Aerosol type classification with machine learning techniques applied to multiwavelength lidar data from EARLINET

del Águila, Ana; Ortiz-Amezcua, Pablo; Tabik, Siham; Bravo-Aranda, Juan Antonio; Fernández-Carvelo, Sol; Alados-Arboledas, Lucas

doi:https://doi.org/10.5194/egusphere-2025-269

Preprints

https://doi.org/10.5194/egusphere-2025-269

Preprints

29 Jan 2025

| 29 Jan 2025

Aerosol type classification with machine learning techniques applied to multiwavelength lidar data from EARLINET

Ana del Águila, Pablo Ortiz-Amezcua, Siham Tabik, Juan Antonio Bravo-Aranda, Sol Fernández-Carvelo, and Lucas Alados-Arboledas

Abstract. Aerosol typing is essential for understanding atmospheric composition and its impact on the climate. Lidar-based aerosol typing has been often addressed with manual classification using optical property ranges. However, few works addressed it using automated classification with machine learning (ML) mainly due to the lack of annotated datasets. In this study, a high-vertical-resolution dataset is generated and annotated for the University of Granada (UGR) station in Southeastern Spain, which belongs to the European Aerosol Research Lidar Network (EARLINET), identifying five major aerosol types: Continental Polluted, Dust, Mixed, Smoke and Unknown. Six ML models – Decision Tree, Random Forest, Gradient Boosting, XGBoost, LightGBM and Neural Network- were applied to classify aerosol types using multiwavelength lidar data from EARLINET, for two system configurations: with and without depolarization data. LightGBM achieved the best performance, with precision, recall, and F1-Score above 90 % (with depolarization) and close to 87 % (without depolarization). The performance for each aerosol type was evaluated and dust classification improved by ~30 % with depolarization, highlighting its critical role in distinguishing aerosol types. Validation against an independent dataset from a Saharan dust event confirmed robust classification under real and extreme conditions. Compared to NATALI, a neural network-based EARLINET algorithm, the approach presented in this work shows improved aerosol classification accuracy, which emphasize the benefits of using high-resolution multiwavelength lidar data from real measurements. This highlights the potential of ML-based methods for robust and accurate aerosol typing, establishing a benchmark for future studies using multiwavelength lidar at high-resolution data from EARLINET.

Received: 20 Jan 2025 – Discussion started: 29 Jan 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1442 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (1442 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

09 Oct 2025

Aerosol type classification with machine learning techniques applied to multiwavelength lidar data from EARLINET

Ana del Águila, Pablo Ortiz-Amezcua, Siham Tabik, Juan Antonio Bravo-Aranda, Sol Fernández-Carvelo, and Lucas Alados-Arboledas

Atmos. Chem. Phys., 25, 12549–12567, https://doi.org/10.5194/acp-25-12549-2025,https://doi.org/10.5194/acp-25-12549-2025, 2025

Short summary

Ana del Águila, Pablo Ortiz-Amezcua, Siham Tabik, Juan Antonio Bravo-Aranda, Sol Fernández-Carvelo, and Lucas Alados-Arboledas

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-269', Anonymous Referee #1, 11 Mar 2025

See the attached file with the comments.

Citation: https://doi.org/10.5194/egusphere-2025-269-RC1
- AC1: 'Reply on RC1', Ana del Águila, 28 May 2025
  
  In the supplement file is attached the final reply to the reviewer report.
  
  Citation: https://doi.org/10.5194/egusphere-2025-269-AC1
RC2:
'Comment on egusphere-2025-269', Anonymous Referee #2, 18 Apr 2025

This paper presents a very innovative and relevant study, showing the possibility of using ML techniques to predict the type of aerosols. The work is very well written and structured. However, the following points need to be better detailed.
Line 135: Why was the median used to fill in the gaps? Did you try to use other techniques? Perhaps the use of machine learning techniques could generate a more robust filling.
Table 1: Why were these groups of hyperparameters exclusively selected? Was any analysis of the importance of hyperparameters performed? This can severely affect the final performance of the models, especially for neural networks.
Figure 4: I recommend increasing the font size of the axes.
Figure 4: How did you deal with the problem of the imbalance of the dataset? Because "continental polluted" tends to have worse performance due to the smaller number of data.
Line 285: I expected better results from the NN model with depolarization data, since you have more information about the particle analyzed. Isn't this difference associated with the data input format in the model? Was any preprocessing performed to normalize them?
Figure 5: Considering the use by other users, I think it is important to comment on the computational cost of each model.
Section 3.2.3: Was an analysis of multicollinearity between the features performed? This can affect the importance of each one in the model, as well as the performance of the final model.
Line 323: Because of this statement, I expected that depolarization would present better results in the MLP Classifier.
Line 363: I recommend reviewing the imbalanced dataset issue because if this is not corrected, the cases that are less present in the training tend to perform worse.

Citation: https://doi.org/10.5194/egusphere-2025-269-RC2
- AC2: 'Reply on RC2', Ana del Águila, 28 May 2025
  
  In the supplement file is attached the final reply to the reviewer report.
  
  Citation: https://doi.org/10.5194/egusphere-2025-269-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-269', Anonymous Referee #1, 11 Mar 2025

See the attached file with the comments.

Citation: https://doi.org/10.5194/egusphere-2025-269-RC1
- AC1: 'Reply on RC1', Ana del Águila, 28 May 2025
  
  In the supplement file is attached the final reply to the reviewer report.
  
  Citation: https://doi.org/10.5194/egusphere-2025-269-AC1
RC2:
'Comment on egusphere-2025-269', Anonymous Referee #2, 18 Apr 2025

This paper presents a very innovative and relevant study, showing the possibility of using ML techniques to predict the type of aerosols. The work is very well written and structured. However, the following points need to be better detailed.
Line 135: Why was the median used to fill in the gaps? Did you try to use other techniques? Perhaps the use of machine learning techniques could generate a more robust filling.
Table 1: Why were these groups of hyperparameters exclusively selected? Was any analysis of the importance of hyperparameters performed? This can severely affect the final performance of the models, especially for neural networks.
Figure 4: I recommend increasing the font size of the axes.
Figure 4: How did you deal with the problem of the imbalance of the dataset? Because "continental polluted" tends to have worse performance due to the smaller number of data.
Line 285: I expected better results from the NN model with depolarization data, since you have more information about the particle analyzed. Isn't this difference associated with the data input format in the model? Was any preprocessing performed to normalize them?
Figure 5: Considering the use by other users, I think it is important to comment on the computational cost of each model.
Section 3.2.3: Was an analysis of multicollinearity between the features performed? This can affect the importance of each one in the model, as well as the performance of the final model.
Line 323: Because of this statement, I expected that depolarization would present better results in the MLP Classifier.
Line 363: I recommend reviewing the imbalanced dataset issue because if this is not corrected, the cases that are less present in the training tend to perform worse.

Citation: https://doi.org/10.5194/egusphere-2025-269-RC2
- AC2: 'Reply on RC2', Ana del Águila, 28 May 2025
  
  In the supplement file is attached the final reply to the reviewer report.
  
  Citation: https://doi.org/10.5194/egusphere-2025-269-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Ana del Águila on behalf of the Authors (28 May 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (09 Jun 2025) by Eduardo Landulfo

RR by Anonymous Referee #3 (17 Jul 2025)

Suggestions for revision or reasons for rejection

The presented study is relevant. The developed machine learning (ML) approach has a high potential to be extended to further lidar stations across Europe and the globe. I want to highlight that the authors have implemented not just one but six ML models to find out which one performs best. The logical next step would be the extension of this ML model to further EARLINET stations to include other aerosol types such as marine aerosol and clean continental. I am looking forward to it.

I see the reviewers’ comments well addressed. I would have appreciated if the original reviewers could have checked the revised version again. Now, coming as an additional third reviewer, I still have some minor comments to be addressed before publication.

• Are the lidar data in the EARLINET data base (2012 – 2015) the result of a manual or automatic analysis?
• Fig 2: The reader might be confused why the depolarization ratio appeared as extensive and intensive property. Probably, you’re referring to the volume and the particle depolarization ratio.
• Tab 1: You’ve chosen a lower boundary for the depolarization ratio of 0.1 for dust. This appears quite low for me. On which study did you base this threshold value? Overall, you could elaborate a bit more on the threshold values in Tab. 1.
• Fig 3: I would advise to mention (at least in the caption) the wavelength of the depolarization ratio because you’ve indicated the wavelengths for all other properties.
• The current aerosol classification for EarthCARE is described in Irbah et al., AMT 2023.
• Sect 3.2.3 The Color Index and Color Ratio basically contain the same information (ratio of backscatter coefficients). So, I could not get, why CI(532/1064) has such a high importance and CR(532/1064) a low importance. Is it because it is somehow a redundant information?
• Please keep the colors for each aerosol type consistent between Fig. 4 and 8&9.
• Sect 3.3 First case study: Could it be the case that layer 1 was typed as continental pollution because of the lower altitude? As we have learned from Fig. 6, the altitude is the third most important feature for the ML method in the absence of depolarization ratio observations.
• L427 Section 2.1 describes the criteria.
• L468 You’re referring to Sect. 3.3 in Sect. 3.3. This is rather unusual.
• del Águila et al., 2019 reference: doi is not working and the journal name seems strange.
• Ke et al., 2017 and Pedregosa et al., 2011 reference: some authors seem to be missing.
• An open question remain the aerosol mixtures which might contain different fractions of aerosol types (e.g., dust contributions). E.g., Wandinger et al., AMT 2023 tries to assess these mixtures with a look-up table approach. However, I see that it goes beyond the scope of the current manuscript.
• Data availability: Besides the input data in the EARLINET data base, please consider to also make the aerosol typing results of the six ML methods available. This might be of interest for the research community.

References:
Irbah, A., Delanoë, J., van Zadelhoff, G.-J., Donovan, D. P., Kollias, P., Puigdomènech Treserras, B., Mason, S., Hogan, R. J.,
and Tatarevic, A.: The classification of atmospheric hydrometeors and aerosols from the EarthCARE radar and lidar: the A-TC, C-TC and AC-TC products, Atmos. Meas. Tech., 16, 2795–2820, https://doi.org/10.5194/amt-16-2795-2023, 2023.

Wandinger, U., Floutsi, A. A., Baars, H., Haarig, M., Ansmann, A., Hünerbein, A., Docter, N., Donovan, D., van Zadelhoff,
G.-J., Mason, S., and Cole, J.: HETEAC – the Hybrid End-To-End Aerosol Classification model for EarthCARE, Atmos.
Meas. Tech., 16, 2485–2510, https://doi.org/10.5194/amt-16-2485-2023, 2023.

Hide

RR by Anonymous Referee #1 (24 Jul 2025)

RR by Anonymous Referee #2 (26 Jul 2025)

ED: Publish as is (21 Aug 2025) by Eduardo Landulfo

AR by Ana del Águila on behalf of the Authors (22 Aug 2025) Manuscript

Journal article(s) based on this preprint

09 Oct 2025

Aerosol type classification with machine learning techniques applied to multiwavelength lidar data from EARLINET

Ana del Águila, Pablo Ortiz-Amezcua, Siham Tabik, Juan Antonio Bravo-Aranda, Sol Fernández-Carvelo, and Lucas Alados-Arboledas

Atmos. Chem. Phys., 25, 12549–12567, https://doi.org/10.5194/acp-25-12549-2025,https://doi.org/10.5194/acp-25-12549-2025, 2025

Short summary

Ana del Águila, Pablo Ortiz-Amezcua, Siham Tabik, Juan Antonio Bravo-Aranda, Sol Fernández-Carvelo, and Lucas Alados-Arboledas

Viewed

Total article views: 1,050 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
859	160	31	1,050	40	62

HTML: 859
PDF: 160
XML: 31
Total: 1,050
BibTeX: 40
EndNote: 62

Views and downloads (calculated since 29 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	68	11	4	83
Feb 2025	86	27	5	118
Mar 2025	54	19	1	74
Apr 2025	53	16	2	71
May 2025	54	23	4	81
Jun 2025	37	15	6	58
Jul 2025	37	13	0	50
Aug 2025	108	18	0	126
Sep 2025	355	16	9	380
Oct 2025	7	2	0	9

Cumulative views and downloads (calculated since 29 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	68	11	4	83
Feb 2025	86	27	5	118
Mar 2025	54	19	1	74
Apr 2025	53	16	2	71
May 2025	54	23	4	81
Jun 2025	37	15	6	58
Jul 2025	37	13	0	50
Aug 2025	108	18	0	126
Sep 2025	355	16	9	380
Oct 2025	7	2	0	9

Viewed (geographical distribution)

Total article views: 1,052 (including HTML, PDF, and XML) Thereof 1,052 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 09 Oct 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (1442 KB)
Metadata XML

Short summary

This study applies machine learning (ML) techniques to classify aerosols using high-resolution multiwavelength lidar data from EARLINET network. We developed a reference dataset and evaluated six ML models, with LightGBM achieving over 90 % accuracy. Depolarization data proved critical for improving dust classification. Validated against a Saharan dust event, our approach improves aerosol classification and may help refine lidar-based processing strategies.


Total:	0
HTML:	0
PDF:	0
XML:	0