Machine learning-driven characterization and prescription of aerosol optical properties for atmospheric models

do Rosário, Nilton Évora; Longo, Karla M.; Toso, Pedro H.; Freitas, Saulo R.; Yamasoe, Marcia A.; Rodrigues, Luiz Flávio; Medeiros, Otavio; Velho, Haroldo Campos; Menezes, Isilda da Cunha; Miranda, Ana Isabel

doi:10.5194/egusphere-2025-454

Preprints

https://doi.org/10.5194/egusphere-2025-454

Preprints

16 Apr 2025

| 16 Apr 2025

Machine learning-driven characterization and prescription of aerosol optical properties for atmospheric models

Nilton Évora do Rosário, Karla M. Longo, Pedro H. Toso, Saulo R. Freitas, Marcia A. Yamasoe, Luiz Flávio Rodrigues, Otavio Medeiros, Haroldo Campos Velho, Isilda da Cunha Menezes, and Ana Isabel Miranda

Abstract. Accurate modeling of aerosol optical properties is critical to simulate aerosol radiative effects. However, uncertainties regarding the simulation aerosol intensive optical properties are still significant. Therefore, the use of observations to constrain aerosol optical properties in models has been indicated as an option. Also, explicit computations of optical properties are still too costly for operational models, which make observational-based prescriptions a convenient solution. We developed a observational-based prescription of aerosol optical properties driven by machine-learning techniques that can be applied in models. The Iberian Peninsula (IP) was taken as the reference domain, and the aerosol products from the AERONET sites across the IP as the main dataset. First, clustering was applied to define the typical aerosol optical regimes affecting the IP atmosphere. Five typical regimes were identified. Two of them were dominated by coarse mode, which were associated with Saharan dust. One was found to be close to pure dust, while the other indicated a mixed scenario of dust and pollution. Two of the non-dust regimes, strongly and moderately absorbing, were found to be associated with smoke. The remaining non-dust regime, with not a clear association, occurs mostly in the eastern portion of the IP. Afterward, using aerosol-type columnar mass density from MERRA-2, a model was trained as predictor of the optical regimes using the Random Forest method. The model was tested under distinct aerosol scenarios. Predictions' accuracy ranged from 60 to 75 %, depending on the regime, while presenting an average accuracy of 70 %.

Received: 30 Jan 2025 – Discussion started: 16 Apr 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2482 KB)

Supplement (182 KB)

Download & links

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-454', Anonymous Referee #1, 22 May 2025
The authors use K-means clustering on AERONET data to identify different regimes of aerosol optical properties over the Iberian Peninsula, and subsequently train random forests to predict these regimes from aerosol column densities provided by MERRA-2. While the paper is interesting and fits the journal, I would still recommend major revisions, please refer to comments below:
Major comments:
Ln 44. “limitations in the current …. address information?” What do you mean with “address information”?

Ln 48. “geographical representativity” Can you be more elaborate on what you mean here? Adebiyi et al. (2023) mention that libraries typical assume identical mineral compositions of certain aerosol species regardless of location. Is that what you mean here?

Ln 63-66. I do not fully understand the contrast between “absorption” and “relative contribution of ..” implied here, are absorption problems not one of the consequences of misrepresenting the fine and coarse modes?

Ln 76. Bias in what quantity?

Ln 107-110: This sentence is a bit ambiguous to me, do you mean that (1) dust regimes and (2) smoke regimes are the major source of differences? If so, could you rephrase?

Could you discus/comment on the accuracy of AERONET retrievals, either in section 2.2 or later in the results? To what extent can we consider these the ground truth?

Section 2.1. To me, this section felt to elaborate and sometimes irrelevant for your study. I would suggest to limit it to a short overview of what aerosol types may be expected where (and when), perhaps aided by showing (long-term averages of) MERRA column densities or something similar, without the climate information

Figure 1. Related to the previous comment, I am not sure surface elevation is the most relevant quantity to show here.

Section 2.2: Could you provide more information on the time resolution of the AERONET data, and the time periods used for the study?

Ln 213: What is the Lidar Ratio and what is it used for?

Ln 216: I assumed “mixed” in this context means multiple aerosols occurring in one column, irrespective of height. Does it matter for your optical properties and retrieval thereof whether at what height different aerosol species are occurring and (how) does that affect your work?

Ln 246-247: Consider introducing table 1 at the start of previous paragraph (ln. 221). Lines 221-229 feel out of place now.

Table 1. Am I right that you don’t use aerosol optical depth? How would that be derived when your model is used? Also, would your approach also work if you only include the optical properties are used in radiative transfer calculations (SSA, ASY at different wave lengths)?

Ln 296: What exactly is the Optical Model in this context?

Ln 314-320: I am not sure I fully understand the equation and accompanying text, are you computing WCSS by minimising sum_1^k W(ck), or are you seeking to minimise WCSS? Also, could you explain more what you mean by Elbow?

Ln 251-253: what do you mean with class imbalance, and what issues regarding atmospheric measurements are you addressing exactly?

Ln 394: What is cluster stability?

Ln 438: you mention marine particle scenarios here in relation to Clusters C2-C4, but marine aerosols are barely mentioned later on. Are they not important enough, or present in all regimes in similar quantities?

Ln 3.1: What are the mean aerosol column densities from MERRA-2 for each regime?

Figure 4: based on what data are these size distribution computed?

Ln 492-493, table 3: Where are the contest of this table discussed? Line 500 perhaps, I would refer to table 3 on that specific line.

A general comment on figures, I find the large variety in font sizes (for example, large bold text in Figure 6, small font size in Figures 11 and 12 in combination with the low dpi of some figures is distracting from their main message.

Ln 501: This hypothesis was already stated on line 470.

Ln 508-509: “however, … , but…” consider removing either however or but?

Ln 535: do you have any metrics showing whether the random forests overfitting on its training data?

Line 563: Could you elaborate on the “extra training”, would you need extra variables, longer time series?

Ln 573: Accuracy is not included in Table 4.

Ln 580: Looking at Table 4, does C2 not have the lowest precision?

Ln 590: Based on Figure 10, is organic carbon really one of the primary factors? It’s relative important of organic carbon is very similar to sea salt, SO4, and SO2.

Ln 623/698: Could you discuss these AOD thresholds in more detail? Is your model only trained for cases with a higher AOD? What percentage of time is this threshold actually reached? I suppose that even for relatively low aerosol optical depths, you would still want optical properties to be well-represented.

Ln 703: what do you exactly mean with “randomly simulated following a Gaussian…”? Please elaborate.

Ln 705: “In addition … size behaviors”, does this refer to the Angstrom exponent (Figure 14)?

Ln 714: “why does MERRA underestimate SSA in C3 and C4 despite the coarser particles?

Ln 727:” What exactly are the expected and observed values referring to?

Figure 12: I am not sure if a diverging colormap is the most appropriate choice here.

Figures 13-15: Do the dashed lines show the means? If so, please state in the caption.

Ln 784-786: could you elaborate on the generalizability of your approach and how your approach would benefit atmospheric models, e.g. climate simulations? Would you need to many random forests, each of a small part of the world, and use those to infer spectral optical properties given column amounts of the various aerosol species?

Ln 785: “aerosol radiative forcing” would it be possible to also illustrate the possible benefit your approach in terms of aerosol radiative effects? For example, using clear sky radiative transfer computations using either your aerosol optical properties or those from MERRA-2 aerosol optical properties.

Minor comments & technical corrections
Ln 16. “simulation aerosol” -> “simulation of aerosol”

Ln 20. “a observational-based” -> “an observation-based”

Ln 37. “Aerosol particles’ importance”, maybe rephrase to “The importance of aerosols”

Ln 38. What do you mean with “direct players”

Ln 40. Could you be more specific with this sentence, it is not very clear to me

Ln 60. “due to treatment of aerosol mixing state” can you formulate more clearly?

Ln 70. “microphysical” missing word?

Ln 102-103. “aerosol scenarios variability”, should “variability” be omitted here?

Ln 241. When or where?

“data fitted to a training process,…” could you rephrase more clearly

Ln 339-341: “mass density, trying to” This sentence does not flow well.

Ln 367: “takes .. account” is redundant.

Figure 3 misses (a) (b) (c) (d) labels

Figure 9: The colorbar misses a label

Ln 582: What do you mean with “cost”?

Ln 635: What do you mean with “corridor”?

Ln 658: “Regimes regarding” -> “regimes such as”?

Ln 682: “550 nm field” -> “550 nm”

Ln 723: What do you mean with “physical distribution characteristics”?
Citation: https://doi.org/10.5194/egusphere-2025-454-RC1
- AC1: 'Reply on RC1', Nilton Rosario, 26 Aug 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-454/egusphere-2025-454-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-454-AC1
RC2:
'Comment on egusphere-2025-454', Anonymous Referee #2, 15 Jul 2025

The study characterizes the typical aerosol intensive optical properties affecting the Iberian Peninsula (IP), comprising Spain and Portugal, using the atmospheric column inversion products from the AERONET sites. The authors employed K-means clustering to analyze historical aerosol intensive properties across all AERONET that operated for at least 2 years and has the higest quality dataset level 2.0 available. Five distinct aerosol optical regimes affecting the IP were identified based on the clustering technique, followed by the utilization of aerosol-type columnar mass density data (dust, organic carbon, black carbon, sea-salt, and sulphates) from MERRA-2 reanalysis to predict the aerosol optical regime using the Random Forest supervised learning methodology. The performance of the trained model was tested under various aerosol scenarios, and the predictions ranged from 60% to 75% with accuracy exceeding 90% when predicting solely dust or non-dust optical regimes. Overall, the study is very interesting and fits to the journal scope. The manuscript is well-written but require some improvement in clarity on certain aspects before re-consideration. Recent literature needs to be cited.

Comments:
Line 37: Statement starting with 'Via'?

Line 70: compositions -> composition,

Line 70: It should be 'microphysical properties'

Line 70: computations -> computation

Line 76: What parameters are being referred to in 'aerosol simulation'?

Line 197: What do you mean by observation-contrained approaches? Are you referring to the threshold based aerosol type classification methods? Please clarify.

Lines 211-215: What is the rationale for choosing these aerosol intensive properties? How is Lidar Ratio (LR) and Linear Depolarization Ratio (LDR) derived with AERONET sky radiance measurements? How reliable are the LR and LDR derived from AERONET?

Line 235: Which climate models are being referred here?

Table 1: Are these VMR-F, VMR-C, STD-F, STD-C, Reff-F, Reff-C provided by the AERONET inversion products or these are derived by the authors? Please clarify. Since these intensive properties are inversion products of AERONET, how did you account for their uncertainty impacting the the aerosol optical regimes identified through K-means clustering (Section 2.4)? There is no much discussion on the influence of the observational/inversion uncertainty of aerosol intensive properties on the identified clusters and interpretation of your results.

Line 286: Use Sulphate or sulfate consistently throughout the manuscript.

Lines 285-290: It was mentioned that the MERRA-2 Aerosol Diagnostic Product (ADP) for aerosol types is considered in this study. Dust, Black Carbon, Organic Carbon, Sea-Salt and Sulphate aerosol mass concentration at specific levels are integrated in the entire atmospheric column to obtain columnar aerosol optical properties such extinction, scattering and absorption optical depth. It is not clear on how the mass concentrations of individual species are converted to optical depths. Atleast proper citation of references to the method adopted might have been included. At which wavelength these are obtained? Did you validate extinction optical depth derived from MERRA-2 with the aerosol optical depth from AERONET? Similarly, how does the SSA from MERRA-2 validate with the corresponding SSA from AERONET?

Line 309: There exist several methods and indices to decide on the appropriate number of clusters such as Elbow, Silhouette, Davies Bouldin, and Calinski-Harabasz indices. I have noticed that in the following study: https://doi.org/10.1016/j.atmosres.2022.106518, the authors have stated that the correct number of clusters derived from different approaches may not lead to a single solution. What is the rationale for adopting the Elbow method, except the fact that it is a widely used method for determining the optimal number of clusters?

Lines 325-328: What do you mean by 'clusters average'?

Lines 334-336: It was not mentioned anywhere how the times were synchronized between the AERONET inversion parameters and MERRA-2 data of aerosol species column mass density. Each of the AERONET inversion parameters and MERRA-2 aerosol species column mass densities might have different ranges of variability and units. How is this accounted for in the ML model while identifying the clusters? I mean to ask if the ML model does any scaling and normalization of different parameters. If not, won't the range of variability and units have any impact on the aerosol classification?

Line 461: Large radius spread for C3 ... What does this infer?

Lines 505-506: There is no mention about catgorization of seasons till now. How are months categorized into seasons?

Table 3: What does the values in brackets correspond to? Standard deviation or error? This may be mentioned in the table caption.

Line 555: How can you say that this would not introduce a substantial error in the radiative effect calculations? In terms of what metrics radiative effect is calculated? Radiative forcing or heating rates? Better to quantify this error. I suggest you to check this study: https://doi.org/10.1016/j.jqsrt.2024.109179, and see if this might provide some insights on the errors associated with direct radiative effects.

Lines 573-583: It appears that these details are repeated again. Please check and avoid repetitions.

Line 598: reanalyzes -> reanalyses

Figure 10: Short forms (dst, oac, ssl, so4, so2, bcc) used as x-axis labels for features should be defined in the figure 10 caption. It is also not clear if this relative importance is obtained for the entire IP region or the grids consisting the AERONET sites. Can you bring out similar figures to ascertain the relative importance of aerosol intrinsic parameters (Table 1) for different clusters (or aerosol scenarios) identified in this study together with the predictor variables from MERRA-2.

Line 625: All of a sudden MERRA-2 AOD field is taken as a reference. AERONET sites also provide the AOD and SSA values, which could have been checked during the period of various scenarios (Case#01, Case#02, Case#03, Case#04).

Lines 673-674: 'lower computational cost' --> How is this quantified? Have you compared with any other methods of aerosol classification?

Figure 11 Caption: MODIS Terra?

Lines 697-699: Earlier it was mentioned AERONET AOD > 0.4 but for MERRA-2 AOD > 0.3. Why?

Citation: https://doi.org/10.5194/egusphere-2025-454-RC2
- AC2: 'Reply on RC2', Nilton Rosario, 26 Aug 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-454/egusphere-2025-454-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-454-AC2

Supplement

https://doi.org/10.5194/egusphere-2025-454-supplement

Viewed

Total article views: 1,977 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,864	87	26	1,977	38	34	54

HTML: 1,864
PDF: 87
XML: 26
Total: 1,977
Supplement: 38
BibTeX: 34
EndNote: 54

Views and downloads (calculated since 16 Apr 2025)

Month	HTML	PDF	XML	Total
Apr 2025	82	9	4	95
May 2025	63	19	2	84
Jun 2025	47	15	2	64
Jul 2025	34	13	3	50
Aug 2025	332	16	5	353
Sep 2025	1,188	3	8	1,199
Oct 2025	82	8	2	92
Nov 2025	36	4	0	40

Cumulative views and downloads (calculated since 16 Apr 2025)

Month	HTML	PDF	XML	Total
Apr 2025	82	9	4	95
May 2025	63	19	2	84
Jun 2025	47	15	2	64
Jul 2025	34	13	3	50
Aug 2025	332	16	5	353
Sep 2025	1,188	3	8	1,199
Oct 2025	82	8	2	92
Nov 2025	36	4	0	40

Viewed (geographical distribution)

Total article views: 2,052 (including HTML, PDF, and XML) Thereof 2,052 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Nov 2025

Short summary

The present article focuses on the topic of observations to constrain aerosol optical properties in climate models . We combine a machine learning approach (based on clustering), used to identify and characterize aerosol optical regimes, with another machine learning technique (Random Forest), used to train the prescription of the identified optical regimes from a mixture of columnar mass density of different aerosol-types.


Total:	0
HTML:	0
PDF:	0
XML:	0