the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Deep learning representation of the aerosol size distribution
Abstract. Aerosols influence Earth's radiative balance via the scattering and absorbing of solar radiation, affect cloud formation, and play important roles on precipitation, ocean seeding and human health. Accurate modeling of these effects requires knowledge of the the chemical composition and size distribution of aerosol particles present in the atmosphere. Computationally intensive applications like remote sensing and weather forecasting commonly use simplified representations of aerosol microphysics, prescribing the aerosol size distribution (ASD), introducing uncertainty in climate predictions and aerosol retrievals. This work develops a neural network model, termed MAMnet, to predict the ASD and mixing state using the bulk mass of aerosol and the meteorological state. MAMnet can be driven by the output of single moment, mass-based, aerosol schemes or using reanalysis products. We show that MAMnet is able to accurately reproduce the predictions of a two-moment microphysics aerosol model as well as field measurements. Our model paves the way to improve the physical representation of aerosols in physical models while maintaining the versatility and efficiency required in large scale applications.
- Preprint
(2164 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
CEC1: 'Comment on egusphere-2025-482 - No compliance with the policy of the journal', Juan Antonio Añel, 07 Apr 2025
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
First, you have archived both the GEOS-ESM and the MAMnet code on GitHub. However, GitHub is not a suitable repository for scientific publication. GitHub itself instructs authors to use other long-term archival and publishing alternatives, such as Zenodo. Therefore, the current situation with your manuscript is irregular. Please, publish your code in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.Also, in the Data Availability section of your manuscript you provide generic links to main web pages for the full datasets that provide access to the specific data that you have used in your work. We can not accept this. You must provide the exact data that you have used to develop your work. Importantly, in the case of the work that you present, the exact data used for the training of the neural network. This is critical to assure the replicability of your work, and therefore its scientific character.
I have to note that if you do not fix this problem, we will have to reject your manuscript for publication in our journal.
Finally, please, remember that you must include a modified 'Code and Data Availability' section in a potentially reviewed manuscript, containing the DOI of the new repositories that you create to solve the issues pointed out here.
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2025-482-CEC1 -
RC1: 'Review on egusphere-2025-482', Anonymous Referee #1, 28 Apr 2025
The manuscript describes the development of a neural-network model, MAMnet, trained on model output from GEOS+MAM, with the goal to create a computationally cheap platform to estimate aerosol size distributions using outputs from bulk aerosol models, with MERRA-2 used as an example. The work is interesting and is worthy of publication, after addressing my comments below, most of which are minor, but some might qualify as major ones, especially the evaluation part.
General comments
How does the trained model perform during a different time period? Aerosols in the 90s were much higher than they are today, is the model that is practically only driven by temperature (and air density, which does not change much with climate change) able to capture that time period? More generally, what is the validity range of the model, given its training dataset?
How much computational time is saved? There is no MERRA-2+MAM model, but the comparison between GEOS, GEOS+MAM, MERRA-2, and MERRA-2+MAMnet should be able to provide the necessary information.
I guess it is MAM7 used in this work; shouldn’t you be using this name to separate it from other MAM versions?
I am really surprised that only temperature and air density have been used for the meteorological state. I would expect that 3-dimensional wind fields (long-range transport), clouds and precipitation (wet removal, CCN, activation), and surface type (dry deposition) would be of key importance. Clouds can be also important for sulfate formation in the aqueous phase, and then cloud evaporation should affect sulfate size distribution. How can a model be accurate without these processes included?
The lifetime of a single species in MAM (e.g. SU) would depend by the removal rates in each mode, which differs in terms of mode solubility (a function of mode composition) and sedimentation velocity (a function of mode size). The NN training is implicitly using this information, but the NN application in a bulk model like GOCART does not have that distinction when calculating SU mass, so inherently SU is different across models by design. The NN will likely try to compensate that, but can you make a comment on this?
Specific comments
Line 9: Replace “physical representation” with “aerosol microphysics representation”. A machine-learned approach is not physics.
Line 24: “of the same size” should be “in the same bin”. Bulk approaches allow particles in different bins to have the same size but different composition, e.g. sulfate vs. nitrate.
Line 25: “they fail to distinguish” is too harsh, please replace with “they are not designed to resolve”. They would fail if they would try to resolve ASD, but they don’t.
Lines 38-39: “These models offer the most physically consistent representation of the ASD” is not necessarily correct, since modal models assume a shape of the size distribution per mode, typically a lognormal, which is an approximation of reality. One could argue that sectional models, which are even more expensive than modal ones, are better, since they can freely calculate the ASD shape without the need of a lognormal, but they also suffer from assumptions needed when moving mass and number from one section to another. Particle-resolved models might be the most realistic ones, but these are practically impossible to use in large-scale models. The point is that mentioning that modal schemes are the most physically consistent is incorrect.
Line 96: Which years were simulated, and 72 vertical levels up to what altitude?
Line 97: Please elaborate on the choice of 9 AM/PM UTC time for the output and especially the 12-hour frequency. Understandably this is a lot of output already, but I would argue that sampling any individual location just twice a day has a high probability to miss the diurnal variability of ASD. I would expect that 4 times a day would be the minimum reasonable sampling frequency, as a first guess.
Section 2.2.1: I do not follow the files counting and usage. 25 were “randomly selected without replacement for training” (what does that mean?), 10 were used “for the testing of the trained model”, 100 were “not used during training” (how were they used?). What are these files? Each instantaneous output produces one file, so 2 per day, times 365 times 5 years files? If yes, what happens with the remaining thousands of files? And how many have been used for training? I see later (lines 139-140) stated “5 output files for training, 2 for validation” which makes even less sense. Please explain.
Figure 1: Please explain what MAMnet loss is. It is not referenced anywhere else in the manuscript. Also, why GOCART is mentioned? This figure is for the development of the NN, not its application. Isn’t GOCART only used for application?
Table 3: Too many new concepts there which are not explained. Please help the reader understand what these are, or move this table in an appendix, if you consider it too technical to expand.
Section 3: I would recommend adding a section 3.1 “evaluation against GEOS+MAM”, similar as to what current section 3.1 says “evaluation against observations”, instead of having it under the generic section 3.
Figure 2: Are these global means per layer? Assuming that yes, is this a good metric, especially for number concentration? Wouldn’t doing this regionally be much more meaningful? I appreciate the zonal means and maps later, but my question stands. To be more specific, how can you say “systematic errors emerge” in line 199, without knowing whether this error is widespread or just some very large scattered errors that overwhelm the mean?
Figures 2-3, regarding mass concentrations: what is the model performance in terms of mass conservation? The results per mode do not need to conserve mass, but per species across modes mass conservation is paramount. Thinking even further, how will the mass conservation concept be applied when using MAMnet in production runs?
Lines 253-262, and Figure 6: These are an evaluation against MERRA-2, not observations, as the title of section 3.1 denotes. This whole paragraph and figure are a good conclusion in the discussion just before this section, so moving it right after line 247 and before section 3.1 starts should be considered.
Section 3.1: Although I agree with the motivational 1st paragraph of this section (lines 249-252), it sounds more than wishful thinking. MAMnet is trained with model data, not measurements, so at its peak performance it will be able to emulate the modeled data. In terms of measurements, it can only be as good as GEOS+MAM or MERRA-2 models, and any improvement in skill when compared with measurements (if at all evident) will be coincidental, thus irrelevant. What is really missing from both sections 3.1.1 and 3.1.2 is a baseline discussion: how does MERRA-2 alone perform when comparing with measurements? Of course MERRA-2 does not simulate ASD, but biases in the total aerosol mass (per species or not) will impact ASD. Even more, GEOS+MAM does not include assimilation, so other sorts of biases are likely present in the ASD of the training data set. Since this paper is about MAMnet, and since section 3.1 as a whole is to demonstrate its overall skill, not knowing the skill of the training dataset is a major shortcoming. To the very least, GEOS+MAM should be presented in figures 7 and 8, but a mass concentration comparison (or citation of past evaluation efforts) should be presented as well.
Section 3.2: please explain what Shapley values are exactly. There is some information in the figure legend, but a short introduction would be useful. Also, since this is a comparison against the model data, I would recommend moving it before the observations sections, so swapping sections 3.1 and 3.2.
Line 334: What do you mean by “possibly by promoting secondary aerosol formation” here? Secondary organics will evaporate more at higher temperatures, while secondary inorganic aerosols will have a more complex relationship depending on relative humidity as well.
Technical corrections
Line 44: Change “ML models, we can” to “ML models can”.
Line 79: Add “of different sizes” after “five mass bins”.
Line 80: Replace “hydrophilics” with “hydrophilic”.
Line 86: Table 2 is referenced before Table 1.
Line 97: Replace “these” with “that”.
Figure 1: rho_air is mentioned in the legend, but it is termed AIRD in the figure.
Line 109: Replace “Kg” with “kg”.
Lines 179 and 181: “the original MAM” and “GEOS+MAM” are the same thing, right? Please use one terminology throughout, for clarity.
Line 214: “smaller and less massive” is the same, why not just say “smaller”?
Line 217: Replace “near-perfect” with “very high”.
Line 223: Replace “sulfates” with “sulfate”.
Line 260: Replace “accurate” with “accurately”.
Line 311: Replace “tends align” with “tends to align”.
Figure 9: Please add a figure legend that explains the color lines, on top of the verbal description present in the caption.
Line 363: Replace “predicted concentrations” with “predicted number concentrations”.
Citation: https://doi.org/10.5194/egusphere-2025-482-RC1 -
RC2: 'Comment on egusphere-2025-482', Anonymous Referee #2, 12 May 2025
This manuscript uses a global aerosol microphysics model to train a neural network model to estimate aerosol size distributions and mixing state from bulk aerosol masses. This is overall a useful contribution. However, I feel like the overview of microphysics methods and understanding needs to be improved, and I’d like the authors to evaluate their results against the typical way of estimating the size distribution and CCN from bulk masses. Once these and specific issues have been addressed, I am supportive of this manuscript being published.
Editorial note: Often, the figures are quite far from where they are discussed in the manuscript, which requires a lot of scrolling or flipping. In most cases, it seems like it would have been straightforward to have arranged the figures to be closer to their discussion.
Major comment
I feel that the paper is missing the #1 test of ML size distributions that I’d want to see. The easiest (and usual) way to get size distributions (and CCN etc.) from a bulk model is simply to assume a fixed size distribution for each species. In your case, it would be good to just use the global average distributions from MAM as a global conversion from bulk to a size distribution. I’d like to know how much better MAMnet is compared to this simplest approximation, which to me is the test of the value of the ML.
Specific comments
L8: This needs more information than “two-moment”. You track two moments for 7 different modes, so this is a two-moment *modal* scheme. There are just plain two-moment (or X-moment) schemes (without assuming a modal shape, e.g., the MATRIX scheme in the GISS climate model) or two-moment sectional schemes (two moments in each size section; e.g., Adams and Seinfeld, 2002 referenced in the manuscript), so I wasn’t sure what you were referring to when originally reading the abstract.
L22: Bulk models don’t need to have bins (in the model you describe later, most species don’t have bins). The key is that even if there are bins, there are no microphysics calculations (nucleation, condensation, coagulation, etc.) that would let the size distribution evolve.
L23: Bulk models do not need to assume external mixing. You can assume that all at any given size (would need to assume a size distribution), all species are mixed into the same particles (internal mixing). Just because GOCART assumes external mixing doesn’t mean that you need to assume external mixing with a bulk model.
L32: two moments of the ASD *for each mode*.
L34: Again, bulk schemes can “handle” internal mixing (you just assume it). What modal schemes can do (assuming they are simulating multiple modes) is to have an explicit calculation of which particles are internally vs. externally mixed that varies in space and time. You also haven’t established that models can have multiple modes yet.
L46 and throughout. Like the other reviewer said, it’s common with MAM to put the number of modes after (in this case MAM7). However, I believe there may be multiple MAM configurations with 7 modes in the literature (e.g., I believe that one has a nucleation mode, and this one does not).
L45: There is other work to parameterize bulk aerosol mass to the size distribution, such as https://doi.org/10.1029/2021GL094133 and https://doi.org/10.5194/acp-23-5023-2023
L97: “From these simulation*s*”
L110: “a a”
L116: My brain wants to read “During training data” as one phrase. Please add a comma after “training”.
L168-177: Please elaborate more on how these methods estimate CCN. Do you expect them to be better than your model estimates? I would guess a lot of assumptions go into these products, so I wouldn’t necessarily expect them to be a useful evaluation. (Note: Figures 8 and 9 show huge disagreements between the datasets.)
L185: [-1, 1] would be the range for “within an order of magnitude of the target value”. [-0.5, 0.5] is an order of magnitude window around the target value (or within a factor of about 3.2).
Figure 2: (1) The colormap is a strange choice for a diverging colorbar in the lower panel. It would be better to make it white in the middle. (2) Would be easier to interpret if it were rotated such that height was the y axis. (3) Have the surface (1000 hPa) be at the origin rather than the model top.
Figure 2 and discussion around line 200: I suspect that the challenge in MAMnet predicting the Aitken mode may stem from the difficulty of predicting when/where nucleation is occurring. If other inputs that may help predict nucleation, like solar radiation and SO2, were included, it might do a better job with the Aitken mode. Also possible is that fresh fossil-fuel combustion emissions (vs. aged in the accumulation mode) into the Aitken mode might be hard to predict, and NOx as an input might help with this as the NOx lifetime is on a similar order as a typical aging timescale (12-24 hours) and they tend to be co-emitted.
L234-236: This sentence overstates things. Relative variability in Dpg is strongly buffered to relative variability in Mass/Number since it goes with the cube root of this ratio. For example, factor-of-2 error in M/N would only be a 26% error in Dpg. Is the MLB of 0.01 really that remarkable or surprising given that it’s much more stable than M/N?
L255: Do we expect the observational constraints (just polar-orbiter AOD in cloud-free regions, right?) on MERRA-2 to improve the relative balance of species masses? My understanding is it just scales the mass of all species in the column up/down until AOD is pushed closer towards the obs.
L271: How did you sample the model for the high-altitude sites? These sites are tricky since they are often at a much higher altitude than the gridbox mean altitude. Sometimes they are in the PBL, sometimes they aren’t. I recommend just leaving them out.
Figure 6 and some other discussions: Is there a way to make MAMnet conserve the mass of the inputs? This seems like a critical thing to do. Also, I recommend ug kg-1 or sm-3 rather than kg kg-1 since people are used to thinking of aerosol masses in ug m-3.
Figure 7: Is this any better or worse than how MAM itself does? I suspect they both have similar issues.
Figures 8 and 9: Are these products that are used for comparison any good? They vary so much, and I’m guessing that there are a lot of assumptions that go into getting CCN from the products.
Figure 9: Please add a legend to the figure rather than stating the colors in the caption.
L313: Please explain what a Shapley value is. What does a high or low feature value mean?
L348-349: Isn’t there a way to just force MAMnet to conserve the mass of the inputs?
L352-353: Are these better than the reference test (fixed size dist for each species) that I described above?
L356-359: Like my earlier comment, “exceedingly well” is an overstatement. Dpm is buffered to errors in Mass/Number.
Citation: https://doi.org/10.5194/egusphere-2025-482-RC2 -
RC3: 'Comment on egusphere-2025-482', Anonymous Referee #3, 27 May 2025
Use of deep learning model to approximate aerosol size distribution from bulk mass inputs is interesting and operationally valuable. Integration with MERRA-2 opens opportunities for reanalysis and assimilation improvements.
Following are my line-by-line comments:
L24-27: You may also flag that some modal schemes like GLOMAP in UK Met Office Unified Model assume a lognormal shape for each mode with prescribed geometric standard deviation and each mode is internally mixed. (in L87-88 you do mention something similar for another model)
L35: Consider specifying orders of magnitude or cite a study quantifying what is really "better" for representing ASD in models.
L56-57: Clarify what is meant by “meteorological state”—mention that it includes only temperature and air density up front, since this is unexpectedly minimal and a key methodological decision.
L77-79: Clarify whether the model includes any simplified representation of aerosol growth, aging, or wet removal in GOCART (even if parametrized) because "transport and evolution" maybe construed for many physical phenomenon.
L87-88: Suggest clarifying whether the geometric mean diameter is prognosed or computed diagnostically.
L96-97: What years were simulated? Why only two time points per day? This sparsity might miss diurnal features. How were the 25 output files selected—what does “one file” correspond to (single timestamp across globe?)? The use of only 25 files for training seems low given the mention of >100M samples later. Please clarify.
L98-100: Add sentence on whether aerosols evolve freely in these simulations or are constrained by observations. Can we understand to what model levels were these "horizontal winds" nudged and to what extend they affect aerosols number concentration?
L104-105: Better to specify:
- Were log10-transformed values standardized after transformation or before?
- Are temperature and air density standardized globally or per level?
L110-115: clarify was this Dpg compared only during evaluation, or was it ever used in the loss function? Please state your loss function as some physics informed neural net models have tried modifying it as well.
L116-120: Clarify whether the flattened fields are shuffled across time and space, or whether there’s structure preserved (e.g., batches by time or region). Were any vertical or horizontal correlations exploited or lost?
L125-134: Were other architectures considered (e.g., transformers, residual connections)? If not, briefly justify.
L139-140: The earlier statement (line 97-98) says 25 files used for training, but here it says “5 for training, 2 for validation.” I think I am missing something here?
L163-175: Briefly discuss how errors in Dpg propagate to aerosol number concentration errors for ccn?
L244: The underestimation of Dpg in SH is attributed to low data availability? Could it also be due to extrapolation error—MAMnet may have learned associations biased toward NH-dominant training. Can we not test this applying class reweighting in the loss function?
On reading the conclusion, I had following questions raised in my mind:
- L340-344: Under what conditions does this input feature set (bulk mass, T, ρ) suffice? Where do predictions degrade (e.g., strong vertical motions, boundary layer transitions)? Why were other physically relevant predictors (RH, precipitation, cloud fraction, wind) excluded? Does it limits its use in complex meteorological regimes. Without input features tied to wet/dry removal, nucleation, or chemical aging, can the model really be used in weather forecasting or satellite retrievals across diverse regions even when we find high correlations?
- L345-350: MAMnet is trained only on MAM model outputs, so how does the model avoid learning MAM's own biases? Can we say that evaluation against MERRA-2 is not necessarily independent since the training data is nudged to MERRA-2 meteorology?
- L351-357: Can MAMnet conserve total aerosol mass by design, or does this emerge of calculation? This is never proven numerically—just implied via Dpg.
L358-370: There’s no attribution of error—how much is due to MAMnet, and how much due to MERRA-2 inputs?
Other than these, my overarching general comments are as follows:
- Unclear what one “file” represents—single timestep? Single day? Entire global field?
- It is also unclear whether any temporal or spatial overlap exists between train/test sets.
- No analysis on extrapolation over different time periods (e.g., pre-2000). The network is trained on a 5-year window using meteorology from MERRA-2 (likely post-2000). How would the model perform in periods with different emissions (e.g., 1980s)? Alternatively, discuss potential limitations in extrapolating to past or future climate states.
- How is the SHAP analysis computed over such a high-dimensional sample space (using any explainer method)? Was it computed on the flattened single-level dataset? How do you deal with feature correlation?
- Is MAMnet architecture resolution-agnostic? Though you use single-level training to make the model resolution-independent, how would MAMnet perform in coarser (~2.5°) or finer (<1°) gridded input?
Citation: https://doi.org/10.5194/egusphere-2025-482-RC3
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
267 | 61 | 15 | 343 | 10 | 21 |
- HTML: 267
- PDF: 61
- XML: 15
- Total: 343
- BibTeX: 10
- EndNote: 21
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1