the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A framework to holistically investigate processes controlling the aerosol lifecycle using explainable AI techniques
Abstract. General circulation models (GCMs) face significant uncertainties in estimating Earth's radiative budget due to aerosol-cloud interactions (ACI). To improve the representation of ACI in GCMs it is crucial to constrain processes controlling the aerosol lifecycle and the resulting size distribution. This is challenging due to the complexity and number of competing atmospheric processes that interact over large spatial and temporal scales which require untangling to elucidate dominant processes controlling aerosol properties. This study aims to (a) develop a generic explainable AI framework from air-mass history to build an accurate representation of processes controlling aerosol properties, from this, (b) identify key relationships between aerosol processes and their impacts on observed aerosol number concentrations, and (c) provide robust process-based observational constraints to aid in the isolation of GCM structural uncertainties. This is achieved by developing XGBoost regression models to simulate Aitken and accumulation mode number concentrations for receptor surface stations and application of TreeSHAP to identify key processes from explanatory variables describing meteorological and aerosol processes collocated to Lagrangian air-mass trajectories. The fidelity of this framework is demonstrated for the Antarctic station Trollhaugen, situated in a pristine region in which GCMs exhibit significant biases. Aerosol number concentrations at Trollhaugen were shown to be dominated by marine sources as well as transport from the free troposphere. The contribution from aloft dominates aerosol burden of the Aitken mode in the transitions between summer and winter, in contrast to a larger contribution in the summer from local marine sources from transport in the boundary layer.
- Preprint
(9639 KB) - Metadata XML
-
Supplement
(4731 KB) - BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-4298', Ruth Price, 26 Nov 2025
-
RC2: 'Comment on egusphere-2025-4298', Anonymous Referee #2, 19 Dec 2025
Overall Assessment
This study proposes an innovative and insightful explainable AI framework that integrates Lagrangian airmass history with XGBoost and TreeSHAP for precise simulation and attribution of aerosol processes, providing in-depth analysis for the Antarctic case. However, the lack of rigorous physical verification for its key conclusions limits their persuasiveness. Additionally, the manuscript requires a thorough language revision to address recurrent grammatical errors.Major Comments
1. The study relies heavily on inferring dominant aerosol processes from feature importance rankings and SHAP-value relationships. However, the scientific robustness of these conclusions is limited by a lack of validation linking these statistical associations to physical causality. For instance, the attribution of the log-correlation between chlorophyll-a and aerosol concentration to marine biogenic sources (Lines 564-566) is made without isolating confounding effects from co-varying drivers like temperature. Similarly, the dual role of precipitation (sink near site, source further away) is explained solely by opposing correlations of weighted vs. unweighted sums (Lines 711-721), lacking mechanistic validation against wet scavenging physics. The proposed contribution of free-tropospheric transport to the Aitken mode rests primarily on SHAP analysis (Lines 621-625) and would benefit from direct support, such as aerosol composition measurements. Finally, the speculative interpretation of positive SHAP values at very high windspeeds (>20 m/s) as possibly from sea spray or blowing snow (Lines 611-618) remains unsubstantiated by concurrent observational evidence or targeted analysis.
2. The argument for the framework's out-of-distribution generalizability remains limited. While Section 4.5 presents proof-of-concept applications at the Värriö and Mace Head sites, the analysis does not sufficiently demonstrate its applicability to environments mechanistically distinct from the Antarctic case. For instance, the model is not tested to see whether the primary drivers identified at Trollhaugen (e.g., marine sources, free-tropospheric transport) remain valid or are superseded by different key factors (e.g., biogenic VOCs at Värriö, air mass history at Mace Head) in these new settings. A comparative analysis of how feature importance rankings shift across these diverse sites is needed to substantiate the claim of broad applicability.
3. The inclusion of 35 explanatory variables is not accompanied by a clear description of the variable selection criteria. While the study correctly notes a high correlation (r=0.8) between trajectory speed and 10-m wind speed and acknowledges the limitation in separating their importance with TreeSHAP (Lines 595-598), it does not detail any preprocessing steps (e.g., filtering, regularization) taken to mitigate the impact of such collinearity on the SHAP-based interpretation. This leaves the feature importance rankings for these—and potentially other—correlated variables difficult to interpret confidently.
4. The criteria for removing extreme low-concentration data (N80 < 4 cm⁻³) are not scientifically substantiated, being described only as "appeared anomalous" (Lines 177-179). This subjective filtering risks altering the training data distribution, potentially biasing the model's learning of wintertime aerosol processes and obscuring the unique source-sink balance under very dry polar conditions. The lack of a clear, objective threshold undermines the reproducibility of the analysis.Minor Comments
1. Lines 11-12: Add a comma to separate the introductory phrase from the main clause.
2. Lines 12-14: The sentence is grammatically confused due to its convoluted structure and a misplaced "which" clause.
3. Line 41: Ensure a comma separates the author name(s) and year (e.g., Fiddes et al., 2024), and check for this throughout.
4. Line 39: "e.g." should not be used before references. Please check elsewhere for similar issues.
5. Line 44: Add a subject before "but" to create two complete clauses. For example: "...will have significant impacts by acting to enhance or dampen RF, but these feedbacks are currently poorly constrained...".
6. Line 69: Add the missing verb. Correct to: "Whilst these techniques are useful to improve understanding of model bias, …".
7. Line 75: Use a comma instead of a semicolon before "However". Check elsewhere for similar issues.
8. Line 407: Insert a comma before “so”.
9. Line 426: Change the semicolon to a comma.
10. Line 534: Change “ranking are” to “ranking is” or “rankings are”.
11. Line 539: Change “slightly changes” to “slight changes”.
12. Line 579: Change “not expect” to “not expected”.
13.Line 596: Change “demonstrate” to “demonstrates”.
14. Line 671: Change the comma before “therefore” to a period. For example: “...Southern Ocean (McCoy et al., 2021). Therefore, the negative relationship...”.
15. Line 683: Change “dilution aerosols” to “dilution of aerosols” or “aerosol dilution”.Citation: https://doi.org/10.5194/egusphere-2025-4298-RC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 248 | 120 | 24 | 392 | 33 | 15 | 16 |
- HTML: 248
- PDF: 120
- XML: 24
- Total: 392
- Supplement: 33
- BibTeX: 15
- EndNote: 16
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of “A framework to holistically investigate processes controlling the aerosol lifecycle using explainable AI techniques”, Duncan et al., EGUsphere, https://doi.org/10.5194/egusphere-2025-4298
This paper uses a novel combination of back-trajectory analysis and machine learning models to investigate aerosol processes in the Antarctic region, with a focus on sources and sinks of aerosol measured at Trollhaugen as well as the seasonal cycle. I congratulate the authors for this work, which includes a large range of data sources and complex, machine learning-based methodology.
The results show that a machine learning model trained on air-mass history, meteorological data and various proxies for aerosol sources and removal processes can outperform a global climate model form the CMIP6 ensemble when aerosol number concentration is compared to measurements.
Further analysis of the observational dataset and machine learning model using advanced statistical techniques reveals key processes that are important for controlling the aerosol size distribution at Trollhaugen throughout the year. Results from other locations show that the method can also be applied in other regions. The article therefore provides an important contribution both to our understanding of Antarctic aerosol and to the field of aerosol modelling more generally. I recommend that the article is published after minor comments below have been addressed.
General comments
Documentation of ACTRIS data: the time resolution of the raw data is not clear. Also, according to Table S1, 2014 appears to contain more flagged data than other years. Do the authors know possible causes for this, and are there any implications for a possible sensitivity to which years are selected for the testing vs. training datasets?
Figures and tables: I suggest that for clarity and brevity, Table 3 could be reported directly in the text or combined with table 1, and figures 3 and 4 could be combined into 1 figure. Tables 1 and 5 could also be combined, discussed more below.
Use of different sites: it was not clear to me until later in the article (section 4.6) that results from multiple sites are presented, which makes references like “sites” or “for each site” confusing earlier in the text (e.g. Lines 421, 425, 442, title of section 4.1). Varrio and Mace Head could be briefly mentioned in earlier sections (abstract, introduction or methods) and Tables 1 and 5 could be combined.
Discussion of UKESM bias: You show that UKESM has a significant low bias in aerosol number concentration and also show that the ML model outperforms UKESM. This is a very interesting result and you rightly say in the conclusions that the approach presented in this study should therefore be used to inform model developments. However, I found the recommendations for future work on this topic a bit vague. Since identifying and reducing sources of model bias seems to me to be one of the main sources of value of this approach, I feel that the article would benefit from more detailed discussion of this in the conclusion.
Data and code: I note that the authors plan to release the data and code associated with the article upon publication. Because the article’s results rely on the synthesis of several datasets and the implementation of complex numerical methods, I strongly recommend that the authors deposit the code and any novel, processed datasets in openly accessible repositories (e.g. Zenodo, GitHub, etc) with persistent identifiers as soon as possible.
Specific comments
Lines 158-161 and section S1.1: please clarify the time resolution of the raw PNSD measurements used before 6-hourly means are taken. Currently, it is not clear how the filtering steps affects the data availability within 6-hourly windows.
Lines 180-181: the size ranges for the target variables are given as approximate (~30-80, ~80-660nm), whereas I would imagine that they are defined precisely based on which bins are summed over. Please clarify this.
Lines 192-193: do the authors plan to release the code for this bug fix, or is it documented elsewhere?
Lines 298-302: should “geometric standard deviation” on line 298 be “geometric mean diameter” as on line 300? Please clarify, and if so, consider reformulating the first two sentences to avoid repetition and aid clarity.
Line 479: “source” instead of “sources”
Line 482: remove comma after “similarly”
Line 561: in the text you reference seasonal behaviour: “As found with numerous previous studies, with less sea ice, in the warmer months, Antarctic sites have increased aerosol concentrations associated with more contribution from the ocean.” However, this makes it somewhat unclear whether the data presented in Figures 8 and 9 is seasonal or for the full annual cycle – please clarify.
Line 563: “Accumulated chlorophyll is also ranked highly for SHAP of both aerosol size range models (Fig. 7) and demonstrates a very consistent logarithmic relationship (Figs 8d and 9c).” Firstly, it seems like this should read “Figs 8c and 9c." Secondly, the phrase “demonstrates a very consistent logarithmic relationship” is not totally clear to me. What is the relationship consistent over Please specify.
Line 702-706: I am not following what physical process could explain the link between increased time in cloud and increased particle number in both mode. “Cloud processing” is given, but I would expect this to increase accumulation mode number at the expense of Aitken mode number, as described in lines 700 and 701. Please clarify this section.
Lines 802-804: “consistent sources of model bias can be identified, to improve representation of natural aerosol processes in GCMs tuned for best global representation” although this study does indeed present UKESM underestimation of aerosol number and shows that the ML models outperform UKESM in this regard, the rest of the results focus only on insights into the observations and ML model. Please add more detail about how this approach can be used to identify sources of model bias, such as the bias presented in Figure 4.