A framework to holistically investigate processes controlling the aerosol lifecycle using explainable AI techniques
Abstract. General circulation models (GCMs) face significant uncertainties in estimating Earth's radiative budget due to aerosol-cloud interactions (ACI). To improve the representation of ACI in GCMs it is crucial to constrain processes controlling the aerosol lifecycle and the resulting size distribution. This is challenging due to the complexity and number of competing atmospheric processes that interact over large spatial and temporal scales which require untangling to elucidate dominant processes controlling aerosol properties. This study aims to (a) develop a generic explainable AI framework from air-mass history to build an accurate representation of processes controlling aerosol properties, from this, (b) identify key relationships between aerosol processes and their impacts on observed aerosol number concentrations, and (c) provide robust process-based observational constraints to aid in the isolation of GCM structural uncertainties. This is achieved by developing XGBoost regression models to simulate Aitken and accumulation mode number concentrations for receptor surface stations and application of TreeSHAP to identify key processes from explanatory variables describing meteorological and aerosol processes collocated to Lagrangian air-mass trajectories. The fidelity of this framework is demonstrated for the Antarctic station Trollhaugen, situated in a pristine region in which GCMs exhibit significant biases. Aerosol number concentrations at Trollhaugen were shown to be dominated by marine sources as well as transport from the free troposphere. The contribution from aloft dominates aerosol burden of the Aitken mode in the transitions between summer and winter, in contrast to a larger contribution in the summer from local marine sources from transport in the boundary layer.
Review of “A framework to holistically investigate processes controlling the aerosol lifecycle using explainable AI techniques”, Duncan et al., EGUsphere, https://doi.org/10.5194/egusphere-2025-4298
This paper uses a novel combination of back-trajectory analysis and machine learning models to investigate aerosol processes in the Antarctic region, with a focus on sources and sinks of aerosol measured at Trollhaugen as well as the seasonal cycle. I congratulate the authors for this work, which includes a large range of data sources and complex, machine learning-based methodology.
The results show that a machine learning model trained on air-mass history, meteorological data and various proxies for aerosol sources and removal processes can outperform a global climate model form the CMIP6 ensemble when aerosol number concentration is compared to measurements.
Further analysis of the observational dataset and machine learning model using advanced statistical techniques reveals key processes that are important for controlling the aerosol size distribution at Trollhaugen throughout the year. Results from other locations show that the method can also be applied in other regions. The article therefore provides an important contribution both to our understanding of Antarctic aerosol and to the field of aerosol modelling more generally. I recommend that the article is published after minor comments below have been addressed.
General comments
Documentation of ACTRIS data: the time resolution of the raw data is not clear. Also, according to Table S1, 2014 appears to contain more flagged data than other years. Do the authors know possible causes for this, and are there any implications for a possible sensitivity to which years are selected for the testing vs. training datasets?
Figures and tables: I suggest that for clarity and brevity, Table 3 could be reported directly in the text or combined with table 1, and figures 3 and 4 could be combined into 1 figure. Tables 1 and 5 could also be combined, discussed more below.
Use of different sites: it was not clear to me until later in the article (section 4.6) that results from multiple sites are presented, which makes references like “sites” or “for each site” confusing earlier in the text (e.g. Lines 421, 425, 442, title of section 4.1). Varrio and Mace Head could be briefly mentioned in earlier sections (abstract, introduction or methods) and Tables 1 and 5 could be combined.
Discussion of UKESM bias: You show that UKESM has a significant low bias in aerosol number concentration and also show that the ML model outperforms UKESM. This is a very interesting result and you rightly say in the conclusions that the approach presented in this study should therefore be used to inform model developments. However, I found the recommendations for future work on this topic a bit vague. Since identifying and reducing sources of model bias seems to me to be one of the main sources of value of this approach, I feel that the article would benefit from more detailed discussion of this in the conclusion.
Data and code: I note that the authors plan to release the data and code associated with the article upon publication. Because the article’s results rely on the synthesis of several datasets and the implementation of complex numerical methods, I strongly recommend that the authors deposit the code and any novel, processed datasets in openly accessible repositories (e.g. Zenodo, GitHub, etc) with persistent identifiers as soon as possible.
Specific comments
Lines 158-161 and section S1.1: please clarify the time resolution of the raw PNSD measurements used before 6-hourly means are taken. Currently, it is not clear how the filtering steps affects the data availability within 6-hourly windows.
Lines 180-181: the size ranges for the target variables are given as approximate (~30-80, ~80-660nm), whereas I would imagine that they are defined precisely based on which bins are summed over. Please clarify this.
Lines 192-193: do the authors plan to release the code for this bug fix, or is it documented elsewhere?
Lines 298-302: should “geometric standard deviation” on line 298 be “geometric mean diameter” as on line 300? Please clarify, and if so, consider reformulating the first two sentences to avoid repetition and aid clarity.
Line 479: “source” instead of “sources”
Line 482: remove comma after “similarly”
Line 561: in the text you reference seasonal behaviour: “As found with numerous previous studies, with less sea ice, in the warmer months, Antarctic sites have increased aerosol concentrations associated with more contribution from the ocean.” However, this makes it somewhat unclear whether the data presented in Figures 8 and 9 is seasonal or for the full annual cycle – please clarify.
Line 563: “Accumulated chlorophyll is also ranked highly for SHAP of both aerosol size range models (Fig. 7) and demonstrates a very consistent logarithmic relationship (Figs 8d and 9c).” Firstly, it seems like this should read “Figs 8c and 9c." Secondly, the phrase “demonstrates a very consistent logarithmic relationship” is not totally clear to me. What is the relationship consistent over Please specify.
Line 702-706: I am not following what physical process could explain the link between increased time in cloud and increased particle number in both mode. “Cloud processing” is given, but I would expect this to increase accumulation mode number at the expense of Aitken mode number, as described in lines 700 and 701. Please clarify this section.
Lines 802-804: “consistent sources of model bias can be identified, to improve representation of natural aerosol processes in GCMs tuned for best global representation” although this study does indeed present UKESM underestimation of aerosol number and shows that the ML models outperform UKESM in this regard, the rest of the results focus only on insights into the observations and ML model. Please add more detail about how this approach can be used to identify sources of model bias, such as the bias presented in Figure 4.