An Ensemble Machine Learning Method to Retrieve Aerosol Parameters from Ground-based Sun-sky Photometer Measurements
Abstract. Ground-based Sun-sky photometers have been widely used to measure aerosol optical and microphysical properties, yet the conventional numerical inversion schemes are often computationally expensive. In this study, we developed an explainable Ensemble Machine Learning (EML) model that simultaneously retrieves aerosol single scattering albedo (SSA), scattering asymmetry parameter (g), effective radius (reff), and fine-mode fraction (FMF) from direct and diffuse solar radiation measurements, with feature importance quantified using SHapley Additive exPlanations (SHAP). The EML model was trained and validated on a dataset of 110,000 samples simulated using the T-matrix particle scattering model and the VLIDORT radiative transfer model, encompassing diverse aerosol, atmospheric, and surface conditions. The algorithm demonstrated robustness through ten-fold cross validation, achieving correlation coefficients of 0.94, 0.95, 0.92, and 0.90 for SSA, g, reff, and FMF on the validation set, respectively. SHAP-based feature importance analysis confirmed the physical interpretability of the model, highlighting its effective use of multi-band radiance information and the stronger dependence of SSA retrieval on aerosol optical depth (AOD) relative to g and reff. Retrieval uncertainties estimated from repeated noise perturbation experiments were 0.03 for SSA, 0.02 for g, 0.08 for reff, and 0.09 for FMF. Applied to 132,067 sets of raw photometer measurements, the EML-based retrieval produced forward radiance fitting residuals comparable to those of the AERONET official inversion products. Moreover, compared with numerical algorithms, the EML model eliminates the need for a priori assumptions and smoothness constraints, while improving computational efficiency by more than five orders of magnitude.
General Comments
This study aims to develop an ensemble machine learning model for the retrieval of aerosol parameters, i.e., aerosol single scattering albedo (SSA), scattering asymmetry parameter (g), effective radius (𝑟𝑒𝑓𝑓), and fine-mode fraction (FMF), utilizing ground-based sky radiance observations from AERONET network and radiative transfer simulations. The authors combine three powerful and widely used machine learning techniques to achieve retrievals of high accuracy with decreased computational cost and without the need for a priori assumptions and constraints compared to the traditional inversion algorithms. The retrieved products are well-presented and the ML model is sufficiently evaluated. However, the model architecture is not described in detail and some information on the configuration and the data used as input in both ML and RTM models are not very clear to me.
Overall, this is a well-written manuscript that fits into the scope of AMT. I would recommend considering the publication of this manuscript after addressing the following issues.
Specific Comments
Section 2.1: It will be useful to mention all AERONET sites used in the study and also what period the dataset covers. What are the criteria (if any) for the selection of the sites and data period?
Page 6, line 130: What do you mean by “randomly combined”?
Section 2.2: Are the values of the EML target parameters (SSA, g, 𝑟𝑒𝑓𝑓, and FMF) derived from the RTM using the T-matrix for SSA and g, and equations (2) and (3) for 𝑟𝑒𝑓𝑓, and FMF? Is there any information from AERONET that you use in these computations?
Section 2.2: Does the RTM input from AERONET include the EML target aerosol parameters (SSA, g, 𝑟𝑒𝑓𝑓, and FMF) in any way? That could probably be considered data leakage.
Sections 2.2 and 2.3: It is not quite clear to me if (and what) AERONET data are used as input in the RTM simulations and EML training or if AERONET data are used only for the final evaluation. The spectral AOD values used in the EML model training and cross validation are derived directly from AERONET? Apart from surface reflectance, are there any aerosol parameters from AERONET used as input in the RTM? Please clarify even if any aerosol parameters are used indirectly, e.g. to compute another parameter that is used as input. Consider adding a table with the RTM configuration and input variables (with their sources, e.g., AERONET, ERA5, climatology).
Section 2.3: Please refer all variables used as input in the EML model in detail. E.g., refer all AOD and sky radiance wavelengths, all geometric parameters including all RAAs, etc. Are all 30 RAAs mentioned in line 117 used as input? Consider adding a table with all the input variables and their sources (if any), e.g., AERONET, RTM simulations etc..
Page 8, line 183: Random Forest (RF), Gradient Boosting (GB), and Multi-Layer Perceptron (MLP) are referred as “base learners”. The term “base learner” typically refers to weak and inexpensive models, such as shallow decision trees, linear models or naive Bayes models. RF and GB are ensemble powerful methods and MLPs are computationally expensive models that are often used as “strong learners” by themselves. In the context of building an ensemble of RF + GB + MLP, maybe you can say that RF, GB and MLP are used as base learners to construct a “higher-level” ensemble model, but not that these models are “base learners” by definition. Please rephrase accordingly.
Section 2.3: Since, these three architectures are strong learners on their own, have you tried to train the RF, GB, and MLP architectures separately and compare the results with the EML to see whether there is an actual improvement when these methods are combined? If there is no significant improvement, then maybe there is no need to build such a “higher-level” model ensemble.
Section 2.3: What method is used as base learner in the GB model? Decision tree or other? Please clarify in the manuscript.
Section 2.3: Please mention the values used for important hyperparameters of RF and GB, e.g., number of estimators, learning rate, maximum depth, and maximum features.
Section 2.3: Please describe the MLP architecture and hyperparameters used.
Section 2.3: How are the different model components (MLP, RF, GB) ensembled? Do you use model stacking or other methodology? Please clarify in the manuscript.
Section 2.3: Does the EML model predict all target aerosol parameters (SSA, g, 𝑟𝑒𝑓𝑓, and FMF) at once? Please clarify in the manuscript.
Page 11, line 255 and Table 1: How are these metrics aggregated across all retrieved variables? Were they derived already aggregated from the EML model or were they averaged afterwards? Are there any metrics for each predicted parameter separately? To my knowledge, the ML model can report validation metrics separately for each predicted variable. If possible, include separate metrics for each target variable. Also, consider including the standard deviation for every average value you report.
Section 3.2: If the EML model is trained using target aerosol parameters (SSA, g, rₑff, and FMF) that are not directly retrieved from AERONET, but instead computed using RTM, T-matrix calculations, and equations (2) and (3), then comparisons with AERONET products should be interpreted as an evaluation of the ML-derived product against an independent dataset, rather than as a direct assessment of model performance. A true evaluation of the algorithm should be based on comparisons between predicted values and target values derived using the same methodology as the training dataset, since the ML model’s performance is defined by how accurately it reproduces the specific target quantities it was trained to predict. Please clarify this distinction.
Section 3.3: Have you checked also the feature importance for each individual learner (RF, GB, MLP)? It would be interesting to see whether there are differences on how the different model architectures "learn" from data.
Section 3.3: In the caption of Figure 4, it is mentioned that a number of 120 features in total are used and sky radiances from 23 observation geometries are included. Please mention all features in the manuscript and consider adding a table with them. Since the number of features is quite large, are there any features that could be considered not important based on the feature importance analysis and could be excluded from EML training? Have you also conducted an analysis to find possible correlations between the different variables? By doing this, you may find further variables that could be excluded from the training.
Page 16, Figure 4: Radiance at 675 nm seems to play an important role in the retrieval of g at 870 and 1020 nm? Do you have any clue on that?
Page 16, Figure 4: Consider showing the importance of AOD at each wavelength separately.
Appendix B: Was the EML trained also on data with AOD < 0.4? If not, then maybe it's expected to fail such cases. To test EML performance on data corresponding to AOD < 0.4 and present meaningful results, maybe you should include such cases in the model training.
Technical Corrections
Page 2, line 45: The period mark after “radiative transfer model (RTM)” looks like a typing error.
Page 5, Figure 1: There are some typing errors like in the words “meansurement” and “redisual” in the “Algorithm Evaluation” box of the flowchart.
Page 2, Figure 2: There are three rows, not four. Please rephrase the following sentence in the caption accordingly: “The four rows correspond to the four retrieved variables: SSA, g, 𝑟𝑒𝑓𝑓, and FMF.”
Page 2, Figure 2: The sentence “The four columns represent the observation bands at 440, 675, 870, and 1020 nm.” applies to the first and second row, but not the last one. Please clarify in the caption.
Page 16, Figure 4: Similarly, replace “four rows” with “three rows” in the caption and clarify that “The four columns represent the observation wavelengths…” applies only to the first and second rows.
Page 17, Figure 5: The colorbar needs improvement. Maybe consider removing the values from colorbar which, to my understanding, do not correspond to the metrics' values, and change the colorbar title or maybe remove the colorbar at all and keep only the explanation in the figure caption.
Page 20, Figure 7: Please replace " the numbers inside each box " with " the numbers above each box " in the caption.