the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Technical note: Reconstructing surface missing aerosol elemental carbon data in long-term series with ensemble learning
Abstract. Ground-based measurements of elemental carbon (EC) – classified under thermal-optical methods and considered as a surrogate for black carbon – are essential for assessing air quality and evaluating climate impacts. However, data gaps caused by technical challenges impede comprehensive analyses of long-term trends. This study proposes an ensemble learning method to address these challenges. The model uses readily accessible ground observation air pollutant data as proxies for EC-related tracers, along with meteorological parameters, to enhance prediction accuracy. It integrates outputs from Gradient Boosting Regression Trees, eXtreme Gradient Boosting, and Random Forest models, combining them through ridge regression to produce robust predictions. We applied this approach to reconstruct hourly EC concentrations from 2013 to 2023 for four cities in Eastern China, filling 45–79 % of missing data and improving prediction performance by 8–17 % compared to individual models. Over the 11-year period, EC exhibited an overall decline (-0.20 to -0.14 µg m-3 a-1), with a more significant decline from 2013 to 2020 (-0.24 to -0.15 µg m-3 a-1) from 3.26 µg m-3 to 1.59 µg m-3, followed by a noticeable slowdown from 2020 to 2023 (-0.12 to -0.04 µg m-3 a-1). Additionally, a fixed emission approximation method based on ensemble learning is proposed to quantitatively analyze the drivers of long-term EC trends. The analysis reveals that anthropogenic emission controls were the predominant contributors, accounting for approximately 92 % of the changes in EC trends from 2013 to 2020. However, their influence weakened post-2020, contributing approximately 80 %. These findings highlight that while China's Clean Air Actions implemented since 2013 have significantly reduced black carbon concentrations, sustained and enhanced strategies are still necessary to further mitigate black carbon pollution in the country.
- Preprint
(1622 KB) - Metadata XML
-
Supplement
(627 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-2776', Anonymous Referee #2, 20 Dec 2024
Long-term in-situ observations of black carbon aerosols are crucial for studying their environmental and climatic effects. However, in real-world observational studies, there are several inevitable technical challenges, such as data gaps. This paper proposes a machine learning method that elegantly addresses this issue. The method is applied to reconstruct time-series data of elemental carbon (EC) aerosols from four cities in eastern China. The results are also validated by comparing them with other datasets. Furthermore, the paper introduces a novel method for assessing the driving factors of long-term trends in elemental carbon, as well as evaluating the uncertainty associated with this approach. I believe both methods hold significant value for the field of atmospheric monitoring. Overall, the paper is well designed and written. However, I have the following points that the authors should address:
The authors introduce MERRA-2 black carbon column concentration data as one of the predictor variables. They also compare MERRA-2 near-surface black carbon concentrations and find that the MERRA-2 data tends to overestimate the site's elemental carbon data. I suggest that the authors conduct a sensitivity test by training the machine learning model without using MERRA-2 black carbon column concentration as a predictor variable and compare the results with the current ones.
The trend changes in EC aerosols are influenced by both meteorological conditions and emissions. In eastern China, the sources of black carbon generally include vehicle emissions and industrial coal combustion. While the paper quantifies the overall anthropogenic emission trend drivers, there is relatively little information on specific emission sectors, which may be a limitation of the method employed. The paper analyzes the daily variation of EC over the years and suggests that the reduction of motor vehicle emissions may be a major factor driving the decline in EC levels. I suggest that the authors could try to extend this analysis by investigating the trend changes of EC during vehicle emission rush hours or by quantifying the driving factors for these peak periods. This could provide a more detailed understanding of the trend changes.
The authors use the ridge regression algorithm for the multivariate regression analysis but do not employ the traditional multiple linear regression algorithm. I recommend that the authors clarify this choice. Additionally, regarding Equation 1, the expression may cause confusion because GBRTs, XGBoost, and RF are abbreviations for different machine learning algorithms, yet they are presented as variables in the formula. I suggest the authors optimize the notation for clarity.
Line 148 – 149: Appropriate references should be cited to support the use of these pollutants as tracers for source characterization.
Line 239: The phrase "Reconstruction of missing data of EC and trend analysis" should be revised to "Reconstruction of missing data of EC and comparison".
Line 336 – 337: The discussion on the impact of COVID-19 lockdowns on EC trend changes is well noted as a factual observation. Could the authors further discussion or quantify such impact?
Citation: https://doi.org/10.5194/egusphere-2024-2776-RC1 -
RC2: 'Comment on egusphere-2024-2776', Anonymous Referee #3, 29 Jan 2025
The authors adopt one ensemble learning model by integrating three Machine Learning models, including Gradient Boosting Regression Trees (GBT), eXtreme Gradient Boosting (XGB) and Random Forest (RF), coupled with ridge regression to generate robust predictions, to fill the gap of the element carbon (EC) data from 2013 to 2023 in Yangtze River Delta, China. The reconstructed EC dataset is valid by the intercomparison of EC with other datasets. Lastly, ensemble learning was used to design a fixed emission approximation method to disentangle and quantify the contribution of anthropogenic drivers to EC reduction.
This work is well organized. The authors present sufficient evidence to prove their robust and good performance in terms of the ensemble learning method. However, I’m sceptical about certain results of this study, particularly on the fixed emission approximation method. The acceptance of this manuscript is contingent upon the authors thoroughly validating those results. In addition, several places in this manuscript require an improvement. I recommend the acceptance after the authors address the comments and concerns detailed below.
General comment:
After reading this manuscript, my initial impression is that the authors have a wide knowledge of Machine Learning. However, I have some concerns as follows: As you mentioned in the 2.4.3 section (Line: 225): the errors increase when 2018 and 2019 are used as baseline years. 1) I am confused by the reason you provided, which is due to the missing meteorological parameters. As far as I know, ERA5 is a continuously updated dataset. It should not have missing values in 2018 and 2019. Please clarify this point. 2) If possible, try to use the ground-based measurements of meteorological factors rather than ERA5; 3) Please clarify how you retrieved the meteorological factors from the ERA5 in four cities in the 2.1 section. 4) In principle, the choice of the baseline year is critical. Basically, the baseline year is representative of typical conditions. If the selected year is an anomaly (e.g. huge emission reduction in COVID year), it could lead to an overestimation/underestimation. Could you explain how you chose the baseline year?
Specific comment:
1)Line 27: Rephase the sentence: from 2013 to 2020 (-0.24 to -0.15 µg m⁻³ a⁻¹) from 3.26 µg m⁻³ to 1.59 µg m⁻³
2) When narrating, maintain consistency in sentence tenses. For example, we evaluated… in Line 199 and we propose… in Line 206
3)Line 214: If the FEA method were…. Please double-check the whole text and use the singular and plural correctly.
-
RC3: 'Comment on egusphere-2024-2776', Anonymous Referee #4, 08 Feb 2025
The current manuscript aims to address the lack of continuous data for 4 cities gap-filled black carbon (BC) data to ultimately assess the trends in this pollutant as a result of the mitigation plans enforced in China in the 2013-2023 period. The reconstruction of these measurements is conducted by means of a machine learning (ML) ensemble of techniques validated upon the existent data, providing good agreement for all sites and years. Additionally, this manuscript provides a method to estimate weather and emission contributions to the reported concentrations, hence tackling the assessment of the effectivity of the abatement actions based solely on the anthropic drivers of BC.
The reviewer agrees to publish this article under minor revisions.
Overall Feedback
The presented manuscript is outstanding regarding the implementation of machine learning in atmospheric aerosol studies while maintaining the final purpose of it, evaluating the trends of the studied pollutant as a consequence of the implemented abatement plans. This paper consists on three main blocks: i. Gap filling of BC time series; ii. Differentiation of the anthropogenic and meteorological drivers of BC evolution; iii. Trend analysis of the outcoming i., ii., outcomes to evaluate China’s pollution mitigation actions. The manuscript is in general very well-written and structured. However, I list below certain aspects which should be addressed:
- The BC, EC data used in the EL models are not clearly described, neither the conversion from one to the other. Firstly, please, state for which cities you have EC, for which you have rBC, and for which you have both (Nanjing only, I assume). I see how Figure 1 shows a 1:1 slope for the presented sites and the Nanjing dataset, but it can not be like this for every site (Jeong et al., 2004, Rigler et al., 2020). Since you cannot provide a 1:1 EC-BC scatterplot for your other sites, please at least state the risk of EC, BC not being interchangeable in the rest of your sites and indicate possible consequences of that.
- You mention in 2.2, 2.3 the limitations of measurements and simulations and the substantial uncertainty these could drag to the EL model. Did you consider introducing uncertainties of both measurements and models as predictors in your EL? In case they became a strong predictor, you could narrow down which instrumental errors are more problematic for your data reconstruction, and maybe you could improve the predictions if filtering them out.
- Line 134. Provide some explanation on the advantages of the ridge regression or a reference.
- Please provide a list of all the “meteorological and emission indicator variables” (Line 151) that you feed the model with.
- The proportion of data trained vs. reconstructed is concerning. Even if you get good reconstructive metrics, I feel a bit skeptical on how extrapolating these predictions learned to other years can be an oversimplification, especially if the years to be reconstructed are anterior to the mitigation policies, as for Xuzhou, Zhenjiang. You could be missing actual significant drivers of EC that were minimized after the abatement regulations. Please, consider evaluating such long-term trends for these two last cities if you don’t have any measurements/satellite information about the previous atmospheric composition. Also, provide the correlation with CO, NOx you gave for the whole period only in the reconstructed periods in addition to the overall long-term correlation.
- Figure 4. Could be the comparison between EC (EL predictions), BCC (MERRA-2), and BC (TAP) misleading the interpretation of the plots here since these are not directly exchangeable variables? Please discuss the limits of the comparability.
- I see the FEA method power to discern between meteorological and anthropic emissions, I consider this is a very well-conceived approach. However, I would restrict the is to be quite near the js, ks. Training with 2013 and predicting 2022, 2023 might be unrealistic, since the validity of the fixed emission hypothesis is less robust. This is specially concerning when training is performed with the reconstructed data with no measurements to validate these years, as I mentioned two points ago. I think being conservative here and acknowledging the limitations of your datasets would make your trend evaluations more sturdy, especially since some readers might be rather ML-skeptic.
- Please indicate explicitly that CMET (i,i) is the self-prediction for the year i based on the training i. This can be understood from the text but stating it would help the reader to understand more easily since these nomenclatures might be new for them.
- In the text (lines 222-223), you provide uncertainties for these Ys. Can you please explain how did you get those uncertainties?
- About Figure 3, how do you explain that the uncertainties of your methods are for almost all cells positive? If I understood properly, the negative uncertainties should be as probable as the positive ones, since |ΔANTi,j| ~ |ΔANTj,i|.
- About Figure 3, the fact that the lower uncertainties you get are from Xuzhouu, with less measurements availability, whilts Suzhou, with higher coverage has higher uncertainty. This, for me, is reinforcing the idea that predicting over no measurement-anchors in the Xuzhou, Zhenjiang early period can lead to an oversimplification of the BC concentrations which might be comfortable for the FEA method. I find more normal that the model struggles for the actual measurement-based 2018-2019 baseline periods than that it doesn’t for the predicted 2013-2015.
- Please provide the trend-estimator method you use in the methods section (is it Senn’s slope) and provide the significance estimator of your results. Do you use “seasonal” Senn slopes, so that their effect is less taken into account?
- The discussion on why MERRA-2 is not properly capturing trends is very interesting (lines 244-251). Please, could you further detail which (meteorological/emissions) situations are better/worse captured by MERRA and TAP?
- Table S5. Why do you think the Zhenjiang city reconstruction is significantly worse than the others.
- Please, provide a short explanation on the meteorological normalization method by Grange et al., 2018.
- Figure S5e-h. It seems that ~2013, emission diels were rather flat whilst they become more marked in the last years. This could be because: i. Meterological influence was underestimated for those periods; ii. Emission patterns/sources changed. Please discuss this variance.
- In the last paragraph of your results section you explain the reductions of the anthropogenic emissions in the period of study, which is the objective of the paper. Could you also gieve some insights on the trend of meteorological impacts on concetrations? Do you consider that the atmospheric influence should be static over the trend or do you expect steady changes?
Technical changes
- Figure 3 is a bit difficulty posed. I understand that the FEA uncertainty shown here is the Yi of equations 10, 11, please indicate this instead of “FEA uncertainty (%)” in the colourbar.
- Figure 4, please play with the transparency or the wave order of the a-d time series so that we can see when the observations actually happen in a glance.
References
Jeong, C. H., Hopke, P. K., Kim, E., & Lee, D. W. (2004). The comparison between thermal-optical transmittance elemental carbon and Aethalometer black carbon measured at multiple monitoring sites. Atmospheric Environment, 38(31), 5193-5204.
Rigler, M., Drinovec, L., Lavrič, G., Vlachou, A., Prévôt, A. S., Jaffrezo, J. L., ... & Močnik, G. (2020). The new instrument using a TC–BC (total carbon–black carbon) method for the online measurement of carbonaceous aerosols. Atmospheric Measurement Techniques, 13(8), 4333-4351.
Citation: https://doi.org/10.5194/egusphere-2024-2776-RC3
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
228 | 42 | 16 | 286 | 36 | 14 | 11 |
- HTML: 228
- PDF: 42
- XML: 16
- Total: 286
- Supplement: 36
- BibTeX: 14
- EndNote: 11
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 83 | 30 |
China | 2 | 57 | 20 |
Germany | 3 | 29 | 10 |
France | 4 | 18 | 6 |
undefined | 5 | 10 | 3 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 83