the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Diagnosing drivers of PM2.5 simulation biases from meteorology, chemical composition, and emission sources using an efficient machine learning method
Shuai Wang
Mengyuan Zhang
Yueqi Gao
Peng Wang
Qingyan Fu
Abstract. Chemical transport models (CTMs) are widely used for air pollution modeling, which suffer from significant biases due to uncertainties in simplified parameterization, meteorological fields, and emission inventories. Accurate diagnosis of simulation biases is critical for improvement of models, interpretation of results, and efficient air quality management, especially for the simulation of fine particulate matter (PM2.5). In this study, an efficient method based on machine learning (ML) was designed to diagnose the drivers of the Community Multiscale Air Quality (CMAQ) model biases in simulating PM2.5 concentrations from three perspectives of meteorology, chemical composition, and emission sources. The source-oriented CMAQ were used to diagnose influences of different emission sources on PM2.5 biases. The ML models showed good fitting ability with small performance gap between training and validation. The CMAQ model underestimates PM2.5 by -19.25 to -2.66 μg/m3 in 2019, especially in winter and spring and high PM2.5 events. Secondary organic components showed the largest contribution to PM2.5 simulation bias for different regions and seasons (13.8–22.6 %) among components. Relative humidity, cloud cover, and soil surface moisture were the main meteorological factors contributing to PM2.5 bias in the North China Plain, Pearl River Delta, and northwestern, respectively. Both primary and secondary inorganic components from residential sources showed the largest contribution (12.05 % and 12.78 %), implying large uncertainties in this sector. The ML-based methods provide valuable complements to traditional mechanism-based methods for model improvement, with high efficiency and low reliance on prior information.
- Preprint
(1409 KB) - Metadata XML
-
Supplement
(909 KB) - BibTeX
- EndNote
Shuai Wang et al.
Status: open (until 05 Oct 2023)
-
CEC1: 'Comment on egusphere-2023-1531', Juan Antonio Añel, 05 Sep 2023
reply
Dear authors,
I would like to point out an issue regarding the code availability in your manuscript and CMAQ. Currently, for CMAQ, you mention that it is available in a GitHub repository. GitHub repositories are not acceptable for scientific publication or long-term code archival, GitHub itself says it in its webpage, and we mention it in our Code and Data policy. Fortunately, for CMAQ, you have available Zenodo repositories too, such as https://zenodo.org/record/5213949. This way, please, look for the Zenodo repository corresponding to the CMAQ version that you use, post it replying to this comment, and in your manuscript cite it instead of the GitHub. If there is not a Zenodo repository for the version that you have used, you can upload the code yourself and create a new one.
Thanks,
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2023-1531-CEC1 -
RC1: 'Comment on egusphere-2023-1531', Anonymous Referee #1, 06 Sep 2023
reply
This paper designed an efficient method based on machine learning for diagnosing the drivers of the Community Multiscale Air Quality (CMAQ) model biases in simulating PM2.5 concentrations from three perspectives of meteorology, chemical composition, and emission sources. The authors used source-oriented CMAQ to diagnose the influences of different emission sources on PM2.5 biases. While the approach presented in the manuscript, particularly the emphasis on identifying biases in the CMAQ simulation of PM2.5 concentration, is innovative and distinct from conventional predictive models, a few issues should be addressed.
Major comments:
1. Line 102: "Five-fold cross-validation method was used to evaluate the model fit and prediction ability (Browne, 2000)," and Line 143: "The ML models were trained separately for different regions and seasons, and a 5-fold cross-validation was used to measure the model performance (Figure 4)."Cross-validation is employed primarily for model selection and hyperparameter tuning rather than evaluating the ML performance. To accurately evaluate the model's performance and generalization capabilities, it is essential to test it on a dataset it has never seen during training or validation. In addition, for clarification and completeness, the authors should provide a detailed explanation of why they chose 5-fold cross-validation and how they implemented this methodology in their study.
2. Line 221: "In addition, the main objective of this study was diagnosing the contributors to CMAQ simulation biases using machine learning, therefore we did not pursue a very good model performance."
If the model's efficacy is insufficient, the interpretations and conclusions that can be drawn from it may be weakened. It is crucial to ensure that the model's predictions or interpretations are at least reasonably accurate. In addition, when using a specific model such as LightGBM, it would be beneficial to provide justification or evidence for why it outperforms other models such as Random Forest (Line 219) in this study. Such a justification can lend more credibility to the findings and insights derived from the model.
Minor Comments:
1. The study area for this manuscript is China; therefore, "in China" should be added to the title.
2. Line 56: Chemical components constitute PM2.5, so they would be considered as labels. Why are they used as features? Additionally, wouldn't the linear summation of all these chemical components essentially represent PM2.5? If I have misunderstood any aspect, it would be helpful if the author could explain it.
3. Line 62: How are "problematic data points" defined?
4. Line 78: Table S1 is "Summary of the WRF model variables used in this study." The list of PM2.5 components simulated by CMAQ is not available in Table S1.
5. Line 96: Could you please clarify what is meant by "three combinations of input variables"? Does this refer to pairwise combinations of the categories (e.g., "meteorological factors" + "chemical components") or something else?
6. Line 107: It should be "Observed PM2.5 concentrations".
7. Figure 3: The left axis features a stacked bar plot for sectoral contribution (with a maximal y-value of 100%), whereas the right axis represents PM2.5 concentration using a scatter plot. However, the areas in which the scatter points overlap with the bars do not provide clear information, making the use of dual axes potentially misleading.
8. Figure 4: Some models have a training R2 lower than 0.6. This suggests that these models might be underfitting (please see my "major comments").Citation: https://doi.org/10.5194/egusphere-2023-1531-RC1 -
RC2: 'Comment on egusphere-2023-1531', Anonymous Referee #2, 15 Sep 2023
reply
General comments
################
The article describes an interesting method to determine the origin of the bias of a CTM using a ML algorithm. It applies this method to an interesting case and allows to determine what sector
biases come from. This method has the potential to be applied in similar studies. A extensive bibliography is provided, enabling the reader to find more details when necessary.It would be nice to have a comparison of the results of this method, in the case studied, with those of other methods. A short reminder of the basics of the algorithms used (even if the references provided do that in details) would have been welcom.
Specific comments
#################
L17: "model biases", bias when the model is compared to observations ? It could be precised.
L20: even if we are in the abstract, the expressions "fitting ability" should be precised.
L29: could you remind the reader the definition of the acronyms PPM and SIA/SOA ?
L42: which ML methods ?L41-52 Could you quantifiy the gain in computing resources you achieve using your ML
method to the methods using Monte Carlo or Latin hypercube sampling ?
Have you tried these other methods and compared their results with the method
described in this article ?
L63: could you precise what you mean by "same simulation grid" ? Could this grid
be defined.L80 would it be possible to me give more detail on the source aportionment method ?
L90-102: a brief presentation of the algorithms would be interesting (see general comments).
L90 the only citation for this method in the bibliography is from a conference.
Why not mentioning
"Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., … Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154." ?L102 even if it defined in the reference, it would be nice to remind (briefly) what the
X-fold cross-validation mathod consists in.L135-136: why just analyzing source sectors of SIA and not SOA ?
L156: What is imperfect: the pathways or their current knowledge ?
L196: could you be more specific about the subsurface conditions ?L206: I don't see the values mentioneed for R2 and RMSE (0.53 and 20.18) in figure 6.
What do they relate to ?Technical corrections
#####################L25: "contribution" => "contribution to this bias".
L64: I would suggest to change the formulation: "Analysis focused on nationwide as well as several interested regions" => "Analysis has been carried out on several regions of interest and on all China."
L65: could you justify the choice of the regions ?
L79: "conducted over" => "carried out"
L107 "higher" => "highest", "lower" => "lowest"
L173: "the stationary" => "a stationary"
L175: "the uncertain" => "the uncertainties"
L177: "Earth's radiation receipts" : I would prefer "radiation received by the Earth"
L181: "shown the dominant" => "showed the dominant".
L183: "the missing" => "the lack"
L185: "can associate" => "can be associated"
L186: "attributed" => "be attributed"L196: "influenced" => "is influenced"
L199 " the underestimates" => "an underestimation"
Figure S4 : difficult to distinguish between the different shades of red/pink. The
use of another color scale (with differerent colors) would be clearer.
Figure 4:
Bottom of figure 4: RSME => RMSE
Figure 5 : difficult to distinguish between the different shades of red/pink. The
use of another color scale (with differerent colors) would be clearer.Problem in numbering: we have the same notation for figures (S1,...) and tables (S1,...)
It would be better to have a different notation for the tables and for the figuresCitation: https://doi.org/10.5194/egusphere-2023-1531-RC2
Shuai Wang et al.
Model code and software
Machine learning code and training datasets Shuai Wang https://zenodo.org/record/7907626
Shuai Wang et al.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
175 | 70 | 10 | 255 | 24 | 3 | 3 |
- HTML: 175
- PDF: 70
- XML: 10
- Total: 255
- Supplement: 24
- BibTeX: 3
- EndNote: 3
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1