the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Explainable ensemble machine learning revealing enhanced anthropogenic emissions of particulate nitro-aromatic compounds in eastern China
Abstract. Nitro-aromatic compounds (NACs) are important atmospheric pollutants that impact air quality, atmospheric chemistry, and human health. Understanding the relationship between NACs formation and key environmental driving factors are crucial for mitigating their environmental and health impacts. In this work, we combined an ensemble machine learning (EML) model with the SHapley Additive exPlanation (SHAP) and positive matrix factorization (PMF) model to identify the key driving factors for ambient particulate NACs covering primary emissions, secondary formation, and meteorological conditions based on field observations at urban, rural, and mountain sites in eastern China. The EML model effectively reproduced ambient NACs and recognized that anthropogenic emissions (i.e., coal combustion, traffic emission, and biomass burning) were the most important driving factors, with the total contribution of 49.3 %, while significant influences from meteorology (27.4 %), and secondary formation (23.3 %) were also confirmed. Seasonal variations analysis showed that direct emissions presented positive responses to NACs concentrations in spring, summer, and autumn, while temperature had the largest impact in winter. By evaluating NACs formation and loss under various locations in winter, we found that anthropogenic sources played a dominant role in increasing NACs levels in urban and rural sites, while reduced ambient temperature along with secondary formation from gas-phase oxidation was the main reason for relatively high particulate NACs levels at the mountain site. This work provides a reliable modelling method for understanding the dominant sources and influencing factors for atmospheric NACs and highlights the necessity of strengthening emission sources controls to mitigate organic aerosol pollution.
- Preprint
(2204 KB) - Metadata XML
-
Supplement
(1853 KB) - BibTeX
- EndNote
Status: open (until 05 Apr 2025)
-
RC1: 'Comment on egusphere-2025-165', Anonymous Referee #1, 13 Mar 2025
reply
Overall evaluation:
Li et al. investigates the influencing factors of particulate nitro-aromatic compounds (NACs) in eastern China, including meteorological factors, primary and secondary sources of NACs. The machine learning combined with PMF model is a key feature. Also, machine learning has not been applied to study NACs before. So, this paper studies an important component of atmospheric aerosols (i.e., NACs) with an innovative approach. Results are presented in a logical and organized manner with thorough discussion. Conclusions are clear and reasonable. Nevertheless, there are still some places where clarifications are needed, which are not major or critical issues. Also, the language could be further improved. Overall, I would recommend a minor revision before this paper could be accepted.
Minor comments:
Line 24-25: The authors state that “temperature had the largest impact in winter”. It is still not clear whether higher temperatures impose a positive or negative impact on NAC abundances in winter. Please briefly elaborate here.
Line 48-49: (1) The “in-situ” used here sounds not necessary. (2) Specifically, only aromatic VOCs could be oxidized to produce NACs. (3) More recent references should also be cited here. A recommended reference is shown as follows.
Men Xia et al., 2023, JGR:A, Observations and Modeling of Gaseous Nitrated Phenols in Urban Beijing: Insights From Seasonal Comparison and Budget Analysis.
Line 55: How could solar radiation inhibit NAC photolysis? In my understanding, solar radiation should enhance NAC photolysis. Also, “NACs photolysis production and loss” lacks clarity. Please double check the expressions here.
Line 56: What does “their” refer to, the abundance of NACs or the influencing factors of NACs? Please clarify. Also, please check the potential abuse or overuse of “it” and “they” in other places.
Line 67: So far, it is inappropriate to judge that machine learning is a more advanced method than PMF or PCA analysis. As an emerging method that is only recently applied in atmospheric chemistry, some scholars also hold a conservative attitude toward the usage of machine learning.
Line 72-73: It is not clear whether Qin et al. and Peng et al. investigated NACs or other compounds.
Line 79: The authors mention “source apportionment”. Does that mean the authors also use methods like PMF or PCA? Please do clarify this key point.
Line 84-85: The combination of PMF and machine learning is a highlight in this paper, which should be emphasized more clearly and thoroughly here, and maybe emphasized again in other places, e.g., the last paragraph in the conclusion part.
Line 97-98: The authors honestly acknowledge that some data has been reported in previous studies, which is of course good manners. Nevertheless, it is more important to emphasize what data is newly reported here, if any.
Line 112: Check for typo of “filed campaigns”.
Line 115: The authors mention SO2, NO2, and O3. Was NO measured? Usually, NO and NO2 are measured together by gas analyzers.
Line 135: Was 2-nitrophenol detected? Why or why not?
Line 142-146: Please elaborate what is the overall/total uncertainty of measuring NACs?
Line 159-161: Although more details could be found in SI, it is still necessary to state other data or parameters input into the PMF. Also, the key message of PMF methods stated in SI should also be briefly summarized and mentioned in the main text.
Line 164: The expression “were considered firstly in this study” sounds misleading. The mentioned machine learning algorithms have already been applied in previous studies.
Line 182: Check for typo of “leaners”.
Line 195: Check the grammar for “for quantify”. Please also carefully check the grammar issues in other places.
About section 2.5 Aerosol surface area density (Sa) prediction. This section needs to be moved to SI. The prediction of Sa by machine learning is not a major scientific goal of this study.
In Table 1, at least the total NACs concentrations, which is key to this study, should be mentioned. Since the season has been mentioned, the detailed sampling period is less interesting and may be recorded in SI.
In Figure 2, it is not clear how to understand these box plots. Please clearly state what does the boxes and data dots mean in this figure. For example, in the box plot, which mark represents the mean and median value, which marks show the interquartile range.
Line 329: Check typo for “expect winter”. Check for grammar for “which with a little high contribution”.
Line 345: What do PE and SF mean? To help readers understand the figure clearly, please elaborate here even if they are defined elsewhere.
Citation: https://doi.org/10.5194/egusphere-2025-165-RC1 -
RC2: 'Comment on egusphere-2025-165', Anonymous Referee #2, 25 Mar 2025
reply
The manuscript investigated the sources and drivers of particulate nitro-aromatic compounds (NACs) in eastern China using a combination of machine learning and receptor modelling. The study’s main focus is how primary emissions, secondary formation, and meteorological factors contribute to ambient NAC levels across different locations and seasons. The authors proposed an ensemble machine learning (EML) model coupled with SHAP (SHapley Additive exPlanation) values and a PMF (Positive Matrix Factorization) source apportionment to interpret NAC variations. Eleven sampling sites (urban, rural, mountain) over multiple seasons provide a robust dataset of NAC concentrations and related variables. The EML model achieves high predictive performance (as can be expected from statistical modelling). The authors conclude that strengthened control of combustion emissions is necessary to mitigate particulate NAC pollution, as their modelling highlights the outsized role of human sources even in a region with complex meteorological and secondary processes.
Overall, this work is important. It extends existing literature on NAC sources (which previously relied on linear models or standalone PMF) by providing a more interpretable quantification of each factor’s contribution. The study is well grounded in current literature and clearly exhibits its novelty by bridging source apportionment with explainable AI. A few methodological clarifications and edits (detailed below) could further strengthen the work before this paper could be submitted.
Major issues:
1. I agree that the ensemble machine learning approach is appropriate for capturing complex nonlinear relationships, but some details would benefit from clarification to enhance confidence in the results. The authors note an 80/20 random split with cross-validation, but given data from multiple sites and seasons, it would be helpful to discuss whether any site-specific bias could affect the model. If, for instance, all data from a particular location or season mostly fell into the training set, the reported high R2 might not fully reflect generalizable performance. An ideal approach (if data allow) would be to test the model’s predictive skill in a leave-one-site-out or leave-one-season-out manner to ensure it generalizes across different scenarios.
While the integration of PMF source contributions as input features is innovative, this could introduce circular reasoning if not carefully handled – since NAC concentrations themselves (via their speciation) inform the PMF factors. The authors should reassure that using PMF outputs (four source factor contributions) as predictors does not inadvertently “double count” NAC information. One way to address this would be to emphasize that the ML model’s target was the total NAC (or NAC subgroups) concentration and that PMF factors, being based on species patterns, serve as independent explanatory variables capturing source-type influences. Clarifying these points will help readers understand the modelling strategy and trust that the conclusions (e.g., anthropogenic share of ~49%) are data-driven and not an artifact of the model design.
2. The claim of “enhanced anthropogenic emissions” driving NAC pollution needs to be positioned against existing studies to ensure the manuscript’s novelty is clear. Prior works have already pointed to combustion sources (coal, biomass burning, vehicle emissions) as major NAC contributors. My understanding is, the manuscript’s novelty is primarily methodological, and this study’s added value lies in quantifying the contributions with a new method and revealing nuanced patterns (like seasonal driver shifts and differences between urban/rural/mountain sites). The authors should ensure readers recognize that the significance lies in using an explainable ML approach to confirm and detail known drivers, rather than in discovering an entirely new source of NACs. This steer will prevent any impression that the study is merely repeating known information, instead of providing new insights into the magnitude and context of anthropogenic influence.
3. The use of SHAP values is a strong point of the study, but some aspects of the SHAP-based findings could be explained more clearly to avoid confusion. One issue is the meaning of negative SHAP contributions for certain factors. For example, the authors mention that at the mountain site, primary emissions had a mean SHAP contribution of –5.7 ng m-3, which initially sounds like primary sources were somehow reducing NAC levels. The intended meaning is presumably that local primary emissions are minimal at the mountain (so their absence corresponds to lower baseline NAC, hence a negative SHAP relative to other sites). Also for discussions regarding “temperature” and “BLH”, providing one or two sentences of intuition (e.g., “a negative SHAP value for a factor means that higher values of that factor are associated with lower NAC concentrations”) when introducing the SHAP results would make the explanation more accessible, especially for readers new to SHAP analysis.
SHAP can sometimes capture pairwise interactions, the authors could discuss on interactions or co-variability among factors if any were observed. For example, did the authors notice if certain meteorological conditions amplify the effect of emissions (high humidity aiding secondary formation of NACs, etc.)? Ensuring the SHAP results are clearly linked back to physical processes (mixing, photochemistry, emissions timing) will make the conclusions more convincing and useful for policy implications.
4. Line 186, the multi-target modelling approach, where NPs, NCs, and NSAs were predicted simultaneously (mentioned in the Methods), is an interesting aspect but is not very prominently discussed in the results. The conclusion hints that different functional groups had different key drivers (e.g., gas-phase oxidation dominating NSAs). It would strengthen the paper to emphasize these findings a bit more in the Results section 3.3 or 3.4 – for instance, explicitly stating which sources were most important for each NAC subclass. This adds depth to the analysis (showing the model’s strength in capturing subtle differences).
Also, given that the data span 2014–2021, there could be a question that if trends over that period were considered – for example, have emission controls in China over the years impacted NAC levels? This may be outside the scope of the current paper’s focus on spatial drivers, but a short note in the discussion could acknowledge that temporal trends were not the focus here (assuming no strong trend was observed after accounting for other factors).
Minor issues:
The paper is generally well-written, but a few sentences should be edited for clarity or correctness. Here below are examples but the authors need to read through the manuscript for such minor language issues:
- Line 73, “Given the complex nonlinear links… it is necessary to establish an effective and reliable evaluation method to comprehensively understand and assess the importance and contribution of each factor…”. This could be broken into two sentences to avoid confusion.
- Line 258, the phrasing “is in coincided with” is grammatically incorrect.
- Line 182, a typo “leaner” should be “learner”
- Line 389, “Jinan ang Beijng” should be “and Beijing”
- Line 363, “confirmed by a previous observational study”
1. As noted, the manuscript uses many abbreviations (NACs, PMF, EML, SHAP, NP, NC, NSA, BLH, SSR, WS_V, WS_H, etc.). It would be very helpful to provide a list of abbreviations early on to improve readability.
2. Line 95, there is a minor point about terminology. Calling Mount Lao a “mountain” site when it’s only 166 m altitude is a bit confusing as Mount Tai is at 1534 m a.s.l. It might be worth clarifying that Mount Lao site is at a lower elevation (perhaps a foothill or a coastal mountain location) to avoid readers questioning if it truly represents a clean mountain background.
3. Line 184, when talking about the performance of a model, it cannot be validated or verified as natural systems are never closed, it can only be evaluated.
4. Linked to point 2, it’s needed to ensure Figure (e.g., Figures 4-7) legends and captions fully describe what the plots represent. The caption may list the variables by name (or refer to a legend) so readers don’t have to infer abbreviations (e.g., PE, SF, etc).
Citation: https://doi.org/10.5194/egusphere-2025-165-RC2
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
126 | 28 | 5 | 159 | 20 | 5 | 10 |
- HTML: 126
- PDF: 28
- XML: 5
- Total: 159
- Supplement: 20
- BibTeX: 5
- EndNote: 10
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
China | 1 | 67 | 35 |
United States of America | 2 | 66 | 35 |
United Kingdom | 3 | 12 | 6 |
France | 4 | 5 | 2 |
undefined | 5 | 5 | 2 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 67