the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Spatio-temporal modeling of air pollutant concentrations in Germany using machine learning
Abstract. Machine learning (ML) models are becoming a meaningful tool for modeling air pollutant concentrations. ML models are capable of learning and modeling complex non-linear interactions between variables, and they require less computational effort than chemical transport models (CTMs). In this study, we used gradient boosted tree (GBT) and multi-layer perceptron (MLP; neural network) algorithms to model near-surface nitrogen dioxide (NO2) and ozone (O3) concentrations over Germany at 0.1 degree spatial resolution and daily intervals.
We trained the ML models using TROPOMI satellite column measurements combined with information on emission sources, air pollutant precursors and meteorology as feature variables. We found that the trained GBT model for NO2 and O3 explained a major portion of the observed concentrations (R2 = 0.68–0.88, RMSE = 4.77–8.67 μg m-3 and R2 = 0.74–0.92, RMSE = 8.53–13.2 μg m-3, respectively). The trained MLP model performed worse than the trained GBT model for both NO2 and O3 (R2 = 0.46–0.82 and R2 = 0.42–0.9, respectively).
Our NO2 GBT model outperforms the CAMS model, a data-assimilated CTM, but slightly under-performs for O3. However, our NO2 and O3 ML models require less computational effort than CTM. Therefore, we can analyze people’s exposure to near-surface NO2 and O3 with significantly less effort. During the study period (2018-04-30 and 2021-07-01), it was found that around 36 % of people lived in locations where the WHO NO2 limit was exceeded for more than 25 % of the days, while 90 % of the population resided in areas where the WHO O3 limit was surpassed for over 25 % of days. Although metropolitan areas had high NO2 concentrations, rural areas, particularly in southern Germany, had high O3 concentrations.
Furthermore, our ML models can be used to evaluate the effectiveness of mitigation policies. Near-surface NO2 and O3 concentrations changes during the 2020 COVID-19 lockdown period over Germany were indeed reproduced by the GBT model, with meteorology-accounted for near-surface NO2 significantly decreased (by 23±5.3 %) and meteorology-accounted for near-surface O3 slightly increased (by 1±4.6 %) over ten major German metropolitan areas, compared to 2019. Finally, our O3 GBT model is highly transferable to other countries, at least to neighboring countries and locations where no measurements are available (R2 = 0.87–0.94), whereas our NO2 GBT model is moderately transferable (R2 = 0.32–0.64).
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(13329 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(13329 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-463', Anonymous Referee #1, 01 Jun 2023
The authors explored the gradient boosted tree approach for spatial-temporal modelling of NO2 and O3 and applied it to the case in Germany. There are some issues to address in the revised version:
- Validations:
- Table 1 lists the types of datasets used in this study. May you clarify which dataset was used for the ground-truth data?
- Figures 5-6 show the spatial distribution of the averaged NO2 and O3 during the study period. Is the study period between 2019-07-17 and 2020-01-31? May you specify which months were used for Summer, Spring, Autumn, and Winter? The data sets were pre-processed in daily scale. Could you please generate a spatial map illustrating the average daily concentrations of NO2 and O3 during Summer and Winter, instead of considering the seasonal averages? Furthermore, may you compare these results with reanalysis from CAMS?
- Line 131, “we also included “Near-surface NO2” modeled from NO2 ML model as a feature variable in the O3 ML model.” However, in Figure 3 (d), the Near-surface NO2” modeled from NO2 ML model is not listed. I guess the Near-surface NO2” modeled from NO2 ML model will be top one affecting the O3 predive results. Is this case? Maybe you can use the ML model to get the direct relationship between O3 and Near-surface NO2” modeled from NO2 ML model.
- Line 243, “After the discussed model evaluation, we trained the GBT model using 100% of the data and modeled the near-surface NO2 and O3 concentrations over the study domain at 0.1 degree resolution and daily”, It is not clear here. Are you re-train the model? How do you validate your model?
Citation: https://doi.org/10.5194/egusphere-2023-463-RC1 -
RC2: 'Comment on egusphere-2023-463', Anonymous Referee #2, 23 Jun 2023
General comments
The authors develop a machine learning framework for modeling NO2 and O3 concentrations in Germany, and based on that, they analyze human exposure to the two air pollutants and the effects of COVID quarantine. The authors also discuss the transferability of their model.
The manuscript is well organized and in particular the methodology is thoroughly described. However, before it can be published, I believe the authors should address the comments below.
Specific comments
Line 129: Does the “season” (season of the year) information in the ML model have only 4 values? In my opinion, “day of the year” would be a more ideal feature to help the model learn the daily variability of air pollutants. The author should try or clarify this.
Line 131: Given the coupled nature of NO2 and ozone, I would suggest the authors try to include O3 as a feature in the NO2 ML model, like why they did the same way for O3 model, or please clarify why they didn’t do so.
Line 148: 24h-mean of ERA-5 data makes sense for NO2 model, but I would suggest the authors to test daytime-mean or daily-max for O3 model, as ozone is calculated as MDA8. This is especially the case for daily-max 2m temperature, which has been shown to be well correlated with MDA8 ozone.
Line 160: Authors should give the exact size of data samples (both training and testing set), as text or labelled on the figure.
Line 205: It is interesting to see that road density is the most important feature, given that it has constant values which don’t show temporal variations. Can the authors explain this further?
Line 229 (and also line 153): The fact that MLP is worse than GBT can be interesting or maybe controversial here, as people now tend to believe that deep learning techniques should outperform light-weight algorithms such as GBT. The authors should explain more about this, as it is an important and perhaps new finding. Personally, I can think of a few questions below that might help clarify this.
- What is tabular/structured data and what is non-tabular/structured data? Is the data we use for air pollutants prediction usually of the former type?
- Is the use of tabular/structured data the only reason why GBT outperforms MLP in this study? Is it possible that the size of the data samples limits the capability of MLP, given that it is a deep learning technique after all?
- In addition to the work of Heaton and Lundberg et al, can the authors find any other studies that have focused on the prediction of air pollutants that can support the results of this study?
- What about other neural network techniques? The author may not need to try them, but at least give a brief discussion, as MLP is one of the simplest deep learning algorithms.
Section 3.3: In this section, the authors discuss the exceedances of NO2 and O3 using data produced by the GBT model, but the model’s ability to capture extreme pollution is hardly evaluated in the validation section above. In fact, the scatter plot of Figure 4 indicates model does have a weakness in reproducing large NO2/O3 values. Therefore, I would suggest that the authors add this uncertainty discussion when analyzing people living beyond the WHO limit.
In addition, a temporal evaluation of the daily time-series (CAMS/GBT versus ground-truth O3) may be meaningful, such as using the temporal correlation coefficient.
Citation: https://doi.org/10.5194/egusphere-2023-463-RC2 -
AC1: 'Comment on egusphere-2023-463', Vigneshkumar Balamurugan, 28 Jul 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-463/egusphere-2023-463-AC1-supplement.pdf
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-463', Anonymous Referee #1, 01 Jun 2023
The authors explored the gradient boosted tree approach for spatial-temporal modelling of NO2 and O3 and applied it to the case in Germany. There are some issues to address in the revised version:
- Validations:
- Table 1 lists the types of datasets used in this study. May you clarify which dataset was used for the ground-truth data?
- Figures 5-6 show the spatial distribution of the averaged NO2 and O3 during the study period. Is the study period between 2019-07-17 and 2020-01-31? May you specify which months were used for Summer, Spring, Autumn, and Winter? The data sets were pre-processed in daily scale. Could you please generate a spatial map illustrating the average daily concentrations of NO2 and O3 during Summer and Winter, instead of considering the seasonal averages? Furthermore, may you compare these results with reanalysis from CAMS?
- Line 131, “we also included “Near-surface NO2” modeled from NO2 ML model as a feature variable in the O3 ML model.” However, in Figure 3 (d), the Near-surface NO2” modeled from NO2 ML model is not listed. I guess the Near-surface NO2” modeled from NO2 ML model will be top one affecting the O3 predive results. Is this case? Maybe you can use the ML model to get the direct relationship between O3 and Near-surface NO2” modeled from NO2 ML model.
- Line 243, “After the discussed model evaluation, we trained the GBT model using 100% of the data and modeled the near-surface NO2 and O3 concentrations over the study domain at 0.1 degree resolution and daily”, It is not clear here. Are you re-train the model? How do you validate your model?
Citation: https://doi.org/10.5194/egusphere-2023-463-RC1 -
RC2: 'Comment on egusphere-2023-463', Anonymous Referee #2, 23 Jun 2023
General comments
The authors develop a machine learning framework for modeling NO2 and O3 concentrations in Germany, and based on that, they analyze human exposure to the two air pollutants and the effects of COVID quarantine. The authors also discuss the transferability of their model.
The manuscript is well organized and in particular the methodology is thoroughly described. However, before it can be published, I believe the authors should address the comments below.
Specific comments
Line 129: Does the “season” (season of the year) information in the ML model have only 4 values? In my opinion, “day of the year” would be a more ideal feature to help the model learn the daily variability of air pollutants. The author should try or clarify this.
Line 131: Given the coupled nature of NO2 and ozone, I would suggest the authors try to include O3 as a feature in the NO2 ML model, like why they did the same way for O3 model, or please clarify why they didn’t do so.
Line 148: 24h-mean of ERA-5 data makes sense for NO2 model, but I would suggest the authors to test daytime-mean or daily-max for O3 model, as ozone is calculated as MDA8. This is especially the case for daily-max 2m temperature, which has been shown to be well correlated with MDA8 ozone.
Line 160: Authors should give the exact size of data samples (both training and testing set), as text or labelled on the figure.
Line 205: It is interesting to see that road density is the most important feature, given that it has constant values which don’t show temporal variations. Can the authors explain this further?
Line 229 (and also line 153): The fact that MLP is worse than GBT can be interesting or maybe controversial here, as people now tend to believe that deep learning techniques should outperform light-weight algorithms such as GBT. The authors should explain more about this, as it is an important and perhaps new finding. Personally, I can think of a few questions below that might help clarify this.
- What is tabular/structured data and what is non-tabular/structured data? Is the data we use for air pollutants prediction usually of the former type?
- Is the use of tabular/structured data the only reason why GBT outperforms MLP in this study? Is it possible that the size of the data samples limits the capability of MLP, given that it is a deep learning technique after all?
- In addition to the work of Heaton and Lundberg et al, can the authors find any other studies that have focused on the prediction of air pollutants that can support the results of this study?
- What about other neural network techniques? The author may not need to try them, but at least give a brief discussion, as MLP is one of the simplest deep learning algorithms.
Section 3.3: In this section, the authors discuss the exceedances of NO2 and O3 using data produced by the GBT model, but the model’s ability to capture extreme pollution is hardly evaluated in the validation section above. In fact, the scatter plot of Figure 4 indicates model does have a weakness in reproducing large NO2/O3 values. Therefore, I would suggest that the authors add this uncertainty discussion when analyzing people living beyond the WHO limit.
In addition, a temporal evaluation of the daily time-series (CAMS/GBT versus ground-truth O3) may be meaningful, such as using the temporal correlation coefficient.
Citation: https://doi.org/10.5194/egusphere-2023-463-RC2 -
AC1: 'Comment on egusphere-2023-463', Vigneshkumar Balamurugan, 28 Jul 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-463/egusphere-2023-463-AC1-supplement.pdf
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
368 | 131 | 18 | 517 | 7 | 7 |
- HTML: 368
- PDF: 131
- XML: 18
- Total: 517
- BibTeX: 7
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
Vigneshkumar Balamurugan
Adrian Wenzel
Frank N. Keutsch
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(13329 KB) - Metadata XML