the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Predicting peak daily maximum 8-hour ozone, and linkages to emissions and meteorology, in Southern California using machine learning methods
Abstract. The growing abundance of data is conducive to using numerical methods to relate air quality, meteorology, and emissions to address which factors impact pollutant concentrations. Often, it is the extreme values that are of interest for health and regulatory purposes (e.g., the National Ambient Air Quality Standard for ozone uses the annual, maximum, daily 4th highest, 8-hour average (MDA8) ozone), though such values are the most challenging to predict using empirical models. We developed four different computational models, including the Generalized Additive Model (GAM), the Multivariate Adaptive Regression Splines, the Random Forest, and the Support Vector Regression, to develop observation-based relationships between the 4th highest MDA8 ozone in the South Coast Air Basin and precursor emissions, meteorological factors, and large-scale climate patterns. All models had similar predictive performance, though the GAM showed a relatively higher R2 value (0.96) with a lower root mean square error and mean bias.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(1759 KB)
-
Supplement
(902 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(1759 KB) - Metadata XML
-
Supplement
(902 KB) - BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-396', William Stockwell, 20 Sep 2022
General Comments
This paper presents four different empirical models for estimating peak daily maximum 8-hour (averaged) ozone from meteorological factors and the level of nitrogen oxide emissions (NOx = NO2 + NO). Four different statistical models: the Generalized Additive Model (GAM), the Multivariate Adaptive Regression Splines, the Random Forest, and the Support Vector Regression were developed and applied to estimate ozone concentrations in the South Coast 25 Air Basin (SoCAB) of California, including Los Angeles and the surrounding region.The use of empirical models for estimating extreme ozone concentrations is particularly relevant because these results may be of interest for health and regulatory purposes. The models may be improved further as the available datasets become larger due to more observations being made over time. Empirical / statistical models are usually more accurate than first-principles numerical forecast models (as long as there are no large changes in the conditions used to derive the empirical models).
There is a long history of the development of empirical models that extends back several decades. However, there are new concepts in machine learning that the authors have used to inform their research. I commend the authors for their appropriate citation of the literature, but they might consider a paragraph to mention the long history of empirical / statistical models.
Specific Comments
The authors provide an excellent discussion of their four models: the Generalized Additive Model (GAM), the Multivariate Adaptive Regression Splines, the Random Forest, and the Support Vector Regression. This clearly written presentation is an outstanding introduction to modern empirical modeling. I can easily imagine using this paper in a graduate atmospheric science course.The correlations between the model predictions and observations are high and the biases are low. There are only small differences in the performance, in terms of accuracy and required computational resources, between the four approaches for the dataset examined. It would be interesting to see a similar comparison for a much larger dataset in a future paper. Overall, I find that this paper by Gao et al. to be a valuable contribution to the literature.
Technical Corrections
Please consider a paragraph to mention the long history of empirical / statistical models if space allows.Citation: https://doi.org/10.5194/egusphere-2022-396-RC1 -
AC1: 'Reply on RC1', Ziqi Gao, 07 Nov 2022
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere-2022-396/egusphere-2022-396-AC1-supplement.pdf
-
AC1: 'Reply on RC1', Ziqi Gao, 07 Nov 2022
-
RC2: 'Comment on egusphere-2022-396', Anonymous Referee #2, 01 Nov 2022
General Comments
In the manuscript, four observation-based machine learning models are developed to predict the top 30 and the 4th highest maximum daily 8-hour average (MDA8) ozone (O3) concentrations as a function of emissions, meteorological factors, and large-scale climate patterns in Southern California, USA. The top O3 concentrations, especially the extreme statistics of O3 concentration, are very difficult to accurately predict. The results show that these four models can explain most of the variations of the observed high O3 concentrations. The study has examined the applicability of these built models in the South Coast Air Basin (SoCAB) and provide alternative methods for predicting top O3 concentrations in other regions. I would recommend publication in Geoscientific Model Development after consideration of the following comments.
Specific comments
1. As the results shown in Figure 2, compared with the observations, all of the four models tend to slightly overestimate the lower MDA8 O3 concentrations and to underpredict the higher ones. The four models have very small mean bias (MB, around 1ppbv) when predicting the top30 MDA8 O3 concentrations (shown in Table S3), but they all have higher MB with the average ~10 ppbv underestimation on the 4th high MDA8 O3 (shown in Table 2). As shown in Figure 3, more than 90% predicted O3 concentrations are lower than the observations, which is consistent to the underestimations on the higher MDA8 O3 shown in Figure 2. It indicates that the relationships between model inputs and predicted ozone are different at different ozone levels even addressing the highest 30 MDA8 O3 concentrations. I wonder whether lower MB and RMSE for predicting the 4th high MDA8 O3 would be expected with the empirical models developed using much higher MDA8 O3 (for example, using the data on the top 15 MDA8 O3 days).
2. As discussed in the Section 3.3 (Limitations), the precursors’ emissions in SoCAB and local meteorological variables have been included in the development of the four models. The structure of the built model equations in the manuscript would be applicable for those regions where top MDA8 O3 concentrations are mainly affected by local emissions. However, for the regions where the top MDA8 O3 are significantly influenced by cross-regional O3 transport, more variables might be considered in developing the predicting models (such as the precursors’ emissions in surrounding regions).
3. In the study, the precursors’ emissions have been proved to be the most significant factors impacting the peak O3 levels in SoCAB, and maximum temperature is of relatively high importance among all the meteorological variables. The annual NOx and VOCs emission amounts and maximum temperature from 1990 to 2019 are suggested to be illustrated together with the corresponding 4th high MDA8 O3 (or the top30 MDA8 O3 concentrations) in the Supplementary Information.
Technical comments
None.
Citation: https://doi.org/10.5194/egusphere-2022-396-RC2 -
AC2: 'Reply on RC2', Ziqi Gao, 07 Nov 2022
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere-2022-396/egusphere-2022-396-AC2-supplement.pdf
-
AC2: 'Reply on RC2', Ziqi Gao, 07 Nov 2022
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-396', William Stockwell, 20 Sep 2022
General Comments
This paper presents four different empirical models for estimating peak daily maximum 8-hour (averaged) ozone from meteorological factors and the level of nitrogen oxide emissions (NOx = NO2 + NO). Four different statistical models: the Generalized Additive Model (GAM), the Multivariate Adaptive Regression Splines, the Random Forest, and the Support Vector Regression were developed and applied to estimate ozone concentrations in the South Coast 25 Air Basin (SoCAB) of California, including Los Angeles and the surrounding region.The use of empirical models for estimating extreme ozone concentrations is particularly relevant because these results may be of interest for health and regulatory purposes. The models may be improved further as the available datasets become larger due to more observations being made over time. Empirical / statistical models are usually more accurate than first-principles numerical forecast models (as long as there are no large changes in the conditions used to derive the empirical models).
There is a long history of the development of empirical models that extends back several decades. However, there are new concepts in machine learning that the authors have used to inform their research. I commend the authors for their appropriate citation of the literature, but they might consider a paragraph to mention the long history of empirical / statistical models.
Specific Comments
The authors provide an excellent discussion of their four models: the Generalized Additive Model (GAM), the Multivariate Adaptive Regression Splines, the Random Forest, and the Support Vector Regression. This clearly written presentation is an outstanding introduction to modern empirical modeling. I can easily imagine using this paper in a graduate atmospheric science course.The correlations between the model predictions and observations are high and the biases are low. There are only small differences in the performance, in terms of accuracy and required computational resources, between the four approaches for the dataset examined. It would be interesting to see a similar comparison for a much larger dataset in a future paper. Overall, I find that this paper by Gao et al. to be a valuable contribution to the literature.
Technical Corrections
Please consider a paragraph to mention the long history of empirical / statistical models if space allows.Citation: https://doi.org/10.5194/egusphere-2022-396-RC1 -
AC1: 'Reply on RC1', Ziqi Gao, 07 Nov 2022
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere-2022-396/egusphere-2022-396-AC1-supplement.pdf
-
AC1: 'Reply on RC1', Ziqi Gao, 07 Nov 2022
-
RC2: 'Comment on egusphere-2022-396', Anonymous Referee #2, 01 Nov 2022
General Comments
In the manuscript, four observation-based machine learning models are developed to predict the top 30 and the 4th highest maximum daily 8-hour average (MDA8) ozone (O3) concentrations as a function of emissions, meteorological factors, and large-scale climate patterns in Southern California, USA. The top O3 concentrations, especially the extreme statistics of O3 concentration, are very difficult to accurately predict. The results show that these four models can explain most of the variations of the observed high O3 concentrations. The study has examined the applicability of these built models in the South Coast Air Basin (SoCAB) and provide alternative methods for predicting top O3 concentrations in other regions. I would recommend publication in Geoscientific Model Development after consideration of the following comments.
Specific comments
1. As the results shown in Figure 2, compared with the observations, all of the four models tend to slightly overestimate the lower MDA8 O3 concentrations and to underpredict the higher ones. The four models have very small mean bias (MB, around 1ppbv) when predicting the top30 MDA8 O3 concentrations (shown in Table S3), but they all have higher MB with the average ~10 ppbv underestimation on the 4th high MDA8 O3 (shown in Table 2). As shown in Figure 3, more than 90% predicted O3 concentrations are lower than the observations, which is consistent to the underestimations on the higher MDA8 O3 shown in Figure 2. It indicates that the relationships between model inputs and predicted ozone are different at different ozone levels even addressing the highest 30 MDA8 O3 concentrations. I wonder whether lower MB and RMSE for predicting the 4th high MDA8 O3 would be expected with the empirical models developed using much higher MDA8 O3 (for example, using the data on the top 15 MDA8 O3 days).
2. As discussed in the Section 3.3 (Limitations), the precursors’ emissions in SoCAB and local meteorological variables have been included in the development of the four models. The structure of the built model equations in the manuscript would be applicable for those regions where top MDA8 O3 concentrations are mainly affected by local emissions. However, for the regions where the top MDA8 O3 are significantly influenced by cross-regional O3 transport, more variables might be considered in developing the predicting models (such as the precursors’ emissions in surrounding regions).
3. In the study, the precursors’ emissions have been proved to be the most significant factors impacting the peak O3 levels in SoCAB, and maximum temperature is of relatively high importance among all the meteorological variables. The annual NOx and VOCs emission amounts and maximum temperature from 1990 to 2019 are suggested to be illustrated together with the corresponding 4th high MDA8 O3 (or the top30 MDA8 O3 concentrations) in the Supplementary Information.
Technical comments
None.
Citation: https://doi.org/10.5194/egusphere-2022-396-RC2 -
AC2: 'Reply on RC2', Ziqi Gao, 07 Nov 2022
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere-2022-396/egusphere-2022-396-AC2-supplement.pdf
-
AC2: 'Reply on RC2', Ziqi Gao, 07 Nov 2022
Peer review completion
Journal article(s) based on this preprint
Data sets
Predicting peak daily maximum 8-hour ozone, and linkages to emissions and meteorology, in Southern California using machine learning methods Ziqi Gao https://doi.org/10.5281/zenodo.6892062
Model code and software
Predicting peak daily maximum 8-hour ozone, and linkages to emissions and meteorology, in Southern California using machine learning methods Ziqi Gao https://doi.org/10.5281/zenodo.6892066
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
347 | 129 | 16 | 492 | 36 | 3 | 5 |
- HTML: 347
- PDF: 129
- XML: 16
- Total: 492
- Supplement: 36
- BibTeX: 3
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
Ziqi Gao
Yifeng Wang
Petros Vasilakos
Cesunica E. Ivey
Khanh Do
Armistead Goode Russell
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(1759 KB) - Metadata XML
-
Supplement
(902 KB) - BibTeX
- EndNote
- Final revised paper