the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Implications of VOC Oxidation in Atmospheric Chemistry: Development of a Comprehensive AI Model for Predicting Reaction Rate Constants
Abstract. Volatile Organic Compounds (VOCs) significantly influence global atmospheric chemistry through oxidative reactions with oxidants. These reactions produce key precursors to the formation of atmospheric fine particulate matter (PM2.5) and ozone (O3), which in turn play a crucial role in regulating O3 pollution and reducing PM2.5 concentrations. With the increasing diversity of VOCs, the need for advanced modeling techniques to accurately estimate the atmospheric oxidation reaction rate constants (ki, where i ∈ {•OH, •Cl, NO3, or O3}) has become more urgent. Here we introduce Vreact, a Siamese message passing neural networks (MPNN) architecture that jointly models VOC–oxidant reactivity. The model simultaneously predicts log10ki values and achieves a mean squared error (MSE) of 0.299 and a coefficient of determination (R²) of 0.941 on the internal test set. This framework overcomes the single-oxidant constraint of traditional models, enabling unified and scalable prediction of VOC oxidation kinetics across multiple oxidants. An interactive web tool (http://vreact.envwind.site:8001) is provided to facilitate non-expert access to reactivity screening. Vreact offers valuable insights into the formation and evolution of atmospheric pollutants, and serves as a critical resource for developing effective control and emission strategies, ultimately supporting global efforts to mitigate air pollution and improve public health.
- Preprint
(1254 KB) - Metadata XML
-
Supplement
(1065 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-1241', Gianluca Armeli, 19 May 2025
General comments
The study in question presents a new model for the prediction of reaction rate constants of volatile organic compounds (VOCs). The authors used the reaction rate constant dataset by McGillen et al. to train a Siamese message passing neural network (MPNN) to predict these rate constants. The outcoming model was given the name “Vreact” and it was shown to outperform existing models for reaction rate constant prediction.
The dataset used in this study comprises 2802 gas-phase reaction rate constants for 1586 VOCs and 4 oxidants (·OH, ·Cl, ·NO3 and O3). The authors underline this diversity of oxidants as one of their advantages compared to previous models which only use a single oxidant per model. Because of the wide value range of reaction constants, the values were log-transformed. Vreact takes the SMILES string of the VOC and the oxidant as inputs, which is an established and modern approach in chem-informatics. Graph representations are generated from these inputs and fed to the neural network that creates the molecular feature tensors A and B. Further mathematical operations are executed to account for the effects of molecular interactions. Finally, the prediction value for the reaction rate constant is made.
Moreover, the authors evaluate how Vreact can contribute to the understanding of aerosol formation mechanisms. They showcase the oxidation of 2-methyl-4-penten-2-ol, discussing different reaction pathways and how the interaction layer of Vreact can be used for comprehension. Furthermore, the authors gathered more data from 2020 and onwards, which they called the ‘post-2020 test set’ to analyze the extrapolation ability of Vreact, leading to satisfactory results. Besides, more insights on the reaction rates of specific chemical classes are provided.
All in all, the article presents a modern and sustainable study. The Vreact model that is the key component of this work was built on well-established methods and principles and could overall convince with its performance. Vreact’s advantages and improvements towards other models were clearly outlined in a comprehensible way. The study was conducted scientifically correct with no obvious shortcomings. Despite it being a rather data scientific topic, its atmospheric relevance became evident. The illustrations used are helpful and supporting. The supplementary material contains further details on the model architecture and is useful for a deeper understanding. Another valuable resource is the web tool version of Vreact, reinforcing reproducibility and open data.
Specific comments
After the results of the test set were presented, the authors provided more extensive evaluations and showcases of the model’s abilities. First, they draw a more detailed comparison between Vreact and the existing single-oxidant models. Therefore, they use two independent approaches: 1) using the pre-trained Vreact to predict the test sets from the literature and 2) retraining Vreact on the original train/test splits of the literature. Approach 2) is a bullet-proof method that really isolates the model’s predictive capability and delivers a nice comparison. Approach 1) has the potential problem, that the literature test sets contain data points that are part of Vreact’s training set. This would be problematic, because generally, machine learning models perform significantly better on seen data, resulting in an unfair comparison. It would be appreciated, if the authors could address this issue briefly, since it was unmentioned in the text so far.
Technical corrections
No typing errors or other technical problems were found.
Citation: https://doi.org/10.5194/egusphere-2025-1241-RC1 -
AC1: 'Reply on RC1', Xian Liu, 22 May 2025
We highly appreciate the reviewer's observations and suggestions proposed to improve the original manuscript. Please find the responses and the additions made to the original text in the attached document.
-
RC2: 'Reply on AC1', Gianluca Armeli, 22 May 2025
Thank you for considering my comments. Everything is clear now, I do not have any further remarks.
Citation: https://doi.org/10.5194/egusphere-2025-1241-RC2
-
RC2: 'Reply on AC1', Gianluca Armeli, 22 May 2025
-
AC1: 'Reply on RC1', Xian Liu, 22 May 2025
-
RC3: 'Comment on egusphere-2025-1241', Anonymous Referee #2, 10 Jun 2025
Zhang and co-authors present a machine learning model for predicting reaction rate constants of VOC–oxidant pairs using a Siamese neural network. The model is novel in its design and application combination, especially in handling multiple atmospheric oxidants. The results demonstrate good predictive performance alongside chemical insight. The results also demonstrate varying performance on the test set depending on which oxidant is considered. The model is tested on an additional external dataset, and is used to make predictions of rate constants for compounds lacking measurements.
From my point of view, the manuscript is generally well-written and clearly structured. However, methodological and interpretative aspects would benefit from clarification to ensure reproducibility and help contextualize the findings. However, I happily recommend it for publication subject to minor revision.
General comments
- I understand that the major benefit of Vreact is the ability to predict reactivities for multiple oxidants. Could the authors further clarify the motivation for using a Siamese neural network over simpler alternative architechtures which also could provide prediction for multiple oxidants (such as a one-hot encoding of oxidant identity). Given that only four oxidants are included, it would be helpful to understand whether the architecture was chosen for scalability, improved interpretability, or flexibility. Will more oxidants be considered in the future?
- A brief discussion of quantum chemistry methods to compute these types of rate constants is not mentioned in the background, but could help position this new method in the broader context of rate constant prediction for atmospheric reactions.
- Methods and Table S1 suggest that stratified sampling was used to balance oxidant classes across train/validation/test splits. Since the model operates on VOC–oxidant pairs, it is now unclear whether the same VOC can appear in different splits with different oxidants. If so, this could introduce information leakage. Please clarify whether VOCs were kept disjoint across splits.
- In Figure 3G, model performance on the external OH dataset is lower than for O₃, which is the reverse of the trend observed in the internal test set. Could this difference be a result of data quality, compound overlap, or target range?
- Clustering was used to analyze molecular groups and their reactivity (Figures 2E, 3D–F), but details on how these embeddings and clusters were generated are not provided in methods. It would be good with a brief description of how the morgan fingerprint was constructed (which parameters) in the methods. Similarly, UMAP and the SOM methods could be briefly described, along with any hyperparameters, in the methods.
- Finally, the manuscript’s would benefit from an outlook contextualizing the model's performance by identifying which applications the current accuracy supports and which may require improvement. Relating how performance varies across different oxidants and how this relates to the amount of available data could further emphasize the paper's contribution to understanding data requirements for reliable model accuracy for atmospheric applications.
Specific comments
- Line 29: Add citations on data-driven methods applied to atmospheric chemistry.
- Line 45: “primarily” → “primary.”
- Line 48: The phrase “with NO₃ radicals” is repeated—I suggest to remove one instance.
- Line 48: “the atmosphere’s self-cleaning capacity” is ambiguous; consider clarifying. rephrasing or removing.
- Line 90: Typo—“and and”
- Line 149: “functional group” could be replaced with “molecular motif” when referring to double bonds.
- Line 191– It is mentioned in results that MSE is the metric that was used for hyperparameter optimization. This information should also be included in the Methods section for clarity.
- Improve resolution of Figures 1–5. Figure 5A would be clearer as a conventional bar chart rather than a circular one for better being able to match bar height with y value.
Citation: https://doi.org/10.5194/egusphere-2025-1241-RC3 - AC2: 'Reply on RC3', Xian Liu, 19 Jun 2025
-
AC3: 'Reply on RC3', Xian Liu, 19 Jun 2025
Publisher’s note: this comment is a copy of AC2 and its content was therefore removed on 24 June 2025.
Citation: https://doi.org/10.5194/egusphere-2025-1241-AC3
Status: closed
-
RC1: 'Comment on egusphere-2025-1241', Gianluca Armeli, 19 May 2025
General comments
The study in question presents a new model for the prediction of reaction rate constants of volatile organic compounds (VOCs). The authors used the reaction rate constant dataset by McGillen et al. to train a Siamese message passing neural network (MPNN) to predict these rate constants. The outcoming model was given the name “Vreact” and it was shown to outperform existing models for reaction rate constant prediction.
The dataset used in this study comprises 2802 gas-phase reaction rate constants for 1586 VOCs and 4 oxidants (·OH, ·Cl, ·NO3 and O3). The authors underline this diversity of oxidants as one of their advantages compared to previous models which only use a single oxidant per model. Because of the wide value range of reaction constants, the values were log-transformed. Vreact takes the SMILES string of the VOC and the oxidant as inputs, which is an established and modern approach in chem-informatics. Graph representations are generated from these inputs and fed to the neural network that creates the molecular feature tensors A and B. Further mathematical operations are executed to account for the effects of molecular interactions. Finally, the prediction value for the reaction rate constant is made.
Moreover, the authors evaluate how Vreact can contribute to the understanding of aerosol formation mechanisms. They showcase the oxidation of 2-methyl-4-penten-2-ol, discussing different reaction pathways and how the interaction layer of Vreact can be used for comprehension. Furthermore, the authors gathered more data from 2020 and onwards, which they called the ‘post-2020 test set’ to analyze the extrapolation ability of Vreact, leading to satisfactory results. Besides, more insights on the reaction rates of specific chemical classes are provided.
All in all, the article presents a modern and sustainable study. The Vreact model that is the key component of this work was built on well-established methods and principles and could overall convince with its performance. Vreact’s advantages and improvements towards other models were clearly outlined in a comprehensible way. The study was conducted scientifically correct with no obvious shortcomings. Despite it being a rather data scientific topic, its atmospheric relevance became evident. The illustrations used are helpful and supporting. The supplementary material contains further details on the model architecture and is useful for a deeper understanding. Another valuable resource is the web tool version of Vreact, reinforcing reproducibility and open data.
Specific comments
After the results of the test set were presented, the authors provided more extensive evaluations and showcases of the model’s abilities. First, they draw a more detailed comparison between Vreact and the existing single-oxidant models. Therefore, they use two independent approaches: 1) using the pre-trained Vreact to predict the test sets from the literature and 2) retraining Vreact on the original train/test splits of the literature. Approach 2) is a bullet-proof method that really isolates the model’s predictive capability and delivers a nice comparison. Approach 1) has the potential problem, that the literature test sets contain data points that are part of Vreact’s training set. This would be problematic, because generally, machine learning models perform significantly better on seen data, resulting in an unfair comparison. It would be appreciated, if the authors could address this issue briefly, since it was unmentioned in the text so far.
Technical corrections
No typing errors or other technical problems were found.
Citation: https://doi.org/10.5194/egusphere-2025-1241-RC1 -
AC1: 'Reply on RC1', Xian Liu, 22 May 2025
We highly appreciate the reviewer's observations and suggestions proposed to improve the original manuscript. Please find the responses and the additions made to the original text in the attached document.
-
RC2: 'Reply on AC1', Gianluca Armeli, 22 May 2025
Thank you for considering my comments. Everything is clear now, I do not have any further remarks.
Citation: https://doi.org/10.5194/egusphere-2025-1241-RC2
-
RC2: 'Reply on AC1', Gianluca Armeli, 22 May 2025
-
AC1: 'Reply on RC1', Xian Liu, 22 May 2025
-
RC3: 'Comment on egusphere-2025-1241', Anonymous Referee #2, 10 Jun 2025
Zhang and co-authors present a machine learning model for predicting reaction rate constants of VOC–oxidant pairs using a Siamese neural network. The model is novel in its design and application combination, especially in handling multiple atmospheric oxidants. The results demonstrate good predictive performance alongside chemical insight. The results also demonstrate varying performance on the test set depending on which oxidant is considered. The model is tested on an additional external dataset, and is used to make predictions of rate constants for compounds lacking measurements.
From my point of view, the manuscript is generally well-written and clearly structured. However, methodological and interpretative aspects would benefit from clarification to ensure reproducibility and help contextualize the findings. However, I happily recommend it for publication subject to minor revision.
General comments
- I understand that the major benefit of Vreact is the ability to predict reactivities for multiple oxidants. Could the authors further clarify the motivation for using a Siamese neural network over simpler alternative architechtures which also could provide prediction for multiple oxidants (such as a one-hot encoding of oxidant identity). Given that only four oxidants are included, it would be helpful to understand whether the architecture was chosen for scalability, improved interpretability, or flexibility. Will more oxidants be considered in the future?
- A brief discussion of quantum chemistry methods to compute these types of rate constants is not mentioned in the background, but could help position this new method in the broader context of rate constant prediction for atmospheric reactions.
- Methods and Table S1 suggest that stratified sampling was used to balance oxidant classes across train/validation/test splits. Since the model operates on VOC–oxidant pairs, it is now unclear whether the same VOC can appear in different splits with different oxidants. If so, this could introduce information leakage. Please clarify whether VOCs were kept disjoint across splits.
- In Figure 3G, model performance on the external OH dataset is lower than for O₃, which is the reverse of the trend observed in the internal test set. Could this difference be a result of data quality, compound overlap, or target range?
- Clustering was used to analyze molecular groups and their reactivity (Figures 2E, 3D–F), but details on how these embeddings and clusters were generated are not provided in methods. It would be good with a brief description of how the morgan fingerprint was constructed (which parameters) in the methods. Similarly, UMAP and the SOM methods could be briefly described, along with any hyperparameters, in the methods.
- Finally, the manuscript’s would benefit from an outlook contextualizing the model's performance by identifying which applications the current accuracy supports and which may require improvement. Relating how performance varies across different oxidants and how this relates to the amount of available data could further emphasize the paper's contribution to understanding data requirements for reliable model accuracy for atmospheric applications.
Specific comments
- Line 29: Add citations on data-driven methods applied to atmospheric chemistry.
- Line 45: “primarily” → “primary.”
- Line 48: The phrase “with NO₃ radicals” is repeated—I suggest to remove one instance.
- Line 48: “the atmosphere’s self-cleaning capacity” is ambiguous; consider clarifying. rephrasing or removing.
- Line 90: Typo—“and and”
- Line 149: “functional group” could be replaced with “molecular motif” when referring to double bonds.
- Line 191– It is mentioned in results that MSE is the metric that was used for hyperparameter optimization. This information should also be included in the Methods section for clarity.
- Improve resolution of Figures 1–5. Figure 5A would be clearer as a conventional bar chart rather than a circular one for better being able to match bar height with y value.
Citation: https://doi.org/10.5194/egusphere-2025-1241-RC3 - AC2: 'Reply on RC3', Xian Liu, 19 Jun 2025
-
AC3: 'Reply on RC3', Xian Liu, 19 Jun 2025
Publisher’s note: this comment is a copy of AC2 and its content was therefore removed on 24 June 2025.
Citation: https://doi.org/10.5194/egusphere-2025-1241-AC3
Data sets
Data sets Xin Zhang and Jiaqi Luo https://github.com/Luo-Jiaqi/Vreact
Model code and software
Model code Xin Zhang and Jiaqi Luo https://github.com/Luo-Jiaqi/Vreact
Interactive computing environment
Interactive computing environment Xin Zhang and Jiaqi Luo https://github.com/Luo-Jiaqi/Vreact
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
517 | 171 | 23 | 711 | 230 | 15 | 28 |
- HTML: 517
- PDF: 171
- XML: 23
- Total: 711
- Supplement: 230
- BibTeX: 15
- EndNote: 28
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1