Implications of VOC Oxidation in Atmospheric Chemistry: Development of a Comprehensive AI Model for Predicting Reaction Rate Constants

Zhang, Xin; Luo, Jiaqi; Pan, Wenxiao; Xue, Qiao; Liu, Xian; Fu, Jianjie; Zhang, Aiqian; Jiang, Guibin

doi:https://doi.org/10.5194/egusphere-2025-1241

Preprints

https://doi.org/10.5194/egusphere-2025-1241

Preprints

11 Apr 2025

| 11 Apr 2025

Implications of VOC Oxidation in Atmospheric Chemistry: Development of a Comprehensive AI Model for Predicting Reaction Rate Constants

Xin Zhang, Jiaqi Luo, Wenxiao Pan, Qiao Xue, Xian Liu, Jianjie Fu, Aiqian Zhang, and Guibin Jiang

Abstract. Volatile Organic Compounds (VOCs) significantly influence global atmospheric chemistry through oxidative reactions with oxidants. These reactions produce key precursors to the formation of atmospheric fine particulate matter (PM_2.5) and ozone (O₃), which in turn play a crucial role in regulating O₃ pollution and reducing PM_2.5 concentrations. With the increasing diversity of VOCs, the need for advanced modeling techniques to accurately estimate the atmospheric oxidation reaction rate constants (k_i, where i ∈ {•OH, •Cl, NO₃, or O₃}) has become more urgent. Here we introduce Vreact, a Siamese message passing neural networks (MPNN) architecture that jointly models VOC–oxidant reactivity. The model simultaneously predicts log₁₀k_i values and achieves a mean squared error (MSE) of 0.299 and a coefficient of determination (R²) of 0.941 on the internal test set. This framework overcomes the single-oxidant constraint of traditional models, enabling unified and scalable prediction of VOC oxidation kinetics across multiple oxidants. An interactive web tool (http://vreact.envwind.site:8001) is provided to facilitate non-expert access to reactivity screening. Vreact offers valuable insights into the formation and evolution of atmospheric pollutants, and serves as a critical resource for developing effective control and emission strategies, ultimately supporting global efforts to mitigate air pollution and improve public health.

Received: 18 Mar 2025 – Discussion started: 11 Apr 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1254 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (1254 KB)

Supplement (1065 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

22 Oct 2025

Implications of VOC oxidation in atmospheric chemistry: development of a comprehensive AI model for predicting reaction rate constants

Xin Zhang, Jiaqi Luo, Wenxiao Pan, Qiao Xue, Xian Liu, Jianjie Fu, Aiqian Zhang, and Guibin Jiang

Atmos. Chem. Phys., 25, 13379–13391, https://doi.org/10.5194/acp-25-13379-2025,https://doi.org/10.5194/acp-25-13379-2025, 2025

Short summary

Xin Zhang, Jiaqi Luo, Wenxiao Pan, Qiao Xue, Xian Liu, Jianjie Fu, Aiqian Zhang, and Guibin Jiang

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-1241', Gianluca Armeli, 19 May 2025

General comments
The study in question presents a new model for the prediction of reaction rate constants of volatile organic compounds (VOCs). The authors used the reaction rate constant dataset by McGillen et al. to train a Siamese message passing neural network (MPNN) to predict these rate constants. The outcoming model was given the name “Vreact” and it was shown to outperform existing models for reaction rate constant prediction.
The dataset used in this study comprises 2802 gas-phase reaction rate constants for 1586 VOCs and 4 oxidants (·OH, ·Cl, ·NO₃ and O₃). The authors underline this diversity of oxidants as one of their advantages compared to previous models which only use a single oxidant per model. Because of the wide value range of reaction constants, the values were log-transformed. Vreact takes the SMILES string of the VOC and the oxidant as inputs, which is an established and modern approach in chem-informatics. Graph representations are generated from these inputs and fed to the neural network that creates the molecular feature tensors A and B. Further mathematical operations are executed to account for the effects of molecular interactions. Finally, the prediction value for the reaction rate constant is made.
Moreover, the authors evaluate how Vreact can contribute to the understanding of aerosol formation mechanisms. They showcase the oxidation of 2-methyl-4-penten-2-ol, discussing different reaction pathways and how the interaction layer of Vreact can be used for comprehension. Furthermore, the authors gathered more data from 2020 and onwards, which they called the ‘post-2020 test set’ to analyze the extrapolation ability of Vreact, leading to satisfactory results. Besides, more insights on the reaction rates of specific chemical classes are provided.
All in all, the article presents a modern and sustainable study. The Vreact model that is the key component of this work was built on well-established methods and principles and could overall convince with its performance. Vreact’s advantages and improvements towards other models were clearly outlined in a comprehensible way. The study was conducted scientifically correct with no obvious shortcomings. Despite it being a rather data scientific topic, its atmospheric relevance became evident. The illustrations used are helpful and supporting. The supplementary material contains further details on the model architecture and is useful for a deeper understanding. Another valuable resource is the web tool version of Vreact, reinforcing reproducibility and open data.
Specific comments
After the results of the test set were presented, the authors provided more extensive evaluations and showcases of the model’s abilities. First, they draw a more detailed comparison between Vreact and the existing single-oxidant models. Therefore, they use two independent approaches: 1) using the pre-trained Vreact to predict the test sets from the literature and 2) retraining Vreact on the original train/test splits of the literature. Approach 2) is a bullet-proof method that really isolates the model’s predictive capability and delivers a nice comparison. Approach 1) has the potential problem, that the literature test sets contain data points that are part of Vreact’s training set. This would be problematic, because generally, machine learning models perform significantly better on seen data, resulting in an unfair comparison. It would be appreciated, if the authors could address this issue briefly, since it was unmentioned in the text so far.
Technical corrections
No typing errors or other technical problems were found.

Citation: https://doi.org/10.5194/egusphere-2025-1241-RC1
- AC1:
  'Reply on RC1', Xian Liu, 22 May 2025
  
  We highly appreciate the reviewer's observations and suggestions proposed to improve the original manuscript. Please find the responses and the additions made to the original text in the attached document.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1241-AC1
  - RC2: 'Reply on AC1', Gianluca Armeli, 22 May 2025
    
    Thank you for considering my comments. Everything is clear now, I do not have any further remarks.
    
    Citation: https://doi.org/10.5194/egusphere-2025-1241-RC2
RC3:
'Comment on egusphere-2025-1241', Anonymous Referee #2, 10 Jun 2025
Zhang and co-authors present a machine learning model for predicting reaction rate constants of VOC–oxidant pairs using a Siamese neural network. The model is novel in its design and application combination, especially in handling multiple atmospheric oxidants. The results demonstrate good predictive performance alongside chemical insight. The results also demonstrate varying performance on the test set depending on which oxidant is considered. The model is tested on an additional external dataset, and is used to make predictions of rate constants for compounds lacking measurements.
From my point of view, the manuscript is generally well-written and clearly structured. However, methodological and interpretative aspects would benefit from clarification to ensure reproducibility and help contextualize the findings. However, I happily recommend it for publication subject to minor revision.
General comments
I understand that the major benefit of Vreact is the ability to predict reactivities for multiple oxidants. Could the authors further clarify the motivation for using a Siamese neural network over simpler alternative architechtures which also could provide prediction for multiple oxidants (such as a one-hot encoding of oxidant identity). Given that only four oxidants are included, it would be helpful to understand whether the architecture was chosen for scalability, improved interpretability, or flexibility. Will more oxidants be considered in the future?

A brief discussion of quantum chemistry methods to compute these types of rate constants is not mentioned in the background, but could help position this new method in the broader context of rate constant prediction for atmospheric reactions.

Methods and Table S1 suggest that stratified sampling was used to balance oxidant classes across train/validation/test splits. Since the model operates on VOC–oxidant pairs, it is now unclear whether the same VOC can appear in different splits with different oxidants. If so, this could introduce information leakage. Please clarify whether VOCs were kept disjoint across splits.

In Figure 3G, model performance on the external OH dataset is lower than for O₃, which is the reverse of the trend observed in the internal test set. Could this difference be a result of data quality, compound overlap, or target range?

Clustering was used to analyze molecular groups and their reactivity (Figures 2E, 3D–F), but details on how these embeddings and clusters were generated are not provided in methods. It would be good with a brief description of how the morgan fingerprint was constructed (which parameters) in the methods. Similarly, UMAP and the SOM methods could be briefly described, along with any hyperparameters, in the methods.

Finally, the manuscript’s would benefit from an outlook contextualizing the model's performance by identifying which applications the current accuracy supports and which may require improvement. Relating how performance varies across different oxidants and how this relates to the amount of available data could further emphasize the paper's contribution to understanding data requirements for reliable model accuracy for atmospheric applications.

Specific comments
Line 29: Add citations on data-driven methods applied to atmospheric chemistry.

Line 45: “primarily” → “primary.”

Line 48: The phrase “with NO₃ radicals” is repeated—I suggest to remove one instance.

Line 48: “the atmosphere’s self-cleaning capacity” is ambiguous; consider clarifying. rephrasing or removing.

Line 90: Typo—“and and”

Line 149: “functional group” could be replaced with “molecular motif” when referring to double bonds.

Line 191– It is mentioned in results that MSE is the metric that was used for hyperparameter optimization. This information should also be included in the Methods section for clarity.

Improve resolution of Figures 1–5. Figure 5A would be clearer as a conventional bar chart rather than a circular one for better being able to match bar height with y value.
Citation: https://doi.org/10.5194/egusphere-2025-1241-RC3
- AC2: 'Reply on RC3', Xian Liu, 19 Jun 2025
  
  We highly appreciate the reviewer's observations and suggestions proposed to improve the original manuscript. Please find the responses and the additions made to the manuscript in the attached document.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1241-AC2
- AC3: 'Reply on RC3', Xian Liu, 19 Jun 2025
  
  Publisher’s note: this comment is a copy of AC2 and its content was therefore removed on 24 June 2025.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1241-AC3

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-1241', Gianluca Armeli, 19 May 2025

General comments
The study in question presents a new model for the prediction of reaction rate constants of volatile organic compounds (VOCs). The authors used the reaction rate constant dataset by McGillen et al. to train a Siamese message passing neural network (MPNN) to predict these rate constants. The outcoming model was given the name “Vreact” and it was shown to outperform existing models for reaction rate constant prediction.
The dataset used in this study comprises 2802 gas-phase reaction rate constants for 1586 VOCs and 4 oxidants (·OH, ·Cl, ·NO₃ and O₃). The authors underline this diversity of oxidants as one of their advantages compared to previous models which only use a single oxidant per model. Because of the wide value range of reaction constants, the values were log-transformed. Vreact takes the SMILES string of the VOC and the oxidant as inputs, which is an established and modern approach in chem-informatics. Graph representations are generated from these inputs and fed to the neural network that creates the molecular feature tensors A and B. Further mathematical operations are executed to account for the effects of molecular interactions. Finally, the prediction value for the reaction rate constant is made.
Moreover, the authors evaluate how Vreact can contribute to the understanding of aerosol formation mechanisms. They showcase the oxidation of 2-methyl-4-penten-2-ol, discussing different reaction pathways and how the interaction layer of Vreact can be used for comprehension. Furthermore, the authors gathered more data from 2020 and onwards, which they called the ‘post-2020 test set’ to analyze the extrapolation ability of Vreact, leading to satisfactory results. Besides, more insights on the reaction rates of specific chemical classes are provided.
All in all, the article presents a modern and sustainable study. The Vreact model that is the key component of this work was built on well-established methods and principles and could overall convince with its performance. Vreact’s advantages and improvements towards other models were clearly outlined in a comprehensible way. The study was conducted scientifically correct with no obvious shortcomings. Despite it being a rather data scientific topic, its atmospheric relevance became evident. The illustrations used are helpful and supporting. The supplementary material contains further details on the model architecture and is useful for a deeper understanding. Another valuable resource is the web tool version of Vreact, reinforcing reproducibility and open data.
Specific comments
After the results of the test set were presented, the authors provided more extensive evaluations and showcases of the model’s abilities. First, they draw a more detailed comparison between Vreact and the existing single-oxidant models. Therefore, they use two independent approaches: 1) using the pre-trained Vreact to predict the test sets from the literature and 2) retraining Vreact on the original train/test splits of the literature. Approach 2) is a bullet-proof method that really isolates the model’s predictive capability and delivers a nice comparison. Approach 1) has the potential problem, that the literature test sets contain data points that are part of Vreact’s training set. This would be problematic, because generally, machine learning models perform significantly better on seen data, resulting in an unfair comparison. It would be appreciated, if the authors could address this issue briefly, since it was unmentioned in the text so far.
Technical corrections
No typing errors or other technical problems were found.

Citation: https://doi.org/10.5194/egusphere-2025-1241-RC1
- AC1:
  'Reply on RC1', Xian Liu, 22 May 2025
  
  We highly appreciate the reviewer's observations and suggestions proposed to improve the original manuscript. Please find the responses and the additions made to the original text in the attached document.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1241-AC1
  - RC2: 'Reply on AC1', Gianluca Armeli, 22 May 2025
    
    Thank you for considering my comments. Everything is clear now, I do not have any further remarks.
    
    Citation: https://doi.org/10.5194/egusphere-2025-1241-RC2
RC3:
'Comment on egusphere-2025-1241', Anonymous Referee #2, 10 Jun 2025
Zhang and co-authors present a machine learning model for predicting reaction rate constants of VOC–oxidant pairs using a Siamese neural network. The model is novel in its design and application combination, especially in handling multiple atmospheric oxidants. The results demonstrate good predictive performance alongside chemical insight. The results also demonstrate varying performance on the test set depending on which oxidant is considered. The model is tested on an additional external dataset, and is used to make predictions of rate constants for compounds lacking measurements.
From my point of view, the manuscript is generally well-written and clearly structured. However, methodological and interpretative aspects would benefit from clarification to ensure reproducibility and help contextualize the findings. However, I happily recommend it for publication subject to minor revision.
General comments
I understand that the major benefit of Vreact is the ability to predict reactivities for multiple oxidants. Could the authors further clarify the motivation for using a Siamese neural network over simpler alternative architechtures which also could provide prediction for multiple oxidants (such as a one-hot encoding of oxidant identity). Given that only four oxidants are included, it would be helpful to understand whether the architecture was chosen for scalability, improved interpretability, or flexibility. Will more oxidants be considered in the future?

A brief discussion of quantum chemistry methods to compute these types of rate constants is not mentioned in the background, but could help position this new method in the broader context of rate constant prediction for atmospheric reactions.

Methods and Table S1 suggest that stratified sampling was used to balance oxidant classes across train/validation/test splits. Since the model operates on VOC–oxidant pairs, it is now unclear whether the same VOC can appear in different splits with different oxidants. If so, this could introduce information leakage. Please clarify whether VOCs were kept disjoint across splits.

In Figure 3G, model performance on the external OH dataset is lower than for O₃, which is the reverse of the trend observed in the internal test set. Could this difference be a result of data quality, compound overlap, or target range?

Clustering was used to analyze molecular groups and their reactivity (Figures 2E, 3D–F), but details on how these embeddings and clusters were generated are not provided in methods. It would be good with a brief description of how the morgan fingerprint was constructed (which parameters) in the methods. Similarly, UMAP and the SOM methods could be briefly described, along with any hyperparameters, in the methods.

Finally, the manuscript’s would benefit from an outlook contextualizing the model's performance by identifying which applications the current accuracy supports and which may require improvement. Relating how performance varies across different oxidants and how this relates to the amount of available data could further emphasize the paper's contribution to understanding data requirements for reliable model accuracy for atmospheric applications.

Specific comments
Line 29: Add citations on data-driven methods applied to atmospheric chemistry.

Line 45: “primarily” → “primary.”

Line 48: The phrase “with NO₃ radicals” is repeated—I suggest to remove one instance.

Line 48: “the atmosphere’s self-cleaning capacity” is ambiguous; consider clarifying. rephrasing or removing.

Line 90: Typo—“and and”

Line 149: “functional group” could be replaced with “molecular motif” when referring to double bonds.

Line 191– It is mentioned in results that MSE is the metric that was used for hyperparameter optimization. This information should also be included in the Methods section for clarity.

Improve resolution of Figures 1–5. Figure 5A would be clearer as a conventional bar chart rather than a circular one for better being able to match bar height with y value.
Citation: https://doi.org/10.5194/egusphere-2025-1241-RC3
- AC2: 'Reply on RC3', Xian Liu, 19 Jun 2025
  
  We highly appreciate the reviewer's observations and suggestions proposed to improve the original manuscript. Please find the responses and the additions made to the manuscript in the attached document.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1241-AC2
- AC3: 'Reply on RC3', Xian Liu, 19 Jun 2025
  
  Publisher’s note: this comment is a copy of AC2 and its content was therefore removed on 24 June 2025.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1241-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Xian Liu on behalf of the Authors (23 Jun 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (26 Jun 2025) by Thomas Berkemeier

RR by Anonymous Referee #1 (30 Jun 2025)

ED: Publish subject to minor revisions (review by editor) (09 Jul 2025) by Thomas Berkemeier

Dear Authors,

Thank you very much for your revision. Before the paper can be accepted in ACP, I would like to reiterate on an important point made by reviewer #2, which I believe was not sufficiently addressed and is crucial to show the significance of this work. The original comment read:

"I understand that the major benefit of Vreact is the ability to predict reactivities for multiple oxidants. Could the authors further clarify the motivation for using a Siamese neural network over simpler alternative architechtures which also could provide prediction for multiple oxidants (such as a one-hot encoding of oxidant identity). Given that only four oxidants are included, it would be helpful to understand whether the architecture was chosen for scalability, improved interpretability, or flexibility."

I would prefer if the authors give quantitative results for their newly added statement of "achieving higher accuracy, stronger interpretability and wider scalability".

1. Accuracy: It would really strengthen the paper if the authors could show and quantify, using unbiased calculations (e.g. with same training time), that the Siamese neural network architecture is superior to simpler alternative model architectures. Can the authors show what happens if you train the same type of neural network (just without the Siamese architecture), for each oxidant? Is the result worse? I think it's not enough here to compare with other published models that may have different numbers of neural network hyperparameters, training time, etc.

2. Interpretability: Can the authors show what precisely can be learned about the detailed chemical interactions of two molecules with the Siamese neural network architecture? If not, it should be indicated that this is only a hypothetical feature that was not yet explored.

Minor Comments (lines refer to manuscript with tracked changes)

l. 64: "QC approaches combine ab initio or density-functional theory calculations ..."

- DFT is generally considered an ab initio method (though not a first-principles-method).

l. 90: "They extract the interaction features of chemical reactions in depth, rather than performing simple reactant concatenating. Yet, their application has largely focused on synthesis or materials chemistry, not atmospheric multiphase oxidation."

- Can the authors explain what they mean with the first sentence?
- Second sentence: this study does not look at multiphase chemistry, either.

l. 98: " Vreact shows significantly improved performance"

- Please provide the level of significance for this statement.

Hide

AR by Xian Liu on behalf of the Authors (17 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (17 Jul 2025) by Thomas Berkemeier

AR by Xian Liu on behalf of the Authors (18 Jul 2025) Author's response Manuscript

Journal article(s) based on this preprint

22 Oct 2025

Implications of VOC oxidation in atmospheric chemistry: development of a comprehensive AI model for predicting reaction rate constants

Xin Zhang, Jiaqi Luo, Wenxiao Pan, Qiao Xue, Xian Liu, Jianjie Fu, Aiqian Zhang, and Guibin Jiang

Atmos. Chem. Phys., 25, 13379–13391, https://doi.org/10.5194/acp-25-13379-2025,https://doi.org/10.5194/acp-25-13379-2025, 2025

Short summary

Xin Zhang, Jiaqi Luo, Wenxiao Pan, Qiao Xue, Xian Liu, Jianjie Fu, Aiqian Zhang, and Guibin Jiang

Supplement

https://doi.org/10.5194/egusphere-2025-1241-supplement

Data sets

Data sets Xin Zhang and Jiaqi Luo https://github.com/Luo-Jiaqi/Vreact

Model code and software

Model code Xin Zhang and Jiaqi Luo https://github.com/Luo-Jiaqi/Vreact

Interactive computing environment

Interactive computing environment Xin Zhang and Jiaqi Luo https://github.com/Luo-Jiaqi/Vreact

Xin Zhang, Jiaqi Luo, Wenxiao Pan, Qiao Xue, Xian Liu, Jianjie Fu, Aiqian Zhang, and Guibin Jiang

Viewed

Total article views: 1,054 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
807	223	24	1,054	242	16	30

HTML: 807
PDF: 223
XML: 24
Total: 1,054
Supplement: 242
BibTeX: 16
EndNote: 30

Views and downloads (calculated since 11 Apr 2025)

Month	HTML	PDF	XML	Total
Apr 2025	109	22	7	138
May 2025	57	29	3	89
Jun 2025	67	50	10	127
Jul 2025	46	30	2	78
Aug 2025	106	33	1	140
Sep 2025	390	26	1	417
Oct 2025	32	33	0	65

Cumulative views and downloads (calculated since 11 Apr 2025)

Month	HTML	PDF	XML	Total
Apr 2025	109	22	7	138
May 2025	57	29	3	89
Jun 2025	67	50	10	127
Jul 2025	46	30	2	78
Aug 2025	106	33	1	140
Sep 2025	390	26	1	417
Oct 2025	32	33	0	65

Viewed (geographical distribution)

Total article views: 1,040 (including HTML, PDF, and XML) Thereof 1,040 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Oct 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (1254 KB)
Metadata XML

Short summary

VOCs drive atmospheric chemistry via oxidation, forming PM_2.5/ozone precursors. This study introduces Vreact, a graph-based AI model predicting reaction rate constands (k_i) for multiple oxidants simultaneously. It achieves high accuracy (MSE=0.281and R²=0.941 for log₁₀k_i ), overcoming single-oxidant model limits. A web tool enables rapid rate screening. Vreact advances pollutant formation insights and supports emission control strategies, aiding global air quality and public health efforts.


Total:	0
HTML:	0
PDF:	0
XML:	0

Implications of VOC Oxidation in Atmospheric Chemistry: Development of a Comprehensive AI Model for Predicting Reaction Rate Constants

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Journal article(s) based on this preprint

Supplement

Data sets

Model code and software

Interactive computing environment

Viewed

Viewed (geographical distribution)