the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC2NN)
Abstract. The vapor pressures (pvap) of organic molecules play a crucial role in the partitioning of secondary organic aerosol (SOA). Given the vast diversity of atmospheric organic compounds, experimentally determining pvap of each compound is unfeasible. Machine Learning (ML) algorithms allow the prediction of physicochemical properties based on complex representations of molecular structure, but their performance crucially depends on the availability of sufficient training data. We propose a novel approach to predict pvap using group contribution-assisted graph convolutional neural networks (GC2NN). The models use molecular descriptors like molar mass alongside molecular graphs containing atom and bond features as representations of molecular structure. Molecular graphs allow the ML model to better infer molecular connectivity compared to methods using other, non-structural embeddings. We achieve best results with an adaptive-depth GC2NN, where the number of evaluated graph layers depends on molecular size. We present two vapor pressure estimation models that achieve strong agreement between predicted and experimentally-determined pvap. The first is a general model with broad scope that is suitable for both organic and inorganic molecules and achieves a mean absolute error (MAE) of 0.67 log-units (R2=0.86). The second model is specialized on organic compounds with functional groups often encountered in atmospheric SOA, achieving an even stronger correlation with the test data (MAE=0.36 log-units, R2=0.97). The adaptive-depth GC2NN models clearly outperform existing methods, including parameterizations and group-contribution methods, demonstrating that graph-based ML techniques are powerful tools for the estimation of physicochemical properties, even when experimental data are scarce.
- Preprint
(2414 KB) - Metadata XML
-
Supplement
(511 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-1191', Anonymous Referee #1, 02 Jun 2025
This manuscript presents a compelling machine learning framework—group contribution-assisted graph convolutional neural networks (GC2NN)—for improving vapor pressure predictions of organic and inorganic molecules. The study is comprehensive, technically sound, and well-articulated. It provides a rigorous benchmark against established methods and convincingly demonstrates the advantages of adaptive-depth GC2NN models, especially in handling compounds with limited experimental data. The paper is suitable for publication pending minor revisions to improve clarity, reproducibility, and contextualization of the results.
- Abstract: Briefly explain what "group contribution-assisted" means in lay terms.
- Model Hyperparameters: Include a summary of final selected hyperparameters in the main text (not only supplementary).
- Loss Function Choice: Explain why MAE was chosen over RMSE or other metrics, especially given outlier sensitivity.
- Model Training Time: Mention hardware specifications (already noted) in the main text to aid reproducibility.
- 模型可解释性:该模型能否提供对哪些官能团影响最大的 pvap 的见解?欢迎任何特征归因分析。
- 不确定性分析:考虑量化 MAE 之外的不确定性(例如,集成预测或贝叶斯 GCNN)以解决实验变异性。
- ELVOC 挑战:扩展潜在策略,以提高 ELVOC 模型的准确性,而不仅仅是数据集扩展。
- 大气建模相关性:更明确地强调您的结果如何改进 SOA 模型或区域气候模拟的参数化。
- 延伸电位:这种方法是否可以适用于其他物理化学性质(例如,亨利定律常数、反应性)?对此添加一句话。
- 结论可以简要提及集成物理信息神经网络或混合 QM/ML 模型的潜在未来发展。
Citation: https://doi.org/10.5194/egusphere-2025-1191-RC1 -
EC1: 'Reply on RC1', Jason Williams, 02 Jun 2025
Dear Anonymous Referee,
Unfortunately half of your response has been posted in a language which some readers will find difficult and it also is outside the guidelines of the journal.
Can you please repost your comment?
Thank you in Advance,
Jason Williams.
Citation: https://doi.org/10.5194/egusphere-2025-1191-EC1
-
RC2: 'Comment on egusphere-2025-1191', Patrick Rinke, 10 Jun 2025
The manuscript by Krüger et al introduces group contribution-assisted graph convolutional neural networks (GC2NN) as a novel machine learning approach to predict the vapor pressures of organic molecules, which are critical for understanding secondary organic aerosol (SOA) formation. By combining molecular descriptors with graph-based representations of molecular structure, the adaptive-depth GC2NN models significantly outperform traditional methods, especially in data scarce regimes. Two models were developed: a general-purpose model with a mean absolute error (MAE) of 0.67 log-units and a specialized model for SOA-relevant compounds achieving a much lower MAE of 0.36 log-units, demonstrating high predictive accuracy.
The manuscript is clearly written, the approach is sound and the findings are well described. Graph convolutional neural networks are relatively new in molecular atmospheric science and the finding will be of interest to the readers of GMD. I recommend the manuscript for publication in GMD provided my minor comments below will be addressed.
- “We assembled a data set of SMILES representations of 6128 compounds with experimental saturation vapor pressure measurements... ” When I read this sentence, I was wondering, if the dataset would be made available. Later, in the data availability statement, it is clarified that the data can be found alongside the code. It might be helpful to already allude at this point in the manuscript to the data availability.
- Figure 1. It is interesting, that the vapour pressure distribution of the confined dataset (shown in panel d) is skewed towards higher vapour pressures. Could this be explained and could this explanation be added to the manuscript?
- GC2: What exactly is the input to the graph convolutions? Figure two is scant on details in this regard.
- I am trying to understand the added value of the group contribution component. The functional groups are already part of the graph. So in principle, one would assume that nothing new is added by supplying them separately as input. Could we think of this extra channel as some kind of weighting that increases the importance of the functional groups in the input representation? One would think that the neural network would adjust such a weighting internally already during the training, just from graph input, if it deems functional group features to be particularly important. But maybe suppling the group contribution separately enforces a higher importance from the start.
- I would appreciate more details on the GC2 architecture (since these are also not supplied in the SI). How exactly is the merging done in the merging layers?
- page 11: “ … or could be a result of the sparsity of ELVOC data in the training set.” More ELVOCS are available in Besel et al J. Aerosol Sci. 179, 106375 (2024): https://www.sciencedirect.com/science/article/pii/S0021850224000429?via%3Dihub
- Section 3.3. GC^2NN-GeckoQ: Do the authors have any insight into why GeckoQ is so much harder to learn? GeckoQ is by far the largest dataset of the ones studied in this work, so should be easier to learn or the learning curve should reach similar errors at higher training data volumes.
- The “Conclusions” section is more of a Discussion section and should be named as such, because there is no other section named discussion, but there should be.
- “… making spatial relations between molecular substructures directly interpretable.” Did the authors try to interpret the models and extract chemical insight?
Citation: https://doi.org/10.5194/egusphere-2025-1191-RC2 -
AC1: 'Response to reviewers of egusphere-2025-1191', Matteo Krüger, 15 Jul 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1191/egusphere-2025-1191-AC1-supplement.pdf
Status: closed
-
RC1: 'Comment on egusphere-2025-1191', Anonymous Referee #1, 02 Jun 2025
This manuscript presents a compelling machine learning framework—group contribution-assisted graph convolutional neural networks (GC2NN)—for improving vapor pressure predictions of organic and inorganic molecules. The study is comprehensive, technically sound, and well-articulated. It provides a rigorous benchmark against established methods and convincingly demonstrates the advantages of adaptive-depth GC2NN models, especially in handling compounds with limited experimental data. The paper is suitable for publication pending minor revisions to improve clarity, reproducibility, and contextualization of the results.
- Abstract: Briefly explain what "group contribution-assisted" means in lay terms.
- Model Hyperparameters: Include a summary of final selected hyperparameters in the main text (not only supplementary).
- Loss Function Choice: Explain why MAE was chosen over RMSE or other metrics, especially given outlier sensitivity.
- Model Training Time: Mention hardware specifications (already noted) in the main text to aid reproducibility.
- 模型可解释性:该模型能否提供对哪些官能团影响最大的 pvap 的见解?欢迎任何特征归因分析。
- 不确定性分析:考虑量化 MAE 之外的不确定性(例如,集成预测或贝叶斯 GCNN)以解决实验变异性。
- ELVOC 挑战:扩展潜在策略,以提高 ELVOC 模型的准确性,而不仅仅是数据集扩展。
- 大气建模相关性:更明确地强调您的结果如何改进 SOA 模型或区域气候模拟的参数化。
- 延伸电位:这种方法是否可以适用于其他物理化学性质(例如,亨利定律常数、反应性)?对此添加一句话。
- 结论可以简要提及集成物理信息神经网络或混合 QM/ML 模型的潜在未来发展。
Citation: https://doi.org/10.5194/egusphere-2025-1191-RC1 -
EC1: 'Reply on RC1', Jason Williams, 02 Jun 2025
Dear Anonymous Referee,
Unfortunately half of your response has been posted in a language which some readers will find difficult and it also is outside the guidelines of the journal.
Can you please repost your comment?
Thank you in Advance,
Jason Williams.
Citation: https://doi.org/10.5194/egusphere-2025-1191-EC1
-
RC2: 'Comment on egusphere-2025-1191', Patrick Rinke, 10 Jun 2025
The manuscript by Krüger et al introduces group contribution-assisted graph convolutional neural networks (GC2NN) as a novel machine learning approach to predict the vapor pressures of organic molecules, which are critical for understanding secondary organic aerosol (SOA) formation. By combining molecular descriptors with graph-based representations of molecular structure, the adaptive-depth GC2NN models significantly outperform traditional methods, especially in data scarce regimes. Two models were developed: a general-purpose model with a mean absolute error (MAE) of 0.67 log-units and a specialized model for SOA-relevant compounds achieving a much lower MAE of 0.36 log-units, demonstrating high predictive accuracy.
The manuscript is clearly written, the approach is sound and the findings are well described. Graph convolutional neural networks are relatively new in molecular atmospheric science and the finding will be of interest to the readers of GMD. I recommend the manuscript for publication in GMD provided my minor comments below will be addressed.
- “We assembled a data set of SMILES representations of 6128 compounds with experimental saturation vapor pressure measurements... ” When I read this sentence, I was wondering, if the dataset would be made available. Later, in the data availability statement, it is clarified that the data can be found alongside the code. It might be helpful to already allude at this point in the manuscript to the data availability.
- Figure 1. It is interesting, that the vapour pressure distribution of the confined dataset (shown in panel d) is skewed towards higher vapour pressures. Could this be explained and could this explanation be added to the manuscript?
- GC2: What exactly is the input to the graph convolutions? Figure two is scant on details in this regard.
- I am trying to understand the added value of the group contribution component. The functional groups are already part of the graph. So in principle, one would assume that nothing new is added by supplying them separately as input. Could we think of this extra channel as some kind of weighting that increases the importance of the functional groups in the input representation? One would think that the neural network would adjust such a weighting internally already during the training, just from graph input, if it deems functional group features to be particularly important. But maybe suppling the group contribution separately enforces a higher importance from the start.
- I would appreciate more details on the GC2 architecture (since these are also not supplied in the SI). How exactly is the merging done in the merging layers?
- page 11: “ … or could be a result of the sparsity of ELVOC data in the training set.” More ELVOCS are available in Besel et al J. Aerosol Sci. 179, 106375 (2024): https://www.sciencedirect.com/science/article/pii/S0021850224000429?via%3Dihub
- Section 3.3. GC^2NN-GeckoQ: Do the authors have any insight into why GeckoQ is so much harder to learn? GeckoQ is by far the largest dataset of the ones studied in this work, so should be easier to learn or the learning curve should reach similar errors at higher training data volumes.
- The “Conclusions” section is more of a Discussion section and should be named as such, because there is no other section named discussion, but there should be.
- “… making spatial relations between molecular substructures directly interpretable.” Did the authors try to interpret the models and extract chemical insight?
Citation: https://doi.org/10.5194/egusphere-2025-1191-RC2 -
AC1: 'Response to reviewers of egusphere-2025-1191', Matteo Krüger, 15 Jul 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1191/egusphere-2025-1191-AC1-supplement.pdf
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
952 | 148 | 18 | 1,118 | 45 | 29 | 40 |
- HTML: 952
- PDF: 148
- XML: 18
- Total: 1,118
- Supplement: 45
- BibTeX: 29
- EndNote: 40
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1