Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC<sup>2</sup>NN)

Krüger, Matteo; Galeazzo, Tommaso; Eremets, Ivan; Schmidt, Bertil; Pöschl, Ulrich; Shiraiwa, Manabu; Berkemeier, Thomas

doi:https://doi.org/10.5194/egusphere-2025-1191

Preprints

https://doi.org/10.5194/egusphere-2025-1191

Preprints

20 Mar 2025

| 20 Mar 2025

Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC²NN)

Matteo Krüger, Tommaso Galeazzo, Ivan Eremets, Bertil Schmidt, Ulrich Pöschl, Manabu Shiraiwa, and Thomas Berkemeier

Abstract. The vapor pressures (p_vap) of organic molecules play a crucial role in the partitioning of secondary organic aerosol (SOA). Given the vast diversity of atmospheric organic compounds, experimentally determining p_vap of each compound is unfeasible. Machine Learning (ML) algorithms allow the prediction of physicochemical properties based on complex representations of molecular structure, but their performance crucially depends on the availability of sufficient training data. We propose a novel approach to predict p_vap using group contribution-assisted graph convolutional neural networks (GC²NN). The models use molecular descriptors like molar mass alongside molecular graphs containing atom and bond features as representations of molecular structure. Molecular graphs allow the ML model to better infer molecular connectivity compared to methods using other, non-structural embeddings. We achieve best results with an adaptive-depth GC²NN, where the number of evaluated graph layers depends on molecular size. We present two vapor pressure estimation models that achieve strong agreement between predicted and experimentally-determined p_vap. The first is a general model with broad scope that is suitable for both organic and inorganic molecules and achieves a mean absolute error (MAE) of 0.67 log-units (R²=0.86). The second model is specialized on organic compounds with functional groups often encountered in atmospheric SOA, achieving an even stronger correlation with the test data (MAE=0.36 log-units, R²=0.97). The adaptive-depth GC²NN models clearly outperform existing methods, including parameterizations and group-contribution methods, demonstrating that graph-based ML techniques are powerful tools for the estimation of physicochemical properties, even when experimental data are scarce.

Received: 13 Mar 2025 – Discussion started: 20 Mar 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2414 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (2414 KB)

Supplement (511 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

15 Oct 2025

Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC²NN)

Matteo Krüger, Tommaso Galeazzo, Ivan Eremets, Bertil Schmidt, Ulrich Pöschl, Manabu Shiraiwa, and Thomas Berkemeier

Geosci. Model Dev., 18, 7357–7371, https://doi.org/10.5194/gmd-18-7357-2025,https://doi.org/10.5194/gmd-18-7357-2025, 2025

Short summary

Matteo Krüger, Tommaso Galeazzo, Ivan Eremets, Bertil Schmidt, Ulrich Pöschl, Manabu Shiraiwa, and Thomas Berkemeier

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-1191', Anonymous Referee #1, 02 Jun 2025
This manuscript presents a compelling machine learning framework—group contribution-assisted graph convolutional neural networks (GC2NN)—for improving vapor pressure predictions of organic and inorganic molecules. The study is comprehensive, technically sound, and well-articulated. It provides a rigorous benchmark against established methods and convincingly demonstrates the advantages of adaptive-depth GC2NN models, especially in handling compounds with limited experimental data. The paper is suitable for publication pending minor revisions to improve clarity, reproducibility, and contextualization of the results.
Abstract: Briefly explain what "group contribution-assisted" means in lay terms.

Model Hyperparameters: Include a summary of final selected hyperparameters in the main text (not only supplementary).

Loss Function Choice: Explain why MAE was chosen over RMSE or other metrics, especially given outlier sensitivity.

Model Training Time: Mention hardware specifications (already noted) in the main text to aid reproducibility.

模型可解释性：该模型能否提供对哪些官能团影响最大的 pvap 的见解？欢迎任何特征归因分析。

不确定性分析：考虑量化 MAE 之外的不确定性（例如，集成预测或贝叶斯 GCNN）以解决实验变异性。

ELVOC 挑战：扩展潜在策略，以提高 ELVOC 模型的准确性，而不仅仅是数据集扩展。

大气建模相关性：更明确地强调您的结果如何改进 SOA 模型或区域气候模拟的参数化。

延伸电位：这种方法是否可以适用于其他物理化学性质（例如，亨利定律常数、反应性）？对此添加一句话。

结论可以简要提及集成物理信息神经网络或混合 QM/ML 模型的潜在未来发展。
Citation: https://doi.org/10.5194/egusphere-2025-1191-RC1
- EC1: 'Reply on RC1', Jason Williams, 02 Jun 2025
  
  Dear Anonymous Referee,
  Unfortunately half of your response has been posted in a language which some readers will find difficult and it also is outside the guidelines of the journal.
  Can you please repost your comment?
  Thank you in Advance,
  Jason Williams.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1191-EC1
RC2: 'Comment on egusphere-2025-1191', Patrick Rinke, 10 Jun 2025

The manuscript by Krüger et al introduces group contribution-assisted graph convolutional neural networks (GC2NN) as a novel machine learning approach to predict the vapor pressures of organic molecules, which are critical for understanding secondary organic aerosol (SOA) formation. By combining molecular descriptors with graph-based representations of molecular structure, the adaptive-depth GC2NN models significantly outperform traditional methods, especially in data scarce regimes. Two models were developed: a general-purpose model with a mean absolute error (MAE) of 0.67 log-units and a specialized model for SOA-relevant compounds achieving a much lower MAE of 0.36 log-units, demonstrating high predictive accuracy.
The manuscript is clearly written, the approach is sound and the findings are well described. Graph convolutional neural networks are relatively new in molecular atmospheric science and the finding will be of interest to the readers of GMD. I recommend the manuscript for publication in GMD provided my minor comments below will be addressed.
- “We assembled a data set of SMILES representations of 6128 compounds with experimental saturation vapor pressure measurements... ” When I read this sentence, I was wondering, if the dataset would be made available. Later, in the data availability statement, it is clarified that the data can be found alongside the code. It might be helpful to already allude at this point in the manuscript to the data availability.
- Figure 1. It is interesting, that the vapour pressure distribution of the confined dataset (shown in panel d) is skewed towards higher vapour pressures. Could this be explained and could this explanation be added to the manuscript?
- GC2: What exactly is the input to the graph convolutions? Figure two is scant on details in this regard.
- I am trying to understand the added value of the group contribution component. The functional groups are already part of the graph. So in principle, one would assume that nothing new is added by supplying them separately as input. Could we think of this extra channel as some kind of weighting that increases the importance of the functional groups in the input representation? One would think that the neural network would adjust such a weighting internally already during the training, just from graph input, if it deems functional group features to be particularly important. But maybe suppling the group contribution separately enforces a higher importance from the start.
- I would appreciate more details on the GC2 architecture (since these are also not supplied in the SI). How exactly is the merging done in the merging layers?
- page 11: “ … or could be a result of the sparsity of ELVOC data in the training set.” More ELVOCS are available in Besel et al J. Aerosol Sci. 179, 106375 (2024): https://www.sciencedirect.com/science/article/pii/S0021850224000429?via%3Dihub
- Section 3.3. GC^2NN-GeckoQ: Do the authors have any insight into why GeckoQ is so much harder to learn? GeckoQ is by far the largest dataset of the ones studied in this work, so should be easier to learn or the learning curve should reach similar errors at higher training data volumes.
- The “Conclusions” section is more of a Discussion section and should be named as such, because there is no other section named discussion, but there should be.
- “… making spatial relations between molecular substructures directly interpretable.” Did the authors try to interpret the models and extract chemical insight?

Citation: https://doi.org/10.5194/egusphere-2025-1191-RC2
AC1: 'Response to reviewers of egusphere-2025-1191', Matteo Krüger, 15 Jul 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1191/egusphere-2025-1191-AC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-1191-AC1

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-1191', Anonymous Referee #1, 02 Jun 2025
This manuscript presents a compelling machine learning framework—group contribution-assisted graph convolutional neural networks (GC2NN)—for improving vapor pressure predictions of organic and inorganic molecules. The study is comprehensive, technically sound, and well-articulated. It provides a rigorous benchmark against established methods and convincingly demonstrates the advantages of adaptive-depth GC2NN models, especially in handling compounds with limited experimental data. The paper is suitable for publication pending minor revisions to improve clarity, reproducibility, and contextualization of the results.
Abstract: Briefly explain what "group contribution-assisted" means in lay terms.

Model Hyperparameters: Include a summary of final selected hyperparameters in the main text (not only supplementary).

Loss Function Choice: Explain why MAE was chosen over RMSE or other metrics, especially given outlier sensitivity.

Model Training Time: Mention hardware specifications (already noted) in the main text to aid reproducibility.

模型可解释性：该模型能否提供对哪些官能团影响最大的 pvap 的见解？欢迎任何特征归因分析。

不确定性分析：考虑量化 MAE 之外的不确定性（例如，集成预测或贝叶斯 GCNN）以解决实验变异性。

ELVOC 挑战：扩展潜在策略，以提高 ELVOC 模型的准确性，而不仅仅是数据集扩展。

大气建模相关性：更明确地强调您的结果如何改进 SOA 模型或区域气候模拟的参数化。

延伸电位：这种方法是否可以适用于其他物理化学性质（例如，亨利定律常数、反应性）？对此添加一句话。

结论可以简要提及集成物理信息神经网络或混合 QM/ML 模型的潜在未来发展。
Citation: https://doi.org/10.5194/egusphere-2025-1191-RC1
- EC1: 'Reply on RC1', Jason Williams, 02 Jun 2025
  
  Dear Anonymous Referee,
  Unfortunately half of your response has been posted in a language which some readers will find difficult and it also is outside the guidelines of the journal.
  Can you please repost your comment?
  Thank you in Advance,
  Jason Williams.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1191-EC1
RC2: 'Comment on egusphere-2025-1191', Patrick Rinke, 10 Jun 2025

The manuscript by Krüger et al introduces group contribution-assisted graph convolutional neural networks (GC2NN) as a novel machine learning approach to predict the vapor pressures of organic molecules, which are critical for understanding secondary organic aerosol (SOA) formation. By combining molecular descriptors with graph-based representations of molecular structure, the adaptive-depth GC2NN models significantly outperform traditional methods, especially in data scarce regimes. Two models were developed: a general-purpose model with a mean absolute error (MAE) of 0.67 log-units and a specialized model for SOA-relevant compounds achieving a much lower MAE of 0.36 log-units, demonstrating high predictive accuracy.
The manuscript is clearly written, the approach is sound and the findings are well described. Graph convolutional neural networks are relatively new in molecular atmospheric science and the finding will be of interest to the readers of GMD. I recommend the manuscript for publication in GMD provided my minor comments below will be addressed.
- “We assembled a data set of SMILES representations of 6128 compounds with experimental saturation vapor pressure measurements... ” When I read this sentence, I was wondering, if the dataset would be made available. Later, in the data availability statement, it is clarified that the data can be found alongside the code. It might be helpful to already allude at this point in the manuscript to the data availability.
- Figure 1. It is interesting, that the vapour pressure distribution of the confined dataset (shown in panel d) is skewed towards higher vapour pressures. Could this be explained and could this explanation be added to the manuscript?
- GC2: What exactly is the input to the graph convolutions? Figure two is scant on details in this regard.
- I am trying to understand the added value of the group contribution component. The functional groups are already part of the graph. So in principle, one would assume that nothing new is added by supplying them separately as input. Could we think of this extra channel as some kind of weighting that increases the importance of the functional groups in the input representation? One would think that the neural network would adjust such a weighting internally already during the training, just from graph input, if it deems functional group features to be particularly important. But maybe suppling the group contribution separately enforces a higher importance from the start.
- I would appreciate more details on the GC2 architecture (since these are also not supplied in the SI). How exactly is the merging done in the merging layers?
- page 11: “ … or could be a result of the sparsity of ELVOC data in the training set.” More ELVOCS are available in Besel et al J. Aerosol Sci. 179, 106375 (2024): https://www.sciencedirect.com/science/article/pii/S0021850224000429?via%3Dihub
- Section 3.3. GC^2NN-GeckoQ: Do the authors have any insight into why GeckoQ is so much harder to learn? GeckoQ is by far the largest dataset of the ones studied in this work, so should be easier to learn or the learning curve should reach similar errors at higher training data volumes.
- The “Conclusions” section is more of a Discussion section and should be named as such, because there is no other section named discussion, but there should be.
- “… making spatial relations between molecular substructures directly interpretable.” Did the authors try to interpret the models and extract chemical insight?

Citation: https://doi.org/10.5194/egusphere-2025-1191-RC2
AC1: 'Response to reviewers of egusphere-2025-1191', Matteo Krüger, 15 Jul 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1191/egusphere-2025-1191-AC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-1191-AC1

Journal article(s) based on this preprint

15 Oct 2025

Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC²NN)

Matteo Krüger, Tommaso Galeazzo, Ivan Eremets, Bertil Schmidt, Ulrich Pöschl, Manabu Shiraiwa, and Thomas Berkemeier

Geosci. Model Dev., 18, 7357–7371, https://doi.org/10.5194/gmd-18-7357-2025,https://doi.org/10.5194/gmd-18-7357-2025, 2025

Short summary

Matteo Krüger, Tommaso Galeazzo, Ivan Eremets, Bertil Schmidt, Ulrich Pöschl, Manabu Shiraiwa, and Thomas Berkemeier

Supplement

https://doi.org/10.5194/egusphere-2025-1191-supplement

Matteo Krüger, Tommaso Galeazzo, Ivan Eremets, Bertil Schmidt, Ulrich Pöschl, Manabu Shiraiwa, and Thomas Berkemeier

Viewed

Total article views: 1,655 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,476	160	19	1,655	55	31	46

HTML: 1,476
PDF: 160
XML: 19
Total: 1,655
Supplement: 55
BibTeX: 31
EndNote: 46

Views and downloads (calculated since 20 Mar 2025)

Month	HTML	PDF	XML	Total
Mar 2025	60	30	2	92
Apr 2025	57	21	2	80
May 2025	51	15	2	68
Jun 2025	85	23	6	114
Jul 2025	85	16	3	104
Aug 2025	392	39	2	433
Sep 2025	722	11	2	735
Oct 2025	24	5	0	29

Cumulative views and downloads (calculated since 20 Mar 2025)

Month	HTML	PDF	XML	Total
Mar 2025	60	30	2	92
Apr 2025	57	21	2	80
May 2025	51	15	2	68
Jun 2025	85	23	6	114
Jul 2025	85	16	3	104
Aug 2025	392	39	2	433
Sep 2025	722	11	2	735
Oct 2025	24	5	0	29

Viewed (geographical distribution)

Total article views: 1,665 (including HTML, PDF, and XML) Thereof 1,665 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 15 Oct 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (2414 KB)
Metadata XML

Short summary

This work uses machine learning to predict saturation vapor pressures of atmospherically-relevant organic compounds, crucial for partitioning of secondary organic aerosol (SOA). We introduce a new method using graph convolutional neural networks, in which molecular graphs enable the model to capture molecular connectivity better than with non-structural embeddings. The method shows strong agreement with experimentally determined vapor pressures, and outperforms existing estimation methods.


Total:	0
HTML:	0
PDF:	0
XML:	0

Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC2NN)

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Journal article(s) based on this preprint

Supplement

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.

Improved vapor pressure predictions using group contribution-assisted graph convolutional neural networks (GC²NN)