Similarity-Based Analysis of Atmospheric Organic Compounds for Machine Learning Applications

Sandström, Hilda; Rinke, Patrick

doi:10.48550/arXiv.2406.18171

Preprints

https://doi.org/10.48550/arXiv.2406.18171

Preprints

09 Sep 2024

| 09 Sep 2024

Similarity-Based Analysis of Atmospheric Organic Compounds for Machine Learning Applications

Hilda Sandström and Patrick Rinke

Abstract. The formation of aerosol particles in the atmosphere impacts air quality and climate change, but many of the organic molecules involved remain unknown. Machine learning could aid in identifying these compounds through accelerated analysis of molecular properties and detection characteristics. However, such progress is hindered by the current lack of curated datasets for atmospheric molecules and their associated properties. To tackle this challenge, we propose a similarity analysis that connects atmospheric compounds to existing large molecular datasets used for machine learning development. We find a small overlap between atmospheric and non-atmospheric molecules using standard molecular representations in machine learning applications. The identified out-of-domain character of atmospheric compounds is related to their distinct functional groups and atomic composition. Our investigation underscores the need for collaborative efforts to gather and share more molecular-level atmospheric chemistry data. The presented similarity based analysis can be used for future dataset curation for machine learning development in the atmospheric sciences.

Received: 01 Aug 2024 – Discussion started: 09 Sep 2024

Download & links

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (0 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

15 May 2025

Similarity-based analysis of atmospheric organic compounds for machine learning applications

Hilda Sandström and Patrick Rinke

Geosci. Model Dev., 18, 2701–2724, https://doi.org/10.5194/gmd-18-2701-2025,https://doi.org/10.5194/gmd-18-2701-2025, 2025

Short summary

Hilda Sandström and Patrick Rinke

Interactive discussion

Status: closed

RC1: 'Comment on egusphere-2024-2432', Anonymous Referee #1, 19 Sep 2024

H. Sandstrom and P. Rinke conducted a study focused on the similarity-based analysis and its various datasetes containing organic compounds. They highlighted the challenges posed by the lack of curated datasets for atmospheric molecules and aimed to connect atmospheric compounds with existing large molecular datasets. Their investigation revealed that atmospheric molecules have limited overlap with non-atmospheric compounds due to distinct functional groups and atomic compositions. They utilized two molecular similarity metrics, specifically t-SNE and the Tanimoto similarity index, to compare atmospheric datasets (Wang, Gecko, and Quinones) between themseves and with non-atmospheric datasets (including drug-like and metabolite compounds). Their findings emphasize the need for collaborative efforts to improve dataset curation in order to enhance machine learning applications in atmospheric sciences.
From my point of view, their manuscript is well-written. All methods are well explained and referenced, and the text is easy to read and understand. The data manipulation and presented results are sufficiently explained. I do have minor (or rather nitpicking) suggestions for improving the manuscript (see below). Nevertheless, I am very pleased to recommend this manuscript for publication.
COMMENTS:
1) Regarding Equation 1, it appears to be incorrect. Since the surrounding text and graphs make sense, I assume this is just a typo. Nevertheless, the correct equation should be:

- either: |A \bigcap B| / |A \bigcup B|

- or: |A \bigcap B| / (|A| + |B| - |A \bigcap B|)

-but not: |A \bigcap B| / (|A \bigcup B| - |A \bigcap B|)

2) The Tanimoto similarity distribution does indeed provide some information on the similarity between the two datasets. However, would it not be even more relevant for machine learning applications to compare the distributions of the highest Tanimoto similarity indices, taken between compounds from the analyzed dataset and all compounds in the reference dataset?

3) The last sentence of Section 2.1 is hard to follow (during the first reading). Please try to be more descriptive.
4) Could you please elaborate on the role of dataset size and diversity? How would the similarity comparison change if, for example, the MONA dataset were removed from the t-SNE analysis? Also, have you tried shuffling the datasets and comparing again? Would you obtain the same conclusions? The size and distance in the t-SNE analysis are not informative—does it even make sense to use this analysis for similarity comparison or any filtering? I ask this to understand whether Figures 6a and 6b are truly different due to the choice of different representations, or if the differences arise because t-SNE is highly sensitive to initial conditions.
5) Nitpicking note on Figure 4 caption: Functional group which are at least in 10% of dataset are shown in c), but in the end you show even smaller fraction (i.e. peaks below 0.1), which just made me wonder whether I understand the graphs correctly.
6) Another nitpicking note: It would be nice if the figures 4a and 5a use the same bin sizes (and scales).
Very nice paper. Good luck with your science!

Citation: https://doi.org/10.5194/egusphere-2024-2432-RC1
RC2: 'Comment on egusphere-2024-2432', Anonymous Referee #2, 20 Sep 2024

The manuscript by Sandström and Rinke investigates the similarity of organic compounds in multiple different datasets with focus on atmospheric oxidation products. In addition to comparing the molecular descriptors, the authors compared other molecular attributes between the datasets. The study shows how the compounds present in large data banks do not coincide with atmospherically relevant compounds. Therefore, these data banks are not sufficient training data for machine learning models in atmospheric studies. This is an important observation for future development of machine learning models and datasets compiled for the training of those models. I happily recommend that the article should be accepted after minor corrections.
General comments
1. Related to the first paragraph of page 14, how is the size of the datasets taken into account in the Tanimoto similarity analysis? For example comparing Gecko and Wang, Gecko has 166434 compounds and Wang only 3414. It's obvious that 166434 easily contains more compounds that are similar to others, because the total number is just so big. If you were to take 3414 of the most different compounds from the Gecko dataset, would the result be similar to the Wang-Wang Tanimoto distribution? Or the opposite, if you would increase the size of the Wang dataset to 166434 compounds, would it be possible to create equally diverse set of atmospherically relevant oxidized organics? If the distributions were plotted without normalization, would the Gecko-Gecko distribution in the low similarity region still be higher in absolute values than the Wang-Wang distribution?
2. In the Tanimoto similarities, it would be interesting to see the percentages of the compounds in each of the similarity categories (low, intermediate, high).
Specific comments
3. Page 1: "However, the underlying molecular-level processes involving organic molecules remain poorly understood, due to the vast number of organic compounds participating in atmospheric chemistry." For readers who are not familiar with atmospheric aerosol, add before this a sentence of how these organic compounds are connected to the aerosol particles you mention in the previous sentences (presumably SOA, since you talk about particle formation).

"human-based activities, like" -> "human-based activities, such as"

"Organic aerosol particle formation" -> "Secondary organic aerosol particle formation"
4. Page 2: "datasets like Gecko" -> "datasets such as Gecko"

"degradation of 143 atmospheric compounds" Can you be more specific? Are these all organics? Hydrocarbons/VOCs or already oxidized species?
5. Page 3: "In recent years, machine learning methods have shown promise..." Hyttinen et al., 2022 doesn't use machine learning methods.
6. Page 6: "We tested three different perplexity values of 5, 50 and 100." Since perplexity is an important hyperparameter in t-SNE, a short explanation of its meaning here would be useful.
7. Figures 4 and 5: Can you specify what the lines are in Figures 4b and 5b? Is the interval showing the range of ratios in the whole dataset? If yes, is the marker then the median? To my eye the markers seem to hit the center of the lines in all cases. Also, there are molecules in Gecko that have fewer O than C, right? If the lines are for the ranges, the O:C for Gecko seems off.
8. Page 8: "Oxygen-carrying groups like hydroxyls" -> "Oxygen-carrying groups such as hydroxyls"

"Functional groups like peroxides" -> "Functional groups such as peroxides"
9. Figure 8: Can you comment on why the Tanimoto similarity distributions with the MACCS fingerprint are so much less smooth compared to the topological fingerprint? Is it related to the size of the fingerprint? And would a larger bin size in these histograms be more convenient? I assume that the "noise" in the distributions doesn't really give any important information about the similarities.
10. Page 12: "Our comparison of nitrogen-containing functional groups instead revealed a lack in amine and amide content in atmospheric compounds compared to the other compound classes." In datasets of atmospheric compounds compared to the other datasets, right? Now it sounds like there aren't amines and amides in the atmosphere.
11. Page 13: "Furthermore, the similarity between molecular representations like fingerprints can unveil" -> "such as"
12. Page 15: "which can be characterized by properties like" -> "such as"
13. Figure 9: Add reference to GeckoQ. Also, use SI units instead of mbar in the x-axis label.
14. Page 16: "assessing not only the overlap of target values, but also to carefully examining" -> "not only assessing the overlap of target values, but also carefully examining"

Citation: https://doi.org/10.5194/egusphere-2024-2432-RC2
RC3: 'Comment on egusphere-2024-2432', Jonas Elm, 02 Oct 2024

Sandström and Rinke investigate how closely atmospheric organic molecules resemble data in existing curated databases for machine learning (ML) applications. In particular they study the atmospheric Gecko dataset, the Wang dataset based on the master chemical mechanism (MCM) and a dataset consisting of quinones. These are compared to themselves and to the well-known QM9 dataset, as well as nablaDFT and MONA. To estimate the similarity between the datasets the authors apply a supervised ML method in the form of the Tanimoto index and an unsupervised ML method in the form of t-SNE clustering. Two different molecular representations are tested: The topological fingerprint and the MACCS fingerprint.
It is found that existing databases do not cover atmospheric organic molecules well. While this to some extent is no surprise, as these datasets are curated for vastly different purposes, it highlights the need for assembling specialized atmospheric databases in the future. Overall, this is very interesting work, that build upon the existing machine learning development in aerosol science and the conclusion that more specialized atmospheric datasets are needed is a welcoming appeal to the community.
The work is meticulously carried out, the manuscript is well-written and easy to follow. Overall, the work fits well in Geophysical Model Development, and I am happy to recommend the manuscript for publication, essentially as is. I only have a few minor comments. I emphasize that these are not demands and the authors are free to dismiss the requests if they deem it necessary.

Comments
Page 6: “We interpret our results by introducing a high and low similarity reference values. This choice is motivated by previous studies of Tanimoto similarity (Liu et al., 2018; Moret et al., 2023).”
I do not really have a gut feeling for the Tanimoto similarity values chosen as not similar (less than 0.1) and similar (0.4 or above). The authors mention that 0.4 or above has been shown to improve ML model performance. Can this value be quantified somehow in the form of the molecular structures? I.e. how similar/dissimilar should the structures be for these cut-off values? For instance, a simple example of some structures that corresponds to the different values would be helpful.

Page 6-7: “Both fingerprints have been used in atmospheric chemistry machine earning applications (Lumiaro et al., 2021; Besel et al., 2023, 2024) and are therefore pertinent for our comparison.”
Figure 6 shows the difference between the two chosen representations. As both of the applied descriptors are fingerprints, it interesting to have perform similar analysis based on another descriptor with different architecture. In quantum chemical ML applications there are many possibilities such as coulomb matrix, SOAP, MBTR, FCHL, ect. Hence, could the authors speculate on how sensitive the similarity analysis is to the choice of descriptor architecture?

Page 12: “Our comparison of nitrogen-containing functional groups instead revealed a lack in amine and amide content in atmospheric compounds compared to the other compound classes.”
This is simply a fact of the Gecko, Wang and Quinone datasets not including such compounds. Perhaps, further stress that this indicates that such species should be present in atmospheric databases to have a versatile and representative atmospheric dataset.

Page 14: “In Figures 7 and 8, we observed that Gecko molecules exhibit greater similarity to each other, while the Wang compounds are more diverse.”
Is this not simply related to the relative size of the two datasets? In addition, too many similar structures in the dataset just leads to redundant structures and essentially overtraining on specific molecular features. Would a cleaned-up version of the Gecko dataset, where structurally too similar molecules are removed, be a better fit for future training of ML models?

Page 17: “Examples of such initiatives have recently been developed, such as the Aerosolomics project (Thoma et al., 2022).”
Perhaps explicitly specify that you are referring to experimental initiatives here. I would argue that our Atmospheric Cluster DataBase (ACDB) comprising the Clusteromics I-V and Clusterome datasets serve a similar purpose, but from a computational point of view.

Citation: https://doi.org/10.5194/egusphere-2024-2432-RC3
CEC1:
'Comment on egusphere-2024-2432: No compliancy with the policy of the journal', Juan Antonio Añel, 29 Oct 2024

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html

You have archived your code on a Git repository. However, Git repositories are not suitable for scientific publication. This flaw in your manuscript was already pointed out by the Topical Editor when you submitted your manuscript and before the Discussions stage. Despite it, you have failed to address and solve the issue, which is specially disappointing.
Therefore, you must publish your code in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Therefore, the current situation with your manuscript is irregular. Also, please include the relevant primary input/output data.
In this way, if you do not fix this problem in a prompt manner, we will have to reject your manuscript for publication in our journal. Therefore, please, I request you to reply to this comment before the end of the Discussions period with the information (link and DOI) for the new repository that complies with the policy.
Also, in the git repository no license is listed. If you do not include a license the code remains your property and nobody can use it. Therefore, when uploading the model's code to the new repository, you could want to choose a free software/open-source (FLOSS) license. We recommend the GPLv3. You simply need to include the file 'https://www.gnu.org/licenses/gpl-3.0.txt' as LICENSE.txt with your code. Also, you can choose other options that Zenodo provides: GPLv2, Apache License, MIT License, etc.
Also, you must include the modified 'Code and Data Availability' section in a potentially reviewed manuscript, the DOI of the code.
Juan A. Añel

Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2024-2432-CEC1
- AC1:
  'Reply on CEC1', HILDA SANDSTRÖM, 29 Oct 2024
  Dear Juan A. Añel,
  Thank you for your email and for highlighting the compliance issue regarding our manuscript's associated code. We sincerely apologize for missing this important point raised by the Topic Editor and appreciate your guidance in addressing it.
  We have now uploaded the versions of the code used in our manuscript to Zenodo, and you can access them at the following links:
  Code Repository: DOI: 10.5281/zenodo.14007731, https://zenodo.org/records/14007731
  
  Additional Code: DOI: 10.5281/zenodo.14007835 , https://zenodo.org/records/14007835
  
  Furthermore, the datasets utilized for our analysis are freely available, as detailed in our README.md file at this link: DOI: 10.5281/zenodo.14007731 (https://zenodo.org/records/14007731).
  We have also ensured that a proper license is included with the uploaded code, specifically the GPLv3 license, as recommended.
  Finally, we will update the 'Code and Data Availability' section in the manuscript to include the DOI of the newly uploaded code and any other necessary details.
  Thank you again for your valuable feedback. We look forward to your guidance on the next steps regarding our manuscript.
  Best regards,
  
  Hilda Sandström
  
  Citation: https://doi.org/10.5194/egusphere-2024-2432-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 29 Oct 2024
    
    Dear authors,
    Many thanks for addressing this issue so quickly. We can consider now the current version of your manuscript in compliance with out code policy.
    Regards,
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2024-2432-CEC2

Interactive discussion

Status: closed

RC1: 'Comment on egusphere-2024-2432', Anonymous Referee #1, 19 Sep 2024

H. Sandstrom and P. Rinke conducted a study focused on the similarity-based analysis and its various datasetes containing organic compounds. They highlighted the challenges posed by the lack of curated datasets for atmospheric molecules and aimed to connect atmospheric compounds with existing large molecular datasets. Their investigation revealed that atmospheric molecules have limited overlap with non-atmospheric compounds due to distinct functional groups and atomic compositions. They utilized two molecular similarity metrics, specifically t-SNE and the Tanimoto similarity index, to compare atmospheric datasets (Wang, Gecko, and Quinones) between themseves and with non-atmospheric datasets (including drug-like and metabolite compounds). Their findings emphasize the need for collaborative efforts to improve dataset curation in order to enhance machine learning applications in atmospheric sciences.
From my point of view, their manuscript is well-written. All methods are well explained and referenced, and the text is easy to read and understand. The data manipulation and presented results are sufficiently explained. I do have minor (or rather nitpicking) suggestions for improving the manuscript (see below). Nevertheless, I am very pleased to recommend this manuscript for publication.
COMMENTS:
1) Regarding Equation 1, it appears to be incorrect. Since the surrounding text and graphs make sense, I assume this is just a typo. Nevertheless, the correct equation should be:

- either: |A \bigcap B| / |A \bigcup B|

- or: |A \bigcap B| / (|A| + |B| - |A \bigcap B|)

-but not: |A \bigcap B| / (|A \bigcup B| - |A \bigcap B|)

2) The Tanimoto similarity distribution does indeed provide some information on the similarity between the two datasets. However, would it not be even more relevant for machine learning applications to compare the distributions of the highest Tanimoto similarity indices, taken between compounds from the analyzed dataset and all compounds in the reference dataset?

3) The last sentence of Section 2.1 is hard to follow (during the first reading). Please try to be more descriptive.
4) Could you please elaborate on the role of dataset size and diversity? How would the similarity comparison change if, for example, the MONA dataset were removed from the t-SNE analysis? Also, have you tried shuffling the datasets and comparing again? Would you obtain the same conclusions? The size and distance in the t-SNE analysis are not informative—does it even make sense to use this analysis for similarity comparison or any filtering? I ask this to understand whether Figures 6a and 6b are truly different due to the choice of different representations, or if the differences arise because t-SNE is highly sensitive to initial conditions.
5) Nitpicking note on Figure 4 caption: Functional group which are at least in 10% of dataset are shown in c), but in the end you show even smaller fraction (i.e. peaks below 0.1), which just made me wonder whether I understand the graphs correctly.
6) Another nitpicking note: It would be nice if the figures 4a and 5a use the same bin sizes (and scales).
Very nice paper. Good luck with your science!

Citation: https://doi.org/10.5194/egusphere-2024-2432-RC1
RC2: 'Comment on egusphere-2024-2432', Anonymous Referee #2, 20 Sep 2024

The manuscript by Sandström and Rinke investigates the similarity of organic compounds in multiple different datasets with focus on atmospheric oxidation products. In addition to comparing the molecular descriptors, the authors compared other molecular attributes between the datasets. The study shows how the compounds present in large data banks do not coincide with atmospherically relevant compounds. Therefore, these data banks are not sufficient training data for machine learning models in atmospheric studies. This is an important observation for future development of machine learning models and datasets compiled for the training of those models. I happily recommend that the article should be accepted after minor corrections.
General comments
1. Related to the first paragraph of page 14, how is the size of the datasets taken into account in the Tanimoto similarity analysis? For example comparing Gecko and Wang, Gecko has 166434 compounds and Wang only 3414. It's obvious that 166434 easily contains more compounds that are similar to others, because the total number is just so big. If you were to take 3414 of the most different compounds from the Gecko dataset, would the result be similar to the Wang-Wang Tanimoto distribution? Or the opposite, if you would increase the size of the Wang dataset to 166434 compounds, would it be possible to create equally diverse set of atmospherically relevant oxidized organics? If the distributions were plotted without normalization, would the Gecko-Gecko distribution in the low similarity region still be higher in absolute values than the Wang-Wang distribution?
2. In the Tanimoto similarities, it would be interesting to see the percentages of the compounds in each of the similarity categories (low, intermediate, high).
Specific comments
3. Page 1: "However, the underlying molecular-level processes involving organic molecules remain poorly understood, due to the vast number of organic compounds participating in atmospheric chemistry." For readers who are not familiar with atmospheric aerosol, add before this a sentence of how these organic compounds are connected to the aerosol particles you mention in the previous sentences (presumably SOA, since you talk about particle formation).

"human-based activities, like" -> "human-based activities, such as"

"Organic aerosol particle formation" -> "Secondary organic aerosol particle formation"
4. Page 2: "datasets like Gecko" -> "datasets such as Gecko"

"degradation of 143 atmospheric compounds" Can you be more specific? Are these all organics? Hydrocarbons/VOCs or already oxidized species?
5. Page 3: "In recent years, machine learning methods have shown promise..." Hyttinen et al., 2022 doesn't use machine learning methods.
6. Page 6: "We tested three different perplexity values of 5, 50 and 100." Since perplexity is an important hyperparameter in t-SNE, a short explanation of its meaning here would be useful.
7. Figures 4 and 5: Can you specify what the lines are in Figures 4b and 5b? Is the interval showing the range of ratios in the whole dataset? If yes, is the marker then the median? To my eye the markers seem to hit the center of the lines in all cases. Also, there are molecules in Gecko that have fewer O than C, right? If the lines are for the ranges, the O:C for Gecko seems off.
8. Page 8: "Oxygen-carrying groups like hydroxyls" -> "Oxygen-carrying groups such as hydroxyls"

"Functional groups like peroxides" -> "Functional groups such as peroxides"
9. Figure 8: Can you comment on why the Tanimoto similarity distributions with the MACCS fingerprint are so much less smooth compared to the topological fingerprint? Is it related to the size of the fingerprint? And would a larger bin size in these histograms be more convenient? I assume that the "noise" in the distributions doesn't really give any important information about the similarities.
10. Page 12: "Our comparison of nitrogen-containing functional groups instead revealed a lack in amine and amide content in atmospheric compounds compared to the other compound classes." In datasets of atmospheric compounds compared to the other datasets, right? Now it sounds like there aren't amines and amides in the atmosphere.
11. Page 13: "Furthermore, the similarity between molecular representations like fingerprints can unveil" -> "such as"
12. Page 15: "which can be characterized by properties like" -> "such as"
13. Figure 9: Add reference to GeckoQ. Also, use SI units instead of mbar in the x-axis label.
14. Page 16: "assessing not only the overlap of target values, but also to carefully examining" -> "not only assessing the overlap of target values, but also carefully examining"

Citation: https://doi.org/10.5194/egusphere-2024-2432-RC2
RC3: 'Comment on egusphere-2024-2432', Jonas Elm, 02 Oct 2024

Sandström and Rinke investigate how closely atmospheric organic molecules resemble data in existing curated databases for machine learning (ML) applications. In particular they study the atmospheric Gecko dataset, the Wang dataset based on the master chemical mechanism (MCM) and a dataset consisting of quinones. These are compared to themselves and to the well-known QM9 dataset, as well as nablaDFT and MONA. To estimate the similarity between the datasets the authors apply a supervised ML method in the form of the Tanimoto index and an unsupervised ML method in the form of t-SNE clustering. Two different molecular representations are tested: The topological fingerprint and the MACCS fingerprint.
It is found that existing databases do not cover atmospheric organic molecules well. While this to some extent is no surprise, as these datasets are curated for vastly different purposes, it highlights the need for assembling specialized atmospheric databases in the future. Overall, this is very interesting work, that build upon the existing machine learning development in aerosol science and the conclusion that more specialized atmospheric datasets are needed is a welcoming appeal to the community.
The work is meticulously carried out, the manuscript is well-written and easy to follow. Overall, the work fits well in Geophysical Model Development, and I am happy to recommend the manuscript for publication, essentially as is. I only have a few minor comments. I emphasize that these are not demands and the authors are free to dismiss the requests if they deem it necessary.

Comments
Page 6: “We interpret our results by introducing a high and low similarity reference values. This choice is motivated by previous studies of Tanimoto similarity (Liu et al., 2018; Moret et al., 2023).”
I do not really have a gut feeling for the Tanimoto similarity values chosen as not similar (less than 0.1) and similar (0.4 or above). The authors mention that 0.4 or above has been shown to improve ML model performance. Can this value be quantified somehow in the form of the molecular structures? I.e. how similar/dissimilar should the structures be for these cut-off values? For instance, a simple example of some structures that corresponds to the different values would be helpful.

Page 6-7: “Both fingerprints have been used in atmospheric chemistry machine earning applications (Lumiaro et al., 2021; Besel et al., 2023, 2024) and are therefore pertinent for our comparison.”
Figure 6 shows the difference between the two chosen representations. As both of the applied descriptors are fingerprints, it interesting to have perform similar analysis based on another descriptor with different architecture. In quantum chemical ML applications there are many possibilities such as coulomb matrix, SOAP, MBTR, FCHL, ect. Hence, could the authors speculate on how sensitive the similarity analysis is to the choice of descriptor architecture?

Page 12: “Our comparison of nitrogen-containing functional groups instead revealed a lack in amine and amide content in atmospheric compounds compared to the other compound classes.”
This is simply a fact of the Gecko, Wang and Quinone datasets not including such compounds. Perhaps, further stress that this indicates that such species should be present in atmospheric databases to have a versatile and representative atmospheric dataset.

Page 14: “In Figures 7 and 8, we observed that Gecko molecules exhibit greater similarity to each other, while the Wang compounds are more diverse.”
Is this not simply related to the relative size of the two datasets? In addition, too many similar structures in the dataset just leads to redundant structures and essentially overtraining on specific molecular features. Would a cleaned-up version of the Gecko dataset, where structurally too similar molecules are removed, be a better fit for future training of ML models?

Page 17: “Examples of such initiatives have recently been developed, such as the Aerosolomics project (Thoma et al., 2022).”
Perhaps explicitly specify that you are referring to experimental initiatives here. I would argue that our Atmospheric Cluster DataBase (ACDB) comprising the Clusteromics I-V and Clusterome datasets serve a similar purpose, but from a computational point of view.

Citation: https://doi.org/10.5194/egusphere-2024-2432-RC3
CEC1:
'Comment on egusphere-2024-2432: No compliancy with the policy of the journal', Juan Antonio Añel, 29 Oct 2024

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html

You have archived your code on a Git repository. However, Git repositories are not suitable for scientific publication. This flaw in your manuscript was already pointed out by the Topical Editor when you submitted your manuscript and before the Discussions stage. Despite it, you have failed to address and solve the issue, which is specially disappointing.
Therefore, you must publish your code in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Therefore, the current situation with your manuscript is irregular. Also, please include the relevant primary input/output data.
In this way, if you do not fix this problem in a prompt manner, we will have to reject your manuscript for publication in our journal. Therefore, please, I request you to reply to this comment before the end of the Discussions period with the information (link and DOI) for the new repository that complies with the policy.
Also, in the git repository no license is listed. If you do not include a license the code remains your property and nobody can use it. Therefore, when uploading the model's code to the new repository, you could want to choose a free software/open-source (FLOSS) license. We recommend the GPLv3. You simply need to include the file 'https://www.gnu.org/licenses/gpl-3.0.txt' as LICENSE.txt with your code. Also, you can choose other options that Zenodo provides: GPLv2, Apache License, MIT License, etc.
Also, you must include the modified 'Code and Data Availability' section in a potentially reviewed manuscript, the DOI of the code.
Juan A. Añel

Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2024-2432-CEC1
- AC1:
  'Reply on CEC1', HILDA SANDSTRÖM, 29 Oct 2024
  Dear Juan A. Añel,
  Thank you for your email and for highlighting the compliance issue regarding our manuscript's associated code. We sincerely apologize for missing this important point raised by the Topic Editor and appreciate your guidance in addressing it.
  We have now uploaded the versions of the code used in our manuscript to Zenodo, and you can access them at the following links:
  Code Repository: DOI: 10.5281/zenodo.14007731, https://zenodo.org/records/14007731
  
  Additional Code: DOI: 10.5281/zenodo.14007835 , https://zenodo.org/records/14007835
  
  Furthermore, the datasets utilized for our analysis are freely available, as detailed in our README.md file at this link: DOI: 10.5281/zenodo.14007731 (https://zenodo.org/records/14007731).
  We have also ensured that a proper license is included with the uploaded code, specifically the GPLv3 license, as recommended.
  Finally, we will update the 'Code and Data Availability' section in the manuscript to include the DOI of the newly uploaded code and any other necessary details.
  Thank you again for your valuable feedback. We look forward to your guidance on the next steps regarding our manuscript.
  Best regards,
  
  Hilda Sandström
  
  Citation: https://doi.org/10.5194/egusphere-2024-2432-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 29 Oct 2024
    
    Dear authors,
    Many thanks for addressing this issue so quickly. We can consider now the current version of your manuscript in compliance with out code policy.
    Regards,
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2024-2432-CEC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Hilda Sandström on behalf of the Authors (06 Dec 2024) Author's response Author's tracked changes Manuscript

ED: Publish subject to minor revisions (review by editor) (13 Jan 2025) by Sergey Gromov

Dear Hilda Sandström and Prof. Patrick Rinke,

After the internal discussion with the Executive Editors of the GMD, we have concluded that the manuscript needs a minor revision before it can be published. The reason is still the compliance with the GMD policies regarding the data used – we cannot accept references to pre-existing datasets in their current form (see the details below). Unfortunately, this implies some additional work on your side; fortunately, this will improve the publication quality and ensure reproducibility of the results.

Currently, the “Code and data availability” section of the MS reads: “The datasets used for this analysis are all freely available from original publications or the database website (see Table 1).” Unfortunately, referenced datasets are not available in compliance with our policy. I quote the decision of Juan A. Añel, the GMD Executive Editor who reviewed this case:
“For this work the authors use seven pre-existing datasets (they list them in Table 1) that they do not link or publish in one of the repositories that we can accept. In some cases (e.g. Wang) they only cite a paper, not a repository. These datasets are: Wang, Gecko, Quinones, QM9, nablaDFT, MassBank Europe, MassBank of North America. The authors should store in a permanent repository the data they have used form these datasets, or ask the owners or publishers of such datasets to do it. I understand that it is not their data, but if they can use them, probably they can republish them, at least the part that they have used. Therefore, given that the authors have submitted a new version of their work for a new round of reviews, before continuing with it, please, send it back to the authors to address the issues with the datasets.”

I therefore would like to ask you to resubmit the manuscript with the following amendments introduced:
- Either provide the reference to the data repository (containing archived data in exact form and version you have used) or store in a permanent repository the data you have used from these datasets, or ask the owners or publishers of such datasets to do it. The requirements (specifically the archive standards) are listed on GMD website at: https://www.geoscientific-model-development.net/policies/code_and_data_policy.html and should be satisfied for every dataset. The updated references may be accommodated in Table 1;
- I encourage making a copy of every used dataset, if such is attainable;
- Please combine the contents of the code and workflow datasets you provide (currently there are two separate Zenodo publications) and add a sufficient description of their contents (listing the title of the manuscript, I guess you can reuse some of the contents of your `readme.md` files). The dataset at https://zenodo.org/records/14007835 provides mere classes but not the main workflow/code to obtain the results you present.

I am looking forward to receiving the revised manuscript,
With best regards,
S. Gromov

Hide

AR by Hilda Sandström on behalf of the Authors (22 Jan 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (18 Feb 2025) by Sergey Gromov

AR by Hilda Sandström on behalf of the Authors (23 Feb 2025) Author's response Manuscript

Journal article(s) based on this preprint

15 May 2025

Similarity-based analysis of atmospheric organic compounds for machine learning applications

Hilda Sandström and Patrick Rinke

Geosci. Model Dev., 18, 2701–2724, https://doi.org/10.5194/gmd-18-2701-2025,https://doi.org/10.5194/gmd-18-2701-2025, 2025

Short summary

Hilda Sandström and Patrick Rinke

Model code and software

Atmospheric Compound Similarity Analysis Hilda Sandström https://gitlab.com/cest-group/atmospheric_compound_similarity_analysis

Interactive computing environment

Atmospheric Compound Similarity Analysis Hilda Sandström https://gitlab.com/cest-group/atmospheric_compound_similarity_analysis

Hilda Sandström and Patrick Rinke

Viewed

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 1,395 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,383	0	12	1,395	0	0

HTML: 1,383
PDF: 0
XML: 12
Total: 1,395
BibTeX: 0
EndNote: 0

Views and downloads (calculated since 09 Sep 2024)

Month	HTML	PDF	XML
Sep 2024	170	0	170
Oct 2024	60	0	60
Nov 2024	50	0	50
Dec 2024	20	0	20
Jan 2025	92	0	92
Feb 2025	48	0	48
Mar 2025	12	0	12
Apr 2025	12	0	12
May 2025	46	0	46
Jun 2025	34	12	46
Jul 2025	32	0	32
Aug 2025	80	0	80
Sep 2025	380	0	380
Oct 2025	26	0	26
Nov 2025	46	0	46
Dec 2025	62	0	62
Jan 2026	54	0	54
Feb 2026	54	0	54
Mar 2026	56	0	56
Apr 2026	30	0	30
May 2026	16	0	16
Jun 2026	3	0	3

Cumulative views and downloads (calculated since 09 Sep 2024)

Month	HTML	PDF	XML
Sep 2024	170	0	170
Oct 2024	60	0	60
Nov 2024	50	0	50
Dec 2024	20	0	20
Jan 2025	92	0	92
Feb 2025	48	0	48
Mar 2025	12	0	12
Apr 2025	12	0	12
May 2025	46	0	46
Jun 2025	34	12	46
Jul 2025	32	0	32
Aug 2025	80	0	80
Sep 2025	380	0	380
Oct 2025	26	0	26
Nov 2025	46	0	46
Dec 2025	62	0	62
Jan 2026	54	0	54
Feb 2026	54	0	54
Mar 2026	56	0	56
Apr 2026	30	0	30
May 2026	16	0	16
Jun 2026	3	0	3

Viewed (geographical distribution)

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 1,395 (including HTML, PDF, and XML) Thereof 1,395 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 25 Jun 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint

Final revised paper

Short summary

Machine learning has the potential to aid the identification organic molecules involved in aerosol formation. Yet, progress is stalled by a lack of curated atmospheric molecular datasets. Here, we compared atmospheric compounds with large molecular datasets used in machine learning and found minimal overlap with similarity algorithms. Our result underlines the need for collaborative efforts to curate atmospheric molecular data to facilitate machine learning model in atmospheric sciences.


Total:	0
HTML:	0
PDF:	0
XML:	0