the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Similarity-Based Analysis of Atmospheric Organic Compounds for Machine Learning Applications
Abstract. The formation of aerosol particles in the atmosphere impacts air quality and climate change, but many of the organic molecules involved remain unknown. Machine learning could aid in identifying these compounds through accelerated analysis of molecular properties and detection characteristics. However, such progress is hindered by the current lack of curated datasets for atmospheric molecules and their associated properties. To tackle this challenge, we propose a similarity analysis that connects atmospheric compounds to existing large molecular datasets used for machine learning development. We find a small overlap between atmospheric and non-atmospheric molecules using standard molecular representations in machine learning applications. The identified out-of-domain character of atmospheric compounds is related to their distinct functional groups and atomic composition. Our investigation underscores the need for collaborative efforts to gather and share more molecular-level atmospheric chemistry data. The presented similarity based analysis can be used for future dataset curation for machine learning development in the atmospheric sciences.
Status: open (until 04 Nov 2024)
-
RC1: 'Comment on egusphere-2024-2432', Anonymous Referee #1, 19 Sep 2024
reply
H. Sandstrom and P. Rinke conducted a study focused on the similarity-based analysis and its various datasetes containing organic compounds. They highlighted the challenges posed by the lack of curated datasets for atmospheric molecules and aimed to connect atmospheric compounds with existing large molecular datasets. Their investigation revealed that atmospheric molecules have limited overlap with non-atmospheric compounds due to distinct functional groups and atomic compositions. They utilized two molecular similarity metrics, specifically t-SNE and the Tanimoto similarity index, to compare atmospheric datasets (Wang, Gecko, and Quinones) between themseves and with non-atmospheric datasets (including drug-like and metabolite compounds). Their findings emphasize the need for collaborative efforts to improve dataset curation in order to enhance machine learning applications in atmospheric sciences.
From my point of view, their manuscript is well-written. All methods are well explained and referenced, and the text is easy to read and understand. The data manipulation and presented results are sufficiently explained. I do have minor (or rather nitpicking) suggestions for improving the manuscript (see below). Nevertheless, I am very pleased to recommend this manuscript for publication.
COMMENTS:
1) Regarding Equation 1, it appears to be incorrect. Since the surrounding text and graphs make sense, I assume this is just a typo. Nevertheless, the correct equation should be:
- either: |A \bigcap B| / |A \bigcup B|
- or: |A \bigcap B| / (|A| + |B| - |A \bigcap B|)
-but not: |A \bigcap B| / (|A \bigcup B| - |A \bigcap B|)
2) The Tanimoto similarity distribution does indeed provide some information on the similarity between the two datasets. However, would it not be even more relevant for machine learning applications to compare the distributions of the highest Tanimoto similarity indices, taken between compounds from the analyzed dataset and all compounds in the reference dataset?
3) The last sentence of Section 2.1 is hard to follow (during the first reading). Please try to be more descriptive.4) Could you please elaborate on the role of dataset size and diversity? How would the similarity comparison change if, for example, the MONA dataset were removed from the t-SNE analysis? Also, have you tried shuffling the datasets and comparing again? Would you obtain the same conclusions? The size and distance in the t-SNE analysis are not informative—does it even make sense to use this analysis for similarity comparison or any filtering? I ask this to understand whether Figures 6a and 6b are truly different due to the choice of different representations, or if the differences arise because t-SNE is highly sensitive to initial conditions.
5) Nitpicking note on Figure 4 caption: Functional group which are at least in 10% of dataset are shown in c), but in the end you show even smaller fraction (i.e. peaks below 0.1), which just made me wonder whether I understand the graphs correctly.
6) Another nitpicking note: It would be nice if the figures 4a and 5a use the same bin sizes (and scales).
Very nice paper. Good luck with your science!
Citation: https://doi.org/10.5194/egusphere-2024-2432-RC1 -
RC2: 'Comment on egusphere-2024-2432', Anonymous Referee #2, 20 Sep 2024
reply
The manuscript by Sandström and Rinke investigates the similarity of organic compounds in multiple different datasets with focus on atmospheric oxidation products. In addition to comparing the molecular descriptors, the authors compared other molecular attributes between the datasets. The study shows how the compounds present in large data banks do not coincide with atmospherically relevant compounds. Therefore, these data banks are not sufficient training data for machine learning models in atmospheric studies. This is an important observation for future development of machine learning models and datasets compiled for the training of those models. I happily recommend that the article should be accepted after minor corrections.
General comments
1. Related to the first paragraph of page 14, how is the size of the datasets taken into account in the Tanimoto similarity analysis? For example comparing Gecko and Wang, Gecko has 166434 compounds and Wang only 3414. It's obvious that 166434 easily contains more compounds that are similar to others, because the total number is just so big. If you were to take 3414 of the most different compounds from the Gecko dataset, would the result be similar to the Wang-Wang Tanimoto distribution? Or the opposite, if you would increase the size of the Wang dataset to 166434 compounds, would it be possible to create equally diverse set of atmospherically relevant oxidized organics? If the distributions were plotted without normalization, would the Gecko-Gecko distribution in the low similarity region still be higher in absolute values than the Wang-Wang distribution?
2. In the Tanimoto similarities, it would be interesting to see the percentages of the compounds in each of the similarity categories (low, intermediate, high).
Specific comments
3. Page 1: "However, the underlying molecular-level processes involving organic molecules remain poorly understood, due to the vast number of organic compounds participating in atmospheric chemistry." For readers who are not familiar with atmospheric aerosol, add before this a sentence of how these organic compounds are connected to the aerosol particles you mention in the previous sentences (presumably SOA, since you talk about particle formation).
"human-based activities, like" -> "human-based activities, such as"
"Organic aerosol particle formation" -> "Secondary organic aerosol particle formation"4. Page 2: "datasets like Gecko" -> "datasets such as Gecko"
"degradation of 143 atmospheric compounds" Can you be more specific? Are these all organics? Hydrocarbons/VOCs or already oxidized species?5. Page 3: "In recent years, machine learning methods have shown promise..." Hyttinen et al., 2022 doesn't use machine learning methods.
6. Page 6: "We tested three different perplexity values of 5, 50 and 100." Since perplexity is an important hyperparameter in t-SNE, a short explanation of its meaning here would be useful.
7. Figures 4 and 5: Can you specify what the lines are in Figures 4b and 5b? Is the interval showing the range of ratios in the whole dataset? If yes, is the marker then the median? To my eye the markers seem to hit the center of the lines in all cases. Also, there are molecules in Gecko that have fewer O than C, right? If the lines are for the ranges, the O:C for Gecko seems off.
8. Page 8: "Oxygen-carrying groups like hydroxyls" -> "Oxygen-carrying groups such as hydroxyls"
"Functional groups like peroxides" -> "Functional groups such as peroxides"9. Figure 8: Can you comment on why the Tanimoto similarity distributions with the MACCS fingerprint are so much less smooth compared to the topological fingerprint? Is it related to the size of the fingerprint? And would a larger bin size in these histograms be more convenient? I assume that the "noise" in the distributions doesn't really give any important information about the similarities.
10. Page 12: "Our comparison of nitrogen-containing functional groups instead revealed a lack in amine and amide content in atmospheric compounds compared to the other compound classes." In datasets of atmospheric compounds compared to the other datasets, right? Now it sounds like there aren't amines and amides in the atmosphere.
11. Page 13: "Furthermore, the similarity between molecular representations like fingerprints can unveil" -> "such as"
12. Page 15: "which can be characterized by properties like" -> "such as"
13. Figure 9: Add reference to GeckoQ. Also, use SI units instead of mbar in the x-axis label.
14. Page 16: "assessing not only the overlap of target values, but also to carefully examining" -> "not only assessing the overlap of target values, but also carefully examining"
Citation: https://doi.org/10.5194/egusphere-2024-2432-RC2 -
RC3: 'Comment on egusphere-2024-2432', Jonas Elm, 02 Oct 2024
reply
Sandström and Rinke investigate how closely atmospheric organic molecules resemble data in existing curated databases for machine learning (ML) applications. In particular they study the atmospheric Gecko dataset, the Wang dataset based on the master chemical mechanism (MCM) and a dataset consisting of quinones. These are compared to themselves and to the well-known QM9 dataset, as well as nablaDFT and MONA. To estimate the similarity between the datasets the authors apply a supervised ML method in the form of the Tanimoto index and an unsupervised ML method in the form of t-SNE clustering. Two different molecular representations are tested: The topological fingerprint and the MACCS fingerprint.
It is found that existing databases do not cover atmospheric organic molecules well. While this to some extent is no surprise, as these datasets are curated for vastly different purposes, it highlights the need for assembling specialized atmospheric databases in the future. Overall, this is very interesting work, that build upon the existing machine learning development in aerosol science and the conclusion that more specialized atmospheric datasets are needed is a welcoming appeal to the community.
The work is meticulously carried out, the manuscript is well-written and easy to follow. Overall, the work fits well in Geophysical Model Development, and I am happy to recommend the manuscript for publication, essentially as is. I only have a few minor comments. I emphasize that these are not demands and the authors are free to dismiss the requests if they deem it necessary.
Comments
Page 6: “We interpret our results by introducing a high and low similarity reference values. This choice is motivated by previous studies of Tanimoto similarity (Liu et al., 2018; Moret et al., 2023).”
I do not really have a gut feeling for the Tanimoto similarity values chosen as not similar (less than 0.1) and similar (0.4 or above). The authors mention that 0.4 or above has been shown to improve ML model performance. Can this value be quantified somehow in the form of the molecular structures? I.e. how similar/dissimilar should the structures be for these cut-off values? For instance, a simple example of some structures that corresponds to the different values would be helpful.
Page 6-7: “Both fingerprints have been used in atmospheric chemistry machine earning applications (Lumiaro et al., 2021; Besel et al., 2023, 2024) and are therefore pertinent for our comparison.”
Figure 6 shows the difference between the two chosen representations. As both of the applied descriptors are fingerprints, it interesting to have perform similar analysis based on another descriptor with different architecture. In quantum chemical ML applications there are many possibilities such as coulomb matrix, SOAP, MBTR, FCHL, ect. Hence, could the authors speculate on how sensitive the similarity analysis is to the choice of descriptor architecture?
Page 12: “Our comparison of nitrogen-containing functional groups instead revealed a lack in amine and amide content in atmospheric compounds compared to the other compound classes.”
This is simply a fact of the Gecko, Wang and Quinone datasets not including such compounds. Perhaps, further stress that this indicates that such species should be present in atmospheric databases to have a versatile and representative atmospheric dataset.
Page 14: “In Figures 7 and 8, we observed that Gecko molecules exhibit greater similarity to each other, while the Wang compounds are more diverse.”
Is this not simply related to the relative size of the two datasets? In addition, too many similar structures in the dataset just leads to redundant structures and essentially overtraining on specific molecular features. Would a cleaned-up version of the Gecko dataset, where structurally too similar molecules are removed, be a better fit for future training of ML models?
Page 17: “Examples of such initiatives have recently been developed, such as the Aerosolomics project (Thoma et al., 2022).”
Perhaps explicitly specify that you are referring to experimental initiatives here. I would argue that our Atmospheric Cluster DataBase (ACDB) comprising the Clusteromics I-V and Clusterome datasets serve a similar purpose, but from a computational point of view.
Citation: https://doi.org/10.5194/egusphere-2024-2432-RC3
Model code and software
Atmospheric Compound Similarity Analysis Hilda Sandström https://gitlab.com/cest-group/atmospheric_compound_similarity_analysis
Interactive computing environment
Atmospheric Compound Similarity Analysis Hilda Sandström https://gitlab.com/cest-group/atmospheric_compound_similarity_analysis
Viewed
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
92 | 0 | 0 | 92 | 0 | 0 |
- HTML: 92
- PDF: 0
- XML: 0
- Total: 92
- BibTeX: 0
- EndNote: 0
Viewed (geographical distribution)
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1