the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Technical note: Towards atmospheric compound identification in chemical ionization mass spectrometry with machine learning
Abstract. Chemical ionization mass spectrometry (CIMS) is widely used in atmospheric chemistry studies. However, due to the complex interactions between reagent ions and target compounds, chemical understanding remains limited and compound identification difficult. In this study, we apply machine learning to a reference dataset of pesticides in two standard solutions to build a model that can provide insights from CIMS analyses in atmospheric science. The CIMS measurements were performed with an orbitrap mass spectrometer coupled to a thermal desorption multi-scheme chemical ionization inlet unit (TD-MION-MS) with both negative and positive ionization modes utilizing Br-, O2-, H3O+ and (CH3)2COH+ (AceH+) as reagent ions. We then trained two machine learning methods on this data: 1) random forest (RF) for classifying if a pesticide can be detected with CIMS, and 2) kernel ridge regression (KRR) for predicting the expected CIMS signals. We compared their performance on five different representations of the molecular structure: the topological fingerprint (TopFP), the molecular access system keys (MACCS), a custom descriptor based on standard molecular properties (RDKitPROP), the Coulomb matrix (CM) and the many-body tensor representation (MBTR). The results indicate that MACCS outperforms the other descriptors. Our best classification model reaches a prediction accuracy of 0.85 ± 0.02 and a receiver operating characteristic curve area of 0.91 ± 0.01. Our best regression model reaches an accuracy of 0.44 ± 0.03 logarithmic units of the signal intensity. Subsequent feature importance analysis of the classifiers reveals that the most important structural fragments are NH and OH for the negative ionization schemes and nitrogen-containing groups for the positive ionization schemes.
- Preprint
(2469 KB) - Metadata XML
-
Supplement
(5853 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-1846', Anonymous Referee #1, 15 Sep 2024
Review Comments: Towards Atmospheric Compound Identification in Chemical Ionization Mass Spectrometry with Machine Learning
This paper uses analysis of pesticide standard materials by a variety of chemical ionization techniques to test the utility of machine learning methods in identifying a) whether a given molecule will be identifiable by a given CIMS ionization technique and b) how sensitive the CIMS technique will be to said detectable compound. Additional analysis investigates what structural characteristics (via elements of molecular descriptors) drive these capabilities. I congratulate the authors on the extremely impressive battery of instrumentation and analytical techniques used in this work and believe that it has the potential to significantly benefit the atmospheric chemistry community. In its current form however the manuscript has significant issues in framing that must be addressed before I can recommend it for publication. First, the title and framing of this work must be updated to reflect that this work represents an analysis of pesticides, which are not representative of atmospheric organics. Second, although this is a technical report, the authors must clearly articulate the goal and justification for the many methods applied. Third, references and comparisons to alternative methods, including fundamental chemistry related to proton affinity and other aspects of ionization chemistry that are typically used in identifying whether or not a given analyte is likely to be detectable by a given ionization method, are almost completely excluded. These comparisons form a critical foundation in establishing if and how these methods may be useful to atmospheric chemists, and would be an invaluable sanity check on whether the machine learning methods used in this study are successfully identifying known reaction phenomena. In its current form, the manuscript’s focus is so broad and technical and references to the potential use case of the different predictions are so incompletely addressed that the potential utility of the methods used is extremely difficult to identify. Overall, I would suggest that the authors consider a broader atmospheric chemistry audience in this work's structure, meaning that a justification of why each method is selected, what it is intended to predict, and how those predictions would be useful for a broader atmospheric chemistry community should be included at the beginning of each section. Finally, as this work exclusively operates in a forward direction from a known compound to predict its detection by CIMS, either additional analysis must be performed to evaluate how it might operate backwards from CIMS data to a prediction of identity, or the paper title must be re-framed away from a claim of providing compound identification. I again applaud the authors in this extremely impressive body of work and wish them success in re-framing the manuscript.
Major Comments:
- Pesticides do not present a representative sample of atmospheric compound composition; for example, they are significantly biased towards halogenated species and phosphates. I recommend that the title and focus of the manuscript be altered to reflect the pesticide focus area. The implications for predicting properties of atmospheric compounds more generally should be described in a conclusions or implications section.
- How was fragmentation accounted for when determining whether or not a species was detectable? Was only the parent ion included in the search?
- Line 114: please justify this assertion- could the differences in intensity instead reflect differences in the kinetics or thermodynamics of reactions leading to differing yields of charged parent ions?
- 126: please justify why the ability of t-sne to cluster compounds based on which ionization chemistry they can be analyzed with is a justification of the ML approach- what underlying characteristics of the molecules does this reflect? Molecular characteristics rather than appropriate ionization chemistry for a given analyte is the stated methodological aim of this work and would be a more compelling target property for clustering or modelling.
- Please be more explicit regarding what characteristics of the molecules each descriptor is intended to elucidate, why they are important for the atmospheric chemistry community, and how they are not currently adequately approximated by current CIMS analysis methods. There are so many methods applied that the goal is getting lost. The molecular descriptors section would be significantly improved by a clear explanation of each descriptor for which modelling was attempted and a justification for why being able to predict this property would be useful for the broader atmospheric chemistry community. This should come at the beginning of the molecular descriptors section (before section 3.1).
- Section 4.3: Can you please explain the performance metrics more clearly? Is each prediction actually a Boolean yes-no of whether the model correctly predicted the molecular descriptor? Is there no nuance on whether a prediction is “closer” or “farther off”? Can you please clarify how “undetected” compounds are being included in this analysis? Earlier the same wording was used to describe compounds that were excluded from analysis?
- Section 5: Please provide some reference to previous methods and potential use cases when assessing performance- for example, what methods are currently used to identify whether a given molecule will be detectable by CIMS? Under what circumstances would this information be valuable?
- 3 chemical insight: Please clarify what chemical insight you are looking for here- I think chemical insight into why various molecules are or are not detectable by different ionization chemistries? This needs to be explicitly stated and explained and should be the primary focus of this section. Please also make at least some reference to the fundamental chemical mechanisms at play and how they do or do not agree with the features your model is identifying as important in predicting whether or not a given molecule will be detectable by CIMS.
- Section 6: This method works in a forward direction from a known compound to predict whether or not that compound will be detectable by CIMS, and if so how sensitive different CIMS mechanisms will be to said molecule. The conclusion states that this method will be useful in identifying atmospheric compounds, which inherently operates in a backwards direction from a detected CIMS exact mass and intensity to the identity, structure, and quantity of an unknown compound. Please explicitly state the suggested use case for this method, as it is not clear to me how the process could be reversed to assist in identification of atmospheric organics in a complicated ambient environment.
- In the section describing the importance factors for the RF models, please provide some context on the fundamental underlying chemistry as to whether the machine learning model is identifying predictors as important that are known to be important drivers of CIMS ionization chemistry.
Minor Comments:
- Lines 34-42: previous work in characterization of atmospheric compound chemical properties via machine learning should be included here- (Sandstrom et al., 2024, Besel et al., 2024, others)
- Please address the treatment if isomers more completely. How many formulae corresponded to multiple isomers? How were the properties of the isomers amalgamated?
- Line 93: what characteristics did the undetectable pesticides share? How did these compare to the detectable species? What does this mean for the biases of this method when applied to atmospheric organics or pesticide mixtures? Can you please differentiate between the undetected and detected species as described in Figure 2? Why does panel c appear to show an about equal number of undetected and detected species when the text states that the most species were detectable? If panel c is illustrating which species are detectable by a single method rather than defining “undetectable” as meaning not detectable by any method, this should be more clearly explained and the color scheme should be different to reflect the different definition of “undetectable”
- Please more clearly explain the purpose of figure 3 panel a- I am not clear on what this is illustrating
- Line 285- I think there are some typos in this sentence and I am not sure what it is meant to say
- Figure 7: can MAE be appropriately calculated and compared on log-normal data? Please justify
Citation: https://doi.org/10.5194/egusphere-2024-1846-RC1 -
RC2: 'Comment on egusphere-2024-1846', Anonymous Referee #2, 27 Sep 2024
General comments
Bortolussi and coauthors present machine learning methods that will be of great utility to users of chemical ionization mass spectrometry (CIMS) for atmospheric chemical detection. The methods outlined here are described well and the applications will allow atmospheric scientists to calculate the feasibility of a CIMS technique in detecting a potential analyte without the need for a detailed understanding of ion-molecule interactions or quantum mechanical calculations for estimating binding enthalpies, proton affinities, or electron affinities. One major comment is in regards to general applicability for the community. As written, it is not clear how a field atmospheric chemist can use these outlined methods with their own instrument without first calibrating for a massive suite of compounds. There is only brief mention that these methods can be used “prior to deployment”. Is the idea that eventually a large library can be generated using a growing training set (beyond pesticides) and then one can generally search if a reagent ion can detect an analyte with a considerable enough relative signal intensity? I also understand intensities vary across instruments which I address in the specific comments. I have other specific comments below, mainly regarding clarification on ion chemistry, fragmentation, superiority over previous methods and linking to prior research, and general applicability that should be addressed before publication. Again, I think the methods are well-outlined and this is a helpful guide for understanding reagent ion selectivity and CIMS detection sensitivity in a more applied way. It would be great to eventually see this extended to other popular CIMS reagent ions like I-, NH4+, and benzene cluster cations in other work.
Specific Comments
Lines 40-41: There are quite a few CIMS datasets now, and each set generates a massive amount of data. The issue is more that they are not publicly available and there are not many detailed standard data sets. I would remove the word “data” and keep data standards. Further, you mention fragmentation but there is no method to address fragmentation patterns in this manuscript.
Lines 45-46: You should elaborate on why you chose pesticides rather than more ubiquitous atmospheric compounds that represent more of the reactive gas abundance. Are pesticides diverse and represent a range of functional groups that can be detected with CIMS? Are they obscure and thus a good training set for difficult to identify compounds?
Lines 74-76: Why are the routes of ionization specified for H3O+ and O2- but not Br- and AceH+? Also, can you list somewhere in the manuscript or supplement the experimental conditions for your ion-molecule reactions in the front end of your instrument, i.e. the temperature and pressure of your reaction chamber? Many species detected with O2- are generated as secondary compounds with signal that scales with front end pressure (and frequency of collisions) so this would be a helpful reference.
Lines 85-86: The number of isomers should be listed. If there are many isomers then wouldn’t keeping isomers in the dataset introduce considerable uncertainty for your performance metrics and chemical insights section?
Lines 88-89: Was there any evidence that these species underwent fragmentation for the four ionization methods? A strong application of the work in this manuscript could be to identify fragments that may contribute signal artifacts to other quantified species at presumed parent masses. Using this method as is for predicting intensities at a parent ion mass and applying to the field can get complicated when there are contributing fragments. Your manuscript title asserting identification is not entirely true without considering potential fragments.
Line 101-102: You should specify that the negative ionization methods are more selective than positive ones for detecting parent masses of this specific suite of molecules.
Line 114-115: Can you please specify what you mean by “different parts of the target molecules”?
Line 123-124: This seems consistent with intuition for CIMS reagent ion selectivity. You should tie in previous work using Br- and I- reagent ions for detecting these types of compounds to support this and explain why this method for selecting a reagent ion is superior to calculating proton affinities, electron affinities, or binding enthalpies. One may prefer to learn Gaussian and run a few relatively quick simulations rather than collect an extensive calibration suite and apply these methods. I understand the ML methods presented here can offer more detailed insight than compound to compound QM simulations but since this is a technical note presenting developments in compound detection with CIMS, a technical comparison is warranted. Is this method more straightforward to implement AND more accurate than previous methods? This could be tied into your performance metrics section since the accuracy looks pretty good.
Line 293-295: CIMS signal intensity should only correlate to binding strength of an analyte and reagent ion for reagent ions that undergo adduct formation. In this case, that is only true for Br- and not applicable to the other reagent ions. This sentence should either be revised to apply to only Br-, offer another explanation for why all descriptors perform similarly, or remove this statement.
Line 344-345: I am a little confused by the repeated mention of binding strength for all reagent ions, since only signals from Br- ionization should be dependent on binding enthalpies (like the Iyer et al. 2016 paper for I- ionization cited in this manuscript). For “binding”, are you more referring to an orientation that increases the likelihood of ion-molecule collisions, resulting in ionized analyte? In this example you mention van der Waals binding for positive ion species, but is there a role for the higher molecular weights to correlate with proton affinity due to an increased availability of electron density across a larger molecular structure for molecules with similar functional groups? Although this gets complicated when considering fragmentation for larger molecules. In general, I think you need to clarify throughout what is meant by “binding” vs. actual ionization mechanisms.
I am not familiar with orbitrap mass spectrometers as routine field instruments. Do you expect applicability to change across CIMS instruments, particularly for routinely used, but lower signal and resolution field deployable time-of-flight mass spectrometers? You do a good job of keeping intensities and absolute differences in log space as a way of normalizing signal, but would considerably lowering the resolution narrow your training data set and limit predictions? In other words, do you need the resolution of an orbitrap for your compounds? Some mention of general applicability to other instruments would be helpful.
In general, did you test your ML methods for compounds not identified in your training set? You generate predictors and accuracy metrics for already identified species that you train with. What happens if you predict the signal intensity for a given concentration of a compound not in your set? I think that would be the eventual, broader application of these methods. It would also assert the superiority of the presented methods over previous ones, in addition to showing how it could advance our understanding of ion-molecule interactions in CIMS.
Technical corrections
Line 25-26: The sentence about MIONs seems out of place in that paragraph. It should be merged more with CIMS benefits rather than come right after “but compound identification remains challenging”.
Line 171: You should specify that the Coulomb matrix is M.
Line 336-337: Please explain a little more what you mean by “… if the polarity is right…”.
Line 284-285: This last sentence needs to be rewritten. I am assuming you mean that you are attributing the variance in the learning curves to the small size of the datasets?
Citation: https://doi.org/10.5194/egusphere-2024-1846-RC2 - AC1: 'Comment on egusphere-2024-1846', Federica Bortolussi, 16 Nov 2024
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
310 | 148 | 67 | 525 | 35 | 12 | 14 |
- HTML: 310
- PDF: 148
- XML: 67
- Total: 525
- Supplement: 35
- BibTeX: 12
- EndNote: 14
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1