the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
An Algorithm for Automatic Fitting and Formula Assignment in Atmospheric Mass Spectra
Abstract. Mass spectrometry is an established method for studying the chemical composition of gases and particles in the atmosphere. Using this technique, signals corresponding to thousands, or even tens of thousands of compounds may be detected from ambient air. The process of identifying all the peaks in the mass spectra is often arduous and time--consuming, in particular when multiple overlapping peaks are present. This manual peak fitting and identification may take even experienced analysts anywhere from weeks to months to complete, depending on the desired accuracy and completeness.
In this work, we attempted to automate the fitting and formula assignment workflow and evaluate how far the process can get using a ''one button'' algorithm. The algorithm constructed in this work takes in commonly known parameters specific to the instrument type and by pressing one button, it runs and ultimately provides a list of likely peaks for the mass spectrum. The algorithm utilizes weighted least squares fitting and a modified version of the Bayesian information criterion along with an iterative formula assignment process. We applied it to synthetic mass spectra and both a gas-phase chemical ionization mass spectrometer (CIMS) dataset and an aerosol mass spectrometer (AMS) dataset. The results were largely comparable with manual peak fitting and identification done previously, but were achieved in a fraction of the time. Erroneous assignments mainly appeared at low--intensity signals, with interference from nearby higher intensity signals, a case that is challenging also for manual peak fitting. This algorithm provides an excellent starting point for a peak list, which, if needed, can be manually revised.
The main result of this study is the algorithm itself. While further improvements and tweaks are possible, the algorithm presented here is currently being implemented into the commonly used Tofware analysis software package, to allow easy utilization by the broader community. We hope this can save valuable time of researchers for data interpretation rather than data processing and curation.
- Preprint
(1456 KB) - Metadata XML
-
Supplement
(27285 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-3047', Anonymous Referee #1, 07 Nov 2024
This study presents a technical advancement toward automating peak fitting and formula assignment in mass spectrometric analysis, a task traditionally requiring significant manual effort. The authors developed an algorithm intended to streamline this process by integrating weighted least squares fitting, a modified Bayesian information criterion, and iterative formula assignment. Their approach aims to deliver a preliminary list of likely peaks that can later be refined as needed, thus reducing the time analysts typically spend on labor-intensive, manual identification. The algorithm was tested on gas-phase CIMS and aerosol AMS datasets, and showed comparable accuracy to manual fitting in many cases. However, as with manual methods, lower-intensity signals and interferences from adjacent peaks presented challenges, which led to occasional erroneous assignments. The study's main output—the algorithm—is reported to be undergoing integration into the Tofware analysis software, making it accessible to a broader user base and thus potentially transforming routine data processing workflows. Overall, the work represents a valuable contribution to the field, promising to free up researchers’ time for data interpretation rather than time-consuming data processing tasks. I recommend its publication in AMT, considering the following comments are adequately addressed.
- I assume the peak shape function follows a Gaussian distribution? Since position is defined based on the peak shape function, variations in shape could impact the algorithm’s reliability.
- More detail on the rationale for default parameter values, especially for critical values like n_max and the parameter A, would be helpful. How robust are these defaults across different instruments and sample types? It would be useful to know how sensitive the fitting results are to these parameter settings and if there is a straightforward way to optimize or validate these parameters for different datasets.
- The current formula list appears to be derived from existing datasets validated by specific instruments, which are selectively sensitive to certain groups of compounds. While these formulas are relevant to particular compounds, they don’t encompass all possible combinations of elements that adhere to established Chemical bonding rules. Given the complexity of organic carbon mixtures in the atmosphere, expanding the list to include additional elements beyond C, H, O, N, and S, and more importantly, to include all possible formula combinations that abbey the valency rules, will be a crucial step for future development.
- Page 13, line 301: Following the previous comment, implementing the odd nitrogen rule would serve as a valuable criterion for automatically excluding incorrect formulas as discussed here.
- Given that low-intensity peaks are prone to interference and misassignment, further explanation on strategies for handling such peaks would strengthen the method. A more rigorous treatment of noise and background signals, including methods for background subtraction, and options to customize baseline input, could improve accuracy.
- I am not sure if isotopic checks are currently incorporated in the algorithm, but incorporating an optional check for isotopes could reduce misassignments, particularly for elements with non-standard isotopic distributions.
Citation: https://doi.org/10.5194/egusphere-2024-3047-RC1 - AC1: 'Reply on RC1', Valter Mickwitz, 20 Dec 2024
-
RC2: 'Comment on egusphere-2024-3047', Anonymous Referee #2, 07 Dec 2024
This study introduces an automated algorithm to streamline peak fitting and formula assignment in mass spectrometry for atmospheric analysis. The algorithm uses weighted least squares fitting and a modified Bayesian information criterion to identify peaks in mass spectra. It was tested on synthetic data and real datasets from gas-phase oxidation using chemical ionization mass spectrometry and particle measurement by aerosol mass spectrometry, yielding results comparable to manual methods but much faster. Errors were mainly observed with low-intensity signals affected by higher-intensity interference. Despite these errors, the algorithm offers a valuable starting point for peak identification and can be manually refined if necessary. Overall, the manuscript is well written. The technique is useful and valuable to the community. I have some minor comments:
- Line 144: "assigns," not "Assigns."
- Line 154: Two “for which” are redundant.
- Figure 3: Why was a default value of 0.2×FWHM used? Have you performed a sensitivity analysis to determine this value?
- Line 156: The term “the other peak” is confusing. Is it “the peak” or “another peak” in line 154?
- Line 132: How is the isotopic contribution calculated without assigning chemical formulas first?
- Line 162: Should step 5 go back to step 1 since changes in the number of peaks and chemical formulas also affect the isotopic contribution?
- Line 180: Please provide more details on how the synthetic data was generated and the mass spectrum for better understanding and visualization.
- Line 187: How was the baseline determined?
- Lines 203-205: Why is fluorine included for generating gas-phase formulas for alpha-pinene ozonolysis products?
- Line 261: What is meant by "calibration error"? Does it occur during the calibration process before peak assignment?
- Figure 4: For N_corr and S_corr, what do you mean by “correct”? How do you know they are correct? Is it because they are based on synthetic data?
- Lines 320-325: To help readers, provide the peak lists for the gas-phase and particle-phase tests based on the rules in Appendix A2.
- Lines 342-344: It’s unclear how the values of 97%, 94%, and 80% were determined. Please clarify.
- Line 352: Should peak shape be considered before step 1?
- Lines 394-407: Consider combining this paragraph into the Conclusion section.
- The authors mentioned that errors were mainly observed with low-intensity signals. Are there strategies to minimize errors in peak assignment for low-intensity peaks? Please discuss.
- Have the authors compared this algorithm with others mentioned in lines 42-43? What are its advantages and limitations?
Citation: https://doi.org/10.5194/egusphere-2024-3047-RC2 - AC2: 'Reply on RC2', Valter Mickwitz, 20 Dec 2024
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
158 | 81 | 61 | 300 | 75 | 3 | 3 |
- HTML: 158
- PDF: 81
- XML: 61
- Total: 300
- Supplement: 75
- BibTeX: 3
- EndNote: 3
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1