An Algorithm for Automatic Fitting and Formula Assignment in Atmospheric Mass Spectra

Mickwitz, Valter; Peräkylä, Otso; Graeffe, Frans; Worsnop, Douglas; Ehn, Mikael

doi:https://doi.org/10.5194/egusphere-2024-3047

Preprints

https://doi.org/10.5194/egusphere-2024-3047

Preprints

14 Oct 2024

| 14 Oct 2024

An Algorithm for Automatic Fitting and Formula Assignment in Atmospheric Mass Spectra

Valter Mickwitz, Otso Peräkylä, Frans Graeffe, Douglas Worsnop, and Mikael Ehn

Abstract. Mass spectrometry is an established method for studying the chemical composition of gases and particles in the atmosphere. Using this technique, signals corresponding to thousands, or even tens of thousands of compounds may be detected from ambient air. The process of identifying all the peaks in the mass spectra is often arduous and time--consuming, in particular when multiple overlapping peaks are present. This manual peak fitting and identification may take even experienced analysts anywhere from weeks to months to complete, depending on the desired accuracy and completeness.

In this work, we attempted to automate the fitting and formula assignment workflow and evaluate how far the process can get using a ''one button'' algorithm. The algorithm constructed in this work takes in commonly known parameters specific to the instrument type and by pressing one button, it runs and ultimately provides a list of likely peaks for the mass spectrum. The algorithm utilizes weighted least squares fitting and a modified version of the Bayesian information criterion along with an iterative formula assignment process. We applied it to synthetic mass spectra and both a gas-phase chemical ionization mass spectrometer (CIMS) dataset and an aerosol mass spectrometer (AMS) dataset. The results were largely comparable with manual peak fitting and identification done previously, but were achieved in a fraction of the time. Erroneous assignments mainly appeared at low--intensity signals, with interference from nearby higher intensity signals, a case that is challenging also for manual peak fitting. This algorithm provides an excellent starting point for a peak list, which, if needed, can be manually revised.

The main result of this study is the algorithm itself. While further improvements and tweaks are possible, the algorithm presented here is currently being implemented into the commonly used Tofware analysis software package, to allow easy utilization by the broader community. We hope this can save valuable time of researchers for data interpretation rather than data processing and curation.

Received: 30 Sep 2024 – Discussion started: 14 Oct 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 1456 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (1456 KB)

Supplement (27285 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

03 Apr 2025

An algorithm for automatic fitting and formula assignment in atmospheric mass spectra

Valter Mickwitz, Otso Peräkylä, Frans Graeffe, Douglas Worsnop, and Mikael Ehn

Atmos. Meas. Tech., 18, 1537–1559, https://doi.org/10.5194/amt-18-1537-2025,https://doi.org/10.5194/amt-18-1537-2025, 2025

Short summary

Valter Mickwitz, Otso Peräkylä, Frans Graeffe, Douglas Worsnop, and Mikael Ehn

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-3047', Anonymous Referee #1, 07 Nov 2024
This study presents a technical advancement toward automating peak fitting and formula assignment in mass spectrometric analysis, a task traditionally requiring significant manual effort. The authors developed an algorithm intended to streamline this process by integrating weighted least squares fitting, a modified Bayesian information criterion, and iterative formula assignment. Their approach aims to deliver a preliminary list of likely peaks that can later be refined as needed, thus reducing the time analysts typically spend on labor-intensive, manual identification. The algorithm was tested on gas-phase CIMS and aerosol AMS datasets, and showed comparable accuracy to manual fitting in many cases. However, as with manual methods, lower-intensity signals and interferences from adjacent peaks presented challenges, which led to occasional erroneous assignments. The study's main output—the algorithm—is reported to be undergoing integration into the Tofware analysis software, making it accessible to a broader user base and thus potentially transforming routine data processing workflows. Overall, the work represents a valuable contribution to the field, promising to free up researchers’ time for data interpretation rather than time-consuming data processing tasks. I recommend its publication in AMT, considering the following comments are adequately addressed.
I assume the peak shape function follows a Gaussian distribution? Since position is defined based on the peak shape function, variations in shape could impact the algorithm’s reliability.

More detail on the rationale for default parameter values, especially for critical values like n_max and the parameter A, would be helpful. How robust are these defaults across different instruments and sample types? It would be useful to know how sensitive the fitting results are to these parameter settings and if there is a straightforward way to optimize or validate these parameters for different datasets.

The current formula list appears to be derived from existing datasets validated by specific instruments, which are selectively sensitive to certain groups of compounds. While these formulas are relevant to particular compounds, they don’t encompass all possible combinations of elements that adhere to established Chemical bonding rules. Given the complexity of organic carbon mixtures in the atmosphere, expanding the list to include additional elements beyond C, H, O, N, and S, and more importantly, to include all possible formula combinations that abbey the valency rules, will be a crucial step for future development.

Page 13, line 301: Following the previous comment, implementing the odd nitrogen rule would serve as a valuable criterion for automatically excluding incorrect formulas as discussed here.

Given that low-intensity peaks are prone to interference and misassignment, further explanation on strategies for handling such peaks would strengthen the method. A more rigorous treatment of noise and background signals, including methods for background subtraction, and options to customize baseline input, could improve accuracy.

I am not sure if isotopic checks are currently incorporated in the algorithm, but incorporating an optional check for isotopes could reduce misassignments, particularly for elements with non-standard isotopic distributions.
Citation: https://doi.org/10.5194/egusphere-2024-3047-RC1
- AC1: 'Reply on RC1', Valter Mickwitz, 20 Dec 2024
  
  We thank the referee for their comments. The complete reply is provided in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2024-3047-AC1
RC2:
'Comment on egusphere-2024-3047', Anonymous Referee #2, 07 Dec 2024
This study introduces an automated algorithm to streamline peak fitting and formula assignment in mass spectrometry for atmospheric analysis. The algorithm uses weighted least squares fitting and a modified Bayesian information criterion to identify peaks in mass spectra. It was tested on synthetic data and real datasets from gas-phase oxidation using chemical ionization mass spectrometry and particle measurement by aerosol mass spectrometry, yielding results comparable to manual methods but much faster. Errors were mainly observed with low-intensity signals affected by higher-intensity interference. Despite these errors, the algorithm offers a valuable starting point for peak identification and can be manually refined if necessary. Overall, the manuscript is well written. The technique is useful and valuable to the community. I have some minor comments:
Line 144: "assigns," not "Assigns."

Line 154: Two “for which” are redundant.

Figure 3: Why was a default value of 0.2×FWHM used? Have you performed a sensitivity analysis to determine this value?

Line 156: The term “the other peak” is confusing. Is it “the peak” or “another peak” in line 154?

Line 132: How is the isotopic contribution calculated without assigning chemical formulas first?

Line 162: Should step 5 go back to step 1 since changes in the number of peaks and chemical formulas also affect the isotopic contribution?

Line 180: Please provide more details on how the synthetic data was generated and the mass spectrum for better understanding and visualization.

Line 187: How was the baseline determined?

Lines 203-205: Why is fluorine included for generating gas-phase formulas for alpha-pinene ozonolysis products?

Line 261: What is meant by "calibration error"? Does it occur during the calibration process before peak assignment?

Figure 4: For N_corr and S_corr, what do you mean by “correct”? How do you know they are correct? Is it because they are based on synthetic data?

Lines 320-325: To help readers, provide the peak lists for the gas-phase and particle-phase tests based on the rules in Appendix A2.

Lines 342-344: It’s unclear how the values of 97%, 94%, and 80% were determined. Please clarify.

Line 352: Should peak shape be considered before step 1?

Lines 394-407: Consider combining this paragraph into the Conclusion section.

The authors mentioned that errors were mainly observed with low-intensity signals. Are there strategies to minimize errors in peak assignment for low-intensity peaks? Please discuss.

Have the authors compared this algorithm with others mentioned in lines 42-43? What are its advantages and limitations?
Citation: https://doi.org/10.5194/egusphere-2024-3047-RC2
- AC2: 'Reply on RC2', Valter Mickwitz, 20 Dec 2024
  
  We thank the referee for their comments. The complete reply is provided in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2024-3047-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-3047', Anonymous Referee #1, 07 Nov 2024
This study presents a technical advancement toward automating peak fitting and formula assignment in mass spectrometric analysis, a task traditionally requiring significant manual effort. The authors developed an algorithm intended to streamline this process by integrating weighted least squares fitting, a modified Bayesian information criterion, and iterative formula assignment. Their approach aims to deliver a preliminary list of likely peaks that can later be refined as needed, thus reducing the time analysts typically spend on labor-intensive, manual identification. The algorithm was tested on gas-phase CIMS and aerosol AMS datasets, and showed comparable accuracy to manual fitting in many cases. However, as with manual methods, lower-intensity signals and interferences from adjacent peaks presented challenges, which led to occasional erroneous assignments. The study's main output—the algorithm—is reported to be undergoing integration into the Tofware analysis software, making it accessible to a broader user base and thus potentially transforming routine data processing workflows. Overall, the work represents a valuable contribution to the field, promising to free up researchers’ time for data interpretation rather than time-consuming data processing tasks. I recommend its publication in AMT, considering the following comments are adequately addressed.
I assume the peak shape function follows a Gaussian distribution? Since position is defined based on the peak shape function, variations in shape could impact the algorithm’s reliability.

More detail on the rationale for default parameter values, especially for critical values like n_max and the parameter A, would be helpful. How robust are these defaults across different instruments and sample types? It would be useful to know how sensitive the fitting results are to these parameter settings and if there is a straightforward way to optimize or validate these parameters for different datasets.

The current formula list appears to be derived from existing datasets validated by specific instruments, which are selectively sensitive to certain groups of compounds. While these formulas are relevant to particular compounds, they don’t encompass all possible combinations of elements that adhere to established Chemical bonding rules. Given the complexity of organic carbon mixtures in the atmosphere, expanding the list to include additional elements beyond C, H, O, N, and S, and more importantly, to include all possible formula combinations that abbey the valency rules, will be a crucial step for future development.

Page 13, line 301: Following the previous comment, implementing the odd nitrogen rule would serve as a valuable criterion for automatically excluding incorrect formulas as discussed here.

Given that low-intensity peaks are prone to interference and misassignment, further explanation on strategies for handling such peaks would strengthen the method. A more rigorous treatment of noise and background signals, including methods for background subtraction, and options to customize baseline input, could improve accuracy.

I am not sure if isotopic checks are currently incorporated in the algorithm, but incorporating an optional check for isotopes could reduce misassignments, particularly for elements with non-standard isotopic distributions.
Citation: https://doi.org/10.5194/egusphere-2024-3047-RC1
- AC1: 'Reply on RC1', Valter Mickwitz, 20 Dec 2024
  
  We thank the referee for their comments. The complete reply is provided in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2024-3047-AC1
RC2:
'Comment on egusphere-2024-3047', Anonymous Referee #2, 07 Dec 2024
This study introduces an automated algorithm to streamline peak fitting and formula assignment in mass spectrometry for atmospheric analysis. The algorithm uses weighted least squares fitting and a modified Bayesian information criterion to identify peaks in mass spectra. It was tested on synthetic data and real datasets from gas-phase oxidation using chemical ionization mass spectrometry and particle measurement by aerosol mass spectrometry, yielding results comparable to manual methods but much faster. Errors were mainly observed with low-intensity signals affected by higher-intensity interference. Despite these errors, the algorithm offers a valuable starting point for peak identification and can be manually refined if necessary. Overall, the manuscript is well written. The technique is useful and valuable to the community. I have some minor comments:
Line 144: "assigns," not "Assigns."

Line 154: Two “for which” are redundant.

Figure 3: Why was a default value of 0.2×FWHM used? Have you performed a sensitivity analysis to determine this value?

Line 156: The term “the other peak” is confusing. Is it “the peak” or “another peak” in line 154?

Line 132: How is the isotopic contribution calculated without assigning chemical formulas first?

Line 162: Should step 5 go back to step 1 since changes in the number of peaks and chemical formulas also affect the isotopic contribution?

Line 180: Please provide more details on how the synthetic data was generated and the mass spectrum for better understanding and visualization.

Line 187: How was the baseline determined?

Lines 203-205: Why is fluorine included for generating gas-phase formulas for alpha-pinene ozonolysis products?

Line 261: What is meant by "calibration error"? Does it occur during the calibration process before peak assignment?

Figure 4: For N_corr and S_corr, what do you mean by “correct”? How do you know they are correct? Is it because they are based on synthetic data?

Lines 320-325: To help readers, provide the peak lists for the gas-phase and particle-phase tests based on the rules in Appendix A2.

Lines 342-344: It’s unclear how the values of 97%, 94%, and 80% were determined. Please clarify.

Line 352: Should peak shape be considered before step 1?

Lines 394-407: Consider combining this paragraph into the Conclusion section.

The authors mentioned that errors were mainly observed with low-intensity signals. Are there strategies to minimize errors in peak assignment for low-intensity peaks? Please discuss.

Have the authors compared this algorithm with others mentioned in lines 42-43? What are its advantages and limitations?
Citation: https://doi.org/10.5194/egusphere-2024-3047-RC2
- AC2: 'Reply on RC2', Valter Mickwitz, 20 Dec 2024
  
  We thank the referee for their comments. The complete reply is provided in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2024-3047-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Valter Mickwitz on behalf of the Authors (20 Dec 2024) Author's response Author's tracked changes Manuscript

ED: Publish as is (24 Dec 2024) by Mingjin Tang

AR by Valter Mickwitz on behalf of the Authors (28 Jan 2025) Manuscript

Journal article(s) based on this preprint

03 Apr 2025

An algorithm for automatic fitting and formula assignment in atmospheric mass spectra

Valter Mickwitz, Otso Peräkylä, Frans Graeffe, Douglas Worsnop, and Mikael Ehn

Atmos. Meas. Tech., 18, 1537–1559, https://doi.org/10.5194/amt-18-1537-2025,https://doi.org/10.5194/amt-18-1537-2025, 2025

Short summary

Valter Mickwitz, Otso Peräkylä, Frans Graeffe, Douglas Worsnop, and Mikael Ehn

Supplement

https://doi.org/10.5194/egusphere-2024-3047-supplement

Valter Mickwitz, Otso Peräkylä, Frans Graeffe, Douglas Worsnop, and Mikael Ehn

Viewed

Total article views: 386 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
201	102	83	386	82	8	10

HTML: 201
PDF: 102
XML: 83
Total: 386
Supplement: 82
BibTeX: 8
EndNote: 10

Views and downloads (calculated since 14 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	86	39	4	129
Nov 2024	26	15	3	44
Dec 2024	40	25	41	106
Jan 2025	20	5	14	39
Feb 2025	15	3	0	18
Mar 2025	13	15	19	47
Apr 2025	1	2	3

Cumulative views and downloads (calculated since 14 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	86	39	4	129
Nov 2024	26	15	3	44
Dec 2024	40	25	41	106
Jan 2025	20	5	14	39
Feb 2025	15	3	0	18
Mar 2025	13	15	19	47
Apr 2025	1	2	3

Viewed (geographical distribution)

Total article views: 357 (including HTML, PDF, and XML) Thereof 357 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 03 Apr 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (1456 KB)
Metadata XML

Short summary

This work presents and evaluates an algorithm that automatically conducts the steps of fitting peaks and identifying formulas, necessary but time consuming steps for most applications of mass spectrometry within atmospheric science. The aim of the algorithm is to save researchers working on these tasks significant amounts of time, and allow them to proceed with their analysis. The work demonstrates that this algorithm can achieve the goal of speeding up analysis, and provide accurate formulas.


Total:	0
HTML:	0
PDF:	0
XML:	0