UCB-GLOBES: An open-access mass spectral database of identified and unidentified atmospheric organic compounds

Yee, Lindsay D.; Franklin, Emily B.; Weber, Robin J.; Zhao, Jessica; Zhang, Tiger; Xu, Stephanie; Santillan, Isaac; Li, Fangyuan; Jen, Coty N.; Zhang, Haofei; Liang, Yutong; Isaacman Van-Wertz, Gabriel; Wernis, Rebecca A.; Offenberg, John; Lewandowski, Michael; Joo, Taekyu; Takeuchi, Masayuki; Eris, Gamze; Xu, Weiqi; Ng, Nga L.; Chen, Yuzhi; Shilling, John E.; Upshur, Mary Alice; Gray Bé, Ariana; Thomson, Regan J.; Geiger, Franz M.; Goldstein, Allen H.

doi:10.5194/egusphere-2026-116

Preprints

https://doi.org/10.5194/egusphere-2026-116

Preprints

05 Feb 2026

| 05 Feb 2026

UCB-GLOBES: An open-access mass spectral database of identified and unidentified atmospheric organic compounds

Lindsay D. Yee, Emily B. Franklin, Robin J. Weber, Jessica Zhao, Tiger Zhang, Stephanie Xu, Isaac Santillan, Fangyuan Li, Coty N. Jen, Haofei Zhang, Yutong Liang, Gabriel Isaacman Van-Wertz, Rebecca A. Wernis, John Offenberg, Michael Lewandowski, Taekyu Joo, Masayuki Takeuchi, Gamze Eris, Weiqi Xu, Nga L. Ng, Yuzhi Chen, John E. Shilling, Mary Alice Upshur, Ariana Gray Bé, Regan J. Thomson, Franz M. Geiger, and Allen H. Goldstein

Abstract. Chemical characterization of atmospheric organic aerosols using gas chromatography with 70 eV electron ionization mass spectrometry (GC/EI-MS) has been used for decades in advancing molecular marker detection and identification, though primarily through suspect screening and/or targeted analyses. To advance non-targeted analyses of environmental samples, we have catalogued approximately 27,000 mass spectra (MS) of semi-volatile organic aerosol (OA) analytes observed in ambient samples from the U.S. and the Central Amazon and/or laboratory simulations of secondary OA (SOA) formation in the open-access University of California Berkeley Goldstein Library of Organic Biogenic Environmental Spectra (UCB-GLOBES). These samples are representative of OA under urban and biomass burning influences as well as SOA derived from biogenic precursors (e.g., isoprene, monoterpenes, sesquiterpenes) and biomass burning intermediates. MS are documented in UCB-GLOBES without regard to known chemical identity, annotated with extensive metadata such as sample source/experimental conditions, structural information gained from MS analyses, and predicted chemical properties such as average carbon oxidation state and carbon number. UCB-GLOBES MS are compatible for importing into NIST MS Search program, and we have also provided a Jupyter Notebook for MS visualization and comparisons. We demonstrate the utility of UCB-GLOBES through MS reanalyses of prior analytes observed in ambient data, finding a 20 % reduction in the number of analytes assigned to OA source categories reliant solely on time series correlation and an overall 11 % increase in new MS-based OA source categorization for the Southeast U.S. For 1,513 analytes observed previously in the Central Amazon, we found 375 MS matches using UCB-GLOBES vs. 136 MS matches during prior analyses, representing a 14 % gain in newly confirmed or newly categorized OA species. While OA from laboratory oxidation experiments in UCB-GLOBES are highly diverse chemically, on average only 29 % of UCB-GLOBES MS have a mass spectral match to another MS entry in UCB-GLOBES and/or in the NIST MS Database. This indicates that roughly 70 % of UCB-GLOBES MS are unique thus far, not observed more than once among the laboratory oxidation samples and ambient data in UCB-GLOBES MS. Further, only 18 % can be positively identified in the NIST MS database or with known authentic standards. This points to a large gap between these laboratory simulations and ambient OA. Overall, the UCB-GLOBES database can be utilized for improving confidence in OA source categorization and/or identification, novel chemical marker discovery, tracking chemical diversity, de novo structure and properties prediction, and improving MS search and matching algorithms. inform future research priorities for the chemical characterization of atmospheric organic samples.

How to cite. Yee, L. D., Franklin, E. B., Weber, R. J., Zhao, J., Zhang, T., Xu, S., Santillan, I., Li, F., Jen, C. N., Zhang, H., Liang, Y., Isaacman Van-Wertz, G., Wernis, R. A., Offenberg, J., Lewandowski, M., Joo, T., Takeuchi, M., Eris, G., Xu, W., Ng, N. L., Chen, Y., Shilling, J. E., Upshur, M. A., Gray Bé, A., Thomson, R. J., Geiger, F. M., and Goldstein, A. H.: UCB-GLOBES: An open-access mass spectral database of identified and unidentified atmospheric organic compounds, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2026-116, 2026.

Received: 09 Jan 2026 – Discussion started: 05 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1887 KB)

Supplement (7164 KB)

Download & links

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-116', Anna Feerick, 14 Mar 2026
General Comments
The preprint is of high quality. The paper addresses relevant scientific questions within the scope of AMT, and, importantly, provides a much-desired resource within the gas chromatography nontarget analysis and atmospheric communities. While the current data in the UCB-GLOBES library is most useful for derivatized compounds separated through GCxGC, the development of a platform that can be contributed to by the greater community and provides an open-source option for semi-volatile organic aerosols will greatly expand the depth to which researchers can explore their environmental sampling sets. The worth of the UCB-GLOBES libraries in retrospective analysis is demonstrated through the re-classification of sourcing for both the GoAmazon and Southeast US SOAS studies. Most importantly, for the study of organic aerosols, it provides a framework for comparing different SOA sources and informing future research directions for SOA characterization. Areas of improvement include adding additional ions to the MS spectra related to TMS derivatization to improve the likelihood of a correct library match and to provide a quantitative measure of similarity between source profiles.

Specific Comments
All current MS spectra in the UCB-GLOBES library are of derivatized compounds. Adding this point to the abstract or the conclusion would improve clarity and understanding of the current use cases for other GC/EI-MS practitioners.
Line 254: Does the simple neighbor comparison find peaks that are considered local maxima in addition to those of true maxima?
Section: Automated metadata entries from MS featurization: MW prediction, base peak, five highest intensity m/z ions
Line 257: Does an exceedance of the 75^th intensity percentile mean that comparisons for the candidate molecular ion had to be at least 75% of the highest intensity ion for the spectra at that point? If so, why pick 75%? Were these intensities considered after blank subtraction?

I’d highly recommend including the M-15 peak in saved mass spectra, in addition to the five highest intensity ions. Since trimethylsilyl derivatives in EI-MS have a characteristic [M-15]+ ion, this ion is very likely to be associated with the mass spectra of the assumed molecular ion. Having more unique ions to the feature of interest would improve the cosine similarity scores and increase the likelihood of a correct library match or a correct rejection.

Line 298: How were Retention index tolerances checked?
Line 455-456: The authors state that the copaiba oil spread shown in Figure 6 is similar to the sesquiterpene system in a) and the monoterpene system in c). I agree up until Nc > 20, where I believe the copaiba oil lacks the data points to confidently say it overlaps with monoterpenes. Since “similar spread” seems to be determined by a qualitative rather than a quantitative measure and can leave room for debate, I’d recommend including a quantitative metric for the degree of similarity for these different databases.

Technical corrections:
Table 1, Dataset Description:
Acedox equation has a parenthesis that needs to be removed.

Copox equation may need a comma after O3

Apinox and Apinox2 equations need a space removed after alpha

Limonox, Myrcox, and Isopox equations need compound names to be all lowercase

Isopox equation may need “as OH source)” to not be a subscript

Table 2, Description column
Remove periods from the end of the descriptions for Synonyms and Contributor_ID

For Description of Column_Type, add a space before the -, or change to a colon

Figure 1: A boarder should be present separating the bars of Fire Science Lab and Napa, CA Fires 2017
Figure 3: ISOP and BBOA could use a color change to make them more distinct in black and white
Figure 4: The number overlap with the border on the right side is very difficult to see in black and white. I’d recommend removing the numbers on the right side of each graphic to reduce clutter.
Figure 5: Some of the borders between the matches and unmatched blocks look thicker than others. If possible, I’d recommend unifying the border size. Additionally, there is a light blue block at the top of the 2-methyfuran/OH bar. If this is part of the No Matches, Unknown category, I’d recommend combining the two.
Figure 6: These figures are difficult to parse in black and white, and sub-figures a, c, and d would benefit from more color variations. To improve the clarity of sub-figure a, I’d recommend O3 and NOx be split into dark colors for one and light colors for the other. This would help visualize the similar spreads of OSc and Nc. Lightening the apin+NO3 in sub-figure c could bring a similar clarity. For sub-figure d, darkening 3-me-fur+OH would help it stand out against the FSL_FIREX data. Capitalize the last word in the title of sub-figure d “burn”.
Line 91: “have been born using” feels awkward. One option for alteration is to add “from” between born and using or replace with “have been made using”
Line 106: re-evaluate the use of “/” in this sentence. Are you using “/” as “or”, or is it implying that “known chemical identity” and “MS generated from authentic standards readily available” are the same thing in the previously mentioned resources?
Line 498-500: Clarify the sentence “In contrast…chemical categorization”. I do not understand it. This makes it difficult to understand what the following example is trying to highlight.
Line 167: used a hyphen when writing high-resolution in the following paragraph. Either add a hyphen here or remove one from line 76 and 187
Line 178: remove space between “N-methyl-N-“ and “trimethylsilyl”
Line 227: The current phrasing implies that low polarity peaks were the contaminants. Is this correct? If not, and PFMD was the “additional known contaminants” make “contaminants” in this sentence singular.
Line 238: remove “be able to”
Line 329: May benefit from a paragraph break between “column.” and “Using UCB-GLOBES…”
Line 456: Add c) after the semi-colon to connect the following statement with the correct sub-figure.
Citation: https://doi.org/10.5194/egusphere-2026-116-RC1
RC2: 'Comment on egusphere-2026-116', Anonymous Referee #2, 26 Mar 2026

The manuscript by Yee et al. presents an impressive effort to prepare a freely available library of GC/EI-MS mass spectra of semi-volatile organic aerosol compounds, UCB-GLOBES. The background and methodology is clearly presented, and the manuscript is in general very well written.
The usability of UCB-GLOBES is illustrated by examples of source reclassification of compounds observed in ambient studies, as well as prediction of selected key properties, and comparison between aerosols from laboratory studies and ambient studies. The UCB-GLOBES database will serve as a reference for researchers in the field.
Specific comments:
The range of in particular temperature, but also relative humidity, in laboratory studies is quite narrow compared to ambient environmental conditions. I suggest to add a short discussion about this limitation and the potential implications.
Thermal desorption of aerosol samples is widely used but may fragment some compounds. Can the authors comment on potential implications regarding the database and its future applicability?
Minor comments:
Line 116: An ending parenthesis is missing at the end of the sentence.
Page 17 (Line 330-) I suggest to divide the paragraph into 2-3 shorter paragraphs to improve readability.

Citation: https://doi.org/10.5194/egusphere-2026-116-RC2
RC3: 'Comment on egusphere-2026-116', Anonymous Referee #3, 06 Apr 2026

This paper presents the development and use of a GC/EI-MS mass spectral library for organic aerosol component and source identification. It is an extension of a previous version of the library published in Jen et al., 2019, extending the number of mass spectra from ~4800 to ~27,000. This is an impressive piece of work with a huge amount of work included to label up the mass spectra with an extensive set of metadata. A jupyter notebook has been developed to aid the user interface, providing series of useful tools for analysis. The authors indicate that other labs will be able to add new mass spectra to the library in the future providing an important and novel resource for the atmospheric science community. The paper is very well written and easy to follow. I suggest publication after addressing a small number of minor comments.

Comments
Page 7: For the sesquiterpenes, there are over a 1000 MS each. What is the S/N ratio used for peak identification? It would be useful to know how much overlap there is between the MS for these experiments.
Line 115: I think the bracket at the end of this line should be a comma or else a second close bracket is needed.
Line 144: Most of these lab SOA samples are without seed aerosol, this should be stated here.
Line 214: Has accessing the library been tested by another lab or have the mass spectra been compared to those generated on another instrument?
Table 2: The description under the “name” “Compound name or given name as an unknown compound….” doesn’t really make sense. Also, are the metrics OSc, O:C etc calculated minus the silyl groups? This should be clarified.
Line 260: Were the molecular formulas predicted based on the isotope ratios of the molecular ions where applicable.
Line 281 – What does the LDY mean? It isn’t explained previously.
Line 300- Why did you pick ~800 compounds for the template? Why did you not do a full non-target analysis and then align the peaks?
Line 356 – Can you explain why you think the BBOA/MT/SQT and BBOA/SQT can be interpreted as terpene derived primary biomass burning organic aerosols rather than small common oxidation products.
Line 364: There is a large drop in compounds assigned as ASOA. There are no laboratory experiments for anthropogenic VOC precursors and so this is likely to be underestimated, especially where highly oxidised low C products are formed. Also, quite a few have been transferred to isoprene SOA. Could this be due to anthropogenic isoprene emissions?
Line 381: The number of hits is perhaps surprisingly low. Do you have any insights into the missing species based on the calculated metrics?
Line 392: It would be useful to suggest the types of experiments that need to be done to extend the range of atmospheric conditions – biogenic + NOx, anthropogenic SOA etc.
Line 418: Does the statement about “29 % of MS are considered to have a MSM” relate to the ambient samples? Its not clear.
Line 420, The meaning of the sentence starting with “on average, 18 % of MSM….” Is not clear and should be rewritten.
Figure 5: Do the hashed bars indicate positive identification using NIST or other known MS libraries?
Line 446: The section on Ch3MS-RF is less convincing than the rest of the paper. The ML model was built on 130 standard compounds, but the library has 27,000 spectra. What is the cross over in terms of chemical functionalilty compared to the expected products. For instance, how does the model deal with the formation of nitrate groups in the OSc calculation for the a-pinene + NO3 experiment? In figure 6c, the nitrate chemistry looks very different and the abundance of dimers is much higher - perhaps the associated reference has some detailed analysis for comparison? More detail is needed here of the limitations of this approach.
Line 516: The future directions are too specific to the authors group and existing samples. I would prefer something more generic that can be applied to the broader community.

Citation: https://doi.org/10.5194/egusphere-2026-116-RC3

Supplement

https://doi.org/10.5194/egusphere-2026-116-supplement

Data sets

ucbglobes2025_v1 Lindsay Yee https://doi.org/10.5281/zenodo.18176760

Interactive computing environment

UCB-GLOBES MS Data Visualization and Comparison Tool_v1 Lindsay Yee https://doi.org/10.5281/zenodo.18177255

Viewed

Total article views: 457 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
296	141	20	457	67	49	66

HTML: 296
PDF: 141
XML: 20
Total: 457
Supplement: 67
BibTeX: 49
EndNote: 66

Views and downloads (calculated since 05 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	160	82	12	254
Mar 2026	105	50	5	160
Apr 2026	31	9	3	43

Cumulative views and downloads (calculated since 05 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	160	82	12	254
Mar 2026	105	50	5	160
Apr 2026	31	9	3	43

Viewed (geographical distribution)

Total article views: 429 (including HTML, PDF, and XML) Thereof 429 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 09 Apr 2026

Short summary

An open-access mass spectral database of identified and unidentified compounds in atmospheric and laboratory-generated organic aerosols is released to aid in future molecular discoveries in the environmental sciences. Identification of air pollution sources and origins are improved using the ~27,000 mass spectral records in the UCB-GLOBES database.


Total:	0
HTML:	0
PDF:	0
XML:	0