the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Flow cytometry and machine learning enable identification of allergenic urban tree pollen
Abstract. Exposure to allergenic pollen is a major public health concern, as it is a key trigger for respiratory allergies, including seasonal allergic rhinitis, which affects approximately 20 % of the global population. Monitoring airborne pollen is essential for prevention and clinical management, yet traditional identification methods, such as light microscopy, are time-consuming and often limited to genus- or family-level resolution. Here, we present a high-throughput approach combining flow cytometry with machine learning to identify pollen from urban environments. We collected a reference database of pollen from 97 species across 34 genera, representing the dominant allergenic trees and other common airborne taxa in Montreal, Canada. Using flow cytometry, we measured particle size, granularity, and fluorescence intensity across multiple excitation and emission channels, and applied a Random Forest classifier to distinguish pollen taxa. At the species level, the model achieved a mean F1-score of 0.76, while genus-level classification reached 0.90, with misclassifications largely occurring among closely related species. Granularity and fluorescence parameters from the violet and blue lasers were the most distinctive features. Our results demonstrate that flow cytometry combined with machine learning provides an efficient, scalable alternative to microscopy, with potential for large-scale urban pollen monitoring.
- Preprint
(2624 KB) - Metadata XML
-
Supplement
(629 KB) - BibTeX
- EndNote
Status: open (until 27 Feb 2026)
- RC1: 'Comment on egusphere-2025-6259', Anonymous Referee #1, 01 Feb 2026 reply
-
RC2: 'Comment on egusphere-2025-6259', Anonymous Referee #2, 02 Feb 2026
reply
Overview:
In this study, Tardiff et al. sought to distinguish tree, grass & weed pollen in an urban area (Montréal) using flow cytometry data & a machine learning algorithm (specifically random forest). They identified granularity & fluorescence parameters from violet (Violet610_A) & blue (PB450_A) lasers to be the most distinctive features for discerning a pollen species, though species accuracy (F1 = 0.76) in the random forest-model was lower than genus accuracy (F1 = 0.90). I generally found the manuscript to be well written, with few spelling, grammatical issues. This study presents a strong step toward scalable pollen classification using non-imaging cytometry. However, some concerns and requests for clarification are pointed out below:
Major comments:
Line 94-95: For this study, I understand that the purity of the pollen can be critical in analyzing the data obtained from the flow cytometer because the proportion of non-pollen particles that can be similar in size or fluorescence intensity can influence the classification. However, there is no data to validate the pollen's purity using other methods. For example, would it be possible to compare the pollen purity by an image analysis of the extracted pollen subsample under a light/fluorescence microscope?
Line 143-144 = Since the data is unbalanced, the authors chose to use synthetic minority over-sampling to normalise the data. However, I was under the impression that oversampling can cause model overfitting & an inaccurate representation of the smallest minority classes (~300 – 35000 is very unbalanced). Were steps taken to avoid/ensure that didn’t happen? It may be at least worth a mention in the discussion.
Line 145-146 = Reads as synthetic minority over-sampling is done on the samples, then the dataset is split into 70% training & 30% validation. If so, would this not lead to overfitting & inflated precision metrics, since the training data is used in the validation set? Perhaps I’ve misinterpreted what’s written, or this doesn’t matter. If so, clarification may be needed in the order of steps taken.
Line 160 = While F1-scores are useful for measuring model performance in a single metric, the likely strong class imbalance and use of oversampling would suggest that including metrics such as PR AUC in addition to F1-scores would provide a more complete summary of the model's performance (Saito & Rehmsmeier, 2015 https://doi.org/10.1371/journal.pone.0118432).
Minor comments:
Lines 103-106: The authors state that the pollen's fluorescence depends on the fluorescent proteins on its surface (my understanding is that these are not proteins). If so, please provide a reference.
Line 148 = At the beginning of 2.4. It is stated that four supervised classification algorithms were tested, and the random forest performed best. There are many types of random forest classifiers, such as random forest by randomisation, which deals well with unbalanced/noisy data. Which random forest classifiers (Breiman?) were tested? It may be worth mentioning why the particular random forest was chosen over other random forest classifiers.
Citation: https://doi.org/10.5194/egusphere-2025-6259-RC2
Data sets
Pollen Flow Cytometry Datasets and Classification Models Sarah Tardif https://doi.org/10.6084/m9.figshare.30870641
Model code and software
Pollen-classification-model Sarah Tardif https://github.com/SarahTardif/Pollen-classification-model
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 144 | 63 | 14 | 221 | 27 | 15 | 16 |
- HTML: 144
- PDF: 63
- XML: 14
- Total: 221
- Supplement: 27
- BibTeX: 15
- EndNote: 16
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The study is oriented towards identifying airborne pollen and the authors imply that flow cytometry is an efficient alternative for microscopical identification. It is particularly valuable to see that analysis of flow cytometry measurements seems to enable discrimination between different species of the same plant genus which is usually not possible in routine microscopical analysis. The manuscript is well written, and it clearly describes possibilities of standard flow cytometers for identification pollen.
However, in my opinion the manuscript lacks tests and discussion on the applicability of proposed method (flow cytometry measurement and the developed random forest classification model) for analysing aerosol samples. The authors emphasized importance of adapting the models to real environment samples (lines 259-264). But in my opinion for Atmospheric Measurement Techniques more than just theoretical discussion is needed when linking to atmospheric measurements. Without tests on atmospheric samples, it is just a speculation that proposed approach has a “potential for large-scale urban pollen monitoring”.
There are several aspects that should be addressed/discussed:
The approach to rely pollen identification exclusively on flow cytometry measurements that most cytometers routinely used in healthcare is very important. But the use of the same classification algorithm on different devices (even the same model) appeared to be challenging (as authors also clearly noted in lines 273-281). If not possible to test the model on different device measuring same parameters, the authors should at least discuss the measurement uncertainty for each parameter and refer to other studies that observed differences in flow cytometry parameters between different devices.
In line 133 authors indicated the training dataset for Thuja genus was impossible to clean from debris. Is presence of debris confirmed by microscope? If not, how can you be sure it is not a part of the normal pollen variability? The pollen from Thuja (and many other Cupressaceae) tends to break in wet environment resulting in separation of exine from the resto of pollen grain. Could it be that those separated exines are the “debris” you see in the data.
In Table A1, authors reference is missing for an accurate scientific name (e.g. Ambrosia artemisiifolia L.). Genus should be written in cursive font and also should include author references