the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A general framework for evaluating real-time bioaerosol classification algorithms
Abstract. Advances in automatic bioaerosol monitoring require updated approaches to evaluate particle classification algorithms. We present a training and evaluation framework based on three metrics: (1) Kendall’s Tau correlation between predicted and manual concentrations, (2) scaling factor, to assess identification efficiency, and (3) off-season noise ratio, quantifying off-season false predictions. Metrics are computed per class across confidence thresholds and five stations stations, and visualised in graphs revealing overfitting, station-specific biases, and sensitivity–specificity trade-offs. We provide optimal ranges for each metric respectively calculated from correlations on co-located manual measurements, worst-case scenario off-season noise ratio, and physical sampling limits constraining acceptable scaling factor. The evaluation framework was applied to seven deep-learning classifiers trained on holography and fluorescence data from SwisensPoleno devices, and compared with the 2022 holography-only classifier. Classifier performances are compared through visualisation methods, helping identifying over-training, misclassification between morphologically similar taxa or between pollen and non-pollen particles. This methodology allows a transparent and reproducible comparison of classification algorithms, independent of classifier architecture and device. Its adoption could help standardise performance reporting across the research community, even more so when evaluation datasets are standardised across different regions.
- Preprint
(4786 KB) - Metadata XML
-
Supplement
(506 KB) - BibTeX
- EndNote
Status: open (until 13 Feb 2026)
- RC1: 'Comment on egusphere-2025-5440', Anonymous Referee #1, 10 Jan 2026 reply
-
RC2: 'Comment on egusphere-2025-5440', Anonymous Referee #2, 10 Jan 2026
reply
Referee comments on “A general framework for evaluating real-time bioaerosol classification algorithms” by Marie-Pierre Meurville, Bernard Clot, Sophie Erb, Maria Lbadaoui-Darvas, Fiona Tummon, Gian-Duri Lieberherr, and Benoît Crouzy
This manuscript addresses a very important aspect for introduction of automatic measurements of airborne pollen. Machine learning based classification algorithm remains the largest point of uncertainty in the process of automatization. The authors propose metrics and the acceptance criteria for classification algorithm based on data collected in the operational SwissPollen network using automatic Swisens Poleno Jupiter automatic air flow cytometer coupled with deep learning classification algorithm and manual Hirst type (Burkard) method. The study fits well to the aims and scope of Atmospheric Measurement Techniques but before it could be accepted for publication the authors should address the following aspects:
Methods
In lines 138-141 the authors indicate that measurements of SwisensPoleno are corrected by applying multiplier to adjust the event count. Please explain this multiplier with the respect to device and particle. And how it affects scaling factor. Does this multiplier include also compensation for loses resulting from filtering out bad measurements e.g. holographic image missing as indicated in recent study about classification fungal spores (Bruffaerts et al. 2025). If not please give information what could be expected loses for each of the analysed pollen types (at least from seen when cleaning dataset for training algorithms).
Why was the multiplier not applied for raindrops?
The paragraph starting at line 142 is referring to Hirst type manual measurements, right? Please indicate it clearly and since the manual Hirst type data are used in the algorithm evaluation the authors should describe the method to obtain daily values (e.g. what sampler is used, Burkard, Lanzoni, SPT?). Also, the authors should explain how they limited the numbered problems. In particular,
1) what fraction of sample is analysed and to what extent it meets the criteria prescribed by the European standard EN16868?
2) what was the airflow?, how it was measured and what was variation between measurements?
3) how the human error is limited (what is the pollen identification education and experience of personnel that analysed samples collected in 2024 and 2025 what magnification is used for identification) including what was measurement uncertainty for reproducibility (more analysts same sample) and repeatability (one analysts several times same sample) for each analyst and to what extent it meets the criteria prescribed by the European standard EN16868?
It seems that using mean off-season concentration in relation to season sum will underestimate the off-season noise ratio (especially short off-season peaks). Why not using ratio between sum of the off-season and the sum of season pollen concentrations?
Please explain (and support by data or references) the statement in line 172: “Scaling factors between 1-20 were considered reasonable for SwissensPoleno Jupiter, with values larger than 20 indicating the automatic measurement system would reach detection limit of the manual device.”. If this is explained in lines 259-265 please move text to methods. Also why 1-20 grains/m3 when the threshold was set to 24 grains/m3?
Include reference for the “trapezoidal rule” in line 178.
Results and discussion
The authors indicated that Kendalls’s Tau is less sensitive to outliers, but does scaling factor affect it? Please show how it compares to other non-parametric alternative, commonly used in aerobiology for comparing aerobiological datasets, e.g. Spearman’s Rho with the respect to this?
Different approach for defining main season (first non-zero day for low seasons or when at least four of the seven days had average pollen concentration greater than 20 particles per cubic meter for pollen with high concentrations detected) could lead to algorithms that minimize accuracy at low concentrations for some pollen classes. Please discuss what impact that could have for end users (could concentrations below 20 still be relevant as in season for some pollen types).
The description (lines 267-281) of calculation of referent Kendall’s Tau and off-season noise ration from manual measurements might suit better to Methods section. In the results it would be interesting to check to what extent those referent values are robust the authors could compare their results to values calculated from three side by side operating Hirst type samples, from intercomparison campaign organized in Munich (Maya-Manzano et al. 2023).
The authors indicated (line 306) that the scaling factors for same pollen at different locations/instruments needs to be different. Please show in Table the data for all pollen types and locations for chosen optimal confidence threshold. From Figure 1 and Figure A1 seems for some pollen types it is much more then up to 72% of difference in flow measurements of different manual devices indicated by Oteros aet al. 2016? (this is relevant aspect but if the manual measurements follow EN16868 the flow should be measured for each at least once a week and this could give information how much that aspect of used manual measurements affect the scaling factor).
This leads to question is there a difference in scaling factors for the same location/device between different seasons? If it is not stable, does it mean that side-by-side manual measurements are required until scaling factor is not determined metrologically from sampling efficiency and algorithm loses?
In Figure 6 change histograms to lines because it seems pollen is hidden behind water droplets and it is not possible to see to what extent pollen classification changed between two algorithms.
Throughout manuscript pollen classes are written in both italicized and normal font. If the pollen classes refer to botanical taxa genera should be italicized, If the pollen classes refer to morphological types then please indicate so in methods and format all normal.
References
CEN 16868. (2019). Ambient air - Sampling and analysis of airborne pollen grains and fungal spores for networksrelated to allergy—Volumetric Hirst method, 2019
Bruffaerts, N., Graf, E., Matavulj, P., Tiwari, A., Pyrri, I., Zeder, Y., Erb, S., Plaza, M., Dietler, S., Bendinelli, T., D’hooge, E., Sikoparija, B. 2025. Advancing automated identification of airborne fungal spores: guidelines for cultivation and reference dataset creation. Aerobiologia. doi: 10.1007/s10453-025-09864-y
Maya-Manzano, J.M. Tummon, F., Abt, R., Allan, N., Bunderson, L., Clot, B., Crouzy, B., Erb, S., Gonzalez-Alonso, M., Graf, E., Grewling, L., Haus, J., Kadantsev, E., Kawashima, S., Martinez-Bracero, M., Matavulj, M., Mills, S., Niederberger, E., Lieberherr, G., Lucas, R.W., O`Connor, D.J., Oteros, J. Palamarchuk, J., Pope, F.D., Rojo, J., Schäfer, S., Schmidt-Weber, C., Šikoparija, B., Skjøth, C.A., Sofiev, M., Stemmler, T., Triviño, M., Buters, J. 2023. Towards European automatic bioaerosol monitoring: Comparison of 9 automatic pollen observational instruments with classic Hirst-type traps. Science of the Total Environment 866, 161220. doi: 10.1016/j.scitotenv.2022.161220
Oteros, J., Buters, J., Laven, G., Röseler, S., Wachter, R., Schmidt-Weber, C., and Hofmann, F. 2016. Errors in determining the flow rate of Hirst-type pollen traps, Aerobiologia, 33, 201–210, https://doi.org/10.1007/s10453-016-9467-x
Citation: https://doi.org/10.5194/egusphere-2025-5440-RC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 260 | 139 | 24 | 423 | 23 | 16 | 15 |
- HTML: 260
- PDF: 139
- XML: 24
- Total: 423
- Supplement: 23
- BibTeX: 16
- EndNote: 15
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This paper proposes a method for evaluating pollen recognition models that process raw data from a Swisens Polen device. The main strength of the method is that it goes beyond the conventional practice of evaluating a model using a test dataset, which is a subset of the dataset used to train the model. Although such a test dataset was not presented to the model during training, it cannot be representative of realistic operating conditions because it originates from data typically collected under laboratory conditions. To address this gap, the authors present a set of metrics to compare model results with Hirst-type sample measurements, where collected pollen are identified and counted manually.
But the paper still needs further improvement. It is essential to ensure coherence among the paper’s title, stated objectives, body, and conclusions. The title of the paper is “A general framework for evaluating real-time bioaerosol classification algorithms”. From the paper’s title, one might infer that a software solution will be presented to automatically rank models. However, the text indicates that model evaluation is not straightforward. After computing the metrics, the results are visualized, and the models must be assessed manually without a clearly defined algorithm. Although seven models are discussed, they were not fully evaluated against one another. Most attention is devoted to comparing a model that accepts only holographic images with the rest, which combine holography with fluorescence. Intuitively, even without a formal study, one might expect the latter to perform better. On the one hand, under favourable conditions, when the models being compared are truly different, the proposed evaluation metrics have been proven as successful. However, on the other hand, the applicability of the proposed methodology can be questionable when evaluating models that differ less markedly.
In the Introduction, we see the statement „We aim at providing a full pipeline and protocol to help the community 1) train and fine-tun deep-learning classifiers ...“. However, it does not appear that this goal is purposefully pursued further in the body of the paper. The conclusion does not mention the achieved result at all.
The section of the paper “2.5 Algorithm evaluation” that should highlight the novelty of the proposed approach is rather laconic. Two of the proposed metrics, correlation and the scaling factor, are intuitive and have been used many times in similar research works. A novelty could be that the commonly applied Pearson correlation is replaced with Kendall’s Tau correlation; however, neither in this section nor elsewhere in the paper is there any data-driven evidence demonstrating that this substitution is justified.
Introduction of the parameters as area under curve (AUC) and difference ∆AUC seems unsuccessful. AUC metric is wisely used in Machine Learning. However, by the AUC definition, a curve under which the area is calculated is a ROC curve. Therefore, the area under the ROC curve is restricted in range (0,1). In the case of the curve "scale factor versus confidence," a special point (confidence=1) exists, where the value approaches infinity. The phrase "the area under each metric curve was computed using the trapezoidal rule" does not explain how the issue of infinity was solved. A large dispersion of the AUC and ∆AUC values indicates that the entered parameters cause more problems than they are useful for model evaluation. The text of Section 3.1 would become simpler and more understandable by eliminating these metrics.
Regarding Figures 1 and 2. in Section 3.2 we find that a confidence threshold of 0.9 has been chosen for taxa Alnus, Betula, and Corylus. Therefore, values of the confidence threshold at 1-10-3, 1-10-4, 1-10-5, 1-10-6, and 1-10-7 are outside the scope of attention and would be better excluded from the plots to better see what happens near the working point (confidence = 0.9).
Regarding Figure 4. "The evaluation of these classes is particularly challenging given that their seasons overlap". This is probably a fundamental problem that prevents the use of non-diagonal values of the correlation matrix for model evaluation purposes.
The section Conclusions needs to be improved. The general thread of the paper is lost in the conclusions. I would like to see the results of what has been achieved in the development of the model evaluation method. This version of Conclusions comments on individual randomly selected facts.