the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Evaluating machine learning model performance in a two-step colocation process for TVOC and BTEX sensor calibration
Abstract. Calibration of low-cost air quality sensors (LCSs) for total volatile organic compound (TVOC) and benzene, toluene, ethylbenzene, and xylenes (BTEX) quantification remains challenging due to the sensors' cross-sensitivity to temperature and humidity and their tendency to drift over time. In this study, we aimed to improve TVOC and BTEX metal oxide sensor calibration using a two-step colocation strategy. This strategy made it possible to develop the calibration model under environmental conditions closely matching those of the field, which is essential for model transferability from colocation to field conditions. The approach also addressed intra-sensor variability and drift in the harmonization step. In addition to TVOC and BTEX, we applied the two-step colocation process to nitrogen dioxide (NO2) electrochemical sensors to demonstrate the broader applicability of our approach beyond TVOC and BTEX quantification.
Next, we compared the performance of multiple machine learning models, including ridge, lasso, random forest, gradient boosting, extreme gradient boosting, support vector regression, and linear regression, to investigate the optimal model choice for calibration. We found that no single model performed best across all pollutants. For example, gradient boosting excelled at capturing peak TVOC concentrations, while linear regression performed best for BTEX. Conversely, linear regression was the worst-performing model for NO2. Overall, the models showed satisfactory RMSE around 40–50 ppb for TVOC, 1.25–1.75 ppb for BTEX, and 4–6 ppb for NO2. However, all models also overestimated baseline concentrations and underestimated peaks. The severity of this bias depended on the reference concentration distribution, with the most severe peak underestimation occurring in the more heavily skewed TVOC and BTEX data. The systematic bias at baseline and peak concentrations was not evident in the overall mean bias error, which was near zero for all pollutants. This result underscores the need to evaluate model performance across the entire concentration distribution. Finally, we found that calibration performance was sensitive to the choice of training and testing data split. Future research could seek to optimize the training and testing split to ensure robust model transferability to field data.
- Preprint
(4837 KB) - Metadata XML
-
Supplement
(353 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4697', Anonymous Referee #1, 18 Nov 2025
-
AC1: 'Reply on RC1', Caroline Frischmon, 26 Mar 2026
We are grateful for the feedback received from the reviewer and have responded to their comments below.
1. It would be helpful to explicitly list the MOX sensors and NO₂ sensors used in this study, so readers can quickly identify the instrumentation without searching through the text.
- Thank you for this suggestion. We added a table of sensors (Table 1) as well as specific sensor names in the abstract.
2. The reviewer is interested in whether BTEX and tVOC are naturally correlated. If they are, how would changes in the this correlation influence the model performance? Some clarification or analysis on this point would strengthen the interpretation.
- In response to this comment, we assessed the correlation between BTEX and tVOC in the reference data using the Pearson Correlation Coefficient (R) and did not find strong correlation between the two (R=0.54). As a comparison, the correlation between BTEX and NO2 was also 0.54, and the correlation between TVOC and NO2 was 0.61. Some correlation is expected because all pollutant concentrations will be influenced by diurnal trends in dispersion related to boundary layer height.
3. Providing a time-series plot showing how sensor readings and reference instrument measurements evolve over time would be necessary. Including an example of how training and testing datasets are selected, especially how temporal blocks are separated, would help demonstrate that autocorrelation issues are addressed and that overfitting is avoided.
- We added an example test/train split time series in Figure 4. We appreciate this suggestion as it provides an opportunity to clarify the test/train split process and to show how the reference data and secondary standard evolve over time, especially with seasonal differences. For example, BTEX and TVOC concentrations tended to be higher in the middle of the study period and lower at the start and end. Figure 5 shows how the field pods evolve over time by comparing the pre- and post- harmonizations. We emphasize here that Figure 4 shows only one test/train split out of 20. We used 20 runs to assess the robustness of the models. In each run, the starting point of the first test section was randomly selected, and the second section always started half the total timeseries length away from the first section to ensure the sections cover different conditions. This information is provided in the second paragraph of Section 2.3.
- Is it feasible to report importance of variables (mainly sensor raw values as input) for each model. For Lasso, coefficients provide direct interpretability, but for other models (e.g., tree-based or ensemble methods), presenting variable-importance metrics would enhance transparency and allow readers to better understand the drivers behind model predictions.
The reviewer’s suggestion to provide information on feature importance is helpful to clarify which sensors drive calibration model predictions. We originally only shared which features Lasso removed from the model, which can help to show which features are unimportant but does little to explain which are important out of those left in the model. Thus, we added a figure that displays the feature importance for the random forest, gradient boosting, and extreme gradient boosting models. This plot reveals that the importance of features is generally consistent across the ensemble decision tree-based methods. Discussion of this plot was added to section 3.3.
Citation: https://doi.org/10.5194/egusphere-2025-4697-AC1
-
AC1: 'Reply on RC1', Caroline Frischmon, 26 Mar 2026
-
RC2: 'Comment on egusphere-2025-4697', Anonymous Referee #2, 05 Mar 2026
The manuscript presents a practical research that discusses the calibration of low-cost sensors for TVOC, BTEX and NO2. I believe the calibration of TVOC and BTEX is worth exploring given their complex nature, impacts on human health and atmospheric chemistry, and rarity in low-cost sensing studies. I have a few comments for the authors to consider:
1. In the abstract, it is not clear what you mean by referring to a 'two-step colocation strategy'. It is better to briefly introduce its principles in one or two sentences for non-experts.
2. Line 66, I believe the harmonization is to address 'inter-sensor variability' (between different sensors) instead of 'intra-sensor variability' (deterioration of the same sensor due to drifting and ageing).
3. In the calibration models for TVOC and BTEX you include metal oxide signals. Is metal oxide also measured by your low-cost sensors or its a reading from the reference station? If it is from the reference station, how do you obtain concurrent and colocated metal oxide levels?
4. It seems you do not consider cross-sensitivity for NO2 sensors, despite the fact that there is known cross-sensitivity between NO2, NO, and Ozone signals. For TVOC and BTEX you only considered metal oxide for cross-sensitivity. Please discuss if there are other known cross-sensitivity sources and the potential uncertainty if omitted.
5. You tested several calibration algorithms. But due to the small number of explanatory variables, some more complicated models (GB, XGB, ANN...) perform no better than simple linear regression.
6. I think R2 is a very important performance metric to show in the main manuscript, as it is critical to at least get the correct trends from low-cost sensors. The calibration performance of a low-cost sensor is, after all, built on its reliability in measurement.
Citation: https://doi.org/10.5194/egusphere-2025-4697-RC2 -
AC2: 'Reply on RC2', Caroline Frischmon, 26 Mar 2026
We appreciate the reviewer’s comments and added clarification to the manuscript based on their feedback.
- In the abstract, it is not clear what you mean by referring to a 'two-step colocation strategy'. It is better to briefly introduce its principles in one or two sentences for non-experts.
- We appreciate this suggestion and added a brief description of the two-step colocation to the abstract.
- Line 66, I believe the harmonization is to address 'inter-sensor variability' (between different sensors) instead of 'intra-sensor variability' (deterioration of the same sensor due to drifting and ageing).
- Thank you for catching this! We changed any mention of intra-sensor variability to inter-sensor variability in the manuscript
- In the calibration models for TVOC and BTEX you include metal oxide signals. Is metal oxide also measured by your low-cost sensors or its a reading from the reference station? If it is from the reference station, how do you obtain concurrent and colocated metal oxide levels?
- Metal oxide here refers to the type of TVOC and BTEX sensors used (“metal oxide VOC sensors”). Our original wording was confusing about this, so we added clarification to lines 109 and 165.
- It seems you do not consider cross-sensitivity for NO2 sensors, despite the fact that there is known cross-sensitivity between NO2, NO, and Ozone signals. For TVOC and BTEX you only considered metal oxide for cross-sensitivity. Please discuss if there are other known cross-sensitivity sources and the potential uncertainty if omitted.
- Collocating with a reference instrument under ambient conditions helps to correct for sensor cross-sensitivity to any non-target pollutants (including NO, O3, etc.) because the models correct for any sensor response occurring under ambient conditions. This is mentioned in lines 27-29. Thus, we do not need to include these pollutant sensors in our models. (also see note above for metal oxide clarification.)
- You tested several calibration algorithms. But due to the small number of explanatory variables, some more complicated models (GB, XGB, ANN...) perform no better than simple linear regression.
- In some cases, yes, though the linear regression performed worse for NO2. This may also indicate that the BTEX and TVOC sensors respond more linearly to the target pollutant than the NO2 sensor.
- I think R2 is a very important performance metric to show in the main manuscript, as it is critical to at least get the correct trends from low-cost sensors. The calibration performance of a low-cost sensor is, after all, built on its reliability in measurement.
- We did not include R2 within the performance metric figures in the paper because splitting R2 by percentile group leads does not provide useful analysis, since the range of values changes for calculating R2. Thus, these values were not easy to fit into the performance metric figures. To not confuse the reader, we instead put the R2 figures into the supplemental. Based on the reviewer’s feedback here, we added emphasis to the R2 values within each pollutant results section (Lines 286, 310, and 336).
Citation: https://doi.org/10.5194/egusphere-2025-4697-AC2
-
AC2: 'Reply on RC2', Caroline Frischmon, 26 Mar 2026
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 337 | 198 | 31 | 566 | 52 | 48 | 40 |
- HTML: 337
- PDF: 198
- XML: 31
- Total: 566
- Supplement: 52
- BibTeX: 48
- EndNote: 40
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Overall, the manuscript is well-organized and clearly presented, with effective illustrations and figures. I would offer the following minor comments for further discussion and consideration: