the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Machine Learning Calibration of Low-Cost Air Quality Gas Sensors
Abstract. Low cost sensors (LCSs) for measuring the concentrations of gaseous pollutants hold great promises for air quality monitoring (AQM) as they can improve the spatio-temporal resolution of observational networks. However, the performance of LCSs is affected by a number of factors including temperature and relative humidity of ambient air, as well as cross-sensitivities with gaseous species other than the target gas, thereby deteriorating the quality of their measurements. To address these issues, data from LCSs can be calibrated against reference instruments using machine learning (ML) algorithms. Here, we have evaluated the performance of a number of ML algorithms for calibrating measurements from CO, NO2, O3 and SO2 LCSs against respective reference measurements. The best model is then used to determine (1) the influence of temporal resolution of the measurements to the calibration performance, (2) the minimum fraction of data needed for model training while maintaining the quality of calibrated measurements within acceptable levels, and (3) the ideal calibration frequency with collocated reference measurements. We found that the quality of LCS measurements improve significantly for all sensors after ML calibration, with Random Forest (RF) being the best performing algorithm, corroborating previous works. By varying the temporal resolution of the training data from 1 h to 2 min, the performance of the RF model in terms of the normalized root mean squared error and the relative expanded uncertainty calculated at maximum observed concentration improves by 11–21 %. The results also suggest that the minimum fraction of data required for training the ML models depends on the frequency of carrying out collocated measurements with reference instruments and using the resulting datasets for training the calibration model. If the calibrations are carried out on a monthly basis, ca. 50 % of the period is needed for collecting data to train the RF algorithm and qualify the LCSs for indicative measurements as defined by the EU directive (2008/50/EC). If the training is carried out every 3 or 6 months by sampling the training data continuously, then ca. 60 % of the measuring period is required for collecting training data. In those cases, if the sampling of the training data is made over specific periods every month, but the entire training dataset is used to calibrate the measurements over 3 or 6 months, the amount of data required for qualifying the LCSs for indicative measurements can significantly reduce to 22 %. However, this would require that the measurements from the LCSs be calibrated retrospectively, which for specific applications is not such of a problem.
- Preprint
(1820 KB) - Metadata XML
-
Supplement
(319 KB) - BibTeX
- EndNote
Status: final response (author comments only)
- CC1: 'Comment on egusphere-2026-897', Ronald Cohen, 01 Apr 2026
-
RC1: 'Comment on egusphere-2026-897', Anonymous Referee #1, 21 Apr 2026
egusphere-2026-897
Machine Learning Calibration of Low-Cost Air Quality Gas Sensors
Ioannidis et al.# General comments
Ioannidis et al. explore calibration approaches for low-cost gas sensors for environmental and air quality monitoring. A six-month dataset including different types of Alphasense gas sensors in Nicosia, Cyprus was used to explore various machine learning algorithms to calibrate the sensors' observations with co-located reference monitors which were considered the ground truth. The results show that the calibration models improved the measurement performance of the sensors, the random forest algorithm was the most performant algorithm, and recommendations are given for what aggregation periods, the fraction of training/testing sets, and the frequency of calibration are best for these types of sensors.
The manuscript is well written, is clear, and the figures are presented well. However, I have concerns about the novelty and generality of the work presented. The calibration of low-cost sensors for air quality monitoring is a well-researched topic, and Ioannidis et al. do not provide new or solid insights. The authors touch on the poor measurement performance of low-cost sensors being driven by manufacturers focusing their calibration procedures on laboratory testing, rather than the variability that is experienced in environmental monitoring applications. Although this is undoubtedly true, there are two other fundamental issues with low-cost sensors and their measurement performance that need to be solved: (i) many of these gas sensors respond to other environmental variables much more than the target gas, i.e., they are very unspecific. Figure 6 demonstrates this point very well. (ii) The calibration of low-cost sensors is unstable over time and this needs very careful management. The latter point is only discussed indirectly. Therefore, we have a situation where a gas sensor is unspecific to the measurand and the calibration is unstable.
These two features of low-cost sensors put into question the portability of the calibration strategy applied in Ioannidis et al.. I would argue that the splitting of a dataset into training and testing sets is a mechanism to train and test a model, but a third validation set is required to truly determine if any given trained model is useful for a monitoring application. The training and testing set comes from the same calibration space, but a validation period should be outside the space, generally in the future from the model's perspective, and be tested with truly parallel observations of the ground truth. This situation gives rise to "true" field performance as could be expected if a low-cost sensor were deployed to a field location without a reference monitor, exactly the point of these devices.
I would recommend that the authors apply their testing to a validation set for more robust performance metrics and further evaluate the fundamental deficiencies of low-cost sensors in their interpretation of their results.
# Itemised comments
- Line 72. Linear regression is probably not considered a machine learning model
- Line 104. Is this a good idea? If there is uncertainty in the positive direction, there should be an uncertainty in the negative direction to avoid a bias.
- Line 148. Only one of r or R2 probably needs to be used; they mostly overlap and show the same thing
- Line 200 onward. Many sub- and super-scripts are missing in the text
- Line 221. When `n` is so high, statistical significance between two groups will almost always be detected, so this might not be a very useful test
- Line 242. The SO2 sensor does not seem suitable for ambient SO2 monitoring
- Line 245-ish. Displaying time series would be useful too. Maybe a clear single example could be added to the text?
- Line 260-ish. Do any of the models require much more training and/or prediction time than others? Is time or computation resource requirements a practical issue for this dataset?
- Line 356. This is a key point that could be tested with a validation dataset.Citation: https://doi.org/10.5194/egusphere-2026-897-RC1 -
RC2: 'Comment on egusphere-2026-897', Anonymous Referee #2, 03 May 2026
# Overall quality of the preprint - general comments
The overall impression of the paper is that it is very well-written and well-organised. The study is also interesting and, in my view, well worth publishing. However, before publication, the authors need to address the scientific questions and issues outlined below.
# Individual scientific questions/issues - specific comments
Issue #1: The calibrations have been carried out only over a 6-month winter period. It would have been a strength if they had covered a full year, including a summer period. It would be helpful if the authors could include a brief description of the potential challenges of including a summer period and how they believe it could have affected the results of the paper.
Issue #2: The interpretation of p-values in relation to the Shapiro-Wilk test for normality is incorrect. A low p-value indicates rejection of the null hypothesis that the data are normally distributed, thereby indicating non-normality, which is to be expected. Thus, further tests of differences in means and variances should use statistical methods that do not assume normality.
Issue #3: Referring to Fig. 4, XGBoost is actually slightly better than RF for all compounds except SO2, yet you describe RF as the best model. RF is indeed very close, so I guess it is quite OK to use it in the rest of the paper. However, you need to explain why you have chosen it over XGBoost. Are there other arguments, such as ease-of-use or robustness, for example?
Issue #4: SO2 is often below the detection limit. Perhaps you should consider removing it from the paper?
# List of technical corrections - typing errors
- 145: Cal and Ref should have bars above them to indicate averages.
- 239: Perhaps replace the term validation here with testing?
- 252: Referring to Fig. 3, perhaps use LAB-calibrated on the y-axis in the left column of plots and ML-calibrated on the y-axis of the right matrix of plots.
- 358: Referring to Fig. 8, the sentence should read: At least 80% for CO, 70% for NO2 and 50% for O3.
- 370: Referring to Fig. 9, the sentence should read: 22% for CO and O3 and 30% for NO2.
- 392: 22% should be replaced by 22% for CO and O3 and 30% for NO2.
- 497: Replace “measure- ment” with “measurement” and use the following for reference to the report: (NILU rapport 1/2020, in Norwegian), NILU, https://nilu.no/publikasjon/1809472/, 2020.
Citation: https://doi.org/10.5194/egusphere-2026-897-RC2
Data sets
Machine Learning Calibration of Low-Cost Air Quality Gas Sensors - Data Giannis Ioannidis https://doi.org/10.5281/zenodo.18629746
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 310 | 104 | 23 | 437 | 41 | 14 | 18 |
- HTML: 310
- PDF: 104
- XML: 23
- Total: 437
- Supplement: 41
- BibTeX: 14
- EndNote: 18
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The two papers below provide performance metrics relevant to the interpretation described in this paper and show that excellent calibration metrics can be achieved with a physically interpretable model.
A.R. Winter, Y. Zhu, N.G. Asimow, M.Y. Patel, and R.C. Cohen, Sustained Performance of Low-Cost Air Quality Sensors in Long-Term Deployments , ACS Sensors, 10.1021/acssensors.5c00566, 2025.
A.R. Winter, Y. Zhu, N.G. Asimow, M.Y. Patel, and R.C. Cohen, A Scalable Calibration Method for Enhanced Accuracy in Dense Air Quality Monitoring Networks, https://doi.org/10.1021/acs.est.4c08855, ES&T 2025. see Table 4 for a more comprehensive list of others who have worked on the issue of LCS performance metrics.