the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Recalibration of low-cost air pollution sensors: Is it worth it?
Abstract. The appropriate period of collocation of a low-cost air sensor (LCS) with reference measurements is often unknown. Previous low-cost air sensor studies have shown that due to sensor ageing and seasonality of environmental interferences periodical sensor calibration needs to be performed to guarantee sufficient data quality. While the limitations are well-established it is still unclear how often a recalibration of a sensor needs to be carried out. In this study, we aim to demonstrate how frequently widely used air sensors for the relevant air pollutants O3 and PM2.5 by two manufacturers (Alphasense and Sensirion) should be recalibrated. Sensor calibration functions were built using Multiple Linear Regression, Ridge Regression, Random Forest and Extreme Gradient Boosting. We use state-of-the-art test protocols for air sensors provided by the United States Environmental Protection Agency (EPA) and the European Committee for Standardization (CEN) for evaluative guidance. We conducted a yearlong collocation campaign at an urban background air and climate monitoring station next to the University Hospital Augsburg, Germany. LCS were exposed to a wide range of environmental conditions, with air temperatures between -10 and 36 °C, relative air humidity between 19 and 96 % and air pressure between 937 and 983 hPa. The ambient concentration ranges for O3 and PM2.5 were up to 83 ppb and 153 µg m-3, respectively. For the baseline single training of 5 months, the calibrated O3 and PM2.5 sensors were able to reflect the hourly reference data well during the training (R2: O3 = 0.92–1.00; PM2.5 = 0.93–0.98) and the following test period (R2: O3 = 0.93–0.97; PM2.5 = 0.84–0.93). Additionally, the sensor errors were generally acceptable during the training (RMSE: O3 = 0.80–4.35 ppb; PM2.5 = 1.45–2.51 µg m-3) and the following test period (RMSE: O3 = 3.62–5.84 ppb; PM2.5 = 2.04–3.02 µg m-3). By investigating different recalibration cycles using a pairwise calibration strategy, our results indicate that a regular in-season recalibration is required to obtain the highest quantitative validity for the analysed low-cost air sensors, with monthly recalibrations appearing to be the most suitable approach. In contrast, an extension of the training period for the calibration models had only a minor overall impact on improving the low-cost air sensors’ ability to capture temporal variations in observed O3 concentrations and PM2.5 concentrations. The measurement uncertainty of the calibrated O3 LCS and PM2.5 LCS were able to meet the data quality objective (DQO) for indicative measurements for different calibration models. Compared to one-time pre-deployment sensor calibration, in-season recalibration can broaden the scope of application for a LCS (indicative measurements, objective estimation, non-regulatory supplemental and informational monitoring).
- Preprint
(4658 KB) - Metadata XML
-
Supplement
(7531 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-2677', Laurent Spinelle, 18 Aug 2025
- AC1: 'Reply on RC1', Paul Gäbel, 16 Sep 2025
-
RC2: 'Comment on egusphere-2025-2677', Anonymous Referee #2, 24 Oct 2025
This manuscript shows different options for calibration of LCS, in particular O3 and PM2.5. The goal is to show a tradeoff between the model accuracy based on an initial training with a dataset (in terms of duration) and recurrent recalibrations.
The discussion is interesting, and it is an open question. Notice that about this topic there are many issues to be considered for this problem, with regard to the initial dataset (in terms of quality, range, duration, sampling frequency, locations for deployments), models used for calibration (statistical ones or based on AI (machine learning, deep learning)), sensor types and features (gas, cross sensitivity, fabrication (Electrochemical, Metal OXide (MOX) sensor, NDIR and/or optical, aging effect) to name a few. Nevertheless, the authors focus on sensors O3 (Alphasense Ox-B431) and PM2.5 (Sensirion AG SPS30) and using 4 different models (MLR, RR, RF, XGB) for calibration.
Next, you have the suggested Comments (C) to improve your manuscript:
C1.- The title should be clearer and more specific including key words such as tradeoff, O3 and PM2.5
C2.-The study is carried out with 2 sensos O3 (Alphasense Ox-B431) and PM2.5 (Sensirion AG SPS30). The selection should be justified and motivated: why these ones? are these the more common, more reliable, price vs quality ratio, etc.? The authors should provide a survey (a study of state of art) about this. This information is very useful for the reader.
In addition, in Section 2.1, the name of the sensors for O3 and PM2.5 and their abbreviations (AS-B431, SAG-SPS30) as well as their features should be placed in a table to ease reading.
C3.- The references are bit confusing. Not sure if it is the proper format and they are correctly compiled (not linked with reference section). For instance, (Gäbel et al., 2022), you cannot find it directly in the reference list. Although in a double lookup you can assume that it refers to a paper in Sensors MDPI from the same authors.
Also, an update of these references is welcome, with more recent ones.
C4.- Figure 1 is a bit confusing. Maybe a flow diagram of the proposal of the manuscript (the tradeoff between training duration and recalibration) should be better.
C5.-In my opinion, the analysis of 2 different deployments (AELCM009 and AELCM010) is interesting, to see the behavior (variability) between the different sensors.
But, the content of this manuscript could improved in a more comprehensive way. It could be carried out by using the whole dataset, and running on this dataset the different variables of the tradeoff: x= duration of initial training, y=recalibration time. Based on (x,y) you can plot the different metrics (R2, RMSE, REU,…) or a cost function (this is mentioned later in C11)) as a heatmap (in 3D plots), in stead of using a fixed training of 5 months, with extended periods of 1 months, and with recalibration with different periods. A heatmap should be easier to understand and see the optimum, rather than Figures 2-4 and 5-7. Notice that these figures are ambiguous and unclear. Also, the caption is bit redundant except 1, 2 or 3 months.
Besides, it should be noted that usually, the datasets have a higher sampling frequency, usually 10 min (or even lower), rather than 1 hour. It should be explained. Even, the sampling frequency could be a new variable to be considered in the tradeoff, instead of 1 hour as default.
C6.- In Section, 2.1, it should be nice to place some pictures of the boxes and deployment, although you refer to them in your own reference ((Gäbel et al., 2022)).
C7.- Section 2.4 requires a better description and detail of the models used. This can be summarized in a table with a short description and reference. Additional information could be interesting such as the library used, hyperparameters used (if needed), is there overfitting in the machine learning models? etc.
In Table 1, the target (in features/target) is not necessary if it is the same name of the model (on each column). Also, it should be recommended for clarity to show only the 2 models that you are using: O3 and PM2.5.
C8.- Abbreviations are repeated many times. As a general rule for abbreviations, define them once and use them always, except in the abstract.
Besides, a glossary at the end of the paper should be interesting.
C9.- In addition to Table 2 (with the stats of the dataset for 1 day), why do not you plot the stats for the whole period (1 year?) and/or plot their value over the time?
Is it correct 36º in Augsburg?
Also, you can also include in Table 2 the same stats for all the features (variables) of your dataset (AEMSxx, Vxx).
C10.- Conclusions are too long. You could simplify them add more relevant conclusions, since it is well known that with these LCS, recalibration is always required.
Besides, both in the abstract and in conclusion, you should highlight your contribution.
C11.- As mentioned before in C5, if you plot heatmap find other suggestions to visualize the results:
- Error-vs-time curves: plot RMSE(t) for different recalibration strategies. This shows how quickly accuracy decays and how recalibration recovers it.
- Heatmap: x-axis = initial training duration (T₀), y-axis = recalibration interval (days). z = a metrics (RMSE, R2, …). This visually shows regions where short initial training + frequent recalibration ≈ long initial training + infrequent recalibration.
- Pareto frontier / cost-accuracy plot: x-axis = operational/calibration cost, y-axis = long-term mean RMSE. Mark strategies on the plot.
- Bar chart: number of recalibrations vs mean RMSE for each T₀.
- Time-to-failure distributions: for threshold-triggered policies, plot histogram of detection delays.
- Uncertainty band plots (error ± CI) to show statistical significance between strategies.
Citation: https://doi.org/10.5194/egusphere-2025-2677-RC2 - AC2: 'Reply on RC2', Paul Gäbel, 05 Dec 2025
Status: closed
-
RC1: 'Comment on egusphere-2025-2677', Laurent Spinelle, 18 Aug 2025
First of all, I would like congratulate the authors for the work carried out and presented in this paper. After having read the full document, I'm not sure that the conclusion or the study really answer the question asked in the title. In fact, the author ask the question of the need of re-calibration of low-cost senors but they do not really answer it in the document as the present an interesting use of sensor for ambient air monitoring ("pairwise calibration strategy") based on a monthly exchange of LCS between a collocation site and a measurement site. This strategy, somehow interesting when looking at the sensors performances is much more time consuming than a classic network installation as, at the end, 2 LCS are always running adding the necessity of installation/removal every month. However, the interesting comparison of calibration results using several training length against both US-EPA and European standards brings a lot of valuable information.
I also made some minor comment along the document listed below:
- Line 153: length of this stabilization phase ?
- Line 155: coma could be removed.
- Line 157: The 3 of O3 should be in subscript.
- Line 165: Are the daily means for LCS based on the hourly values or on the raw values ? The end of this paragraph suggest that the daily means has been calculated using hourly values. Did you check the impact on the data ?
- Line 183: This PM sensor sentence seems to me to be not in the right paragraph as the PM data has been discussed on the previous one.
- Line184-189: This explanation could maybe be moved a after the first paragraph of 2.4 where the use of T and RH in the calibration models is explained. It was somehow confusing to me to read first that the data from the BME280 were not used to then see that they are finally used. Only on a second read I pay attention to the fact that the BME280 data were not used for the gas sensors.
- Table 1: the first row is not the easiest to read, in particular for O3 and NO2 as there is not a clear separation between the T (end of O3) and VNO2 (beginning of NO2).
- Line 218: what do you mean by merging the data by hour ? is it the mean calculation ?
- Line 395: you should mention in the previous paragraph 2.7 Performance metrics and target values that the measurement thus the evaluation has been carried out only for a urban background site whereas the CEN document ask for different testing site, for example a rural site for O3.
- Figure 8, 9, 10 and 11: I would advice the authors to write the title of the different graphs on a clearer way, at a first look, it is not easy to see the difference between each plot.
Citation: https://doi.org/10.5194/egusphere-2025-2677-RC1 - AC1: 'Reply on RC1', Paul Gäbel, 16 Sep 2025
-
RC2: 'Comment on egusphere-2025-2677', Anonymous Referee #2, 24 Oct 2025
This manuscript shows different options for calibration of LCS, in particular O3 and PM2.5. The goal is to show a tradeoff between the model accuracy based on an initial training with a dataset (in terms of duration) and recurrent recalibrations.
The discussion is interesting, and it is an open question. Notice that about this topic there are many issues to be considered for this problem, with regard to the initial dataset (in terms of quality, range, duration, sampling frequency, locations for deployments), models used for calibration (statistical ones or based on AI (machine learning, deep learning)), sensor types and features (gas, cross sensitivity, fabrication (Electrochemical, Metal OXide (MOX) sensor, NDIR and/or optical, aging effect) to name a few. Nevertheless, the authors focus on sensors O3 (Alphasense Ox-B431) and PM2.5 (Sensirion AG SPS30) and using 4 different models (MLR, RR, RF, XGB) for calibration.
Next, you have the suggested Comments (C) to improve your manuscript:
C1.- The title should be clearer and more specific including key words such as tradeoff, O3 and PM2.5
C2.-The study is carried out with 2 sensos O3 (Alphasense Ox-B431) and PM2.5 (Sensirion AG SPS30). The selection should be justified and motivated: why these ones? are these the more common, more reliable, price vs quality ratio, etc.? The authors should provide a survey (a study of state of art) about this. This information is very useful for the reader.
In addition, in Section 2.1, the name of the sensors for O3 and PM2.5 and their abbreviations (AS-B431, SAG-SPS30) as well as their features should be placed in a table to ease reading.
C3.- The references are bit confusing. Not sure if it is the proper format and they are correctly compiled (not linked with reference section). For instance, (Gäbel et al., 2022), you cannot find it directly in the reference list. Although in a double lookup you can assume that it refers to a paper in Sensors MDPI from the same authors.
Also, an update of these references is welcome, with more recent ones.
C4.- Figure 1 is a bit confusing. Maybe a flow diagram of the proposal of the manuscript (the tradeoff between training duration and recalibration) should be better.
C5.-In my opinion, the analysis of 2 different deployments (AELCM009 and AELCM010) is interesting, to see the behavior (variability) between the different sensors.
But, the content of this manuscript could improved in a more comprehensive way. It could be carried out by using the whole dataset, and running on this dataset the different variables of the tradeoff: x= duration of initial training, y=recalibration time. Based on (x,y) you can plot the different metrics (R2, RMSE, REU,…) or a cost function (this is mentioned later in C11)) as a heatmap (in 3D plots), in stead of using a fixed training of 5 months, with extended periods of 1 months, and with recalibration with different periods. A heatmap should be easier to understand and see the optimum, rather than Figures 2-4 and 5-7. Notice that these figures are ambiguous and unclear. Also, the caption is bit redundant except 1, 2 or 3 months.
Besides, it should be noted that usually, the datasets have a higher sampling frequency, usually 10 min (or even lower), rather than 1 hour. It should be explained. Even, the sampling frequency could be a new variable to be considered in the tradeoff, instead of 1 hour as default.
C6.- In Section, 2.1, it should be nice to place some pictures of the boxes and deployment, although you refer to them in your own reference ((Gäbel et al., 2022)).
C7.- Section 2.4 requires a better description and detail of the models used. This can be summarized in a table with a short description and reference. Additional information could be interesting such as the library used, hyperparameters used (if needed), is there overfitting in the machine learning models? etc.
In Table 1, the target (in features/target) is not necessary if it is the same name of the model (on each column). Also, it should be recommended for clarity to show only the 2 models that you are using: O3 and PM2.5.
C8.- Abbreviations are repeated many times. As a general rule for abbreviations, define them once and use them always, except in the abstract.
Besides, a glossary at the end of the paper should be interesting.
C9.- In addition to Table 2 (with the stats of the dataset for 1 day), why do not you plot the stats for the whole period (1 year?) and/or plot their value over the time?
Is it correct 36º in Augsburg?
Also, you can also include in Table 2 the same stats for all the features (variables) of your dataset (AEMSxx, Vxx).
C10.- Conclusions are too long. You could simplify them add more relevant conclusions, since it is well known that with these LCS, recalibration is always required.
Besides, both in the abstract and in conclusion, you should highlight your contribution.
C11.- As mentioned before in C5, if you plot heatmap find other suggestions to visualize the results:
- Error-vs-time curves: plot RMSE(t) for different recalibration strategies. This shows how quickly accuracy decays and how recalibration recovers it.
- Heatmap: x-axis = initial training duration (T₀), y-axis = recalibration interval (days). z = a metrics (RMSE, R2, …). This visually shows regions where short initial training + frequent recalibration ≈ long initial training + infrequent recalibration.
- Pareto frontier / cost-accuracy plot: x-axis = operational/calibration cost, y-axis = long-term mean RMSE. Mark strategies on the plot.
- Bar chart: number of recalibrations vs mean RMSE for each T₀.
- Time-to-failure distributions: for threshold-triggered policies, plot histogram of detection delays.
- Uncertainty band plots (error ± CI) to show statistical significance between strategies.
Citation: https://doi.org/10.5194/egusphere-2025-2677-RC2 - AC2: 'Reply on RC2', Paul Gäbel, 05 Dec 2025
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 1,035 | 482 | 39 | 1,556 | 95 | 31 | 46 |
- HTML: 1,035
- PDF: 482
- XML: 39
- Total: 1,556
- Supplement: 95
- BibTeX: 31
- EndNote: 46
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
First of all, I would like congratulate the authors for the work carried out and presented in this paper. After having read the full document, I'm not sure that the conclusion or the study really answer the question asked in the title. In fact, the author ask the question of the need of re-calibration of low-cost senors but they do not really answer it in the document as the present an interesting use of sensor for ambient air monitoring ("pairwise calibration strategy") based on a monthly exchange of LCS between a collocation site and a measurement site. This strategy, somehow interesting when looking at the sensors performances is much more time consuming than a classic network installation as, at the end, 2 LCS are always running adding the necessity of installation/removal every month. However, the interesting comparison of calibration results using several training length against both US-EPA and European standards brings a lot of valuable information.
I also made some minor comment along the document listed below:
- Line 153: length of this stabilization phase ?
- Line 155: coma could be removed.
- Line 157: The 3 of O3 should be in subscript.
- Line 165: Are the daily means for LCS based on the hourly values or on the raw values ? The end of this paragraph suggest that the daily means has been calculated using hourly values. Did you check the impact on the data ?
- Line 183: This PM sensor sentence seems to me to be not in the right paragraph as the PM data has been discussed on the previous one.
- Line184-189: This explanation could maybe be moved a after the first paragraph of 2.4 where the use of T and RH in the calibration models is explained. It was somehow confusing to me to read first that the data from the BME280 were not used to then see that they are finally used. Only on a second read I pay attention to the fact that the BME280 data were not used for the gas sensors.
- Table 1: the first row is not the easiest to read, in particular for O3 and NO2 as there is not a clear separation between the T (end of O3) and VNO2 (beginning of NO2).
- Line 218: what do you mean by merging the data by hour ? is it the mean calculation ?
- Line 395: you should mention in the previous paragraph 2.7 Performance metrics and target values that the measurement thus the evaluation has been carried out only for a urban background site whereas the CEN document ask for different testing site, for example a rural site for O3.
- Figure 8, 9, 10 and 11: I would advice the authors to write the title of the different graphs on a clearer way, at a first look, it is not easy to see the difference between each plot.