Improving the quantification of peak concentrations for air quality sensors via data weighting

Frischmon, Caroline; Silberstein, Jon; Guth, Annamarie; Mattson, Erick; Porter, Jack; Hannigan, Michael

doi:10.5194/egusphere-2024-4080

Preprints

https://doi.org/10.5194/egusphere-2024-4080

Preprints

22 Jan 2025

| 22 Jan 2025

Improving the quantification of peak concentrations for air quality sensors via data weighting

Caroline Frischmon, Jon Silberstein, Annamarie Guth, Erick Mattson, Jack Porter, and Michael Hannigan

Abstract. Traditional calibration models for low-cost air quality sensors have demonstrated a tendency to under-predict peak concentrations. We assessed the utility of adding data weights to low-cost sensor colocation data to improve the quantification of peak concentrations. Specifically, we explored the effects of data weighting on three different pollutant colocation datasets: total volatile organic compounds, carbon monoxide, and methane. Leveraging two different weighting functions, a sigmoidal and piecewise weighting regime, we explored the impacts of the base model choice (multilinear regression vs random forest models), the sensitivity of weighting functions, and the ability of data weighting to improve high-concentration pollution measurements. When compared to unweighted colocation data, we demonstrate significant reductions in both error (root mean square error-RMSE) and bias (mean bias error-MBE) for pollutant peaks across all three datasets when data weighting is employed. For the top percentile of data, we observe an average of 23 % reduction in RMSE and a 35 % reduction in MBE when optimal weights are employed. More significant reductions occurred in the 95–99th percentile of data, where MBE was reduced by an average of 70 %. RMSE in the 95–99th percentile was reduced by an average of 26 %. However, data weighting can also generate larger errors at baseline pollutant concentrations. Data weighting regimes were sensitive to input parameters, and input weighting functions may be tuned to better predict peak concentration data without significant reductions in the fidelity of baseline pollutant predictions.

Received: 23 Dec 2024 – Discussion started: 22 Jan 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 12185 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (12185 KB)

Supplement (163 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

15 Jul 2025

Improving the quantification of peak concentrations for air quality sensors via data weighting

Caroline Frischmon, Jonathan Silberstein, Annamarie Guth, Erick Mattson, Jack Porter, and Michael Hannigan

Atmos. Meas. Tech., 18, 3147–3159, https://doi.org/10.5194/amt-18-3147-2025,https://doi.org/10.5194/amt-18-3147-2025, 2025

Short summary

Caroline Frischmon, Jon Silberstein, Annamarie Guth, Erick Mattson, Jack Porter, and Michael Hannigan

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-4080', Anonymous Referee #1, 17 Feb 2025

I appreciate the work in this paper as it provides guidance for other users of sensors seeking to measure pollutants that do not vary much diurnally and present as rare, intermittent transient events with the vast majority of data being at some baseline level. Overall it is a good study that highlights the importance of weighting collocation data for such pollutants.
Comments:

1. In section 2.1 there is no mention of the TVOC sensors make/models used.

2. I noticed throughout most of the paper and figures, sigmoidal weighting is discussed/appears before piecewise weighting, so consider switching the order of Sections 2.4.1 and 2.4.2 to be consistent.

3. Both sections 2.4.1 and 2.4.2 use a "X" variable to describe either a percentile or an offset, which is confusing. Consider using a different variable other than "X" to describe one of those. In addition, Section 2.6 uses lowercase "x" instead of uppercase "X" by mistake.

4. In section 3, the subsection numbering is unusual (e.g. 3.0.1 rather than 3.1 or 3.1.1); consider using nonzero subsection numbering.

5. In general I find Figs. 4-9 not easy to decipher.

For the sensitivity plots, perhaps there are too many weighting parameters shown on the same plot, but I find it difficult to see which weighting parameters are performing best in order to connect it with the in-text statements of which weighting parameters were further explored (e.g., "Therefore, we chose to further analyze the sigmoidal z_sigmoid=3 andn percentile_piecewise=95th").

For the timeseries/scatterplots, I likewise am having trouble distinguishing whether the sigmoidal or piecewise scatterplots are hugging the 1-1 line closer. I think there could be some refinement of Figs 4-9 to help make it clearer on how the reader can also arrive to the in-text conclusions.

Citation: https://doi.org/10.5194/egusphere-2024-4080-RC1
- AC2:
  'Reply on RC1', Caroline Frischmon, 25 Mar 2025
  We appreciate the reviewer’s feedback on our study. In particular, we found their definition of the pollutants as those that “do not vary much diurnally and present as rare, intermittent transient events with the vast majority of data being at some baseline level” especially helpful and have added part of this definition to the abstract.
  Below are our responses to the reviewer’s specific comments.
  
  1. In section 2.1 there is no mention of the TVOC sensors make/models used.
  VOC measurements were collected via Figaro metal oxide sensors now listed in lines 84-85 in the text.
  
  I noticed throughout most of the paper and figures, sigmoidal weighting is discussed/appears before piecewise weighting, so consider switching the order of Sections 2.4.1 and 2.4.2 to be consistent.
  Thank you for this suggestion. We have switched the orders for consistency.
  
  Both sections 2.4.1 and 2.4.2 use a "X" variable to describe either a percentile or an offset, which is confusing. Consider using a different variable other than "X" to describe one of those. In addition, Section 2.6 uses lowercase "x" instead of uppercase "X" by mistake.
  
  We changed the percentile to P and fixed the lowercase “x” in Section 2.6. Thank you!
  
  In section 3, the subsection numbering is unusual (e.g. 3.0.1 rather than 3.1 or 3.1.1); consider using nonzero subsection numbering.
  
  We appreciate the referee bringing this to our attention and have changed the numbering to 3.1.
  
  In general I find Figs. 4-9 not easy to decipher.
  
  For the sensitivity plots, perhaps there are too many weighting parameters shown on the same plot, but I find it difficult to see which weighting parameters are performing best in order to connect it with the in-text statements of which weighting parameters were further explored (e.g., "Therefore, we chose to further analyze the sigmoidal z_sigmoid=3 andn percentile_piecewise=95th").
  
  For the timeseries/scatterplots, I likewise am having trouble distinguishing whether the sigmoidal or piecewise scatterplots are hugging the 1-1 line closer. I think there could be some refinement of Figs 4-9 to help make it clearer on how the reader can also arrive to the in-text conclusions.
  
  Thank you for pointing this out. We have updated Figures 4, 6, and 8 to only include the best model type (MLR or RF) for each pollutant to make these plots easier to read. Plots for both MLR and RF are now available in the supplemental information. For Figures 5, 7, and 9, we added some information to the figure captions to help readers understand the conclusions that can be drawn from these plots. Regarding the comparison between sigmoidal and piecewise for the 1-to-1 line, we agree that it is not easy to distinguish the difference between each weighting scheme here. This is why we choose to compare sigmoidal and piecewise using Figure 10 rather than Figures 5, 7, and 9. The only exception to this is the apparent shifted baseline for CO using sigmoid weighting.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4080-AC2
RC2:
'Comment on egusphere-2024-4080', Anonymous Referee #2, 03 Mar 2025
This is a helpful paper looking at different calibration weighting schemes for CH2, CO, and TVOC. It could be strengthened by adding figures that allow for a more easy take away of the main findings of the paper. The current figures are very complicated and may be more well suited for the SI since it is hard to easily look at them and understand which sensor performs best. The findings will be helpful for a variety of sensor users.
What TVOC sensor was used?

It is hard to look at figures 4 and 6 and understand which performs best

It would be helpful to understand what the concentrations are associated with the data percentiles.

Testing/training is not described in the methods

Line 231 “?” in the citation
Citation: https://doi.org/10.5194/egusphere-2024-4080-RC2
- AC1:
  'Reply on RC2', Caroline Frischmon, 25 Mar 2025
  We are grateful for the feedback received from the reviewer and have responded to specific comments below. In order to improve the clarity of our figures, we added some context to the figure captions and moved some figure details to the supplementary information.
  What TVOC sensor was used?
  VOC measurements were collected via Figaro metal oxide sensors now listed in lines 84-85 in the text.
  
  It is hard to look at figures 4 and 6 and understand which performs best
  Thank you for pointing this out. We have updated the plots to only include the best model type (MLR or RF) for each pollutant to make these plots easier to read. Plots for both MLR and RF are now available in the supplemental information.
  
  It would be helpful to understand what the concentrations are associated with the data percentiles.
  Thank you for the suggestion. This information was added to the captions of Figures 4, 6, and 8.
  
  Testing/training is not described in the methods
  We appreciate the referee bringing this gap to our attention. We added the information shared below on testing/training to lines 144-146.
  "For CO and CH4, the first and last ten percent of data was used excluded from model training to test the models' ability to predict concentrations under unseen conditions. This data is hence referred to as testing data. For the TVOC dataset, the last 20 percent was used as testing data to achieve more peaks within the testing dataset.}"
  
  Line 231 “?” in the citation
  We have corrected this citation issue. Thank you.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4080-AC1

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-4080', Anonymous Referee #1, 17 Feb 2025

I appreciate the work in this paper as it provides guidance for other users of sensors seeking to measure pollutants that do not vary much diurnally and present as rare, intermittent transient events with the vast majority of data being at some baseline level. Overall it is a good study that highlights the importance of weighting collocation data for such pollutants.
Comments:

1. In section 2.1 there is no mention of the TVOC sensors make/models used.

2. I noticed throughout most of the paper and figures, sigmoidal weighting is discussed/appears before piecewise weighting, so consider switching the order of Sections 2.4.1 and 2.4.2 to be consistent.

3. Both sections 2.4.1 and 2.4.2 use a "X" variable to describe either a percentile or an offset, which is confusing. Consider using a different variable other than "X" to describe one of those. In addition, Section 2.6 uses lowercase "x" instead of uppercase "X" by mistake.

4. In section 3, the subsection numbering is unusual (e.g. 3.0.1 rather than 3.1 or 3.1.1); consider using nonzero subsection numbering.

5. In general I find Figs. 4-9 not easy to decipher.

For the sensitivity plots, perhaps there are too many weighting parameters shown on the same plot, but I find it difficult to see which weighting parameters are performing best in order to connect it with the in-text statements of which weighting parameters were further explored (e.g., "Therefore, we chose to further analyze the sigmoidal z_sigmoid=3 andn percentile_piecewise=95th").

For the timeseries/scatterplots, I likewise am having trouble distinguishing whether the sigmoidal or piecewise scatterplots are hugging the 1-1 line closer. I think there could be some refinement of Figs 4-9 to help make it clearer on how the reader can also arrive to the in-text conclusions.

Citation: https://doi.org/10.5194/egusphere-2024-4080-RC1
- AC2:
  'Reply on RC1', Caroline Frischmon, 25 Mar 2025
  We appreciate the reviewer’s feedback on our study. In particular, we found their definition of the pollutants as those that “do not vary much diurnally and present as rare, intermittent transient events with the vast majority of data being at some baseline level” especially helpful and have added part of this definition to the abstract.
  Below are our responses to the reviewer’s specific comments.
  
  1. In section 2.1 there is no mention of the TVOC sensors make/models used.
  VOC measurements were collected via Figaro metal oxide sensors now listed in lines 84-85 in the text.
  
  I noticed throughout most of the paper and figures, sigmoidal weighting is discussed/appears before piecewise weighting, so consider switching the order of Sections 2.4.1 and 2.4.2 to be consistent.
  Thank you for this suggestion. We have switched the orders for consistency.
  
  Both sections 2.4.1 and 2.4.2 use a "X" variable to describe either a percentile or an offset, which is confusing. Consider using a different variable other than "X" to describe one of those. In addition, Section 2.6 uses lowercase "x" instead of uppercase "X" by mistake.
  
  We changed the percentile to P and fixed the lowercase “x” in Section 2.6. Thank you!
  
  In section 3, the subsection numbering is unusual (e.g. 3.0.1 rather than 3.1 or 3.1.1); consider using nonzero subsection numbering.
  
  We appreciate the referee bringing this to our attention and have changed the numbering to 3.1.
  
  In general I find Figs. 4-9 not easy to decipher.
  
  For the sensitivity plots, perhaps there are too many weighting parameters shown on the same plot, but I find it difficult to see which weighting parameters are performing best in order to connect it with the in-text statements of which weighting parameters were further explored (e.g., "Therefore, we chose to further analyze the sigmoidal z_sigmoid=3 andn percentile_piecewise=95th").
  
  For the timeseries/scatterplots, I likewise am having trouble distinguishing whether the sigmoidal or piecewise scatterplots are hugging the 1-1 line closer. I think there could be some refinement of Figs 4-9 to help make it clearer on how the reader can also arrive to the in-text conclusions.
  
  Thank you for pointing this out. We have updated Figures 4, 6, and 8 to only include the best model type (MLR or RF) for each pollutant to make these plots easier to read. Plots for both MLR and RF are now available in the supplemental information. For Figures 5, 7, and 9, we added some information to the figure captions to help readers understand the conclusions that can be drawn from these plots. Regarding the comparison between sigmoidal and piecewise for the 1-to-1 line, we agree that it is not easy to distinguish the difference between each weighting scheme here. This is why we choose to compare sigmoidal and piecewise using Figure 10 rather than Figures 5, 7, and 9. The only exception to this is the apparent shifted baseline for CO using sigmoid weighting.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4080-AC2
RC2:
'Comment on egusphere-2024-4080', Anonymous Referee #2, 03 Mar 2025
This is a helpful paper looking at different calibration weighting schemes for CH2, CO, and TVOC. It could be strengthened by adding figures that allow for a more easy take away of the main findings of the paper. The current figures are very complicated and may be more well suited for the SI since it is hard to easily look at them and understand which sensor performs best. The findings will be helpful for a variety of sensor users.
What TVOC sensor was used?

It is hard to look at figures 4 and 6 and understand which performs best

It would be helpful to understand what the concentrations are associated with the data percentiles.

Testing/training is not described in the methods

Line 231 “?” in the citation
Citation: https://doi.org/10.5194/egusphere-2024-4080-RC2
- AC1:
  'Reply on RC2', Caroline Frischmon, 25 Mar 2025
  We are grateful for the feedback received from the reviewer and have responded to specific comments below. In order to improve the clarity of our figures, we added some context to the figure captions and moved some figure details to the supplementary information.
  What TVOC sensor was used?
  VOC measurements were collected via Figaro metal oxide sensors now listed in lines 84-85 in the text.
  
  It is hard to look at figures 4 and 6 and understand which performs best
  Thank you for pointing this out. We have updated the plots to only include the best model type (MLR or RF) for each pollutant to make these plots easier to read. Plots for both MLR and RF are now available in the supplemental information.
  
  It would be helpful to understand what the concentrations are associated with the data percentiles.
  Thank you for the suggestion. This information was added to the captions of Figures 4, 6, and 8.
  
  Testing/training is not described in the methods
  We appreciate the referee bringing this gap to our attention. We added the information shared below on testing/training to lines 144-146.
  "For CO and CH4, the first and last ten percent of data was used excluded from model training to test the models' ability to predict concentrations under unseen conditions. This data is hence referred to as testing data. For the TVOC dataset, the last 20 percent was used as testing data to achieve more peaks within the testing dataset.}"
  
  Line 231 “?” in the citation
  We have corrected this citation issue. Thank you.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4080-AC1

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Caroline Frischmon on behalf of the Authors (25 Mar 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (30 Mar 2025) by Albert Presto

RR by Anonymous Referee #1 (12 Apr 2025)

ED: Publish subject to technical corrections (27 Apr 2025) by Albert Presto

AR by Caroline Frischmon on behalf of the Authors (27 Apr 2025) Manuscript

Journal article(s) based on this preprint

15 Jul 2025

Improving the quantification of peak concentrations for air quality sensors via data weighting

Caroline Frischmon, Jonathan Silberstein, Annamarie Guth, Erick Mattson, Jack Porter, and Michael Hannigan

Atmos. Meas. Tech., 18, 3147–3159, https://doi.org/10.5194/amt-18-3147-2025,https://doi.org/10.5194/amt-18-3147-2025, 2025

Short summary

Caroline Frischmon, Jon Silberstein, Annamarie Guth, Erick Mattson, Jack Porter, and Michael Hannigan

Supplement

https://doi.org/10.5194/egusphere-2024-4080-supplement

Caroline Frischmon, Jon Silberstein, Annamarie Guth, Erick Mattson, Jack Porter, and Michael Hannigan

Viewed

Total article views: 318 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
241	57	20	318	36	15	32

HTML: 241
PDF: 57
XML: 20
Total: 318
Supplement: 36
BibTeX: 15
EndNote: 32

Views and downloads (calculated since 22 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	47	12	5	64
Feb 2025	45	7	2	54
Mar 2025	47	11	3	61
Apr 2025	29	7	2	38
May 2025	21	2	3	26
Jun 2025	39	15	5	59
Jul 2025	13	2	0	15
Aug 2025	0
Sep 2025	0
Oct 2025	0
Nov 2025	0
Dec 2025	1	0	1
Jan 2026	0

Cumulative views and downloads (calculated since 22 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	47	12	5	64
Feb 2025	45	7	2	54
Mar 2025	47	11	3	61
Apr 2025	29	7	2	38
May 2025	21	2	3	26
Jun 2025	39	15	5	59
Jul 2025	13	2	0	15
Aug 2025	0
Sep 2025	0
Oct 2025	0
Nov 2025	0
Dec 2025	1	0	1
Jan 2026	0

Viewed (geographical distribution)

Total article views: 329 (including HTML, PDF, and XML) Thereof 329 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 06 Jan 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (12185 KB)
Metadata XML

Short summary

Air quality sensors often under-predict peak concentrations, which is a major issue in applications such as emissions event detection. This manuscript details a novel approach involving data weighting to improve quantification of these peak concentrations. To demonstrate its effectiveness, we applied data weighting to carbon monoxide, methane, and VOC sensor data. This work broadens our ability to use air sensors in contexts where accurate quantification of peak concentrations is essential.


Total:	0
HTML:	0
PDF:	0
XML:	0

Improving the quantification of peak concentrations for air quality sensors via data weighting

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Journal article(s) based on this preprint

Supplement

Viewed

Viewed (geographical distribution)