the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A tuneable framework for outlier detection in PM2.5 air sensor networks during wildland fire smoke events
Abstract. In recent years the use of air sensors has rapidly expanded across North America to measure fine particulate matter (PM2.5), particularly in response to increasing air quality impacts from wildland fire. With the benefit of enhanced spatial and temporal coverage, the scientific community and the public have come to rely on sensor networks as valuable sources of air quality information. With an increasing variety of sensor devices being deployed, there is a need to validate and harmonize PM2.5 data between different device types. While significant attention has been given to calibration and correction equations to improve the accuracy of a given sensor's measurement, there is a need to develop tractable and generalizable methods of identifying malfunctioning or unreliable sensors, given the maintenance, siting, and operation of many of these devices is unknown. In this paper, we propose a method of identifying outlier PM2.5 sensors, defined as those whose measurements deviate strongly from other local measurements due to hardware faults or to hyper-local environmental conditions that are not representative of typical ambient air quality conditions. While detecting outliers during typical conditions is a fairly straightforward task, detecting outliers during smoke events is challenging due to real, erratic shifts in PM2.5 concentrations. Here, we present a novel method of detecting outliers within sensor networks by combining measures from information theory and machine learning. We first define a tuneable, rule-based detection function that balances the Shannon entropy of a local network against the information content of an individual sensor's measurement. We then use this function, together with additional information-theoretic and short-term temporal features, to train a gradient-boosted decision tree for automated outlier detection. Hourly PM2.5 measurements from various device types were collected for 11 unique smoke events across North America in 2024 and 2025, and a stratified sample of sensor data were randomly perturbed to simulate 5 commonly seen faults. In each of these cases, we assessed each method's ability to detect the simulated faults. We demonstrate that either of these methods, while trained on a semi-synthetic dataset, can act as a useful data validation procedure when applied to both real-time air quality reporting and retrospective analysis.
- Preprint
(17254 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 08 Jul 2026)
- RC1: 'Comment on egusphere-2026-1273', Anonymous Referee #1, 22 Jun 2026 reply
-
RC2: 'Comment on egusphere-2026-1273', Anonymous Referee #4, 25 Jun 2026
reply
The authors present a framework for grouping and detecting sensor errors during wildfire smoke events using ideas from information theory. The work has important implications for detecting errors in real-time for platforms such as AirNow and PurpleAir. However, it was difficult to evaluate this manuscript as it is not well organized and bloated with figures and datasets outside of typical paper organization (i.e., methods, results, discussion). There are interesting analyses in this manuscript but a reorganization of the paper is needed to understand the important takeaways in a more cogent manner. See comments below
Intro:
"While many parts of North America, particularly large population centres, are covered by regulatory PM2.5 monitoring networks, these stations are sparse in smoke-prone regions." --> This probably needs a citation, perhaps use:
Kelp, M., S. Lin, J.N. Kutz, and L.J. Mickley (2022). A new approach for optimal placement of PM2.5 air quality sensors: case study for the contiguous United States, Env. Res. Letters, 17, 034034, DOI: 10.1088/1748-9326/ac548f.Otherwise, doesn't the relatively long correlation length-scale of smoke and the fact that populated centers are monitored cover most of our bases in terms of population smoke exposure?
2nd paragraph: but doesn't deSouza et al 2021 show that these low cost sensors are deployed in largely less socioeconomically diverse locations? And perhaps also in less optimal locations as well?
"Even if the device is performing as intended, publicly operated sensors may produce aberrant measurements if they are sited or used with the intent to capture hyper-local conditions of interest and therefore may not reflect ambient air quality conditions at large" --> Feel as though this needs a citation, or perhaps I am not understanding what is meant. Does it matter what the 'intent' of the sensor is used for? Is this public knowledge and if not there are many studies that leverage PurpleAir or AirNow data, does that mean that they may be flawed? Or perhaps this is a motivating point of your study? In any event, I think slightly more clarification is needed.
Methods:
2.1: would be helpful to have in the SI a list of the different types of sensors. Also, no discussion of differing signal-to-noise detection ability has been mentioned. Why are regulatory grade monitors treated the same as low cost sensors?
-->I see that they are discussed later, probably can mention so. But the latter part of my question remains.2.2: "We define a smoke event as a geographic region impacted by elevated PM2.5 concentrations as a direct result from wildland fire. These events have a clear start and end time, delineated by lower, ambient PM2.5 concentrations. The sites used in this study were selected from known events that occurred during 2024 – 2025 that were measured by sufficient PM2.5 monitors and sensors to capture the spatial and temporal extent of the smoke. There were 10 geographic sites used, and 11 total events, with one geographic area used for both an event caused by regional wildfire (WF) smoke, and another event due to a local prescribed fire (Rx)."
--> This is not specific enough. Do you define the smoke event based on HRRR-Smoke? Based on NOAA HMS? Based on elevated CO from satellites?
There is a lot of literature on the difficulty to attribute PM specifically to smoke due to the definition of the elevated smoke extent and to the background PM. I'm sure these events are 'known' but need more detail on how you chose them and delineated the times.See: Liu, T., F.M. Panday, M.C. Caine, M. Kelp, D.C. Pendergrass, and L.J. Mickley (2024). Is the smoke aloft? Caveats regarding the use of the Hazard Mapping System (HMS) smoke product as a proxy for surface smoke presence across the United States, International Journal of Wildland Fire, 33, WF23148, DOI: 10.1071/WF23148.
Fig 1, Table1: how is the extent determined in fig 1, what are the conditions of the Rx fire treatments? I downloaded the companion repository (Illson, 2026b) and it is fairly unstructured with little documentation than the data. I think there needs to be more information on these fires and data you are using, even if it is not the main point of the analysis. It will still potentially inform the results.
Fig 4: this figure is quite nice
3.1: "while providing measures of disorder that remain meaningful during smoke events." --> I don't know what meaningful means"This categorization avoids overemphasis on minor numerical differences, while capturing meaningful shifts in health-relevant air quality." -> I don't really buy this. If you want to say that it makes the mathematical optimization easier then that is perfectly fine to say. But public health researchers would argue that any concentration of PM2.5 is 'meaningful'. There are large resources devoted to studying health effects/impacts of PM at very low (i.e., <5 ug/m3 concentrations).
L323-332: no citations are given and the discussion here is fairly vague, referring to information theory as an approach or an idea, rather than just identifying specific applications or models or tests, etc.
"neighbouring devices whose measurements fall into AQI bin 𝑥" --> are these the categorical AQI bins? Are the bins not equal in size? For example, you can have a monitor that is off by 50 AQI points but contained in the same category (at high smoke concs) but a monitor that is off by 5 that switches categories is flagged? Does that mean the latter example has greater entropy than the former even though it displays much less error?
Ok, I think the following paragraphs addressed it a little bit, but the idea of the bins is still unclear. You show a time series of AQI which has discrete numbers (even though abstracted) but now it's only using the binned AQI categories or are you also using the AQI scores in the bins as well?Figure 7: font sizes are too small. Text is small, hard to see the lines.
"In terms of spatial configuration, the optimal 455 proximity varied widely for each fault type, ranging from 5,000 to 20,000 m, implying that specific fault detectability may be at least partially scale-dependent."
--> I think this is important to actually tease out. You show that this information theory approach works well, but there isn't much analysis on the interpretability at all. Why do these sets of parameters lead to the best performance for these specific error types?Don't find Figure 9 very useful
The organization of this paper is a bit messy. We are already on section 4 and it seems to be introducing new methods and a different line of inquiry than what the paper had originally set up. I would rewrite all methods to be in the methods section.
I find the XGBoost sections fairly haphazard. There are no tables or figures and it seems to come out of nowhere.
On section 5, and new data sources and scope is introduced. I think this manuscript needs a large reorganization. It contains many interesting analyses but it is long-winded and not effectively organized. The Discussion section is introducing even more figures and analysis. There are 17 figures in this paper, which can easily be cut down to half this number given that many of the figures do not add much substantive information or could be condensed more.
The high misclassification rate during local smoke events is concerning especially given that that is a benefit of using low cost sensors. There is no discussion of what the correlation length scale of wildfire or Rx fire smoke might look like, but they are generally long traveled plumes (at least for wildfires). Making the local classification of smoke an easier task, at least from first principles.
Citation: https://doi.org/10.5194/egusphere-2026-1273-RC2 -
RC3: 'Comment on egusphere-2026-1273', Byeongseong Choi, 25 Jun 2026
reply
The proposed framework, which combines distance-based sensor networks, entropy-based comparison metrics, and machine-learning-assisted rule optimization, is technically interesting and practically relevant. In its current form, the manuscript already appears publishable with minor revisions. However, several additional clarifications and justifications could further strengthen the impact and interpretation of the work.
(I initially misunderstood the review process and included several comments only in the initial evaluation rather than in the public discussion. For completeness and to ensure that the authors can address them during the discussion phase, I am reposting those comments below.)
1. A fundamental question is the lack of a clear comparison baseline. Additional justification may be beneficial regarding why the proposed framework is particularly necessary under smoke conditions, unless there are cases where conventional approaches perform well under normal air-quality conditions but fail during smoke events.
2. A stochastic spatiotemporal modeling framework may also be considered as a comparison approach, rather than relying solely on distance-based network generation. Including a brief discussion of physically informed spatiotemporal transport formulations in the Discussion section may help further contextualize the limitations and assumptions of the current network-based approach. For example:
Choi, B., and Hummel, M. A. (2025). Spatiotemporal air quality prediction using stochastic advection–diffusion model for multimodal data fusion. Environmental Research Letters, 20(1), 014065.
3. (Suggestion) Reducing redundant explanations in several sections may improve the flow of the manuscript.Citation: https://doi.org/10.5194/egusphere-2026-1273-RC3
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 96 | 28 | 13 | 137 | 11 | 11 |
- HTML: 96
- PDF: 28
- XML: 13
- Total: 137
- BibTeX: 11
- EndNote: 11
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review for paper: "A tuneable framework for outlier detection in PM2.5 air sensor networks during wildland fire smoke events". The authors propose an outlier detection methods for PM2.5 sensors that leverages information theory and machine learning, and is especially useful in smoke events. Data integrity, as well as data quality, is one of the major concerns in low-cost sensor networks. Henceforth, automatic outlier detection is a crucial task. The paper presents a relevant novelty and a useful use case. However, there are some aspects that need to be further clarified regarding the machine learning phase:
1) It seems that the authors are introducing synthetic anomalies. Then, they perform a threshold selection for the information-based rule and the XGBoost. It is very important to clearly state what data has been used for training, what data for validation/grid search, and for testing.
2) The authors should clearly explain the difference/benefits of the two methods.
3) The information-theoretic approach should be compared with other simpler methods, e.g., one of their criterion, the absolute value of the difference between the sensor reading and the average of the neighborhood values.
4) Why have the authors chosen the XGBoost for binary classification? They should motivate their choice and compare it with some other well-known methods.
5) What is the labelling process for the supervised binary classification?
6) What is the effect of a sparse sensor network? Are the anomalies also identifiable in the presence of smoke ? During smoke events, are the outliers identifiable only in the case of a dense network ?
7) Is the method proposed optimal or nearly-optimal for all scenarios, i.e., smoke and non-smoke events ?
8) Is the method intended for large anomalies, e.g., lasting more than one hour ? What if there is just a single measurement that deviates from normality ? Is it flagged ? These scenarios are also relevant in cases where data is fed to downstream applications.