the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A Framework for Dynamic Hyper-local Source Apportionment using Low-cost Sensors for Real-time Policy Action
Abstract. The presence of particulate matter, toxic gases and other pollutants in the air pose significant risk to human health and the environment. Identifying the different sources of air pollution which is termed as Source Apportionment (SA), needs to be done in real-time in order to understand the dynamics of the contributing sources and also to enable the policy makers frame effective regulatory measures to curb air pollution. The unit deployed for implementing the SA framework at a particular location must also be cost-effective, so that it becomes feasible to create a dense network with such units and thus cover a wide geographical area. The use of low-cost air quality monitoring sensors have become popular in this regard. In our proposed framework we use low-cost air quality sensor units in conjunction with machine learning models to develop a low-cost real-time solution for SA. Multi output regression models, which are supervised machine learning models are used for this purpose. Reference Grade Instruments are used for learning calibration models for the low-cost sensors as well as the multi output regression models for SA. Once the calibration and multi output regression models are learnt during training, the proposed framework allows the low-cost sensors to be deployed on the field as a standalone device, where it collects on-field data and stores it in a remote server through a wireless network. This data can be pulled at the user end, calibrated and then fed to the trained model to obtain the SA results in terms of the relative abundance of the different sources in ambient air. Mean Absolute Error (MAE) has been used as the metric to measure the accuracy in predicting the relative abundance of different sources, while Spearman's Rank Order Correlation Coefficient (SROCC) and Normalized Discounted Cumulative Gain (NDCG) are the metrics that have been used to get an estimate of how well the proposed approach performs in predicting the relative abundance of the different sources in the correct order. Extensive experimentation done using data gathered from two different environments in the city of Lucknow, India shows the robustness of the proposed approach in doing real-time SA. MAE of less than 5 % have been obtained in predicting the relative abundance of most of the organic as well as elemental sources, while values of SROCC greater than 0.75 and NDCG greater than 0.85 obtained for all the sources shows that the proposed framework also performs very well in predicting most of the sources in correct order of their actual contribution to air pollution.
- Preprint
(18538 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 05 Jun 2026)
- RC1: 'Comment on egusphere-2025-5677', Anonymous Referee #1, 13 Mar 2026 reply
-
RC2: 'Comment on egusphere-2025-5677', Anonymous Referee #2, 01 May 2026
reply
Overview
This manuscript proposes a unique approach for source apportionment of PM2.5 via low cost air quality (LCAQ) sensor inputs to machine learning models that are trained via collocated reference grade instruments (RGI). The authors train multi-output regression models from about 7 days of collocated LCAQ/RGI measurements taken at two sites roughly 20 km apart in Lucknow, India, where positive matrix factorization (PMF) is first applied to aerosol mass spectrometer (AMS) and energy dispersive x-ray fluorescence (Xact 625i) measurements and resolved factors serve as the ground truth for model training. Model predictions of AMS and Xact 625i PMF factors are claimed to agree with AMS/Xact 625i PMF factors over 2 day test periods at both sites. As the authors have correctly pointed out, traditional source apportionment (e.g., PMF) of RGI measurements is resource-intensive and often time-consuming. Therefore, their research objectives are timely and relevant to Atmospheric Measurement Techniques, and such a framework proposed would represent an important contribution to the state of the science. However, in its current form, the manuscript lacks the necessary data and methodological rigor to support the objectives or conclusions the authors claim to accomplish. For these reasons, which I detail below, major revisions with additional data and methodological improvements should be made before considering the work for publication.
Major Issues
1. The work’s claims about the proposed framework are disproportionate to the limited training and test data used. Much more RGI/LCAQ data is necessary for the building and evaluating the framework this manuscript is proposing.
The authors describe their framework as robust and effective (e.g. lines 17-18 claim “extensive experimentation done” shows the “robustness of the proposed approach.”) yet the employed 80/20 split and very limited test set (2 days of test data) prevent a rigorous evaluation of model performance and the use of AMS/Xact 625i PMF factors as ground truths. PMF factors will vary over time scales much larger than the 10-day windows considered here. It has also been well documented in literature that AMS and Xact 625i PMF factor profiles themselves can vary substantially across sites and, to an extent, even within a site over time (Zhang et al., 2007; Canonaco et al., 2021; Chen et al., 2022). While the framework proposed is interesting, an expanded high-quality dataset is required to demonstrate its potential merit.
In order to demonstrate the need for more data, I pose a few questions to the authors. (a) Because the training set is within 0-7 days of the test set, predicted factor concentrations will already be similar due to autocorrelation – what if the model predicted concentrations a month later or a year later? If periodic RGI/LCAQ relocation periods are required for model training, this would be important information to include. A test set several weeks apart from the training set would ensure autocorrelation effects are not impacting model performance. (b) In this first submitted manuscript each LCAQ sensor was collocated with the RGI mobile lab and it’s my understanding that each regression model is specific to each site. What if the site B model was applied to nearby LCAQ sensor data from site C? This would be another more rigorous evaluation that would show the ability of multi-output regression models to quantify similar sources at a different place and time.
While RGI data and its source apportionment products are resource-intensive, there are many long-term deployments of RGI collocated near LCAQ monitors with free, open-source data products that would be well suited to the models proposed by the authors. For example, the ACTRIS network has aerosol chemical speciation monitors and aethalometers deployed at many sites and has conducted harmonized source apportionment (Chen et al., 2022).
2. There are concerns about the integrity and reliability of RGI and LCAQ data.
The authors thoughtfully describe the calibration procedure for LCAQ sensors and note the poor agreement between RGI and LCAQ monitors for SO2 and NO2 because measured concentrations are below detection limits. This is especially evident in the Fig. 4 SO2 agreement for site C. The challenges of LCAQ monitors for measuring SO2 and NO2 have been noted extensively (Duvall et al., 2016). If the SO2 and NO2 measurements are not reliable, they should not be included in the machine learning models as they will only impair the model’s ability to predict PM2.5 sources.
In Fig. 11, there appears to be poor time resolution or a lack of RGI AMS PMF data and LCAQ PMF predictions within the 2 day test set for site C. Between Oct 8 00:00 and Oct 8 20:00, (about half of the test set) it appears there are only 3 data points being compared. Interestingly, this data appears to be present in Fig. A2. Why does this appear to be excluded in the evaluation in Fig. 11?
Additional information on the operation of the HR-ToF-AMS and Xact 625i should be included in the manuscript. For the HR-ToF-AMS, what were the results of ionization efficiency calibrations? For the Xact 625i, what was the signal-to-noise of the elements measured? For all RGI deployments, what was the inlet height/configuration and position in the mobile lab?
3. The PMF source factors are being overinterpreted, and additional details should be provided on PMF source apportionment methods.
From 20 days of HR-ToF-AMS and Xact 625i measurements, the authors have resolved 5 AMS source factors and 7 Xact 625i source factors and claim the model is able to predict these different source factors well. Some of the PMF factors (e.g., “Fe-smelting”) seem very specific considering that other large sources of particulate iron exist (e.g., brake wear). The manuscript needs a more thorough rationalization of the PMF solutions (e.g., correlations with external tracers, diurnal profiles, comparisons to prior literature, bootstrapping). If the RGI PMF factors themselves have a high degree of uncertainty and mixing, I would not expect the LCAQ sensors to predict such factors well. If the model is not able to reproduce accurate source apportionment of these 12 factors with an expanded dataset, I would advise the authors to consider applying the model to simpler groups first (e.g., hydrocarbon like organic aerosol, oxygenated organic aerosol, sulfate, nitrate).
Canonaco et al., A new method for long-term source apportionment with time-dependent factor profiles and uncertainty assessment using SoFi Pro: application to 1 year of organic aerosol data, Atmospheric Measurement Techniques, 2021. https://doi.org/10.5194/amt-14-923-2021
Chen et al., European aerosol phenomenology − 8: Harmonised source apportionment of organic aerosol using 22 Year-long ACSM/AMS datasets, Environment International, 2022. https://doi.org/10.1016/j.envint.2022.107325
Duvall et al., Performance Evaluation and Community Application of Low-Cost Sensors for Ozone and Nitrogen Dioxide, Sensors, 2016. https://doi.org/10.3390/s16101698
Zhang et al., Ubiquity and dominance of oxygenated species in organic aerosols in anthropogenically-influenced Northern Hemisphere midlatitudes, Geophysical Research Letters, 2007. 10.1029/2007gl029979
Citation: https://doi.org/10.5194/egusphere-2025-5677-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,197 | 1,289 | 68 | 2,554 | 110 | 126 |
- HTML: 1,197
- PDF: 1,289
- XML: 68
- Total: 2,554
- BibTeX: 110
- EndNote: 126
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Recommendation to the editor:
The manuscript presents a potentially interesting proof-of-concept for real-time, low-cost SA, but it is not ready for publication in its current form. The fundamental issues of scope (single city, effectively single week of training and 2 days of test data), lack of instrument calibration, the absence of a proper validation set, the use of sub-detection-limit and potentially degraded sensor data in predictive models, and the overstated generalization claims represent major methodological and scientific integrity concerns that cannot be resolved easily. A substantially expanded underlying dataset — spanning multiple seasons, cities, or years — combined with the methodological corrections described below would be required before the work could be reconsidered for publication.
Review Summary
This manuscript proposes a machine learning framework for real-time, hyper-local air pollution source apportionment (SA) using low-cost air quality (LCAQ) sensors co-located with Reference Grade Instruments (RGI). Multi-output linear regression models are trained on calibrated LCAQ data, with SA outputs from Positive Matrix Factorization (PMF) applied to HR-ToF-AMS and Xact-625i data serving as ground truth. The framework is evaluated at two sites in Lucknow, India, during a single month (October 2023). While the research topic is timely and relevant to the scope of Atmospheric Measurement Techniques, the manuscript has fundamental deficiencies in experimental scope, methodological rigor, and scientific integrity of its claims that preclude publication in its current form.
Major Concerns
1. Overstated Claims of Robustness and Generalization
The introduction asserts that the objective of this study is for “enhancing model robustness, calibration fidelity, and generalizability across diverse sensing environments” (p. 4, l. 115) and that "extensive experimentation has been carried out to validate the robustness of the proposed framework " (p. 5, l. 133). However, the entire study consists of two deployments within a single city during a single month. Each site yields approximately 350 observations at 30-minute resolution (roughly one week of data) (p. 15, l. 332-333). With an 80/20 train-test split, the test set contains only ~70 observations per site — less than two days of data. These sample sizes are wholly insufficient to support claims of robustness or generalizations. To substantiate such claims, the authors would need to collect data across multiple seasons at the same sites, across multiple cities, or across multiple years. The current field monitoring scope supports only a proof-of-concept demonstration to conduct a full investigation, and not sufficient for a manuscript proposing a new framework.
2. Micro-Aethalometer Calibration Not Described
The Micro-Aethalometer (AethLabs AE-51) is deployed alongside the LCAQ sensor unit and its BC measurements are used as a key predictor. The paper states only that the device "comes lab calibrated" (p. 2, l. 176) and cites a field intercomparison study, but provides no site-specific calibration or cross-validation against the EBAM or other RGI at either deployment site. Amazingly, the abstract of that field intercomparison study reads, “Real-world quality assurance of these instruments should be performed through field IC against reference instruments with longer durations in areas of slowly changing eBC concentration” (Alas et al., 2020). For a manuscript focused on calibration methodology, the absence of any field calibration assessment of this instrument is a significant omission that should be rectified.
3. Potential NO₂ Sensor Degradation Mid-Deployment
Figure 3 displays the NO₂ calibration time series at Site-B and visually suggests a step-change in sensor behavior around 16–17 October 2023, consistent with sensor performance degradation. The authors must formally test this by computing the calibration correlation coefficient separately for data before and after this break-point. If a statistically significant change is identified, NO₂ data collected after that date should be excluded from training and inference. Given the already limited dataset size (~350 observations), such an exclusion could substantially constrain the usable training data and reduce the available inputs to three reliably calibrated pollutants (CO, O₃, and PM₂.₅), given the acknowledged below-detection-limit issues with SO₂. The implications for model validity need to be fully addressed, making the underlying data in its current form unusable.
4. Use of Below-Detection-Limit SO₂ Data in Predictive Models
The authors acknowledge that all ambient SO₂ concentrations at both sites fall below the minimum detection limit (MDL) of the Alphasense B4 sensor (~5 ppb), resulting in R² values near zero compared to reference grade instrumentation (p. 10, l. 261-263; 0.03 at Site-B, −0.17 at Site-C). Yet SO₂ appears to be retained as a predictor in the SA regression models, and the authors do not explicitly state that all SO₂ measurements are excluded. Using as model predictor a pollutant where all measurements are sub-MDL introduces noise rather than signal and violates fundamental principles of analytical measurement.
5. Cross-Sensitivity of Alphasense B4 Sensors Not Validated
The authors note (p. 34, l. 607–608) that the Alphasense B4 auxiliary electrode compensates for cross-sensitivities from interfering gases; however, the calibration model (Equation 1) does not include any terms to explicitly test for or remove residual cross-sensitivities among pollutant channels. The authors must either demonstrate empirically that cross-sensitivities are negligible in their deployment context (e.g., by showing that adding cross-sensitivity terms explains negligible additional variance), or account for these effects within the calibration framework. This is particularly important given the poor NO₂ and SO₂ calibration results. The currently developed LCS calibration model is unfit for use.
6. Absence of a Validation Set — Risk of Overfitting and Contaminated Evaluation
The paper describes only a training/test split (80/20). No separate validation set is used for hyperparameter tuning or model selection (Appendix C evaluates multiple regression models including gradient boosting and random forests with tuned hyperparameters). Without a held-out validation set, the reported test-set performance risks being optimistic: if any model selection decisions were informed — even informally — by test-set behavior, the test set no longer represents a true independent evaluation. Standard practice in supervised machine learning requires a three-way split (e.g., 60/20/20 or 70/15/15) when multiple models or hyperparameters are compared. In the future, the authors must clarify their model selection procedure and, if the test set was used in any capacity for model comparison, must provide corrected evaluation on a genuinely held-out partition.
7. Use of R² for Calibration Assessment is Inappropriate; Spearman Correlation Required
The paper uses Pearson R² to evaluate sensor calibration (Figures 3 and 4) and the feature correlation heat map (Figure 6). The R² values reported for NO₂ (0.22 at Site-B) and SO₂ (−0.17 at Site-C) are strikingly poor and should disqualify these sensors from use, yet the manuscript characterizes the calibration as "reasonably good" (p. 10, l. 259-260) and proceeds to include these pollutants as model inputs. Pearson R² is sensitive to outliers and is poorly suited to noisy electrochemical sensor data. In general, the authors must replace Figures 3, 4, and 6 with Spearman rank-order correlation coefficients, which are more appropriate for this type of data and provide an honest characterization of calibration quality.
References: