A Framework for Dynamic Hyper-local Source Apportionment using Low-cost Sensors for Real-time Policy Action
Abstract. The presence of particulate matter, toxic gases and other pollutants in the air pose significant risk to human health and the environment. Identifying the different sources of air pollution which is termed as Source Apportionment (SA), needs to be done in real-time in order to understand the dynamics of the contributing sources and also to enable the policy makers frame effective regulatory measures to curb air pollution. The unit deployed for implementing the SA framework at a particular location must also be cost-effective, so that it becomes feasible to create a dense network with such units and thus cover a wide geographical area. The use of low-cost air quality monitoring sensors have become popular in this regard. In our proposed framework we use low-cost air quality sensor units in conjunction with machine learning models to develop a low-cost real-time solution for SA. Multi output regression models, which are supervised machine learning models are used for this purpose. Reference Grade Instruments are used for learning calibration models for the low-cost sensors as well as the multi output regression models for SA. Once the calibration and multi output regression models are learnt during training, the proposed framework allows the low-cost sensors to be deployed on the field as a standalone device, where it collects on-field data and stores it in a remote server through a wireless network. This data can be pulled at the user end, calibrated and then fed to the trained model to obtain the SA results in terms of the relative abundance of the different sources in ambient air. Mean Absolute Error (MAE) has been used as the metric to measure the accuracy in predicting the relative abundance of different sources, while Spearman's Rank Order Correlation Coefficient (SROCC) and Normalized Discounted Cumulative Gain (NDCG) are the metrics that have been used to get an estimate of how well the proposed approach performs in predicting the relative abundance of the different sources in the correct order. Extensive experimentation done using data gathered from two different environments in the city of Lucknow, India shows the robustness of the proposed approach in doing real-time SA. MAE of less than 5 % have been obtained in predicting the relative abundance of most of the organic as well as elemental sources, while values of SROCC greater than 0.75 and NDCG greater than 0.85 obtained for all the sources shows that the proposed framework also performs very well in predicting most of the sources in correct order of their actual contribution to air pollution.
Recommendation to the editor:
The manuscript presents a potentially interesting proof-of-concept for real-time, low-cost SA, but it is not ready for publication in its current form. The fundamental issues of scope (single city, effectively single week of training and 2 days of test data), lack of instrument calibration, the absence of a proper validation set, the use of sub-detection-limit and potentially degraded sensor data in predictive models, and the overstated generalization claims represent major methodological and scientific integrity concerns that cannot be resolved easily. A substantially expanded underlying dataset — spanning multiple seasons, cities, or years — combined with the methodological corrections described below would be required before the work could be reconsidered for publication.
Review Summary
This manuscript proposes a machine learning framework for real-time, hyper-local air pollution source apportionment (SA) using low-cost air quality (LCAQ) sensors co-located with Reference Grade Instruments (RGI). Multi-output linear regression models are trained on calibrated LCAQ data, with SA outputs from Positive Matrix Factorization (PMF) applied to HR-ToF-AMS and Xact-625i data serving as ground truth. The framework is evaluated at two sites in Lucknow, India, during a single month (October 2023). While the research topic is timely and relevant to the scope of Atmospheric Measurement Techniques, the manuscript has fundamental deficiencies in experimental scope, methodological rigor, and scientific integrity of its claims that preclude publication in its current form.
Major Concerns
1. Overstated Claims of Robustness and Generalization
The introduction asserts that the objective of this study is for “enhancing model robustness, calibration fidelity, and generalizability across diverse sensing environments” (p. 4, l. 115) and that "extensive experimentation has been carried out to validate the robustness of the proposed framework " (p. 5, l. 133). However, the entire study consists of two deployments within a single city during a single month. Each site yields approximately 350 observations at 30-minute resolution (roughly one week of data) (p. 15, l. 332-333). With an 80/20 train-test split, the test set contains only ~70 observations per site — less than two days of data. These sample sizes are wholly insufficient to support claims of robustness or generalizations. To substantiate such claims, the authors would need to collect data across multiple seasons at the same sites, across multiple cities, or across multiple years. The current field monitoring scope supports only a proof-of-concept demonstration to conduct a full investigation, and not sufficient for a manuscript proposing a new framework.
2. Micro-Aethalometer Calibration Not Described
The Micro-Aethalometer (AethLabs AE-51) is deployed alongside the LCAQ sensor unit and its BC measurements are used as a key predictor. The paper states only that the device "comes lab calibrated" (p. 2, l. 176) and cites a field intercomparison study, but provides no site-specific calibration or cross-validation against the EBAM or other RGI at either deployment site. Amazingly, the abstract of that field intercomparison study reads, “Real-world quality assurance of these instruments should be performed through field IC against reference instruments with longer durations in areas of slowly changing eBC concentration” (Alas et al., 2020). For a manuscript focused on calibration methodology, the absence of any field calibration assessment of this instrument is a significant omission that should be rectified.
3. Potential NO₂ Sensor Degradation Mid-Deployment
Figure 3 displays the NO₂ calibration time series at Site-B and visually suggests a step-change in sensor behavior around 16–17 October 2023, consistent with sensor performance degradation. The authors must formally test this by computing the calibration correlation coefficient separately for data before and after this break-point. If a statistically significant change is identified, NO₂ data collected after that date should be excluded from training and inference. Given the already limited dataset size (~350 observations), such an exclusion could substantially constrain the usable training data and reduce the available inputs to three reliably calibrated pollutants (CO, O₃, and PM₂.₅), given the acknowledged below-detection-limit issues with SO₂. The implications for model validity need to be fully addressed, making the underlying data in its current form unusable.
4. Use of Below-Detection-Limit SO₂ Data in Predictive Models
The authors acknowledge that all ambient SO₂ concentrations at both sites fall below the minimum detection limit (MDL) of the Alphasense B4 sensor (~5 ppb), resulting in R² values near zero compared to reference grade instrumentation (p. 10, l. 261-263; 0.03 at Site-B, −0.17 at Site-C). Yet SO₂ appears to be retained as a predictor in the SA regression models, and the authors do not explicitly state that all SO₂ measurements are excluded. Using as model predictor a pollutant where all measurements are sub-MDL introduces noise rather than signal and violates fundamental principles of analytical measurement.
5. Cross-Sensitivity of Alphasense B4 Sensors Not Validated
The authors note (p. 34, l. 607–608) that the Alphasense B4 auxiliary electrode compensates for cross-sensitivities from interfering gases; however, the calibration model (Equation 1) does not include any terms to explicitly test for or remove residual cross-sensitivities among pollutant channels. The authors must either demonstrate empirically that cross-sensitivities are negligible in their deployment context (e.g., by showing that adding cross-sensitivity terms explains negligible additional variance), or account for these effects within the calibration framework. This is particularly important given the poor NO₂ and SO₂ calibration results. The currently developed LCS calibration model is unfit for use.
6. Absence of a Validation Set — Risk of Overfitting and Contaminated Evaluation
The paper describes only a training/test split (80/20). No separate validation set is used for hyperparameter tuning or model selection (Appendix C evaluates multiple regression models including gradient boosting and random forests with tuned hyperparameters). Without a held-out validation set, the reported test-set performance risks being optimistic: if any model selection decisions were informed — even informally — by test-set behavior, the test set no longer represents a true independent evaluation. Standard practice in supervised machine learning requires a three-way split (e.g., 60/20/20 or 70/15/15) when multiple models or hyperparameters are compared. In the future, the authors must clarify their model selection procedure and, if the test set was used in any capacity for model comparison, must provide corrected evaluation on a genuinely held-out partition.
7. Use of R² for Calibration Assessment is Inappropriate; Spearman Correlation Required
The paper uses Pearson R² to evaluate sensor calibration (Figures 3 and 4) and the feature correlation heat map (Figure 6). The R² values reported for NO₂ (0.22 at Site-B) and SO₂ (−0.17 at Site-C) are strikingly poor and should disqualify these sensors from use, yet the manuscript characterizes the calibration as "reasonably good" (p. 10, l. 259-260) and proceeds to include these pollutants as model inputs. Pearson R² is sensitive to outliers and is poorly suited to noisy electrochemical sensor data. In general, the authors must replace Figures 3, 4, and 6 with Spearman rank-order correlation coefficients, which are more appropriate for this type of data and provide an honest characterization of calibration quality.
References: