the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A Framework for Dynamic Hyper-local Source Apportionment using Low-cost Sensors for Real-time Policy Action
Abstract. The presence of particulate matter, toxic gases and other pollutants in the air pose significant risk to human health and the environment. Identifying the different sources of air pollution which is termed as Source Apportionment (SA), needs to be done in real-time in order to understand the dynamics of the contributing sources and also to enable the policy makers frame effective regulatory measures to curb air pollution. The unit deployed for implementing the SA framework at a particular location must also be cost-effective, so that it becomes feasible to create a dense network with such units and thus cover a wide geographical area. The use of low-cost air quality monitoring sensors have become popular in this regard. In our proposed framework we use low-cost air quality sensor units in conjunction with machine learning models to develop a low-cost real-time solution for SA. Multi output regression models, which are supervised machine learning models are used for this purpose. Reference Grade Instruments are used for learning calibration models for the low-cost sensors as well as the multi output regression models for SA. Once the calibration and multi output regression models are learnt during training, the proposed framework allows the low-cost sensors to be deployed on the field as a standalone device, where it collects on-field data and stores it in a remote server through a wireless network. This data can be pulled at the user end, calibrated and then fed to the trained model to obtain the SA results in terms of the relative abundance of the different sources in ambient air. Mean Absolute Error (MAE) has been used as the metric to measure the accuracy in predicting the relative abundance of different sources, while Spearman's Rank Order Correlation Coefficient (SROCC) and Normalized Discounted Cumulative Gain (NDCG) are the metrics that have been used to get an estimate of how well the proposed approach performs in predicting the relative abundance of the different sources in the correct order. Extensive experimentation done using data gathered from two different environments in the city of Lucknow, India shows the robustness of the proposed approach in doing real-time SA. MAE of less than 5 % have been obtained in predicting the relative abundance of most of the organic as well as elemental sources, while values of SROCC greater than 0.75 and NDCG greater than 0.85 obtained for all the sources shows that the proposed framework also performs very well in predicting most of the sources in correct order of their actual contribution to air pollution.
- Preprint
(18538 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-5677', Anonymous Referee #1, 13 Mar 2026
-
RC2: 'Comment on egusphere-2025-5677', Anonymous Referee #2, 01 May 2026
Overview
This manuscript proposes a unique approach for source apportionment of PM2.5 via low cost air quality (LCAQ) sensor inputs to machine learning models that are trained via collocated reference grade instruments (RGI). The authors train multi-output regression models from about 7 days of collocated LCAQ/RGI measurements taken at two sites roughly 20 km apart in Lucknow, India, where positive matrix factorization (PMF) is first applied to aerosol mass spectrometer (AMS) and energy dispersive x-ray fluorescence (Xact 625i) measurements and resolved factors serve as the ground truth for model training. Model predictions of AMS and Xact 625i PMF factors are claimed to agree with AMS/Xact 625i PMF factors over 2 day test periods at both sites. As the authors have correctly pointed out, traditional source apportionment (e.g., PMF) of RGI measurements is resource-intensive and often time-consuming. Therefore, their research objectives are timely and relevant to Atmospheric Measurement Techniques, and such a framework proposed would represent an important contribution to the state of the science. However, in its current form, the manuscript lacks the necessary data and methodological rigor to support the objectives or conclusions the authors claim to accomplish. For these reasons, which I detail below, major revisions with additional data and methodological improvements should be made before considering the work for publication.
Major Issues
1. The work’s claims about the proposed framework are disproportionate to the limited training and test data used. Much more RGI/LCAQ data is necessary for the building and evaluating the framework this manuscript is proposing.
The authors describe their framework as robust and effective (e.g. lines 17-18 claim “extensive experimentation done” shows the “robustness of the proposed approach.”) yet the employed 80/20 split and very limited test set (2 days of test data) prevent a rigorous evaluation of model performance and the use of AMS/Xact 625i PMF factors as ground truths. PMF factors will vary over time scales much larger than the 10-day windows considered here. It has also been well documented in literature that AMS and Xact 625i PMF factor profiles themselves can vary substantially across sites and, to an extent, even within a site over time (Zhang et al., 2007; Canonaco et al., 2021; Chen et al., 2022). While the framework proposed is interesting, an expanded high-quality dataset is required to demonstrate its potential merit.
In order to demonstrate the need for more data, I pose a few questions to the authors. (a) Because the training set is within 0-7 days of the test set, predicted factor concentrations will already be similar due to autocorrelation – what if the model predicted concentrations a month later or a year later? If periodic RGI/LCAQ relocation periods are required for model training, this would be important information to include. A test set several weeks apart from the training set would ensure autocorrelation effects are not impacting model performance. (b) In this first submitted manuscript each LCAQ sensor was collocated with the RGI mobile lab and it’s my understanding that each regression model is specific to each site. What if the site B model was applied to nearby LCAQ sensor data from site C? This would be another more rigorous evaluation that would show the ability of multi-output regression models to quantify similar sources at a different place and time.
While RGI data and its source apportionment products are resource-intensive, there are many long-term deployments of RGI collocated near LCAQ monitors with free, open-source data products that would be well suited to the models proposed by the authors. For example, the ACTRIS network has aerosol chemical speciation monitors and aethalometers deployed at many sites and has conducted harmonized source apportionment (Chen et al., 2022).
2. There are concerns about the integrity and reliability of RGI and LCAQ data.
The authors thoughtfully describe the calibration procedure for LCAQ sensors and note the poor agreement between RGI and LCAQ monitors for SO2 and NO2 because measured concentrations are below detection limits. This is especially evident in the Fig. 4 SO2 agreement for site C. The challenges of LCAQ monitors for measuring SO2 and NO2 have been noted extensively (Duvall et al., 2016). If the SO2 and NO2 measurements are not reliable, they should not be included in the machine learning models as they will only impair the model’s ability to predict PM2.5 sources.
In Fig. 11, there appears to be poor time resolution or a lack of RGI AMS PMF data and LCAQ PMF predictions within the 2 day test set for site C. Between Oct 8 00:00 and Oct 8 20:00, (about half of the test set) it appears there are only 3 data points being compared. Interestingly, this data appears to be present in Fig. A2. Why does this appear to be excluded in the evaluation in Fig. 11?
Additional information on the operation of the HR-ToF-AMS and Xact 625i should be included in the manuscript. For the HR-ToF-AMS, what were the results of ionization efficiency calibrations? For the Xact 625i, what was the signal-to-noise of the elements measured? For all RGI deployments, what was the inlet height/configuration and position in the mobile lab?
3. The PMF source factors are being overinterpreted, and additional details should be provided on PMF source apportionment methods.
From 20 days of HR-ToF-AMS and Xact 625i measurements, the authors have resolved 5 AMS source factors and 7 Xact 625i source factors and claim the model is able to predict these different source factors well. Some of the PMF factors (e.g., “Fe-smelting”) seem very specific considering that other large sources of particulate iron exist (e.g., brake wear). The manuscript needs a more thorough rationalization of the PMF solutions (e.g., correlations with external tracers, diurnal profiles, comparisons to prior literature, bootstrapping). If the RGI PMF factors themselves have a high degree of uncertainty and mixing, I would not expect the LCAQ sensors to predict such factors well. If the model is not able to reproduce accurate source apportionment of these 12 factors with an expanded dataset, I would advise the authors to consider applying the model to simpler groups first (e.g., hydrocarbon like organic aerosol, oxygenated organic aerosol, sulfate, nitrate).
Canonaco et al., A new method for long-term source apportionment with time-dependent factor profiles and uncertainty assessment using SoFi Pro: application to 1 year of organic aerosol data, Atmospheric Measurement Techniques, 2021. https://doi.org/10.5194/amt-14-923-2021
Chen et al., European aerosol phenomenology − 8: Harmonised source apportionment of organic aerosol using 22 Year-long ACSM/AMS datasets, Environment International, 2022. https://doi.org/10.1016/j.envint.2022.107325
Duvall et al., Performance Evaluation and Community Application of Low-Cost Sensors for Ozone and Nitrogen Dioxide, Sensors, 2016. https://doi.org/10.3390/s16101698
Zhang et al., Ubiquity and dominance of oxygenated species in organic aerosols in anthropogenically-influenced Northern Hemisphere midlatitudes, Geophysical Research Letters, 2007. 10.1029/2007gl029979
Citation: https://doi.org/10.5194/egusphere-2025-5677-RC2 -
RC3: 'Comment on egusphere-2025-5677', Anonymous Referee #3, 02 Jun 2026
Review Overview
This manuscript introduces a machine learning framework designed for real-time source apportionment (SA) of urban air pollutants. The authors propose a methodology that utilizes Positive Matrix Factorization (PMF) profiles derived from high-resolution reference instrumentation (HR-ToF-AMS and Xact-625i) as baseline target vectors to train the machine learning based models. While the framework addresses a highly relevant and timely topic for hyper-local air quality monitoring, substantial deficiencies regarding technical reporting, physical plausibility of the underlying datasets, and statistical evaluation preclude its publication in its current form. Major revisions are required to ensure the scientific validity and integrity of the proposed methodology before a final editorial decision can be reached.
Main Concerns
1. Technical Characterization of Reference Instrumentation and Mobile Deployment Setup
The reliability and accuracy of the reference grade instrument (RGI) measurements represent the core baseline of this study, as these observations constitute both the sensor calibration standards and the machine learning training targets. However, the manuscript fails to supply comprehensive descriptions regarding the technical specifications and operational parameters of the reference equipment hosted within the mobile laboratory. Although the low-cost sensors are described briefly, a rigorous accounting of the RGI mobile deployment configuration is missing. The authors must provide explicit details concerning inlet configurations, sampling line loss considerations, and the exact operation protocols used for field validation and calibration of the mass spectrometers and gas analyzers during the campaign period.
2. Methodological Handling and Retention of Unreliable Low-Cost Sensor Signals
A fundamental conflict exists between the authors' evaluation of sensor data quality and its subsequent inclusion in the machine learning framework. The text explicitly states that ambient SO2 concentrations continuously fell below the minimum detection limit (MDL) of the sensors, resulting in an unreliable usage in the regression model. Furthermore, the NO2 data also shows a poor coefficient of determination (R2 = 0.22). Despite acknowledging that these datasets are not reliable in the model, both SO2 and NO2 appear to be retained as active predictors within the regression framework. Incorporating input parameters characterized primarily by low signal-to-noise ratios introduces arbitrary noise that can degrade multi-output model performance. The authors must supply a rigorous justification for retaining these specific chemical datasets, perform a focused sensitivity analysis examining model errors with and without these inputs, and clarify the exact processing workflow applied to sub-MDL measurements.
3. Mass Balance Discrepancies and Physical Incongruities in Aerosol Data
A critical review of the baseline chemical concentrations summarized in Figure 5 reveals several profound reporting errors and physical contradictions. The specified concentration unit for Volatile Organic Compounds (VOCs) is entirely omitted from the graphic layout. And, the reported median Black Carbon (BC) mass concentrations (6537.5 µg/m3 at Site-B and 5512.9 µg/m3 at Site-C) are multiple orders of magnitude higher than the total PM2.5 mass concentrations recorded for the exact same sites. Because BC is inherently a physical constituent of total PM2.5 mass, it is physically impossible for the component concentration to exceed the bulk aerosol mass concentration. This implies a fundamental decimal scaling or unit conversion error (such as a systematic confusion between ng/m3 and µg/m3 ) during data analysis.
The authors indicate that Z-score normalization was implemented specifically to mitigate the massive disparities in numerical ranges among variables like CO and BC. If the primary BC dataset contains a systemic mathematical or physical error, the entire subsequent normalization workflow, model training, and regression feature weight allocations are compromised. The authors must comprehensively validate their data units, re-verify statistical summaries, and provide a clear chemical overview justifying why these locations are classified as distinct background or traffic environments based on their ambient profiles.
4. Statistical Limitations of the Training Temporal Scale
The volume of collected data points raises significant concerns regarding the suitability and generalization performance of the chosen data-driven approach. The underlying dataset is highly constrained, encompassing merely 364 and 369 paired observations for Site-B and 229 and 351 observations for Site-C. Because the raw readings were consolidated into 30-minute intervals and partitioned into an 80/20 training-to-testing ratio, the final validation phase evaluates the model over a transient window equivalent to less than two days of active testing data. Such constrained sample sizes increase the vulnerability of regression models to overfitting and fail to capture the atmospheric variance necessary to claim a robust or generalized framework. The authors need to address how a system trained on such restricted, single-month snapshots can provide reliable predictive capabilities across broader, dynamically changing atmospheric conditions.
5. Reporting Gaps and Structural Disconnects in Source Apportionment
The implementation of the PMF source solutions and their integration with the subsequent machine learning steps require significant clarification. While the manuscript notes that factor solutions were selected based on standard statistical criteria (Q/Qexp) and scaled residuals, the manuscript does not present any underlying numerical indices or factor-resolution diagnostic plots for comparison.
Also, the chemical mass spectral markers or particular ion fragments that differentiate the less-oxidized BBOA-1 factor from the more-aged BBOA-2 factor are poorly detailed. The authors should incorporate diurnal profiles or detailed time series profiles to solidify the physical identity of these resolved factors rather than relying strictly on unvalidated mass profiles.
The manuscript suggests that a combined input matrix merging data from both Site-B and Site-C was processed through the initial PMF model, yet this structural step is not explicitly defined. If reference factor profiles were extracted collectively across both locations, the authors must supply a clear scientific rationale for why the subsequent machine learning evaluations and reference-versus-predicted comparisons were divided and executed on a site-segregated basis.
6. Representation Biases and Scale Inadequacies in Visualizations
The graphical methods employed to display and evaluate model output do not provide an objective or statistically comprehensive representation of framework performance. The source contribution comparisons presented in the pie charts (Figures 8, 10, 12, and 14) provide only localized snapshots of model performance. Displaying isolated, single 30-minute data steps selected at disparate times across figures introduces significant selection bias and prevents an objective evaluation of systemic model agreement. These snapshots should be replaced or supplemented with the averages integrated across the entire testing window, or broken down into multi-hour diurnal blocks to demonstrate sustained predictive skill.
In Figure 11, the comparative data resolution appears visibly lower than the defined 30-minute sampling sequence, creating prominent temporal gaps that remain unaddressed in the text. In the time-series trends (such as Figure 13), several trace emission factors—including Pb-rich, Fe-smelting, and Cl-rich sources—exhibit fractional contributions that hover near zero. The absolute scaling of these plots makes it visually impossible to determine whether the model is capturing temporal trends or merely predicting flat baselines. The authors should supplement these figures with explicit cross-correlation scatter plots, residual distribution profiles, or optimized y-axis scale ranges to properly demonstrate real-time tracking fidelity.
Citation: https://doi.org/10.5194/egusphere-2025-5677-RC3 -
RC4: 'Comment on egusphere-2025-5677', Anonymous Referee #4, 10 Jun 2026
Overview: The manuscript proposes an appealing and potentially impactful approach: using ML to train low-cost sensors against co-located reference instruments so they can reproduce reference-grade source apportionment. The concept is sound and worth pursuing, but in its current form the manuscript has several gaps in the input data, the ML validation and evaluation, and the reliability of the PMF, as well as internal inconsistencies. These must be addressed through major revisions before it is ready for publication.
Major concerns
- Unreliable input predictors are retained without justification.
Not all of the LCAQ features used as model inputs are adequately calibrated, despite the text stating that all are used "after calibration" (l.211). Most clearly, there is no calibration equation for VOC: Eqn (1) provides calibration models for five species only (CO, NO2, O3, SO2, PM2.5), and VOC has neither a reference instrument nor a calibration model. Please clarify whether VOC is fed raw, state what physical quantity and units it represents, and justify retaining it.
In addition, the manuscript reports that ambient SO2 is below the sensor detection limit at both sites (R2 = 0.03 at Site-B, R2 = -0.17 at Site-C) and that NO2 calibrates poorly (R2 = 0.22 at Site-B). Inputs with this little signal contribute noise rather than information, yet both are retained in the regression and assigned non-negligible coefficient magnitudes (Fig. 16). Please report a sensitivity analysis of model performance with and without SO2, NO2, and VOC, and state precisely how sub-detection-limit values were handled in the feature pipeline. If performance is comparable without them, retaining noisy predictors should be avoided; if performance drops, that dependence on unreliable inputs is itself a concern for field deployment.
- PMF factor solutions require further validation and justification.
The authors justify the study design on the premise that the two sites are strongly contrasting: Site-B a traffic site, Site-C a background site (l.144-148) and they train separate per-site regression models on that basis (l.520). Yet, the PMF was run on a single pooled Site-B/Site-C matrix (l.290, l.561), which constrains each factor to a single profile across the whole run; only the time-resolved contributions differ by site. Pooling therefore cannot reveal whether a factor represents a different source at the two sites, only how active a shared profile is at each. Considering the policy relevance, even in the title, this ambiguity is consequential. Take the “Ferrous-Smelting” factor for example. The manuscript attributes it to "industrial sources or non-exhaust traffic emissions, such as brake or tire wear" (l.587-589), so the same Fe-rich signature could represent smelting at one site but brake/tire wear at the other. Examining the diurnal variability of each factor and its correlation with co-measured species would further help establish factor identity. These analyses, together with rotational and statistical diagnostics (Q/Q_exp, bootstrapping, scaled residuals), would demonstrate the robustness of the PMF solutions.
- The model’s high metric scores largely reflect source dominance, not predictive skill.
The SA is dominated by a single source in every case (LVOOA ~49% at Site-B and ~58% at Site-C; S-rich ~35-52% in the elemental case). A perfect NDCG for identifying the single top source at Site-C simply reflects that one source is almost always the largest. The component-wise MAE is likewise misleading when read in absolute terms: recast relative to each source’s mean abundance, the minor factors show no skill with Pb-rich at Site-B has MAE 2.12% against a mean of 1.70%, and Cl-rich at Site-C has MAE 2.41% against 1.30%, i.e. the error equals or exceeds the quantity being predicted. The apparent success of the model therefore rests almost entirely on tracking the one dominant source, while the figures (pie charts and 0-100% time series) make the minor-source failures visually negligible. The model should be benchmarked against naïve baselines (e.g., always predicting the training-mean composition, or carrying the previous window forward). Only the improvement over such baselines shows whether the model has learned source dynamics rather than reproducing the average. This concern compounds the uncertainty issue below: the reported scores are both inflated (source dominance) and imprecise (small test set), so at present neither their magnitude nor the differences between models can be reliably interpreted.
- The reported performance of the selected model is not quantified with uncertainty.
With N_test ~ 46-75, the headline metrics for the chosen linear-regression model (and the abstract's summary claims of MAE < 5%, SROCC > 0.75, NDCG > 0.85) are point estimates from a few dozen test points, with wide and unreported uncertainty. Please report these metrics with bootstrap confidence intervals (resampling test timestamps) so the reliability of the reported performance can be assessed.
- Technical corrections
- S3, Fig. 5: correct BC units and values. VOC: units
- S3, l.302 vs l.565: reconcile the elemental PMF factor list. Remove or restore “Fireworks”.
- S4, Eqn (5): add the square root to SAE.
- S9: While the full 22-element PMF input list is evident in Fig. A3, state it explicitly in the text as well.
Citation: https://doi.org/10.5194/egusphere-2025-5677-RC4
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 2,152 | 2,308 | 129 | 4,589 | 193 | 216 |
- HTML: 2,152
- PDF: 2,308
- XML: 129
- Total: 4,589
- BibTeX: 193
- EndNote: 216
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Recommendation to the editor:
The manuscript presents a potentially interesting proof-of-concept for real-time, low-cost SA, but it is not ready for publication in its current form. The fundamental issues of scope (single city, effectively single week of training and 2 days of test data), lack of instrument calibration, the absence of a proper validation set, the use of sub-detection-limit and potentially degraded sensor data in predictive models, and the overstated generalization claims represent major methodological and scientific integrity concerns that cannot be resolved easily. A substantially expanded underlying dataset — spanning multiple seasons, cities, or years — combined with the methodological corrections described below would be required before the work could be reconsidered for publication.
Review Summary
This manuscript proposes a machine learning framework for real-time, hyper-local air pollution source apportionment (SA) using low-cost air quality (LCAQ) sensors co-located with Reference Grade Instruments (RGI). Multi-output linear regression models are trained on calibrated LCAQ data, with SA outputs from Positive Matrix Factorization (PMF) applied to HR-ToF-AMS and Xact-625i data serving as ground truth. The framework is evaluated at two sites in Lucknow, India, during a single month (October 2023). While the research topic is timely and relevant to the scope of Atmospheric Measurement Techniques, the manuscript has fundamental deficiencies in experimental scope, methodological rigor, and scientific integrity of its claims that preclude publication in its current form.
Major Concerns
1. Overstated Claims of Robustness and Generalization
The introduction asserts that the objective of this study is for “enhancing model robustness, calibration fidelity, and generalizability across diverse sensing environments” (p. 4, l. 115) and that "extensive experimentation has been carried out to validate the robustness of the proposed framework " (p. 5, l. 133). However, the entire study consists of two deployments within a single city during a single month. Each site yields approximately 350 observations at 30-minute resolution (roughly one week of data) (p. 15, l. 332-333). With an 80/20 train-test split, the test set contains only ~70 observations per site — less than two days of data. These sample sizes are wholly insufficient to support claims of robustness or generalizations. To substantiate such claims, the authors would need to collect data across multiple seasons at the same sites, across multiple cities, or across multiple years. The current field monitoring scope supports only a proof-of-concept demonstration to conduct a full investigation, and not sufficient for a manuscript proposing a new framework.
2. Micro-Aethalometer Calibration Not Described
The Micro-Aethalometer (AethLabs AE-51) is deployed alongside the LCAQ sensor unit and its BC measurements are used as a key predictor. The paper states only that the device "comes lab calibrated" (p. 2, l. 176) and cites a field intercomparison study, but provides no site-specific calibration or cross-validation against the EBAM or other RGI at either deployment site. Amazingly, the abstract of that field intercomparison study reads, “Real-world quality assurance of these instruments should be performed through field IC against reference instruments with longer durations in areas of slowly changing eBC concentration” (Alas et al., 2020). For a manuscript focused on calibration methodology, the absence of any field calibration assessment of this instrument is a significant omission that should be rectified.
3. Potential NO₂ Sensor Degradation Mid-Deployment
Figure 3 displays the NO₂ calibration time series at Site-B and visually suggests a step-change in sensor behavior around 16–17 October 2023, consistent with sensor performance degradation. The authors must formally test this by computing the calibration correlation coefficient separately for data before and after this break-point. If a statistically significant change is identified, NO₂ data collected after that date should be excluded from training and inference. Given the already limited dataset size (~350 observations), such an exclusion could substantially constrain the usable training data and reduce the available inputs to three reliably calibrated pollutants (CO, O₃, and PM₂.₅), given the acknowledged below-detection-limit issues with SO₂. The implications for model validity need to be fully addressed, making the underlying data in its current form unusable.
4. Use of Below-Detection-Limit SO₂ Data in Predictive Models
The authors acknowledge that all ambient SO₂ concentrations at both sites fall below the minimum detection limit (MDL) of the Alphasense B4 sensor (~5 ppb), resulting in R² values near zero compared to reference grade instrumentation (p. 10, l. 261-263; 0.03 at Site-B, −0.17 at Site-C). Yet SO₂ appears to be retained as a predictor in the SA regression models, and the authors do not explicitly state that all SO₂ measurements are excluded. Using as model predictor a pollutant where all measurements are sub-MDL introduces noise rather than signal and violates fundamental principles of analytical measurement.
5. Cross-Sensitivity of Alphasense B4 Sensors Not Validated
The authors note (p. 34, l. 607–608) that the Alphasense B4 auxiliary electrode compensates for cross-sensitivities from interfering gases; however, the calibration model (Equation 1) does not include any terms to explicitly test for or remove residual cross-sensitivities among pollutant channels. The authors must either demonstrate empirically that cross-sensitivities are negligible in their deployment context (e.g., by showing that adding cross-sensitivity terms explains negligible additional variance), or account for these effects within the calibration framework. This is particularly important given the poor NO₂ and SO₂ calibration results. The currently developed LCS calibration model is unfit for use.
6. Absence of a Validation Set — Risk of Overfitting and Contaminated Evaluation
The paper describes only a training/test split (80/20). No separate validation set is used for hyperparameter tuning or model selection (Appendix C evaluates multiple regression models including gradient boosting and random forests with tuned hyperparameters). Without a held-out validation set, the reported test-set performance risks being optimistic: if any model selection decisions were informed — even informally — by test-set behavior, the test set no longer represents a true independent evaluation. Standard practice in supervised machine learning requires a three-way split (e.g., 60/20/20 or 70/15/15) when multiple models or hyperparameters are compared. In the future, the authors must clarify their model selection procedure and, if the test set was used in any capacity for model comparison, must provide corrected evaluation on a genuinely held-out partition.
7. Use of R² for Calibration Assessment is Inappropriate; Spearman Correlation Required
The paper uses Pearson R² to evaluate sensor calibration (Figures 3 and 4) and the feature correlation heat map (Figure 6). The R² values reported for NO₂ (0.22 at Site-B) and SO₂ (−0.17 at Site-C) are strikingly poor and should disqualify these sensors from use, yet the manuscript characterizes the calibration as "reasonably good" (p. 10, l. 259-260) and proceeds to include these pollutants as model inputs. Pearson R² is sensitive to outliers and is poorly suited to noisy electrochemical sensor data. In general, the authors must replace Figures 3, 4, and 6 with Spearman rank-order correlation coefficients, which are more appropriate for this type of data and provide an honest characterization of calibration quality.
References: