the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Improving Imputation of Missing PM2.5 Speciation Data Using PMF-Informed Source–Receptor Relationships
Abstract. Missing values are ubiquitous in atmospheric monitoring due to instrument drift, calibration cycles, operational interruptions, and other random malfunctions. Such gaps can undermine the reliability of subsequent analyses and introduce systematic biases. Conventional imputation methods, such as K-nearest neighbor (KNN), Bayesian principal component analysis (BPCA), and deep learning architectures, rely primarily on statistical correlations, requiring auxiliary inputs, and offer limited physical interpretability. To address this issue, we propose a novel source–receptor informed Positive Matrix Factorization Reconstruction (PMFr) method that leverages PMF-derived source–receptor relationships, rather than purely statistical interpolation, to impute missing PM2.5 speciation data without requiring auxiliary data. Benchmarking against commonly used imputation techniques KNN, BPCA, and deep learning predictive model demonstrates that PMFr achieves superior accuracy and robustness under all real-world missing scenarios, with a mean coefficient of determination (R2) of 0.81, index of agreement (IoA) of 0.92, and mean absolute percentage error (MAPE) of 22.8 %, reducing MAPE by 25.5–29.1 %, particularly for key PM2.5 species, highlighting its potential as a robust tool for recovering reliable data in air quality studies.
- Preprint
(4651 KB) - Metadata XML
-
Supplement
(2435 KB) - BibTeX
- EndNote
Status: open (until 22 Apr 2026)
- RC1: 'Comment on egusphere-2026-474', Anonymous Referee #1, 21 Mar 2026 reply
-
RC2: 'Comment on egusphere-2026-474', Anonymous Referee #2, 27 Mar 2026
reply
This manuscript proposes a PMF-based reconstruction method (PMFr) for imputing missing values in PM2.5 speciation datasets by explicitly using source–receptor relationships, rather than relying solely on conventional statistical or machine-learning approaches. The study is interesting and potentially valuable, particularly because it does not evaluate imputation quality only in terms of reconstruction error, but also examines whether the reconstructed dataset preserves consistency in subsequent source apportionment results. This is a meaningful strength of the work. The comparison with several benchmark methods under multiple missing-data scenarios is also generally appropriate.
However, the manuscript still requires revision before it can be considered for publication. The paper shows clear promise, but some of the claims are broader than what is fully supported by the presented analysis, and several parts of the interpretation would benefit from more careful and balanced framing. In particular, the manuscript should more clearly distinguish the conditions under which PMFr performs especially well from those under which its advantage is limited, and it should present the generalizability of the method more cautiously in light of the assumptions underlying PMF-based reconstruction.
[Major comments]
1. The method is evaluated using data from a single urban site over a limited observational period and for a specific set of PM2.5 chemical species. However, the conclusion extends the potential applicability of PMFr to other atmospheric datasets, including VOC-related contexts. While this extension may be reasonable as a future possibility, the current manuscript does not yet demonstrate such breadth. I suggest that the authors moderate the scope of their claims and more clearly state that the present findings support the method under the conditions tested in this study.2. Several explanations offered for species-specific performance differences are plausible and scientifically sensible, but they are still interpretive rather than directly demonstrated. For example, the discussion of lower sulfate performance, or the explanation of differential behavior for OC/EC and NH4+/NO3−, seems to go beyond the evidence shown in the main performance metrics. These interpretations should be framed more cautiously, using language such as “may reflect,” “is likely associated with,” or “is consistent with,” unless additional analysis is provided to directly support those mechanistic explanations.
3. One of the strengths of the manuscript is that it does not hide the fact that PMFr is not uniformly superior in every situation. There are cases in which performance is weaker, or where the gap between PMFr and alternative methods narrows. These include certain species such as sulfate, situations involving high Fe concentrations, and some scenarios involving medium gaps or instrument-failure-type missingness. These limitations are scientifically important and should be more explicitly synthesized in the discussion. A dedicated paragraph or subsection on the strengths and limitations of PMFr across species types and missingness patterns would make the manuscript more informative and more credible.
4. The manuscript provides multiple reasons for adopting the 7-factor solution, including interpretability, residual behavior, and diagnostic stability. However, the current presentation reads more as a list of supporting points than as a clearly structured argument. The authors should revise this section so that the decision logic becomes easier to follow. For example, the discussion could more explicitly distinguish why the lower-factor solutions were insufficient, why the higher-factor solutions were over-resolved or physically less meaningful, and why the selected solution best balanced interpretability and statistical diagnostics.
5. An important contribution of the study is not only that PMFr often performs well numerically, but also that it preserves source-related structure in a way that may be more physically meaningful for subsequent source apportionment. At present, however, these two strengths are sometimes discussed together as if they were the same claim. The paper would be stronger if it clearly distinguished between them. For example, there may be situations in which another method performs competitively on certain numerical metrics, whereas PMFr retains greater interpretive consistency in the source-apportionment context. Making this distinction explicit would sharpen the central message of the paper.
[Minor comments]
1. The manuscript contains a number of grammatical and stylistic issues, including subject–verb agreement errors, awkward phrasing, incorrect verb forms, article usage problems, and inconsistent spacing around parentheses.
(e.g., “by multipling”, “The best-fitting solution were selected”, “Potassium (K) was treat as missing”, “PMFr achieve/decline/capture”, and "~of scaled residuals(Reff et al., 2007;~")2. Table 1 should be checked for ion notation consistency. Several ionic species are listed without charge notation (e.g., NH4, SO4, NO3, Ca, K), whereas the main text uses formal ionic expressions.
3. The conclusion section does a good job of emphasizing the promise of PMFr, but it would be stronger if it also briefly acknowledged the conditions under which the method appears less robust. A short, balanced statement about both the advantages and the observed limitations would make the paper’s ending more convincing and scientifically grounded.
Citation: https://doi.org/10.5194/egusphere-2026-474-RC2 -
RC3: 'Comment on egusphere-2026-474', Anonymous Referee #3, 29 Mar 2026
reply
This manuscript proposes a novel imputation framework (PMFr) that leverages source–receptor relationships derived from PMF to reconstruct missing PM2.5 speciation data. Addressing missing data due to instrument failures and monitoring gaps is an important and practical issue, and the attempt to incorporate physically interpretable source profiles rather than relying solely on statistical covariance is a clear strength of this work.
However, several key aspects of the methodology require further elaboration. In particular, the role of the pre-imputation step, the justification for the assignment of uncertainties, and the comparison with standard PMF practices need to be addressed to fully demonstrate the robustness of the proposed approach. I recommend major revisions to clarify the workflow and to more rigorously validate the method before the manuscript can be considered for publication.
Major Comments:
1) The methodological description in Section 2.3 would benefit from further clarification to address potential concerns about model independence. According to the text, tracer species are first imputed using another method, such as KNN, while non-tracers are filled using the geometric mean prior to the initial PMF run. Given that the PMFr framework relies on these pre-imputed values to derive source profiles and subsequently reconstruct missing data, it is currently difficult to isolate the performance of the PMFr method itself from the accuracy of the initial KNN imputation. To address this, the authors should clearly delineate the full workflow, including all intermediate steps. Providing a comprehensive flowchart in Figure 1 that details the full pipeline from raw data to pre-imputation, initial PMF, reconstruction, and the final PMF would greatly improve clarity. Additionally, conducting a sensitivity analysis to evaluate how different pre-imputation methods or levels of error in pre-imputed tracers propagate into the final PMFr results is necessary to demonstrate the methodological robustness of the framework.
2) While the manuscript comprehensively compares PMFr against LI, KNN, BPCA, and DBN, it omits the most widely used baseline in receptor modeling practice. The U.S. EPA PMF 5.0 User Guide recommends handling missing values by replacing them with the species median and assigning an uncertainty of four times the median (400%). As this approach is routinely used in real-world PMF applications, including it as a baseline would help the receptor modeling community better assess the meaningful improvement provided by PMFr. The authors are encouraged to include this EPA-recommended method as a baseline and compare the PMFr performance against it under the various missing-data scenarios.
3) The treatment of uncertainty in Section 2.3 requires further justification. The manuscript states that tracers are assigned an uncertainty equal to 10% of their imputed value, while non-tracers are assigned an uncertainty equal to eight times the geometric mean. The physical or statistical rationale for these specific multipliers is currently missing. Since the uncertainty matrix directly controls the PMF objective function (Q-value) and strongly influences the model solution, these parameters are critical. The authors should provide a justification for these choices, whether through literature references, empirical evidence, or theoretical reasoning, and briefly discuss or conduct a sensitivity analysis demonstrating how different uncertainty assignments might affect the PMFr performance and subsequent PMF outputs.
Citation: https://doi.org/10.5194/egusphere-2026-474-RC3
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 149 | 47 | 15 | 211 | 42 | 12 | 16 |
- HTML: 149
- PDF: 47
- XML: 15
- Total: 211
- Supplement: 42
- BibTeX: 12
- EndNote: 16
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Missing data in PM2.5 speciation monitoring, due to instrumental drift, calibration, and maintenance, poses challenges for source apportionment and health risk assessments. Conventional imputation methods, including statistical techniques and deep learning, depend on mathematical correlations and often lack physical interpretability. This study presents a novel Positive Matrix Factorization-based reconstruction method (PMFr) that integrates source profile characteristics into the imputation process. Unlike traditional models that rely solely on data covariance, this approach uses "low-entropy structures" to reconstruct latent information, ensuring chemical consistency and physical interpretability. Given its potential to improve data quality in atmospheric research, the reviewer recommends this work for publication with some revisions and clarifications.
Major Comments:
1) The manuscript introduces a novel framework for data imputation based on low-entropy structures, but lacks practical guidelines on its limits of applicability for specific timestamps. It does not define conditions under which the method may fail due to insufficient observational constraints. The methodology assumes the source contribution vector (G) can be uniquely resolved from observed species, which requires at least one key tracer species for each source factor. However, the manuscript does not address scenarios where all characteristic species for a specific source are missing, leading to an under-constrained system that undermines the imputation's reliability.
The authors should include a section on practical principles for validity checks. It must state that before imputation, users should ensure each identified source factor has at least one non-missing key tracer. If any time point lacks all diagnostic tracers for a source, that data point should be flagged as un-imputable or handled with caution.
To operationalize this principle, the authors should add a table listing the "Non-Missable Key Tracers" for each “pollution source”. This table should clearly map each source factor to its essential diagnostic species. This will serve as a vital reference for practitioners to assess data quality and imputation feasibility before applying the model.
Addressing these points is essential to prevent the misapplication of the method and to clarify the boundary conditions under which the proposed imputation remains scientifically valid.
2) Mixed missing data patterns (MCMS vs. MCMI) in Cases 4~8. MCMS is inherently much more challenging than MCMI because it removes the identifiability of the source, whereas MCMI only removes temporal continuity. A model might perform well under MCMI but fail catastrophically under MCMS. Combining these two patterns into a single performance metric for each Case obscures the specific source of error. Therefore, the reviewer suggests that reporting the results for pure MCMS scenarios and pure MCMI scenarios separately is more scientifically valid.
3) The PMFr method relies on the assumption that source chemical profiles remain stable over time. However, real-world atmospheric conditions lead to dynamic source signatures that can vary significantly due to seasonal changes, fuel composition, and combustion conditions. This variability can introduce biases in reconstructed data if profiles differ from reality. Though this study uses a short two-month dataset, concerns about using this method over longer periods (e.g., multi-year datasets) highlight issues with profile stability. The manuscript currently lacks guidance on determining the appropriate temporal window for stable profiles. The reviewer advises the authors to provide clear, quantitative guidelines for assessing this assumption, including metrics or statistical tests (like rolling window analysis or change-point detection) to identify when profiles need recalibration or updating.
4) The PMFr framework relies on a linear mixing model (C=G×F), assuming observed concentrations are linear combinations of primary emissions. However, secondary components like sulfates, nitrates, and Secondary Organic Carbon (SOC) arise from complex, non-linear photochemical reactions, which the linear assumption may fail to accurately capture, particularly during heavy pollution or specific weather conditions. The manuscript does not sufficiently address the uncertainty introduced by this assumption in reconstructing secondary species. It is recommended that the authors discuss the limitations of the linear model in secondary aerosol formation and consider conducting a sensitivity analysis to quantify the uncertainty.
Minor comments:
1) Line 82. MCMS and MCMI should be defined in the first paragraph of section 2.2.
2) Figure 1b illustrates model performance metrics through a scatter plot comparing MAPE (y-axis) and IOA (x-axis), with R2 values annotated. However, it does not visualize the standard deviation (σ) of modeled data against observations. A model may show high IOA and low MAPE but still misrepresent variability, indicating "amplitude bias," which is crucial for accurate source contribution estimates. The authors should include a Taylor Diagram as a supplementary figure for a comprehensive statistical assessment of variance and correlation in the observed data.