Improving Imputation of Missing PM2.5 Speciation Data Using PMF-Informed Source–Receptor Relationships
Abstract. Missing values are ubiquitous in atmospheric monitoring due to instrument drift, calibration cycles, operational interruptions, and other random malfunctions. Such gaps can undermine the reliability of subsequent analyses and introduce systematic biases. Conventional imputation methods, such as K-nearest neighbor (KNN), Bayesian principal component analysis (BPCA), and deep learning architectures, rely primarily on statistical correlations, requiring auxiliary inputs, and offer limited physical interpretability. To address this issue, we propose a novel source–receptor informed Positive Matrix Factorization Reconstruction (PMFr) method that leverages PMF-derived source–receptor relationships, rather than purely statistical interpolation, to impute missing PM2.5 speciation data without requiring auxiliary data. Benchmarking against commonly used imputation techniques KNN, BPCA, and deep learning predictive model demonstrates that PMFr achieves superior accuracy and robustness under all real-world missing scenarios, with a mean coefficient of determination (R2) of 0.81, index of agreement (IoA) of 0.92, and mean absolute percentage error (MAPE) of 22.8 %, reducing MAPE by 25.5–29.1 %, particularly for key PM2.5 species, highlighting its potential as a robust tool for recovering reliable data in air quality studies.