OIRF-LEnKF v1.0: A Self-evolving Data Assimilation System by Integrating Incremental Machine Learning with a Localized EnKF for Enhanced PM2.5 Chemical Component Forecasting and Analysis
Abstract. Assimilating observational data into numerical forecasts is crucial for accurately estimating the spatiotemporal distribution of PM2.5 chemical components (NH4+, NO3-, SO42-, OC, and BC), which is beneficial to quantifying the impact of aerosols on the environment, climate change and human health. However, chemical transport model (CTM)-based data assimilation (DA) is computationally inefficient for large ensemble sizes and offers limited improvements in forecasting, as it solely provides optimal initial conditions. This paper introduces a machine learning (ML)-based self-evolving data assimilation system (OIRF-LEnKF v1.0) that achieves high efficiency and high quality in the forecast and analysis fields of chemical components. Computational efficiency tests indicate that the total time consumed by OIRF-LEnKF v1.0 constitutes only 11.41–16.60 % of that of CTM-based DA, particularly during the forecasting process (0.13–0.20 %). Sensitivity tests demonstrate that the self-evolution mechanism in our system enhances the Pearson correlation coefficient (CORR) and reduces the RMSE during the forecasting process by 2.28–11.75 % and 32.94–40.98 %, respectively, compared to the stationary training mechanism. A 2-month DA experiment reveals that the RMSE values of chemical components after DA are less than 7.80 µg m-3 and 2.36 µg m-3 during the forecasting and analysis processes, respectively, indicating reductions of at least 26.38 % and 68.99 % compared to values without DA. Notably, the RMSE values of our system during the forecasting process exhibit a significant reduction of 33.16–90.10 % compared to those of the CTM-based DA, highlighting the superior forecasting capability of our system. Furthermore, the spatial overestimation and underestimation of chemical components have been significantly mitigated following DA. Compared to multiple reanalysis datasets of inorganic salt aerosols (CORR: 0.56–0.89, RMSE: 2.55–8.52 μg m-3), the dataset generated by OIRF-LEnKF v1.0 (CORR: 0.97, RMSE: 1.12 μg m-3) demonstrates higher data quality.