Preprints
https://doi.org/10.5194/egusphere-2025-3960
https://doi.org/10.5194/egusphere-2025-3960
11 Sep 2025
 | 11 Sep 2025
Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

OIRF-LEnKF v1.0: A Self-evolving Data Assimilation System by Integrating Incremental Machine Learning with a Localized EnKF for Enhanced PM2.5 Chemical Component Forecasting and Analysis

Hongyi Li, Ting Yang, Lei Kong, Di Zhang, Guigang Tang, and Zifa Wang

Abstract. Assimilating observational data into numerical forecasts is crucial for accurately estimating the spatiotemporal distribution of PM2.5 chemical components (NH4+, NO3-, SO42-, OC, and BC), which is beneficial to quantifying the impact of aerosols on the environment, climate change and human health. However, chemical transport model (CTM)-based data assimilation (DA) is computationally inefficient for large ensemble sizes and offers limited improvements in forecasting, as it solely provides optimal initial conditions. This paper introduces a machine learning (ML)-based self-evolving data assimilation system (OIRF-LEnKF v1.0) that achieves high efficiency and high quality in the forecast and analysis fields of chemical components. Computational efficiency tests indicate that the total time consumed by OIRF-LEnKF v1.0 constitutes only 11.41–16.60 % of that of CTM-based DA, particularly during the forecasting process (0.13–0.20 %). Sensitivity tests demonstrate that the self-evolution mechanism in our system enhances the Pearson correlation coefficient (CORR) and reduces the RMSE during the forecasting process by 2.28–11.75 % and 32.94–40.98 %, respectively, compared to the stationary training mechanism. A 2-month DA experiment reveals that the RMSE values of chemical components after DA are less than 7.80 µg m-3 and 2.36 µg m-3 during the forecasting and analysis processes, respectively, indicating reductions of at least 26.38 % and 68.99 % compared to values without DA. Notably, the RMSE values of our system during the forecasting process exhibit a significant reduction of 33.16–90.10 % compared to those of the CTM-based DA, highlighting the superior forecasting capability of our system. Furthermore, the spatial overestimation and underestimation of chemical components have been significantly mitigated following DA. Compared to multiple reanalysis datasets of inorganic salt aerosols (CORR: 0.56–0.89, RMSE: 2.55–8.52 μg m-3), the dataset generated by OIRF-LEnKF v1.0 (CORR: 0.97, RMSE: 1.12 μg m-3) demonstrates higher data quality.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Hongyi Li, Ting Yang, Lei Kong, Di Zhang, Guigang Tang, and Zifa Wang

Status: open (until 06 Nov 2025)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Hongyi Li, Ting Yang, Lei Kong, Di Zhang, Guigang Tang, and Zifa Wang
Hongyi Li, Ting Yang, Lei Kong, Di Zhang, Guigang Tang, and Zifa Wang
Metrics will be available soon.
Latest update: 11 Sep 2025
Download
Short summary
Chemical transport model-based data assimilation is computationally inefficient for large ensemble sizes and offers limited improvements in forecasting PM2.5 chemical components. This paper introduces a machine learning-based data assimilation system that facilitates rapid iterations for forecasting, assimilation, and incremental learning. Results show that our system achieves superior efficiency and accuracy in forecasting and assimilation compared to traditional data assimilation.
Share