the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
OIRF-LEnKF v1.0: A Self-evolving Data Assimilation System by Integrating Incremental Machine Learning with a Localized EnKF for Enhanced PM2.5 Chemical Component Forecasting and Analysis
Abstract. Assimilating observational data into numerical forecasts is crucial for accurately estimating the spatiotemporal distribution of PM2.5 chemical components (NH4+, NO3-, SO42-, OC, and BC), which is beneficial to quantifying the impact of aerosols on the environment, climate change and human health. However, chemical transport model (CTM)-based data assimilation (DA) is computationally inefficient for large ensemble sizes and offers limited improvements in forecasting, as it solely provides optimal initial conditions. This paper introduces a machine learning (ML)-based self-evolving data assimilation system (OIRF-LEnKF v1.0) that achieves high efficiency and high quality in the forecast and analysis fields of chemical components. Computational efficiency tests indicate that the total time consumed by OIRF-LEnKF v1.0 constitutes only 11.41–16.60 % of that of CTM-based DA, particularly during the forecasting process (0.13–0.20 %). Sensitivity tests demonstrate that the self-evolution mechanism in our system enhances the Pearson correlation coefficient (CORR) and reduces the RMSE during the forecasting process by 2.28–11.75 % and 32.94–40.98 %, respectively, compared to the stationary training mechanism. A 2-month DA experiment reveals that the RMSE values of chemical components after DA are less than 7.80 µg m-3 and 2.36 µg m-3 during the forecasting and analysis processes, respectively, indicating reductions of at least 26.38 % and 68.99 % compared to values without DA. Notably, the RMSE values of our system during the forecasting process exhibit a significant reduction of 33.16–90.10 % compared to those of the CTM-based DA, highlighting the superior forecasting capability of our system. Furthermore, the spatial overestimation and underestimation of chemical components have been significantly mitigated following DA. Compared to multiple reanalysis datasets of inorganic salt aerosols (CORR: 0.56–0.89, RMSE: 2.55–8.52 μg m-3), the dataset generated by OIRF-LEnKF v1.0 (CORR: 0.97, RMSE: 1.12 μg m-3) demonstrates higher data quality.
- Preprint
(2542 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (extended)
-
CEC1: 'No compliance with the policy of the journal', Juan Antonio Añel, 11 Oct 2025
reply
-
AC1: 'Reply on CEC1', Ting Yang, 13 Oct 2025
reply
As attached.
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 13 Oct 2025
reply
Dear authors,
Thanks for the clarification. Unfortunately, this does not solve the problem. First, the declaration on the software and data used in your manuscript should be in the "Code and data availability" section, not in the Section 2.2 (“Data”). Secondly, from the sites listed in the table in your reply, only the one for NP2 is a repository. The others are not acceptable. Therefore, you must store all the data that you have used from the datasets mentioned in such table in a suitable repository according to our policy.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2025-3960-CEC2 -
AC2: 'Reply on CEC2', Ting Yang, 20 Oct 2025
reply
As attached.
-
CEC3: 'Reply on AC2', Juan Antonio Añel, 21 Oct 2025
reply
Dear authors,
Many thanks for addressing the outstanding issues. We can consider the current version of your manuscript now in compliance with the Code and Data policy of our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2025-3960-CEC3
-
CEC3: 'Reply on AC2', Juan Antonio Añel, 21 Oct 2025
reply
-
AC2: 'Reply on CEC2', Ting Yang, 20 Oct 2025
reply
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 13 Oct 2025
reply
-
AC1: 'Reply on CEC1', Ting Yang, 13 Oct 2025
reply
-
RC1: 'Comment on egusphere-2025-3960', Anonymous Referee #1, 17 Nov 2025
reply
General Comments
This paper presents a method called the optimized incremental random forest ensemble forecasting model with the localized ensemble Kalman filter (OIRF-LEnKF), which combines an computationally lightweight emulator of PM2.5 chemical components with the efficient and accurate data assimilation method LEnKF. This paper also presents a mechanism for online learning to update the OIRF model as new data arrives. The results on a real-world assimilation task of PM2.5 concentrations in a region in China show that the proposed algorithm is effective in remaining stable across a long assimilation horizon while effectively assimilating observations that lead the analyses to remain close to a notion of ground truth (based on reanalysis). The use of machine learning in various data assimilation applications is an important investigation, however, the authors could provide more motivation for particular choices made while creating the OIRF-LEnKF algorithm and perhaps further contextualize this work in relation to related work.
Scientific Comments
A - The authors propose a random forest model that is incrementally optimized as new information about the system is obtained, but provides minimal motivation for the choice of random forest over other approaches. This application area involves spatial datasets, which make a neural network architecture like a CNN a good fit, especially given that many state-of-the-art emulators of spatial systems rely in part on CNNs. The motivation for the choice of RF should be made more clear in the paper, and perhaps an ablation against a CNN-type architecture should be provided. To create an ensemble like a random forest, the CNN training could be bagged.
B - Have there been other successful ML models of PM2.5 chemical components? If so, they should be cited in the related work. If these emulators do exist, why were any of them not used instead of the authors' proposed random forest model?
C - In line 90, a claim is made that increasing the number of ensemble members in the forecast “mitigates the underestimation of forecast error covariance”. It certainly helps mitigate, but it is not an assured cure. The authors should modify the language to something like “helps mitigate” to make the statement more accurate.
D - Is the idea of throwing away decision trees that do not perform as well as a predefined threshold on the updated dataset a novel contribution of this paper, or has this approach been used elsewhere? If it has been used elsewhere, the previous works should be cited.
Technical Corrections
A - It would be helpful to spell out the name of the OIRF-LEnKF in the abstract (line 15).
B - In line 55, “DA technique has been widely used [...]” should be corrected to either “DA has been widely used [...]” or “DA techniques have been widely used [...].”
C - “Where” immediately after equation (1) should be lowercase.
D - What criteria is used to determine a split for each decision tree? Is it based on MAE? This should be made more clear in the text (roughly around line 145).
E - y is used to describe an analysis in line 154 but is then used to describe observations in line 215. The authors should stay consistent in the text that y refers to observations.
F - In lines 154-155, should “nth grid point after DA” be changed to the “ith grid point”? And similarly should “nth DT at the nth grid point” be changed to “nth DT at the ith grid point”?
G - Immediately after equation (5), it should be made clear that f_t^{DT}(x,theta_n) with a bar over the expression refers to the ensemble mean across decision trees in the random forest.
H - I think that the paper would benefit from a mathematical formulation of the difference between domain localization and observation localization (in the section in lines 219-243).
I - The construction of W in equation (11) is not immediately clear. What values do i and j range from? Why is this matrix forced to be diagonal? What is the dimensionality of W? The answers to these questions should be made more clear in the text.
J - In Table 1, could the authors please also list the dimensions of the analysis (# latitudes, # longitudes, # features)?
K - In Figure 3a, is the objective referenced from the Bayesian optimization? It may be more clear to reference an equation number in the caption. In Figure 3c, why is there a sudden decrease at an ensemble size of 30 in the OIRF-LEnKF/NP2 (%)?
L - The colorbar in all subfigures in Figure 4 should start at 0 so that perceived color variations more closely correspond to significant differences in the values in the table. Figure 4c, for example, has two different colors assigned to 0.77 in the bottom right corner, likely due to small differences past the third decimal place. If these heatmaps no longer look interesting after making this change, then another plotting technique highlighting any interesting aspects should replace Figure 4.
Citation: https://doi.org/10.5194/egusphere-2025-3960-RC1
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,311 | 63 | 32 | 1,406 | 19 | 21 |
- HTML: 1,311
- PDF: 63
- XML: 32
- Total: 1,406
- BibTeX: 19
- EndNote: 21
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have published the ChinaHighPMC data in a restricted repository, and this does not comply with our policy, which requires that all the code and data used to produce a manuscript submitted to the journal is publicly available when submitting it. Therefore, the current situation with your manuscript is irregular. Please, publish the ChinaHighPMC data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.
Also, you must include a modified 'Code and Data Availability' section in a potentially reviewed manuscript, containing the information of the new repository.
I must note that if you do not fix this problem, we cannot accept your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor