OIRF-LEnKF v1.0: A Self-evolving Data Assimilation System by Integrating Incremental Machine Learning with a Localized EnKF for Enhanced PM<sub>2.5</sub> Chemical Component Forecasting and Analysis

Li, Hongyi; Yang, Ting; Kong, Lei; Zhang, Di; Tang, Guigang; Wang, Zifa

doi:10.5194/egusphere-2025-3960

Preprints

https://doi.org/10.5194/egusphere-2025-3960

Preprints

11 Sep 2025

| 11 Sep 2025

OIRF-LEnKF v1.0: A Self-evolving Data Assimilation System by Integrating Incremental Machine Learning with a Localized EnKF for Enhanced PM_2.5 Chemical Component Forecasting and Analysis

Hongyi Li, Ting Yang, Lei Kong, Di Zhang, Guigang Tang, and Zifa Wang

Abstract. Assimilating observational data into numerical forecasts is crucial for accurately estimating the spatiotemporal distribution of PM_2.5 chemical components (NH₄⁺, NO₃^-, SO₄^2-, OC, and BC), which is beneficial to quantifying the impact of aerosols on the environment, climate change and human health. However, chemical transport model (CTM)-based data assimilation (DA) is computationally inefficient for large ensemble sizes and offers limited improvements in forecasting, as it solely provides optimal initial conditions. This paper introduces a machine learning (ML)-based self-evolving data assimilation system (OIRF-LEnKF v1.0) that achieves high efficiency and high quality in the forecast and analysis fields of chemical components. Computational efficiency tests indicate that the total time consumed by OIRF-LEnKF v1.0 constitutes only 11.41–16.60 % of that of CTM-based DA, particularly during the forecasting process (0.13–0.20 %). Sensitivity tests demonstrate that the self-evolution mechanism in our system enhances the Pearson correlation coefficient (CORR) and reduces the RMSE during the forecasting process by 2.28–11.75 % and 32.94–40.98 %, respectively, compared to the stationary training mechanism. A 2-month DA experiment reveals that the RMSE values of chemical components after DA are less than 7.80 µg m^-3 and 2.36 µg m^-3 during the forecasting and analysis processes, respectively, indicating reductions of at least 26.38 % and 68.99 % compared to values without DA. Notably, the RMSE values of our system during the forecasting process exhibit a significant reduction of 33.16–90.10 % compared to those of the CTM-based DA, highlighting the superior forecasting capability of our system. Furthermore, the spatial overestimation and underestimation of chemical components have been significantly mitigated following DA. Compared to multiple reanalysis datasets of inorganic salt aerosols (CORR: 0.56–0.89, RMSE: 2.55–8.52 μg m^-3), the dataset generated by OIRF-LEnKF v1.0 (CORR: 0.97, RMSE: 1.12 μg m^-3) demonstrates higher data quality.

Received: 13 Aug 2025 – Discussion started: 11 Sep 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Hongyi Li, Ting Yang, Lei Kong, Di Zhang, Guigang Tang, and Zifa Wang

Status: closed

CEC1:
'No compliance with the policy of the journal', Juan Antonio Añel, 11 Oct 2025

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have published the ChinaHighPMC data in a restricted repository, and this does not comply with our policy, which requires that all the code and data used to produce a manuscript submitted to the journal is publicly available when submitting it. Therefore, the current situation with your manuscript is irregular. Please, publish the ChinaHighPMC data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.
Also, you must include a modified 'Code and Data Availability' section in a potentially reviewed manuscript, containing the information of the new repository.
I must note that if you do not fix this problem, we cannot accept your manuscript for publication in our journal.
Juan A. Añel

Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2025-3960-CEC1
- AC1:
  'Reply on CEC1', Ting Yang, 13 Oct 2025
  
  As attached.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3960-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 13 Oct 2025
    
    Dear authors,
    Thanks for the clarification. Unfortunately, this does not solve the problem. First, the declaration on the software and data used in your manuscript should be in the "Code and data availability" section, not in the Section 2.2 (“Data”). Secondly, from the sites listed in the table in your reply, only the one for NP2 is a repository. The others are not acceptable. Therefore, you must store all the data that you have used from the datasets mentioned in such table in a suitable repository according to our policy.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2025-3960-CEC2
    
    AC2: 'Reply on CEC2', Ting Yang, 20 Oct 2025
    
    As attached.
    
    Citation: https://doi.org/10.5194/egusphere-2025-3960-AC2
    
    CEC3: 'Reply on AC2', Juan Antonio Añel, 21 Oct 2025
    
    Dear authors,
    Many thanks for addressing the outstanding issues. We can consider the current version of your manuscript now in compliance with the Code and Data policy of our journal.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2025-3960-CEC3
RC1:
'Comment on egusphere-2025-3960', Anonymous Referee #1, 17 Nov 2025

General Comments
This paper presents a method called the optimized incremental random forest ensemble forecasting model with the localized ensemble Kalman filter (OIRF-LEnKF), which combines an computationally lightweight emulator of PM2.5 chemical components with the efficient and accurate data assimilation method LEnKF. This paper also presents a mechanism for online learning to update the OIRF model as new data arrives. The results on a real-world assimilation task of PM2.5 concentrations in a region in China show that the proposed algorithm is effective in remaining stable across a long assimilation horizon while effectively assimilating observations that lead the analyses to remain close to a notion of ground truth (based on reanalysis). The use of machine learning in various data assimilation applications is an important investigation, however, the authors could provide more motivation for particular choices made while creating the OIRF-LEnKF algorithm and perhaps further contextualize this work in relation to related work.

Scientific Comments
A - The authors propose a random forest model that is incrementally optimized as new information about the system is obtained, but provides minimal motivation for the choice of random forest over other approaches. This application area involves spatial datasets, which make a neural network architecture like a CNN a good fit, especially given that many state-of-the-art emulators of spatial systems rely in part on CNNs. The motivation for the choice of RF should be made more clear in the paper, and perhaps an ablation against a CNN-type architecture should be provided. To create an ensemble like a random forest, the CNN training could be bagged.
B - Have there been other successful ML models of PM2.5 chemical components? If so, they should be cited in the related work. If these emulators do exist, why were any of them not used instead of the authors' proposed random forest model?
C - In line 90, a claim is made that increasing the number of ensemble members in the forecast “mitigates the underestimation of forecast error covariance”. It certainly helps mitigate, but it is not an assured cure. The authors should modify the language to something like “helps mitigate” to make the statement more accurate.
D - Is the idea of throwing away decision trees that do not perform as well as a predefined threshold on the updated dataset a novel contribution of this paper, or has this approach been used elsewhere? If it has been used elsewhere, the previous works should be cited.

Technical Corrections
A - It would be helpful to spell out the name of the OIRF-LEnKF in the abstract (line 15).
B - In line 55, “DA technique has been widely used [...]” should be corrected to either “DA has been widely used [...]” or “DA techniques have been widely used [...].”
C - “Where” immediately after equation (1) should be lowercase.
D - What criteria is used to determine a split for each decision tree? Is it based on MAE? This should be made more clear in the text (roughly around line 145).
E - y is used to describe an analysis in line 154 but is then used to describe observations in line 215. The authors should stay consistent in the text that y refers to observations.
F - In lines 154-155, should “nth grid point after DA” be changed to the “ith grid point”? And similarly should “nth DT at the nth grid point” be changed to “nth DT at the ith grid point”?
G - Immediately after equation (5), it should be made clear that f_t^{DT}(x,theta_n) with a bar over the expression refers to the ensemble mean across decision trees in the random forest.
H - I think that the paper would benefit from a mathematical formulation of the difference between domain localization and observation localization (in the section in lines 219-243).
I - The construction of W in equation (11) is not immediately clear. What values do i and j range from? Why is this matrix forced to be diagonal? What is the dimensionality of W? The answers to these questions should be made more clear in the text.
J - In Table 1, could the authors please also list the dimensions of the analysis (# latitudes, # longitudes, # features)?
K - In Figure 3a, is the objective referenced from the Bayesian optimization? It may be more clear to reference an equation number in the caption. In Figure 3c, why is there a sudden decrease at an ensemble size of 30 in the OIRF-LEnKF/NP2 (%)?
L - The colorbar in all subfigures in Figure 4 should start at 0 so that perceived color variations more closely correspond to significant differences in the values in the table. Figure 4c, for example, has two different colors assigned to 0.77 in the bottom right corner, likely due to small differences past the third decimal place. If these heatmaps no longer look interesting after making this change, then another plotting technique highlighting any interesting aspects should replace Figure 4.

Citation: https://doi.org/10.5194/egusphere-2025-3960-RC1
- AC3: 'Reply on RC1', Ting Yang, 24 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-3960/egusphere-2025-3960-AC3-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-3960-AC3
RC2:
'Comment on egusphere-2025-3960', Anonymous Referee #2, 08 Dec 2025

Please find the attached report.

Citation: https://doi.org/10.5194/egusphere-2025-3960-RC2
- AC4: 'Reply on RC2', Ting Yang, 24 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-3960/egusphere-2025-3960-AC4-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-3960-AC4

Status: closed

CEC1:
'No compliance with the policy of the journal', Juan Antonio Añel, 11 Oct 2025

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have published the ChinaHighPMC data in a restricted repository, and this does not comply with our policy, which requires that all the code and data used to produce a manuscript submitted to the journal is publicly available when submitting it. Therefore, the current situation with your manuscript is irregular. Please, publish the ChinaHighPMC data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.
Also, you must include a modified 'Code and Data Availability' section in a potentially reviewed manuscript, containing the information of the new repository.
I must note that if you do not fix this problem, we cannot accept your manuscript for publication in our journal.
Juan A. Añel

Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2025-3960-CEC1
- AC1:
  'Reply on CEC1', Ting Yang, 13 Oct 2025
  
  As attached.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3960-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 13 Oct 2025
    
    Dear authors,
    Thanks for the clarification. Unfortunately, this does not solve the problem. First, the declaration on the software and data used in your manuscript should be in the "Code and data availability" section, not in the Section 2.2 (“Data”). Secondly, from the sites listed in the table in your reply, only the one for NP2 is a repository. The others are not acceptable. Therefore, you must store all the data that you have used from the datasets mentioned in such table in a suitable repository according to our policy.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2025-3960-CEC2
    
    AC2: 'Reply on CEC2', Ting Yang, 20 Oct 2025
    
    As attached.
    
    Citation: https://doi.org/10.5194/egusphere-2025-3960-AC2
    
    CEC3: 'Reply on AC2', Juan Antonio Añel, 21 Oct 2025
    
    Dear authors,
    Many thanks for addressing the outstanding issues. We can consider the current version of your manuscript now in compliance with the Code and Data policy of our journal.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2025-3960-CEC3
RC1:
'Comment on egusphere-2025-3960', Anonymous Referee #1, 17 Nov 2025

General Comments
This paper presents a method called the optimized incremental random forest ensemble forecasting model with the localized ensemble Kalman filter (OIRF-LEnKF), which combines an computationally lightweight emulator of PM2.5 chemical components with the efficient and accurate data assimilation method LEnKF. This paper also presents a mechanism for online learning to update the OIRF model as new data arrives. The results on a real-world assimilation task of PM2.5 concentrations in a region in China show that the proposed algorithm is effective in remaining stable across a long assimilation horizon while effectively assimilating observations that lead the analyses to remain close to a notion of ground truth (based on reanalysis). The use of machine learning in various data assimilation applications is an important investigation, however, the authors could provide more motivation for particular choices made while creating the OIRF-LEnKF algorithm and perhaps further contextualize this work in relation to related work.

Scientific Comments
A - The authors propose a random forest model that is incrementally optimized as new information about the system is obtained, but provides minimal motivation for the choice of random forest over other approaches. This application area involves spatial datasets, which make a neural network architecture like a CNN a good fit, especially given that many state-of-the-art emulators of spatial systems rely in part on CNNs. The motivation for the choice of RF should be made more clear in the paper, and perhaps an ablation against a CNN-type architecture should be provided. To create an ensemble like a random forest, the CNN training could be bagged.
B - Have there been other successful ML models of PM2.5 chemical components? If so, they should be cited in the related work. If these emulators do exist, why were any of them not used instead of the authors' proposed random forest model?
C - In line 90, a claim is made that increasing the number of ensemble members in the forecast “mitigates the underestimation of forecast error covariance”. It certainly helps mitigate, but it is not an assured cure. The authors should modify the language to something like “helps mitigate” to make the statement more accurate.
D - Is the idea of throwing away decision trees that do not perform as well as a predefined threshold on the updated dataset a novel contribution of this paper, or has this approach been used elsewhere? If it has been used elsewhere, the previous works should be cited.

Technical Corrections
A - It would be helpful to spell out the name of the OIRF-LEnKF in the abstract (line 15).
B - In line 55, “DA technique has been widely used [...]” should be corrected to either “DA has been widely used [...]” or “DA techniques have been widely used [...].”
C - “Where” immediately after equation (1) should be lowercase.
D - What criteria is used to determine a split for each decision tree? Is it based on MAE? This should be made more clear in the text (roughly around line 145).
E - y is used to describe an analysis in line 154 but is then used to describe observations in line 215. The authors should stay consistent in the text that y refers to observations.
F - In lines 154-155, should “nth grid point after DA” be changed to the “ith grid point”? And similarly should “nth DT at the nth grid point” be changed to “nth DT at the ith grid point”?
G - Immediately after equation (5), it should be made clear that f_t^{DT}(x,theta_n) with a bar over the expression refers to the ensemble mean across decision trees in the random forest.
H - I think that the paper would benefit from a mathematical formulation of the difference between domain localization and observation localization (in the section in lines 219-243).
I - The construction of W in equation (11) is not immediately clear. What values do i and j range from? Why is this matrix forced to be diagonal? What is the dimensionality of W? The answers to these questions should be made more clear in the text.
J - In Table 1, could the authors please also list the dimensions of the analysis (# latitudes, # longitudes, # features)?
K - In Figure 3a, is the objective referenced from the Bayesian optimization? It may be more clear to reference an equation number in the caption. In Figure 3c, why is there a sudden decrease at an ensemble size of 30 in the OIRF-LEnKF/NP2 (%)?
L - The colorbar in all subfigures in Figure 4 should start at 0 so that perceived color variations more closely correspond to significant differences in the values in the table. Figure 4c, for example, has two different colors assigned to 0.77 in the bottom right corner, likely due to small differences past the third decimal place. If these heatmaps no longer look interesting after making this change, then another plotting technique highlighting any interesting aspects should replace Figure 4.

Citation: https://doi.org/10.5194/egusphere-2025-3960-RC1
- AC3: 'Reply on RC1', Ting Yang, 24 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-3960/egusphere-2025-3960-AC3-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-3960-AC3
RC2:
'Comment on egusphere-2025-3960', Anonymous Referee #2, 08 Dec 2025

Please find the attached report.

Citation: https://doi.org/10.5194/egusphere-2025-3960-RC2
- AC4: 'Reply on RC2', Ting Yang, 24 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-3960/egusphere-2025-3960-AC4-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-3960-AC4

Hongyi Li, Ting Yang, Lei Kong, Di Zhang, Guigang Tang, and Zifa Wang

Viewed

Total article views: 1,778 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,545	178	55	1,778	31	50

HTML: 1,545
PDF: 178
XML: 55
Total: 1,778
BibTeX: 31
EndNote: 50

Views and downloads (calculated since 11 Sep 2025)

Month	HTML	PDF	XML	Total
Sep 2025	1,078	6	3	1,087
Oct 2025	155	39	16	210
Nov 2025	78	18	13	109
Dec 2025	80	39	12	131
Jan 2026	45	26	3	74
Feb 2026	40	19	0	59
Mar 2026	59	25	7	91
Apr 2026	10	6	1	17

Cumulative views and downloads (calculated since 11 Sep 2025)

Month	HTML	PDF	XML	Total
Sep 2025	1,078	6	3	1,087
Oct 2025	155	39	16	210
Nov 2025	78	18	13	109
Dec 2025	80	39	12	131
Jan 2026	45	26	3	74
Feb 2026	40	19	0	59
Mar 2026	59	25	7	91
Apr 2026	10	6	1	17

Viewed (geographical distribution)

Total article views: 1,742 (including HTML, PDF, and XML) Thereof 1,742 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Apr 2026

Short summary

Chemical transport model-based data assimilation is computationally inefficient for large ensemble sizes and offers limited improvements in forecasting PM_2.5chemical components. This paper introduces a machine learning-based data assimilation system that facilitates rapid iterations for forecasting, assimilation, and incremental learning. Results show that our system achieves superior efficiency and accuracy in forecasting and assimilation compared to traditional data assimilation.


Total:	0
HTML:	0
PDF:	0
XML:	0

OIRF-LEnKF v1.0: A Self-evolving Data Assimilation System by Integrating Incremental Machine Learning with a Localized EnKF for Enhanced PM2.5 Chemical Component Forecasting and Analysis

Viewed

Viewed (geographical distribution)

OIRF-LEnKF v1.0: A Self-evolving Data Assimilation System by Integrating Incremental Machine Learning with a Localized EnKF for Enhanced PM_2.5 Chemical Component Forecasting and Analysis