the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Using Data Assimilation to Improve Data-Driven Models
Abstract. Data-driven models (DDMs) are developed by analysing extensive datasets to detect patterns and make predictions, without relying on predefined rules or instructions from humans. In fields like numerical weather prediction (NWP), DDMs are gaining popularity as potential replacements for traditional numerical models, thanks to their grounding in a multi-decadal, high-quality data assimilation (DA) analysis product. Recent studies, such as Lang et al. (2024), have demonstrated that training DDMs using the ERA5 (the fifth generation European Centre for Medium-range Weather Forecast atmospheric reanalysis) can outperform traditional numerical models. DA integrates observations from various sources with numerical models to enhance the accuracy of model state estimates and predictions or simulations of a system's behaviour. Due to the benefits of DDMs and DA, integration of these methods has been gaining traction in a wide range of fields.
This paper focuses on the application of DA methodologies in enhancing the precision and efficiency of DDM generation. The aim is to demonstrate the pivotal role that DA can play in refining and optimising the process of DDM generation by incorporating various observation data directly, augmenting the accuracy and reliability of predictive models despite the presence of observational uncertainties. This study shows how DDMs can improve on imperfect model forecasting, and in conjunction with DA, can cyclically generate more accurate training data, further enhancing the precision of DDMs.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Nonlinear Processes in Geophysics.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.- Preprint
(651 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 07 May 2025)
-
CC1: 'Comment on egusphere-2025-933', Zheqi Shen, 22 Mar 2025
reply
The paper focuses on the application of data assimilation (DA) methods in improving the accuracy and efficiency of data-driven model (DDM) generation. It designs a cycle of “assimilating observational data in a numerical model—training the DDM—repeatedly assimilating observational data in the DDM” and claims that it can produce a DDM closer to the true values from an imperfect model by assimilating observational data. The designed experiment uses a coupled Lorenz63 model (L63-cc) to generate the true values and employs the classic Lorenz63 model (L63) as the initial dynamical model. By repeatedly assimilating and modeling, a proxy model for L63-cc is obtained, verifying the accuracy of the model.
The idea of the paper is somewhat interesting, but the content is too sparse and the experiments are relatively simple. The conclusions drawn in the paper are not remarkable and are all predictable. The DA method (3D-Var) and the artificial intelligence model (LSTM) used in the paper are both rather old-fashioned. The conclusions drawn from the toy model have limited reference value for reality. Overall, the paper reads more like an study report than a research paper. I recommend rejecting this paper. The authors need to use more advanced methods and models to validate the viewpoints of the paper.
Section 2.2 provides a detailed introduction to the three-dimensional variational method, which is common knowledge and unnecessary to introduce specifically.
Section 2.3 introduces the LSTM but does not explain why LSTM is chosen over other algorithms, nor does it provide details on the training loss function with formulas.
Section 2.4 should be the focus of the paper, but it is described too briefly. The process is all in the figure, and Figure 1 is also rather crude.
Regardless, the paper is too short, with too few experimental results to support the conclusions.
Citation: https://doi.org/10.5194/egusphere-2025-933-CC1 -
RC1: 'Comment on egusphere-2025-933', Marc Bocquet, 03 Apr 2025
reply
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-933/egusphere-2025-933-RC1-supplement.pdf
-
EC1: 'Comment on egusphere-2025-933', Jie Feng, 15 Apr 2025
reply
I have carefully reviewed the comments from both reviewers. They noted that significant progress has been made in the field of AI-integrated data assimilation, and they raised concerns that the current manuscript may not provide sufficient novelty or contribution to warrant publication in NPG.
Additionally, the reviewers identified several weaknesses in the manuscript’s logic and structure. The manuscript lacks a thorough discussion of recent advancements and key references in the field. The analysis is currently too brief, and the results would benefit from a more detailed and rigorous examination.
Given these concerns, I regret to inform you that I must recommend rejecting the manuscript in its present form. However, I encourage the authors to thoroughly address the reviewers' feedback—particularly by expanding the literature review, strengthening the methodological rigor, and providing a more in-depth analysis of the results—before considering resubmission.
Citation: https://doi.org/10.5194/egusphere-2025-933-EC1 -
RC2: 'Comment on egusphere-2025-933', Anonymous Referee #2, 19 Apr 2025
reply
This paper presents a methodology for iteratively improving data-driven models through data assimilation. The proposed algorithm is initialized by embedding a physical model into a cycled data assimilation framework to generate an analysis trajectory. This trajectory then serves as training data for learning the data-driven surrogate model. In subsequent iterations, the learned model is used to generate new analysis trajectories, which are in turn used to refine the accuracy of the surrogate model.
The study has the potential to be a nice contribution to the literature, but major revisions are necessary before I can recommend it for publication. Most notably, the paper lacks a clear statement of its contributions, omits details about the proposed algorithm and experimental setup, and presents numerical results that do not sufficiently illuminate the properties of the iterative scheme. Perhaps most concerning is the apparent overlap with the earlier work of Brajard (2019), which follows a similarly structured framework for recovering data-driven models. A more careful and thorough discussion is required to establish the novelty of the proposed approach. Below, I provide a list of general and specific comments for the authors to consider during revision.
References: The introduction should offer a broader overview of recent advances in data-driven modeling, including a discussion of prominent architectures such as graph neural networks, neural operators, and vision transformers. It should also more fully address the intersection of machine learning and data assimilation. At present, the authors primarily cite the works of Brajard (2019) [note the correct publication year is 2020] and Amemiya (2023), while overlooking other key contributions such as Bocquet et al. (2020) [doi:10.3934/fods.2020004] and Farchi et al. (2021) [https://doi.org/10.1016/j.jocs.2021.101468], which also focus on correcting imperfect models using partial and noisy observations.
Lines 34–39: Please revise this section for improved clarity.
Lines 44–48: The analogy with Brajard (2019) should be discussed in greater depth, particularly given the methodological similarity with the iterative scheme described on Lines 34–39. Clarifying what is meant by “adding an imperfect model” would be helpful in this context.
Lines 121–131 (LSTM description): This section needs more mathematical rigor, as the current presentation is largely qualitative. The level of detail should be comparable to the exposition of 3D-Var provided earlier.
Line 143: Please clarify whether the "imperfect model" referenced here is a physical model or an LSTM. Based on later sections, it appears to be the physical model, but this should be stated explicitly to prevent confusion.
Line 163: How many past states are used as inputs to the LSTM? Was any sensitivity analysis performed on this hyperparameter?
Line 165: Are the same observations assimilated during each iteration of the loop?
Lines 166–167: Could the proposed iterative algorithm fail to converge under certain conditions? It would strengthen the paper to empirically validate convergence across a range of experimental settings.
Lines 178–179: The RMSE appears to stabilize after about 10 iterations.
Line 181: The meaning of “1dt” and “4dt” is unclear. Do these refer to the length of the input sequence passed to the LSTM, or the prediction horizon? Please clarify.
Lines 231–237: It would be helpful to include a more concrete example of how this methodology could be used to improve complex DDMs trained on reanalysis datasets such as ERA5. Would this involve a second training phase where DDMs are used to cycle through observations, thus generating refined reanalysis trajectories which further improve the model? Additionally, how might the choice of data assimilation algorithms influence the convergence properties of the iterative procedure?
Citation: https://doi.org/10.5194/egusphere-2025-933-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
152 | 49 | 10 | 211 | 7 | 5 |
- HTML: 152
- PDF: 49
- XML: 10
- Total: 211
- BibTeX: 7
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 71 | 31 |
China | 2 | 49 | 21 |
France | 3 | 23 | 10 |
Japan | 4 | 15 | 6 |
Germany | 5 | 7 | 3 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 71