the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Improving dynamical climate predictions with machine learning: insights from a twin experiment framework
Abstract. Systematic errors in dynamical climate models remain a significant challenge to accurate climate predictions, particularly when modeling the nonlinear coupling between the atmosphere and oceans. Despite notable advances in dynamical climate modeling that have improved our understanding of climate variability, these systematic errors can still degrade predictive skills. In this study, we adopt a twin experiment framework with a reduced-order coupled atmosphere-ocean model to explore the utility of machine learning in mitigating these errors. Specifically, we train a data-driven model on data assimilation increments to learn and emulate the underlying dynamical model error, which is then integrated with the dynamical model to form a hybrid system. Comparison experiments show that the hybrid model consistently outperforms the standalone dynamical model in predicting atmospheric and oceanic variables. Further investigation using hybrid models that correct only atmospheric or only oceanic errors reveals that atmospheric corrections are essential for improving short-term forecasts, while concurrently addressing both atmospheric and oceanic errors yields superior performance in long-term climate prediction.
- Preprint
(6280 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-212', Anonymous Referee #1, 14 Apr 2025
By using a reduced-order coupled atmosphere-ocean model within a twin experiment framework, the authors demonstrate the capability of machine learning in correcting errors in various components of the coupled model on different time scales so that enhancing the prediction skill of the model. I found the results interesting and publishable and recommend a minor revision that addresses my comments below.
Specific Comments:
Line 22: "estimate best the state" should be "estimate the best state".
Line 127: Section 2.4. Experimental settings, "experimental settings" sounds like the setting itself is experimental, "experiment settings" sounds more like setting up an experiment.Citation: https://doi.org/10.5194/egusphere-2025-212-RC1 -
RC2: 'Comment on egusphere-2025-212', Alban Farchi, 16 Apr 2025
In this manuscript the authors use hybrid modelling to correct model error in a low-order coupled ocean-atmosphere model, namely MAOOAM. Model error is introduced by using a spectral truncation of a reference version of the model. Then, a neural network is trained to correct the 27h forecast errors by learning the analysis increments obtained from an EnKF analysis. Finally, the hybrid model is evaluated in ensemble forecast experiments.
Overall, the manuscript reads well, but I have the feeling that some key informations / explanations about the experiments are missing (see general points).
Nevertheless I think that after substantial revision it could make a valuable contribution to the literature.General comments
----------------1) At first glance, I had difficulties to grasp what are the objectives of this work, and to what extent it differs from previous work on hybrid modelling (in particular Brajard et al 2021). Therefore, I think that it would make sense to clarify a bit the objectives of this work (in the introduction) and in particular what it means for you to built a model error correction for climate prediction and how it differs from eg NWP or dynamical systems in general.
2) It is absolutely necessary to give typical time scales of the MAOOAM model for each component (atmosphere and ocean), for example by providing the doubling time of errors. Without this information, it is really hard to get an idea of the time evolution in the model and for example to know whether the 50 days and 60 years lead time used in some experiments are long or not.
3) I find the use of the significance test really difficult to understand. In particular, I don't understand what it means for a correlation or for a skill score "to be significant"? For example, what would be the difference between a significant 50% correlation and a non-significant 50% correlation? Perhaps it would be clearer if you explicitly indicated what are the null and tested hypotheses in your Student's t-test. More generally, I have the feeling that you would get a similar information content by computing a confidence interval over the 30 test runs, with the added benefit that it would be much easier to understand and hence to explain.
4) The caption of figure 3 mentions "The atmospheric variables are calculated based on daily data, while the oceanic variables are based on annual average data". Does this mean that the truth and the prediction are averaged (on a daily or annual basis) before computing the RMSE or correlation? To me, this is a really important point, because it means that you evaluate the models in their ability to reproduce daily or annual means, and not their ability to predict trajectories. This is fine, especially in a context of "climate predictions" but should absolutely be discussed in the main text. In my opinion, this raises additional questions, like for example why only the mean, why not higher-order moments or more complex statistical properties? Also, it is not entirely clear to me how fast is the evolution of daily / annual means in this model, and hence whether the averaging window (one day or on year) makes sense or not.
5) The scientific question raised in section 3.2 is interesting, but I am not entirely sure that the experimental setup chosen in this manuscript is entirely relevant to answer that question, because there is by construction no model error on the oceanic component (which means that the only source of error in the ocean comes from the interaction with the atmosphere). Could you discuss this point?
Specific and technical comments
-------------------------------- Introduction: I have the impression that you move back and forth between dynamical models in general and climate models in particular, which makes the text sometimes a bit harder to follow.
- L 22 "estimate best the state of the climate system" there is a typo in this sentence. Furthermore, I would rather use "initial condition" here than "state" (to be consistent with the start of the sentence).
- L 39-40 "In these works, the hybrid model is tested in an idealized setting in which initial conditions are perfectly known" I guess that here you are referring to the fact that the forecast skill of the hybrid models are evaluated using a dataset with perfect initial conditions, but the current formulation is not entirely clear to me.
- L 41-43 "To our knowledge, the performance of hybrid models under imperfect initial conditions—particularly when using an ensemble of forecasts—has not been thoroughly assessed." Actually, starting the forecast from an analysis and not necessarily perfect initial conditions has already been done, eg in Farchi et al, 2023 (doi 10.1029/2022MS003474).
- L 47-48: "Moreover, observation for training, validation, and testing is relatively limited." I don't fully agree with this. For specific components of the Earth system (eg the atmosphere), there are a lot of observations available.
- L 71-75 Why is the relationship between resolution and number of modes different in the atmosphere and in the ocean?
- L 77-79 "One of the key features of MAOOAM is its ability to modify the number of atmospheric and oceanic model variables simply by adjusting the model’s resolution in the x-direction or y-direction." Actually, any model with a spatial extent possesses this feature, right?
- Figure 1 is difficult to interpret. Beyond the fact that the attractors look different, I don't get much out of it. I would advise first to use the same axes extent for both panels, to use label fonts consistent with the main text, and to use colour or transparency to show the density. Beyond these advises, is the 3D aspect really fundamental here? Otherwise, I would suggest to show a 2D figure (potentially with more panels if needed), which would be easier to interpret.
- L 86 "showing they evolve differently" If I am not mistaken there is a typo in this part of the sentence.
- L 89 "specifically 10 mode less" -> "specifically 10 modes less"
- L 90 "The atmospheric error could then propagate" by using "could" instead of something more affirmative, you imply that in certain cases the error can be limited to the atmospheric component?
- L 98-99 "In this study, we utilize the DAPPER package (Raanes, 2018) for conducting all experiments, as described in section 2.4 and depicted in Fig. 2." You should reformulate this sentence, because as is it sounds like section 2.4 and figure 2 will be about the use of DAPPER.
- L 100-101 "This method reducing the amount of experimentation required in tuning the EnKF DA system, thereby enhancing the performance of the assimilation experiments" I would reformulate this sentence, as one could understand that reducing the number of experiments enhances performance, which is obviously not the case.
- L 111-112 "the training of ANN is using the analysis increments produced by the EnKF (Gregory et al., 2024)" I would suggest to cite Farchi et al 2021 (doi 10.1002/qj.4116) and Brajard et al 2021 (which you already cited elsewhere) before Gregory et al 2024 here.
- L 126 Is the NN correction applied in spectral space or in grid-point space?
- L 135 "true state \sigma^{\mathrm{hf}}" -> "true state \mathbf{x}^{\mathrm{t}}" to be consistent with the notation introduced L 110.
- L 132-136 You forgot to specify what is the observation operator (H=I I imagine) and whether it is applied in spectral space or in grid point space?
- L 137 "and generate a reanalysis" Why a reanalysis here? Why not simply an analysis?
- L 150 "without incorporating validation data to adjust the ANN model during training" If you don't have any validation set, how can you be sure that the training process has converged?
- Section 2.5: Usually, in the RMSE the "mean" is computed over the different variables that constitute on state, in such a way that there is only one RMSE for a prediction. However, as far as I understand, in your case the mean in the RMSE is computed over the 30 test runs, right? In such a way that you get one RMSE value per variable. This should be explicitly mentioned.
- L 199: phi should be psi, right?
- Figures 3 and 4 (and the other ones as well): I am a surprised by the "noisiness" of the results, because with low-order models the scores usually look much smoother. Could this be coming from the fact that the test set contains only 30 runs? Would it be affordable to increase this number?
- Figures 9 and 10: are the correlation and RMSE skill score computed again on daily / annual averages or on instantaneous values?
- There is in section 3 a lot of (large) figures, which makes the text sometimes difficult to read (eg when referring to a figure which is located... in the following pages). I know that this is a draft and the one-column draft mode isn't really helping, but when moving to the edition process, I would recommend to be really careful to mitigate as much as possible the inconvenience.Citation: https://doi.org/10.5194/egusphere-2025-212-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
246 | 68 | 11 | 325 | 12 | 17 |
- HTML: 246
- PDF: 68
- XML: 11
- Total: 325
- BibTeX: 12
- EndNote: 17
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1