Improving dynamical climate predictions with machine learning: insights from a twin experiment framework

He, Zikang; Brajard, Julien; Wang, Yiguo; Wang, Xidong; Shen, Zheqi

doi:10.5194/egusphere-2025-212

Preprints

https://doi.org/10.5194/egusphere-2025-212

Preprints

17 Feb 2025

| 17 Feb 2025

Improving dynamical climate predictions with machine learning: insights from a twin experiment framework

Zikang He, Julien Brajard, Yiguo Wang, Xidong Wang, and Zheqi Shen

Abstract. Systematic errors in dynamical climate models remain a significant challenge to accurate climate predictions, particularly when modeling the nonlinear coupling between the atmosphere and oceans. Despite notable advances in dynamical climate modeling that have improved our understanding of climate variability, these systematic errors can still degrade predictive skills. In this study, we adopt a twin experiment framework with a reduced-order coupled atmosphere-ocean model to explore the utility of machine learning in mitigating these errors. Specifically, we train a data-driven model on data assimilation increments to learn and emulate the underlying dynamical model error, which is then integrated with the dynamical model to form a hybrid system. Comparison experiments show that the hybrid model consistently outperforms the standalone dynamical model in predicting atmospheric and oceanic variables. Further investigation using hybrid models that correct only atmospheric or only oceanic errors reveals that atmospheric corrections are essential for improving short-term forecasts, while concurrently addressing both atmospheric and oceanic errors yields superior performance in long-term climate prediction.

Received: 17 Jan 2025 – Discussion started: 17 Feb 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 6280 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (6280 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

20 Oct 2025

Improving dynamical climate predictions with machine learning: insights from a twin experiment framework

Zikang He, Julien Brajard, Yiguo Wang, Xidong Wang, and Zheqi Shen

Nonlin. Processes Geophys., 32, 397–409, https://doi.org/10.5194/npg-32-397-2025,https://doi.org/10.5194/npg-32-397-2025, 2025

Short summary

Zikang He, Julien Brajard, Yiguo Wang, Xidong Wang, and Zheqi Shen

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-212', Anonymous Referee #1, 14 Apr 2025

By using a reduced-order coupled atmosphere-ocean model within a twin experiment framework, the authors demonstrate the capability of machine learning in correcting errors in various components of the coupled model on different time scales so that enhancing the prediction skill of the model. I found the results interesting and publishable and recommend a minor revision that addresses my comments below.
Specific Comments:
Line 22: "estimate best the state" should be "estimate the best state".

Line 127: Section 2.4. Experimental settings, "experimental settings" sounds like the setting itself is experimental, "experiment settings" sounds more like setting up an experiment.

Citation: https://doi.org/10.5194/egusphere-2025-212-RC1
- AC1: 'Reply on RC1', Zikang He, 16 Jul 2025
  
  Dear Reviewer,
  Thank you for your constructive comments and valuable feedback on our manuscript. Please find our point-by-point responses in the supplement. We sincerely appreciate your time and effort in reviewing our work.
  Best regards,
  
  Zikang He
  
  On behalf of all authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-212-AC1
RC2:
'Comment on egusphere-2025-212', Alban Farchi, 16 Apr 2025

In this manuscript the authors use hybrid modelling to correct model error in a low-order coupled ocean-atmosphere model, namely MAOOAM. Model error is introduced by using a spectral truncation of a reference version of the model. Then, a neural network is trained to correct the 27h forecast errors by learning the analysis increments obtained from an EnKF analysis. Finally, the hybrid model is evaluated in ensemble forecast experiments.
Overall, the manuscript reads well, but I have the feeling that some key informations / explanations about the experiments are missing (see general points).

Nevertheless I think that after substantial revision it could make a valuable contribution to the literature.
General comments

----------------
1) At first glance, I had difficulties to grasp what are the objectives of this work, and to what extent it differs from previous work on hybrid modelling (in particular Brajard et al 2021). Therefore, I think that it would make sense to clarify a bit the objectives of this work (in the introduction) and in particular what it means for you to built a model error correction for climate prediction and how it differs from eg NWP or dynamical systems in general.
2) It is absolutely necessary to give typical time scales of the MAOOAM model for each component (atmosphere and ocean), for example by providing the doubling time of errors. Without this information, it is really hard to get an idea of the time evolution in the model and for example to know whether the 50 days and 60 years lead time used in some experiments are long or not.
3) I find the use of the significance test really difficult to understand. In particular, I don't understand what it means for a correlation or for a skill score "to be significant"? For example, what would be the difference between a significant 50% correlation and a non-significant 50% correlation? Perhaps it would be clearer if you explicitly indicated what are the null and tested hypotheses in your Student's t-test. More generally, I have the feeling that you would get a similar information content by computing a confidence interval over the 30 test runs, with the added benefit that it would be much easier to understand and hence to explain.
4) The caption of figure 3 mentions "The atmospheric variables are calculated based on daily data, while the oceanic variables are based on annual average data". Does this mean that the truth and the prediction are averaged (on a daily or annual basis) before computing the RMSE or correlation? To me, this is a really important point, because it means that you evaluate the models in their ability to reproduce daily or annual means, and not their ability to predict trajectories. This is fine, especially in a context of "climate predictions" but should absolutely be discussed in the main text. In my opinion, this raises additional questions, like for example why only the mean, why not higher-order moments or more complex statistical properties? Also, it is not entirely clear to me how fast is the evolution of daily / annual means in this model, and hence whether the averaging window (one day or on year) makes sense or not.
5) The scientific question raised in section 3.2 is interesting, but I am not entirely sure that the experimental setup chosen in this manuscript is entirely relevant to answer that question, because there is by construction no model error on the oceanic component (which means that the only source of error in the ocean comes from the interaction with the atmosphere). Could you discuss this point?
Specific and technical comments

-------------------------------
- Introduction: I have the impression that you move back and forth between dynamical models in general and climate models in particular, which makes the text sometimes a bit harder to follow.

- L 22 "estimate best the state of the climate system" there is a typo in this sentence. Furthermore, I would rather use "initial condition" here than "state" (to be consistent with the start of the sentence).

- L 39-40 "In these works, the hybrid model is tested in an idealized setting in which initial conditions are perfectly known" I guess that here you are referring to the fact that the forecast skill of the hybrid models are evaluated using a dataset with perfect initial conditions, but the current formulation is not entirely clear to me.

- L 41-43 "To our knowledge, the performance of hybrid models under imperfect initial conditions—particularly when using an ensemble of forecasts—has not been thoroughly assessed." Actually, starting the forecast from an analysis and not necessarily perfect initial conditions has already been done, eg in Farchi et al, 2023 (doi 10.1029/2022MS003474).

- L 47-48: "Moreover, observation for training, validation, and testing is relatively limited." I don't fully agree with this. For specific components of the Earth system (eg the atmosphere), there are a lot of observations available.

- L 71-75 Why is the relationship between resolution and number of modes different in the atmosphere and in the ocean?

- L 77-79 "One of the key features of MAOOAM is its ability to modify the number of atmospheric and oceanic model variables simply by adjusting the model’s resolution in the x-direction or y-direction." Actually, any model with a spatial extent possesses this feature, right?

- Figure 1 is difficult to interpret. Beyond the fact that the attractors look different, I don't get much out of it. I would advise first to use the same axes extent for both panels, to use label fonts consistent with the main text, and to use colour or transparency to show the density. Beyond these advises, is the 3D aspect really fundamental here? Otherwise, I would suggest to show a 2D figure (potentially with more panels if needed), which would be easier to interpret.

- L 86 "showing they evolve differently" If I am not mistaken there is a typo in this part of the sentence.

- L 89 "specifically 10 mode less" -> "specifically 10 modes less"

- L 90 "The atmospheric error could then propagate" by using "could" instead of something more affirmative, you imply that in certain cases the error can be limited to the atmospheric component?

- L 98-99 "In this study, we utilize the DAPPER package (Raanes, 2018) for conducting all experiments, as described in section 2.4 and depicted in Fig. 2." You should reformulate this sentence, because as is it sounds like section 2.4 and figure 2 will be about the use of DAPPER.

- L 100-101 "This method reducing the amount of experimentation required in tuning the EnKF DA system, thereby enhancing the performance of the assimilation experiments" I would reformulate this sentence, as one could understand that reducing the number of experiments enhances performance, which is obviously not the case.

- L 111-112 "the training of ANN is using the analysis increments produced by the EnKF (Gregory et al., 2024)" I would suggest to cite Farchi et al 2021 (doi 10.1002/qj.4116) and Brajard et al 2021 (which you already cited elsewhere) before Gregory et al 2024 here.

- L 126 Is the NN correction applied in spectral space or in grid-point space?

- L 135 "true state \sigma^{\mathrm{hf}}" -> "true state \mathbf{x}^{\mathrm{t}}" to be consistent with the notation introduced L 110.

- L 132-136 You forgot to specify what is the observation operator (H=I I imagine) and whether it is applied in spectral space or in grid point space?

- L 137 "and generate a reanalysis" Why a reanalysis here? Why not simply an analysis?

- L 150 "without incorporating validation data to adjust the ANN model during training" If you don't have any validation set, how can you be sure that the training process has converged?

- Section 2.5: Usually, in the RMSE the "mean" is computed over the different variables that constitute on state, in such a way that there is only one RMSE for a prediction. However, as far as I understand, in your case the mean in the RMSE is computed over the 30 test runs, right? In such a way that you get one RMSE value per variable. This should be explicitly mentioned.

- L 199: phi should be psi, right?

- Figures 3 and 4 (and the other ones as well): I am a surprised by the "noisiness" of the results, because with low-order models the scores usually look much smoother. Could this be coming from the fact that the test set contains only 30 runs? Would it be affordable to increase this number?

- Figures 9 and 10: are the correlation and RMSE skill score computed again on daily / annual averages or on instantaneous values?

- There is in section 3 a lot of (large) figures, which makes the text sometimes difficult to read (eg when referring to a figure which is located... in the following pages). I know that this is a draft and the one-column draft mode isn't really helping, but when moving to the edition process, I would recommend to be really careful to mitigate as much as possible the inconvenience.

Citation: https://doi.org/10.5194/egusphere-2025-212-RC2
- AC2: 'Reply on RC2', Zikang He, 16 Jul 2025
  
  Dear Reviewer,
  
  Thank you for your constructive comments and valuable feedback on our manuscript. Please find our point-by-point responses in the supplement. We sincerely appreciate your time and effort in reviewing our work.
  
  Best regards,
  
  Zikang He
  
  On behalf of all authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-212-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-212', Anonymous Referee #1, 14 Apr 2025

By using a reduced-order coupled atmosphere-ocean model within a twin experiment framework, the authors demonstrate the capability of machine learning in correcting errors in various components of the coupled model on different time scales so that enhancing the prediction skill of the model. I found the results interesting and publishable and recommend a minor revision that addresses my comments below.
Specific Comments:
Line 22: "estimate best the state" should be "estimate the best state".

Line 127: Section 2.4. Experimental settings, "experimental settings" sounds like the setting itself is experimental, "experiment settings" sounds more like setting up an experiment.

Citation: https://doi.org/10.5194/egusphere-2025-212-RC1
- AC1: 'Reply on RC1', Zikang He, 16 Jul 2025
  
  Dear Reviewer,
  Thank you for your constructive comments and valuable feedback on our manuscript. Please find our point-by-point responses in the supplement. We sincerely appreciate your time and effort in reviewing our work.
  Best regards,
  
  Zikang He
  
  On behalf of all authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-212-AC1
RC2:
'Comment on egusphere-2025-212', Alban Farchi, 16 Apr 2025

In this manuscript the authors use hybrid modelling to correct model error in a low-order coupled ocean-atmosphere model, namely MAOOAM. Model error is introduced by using a spectral truncation of a reference version of the model. Then, a neural network is trained to correct the 27h forecast errors by learning the analysis increments obtained from an EnKF analysis. Finally, the hybrid model is evaluated in ensemble forecast experiments.
Overall, the manuscript reads well, but I have the feeling that some key informations / explanations about the experiments are missing (see general points).

Nevertheless I think that after substantial revision it could make a valuable contribution to the literature.
General comments

----------------
1) At first glance, I had difficulties to grasp what are the objectives of this work, and to what extent it differs from previous work on hybrid modelling (in particular Brajard et al 2021). Therefore, I think that it would make sense to clarify a bit the objectives of this work (in the introduction) and in particular what it means for you to built a model error correction for climate prediction and how it differs from eg NWP or dynamical systems in general.
2) It is absolutely necessary to give typical time scales of the MAOOAM model for each component (atmosphere and ocean), for example by providing the doubling time of errors. Without this information, it is really hard to get an idea of the time evolution in the model and for example to know whether the 50 days and 60 years lead time used in some experiments are long or not.
3) I find the use of the significance test really difficult to understand. In particular, I don't understand what it means for a correlation or for a skill score "to be significant"? For example, what would be the difference between a significant 50% correlation and a non-significant 50% correlation? Perhaps it would be clearer if you explicitly indicated what are the null and tested hypotheses in your Student's t-test. More generally, I have the feeling that you would get a similar information content by computing a confidence interval over the 30 test runs, with the added benefit that it would be much easier to understand and hence to explain.
4) The caption of figure 3 mentions "The atmospheric variables are calculated based on daily data, while the oceanic variables are based on annual average data". Does this mean that the truth and the prediction are averaged (on a daily or annual basis) before computing the RMSE or correlation? To me, this is a really important point, because it means that you evaluate the models in their ability to reproduce daily or annual means, and not their ability to predict trajectories. This is fine, especially in a context of "climate predictions" but should absolutely be discussed in the main text. In my opinion, this raises additional questions, like for example why only the mean, why not higher-order moments or more complex statistical properties? Also, it is not entirely clear to me how fast is the evolution of daily / annual means in this model, and hence whether the averaging window (one day or on year) makes sense or not.
5) The scientific question raised in section 3.2 is interesting, but I am not entirely sure that the experimental setup chosen in this manuscript is entirely relevant to answer that question, because there is by construction no model error on the oceanic component (which means that the only source of error in the ocean comes from the interaction with the atmosphere). Could you discuss this point?
Specific and technical comments

-------------------------------
- Introduction: I have the impression that you move back and forth between dynamical models in general and climate models in particular, which makes the text sometimes a bit harder to follow.

- L 22 "estimate best the state of the climate system" there is a typo in this sentence. Furthermore, I would rather use "initial condition" here than "state" (to be consistent with the start of the sentence).

- L 39-40 "In these works, the hybrid model is tested in an idealized setting in which initial conditions are perfectly known" I guess that here you are referring to the fact that the forecast skill of the hybrid models are evaluated using a dataset with perfect initial conditions, but the current formulation is not entirely clear to me.

- L 41-43 "To our knowledge, the performance of hybrid models under imperfect initial conditions—particularly when using an ensemble of forecasts—has not been thoroughly assessed." Actually, starting the forecast from an analysis and not necessarily perfect initial conditions has already been done, eg in Farchi et al, 2023 (doi 10.1029/2022MS003474).

- L 47-48: "Moreover, observation for training, validation, and testing is relatively limited." I don't fully agree with this. For specific components of the Earth system (eg the atmosphere), there are a lot of observations available.

- L 71-75 Why is the relationship between resolution and number of modes different in the atmosphere and in the ocean?

- L 77-79 "One of the key features of MAOOAM is its ability to modify the number of atmospheric and oceanic model variables simply by adjusting the model’s resolution in the x-direction or y-direction." Actually, any model with a spatial extent possesses this feature, right?

- Figure 1 is difficult to interpret. Beyond the fact that the attractors look different, I don't get much out of it. I would advise first to use the same axes extent for both panels, to use label fonts consistent with the main text, and to use colour or transparency to show the density. Beyond these advises, is the 3D aspect really fundamental here? Otherwise, I would suggest to show a 2D figure (potentially with more panels if needed), which would be easier to interpret.

- L 86 "showing they evolve differently" If I am not mistaken there is a typo in this part of the sentence.

- L 89 "specifically 10 mode less" -> "specifically 10 modes less"

- L 90 "The atmospheric error could then propagate" by using "could" instead of something more affirmative, you imply that in certain cases the error can be limited to the atmospheric component?

- L 98-99 "In this study, we utilize the DAPPER package (Raanes, 2018) for conducting all experiments, as described in section 2.4 and depicted in Fig. 2." You should reformulate this sentence, because as is it sounds like section 2.4 and figure 2 will be about the use of DAPPER.

- L 100-101 "This method reducing the amount of experimentation required in tuning the EnKF DA system, thereby enhancing the performance of the assimilation experiments" I would reformulate this sentence, as one could understand that reducing the number of experiments enhances performance, which is obviously not the case.

- L 111-112 "the training of ANN is using the analysis increments produced by the EnKF (Gregory et al., 2024)" I would suggest to cite Farchi et al 2021 (doi 10.1002/qj.4116) and Brajard et al 2021 (which you already cited elsewhere) before Gregory et al 2024 here.

- L 126 Is the NN correction applied in spectral space or in grid-point space?

- L 135 "true state \sigma^{\mathrm{hf}}" -> "true state \mathbf{x}^{\mathrm{t}}" to be consistent with the notation introduced L 110.

- L 132-136 You forgot to specify what is the observation operator (H=I I imagine) and whether it is applied in spectral space or in grid point space?

- L 137 "and generate a reanalysis" Why a reanalysis here? Why not simply an analysis?

- L 150 "without incorporating validation data to adjust the ANN model during training" If you don't have any validation set, how can you be sure that the training process has converged?

- Section 2.5: Usually, in the RMSE the "mean" is computed over the different variables that constitute on state, in such a way that there is only one RMSE for a prediction. However, as far as I understand, in your case the mean in the RMSE is computed over the 30 test runs, right? In such a way that you get one RMSE value per variable. This should be explicitly mentioned.

- L 199: phi should be psi, right?

- Figures 3 and 4 (and the other ones as well): I am a surprised by the "noisiness" of the results, because with low-order models the scores usually look much smoother. Could this be coming from the fact that the test set contains only 30 runs? Would it be affordable to increase this number?

- Figures 9 and 10: are the correlation and RMSE skill score computed again on daily / annual averages or on instantaneous values?

- There is in section 3 a lot of (large) figures, which makes the text sometimes difficult to read (eg when referring to a figure which is located... in the following pages). I know that this is a draft and the one-column draft mode isn't really helping, but when moving to the edition process, I would recommend to be really careful to mitigate as much as possible the inconvenience.

Citation: https://doi.org/10.5194/egusphere-2025-212-RC2
- AC2: 'Reply on RC2', Zikang He, 16 Jul 2025
  
  Dear Reviewer,
  
  Thank you for your constructive comments and valuable feedback on our manuscript. Please find our point-by-point responses in the supplement. We sincerely appreciate your time and effort in reviewing our work.
  
  Best regards,
  
  Zikang He
  
  On behalf of all authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-212-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Zikang He on behalf of the Authors (16 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (26 Jul 2025) by Josef Ludescher

RR by Alban Farchi (14 Aug 2025)

ED: Publish subject to minor revisions (review by editor) (24 Aug 2025) by Josef Ludescher

AR by Zikang He on behalf of the Authors (26 Aug 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (31 Aug 2025) by Josef Ludescher

AR by Zikang He on behalf of the Authors (04 Sep 2025) Manuscript

Journal article(s) based on this preprint

20 Oct 2025

Improving dynamical climate predictions with machine learning: insights from a twin experiment framework

Zikang He, Julien Brajard, Yiguo Wang, Xidong Wang, and Zheqi Shen

Nonlin. Processes Geophys., 32, 397–409, https://doi.org/10.5194/npg-32-397-2025,https://doi.org/10.5194/npg-32-397-2025, 2025

Short summary

Zikang He, Julien Brajard, Yiguo Wang, Xidong Wang, and Zheqi Shen

Viewed

Total article views: 2,998 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
2,366	528	104	2,998	87	128

HTML: 2,366
PDF: 528
XML: 104
Total: 2,998
BibTeX: 87
EndNote: 128

Views and downloads (calculated since 17 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	92	24	8	124
Mar 2025	102	22	0	124
Apr 2025	114	22	6	142
May 2025	96	30	6	132
Jun 2025	94	38	6	138
Jul 2025	60	24	10	94
Aug 2025	242	6	6	254
Sep 2025	1,104	28	18	1,150
Oct 2025	74	30	4	108
Nov 2025	56	24	8	88
Dec 2025	52	60	6	118
Jan 2026	50	48	8	106
Feb 2026	82	64	2	148
Mar 2026	84	64	4	152
Apr 2026	30	15	4	49
May 2026	23	17	2	42
Jun 2026	4	5	1	10
Jul 2026	7	7	5	19

Cumulative views and downloads (calculated since 17 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	92	24	8	124
Mar 2025	102	22	0	124
Apr 2025	114	22	6	142
May 2025	96	30	6	132
Jun 2025	94	38	6	138
Jul 2025	60	24	10	94
Aug 2025	242	6	6	254
Sep 2025	1,104	28	18	1,150
Oct 2025	74	30	4	108
Nov 2025	56	24	8	88
Dec 2025	52	60	6	118
Jan 2026	50	48	8	106
Feb 2026	82	64	2	148
Mar 2026	84	64	4	152
Apr 2026	30	15	4	49
May 2026	23	17	2	42
Jun 2026	4	5	1	10
Jul 2026	7	7	5	19

Viewed (geographical distribution)

Total article views: 2,991 (including HTML, PDF, and XML) Thereof 2,991 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 30 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (6280 KB)
Metadata XML

Short summary

Climate prediction is challenging due to systematic errors in traditional climate models. We addressed this by training a machine learning model to correct these errors and then integrating it with the traditional climate model to form an AI-physics hybrid model. Our study demonstrates that the hybrid model outperforms the original climate model on both short-term and long-term predictions of the atmosphere and ocean.


Total:	0
HTML:	0
PDF:	0
XML:	0

Improving dynamical climate predictions with machine learning: insights from a twin experiment framework

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Journal article(s) based on this preprint

Viewed

Viewed (geographical distribution)