the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Relaxation experiments in ML-based weather prediction models to study subseasonal predictability
Abstract. This study explores the use of relaxation experiments in machine learning-based weather prediction (MLWP) models to identify sources of subseasonal predictability in comparison to a traditional numerical weather prediction (NWP) system. Relaxation involves nudging specific regions of a model toward reanalysis data to isolate their influence on forecast skill. We apply this technique to two MLWP models, Pangu-Weather (fully data-driven) and NeuralGCM (hybrid) and compare the experiments to the Unified Forecast System (UFS). The focus is on week 3–4 forecast of two major precipitation events in western North America in winter 2022/2023, both linked to Madden-Julian Oscillation (MJO) activity. For the two cases, MLWP models exhibit higher forecast skill than the UFS at subseasonal lead times. Though tropical relaxation improves the skill in all forecast systems, gains are greater for UFS, reflecting the MLWP models’ stronger baseline performance. A Rossby wave source (RWS) analysis shows that tropical relaxation consistently improves the large-scale dynamic processes associated with the tropical-extratropical teleconnections leading to both events. These results highlight the potential of relaxation experiments as a low-cost, effective diagnostic for understanding and improving subseasonal forecasts, especially in emerging MLWP systems.
- Preprint
(14712 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2026-35', Yannick Peings, 17 Feb 2026
-
RC2: 'Comment on egusphere-2026-35', Anonymous Referee #2, 26 Feb 2026
The study discusses how constraining the atmospheric state in the tropics improves sub-seasonal forecasts of two extreme precipitation events in western North America in the winter of 2022/2023 in two machine-learning weather prediction (MLWP) models. It complements an earlier study by some of the authors, where an equivalent relaxation experiment was performed in a physics-based weather prediction model. It is found that the response to the tropical constraint in the MLWP models is similar to that in the physics-based model, although somewhat weaker owing to a stronger baseline performance. An analysis of Rossby wave sources indicates that the MLWP models simulate the tropical-extratropical teleconnections contributing to the extreme events in a physically consistent way. The authors emphasise the general point that such relaxation experiments are a useful diagnostic tool to understand and improve sub-seasonal predictions.
The paper is a useful contribution to the field of MLWP model evaluation and merits publication. Since it discusses two very specific case studies, it lacks the generality that the title suggest, but on the other hand there is value in a detailed assessment of how MLWP forecasts represent tropical-extratropical teleconnections for specific mid-latitude extreme precipitation events. The presentation is mostly clear and concise, but I would recommend some clarifications and revisions, as well as adding some further analysis and discussion - see the comments below.
- Would it be worth investigating the reasons for a different importance of tropical forcing between the two cases a bit more, e.g. by looking into origin and propagation of forecast errors over time? The RWS diagnostic only works for tropical sources, but it would be good to quantify mid- and high-latitude contributions.
- The title is too general for what is being presented. I suggest to start from the title of the reference study by Moore et al. ("Impacts of tropical forecast errors on weeks 3–4 extreme precipitation predictions over California during winter 2022–23") and modify this to reflect the new aspect of relaxing MLWP models
- l. 7: the fact that only tropical relaxation is considered should be mentioned earlier than this, potentially first sentence of the abstract
- l. 73: "6-hour forecast increment" - I suspect you are referring to the output available, not to the model time step (which would be hard to believe). Please clarify.
- l. 85: The fact that NeuralGCM uses "perfect" SST prescribed from ERA5 strikes me as important. It means that the NeuralGCM setup could not issue real-time forecasts, and it should have an unfair advantage over the other models considered. What is your view, maybe you can add some discussion or analysis on this?
- ll. 89-90 (..., Pangu-Weather is ..."): Please specify which Pangu model you are actually using for your inference relaxation study - is it the on with a 24h time step?
- ll. 90ff. ("Overall the model is..."): I don't understand this, please rephrase. With "auto-regressive during training" I assume you are referring to rolling out for more than one model time step during training and minimizing the loss computed from the rolled-out errors. This is indeed more costly during the training, but has no impact on inference cost. The real reason that Pangu is cheapest among the models you are considering is probably that its does not need to run a GCM dynamical core (expensive, both UFS and NeuralGCM have it).
- l. 96: Why did you choose to compute the climatology over this long period? Can you please check whether substantial trends are present for the variables you are considering? If this is the case, there is the risk that anomaly correlations presented are inappropriately dominated by these trends.
- Table 1: The "Replay to ERA5" experiment is never used in the manuscript. Why? Please either remove the reference to this experiment, or use it when discussing the results.
- ll. 115ff.: Did you test the sensitivity to the width of the tapering region, or to the functional shape of the transition? If yes, a comment on that would be helpful.
- ll. 119f. ("Relaxed region is corrected by 100% at each time step"): You say on line 108 that the relaxation "gently steers the model state", which is inconsistent with a 100% replacement of the model forecast by the reference. Maybe worth clarifying this on line 108 and elsewhere
- ll. 127 - 131: Can you please explain the motivation or justification for relaxing different variables for different models? One might argue that this makes the experiments less comparable.
- ll. 148-151: Please elaborate how you compute and interpret anomaly correlation, it is left a bit vague. As I understand it, this is the pattern correlation between the verification anomaly in Fig 1 and forecast anomalies in Figs 2 & 4. - correct?
- l. 172: I would not call this a forecast bust. Larger errors are expected for any extreme event occurring in the observations when forecasts have modest levels of skill and tend to predict climatology.
- Figure 1: The green lines are really hard to see - is it worth making separate panels?
- l 181: please define how the water vapour flux is computed (I assume you are showing the magnitude of the vector quantity). Also please add some discussion on why you do not use precipitation directly and what whether this constitute a caveat of the study. It might be worth citing Lavers et al., Weather and Forecasting (2017), https://doi.org/10.1175/WAF-D-17-0073.1 in this context.
- Figure 2 caption: Looks to me like the bold green line is at 40 not at 20.
- Figure 3: this is extremely hard to see, even on a very large screen. Please revise to have fewer panels or less details in each.
- l. 235 ("This similarity suggests..."): OK but what does this mean? That only sources in the deep tropics matter?
- l. 257: Could this also be because NeuralGCM sees the observed SST (see one of my earlier comments)?
- l. 269: This is a trivial result - any relaxation of ensembles towards a common reference state will reduce the ensemble spread
- Figure 6: Seeing Z500 anomaly correlations of close to 1 for almost every single ensemble members at 3-4 weeks lead time makes me wonder whether the ACC you compute is a discerning enough metric. Can you please show the same plot for MAE? A reader could conclude from the extremely high ACC for most ensemble members that there is near-perfect deterministic forecast skill in the sub-seasonal range for these events, which I would be sceptical about even with strong impact from tropical sources of predictability.
- l. 276: Can you discuss a bit more why there is less impact for the February event? Given the flow pattern, a stronger impact of mid- or high-latitude dynamics is plausible (see also the Moore et al. study)
- l. 277 (we did not investigate precip directly): I think you need to be upfront about this caveat and discuss it in the methods section
- l. 304 ("reduced tropical influence on the event"): As mentioned before, it would be good to have some further analysis on this.Citation: https://doi.org/10.5194/egusphere-2026-35-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 180 | 68 | 19 | 267 | 13 | 15 |
- HTML: 180
- PDF: 68
- XML: 19
- Total: 267
- BibTeX: 13
- EndNote: 15
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The paper examines the week 3-4 prediction skill of two machine learning weather prediction (MLWP) models for two climate events that brought significant precipitation over California in 2022/23 (December-January 2022/23 and February-March 2023). The two MLWP, NeuralGCM and Pangu-Weather, are compared to a traditional S2S General Circulation Model (GCM), UFS. The authors use a relaxation technique (nudging) to impose observed atmospheric variability in the tropics in a set of ensemble reforecasts, that they compare to the original reforecasts of the two climate events. They find that imposing accurate tropical variability largely improves the prediction skill of the North Pacific atmospheric circulation and associated moisture flux at week 3-4 lead time, especially for the December case study. This is true for both the two MLWP models and UFS, with comparable physical mechanisms that lead to the improvement (Rossby wave sources in the subtropics). This demonstrates that improved S2S prediction in the tropics would induce a higher prediction skill of such precipitation events in the mid-latitudes, and also that the new generation of MLWP models exhibit comparable skill and mechanisms as the traditional physics-based forecast models when such tropical relaxation techniques is used (for much lower computational costs). The prediction skill of the two MLWP models is in fact slightly higher than UFS for the two case studies, without and with tropical relaxation, but as noted by the authors, a more robust comparison of prediction skill would require a more systematic evaluation over a greater number of cases.
The paper is a nice contribution to the field of S2S prediction, it is clear and well-written. However there is room for improvement, and I have some comments and suggestions listed herebelow.
1) l. 28, when discussing the potential for S2S prediction using MLWP models, some references are missing to reflect what has been done already. For instance, the two following papers are relevant references to include as they discuss and demonstrate the advance of S2S forecast skill using these models.
Weyn, J. A., Durran, D. R., Caruana, R., & Cresswell-Clay, N. (2021). Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models. Journal of Advances in Modeling Earth Systems, 13(7), e2021MS002502. https://doi.org/10.1029/2021ms002502
Chen, L., Zhong, X., Li, H., Wu, J., Lu, B., Chen, D., et al. (2024). A machine learning model that outperforms conventional global subseasonal forecast models. Nature Communications, 15(1), 6425. https://doi.org/10.1038/s41467-024-50714-1
2) l. 85 : “Sea surface temperature are prescribed from ERA5.” Can you detail here? Do you maintain SST anomalies from initialization (persistent SST anomalies)?
3) l. 91: “This leads to its significantly lower computational resource requirements compared to the other two models of this study.” Could you give an rough estimate of each MLWP model’s computational cost here, relative to UFS?
4) Section 2.3 : it sounds like the daily anomalies for the models are calculated from the ERA5 daily climatology. Ideally the model anomalies should be calculated using the model daily climatology, but this requires a set of hindcasts over a sufficient long period. I do not think that using model climatology would significantly change the results, but this should be mentioned for transparency.
5) l. 125, it is unclear what the “model replay” experiment is used for in the study.
6) Section 3.1: the December case study has also been highlighted in our recent paper (Peings et al. 2026), as a window of opportunity for S2S forecasting. The three models used in our study (two MLWP models and the ECMWF S2S model) exhibit good prediction skill for this period at week 2 as shown in the paper, but we also found good skill for week 3 and more generally for the week 2-4 window. We also preformed a sensitivity study with one of the MLWP model to demonstrate that the skill was coming from the tropics. I think this paper is worth being cited because it aligns with the result presented here.
Peings, Y., Dong, C., Mahesh, A., Pritchard, M., Collins, W., & Magnusdottir, G. (2026). Subseasonal forecasting and MJO teleconnections in machine learning weather prediction models. Journal of Geophysical Research: Atmospheres, 131, e2025JD044910. https://doi.org/10.1029/2025JD044910
7) The section about the physical mechanism leading to more skillful predictions for the two case studies would benefit from being developed. The RWS anomalies of Fig. 3 and Fig. 5 are noisy and they are not very explicit. I think it would be interesting to see how they bridge the tropics with the extratropics. I.e., showing the Rossby wave associated with it, maybe at different lead times (week 1, 2 and 3) to show its development. You could also show how the deep convection anomalies in the tropics differ in CRL versus NTR in function of time, maybe using a Hovmoller plot (time in function of longitude) which would reveal how MJO propagation changes with nudging and makes for a more accurate teleconnection. The paper only includes 6 figures so there is room for a couple figures further detailing the tropics-extratropics teleconnection leading to improved skill in the North Pacific/North America sector (especially for the December case).
8) In conclusion, when stating that “However, drawing more definitive conclusions will require a systematic evaluation over multiple years and similar events to assess the generalization of these results”, it should be mentioned that a systematic evaluation of the S2S forecast skill for the North Pacific/Western North America region has been done for NeuralGCM (Peings et al. 2026). The study shows that two MLWP models (SFNO-HENS and NeuralGCM) exhibit comparable S2S skill to ECWMF for the case of the MJO and North Pacific atmospheric patterns during the October-March season.
9) l. 296: “This suggests that a better representation of the tropical atmospheric state in the models would have improved the prediction of this particular event”. The conclusion would benefit from a discussion of how the MLWP models have the potential to improve prediction skill in the tropics, and consequently in the mid-latitudes (if they do).
Do the authors anticipate that S2S forecast skill will improve with future developments in both traditional dynamical models and machine-learning weather prediction (MLWP) systems? Or does the current similarity in S2S skill between MLWP models and GCMs indicate an intrinsic predictability limit of the climate system that may be difficult to surpass?
Nudging simulations such as those presented in the paper are valuable for investigating mechanisms and tracing potential sources of predictability for specific events. However, do we realistically expect S2S forecasts in the tropics to become sufficiently accurate to substantially improve prediction skill in the mid-latitudes? I know that is the million-dollar question bit it would be worthwhile to address it in the conclusion to place the results in a broader predictability context.