Constructing Extreme Heatwave Storylines with Differentiable Climate Models

Whittaker, Tim; Di Luca, Alejandro

doi:10.48550/arXiv.2506.10660

Preprints

https://doi.org/10.48550/arXiv.2506.10660

Preprints

12 Aug 2025

| 12 Aug 2025

Constructing Extreme Heatwave Storylines with Differentiable Climate Models

Tim Whittaker and Alejandro Di Luca

Abstract. Understanding the plausible upper bounds of extreme weather events is essential for risk assessment in a warming climate. Existing methods, based on large ensembles of physics-based models, are often computationally expensive or lack the fidelity needed to simulate rare, high-impact extremes. Here, we present a novel framework that leverages a differentiable hybrid climate model, NeuralGCM, to optimize initial conditions and generate physically consistent worst-case heatwave trajectories. Applied to the 2021 Pacific Northwest heatwave, our method produces heatwave intensity up to 3.7 °C above the most extreme member of a 75-member ensemble. These trajectories feature intensified atmospheric blocking and amplified Rossby wave patterns—hallmarks of severe heat events. Our results demonstrate that differentiable climate models can efficiently explore the upper tails of event likelihoods, providing a powerful new approach for constructing targeted storylines of extreme weather under climate change.

Received: 01 Aug 2025 – Discussion started: 12 Aug 2025

Tim Whittaker and Alejandro Di Luca

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-3748', Anonymous Referee #1, 14 Sep 2025
Dear Editor,
Thank you for inviting me to review this manuscript. Please find below my comments and suggestions to the authors.
General Comment
First of all, I would like to thank the authors for investigating the promising field of finding alternative, less-computationally demanding tools for simulating extreme events. The overarching aim of the study is timely and well chosen.
In this paper, the authors propose a method to simulate storylines of extreme weather events by perturbing the initial conditions of the hybrid climate model NeuralGCM, which is differentiable and therefore amenable to gradient-based optimization. Their framework identifies small but plausible initial perturbations that evolve into more extreme heatwave trajectories. As a proof of concept, the method is applied to the 2021 Pacific Northwest heatwave (PNW2021). The optimized runs produce heatwaves that are up to 3–4 °C hotter than any member of a 75-member stochastic NeuralGCM ensemble, with strengthened blocking and Rossby wave patterns consistent with known physical mechanisms. Importantly, the required perturbations remain within ensemble variability, suggesting plausibility.
The methodology is promising, and I consider the article worthy of publication after revisions. My general impression is that the material is interesting but the presentation sometimes makes it difficult for readers to follow, particularly in Sections 2 and 3, which should undergo some changes. By contrast, I found the Discussion (Section 4) concise, clear, and well situated within the literature. Below, I provide more detailed comments that I hope will help the authors clarify and strengthen the manuscript.
Specific Comments
Section 2.1
Please expand this subsection. It is helpful to introduce the methodology formally, but currently it lacks clarity and rigour. For non-mathematical readers, the description is particularly difficult to follow.

At the beginning, one or two sentences reminding the reader of the subsection’s aim (why optimization is needed and what problem it addresses) would help.

Some notations are undefined (e.g., x^i_0). Please define all variables consistently.

The authors state that Eq. (3) is a good loss function. Could you explain why this particular form is appropriate for this case study, and why alternatives were not chosen?

Please state explicitly what O(X(t)) and F(O(X(t))) represent for this case study.

The term “component” is ambiguous—please clarify what is meant.

Why was a 5-day averaging window chosen? Is this linked specifically to PN2021, or more generally to temperature autocorrelation?

In Eq. (3), what does the index i represent, and what does it span over? Similarly, please define \gamma and \theta (I assume latitude and longitude).

I find it misleading to call the first term the “heatwave intensity term” when “heatwave intensity” is formally defined later in Section 2.3 but not being the same object.

The “objective function” mentioned after Eq. (3) should be clearly defined.

Section 2.2
The decision to optimize only the initial conditions while keeping all other parameters fixed is understandable for tractability and stability, but could limit representativeness of the extremes. Is this a true limitation for your study? If so, I suggest mentioning it explicitly.

The mention of 1.4° resolution in this section is confusing. It is not clear how or when this configuration is used. The text should clearly distinguish which experiments are at 2.8° and which at 1.4°, and why both are mentioned at this stage.

The discussion of grid scales, time steps, and numbers of simulations is unclear. Are multiple optimized runs performed, or only one? Is the 75-member ensemble used for both the stochastic NeuralGCM and the optimized runs?

If only one optimized run is presented, how robust are the results to “luck” in initialization? How should the uncertainty in the optimization outcome be quantified?

Table 1 is useful, but please explain what the listed parameters mean (like all the \lambdas), why there are two different numbers of steps, and recall the definition of \tau.

Section 2.3
Please clarify how you treat events separated by only one day: are these counted as two separate events or merged as one? Otherwise there is a risk of double-counting.

This section suggests that heatwave intensity is central to the study, but it seems to be used primarily in Fig. 4e. Consider clarifying that it is one of several diagnostics used.

Section 3.1
On page 7, the reference should be to Fig. 4, not Fig. 2 (caption).

The evaluation against ERA5 is valuable, and I appreciate the authors’ transparency in acknowledging that NeuralGCM underestimates extreme heat. Could you provide a possible explanation here (e.g., omission of land–atmosphere feedbacks, as later discussed in Section 4)? Even a brief cross-reference would help.

The distribution in Fig. 1a appears bimodal for NeuralGCM. Is this an artifact, or is there a physical reason?

In Fig. 1b, please add a legend to indicate lead times, and specify the simulation period in the caption (otherwise “Day of the month” is hard to interpret).

While you compare against ERA5 temperature, can NeuralGCM also reproduce circulation fields relevant to heatwaves (e.g., Z500)? Showing this would be useful.

As far as I understand, Section 3.1 uses the 2.8° version of NeuralGCM (worth recalling at the beginning of the section). Since you later show (Section 3.4) that the 1.4° configuration reduces biases and better captures PN2021, it would be valuable to include an ERA5 comparison for the higher resolution as well. Even a supplemental figure would highlight the importance of resolution for heatwave fidelity.

Section 3.2
The optimized trajectories are compared with the stochastic ensemble, but not directly with ERA5. Could you show whether the optimized Z500 patterns resemble those observed?

How many optimized trajectories were run? Is it 75, like the stochastic ensemble, or fewer? Please clarify in the text.

You report a 33% reduction in computational cost. How was this calculated? Please provide details.

Is there an optimal way to select the number of optimization steps ?

Please ensure consistency in the definition of “intensity” across the manuscript, and cross-reference to the section where it was defined.

Section 3.4
Why is the 1.4° experiment presented more briefly than the 2.8°, even though it shows better agreement with ERA5? Presenting the 1.4° case in more detail (with the 2.8° as a supporting comparison) would seem the more logical choice.

Conclusion
The statement “a fraction of the computational cost” is vague. Please quantify—e.g., what fraction compared to a 75-member ensemble?

Final Remark
Overall, I found this to be a promising and well-motivated study, but one that would benefit from greater clarity in the methodology and results sections. The suggestions above aim to improve accessibility and transparency for readers. I believe the manuscript is suitable for publication after these issues are addressed.
Citation: https://doi.org/10.5194/egusphere-2025-3748-RC1
RC2: 'Comment on egusphere-2025-3748', Anonymous Referee #2, 23 Oct 2025

Review of Constructing Extreme Heatwave Storylines with Differentiable Climate Models
Summary
The authors leverage the differentiable nature of NeuralGCM to identify the optimal initial conditions (ICs) for the 2021 Pacific Northwest heatwave, and then demonstrate that the simulations following these ICs indeed produce very extreme heat in the region that exceeds values produced in a standard NeuralGCM ensemble. The paper is a nice demonstration of the potential to use differentiation to produce boosted ensembles.
Major comments
- Table 2 provides information about the perturbations used in the optimized runs (and the range for the ensemble). Is the max across space? This is not clear. What is the spatial structure of the perturbations? Is it constrained in some way, or emerges directly from the differentiation? Can the authors show the perturbations in a figure, and can we learn from their structure?
- Can the authors comment further on the use of N=50 vs N=75? How were these chosen? If we wanted to reduce compute, would we still have success with e.g. N=10? How different would N=75 be from N=something large? I don't necessarily expect the author to do this experiment, but simply comment on their expectations for the sensitivity of the results to this choice.
- The authors compare to ERA5 early on, but then drop the comparison. Are the optimized heatwaves comparable to ERA5, perhaps after accounting for any biases in the mean state during the early summer period? Do the other variables shown in Figure 5 follow similar trajectories to ERA5?
- Figure 5 shows that some plausible drivers of the heatwave are largely within the envelope of the original NeuralGCM ensemble, raising the question of what actually caused the extreme heat. One option is that no single driver is extreme, but they are collectively extreme given correlation between them. It may also be worth looking at shortwave radiation and advection if these are available in NeuralGCM.
Specific comments
- Please include line numbers in future drafts.

- Page 1: There are a number of papers about the dynamics of the 2021 PNW heatwave beyond the Mass et al study that should be cited, e.g. White et al, https://www.nature.com/articles/s41467-023-36289-3; Neal et al, https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2021GL097699; Duan et al, https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2025EF006216 as a starter package

- Page 2: McKinnon and Simpson, https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2022GL100380 is also relevant for the use of large ensembles specific to the 2021 PNW heatwave

- Page 2: The computational cost of training the models is high, although prediction is relatively cheap. At what point does the cheapness of the prediction outweigh the expense of the training?

- Page 3: The potential role of quasi-resonant amplification is not fully established within the literature for the 2021 heatwave.

- Page 4: Have you confirmed that the 1000 hPa is above the surface at all points in the domain? I suspect it is not based on the topography. Why use 1000 hPa rather than temperature 2m above the surface, which is the more typical choice of variable for heat?

- Page 5 / Table 1: Could the authors provide some intuition about the choice of the two sets of parameters for the optimization process?

- Table 2: The difference between the first and second columns is not clear, and the title of the third column could be improved.

- Figure 1a: Given that the heatwave happened in summer, suggest subsetting the data to a relevant summer period (e.g. June-July) and then comparing histograms.

- Figure 2: Please reduce the thickness and/or number of contours in the middle panel, since it is hard to see the shading.

- Page 13: Could the authors say more about the dual-initial-condition requirement of purely data driven models?

- There are typos and citation errors with respect to use of in-text vs parenthetical citations that should be corrected.

Citation: https://doi.org/10.5194/egusphere-2025-3748-RC2
AC1: 'Replies to reviewers', Tim Whittaker, 27 Nov 2025

We thank the reviewers for their comments, which are addressed in the combine responses file uploaded.

Citation: https://doi.org/10.5194/egusphere-2025-3748-AC1

Tim Whittaker and Alejandro Di Luca

Viewed

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 1,097 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,094	0	3	1,097	0	0

HTML: 1,094
PDF: 0
XML: 3
Total: 1,097
BibTeX: 0
EndNote: 0

Views and downloads (calculated since 12 Aug 2025)

Month	HTML	PDF	XML
Aug 2025	349	0	349
Sep 2025	635	1	636
Oct 2025	77	1	78
Nov 2025	33	1	34

Cumulative views and downloads (calculated since 12 Aug 2025)

Month	HTML	PDF	XML
Aug 2025	349	0	349
Sep 2025	635	1	636
Oct 2025	77	1	78
Nov 2025	33	1	34

Viewed (geographical distribution)

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 1,083 (including HTML, PDF, and XML) Thereof 1,083 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 30 Nov 2025

Short summary

Heatwaves are becoming more extreme in frequency and intensity. Yet running many climate simulations to find the rare worst-case events is slow and costly. We developed a method that tweaks initial weather conditions to target the most extreme heat scenarios at a fraction of the usual cost. For the 2021 Pacific Northwest heatwave, it found cases up to 3.7 °C hotter than any run in a 75-member ensemble, helping communities prepare for the worst.


Total:	0
HTML:	0
PDF:	0
XML:	0