the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Forecast-based attribution of the role of stratospheric variability in weather extremes
Abstract. Variability of the stratospheric polar vortex, particularly its dramatic breakdown during sudden stratospheric warming (SSW) events, has been linked to a number of surface weather extremes. However, attributing the role of stratospheric variability in a specific observed weather extreme, rather than an abstracted class of extremes, has proved highly challenging. Here we use an ensemble of subseasonal forecast simulations from 7 forecast systems participating in the Stratospheric Nudging and Predictable Surface Impacts (SNAPSI) project to carry out this task. By comparing the likelihood of extreme events in free-running forecasts to those with the zonal-mean stratospheric state nudged towards its observed or climatological evolution (while the troposphere is freely-evolving), we are able to calculate the changes in the risk and severity of extremes due to the occurrence, or non-occurrence, of an SSW. We focus on three case-study events: (i) the 2018 boreal SSW and subsequent Eurasian cold air outbreak and snowfall, (ii) the 2019 boreal SSW and subsequent North American cold air outbreak, and (iii) the 2019 austral near-SSW and subsequent Australian heat wave. Through an extreme value statistical analysis, we find in all three cases a significant stratospheric contribution to the risk of relevant weather extremes. In case (i), improving the SSW prediction by nudging as much as doubles the forecast risk of extreme Eurasian cold and UK snow. The differences in risk and severity between experiments nudged to the SSW and to climatology are relatively insensitive to the lead time before the cold air outbreak of case (i). By contrast, in case (ii) this difference only emerges at short lead times before the event, indicating a stratospheric influence on this event that is dependent on the tropospheric state. For case (iii) we find a stronger and more robust stratospheric impact on the severity of the Australian heat wave than on its risk, with the latter being highly sensitive to model bias. The methodology outlined here, including both the experimental design and the semi-parametric approaches for calculating risks, can be applied to attribute several other internal climate system drivers of extreme event risk.
Competing interests: Amy H. Butler is a member of the editorial board of Weather and Climate Dynamics.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(4959 KB) - Metadata XML
-
Supplement
(3400 KB) - BibTeX
- EndNote
Status: open (until 20 Mar 2026)
- RC1: 'Comment on egusphere-2026-230', Anonymous Referee #1, 12 Mar 2026 reply
-
RC2: 'Comment on egusphere-2026-230', Nicholas Leach, 18 Mar 2026
reply
Summary
The manuscript presents a forecast-based approach to attribution of surface extremes to sudden stratospheric warming events, suggesting and estimating a variety of metrics for measuring the strength of the link. This link is based on a comparison of free-running and nudged experiments in a variety of forecast models. The breadth of the experiments performed and resulting analysis is impressive, and provides a reliable counterfactual simulation approach and statistical framework for assessing the role of the stratosphere in individual weather events. My overall suggestions for improvement largely relate to the discussion of the GEV, the utility of the metrics produced, and the overall interpretation of the results. I have split my review into three sections: "significant suggestions", "minor remarks", and "other questions and ideas". The "significant suggestions" are broad areas that I believe the manuscript would benefit from addressing. The "minor remarks" are small suggestions relating to Figures or wording that will not substantially change the text. The "other questions and ideas" are comments that, in my view, are not necessary for the publication of this manuscript, but I hope may be useful and/or interesting for the authors regardless. Overall, I believe that this manuscript is suitable for publication with a few changes and additional discussion. For visibility, I have selected "major revisions" as in my view some of the analysis should be repeated under slightly altered definitions of the metrics, but would view the changes necessary to be at the very bottom end of "major", as hopefully is clear from my review.Significant Suggestions
Statistical inference - on the use of GEVDs and derived metrics
Your acknowledgement of the (in)appropriateness of the GEV is appreciated, and I like your description of it as a regularisation tool, rather than the "truth" based on the estimated underlying distribution. However, I think it may be worth emphasising that the confidence intervals estimated are very likely underestimates (especially in the tails, which are at times pretty poorly captured). I think it's also important to point out that - besides the seasonal cycle - the assumption of identically distributed variables is only justified if the underlying processes are sufficiently similar. This is unlikely to be the case here, and in my view may well contribute to the structure not captured by the GEV in figure 5 (ie. if the underlying distribution is a mixture). This lack of physical process understanding is particularly relevant for the snow depth analysis, where I would argue the GEV estimates are so poor that they should not be included. Particularly for CNRM, it seems likely that in the "free" ensemble the snowfall in those two top ensemble members was driven by sufficiently different processes that they are effectively drawn from a different distribution. While not necessary for this work, in my view a focus on the process understanding (which is a key advantage of forecast-based approaches over statistical ones) would be worth mentioning as an avenue for further research.
The metrics used are pretty standard in attribution, though their specific definition as used here has at least one quirk, which I think is worth either addressing or noting. The risk ratio (RR) is based on the "observed" value, while the quantile shift (QS) is based on the equivalent value at the same quantile as that estimated from the ERA5 distribution. When the free distribution differs significantly from the, this can lead to the two metrics being calculated on very different "parts" of the distribution, leading to unintuitive results (such as the estimated attributable influence being an increase in probability but a reduction in intensity). I think it would be best to stick to using one "reference" to present the results: either calculate the quantile shift using the quantile that the observed value lies at in the *free* experiment (not in ERA5), OR calculate the risk ratio based on the value in the *free* experiment that corresponds to the ERA5 quantile. I believe that this would be consistent with how these metrics are typically used in the wider extreme weather attribution literature (considering the "free" experiments to be the "factual" experiment in attribution literature).
My final point is along similar lines to the process understanding that I have mentioned. I think it's worth noting that the statistical analysis can only get you so far, and may actually misrepresent the true response, especially at the tails. This is particularly true for the CNRM snow depth one, where it is plausible (though I think an increased ensemble size would be needed to determine as this inference is based on only a couple of members) from the plotted ensembles, that the nudging actually reduces the probability of truly extreme snowfall.Model bias and impacts on attributable influence
While you do include a section on the importance of model bias in the discussion, I think it may be worth making more of this in the results. In particular, model drifts could explain some of the discrepancies between the influence estimated in the long and short lead times and are not discussed.
In the case of the SSWSep19, I think that there are several potential issues that could be made more of or described in more detail:
- The seasonal cycle over the long lead date. From FigS5, it is clear that the majority of the ERA5 events are taken from the end of the window - which isn't available in the long forecast. This is a bit of a challenge as it introduces a bias that the RR metric will be sensitive to. This is a place where it may be more reasonable to define the event through a quantile for BOTH the RR and QS metrics.
- Similarly, in the short lead time, the RRs are all very small because the extreme event appears to have been entirely predictable (possibly because the models, IFS at least, have a slight warm bias too). This means that the event lies right at the bottom of the distribution of all the forecast experiments and thus the RR will be very small, even if the QS were large. Similarly to the long lead, this would be somewhat alleviated by consistently defining the event through a quantile rather than an observed value. I note that this may not make much difference.Interpretation
It would be useful for you to provide some additional discussion on the interpretation of the metrics. In particular, it would be very useful to set out how the estimated influence could be dependent on the predictability of the event in the forecast models, and thus how readers should interpret the metrics given (especially the RR metric). This would give potentially very useful context for the snow depth analysis, where the estimated influence could be extremely dependent on the skill of the individual models at predicting extreme snowfall. The noted lack of model hindcasts for bias correction or calibration is a significant drawback here, as I think properly assessing model bias would strengthen the analysis.
Minor Remarks:
- Fig 3: it would be easier to see differences if the IFS panels were shown as differences? At the moment they look much the same and so the IFS panels don't really provide additional information.
- L186 doesn't make sense? Is the min / max over the 4 daily samples relevant given that the min / max is computed again over the whole integration? Or is this just for clarity to turn the 6-hourly samples into daily estimated minima or maxima?
- L201 on the independence -> may also be worth including that the presence of skill means that independence of the maxima of the blocks is violated (in addition to the independence of the variables within the blocks).
- L213 -> there is a growing body of work into physical bounds on the limits of temperature extremes (and some work into using these physical limits to constrain statistical models of such extremes). May be worth mentioning. I have provided some references below.
- Fig 4 / 5: the x-axis scales used cut off data points in 4 and make them nearly invisible in 5.
- Fig 4: in (i), are the "GEV" fits the median of the bootstrap sample of fits (as opposed to the GEV fit to the original data)? Just asking as the solid black line has a kink around an exceedence probability of 0.6-0.7 which looks a little odd in the context of a parametric GEV model. Or is this because a linear scale without sufficiently small spacing is used to produce the line?
- Fig 4: Could a different symbol be used for the "observed" sample? The bigger cross is hard to pick out. To make it even more clear, the horizontal reference line could be drawn across both panels (i) and (ii)
- L255: "dots" -> "crosses"
- L240: I'm a little unsure about using the "absolute risk" terminology, given that the probabilities estimated throughout are conditional in at least some sense.
- [LEFT IN FOR TRANSPARENCY - though I think this is to do with the quantile-based vs. observed baseline issue I've commented on above] Fig 6: One feature I am struggling to understand is how figure 6 appears to show a reduction in risk for an increase in the severity of the extreme assessed in some cases (eg. panel c/I nudged B)? Could you comment on how this is possible? I wondered if this was something to do with how the confidence intervals are plotted, but given that this feels non-intuitive, this it's worth commenting on.
- Fig 5: explain the dashed lines (these are the ERA5 observed value and quantile per the text, but it would help to state this in the caption or somewhere on the figure).
- L265-270: is it worth stating that the equally the free experiment could be expected to better match climatology for a low-predictability situation given that the free experiment has the additional dispersion in the stratosphere and is only conditional on the predictable component of the weather at the point of initialisation.
- Fig S6 (a)(i) and (a)(iii) appear to disagree - all the red crosses lie below the green in the tail (eg. less likely that 10^-1) of (i), but in (iii) the severity change indicated by the crosses is that the nudged severity is higher? I wondered if this is a sign error, but I don't think so since the lower part of the distribution looks consistent.
- Is it possible to include a description of the model specifications (model cycle, resolution, ocean etc.) in the supplement? I think this would be useful.
- An interesting feature relevant to the discussion between L263-275 is that the ensemble variance relative to the free increases in not only the control but the nudged experiment too (per Fig 4; though the differences in variance may not be significant between the free and nudged experiments). This feels unintuitive - can you explain why?
Other Questions and Ideas:- Would a "better" (or perhaps actually just "different") experiment design for the control be to nudge each ensemble member towards a different (random) realisation from the observed climatology? This would provide internally physically plausible/realisable stratospheric realisations with the same stratospheric variance properties as the observed realisations, rather than the nudging towards the climatological mean?
Regards,
Nick Leach
References
Noyelle, Robin, Yoann Robin, Philippe Naveau, Pascal Yiou, and Davide Faranda. ‘Integration of Physical Bound Constraints to Alleviate Shortcomings of Statistical Models for Extreme Temperatures’. Journal of Climate. Journal of Climate 1, no. aop (2026). https://doi.org/10.1175/JCLI-D-25-0112.1.
Noyelle, Robin, Yi Zhang, Pascal Yiou, and Davide Faranda. ‘Maximal Reachable Temperatures for Western Europe in Current Climate’. Environmental Research Letters 18, no. 9 (2023): 094061. https://doi.org/10.1088/1748-9326/acf679.
Zhang, Yi, and William R. Boos. ‘An Upper Bound for Extreme Temperatures over Midlatitude Land’. Proceedings of the National Academy of Sciences 120, no. 12 (2023): e2215278120. https://doi.org/10.1073/pnas.2215278120.Citation: https://doi.org/10.5194/egusphere-2026-230-RC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 266 | 107 | 15 | 388 | 38 | 57 | 47 |
- HTML: 266
- PDF: 107
- XML: 15
- Total: 388
- Supplement: 38
- BibTeX: 57
- EndNote: 47
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Article: Forecast-based attribution of the role of stratospheric variability in weather extremes
Author: Seviour et al.
In this work, the authors use the SNAPSI experiments to attribute the role of the stratosphere in the weather extremes that followed 3 sudden stratospheric warming: two in the NH and one in the SH. The forecast-based attribution approach is well motivated, and the use of the relative risk (RR) and quantile shift as complementary metrics is a good approach to complement their weaknesses. The paper is clear and well structured and it would be a good contribution to the journal and the studies done with the SNAPSI experiments. The main areas that could be strengthened are the definition of severity and the attribution when RR becomes ill-conditioned, potentially by adding the Extreme Forecast Index (EFI) and/or shift of tails (SoT) diagnostics. See my comments below
Major comments:
Definition of severity: The paper discusses minimum temperature as the relevant extreme for 2 SSW cases, but the severity definition equation is written as a maximum over time. This is consistent if the S is defined with a sign convention (eg. coldness = -T) or if the variable is transformed prior to the maximization (which I don’t think is the case if I read correctly the previous two paragraphs). At present this can be a bit confusing. Could you clarify this in the eq. 1
Using RR in unstable cases: RR based on exceeding the observed severity can become 0 or infinite or extremely unstable when the event is too rare for the model. You already discuss this in the text and partly address it by reporting the quantile shift. The EFI and SoT would be valuable additions because they remain informative when the observed threshold is outside the ensemble outcomes and it is used in operational context for the s2s timescale. Could you add this type of diagnostic to your study? Could be interesting to see the changes in EFI/SoT for each experiment and possibly report the differences. EFI and SoT are used, for instance, by the ECMWF to predict extreme/unusual events and and the results from this analysis could also be of interest for prediction centers if they are presented using a quantity already in use by them. You may end up with the same weaknesses as in the study in terms if models have strong bias but it can still be useful.
Minor comments
The control (nudged toward a climatological zonal mean) is useful as a counterfactual, but it may represent an unrealistic state of the system. You already acknowledge this limitation but consider improving the discussion by adding a possible alternative counterfactual construction.
Figure 2 caption: maximima → maxima. Also, could you use the same colobar for all the plots? The scale is quite similar.
Figure 3: Could you use one colorbar for the plots in the same row? In addition, some colorbars in the left/right column have an upper/lower triangle that it is not present in the equivalent colobar for the other column. Could the top titles have a larger font size?
Line 197:Could you clarify the expression “(·)+ := max(·, 0) “ in plain English for the reader?
Line 257: minumum → minimum
Figure 9 caption: Non-parameteric → Non-parametric
Line 404: intialization → initialization
The CNRM model appears in the table as CNRM-CM6-1 and in Figure 9 as CNRM-CM61. Please make it consistent