A spread-versus-error framework to reliably quantify the potential for subseasonal windows of forecast opportunity

Rupp, Philip; Spaeth, Jonas; Birner, Thomas

doi:10.5194/egusphere-2025-4925

Preprints

https://doi.org/10.5194/egusphere-2025-4925

Preprints

23 Oct 2025

| 23 Oct 2025

Status: this preprint is open for discussion and under review for Weather and Climate Dynamics (WCD).

A spread-versus-error framework to reliably quantify the potential for subseasonal windows of forecast opportunity

Philip Rupp, Jonas Spaeth, and Thomas Birner

Abstract. Mid-latitude forecast skill at subseasonal timescales often depends on 'windows of opportunity' that may be opened by slowly varying modes such as ENSO, the MJO or stratospheric variability. Most previous work has focused on the predictability of ensemble-mean states, with less attention paid to the reliability of such forecasts and how it relates to ensemble spread, which directly reflects intrinsic forecast uncertainty. Here, we introduce a spread-versus-error framework based on the Spread-Reliability Slope (SRS) to quantify whether fluctuations in ensemble spread provide reliable information about variations in forecast error. Using ECMWF S2S forecasts and ERA5 reanalysis data, aided by idealised toy-model experiments, we show that reliability is controlled by at least three intertwined factors: sampling error, the magnitude of physically driven spread variability and model fidelity in representing that variability. Regions such as northern Europe, the mid-east Pacific, and the tropical west Pacific exhibit robustly high SRS values (≈ 0.6 or greater for 50-member ensembles), consistent with robust modulation by slowly varying teleconnections. In contrast, areas like eastern Canada show little or no reliability, even for 100-member ensembles, reflecting limited low-frequency modulation of forecast uncertainty. We further demonstrate two practical implications: (i) a simple variance rescaling yields a post-processed 'corrected spread' that enforces reliability and may help to bridge ensemble output with user needs; and (ii) time averaging effectively boosts ensemble size, allowing even 10-member ensembles to achieve reliability of spread fluctuations comparable to larger ensembles. Finally, we discuss possible links to the signal-to-noise paradox and emphasize that adequate representation of ensemble spread variability is crucial for exploiting subseasonal windows of opportunity.

Received: 04 Oct 2025 – Discussion started: 23 Oct 2025

Competing interests: At least one of the (co-)authors is a member of the editorial board of Weather and Climate Dynamics.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 23267 KB)

Supplement (17575 KB)

Download & links

Preprint (23267 KB)
Metadata XML
Supplement (17575 KB)
BibTeX
EndNote

Philip Rupp, Jonas Spaeth, and Thomas Birner

Status: open (until 18 Jan 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2025-4925', Anonymous Referee #1, 28 Nov 2025 reply
The manuscript “A spread-versus-error framework to reliably quantify the potential for subseasonal windows of forecast opportunity” by Rupp et al. explores the relationship between the ensemble spread and forecast error in sub-seasonal ensemble forecasts (days 14-46) by ECMWF system and in a statistical toy model. The authors propose an approach, based on spread-error relationship, to identify regions where variations in ensemble spread correlate with variations in forecast error and demonstrate, using a simple statistical model, that spread-error relationship can be deteriorated by insufficient sampling, lack of physical processes that modulate predictability, and model deficiencies.
The paper provides several interesting ideas, in particular exploring the connection between intra-forecast and inter-forecast variability of the spread, and illustrating several critical issues of sub-seasonal forecasting (such of under-sampling) using the toy model. I have no doubt that the paper should be published in WCD. However, I ask the authors to clarify several critical points before publication.

Major points.
I find that the term “the potential for windows of forecast opportunity” is obscure. I suspect that what the authors mean is “the potential to make skillful forecasts”. Instead, the current message is “the potential for opportunity to make skillful forecasts”. If this is really what the authors want to say, then I wonder what it means in practice.

The authors focus on one property of the forecast – reliability. However, I am used to think of skillful forecasts in terms of accuracy. The forecasts may lack accuracy because of low predictability even if the forecasting system is reliable. Consequently, I am used to think of windows of forecast opportunity as of periods with enhanced skill, and accuracy sufficient for decision making. I feel that your analysis, as illustrated in Figure 3, only highlights areas where physical processes modulate predictability, however it leaves open the question of whether the predictability in these regions is ever sufficient for making skillful forecasts. Therefore, I do not agree with the following statement at L468-469: “Our spread-error framework shows that, over large areas of the Northern Hemisphere, those windows are opened by slowly varying teleconnections”. I feel it is difficult to discuss forecasting opportunity without analyzing accuracy (for example, anomaly correlation coefficients) and therefore I ask the authors to be more careful about their definitions and be more critical about implication of their findings.

I am not sure why spread-error scatterplots should be made using daily values. The authors show in Figure 7 that intra-variations are spurious; thus, a lot of spread in Figure 2b is just noise. Why not define SRS using time-averaged (e.g. weekly mean) statistics?

The authors make important point about time averaging (L13-14); however, this point is only illustrated by a supplementary figure (Figure S3). If the point is important enough to be elevated to the abstract, then the figure should be a part of the main manuscript.

Specific points:
L61-64: Are these assertions supported by research, or is it your hypothesis? If this is the former, a reference is needed. If this is your hypotheses, please be clear about it.
L113: Provide full reference for Leutbecher et al.
L114-115: “A comparison between the IFS model and the CNRM model further shows qualitatively robust patterns (discussed in Section 6).” Robust patterns of what? Also, more information about the used CNRM data is needed.
L115-116: It is quite difficult to comprehend what exactly “forecast spread reliability is influenced by the potential for windows of opportunity” means. I am not sure which definition of “reliability” the authors are using. A reliable ensemble forecast system (or any other forecast system that provides probabilistic forecasts) is one whose predicted probabilities correspond to the observed frequencies; this is what a reliability diagram illustrates. It would help if the authors provided the definition of reliability they are using. In addition, what is the difference between “windows of opportunity” and “potential for windows of opportunity”? “Opportunity” and “potential” sound synonymous to me.
L125-127: “However, if the ensemble size is small, sampling errors will be relatively large. In such a case, some forecast/time step with, e.g., low spread, could be also associated with comparably large error, as the spread is simply underestimated due to sampling error.” You assume that spread is not a good predictor for accuracy, but has this been studied? Also, how to define whether the ensemble size is small or not? The size you are using (50 members at least) does not sound small to me.
Figure 2: Have you tried plotting only the “inter” component of your variance separation, rather than showing daily spread and error, which are mostly noise?
Figure 2 captions: “Red dashed line” not “Orange dashed line”
L151: How do you define “anomaly”? Figure 2 shows only positive values. For anomalies I would expect both positive (above climatology) and negative (below climatology) values.
L175: Do you assume that ensemble mean is well represented in the toy model, or do you also assume it is well represented in operational forecasts? Is this assumption justified?
L242: Does your assumption hold? I understand that, as you under-sample the forecast distribution, the variability of the spread will in general increase. However, I believe that the variability of ensemble mean would also increase, leading to increased error. Why this would not be the case?
L251: If the error is overestimated then how this can lead to a lower error?
L235-255: I cannot understand your explanations for decreased SRS in experiment (b), and I am not sure that you can explain it without analysing variability of ensemble mean.
L262-270: Do you mean that a larger ensemble size than 100 members would be required to capture the spread-error relationship in the case shown in panel “c”? Have you tested this with your toy model?
L271: “intrincic” -> ” intrinsic”
L289-290: Can you be more specific about which effects are unsystematic? I understand that insufficient number of cases leads to unsystematic effects, but can for example small sample size lead to unsystematic effects, or does it always lead to decreased SRS?
L324-329: Can you provide equations for the inter- and intra- variability?
L341: I do not know what the journal’s policy is, but I would prefer to see the definition of the theoretical sampling error estimate in the text rather than in figure captions.
L351-352: I presume you refer to Figure 4d? It would be nice to explicitly refer to this figure in the text, for clarity.
L388-389: It took me a while to figure out that you are using different colour scale for Figs. 9b and 9d. I suggest using the same scale because you are making the point about smallness of the anomalies in Fig.9d, which cannot be seen with the present scales.

Reply
Citation: https://doi.org/10.5194/egusphere-2025-4925-RC1

Philip Rupp, Jonas Spaeth, and Thomas Birner

Supplement

https://doi.org/10.5194/egusphere-2025-4925-supplement

Philip Rupp, Jonas Spaeth, and Thomas Birner

Viewed

Total article views: 345 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
248	72	25	345	25	21	23

HTML: 248
PDF: 72
XML: 25
Total: 345
Supplement: 25
BibTeX: 21
EndNote: 23

Views and downloads (calculated since 23 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	129	18	8	155
Nov 2025	91	15	9	115
Dec 2025	28	39	8	75

Cumulative views and downloads (calculated since 23 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	129	18	8	155
Nov 2025	91	15	9	115
Dec 2025	28	39	8	75

Viewed (geographical distribution)

Total article views: 335 (including HTML, PDF, and XML) Thereof 335 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 27 Dec 2025

Short summary

Weather forecasts several weeks ahead are uncertain, but this uncertainty itself can change depending on large-scale atmospheric conditions. We present a new way to measure how well forecasts capture these changes in uncertainty. Our results show that reliability of uncertainty varies strongly with region and is linked to slow, predictable patterns in the atmosphere. These findings help identify periods when forecasts are more trustworthy.


Total:	0
HTML:	0
PDF:	0
XML:	0