On the reliability of seasonal snow forecasts

Vorobeva, Ekaterina; Orsolini, Yvan; de Rosnay, Patricia; Day, Jonathan; Senan, Retish; Decremer, Damien; Vitart, Frederic

doi:10.5194/egusphere-2026-872

Preprints

https://doi.org/10.5194/egusphere-2026-872

Preprints

26 Feb 2026

| 26 Feb 2026

Status: this preprint is open for discussion and under review for The Cryosphere (TC).

On the reliability of seasonal snow forecasts

Ekaterina Vorobeva, Yvan Orsolini, Patricia de Rosnay, Jonathan Day, Retish Senan, Damien Decremer, and Frederic Vitart

Abstract. Reliable information on seasonal snow conditions is important for long-range weather forecasting and climate modeling. The reliability of winter-mean hindcasts of snow water equivalent (SWE) produced by the ECMWF for the period 1993–2022 within the CopERnIcus climate change Service Evolution (CERISE) project is evaluated in this study. In probabilistic forecasting, reliability for a binary event is defined as the consistency between forecast probabilities and observed frequencies. Here, reliability is assessed using two independent SWE datasets (ERA5-Land and ESA Snow-CCI v4) across eight land regions in the Northern Hemisphere excluding mountainous regions. The reliability assessment is performed for two tercile-based binary events representing low- and high snow accumulation winters. Reliability is quantified using a weighted linear regression applied to reliability diagrams and is grouped into five categories from perfect to dangerous. The results show good reliability of the ECMWF seasonal snow hindcasts for both low- and high-snow conditions. The assessment shows sensitivity to the choice of verification dataset, with ERA5-Land yielding slightly higher reliability categories than ESA Snow-CCI. It is found that differences in hindcasts reliability between regions and between verification datasets may be linked to snow variability, model representation, and observational uncertainty.

Received: 13 Feb 2026 – Discussion started: 26 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Ekaterina Vorobeva, Yvan Orsolini, Patricia de Rosnay, Jonathan Day, Retish Senan, Damien Decremer, and Frederic Vitart

Status: open (until 12 May 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2026-872', Anonymous Referee #1, 13 Apr 2026 reply
Review for the manuscript: On the reliability of seasonal snow forecasts
General comments:
In their manuscript, the authors provide a detailed analysis on the reliability of seasonal forecasts of snow from the SEAS5 model. The authors use reliability diagrams to analyse the forecast skill of low- (lower tercile) and high- (upper tercile) snow accumulation winters (in terms of SWE) using two independent verification datasets as reference for the forecasts skill verification process, namely the ERA5-Land and the ESA Snow-CCI version 4. The verification process against different observational dataset, which assigns each pre-defined land region to a specific Category of forecast reliability based on the definition of Weishemer and Palmer (2004), yields very different results. ERA5-Land generally yields higher reliability categories than ESA Snow-CCI, a discrepancy that the authors attribute in part to the shared dependence between the hindcasts and ERA5-Land on ERA5 forcing. In general, seasonal snow hindcasts are classified into “useful" categories (Categories 3–5) for both verification datasets, snow terciles and assessed regions.
The study compares the probability density functions for SWE anomalies across the two verification products. They find that while both datasets show similar general patterns, ERA5-Land typically displays greater variability and a higher frequency of extreme values. In contrast, ESA Snow-CCI anomalies tend to be more concentrated near the mean. Near the tercile thresholds, the agreement between the two datasets in classifying binary events (e.g., low- or high- snow accumulation occur) falls from an overall average of 70–80% to 50–70%. This suggests that the final reliability category assigned to a forecast is sensitive to the specific threshold definitions of the chosen reference data.
The overall study is rigorous and robust, the methodology is well explicated and therefore replicable, it addresses the very relevant and timely topic of reliability of seasonal forecasts of snow. However, I have some concerns regarding the overall structure of the manuscript, and the logic with which figures are presented. Hence, I suggest some major changes prior the manuscript could be accepted for publication.

Specific comments:
In the Introduction, the discussion of previous literature on the verification of snow forecast products is limited. Expanding this section would help better contextualize the present work within the broader literature and clarify its contribution relative to existing studies. For instance, in the Introduction the authors say: “Reliability of seasonal-mean near-surface temperature and precipitation forecasted by the European Centre for Medium-Range Weather Forecasts (ECMWF) Integrated Forecasting System (IFS) 4 was examined in Weisheimer and Palmer (2014).” While I guess that the original reference was likely introduced to cite the study on which the methodology is based, there are a number of more recent studies that have evaluated the performance of the newer ECMWF seasonal forecast systems (SEAS5) through similar approaches. Some examples (also using the author’s methodology) are:
Manzanas, , Torralba, V., Lledó, L., & Bretonnière, P. A. (2022). On the reliability of global seasonal forecasts: Sensitivity to ensemble size, hindcast length and region definition. Geophysical Research Letters, 49, e2021GL094662. https://doi.org/10.1029/2021GL094662

Manzanas, R., Lucero, A., Weisheimer, A., & Gutiérrez, J. M. (2018). Can bias correction and statistical downscaling methods improve the skill of seasonal precipitation forecasts? Climate Dynamics, 50(3–4), 1161–1176. https://doi.org/10.1007/s00382-017-3668-z

In its current form, Figure 2 is introduced in relation to the subdivision into eight land regions (L169). It presents climatological mean winter SWE and standard deviation for three datasets, but its role in the manuscript somewhat unclear. The figure is only briefly discussed, with explicit reference mainly to panels (b), (d), and (f), while the climatological panels (a) and (e) are not described, and panel (c) is only mentioned in relation to the mountain mask (L176). A more complete description of all panels and a clearer link to the regional subdivision, including how it is adapted from Giorgi and Francisco (2000), would help to clarify its purpose in the manuscript.
The authors do not mention any detrending has been applied prior to the reliability assessment. If significant long-term trends are present in both the hindcasts and the verification datasets, these could affect he forecast skill and potentially masking the actual capability of the model to capture true interannual variability. While the domain-averaged time series in Figure 5 do not show obvious, pronounced trends, these large-scale averages may obscure significant regional trends occurring at the grid-point level. The authors may therefore consider assessing, or at least discussing, the sensitivity of the reliability categories to detrending anomalies at each gridpoint, in order to ensure the robustness of the results.
I am a bit skeptical on the introduction of Fig. 4 and Fig. 5 only in the Discussion section. These figures could instead be presented and described in the Results section, with their implications for the reliability assessment discussed subsequently. Figure 4 is introduced as a means to better understand differences in reliability categories between the two verification datasets. However, as currently written, while the differences between the observational datasets anomalies are briefly described, their implications for the reliability assessment are not entirely clear to me. Although the authors mention the shared dependence between ECMWF forecasts and ERA5-Land on the ERA5 forcing, I think showing the ECMWF forecast distribution can provide additional insight into the behaviour of the forecasting system and help understanding possible bias in the forecasted anomalies distribution, better supporting the interpretation of the reliability differences found between verification datasets.
In general, the structure of the manuscript needs to be improved, and the paper is lacking in discussion, with some paragraphs presenting results rather than providing interpretation. I have some further minor comments below that the authors should consider when revising their manuscript.

Technical corrections:
L8: The authors write “The results show good reliability of the ECMWF seasonal snow hindcasts for both low- and high-snow conditions.” However, throughout the manuscript, performance is consistently described in terms of “useful” or “marginally useful” categories. To be more specific, I would suggest adding a clarification such as “in that they consistently issue at least marginally useful SWE forecasts independently o the chosen benchmark”.
L10: “slightly higher reliability” can be made more precise like “1 or 2 categories better”.
L30: I would smooth the sentence (e.g., “Improved forecasts of snow variables may also enhance the representation of large-scale atmospheric circulation in regions with strong snow–atmosphere coupling, such as East Asia”)
L43: “has”
L45: The relevance of snow forecasts for hydrological applications is stated but not substantiated, could the authors include specific examples or relevant references?
L76: I think it would be useful to have some description or a reference explaining how UNC, REL and RES can be interpreted.
L97: Figure 1 is introduced in the Methods section, to explain how to read a reliability diagram. However, in the manuscript, reliability diagrams themselves are never shown, but only appear in the Appenidx. It reproduces a standard schematic reliability diagram which is very similar to that presented in Weisheimer and Palmer (2014). I am not sure the figure adds value to the current manuscript, especially for a specialized audience. I suggest either explicitly stating that the figure is adapted from previous work and clarifying its purpose, or removing it and referring directly to the existing literature.
L192: This explanation seems to imply that the distribution of SWE anomalies is skewed, leading to a lower frequency of low-snow events. If so, I would indeed expect fewer forecasts with high probabilities for the lower tercile. Is my understanding what you mean? It would be useful to better clarify this point in the manuscript.
L201: While an offset of the best-fit reliability line from the climatological intersection appears to be present in both TIB and WNA regions, it is considerably more pronounced in TIB. In particular, in the TIB region the best-fit line lies systematically below the line of perfect reliability, indicating a tendency towards overforecasting. It would be helpful to further discuss this bias, as this provides useful insight into the nature of the forecast errors.
L213: I am not sure that repeatedly specifying the color of each category adds value. Once defined, this information could be omitted to improve readability and avoid unnecessary repetition.
L247: Are the agreement values reported in Table A1 computed by pooling all gridpoints and forecasted years together?
L254: Given the sensitivity of binary-event reliability to tercile thresholds and the reduced agreement between verification datasets near tercile boundaries, the authors may consider complementing the categorical reliability analysis with a continuous probabilistic verification metric (for example, the Continuous Ranked Probability Score).
L258: Could the authors better specify what they mean by “snow dataset specifically suited for the forecast verification purposes”?
L267 and L273: The manuscript concludes that the forecasts show “overall good performance”. However, many regions fall into Category 3 (marginally useful), especially when evaluated against ESA Snow-CCI. Given that Category 3 includes cases with limited skill and the results are strongly dependent on the verification dataset, the interpretation of “overall good performance” appears overstated.

Reply
Citation: https://doi.org/10.5194/egusphere-2026-872-RC1

Ekaterina Vorobeva, Yvan Orsolini, Patricia de Rosnay, Jonathan Day, Retish Senan, Damien Decremer, and Frederic Vitart

Viewed

Total article views: 281 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
171	93	17	281	15	22

HTML: 171
PDF: 93
XML: 17
Total: 281
BibTeX: 15
EndNote: 22

Views and downloads (calculated since 26 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	50	24	3	77
Mar 2026	89	51	13	153
Apr 2026	32	18	1	51

Cumulative views and downloads (calculated since 26 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	50	24	3	77
Mar 2026	89	51	13	153
Apr 2026	32	18	1	51

Viewed (geographical distribution)

Total article views: 259 (including HTML, PDF, and XML) Thereof 259 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 30 Apr 2026

Short summary

How reliable are probabilistic seasonal snow forecasts in winter? Although they are routinely issued by operational prediction centres such as ECMWF, their reliability has never been evaluated. We close this gap by analyzing 30 years of seasonal snow re-forecasts from ECMWF and evaluating them against two snow datasets. Our results provide comprehensive assessment of seasonal snow forecast reliability and offer new insights into their performance in different parts of the Northern Hemisphere.


Total:	0
HTML:	0
PDF:	0
XML:	0