the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
On the reliability of seasonal snow forecasts
Abstract. Reliable information on seasonal snow conditions is important for long-range weather forecasting and climate modeling. The reliability of winter-mean hindcasts of snow water equivalent (SWE) produced by the ECMWF for the period 1993–2022 within the CopERnIcus climate change Service Evolution (CERISE) project is evaluated in this study. In probabilistic forecasting, reliability for a binary event is defined as the consistency between forecast probabilities and observed frequencies. Here, reliability is assessed using two independent SWE datasets (ERA5-Land and ESA Snow-CCI v4) across eight land regions in the Northern Hemisphere excluding mountainous regions. The reliability assessment is performed for two tercile-based binary events representing low- and high snow accumulation winters. Reliability is quantified using a weighted linear regression applied to reliability diagrams and is grouped into five categories from perfect to dangerous. The results show good reliability of the ECMWF seasonal snow hindcasts for both low- and high-snow conditions. The assessment shows sensitivity to the choice of verification dataset, with ERA5-Land yielding slightly higher reliability categories than ESA Snow-CCI. It is found that differences in hindcasts reliability between regions and between verification datasets may be linked to snow variability, model representation, and observational uncertainty.
- Preprint
(6237 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 12 May 2026)
- RC1: 'Comment on egusphere-2026-872', Anonymous Referee #1, 13 Apr 2026 reply
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 171 | 93 | 17 | 281 | 15 | 22 |
- HTML: 171
- PDF: 93
- XML: 17
- Total: 281
- BibTeX: 15
- EndNote: 22
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review for the manuscript: On the reliability of seasonal snow forecasts
General comments:
In their manuscript, the authors provide a detailed analysis on the reliability of seasonal forecasts of snow from the SEAS5 model. The authors use reliability diagrams to analyse the forecast skill of low- (lower tercile) and high- (upper tercile) snow accumulation winters (in terms of SWE) using two independent verification datasets as reference for the forecasts skill verification process, namely the ERA5-Land and the ESA Snow-CCI version 4. The verification process against different observational dataset, which assigns each pre-defined land region to a specific Category of forecast reliability based on the definition of Weishemer and Palmer (2004), yields very different results. ERA5-Land generally yields higher reliability categories than ESA Snow-CCI, a discrepancy that the authors attribute in part to the shared dependence between the hindcasts and ERA5-Land on ERA5 forcing. In general, seasonal snow hindcasts are classified into “useful" categories (Categories 3–5) for both verification datasets, snow terciles and assessed regions.
The study compares the probability density functions for SWE anomalies across the two verification products. They find that while both datasets show similar general patterns, ERA5-Land typically displays greater variability and a higher frequency of extreme values. In contrast, ESA Snow-CCI anomalies tend to be more concentrated near the mean. Near the tercile thresholds, the agreement between the two datasets in classifying binary events (e.g., low- or high- snow accumulation occur) falls from an overall average of 70–80% to 50–70%. This suggests that the final reliability category assigned to a forecast is sensitive to the specific threshold definitions of the chosen reference data.
The overall study is rigorous and robust, the methodology is well explicated and therefore replicable, it addresses the very relevant and timely topic of reliability of seasonal forecasts of snow. However, I have some concerns regarding the overall structure of the manuscript, and the logic with which figures are presented. Hence, I suggest some major changes prior the manuscript could be accepted for publication.
Specific comments:
In the Introduction, the discussion of previous literature on the verification of snow forecast products is limited. Expanding this section would help better contextualize the present work within the broader literature and clarify its contribution relative to existing studies. For instance, in the Introduction the authors say: “Reliability of seasonal-mean near-surface temperature and precipitation forecasted by the European Centre for Medium-Range Weather Forecasts (ECMWF) Integrated Forecasting System (IFS) 4 was examined in Weisheimer and Palmer (2014).” While I guess that the original reference was likely introduced to cite the study on which the methodology is based, there are a number of more recent studies that have evaluated the performance of the newer ECMWF seasonal forecast systems (SEAS5) through similar approaches. Some examples (also using the author’s methodology) are:
In its current form, Figure 2 is introduced in relation to the subdivision into eight land regions (L169). It presents climatological mean winter SWE and standard deviation for three datasets, but its role in the manuscript somewhat unclear. The figure is only briefly discussed, with explicit reference mainly to panels (b), (d), and (f), while the climatological panels (a) and (e) are not described, and panel (c) is only mentioned in relation to the mountain mask (L176). A more complete description of all panels and a clearer link to the regional subdivision, including how it is adapted from Giorgi and Francisco (2000), would help to clarify its purpose in the manuscript.
The authors do not mention any detrending has been applied prior to the reliability assessment. If significant long-term trends are present in both the hindcasts and the verification datasets, these could affect he forecast skill and potentially masking the actual capability of the model to capture true interannual variability. While the domain-averaged time series in Figure 5 do not show obvious, pronounced trends, these large-scale averages may obscure significant regional trends occurring at the grid-point level. The authors may therefore consider assessing, or at least discussing, the sensitivity of the reliability categories to detrending anomalies at each gridpoint, in order to ensure the robustness of the results.
I am a bit skeptical on the introduction of Fig. 4 and Fig. 5 only in the Discussion section. These figures could instead be presented and described in the Results section, with their implications for the reliability assessment discussed subsequently. Figure 4 is introduced as a means to better understand differences in reliability categories between the two verification datasets. However, as currently written, while the differences between the observational datasets anomalies are briefly described, their implications for the reliability assessment are not entirely clear to me. Although the authors mention the shared dependence between ECMWF forecasts and ERA5-Land on the ERA5 forcing, I think showing the ECMWF forecast distribution can provide additional insight into the behaviour of the forecasting system and help understanding possible bias in the forecasted anomalies distribution, better supporting the interpretation of the reliability differences found between verification datasets.
In general, the structure of the manuscript needs to be improved, and the paper is lacking in discussion, with some paragraphs presenting results rather than providing interpretation. I have some further minor comments below that the authors should consider when revising their manuscript.
Technical corrections:
L8: The authors write “The results show good reliability of the ECMWF seasonal snow hindcasts for both low- and high-snow conditions.” However, throughout the manuscript, performance is consistently described in terms of “useful” or “marginally useful” categories. To be more specific, I would suggest adding a clarification such as “in that they consistently issue at least marginally useful SWE forecasts independently o the chosen benchmark”.
L10: “slightly higher reliability” can be made more precise like “1 or 2 categories better”.
L30: I would smooth the sentence (e.g., “Improved forecasts of snow variables may also enhance the representation of large-scale atmospheric circulation in regions with strong snow–atmosphere coupling, such as East Asia”)
L43: “has”
L45: The relevance of snow forecasts for hydrological applications is stated but not substantiated, could the authors include specific examples or relevant references?
L76: I think it would be useful to have some description or a reference explaining how UNC, REL and RES can be interpreted.
L97: Figure 1 is introduced in the Methods section, to explain how to read a reliability diagram. However, in the manuscript, reliability diagrams themselves are never shown, but only appear in the Appenidx. It reproduces a standard schematic reliability diagram which is very similar to that presented in Weisheimer and Palmer (2014). I am not sure the figure adds value to the current manuscript, especially for a specialized audience. I suggest either explicitly stating that the figure is adapted from previous work and clarifying its purpose, or removing it and referring directly to the existing literature.
L192: This explanation seems to imply that the distribution of SWE anomalies is skewed, leading to a lower frequency of low-snow events. If so, I would indeed expect fewer forecasts with high probabilities for the lower tercile. Is my understanding what you mean? It would be useful to better clarify this point in the manuscript.
L201: While an offset of the best-fit reliability line from the climatological intersection appears to be present in both TIB and WNA regions, it is considerably more pronounced in TIB. In particular, in the TIB region the best-fit line lies systematically below the line of perfect reliability, indicating a tendency towards overforecasting. It would be helpful to further discuss this bias, as this provides useful insight into the nature of the forecast errors.
L213: I am not sure that repeatedly specifying the color of each category adds value. Once defined, this information could be omitted to improve readability and avoid unnecessary repetition.
L247: Are the agreement values reported in Table A1 computed by pooling all gridpoints and forecasted years together?
L254: Given the sensitivity of binary-event reliability to tercile thresholds and the reduced agreement between verification datasets near tercile boundaries, the authors may consider complementing the categorical reliability analysis with a continuous probabilistic verification metric (for example, the Continuous Ranked Probability Score).
L258: Could the authors better specify what they mean by “snow dataset specifically suited for the forecast verification purposes”?
L267 and L273: The manuscript concludes that the forecasts show “overall good performance”. However, many regions fall into Category 3 (marginally useful), especially when evaluated against ESA Snow-CCI. Given that Category 3 includes cases with limited skill and the results are strongly dependent on the verification dataset, the interpretation of “overall good performance” appears overstated.