the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
An extension of the WeatherBench 2 to binary hydroclimatic forecasts
Abstract. Binary forecasts on hydroclimatic extremes play a critical part in disaster prevention and risk management. While the recent WeatherBench 2 provides a versatile framework for the verification of deterministic and ensemble forecasts, this paper presents an extension to binary forecasts on the occurrence versus non-occurrence of hydroclimatic extremes. Specifically, sixteen verification metrics on the accuracy and discrimination of binary forecasts are employed and scorecards are generated to showcase the predictive performance. A case study is devised for binary forecasts of wet and warm extremes obtained from both deterministic and ensemble forecasts generated by three data-driven models, i.e., Pangu-Weather, GraphCast and FuXi, and two numerical weather prediction products, i.e., ECMWF’s IFS HRES and IFS ENS. The results show that the receiver operating characteristic skill score (ROCSS) serves as a suitable metric due to its relative insensitivity to the rarity of hydroclimatic extremes. For wet extremes, the GraphCast tends to outperform the IFS HRES with the total precipitation of ERA5 data as ground truth. For warm extremes, the Pangu-Weather, GraphCast and FuXi tends to be more skilful than the IFS HRES within 3-day lead time but become less skilful as lead time increases. In the meantime, the IFS ENS tends to provide skilful forecasts of both wet and warm extremes at different lead times and at the global scale. Through diagnostic plots of forecast time series at selected grid cells, it is observed that at longer lead times, forecasts generated by data-driven models tend to be smoother and less skilful compared to those generated by physical models. Overall, the extension of the WeatherBench 2 facilitates more comprehensive comparisons of hydroclimatic forecasts and provides useful information for forecast applications.
- Preprint
(3755 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-3', Anonymous Referee #1, 04 Mar 2025
Dear authors,
I find the idea of extending WeatherBench2 to climatological extremes very interesting. I believe such evaluations would be really valuable for further comparison between numerical and machine learning based weather models. However, I personally find the paper more of a summary of possible evaluation metrics demonstrated on WeatherBench2 rather than actual extension of WeatherBench2.
The provided codebase and data seem to allow for reproducibility of the results in the paper, but I am not convinced they are easy to reuse for evaluation of new approaches on extreme weather events in the future.
Upon some major revisions, I can see this as a valuable contribution to the WeatherBench2 benchmark.
I have the following detailed comments I would like to be answered and addressed:
1) I think what you propose would be a valuable extension of WeatherBench2. However, what you provide is a set of Jupyter Notebooks. I think it would be better to provide a small library with a set of methods that can be imported and run as part of any other codebase. Would that be feasible?
2) Besides, how do you guarantee future compatibility with WeatherBench2, may it undergo any substantial changes in terms of available datasets or codebase? Would it be possible to make your evaluation scripts a push request to WeatherBench2 repository, making it part of the benchmark?
3) You mention GenCast (Price et al., 2025) in your paper, which suggests alternative evaluation metric for weather extremes borrowed from finance, in particular REV curves. I think it would be an interesting score to include in your evaluation. Could you comment on why is it not included and eventually include it?
4) In Section 3.1 you introduce terminology "hits", "false alarm", "misses", "correct rejections". Is this the jargon in natural hazards? I would rather suggest the following terminology: "true positives", "false positives", "false negatives", "true negatives" respectively. This should be changed also anywhere else in the text where the terminology is used.
5) The equations in Table 3 are not understandable without preliminary knowledge. Each equation contains variables, which are not explained anywhere in the text. This, in my opinion, needs to be improved together with description of each score in the text, which is at the moment rather vague (see also my next comment).
6) On line 153 you introduce HSS, and later on line 161 you introduce ORSS. You describe them as: HSS - accuracy relative to that of the random forecast; ORSS - improvement over the random forecast. It sounds like the two metrics are redundant. Is that the case? If so, why do we need both?
7) Line 171, I would suggest to reformulate the sentence.
8) Line 192 - "As expected, forecasts become less accurate" - why is this expected? You haven't motivated anywhere in the text why this should be the case, neither reference any literature that would explain it. It is mostly the case that forecasts for longer lead times exhibit strong decrease in performance, but I believe your expectation should be somehow grounded.
9) Line 197 - "As lead time increases, data-driven forecasts can be less skilful than the IFS HRES". This is an interesting observation, without any follow-up argumentation. It would be great to have more insights into this.
10) Figure 5 legend - gray shading says "Warm Extremes". Is that correct or should it be "Wet extremes" since there is precipitation on the y-axis? And the shaded areas are where precipitation is often high.
11) Depending on the extremes that one wants to forecast, the forecast resolution plays an important role. Many extreme precipitation events cannot be detected with models operating on a coarse global grid as they are subject to very local behaviors. Would it be worth extending WeatherBench2 with new datasets that would allow for finer-scale evaluation of some local extreme events? Do you have any insights on which numerical and machine learning models could be used in order to produce more fine grained forecasts?
12) Line 347: "the capability to produce binary forecasts of hydroclimatic extremes warrants further verification" - I would suggest rephrasing
13) Line 357: "total precipitation of ERA5 data is used as the ground truth" - Do you mean ERA5 forecast or reanalysis? I believe only reanalysis data would make sense as a ground truth. I do not think it would be objective to compare to ERA5 precipitation forecast directly as we do not want to match ERA5 forecasting capability, but hopefully improve over it, therefore needing ground-truth data corresponding to reality.
Best Regards,
Referee
Citation: https://doi.org/10.5194/egusphere-2025-3-RC1 - AC2: 'Reply on RC1', Tongtiegang Zhao, 23 Mar 2025
-
CEC1: 'Comment on egusphere-2025-3 - No compliance with the policy of Geosci. Model Dev.', Juan Antonio Añel, 21 Mar 2025
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlIn your Code and Data Availability section, with only two exceptions, you link your code and data with websites that do not comply with the standards to be scientific repositories. These include things like Google drives, Google docs, papers that for the availability include links to GitHub sites (something totally unacceptable too, as our policy clearly states), and links to main Copernicus portals, instead to provide the access to the exact data that you use. Therefore, all these issues listed fail to comply with our policy, and instead the current sites that you link or citations to papers, you must provide the exact repositories (link and DOI) from one of the ones accepted according to our policy (please, check it).
A big issue here is that manuscripts that do not comply with our Code and Data policy can not be accepted for Discussions and review. However, your manuscript has ended here because of editorial overlook. Because of it, we are granting you a brief time to solve this situation, and reply to this comment with the requested information, which then you should include in any potentially reviewed version of your manuscript, modifying the Code and Data Availability section accordingly.
Please, reply to this comment addressing this situation and with the information for the repositories as soon as possible. I have to note that failing to comply with this request will make us to have to reject your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2025-3-CEC1 -
AC1: 'Reply on CEC1', Tongtiegang Zhao, 23 Mar 2025
Response:
Executive Editor:
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
In your Code and Data Availability section, with only two exceptions, you link your code and data with websites that do not comply with the standards to be scientific repositories. These include things like Google drives, Google docs, papers that for the availability include links to GitHub sites (something totally unacceptable too, as our policy clearly states), and links to main Copernicus portals, instead to provide the access to the exact data that you use. Therefore, all these issues listed fail to comply with our policy, and instead the current sites that you link or citations to papers, you must provide the exact repositories (link and DOI) from one of the ones accepted according to our policy (please, check it).
Thank you very much for clearly pointing to the "Code and Data Policy" of Geoscientific Model Development. After carefully reading the policy and accordingly checking the manuscript, we have deleted all the website links that do not comply with the standards to be scientific repositories. We have archived all the code and data on the Zenodo website:
“Code and data availability
The code and scripts performing all the analysis and plots are archived on the Zenodo under https://doi.org/10.5281/zenodo.15067282 (Li and Zhao, 2025a). All the analysis results are archived on the Zenodo under https://doi.org/10.5281/zenodo.15067178 (Li and Zhao, 2025b).
The raw data, i.e., forecasts and ground truth data, used in this paper are downloaded from the WeatherBench 2 and are archived on the Zenodo under https://doi.org/10.5281/zenodo.15066828 (Li and Zhao, 2025d) and under https://doi.org/10.5281/zenodo.15066898 (Li and Zhao, 2025c).
To ensure the compatibility with the WeatherBench 2, the code and scripts have been made a push request to its successor, i.e., WeatherBench-X.” (Page 24, Lines 393 to 401).
A big issue here is that manuscripts that do not comply with our Code and Data policy can not be accepted for Discussions and review. However, your manuscript has ended here because of editorial overlook. Because of it, we are granting you a brief time to solve this situation, and reply to this comment with the requested information, which then you should include in any potentially reviewed version of your manuscript, modifying the Code and Data Availability section accordingly.
We are grateful to you for the kind reminding and granting us the time to address this issue. The code and data are now on the Zenodo website:
Li, Q. and Zhao, T.: Code for the extension of the WeatherBench 2 to binary hydroclimatic forecasts (v0.3.0), https://doi.org/10.5281/zenodo.15067282, 2025a.
Li, Q. and Zhao, T.: Data for the extension of the WeatherBench 2 to binary hydroclimatic forecasts (v0.2.0), https://doi.org/10.5281/zenodo.15067178, 2025b.
Li, Q. and Zhao, T.: Data for the extension of the WeatherBench 2 to binary hydroclimatic forecasts: ensemble forecasts for 24h maximum temperature (v0.1.0), https://doi.org/10.5281/zenodo.15066898, 2025c.
Li, Q. and Zhao, T.: Data for the extension of the WeatherBench 2 to binary hydroclimatic forecasts: ensemble forecasts for 24h precipitation (v0.1.0), https://doi.org/10.5281/zenodo.15066828, 2025d.
Please, reply to this comment addressing this situation and with the information for the repositories as soon as possible. I have to note that failing to comply with this request will make us to have to reject your manuscript for publication in our journal.
Thank you very much for giving us an opportunity to improve the manuscript. By following the insightful comments and the "Code and Data Policy", we have thoroughly revised the code and data availability to address the issue and provide data repositories.
Citation: https://doi.org/10.5194/egusphere-2025-3-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 23 Mar 2025
Dear authors,
Many thanks for addressing this issue so expeditiously. We can consider the current version of your manuscript in compliance with the code and data policy now.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2025-3-CEC2 -
AC3: 'Reply on CEC2', Tongtiegang Zhao, 24 Mar 2025
Dear Prof. Añel,
Many thanks. It is great to hear the message. This paper is very important to us.
All the best,
Yours Sincerely Tony
Citation: https://doi.org/10.5194/egusphere-2025-3-AC3
-
AC3: 'Reply on CEC2', Tongtiegang Zhao, 24 Mar 2025
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 23 Mar 2025
-
AC1: 'Reply on CEC1', Tongtiegang Zhao, 23 Mar 2025
-
RC2: 'Comment on egusphere-2025-3', Anonymous Referee #2, 24 Mar 2025
This manuscript evaluates the performance of leading deterministic deep learning models( GraphCast, PanguWeather, and FuXi) against ECMWF’s numerical models, IFS HRES and IFS ENS, in forecasting hydroclimatic extremes. By employing a comprehensive set of skill scores, it provides a detailed assessment of forecast quality. Notably, the study shifts the focus from continuous metrics to binary extremes, expanding upon the work of WeatherBench 2 and offering fresh insights for operational forecasting. In doing so, it also underscores the importance of binary decision-making in high-stakes forecasting scenarios.
I believe this manuscript presents a valuable argument and could make a strong contribution to the evaluation literature on data-driven weather models. However, several key concerns must be addressed before it can be recommended for publication:
1. A clearer justification and explanation of key methodological choices.
2. Visualizations that provide regional or grid-point-level insights for the main evaluation metrics.
3. A more comprehensive discussion of the implications of the findings, including potential limitations that may impact the validity of the conclusions.
Below, I outline my specific concerns and recommendations:
Main concerns- I find unclear how global and regional scores are being computed. Are you pooling all data points, defining a single percentile threshold globally, and then taking all data points above that threshold while applying cosine weighting? Or are you defining grid-point-level thresholds, selecting an equal number of extreme data points at each grid point, and then computing an overall score via a cosine-weighted average of individual grid-point scores? Please clarify this in the manuscript. If you are following the first approach, the global scores may not be particularly meaningful, as they would be dominated by data from the warmest/wettest grid points. If so, it would be especially important to provide additional regional or grid-point-level analyses.
- Several symbols and abbreviations in Table 3 are undefined, making it difficult to understand the scores without prior knowledge. Please define all terms explicitly and consider adding a short section introducing the main evaluation metrics used in the study.
- The rationale for selecting specific case studies in Figures 5 and 7 is unclear. Were these grid points chosen because they represent some particular extreme events? Do they highlight specific forecast behavior? If neither, consider moving these figures to the appendix.
- Figure 4 is highly informative, as it provides a grid-point-level comparison of different models using a specific skill metric. Could similar visualizations be provided for additional metrics? At present, most metrics are only analyzed at a global scale, which, while interesting, does not offer any insights into regional model performance.
- The manuscript presents well-justified points of comparison in the discussion, but it would benefit from a clearer articulation of its novelty relative to prior literature. How do these results improve our understanding of the strengths and limitations of data-driven models compared to numerical models? What new insights does this study provide for operational forecasting? How do these findings extend beyond previous evaluation studies?
- While the manuscript discusses some limitations of individual metrics, a broader reflection on the general limitations of evaluating forecasts based solely on binary performance for hydroclimatic extremes would be valuable. For example, it would be useful to acknowledge that binary metrics alone may not fully capture all the qualities of a good forecast, and could also benefit from integration with standard skill metrics to mitigate the risk for the "forecaster’s dilemma" (Lerch, 2017). Additionally, it might be worth discussing why certain models perform particularly well at specific lead times, potentially due to trade-offs between accuracy and forecast activity (Ben Bouallègue and the AIFS team, 2024).
Minor considerations and typos- The introduction could further emphasise the necessity of this work. Expanding on why binary forecasts are operationally important and how they complement deterministic or probabilistic forecasts would benefit readers unfamiliar with operational forecasting.
- The manuscript focuses on temperature and wet extremes but does not explain why other extreme variables (e.g., wind) were excluded. A brief justification would be helpful, particularly given that ERA5 may be a more reliable ground truth for other variables than precipitation.
- The choice of approach to statistical testing could do with some more thorough justification. The manuscript currently suggests that the approach follows prior literature, but is a paired t-test valid in this case? For instance, have you verified that the assumption of normality for ROCSS score differences holds? A brief explanation of why this method is appropriate would strengthen the argument for this approach.
Section 2.1: For clarity and conciseness, consider removing descriptions of models not used in the analysis.
Line 108: Typo ("Weatherbence 2" should be "WeatherBench 2").
References
Lerch, Sebastian, et al. “Forecaster’s Dilemma: Extreme Events and Forecast Evaluation.” Statistical Science, vol. 32, no. 1, 2017, pp. 106–27. JSTOR, http://www.jstor.org/stable/26408123. Accessed 24 Mar. 2025.
Ben Bouallègue and the AIFS team, "Accuracy versus Activity", 2024, doi: 10.21957/8b50609a0f .
Citation: https://doi.org/10.5194/egusphere-2025-3-RC2 - AC4: 'Reply on RC2', Tongtiegang Zhao, 27 Mar 2025
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
218 | 58 | 8 | 284 | 8 | 5 |
- HTML: 218
- PDF: 58
- XML: 8
- Total: 284
- BibTeX: 8
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 101 | 34 |
China | 2 | 52 | 17 |
Sweden | 3 | 18 | 6 |
Switzerland | 4 | 17 | 5 |
France | 5 | 13 | 4 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 101