Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model
Abstract. Recent advances in AI-based weather prediction have led to the development of artificial intelligence weather prediction (AIWP) models with competitive forecast skill compared to traditional NWP models, but with substantially reduced computational cost. There is a strong need for appropriate methods to evaluate their ability to predict extreme weather events, particularly when spatial coherence is important, and grid resolutions differ between models.
We introduce a verification framework that combines spatial verification methods and proper scoring rules. Specifically, the framework extends the High-Resolution Assessment (HiRA) approach with threshold-weighted scoring rules. It enables user-oriented evaluation consistent with how forecasts may be interpreted by operational meteorologists or used in simple post-processing systems. The method supports targeted evaluation of extreme events by allowing flexible weighting of the relative importance of different decision thresholds. We demonstrate this framework by evaluating 32 months of precipitation forecasts from an AIWP model and a high-resolution NWP model. Our results show that model rankings are sensitive to the choice of neighbourhood size. Increasing the neighbourhood size has a greater impact on scores evaluating extreme-event performance for the high-resolution NWP model than for the AIWP model. At equivalent neighbourhood sizes, the high-resolution NWP model only outperformed the AIWP model in predicting extreme precipitation events at short lead times. We also demonstrate how this approach can be extended to evaluate discrimination ability in predicting heavy precipitation. We find that the high-resolution NWP model had superior discrimination ability at short lead times, while the AIWP model had slightly better discrimination ability from a lead time of 24-hours onwards.
This article contributes to the discussion on the performance of AI models for weather forecasting, with a particular focus on their ability to predict extreme precipitation events. The methodology incorporates several novel ideas in verification, including spatial verification using a neighbourhood pseudo‑ensemble, a threshold‑weighted CRPS, and a decomposition of the CRPS using post‑processing.
The paper reads very well overall. The data and the verification methodology are generally well described, and the figures are clear and easy to read. However, I found myself going back and forth between Figures 3, 4, and 7 to compare the results. Perhaps the authors could find a way to keep the results from Figure 3 visible in Figures 4 and 7 (and those from Figure 4 in Figure 7). This would make it easier to follow the presentation of the results.
While I find the study very interesting, I would encourage the authors to add a couple of discussion points:
Minor comments:
References:
Ben Bouallegue et al (2026), SEEPS4ALL: an open dataset for the verification of daily precipitation forecasts using station climate statistics, https://doi.org/10.5194/essd-18-713-2026
Jin et al (2025), WeatherReal: A Benchmark Based on In-Situ Observations for Evaluating Weather Models, https://doi.org/10.48550/arXiv.2409.09371
Siegert, S. (2017), Simplifying and generalising Murphy's Brier score decomposition. Q.J.R. Meteorol. Soc., 143: 1178-1183. https://doi.org/10.1002/qj.2985
Theis, S.E., Hense, A. and Damrath, U. (2005), Probabilistic precipitation forecasts from a deterministic model: a pragmatic approach. Met. Apps, 12: 257-268. https://doi.org/10.1017/S1350482705001763