Preprints
https://doi.org/10.48550/arXiv.2510.25045
https://doi.org/10.48550/arXiv.2510.25045
04 Feb 2026
 | 04 Feb 2026
Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

Nicholas Loveday and Tracy Hertneky

Abstract. Recent advances in AI-based weather prediction have led to the development of artificial intelligence weather prediction (AIWP) models with competitive forecast skill compared to traditional NWP models, but with substantially reduced computational cost. There is a strong need for appropriate methods to evaluate their ability to predict extreme weather events, particularly when spatial coherence is important, and grid resolutions differ between models.

We introduce a verification framework that combines spatial verification methods and proper scoring rules. Specifically, the framework extends the High-Resolution Assessment (HiRA) approach with threshold-weighted scoring rules. It enables user-oriented evaluation consistent with how forecasts may be interpreted by operational meteorologists or used in simple post-processing systems. The method supports targeted evaluation of extreme events by allowing flexible weighting of the relative importance of different decision thresholds. We demonstrate this framework by evaluating 32 months of precipitation forecasts from an AIWP model and a high-resolution NWP model. Our results show that model rankings are sensitive to the choice of neighbourhood size. Increasing the neighbourhood size has a greater impact on scores evaluating extreme-event performance for the high-resolution NWP model than for the AIWP model. At equivalent neighbourhood sizes, the high-resolution NWP model only outperformed the AIWP model in predicting extreme precipitation events at short lead times. We also demonstrate how this approach can be extended to evaluate discrimination ability in predicting heavy precipitation. We find that the high-resolution NWP model had superior discrimination ability at short lead times, while the AIWP model had slightly better discrimination ability from a lead time of 24-hours onwards.

Share
Nicholas Loveday and Tracy Hertneky

Status: open (until 01 Apr 2026)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Nicholas Loveday and Tracy Hertneky
Nicholas Loveday and Tracy Hertneky

Viewed

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 56 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
56 0 0 56 0 0
  • HTML: 56
  • PDF: 0
  • XML: 0
  • Total: 56
  • BibTeX: 0
  • EndNote: 0
Views and downloads (calculated since 04 Feb 2026)
Cumulative views and downloads (calculated since 04 Feb 2026)

Viewed (geographical distribution)

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 75 (including HTML, PDF, and XML) Thereof 75 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 12 Feb 2026
Download
Short summary
This study introduces a verification method that accounts for differences in grid resolution when evaluating extreme event forecasts. We apply it to an artificial intelligence-based weather prediction model and a high-resolution numerical weather prediction model. Results show that, when assessed on equivalent neighborhood scales, the high resolution numerical weather prediction model only outperforms the AI system for short lead times in predicting extreme precipitation.
Share