Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

Loveday, Nicholas; Hertneky, Tracy

doi:10.48550/arXiv.2510.25045

Preprints

https://doi.org/10.48550/arXiv.2510.25045

Preprints

04 Feb 2026

| 04 Feb 2026

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

Nicholas Loveday and Tracy Hertneky

Abstract. Recent advances in AI-based weather prediction have led to the development of artificial intelligence weather prediction (AIWP) models with competitive forecast skill compared to traditional NWP models, but with substantially reduced computational cost. There is a strong need for appropriate methods to evaluate their ability to predict extreme weather events, particularly when spatial coherence is important, and grid resolutions differ between models.

We introduce a verification framework that combines spatial verification methods and proper scoring rules. Specifically, the framework extends the High-Resolution Assessment (HiRA) approach with threshold-weighted scoring rules. It enables user-oriented evaluation consistent with how forecasts may be interpreted by operational meteorologists or used in simple post-processing systems. The method supports targeted evaluation of extreme events by allowing flexible weighting of the relative importance of different decision thresholds. We demonstrate this framework by evaluating 32 months of precipitation forecasts from an AIWP model and a high-resolution NWP model. Our results show that model rankings are sensitive to the choice of neighbourhood size. Increasing the neighbourhood size has a greater impact on scores evaluating extreme-event performance for the high-resolution NWP model than for the AIWP model. At equivalent neighbourhood sizes, the high-resolution NWP model only outperformed the AIWP model in predicting extreme precipitation events at short lead times. We also demonstrate how this approach can be extended to evaluate discrimination ability in predicting heavy precipitation. We find that the high-resolution NWP model had superior discrimination ability at short lead times, while the AIWP model had slightly better discrimination ability from a lead time of 24-hours onwards.

Received: 21 Nov 2025 – Discussion started: 04 Feb 2026

Nicholas Loveday and Tracy Hertneky

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-5796', Anonymous Referee #1, 16 Mar 2026
This article contributes to the discussion on the performance of AI models for weather forecasting, with a particular focus on their ability to predict extreme precipitation events. The methodology incorporates several novel ideas in verification, including spatial verification using a neighbourhood pseudo‑ensemble, a threshold‑weighted CRPS, and a decomposition of the CRPS using post‑processing.
The paper reads very well overall. The data and the verification methodology are generally well described, and the figures are clear and easy to read. However, I found myself going back and forth between Figures 3, 4, and 7 to compare the results. Perhaps the authors could find a way to keep the results from Figure 3 visible in Figures 4 and 7 (and those from Figure 4 in Figure 7). This would make it easier to follow the presentation of the results.
While I find the study very interesting, I would encourage the authors to add a couple of discussion points:
In Section 3.1, it would be important, to my opinion, to discuss representativeness and to what extent a grid-box average can be directly compared to a point observation. In Section 6 (Model climatology), the Q-Q plot is quite compelling. How much of the off-diagonal behaviour is due to the smoothness in the forecast versus representativeness issue. A discussion on this could be interesting.

About the climatology used to define the thresholds, using ERA5 instead of long time series of observations touches upon the representativeness issue too. Also, a question is whether the thresholds are season dependent or constant throughout the year. If the latter is correct, what are the implications for the interpretation of the results?

In Section 5, you mention CRPS being the integral of BS scores with the BS for small thresholds contributing most to the overall score. Could you comment on how twCRPS works with that respect? Would the main contribution to this score be the BS for the 99% percentile in your example?

You show the discrimination ability based on the twCRPS. It would be interesting to compute the discrimination based on the “full” CRPS too. That would help the interpretation of the results. Also, one could simply show the results of the post-processed forecasts (instead of showing the discrimination), that would ease the comparison with the other results and help assessing the impact of post-processing more directly.

Minor comments:
Section 2.1. There are a couple of studies based on precipitation observations that have been published. See for example Jin et al (2025) and Ben Bouallegue et al (2026).

Section 3.1. When mentioning pseudo-ensemble, I would cite the original paper describing this idea: Theis et al 2005.

Figures 3,4, 7, 8. It is not explained in the text how the confidence intervals are computed (Diebold Mariano is only mentioned in the Code Availability Section).

Section 7. The CORP-like decomposition approach: it is not explained how it works. Equation 9 reminds me of Equation 15 in Siegert 2017. Is it the same idea?

References:
Ben Bouallegue et al (2026), SEEPS4ALL: an open dataset for the verification of daily precipitation forecasts using station climate statistics, https://doi.org/10.5194/essd-18-713-2026
Jin et al (2025), WeatherReal: A Benchmark Based on In-Situ Observations for Evaluating Weather Models, https://doi.org/10.48550/arXiv.2409.09371
Siegert, S. (2017), Simplifying and generalising Murphy's Brier score decomposition. Q.J.R. Meteorol. Soc., 143: 1178-1183. https://doi.org/10.1002/qj.2985
Theis, S.E., Hense, A. and Damrath, U. (2005), Probabilistic precipitation forecasts from a deterministic model: a pragmatic approach. Met. Apps, 12: 257-268. https://doi.org/10.1017/S1350482705001763
Citation: https://doi.org/10.5194/egusphere-2025-5796-RC1
RC2: 'Comment on egusphere-2025-5796', Anonymous Referee #2, 20 Mar 2026

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2026/egusphere-2025-5796/egusphere-2025-5796-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-5796-RC2
CEC1:
'Comment on egusphere-2025-5796 - No compliance with the policy of the journal', Juan Antonio Añel, 25 Mar 2026

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
First, I would like to note that in the preprint of your manuscript you do not provide important information regarding the deposit of some code and data for your manuscript. You have provided information about a repository (https://zenodo.org/records/17667747) to the editors internally. However, such information must be public, and should be in your manuscript, and therefore I am making it public here.
Second, the Code and Data Availability section in your manuscript does not provide a repository for the GraphCast and High-Resolution Rapid Refresh models, which you use in your work. Additionally, to access the data, you have linked sites that are not trusted long-term archival repositories, and therefore are not acceptable according to the policy of the journal.
We can not accept this, it is forbidden by our policy, and your manuscript should have never been accepted for Discussions or peer review given such lack of compliance. Our policy clearly states that all the code and data necessary to replicate a manuscript must be published openly and freely to anyone before submission. The GMD review and publication process depends on reviewers and community commentators being able to access, during the discussion phase, the code and data on which a manuscript depends, and on ensuring the provenance of replicability of the published papers for years after their publication. We cannot have manuscripts under discussion that do not comply with our policy.
Therefore, we are granting you a short time to solve this situation. You have to reply to this comment in a prompt manner with the information for the repositories containing all the models, code and data that you use to produce and replicate your manuscript. The reply must include the link and permanent identifier (e.g. DOI). Also, any future version of your manuscript must include the modified section with the new information. The 'Code and Data Availability’ section must also be modified to cite the new repository locations, and corresponding references added to the bibliography.
Additionally, I see that two reviewers have already posted comments on your manuscript. I ask you to refrain to address the comments by any reviewer until the situation regarding the compliance of your manuscript with the Code and Data policy of the journals is clarified and solved.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in GMD.
Juan A. Añel

Geosci. Model Dev. Exec. Editor

Citation: https://doi.org/10.5194/egusphere-2025-5796-CEC1
- AC1:
  'Reply on CEC1', Nicholas Loveday, 01 Apr 2026
  Dear Juan,
  
  Would the following statement meet your needs? We believe that it should be if we compare it to other papers published in GMD that used the same or similar data.
  
  "One-minute ASOS data was retrieved from https://mesonet.agron.iastate.edu/request/asos/1min.phtml which contains an archive of data provided by the National Climatic Data Center. HRRRv4 uses the Weather Research and Forecasting (WRF) model v3.9.1, which is available at https://www2.mmm.ucar.edu/wrf/users/download/get_source.html (National Center for Atmospheric Research, 2025), with the namelist provided at https://rapidrefresh.noaa.gov/hrrr/wrf.nl.txt (National Oceanic and Atmospheric Administration, 2025) and is also available from https://hrrrzarr.s3.amazonaws.com/index.html. GraphCast-GFS is from NOAA’s Open Data Dissemination (NODD) program https://doi.org/10.1175/BAMS-D-24-0057.1 and can be retrieved from https://noaa-oar-mlwp-data.s3.amazonaws.com/index.html. ERA5 data is available in the Copernicus data store (doi.org/10.24381/cds.adbb2d47) and from https://console.cloud.google.com/storage/browser/weatherbench2/data/era5.
  
  The verification measures and statistical tests used in this paper (e.g., twCRPS) were implemented in the scores package (https://doi.org/10.5281/zenodo.18638494). All code to reproduce the results and figures in this paper is available at https://zenodo.org/records/17667747 . "
  
  Other notes to the editor:
  
  If it is required to meet the data availability requirements, we can try to put the subset of the GraphCast-GFS, HRRR, and observations data that we used for the paper on Zenodo. We believe this would address any remaining issues with data availability in our statement above. Could you please confirm if we need to do this and if it would address your concerns?
  
  We will update the scores zenodo link and the paper code zenodo link to be the correct versions when we resubmit a revised manuscript
  
  Thanks,
  
  Nick
  
  Citation: https://doi.org/10.5194/egusphere-2025-5796-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 01 Apr 2026
    
    Dear authors,
    Many thanks for the reply. Unfortunately, your proposed solution does not address the outstanding issues which I pointed in my previous comment. We must insist that you have to publish all the code and data openly, and reply to this comment with the information about them. It is not enough that you correct it in a reviewed version of your manuscript. The information requested is necessary for the Discussions stage and peer review.
    The new text that you propose continue citing multiple sites that are not suitable for long-term storage of assets linked to the publication of a paper. Only the two Zenodo repositories that you have mentioned are acceptable. Namely, the iastate.edu, ucar.edu, noaa.gov, amazonaws.com, or the sites linked in the BAMS paper you cite, are not acceptable. They do not fulfil GMD’s requirements for a persistent data archive because:
    - They do not appear to have a published policy for data preservation over many years or decades (some flexibility exists over the precise length of preservation, but the policy must exist).
    
    - They do not appear to have a published mechanism for preventing authors from unilaterally removing material. Archives must have a policy which makes removal of materials only possible in exceptional circumstances and subject to an independent curatorial decision,
    
    - They do not appear to issue a persistent identifier such as a DOI or Handle for each precise dataset.
    If for any of them we have missed a published policy which does in fact address this matter satisfactorily, please post a response linking to it. If you have any questions about this issue, please post them in a reply.
    I must insist that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in GMD.
    Juan A. Añel
    
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2025-5796-CEC2
    
    AC2: 'Reply on CEC2', Nicholas Loveday, 08 Apr 2026
    
    Dear Juan,
    Before I upload large amounts of data to Zenodo, could you please let me know if the following will be sufficient?
    Put all observations used on Zenodo.
    
    Put the entire subset of GraphCast-GFS and HRRR data required to reproduce the results on Zenodo.
    
    I think that the other data and code meets the requirements already.
    ERA5 data is already on the Copernicus data store (doi.org/10.24381/cds.adbb2d47)
    
    Scores code (that I implemented for this work) is on Zenodo.
    
    Code to reproduce all data-wrangling, calculations, and plotting is on Zenodo. I will update this when I respond to the reviewer's feedback.
    
    Could you please let me know if this is sufficient? If it is, I will notify you when the data is uploaded to Zenodo.
    Regards,
    Nick
    
    Citation: https://doi.org/10.5194/egusphere-2025-5796-AC2

Nicholas Loveday and Tracy Hertneky

Viewed

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 1,009 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
995	0	14	1,009	0	0

HTML: 995
PDF: 0
XML: 14
Total: 1,009
BibTeX: 0
EndNote: 0

Views and downloads (calculated since 04 Feb 2026)

Month	HTML	PDF	XML
Feb 2026	494	0	494
Mar 2026	373	10	383
Apr 2026	118	4	122
May 2026	10	0	10

Cumulative views and downloads (calculated since 04 Feb 2026)

Month	HTML	PDF	XML
Feb 2026	494	0	494
Mar 2026	373	10	383
Apr 2026	118	4	122
May 2026	10	0	10

Viewed (geographical distribution)

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 1,024 (including HTML, PDF, and XML) Thereof 1,024 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 23 May 2026

Short summary

This study introduces a verification method that accounts for differences in grid resolution when evaluating extreme event forecasts. We apply it to an artificial intelligence-based weather prediction model and a high-resolution numerical weather prediction model. Results show that, when assessed on equivalent neighborhood scales, the high resolution numerical weather prediction model only outperforms the AI system for short lead times in predicting extreme precipitation.


Total:	0
HTML:	0
PDF:	0
XML:	0