the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Evaluation of forecasts by a global data-driven weather model with and without probabilistic post-processing at Norwegian stations
Abstract. During the last two years, tremendous progress in global data-driven weather models trained on numerical weather prediction (NWP) re-analysis data has been made. The most recent models trained on the ERA5 at 0.25° resolution demonstrate forecast quality on par with ECMWF's high-resolution model with respect to a wide selection of verification metrics. In this study, one of these models, the Pangu-Weather, is compared to several NWP models with and without probabilistic post-processing for 2-meter temperature and 10-meter wind speed forecasting at 183 Norwegian SYNOP stations up to +60 hours ahead. The NWP models included are the ECMWF HRES, ECMWF ENS and the Harmonie-AROME ensemble model MEPS with 2.5 km spatial resolution. Results show that the performances of the global models are on the same level with Pangu-Weather being slightly better than the ECMWF models for temperature and slightly worse for wind speed. The MEPS model clearly provided the best forecasts for both parameters. The post-processing improved the forecast quality considerably for all models, but to a larger extent for the coarse-resolution global models due to stronger systematic deficiencies in these. Apart from this, the main characteristics in the scores were more or less the same with and without post-processing. Our results thus confirm the conclusions from other studies that global data-driven models are promising for operational weather forecasting.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(0 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-2838', Anonymous Referee #1, 02 Jan 2024
Over the past two years, there has been rapid and unprecedented progress in data-driven, AI-based models for weather prediction. The paper evaluates the forecast quality of Pangu-Weather, a state-of-the-art AI weather model, and two physics-based NWP models: the ECMWF ensemble, and MEPS, a high-resolution limited area model. The forecasts are compared based on temperature and wind speed data from observation stations in Norway. Overall, the authors find that the unprocessed forecasts of the MEPS model are superior to those of the ECMWF ensemble and the Pangu-Weather model (which perform similarly). After post-processing, the relative differences in terms of the evaluation metrics are much smaller, with slight advantages for the post-processed MEPS forecasts.
In my view, the paper is timely and addresses interesting and important research questions given the recent rise in data-driven weather forecasting. The two main contributions are that
- the paper provides (potentially the first, but at least one of the first) comparisons of AI-based and physics-based weather forecasting models based on station data (rather than the commonly used comparisons based on gridded ERA5 data);
- the paper assesses and quantifies the effect of post-processing on forecasts from AI-based weather models. The paper is well written; the findings and conclusions are presented clearly and are supported by the presented results. Some minor comments, questions and suggestions are summarized below.
General comments:
- To me, the most interesting contribution was on the effect of post-processing on the AI-based vs. physics-based weather models. Though not entirely unexpected, I found it interesting to see that after post-processing, the differences between the ECMWF and Pangu-Weather forecasts seems to be minimal. This contribution of the paper might be strengthened by highlighting it more, for example in the ‘Conclusions’ section.
- Regarding the first comment from above, it might have been interesting to add some notion of ‘significance’ or uncertainty to the observed score differences, in particular after post-processing. I would expect that (potentially except for parts of the MEPS scores) the differences likely are not significant.
- In terms of post-processing methods, only BQN is applied. Even though I assume that this is not the case, it might be interesting to discuss whether there are any reasons to assume whether different effects of post-processing could potentially be expected for other post-processing approaches.
- The BQN post-processed forecasts only use weather variables directly related to the target variable. However, it has been demonstrated in various papers (on physics-based weather models) that neural network-based post-processing methods such as BQN can benefit substantially from including additional predictors. Would you expect similar improvements when utilizing additional predictors from AI-based weather models?
- In the presentation of the results in Figures 2 and 3 (and 4 and 5), I found it somewhat confusing that the line types do not consistently refer to raw and post-processed forecasts. Personally, I would have found it more instructive, if (for example), scores for deterministic forecasts are always drawn in solid lines, and forecasts for probabilistic scores in dashed lines (or vice versa).
Citation: https://doi.org/10.5194/egusphere-2023-2838-RC1 -
AC1: 'Reply on RC1', John Bjørnar Bremnes, 31 Jan 2024
We would like to thank the reviewer for the feedback on the article. Below is our initial response:
1) Yes, we will add this to the concluding section.
2) Testing statistical significance is challenging due to complex dependencies in space and time and there are several ways forward. Hence, it is important to be aware that the outcome would depend on how such tests are set up. However, we agree that it would be useful to have some idea of the significance of the differences. We have chosen to focus on site and lead time specific performance by applying the Diebold-Mariano (1995) test separately for each site and lead time and using the Benjamini-Hochberg (1995) procedure to control the false discovery rate at the given level. In the tables below, performance in terms of CRPS for the post-processed forecasts is considered. The reported figures are the percentages of the 183×10 site and lead time combinations where the model in a given row is significantly better than the models in the columns at the 0.05 nominal level.We will include more details on the testing procedure and update the text accordingly in the revised version of the manuscript. We will also consider making a test on a more aggregated level. On the combined site and lead time level there are only up to 365 forecasts in each DM test, while in total there are 602,909 forecasts in the dataset for evaluation.3) Based on the research literature, it is reasonable to expect that other post-processing methods could provide forecasts of about the same skill. The choice of input data could, however, make a noticeable difference. In this study, only the most relevant variable interpolated to the site at the given (lead) time is used as input. As mentioned in 4) more input variables could very likely improve the performances. The same goes for including forecasts in a neighborhood around the given point in space and time. Further, the latter may have different effects on the various NWP models. For example, it could be that the high-resolution MEPS model (2.5 km) could benefit more from this than the NWP models with coarse resolutions, since the former are more prone to phase shifts in time and space with increasing lead time, in particular for wind speed.4) Yes, including more variables to the input generally improves the scores. We do not see any reason why this should not be the case for data-driven AI models as well.5) We will consider whether there is a better alternative.Citation: https://doi.org/10.5194/egusphere-2023-2838-AC1
-
RC2: 'Comment on egusphere-2023-2838', Anonymous Referee #2, 12 Apr 2024
This manuscript is very well written. I like its length and clarity, and I believe it adds further evidence that ML/AI will revolutionize weather prediction. I have one minor technical comment (below); otherwise, I think the manuscript is well done and ready for full publication.
Page 3: Add a (ME) after mean error (second line from bottom) for consistency with other metric definitions and Table 1.
Citation: https://doi.org/10.5194/egusphere-2023-2838-RC2 -
AC2: 'Reply on RC2', John Bjørnar Bremnes, 19 Apr 2024
Will be done.
Citation: https://doi.org/10.5194/egusphere-2023-2838-AC2
-
AC2: 'Reply on RC2', John Bjørnar Bremnes, 19 Apr 2024
-
RC3: 'Comment on egusphere-2023-2838', Anonymous Referee #2, 12 Apr 2024
You conclude that the models, including Pangu-Weather, are considerably less skillful than the high-resolution MEPS model.
a) Does this make a strong case for a high-resolution/limited area re-analysis dataset?
b) Along those same lines, given a high resolution re-analysis dataset to train on, would you expect that ML/AI models will be equally competitive or better than traditional high-resolution NWP models and ensembles, and that the ML/AI approach will be equally skillful at predicting higher resolution, higher-impact meteorological phenomena? Or would high-impact meteorological prediction be better served by additional post-processing of the current resolution of Pangu-Weather and other global ML/AI models?
Citation: https://doi.org/10.5194/egusphere-2023-2838-RC3 -
AC3: 'Reply on RC3', John Bjørnar Bremnes, 19 Apr 2024
Concerning a), one way forward is to make global ML models with higher resolution over the area of interest by for example using a graph-based model with a stretched grid. Since re-analysis data at about 2- 5 km spatial resolution is available for parts of the Earth, it would indeed be possible to combine these with data from ERA5 at 0.25° resolution in a single model. Work in this direction is currently in progress, but to our knowledge no results have yet been shown. At this stage, it is therefore difficult to draw any conclusions both on deterministic and probabilistic/ensemble initial-state approaches trained on high-resolution data. It could be that the relative merit of such ML models varies by parameter.
The role or scope of post-processing methods is also not obvious, in particular if there are no additional reference/target data available.
Citation: https://doi.org/10.5194/egusphere-2023-2838-AC3
-
AC3: 'Reply on RC3', John Bjørnar Bremnes, 19 Apr 2024
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-2838', Anonymous Referee #1, 02 Jan 2024
Over the past two years, there has been rapid and unprecedented progress in data-driven, AI-based models for weather prediction. The paper evaluates the forecast quality of Pangu-Weather, a state-of-the-art AI weather model, and two physics-based NWP models: the ECMWF ensemble, and MEPS, a high-resolution limited area model. The forecasts are compared based on temperature and wind speed data from observation stations in Norway. Overall, the authors find that the unprocessed forecasts of the MEPS model are superior to those of the ECMWF ensemble and the Pangu-Weather model (which perform similarly). After post-processing, the relative differences in terms of the evaluation metrics are much smaller, with slight advantages for the post-processed MEPS forecasts.
In my view, the paper is timely and addresses interesting and important research questions given the recent rise in data-driven weather forecasting. The two main contributions are that
- the paper provides (potentially the first, but at least one of the first) comparisons of AI-based and physics-based weather forecasting models based on station data (rather than the commonly used comparisons based on gridded ERA5 data);
- the paper assesses and quantifies the effect of post-processing on forecasts from AI-based weather models. The paper is well written; the findings and conclusions are presented clearly and are supported by the presented results. Some minor comments, questions and suggestions are summarized below.
General comments:
- To me, the most interesting contribution was on the effect of post-processing on the AI-based vs. physics-based weather models. Though not entirely unexpected, I found it interesting to see that after post-processing, the differences between the ECMWF and Pangu-Weather forecasts seems to be minimal. This contribution of the paper might be strengthened by highlighting it more, for example in the ‘Conclusions’ section.
- Regarding the first comment from above, it might have been interesting to add some notion of ‘significance’ or uncertainty to the observed score differences, in particular after post-processing. I would expect that (potentially except for parts of the MEPS scores) the differences likely are not significant.
- In terms of post-processing methods, only BQN is applied. Even though I assume that this is not the case, it might be interesting to discuss whether there are any reasons to assume whether different effects of post-processing could potentially be expected for other post-processing approaches.
- The BQN post-processed forecasts only use weather variables directly related to the target variable. However, it has been demonstrated in various papers (on physics-based weather models) that neural network-based post-processing methods such as BQN can benefit substantially from including additional predictors. Would you expect similar improvements when utilizing additional predictors from AI-based weather models?
- In the presentation of the results in Figures 2 and 3 (and 4 and 5), I found it somewhat confusing that the line types do not consistently refer to raw and post-processed forecasts. Personally, I would have found it more instructive, if (for example), scores for deterministic forecasts are always drawn in solid lines, and forecasts for probabilistic scores in dashed lines (or vice versa).
Citation: https://doi.org/10.5194/egusphere-2023-2838-RC1 -
AC1: 'Reply on RC1', John Bjørnar Bremnes, 31 Jan 2024
We would like to thank the reviewer for the feedback on the article. Below is our initial response:
1) Yes, we will add this to the concluding section.
2) Testing statistical significance is challenging due to complex dependencies in space and time and there are several ways forward. Hence, it is important to be aware that the outcome would depend on how such tests are set up. However, we agree that it would be useful to have some idea of the significance of the differences. We have chosen to focus on site and lead time specific performance by applying the Diebold-Mariano (1995) test separately for each site and lead time and using the Benjamini-Hochberg (1995) procedure to control the false discovery rate at the given level. In the tables below, performance in terms of CRPS for the post-processed forecasts is considered. The reported figures are the percentages of the 183×10 site and lead time combinations where the model in a given row is significantly better than the models in the columns at the 0.05 nominal level.We will include more details on the testing procedure and update the text accordingly in the revised version of the manuscript. We will also consider making a test on a more aggregated level. On the combined site and lead time level there are only up to 365 forecasts in each DM test, while in total there are 602,909 forecasts in the dataset for evaluation.3) Based on the research literature, it is reasonable to expect that other post-processing methods could provide forecasts of about the same skill. The choice of input data could, however, make a noticeable difference. In this study, only the most relevant variable interpolated to the site at the given (lead) time is used as input. As mentioned in 4) more input variables could very likely improve the performances. The same goes for including forecasts in a neighborhood around the given point in space and time. Further, the latter may have different effects on the various NWP models. For example, it could be that the high-resolution MEPS model (2.5 km) could benefit more from this than the NWP models with coarse resolutions, since the former are more prone to phase shifts in time and space with increasing lead time, in particular for wind speed.4) Yes, including more variables to the input generally improves the scores. We do not see any reason why this should not be the case for data-driven AI models as well.5) We will consider whether there is a better alternative.Citation: https://doi.org/10.5194/egusphere-2023-2838-AC1
-
RC2: 'Comment on egusphere-2023-2838', Anonymous Referee #2, 12 Apr 2024
This manuscript is very well written. I like its length and clarity, and I believe it adds further evidence that ML/AI will revolutionize weather prediction. I have one minor technical comment (below); otherwise, I think the manuscript is well done and ready for full publication.
Page 3: Add a (ME) after mean error (second line from bottom) for consistency with other metric definitions and Table 1.
Citation: https://doi.org/10.5194/egusphere-2023-2838-RC2 -
AC2: 'Reply on RC2', John Bjørnar Bremnes, 19 Apr 2024
Will be done.
Citation: https://doi.org/10.5194/egusphere-2023-2838-AC2
-
AC2: 'Reply on RC2', John Bjørnar Bremnes, 19 Apr 2024
-
RC3: 'Comment on egusphere-2023-2838', Anonymous Referee #2, 12 Apr 2024
You conclude that the models, including Pangu-Weather, are considerably less skillful than the high-resolution MEPS model.
a) Does this make a strong case for a high-resolution/limited area re-analysis dataset?
b) Along those same lines, given a high resolution re-analysis dataset to train on, would you expect that ML/AI models will be equally competitive or better than traditional high-resolution NWP models and ensembles, and that the ML/AI approach will be equally skillful at predicting higher resolution, higher-impact meteorological phenomena? Or would high-impact meteorological prediction be better served by additional post-processing of the current resolution of Pangu-Weather and other global ML/AI models?
Citation: https://doi.org/10.5194/egusphere-2023-2838-RC3 -
AC3: 'Reply on RC3', John Bjørnar Bremnes, 19 Apr 2024
Concerning a), one way forward is to make global ML models with higher resolution over the area of interest by for example using a graph-based model with a stretched grid. Since re-analysis data at about 2- 5 km spatial resolution is available for parts of the Earth, it would indeed be possible to combine these with data from ERA5 at 0.25° resolution in a single model. Work in this direction is currently in progress, but to our knowledge no results have yet been shown. At this stage, it is therefore difficult to draw any conclusions both on deterministic and probabilistic/ensemble initial-state approaches trained on high-resolution data. It could be that the relative merit of such ML models varies by parameter.
The role or scope of post-processing methods is also not obvious, in particular if there are no additional reference/target data available.
Citation: https://doi.org/10.5194/egusphere-2023-2838-AC3
-
AC3: 'Reply on RC3', John Bjørnar Bremnes, 19 Apr 2024
Peer review completion
Journal article(s) based on this preprint
Viewed
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
216 | 0 | 1 | 217 | 0 | 0 |
- HTML: 216
- PDF: 0
- XML: 1
- Total: 217
- BibTeX: 0
- EndNote: 0
Viewed (geographical distribution)
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Thomas N. Nipen
Ivar A. Seierstad
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.