the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Huge Ensembles Part I: Design of Ensemble Weather Forecasts using Spherical Fourier Neural Operators
Abstract. Simulating low-likelihood high-impact extreme weather events in a warming world is a significant and challenging task for current ensemble forecasting systems. While these systems presently use up to 100 members, larger ensembles could enrich the sampling of internal variability. They may capture the long tails associated with climate hazards better than traditional ensemble sizes. Due to computational constraints, it is infeasible to generate huge ensembles (comprised of 1,000–10,000 members) with traditional, physics-based numerical models. In this two-part paper, we replace traditional numerical simulations with machine learning (ML) to generate hindcasts of huge ensembles. In Part I, we construct an ensemble weather forecasting system based on Spherical Fourier Neural Operators (SFNO), and we discuss important design decisions for constructing such an ensemble. The ensemble represents model uncertainty through perturbed-parameter techniques, and it represents initial condition uncertainty through bred vectors, which sample the fastest growing modes of the forecast. Using the European Centre for Medium-Range Weather Forecasts Integrated Forecasting System (IFS) as a baseline, we develop an evaluation pipeline composed of mean, spectral, and extreme diagnostics. Using large-scale, distributed SFNOs with 1.1 billion learned parameters, we achieve calibrated probabilistic forecasts. As the trajectories of the individual members diverge, the ML ensemble mean spectra degrade with lead time, consistent with physical expectations. However, the individual ensemble members' spectra stay constant with lead time. Therefore, these members simulate realistic weather states during the rollout, and the ML ensemble thus passes a crucial spectral test in the literature. The IFS and ML ensembles have similar Extreme Forecast Indices, and we show that the ML extreme weather forecasts are reliable and discriminating. These diagnostics ensure that the ensemble can reliably simulate the time evolution of the atmosphere, including low likelihood high-impact extremes. In Part II, we generate a huge ensemble initialized each day in summer 2023, and we characterize the statistics of extremes.
Status: final response (author comments only)
-
CEC1: 'Comment on egusphere-2024-2420', Juan Antonio Añel, 29 Oct 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlWe have checked the Code and Data Availability section in your manuscript and beyond the two Zenodo repositories you share,many of the assets used in your work are posted in servers that we can not accept. You mention several GitHub sites, which are not valid repositories, and even the link to one of them (the one related to the scoring) does not work.
Therefore, the current situation with your manuscript is irregular. Please, publish your code in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.
In this way, if you do not fix this problem, we will have to reject your manuscript for publication in our journal.
Also, you must include the modified 'Code and Data Availability' section in a potentially reviewed manuscript, the DOI of the code (and another DOI for the dataset if necessary).
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2024-2420-CEC1 -
RC1: 'Comment on egusphere-2024-2420', Peter Düben, 02 Nov 2024
Part 1 and 2 are both interesting papers that document the development and use of a machine learned ensemble weather forecast model with an enormous number of ensemble members. The papers are fitting well into GMD but I think that they should be revised following the comments below.
All the best,
PeterPart 1:
The paper is documenting a very interesting results as it shows that an SFNO-type-model can be used to develop a competitive ensemble forecast system when combined with bred vectors and multi-checkpointing.
Major comments:
- Page 2: The ML model has “orders-of-magnitudes” lower computational cost. Is this really true? More than a factor of 10? This could only be possible if the IO cost (that will stay the same) is considered to be of less than 10% of the overall cost (also see comment for Part 2). And what is the “cost”? Time, energy, or hardware purchase?
- P6, paragraph starting with “We choose SFNO…”: I find this part difficult to follow. It would be good to remind the reader about the architecture of the SFNO and to clearly state what is changed to see only “linear” scaling with horizontal resolution. This will clearly not be the case if the size of the SFNO is increased (?). And I thought I had seen talks by NVIDIAns that showed that there were fundamental problems when scaling SFNO to km-scale resolution? And is “super-linear” more or less than linear when you talk about the cost? What exactly do you mean by downscaling and scale factor here (I think I know but only since I know the previous papers)? I do not understand why a lower scale factor would lead to a larger ensemble spread.
- Figure 4: I do not really understand how the process is repeated for t-1 and t0.
- Figure 10: Is the control member equivalent to a normal ensemble member, or are there small differences (as in IFS)? Can you also plot 0h?
- Page 20: 9 initial dates per year does not seem correct.
- Section 3.3.2 and 3.3.3 read a bit too much like a text book. Can you refer to literature and keep the discussion shorter?
- Page 25: The discussion of the pipeline and the earth2mip package indicates that you consider this to be one of the main contributions of the paper. I think it could be, but you would probably need to make it more prominent in the write-up. It is hardly mentioned at the moment. You could maybe show how the package is working for another ML model, more-or-less out of the box.? Is the extreme diagnostics pipeline that is mentioned on page 27 meant to be used by other groups and to work as a benchmark?
Minor comments:
- Abstract: “these” -> “These”
- P4: “set This”
- Figure 2b: For what timestep are the spectra calculated?
Part 2:
The paper presents very interesting and useful information how the huge ensemble is generated and how much effort it requires to run such a model. However, the evaluation of the usefulness of a huge ensemble is rather weak as it is presenting the “easy” task of Gaussian predictions but avoids diagnostics that evaluate the “hard” tasks for a huge ensemble that could actually show the real usefulness. I am left a bit puzzled after reading the paper how the huge ensembles could actually become useful. I do not think that our current 50-member ensembles would greatly benefit from more ensemble members if we assume that we predict Gaussian distributions. We have EMOS to improve predictions for 50-member ensembles, so no need for huge ensembles. How does the information gain of 4 compare against the IFS ensemble with EMOS? I also do not think that an ensemble range that predicts temperatures between 295 and 320K will be of any help for a decision maker (as seen in Figure 5). To have a couple of members from a >1000-member ensemble close to the truth will not trigger any decisions for a forecast. The same is true for the outcome-weighted CRPS discussion. It seem to be of less relevance to have a prediction of the uncertainty range of the probability for an extreme prediction.
If we assume that the distributions of variables that are of interest are non-Gaussian, in particular for extremes, the huge ensembles may be extremely useful to sample the tails of the distribution. But in this case, we would need to still show that the ensemble is actually representing the tails of the distribution correctly. This should be evaluated but it is a very hard problem, not only for the ensemble system, but also for the evaluation as you would need a very long test period to sample extreme events to understand the real quality of the ensemble when representing a 4-sigma event for, say, precipitation with enough statistics. This may not be possible without overlap between training and testing datasets.
It also smells a bit like cherry-picking when the evaluation is focussing on day 10+ as you see a good spread-error ratio here. I would like to see evaluations of earlier lead times (in particular for Figure 4, 6, 7). If you represent all possible weather situations at day 10, this can well indicate that your model is all over the place when it is basically uncorrelated with the real-word trajectory. It would be a much stronger statement if you see the same at day 2 or 5.
It would also be good if you could show results for more challenging quantities such as precipitation as well.
The first part that outlines how a huge ensemble can be run and what hardware is needed is very interesting. However, it would be good if you could put the results a bit better into perspective. The data pipeline that you describe seems to bring a machine of the size of Perlmutter to it’s limits. A 25 GB/s connection is rather expensive to maintain. This does not go down well with the claim the ML models are orders of magnitude cheaper when compared to conventional models.?
Would it be possible to compare the huge ensemble also against other ML ensemble systems that are published in the literature?
Minor comments:
- P6: How large is the model if you want to send it around instead of the data?
- How does climate change enter the discussion around huge ensembles?
- Figure 11 seems to have an error in the caption with 240, 246, 252… not fitting to day 4,7,10.
Citation: https://doi.org/10.5194/egusphere-2024-2420-RC1 -
RC2: 'Comment on egusphere-2024-2420', Anonymous Referee #2, 18 Jan 2025
The manuscript presents an approach to forecasting low-likelihood high-impact extreme weather events using a Spherical Fourier Neural Operator with bred vectors and multiple checkpoints (SFNO-BVMC), addressing a significant challenge faced by current deep-learning weather prediction models. The results demonstrate the model's capability to predict extreme events while achieving reduced computational costs compared to traditional Numerical Weather Prediction (NWP) methods, potentially marking a significant milestone in weather forecasting. Despite these promising results, several aspects warrant further research. The authors mainly focused on 2m temperature, especially heat extreme from model configuration to diagnostics, and given that the authors deliberately included 2m dewpoint temperature as a model input variable, incorporating predictions of derived heat extreme indices would provide valuable insights into the model's capabilities. Furthermore, I recommend evaluating a broader range of LLHIs to strengthen the reliability of the approach. I think the authors can incorporate cold extremes along with heat extremes. What about wind extremes which are in prediction variables? Most importantly, this model does not encompass floods/precipitation which can cause the highest impact extreme. In its current form, the LLHI diagnosis may be too narrow to adequately showcase the model’s full ability to predict various extreme weather events.
As well as various extreme events, actual forecasts would be helpful to recognize the usefulness of the model. Diagnostics with real-event prediction would be more persuasive. For example, t2m ensemble time series at a certain grid point, trajectories of each ensemble for each variable, and the difference between IFS could strengthen the model’s credibility.
Major Comments
1. (p.4) “Existing work has shown that simple Gaussian perturbations do not yield a sufficiently dispersive ensemble. (Scher and Messori, 2021; Bülte et al., 2024): the ensemble spread from these perturbations is too small.”
: If so, you can still adopt singular vectors or other methods to reflect initial condition uncertainty. Are bred vectors superior to other approaches? Are they the cheapest way other than simple Gaussian perturbations?
2. (p.4) “Each resulting checkpoint represents an equivalently plausible set of weights that can model the time evolution of the atmosphere from an initial state.”
: (Bonev et al., 2023) assessed their SFNO for weather prediction via ACC only. As hyperparameters and input variables changed, I am curious about the predictability of this version. Does each of the checkpoints generate reliable forecasts? Comprehensive assessment of SFNO via metrics more than ACC is required.
3. (p.6) “In this study, we add 2-meter (2m) dewpoint temperature as another variable; for our SFNO training dataset, we obtain the 2m dewpoint temperature field from ERA5.”
: Vertical velocity and precipitation are excluded. As I mentioned above, precipitation is important in extreme weather forecasting. Is there any specific reason for excluding precipitation?
4. (p.10) Figure 3. Ensemble spread from different numbers of checkpoints.
: Model configuration also focused on 2m temperature. Do we need to change the number of checkpoints if we want to forecast wind extremes? Do we need to change it every time for different variables? Selecting the number of checkpoints based on the comparison among multiple variables would be a more optimal choice.
5. (p.17) “On the second criterion, crucially, their spectra remain constant through the 360-hour rollout (Figure 10 and Figure 11).”
: Degradation of power in short wavelengths occurs in a lot of DLWPs. Then are all DLWP models’ degradation because of autoregressive fine-tuning? This seems like a crucial problem to just hypothesize the cause. I think it would be beneficial for readers to pinpoint the cause.
6. (p.17) “While the control and perturbed spectra remain constant through the rollout, the SFNO-BVMC ensemble mean does increasingly blur with lead time. Figure 12 shows that the ensemble means of SFNO-BVMC and IFS ENS similarly degrade in power after 24 hours, 120 hours, and 240 hours.”
: In the first paragraph of section 3.2 Spectral Diagnostics, the authors elaborate that power decay is one of the symptoms of blurriness but this sentence seems like presuming those two are equivalent. section 3.2 needs to be more clear. What is the relationship between spectra and blurriness in general and what did SFNO find? Why is SFNO-BVMC different from other DLWPs with respect to the power spectrum?
7. (p.19) “This is necessary but as yet insufficient validation for our main scientific interest in LLHIs.”
: I expect more analysis of LLHIs such as case studies that occurred during recent years even though the authors agreed with the lack of validation. It would provide a more robust evaluation and help illustrate the model’s practical value.
Minor comments
1. (p.12) “First, they contain a land-sea contrast for surface fields such as 10m wind speed and 2m temperature. For these surface fields, perturbations have distinct amplitudes and spatial scales over the land and ocean.”
: It’s a bit difficult for me to discriminate the difference. Could you show the amplitude in another way?
2. (p.14) “On 850 hPa temperature, 2m temperature, 850 hPa specific humidity, and 500 hPa geopotential, SFNO-BVMC lags approximately 18 hours behind IFS ENS, though their performance is comparable.”
: CRPS score with all pressure levels would be useful for readers e.g. GenCast or GraphCast.
Citation: https://doi.org/10.5194/egusphere-2024-2420-RC2 -
AC1: 'Response to CEC1 Comment', Ankur Mahesh, 19 Jan 2025
We have amended our code and data availability statement to the following statement:
The code, datasets, and models are all stored at https://doi.org/10.57967/hf/4200. We include the code to train SFNO, conduct ensemble inference with bred vectors and multiple checkpoints, and conduct scoring and analysis. We also open-source the model weights of the trained SFNO. See the README.txt for information on how to use the codebase and for the permissive license associated with the code and data.
This change should appear on arxiv shortly. We have removed the the reference to temporary repositories and stored all code and data at the above DOI, in compliance with GMD's code and data policy.
Citation: https://doi.org/10.5194/egusphere-2024-2420-AC1
Data sets
Trained ML Model Weights Ankur Mahesh et al. https://portal.nersc.gov/cfs/m4416/earth2mip_prod_registry/
Model code and software
Code to run the ML mode for inference Ankur Mahesh et al. https://github.com/ankurmahesh/earth2mip-fork/tree/HENS
Viewed
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
181 | 0 | 0 | 181 | 0 | 0 |
- HTML: 181
- PDF: 0
- XML: 0
- Total: 181
- BibTeX: 0
- EndNote: 0
Viewed (geographical distribution)
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1