the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Huge Ensembles Part II: Properties of a Huge Ensemble of Hindcasts Generated with Spherical Fourier Neural Operators
Abstract. In Part I, we created an ensemble based on Spherical Fourier Neural Operators. As initial condition perturbations, we used bred vectors, and as model perturbations, we used multiple checkpoints trained independently from scratch. Based on diagnostics that assess the ensemble's physical fidelity, our ensemble has comparable performance to operational weather forecasting systems. However, it requires several orders of magnitude fewer computational resources. Here in Part II, we generate a huge ensemble (HENS), with 7,424 members initialized each day of summer 2023. We enumerate the technical requirements for running huge ensembles at this scale. HENS precisely samples the tails of the forecast distribution and presents a detailed sampling of internal variability. For extreme climate statistics, HENS samples events 4σ away from the ensemble mean. At each grid cell, HENS improves the skill of the most accurate ensemble member and enhances coverage of possible future trajectories. As a weather forecasting model, HENS issues extreme weather forecasts with better uncertainty quantification. It also reduces the probability of outlier events, in which the verification value lies outside the ensemble forecast distribution.
Status: final response (author comments only)
-
CEC1: 'Comment on egusphere-2024-2422', Juan Antonio Añel, 29 Oct 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlWe have checked the Code and Data Availability section in your manuscript and beyond the two Zenodo repositories you share,many of the assets used in your work are posted in servers that we can not accept. You mention several GitHub sites, which are not valid repositories, and even the link to one of them (the one related to the scoring) does not work.
Therefore, the current situation with your manuscript is irregular. Please, publish your code in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.
In this way, if you do not fix this problem, we will have to reject your manuscript for publication in our journal.
Also, you must include the modified 'Code and Data Availability' section in a potentially reviewed manuscript, the DOI of the code (and another DOI for the dataset if necessary).
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2024-2422-CEC1 -
RC1: 'Comment on egusphere-2024-2422', Peter Düben, 02 Nov 2024
Part 1 and 2 are both interesting papers that document the development and use of a machine learned ensemble weather forecast model with an enormous number of ensemble members. The papers are fitting well into GMD but I think that they should be revised following the comments below.
All the best,
PeterPart 1:
The paper is documenting a very interesting results as it shows that an SFNO-type-model can be used to develop a competitive ensemble forecast system when combined with bred vectors and multi-checkpointing.
Major comments:
- Page 2: The ML model has “orders-of-magnitudes” lower computational cost. Is this really true? More than a factor of 10? This could only be possible if the IO cost (that will stay the same) is considered to be of less than 10% of the overall cost (also see comment for Part 2). And what is the “cost”? Time, energy, or hardware purchase?
- P6, paragraph starting with “We choose SFNO…”: I find this part difficult to follow. It would be good to remind the reader about the architecture of the SFNO and to clearly state what is changed to see only “linear” scaling with horizontal resolution. This will clearly not be the case if the size of the SFNO is increased (?). And I thought I had seen talks by NVIDIAns that showed that there were fundamental problems when scaling SFNO to km-scale resolution? And is “super-linear” more or less than linear when you talk about the cost? What exactly do you mean by downscaling and scale factor here (I think I know but only since I know the previous papers)? I do not understand why a lower scale factor would lead to a larger ensemble spread.
- Figure 4: I do not really understand how the process is repeated for t-1 and t0.
- Figure 10: Is the control member equivalent to a normal ensemble member, or are there small differences (as in IFS)? Can you also plot 0h?
- Page 20: 9 initial dates per year does not seem correct.
- Section 3.3.2 and 3.3.3 read a bit too much like a text book. Can you refer to literature and keep the discussion shorter?
- Page 25: The discussion of the pipeline and the earth2mip package indicates that you consider this to be one of the main contributions of the paper. I think it could be, but you would probably need to make it more prominent in the write-up. It is hardly mentioned at the moment. You could maybe show how the package is working for another ML model, more-or-less out of the box.? Is the extreme diagnostics pipeline that is mentioned on page 27 meant to be used by other groups and to work as a benchmark?
Minor comments:
- Abstract: “these” -> “These”
- P4: “set This”
- Figure 2b: For what timestep are the spectra calculated?
Part 2:
The paper presents very interesting and useful information how the huge ensemble is generated and how much effort it requires to run such a model. However, the evaluation of the usefulness of a huge ensemble is rather weak as it is presenting the “easy” task of Gaussian predictions but avoids diagnostics that evaluate the “hard” tasks for a huge ensemble that could actually show the real usefulness. I am left a bit puzzled after reading the paper how the huge ensembles could actually become useful. I do not think that our current 50-member ensembles would greatly benefit from more ensemble members if we assume that we predict Gaussian distributions. We have EMOS to improve predictions for 50-member ensembles, so no need for huge ensembles. How does the information gain of 4 compare against the IFS ensemble with EMOS? I also do not think that an ensemble range that predicts temperatures between 295 and 320K will be of any help for a decision maker (as seen in Figure 5). To have a couple of members from a >1000-member ensemble close to the truth will not trigger any decisions for a forecast. The same is true for the outcome-weighted CRPS discussion. It seem to be of less relevance to have a prediction of the uncertainty range of the probability for an extreme prediction.
If we assume that the distributions of variables that are of interest are non-Gaussian, in particular for extremes, the huge ensembles may be extremely useful to sample the tails of the distribution. But in this case, we would need to still show that the ensemble is actually representing the tails of the distribution correctly. This should be evaluated but it is a very hard problem, not only for the ensemble system, but also for the evaluation as you would need a very long test period to sample extreme events to understand the real quality of the ensemble when representing a 4-sigma event for, say, precipitation with enough statistics. This may not be possible without overlap between training and testing datasets.
It also smells a bit like cherry-picking when the evaluation is focussing on day 10+ as you see a good spread-error ratio here. I would like to see evaluations of earlier lead times (in particular for Figure 4, 6, 7). If you represent all possible weather situations at day 10, this can well indicate that your model is all over the place when it is basically uncorrelated with the real-word trajectory. It would be a much stronger statement if you see the same at day 2 or 5.
It would also be good if you could show results for more challenging quantities such as precipitation as well.
The first part that outlines how a huge ensemble can be run and what hardware is needed is very interesting. However, it would be good if you could put the results a bit better into perspective. The data pipeline that you describe seems to bring a machine of the size of Perlmutter to it’s limits. A 25 GB/s connection is rather expensive to maintain. This does not go down well with the claim the ML models are orders of magnitude cheaper when compared to conventional models.?
Would it be possible to compare the huge ensemble also against other ML ensemble systems that are published in the literature?
Minor comments:
- P6: How large is the model if you want to send it around instead of the data?
- How does climate change enter the discussion around huge ensembles?
- Figure 11 seems to have an error in the caption with 240, 246, 252… not fitting to day 4,7,10.
Citation: https://doi.org/10.5194/egusphere-2024-2422-RC1 -
RC2: 'Comment on egusphere-2024-2422', Anonymous Referee #2, 18 Jan 2025
In Part 1, the integrated system was validated, while in Part 2, the focus shifted to simulating extreme weather events, particularly those exceeding 4 standard deviations from the mean. The creation and analysis of 7,424 ensemble members using a range of probabilistic metrics is impressive and represents a significant advancement in ensemble-based forecasting. This large ensemble approach has substantial potential for improving the prediction and assessment of extreme weather events, offering valuable insights into their likelihood and associated uncertainties. Moreover, the integration of artificial intelligence, specifically through Spherical Fourier Neural Operators, presents a promising new direction for weather forecasting, combining computational efficiency with robust performance.
However, several concerns need to be addressed, as outlined in the detailed comments below. With these revisions, we believe the manuscript will be well-prepared for publication.Major Comments
1. The authors do not demonstrate whether the error accumulates and spreads as the lead time progresses, and the explanation for this omission is unclear. Since the perturbations are based on bred vectors, the lack of error accumulation could potentially result from deviations introduced by the initial bred vector itself. Therefore, including an analysis of the variance and characteristics of the bred vector would enhance the validity of the results and provide a more convincing argument.2. The authors have only analyzed temperature, but with the availability of u10m data, further analysis of wind gusts could be conducted. Limiting the results to temperature alone restricts the reliability of the study. Additional analyses of other extreme events using different variables would strengthen the manuscript. If it is feasible to modify the deep learning model to simulate precipitation, I would recommend including it. If not, at the very least, wind gust analysis should be explored and discussed.
3.The study focuses exclusively on heat waves from June to August. However, cold waves represent the opposite end of temperature extremes and are equally important. If the authors can demonstrate that their system is capable of reproducing cold waves, it would significantly enhance the reliability of the model in predicting a broader range of temperature extremes.
Minor Comments
1. Perturbations were applied using both bred vectors and checkpoints. One question that arises is which of these two methods is more sensitive to an increased perturbation. A brief analysis and discussion on this point would be valuable for readers to better understand the relative impact of each approach.2. There are typos in plural and singular, please find and correct them.
Citation: https://doi.org/10.5194/egusphere-2024-2422-RC2 -
AC1: 'Response to CEC1 Comment', Ankur Mahesh, 19 Jan 2025
We have amended our code and data availability statement to the following statement:
The code, datasets, and models are all stored at https://doi.org/10.57967/hf/4200. We include the code to train SFNO, conduct ensemble inference with bred vectors and multiple checkpoints, and scoring and analysis code. We also open-source the model weights of the trained SFNO. See the README.txt for information on how to use the codebase and for the permissive license associated with the code and data.
This change should appear on arxiv shortly. We have removed the the reference to temporary repositories and stored all code and data at the above DOI, in compliance with GMD's code and data policy.
Citation: https://doi.org/10.5194/egusphere-2024-2422-AC1
Data sets
Trained Machine Learning Model Weights Ankur Mahesh et al. https://portal.nersc.gov/cfs/m4416/earth2mip_prod_registry/
Model code and software
Ensemble Inference Code Ankur Mahesh et al. https://github.com/ankurmahesh/earth2mip-fork/tree/HENS
Viewed
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
168 | 0 | 0 | 168 | 0 | 0 |
- HTML: 168
- PDF: 0
- XML: 0
- Total: 168
- BibTeX: 0
- EndNote: 0
Viewed (geographical distribution)
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1