the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
AI based seasonal large ensembles for fluvial flood risk: Evaluation over the Elbe basin
Abstract. A key challenge in risk analysis is to identify hazard events that are plausible and yet extrapolate beyond historical observations with appropriate frequency. For flood risk management, this can be done with large ensembles of synthetic, physically plausible weather scenarios that extend beyond the historical record to sample low-likelihood, high-impact events. Traditional statistical approaches for synthetic weather generation are often limited in variability and physical realism. Here, we show for the first time that a machine learning weather prognostic model, combined with a diagnostic precipitation model, can generate seasonal-scale ensembles suitable for flood risk assessment. Specifically, we adapt the huge ensembles (HENS) approach using a Spherical Fourier Neural Operator (SFNO)-based model combined with an Adaptive Fourier Neural Operator (AFNO)-based diagnostic precipitation model, using Nvidia Earth-2 stack, in a framework which we call “PrecipHENS”, to produce > 1000 synthetic European winter seasons of precipitation and temperature at 0.25° resolution in 112 GPU hours on NVIDIA L40s GPUs. In an Elbe River case study, PrecipHENS reproduces key features of the precipitation and temperature climatology, preserves spatial and temporal dependence – including decay of extremal co‑occurrence with distance – and generates a wider diversity of extreme events than an industry-standard conditional multivariate extreme value model benchmark. Principal component analysis of extreme precipitation fields shows that PrecipHENS spans a much broader space of storm structures (≈ 81 % of 1 × 1 grid cells) than the benchmark (≈ 50 %) or the historical record (≈ 19 %), indicating plausible novelty rather than repetition of past patterns. Coupled with a hydrological model, the AI-generated weather sequences produce river flow simulations consistent with historical climatology and extreme discharge patterns. These results demonstrate the potential of AI-based weather models to support event set generation for flood hazard and risk applications. Beyond flood risk, such AI-based large-ensemble weather generation offers a general framework for applications that benefit from expanding the physically plausible sample space, including risk assessment, climate-impact analysis, storyline development and statistical characterisation of extremes.
- Preprint
(5230 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-5841', Anonymous Referee #1, 24 Dec 2025
-
RC2: 'Comment on egusphere-2025-5841', Anonymous Referee #2, 15 Jan 2026
General Comments
This manuscript introduces and validates a novel AI-driven framework, PrecipHENS, which integrates a Spherical Fourier Neural Operator (SFNO)-based model combined with an Adaptive Fourier Neural Operator (AFNO)-based diagnostic precipitation model to generate seasonal-scale, high-resolution weather ensembles for flood risk assessment. Through a case study in the Elbe River basin, this manuscript systematically compares the PrecipHENS outputs against a multivariate extreme-value statistical model (the benchmark) across multiple dimensions such as climatology preservation, spatiotemporal coherence, extreme event representation, and methodological generalizability. The results demonstrate that PrecipHENS can realistically reproduce historical climatic features and generate extreme precipitation events with greater spatial diversity, and the produced meteorological data can be translated into plausible river flow simulations via a hydrological model named GR4J. This study successfully demonstrates the feasibility and considerable potential of AI weather generation models for directly constructing extreme precipitation and flood risk event sets, representing a significant methodological novelty with important application prospects.
The paper is well-written with sound experimental design, although there are a few points requiring clarification and minor adjustments. Comments are listed below.
Major Comments
- The authors note that this study applied the lumped hydrological model, GR4J, on individual watersheds without explicit channel routing. This is a reasonable simplification given the core focus on validating the meteorological input. However, the specific limitations this choice imposes on the applicability of the conclusions should be more explicitly discussed. It should be stated that the current framework’s output is flow event sets at sub-basin outlets, and its direct use for fluvial flood risk assessment has constraints, particularly concerning flood peak propagation and superposition.
- All datasets in this study, including the 44-year historical reference used by the statistical benchmark model and the AI model trained on data from 1979-2016, rely on relatively short climatic records. A key point that could be more explicitly discussed is that the short record length fundamentally limits the reliable characterization and statistical estimation of low frequency and high return-period extreme events. This implies not only that the historical data itself may lack some genuine extreme patterns, but also that both the benchmark and PrecipHENS methods, which involve extrapolation or learning from this short series, carry inherent and considerable uncertainty for events far exceeding the record length (e.g., high return periods of 50 or 100 years). This should be more clearly elevated as a common foundational limitation affecting all approaches.
- Similar to (2), the manuscript mentions that the SFNO model was trained on data before 2016. Given climate non-stationarity, readers may be concerned about the implications of this data timeliness for future risk analysis. It is recommended that the authors briefly discuss this as a recognized limitation.
Minor Comments
- P2, L47, “Each event set comprises of spatially …”, remove “of”.
- P8, L200, “...with such length following a Poisson distribution with the unit being the number of years.”, revising to “...with the length (in years) following a Poisson distribution.”
Citation: https://doi.org/10.5194/egusphere-2025-5841-RC2 -
RC3: 'Comment on egusphere-2025-5841', Anonymous Referee #3, 16 Jan 2026
General Comments:
This manuscript introduces PrecipHENS, a novel AI-driven framework that combines a deep-learning-based weather emulator (SFNO + AFNO) with the conceptual hydrological model GR4J to estimate fluvial flood risk in the Elbe River basin. By generating an ensemble of more than 1,000 synthetic winter seasons, the study seeks to overcome the limitations of relatively short historical records when estimating tail risk. The proposed approach is compared against a traditional statistical method based on Heffernan and Tawn (2004), with performance evaluated using four criteria: whether each method reproduces the observed precipitation climatology; spatial and temporal coherence; representation of extremes; and methodological robustness.
Overall, this manuscript fits well within the scope of Natural Hazards and Earth System Sciences. However, a few questions require clarification, and some revisions are needed.
Specific Comments:
1. The text refers to "Figure 17" when describing river flow simulations at Dresden. Figure 17 displays Gumbel parameters; the correct reference is Figure 18. Please correct this reference.
2. The authors highlight that PrecipHENS spans approximately 81% of the PCA space, compared to only 19% for the historical record, and interpret this as evidence that the framework generates novel extremes. Could the authors clarify how the physical plausibility of these novel events is ensured? Specifically, do these outlier events correspond to coherent atmospheric structures?
3. Are all 1,008 ensemble members derived from initial conditions sampled strictly from the 9–16 November 2023 window? If so, have the authors tested initializing from different years (e.g., neutral, El Niño, and La Niña years) to ensure that the ensemble spans the full range of climatological variability?
4. The current evaluation focuses primarily on end-of-pipe outputs (i.e., precipitation), without assessing the intermediate atmospheric variables generated by SFNO, such as geopotential height, pressure fields, or wind patterns. Given that the Discussion explicitly acknowledges that “remaining uncertainties concern how well the underlying SFNO emulator reproduces large-scale atmospheric circulation patterns”, could the authors include a brief evaluation or at least a visualization of these generated atmospheric fields?
Citation: https://doi.org/10.5194/egusphere-2025-5841-RC3
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 289 | 104 | 18 | 411 | 11 | 9 |
- HTML: 289
- PDF: 104
- XML: 18
- Total: 411
- BibTeX: 11
- EndNote: 9
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript provides a novel framework for supplementing the short period of record of historical precipitation and river flow data through an AI based methodology that creates hundreds of ensemble members from which to assess rare extreme events. This methodology represents a creative and sophisticated approach to addressing the problem that low-likelihood, high-impact events are rarely seen in our short historical period. I am impressed by the breadth of knowledge presented on the methodology in this study and think this application will constitute a considerable advancement on the academic literature of this topic.
While I think this approach can be a meaningful contribution to the literature, there are several areas where the manuscript’s contents can either be better explained or address topics that are currently lacking. I will detail my main concerns below, with a list of minor comments after.
-How do we know that the precipitation outputs of PrecipHENS are physically plausible?
We know, and the authors do an excellent job of showing, that the PrecipHENS precipitation outputs are statistically plausible in reference to the historical data, but the modeling framework presented here is entirely based on an AI-based emulator of ERA5, so it is important to know whether or not these outputs represent potential physically-realistic scenarios. In short, how do we know that PrecipHENS generates thousands of “right answers” instead of thousands of “wrong answers”. With a physics-based framework like UNSEEN, we can trust that they are “right answers” because the underlying model is a physics-based dynamical model, but that is not the case for PrecipHENS. I acknowledge that this is a very difficult question to answer with any degree of certainty, so I think the authors should at least acknowledge that this remains an open question related to their method if they are unable to fully answer this question.
-What do the data distributions look like for each of the three cases (Historical, Benchmark, PrecipHENS)?
The authors show many sophisticated statistical analyses to illustrate that PrecipHENS passes the tests for G1-4; however, a more basic depiction of the precipitation data generated for each of these cases is missing. It would greatly improve the manuscript to see how the distributions (e.g., PDFs) vary across the Historical, Benchmark, and PrecipHENS, especially regarding the tails. This is important because we want to know if the precipitation data is being drawn from the same distributions or not, which has implications for the differences between Benchmark and PrecipHENS shown throughout the paper. One possibility here is that I can see the underlying distribution as a way to potentially show that PrecipHENS is generating hundreds of “right answers” (from the comment above), especially if there is no discernable difference between the Historical and PrecipHENS PDFs.
Additionally, it would be helpful to see a precipitation plot version of Figure C4 (which I really like!) to get a better understanding of the similarity of actual precipitation values between the Historical and PrecipHENS. This also can help with an understanding of how trustworthy the precipitation output from PrecipHENS is.
-What are the big picture goals this approach is trying to accomplish?
A clearer representation of the problem from the beginning of the manuscript would help to ensure there is no misunderstanding related of the capabilities of this methodology. For example, this method is entirely based on historical data, so it is very relevant to an understanding of extreme events today (or in the next few years, let’s say). But as extreme events are occurring with greater frequency and magnitude, it will likely become out-of-date (i.e., an underestimation) for calculation of extremes a decade from now or longer out into the future. This is an important point about this method that is not currently addressed by the manuscript.
This can also help to contextualize the illustrated dry bias of PrecipHENS. While it may not be entirely generalizable for the Elbe River individually, extreme precipitation in general is increasing in the future with climate warming. Is it a problem that the PrecipHENS approach is showing a dry bias in reference to the historical data when we know with a fair degree of certainty that the historical data is likely the lower end of what we expect in the near future? This can especially be seen in the longer return period events in Figure 9 that clearly have a dry bias (far fewer points falling above the reference line than below).
-Add motivation for why the Benchmark method is used
The Benchmark method is essentially a resampling of the historical period to give it a much longer period of record, but with the same underlying density of extreme events. Why is it necessary to construct this Benchmark method as a point of comparison with PrecipHENS instead of performing a comparison with only the Historical Data? This type of motivation in Section 2.2.1 would improve this manuscript.
-Selection of initial conditions
Are all initial conditions for PrecipHENS taken from 9-16 Nov. 2023 as lines 243-245 appears to suggest? The description here is a bit unclear on these details related to the model. If initial conditions are indeed all derived from this 7-day period, I would argue that this is unnecessarily restricting the variability of the initial conditions, which would be a potential issue with the model and its results. If this were the case, the model would be considerably more robust with a more diverse set of initial conditions for atmospheric patterns than simple variations on a 7-day timeframe. I am hoping that I am understanding this poorly and there is more diversity in the set of initial conditions run with the model.
Minor comments:
-Line 191: What is the historical record here? Stations? ERA5? Something else?
-Figure B2: I recommend using a different color scale for this figure because it is temperature. The figure currently gives the incorrect assumption that PrecipHENS is biased cold. I suggest either flipping the bias color bar direction so brown is warmer, or using a blue-to-red color bar with red indicating warmer.
-Lines 388-390: Why are the PrecipHENS’ Gumbel parameters closer to the Historical than the Benchmark even though the Benchmark is based on inference about the extremes? It would be nice to get another sentence or two describing why this counterintuitive behavior exists.
-Figure 9: Did the authors consider showing the 500-year and 1000-year return periods on this plot as well? Since there are 1000+ years of data, this approach would also be applicable to these long return periods.
-Lines 454-457: What does the calculated proportion of non-overlapping 1x1 grid cells in PCA space that contain at least one event from each model mean exactly? I understand that it shows us greater diversity of events in PrecipHENS, but there could be another sentence or two clarifying how this method is actually constructed and why it’s done this way.
-Figure 11: What dataset is used to construct the data shown in this figure?
-Line 583: I believe the authors mean Figure 18 (not Figure 17).