The need for uncertainty: why probabilistic LSTMs are key to improving flood predictions and enabling learned warning rules
Abstract. Deterministic model predictions can struggle to adequately capture extreme events such as floods and droughts, which are of particular relevance in hydrology. This limitation arises because deterministic models collapse the conditional runoff distribution to a single point estimate. Probabilistic modeling provides a way to address this issue by explicitly representing uncertainty and assigning non-zero probabilities to a range of possible outcomes, including rare and extreme events, thereby capturing the full range of plausible hydrological responses. Motivated by this perspective, we examine whether probabilistic Long Short-Term Memory (LSTM) models improve the representation of extreme events in rainfall–-runoff simulations across Switzerland. Overall, the probabilistic models show good calibration, although some miscalibration remains for the extremes. Differences between models mainly manifest in how uncertainty is distributed: some approaches produce narrower and lighter-tailed distributions, while others yield broader distributions with heavier tails. These trade-offs highlight that probabilistic models differ not only in sharpness but also in how their calibration for rare events. We observe this tradeoff also in models' accuracy metrics. When evaluating the mean of the probabilistic predictions using the Nash–Sutcliffe efficiency (NSE), none of the probabilistic approaches outperform the deterministic LSTM in terms of average predictive accuracy. However, a clear advantage over the determinsitc models emerges when focusing on the tail of the discharge distribution. For the most extreme events (top 0.1 % of the discharge distribution), the deterministic LSTM underestimates more than 90 % of observed values (since it provides estimates of an expectation), whereas probabilistic predictions can capture a substantially larger fraction (67 %) of these extremes within their upper predictive bounds. Building on the additional information provided by probabilistic runoff predictions, we further show how they can be translated into actionable flood warnings using reinforcement learning. To this end, we introduce a Flood Risk Communication Agent (FRiCA) that operates on probabilistic runoff predictions and learns decision rules for issuing warnings of varying intensity. The FRiCA is implemented as an LSTM-based policy network and is trained by rewarding correct warning levels while penalizing the underestimation of flood severity. Results indicate that the FRiCA outperforms simple fixed heuristics, such as issuing warnings based on the predictive mean or a fixed high quantile (e.g., the 99th percentile). While this behavior already demonstrates the potential of reinforcement learning for improved flood risk communication, it also motivates further exploration of better reward design and policy network definition for context-dependent decision policies that adapt to varying hydrological and societal contexts.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
[A] Overall Assessment
This is an interesting and well-executed study. I personally learned a great deal while reading it. Strictly speaking, I don’t see any major concerns that would prevent publication. However, I encourage the authors to reflect on and further clarify the following points to help strengthen and polish the manuscript.
[B] Conceptual and Methodological Questions
[C] Readability Suggestion
The information presented is professional and technically solid. However, I suggest breaking up some of the longer paragraphs into smaller sub-paragraphs—particularly in sections such as the Results—to improve readability.
This is only my second time reviewing an EGU-style preprint, so this suggestion may partly reflect personal preference.
[D] Specific Comments
Line 10: What is the definition of “rare events”?
Line 14: “The deterministic LSTM underestimates more than 90% of observed values.” Is this conclusion derived from results obtained across all single-objective functions? What is the basis for this statement?
Line 34–35: “Although LSTMs often outperform conceptual and process-based models for the majority of the flow regime.” Could references be provided to support this claim?
Line 49–54: It may be helpful to move these experimental results to the Appendix.
Line 69–70: “Hydrological systems are, in principle, deterministic dynamical systems...” Could the authors clarify what theoretical framework or principle this statement refers to?
Line 71–73: The statement regarding one-forcing-to-many-responses may require references.
Line 111–112: Are there alternative modeling perspectives to this reasoning? Could the authors further explain why a reinforcement-learning–based decision module was chosen?
Section 2.2: Is it necessary to clarify whether the network is operating in an extrapolation setting (e.g., PUB)?
Line 146: Does “five-member ensemble” refer only to different random seeds?
Line 224–225: Why were 5,000 Monte Carlo samples selected?
Line 248: For readers less familiar with reinforcement learning, it would be helpful to clarify: Why is the decision policy parameterized by an LSTM? How is the decision-making policy defined in hydrologic terms?
Line 252: Why were 32 quantiles used?
Line 253–254: Could the authors elaborate on what a “single-step decision process” looks like in the hydrologic context?
Line 260–262: Why were +100 and –5000 chosen as reward values?
Section 2.5: This section contains many implementation details. An illustrative figure could significantly improve clarity.
[E] Performance Analysis
Line 281–288: Have the authors evaluated performance by grouping catchments according to functional traits or hydrologic regimes?
Line 310–313: The manuscript mentions that 9% of test data fall below the 1st quantile of BQN predictions. Is this behavior visible in any figure?
Line 325–326: Do the authors have insights into why results differ from Klotz et al. (2022) on CAMELS-US?
Section 3.2: It may strengthen the manuscript to further analyze how the three approaches learn (or fail to learn) uncertainty bounds during events.
For example: How does LSTM parameterization affect head-layer behavior? What are we learning mechanistically from this comparison?
Line 336 (Figure 3): Was any attempt made to improve the poor uncertainty bound of CMAL? Is this behavior associated with the model structure? How generalizable is this finding across locations?
Line 343: Were rising and falling limbs quantitatively separated during analysis?
Line 346: How does CMAL learn heteroscedasticity? Is this assumption embedded in the objective function?
Line 338–379: Are there insights regarding spatial and temporal performance differences? It may also be helpful to report deterministic LSTM performance per location rather than only median values.
[F] Event Definition and Aggregation
Line 397: Please specify which two ARIs are used.
Line 399: Does “over 158 catchments” imply that 38 basins were excluded? What is the impact of including/excluding them?
Line 401–404: The event definition is unclear. It would be helpful to clearly compare this definition with previous literature and explain how flood categories are identified following BAFU (2024).
[G] Additional Technical Clarifications
Line 435–437: Was there any attempt to train likelihood-based models using higher-order moments?
Line 454–455: More detail on the role of noise-based regularization could strengthen the main argument.
Line 518–519: A reference supporting the statement about subjective risk choice in reinforcement learning would be useful.
Line 510–541: Given the technical complexity of reinforcement learning, expanding literature support in this section would improve rigor.