the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The need for uncertainty: why probabilistic LSTMs are key to improving flood predictions and enabling learned warning rules
Abstract. Deterministic model predictions can struggle to adequately capture extreme events such as floods and droughts, which are of particular relevance in hydrology. This limitation arises because deterministic models collapse the conditional runoff distribution to a single point estimate. Probabilistic modeling provides a way to address this issue by explicitly representing uncertainty and assigning non-zero probabilities to a range of possible outcomes, including rare and extreme events, thereby capturing the full range of plausible hydrological responses. Motivated by this perspective, we examine whether probabilistic Long Short-Term Memory (LSTM) models improve the representation of extreme events in rainfall–-runoff simulations across Switzerland. Overall, the probabilistic models show good calibration, although some miscalibration remains for the extremes. Differences between models mainly manifest in how uncertainty is distributed: some approaches produce narrower and lighter-tailed distributions, while others yield broader distributions with heavier tails. These trade-offs highlight that probabilistic models differ not only in sharpness but also in how their calibration for rare events. We observe this tradeoff also in models' accuracy metrics. When evaluating the mean of the probabilistic predictions using the Nash–Sutcliffe efficiency (NSE), none of the probabilistic approaches outperform the deterministic LSTM in terms of average predictive accuracy. However, a clear advantage over the determinsitc models emerges when focusing on the tail of the discharge distribution. For the most extreme events (top 0.1 % of the discharge distribution), the deterministic LSTM underestimates more than 90 % of observed values (since it provides estimates of an expectation), whereas probabilistic predictions can capture a substantially larger fraction (67 %) of these extremes within their upper predictive bounds. Building on the additional information provided by probabilistic runoff predictions, we further show how they can be translated into actionable flood warnings using reinforcement learning. To this end, we introduce a Flood Risk Communication Agent (FRiCA) that operates on probabilistic runoff predictions and learns decision rules for issuing warnings of varying intensity. The FRiCA is implemented as an LSTM-based policy network and is trained by rewarding correct warning levels while penalizing the underestimation of flood severity. Results indicate that the FRiCA outperforms simple fixed heuristics, such as issuing warnings based on the predictive mean or a fixed high quantile (e.g., the 99th percentile). While this behavior already demonstrates the potential of reinforcement learning for improved flood risk communication, it also motivates further exploration of better reward design and policy network definition for context-dependent decision policies that adapt to varying hydrological and societal contexts.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(2165 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 05 Apr 2026)
- RC1: 'Comment on egusphere-2026-469', Anonymous Referee #1, 28 Feb 2026 reply
-
RC2: 'Comment on egusphere-2026-469', Sandeep Poudel & Scott Steinschneider (co-review team), 15 Mar 2026
reply
The need for uncertainty: why probabilistic LSTMs are key to improving flood predictions and enabling learned warning rules
Summary:
This study addresses two objectives: (1) evaluating probabilistic LSTM networks on extreme rainfall-runoff events, and (2) introducing reinforcement learning (RL) methods on top of probabilistic forecasts to issue flood warnings. Overall, the authors find that probabilistic LSTM performs better for extreme events, as it assigns some probability mass to those events, whereas deterministic LSTM mostly underestimates them. They also demonstrate the value of RL-type methods in issuing reliable flood warnings compared to simple fixed heuristics. The paper covers important aspects and has practical applications, and therefore has merit. However, there are several major concerns that need to be addressed to improve the clarity of the work and better support the results of their analysis. See comments below.
Major Comments:
- The DNR and CMAL models are not really that different. Both estimate the parameters of a distribution for flow, and its just the distribution that differs (countable mixture of asymmetric Laplace distributions vs. logistic distribution). However, the authors use the negative log-likelihood as the loss for CMAL and CRPS as the loss for DNR. Therefore, its unclear how the analysis separates the impact of the loss function from the structure of distribution on model performance. The authors should therefore try using the negative log-likelihood as a loss for DRN and CRPS as a loss for CMAL, and then comment on whether it’s the loss or the distribution type that leads to the bigger separation in performance.
- The FriCA reinforcement learning model feels underdeveloped and tacked on at the end. Additionally, based on the results, it does not come off as a convincing improvement over simply using a static quantile of the probabilistic predictive distribution. I still think it’s interesting and a useful contribution, but the framing in the introduction needs to be adjusted to communicate to readers that this part of the paper is a bit exploratory and designed as proof of demonstration of how additional models can be used to explore how probabilistic DL hydrologic predictions could be incorporated into peak flow estimation and decision-support for flood warnings.
- Furthermore, I found the methodological description of FRiCA is little difficult to follow. This section (Section 2.5) would benefit from clearer conceptual framing. In particular, the state–action–reward structure is not clearly defined, making it hard to understand what the agent is actually learning. In addition, the reward design (e.g., +100 vs. −5000 vs. 0) and the masking based on the 85th quantile could be better justified, as its not clear whether those choices influence the learned behavior. Overall, I recommend adding a concise conceptual overview of the decision problem, defining the RL components better, and providing clearer justification (or sensitivity analysis) for the reward structure and activation threshold.
- The introduction contains a comprehensive discussion of the saturation limit problem of LSTM networks, which is also one of the major motivations of this study. The authors also point out a few strategies they have previously implemented to partially overcome these limitations. However, in the analysis of this study, no focus has been made on the saturation limit problem, as seen from the result that although probabilistic LSTMs are somewhat better than deterministic LSTMs, the saturation problem still persists with them. Therefore, either some supporting analysis, perhaps incorporating ideas the authors have previously hinted at, or at the very least a detailed discussion of this limitation as it applies to the probabilistic LSTM, is necessary.
Minor comments:
- In general, paragraphs are far too long, making it hard to separate out key ideas and natural breaks in the narrative. I would introduce more paragraph breaks throughout to improve readability.
- Lines 14-16: Provide quantifiable estimate rather than only stating that probabilistic predictions capture extreme events within their bounds.
- Lines 97-98: The motivation here is not clear. We already have a probabilistic LSTM model that predicts the parameters of the CMAL distribution, which is flexible enough and should in principle be capable of capturing extremes. You introduce two alternative approaches, but why are these necessary? What is the limitation of CMAL, does it have a property that causes it to struggle with extremes, and do the two methods you introduce (DRN and BQN) have property that could potentially better represent extremes? In short, out of the many distributions and methods available, the rationale for why these two in particular are used here is missing.
- Lines 111-112: You do not provide any background on the RL method here. Why would RL be a suitable candidate for issuing more reliable flood warnings? There should be references to support this, from hydrology or similar domains, to provide the reader with useful context. On this point and the one above, the authors are assuming a great deal of prior knowledge on the part of readers, which might not be the case. Please provide brief context and background so that the study is more accessible.
- Table 2: The hidden size of the deterministic LSTM (64) is substantially smaller than that of the other probabilistic models (256 or 250). Do you have any comment on this? Was the size for the deterministic LSTM tuned specifically for this region, while the size for the probabilistic models was adopted directly from studies using CAMELS-US? For the CAMELS-US dataset, which is much larger than your data, the common hidden size choice is similar for both deterministic and probabilistic LSTMs (256 vs. 250). If 64 hidden units are sufficient for deterministic LSTM in your region, a similar size may also suffice for the probabilistic models, and the larger size may simply be overfitting. I would be interested to hear your explanation for this or see any supporting analysis.
- Lines 174-178: DRN is not clearly explained based on your description here. Is this a family of many regression methods, or is logistic regression always used with DRN? If the choice to use logistic regression was yours, clarify why?
- Lines 193-196: Streamflow is strictly positive, so why did you decide to remove the strictly positive enforcement? This seems contrary to what should be done, so the reasoning needs to be clarified.
- Line 206-215: This is a bit confusing. The authors first direct us to Schulz and Lerch (2022) and Shulz et al. (2024) to read about how predictive distributions are combined, but then in the next paragraph, they describe the process.
- Lines 245-276: This entire section is somewhat confusing, and the working mechanism of FRICA is not clearly communicated. I suggest breaking it into smaller paragraphs to more clearly explain how it works, and, if possible, adding a conceptual figure illustrating the FRICA mechanism.
- Figure 1: Looking at this PIT histogram, what stands out is that for tail flood events, BQN actually performs better than both CMAL and DRN, both of which behave very similarly to each other. For the most extreme flood events, the histogram mass is much higher than 1 for both CMAL and DRN, suggesting that extreme flood observations more frequently fall in the right tail of the forecast distribution, indicating that extremes are being underforecasted. Given that the main focus of this paper is capturing extreme flood events with probabilistic models, why is DRN considered the best probabilistic method based on average behavior rather than extreme performance? Based on this figure and the CRPS values provided, I am not entirely convinced that DRN is the best method.
- Lines 327 – 335: The authors argue that the DRN predictive distributions provide an advantage for extremes, but they cite that only 67% of the extreme flow observations are within the 99th percentile bounds of the predicted distribution. While better than a deterministic model that strongly underestimates the peaks, the performance of the probabilistic bounds of the DRN model aren’t great either. Overall, I would recommend the authors not overstate the performance of the probabilistic models for the most extreme events. This also extends to the title of the paper; its not clear that probabilistic models ‘are key’ to improving flood predictions based on these results.
- Figure 2: This figure shows the limitations of the deterministic LSTM model and some advantage of the probabilistic model, but it is still not very informative. It is expected that by using a probabilistic LSTM which can assign some probability mass to extreme events, performance would improve to some degree compared to deterministic model. However, given that a large portion of the introduction focused on the saturation limit of LSTM, I was expecting more experimentation or analysis on that topic. This figure simply shows that probabilistic LSTM can be somewhat better, but the saturation problem is still there. The extremes are still clearly underestimated, and most extreme events remain beyond the reach of even the probabilistic LSTM. Also, the figure shows 99th percentile interval for probabilistic LSTM but it is unclear whether the model is assigning meaningful probability mass to those extremes. At a minimum, showing the bounds for the 25th to 75th percentile interval would be important here. And, given that saturation limit was a major focus of the introduction and motivation for this study, I was expecting some analysis of this issue in the context of the probabilistic LSTM as well, but it is entirely absent.
- Typo correct improvemed to improvement
- Line 343 and elsewhere throughout the manuscript: LSTMdet is written LSTMdet
- Line 396: It might be useful to have a 1-line reminder about how hits, misses, and false alarms are defined
- Line 398-403: I’m not sure this text is necessary
- Line 405-412: This gets at one of my main comments above. Its not clear based on this result that the FRiCA model actually is providing meaningful benefit over simply using the 99th percentile of the DNR predictive distribution. Also, it seems to collapse to just using a different static percentile most of the time (95th percentile). So its not clear what the benefit is over just calibrating a static quantile threshold to use from the probabilistic predictions.
Citation: https://doi.org/10.5194/egusphere-2026-469-RC2 -
RC3: 'Comment on egusphere-2026-469', Jonathan Frame, 28 Mar 2026
reply
Review: The need for uncertainty: why probabilistic LSTMs are key to improving flood predictions and enabling learned warning rules
This paper tests three different methods of predicting streamflow probabilities, and compares those against a deterministic LSTM. Interestingly, this paper demonstrates that high probability predictions usually underestimate peak flows, but the high end of the low probability predictions are more capable of matching peaks than deterministic LSTM. This is further demonstrated with a hit and miss comparison, showing LSTM misses floods more often than not, yet 99th percentile captures floods, except for “extreme floods”, which are still outside the 99th percentile. This paper also includes a reinforcement learning-based method for translating streamflow probability into actionable flood warnings.
Paper organization: This paper is generally easy to read and easy to understand. There are a few sentences that are hard to follow, I’ll point those out in the line comments.
Novelty: This paper tests three methods for probabilistic streamflow predictions, and in the end, none of them turn out to sufficiently capture the most extreme events. CMAL had been used before, but I wonder if the authors simply chose the other two methods almost randomly or was there consideration of the applicability to the problem. I am left wondering if probabilistic streamflow in general can’t capture those extremes, or, is it just these methods. DRN was developed for weather and BQN for wind speed. I’m not suggesting more methods be added, but I think it is worth reflecting on if there is a better method specifically for streamflow? Or, would it make sense to develop a custom method? If not, is it a limitation of LSTM itself? I think these would need to be addressed in order to answer research question 1 completely.
Research question 2 seems either 1) trivial or 2) poorly worded. This discussion piece on question 2 is also kind of strange. Particularly line 481-483. I’m not sure what is meant by this: “it illustrates that runoff generation at the catchment scale in large-sample datasets is not unique”.
Line 38-40, LSTM outperforms standard calibration of static parameters for conceptual models. Actually, there is a slight correction: The conceptual model in Frame 2022 (including corrigendum) never outperforms LSTM. The NWM does, but we were not able to re-calibrate that without the extreme events, so it isn’t really a fair comparison.
Line 554-556: “This highlights that the primary benefit of probabilistic modeling does not lie in improving point-prediction accuracy, but rather in providing a structured and interpretable representation of predictive uncertainty” I’m not sure you can you claim this as a general conclusion. This reads to me more of a specific interpretation of this study. For instance, I’d love to see 68% of peak streamflow observations within the standard deviation of your probability distribution. Just because this result doesn’t achieve that doesn’t mean it isn’t a benefit of probabilistic modeling in general, if the objective is met. And again, on Line 557: “biggest value addition from the probabilistic models emerges in the upper tail of the discharge distribution”, this seems like a specific result to this study, not a general conclusion, but it is written as something general. I guess it should be “biggest value addition from THESE probabilistic models”.
I trust that the code will be released by the authors, but I am surprised not to see it linked here already.
Citation: https://doi.org/10.5194/egusphere-2026-469-RC3
Model code and software
The need for uncertainty: why probabilistic LSTMs are key to improving flood predictions and enabling learned warning rules Sanika Baste https://doi.org/10.5281/zenodo.18385505
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 308 | 192 | 19 | 519 | 12 | 20 |
- HTML: 308
- PDF: 192
- XML: 19
- Total: 519
- BibTeX: 12
- EndNote: 20
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
[A] Overall Assessment
This is an interesting and well-executed study. I personally learned a great deal while reading it. Strictly speaking, I don’t see any major concerns that would prevent publication. However, I encourage the authors to reflect on and further clarify the following points to help strengthen and polish the manuscript.
[B] Conceptual and Methodological Questions
[C] Readability Suggestion
The information presented is professional and technically solid. However, I suggest breaking up some of the longer paragraphs into smaller sub-paragraphs—particularly in sections such as the Results—to improve readability.
This is only my second time reviewing an EGU-style preprint, so this suggestion may partly reflect personal preference.
[D] Specific Comments
Line 10: What is the definition of “rare events”?
Line 14: “The deterministic LSTM underestimates more than 90% of observed values.” Is this conclusion derived from results obtained across all single-objective functions? What is the basis for this statement?
Line 34–35: “Although LSTMs often outperform conceptual and process-based models for the majority of the flow regime.” Could references be provided to support this claim?
Line 49–54: It may be helpful to move these experimental results to the Appendix.
Line 69–70: “Hydrological systems are, in principle, deterministic dynamical systems...” Could the authors clarify what theoretical framework or principle this statement refers to?
Line 71–73: The statement regarding one-forcing-to-many-responses may require references.
Line 111–112: Are there alternative modeling perspectives to this reasoning? Could the authors further explain why a reinforcement-learning–based decision module was chosen?
Section 2.2: Is it necessary to clarify whether the network is operating in an extrapolation setting (e.g., PUB)?
Line 146: Does “five-member ensemble” refer only to different random seeds?
Line 224–225: Why were 5,000 Monte Carlo samples selected?
Line 248: For readers less familiar with reinforcement learning, it would be helpful to clarify: Why is the decision policy parameterized by an LSTM? How is the decision-making policy defined in hydrologic terms?
Line 252: Why were 32 quantiles used?
Line 253–254: Could the authors elaborate on what a “single-step decision process” looks like in the hydrologic context?
Line 260–262: Why were +100 and –5000 chosen as reward values?
Section 2.5: This section contains many implementation details. An illustrative figure could significantly improve clarity.
[E] Performance Analysis
Line 281–288: Have the authors evaluated performance by grouping catchments according to functional traits or hydrologic regimes?
Line 310–313: The manuscript mentions that 9% of test data fall below the 1st quantile of BQN predictions. Is this behavior visible in any figure?
Line 325–326: Do the authors have insights into why results differ from Klotz et al. (2022) on CAMELS-US?
Section 3.2: It may strengthen the manuscript to further analyze how the three approaches learn (or fail to learn) uncertainty bounds during events.
For example: How does LSTM parameterization affect head-layer behavior? What are we learning mechanistically from this comparison?
Line 336 (Figure 3): Was any attempt made to improve the poor uncertainty bound of CMAL? Is this behavior associated with the model structure? How generalizable is this finding across locations?
Line 343: Were rising and falling limbs quantitatively separated during analysis?
Line 346: How does CMAL learn heteroscedasticity? Is this assumption embedded in the objective function?
Line 338–379: Are there insights regarding spatial and temporal performance differences? It may also be helpful to report deterministic LSTM performance per location rather than only median values.
[F] Event Definition and Aggregation
Line 397: Please specify which two ARIs are used.
Line 399: Does “over 158 catchments” imply that 38 basins were excluded? What is the impact of including/excluding them?
Line 401–404: The event definition is unclear. It would be helpful to clearly compare this definition with previous literature and explain how flood categories are identified following BAFU (2024).
[G] Additional Technical Clarifications
Line 435–437: Was there any attempt to train likelihood-based models using higher-order moments?
Line 454–455: More detail on the role of noise-based regularization could strengthen the main argument.
Line 518–519: A reference supporting the statement about subjective risk choice in reinforcement learning would be useful.
Line 510–541: Given the technical complexity of reinforcement learning, expanding literature support in this section would improve rigor.