the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Unveiling the Limits of Deep Learning Models in Hydrological Extrapolation Tasks
Abstract. Long Short-Term Memory (LSTM) networks have shown strong performance in rainfall-runoff modelling, often surpassing conventional hydrological models in benchmark studies. However, recent studies raise questions about their ability to extrapolate, particularly under extreme conditions that exceed the range of their training data. This study examines the performance of a stand-alone LSTM trained on 196 catchments in Switzerland when subjected to synthetic design precipitation events of increasing intensity and varying duration. The model’s response is compared to that of a hybrid model and evaluated against hydrological process understanding. Our study reiterates that the stand-alone LSTM is not capable of predicting discharge values above a theoretical limit, and we show that this limit (73 mm d-1) is below the range of the data the model was trained on (183 mm d-1 when trained on CAMELS-CH). Furthermore, the LSTM exhibits a concave runoff response under extreme precipitation, indicating that event runoff coefficients decrease with increasing design precipitation-a phenomenon not observed in the hybrid model used as a benchmark. We show that saturation of the LSTM cell states, alone, does not fully account for this characteristic behavior, as the LSTM does not reach full saturation, particularly for the 1-day events. Instead, its gating structures prevent new information about the current extreme precipitation from being incorporated into the cell states. Adjusting the LSTM architecture, for instance, by increasing the number of hidden states, and/or using a larger, more diverse training dataset can help mitigate the problem. However, these adjustments do not guarantee improved extrapolation performance, and the LSTM continues to predict values below the range of the training data or show unfeasible runoff responses during the 1-day design experiments. Despite these shortcomings, our findings highlight the inherent potential of stand-alone LSTMs to capture complex hydro-meteorological relationships. We argue that, more robust training strategies and model configurations could address the observed limitations, preserving the promise of stand-alone LSTMs for rainfall-runoff modelling.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.- Preprint
(2889 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
CC1: 'Comment on egusphere-2025-425', Baoying Shan, 17 Feb 2025
Do you think could the LSTM perform better if we add a binary variable (such as [ 0 0 0 0 0 0 1 1 ....] and 1 means extreme precipitation) into inputs?
Citation: https://doi.org/10.5194/egusphere-2025-425-CC1 -
AC1: 'Reply on CC1', Sanika Baste, 18 Feb 2025
Publisher’s note: this comment was edited on 26 February 2025. The following text is not identical to the original comment, but the adjustments were minor without effect on the scientific meaning.
Dear Baoying Shan,
Thank you for your comment on our manuscript, egusphere-2025-425. We acknowledge that it is a great idea to check whether flagging precipitation information, by means of the suggested binary variable, leads to an improved performance. We are currently also investigating different strategies how to overcome the shown limitations. However, the primary focus of our current study is to highlight the characteristic behavior of LSTMs within the described empirical setting. In this context, we have chosen to concentrate on demonstrating the effects of increasing the model size (256 hidden states instead of 64) and/or training it on more diverse data. We see your suggestion as a valuable direction for further improving model performance and look forward to exploring such strategies, including the one you proposed, in future research.
Sincerely,
Sanika Baste, on behalf of all co-authors
Citation: https://doi.org/10.5194/egusphere-2025-425-AC1
-
AC1: 'Reply on CC1', Sanika Baste, 18 Feb 2025
-
RC1: 'Comment on egusphere-2025-425', Basil Kraft, 14 Mar 2025
Dear Authors,I have read your manuscript with great interest. The study is well-conducted, and the results are clearly presented. The findings are highly relevant and raise important questions about the generalization capabilities of LSTM models in hydrological applications.I have some major and minor remarks that I believe will help improve the manuscript. Please find a short summary along with major and minor remarks in the attached file.Kind regards,Basil Kraft
- AC2: 'Reply on RC1', Sanika Baste, 10 Apr 2025
-
RC2: 'Comment on egusphere-2025-425', Anonymous Referee #2, 14 Apr 2025
General comment
This paper compares a stand-alone LSTM model with a hybrid HBV-LSTM model on the CAMEL-CH dataset. It also examines the impact of training the stand-alone LSTM on the CAMEL-US dataset, alongside CAMEL-CH, and also using 256 nodes instead of 64. The main focus of the discussion is on the ability of both models (stand-alone LSTM and hybrid) to show a linear pattern between simulated peak flows and rainfall when applying "synthetic" rainfall far higher the observed ones. The results clearly show the impossibility for the stand-alone LSTM model to exibit this linear pattern due to simulated discharges tending to a limit values as predicted by a previous study. But more interestingly, it clearly shows that this observed limit is far lower than the theoretical limit expected by the authors.
In my opinion, this paper is very interesting as it is the first to clearly and honestly address the limitations of LSTM models in hydrology.
I recommend publication after revision.
Main comments
My main comments are:
1) Even though it is outside the scope of the paper, I would have appreciated a "deeper" and "fairer" comparison between the stand-alone LSTM model and the hybrid HBV-LSTM model. The paper is short (only 3 figures of results in the main text), there is room for that. My main criticism is that no hyperparameter (HP) tuning is done for either model. The HP values are simply taken from previous studies. I think that the results could be different if a proper fine tuning was done for each model.
2) For analysis, it would also be very interesting to see the results for a single HBV model as a benchmark, which is very "cheap" to calibrate locally. Is there an improvement and is it "worth" the huge amount of data and GPU time required to process it? For example, the authors added US CAMEL data to their CH CAMEL learning dataset and moved from 64 to 256 nodes, which would have required a considerable amount of additional resources, but they don't show the corresponding improvement.
3) As the paper focuses on extremes, I also think that the evaluation against the observed runoff should not be limited to the NSE criteria as in Fig. 1 (which is the only figure presenting models performances), but should include a deeper analysis, including for example signatures calculated on flood events.
4) The same comment applies to the second part of results (Figs. 2 and 3, using synthetic rainfall): only 1 flood for 3 catchments (a little more in the appendix), whereas the authors have thousands of examples. A synthetic metric should be found that "summarises" the different observed behaviours (between catchments, but also for the same catchment but under different conditions). A "visual" analysis on a few examples, as in this paper, is a first step to draw first hypotheses. But then these hypotheses should be tested in depth..
5) This last point (the need for a synthetic metrics that allows a "deep" analysis) leads me to my main comment. The authors don't clearly explain why, from a hydrological point of view, peak discharge should increase linearly with extreme rainfall. I fully agree with this, and even if it seems obvious, I think it would be valuable to anchor the paper with more basic hydrological references. In terms of synthetic metrics, I would, for example, calculate a regression coefficient between peak discharge and synthetic rainfall and see how it changes as a function of rainfall, as in the paper, but also as a function of the initial moisture content before a flood and/or the runoff coefficient. I would also not look at flood by flood, but try to find a graphical representation of all floods and catchments together.
Minor comments
L100 : why did not you do a hyperparameter tuning? (a LSTM expert told me one day that hyperparameter training is absolutely required in any case, and that, if "hydrologists" don't have the necessary GPU resource, they should not use LSTM)
L200: You should give more details on the models performances, for instance using flood signatures
L224: The results for the 256 node LSTM and/or the training using US-CAMEL should be presented in Fig .1 and discussed. Does this huge amount of additional data improve models performances?
L235: You should do more clearly the link with basic hydrological processes, such as soil saturation and the effect of initial humidity condition.
Figure 2 and 3 : the terme "observation" is misleading. There is no observed discharges in this figure.
L260: this affirmation is supported only by 1 flood over 3 cathments. You should try to exhibit that using much more discharge simulation (...that you have)
L299 : "Extreme hydrological events often coincide with distinct regime shifts": I fully agree but could you explain what do you mean to a "non-hydrogist", in term of involved processes.
Citation: https://doi.org/10.5194/egusphere-2025-425-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
357 | 91 | 7 | 455 | 7 | 6 |
- HTML: 357
- PDF: 91
- XML: 7
- Total: 455
- BibTeX: 7
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 135 | 29 |
Germany | 2 | 78 | 16 |
China | 3 | 53 | 11 |
Switzerland | 4 | 35 | 7 |
France | 5 | 17 | 3 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 135