the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The ability of LSTM to model snowmelt versus rainfall generated floods
Abstract. One of the most important skills of hydrological models is to simulate timing and magnitude of flood events. Long Short-Term Memory (LSTM) networks are currently among the most successful models for streamflow and flood prediction over large regions. In snow-influenced catchments, which typically comprise a minority in large-scale studies, floods are generated by two distinctly different processes, snowmelt and rainfall. The applicability of hydrological models in such regions is therefore dependent on their ability to represent both types of floods. Nevertheless, flood evaluations of LSTM taking different flood-generating processes into account are currently lacking. This study fills this gap by evaluating the ability of LSTM to model flood peak characteristics separately for snowmelt and rainfall generated floods. The trained LSTM model successfully simulated streamflow time series across the 103 evaluated catchments, with average NSE of 0.85 and average KGE of 0.87 over the unseen evaluation period. LSTM exhibited better performance in the majority of the catchments in terms of flood peak timing and magnitude for both rainfall and snowfall generated floods when compared to the operational hydrological model in the region (HBV) used as a benchmark. Both models had a 24 pp higher percentage of correctly simulated peak days for rainfall generated floods as compared to snowmelt generated floods. LSTM outperformed HBV for a larger proportion of the catchments in terms of peak timing of rainfall generated events (83 %) as compared to snowmelt generated events (64 %). On the other hand, a larger proportion of the catchments were improved by LSTM for snowmelt generated events as compared to rainfall generated events when considering peak magnitudes. The largest improvements in peak magnitudes were found for rainfall generated events, in particular for catchments where HBV exhibited high (> 40 %) absolute errors. Overall, our findings bring confidence that LSTM can improve hydrological services in regions subject to both snowmelt and rainfall generated floods.
- Preprint
(6658 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2026-1056', Klaus Vormoor, 10 Apr 2026
-
AC1: 'Reply on RC1', Sigrid Joergensen Bakke, 05 May 2026
We thank the reviewer for his time and efforts in providing a thorough and constructive review, which helps strengthen our manuscript. Hereby we outline our plan to address his comments.
General comment 1 highlights false positive events as an additional important aspect of the evaluation of a model’s suitability for purposes such as flood warning. We agree and will add an analysis of floods that have not been seen but simulated, and quantify the proportion of events where this occurs separately for the two models. To align with our study design, we will do this analysis independently for the different flood types.
General comment 2 concerns the possibility of defining flood generating processes using snowmelt information from the applied hydrological models. In this way, ref. reviewer’s specific comment 8, one can evaluate the ability of the models to correctly simulate the specific flood type as classified. We agree that this is an interesting topic, and existing methods of interpretable AI make such analyses possible. As suggested by the reviewer, we will address this topic in the discussion.
General comment 3 (and specific comment 10) points out that the different objective functions applied may give one model an advantage over the other in simulating peak flow characteristics. We agree that it can have an impact, and will follow-up by 1) evaluating both models in terms of the objective function used in calibration of HBV, and 2) calibrate HBV for a selection of catchments using sum of squared error (SSE) as objective function (which gives the same optimum as NSE for single catchment calibration).
1) In our first draft, we evaluated both models using NSE, which is also the objective function of LSTM. Analogously, we will include HBV’s objective function (sum of squared error plus squared volume error) as an additional evaluation metric. In both calibration period and evaluation period, LSTM got a lower (i.e. better) score than HBV in the majority (60%) of the catchments. We will add this result and comment on it in the revised manuscript. Strong LSTM performance under HBV’s objective function indicates that its advantage is not solely an artifact of being optimized for NSE.
2) We plan to calibrate HBV using SSE as the objective function, to quantitatively assess the influence of objective function on HBV’s ability to simulate peak flow characteristics. Key results will be presented and commented on in the revised manuscript.We specifically choose to use existing HBV simulations that are used in the national operational flood forecasting service to be in accordance with our main objective (ref L49): “The main objective is to evaluate LSTM’s potential for operational use in snow-influenced regions…”. We agree that this choice, and in particular the associated objective functions, requires more careful treatment in the analysis, and our revision will assess more explicitly whether the comparison and the resulting conclusions are sensitive to the different objective functions. In general, we note that a fully neutral comparison between the two models is not possible, due to fundamental differences. Other objective functions and calibration procedures may be more suitable for locally calibrated process-based models as opposed to globally calibrated deep learning models.
First technical correction suggests replacing “most successful models for streamflow and flood prediction
over large regions” with “most successful deep-learning models for streamflow and flood prediction
over large regions”. We wish to keep our original formulation, as LSTM is the most successful over large regions also when compared to process-based models (demonstrated by several large-scale studies, e.g. references in L24-27).Additional specific comments, questions and technical corrections are appreciated, and will be addressed by us in the revisions (we foresee no issues).
Citation: https://doi.org/10.5194/egusphere-2026-1056-AC1
-
AC1: 'Reply on RC1', Sigrid Joergensen Bakke, 05 May 2026
-
RC2: 'Comment on egusphere-2026-1056', Anonymous Referee #2, 12 Apr 2026
Bakke et al. assess the ability of LSTM to model snowmelt and rainfall generated floods using the HBV model as a benchmark in 103 catchments across Norway. They find that LSTM outperforms HBV in terms of simulating both the timing and magnitude of flood peaks for rainfall induced events and snowmelt events. This is a very timely and relevant contribution, and the manuscript is overall well-written and the results are well-presented. Thus, I have only a few comments the authors may want to consider before publication.
General comments
- The authors assess the performance of the LSTM for snowmelt, rainfall, and mixed events separately. Although the definition of the different event types is briefly described in the methods section 2.6, I would recommend to expand on the procedure of defining the flood generation processes. I assume that boundaries between the different processes are generally difficult to define, and it would be interesting to see how a change in the distribution of processes (e.g., more mixed events, less rainfall events, …) may influence the predictions obtained with the LSTM in each group, e.g., when looking at Figure 5 (or if this has an impact at all?). I do not expect the authors to add a detailed sensitivity analysis, but recommend adding this to the discussion section.
- In my view, the discussion is quite methods-oriented, which is well-reasoned in itself. However, I am missing potential (physical) explanations on why both models performed differently for the different event types, for example, potential reasons for the differences in timing errors between snowmelt and rainfall generated events? I understand that interpreting the results obtained with the LSTM can be more challenging compared to the HBV model. Nevertheless, I would encourage the authors to include potential explanations into the discussion section, including suitable references of previous studies.
Minor comments
L 42-45: This sounds as if you trained the LSTM separately for snowmelt and rainfall events, but as far as I understood, you trained it over all catchments and evaluated its predictions separately. Please clarify in the text.
L 65: In my opinion, the data and method section is excessively long. The authors may want to consider creating a Supporting Information document and moving parts of this section to that document, e.g., model details in sections 2.3, 2.4, 2.5.
L 117: The 103 stations have been repeated many times; consider consolidating these parts.
L 196-199: This paragraph needs references.
L 227-231: I do not think that it is necessary to mention the periods before 1994 here, as 15 years can already be considered a sufficient length.
Figure 5: Why do the authors think the peak timing of snowmelt events is harder to predict than mixed events? Intuitively, one could expect that it would be the other way around? This may be added to the discussion (see also general comment).
L 368: Any other references here?
L 390: This sentence needs a reference.
Citation: https://doi.org/10.5194/egusphere-2026-1056-RC2 -
AC2: 'Reply on RC2', Sigrid Joergensen Bakke, 05 May 2026
We thank the anonymous reviewer for the positive and constructive feedback, which will help improve our manuscript. Below, we outline our plan to address the comments.
General comment 1 highlights how chosen thresholds used to classify flood events may influence the results. We agree with the comment, and plan to 1) visualise the distribution of fractional rainfall contribution of all flood events, and 2) produce equivalents to figure 5 and 6 with two other thresholds (at least one of them widening the mixed event group). This will form the basis for commenting on the sensitivity of event type definition in the discussion section.
General comment 2 (and specific comment about Fig. 5) suggests to include potential explanations for why each model performs differently (e.g. in peak timing) for the different event types. We agree this is an interesting topic for discussion, and plan to expand from the existing comments in the discussion chapter (L359-L364).
We agree with all remaining specific comments.
Citation: https://doi.org/10.5194/egusphere-2026-1056-AC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 916 | 505 | 69 | 1,490 | 95 | 77 |
- HTML: 916
- PDF: 505
- XML: 69
- Total: 1,490
- BibTeX: 95
- EndNote: 77
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The study by Bakke et al. evaluates a LSTM model with regard to its ability to simulate snowmelt- vs. rainfall generated flood events and examines the potential for its operational use in flood forecasting in snow-influenced regions such as Norway. The manuscript is well structured and well written, and the selection and quality of the figures are excellent. Testing deep learning approaches like LSTM to simulate streamflow/floods and comparing them to a benchmark model is nothing new. However, the innovation of this study lies in the specific consideration of flood generating processes for this model evaluation and comparison. As a result, this study could make a relevant contribution to HESS and should be considered for publication once some incomplete and unclear points have been resolved.
GENERAL COMMENTS:
SPECIFIC COMMENTS:
Introduction or Data: Since the study focuses on Norway, the relevance of snowmelt, rainfall and a combination of both for flood generation should be better demonstrated for this region.
Introduction: LSTM is one deep learning approach among others. The literature review should be a bit broader in this regard. Since LSTMs often perform better than other methods, this actually provides a good reason for why LSTM is applied in this study.
L42 I am at least aware of the beforementioned study that apply a LSTM to simulate rain-on-snow floods in Germany
L87 Please indicate percentage or number of catchments being partly glacier cover.
L92 Comparability of the SeNorge snow model and the output of the HBV snow routine: Here, I was wondering if it would make sense to evaluate the ability of the benchmark model to correctly model the specific flood type as classified by the approach in section 2.6 using the data described in 2.2 and the output of the catchment specific HBV. On the other hand, this cannot be tested for the LSTM. Same direction as my GC (2).
Table 1 Just for curiosity: has it been tested how the LSTM performs when using only subsets of these catchment attributes? Alternatively, could specific attributes be identified as important predictors? Sometimes this leads to some hardly explainable surprises.
L126 Was the same objective function used to optimize the LSTM? If not, this could have an impact on the ability of simulating peak flows.
Section 2.4 Is the LSTM applied on a gridded basis or for catchment averages?
Sections 2.4 and 2.5 I suggest moving the last paragraph of 2.4 to 2.5 and the first paragraph of 2.5 to 2.4.
Section 2.6 The approach of detecting flood generating processes is comparatively simple but effective. Still, I suggest putting this into context of other more sophisticated approaches of flood type differentiation.
L269 I am a bit surprised about the low number of mixed floods. Were the same thresholds regarding snowmelt- and rainfall contribution used as in the cited reference? Generally, it is worth adding one or two more sentences on this procedure in section 2.6 so that the reader does not need to search for the reference.
L281 For hydrological models, the goodness-of-fit is often higher due to low model errors during winter low flows, particularly in snow-dominated catchments. Is this also the case for LSTM? According to Figure 4 (c), probably not as pronounced as for HBV.
Figure 5 should indicate the number of events per class either in the bar labels or in the figure caption.
Figure 10. Station numbers in the x-axis might be confusing. I suggest numbering from 1-103, or indicating that numbers are station numbers or catchment IDs, respectively.
L389 This sentence needs reformulation. What aspects? Maybe “The relevance of ... differ between”. In addition, these “aspects” may also differ for different spatial and temporal scales, i.e., flood type dominance, relevance of catchment (storage) characteristics, antecedent conditions, flood durations etc.
TECHNICAL CORRECTIONS:
L2 I suggest writing “… most successful deep learning models…”
L114 Abbreviation NVE has already been introduced
Check the use of “percent” vs. “percentage”. For example, I think it is called mean absolute percentage error.
When referred to a figure in the main text, it is “Figure #” not “Fig. #”; “(Fig. #)” is correct.
L351 “has previously shown”
L365 I suggest shortening this to “…for different flood types…”
L420 “errors were notably smaller” rather than “…better”
L511 Reference Langsholt is missing an „E.“
References:
Czakay, C., Tarasova, L., and Ahrens, B.: Composition, frequency and magnitude of future rain-on-snow floods in Germany, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2025-3532, 2025
De la Fuente, L. A., Ehsani, M. R., Gupta, H. V., and Condon, L. E.: Toward interpretable LSTM-based modeling of hydrological systems, Hydrol. Earth Syst. Sci., 28, 945–971, https://doi.org/10.5194/hess-28-945-2024, 2024.