The ability of LSTM to model snowmelt versus rainfall generated floods

Bakke, Sigrid Jørgensen; Barna, Danielle Marie; Engeland, Kolbjørn; Kolberg, Sjur Anders; Nordeide, Sunniva

doi:10.5194/egusphere-2026-1056

Preprints

https://doi.org/10.5194/egusphere-2026-1056

Preprints

06 Mar 2026

| 06 Mar 2026

The ability of LSTM to model snowmelt versus rainfall generated floods

Sigrid Jørgensen Bakke, Danielle Marie Barna, Kolbjørn Engeland, Sjur Anders Kolberg, and Sunniva Nordeide

Abstract. One of the most important skills of hydrological models is to simulate timing and magnitude of flood events. Long Short-Term Memory (LSTM) networks are currently among the most successful models for streamflow and flood prediction over large regions. In snow-influenced catchments, which typically comprise a minority in large-scale studies, floods are generated by two distinctly different processes, snowmelt and rainfall. The applicability of hydrological models in such regions is therefore dependent on their ability to represent both types of floods. Nevertheless, flood evaluations of LSTM taking different flood-generating processes into account are currently lacking. This study fills this gap by evaluating the ability of LSTM to model flood peak characteristics separately for snowmelt and rainfall generated floods. The trained LSTM model successfully simulated streamflow time series across the 103 evaluated catchments, with average NSE of 0.85 and average KGE of 0.87 over the unseen evaluation period. LSTM exhibited better performance in the majority of the catchments in terms of flood peak timing and magnitude for both rainfall and snowfall generated floods when compared to the operational hydrological model in the region (HBV) used as a benchmark. Both models had a 24 pp higher percentage of correctly simulated peak days for rainfall generated floods as compared to snowmelt generated floods. LSTM outperformed HBV for a larger proportion of the catchments in terms of peak timing of rainfall generated events (83 %) as compared to snowmelt generated events (64 %). On the other hand, a larger proportion of the catchments were improved by LSTM for snowmelt generated events as compared to rainfall generated events when considering peak magnitudes. The largest improvements in peak magnitudes were found for rainfall generated events, in particular for catchments where HBV exhibited high (> 40 %) absolute errors. Overall, our findings bring confidence that LSTM can improve hydrological services in regions subject to both snowmelt and rainfall generated floods.

Received: 24 Feb 2026 – Discussion started: 06 Mar 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Sigrid Jørgensen Bakke, Danielle Marie Barna, Kolbjørn Engeland, Sjur Anders Kolberg, and Sunniva Nordeide

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-1056', Klaus Vormoor, 10 Apr 2026
The study by Bakke et al. evaluates a LSTM model with regard to its ability to simulate snowmelt- vs. rainfall generated flood events and examines the potential for its operational use in flood forecasting in snow-influenced regions such as Norway. The manuscript is well structured and well written, and the selection and quality of the figures are excellent. Testing deep learning approaches like LSTM to simulate streamflow/floods and comparing them to a benchmark model is nothing new. However, the innovation of this study lies in the specific consideration of flood generating processes for this model evaluation and comparison. As a result, this study could make a relevant contribution to HESS and should be considered for publication once some incomplete and unclear points have been resolved.
GENERAL COMMENTS:
The authors state that the main objective of this study is to evaluate the potential of LSTM for operational use in flood simulations and flood forecasting in snow-influenced regions. However, the model is only tested against observed (i.e., really occurred) peak flow events. How does the LSTM perform compared to the HBV model in terms of simulating false positive events (i.e., floods that have not been observed but simulated)? I think this is another important aspect which determines the model’s suitability for different purposes such as flood warning. A confusion matrix comparing both models could provide some insights in this regard.

The advantage of a process-orientated hydrological model compared to LSTM is that the flood generating processes can be directly derived from the model output using the described approach since HBV also simulates snowmelt. LSTM and other deep learning approaches are black-box models and do not support this process mapping. There are a few approaches that attempt to combine LSTM with explainable AI (e.g. De la Fuente et al., 2024; Czakay et al., 2025 and references within) to better capture the hydrometeorological process relationships. I think that AI approaches for hydrological applications will evolve in this direction. So, if not by extending this study in this direction, this topic should at least be addressed in the discussion.

It seems that the HBV model and the LSTM have been calibrated/trained by minimizing different objective functions (sum of squared errors + squared total volume errors vs. NSE). This may impact the relative ability of both approaches to reproduce peak flow events. For the comparison of the LSTM against the benchmark HBV model this could be a serious issue. It needs to become clear that LSTM does not have an advantage in modelling flood events due to a better and more specific training for simulating high discharge compared to the benchmark model.

SPECIFIC COMMENTS:
Introduction or Data: Since the study focuses on Norway, the relevance of snowmelt, rainfall and a combination of both for flood generation should be better demonstrated for this region.
Introduction: LSTM is one deep learning approach among others. The literature review should be a bit broader in this regard. Since LSTMs often perform better than other methods, this actually provides a good reason for why LSTM is applied in this study.
L42 I am at least aware of the beforementioned study that apply a LSTM to simulate rain-on-snow floods in Germany
L87 Please indicate percentage or number of catchments being partly glacier cover.
L92 Comparability of the SeNorge snow model and the output of the HBV snow routine: Here, I was wondering if it would make sense to evaluate the ability of the benchmark model to correctly model the specific flood type as classified by the approach in section 2.6 using the data described in 2.2 and the output of the catchment specific HBV. On the other hand, this cannot be tested for the LSTM. Same direction as my GC (2).
Table 1 Just for curiosity: has it been tested how the LSTM performs when using only subsets of these catchment attributes? Alternatively, could specific attributes be identified as important predictors? Sometimes this leads to some hardly explainable surprises.
L126 Was the same objective function used to optimize the LSTM? If not, this could have an impact on the ability of simulating peak flows.
Section 2.4 Is the LSTM applied on a gridded basis or for catchment averages?
Sections 2.4 and 2.5 I suggest moving the last paragraph of 2.4 to 2.5 and the first paragraph of 2.5 to 2.4.
Section 2.6 The approach of detecting flood generating processes is comparatively simple but effective. Still, I suggest putting this into context of other more sophisticated approaches of flood type differentiation.
L269 I am a bit surprised about the low number of mixed floods. Were the same thresholds regarding snowmelt- and rainfall contribution used as in the cited reference? Generally, it is worth adding one or two more sentences on this procedure in section 2.6 so that the reader does not need to search for the reference.
L281 For hydrological models, the goodness-of-fit is often higher due to low model errors during winter low flows, particularly in snow-dominated catchments. Is this also the case for LSTM? According to Figure 4 (c), probably not as pronounced as for HBV.
Figure 5 should indicate the number of events per class either in the bar labels or in the figure caption.
Figure 10. Station numbers in the x-axis might be confusing. I suggest numbering from 1-103, or indicating that numbers are station numbers or catchment IDs, respectively.
L389 This sentence needs reformulation. What aspects? Maybe “The relevance of ... differ between”. In addition, these “aspects” may also differ for different spatial and temporal scales, i.e., flood type dominance, relevance of catchment (storage) characteristics, antecedent conditions, flood durations etc.
TECHNICAL CORRECTIONS:
L2 I suggest writing “… most successful deep learning models…”
L114 Abbreviation NVE has already been introduced
Check the use of “percent” vs. “percentage”. For example, I think it is called mean absolute percentage error.
When referred to a figure in the main text, it is “Figure #” not “Fig. #”; “(Fig. #)” is correct.
L351 “has previously shown”
L365 I suggest shortening this to “…for different flood types…”
L420 “errors were notably smaller” rather than “…better”
L511 Reference Langsholt is missing an „E.“
References:
Czakay, C., Tarasova, L., and Ahrens, B.: Composition, frequency and magnitude of future rain-on-snow floods in Germany, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2025-3532, 2025
De la Fuente, L. A., Ehsani, M. R., Gupta, H. V., and Condon, L. E.: Toward interpretable LSTM-based modeling of hydrological systems, Hydrol. Earth Syst. Sci., 28, 945–971, https://doi.org/10.5194/hess-28-945-2024, 2024.
Citation: https://doi.org/10.5194/egusphere-2026-1056-RC1
- AC1: 'Reply on RC1', Sigrid Joergensen Bakke, 05 May 2026
  
  We thank the reviewer for his time and efforts in providing a thorough and constructive review, which helps strengthen our manuscript. Hereby we outline our plan to address his comments.
  General comment 1 highlights false positive events as an additional important aspect of the evaluation of a model’s suitability for purposes such as flood warning. We agree and will add an analysis of floods that have not been seen but simulated, and quantify the proportion of events where this occurs separately for the two models. To align with our study design, we will do this analysis independently for the different flood types.
  General comment 2 concerns the possibility of defining flood generating processes using snowmelt information from the applied hydrological models. In this way, ref. reviewer’s specific comment 8, one can evaluate the ability of the models to correctly simulate the specific flood type as classified. We agree that this is an interesting topic, and existing methods of interpretable AI make such analyses possible. As suggested by the reviewer, we will address this topic in the discussion.
  General comment 3 (and specific comment 10) points out that the different objective functions applied may give one model an advantage over the other in simulating peak flow characteristics. We agree that it can have an impact, and will follow-up by 1) evaluating both models in terms of the objective function used in calibration of HBV, and 2) calibrate HBV for a selection of catchments using sum of squared error (SSE) as objective function (which gives the same optimum as NSE for single catchment calibration).
  
  1) In our first draft, we evaluated both models using NSE, which is also the objective function of LSTM. Analogously, we will include HBV’s objective function (sum of squared error plus squared volume error) as an additional evaluation metric. In both calibration period and evaluation period, LSTM got a lower (i.e. better) score than HBV in the majority (60%) of the catchments. We will add this result and comment on it in the revised manuscript. Strong LSTM performance under HBV’s objective function indicates that its advantage is not solely an artifact of being optimized for NSE.
  
  2) We plan to calibrate HBV using SSE as the objective function, to quantitatively assess the influence of objective function on HBV’s ability to simulate peak flow characteristics. Key results will be presented and commented on in the revised manuscript.
  We specifically choose to use existing HBV simulations that are used in the national operational flood forecasting service to be in accordance with our main objective (ref L49): “The main objective is to evaluate LSTM’s potential for operational use in snow-influenced regions…”. We agree that this choice, and in particular the associated objective functions, requires more careful treatment in the analysis, and our revision will assess more explicitly whether the comparison and the resulting conclusions are sensitive to the different objective functions. In general, we note that a fully neutral comparison between the two models is not possible, due to fundamental differences. Other objective functions and calibration procedures may be more suitable for locally calibrated process-based models as opposed to globally calibrated deep learning models.
  First technical correction suggests replacing “most successful models for streamflow and flood prediction
  
  over large regions” with “most successful deep-learning models for streamflow and flood prediction
  
  over large regions”. We wish to keep our original formulation, as LSTM is the most successful over large regions also when compared to process-based models (demonstrated by several large-scale studies, e.g. references in L24-27).
  Additional specific comments, questions and technical corrections are appreciated, and will be addressed by us in the revisions (we foresee no issues).
  
  Citation: https://doi.org/10.5194/egusphere-2026-1056-AC1
RC2:
'Comment on egusphere-2026-1056', Anonymous Referee #2, 12 Apr 2026
Bakke et al. assess the ability of LSTM to model snowmelt and rainfall generated floods using the HBV model as a benchmark in 103 catchments across Norway. They find that LSTM outperforms HBV in terms of simulating both the timing and magnitude of flood peaks for rainfall induced events and snowmelt events. This is a very timely and relevant contribution, and the manuscript is overall well-written and the results are well-presented. Thus, I have only a few comments the authors may want to consider before publication.
General comments
The authors assess the performance of the LSTM for snowmelt, rainfall, and mixed events separately. Although the definition of the different event types is briefly described in the methods section 2.6, I would recommend to expand on the procedure of defining the flood generation processes. I assume that boundaries between the different processes are generally difficult to define, and it would be interesting to see how a change in the distribution of processes (e.g., more mixed events, less rainfall events, …) may influence the predictions obtained with the LSTM in each group, e.g., when looking at Figure 5 (or if this has an impact at all?). I do not expect the authors to add a detailed sensitivity analysis, but recommend adding this to the discussion section.

In my view, the discussion is quite methods-oriented, which is well-reasoned in itself. However, I am missing potential (physical) explanations on why both models performed differently for the different event types, for example, potential reasons for the differences in timing errors between snowmelt and rainfall generated events? I understand that interpreting the results obtained with the LSTM can be more challenging compared to the HBV model. Nevertheless, I would encourage the authors to include potential explanations into the discussion section, including suitable references of previous studies.

Minor comments
L 42-45: This sounds as if you trained the LSTM separately for snowmelt and rainfall events, but as far as I understood, you trained it over all catchments and evaluated its predictions separately. Please clarify in the text.
L 65: In my opinion, the data and method section is excessively long. The authors may want to consider creating a Supporting Information document and moving parts of this section to that document, e.g., model details in sections 2.3, 2.4, 2.5.
L 117: The 103 stations have been repeated many times; consider consolidating these parts.
L 196-199: This paragraph needs references.
L 227-231: I do not think that it is necessary to mention the periods before 1994 here, as 15 years can already be considered a sufficient length.
Figure 5: Why do the authors think the peak timing of snowmelt events is harder to predict than mixed events? Intuitively, one could expect that it would be the other way around? This may be added to the discussion (see also general comment).
L 368: Any other references here?
L 390: This sentence needs a reference.
Citation: https://doi.org/10.5194/egusphere-2026-1056-RC2
- AC2: 'Reply on RC2', Sigrid Joergensen Bakke, 05 May 2026
  
  We thank the anonymous reviewer for the positive and constructive feedback, which will help improve our manuscript. Below, we outline our plan to address the comments.
  General comment 1 highlights how chosen thresholds used to classify flood events may influence the results. We agree with the comment, and plan to 1) visualise the distribution of fractional rainfall contribution of all flood events, and 2) produce equivalents to figure 5 and 6 with two other thresholds (at least one of them widening the mixed event group). This will form the basis for commenting on the sensitivity of event type definition in the discussion section.
  General comment 2 (and specific comment about Fig. 5) suggests to include potential explanations for why each model performs differently (e.g. in peak timing) for the different event types. We agree this is an interesting topic for discussion, and plan to expand from the existing comments in the discussion chapter (L359-L364).
  We agree with all remaining specific comments.
  
  Citation: https://doi.org/10.5194/egusphere-2026-1056-AC2

Sigrid Jørgensen Bakke, Danielle Marie Barna, Kolbjørn Engeland, Sjur Anders Kolberg, and Sunniva Nordeide

Viewed

Total article views: 1,626 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
995	554	77	1,626	102	85

HTML: 995
PDF: 554
XML: 77
Total: 1,626
BibTeX: 102
EndNote: 85

Views and downloads (calculated since 06 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	754	430	63	1,247
Apr 2026	144	68	3	215
May 2026	70	35	5	110
Jun 2026	18	11	3	32
Jul 2026	9	10	3	22

Cumulative views and downloads (calculated since 06 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	754	430	63	1,247
Apr 2026	144	68	3	215
May 2026	70	35	5	110
Jun 2026	18	11	3	32
Jul 2026	9	10	3	22

Viewed (geographical distribution)

Total article views: 1,611 (including HTML, PDF, and XML) Thereof 1,611 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 24 Jul 2026

Short summary

Hydrological models need to simulate both rainfall and snowmelt generated floods in regions with snow. We evaluated a deep learning model’s ability to capture timing and magnitude of floods generated by snowmelt and rainfall separately. Timing was better simulated for rainfall than snowmelt generated floods, whereas results for flood peak magnitudes were similar. Compared to an operational model, the deep learning model was better at simulating both flood types in the majority of the catchments.


Total:	0
HTML:	0
PDF:	0
XML:	0