The ability of LSTM to model snowmelt versus rainfall generated floods
Abstract. One of the most important skills of hydrological models is to simulate timing and magnitude of flood events. Long Short-Term Memory (LSTM) networks are currently among the most successful models for streamflow and flood prediction over large regions. In snow-influenced catchments, which typically comprise a minority in large-scale studies, floods are generated by two distinctly different processes, snowmelt and rainfall. The applicability of hydrological models in such regions is therefore dependent on their ability to represent both types of floods. Nevertheless, flood evaluations of LSTM taking different flood-generating processes into account are currently lacking. This study fills this gap by evaluating the ability of LSTM to model flood peak characteristics separately for snowmelt and rainfall generated floods. The trained LSTM model successfully simulated streamflow time series across the 103 evaluated catchments, with average NSE of 0.85 and average KGE of 0.87 over the unseen evaluation period. LSTM exhibited better performance in the majority of the catchments in terms of flood peak timing and magnitude for both rainfall and snowfall generated floods when compared to the operational hydrological model in the region (HBV) used as a benchmark. Both models had a 24 pp higher percentage of correctly simulated peak days for rainfall generated floods as compared to snowmelt generated floods. LSTM outperformed HBV for a larger proportion of the catchments in terms of peak timing of rainfall generated events (83 %) as compared to snowmelt generated events (64 %). On the other hand, a larger proportion of the catchments were improved by LSTM for snowmelt generated events as compared to rainfall generated events when considering peak magnitudes. The largest improvements in peak magnitudes were found for rainfall generated events, in particular for catchments where HBV exhibited high (> 40 %) absolute errors. Overall, our findings bring confidence that LSTM can improve hydrological services in regions subject to both snowmelt and rainfall generated floods.
The study by Bakke et al. evaluates a LSTM model with regard to its ability to simulate snowmelt- vs. rainfall generated flood events and examines the potential for its operational use in flood forecasting in snow-influenced regions such as Norway. The manuscript is well structured and well written, and the selection and quality of the figures are excellent. Testing deep learning approaches like LSTM to simulate streamflow/floods and comparing them to a benchmark model is nothing new. However, the innovation of this study lies in the specific consideration of flood generating processes for this model evaluation and comparison. As a result, this study could make a relevant contribution to HESS and should be considered for publication once some incomplete and unclear points have been resolved.
GENERAL COMMENTS:
SPECIFIC COMMENTS:
Introduction or Data: Since the study focuses on Norway, the relevance of snowmelt, rainfall and a combination of both for flood generation should be better demonstrated for this region.
Introduction: LSTM is one deep learning approach among others. The literature review should be a bit broader in this regard. Since LSTMs often perform better than other methods, this actually provides a good reason for why LSTM is applied in this study.
L42 I am at least aware of the beforementioned study that apply a LSTM to simulate rain-on-snow floods in Germany
L87 Please indicate percentage or number of catchments being partly glacier cover.
L92 Comparability of the SeNorge snow model and the output of the HBV snow routine: Here, I was wondering if it would make sense to evaluate the ability of the benchmark model to correctly model the specific flood type as classified by the approach in section 2.6 using the data described in 2.2 and the output of the catchment specific HBV. On the other hand, this cannot be tested for the LSTM. Same direction as my GC (2).
Table 1 Just for curiosity: has it been tested how the LSTM performs when using only subsets of these catchment attributes? Alternatively, could specific attributes be identified as important predictors? Sometimes this leads to some hardly explainable surprises.
L126 Was the same objective function used to optimize the LSTM? If not, this could have an impact on the ability of simulating peak flows.
Section 2.4 Is the LSTM applied on a gridded basis or for catchment averages?
Sections 2.4 and 2.5 I suggest moving the last paragraph of 2.4 to 2.5 and the first paragraph of 2.5 to 2.4.
Section 2.6 The approach of detecting flood generating processes is comparatively simple but effective. Still, I suggest putting this into context of other more sophisticated approaches of flood type differentiation.
L269 I am a bit surprised about the low number of mixed floods. Were the same thresholds regarding snowmelt- and rainfall contribution used as in the cited reference? Generally, it is worth adding one or two more sentences on this procedure in section 2.6 so that the reader does not need to search for the reference.
L281 For hydrological models, the goodness-of-fit is often higher due to low model errors during winter low flows, particularly in snow-dominated catchments. Is this also the case for LSTM? According to Figure 4 (c), probably not as pronounced as for HBV.
Figure 5 should indicate the number of events per class either in the bar labels or in the figure caption.
Figure 10. Station numbers in the x-axis might be confusing. I suggest numbering from 1-103, or indicating that numbers are station numbers or catchment IDs, respectively.
L389 This sentence needs reformulation. What aspects? Maybe “The relevance of ... differ between”. In addition, these “aspects” may also differ for different spatial and temporal scales, i.e., flood type dominance, relevance of catchment (storage) characteristics, antecedent conditions, flood durations etc.
TECHNICAL CORRECTIONS:
L2 I suggest writing “… most successful deep learning models…”
L114 Abbreviation NVE has already been introduced
Check the use of “percent” vs. “percentage”. For example, I think it is called mean absolute percentage error.
When referred to a figure in the main text, it is “Figure #” not “Fig. #”; “(Fig. #)” is correct.
L351 “has previously shown”
L365 I suggest shortening this to “…for different flood types…”
L420 “errors were notably smaller” rather than “…better”
L511 Reference Langsholt is missing an „E.“
References:
Czakay, C., Tarasova, L., and Ahrens, B.: Composition, frequency and magnitude of future rain-on-snow floods in Germany, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2025-3532, 2025
De la Fuente, L. A., Ehsani, M. R., Gupta, H. V., and Condon, L. E.: Toward interpretable LSTM-based modeling of hydrological systems, Hydrol. Earth Syst. Sci., 28, 945–971, https://doi.org/10.5194/hess-28-945-2024, 2024.