the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Assessing the stability of LSTM runoff projections in Switzerland under climate scenarios
Abstract. Climate change is intensifying the global water cycle, altering both mean runoff and extremes, and strengthening the need for reliable hydrological projections to support adaptation. Traditionally, such projections have relied on process-based models. More recently, machine learning models, and in particular Long Short-Term Memory (LSTM) networks, have shown strong skill in predicting and reconstructing runoff from observations, raising interest in their use for hydrological projections. However, their ability to provide stable and physically credible results when forced with future climates beyond their training domain remains largely unexplored. Here we evaluate this question in Switzerland, a region strongly exposed to warming due to its alpine environment and glacier influence. An LSTM trained on observed meteorological and discharge data is driven with CH2018 climate and glacier projections for 1981–2100, and benchmarked against Hydro-CH2018 simulations from the process-based model PREVAH under identical forcings. Results show that the LSTM reproduces key hydrological signals closely – wetter winters, drier summers, and elevation-dependent trends – consistently across catchments and climate chains. Divergences are most pronounced in alpine and glacier-fed catchments, where runoff dynamics are more complex, yet the main governing patterns are captured. The largest limitation arises for extremes, where the LSTM underestimates peak flows, consistent with previously reported saturation effects. Overall, this study demonstrates that LSTMs can deliver robust mean-flow projections and trends comparable to a process-based benchmark, while highlighting persistent challenges in representing hydrological extremes.
- Preprint
(27026 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-6058', Anonymous Referee #1, 20 Jan 2026
-
RC2: 'Comment on egusphere-2025-6058', Anonymous Referee #2, 24 Jan 2026
The study by Courvoisier et al. compares an LSTM architecture with a process-based model regarding runoff behaviour across Switzerland - for historical as well as future (extreme) climate change projections.
The paper addresses an important topic and I generally like how the authors approached the problem, presented and discussed the results. I enjoyed reading the work, particularly the discussion and interpretation.
However, from a process-based perspective and for explaining differences between the models, I am missing some crucial information. And I think a few additions to to the results could help with the interpretation. While I feel the paper has the quality to be accepted in HESS, I think a few of my comments require this to be a 'major revision'.
I hope the authors find my comments useful to improve the manuscript.
Comments to individual sections:Section 2.1: Suggest to better explain in the text what the differently labelled catchments in figure 1 are used for. e.g. unclear what "second part of our analysis" means in the caption at this stage.
It is unclear why 96 minimally impacted catchments were chosen and then projected on 307 - you introduce a bias already (e.g. missing out on reservoir impacts to name one obvious point)?Section 2.5: The glaciers play a crucial role in the future runoff predictions, I think. Could you add some more detail here. E.g. which/how many of the basins have no glacier extent in summer by end of century and is this consistently applied in both LSTM and PREVAH?
Section 2.6 and 3.1: You compare these two models, and try to attribute differences - but I don't know where the models actually differ. I strongly suggest to give an overview (a table with a side-by-side comparison would work nicely) of the diverging input data and model structure of PREVAH vs the LSTM - observed meteo, cc data (and meteo parameters used), cal-val-test approach, glacier data and simulation approach in PREVAH, spatial representation, observational vs projection catchments, you train the LSTM on natural catchments - what about reservoirs in the projection catchments - are those included in PREVAH...?
In this sense, I don't see a value in section 3.1 without mentioning PREVAH. I think it is more important to understand the differences than explaining the LSTM in detail alone (E.g. I'd be interested in how PREVAH simulates glacier, snow, ET which is needed to understand the differences presented (e.g. in section 4.5)).Section 4.3/4.4/4.7 I like the maps. However, I wonder how a scatter plot LSTM vs PREVAH Runoff (could add the corresponding r²) would look like next to the maps (one dot=one catchment, x=LSTM runoff, y=PREVAH runoff, color of dot could be represented by observational, projection, selected catchments or the elevation band color or the ecoregion ... whatever you find most appropriate). This would give a better sense of the actual differences between the models and allow more intuitive diagnostics for why you see the differences.
Major comments:
l.132ff I'd like to understand the bias adjustment and cc data better: Which "Swiss gridded observations" - are those the same that you forced the model with (chapter 2.2)? Did you conduct the QM yourself or does CH2018 provide that? I suggest to discuss the implications if the bias adjustment was done on a meteorology dataset different to the forcing dataset. Also, how did you prepare the cc data for the models: Spatial aggregation->Bias adjustment. Or Bias adjustment->spatial aggregation?l.220ff I like this cross validation methodology. I assume, per fold, you have 12 catchments that are validation, 12 test and 72 train. Is it correct that you took for valid: year 2016, 2017, 2018 of the 12 validation catchments; year 2019-2024 of the 12 testing catchments and the remaining years of the 72 training catchment for train? But what do the grey shades in Fig 2ab mean - do they correspond? what is the light grey in b that is not part of a?
l.242/369ff as far as I understand your LSTM model architecture is the same as Kraft et al. 2024 - with the only difference that you added dynamic glacier data. I'd have expected the LSTM would perform slightly better with this additional information. Do you have a (short) reasoning why it didn't (including Kraft et al's LSTM in the suggested table might help to explain this - see comment to 2.6)?
l.339 I am confused by this. Yes, fig 8 shows a stronger decline for PREVAH vs the LSTM, but I can't see that in fig 7. For 7b,c,d,f I even see the opposite.
l.347ff, 413ff and l.529ff I think the key question here is: Why is this happening and which model is closer to reality? I know that you cannot answer this question as it's the future. This likely goes beyond your scope, but I wonder if it would help (you could add your thoughts to the discussion):
1. when comparing to observed data. I assume these future events are significantly outside your training data? If that wouldn't be too extreme, you could look at the test data of the LSTM (section 3.5) and extract the maximum/minimum precip periods and check the same events for PREVAH, and evaluate which model is closer to obs (this could help for your discussion in chapter 5.7)?
2. to attribute this to PREVAH's 'knowledge' of the driving processes (such as the additional climate data it gets to calculate ET) different constraints (or no constraints) for the glacier extent, the internal mass balance constraint that the LSTM doesn't have (could the application of a mass-conserving LSTM improve the situation)?l.584 You earlier mention that this interpretation of low-flow performance requires caution and I don't see this adequately evaluated in your paper to merit mentioning this in the short conclusion.
Minor comments:l.93 when does the data end/what was your cutoff? Mention the spatial resolution.
l.96 PRISM and SYMAP - reference and spell out on first use
l.102 how did you spatially average in detail? Extract entire cells or did you 'split' cells? If the catchment area is small and the grids large, you can introduce errors particularly in mountainous terrain.
l.116 Can you give a rational why you used topsoil information only?
l.126 suggest to add that it is based on the CMIP5 framework
l.129 you earlier write that the resolution was 2km - how is the product downscaled from the 12-50km CORDEX resolution to 2km?
l.148 Suggest to make it clearer whether you did any glacier simulations of if this was Brunner et al. 2019b work.
l.150 I think you mean section 3.2
l.185 in section 2.5 you cite Brunner et al. 2019b as the source for the glacier data
l.201 I don't see the 24 tested alternatives in Kraft et al. 2025 section 3.6 - seems you employed the pre-print version of their approach?
l.251 fig 4: suggest to add the number of catchments per boxplot (n=x) in the caption.
l.261 I don't think DJF is a period where much melt dynamics is going on in Switzerland and JJA is not really low flow in most of the catchments (you mention this yourself in l.271-272)
l.262 add that this is about the annual panel
l.307 I think the LSTM surpasses PREVAH by 2060, or?
l.327 fig 8 caption "(mm period−1 vs 1991–2020) for 2071-2100" I think you mean something like 2071-2100 - 1991–2020?
l.476-482 the language here suggests we are still in results. Suggest to rephrase.
l.539 deeper deeper
l.576 whether
l.715, l.623 provide links to final paper?
Citation: https://doi.org/10.5194/egusphere-2025-6058-RC2 -
RC3: 'Review of egusphere-2025-6058', Anonymous Referee #3, 16 Feb 2026
General comments:
This paper investigates the question if LSTMs can be used for climate change impact prediction or rather: are they a valuable alternative to process-based hydrological models for climate change impact predictions in alpine environments? Or even more precise: are LSTM a valuable alternative to the selected process-based model (PREVAH)? Previous work has asked this question for other case studies and other models. From the paper as it is currently framed, it is unclear if this is simply a state-of-the-art case study or if it goes beyond the state-of-the-art. This having said, case studies can well desserve publication in HESS but given that the methods have all been used / have been developed in previous work, I think that the paper could make clearer what it contributes to the literature.
More importantly: as far as I see the reference simulations with the selected PREVAH model are problematic as reference simulations: these simulations are calibrated for some of the catchments; but for many other catchments, the model is somehow regionalized (we do not have information on this in the paper). I do not think that this is good practice: How can such a sample of two very different populations be used as a reference to evaluate a purely data-driven approach against? Furthermore, we see from Figure 3 that for many catchments, the reference model has an NSE value below 0.7 and even below 0.5: even if I cannot relate entirely to what 0.7 means for these catchments, 0.5 at daily time step is for sure a really bad performance for this climate, so these catchments should not serve as reference to compare the LSTM against. Furthermore, given the presented performance comparison, we simply know that the LSTM has a similar performance distribution. This is not informative: what if the LSTM does well for the the ones that PREVAH does not well and vice versa? If the focus of the paper is on seasonal signals only, we should at least expect a sensoring of the catchments / regions for which the seasonal signal is not well reproduced by a) by PREVAH, b) by LSTM, c) by both (which is very interesting!).
Furthermore: we should not forget that we talk about streamflow simulations here but we do not see any actual simulations, not even for seasonal simulations: I think that it is of outmost importance to reconnect the data-based methods to actual data and hydrology. Otherwise we do not learn much about how well LSTMs perform for the selected hydroclimatic region.
From a model development perspective, information is missing: how well does the PREVAH model, forced with climate data (rather than with observed precip & temperature) reproduce observed streamflow? This is a serious lack of information: the models are trained with observed data and then run with climate data; the divergence between the models for the future periode might be rooted at least partly in the divergence of their simulations for the observed period.
Furthermore, the training input data set contains a gridded spatial rainfall product, which can be assumed to show very different quality for different catchments in an alpine environment. Do the climate scenarios have the same spatial resolution? And are the gridded climate scenarios debiased to the spatial rainfall product?
Otherwise, each of the models (PREVAH and LSTM) learns how to deal with the observation-based data set (i.e. how to translate it into observed streamflow), but then makes its own errors when applying what it has learned to a rather different data set. In this setting, what can you learn from such a comparison framework?
Let's imagine: Model A is better than model B for the observed period if fed with observed data, but worse than B if fed with climate data. What do we learn from this? That model A was overfitted? Or that the climate data is not good? If now model A is close to model B for the future period, what does this tells us? That by chance, they both agree? I know that this is not the objective of the paper but given that it contributes to the question of how to use LSTM for climate change impact studies, these questions should be discussed.
Next: the hydrological process model certainly received PET as input for training. How was dealt with this for climate change simulations? And does the LSTM receive PET ? And if not, is this not unfair?
Next: The LSTM is optimized with the mean squared error (MSE) between normalized simulated
and observed discharge. This is certainly very different from the optimisation criterion used for the process model (MSE is rarely used). Why was not a similar / the same criterion used? And what can you learn from two models that are optimized based on different criteria? Besides: was the PREVAH model calibrated with some global optimisation method?
I believe that this paper needs to better frame the analysis framework, improve is A compared to B and what conclusions can be drawn in its methods section. Furthermore, we need more details on the model set up.
Detailed comments:
- Abstract and elsewhere: what means "stable results" and how do you measure "physically credible"?
- Intro: what do you mean by "These results highlight their robustness and scalability"? Does any of the references explicitely assess the robustness of the models and if so, how is this defined in a hydrological modelling context? What is meant by scalability, what paper discusses this and how is it defined in a hydrological modelling context? Similar what is meant by "achieve strong generalization performance" in a hydrological context? The introduction has to be much more hydrology-specific.
- Intro: what do you mean by "Despite their success in present-day forecasting,"? Is this paper about present-day forecasting (forecasting = predicting current state based on previous state) or is it about continuous simulation (prediction) of hydrological states based on driver inputs (i.e. without feeding in the previous hydrological state)?
- The abstract states "Divergences are most pronounced in alpine and glacier-fed catchments, where runoff dynamics are more complex, yet the main governing patterns are captured. ": I would argue that the mean governing patterns in these environments can be capture by feeding only temperature into the LSTM; should this be toned down?
- Section 2: "The dataset includes most of the variables used in Kraft et al. (2024) and combines those typically employed to force the spatially distributed model PREVAH" - can this be more precise: did PREVAH use similar data or might part of the divergence come from different reference data for parameter estimation?
- Section 2: do the same dynamic glacier fractions feed into the LSTM and into the PREVAH simulations or do they receive different input? This is unclear in my view.
- Section 4: what is CH-RUN? This should be clearer somewhere
- What is "dynamical stability of projections", should this be defined in the methods?
- The paper mentions that PREVAH is widely used; the model is however only unsed in Switzerland (given the references)
Citation: https://doi.org/10.5194/egusphere-2025-6058-RC3
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 330 | 188 | 30 | 548 | 29 | 39 |
- HTML: 330
- PDF: 188
- XML: 30
- Total: 548
- BibTeX: 29
- EndNote: 39
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Sanika Baste