Hybrid models generalize better to warmer climate conditions than process-based and purely data-driven models

Bohl, Jan P.; Wood, Raul R.; Frank, Corinna; Astagneau, Paul C.; Peters, Jonas; Brunner, Manuela I.

doi:10.5194/egusphere-2025-5201

Preprints

https://doi.org/10.5194/egusphere-2025-5201

Preprints

07 Nov 2025

| 07 Nov 2025

Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Hybrid models generalize better to warmer climate conditions than process-based and purely data-driven models

Jan P. Bohl, Raul R. Wood, Corinna Frank, Paul C. Astagneau, Jonas Peters, and Manuela I. Brunner

Abstract. Deep-learning based rainfall-runoff models, in particular long short-term memory networks (LSTM), have been shown to outperform traditional hydrological models at various tasks, both when used as purely data-driven models and when combined with process-based models in a hybrid setting. These tasks include predictions in ungauged basins (PUB) and regions (PUR), tasks which have traditionally been challenging for conceptual hydrological models. While the spatial generalizability of deep-learning based models has received a lot of attention, it is less clear how they generalize to unseen and warmer climate conditions, i.e. how suitable these models are for hydrological climate impact studies. To address this research gap, we assess the ability of three types of models including (1) fully data-driven (LSTMs), (2) conceptual (Hydrologiska Byråns Vattenbalansavdelning (HBV)), and (3) hybrid (LSTM-HBV) models to simulate streamflow under conditions warmer than those used to train the models by running a differential split sample test. That is, we trained the models using data from the historical period 1960–1990 and evaluated them on both data of this period as well as of the warmer period 2000–2023. We find that LSTMs, while being the most accurate during the 1960–1990 period, have inferior generalizability to the warm period compared to the hybrid and conceptual models. In addition, we show that when generalizing to the warm period, hybrid models have similar accuracy as LSTMs, independently of whether the entire streamflow distribution or extreme events such as floods and droughts are considered. However, for snow-dominated catchments, all models suffer from similar reductions in accuracy when simulating streamflow under unseen climate conditions and the LSTM is the most accurate model for all periods. A detailed look at the snowmelt simulations of the hybrid and conceptual model suggests that better process-representation might be needed to accurately capture the dynamics of snow-melt and -accumulation processes, which are highly sensitive to changes in temperature. We conclude that the hybrid models effectively combine the high accuracy of LSTMs when predicting in ungauged basins with the good generalizability under changes in climate of conceptual hydrological models. This makes them a suitable choice for hydrological climate change impact assessments, particularly in ungauged basins.

Received: 27 Oct 2025 – Discussion started: 07 Nov 2025

Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Jan P. Bohl, Raul R. Wood, Corinna Frank, Paul C. Astagneau, Jonas Peters, and Manuela I. Brunner

Status: open (until 12 Jan 2026)

Post a comment Subscribe to comment alert

RC1: 'Comment on egusphere-2025-5201', Anonymous Referee #1, 04 Dec 2025 reply

The manuscript evaluates the ability of different model types (HBV, LSTM, and a hybrid model) to predict river streamflow under different climate conditions, particularly when the training/calibration period differs from the testing/validation/prediction period. This issue is critical when applying machine learning models to future climate-change impact studies. The manuscript is well written, the experimental design is appropriate for the scientific questions, and the results are clearly illustrated. I have several major comments that I would like to discuss with the authors. If these can be addressed, I would recommend the paper for publication.
First, I think a sensitivity test should be conducted. Before applying the models to the warmer period, perturb the input variables (such as temperature or precipitation) and evaluate how the models respond to these changes. This is relevent for the following analysis, maybe different model is sensitive, others are not.
Another concern relates to the importance of the different input features. Is temperature the most important predictor, or do other variables differ more between the cold and warm periods? The manuscript does not discuss precipitation changes, and I think a feature-importance/SHAP analysis is possible for the LSTM or hybrid model. It would be helpful to understand whether precipitation or PET, although changing less than temperature, may have a stronger influence on streamflow. Concerning the evaluation metrics are not very different from models to models during different period. just to confirm that different model performances are due to climate warming.
A few minor comments
Line 1: Use consistent terminology: either “deep learning,” “deep-learning” (as an adjective), or “DL,” throughout the manuscript.
Line 1: Spell out “Long Short-Term Memory (LSTM)” on first use.
The abstract is currently very conceptual. Please include key numerical results (e.g., NSE, KGE) to quantify performance. For example, Lines 10–12 mention that the LSTM performs best during the cold period but worse during the warm period, this should be supported with specific numbers.
From the abstract, the advantages of the hybrid model over the LSTM are not obvious. Lines 10–15 suggest that hybrid models have similar accuracy to LSTMs, please clarify the added benefit.
Line 86: Please correct the citation formatting.
The introduction is well written.
Line 153: If the Po River basin is not included in the analysis, it may be better not to mention it here (or clarify this later, as in Line 159).
Line 310: Please clarify the distinction between “in-sample HBV” and “regional HBV.”
Line 320: This relates to my major concern, how does precipitation change between periods and among different catchments?
Lines 315–319: The reported values are very close to each other, and they represent means or medians over hundreds of catchments. Could these differences fall within model uncertainty?
Section 3.4: In general, the hybrid and HBV models perform worse than the LSTM model. Is this due to limitations of HBV in snow-affected catchments, where LSTM may better learn snow–streamflow relationships? Does the hybrid model inherit these limitations from HBV, preventing it from outperforming the LSTM?
Line 415: All models show higher performance for flood events than for drought or low flows. Is this due to the choice of objective function (NSE), which emphasizes high-flow periods?

Reply

Citation: https://doi.org/10.5194/egusphere-2025-5201-RC1
CC1: 'How do we define 'generalizability'?', Sacha Ruzzante, 17 Dec 2025 reply

This is a useful and timely paper. The testing of model generalizability follows best practices in the hydrologic literature, but I'd like to take this opportunity to ask exactly what is meant by 'generalizability'. Is it:
a) Which model has the highest accuracy in an unseen warm test period?

b) Which model has the smallest reduction (or largest increase) in accuracy when moving from calibration (cold) to testing (warm) periods?

c) Which model most accurately simulates the change in hydrologic conditions between warm and cold periods?
These three alternate definitions are subtly different and suggest different statistical tests. This study relies on definitions (a) and (b). In climate change projection studies, however, it is common to summarize results as a percentage change from historical conditions, for which definition (c) is the most relevant.
As an illustrative example: in a catchment, suppose the observed peak flow increases by 50%, from 100 cms to 150 cms, between the cold and warm periods. Models A and B give the following results:
Period Observed Model A Model B

Cold 100 90 90

Warm 150 135 148

Change 50% 50% 64%
Model A has a persistent bias of -10%, and correctly predicts an increase in peak flows of 50%, while model B overpredicted the increase (64%). However, by definitions (a) and (b), we would select model B as the most generalizable since its accuracy in the warm period is highest and the accuracy improves from the calibration (cold) to the testing (warm) period.
This example is relevant to Table C1, where (for example) the hybrid model is shown to have the best performance for DVPB in the warm period (4.9%), but this represents a large reduction from the cold period DVPB (9.5%). In comparison, the LSTM has the most stable DVPB across the three periods, as indicated at L358. In this case, it seems that the LSTM will predict the change in DVPB best, and be most generalizable by definition (c). These numbers would, however, be more informative if compared on a catchment-by-catchment basis rather than comparing the median values.
I recommend including (maybe in an appendix) a comparison of the observed and simulated change in various hydrologic signatures between the cold and warm periods (eg., the mean annual flow, the mean monthly flow for each month, and the high flow, low flow, and drought metrics already calculated in the paper).
As a second point, the LSTM is found to generalize most poorly in the warmest catchments. To me, this makes sense, given that the LSTM is extrapolating most strongly in these catchments. In the colder catchments (warm period), the LSTM can learn from the warm catchments (cold period). For the warm catchments (warm period) there is no analogue set of catchments from which to learn. It might be worthwhile to mention this explanation alongside the explanations already given (L427-442).

Reply

Citation: https://doi.org/10.5194/egusphere-2025-5201-CC1

Jan P. Bohl, Raul R. Wood, Corinna Frank, Paul C. Astagneau, Jonas Peters, and Manuela I. Brunner

Data sets

E-OBS Daily Gridded Meteorological Data for Europe from 1950 to Present Derived from in-Situ Observations Copernicus Climate Change Service, Climate Data Store https://doi.org/10.24381/cds.151d3ec6

SPASS - new gridded climatological snow datasets for Switzerland C. Marty et al. https://www.doi.org/10.16904/envidat.580

SNOWGRID Klima v2.1 GeoSphere Austria https://doi.org/10.60669/fsxx-6977

Model code and software

Caravan - A global community dataset for large-sample hydrology F. Kratzert https://github.com/kratzert/Caravan/

Analyzing the generalization capabilities of hybrid hydrological models for extrapolation to extreme events E. Acuna Espinoza https://doi.org/10.5281/zenodo.14191623

Jan P. Bohl, Raul R. Wood, Corinna Frank, Paul C. Astagneau, Jonas Peters, and Manuela I. Brunner

Viewed

Total article views: 466 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
341	103	22	466	11	14

HTML: 341
PDF: 103
XML: 22
Total: 466
BibTeX: 11
EndNote: 14

Views and downloads (calculated since 07 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	232	47	11	290
Dec 2025	98	48	11	157
Jan 2026	11	8	0	19

Cumulative views and downloads (calculated since 07 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	232	47	11	290
Dec 2025	98	48	11	157
Jan 2026	11	8	0	19

Viewed (geographical distribution)

Total article views: 452 (including HTML, PDF, and XML) Thereof 452 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 09 Jan 2026

Short summary

To assess climate impacts on streamflow, we need models that can predict streamflow under future conditions. This study compares three model types: data-driven (LSTM), conceptual (HBV), and hybrid (LSTM-HBV). LSTMs perform best overall, but HBV and hybrid models generalize better to warmer climates. Hybrid models are a promising tool for climate impact assessments, combining LSTMs accuracy with better generalizability of traditional models. In snowy regions, all models struggle to generalize.


Total:	0
HTML:	0
PDF:	0
XML:	0