the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Hybrid models generalize better to warmer climate conditions than process-based and purely data-driven models
Abstract. Deep-learning based rainfall-runoff models, in particular long short-term memory networks (LSTM), have been shown to outperform traditional hydrological models at various tasks, both when used as purely data-driven models and when combined with process-based models in a hybrid setting. These tasks include predictions in ungauged basins (PUB) and regions (PUR), tasks which have traditionally been challenging for conceptual hydrological models. While the spatial generalizability of deep-learning based models has received a lot of attention, it is less clear how they generalize to unseen and warmer climate conditions, i.e. how suitable these models are for hydrological climate impact studies. To address this research gap, we assess the ability of three types of models including (1) fully data-driven (LSTMs), (2) conceptual (Hydrologiska Byråns Vattenbalansavdelning (HBV)), and (3) hybrid (LSTM-HBV) models to simulate streamflow under conditions warmer than those used to train the models by running a differential split sample test. That is, we trained the models using data from the historical period 1960–1990 and evaluated them on both data of this period as well as of the warmer period 2000–2023. We find that LSTMs, while being the most accurate during the 1960–1990 period, have inferior generalizability to the warm period compared to the hybrid and conceptual models. In addition, we show that when generalizing to the warm period, hybrid models have similar accuracy as LSTMs, independently of whether the entire streamflow distribution or extreme events such as floods and droughts are considered. However, for snow-dominated catchments, all models suffer from similar reductions in accuracy when simulating streamflow under unseen climate conditions and the LSTM is the most accurate model for all periods. A detailed look at the snowmelt simulations of the hybrid and conceptual model suggests that better process-representation might be needed to accurately capture the dynamics of snow-melt and -accumulation processes, which are highly sensitive to changes in temperature. We conclude that the hybrid models effectively combine the high accuracy of LSTMs when predicting in ungauged basins with the good generalizability under changes in climate of conceptual hydrological models. This makes them a suitable choice for hydrological climate change impact assessments, particularly in ungauged basins.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(2126 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-5201', Anonymous Referee #1, 04 Dec 2025
-
CC1: 'How do we define 'generalizability'?', Sacha Ruzzante, 17 Dec 2025
This is a useful and timely paper. The testing of model generalizability follows best practices in the hydrologic literature, but I'd like to take this opportunity to ask exactly what is meant by 'generalizability'. Is it:
a) Which model has the highest accuracy in an unseen warm test period?
b) Which model has the smallest reduction (or largest increase) in accuracy when moving from calibration (cold) to testing (warm) periods?
c) Which model most accurately simulates the change in hydrologic conditions between warm and cold periods?These three alternate definitions are subtly different and suggest different statistical tests. This study relies on definitions (a) and (b). In climate change projection studies, however, it is common to summarize results as a percentage change from historical conditions, for which definition (c) is the most relevant.
As an illustrative example: in a catchment, suppose the observed peak flow increases by 50%, from 100 cms to 150 cms, between the cold and warm periods. Models A and B give the following results:
Period Observed Model A Model B
Cold 100 90 90
Warm 150 135 148
Change 50% 50% 64%Model A has a persistent bias of -10%, and correctly predicts an increase in peak flows of 50%, while model B overpredicted the increase (64%). However, by definitions (a) and (b), we would select model B as the most generalizable since its accuracy in the warm period is highest and the accuracy improves from the calibration (cold) to the testing (warm) period.
This example is relevant to Table C1, where (for example) the hybrid model is shown to have the best performance for DVPB in the warm period (4.9%), but this represents a large reduction from the cold period DVPB (9.5%). In comparison, the LSTM has the most stable DVPB across the three periods, as indicated at L358. In this case, it seems that the LSTM will predict the change in DVPB best, and be most generalizable by definition (c). These numbers would, however, be more informative if compared on a catchment-by-catchment basis rather than comparing the median values.
I recommend including (maybe in an appendix) a comparison of the observed and simulated change in various hydrologic signatures between the cold and warm periods (eg., the mean annual flow, the mean monthly flow for each month, and the high flow, low flow, and drought metrics already calculated in the paper).
As a second point, the LSTM is found to generalize most poorly in the warmest catchments. To me, this makes sense, given that the LSTM is extrapolating most strongly in these catchments. In the colder catchments (warm period), the LSTM can learn from the warm catchments (cold period). For the warm catchments (warm period) there is no analogue set of catchments from which to learn. It might be worthwhile to mention this explanation alongside the explanations already given (L427-442).
Citation: https://doi.org/10.5194/egusphere-2025-5201-CC1 -
RC2: 'Comment on egusphere-2025-5201', Anonymous Referee #2, 20 Jan 2026
Bohl et al. conducted the comprehensive benchmark of different types of models (purely data-driven, hybrid, and conceptual models) for the generalizability to warmer scenarios and extreme events. They found that hybrid models can be more robustly generalized to scenarios with distribution shifts compared to the LSTM and stand-alone HBV. The manuscript was well-written to follow clearly. I appreciate the comprehensive evaluation framework and the deep discussions provided in the paper and support this study to be published. I have some moderate/minor comments below mainly toward helping the manuscript further getting improved in the clarification of methodology and results.
I suggest that the authors have a table summarizing the details of all the evaluation experiments and scenarios, so that the readers can quickly review and refer to each experiment discussed, especially given that the study discussed different models for different types of generalization scenarios.
Are the generalization tests for warm and warmest periods also for the predictions in ungauged basins? For example, the historical observations of those basins are not used during training? Please clarify.
Does in-sample HBV mentioned throughout the manuscript represent the calibrated HBV in each individual catchment? Therefore, only the stand-alone conceptual HBV has in-sample prediction results, while all other results reported for the LSTM and hybrid models are for prediction in ungauged basins?
Line 175, my understanding is that the SWE observations used for the evaluation are also model-based data with uncertainties instead of ground-truth. Could the authors discuss how these data might impact the evaluation of the hybrid model simulated SWE?
I feel the used “LSTM-HBV” might not be an acronym accurate enough to represent the hybrid models developed in Feng et al. 2022 and 2023 which give a specific name called “differentiable hydrological models” or “δ (delta) models”. LSTM-HBV reads like a loose coupling or post-processing type of models, which doesn’t reflect the core of these hybrid models. Moreover, although the hybrid models use HBV as the base backbone, the frameworks also largely modified the original HBV structure.
Please cite Feng et al., 2022 and 2023 in line 135 and 206 when referring to the “hybrid model” because the authors employed the hybrid models introduced in these previous studies.
NM7Q should be just one value for each year instead of time series based on the LFPB equations. Please modify the related use of “time series” when introducing the concepts.
Figure 7, which catchment is this figure plotted on? or this is actually the mean across all catchments and days of the year? Please clarify.
Line 409, I feel “it generalizes equally well…” can be a bit misleading given that the conceptual model’s absolute performance is apparently lower than the other two models. Maybe a more accurate statement is the conceptual model has similar performance reduction to the hybrid model when generalized to warmer climate conditions?
Line 413 the use of “the latter” is confusing here. I guess you refer to the conceptual HBV model but it’s not clear.
Line 417 is there a typo here? It seems the hybrid model is the most accurate in Figure 5c for drought volumes but you are saying HBV here.
I am glad that the authors provide discussions on the potential reasons of underperformance in snow dominant catchments and the benefits of hybrid models over purely data-driven models for predicting untrained variables in line 435 and 456, respectively. Good jobs! These points are valuable and important to further think about.
Code availability: The authors of Feng et al., 2022 have released all their model codes in Zenodo publicly, while NeuralHydrology reimplemented these codes in their library. Therefore, credits should also be given to the original developers, such as in line 522 by noting, using the Python library that reimplemented the model codes in Feng et al., 2022, and citing the original Zenodo release.
Feng, D., Shen, C., Liu, J., Lawson, K., & Beck, H. (2022). differentiable parameter learning (dPL) + HBV hydrologic model. Zenodo. https://doi.org/10.5281/zenodo.7091334
Citation: https://doi.org/10.5194/egusphere-2025-5201-RC2
Data sets
E-OBS Daily Gridded Meteorological Data for Europe from 1950 to Present Derived from in-Situ Observations Copernicus Climate Change Service, Climate Data Store https://doi.org/10.24381/cds.151d3ec6
SPASS - new gridded climatological snow datasets for Switzerland C. Marty et al. https://www.doi.org/10.16904/envidat.580
SNOWGRID Klima v2.1 GeoSphere Austria https://doi.org/10.60669/fsxx-6977
Model code and software
Caravan - A global community dataset for large-sample hydrology F. Kratzert https://github.com/kratzert/Caravan/
Analyzing the generalization capabilities of hybrid hydrological models for extrapolation to extreme events E. Acuna Espinoza https://doi.org/10.5281/zenodo.14191623
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 412 | 131 | 27 | 570 | 19 | 18 |
- HTML: 412
- PDF: 131
- XML: 27
- Total: 570
- BibTeX: 19
- EndNote: 18
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The manuscript evaluates the ability of different model types (HBV, LSTM, and a hybrid model) to predict river streamflow under different climate conditions, particularly when the training/calibration period differs from the testing/validation/prediction period. This issue is critical when applying machine learning models to future climate-change impact studies. The manuscript is well written, the experimental design is appropriate for the scientific questions, and the results are clearly illustrated. I have several major comments that I would like to discuss with the authors. If these can be addressed, I would recommend the paper for publication.
First, I think a sensitivity test should be conducted. Before applying the models to the warmer period, perturb the input variables (such as temperature or precipitation) and evaluate how the models respond to these changes. This is relevent for the following analysis, maybe different model is sensitive, others are not.
Another concern relates to the importance of the different input features. Is temperature the most important predictor, or do other variables differ more between the cold and warm periods? The manuscript does not discuss precipitation changes, and I think a feature-importance/SHAP analysis is possible for the LSTM or hybrid model. It would be helpful to understand whether precipitation or PET, although changing less than temperature, may have a stronger influence on streamflow. Concerning the evaluation metrics are not very different from models to models during different period. just to confirm that different model performances are due to climate warming.
A few minor comments
Line 1: Use consistent terminology: either “deep learning,” “deep-learning” (as an adjective), or “DL,” throughout the manuscript.
Line 1: Spell out “Long Short-Term Memory (LSTM)” on first use.
The abstract is currently very conceptual. Please include key numerical results (e.g., NSE, KGE) to quantify performance. For example, Lines 10–12 mention that the LSTM performs best during the cold period but worse during the warm period, this should be supported with specific numbers.
From the abstract, the advantages of the hybrid model over the LSTM are not obvious. Lines 10–15 suggest that hybrid models have similar accuracy to LSTMs, please clarify the added benefit.
Line 86: Please correct the citation formatting.
The introduction is well written.
Line 153: If the Po River basin is not included in the analysis, it may be better not to mention it here (or clarify this later, as in Line 159).
Line 310: Please clarify the distinction between “in-sample HBV” and “regional HBV.”
Line 320: This relates to my major concern, how does precipitation change between periods and among different catchments?
Lines 315–319: The reported values are very close to each other, and they represent means or medians over hundreds of catchments. Could these differences fall within model uncertainty?
Section 3.4: In general, the hybrid and HBV models perform worse than the LSTM model. Is this due to limitations of HBV in snow-affected catchments, where LSTM may better learn snow–streamflow relationships? Does the hybrid model inherit these limitations from HBV, preventing it from outperforming the LSTM?
Line 415: All models show higher performance for flood events than for drought or low flows. Is this due to the choice of objective function (NSE), which emphasizes high-flow periods?