Hybrid models generalize better to warmer climate conditions than process-based and purely data-driven models
Abstract. Deep-learning based rainfall-runoff models, in particular long short-term memory networks (LSTM), have been shown to outperform traditional hydrological models at various tasks, both when used as purely data-driven models and when combined with process-based models in a hybrid setting. These tasks include predictions in ungauged basins (PUB) and regions (PUR), tasks which have traditionally been challenging for conceptual hydrological models. While the spatial generalizability of deep-learning based models has received a lot of attention, it is less clear how they generalize to unseen and warmer climate conditions, i.e. how suitable these models are for hydrological climate impact studies. To address this research gap, we assess the ability of three types of models including (1) fully data-driven (LSTMs), (2) conceptual (Hydrologiska Byråns Vattenbalansavdelning (HBV)), and (3) hybrid (LSTM-HBV) models to simulate streamflow under conditions warmer than those used to train the models by running a differential split sample test. That is, we trained the models using data from the historical period 1960–1990 and evaluated them on both data of this period as well as of the warmer period 2000–2023. We find that LSTMs, while being the most accurate during the 1960–1990 period, have inferior generalizability to the warm period compared to the hybrid and conceptual models. In addition, we show that when generalizing to the warm period, hybrid models have similar accuracy as LSTMs, independently of whether the entire streamflow distribution or extreme events such as floods and droughts are considered. However, for snow-dominated catchments, all models suffer from similar reductions in accuracy when simulating streamflow under unseen climate conditions and the LSTM is the most accurate model for all periods. A detailed look at the snowmelt simulations of the hybrid and conceptual model suggests that better process-representation might be needed to accurately capture the dynamics of snow-melt and -accumulation processes, which are highly sensitive to changes in temperature. We conclude that the hybrid models effectively combine the high accuracy of LSTMs when predicting in ungauged basins with the good generalizability under changes in climate of conceptual hydrological models. This makes them a suitable choice for hydrological climate change impact assessments, particularly in ungauged basins.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
The manuscript evaluates the ability of different model types (HBV, LSTM, and a hybrid model) to predict river streamflow under different climate conditions, particularly when the training/calibration period differs from the testing/validation/prediction period. This issue is critical when applying machine learning models to future climate-change impact studies. The manuscript is well written, the experimental design is appropriate for the scientific questions, and the results are clearly illustrated. I have several major comments that I would like to discuss with the authors. If these can be addressed, I would recommend the paper for publication.
First, I think a sensitivity test should be conducted. Before applying the models to the warmer period, perturb the input variables (such as temperature or precipitation) and evaluate how the models respond to these changes. This is relevent for the following analysis, maybe different model is sensitive, others are not.
Another concern relates to the importance of the different input features. Is temperature the most important predictor, or do other variables differ more between the cold and warm periods? The manuscript does not discuss precipitation changes, and I think a feature-importance/SHAP analysis is possible for the LSTM or hybrid model. It would be helpful to understand whether precipitation or PET, although changing less than temperature, may have a stronger influence on streamflow. Concerning the evaluation metrics are not very different from models to models during different period. just to confirm that different model performances are due to climate warming.
A few minor comments
Line 1: Use consistent terminology: either “deep learning,” “deep-learning” (as an adjective), or “DL,” throughout the manuscript.
Line 1: Spell out “Long Short-Term Memory (LSTM)” on first use.
The abstract is currently very conceptual. Please include key numerical results (e.g., NSE, KGE) to quantify performance. For example, Lines 10–12 mention that the LSTM performs best during the cold period but worse during the warm period, this should be supported with specific numbers.
From the abstract, the advantages of the hybrid model over the LSTM are not obvious. Lines 10–15 suggest that hybrid models have similar accuracy to LSTMs, please clarify the added benefit.
Line 86: Please correct the citation formatting.
The introduction is well written.
Line 153: If the Po River basin is not included in the analysis, it may be better not to mention it here (or clarify this later, as in Line 159).
Line 310: Please clarify the distinction between “in-sample HBV” and “regional HBV.”
Line 320: This relates to my major concern, how does precipitation change between periods and among different catchments?
Lines 315–319: The reported values are very close to each other, and they represent means or medians over hundreds of catchments. Could these differences fall within model uncertainty?
Section 3.4: In general, the hybrid and HBV models perform worse than the LSTM model. Is this due to limitations of HBV in snow-affected catchments, where LSTM may better learn snow–streamflow relationships? Does the hybrid model inherit these limitations from HBV, preventing it from outperforming the LSTM?
Line 415: All models show higher performance for flood events than for drought or low flows. Is this due to the choice of objective function (NSE), which emphasizes high-flow periods?