Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js
Preprints
https://doi.org/10.5194/egusphere-2025-1076
https://doi.org/10.5194/egusphere-2025-1076
14 Mar 2025
 | 14 Mar 2025
Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

How well do hydrological models learn from limited discharge data? A comparison of process- and data-driven models

Maria Staudinger, Anna Herzog, Ralf Loritz, Tobias Houska, Sandra Pool, Diana Spieler, Paul D. Wagner, Juliane Mai, Jens Kiesel, Stephan Thober, Björn Guse, and Uwe Ehret

Abstract. A widespread assumption is that data-driven models only achieve good results with sufficiently large training data, while process-based models are usually expected to be superior in data-poor situations. In our study, we investigate this assumption by calibrating several process-based and data-driven hydrological models with training data sets of observed discharge that differ in the number of data points and the type of data selection. The tested models include four commonly used process-based models (GR4J, HBV, mHM, and SWAT+) and four data-driven models (conditional probability distributions, regression trees, ANN, and LSTM), which are calibrated for three meso-scale catchments representing three different landscapes in Germany: the Iller in the Alpine region, the Saale in the low mountain ranges, and the Selke in the Central German lowlands. We used conditional entropy to evaluate model performance and the learning capability of a model (i.e., change in model performance with increasing sample size).

In addition to the main question of this study, i.e., to what extent the performance of the different models depends on the training data set, we also investigated whether the selection of the training data (random or according to information content, selection of contiguous time periods, or independent time points) plays a role. We also investigated whether there is a relationship between the information contained in the data and the shape of the learning curve for different models that allows prediction of the achievable model performance, and whether the use of more spatially distributed model inputs leads to improved model performance compared to spatially lumped inputs.

Process-based models outperformed data-driven models for small amounts of training data due to their predefined structure based on process representation. However, with increasing amounts of training data, the learning curve of process-based models quickly saturates, and using about 2 to 5 years of training data, the data-driven LSTM consistently outperforms all process-based models. In particular, the LSTM continues to learn from more training data without approaching saturation. Surprisingly, fully random sampling of training data points for the HBV model leads to better learning results not only compared to consecutive random sampling but also compared to optimal sampling in terms of information content. Analyzing multivariate catchment data allows predictions about how these data can be used to predict discharge. When no memory was considered, the conditional entropy was large, but as soon as some memory was introduced in the form of a past day or past week, the conditional entropy became smaller, suggesting that memory is a very important component in the data and that capturing it improves model performance. This was particularly the case for the catchment from the low mountain ranges and the Alpine region.

Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Share
Download
Short summary
Four process-based and four data-driven hydrological models are compared using different...
Share