Preprints
https://doi.org/10.5194/egusphere-2025-1076
https://doi.org/10.5194/egusphere-2025-1076
14 Mar 2025
 | 14 Mar 2025
Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

How well do hydrological models learn from limited discharge data? A comparison of process- and data-driven models

Maria Staudinger, Anna Herzog, Ralf Loritz, Tobias Houska, Sandra Pool, Diana Spieler, Paul D. Wagner, Juliane Mai, Jens Kiesel, Stephan Thober, Björn Guse, and Uwe Ehret

Abstract. A widespread assumption is that data-driven models only achieve good results with sufficiently large training data, while process-based models are usually expected to be superior in data-poor situations. In our study, we investigate this assumption by calibrating several process-based and data-driven hydrological models with training data sets of observed discharge that differ in the number of data points and the type of data selection. The tested models include four commonly used process-based models (GR4J, HBV, mHM, and SWAT+) and four data-driven models (conditional probability distributions, regression trees, ANN, and LSTM), which are calibrated for three meso-scale catchments representing three different landscapes in Germany: the Iller in the Alpine region, the Saale in the low mountain ranges, and the Selke in the Central German lowlands. We used conditional entropy to evaluate model performance and the learning capability of a model (i.e., change in model performance with increasing sample size).

In addition to the main question of this study, i.e., to what extent the performance of the different models depends on the training data set, we also investigated whether the selection of the training data (random or according to information content, selection of contiguous time periods, or independent time points) plays a role. We also investigated whether there is a relationship between the information contained in the data and the shape of the learning curve for different models that allows prediction of the achievable model performance, and whether the use of more spatially distributed model inputs leads to improved model performance compared to spatially lumped inputs.

Process-based models outperformed data-driven models for small amounts of training data due to their predefined structure based on process representation. However, with increasing amounts of training data, the learning curve of process-based models quickly saturates, and using about 2 to 5 years of training data, the data-driven LSTM consistently outperforms all process-based models. In particular, the LSTM continues to learn from more training data without approaching saturation. Surprisingly, fully random sampling of training data points for the HBV model leads to better learning results not only compared to consecutive random sampling but also compared to optimal sampling in terms of information content. Analyzing multivariate catchment data allows predictions about how these data can be used to predict discharge. When no memory was considered, the conditional entropy was large, but as soon as some memory was introduced in the form of a past day or past week, the conditional entropy became smaller, suggesting that memory is a very important component in the data and that capturing it improves model performance. This was particularly the case for the catchment from the low mountain ranges and the Alpine region.

Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Share
Maria Staudinger, Anna Herzog, Ralf Loritz, Tobias Houska, Sandra Pool, Diana Spieler, Paul D. Wagner, Juliane Mai, Jens Kiesel, Stephan Thober, Björn Guse, and Uwe Ehret

Status: open (until 15 May 2025)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2025-1076', Salvatore Manfreda, 18 Apr 2025 reply
Maria Staudinger, Anna Herzog, Ralf Loritz, Tobias Houska, Sandra Pool, Diana Spieler, Paul D. Wagner, Juliane Mai, Jens Kiesel, Stephan Thober, Björn Guse, and Uwe Ehret

Data sets

MariStau/IMPRO_infotheory_Data_Code: Data and code used to calculate conditional entropy values Maria Staudinger and Uwe Ehret https://doi.org/10.5281/zenodo.14938050

Maria Staudinger, Anna Herzog, Ralf Loritz, Tobias Houska, Sandra Pool, Diana Spieler, Paul D. Wagner, Juliane Mai, Jens Kiesel, Stephan Thober, Björn Guse, and Uwe Ehret

Viewed

Total article views: 203 (including HTML, PDF, and XML)
HTML PDF XML Total Supplement BibTeX EndNote
178 19 6 203 10 7 4
  • HTML: 178
  • PDF: 19
  • XML: 6
  • Total: 203
  • Supplement: 10
  • BibTeX: 7
  • EndNote: 4
Views and downloads (calculated since 14 Mar 2025)
Cumulative views and downloads (calculated since 14 Mar 2025)

Viewed (geographical distribution)

Total article views: 251 (including HTML, PDF, and XML) Thereof 251 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 21 Apr 2025
Download
Short summary
Four process-based and four data-driven hydrological models are compared using different training data. We found process-based models to perform better with small data sets but stop learning soon, while data-driven models learn longer. The study highlights the importance of memory in data and the impact of different data sampling methods on model performance. The direct comparison of these models is novel and provides a clear understanding of their performance under various data conditions.
Share