Preprints
https://doi.org/10.5194/egusphere-2026-3272
https://doi.org/10.5194/egusphere-2026-3272
11 Jun 2026
 | 11 Jun 2026
Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Setting the Bar: Benchmarks for Model Performances in Large-Sample Hydrology

Jan Seibert, Marc Vis, and Sandra Pool

Abstract. The availability of large-sample hydrometeorological datasets, now widespread across many regions worldwide, has changed hydrological catchment modelling. Assessing model performance is an essential component of any modelling exercise, and an important question is how to interpret performance measure values. Performances of uncalibrated bucket-type models vary significantly across regions and can reach NSE values of 0.8 or higher, particularly in humid or snow-dominated catchments. This implies that using a fixed value for a performance measure to judge model performance, as sometimes suggested in the literature, is inappropriate. Instead, one should consider that, given local hydroclimatic conditions and the quality of the available data, the performance we should expect from any model in a particular catchment can vary widely. At the same time, a perfect fit (NSE value of 1) is usually impossible to achieve due to errors and uncertainties in the model and data. Therefore, it is helpful to compare model performances to lower and upper benchmarks.

The purpose of this study was two-fold. First, we examined how to compute lower bounds, including determining appropriate ensemble sizes, assessing the effects of parameter ranges, deciding whether to use random or regional parameter sets, and evaluating how best to aggregate the ensemble of simulations. We also examined the relationships between lower and upper benchmarks and catchment characteristics. Secondly, we utilised these findings to compute both lower and upper benchmarks for many of the existing large sample datasets. By providing these values to the modelling community, we aim to facilitate the broader use of lower and upper benchmarks in large-sample hydrological modelling studies. We argue that these values are valuable as they provide a basis for evaluating model performance across the various large-sample datasets. This will allow assessment of model performance, considering what one could and should expect for a particular catchment.

Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Jan Seibert, Marc Vis, and Sandra Pool

Status: open (until 23 Jul 2026)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Jan Seibert, Marc Vis, and Sandra Pool
Jan Seibert, Marc Vis, and Sandra Pool
Metrics will be available soon.
Latest update: 11 Jun 2026
Download
Short summary
We studied how well simple bucket-type models can reproduce observed river flow using large data sets from many regions in the world. Model performance varies widely depending on local conditions, so fixed performance thresholds are misleading. To better judge model performances, we propose lower and upper benchmarks. These benchmarks help to better understand what level of model performance is achievable and, thus, enable us to compare models across different catchments.
Share