EGUsphere

Copernicus Publications

Göttingen, Germany

10.5194/egusphere-2026-3272

Setting the Bar: Benchmarks for Model Performances in Large-Sample Hydrology

Seibert

Jan

https://orcid.org/0000-0002-6314-2124

¹ Vis

Marc

https://orcid.org/0000-0002-5589-2611

¹ Pool

Sandra

https://orcid.org/0000-0001-9399-9199

University of Zurich, Department of Geography, Winterthurerstrasse 190, 8057 Zurich, Switzerland

Eawag, Swiss Federal Institute of Aquatic Science and Technology, Department Water Resources and Drinking Water, Überlandstrasse 133, 8600 Dübendorf, Switzerland

11 06 2026

2026 1 23

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-3272/

The full text article is available as a PDF file from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-3272/egusphere-2026-3272.pdf

The availability of large-sample hydrometeorological datasets, now widespread across many regions worldwide, has changed hydrological catchment modelling. Assessing model performance is an essential component of any modelling exercise, and an important question is how to interpret performance measure values. Performances of uncalibrated bucket-type models vary significantly across regions and can reach NSE values of 0.8 or higher, particularly in humid or snow-dominated catchments. This implies that using a fixed value for a performance measure to judge model performance, as sometimes suggested in the literature, is inappropriate. Instead, one should consider that, given local hydroclimatic conditions and the quality of the available data, the performance we should expect from any model in a particular catchment can vary widely. At the same time, a perfect fit (NSE value of 1) is usually impossible to achieve due to errors and uncertainties in the model and data. Therefore, it is helpful to compare model performances to lower and upper benchmarks.</p> <p>The purpose of this study was two-fold. First, we examined how to compute lower bounds, including determining appropriate ensemble sizes, assessing the effects of parameter ranges, deciding whether to use random or regional parameter sets, and evaluating how best to aggregate the ensemble of simulations. We also examined the relationships between lower and upper benchmarks and catchment characteristics. Secondly, we utilised these findings to compute both lower and upper benchmarks for many of the existing large sample datasets. By providing these values to the modelling community, we aim to facilitate the broader use of lower and upper benchmarks in large-sample hydrological modelling studies. We argue that these values are valuable as they provide a basis for evaluating model performance across the various large-sample datasets. This will allow assessment of model performance, considering what one could and should expect for a particular catchment.