Setting the Bar: Benchmarks for Model Performances in Large-Sample Hydrology

Seibert, Jan; Vis, Marc; Pool, Sandra

doi:10.5194/egusphere-2026-3272

Preprints

https://doi.org/10.5194/egusphere-2026-3272

Preprints

11 Jun 2026

| 11 Jun 2026

Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Setting the Bar: Benchmarks for Model Performances in Large-Sample Hydrology

Jan Seibert, Marc Vis, and Sandra Pool

Abstract. The availability of large-sample hydrometeorological datasets, now widespread across many regions worldwide, has changed hydrological catchment modelling. Assessing model performance is an essential component of any modelling exercise, and an important question is how to interpret performance measure values. Performances of uncalibrated bucket-type models vary significantly across regions and can reach NSE values of 0.8 or higher, particularly in humid or snow-dominated catchments. This implies that using a fixed value for a performance measure to judge model performance, as sometimes suggested in the literature, is inappropriate. Instead, one should consider that, given local hydroclimatic conditions and the quality of the available data, the performance we should expect from any model in a particular catchment can vary widely. At the same time, a perfect fit (NSE value of 1) is usually impossible to achieve due to errors and uncertainties in the model and data. Therefore, it is helpful to compare model performances to lower and upper benchmarks.

The purpose of this study was two-fold. First, we examined how to compute lower bounds, including determining appropriate ensemble sizes, assessing the effects of parameter ranges, deciding whether to use random or regional parameter sets, and evaluating how best to aggregate the ensemble of simulations. We also examined the relationships between lower and upper benchmarks and catchment characteristics. Secondly, we utilised these findings to compute both lower and upper benchmarks for many of the existing large sample datasets. By providing these values to the modelling community, we aim to facilitate the broader use of lower and upper benchmarks in large-sample hydrological modelling studies. We argue that these values are valuable as they provide a basis for evaluating model performance across the various large-sample datasets. This will allow assessment of model performance, considering what one could and should expect for a particular catchment.

Received: 05 Jun 2026 – Discussion started: 11 Jun 2026

Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2177 KB)

Supplement (5774 KB)

Download & links

Jan Seibert, Marc Vis, and Sandra Pool

Status: open (until 08 Aug 2026)

Post a comment Subscribe to comment alert

RC1:
'Very valuable, but needs to provide details on setup and performance values', Benedikt Heudorfer, 19 Jun 2026 reply

General comment:
This paper deals with very relevant questions w.r.t. proper comparison of model quality on large sample hydrological datasets. It uses the HBV conceptual model to provide upper and lower performance boundaries on dataset and catchment basis. This is a highly relevant contribution that can also guide the general machine learning community as it provides (among other things) reasonable lower boundaries of predictability of individual catchments from a conceptual hydrological model. Since neural networks are inherently random, the performance of the model can vary substantially between for each catchment between model runs. The study presented here can aid this shortcoming by telling us where good performance can be expected and where not.

Major comment:
However the main shortcoming of this paper is that it needs to provide better documentation of how the models are actually run, on which catchments they are run, on which time periods they are trained, what the exact performance is for each respective dataset and/or catchment, etc.. While this is (mostly) not sensible to report in the main body of the paper, these details should be included in the supplement files, in a data repository, published along with the code, anywhere (the link in the data availability section is dead). These details are crucial to allow what the study sets out to do: enabling comparison/benchmarking of models in future studies, because without these details, future studies will not be able to replicate the experimental setup properly. See also specific comments on this topic.

Specific comments:

Line 46 introduce NPE properly please.

Line 98: dot missing

Section 2.3 is missing crucial information e.g. on test periods. To allow future studies to compare performance, the exact test periods should be named for each dataset. Backgroung: The choice of data is really important how the value of the performance metric turns out (as the “elephant in the room” paper my Maier 2023 points out: https://doi.org/10.1016/j.envsoft.2023.105779). If exact periods and catchment lists are not reported for each dataset, valid comparison with your results are practically impossible, rendering the attempt to provide benchmark for future reference futile. Also, exact catchment list should be reported somewhere, on which you train. This is needed for people to match their model setup exactly in the future, in case they want to benchmark the results.

Line 117-119: “The first one to two years were used as a warming-up period to obtain reasonable initial conditions for the storage components of HBV. The remaining years were then used for model calibration (upper benchmark) and evaluation (lower benchmark) with streamflow.” Does that mean no test period for calibrated models, and the reported upper benchmarks are from the calibration period metrics? If yes, that should be stated clearly. I don’t know how it is treated in the conceptual modelling community, but in machine learning this would be considered data leakage and not a reliable metric for model performance, as it will mostly speak for the degree of overfitting. Could you please elaborate whether the phenomenon of overfitting is relevant in the conceptual modelling domain, or why the choice to report calibration period metric makes sense? Or if I misunderstand the section and this all is beside the point, please clarify.

Line 131-134: very good strategy and reflects the authors knowledge about the dependency of metrics to calculation procedure. Speaks for the rigidity of the study.
135-138: here as well, the exact catchment IDs associated with the lower benchmark should be made public somewhere, if not already present in a published codebase. That is, if the random selection of the 10 catchments was not repeated with replacement, in which case tracing it may be pointless. Backgroung: The choice of data is really important how the value of the performance metric turns out.

Section 3.1: only evaluation based on NPE is reported, but 2.3.1 claims it was evaluated based on NSE, KGE and NPE; since NPE is more novel and NSE and KGE are more widely used, KGE and NSe should be reported as well; also, to allow proper use of this benchmark in future studies, tables with exact metric values (incl. any uncertainty of distribution indices like IQR etc.) should be reported, if only in the supplements.
Section 3.2: Having this effect of the influence of the calculation procedure on the final metric value finally worked out in a paper is super exiting to me. We have briefly discussed this effect in our 2025 paper (https://doi.org/10.1029/2024GL113036) and internal tests show metric value deviations of up to 0.1 NSE depending on the calculation procedure, but I am not aware that this was specifically addressed anywhere yet. Very nice result, can be highlighted more prominently in my humble opinion.

Line 208-209: I don’t understand exactly how this parameter range width is applied; how can a negative parameter range be? Also range values should be included in the figure 3 xlabs as well.

Table 3: ah, here is the table on exact upper and lower benchmark values ; but should be reported further split into the individual datasets, so people can compare their models against it in the future. These are surprisingly strong lower benchmark by the way. Reading this paper made me think how a lower benchmark might be computed for ML/DL models. If you have any ideas, I’d appreciate elaborating and/or putting them into the paper.

Reply

Citation: https://doi.org/10.5194/egusphere-2026-3272-RC1
- AC1: 'Quick Reply on link comment in RC1', Jan Seibert, 22 Jun 2026 reply
  
  Thanks for the overall positive assessment of our manuscript. While we will, of course, reply in detail to the various valuable comments, here is a quick reply regarding the 'dead' Zenondo link.
  Please see the footnote: "This will be the link to the data on Zenodo once the paper is accepted. A temporary link is made available to the editor and reviewers."
  The reason is that once we activate the link, we cannot make changes in the data (structure) on Zenodo without creating a new version. However, it seems the temporary link might not have reached the reviewer. We will send this by email to the reviewer (and others upon request).
  
  Reply
  
  Citation: https://doi.org/10.5194/egusphere-2026-3272-AC1
- AC2: 'Reply on RC1', Jan Seibert, 08 Jul 2026 reply
  
  Please see our responses in the attached pdf
  
  Reply
  
  Citation: https://doi.org/10.5194/egusphere-2026-3272-AC2
RC2:
'Comment on egusphere-2026-3272', Tam Nguyen, 24 Jun 2026 reply

Disclaimer: ChatGPT was used to improve the clarity, grammar, and readability of this review without altering its intended meaning. I carefully reviewed and verified the revised text to ensure that all comments accurately reflect my original assessment and opinions.
General Assessment
From my understanding (please correct me if I have misunderstood any aspect of the methodology), the study derives lower and upper ranges for model performance metrics (NSE, KGE, and NPE). The lower benchmark is obtained from random or regional parameter sets, whereas the upper benchmark is derived from calibrated parameter sets. This framework is applied across multiple large-sample hydrological datasets (CAMELS and LamaH) using the HBV model. The overarching objective appears to be the establishment of performance ranges that can serve as references for future studies, allowing researchers to contextualize model performance in relation to catchment characteristics.

1. Strengths of the Manuscript

The manuscript represents, to my knowledge, the first study to establish the HBV model at such a large scale, covering 12 CAMELS and LamaH datasets. This extensive analysis provides a valuable contribution to the hydrological modelling community, especially in terms of model evaluation and inter-comparison.

Considering that nowadays there are many machine learning models and papers, this paper raises an important question to the ML community: “Do we need another ML model for streamflow prediction?” Considering the lack of trust and explainability in many ML models, this paper shows that a bucket-type model with “low data demand and ease of application” can be highly scalable, provide physically explainable results, and still achieve high performance. Therefore, ML models may only be needed if they can outperform bucket-type models such as this one.

2. Points for Discussion and Potential Improvements

From my understanding, the upper benchmark is derived from the calibration period. If this is correct, it should be stated explicitly in the manuscript. While overfitting may be less problematic for a relatively parsimonious model such as HBV, the interpretation of an upper benchmark derived from calibration data warrants discussion (e.g., if such model is evaluated for other period, the upper benchmark could be lower). For more complex models with larger parameter spaces (e.g., SWAT, mHM), calibration performance may approach the theoretical optimum and therefore provide an unrealistically optimistic benchmark.

The upper benchmark is likely influenced by several factors, including: model structure, calibration period, objective function, data quality and completeness, record length, model complexity. These factors deserve further discussion and analysis in the manuscript.

Influence of human on the upper and lower benchmarks, e.g., catchments where streamflow is substantially modified by reservoirs, weirs, diversions, or water transfer schemes may exhibit much lower lower and upper benchmarks. Additional analysis of how anthropogenic influences may affect both lower and upper benchmarks would strengthen the manuscript.

3. Specific Comments

L48–49

“A perfect fit (value of 1) is usually impossible to achieve due to model and data errors and uncertainties. In the case of NSE, a value of zero can be considered a benchmark representing the prediction of a constant discharge equal to the annual mean discharge.”

This is the first explicit mention of discharge in the manuscript. Since the study focuses specifically on streamflow prediction, it may be helpful to introduce this earlier in the introduction so that readers immediately understand the scope of the performance evaluation.
L54–58

“…using any fixed value for a performance measure to judge model performance ... might not be appropriate…”

While I agree with the authors’ argument regarding the limitations of “absolute” performance thresholds, it may be worth acknowledging that absolute performance measures still provide valuable information. For example, NSE = 1 represents a theoretically optimal model and remains an important reference point for model development. Relative benchmarks are highly useful for contextual comparisons, but they may not fully replace absolute performance measures, particularly in regions where all available models perform poorly.
L63

“both upper and lower benchmarks”

The terms "upper" and "lower" benchmark should be defined more explicitly when first introduced. Initially, I interpreted the lower benchmark as the minimum achievable model performance, whereas it actually appears to represent performance obtained under minimal calibration effort (i.e., random parameter sets).
L66–67

“the choice of model is less crucial…”

I am not fully convinced by this statement. For example, Kratzert et al. (2024, https://doi.org/10.5194/hess-28-4187-2024) demonstrated substantial differences in predictive performance among VIC, mHM, and LSTM models. Additional justification or supporting evidence would strengthen this conclusion.
L71–72 and L76–77

The manuscript discusses using either the median ensemble performance or the performance of the ensemble mean simulation as a lower benchmark.

From a parameter-estimation perspective, the ensemble mean simulation may not correspond to any physically realizable parameter set. Therefore, the median performance across ensemble members appears conceptually more consistent for model comparison purposes. It would be helpful if the authors could provide a stronger rationale for preferring one aggregation method over the other.
L73–74

“The purpose of this study was two-fold. Firstly, we evaluated different approaches to derive lower benchmark values, and secondly, we provide upper and lower benchmark values for existing large-sample datasets.” I think the term model performance for “streamflow” should be introduced before or here.
Section 2.1
Providing one detailed example of the HBV model setup and input data would greatly enhance reproducibility and facilitate adoption of the framework by other researchers.

In addition, the simulation, calibration, warm-up, and evaluation periods are not clearly described.

Table 1: Consider adding: simulation period, warm-up period, calibration period, evaluation period.. This information would improve transparency and reproducibility.
L117

“The remaining years…”

Please specify the exact calibration and evaluation periods rather than referring to them generically.
L129–130

The analysis of subsets containing 1, 2, 5, and 10 parameter sets is acceptable, although it is not immediately clear how informative these very small sample sizes are in practice.

Additionally, since 10,000 represents the full parameter ensemble rather than a subset, the wording could be revised for clarity.
L131

“The selection of subsets was repeated ten times”. I think the authors can repeated more times (my rule of thumb is 30 times or more, please see this reference; https://web.stanford.edu/class/archive/cs/cs109/cs109.1212/lectureNotes/LN18_clt.pdf?utm_source=chatgpt.com) to have a robust result. As the simulation results are already there, doing this does not require much effort

Table 2: The parameter-range combinations are difficult to follow. It would help if the authors explicitly stated the number of combinations considered and perhaps provided examples such as [LL1, UL1], [LL1, UL2], etc.
L157 (NPE): Many readers may be less familiar with NPE than NSE or KGE. A brief description of NPE and its interpretation relative to the other metrics would improve accessibility.
L162–163

“…generated maps of model performance for each country and predicted model performance using random forest regression trees.”

I only observed maps for the United States. If maps were generated for additional regions, it would be useful to present them or provide them as supplementary material.

Figure 1: I was surprised to observe that the median NPE appears to increase with increasing numbers of random parameter sets (Figure 1a). Additional discussion of this behaviour would be helpful (e.g., is this model-specific behavior?)

Furthermore, the rationale for presenting NPE rather than NSE or KGE in this figure is not entirely clear. Including the mathematical definition of the reported statistics would also improve interpretability.

Section 3.2 It is not clear to me what the purpose of this section is
Section 3.4. It is not clear to me which range was ultimately selected for deriving the lower benchmark
Figure 3

This figure clarified that seven benchmark ranges were considered rather than the 49 combinations I initially inferred from Table 2. It may be helpful to explain this more clearly earlier in the manuscript.
Figure 5

It would be highly valuable to provide a supplementary CSV file containing: catchment identifiers, dataset name, benchmark values, calibration period, evaluation period. This would substantially increase the utility and reproducibility of the dataset for future studies.
L263

From my experience, seasonality in streamflow, precipitation, and other climatic variables is often a major determinant of hydrological predictability. Catchments with strong seasonal signals are frequently easier to model and may achieve higher performance metrics.

It would therefore be interesting to investigate whether seasonality metrics could explain some of the observed variation in benchmark performance and whether they should be included among the predictor variables considered in the analysis.

Reply

Citation: https://doi.org/10.5194/egusphere-2026-3272-RC2
- AC3: 'Reply on RC2', Jan Seibert, 08 Jul 2026 reply
  
  Please see our responses in the attached pdf
  
  Reply
  
  Citation: https://doi.org/10.5194/egusphere-2026-3272-AC3

Jan Seibert, Marc Vis, and Sandra Pool

Supplement

https://doi.org/10.5194/egusphere-2026-3272-supplement

Jan Seibert, Marc Vis, and Sandra Pool

Viewed

Total article views: 257 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
191	54	12	257	16	4	5

HTML: 191
PDF: 54
XML: 12
Total: 257
Supplement: 16
BibTeX: 4
EndNote: 5

Views and downloads (calculated since 11 Jun 2026)

Month	HTML	PDF	XML	Total
Jun 2026	80	36	5	121
Jul 2026	111	18	7	136

Cumulative views and downloads (calculated since 11 Jun 2026)

Month	HTML	PDF	XML	Total
Jun 2026	80	36	5	121
Jul 2026	111	18	7	136

Viewed (geographical distribution)

Total article views: 233 (including HTML, PDF, and XML) Thereof 233 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Jul 2026

Short summary

We studied how well simple bucket-type models can reproduce observed river flow using large data sets from many regions in the world. Model performance varies widely depending on local conditions, so fixed performance thresholds are misleading. To better judge model performances, we propose lower and upper benchmarks. These benchmarks help to better understand what level of model performance is achievable and, thus, enable us to compare models across different catchments.


Total:	0
HTML:	0
PDF:	0
XML:	0