Technical note: Separating signal from noise in large-domain hydrologic model evaluation &ndash; Benchmarking model performance under sampling uncertainty

Gründemann, Gaby J.; Knoben, Wouter J. M.; Song, Yalan; van Werkhoven, Katie; Clark, Martyn P.

doi:10.5194/egusphere-2025-6460

Preprints

https://doi.org/10.5194/egusphere-2025-6460

Preprints

02 Feb 2026

| 02 Feb 2026

Technical note: Separating signal from noise in large-domain hydrologic model evaluation – Benchmarking model performance under sampling uncertainty

Gaby J. Gründemann, Wouter J. M. Knoben, Yalan Song, Katie van Werkhoven, and Martyn P. Clark

Abstract. Large-domain hydrologic modeling studies are becoming increasingly common. The evaluation of the resulting models is however often limited to the use of aggregated performance scores that show where model accuracy is higher and lower. Moreover, the inherent uncertainty in such scores, stemming from the choice of time periods used for their calculation, often remains unaccounted for. Here we use a collection of simple benchmarks whilst accounting for this sampling uncertainty to provide context for the performance scores of a large-domain hydrologic model. The benchmarks suggest that there are considerable constraints on the model's performance in approximately one-third of the basins used for model calibration and in approximately half of the basins where model parameters are regionalized. Sampling uncertainty has limited impact: in most basins the model is either clearly better or worse than the benchmarks, though accounting for sampling uncertainty remains important when the performance of different models is more similar. The areas where the benchmarks outperform the model only partially overlap with areas where the model achieves lower performance scores, and this suggests that improvements may be possible in more regions than a first glance at model performance values may indicate. A key advantage of using these benchmarks is that they are easy and fast to compute, particularly compared to the cost of configuring and running the model. This makes benchmarking a valuable tool that can complement more detailed model evaluation techniques by quickly identifying areas that should be investigated more thoroughly.

Received: 23 Dec 2025 – Discussion started: 02 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 11819 KB)

Supplement (42281 KB)

Download & links

Preprint (11819 KB)
Metadata XML
Supplement (42281 KB)
BibTeX
EndNote

Gaby J. Gründemann, Wouter J. M. Knoben, Yalan Song, Katie van Werkhoven, and Martyn P. Clark

Status: final response (author comments only)

RC1: 'Comment on egusphere-2025-6460', Anonymous Referee #1, 13 Mar 2026

Review of "Technical note: Separating signal from noise in large-domain hydrologic model evaluation - Benchmarking model performance" by Gründemann et al.
The technical note promotes the use of various benchmarks for model performance evaluation, particularly in a large-domain setting (or for large-sample studies) and includes a quantification of sampling uncertainty from different periods through bootstrapping of different hydrological years.
The manuscript is clearly, concisely written and well structured.
Before I can recommend publication, however, I would like to raise the following comments:
major comments:

- Since this note is all about the benchmarks, I thing two ingredients are missing:

1) Please add the benchmarks and their description to the main text and not just to the supplementary material and ensure that the abbreviations match those in the figures (or vice versa)
2) Each of the benchmarks is essentially a test of how well a model should minimally perform regarding a specific aspect. This is not discussed in detail in the manuscript, but I think providing some examples would really help promoting the use of various benchmarks from very simple ones targeting maybe the water balance to more complex ones. I would suggest extending the dicussion and conclusions accordingly and as well as adding this explanation regarding which aspect they are benchmarking in the table describing them.
- there is the sampling uncertainty, there is the model uncertainty, but what makes up these metrics are also affected by the uncertainty inherent in the observations. It would be worth reminding the reader that these can be considerably large and influential on the performance metric. For instance, for discharge, there is the rating curve uncertainty that is not constant but varies with the flows (see for instance Westerberg et al., 2011)
Line by line comments:

Abstract

L4 name at least some examples of what is meant by a simple benchmark, i.e. make it more specific

L5-7 these results are valid for the study region and basins and but not for other regions, please add that the data set is from the United Stated and maybe add even NWM

L9 ", though accounting..." this part of the sentence is not clear. Please rephrase.
Main text

L21-25 the words "score", "statistics","efficiency", "metrics" are used and they are used interchangeably. I would suggest using only one, where this is applicable and using it consistently throughout the manuscript

L22 "and more" remove (there is already "for example" in the same sentence)

L34 ... or further checks are required
L40 "can be " -> "is"

L120 since the benchmarks are the core of this note, Table S1 should be moved to the main text and the abbreviations adjusted accordingly.

L126 "as" -> "that"

L239 Supporting

L239 abbreviation was already introduced in L25

L257 "perform" missing?

L262 which benchmark? please add

L284 remove "and" before "snow"

Figure 2 in the upper panel the lines are not distinguishable in b&w print

Figure 3 Please add the written-out benchmarks in the caption so that the figure can stand-alone.
References

Westerberg, I., Guerrero, J. L., Seibert, J., Beven, K. J., & Halldin, S. (2011). Stage‐discharge uncertainty derived with a non‐stationary rating curve in the Choluteca River, Honduras. Hydrological Processes, 25(4), 603-613.

Citation: https://doi.org/10.5194/egusphere-2025-6460-RC1
RC2:
'Comment on egusphere-2025-6460', Anonymous Referee #2, 16 Mar 2026
The authors provide a nice and easy-to-read study on large-domain hydrological model evaluation. They introduced benchmarks as a valuable tool to investigate where a large domain hydrologic model might still lack in performance. This makes the manuscript, in my opinion, very interesting for a wider hydrological audience. Before publication, however, I think the manuscript would benefit from a bit more analysis on why the model is failing in certain areas. Below, I try to give some constructive feedback for the authors to consider:
General remarks:
I have the feeling that a simple explanation of why a model fails if it is worse than a certain benchmark would help readers to understand the point of this technical note.

I have the feeling the Introduction is not well linked to the rest of the manuscript. I did not get the impression that the questions raised were answered. The manuscript does not give any guidance if a score is indicative or useful for a model, nor does it go into quantifying their uncertainty. Isn’t the point of the manuscript more to find regions and reasons where the model is failing against the suggested benchmarks? I recommend restructuring the introduction accordingly.

Discussion: I would like to see a more in-depth discussion on what it actually means if the Benchmark is better than the model. After that, you can go into the analysis, where and why the model might have failed. For this, however, I would recommend putting more emphasis on why the model has failed. Maybe look at it from a model development perspective, what would you need to do to improve the model? Try to give some guidance. E.g., by correlating your J index against catchment attributes (soil, landuse, climate, etc.), and also against the KGE, might give more insights. I acknowledge that the authors already provide some discussion on why the model might fail under certain circumstances, but very few of them are really based on the results of this manuscript and are rather based on the authors' knowledge and other literature.

Minor comments:
Title: Is it really fitting what the manuscript is about? I would suggest something like “Technical note: Benchmarking large-domain hydrological model performance”. If the authors want to state uncertainty in their title, they should be specific what kind of uncertainty they are referring to.

Introduction: What are simple benchmarks exactly, where have they been used, and what's their benefit, how do they relate to hydrological signatures?

Section 2.4 Benchmarks: It should be better explained which benchmarks are actually used and why.

Figure 2: Might it be easier to focus on the evaluation period only? And maybe I missed it in the Data and Methods section, but it should be clearly defined what the evaluation period is. Section 2.4 is speaking of a validation period; is this used as a synonym here? If so, it would be better to use only one of the two words throughout the manuscript.

Figure 3: Show what BMs are actually standing for; that’s not too much text for the figure.
Citation: https://doi.org/10.5194/egusphere-2025-6460-RC2

Gaby J. Gründemann, Wouter J. M. Knoben, Yalan Song, Katie van Werkhoven, and Martyn P. Clark

Supplement

https://doi.org/10.5194/egusphere-2025-6460-supplement

Data sets

Data for "Separating Signal from Noise in Large-Domain Hydrologic Model Evaluation: Benchmarking model performance under sampling uncertainty" Gaby Gründemann, Wouter Knoben, Yalan Song, Katie van Werkhoven, and Martyn Clark https://doi.org/10.5281/zenodo.18028487

Gaby J. Gründemann, Wouter J. M. Knoben, Yalan Song, Katie van Werkhoven, and Martyn P. Clark

Viewed

Total article views: 251 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
142	95	14	251	53	11	22

HTML: 142
PDF: 95
XML: 14
Total: 251
Supplement: 53
BibTeX: 11
EndNote: 22

Views and downloads (calculated since 02 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	85	77	9	171
Mar 2026	57	18	5	80

Cumulative views and downloads (calculated since 02 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	85	77	9	171
Mar 2026	57	18	5	80

Viewed (geographical distribution)

Total article views: 243 (including HTML, PDF, and XML) Thereof 243 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 18 Mar 2026

Short summary

The quality of large-domain hydrologic model simulations is often quantified with so-called accuracy metrics. Here we use simple benchmarks to provide relevant context for these accuracy metrics. Results show that areas where the model cannot beat the benchmarks do not always align with areas where the accuracy metrics are low. This suggests that model improvements are possible in regions that under more typical model evaluation approaches (i.e., without benchmarks) might not be obvious.


Total:	0
HTML:	0
PDF:	0
XML:	0