the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Technical note: Separating signal from noise in large-domain hydrologic model evaluation – Benchmarking model performance under sampling uncertainty
Abstract. Large-domain hydrologic modeling studies are becoming increasingly common. The evaluation of the resulting models is however often limited to the use of aggregated performance scores that show where model accuracy is higher and lower. Moreover, the inherent uncertainty in such scores, stemming from the choice of time periods used for their calculation, often remains unaccounted for. Here we use a collection of simple benchmarks whilst accounting for this sampling uncertainty to provide context for the performance scores of a large-domain hydrologic model. The benchmarks suggest that there are considerable constraints on the model's performance in approximately one-third of the basins used for model calibration and in approximately half of the basins where model parameters are regionalized. Sampling uncertainty has limited impact: in most basins the model is either clearly better or worse than the benchmarks, though accounting for sampling uncertainty remains important when the performance of different models is more similar. The areas where the benchmarks outperform the model only partially overlap with areas where the model achieves lower performance scores, and this suggests that improvements may be possible in more regions than a first glance at model performance values may indicate. A key advantage of using these benchmarks is that they are easy and fast to compute, particularly compared to the cost of configuring and running the model. This makes benchmarking a valuable tool that can complement more detailed model evaluation techniques by quickly identifying areas that should be investigated more thoroughly.
- Preprint
(11819 KB) - Metadata XML
-
Supplement
(42281 KB) - BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-6460', Anonymous Referee #1, 13 Mar 2026
-
RC2: 'Comment on egusphere-2025-6460', Anonymous Referee #2, 16 Mar 2026
The authors provide a nice and easy-to-read study on large-domain hydrological model evaluation. They introduced benchmarks as a valuable tool to investigate where a large domain hydrologic model might still lack in performance. This makes the manuscript, in my opinion, very interesting for a wider hydrological audience. Before publication, however, I think the manuscript would benefit from a bit more analysis on why the model is failing in certain areas. Below, I try to give some constructive feedback for the authors to consider:
General remarks:
- I have the feeling that a simple explanation of why a model fails if it is worse than a certain benchmark would help readers to understand the point of this technical note.
- I have the feeling the Introduction is not well linked to the rest of the manuscript. I did not get the impression that the questions raised were answered. The manuscript does not give any guidance if a score is indicative or useful for a model, nor does it go into quantifying their uncertainty. Isn’t the point of the manuscript more to find regions and reasons where the model is failing against the suggested benchmarks? I recommend restructuring the introduction accordingly.
- Discussion: I would like to see a more in-depth discussion on what it actually means if the Benchmark is better than the model. After that, you can go into the analysis, where and why the model might have failed. For this, however, I would recommend putting more emphasis on why the model has failed. Maybe look at it from a model development perspective, what would you need to do to improve the model? Try to give some guidance. E.g., by correlating your J index against catchment attributes (soil, landuse, climate, etc.), and also against the KGE, might give more insights. I acknowledge that the authors already provide some discussion on why the model might fail under certain circumstances, but very few of them are really based on the results of this manuscript and are rather based on the authors' knowledge and other literature.
Minor comments:
- Title: Is it really fitting what the manuscript is about? I would suggest something like “Technical note: Benchmarking large-domain hydrological model performance”. If the authors want to state uncertainty in their title, they should be specific what kind of uncertainty they are referring to.
- Introduction: What are simple benchmarks exactly, where have they been used, and what's their benefit, how do they relate to hydrological signatures?
- Section 2.4 Benchmarks: It should be better explained which benchmarks are actually used and why.
- Figure 2: Might it be easier to focus on the evaluation period only? And maybe I missed it in the Data and Methods section, but it should be clearly defined what the evaluation period is. Section 2.4 is speaking of a validation period; is this used as a synonym here? If so, it would be better to use only one of the two words throughout the manuscript.
- Figure 3: Show what BMs are actually standing for; that’s not too much text for the figure.
Citation: https://doi.org/10.5194/egusphere-2025-6460-RC2
Data sets
Data for "Separating Signal from Noise in Large-Domain Hydrologic Model Evaluation: Benchmarking model performance under sampling uncertainty" Gaby Gründemann, Wouter Knoben, Yalan Song, Katie van Werkhoven, and Martyn Clark https://doi.org/10.5281/zenodo.18028487
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 142 | 95 | 14 | 251 | 53 | 11 | 22 |
- HTML: 142
- PDF: 95
- XML: 14
- Total: 251
- Supplement: 53
- BibTeX: 11
- EndNote: 22
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of "Technical note: Separating signal from noise in large-domain hydrologic model evaluation - Benchmarking model performance" by Gründemann et al.
The technical note promotes the use of various benchmarks for model performance evaluation, particularly in a large-domain setting (or for large-sample studies) and includes a quantification of sampling uncertainty from different periods through bootstrapping of different hydrological years.
The manuscript is clearly, concisely written and well structured.
Before I can recommend publication, however, I would like to raise the following comments:
major comments:
- Since this note is all about the benchmarks, I thing two ingredients are missing:
1) Please add the benchmarks and their description to the main text and not just to the supplementary material and ensure that the abbreviations match those in the figures (or vice versa)
2) Each of the benchmarks is essentially a test of how well a model should minimally perform regarding a specific aspect. This is not discussed in detail in the manuscript, but I think providing some examples would really help promoting the use of various benchmarks from very simple ones targeting maybe the water balance to more complex ones. I would suggest extending the dicussion and conclusions accordingly and as well as adding this explanation regarding which aspect they are benchmarking in the table describing them.
- there is the sampling uncertainty, there is the model uncertainty, but what makes up these metrics are also affected by the uncertainty inherent in the observations. It would be worth reminding the reader that these can be considerably large and influential on the performance metric. For instance, for discharge, there is the rating curve uncertainty that is not constant but varies with the flows (see for instance Westerberg et al., 2011)
Line by line comments:
Abstract
L4 name at least some examples of what is meant by a simple benchmark, i.e. make it more specific
L5-7 these results are valid for the study region and basins and but not for other regions, please add that the data set is from the United Stated and maybe add even NWM
L9 ", though accounting..." this part of the sentence is not clear. Please rephrase.
Main text
L21-25 the words "score", "statistics","efficiency", "metrics" are used and they are used interchangeably. I would suggest using only one, where this is applicable and using it consistently throughout the manuscript
L22 "and more" remove (there is already "for example" in the same sentence)
L34 ... or further checks are required
L40 "can be " -> "is"
L120 since the benchmarks are the core of this note, Table S1 should be moved to the main text and the abbreviations adjusted accordingly.
L126 "as" -> "that"
L239 Supporting
L239 abbreviation was already introduced in L25
L257 "perform" missing?
L262 which benchmark? please add
L284 remove "and" before "snow"
Figure 2 in the upper panel the lines are not distinguishable in b&w print
Figure 3 Please add the written-out benchmarks in the caption so that the figure can stand-alone.
References
Westerberg, I., Guerrero, J. L., Seibert, J., Beven, K. J., & Halldin, S. (2011). Stage‐discharge uncertainty derived with a non‐stationary rating curve in the Choluteca River, Honduras. Hydrological Processes, 25(4), 603-613.