the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Metrics that Matter: Objective Functions and Their Impact on Signature Representation in Conceptual Hydrological Models
Abstract. Although objective functions (OFs) are widely discussed in the literature, many modelling studies still default to a few common metrics, without much consideration of their relative strengths and weaknesses. This paper systematically investigates the impact of OF choice on the representation of various streamflow characteristics across 47 conceptual models and 10 hydro-climatically diverse catchments selected from the CARAVAN dataset. We use eight different OFs for calibration, including the Kling–Gupta efficiency (KGE), Nash–Sutcliffe efficiency (NSE), and their respective logarithmic variants, as well as four more recently proposed metrics. We evaluate the representation of 15 hydrological signatures that capture a relevant selection of streamflow characteristics to determine generalizable strengths and weaknesses of individual OFs across different models and catchments. Results show that the choice of OF can significantly affect a model's capability to simulate different hydrological signatures such as runoff ratios, extreme flow percentiles, and certain baseflow characteristics. While certain signatures, particularly those related to flow variability, are relatively insensitive to OF choice, others exhibit large performance shifts across different OFs. Generally, no single OF simultaneously achieved high performance across all tested signatures, highlighting that a single-objective calibration is unlikely to lead to an all-purpose model. Our results reinforce calls to choose objective functions deliberately and in line with the objectives of a study. They also provide initial guidance on which metrics highlight particular facets of streamflow behaviour.
- Preprint
(9997 KB) - Metadata XML
-
Supplement
(22350 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
CC1: 'Comment on egusphere-2025-5413', Keith Beven, 28 Nov 2025
-
CC5: 'Reply on CC1', Bettina Schaefli, 07 Jan 2026
I do not like to use strong words in scientific discussions but to use the same words, "it is somewhat depressing" to see such a negative comment on a paper that made it very clear what it intends to study. Yes, it could have discussed the implications of uncertainty more deeply but we, as peers, should judge the value of presented work with respect to the asked research question and not with respect to research questions that we asked.
Citation: https://doi.org/10.5194/egusphere-2025-5413-CC5 -
AC1: 'Reply on CC1', Peter Wagener, 09 Jan 2026
We understand Prof Beven’s frustration regarding the usage of aggregated metrics and the often limited consideration of all potentially relevant uncertainty sources in hydrological modelling studies. Generally, we agree with these concerns, and we will discuss the various sources of uncertainty and how they may influence our work in an updated part of the discussion.
There are, however, reasons to structure the experiment like we did, and our approach is in line with how many other studies approach the quantification of uncertainty; namely by fixing some parts of the analysis while extending others. Assessing all relevant uncertainty sources remains difficult in practice without making various big assumptions (for example, about the nature and magnitude of input and observational uncertainty) and without substantially increasing the computational and analysis effort (for example, by using n optimized parameter sets, per catchment, per model, per objective function).
To address the specific mention of parameter uncertainty, we decided to use one optimized parameter set per model because (1) this is currently still representative of standard modelling practice (e.g., operational, as suggested by reviewer 1, as well as in contemporary large-sample hydrology), and (2) to keep the analysis of results manageable, which already covers three different dimensions (model structures, objective functions, and catchments).
We believe that we do – to some extent - account for the main critique mentioned: the need to “reject models instead of accepting an optimised model as satisfactory”. We do this by imposing a minimum level of expectation for model performance. By requiring the models to exceed the performance of a specific benchmark, we expect them to at least be able to predict departures from the typical seasonal cycle. This approach represents a location-specific benchmark that avoids the potential difficulties involved in setting absolute expectations for objective function scores (cf. Schaefli & Gupta, 2007), and results in 48% of our optimized model runs being rejected a priori.
Furthermore, the main aspect of the paper is highlighting skill and shortcomings of the applied objective functions regarding hydrologic signatures, which relate to specific aspects of hydrologic behaviour, albeit admittedly not at the level of detail mentioned in the commentary (e.g. “wetting periods at the end of summer”). Nevertheless, we believe there is value to exploring the relationship between objective functions and signatures in a setting that mimics common applications (i.e., calibrate once) but with a much broader model sample than is typical, because this will give insight into whether a model is “fit for purpose”.
In conclusion, we will (1) revisit the introduction of our paper to ensure this reasoning is clear to the reader, (2) ensure the outcomes of the benchmarking procedure (number of models rejected) are clarified in “Section 3.1 Benchmarking the Models”, and (3) revise the discussion to more clearly outline how model structure uncertainty falls within the wider uncertainties that affect hydrologic modeling. In particular, we will emphasize further that for hydrologic signature representation, our results suggest that broadly speaking catchment sampling > objective functions > models, are the order of relative importance, and that this can inform where future studies direct most of their effort.
Citation: https://doi.org/10.5194/egusphere-2025-5413-AC1
-
CC5: 'Reply on CC1', Bettina Schaefli, 07 Jan 2026
-
CC2: 'Comment on egusphere-2025-5413', John Ding, 02 Dec 2025
The authors, three of four affiliated with the University of Calgary, Canada, need to explain a glaring omission in references of Melsen et al. (2025). (https://doi.org/10.1080/02626667.2025.2475105). Is it simply the study, on which the discussion paper is based, was substantially completed before the publication of the Melsen et al. paper?
I've been both amused and amazed, a long lost but recently found Nash-Ding efficiency (NDE; Ding, 1974, Page 65) performs as well as a late comer, the KGE, even numerically, Duc and Sawada (2023, Figure 2). (https://doi.org/10.5194/hess-27-1827-2023).
I'm grateful to both my alma mater, the University of Guelph, Canada, and my former employer, Ontario Ministry of Natural Resources in Toronto. Because of their foresight, I found myself, one of three young Government of Ontario engineers, listening to Master Nash delivering his NSE lecture at Guelph, one splendid fall afternoon of 1968.
Whose idea it was to invite Nash of Wallingford, UK, and/or Galway, Ireland, to visit Guelph? Their identity remains buried in archives. The search continues.
PS. This reminiscence has been prompted upon reading the 'little rant' in the last paragraph raised by Keith Beven in CC1 elsewhere.
Citation: https://doi.org/10.5194/egusphere-2025-5413-CC2 - AC2: 'Reply on CC2', Peter Wagener, 09 Jan 2026
-
RC1: 'Comment on egusphere-2025-5413', Guillaume Thirel, 23 Dec 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-5413/egusphere-2025-5413-RC1-supplement.pdf
-
CC3: 'Reply on RC1', Keith Beven, 24 Dec 2025
We would surely all agree Guillaume that optimisation is convenient for applications of models, but in terms of the science convenience does not always equate to good practice. We do, after all, KNOW that different objective functions produce different optimal parameter sets (as shown again in this paper). We do KNOW that optimisation methods will fit the errors in the data - and that the more parameters in a model then the more that will be case (including machine learning models which have lots and lots of parameters). We do KNOW that because of data errors different periods of calibration data will produce different optimal parameter sets. That is one reason why performance is often not so good in :"validation" periods. We do KNOW that there will be man models that are nearly as good as any optimum that is found, and that may depend only on the errors in the data (look at all the dotty plots that have been produced over the years). We do KNOW that if we try to use two or more objective functions then there will be a Pareto front or manifold that results from the trade-off in performance between them, and that there will be models that are just behind the front only because of the errors in the data. We do KNOW that different model structures might be needed for different catchments because of difference in the mechanisms of hydrological response. So I would still suggest that what might constitute the best optimum is asking the wrong question. The correct question is when is a model (or models) fit for purpose for use in prediction (for whatever purpose we might be interested in). Defining fitness-for-purpose given what we understand about data uncertainties and model equifinalities is, of course, much more difficult than optimisation - but it is a useful way of getting stakeholders more involved in the modelling process and to convey a greater understanding of data limitations. That is where we really need more research. :-)
Citation: https://doi.org/10.5194/egusphere-2025-5413-CC3
-
CC3: 'Reply on RC1', Keith Beven, 24 Dec 2025
-
RC2: 'Comment on egusphere-2025-5413', Anonymous Referee #2, 29 Dec 2025
The value and relevance of different objective functions to quantify the performance of hydrologic models is a never-ending story. The authors present a study in which they connect several versions of hydrologic efficiency metrics with hydrologic signatures to study their interactions across catchments and models. The study is technically fine, and I see no obvious problem in what the authors have done. What I really would like the authors to reflect on more extensively, is what they have learned.
The authors state in their abstract that “Results show that the choice of OF can significantly affect a model’s capability to simulate different hydrological signatures ...”. And that “Generally, no single OF simultaneously achieved high performance across all tested signatures, highlighting that a single-objective calibration is unlikely to lead to an all-purpose model. Our results reinforce calls to choose objective functions deliberately and in line with the objectives of a study.”
OK, I believe you but – given the studies listed in Table 1 and many others before then – did we not know this already? No metric is giving us a perfect fit and hence we should assume that differences in metric formulation are reflected in differences in signature values. Let me cite an older paper (which was the first one I tried to find a suitable quote): “The selection of the best efficiency measures should reflect the intended use of the model and should concern model quantities which are deemed relevant for the study at hand (Janssen and Heuberger 1995).” This quote is from Krause et al. (2005, Advances in Geosciences), referencing Janssen and Heuberger (1995, Ecological Modelling). I am not suggesting that the current study is not providing new insights. Just that the rather general statements in the abstract are not it. Could you be more specific in your conclusions and hence in what this specific study adds to our knowledge base?
And finally, if you want to get hydrologic signatures right (which should reflect hydrologic processes more directly), then why optimize an efficiency metric and not directly a combination of those signatures you think best reflect your catchment and purpose? In this sense, what functions of the catchment are represented by the chosen signatures? Can you discuss the signatures you selected in terms of the hydrologic behavior they (should) reflect? What do your signature clusters tell you about the underlying processes that a model should reflect? Can we learn something of diagnostic value from using these integrated metrics?
The authors further conclude that: “Together, our results support the argument for a purpose-based model calibration, that considers multiple aspects of the flow regime, and multi-objective calibration setups, rather than defaulting to a familiar single metric (Mai, 2023; Jackson et al., 2019).” This is rather close to: “This paper suggests that the emergence of a new and more powerful model calibration paradigm must include recognition of the inherent multiobjective nature of the problem and must explicitly recognize the role of model error.” from Gupta et al. (1998, WRR). The multi-objective nature of the problem is clear, so why are we still looking for a single metric solution?
Minor comment:
The authors use the slope of the flow duration curve (FDC). I assume that this signature quantifies the slope of the central part of the FDC, though the authors never specify this. It would be good if they would.
Citation: https://doi.org/10.5194/egusphere-2025-5413-RC2 -
CC4: 'Comment on egusphere-2025-5413: A targeted analysis to answer a well specified question', Bettina Schaefli, 07 Jan 2026
Let me start my comment in an unusual way: Imagine yourself in the middle of the difficult process of choosing one or maximum two metrics for model calibration within a wider model intercomparison project, involving several modelling groups using different models. The project wants to renew national-scale climate change impact simulations because the existing ones show deficiencies with respect to low flow simulations. You have your own preference for calibration metrics ("I only use NSE because at least I know very well what it does and it has a clear link to formal parameter estimation literature"), but you have to discuss with colleagues who have very different preferences ("KGE-like metrics are the new standard", "we should use the new metric by xxx"). After several discussion rounds, you discover a paper that investigates exactly the question that you have been ruminating for several weeks: Is there a modern hydrological metric that outperforms all others across a typical range of signatures for typical hydrological regimes? What a chance that someone took the time to analyze this question! Many thanks to the authors of this study for having provided a detailed analysis of a specific modelling question that many of us are facing. The design of the study emphasizes an aspect that is often omitted: the variability arising from different model structures, which is often lacking from available studies, in particular in times of large-ensemble hydrology. We have become so excited about using hundreds of catchments for our studies that we tend to omit model ensembles.
As pointed out by other comments in this discussion, the study does not explicitly address data or parameter uncertainty: every data point included in the analysis (a signature value obtained for a given model calibrated with a given objective function for a given catchment) could in principle be augmented by analyzing e.g. the 20 best calibrations or by analyzing different input data sets or by analyzing different data portions. But this would be a different study.
The study does also not put a lot of emphasis on connecting the results to dominant hydrological processes or controls. In particular, it would be nice to see some discussion of what signatures are (by nature!) dominated by the input data. I could think of rising limb density, which can arguably only be well simulated if the rainfall input has good properties. An attempt is made in Table 5 where the signatures are put into categories, but the category “streamflow” is perhaps a bit too generic. How do these signatures depend on input data quality and the climate, hillslope-scale partionning functions or storage-release functions ? The ones that strongly depend on "functions" are probably more strongly influenced by objective function choice. Any thoughts on this could perhaps complement the discussion.
Two detailed comments:
- The term « hydrological regime » is used at instances e.g. in 83 where I believe it should read the “hydrological response” or “hydrological behaviour” since the signatures do not only capture the regime
- Sentence at line 463 ("The claim that KGE resolves (..)") seems incomplete.
Citation: https://doi.org/10.5194/egusphere-2025-5413-CC4 - AC3: 'Reply on CC4', Peter Wagener, 09 Jan 2026
Model code and software
Code for Analysis and Visualization Peter Wagener https://github.com/peterwagener/OF_Signature_Code.git
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 567 | 195 | 38 | 800 | 55 | 119 | 104 |
- HTML: 567
- PDF: 195
- XML: 38
- Total: 800
- Supplement: 55
- BibTeX: 119
- EndNote: 104
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
It is somewhat depressing that after all the decades of past work on aleatory and epistemic uncertainties in both data, models and identification of parameters, there are still studies that essentially ignore the impacts (except for a conclusion that parameter uncertainty is not an issue because different random seeds give similar optimal parameter sets for some OFs).
But why have you not considered that some of the data you are using might be disinformative for model evaluation; that there may be parameter sets close to your optima that will give similar "performance" however that is measured; that different periods of data (with different errors) will give different optimal parameter sets, etc etc (see for example Beven, K. J., 2024, A short history of philosophies of hydrological model evaluation and hypothesis testing, WIRES Water, e1761, 69 (5): 519-527, https://doi.org/10.1002/wat2.1761 and the references therein).
We have known these things for a very long time - but the real issue to be addressed is whether a model (even when optimised as in this study) can really be considered as fit for purpose when there are often glaring visual issues in performance (during wetting up periods at the end of summer for example) that are glossed over by the types of global OFs used here). That was one of the reasons why I rejected the concept of optimal parameter sets more than 30 years ago now in favour of seeking models that might be consistent with the observations and what we know about their uncertainities. Trying to assess those uncertainties is, of course, a much more difficult problem than simply applying an optimisation algorithm (particularly for the epistemic uncertainties), but just thinking about what might be involved in doing so is a really valuable exercise.
Apologies in advance for this little rant but if we do not approach the modelling process with a bit deeper thought, how are we going to progress the science? That surely requires ways of rejecting models and then trying to do better, not of accepting that an optimised model is de facto considered satisfactory.
Keith Beven