Metrics that Matter: Objective Functions and Their Impact on Signature Representation in Conceptual Hydrological Models

Wagener, Peter; Knoben, Wouter J. M.; Schütze, Niels; Spieler, Diana

doi:10.5194/egusphere-2025-5413

Preprints

https://doi.org/10.5194/egusphere-2025-5413

Preprints

27 Nov 2025

| 27 Nov 2025

Metrics that Matter: Objective Functions and Their Impact on Signature Representation in Conceptual Hydrological Models

Peter Wagener, Wouter J. M. Knoben, Niels Schütze, and Diana Spieler

Abstract. Although objective functions (OFs) are widely discussed in the literature, many modelling studies still default to a few common metrics, without much consideration of their relative strengths and weaknesses. This paper systematically investigates the impact of OF choice on the representation of various streamflow characteristics across 47 conceptual models and 10 hydro-climatically diverse catchments selected from the CARAVAN dataset. We use eight different OFs for calibration, including the Kling–Gupta efficiency (KGE), Nash–Sutcliffe efficiency (NSE), and their respective logarithmic variants, as well as four more recently proposed metrics. We evaluate the representation of 15 hydrological signatures that capture a relevant selection of streamflow characteristics to determine generalizable strengths and weaknesses of individual OFs across different models and catchments. Results show that the choice of OF can significantly affect a model's capability to simulate different hydrological signatures such as runoff ratios, extreme flow percentiles, and certain baseflow characteristics. While certain signatures, particularly those related to flow variability, are relatively insensitive to OF choice, others exhibit large performance shifts across different OFs. Generally, no single OF simultaneously achieved high performance across all tested signatures, highlighting that a single-objective calibration is unlikely to lead to an all-purpose model. Our results reinforce calls to choose objective functions deliberately and in line with the objectives of a study. They also provide initial guidance on which metrics highlight particular facets of streamflow behaviour.

Received: 03 Nov 2025 – Discussion started: 27 Nov 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 9997 KB)

Supplement (22350 KB)

Download & links

Preprint (9997 KB)
Metadata XML
Supplement (22350 KB)
BibTeX
EndNote

Peter Wagener, Wouter J. M. Knoben, Niels Schütze, and Diana Spieler

Status: final response (author comments only)

CC1:
'Comment on egusphere-2025-5413', Keith Beven, 28 Nov 2025

It is somewhat depressing that after all the decades of past work on aleatory and epistemic uncertainties in both data, models and identification of parameters, there are still studies that essentially ignore the impacts (except for a conclusion that parameter uncertainty is not an issue because different random seeds give similar optimal parameter sets for some OFs).
But why have you not considered that some of the data you are using might be disinformative for model evaluation; that there may be parameter sets close to your optima that will give similar "performance" however that is measured; that different periods of data (with different errors) will give different optimal parameter sets, etc etc (see for example Beven, K. J., 2024, A short history of philosophies of hydrological model evaluation and hypothesis testing, WIRES Water, e1761, 69 (5): 519-527, https://doi.org/10.1002/wat2.1761 and the references therein).
We have known these things for a very long time - but the real issue to be addressed is whether a model (even when optimised as in this study) can really be considered as fit for purpose when there are often glaring visual issues in performance (during wetting up periods at the end of summer for example) that are glossed over by the types of global OFs used here). That was one of the reasons why I rejected the concept of optimal parameter sets more than 30 years ago now in favour of seeking models that might be consistent with the observations and what we know about their uncertainities. Trying to assess those uncertainties is, of course, a much more difficult problem than simply applying an optimisation algorithm (particularly for the epistemic uncertainties), but just thinking about what might be involved in doing so is a really valuable exercise.
Apologies in advance for this little rant but if we do not approach the modelling process with a bit deeper thought, how are we going to progress the science? That surely requires ways of rejecting models and then trying to do better, not of accepting that an optimised model is de facto considered satisfactory.
Keith Beven

Citation: https://doi.org/10.5194/egusphere-2025-5413-CC1
- CC5: 'Reply on CC1', Bettina Schaefli, 07 Jan 2026
  
  I do not like to use strong words in scientific discussions but to use the same words, "it is somewhat depressing" to see such a negative comment on a paper that made it very clear what it intends to study. Yes, it could have discussed the implications of uncertainty more deeply but we, as peers, should judge the value of presented work with respect to the asked research question and not with respect to research questions that we asked.
  
  Citation: https://doi.org/10.5194/egusphere-2025-5413-CC5
- AC1: 'Reply on CC1', Peter Wagener, 09 Jan 2026
  
  We understand Prof Beven’s frustration regarding the usage of aggregated metrics and the often limited consideration of all potentially relevant uncertainty sources in hydrological modelling studies. Generally, we agree with these concerns, and we will discuss the various sources of uncertainty and how they may influence our work in an updated part of the discussion.
  There are, however, reasons to structure the experiment like we did, and our approach is in line with how many other studies approach the quantification of uncertainty; namely by fixing some parts of the analysis while extending others. Assessing all relevant uncertainty sources remains difficult in practice without making various big assumptions (for example, about the nature and magnitude of input and observational uncertainty) and without substantially increasing the computational and analysis effort (for example, by using n optimized parameter sets, per catchment, per model, per objective function).
  To address the specific mention of parameter uncertainty, we decided to use one optimized parameter set per model because (1) this is currently still representative of standard modelling practice (e.g., operational, as suggested by reviewer 1, as well as in contemporary large-sample hydrology), and (2) to keep the analysis of results manageable, which already covers three different dimensions (model structures, objective functions, and catchments).
  We believe that we do – to some extent - account for the main critique mentioned: the need to “reject models instead of accepting an optimised model as satisfactory”. We do this by imposing a minimum level of expectation for model performance. By requiring the models to exceed the performance of a specific benchmark, we expect them to at least be able to predict departures from the typical seasonal cycle. This approach represents a location-specific benchmark that avoids the potential difficulties involved in setting absolute expectations for objective function scores (cf. Schaefli & Gupta, 2007), and results in 48% of our optimized model runs being rejected a priori.
  Furthermore, the main aspect of the paper is highlighting skill and shortcomings of the applied objective functions regarding hydrologic signatures, which relate to specific aspects of hydrologic behaviour, albeit admittedly not at the level of detail mentioned in the commentary (e.g. “wetting periods at the end of summer”). Nevertheless, we believe there is value to exploring the relationship between objective functions and signatures in a setting that mimics common applications (i.e., calibrate once) but with a much broader model sample than is typical, because this will give insight into whether a model is “fit for purpose”.
  In conclusion, we will (1) revisit the introduction of our paper to ensure this reasoning is clear to the reader, (2) ensure the outcomes of the benchmarking procedure (number of models rejected) are clarified in “Section 3.1 Benchmarking the Models”, and (3) revise the discussion to more clearly outline how model structure uncertainty falls within the wider uncertainties that affect hydrologic modeling. In particular, we will emphasize further that for hydrologic signature representation, our results suggest that broadly speaking catchment sampling > objective functions > models, are the order of relative importance, and that this can inform where future studies direct most of their effort.
  
  Citation: https://doi.org/10.5194/egusphere-2025-5413-AC1
CC2:
'Comment on egusphere-2025-5413', John Ding, 02 Dec 2025

The authors, three of four affiliated with the University of Calgary, Canada, need to explain a glaring omission in references of Melsen et al. (2025). (https://doi.org/10.1080/02626667.2025.2475105). Is it simply the study, on which the discussion paper is based, was substantially completed before the publication of the Melsen et al. paper?
I've been both amused and amazed, a long lost but recently found Nash-Ding efficiency (NDE; Ding, 1974, Page 65) performs as well as a late comer, the KGE, even numerically, Duc and Sawada (2023, Figure 2). (https://doi.org/10.5194/hess-27-1827-2023).
I'm grateful to both my alma mater, the University of Guelph, Canada, and my former employer, Ontario Ministry of Natural Resources in Toronto. Because of their foresight, I found myself, one of three young Government of Ontario engineers, listening to Master Nash delivering his NSE lecture at Guelph, one splendid fall afternoon of 1968.
Whose idea it was to invite Nash of Wallingford, UK, and/or Galway, Ireland, to visit Guelph? Their identity remains buried in archives. The search continues.
PS. This reminiscence has been prompted upon reading the 'little rant' in the last paragraph raised by Keith Beven in CC1 elsewhere.

Citation: https://doi.org/10.5194/egusphere-2025-5413-CC2
- AC2: 'Reply on CC2', Peter Wagener, 09 Jan 2026
  
  Thank you for your time and engaging in the discussion of our paper. Please find our answers to your comments in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2025-5413-AC2
RC1:
'Comment on egusphere-2025-5413', Guillaume Thirel, 23 Dec 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-5413/egusphere-2025-5413-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-5413-RC1
- CC3:
  'Reply on RC1', Keith Beven, 24 Dec 2025
  
  We would surely all agree Guillaume that optimisation is convenient for applications of models, but in terms of the science convenience does not always equate to good practice. We do, after all, KNOW that different objective functions produce different optimal parameter sets (as shown again in this paper). We do KNOW that optimisation methods will fit the errors in the data - and that the more parameters in a model then the more that will be case (including machine learning models which have lots and lots of parameters). We do KNOW that because of data errors different periods of calibration data will produce different optimal parameter sets. That is one reason why performance is often not so good in :"validation" periods. We do KNOW that there will be man models that are nearly as good as any optimum that is found, and that may depend only on the errors in the data (look at all the dotty plots that have been produced over the years). We do KNOW that if we try to use two or more objective functions then there will be a Pareto front or manifold that results from the trade-off in performance between them, and that there will be models that are just behind the front only because of the errors in the data. We do KNOW that different model structures might be needed for different catchments because of difference in the mechanisms of hydrological response. So I would still suggest that what might constitute the best optimum is asking the wrong question. The correct question is when is a model (or models) fit for purpose for use in prediction (for whatever purpose we might be interested in). Defining fitness-for-purpose given what we understand about data uncertainties and model equifinalities is, of course, much more difficult than optimisation - but it is a useful way of getting stakeholders more involved in the modelling process and to convey a greater understanding of data limitations. That is where we really need more research. :-)
  
  Citation: https://doi.org/10.5194/egusphere-2025-5413-CC3
  - AC6: 'Reply on CC3', Peter Wagener, 06 Feb 2026
    
    We thank Prof Beven for the spirited debate around our paper. We hope our response to your first comment (CC1) addresses these points to some extent. Briefly, while we agree with you that uncertainty in hydrologic modeling is a multi-faceted problem that has already received considerable attention, we necessarily have to limit the scope of our work somehow. Here, we chose to go with what is common practice, though we will do our best to emphasize in the paper that the uncertainty literature is broad and that current common practice is somewhat divorced from what could be considered best practice.
    
    Citation: https://doi.org/10.5194/egusphere-2025-5413-AC6
- AC4:
  'Reply on RC1', Peter Wagener, 06 Feb 2026
  Dear Editor and Reviewers,
  Thank you for your detailed reviews of our submission. Based on the feedback we have gathered, we propose to improve the analysis in the following ways:
  We will broaden the discussion regarding the role of uncertainty. We will also clarify the goal and limitations of the conducted study.
  
  We will highlight the similarities and differences between the calibration and evaluation periods in more detail and include additional plots for the latter.
  
  We will update the benchmarking routine, particularly for the catchments with significant snow impact.
  
  We will investigate the role of potential underlying data issues, including (but not limited to) potential evapotranspiration. Based on findings for all catchments and multiple models, we will either (1) replace selected catchments, (2) implement further changes to the benchmarking routine, or (3) recalibrate catchments with improved forcing data if necessary.
  
  We will consider all minor comments as described in the detailed responses.
  
  Please find more detailed responses and initial analysis in the attached PDF. The reviewer's remarks are shown in black, our responses in red.
  
  Citation: https://doi.org/10.5194/egusphere-2025-5413-AC4
RC2:
'Comment on egusphere-2025-5413', Anonymous Referee #2, 29 Dec 2025

The value and relevance of different objective functions to quantify the performance of hydrologic models is a never-ending story. The authors present a study in which they connect several versions of hydrologic efficiency metrics with hydrologic signatures to study their interactions across catchments and models. The study is technically fine, and I see no obvious problem in what the authors have done. What I really would like the authors to reflect on more extensively, is what they have learned.
The authors state in their abstract that “Results show that the choice of OF can significantly affect a model’s capability to simulate different hydrological signatures ...”. And that “Generally, no single OF simultaneously achieved high performance across all tested signatures, highlighting that a single-objective calibration is unlikely to lead to an all-purpose model. Our results reinforce calls to choose objective functions deliberately and in line with the objectives of a study.”
OK, I believe you but – given the studies listed in Table 1 and many others before then – did we not know this already? No metric is giving us a perfect fit and hence we should assume that differences in metric formulation are reflected in differences in signature values. Let me cite an older paper (which was the first one I tried to find a suitable quote): “The selection of the best efficiency measures should reflect the intended use of the model and should concern model quantities which are deemed relevant for the study at hand (Janssen and Heuberger 1995).” This quote is from Krause et al. (2005, Advances in Geosciences), referencing Janssen and Heuberger (1995, Ecological Modelling). I am not suggesting that the current study is not providing new insights. Just that the rather general statements in the abstract are not it. Could you be more specific in your conclusions and hence in what this specific study adds to our knowledge base?
And finally, if you want to get hydrologic signatures right (which should reflect hydrologic processes more directly), then why optimize an efficiency metric and not directly a combination of those signatures you think best reflect your catchment and purpose? In this sense, what functions of the catchment are represented by the chosen signatures? Can you discuss the signatures you selected in terms of the hydrologic behavior they (should) reflect? What do your signature clusters tell you about the underlying processes that a model should reflect? Can we learn something of diagnostic value from using these integrated metrics?
The authors further conclude that: “Together, our results support the argument for a purpose-based model calibration, that considers multiple aspects of the flow regime, and multi-objective calibration setups, rather than defaulting to a familiar single metric (Mai, 2023; Jackson et al., 2019).” This is rather close to: “This paper suggests that the emergence of a new and more powerful model calibration paradigm must include recognition of the inherent multiobjective nature of the problem and must explicitly recognize the role of model error.” from Gupta et al. (1998, WRR). The multi-objective nature of the problem is clear, so why are we still looking for a single metric solution?
Minor comment:
The authors use the slope of the flow duration curve (FDC). I assume that this signature quantifies the slope of the central part of the FDC, though the authors never specify this. It would be good if they would.

Citation: https://doi.org/10.5194/egusphere-2025-5413-RC2
- AC5:
  'Reply on RC2', Peter Wagener, 06 Feb 2026
  Dear Editor and Reviewers,
  Thank you for your detailed review of our submission. Based on the feedback we have gathered, we propose to adjust the conducted analysis in the following ways:
  We will add more context and reflections to the findings from our study and will clarify the benefit and novelty.
  
  We will expand the discussion on the signature selection as well as its implications for hydrologic behaviour for modelling. Relating to this, we will also reflect on the diagnostic value of integrated calibration metrics.
  
  Please find more specific responses in the attached PDF. The reviewer's remarks are shown in black, our responses in red.
  
  Citation: https://doi.org/10.5194/egusphere-2025-5413-AC5
CC4:
'Comment on egusphere-2025-5413: A targeted analysis to answer a well specified question', Bettina Schaefli, 07 Jan 2026
Let me start my comment in an unusual way: Imagine yourself in the middle of the difficult process of choosing one or maximum two metrics for model calibration within a wider model intercomparison project, involving several modelling groups using different models. The project wants to renew national-scale climate change impact simulations because the existing ones show deficiencies with respect to low flow simulations. You have your own preference for calibration metrics ("I only use NSE because at least I know very well what it does and it has a clear link to formal parameter estimation literature"), but you have to discuss with colleagues who have very different preferences ("KGE-like metrics are the new standard", "we should use the new metric by xxx"). After several discussion rounds, you discover a paper that investigates exactly the question that you have been ruminating for several weeks: Is there a modern hydrological metric that outperforms all others across a typical range of signatures for typical hydrological regimes? What a chance that someone took the time to analyze this question! Many thanks to the authors of this study for having provided a detailed analysis of a specific modelling question that many of us are facing. The design of the study emphasizes an aspect that is often omitted: the variability arising from different model structures, which is often lacking from available studies, in particular in times of large-ensemble hydrology. We have become so excited about using hundreds of catchments for our studies that we tend to omit model ensembles.
As pointed out by other comments in this discussion, the study does not explicitly address data or parameter uncertainty: every data point included in the analysis (a signature value obtained for a given model calibrated with a given objective function for a given catchment) could in principle be augmented by analyzing e.g. the 20 best calibrations or by analyzing different input data sets or by analyzing different data portions. But this would be a different study.
The study does also not put a lot of emphasis on connecting the results to dominant hydrological processes or controls. In particular, it would be nice to see some discussion of what signatures are (by nature!) dominated by the input data. I could think of rising limb density, which can arguably only be well simulated if the rainfall input has good properties. An attempt is made in Table 5 where the signatures are put into categories, but the category “streamflow” is perhaps a bit too generic. How do these signatures depend on input data quality and the climate, hillslope-scale partionning functions or storage-release functions ? The ones that strongly depend on "functions" are probably more strongly influenced by objective function choice. Any thoughts on this could perhaps complement the discussion.
Two detailed comments:
The term « hydrological regime » is used at instances e.g. in 83 where I believe it should read the “hydrological response” or “hydrological behaviour” since the signatures do not only capture the regime

Sentence at line 463 ("The claim that KGE resolves (..)") seems incomplete.
Citation: https://doi.org/10.5194/egusphere-2025-5413-CC4
- AC3: 'Reply on CC4', Peter Wagener, 09 Jan 2026
  
  Thank you for your time and engaging in the discussion of our paper. Please find our answers to your comments in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2025-5413-AC3

Peter Wagener, Wouter J. M. Knoben, Niels Schütze, and Diana Spieler

Supplement

https://doi.org/10.5194/egusphere-2025-5413-supplement

Model code and software

Code for Analysis and Visualization Peter Wagener https://github.com/peterwagener/OF_Signature_Code.git

Peter Wagener, Wouter J. M. Knoben, Niels Schütze, and Diana Spieler

Viewed

Total article views: 1,039 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
743	249	47	1,039	81	230	121

HTML: 743
PDF: 249
XML: 47
Total: 1,039
Supplement: 81
BibTeX: 230
EndNote: 121

Views and downloads (calculated since 27 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	102	15	3	120
Dec 2025	244	120	21	385
Jan 2026	233	64	16	313
Feb 2026	164	50	7	221

Cumulative views and downloads (calculated since 27 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	102	15	3	120
Dec 2025	244	120	21	385
Jan 2026	233	64	16	313
Feb 2026	164	50	7	221

Viewed (geographical distribution)

Total article views: 957 (including HTML, PDF, and XML) Thereof 957 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 28 Feb 2026

Short summary

Hydrologic models help predict floods and droughts, but how we calibrate them changes what they get right. By testing eight objective functions across many model types and catchments, we found that each highlights different flow behaviours, such as floods, low flows, or water balance. No single approach is best for all flow conditions. Matching the calibration method to the study's purpose, or combining several methods, can make models more applicable to real-world water decisions.


Total:	0
HTML:	0
PDF:	0
XML:	0