Improving low and high flow simulations at once: An enhanced metric for hydrological model calibration
Abstract. The choice of an objective function for hydrological model calibration is a critical step that directly influences model performance and suitability for the intended use cases. While calibration functions should ideally be tailored to specific modeling objectives, such as flood forecasting or drought monitoring, general-purpose metrics are typically used in practice. The two most widely adopted objective functions are the Nash–Sutcliffe Efficiency (NSE) and the Kling–Gupta Efficiency (KGE). While the NSE is a simple normalization of the mean square error, the KGE overcomes some of the NSE limitations and is often preferred due to its decomposable structure, capturing bias, relative variability, and correlation. However, KGE still suffers from limitations, including sensitivity to outliers and assumptions of linearity and normality in the error distribution, which particularly limit performance under low-flow conditions. Although several alternatives to NSE and KGE have been proposed, none has clearly outperformed these standard metrics across the full flow duration curve (FDC), especially for improving low flows without degrading performance elsewhere. To address these limitations, we propose a new metric, the Joint Divergence Kling-Gupta Efficiency (JDKGE), that enhances the KGE by incorporating an additional component based on the Jensen–Shannon Divergence (JSD). We evaluate the JDKGE metric using two hydrological process-based models (GR6J and OS-LISFLOOD), applied to two large and diverse samples of catchments spanning a broad range of hydroclimatic conditions. Calibrated using a suite of objective functions, both models are then evaluated with multiple performance metrics, including KGE, JSD, quantile ratios, and FDC-based signatures. Results show that calibrations using JDKGE significantly improve low-flow simulations compared to KGE, NSE and other competitors, while maintaining comparable or improved performance in other regimes, including high flows. Multi-objective calibration experiments further reveal that substantial gains in distributional similarity (i.e., reductions in JSD) can be achieved with only marginal changes in overall performance (KGE). Moreover, the JDKGE objective function leads to a balanced compromise between KGE and JSD and a reduction in model equifinality. This study highlights the importance of carefully selecting the objective function for hydrological model calibration and proposes JDKGE as an effective solution for improving low-flow performance while retaining general-purpose applicability for floods and water management.
Summary
This paper introduces a new objective function that appends to the existing KGE' a new component based on information theory. This new JDKGE metric is then tested in a variety of ways, covering three modeling experiments. For the first, the authors calibrate the GR6J model for 240 catchments in France (spatially lumped, daily time step) with the new JDKGE, as well as KGE, KGE', NSE, NSE_log and KGE_NP. They then compare the resulting distributions of performance scores on a number of separate metrics (KGE', the three KGE components r / alpha / beta, the Jensen-Shannon Divergence, and three flow percentile biases). These comparisons are both visual and based on statistical comparison of the resulting distribution. The second experiment involves calibrating GR6J and OS LISFLOOD for 45 basins worldwide, using the KGE' and JDKGE as objective functions. Distributions are compared as CDFs of a smaller number of metrics (KGE', JSD, lower and upper percentile). The third experiment involves multi-objective calibration of GR6J for the 240 French basins, using KGE' and JSD as complementary objectives. The resulting pareto fronts are compared to the JDKGE single optimum, with results (flow duration curves, averaged seasonal flows) shown for three basins. The paper concludes that in most cases the JDKGE criterion improves the simulation of certain flow percentiles, without much associated change in the KGE' (i.e., improvements come at limited/no cost elsewhere). There is a caveat that this is not the case everywhere, which the authors put down to poor suitability of the tested structures for these specific cases.
Opinion
I am a bit in two minds about this paper. On the one hand, the authors set out to improve low flow simulations, settled on a method, and show that this in principle has the intended effect. In that light, this paper can be a useful reference to spur further work. On the other hand, the paper to some extent seems rather ad-hoc to me. There are a number of things that stand out to me as potentially benefitting from further attention:
[1] The justification for the new metric seems very minimal to me. There are two reference to support the idea that using/adding metrics derived from information theory might be helpful. There is some explanation of the benefits of using something derived from the JSD, but no discussion of alternatives that could have been considered but were ultimately rejected. There is also no explanation for why adding the new component to the KGE in the way that was done (as an equally-weighted fourth term under the square root) is the best option. This makes it difficult to understand if the current implementation was a deliberate choice (meaning that alternatives were assessed but judged unhelpful) or more a lucky first attempt (meaning that follow-up work is possible that investigates alternative information-theory-based components, different construction of objective function, etc).
[2] The comparison does not seem as clear to me as it can be. Line 204-205 explains that the JSD component in the JDKGE uses log-transformed flows. This means the JDKGE is a function of both regular and transformed flows (i.e., f(Q,log(Q)). The objective functions used for comparison however only have access to a single set of flows. KGE, KGE', NSE and KGE_NP are all f(Q), whereas NSE_log is f(log(Q)). It is thus unclear from the presented comparisons if the improvements provided by JDKGE are a result of the inclusion of the JSD component, or the fact that JDKGE is a merged metric of regular and transformed flows, or a combination of both. A comparison with JDKGE*, where the JSD component only has access to regular flows seems a very useful thing to add, because this would indicate if the improvements come from the merger of information theory or from extra weighting of low flows resulting from the log(Q) transformation.
[3] While reading the manuscript, it was unclear to me if the results that were shown are for calibration data (i.e., the outcomes of data fitting) or for evaluation data (i.e., an estimate of how the model performs for unseen data). This clarification is needed and if, as I expect based on line 442-444, all results shown are for calibration only I would strongly recommend to add an assessment of JDKGE using unseen data. Particularly in the light of the rather short calibration windows, evidence that JDKGE helps the optimizer find better parameter sets that describe the general behaviour for the catchment, rather than parameter sets that better fit the specific conditions of the calibration data, is needed.
[4] The calibration length seems rather short to me. Eight years for the French basins, 4+ years for the global basins. Particularly in intermittent regimes, this means the results really are a snapshot of model performance rather than a more generally applicable assessment of performance. An analysis of the robustness of conclusions w.r.t. calibration data length would be very welcome.
[5] In addition, there are numerous things that I think would strengthen the paper substantially:
[6] All analysis is performed on distributions of scores only, without any mention of (a lack of) regional patterns. I expect it would be instructive to see if changes obtained from using JDKGE are uniformly spread in space or if there are specific regions where the metric enhances simulations more than others. Either case would be instructive.
[7] Related to the previous point, scores such as NSE and KGE are known to be (sometimes highly) conditional on the data used to calculate them. This sampling uncertainty (Lamontagne et al., 2020; Clark et al, 2021; Vrugt & De Oliveira, 2022) can be estimated and is often rather large. An assessment of sampling uncertainty would show how the changes in evaluation metric values obtained by switching to JDKGE compare to the uncertainty in the scores themselves. If the benefits of using JDKGE are consistently larger than the uncertainty in the scores, this would be a helpful line of evidence. Vice versa, if the benefits of using JDKGE are small compared to the uncertainty in the scores themselves this would be worth knowing too.
[8] As far as I can tell, experiments 1 and 2 calibrate a single parameter set per model per metric per basin. One of the main conclusions in the work is that considerable improvements in low flows can obtained with minimal reduction in KGE' scores. This strongly suggests that the main benefit of the JDKGE is that it nudges the optimization away from the very highest KGE' optimum, but that in absolute terms these differences need not be large. I think this can be tied in more with the existing literature on parameter uncertainty and equifinality.
[9] Related, adding some evidence of convergence of the calibration algorithm would be good to add. The text (lines 625-626) already suggests that there are certain cases in the multi-objective experiment where calibration was not fully successful, and getting some indication of how often this happened in the other two experiments would be helpful.
I hope these points highlight why I struggle to decide what to recommend for this paper. On the one hand, it outlines a potential trajectory to improve model calibration. On the other hand, I have a lot of open questions about the process presented in the paper here, some of which seem fairly fundamental to me. I hope the above and the comments in the pdf are helpful in some way.
Please note that Section 2.1.3. was beyond my ability to assess.