Evaluating Flexible Configurations of the Shyft Hydrologic Model Framework Across Mainland Norway
Abstract. The development and application of numerous hydrological models have played an indispensable role in advancing our understanding of hydrological processes, improving forecasting capabilities, supporting the design and operation of water conservancy projects, and facilitating water resource assessments. However, due to the spatial heterogeneity and temporal variability of climate and basin characteristics, the inherent complexity of hydrological processes, and data limitations, hydrological modeling faces two major bottlenecks: first, no single model is universally applicable to all river basins; second, further improvement in simulation accuracy of existing fixed-structure models remain challenging. As a result, the emergence of hydrological modeling frameworks with flexible structures and configurable components represents the next generation in the model development. Shyft is one of such flexible modeling frameworks fulfilling the above-mentioned purpose. It is cross platform and open source, jointly developed by academic and industrial partners. The framework allows uncertainty analysis, streamflow simulations, and forecasting. Most evaluation efforts of the framework to date have focused on smaller basins, but there is also a need to benchmark model performance more comprehensively. Here, we present a public benchmark for discharge simulation for 109 catchments across mainland Norway. Five model configurations are evaluated containing two different evapotranspiration routines (Priestley-Taylor and Penman-Monteith), two runoff methods (Kirchner and HBV) and two snow modules (temperature-index and semi-physical). The models are calibrated with 10 variants of target goal functions: KGE-based family, consisting of KGE, LKGE, bcKGE, KGE_LKGE, KGE_bcKGE, and NSE-based family, with NSE, LNSE, bcNSE, NSE_LNSE, NSE_bcNSE. The simulations are divided into two major groups: without precipitation correction and with precipitation correction. The evaluation is performed from 1981 to 2020 (approx. 40 years) at a daily time step. Using KGE, NSE and percent bias (PBIAS) as main evaluation metrics, the model configurations are compared against each other and against climatological benchmarks. The results show that all selected models were able to beat both mean and median flow benchmarks for the majority of catchments in all the target goal function set ups. 89 % of catchments gain higher performance scores with precipitation correction, but the picture is mixed for different metrics and models. The KGE and NSE performance metrics reveal that models, which combine temperature-index snow-tiles model and Kirchner runoff (-STK), perform best, but require precipitation correction to improve PBIAS. The models, which have semi-physical gamma-snow routine (-GSK), show relatively low performance with KGE and NSE scores, especially in Mountain and Inland hydrological regimes, but have the lowest |PBIAS|if no precipitation correction is applied. Precipitation correction shows limited effect on the -GSK models, even deteriorating some of the scores. The model, which combines temperature index snow-tiles and HBV runoff instead of Kirchner (-STHBV), is the most sensitive to precipitation correction: it has the worst PBIAS score across all models without precipitation correction, but jumps to third place in all three metrics, if the correction is applied. The study highlights that KGE-based goal functions reduce PBIAS more than any of the NSE-based goal functions. The study confirms that logarithmic transformation on streamflows, both if LKGE and LNSE are used as target goal functions, generate parameter sets with majority of outliers (KGE scores lower than -0.41). This new benchmark has potential to help with diagnosing problems, improving algorithms and further development within hydrological part of Shyft. Modeling results are made publicly available for further investigation.
Summary
This paper investigates the functioning of the Shyft modelling framework in 109 basins across Norway. Shyft allows the user to construct hydrologic models from components. In this work, five different configurations are tested. The 5 models are calibrated for each of the 109 basins with 10 different objective functions (called goal functions in this paper). Model performance is then assessed through (1) CDFs (in which models are compared to two seasonal cycle benchmarks), (2) box plots of aggregated performance on KGE, NSE and PBIAS scores, and (3) maps showing which combinations of model and goal function give the highest KGE score in each of the 109 basins. This analysis is done twice, once while including a precipitation correction factor and once without this factor. The paper concludes with recommendations about (1) the type of Shyft options that work well in these Norwegian basins, (2) the preferential use of KGE over NSE for less biased simulations, and (3) the use of a precipitation correction factor.
Comments
I have made some line-by-line comments in the PDF. Below are some further thoughts that I could not easily fit into a specific comment in the PDF.
[1] The idea of running mosaics of different models in space (i.e., finding an appropriate model for each individual basin, not a single model that works well enough for all of them) has been around for a while, but actual implementations that test the validity of this approach is more of a recent development. In that sense, this paper is a timely contribution. However, the introduction might be improved by being a bit more specific about the current knowledge gaps and state-of-the-art for these approaches. It currently focuses heavily on Shyft, but does not expand much on the more general multi-model approach that Shyft allows a user to set up.
[2] Related, the stated goal of this work (l. 73-75) is “Using Shyft as an example, the objective of this research is to evaluate the performance of flexible model configurations from a benchmarking perspective, considering different objective functions, accuracy of precipitation input, and streamflow regimes”. This suggest a general approach applied to a specific case, but after reading the work (and particularly the conclusions) my main takeaway is that we now have some idea which of the five tested Shyft configurations work well in Norway. It’s less clear to me which general lessons can be learned from this work that are valuable to readers who do not actively use Shyft to model Norwegian basins. The benchmarking seems fairly minimal to me (just a comparison of CDFs), the conclusions about objective functions are possibly not that surprising (see [3c] below), the accuracy of precipitation context seems quite dependent on the model (some models provide better simulations with P corrections, others do not and sometimes get worse), and the flow regimes are specific to Norwegian basins. It would be good to emphasize the more general aspects of the work to justify publication in a broad international journal like HESS, compared to publication in a more regional or model-focused journal.
[3] I believe part of this is that the justification of various choices can be improved. Doing so might help highlight what broader lessons might be learned from this work. See below.
[3a] Based on Figure 2 it seems to me that Shyft supports more model configurations than just the five tested in this work. Some justification about why specifically these five are tested and not others would be welcome. This should, I think, extend to describing why certain parameters are chosen for calibration and others are not: lines 398-399 for example suggest that one sensitive parameter was not calibrated at all. Being clearer about the reasoning for these choices would be good.
[3b] This would also be a good opportunity to outline what can be learned from pairwise comparison of each of the models. For example, if I understand Table A1 correctly, the single difference between RPMSTK and RPMGSK is the choice of snow module (ST vs GS). This would mean that differences between both models can be more clearly attribute to specific modelling decisions. This is the core idea behind multi-hypothesis modelling frameworks such as SUMMA (which Shyft is apparently inspired by), and leaning more into this part of the literature (it’s currently not particularly prevalent in the framing of the paper) could help emphasize the more broadly applicable conclusions from this work.
[3c] The analysis broadly consists of looking at the performance of the models on either KGE alone (CDFs, maps) or on a combination of KGE, NSE and PBIAS. I understand that there most be some sort of consistent metric(s) to evaluate the 10 different objective/goal functions on, but it’s not particularly clear to me why specifically these three (or this one) was chosen. For example, it’s widely accepted that hydrologic models tend to perform best for the metric they were calibrated on, so it’s possibly not very surprising that any goal function that uses KGE shows better performance on KGE than any of the NSE-family of goal functions do. I think that explaining a bit more clearly why the chosen analysis approach is a helpful thing to do would be good (e.g. connecting the methodology to existing research gaps, operational practice or something else).
[4] Readability and clarity can be improved in my opinion. Most of my line-by-line comments focus on this. See also the points below.
[4a] More generally, I struggled quite a bit keeping track of the five model acronyms. The letters don’t mean much to me because I don’t have the familiarity with the Shyft software to map the acronyms onto specific models. The acronyms are also rather similar and this makes the text quite hard to follow. I expect that even just naming the models “model 1 - 5” (or “stack 1 - 5” to stick with the Shyft terminology) would be easier to follow for readers.
[4b] It’s also not always clear if results in figures refer to the calibration or evaluation period. This should be clarified in all cases.