the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Evaluating Flexible Configurations of the Shyft Hydrologic Model Framework Across Mainland Norway
Abstract. The development and application of numerous hydrological models have played an indispensable role in advancing our understanding of hydrological processes, improving forecasting capabilities, supporting the design and operation of water conservancy projects, and facilitating water resource assessments. However, due to the spatial heterogeneity and temporal variability of climate and basin characteristics, the inherent complexity of hydrological processes, and data limitations, hydrological modeling faces two major bottlenecks: first, no single model is universally applicable to all river basins; second, further improvement in simulation accuracy of existing fixed-structure models remain challenging. As a result, the emergence of hydrological modeling frameworks with flexible structures and configurable components represents the next generation in the model development. Shyft is one of such flexible modeling frameworks fulfilling the above-mentioned purpose. It is cross platform and open source, jointly developed by academic and industrial partners. The framework allows uncertainty analysis, streamflow simulations, and forecasting. Most evaluation efforts of the framework to date have focused on smaller basins, but there is also a need to benchmark model performance more comprehensively. Here, we present a public benchmark for discharge simulation for 109 catchments across mainland Norway. Five model configurations are evaluated containing two different evapotranspiration routines (Priestley-Taylor and Penman-Monteith), two runoff methods (Kirchner and HBV) and two snow modules (temperature-index and semi-physical). The models are calibrated with 10 variants of target goal functions: KGE-based family, consisting of KGE, LKGE, bcKGE, KGE_LKGE, KGE_bcKGE, and NSE-based family, with NSE, LNSE, bcNSE, NSE_LNSE, NSE_bcNSE. The simulations are divided into two major groups: without precipitation correction and with precipitation correction. The evaluation is performed from 1981 to 2020 (approx. 40 years) at a daily time step. Using KGE, NSE and percent bias (PBIAS) as main evaluation metrics, the model configurations are compared against each other and against climatological benchmarks. The results show that all selected models were able to beat both mean and median flow benchmarks for the majority of catchments in all the target goal function set ups. 89 % of catchments gain higher performance scores with precipitation correction, but the picture is mixed for different metrics and models. The KGE and NSE performance metrics reveal that models, which combine temperature-index snow-tiles model and Kirchner runoff (-STK), perform best, but require precipitation correction to improve PBIAS. The models, which have semi-physical gamma-snow routine (-GSK), show relatively low performance with KGE and NSE scores, especially in Mountain and Inland hydrological regimes, but have the lowest |PBIAS|if no precipitation correction is applied. Precipitation correction shows limited effect on the -GSK models, even deteriorating some of the scores. The model, which combines temperature index snow-tiles and HBV runoff instead of Kirchner (-STHBV), is the most sensitive to precipitation correction: it has the worst PBIAS score across all models without precipitation correction, but jumps to third place in all three metrics, if the correction is applied. The study highlights that KGE-based goal functions reduce PBIAS more than any of the NSE-based goal functions. The study confirms that logarithmic transformation on streamflows, both if LKGE and LNSE are used as target goal functions, generate parameter sets with majority of outliers (KGE scores lower than -0.41). This new benchmark has potential to help with diagnosing problems, improving algorithms and further development within hydrological part of Shyft. Modeling results are made publicly available for further investigation.
- Preprint
(8348 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-4071', Anonymous Referee #1, 12 Nov 2025
- AC1: 'Reply on RC1', Olga Silantyeva, 16 Jan 2026
-
RC2: 'Comment on egusphere-2025-4071', Anonymous Referee #2, 09 Dec 2025
The article compares five hydrological model structures on a large set of catchments in Norway. Models are tested with ten alternative calibration functions and evaluated with three efficiency criteria. Two versions of the models are evaluated, without and with precipitation correction factor.
Though large-sample hydrology studies often give insights on modelling performance, I found the originality of this study is unclear. The sensitivity of model results to calibration options has been already analysed in the literature, with recent studies cited by the authors. The individual evaluation of models is useful but it is difficult to get general conclusions from this work given some limitations in the testing framework (see detailed comments below).
Several choices lack justifications and results should be analysed in more detail to get actual insights from this modelling experiment. Research questions should also be more clearly stated. This would help the authors better explaining and demonstrating the novelty of their work.
I found this study has too many weaknesses and should undergo in-depth modifications including strengthening the introduction and discussion, designing a more convincing testing framework and providing a more detailed analysis of results. I think this would make the article very different. That is why I advise rejection.
Detailed comments
- The abstract could be much more concise and to the point to highlight the main points. Notations (families of KGE and NSE criteria) are unclear without further explanations.
- The introduction should better highlight the research questions addressed by the authors and explain what is the novelty of their study compared to existing works.
- Data: Given the problems noticed on water balance on a large part of the catchments, this issue should be analysed in more detail to better identify the possible causes. We understand that show undercatch may be one of them, but this is unlikely to be the only one. Some Budyko-type framework may be useful to analyse problems between catchments. This would help better understand the modelling results.
- Data: It is unclear whether there were gaps in flow time series. Streamflow data quality is not commented.
- Performance metrics: The selected performance metrics (NSE and KGE) are biased towards high flows, so it is not a strong surprise that objective functions that should be preferred should also emphasize high flows to some extent.
- Objective functions: The authors cite a recent study which extensively tested various objective functions, with a similar (but more rigorous) testing protocol. What does the study proposed here bring new? Besides, I was surprised that the authors ignore a recommendation from a previous study the cite not to use the KGE criterion calculated on log-transformed KGE… to come to the same conclusion that this criterion should not be used.
- Testing protocol: The testing protocol should be improved. First a full split sample test should be done on the two subperiods: calibration on first and validation on second and vice versa. Thus model evaluation in calibration and validation would be available on all data, which would help evaluating model robustness (comparison of mean performance on validation and mean performance on calibration) more rigorously. Second, the distribution of results should not mix calibration and validation results. Indeed this may favour more complex models which have more degrees of freedom in calibration. Third it is unclear if the local calibration tool used is well adapted to the level of complexity of the tested models, which may have more than ten parameters, with possible secondary optima in the response surface where the algorithm may be trapped. Last, only one-year warm-up is used. If there are catchments with strong groundwater contribution (I do not know if it is the case), a single year may not be sufficient. This choice should be justified.
- The -0.41 threshold on KGE should be explained at first appearance in the text.
- There is no information on parameter distributions obtained by the various models. Are they realistic? What can be learnt from these distributions on the limits of the five model structures? What can be learnt from the values of the precipitation correction factor on the water balance issues the catchment may face?
- The comparison between the two modelling options on precipitation correction is a bit repetitive and direct comparison between graphs is difficult.
- It is unclear whether the correction factor is applied on solid precipitation only (to compensate for snow undercatch) or on all precipitations? Would it make a difference?
- The authors argued that models are robust but no result is shown on that.
- The authors do not really comment the possible complementarity between model structures. For example, if the best performing structure was selected (in calibration) for each catchment, what would be the gain (in validation) over each structure applied on all catchments individually?
- Be consistent with criteria notations in figures and in the text (lower case / upper case)
Citation: https://doi.org/10.5194/egusphere-2025-4071-RC2 - AC2: 'Reply on RC2', Olga Silantyeva, 16 Jan 2026
Status: closed
-
RC1: 'Comment on egusphere-2025-4071', Anonymous Referee #1, 12 Nov 2025
Summary
This paper investigates the functioning of the Shyft modelling framework in 109 basins across Norway. Shyft allows the user to construct hydrologic models from components. In this work, five different configurations are tested. The 5 models are calibrated for each of the 109 basins with 10 different objective functions (called goal functions in this paper). Model performance is then assessed through (1) CDFs (in which models are compared to two seasonal cycle benchmarks), (2) box plots of aggregated performance on KGE, NSE and PBIAS scores, and (3) maps showing which combinations of model and goal function give the highest KGE score in each of the 109 basins. This analysis is done twice, once while including a precipitation correction factor and once without this factor. The paper concludes with recommendations about (1) the type of Shyft options that work well in these Norwegian basins, (2) the preferential use of KGE over NSE for less biased simulations, and (3) the use of a precipitation correction factor.
Comments
I have made some line-by-line comments in the PDF. Below are some further thoughts that I could not easily fit into a specific comment in the PDF.
[1] The idea of running mosaics of different models in space (i.e., finding an appropriate model for each individual basin, not a single model that works well enough for all of them) has been around for a while, but actual implementations that test the validity of this approach is more of a recent development. In that sense, this paper is a timely contribution. However, the introduction might be improved by being a bit more specific about the current knowledge gaps and state-of-the-art for these approaches. It currently focuses heavily on Shyft, but does not expand much on the more general multi-model approach that Shyft allows a user to set up.
[2] Related, the stated goal of this work (l. 73-75) is “Using Shyft as an example, the objective of this research is to evaluate the performance of flexible model configurations from a benchmarking perspective, considering different objective functions, accuracy of precipitation input, and streamflow regimes”. This suggest a general approach applied to a specific case, but after reading the work (and particularly the conclusions) my main takeaway is that we now have some idea which of the five tested Shyft configurations work well in Norway. It’s less clear to me which general lessons can be learned from this work that are valuable to readers who do not actively use Shyft to model Norwegian basins. The benchmarking seems fairly minimal to me (just a comparison of CDFs), the conclusions about objective functions are possibly not that surprising (see [3c] below), the accuracy of precipitation context seems quite dependent on the model (some models provide better simulations with P corrections, others do not and sometimes get worse), and the flow regimes are specific to Norwegian basins. It would be good to emphasize the more general aspects of the work to justify publication in a broad international journal like HESS, compared to publication in a more regional or model-focused journal.
[3] I believe part of this is that the justification of various choices can be improved. Doing so might help highlight what broader lessons might be learned from this work. See below.[3a] Based on Figure 2 it seems to me that Shyft supports more model configurations than just the five tested in this work. Some justification about why specifically these five are tested and not others would be welcome. This should, I think, extend to describing why certain parameters are chosen for calibration and others are not: lines 398-399 for example suggest that one sensitive parameter was not calibrated at all. Being clearer about the reasoning for these choices would be good.
[3b] This would also be a good opportunity to outline what can be learned from pairwise comparison of each of the models. For example, if I understand Table A1 correctly, the single difference between RPMSTK and RPMGSK is the choice of snow module (ST vs GS). This would mean that differences between both models can be more clearly attribute to specific modelling decisions. This is the core idea behind multi-hypothesis modelling frameworks such as SUMMA (which Shyft is apparently inspired by), and leaning more into this part of the literature (it’s currently not particularly prevalent in the framing of the paper) could help emphasize the more broadly applicable conclusions from this work.
[3c] The analysis broadly consists of looking at the performance of the models on either KGE alone (CDFs, maps) or on a combination of KGE, NSE and PBIAS. I understand that there most be some sort of consistent metric(s) to evaluate the 10 different objective/goal functions on, but it’s not particularly clear to me why specifically these three (or this one) was chosen. For example, it’s widely accepted that hydrologic models tend to perform best for the metric they were calibrated on, so it’s possibly not very surprising that any goal function that uses KGE shows better performance on KGE than any of the NSE-family of goal functions do. I think that explaining a bit more clearly why the chosen analysis approach is a helpful thing to do would be good (e.g. connecting the methodology to existing research gaps, operational practice or something else).
[4] Readability and clarity can be improved in my opinion. Most of my line-by-line comments focus on this. See also the points below.[4a] More generally, I struggled quite a bit keeping track of the five model acronyms. The letters don’t mean much to me because I don’t have the familiarity with the Shyft software to map the acronyms onto specific models. The acronyms are also rather similar and this makes the text quite hard to follow. I expect that even just naming the models “model 1 - 5” (or “stack 1 - 5” to stick with the Shyft terminology) would be easier to follow for readers.
[4b] It’s also not always clear if results in figures refer to the calibration or evaluation period. This should be clarified in all cases.
- AC1: 'Reply on RC1', Olga Silantyeva, 16 Jan 2026
-
RC2: 'Comment on egusphere-2025-4071', Anonymous Referee #2, 09 Dec 2025
The article compares five hydrological model structures on a large set of catchments in Norway. Models are tested with ten alternative calibration functions and evaluated with three efficiency criteria. Two versions of the models are evaluated, without and with precipitation correction factor.
Though large-sample hydrology studies often give insights on modelling performance, I found the originality of this study is unclear. The sensitivity of model results to calibration options has been already analysed in the literature, with recent studies cited by the authors. The individual evaluation of models is useful but it is difficult to get general conclusions from this work given some limitations in the testing framework (see detailed comments below).
Several choices lack justifications and results should be analysed in more detail to get actual insights from this modelling experiment. Research questions should also be more clearly stated. This would help the authors better explaining and demonstrating the novelty of their work.
I found this study has too many weaknesses and should undergo in-depth modifications including strengthening the introduction and discussion, designing a more convincing testing framework and providing a more detailed analysis of results. I think this would make the article very different. That is why I advise rejection.
Detailed comments
- The abstract could be much more concise and to the point to highlight the main points. Notations (families of KGE and NSE criteria) are unclear without further explanations.
- The introduction should better highlight the research questions addressed by the authors and explain what is the novelty of their study compared to existing works.
- Data: Given the problems noticed on water balance on a large part of the catchments, this issue should be analysed in more detail to better identify the possible causes. We understand that show undercatch may be one of them, but this is unlikely to be the only one. Some Budyko-type framework may be useful to analyse problems between catchments. This would help better understand the modelling results.
- Data: It is unclear whether there were gaps in flow time series. Streamflow data quality is not commented.
- Performance metrics: The selected performance metrics (NSE and KGE) are biased towards high flows, so it is not a strong surprise that objective functions that should be preferred should also emphasize high flows to some extent.
- Objective functions: The authors cite a recent study which extensively tested various objective functions, with a similar (but more rigorous) testing protocol. What does the study proposed here bring new? Besides, I was surprised that the authors ignore a recommendation from a previous study the cite not to use the KGE criterion calculated on log-transformed KGE… to come to the same conclusion that this criterion should not be used.
- Testing protocol: The testing protocol should be improved. First a full split sample test should be done on the two subperiods: calibration on first and validation on second and vice versa. Thus model evaluation in calibration and validation would be available on all data, which would help evaluating model robustness (comparison of mean performance on validation and mean performance on calibration) more rigorously. Second, the distribution of results should not mix calibration and validation results. Indeed this may favour more complex models which have more degrees of freedom in calibration. Third it is unclear if the local calibration tool used is well adapted to the level of complexity of the tested models, which may have more than ten parameters, with possible secondary optima in the response surface where the algorithm may be trapped. Last, only one-year warm-up is used. If there are catchments with strong groundwater contribution (I do not know if it is the case), a single year may not be sufficient. This choice should be justified.
- The -0.41 threshold on KGE should be explained at first appearance in the text.
- There is no information on parameter distributions obtained by the various models. Are they realistic? What can be learnt from these distributions on the limits of the five model structures? What can be learnt from the values of the precipitation correction factor on the water balance issues the catchment may face?
- The comparison between the two modelling options on precipitation correction is a bit repetitive and direct comparison between graphs is difficult.
- It is unclear whether the correction factor is applied on solid precipitation only (to compensate for snow undercatch) or on all precipitations? Would it make a difference?
- The authors argued that models are robust but no result is shown on that.
- The authors do not really comment the possible complementarity between model structures. For example, if the best performing structure was selected (in calibration) for each catchment, what would be the gain (in validation) over each structure applied on all catchments individually?
- Be consistent with criteria notations in figures and in the text (lower case / upper case)
Citation: https://doi.org/10.5194/egusphere-2025-4071-RC2 - AC2: 'Reply on RC2', Olga Silantyeva, 16 Jan 2026
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 330 | 115 | 33 | 478 | 17 | 20 |
- HTML: 330
- PDF: 115
- XML: 33
- Total: 478
- BibTeX: 17
- EndNote: 20
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Summary
This paper investigates the functioning of the Shyft modelling framework in 109 basins across Norway. Shyft allows the user to construct hydrologic models from components. In this work, five different configurations are tested. The 5 models are calibrated for each of the 109 basins with 10 different objective functions (called goal functions in this paper). Model performance is then assessed through (1) CDFs (in which models are compared to two seasonal cycle benchmarks), (2) box plots of aggregated performance on KGE, NSE and PBIAS scores, and (3) maps showing which combinations of model and goal function give the highest KGE score in each of the 109 basins. This analysis is done twice, once while including a precipitation correction factor and once without this factor. The paper concludes with recommendations about (1) the type of Shyft options that work well in these Norwegian basins, (2) the preferential use of KGE over NSE for less biased simulations, and (3) the use of a precipitation correction factor.
Comments
I have made some line-by-line comments in the PDF. Below are some further thoughts that I could not easily fit into a specific comment in the PDF.
[1] The idea of running mosaics of different models in space (i.e., finding an appropriate model for each individual basin, not a single model that works well enough for all of them) has been around for a while, but actual implementations that test the validity of this approach is more of a recent development. In that sense, this paper is a timely contribution. However, the introduction might be improved by being a bit more specific about the current knowledge gaps and state-of-the-art for these approaches. It currently focuses heavily on Shyft, but does not expand much on the more general multi-model approach that Shyft allows a user to set up.
[2] Related, the stated goal of this work (l. 73-75) is “Using Shyft as an example, the objective of this research is to evaluate the performance of flexible model configurations from a benchmarking perspective, considering different objective functions, accuracy of precipitation input, and streamflow regimes”. This suggest a general approach applied to a specific case, but after reading the work (and particularly the conclusions) my main takeaway is that we now have some idea which of the five tested Shyft configurations work well in Norway. It’s less clear to me which general lessons can be learned from this work that are valuable to readers who do not actively use Shyft to model Norwegian basins. The benchmarking seems fairly minimal to me (just a comparison of CDFs), the conclusions about objective functions are possibly not that surprising (see [3c] below), the accuracy of precipitation context seems quite dependent on the model (some models provide better simulations with P corrections, others do not and sometimes get worse), and the flow regimes are specific to Norwegian basins. It would be good to emphasize the more general aspects of the work to justify publication in a broad international journal like HESS, compared to publication in a more regional or model-focused journal.
[3] I believe part of this is that the justification of various choices can be improved. Doing so might help highlight what broader lessons might be learned from this work. See below.
[3a] Based on Figure 2 it seems to me that Shyft supports more model configurations than just the five tested in this work. Some justification about why specifically these five are tested and not others would be welcome. This should, I think, extend to describing why certain parameters are chosen for calibration and others are not: lines 398-399 for example suggest that one sensitive parameter was not calibrated at all. Being clearer about the reasoning for these choices would be good.
[3b] This would also be a good opportunity to outline what can be learned from pairwise comparison of each of the models. For example, if I understand Table A1 correctly, the single difference between RPMSTK and RPMGSK is the choice of snow module (ST vs GS). This would mean that differences between both models can be more clearly attribute to specific modelling decisions. This is the core idea behind multi-hypothesis modelling frameworks such as SUMMA (which Shyft is apparently inspired by), and leaning more into this part of the literature (it’s currently not particularly prevalent in the framing of the paper) could help emphasize the more broadly applicable conclusions from this work.
[3c] The analysis broadly consists of looking at the performance of the models on either KGE alone (CDFs, maps) or on a combination of KGE, NSE and PBIAS. I understand that there most be some sort of consistent metric(s) to evaluate the 10 different objective/goal functions on, but it’s not particularly clear to me why specifically these three (or this one) was chosen. For example, it’s widely accepted that hydrologic models tend to perform best for the metric they were calibrated on, so it’s possibly not very surprising that any goal function that uses KGE shows better performance on KGE than any of the NSE-family of goal functions do. I think that explaining a bit more clearly why the chosen analysis approach is a helpful thing to do would be good (e.g. connecting the methodology to existing research gaps, operational practice or something else).
[4] Readability and clarity can be improved in my opinion. Most of my line-by-line comments focus on this. See also the points below.
[4a] More generally, I struggled quite a bit keeping track of the five model acronyms. The letters don’t mean much to me because I don’t have the familiarity with the Shyft software to map the acronyms onto specific models. The acronyms are also rather similar and this makes the text quite hard to follow. I expect that even just naming the models “model 1 - 5” (or “stack 1 - 5” to stick with the Shyft terminology) would be easier to follow for readers.
[4b] It’s also not always clear if results in figures refer to the calibration or evaluation period. This should be clarified in all cases.