the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Larger hydrological simulation uncertainties where runoff generation capacity is high: insights from 63 catchments in southeastern China
Abstract. Traditional parameter calibration strategies that focus on a single optimal parameter set may lead to large uncertainties and biases in simulating internal hydrological processes because of parameter equifinality. This study used the semi-distributed Tsinghua Hydrological Model based on Representative Elementary Watershed (THREW) to investigate the influence of parameter equifinality on uncertainties in surface–subsurface runoff partitioning. The model was implemented in 63 catchments in southeastern China with high-quality rainfall and streamflow data. Behavioral parameter sets were selected based on KGE thresholds to quantify uncertainty in estimates of the contribution of subsurface runoff (Csub). Correlation analyses were conducted to investigate factors influencing these uncertainties. Results showed that: (1) the THREW model performed well across the 63 catchments, with an average optimal KGE (KGEopt) of 0.846. Csub varied widely among catchments, ranging from 1.0 % to 74.1 % (mean = 31.7 %), and was below 50 % in 84 % of the catchments, indicating that surface runoff was the dominant runoff generation mechanism in the study area. (2) Substantial uncertainty in Csub can arise from small differences in KGE, with notable variability among catchments. The uncertainty in Csub was modest in most catchments, with mean Bias (difference between the Csub estimated using the optimal set and the average across all behavioral parameter sets) and Range (max–min across behavioral sets) of 2.7 % and 15.8 %, respectively. However, the uncertainty can be large in some catchments, where reliance on a single optimal parameter set is likely inappropriate. (3) Runoff ratio was identified as an important catchment attribute significantly correlated with Csub and its uncertainty. In catchments with stronger runoff-generation capacity, the model tended to be less sensitive and the simulation of internal runoff-component partitioning tended to exhibit larger uncertainties. Such evidence can provide empirical a priori guidance on the likely magnitude of uncertainties and help inform calibration-strategy selection.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(1488 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 09 Dec 2025)
- RC1: 'Comment on egusphere-2025-4445', Anonymous Referee #1, 02 Dec 2025 reply
-
RC2: 'Comment on egusphere-2025-4445', Anonymous Referee #2, 08 Dec 2025
reply
Summary
The main goal of the paper is to analyze if predictors can be found that can indicate if calibrating a particular catchment only once will be sufficient to accurately estimate internal hydrological processes and variables (in this case the fraction subsurface flow), or that calculating an ensemble mean over multiple calibrations may be necessary for an accurate estimation because of parameter equifinality.
The authors calibrated the THREW model 50 times for each of the 63 catchments that were used in this study. The catchments were calibrated on streamflow using KGE as the objective function. A threshold value was applied to select only those parameter sets that performed reasonably well in terms of KGE. Four different measures were defined to describe the uncertainty in the estimated fraction subsurface flow resulting from simulations with the calibrated parameter sets. Linking the uncertainty measures to catchment characteristics, the authors found that for catchments with higher runoff ratios the estimation of the fraction subsurface flow was less certain.
Review
The paper is well written and easy to follow, the language is clear, and the general topic of uncertainties in calibration strategies is relevant and interesting. However, I have a few concerns with regard to the methodology that is used within this study, which I will outline below, together with some other major and minor feedback.
Major comments:
The authors compared two calibration strategies: calibrating the model only once and calibrating the model 50 times. They defined two measures (Bias and RBias) to analyze how representative the contribution of subsurface runoff (Csub) estimated by the parameter set with the highest KGE is. However, calibrating the model only once can results in any of the 50 parameter sets with the same likelihood. There is no guarantee (and one cannot know) if the calibration resulted in the parameter set with the highest KGE value. Therefore, measures analyzing the representativeness of the best parameter set (in terms of KGE) seem to me to be not meaningful for the purpose of this study. The 'Bias' measure, for example, only shows to which extent Csub modeled by the best parameter set corresponds to the mean Csub of all calibrations, but it doesn't say anything at all about how wrong or inaccurate Csub can possibly be for a single calibration.
The authors calibrated the model for each of the catchments 50 times. They then selected those parameter sets whose KGE exceeded a certain threshold, the so-called behavioral parameter sets, and used them for further analyses [Line 171-172]. I think this selection is problematic. Any of the removed parameter sets could be the result of a single calibration. Figure 5b clearly shows the possible consequences of excluding parameter sets from further analyses for the Shuikou catchment. For this catchment, all behavioral parameter sets have a Csub value roughly between 25 and 32 (a rather small spread), but a single calibration could result in a Csub value of at least up to 55. I understand that if a calibrated parameter set doesn't model the discharge very well, one may have less trust in the modeled Csub. It's unfortunate that the optimization strategy was not very effective, given the often small number of behavioral parameter sets. But that doesn't mean that one can simply ignore the fact that a single calibration could result in any of the non-behavioral parameter sets.
I think it is important to zoom in a bit on the possible consequences of using a performance threshold for the results and/or conclusions of the paper. Even though the authors mention that it is difficult to define a QR threshold below which a single optimal parameter set can be judged sufficiently credible [Line 351-353], the paper implicitly concludes that the smaller the QR ratio, the smaller the uncertainty measures, and therefore the more likely that one calibration may be sufficient to accurately estimate Csub. However, a strong, significant correlation was found between QR and the number of behavioral parameter sets [Line 305-307]. So, the lower the QR ratio, the more parameter sets were removed from the analysis due to the calibration algorithm struggling to find a good fit. In other words, for those catchments for which one calibration might potentially be sufficient according to the analyses within this study, chances are rather high to end up with a parameter set that does not even calibrate the discharge well.
Specific questions / feedback:
- Line 88-90: "(1) to quantify the uncertainty in the contribution of subsurface runoff (Csub) resulting from small changes in model performance metric (KGE)" -> As outlined earlier, for the purpose of this study, this should to my opinion have been e.g.: 'to quantify the uncertainty in the contribution of subsurface runoff (Csub) when calibrating a hydrological model multiple times'.
- Line 159-160: "the optimization stopped when the objective converged or the number of model runs reached a threshold" -> Which of the 2 happened for the different calibrations within this study? Considering the often large number of non-behavioral parameter sets, it would be valuable to know if the calibration got stuck in a local optimum or reached the maximum number of model runs. In case of the latter, the number of model runs during each calibration might not have been set high enough.
- Line 204-205: KGE values for the best parameter sets are given, but how are the values for the other 49 parameter sets (range and/or distribution)?
- Line 219-221: "Considering the random generation of initial parameter sets within each pySOT running, the number of behavioral parameter sets serve as a partial indicator of model sensitivity." -> I don't agree with this statement. PySOT is an optimization algorithm, and the number of behavioral parameter sets tells more about how easy or difficult it is for the algorithm to find the optimum, i.e. about the 'smoothness' and the exact structure of the parameter space.
- Line 271-273: "the 90th percentiles of Bias and Range were below 5% and 10%, respectively, when the KGE threshold was set 0.01 below KGEopt, indicating that Csub estimation is robust in most catchments if the threshold is set sufficiently high" -> I do not fully understand the reasoning here. Why exactly do the percentiles need to be low? If it is about robustness, one can set the threshold to KGEopt. Then there will be only one parameter set left, resulting in an 'excellent robustness' of Csub, with Bias, RBias, STD and Range all equal to 0...
- Figure 9: Is there a correlation between the number of behavioral parameter sets and the uncertainty measures (Bias and Range)? And could it be that the correlation between the uncertainty measures and QR is an artifact of the fact that not all parameter sets were used for the calculation of the uncertainty measures?
- Line 331-332: "the optimal parameter set can adequately represent the parameter sets that produced sufficiently high KGE" -> But how can one know what is 'sufficiently high'?
- Line 350-353: "However, it is difficult to derive a reliable equation to predict potential modeling uncertainty from catchment attributes, or to define a MAR/QR threshold above or below which a single optimal parameter set can be judged sufficiently credible". Let's assume that such an equation and/or threshold could be defined. Since for studies involving multiple catchments it is common practice to treat all catchments in an identical way in order to be able to make comparisons and draw overall conclusions, calibrating some of the catchments only once, and others multiple times may complicate the analyses and conclusions. Can the authors elaborate a bit on which studies could benefit from an equation or threshold to decide if a single optimal parameter may be sufficient, and how they would apply it in practice?
- Line 363-365: "Because of deficiencies in observational data for multiple runoff components, it is difficult to directly validate the Csub estimates in this study" -> Although validating the Csub estimates is indeed difficult, I would suggest to include a map with the catchments color coded by Csub. Spatial patterns of Csub in combination with expert knowledge about the characteristics and behavior of the different catchments might be of help to assess the likelihood of correct Csub values.
- Line 368-369: "such as an end-member mixing model and a groundwater model" -> Can you elaborate a bit more on these models, in particular to which extent modeling was based on field measurements/data.
- Line 371-373: "the positive correlation between Csub and topographic slope and the negative correlation between Csub and mean annual rainfall are consistent with patterns reported in previous studies" -> Both correlations are not significant, and the correlation direction alone is not a very strong similarity.
Minor comments:
- Line 22-25: Are the KGE and Csub statistics for 63 catchments (as mentioned here) or for the 50 catchments that have at least 10 behavioral parameter sets (as mentioned in the results section)?
- Line 47: "Parameter calibration is a necessary step in developing hydrological models" -> This depends on the type of model.
- Line 63-64: "single–optimal-parameter approaches remain the default in many applications" -> Refs are 12-15 years old. Please support the statement with some newer publications.
- Line 110-111: "Considering the variable quality of the raw data (mainly completeness and temporal resolution)" -> What do you exactly mean by the temporal resolution being of 'variable quality' other than having missing data (which falls under 'completeness')?
- Line 113: 'Exceeded' or 'was below 3650'?
- Line 121-124: Were lapse rates used?
- Line 128: "Soil parameters" -> which parameters are these?
- Figure 2 (and the corresponding model description in the text):
- Use the same terminology consistently (e.g. u-zone, b-zone, n-zone etcetera).
- Leave out the components that are not used in this study (snow, glacier)
- Line 149: "in our previous publications" -> Include refs.
- Table 2: Where are the parameters exactly used within the model and what is their exact function? If possible integrate the parameters into Figure 2.
- Table 3: Include the meaning of the abbreviations in the caption.
- Line 337-338: "Generalized Likelihood Unvertainty Estimation (GLUE) framework" -> Add ref.
Citation: https://doi.org/10.5194/egusphere-2025-4445-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 149 | 34 | 15 | 198 | 12 | 10 |
- HTML: 149
- PDF: 34
- XML: 15
- Total: 198
- BibTeX: 12
- EndNote: 10
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The study quantified the effect of parametric uncertainty on the resulting runoff partitioning using 50 behavioral datasets of the Tsinghua Hydrological model in more than 60 catchments in the Yangtze River Basin. It shows that in catchments with larger runoff ratio, the parametric uncertainty is higher resulting in higher variability of simulated runoff components.
Although I find the analysis of relationship between parametric uncertainty and their possible catchments attributes interesting in general, the investigation performed in this study is rather incomplete and poorly motivated. The model appears to be unvalidated. Moreover, a rather small sample of parameters were considered for the analysis, compromising the reliability of the results. No rationale for selecting possible catchment controls is provided, while the results are based on simulation of a single model and single objective function, making the generalizability rather difficult. Please find my detailed comments below.
General comments
Motivation: The motivation of the study as mentioned in Line 68-70 is to provide an a priori guidance on whether to only select single best calibration parameter or instead choose multiple behavioral parameter sets. I rather disagree with such premise. Even if some studies still use a single best parameter set, they are simply ignoring parametric uncertainty altogether, while there is a large body of studies including those referenced by the authors that show that using multiple parameter sets is essential for accounting for and communicating uncertainty. I do find the idea of knowing a priori how much parametric uncertainty one might expect useful for the sake of understanding and improving model limitations and to lower the computational costs of running a large number of simulations. However, especially given rather inconclusive results of this study, using single parameter set would in any case mean ignoring the uncertainty. See also my detailed comments.
Lack of Validation: It seems that there is no validation of model performance provided in this study. The method section describes only the calibration approach. Moreover, it seems that the total length of observations used for calibrations is about 2 years, which is too short to result in reliable parameter identification. Without the validation, there is no proof of parameter validity for an uncalibrated period.
Parameter sampling: Given that the focus of this study is to investigate parameter uncertainty, 50 parameter sets seem like a very low number of samples. A much more comprehensive sampling is needed to prove that lower number of behavioral sets is indeed not simply an artefact of small sampling.
Single calibration strategy and single model: While the study aims to provide generalizable results on how to a priori decide calibration strategy, it only examines one single objective function for the calibration (i.e., KGE) and one single conceptual model. In the Limitation section the authors highlight the latter as the limitation themselves. As hydrological models of even rather similar structures are known to behave quite differently in how they partition the fluxes (Merz et al., 2022 https://doi.org/10.1175/BAMS-D-21-0284.1; van Kempen et al., 2020 https://doi.org/10.5194/nhess-21-961-2021), several model structures must be analyzed to confirm reliability of the suggestions presented here. Similarly, given high effect of the objective function on the resulting parameter sets, a comprehensive set of experiments with various objective functions is needed to evaluate the generalizability of the findings.
Potential controls of uncertainty: The manuscript does not provide the rationale for selecting a few examined catchment attributes as potential controls of parameter uncertainty. This is necessary to understand why the examined attributes were selected in a first place and if there any other potential properties that can be used to explain observed parameter uncertainty.
Specific comments
Line 18, 111 and elsewhere: It is not quite clear what is meant by “high-quality rainfall” here? It is also was not clarified later in the Methods part. It would be much more instructive to specify temporal resolution, length of observations or density of the observations to highlight a particular aspect of data quality. Please revise.
Line 23: Given the theme of this paper, I do not think that the model performance is relevant for the abstract. Consider omitting this.
Line 25-26: At this point, this statement is not clear, because it is not yet clarified how the uncertainty is computed. Please clarify and revise.
Line 34-36: This recommendation is rather general. It would be helpful to specifically highlight what sort of guidance this study can provide for the choice of the calibration strategy.
Line 49-51: This recent review of Wagner et al., 2025 (https://doi.org/10.1002/wat2.70018) might be worth mentioning here.
Line 53: I disagree that these are two different calibration approaches. Considering multiple behavioral parameter sets vs taking one best single parameter set stands for “accounting for parametric uncertainty” vs “ignoring it”. While most of the current modeling studies account for parametric uncertainty using various approaches, using a single best parameter set remains a poor practice that unfortunately still persists in some studies. Yet, this does not make it a distinct calibration strategy. Please revise.
Line 57-59: Given the relevance of the behavioral parameter sampling, this part seems rather short to me and incomplete, especially in terms of the references mentioned. It seems that most of seminal works of Keith Beven on this topic are overlooked here. Please add.
Line 63-64: As I mentioned in my comment above, even if some studies still use a single parameter set, there is a consensus that in order to represent parametric uncertainty multiple parameter sets should be considered as the study referenced here also emphasizes.
Line 99 and elsewhere: The term “average” is ambiguous. Please specify if this is mean or median.
Line 101: Please clarify here what is meant here by runoff ratio and how it was computed in this study.
Figure 1: This figure would be more instructive if it would also display precipitation or runoff ratios that are used as explanatory controls of the uncertainty in this study.
Line 112-113: This is a rather misleading way report missing values. Please specify the study period and the portion of missing data across all catchments.
Line 115: Please quantify this by specifying how much water level data vs discharge data is available.
Line 116-119: Please specify how these relationships were built. Was it a linear relationship?
Line 128-132: It is not clear how this information was used. Is it needed for model input? This has to be clarified.
Line 149: Even if the model was used in the previous studies key equation and especially parameters have to be specified. Parameters can easily be added near the corresponding compartment in Figure 2. Please add.
Section 2.3: This section must provide a complete information on the length of calibration period and validation period.
Table 2: Are these all the parameters of the model? Please clarify.
Line 172: It is not clear what is meant by the “optimal KGE”, the highest one?
Line 179: Even if the river bed might be impervious, this is not a primary reason for surface runoff. For perennial rivers as here, rainfall falling on the channel surface is much more likely to flow horizontally along the channel than vertically towards river bed. Please revise.
Section 3.1: It is not clear of the reported performance corresponds to the calibration or to the validation method. Please clarify.
Figure 8: Please add explanation of clusters in the caption. Please also clarify what is meant here by maximum rainfall. Event? Rate? Volume?
Line 206: Please clarify what makes these two catchments typical and why for another example later another set of two catchments were selected. Please avoid subjective choice of catchments for examples and present the results for all study catchments.
Figure 3: This figure could be more efficiently presented as a boxplot or a violin plot.
Line 216 and elsewhere: Please avoid term “significantly” if no statistical test was used.
Line 260: Please be more specific on which catchments are these.
Figure 6 and all other figures: Please explain all the terms and acronyms from the figure in the caption.
Table 3 and elsewhere: Please specify type of correlation used here.
Line 306: I am not sure if the term “model sensitivity” is suitable here. Given that “parameter sensitivity” is a rather established term, I would avoid using it in a different context.
Line 338: It is not clear what is meant here by the assimilation of the datasets. Please clarify.
Line 367-371: Please specify if these studies were in the same study region/ catchments.