Exploring Diverse Modeling Schemes for Runoff Prediction: An Application to 544 Basins in China

Hu, Yuqian; Li, Heng; Zhang, Chunxiao; Shen, Dingtao; Xu, Bingli; Chen, Min; Chu, Wenhao; Li, Rongrong

doi:10.5194/egusphere-2025-1161

Preprints

https://doi.org/10.5194/egusphere-2025-1161

Preprints

02 Jun 2025

| 02 Jun 2025

Exploring Diverse Modeling Schemes for Runoff Prediction: An Application to 544 Basins in China

Yuqian Hu, Heng Li, Chunxiao Zhang, Dingtao Shen, Bingli Xu, Min Chen, Wenhao Chu, and Rongrong Li

Abstract. Hydrological modeling plays a key role in water resource management and flood forecasting. However, in China with diverse geography and complex climate types, a systematic evaluation of different modeling schemes for large-sample hydrological datasets is still lacking. This study preliminarily constructed a dataset of catchment attributes and meteorology covering 544 basins in China, and systematically evaluated the applicability of process-based models (PBMs), long short-term memory (LSTM) models, and hybrid modeling methods. The results demonstrated: (1) The accuracy of meteorological data critically impacts the prediction performance of hydrological models. High-quality precipitation data enables the model to better simulate the runoff generation process in the basin, thereby improving prediction accuracy. (2) The hybrid modeling method possesses regional modeling capabilities comparable to those of LSTM model. It also demonstrates strong generalization capabilities. In predicting ungauged basins, the hybrid model exhibits greater stability than the LSTM model. (3) Among the two hybrid modeling methods, the differentiable hybrid modeling scheme offers a deeper understanding and simulation of hydrological processes, along with the ability to output unobserved intermediate hydrological variables, compared to the alternative hybrid modeling schemes. Its prediction results are more consistent with the water balance of the basin. The research results provide a systematic analysis for evaluating the applicability of different hydrological modeling methods in 544 basins in China, offering important guidance for the selection and optimization of future hydrological models.

Received: 11 Mar 2025 – Discussion started: 02 Jun 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2509 KB)

Supplement (589 KB)

Download & links

Yuqian Hu, Heng Li, Chunxiao Zhang, Dingtao Shen, Bingli Xu, Min Chen, Wenhao Chu, and Rongrong Li

Status: final response (author comments only)

CC1:
'Comment on egusphere-2025-1161', Junzhi Liu, 10 Jun 2025

The authors constructed a watershed property and meteorological dataset covering 544 watersheds in China and systematically evaluated the applicability of different modeling schemes. The paper is well-structured and the experiments are comprehensive. I believe the following improvements will further enhance the quality of the article.

1. Is there any reference or rationale for the determination of the watershed boundaries?

2. The abbreviations of the models in the article are very confusing. Please explain them uniformly in the appropriate place.

3. Line 284 describes the PUB test method. Why are the remaining 9 clusters used for training?

4. The authors claim that the differentiable mixed hydrological model can output unobserved intermediate hydrological variables, but there is no data to support this.

5. What does the spatial distribution map in Figure 12 mean? A detailed explanation should be given in the image caption.

6. Why do we use the runoff predictions for water balance assessment? What is the purpose of calculating the water imbalance ratio? Please add an explanation.

7. There is an error in the labeling of Figure 11. Two sub-figures (b) appear. Please modify them.

Citation: https://doi.org/10.5194/egusphere-2025-1161-CC1
- AC1:
  'Reply on CC1', Chunxiao Zhang, 13 Jul 2025
  Dear Professor Liu:
  On behalf of all the co-authors, I would like to sincerely thank you for your valuable advice. We appreciate your recognition of the structure and comprehensiveness of our study, and we are grateful for the suggestions you provided to further improve the clarity and quality of the manuscript. Your comments helped us identify several areas where more detailed explanation, clearer labeling, and stronger justification were needed. In response, we have carefully revised the manuscript to address each of your concerns. To facilitate a more effective and efficient review process, we kindly ask the editors and reviewers to focus on the following key points before proceeding with our item-by-item responses:
  Model abbreviations and definitions: We acknowledge the confusion caused by inconsistent abbreviations. In the revised version, we have standardized all abbreviations and added a summary table to clearly distinguish the different model versions.
  
  Interpretation of intermediate variables and diagnostic plots: The differentiable hybrid models do indeed allow for the simulation of intermediate hydrological states. We have now added visualizations to support this claim and revised the text accordingly to avoid overstatement.
  
  Purpose of water balance evaluation: The water imbalance ratio was intended as a diagnostic metric to assess the physical consistency of runoff predictions. We have added clearer justification and explanation in the corresponding section.
  
  We have responded to your suggestions one by one (in italics, with specific issues numbered) and attached a document that provides detailed responses to each of your comments, and we hope that the changes will improve clarity and remove any ambiguity.
  
  Thank you for your patience, please see below for a point-by-point response to the reviewers’ comments and concerns.
  
  Sincerely,
  Chunxiao Zhang
  zcx@cugb.edu.cn
  
  Citation: https://doi.org/10.5194/egusphere-2025-1161-AC1
CC2:
'Comment on egusphere-2025-1161', Zeqiang Chen, 11 Jun 2025

This study conducted regional hydrological modeling work based on a large sample data set in China. This work can provide some reference for regions that currently do not have a large sample basin data set. However, I suggest the following modifications:

1. Why did the author only choose two sets of precipitation data, or did the temperature data also come from two sets of data products? A more detailed description of the data source is needed.

2. The description of the hybrid model structure in Section 3.4 is confusing. Please try to describe the operating logic of the two hybrid models separately.

3. Line251: Which of the 6 categories the 15 attributes belong to needs additional explanation, or should be added to Table 2.

4. Line455: The author mentioned here the accuracy of climate characteristics and rainfall data. Among the 15 attributes, which meteorological data product is used to calculate "p_mean", "pet_mean", etc., or did the author use other methods?

5. The author uses the Budyko curve to examine the watershed water balance in Section 4.1, while in Section 4.5, the water budget closure method is employed. Why are different methods used to verify the watershed's water balance situation?

6. There are still many available high-quality meteorological data products. I can understand the author's decision to limit the scope of the article to control its length. However, this needs to be clarified in the conclusion section of the article.

Citation: https://doi.org/10.5194/egusphere-2025-1161-CC2
- AC2:
  'Reply on CC2', Chunxiao Zhang, 13 Jul 2025
  Dear Professor Chen:
  Thank you very much for your careful review and constructive suggestions. We greatly appreciate your recognition of our efforts in constructing a large sample basin data set in China, and we agree that such datasets can provide valuable references for regions where high-quality large-sample hydrological data are still lacking. Before addressing your specific comments one by one, we would like to offer a few clarifications regarding key aspects of our study:
  Regarding the choice and use of meteorological data: We selected CN05.1 and ERA5-Land as the two precipitation products in this study because they are the most widely used and accessible datasets in China. To ensure consistency and comparability, the temperature data used in our experiments were also obtained from these two datasets. We have now added a more detailed description of all meteorological variables and their sources in the revised manuscript.
  
  Regarding the hybrid model structure: We acknowledge that the previous description of the hybrid modeling framework may have caused confusion. In the revised version, we have rewritten Section 3.4 to separately explain the design and functioning logic of the two hybrid models, accompanied by updated schematics and clearer terminology.
  
  Regarding the scope of meteorological dataset selection: We appreciate your suggestion to explicitly acknowledge the limited number of forcing datasets used. To this end, we have revised the conclusion section to clearly state that the current study focuses on two widely used datasets and that future extensions could include a broader range of high-quality products to enhance generality.
  
  We have responded to your suggestions one by one (in italics, with specific issues numbered) and attached a document that provides detailed responses to each of your comments, and we hope that the changes will improve clarity and remove any ambiguity.
  
  Thank you for your patience, please see below for a point-by-point response to the reviewers’ comments and concerns.
  
  Sincerely,
  Chunxiao Zhang
  zcx@cugb.edu.cn
  
  Citation: https://doi.org/10.5194/egusphere-2025-1161-AC2
RC1:
'Comment on egusphere-2025-1161', Anonymous Referee #1, 18 Jun 2025
In this work, the authors compare the performance of two process-based models, an LSTM and a couple of hybrid models on a rather large set of catchments in China. On the paper, this work is interesting, as Chinese hydrology is not so commonly studied, and insights on the use of hybrid methods is needed. However, the presentation of this work is of rather poor quality (Figures and figure captions), and some methodological features make it difficult to draw solid conclusions. In addition, discussions are very little and sparse over the results.

Main remarks
Namely, the figures are most of them poorly described in the captions, and many small to larger mistakes or discrepancies are present. Some of them also show too small fonts to be read comfortably.
For assessing the performance of the different experiments, the authors compare the simulated streamflow to streamflow from VIC-CCN5.1… which is also simulated streamflow. This choice is justified, although I guess that VIC-CCN5.1 had to be evaluated against observed streamflow, so why not using it. To add to the confusion, several of the experiments come from models forced by CCN5.1. That induces a bias in the conclusions that can be drawn.
Finally, there is no discussions section. Some discussions arise in the results section, but those are rather rare and sparse. The effect is that it is difficult to understand the added value of this work for the hydrological community, and we do not have recommendations.

Miscellaneaous
Abstract:
The abstract mentions the use of PBMs, but not what they are used for, neither what we can conclude about them.

Line 36: conclusions about the two hybrid models are drawn, but those are not detailed before

L 40-42: This is not a concluding sentence for an abstract, this is the rationale of the study. Here we need you to give us the major guidance resulting from your work.

Line 52: Why is the complexity of hydrological processes increasing? It seems to me that all this discussion is about natural processes, which do not complexify in time.
L 86: I do agree for physically-based models, but conceptual/empirical ones only need from 3 variables, namely precipitation, temperature and streamflow. This is not a substantial amount of high-quality data! For example, the EXP-HYDRO used by the authors exactly need these data, plus the day length, and the Xin'an jiang model only needs these data.
L 91: I do not agree, see previous comment
L 168: From now onwards, I wonder if most elements should rather appear in the material and methods section of the manuscript
L 190: Why do accurate daily runoff observation data often need to be kept confidential?
Figure 1: Please make the different maps more uniform. Panel b uses a different color for foreign countries. In addition, please do not use the same color for China and seas (panel a). I also suggest removing the bottom right islands, as there as no basins there and they are originally not on the map. Imagine if French researchers put all French territories on all maps!!
Caption of Figure 1: In a I see the areas, in b the DEM, in c the catchments and in d the climates (only this one s correct). Please modify
L 234: I was completely lost here. There must be a nuance between the different terms (observation, runoff, runoff hydrograph), but I initially didn't get it. Only later on, while reading the results, I understood that the VIC-CN05.1 dataset is simulations from the VIC model forced by CN05.1. That was not clear at all.
L 284: Do you mean 4? There are 5 clusters
Figure 4: While a is understandable, I do not get b at all. What is FCNN? It is never defined in the text. Please improve or develop the caption.
Figure 5: Please use the same range for the distribution of P values for the two products over the diverse basins. Also make sure to use the same categories, it seems that there are many more categories for CN05.1 than for ERA5. I guess this is basin-averaged P and T? Please specify.
Figure 5: The scale indicates a gradual color scale for P and T, but the maps only display categorical values, with only 5 colors. Please correct. What is the period? Is it the total period or the evaluation period (1995-2015)? These two comments are valid for most figures that follow
L 407: How is the drought index calculated?
L 415: That definitely induces a bias! It is easier to reproduce streamflow obtained from a model forced by a dataset, when you use the same dataset…
L 416: This is methods, not results
L 420-425: This is discussions, not results
Figure 6: what is the blue shaded area?
L 433: This is a somehow unfair comparison, as the reference data used to calculate NSE comes from VIC forced by CN05.1. Then, when you compare models forced by ERA5 to these data, you include the error coming from the PBM and the error coming from the input data set.
Figure 7, left: what is this scale? It does not include regular intervals between values
Figure 7, caption: the authors state that the colormap include vales from 0 to 1. That would be great, to compare the four maps together. Unfortunately, the left maps do not use the same range as the right maps
L 411 and following: The differences should be discussed in terms of what processes are important for these basins and what is the link with the processes present in the PBMs. We need interpretation!
L 469: The fact that the LSTM performs very well with CN05.1 comes from the fact that the authors do not try to reproduce observed streamflow but simulated streamflow. This means that LSTM does not excels in reproducing the processes leading to streamflow from meteorological input, but rather excels in mimicking the behavior of the VIC model. This is highly different and is caused by the experiment setup. In addition, this might indicate that the LSTM cannot cope with input errors
L 488-491: these are discussions, not results
Figure 9, 10: random scales prevent from comparing the different parts of the figure
L 528-537: these are discussions, not results
Figure 12, 13: fonts are too small, we cannot read
Citation: https://doi.org/10.5194/egusphere-2025-1161-RC1
- AC3:
  'Reply on RC1', Chunxiao Zhang, 13 Jul 2025
  Dear Reviewer:
  On behalf of all the co-authors, I would like to sincerely thank you for your valuable advice. Your suggestions are very professional and meticulous. We attach great importance to your feedback and will seriously consider your suggestions. As mentioned, large-sample basin hydrological research is relatively uncommon in China. Therefore, we are eager to make some attempts in this area, hoping our research can serve as a foundation for future hydrological studies in the country. Before addressing your specific suggestions one by one, we would like to clarify the following points:
  Regarding the quality of figure captions: We realize that the current presentation of figures may not be clear enough. We have revisited all figures, improved the resolution of the images, and provided more detailed and accurate descriptions in the figure captions to better present our results.
  
  Regarding the clarity of the methodology: We have added a detailed description to the methods section and more fully explained the process of obtaining the data. We hope that this will make it easier for you and other readers to evaluate the validity of our research results.
  
  Regarding the expansion of the Discussion section: We realize that the organization of the Results and Discussion sections is quite confusing, and the discussion content is relatively sparse. We have reorganized the Results section based on your specific comments, and set up a separate Discussion section to explore the potential significance and impact of the results to enhance the depth and breadth of the discussion.
  
  We have responded to your suggestions one by one (in italics, with specific issues numbered) and attached a document that provides detailed responses to each of your comments, and we hope that the changes will improve clarity and remove any ambiguity.
  
  Thank you for your patience, please see below for a point-by-point response to the reviewers’ comments and concerns.
  Sincerely,
  Chunxiao Zhang
  zcx@cugb.edu.cn
  
  Citation: https://doi.org/10.5194/egusphere-2025-1161-AC3
RC2:
'Comment on egusphere-2025-1161', Anonymous Referee #2, 30 Jun 2025

This paper constructs a dataset of catchment attributes, meteorology, and simulated flow for 544 catchments in China, and compares across process based, machine learning, and hybrid models for flow predictions, and compares across two meteorological forcing products. It is found that a hybrid model often out-performs a purely data driven approach or a process based model, and model performance is highly subject to the quality of meteorological inputs. As relatively few large-sample systematic studies have been performed in China, this study provides a base dataset and guidance for future modeling efforts.
This paper was interesting, relatively clear to read, and generally appropriate for this journal. However, I have one major concern about the experimental setup, namely the use of the CN05 precipitation data to generate the flow data, and the subsequent comparison of models that are forced with CN05 versus ERA5 precipitation. I think this feature should be brought up explicitly earlier, since it likely has a large effect on how we should interpret the findings of model differences. For example, if we had used ERA5 forcing to generate streamflow, these comparative results would likely be flipped or greatly altered. Otherwise, several major and minor comments are listed below.

Major comments:
The runoff data product is actually simulated with VIC, a process based model, which uses one of the two meteorological datasets (the CN05.1) as input. In general, the uncertainties in the VIC data product that is being used as a proxy for “observed data” in terms of model training and evaluation should be brought up more clearly. This seems relevant in several places in the study. For example, the finding that the CN05 product leads to better performance is mainly because it is embedded in the flow data. This is explicitly mentioned in Line 470, and I would say that is extremely likely that it is what is happening. Meanwhile, there are other places in the paper where it is posed that CN05 must be the superior forcing product because it leads to better model results. For example, Line 32 in the abstract cites that “high-quality precipitation data better enables the model to simulate runoff processes”. While this is surely true in general, I think the results of this study reflect the fact that one of the products was used to generate the original flow data, making it a more unfair comparison. With this, I’m not sure what the significance of comparing between the two meteorological products is here – since whichever product is used to generate the flow data product is likely to be the most useful input to another model that is trained to emulate the flow data. It would be better to bring in a third met product (or more) to really compare this, or drop this aspect of the study and focus on differences between models based on a single forcing product.
It is easy to get confused between model names. For example, authors could use subscripts on “EXP” and “XAJ” to indicate the process based, alternative hybrid, and differentiable hybrid model versions.
Given that the comparison between the two meteorological products may not be valid, as one is directly used to produce the training data – it would be valuable to discuss the differences between the model types more in the results. For example, more specific explanations of differences in behavior between the pure machine learning model and the hybrid versions (maybe just given ERA5 as the forcing).
Similarly, all figure captions could use more detail. For example many terms and abbreviations are not described and specific panels not described in detail, and some color scales seem to not match between panels. Several specific figure-related comments are included below.
Minor comments:
Figure 4: Figure 4a makes it look like only VP is an input to the LSTM but I think it should be all the forcing variables? As with other figures more description here would be useful.
Figure 5: the legends show the colors as continuous but the dots in the maps make it seem discrete (that there are 5 colors). It would be better if the legend reflected the ranges for these 5 colors.
Figure 9: The color scales are all different, so it makes it hard to compare between the four panels. Same in Figure 10 – why are the color scales different for NSE for the two different forcing cases?
Line 230: think there is a word missing “provided by the originates”
Line 421: cross to across

Citation: https://doi.org/10.5194/egusphere-2025-1161-RC2
- AC4:
  'Reply on RC2', Chunxiao Zhang, 13 Jul 2025
  尊敬的评论者：
  首先，我衷心感谢您对我们工作的认可和支持。您的意见表明您深厚的专业知识。我们非常重视您的反馈，并将认真考虑您的宝贵建议。在本刊上阅读了类似的研究后，我们希望能在中国进行类似的研究，尽管在实际的实验过程中，我们确实遇到了一些暂时难以解决的困难。这也可能是大样本水文学研究在中国尚不普遍的原因之一。但是，我们始终怀揣着探索的精神，希望通过我们的努力，可以做一些初步的尝试，并呼吁相关学者进一步关注和研究。在逐一解答您的具体建议之前，我们想澄清以下几点：
  关于径流数据的来源：我们承认我们的原始手稿缺乏对径流数据源的明确解释。在修订版中，我们对数据部分进行了有针对性的改进，增加了更详细的描述和更完整的径流数据获取方式说明。我们相信这些修订将帮助您和读者更有效地评估我们结果的可靠性和有效性。
  
  关于实验设置：我们使用两种气象强迫产品的主要动机之一是评估模型在不同输入条件下的鲁棒性。我们在实验设计部分更清楚地阐明了这一意图。此外，我们还加入了一个新的讨论部分，讨论了径流数据产品的选择如何影响模型比较的解释。这些说明旨在为我们的实验设置提供更高的透明度。
  
  我们已经逐一回复了您的建议（以斜体字显示，并附有具体问题编号），并附上了一份文档，其中提供了对您每条评论的详细回复，我们希望这些更改将提高清晰度并消除任何歧义。
  
  感谢您的耐心等待，请参阅下文对审稿人的意见和疑虑的逐点回复。
  真诚地
  张春晓
  zcx@cugb.edu.cn
  
  Citation: https://doi.org/10.5194/egusphere-2025-1161-AC4
RC3:
'Comment on egusphere-2025-1161', Anonymous Referee #3, 02 Jul 2025

This paper presents a comparison of three approaches to hydrological modelling in a large sample of catchments in China. Methods compared are classic process/physics based models, deep learning models, and hybrid architectures. In theory, this is a good idea and probably could be useful to the community. However, the paper lacks in scientific rigour and the chosen methodology is flawed in multiple aspects. I will detail these below as well as provide some suggestions for a future version of this paper, which still has some potential. I will start with the major issues first and list some minor points afterwards.
1. The first major point is the fact that the target variable is not observed data, but an output from VIC. While I understand that this might be required for confidentiality reasons, it also means that the results cannot be trusted or generalized elsewhere. This is because results are compared to the outputs of VIC and as such, the models are rewarded if they emulate VIC, rather than simulate real streamflow. And since VIC has its own biases, strengths and weaknesses, we are simply evaluating the ability of these new modeling techniques to emulate the same biases. To make this point, I contend that we could obtain NSE values of 1 if we simply used VIC as one of the models in this study. Would it mean that VIC is much better than other models? Of course not, and the same is true with the relationship between these new models and the VIC outputs. The same goes for internal variable analysis: Models that are more similar to VIC will perform better. The problem with this whole approach is that we cannot learn from results vs application in the real world, because the response surface of the optimization problem is much much smoother and easier to navigate than one using real observations which are uncertain and error-prone. Models will never perform as well on real data than on these synthetic data. Therefore I think this study has a very limited reach and usefulness while using synthetic streamflow data.
2. The second major issue is linked to the previous one, and that is the use of CN05.1 as the input data to VIC, which is then used again as an input in the other models. This means that any model using CN05.1 will most likely perform better than another using the ERA5 dataset, simply because the processes are artificial and conditioned on the use of CN05.1. In the study, there are a few sections commenting on how CN05.1 performs better than ERA5 (ex lines 467-469: "When using CN05.1 precipitation data, the median NSE for LSTM in regional modeling and PUB reached an impressive 0.95 and 0.93, respectively."). These results, while contextualized by following that this stems from its use in VIC, are still give the impression that CN05.1 is better than ERA5, which is not true given the evaluation method presented here. This issue would not arise if VIC was not used at all (as per my point #1 above), but if the authors decide to continue using VIC for a revised version of this paper, they need to simply remove CN05.1 as one of the datasets in the comparison to be fully independent. Doing so would at least allow simplifying the paper enough that they could then delve into the analysis of internal variables of the hybrid models, which seems to me as a key advantage but that is not discussed or evaluated in the present paper.
3. This issue is related to the way the deep learning and hybrid models are trained, and has two distinct sub-issues. We often see this from hydrologists that work with deep learning models for the first time and is a common mistake that is easy to make but has important consequences. The first is the fact that the authors have only 2 periods of data: Training (or calibration; 1975-1995) and a testing period (1995-2015). While adequate for PBM calibration, this is simply unacceptable for Deep-learning models. These models need 3 periods of data: Training, Validation, and Testing. Training is used on the forward pass and backpropagation steps to tune the weights and parameters according to the chosen gradient descent method. The model is then evaluated on the Validation period after each epoch, and the objective function score is computed on that period specifically. The model training is then stopped when the validation period loss stops improving and starts regressing. Finally, when the model has stopped improving and the training is stopped, then the model performance is evaluated on the third, independent Testing period. Failing to stop training will inevitably lead to overfitting and unreliable results, which is the case here. Therefore, a revised study should follow best practices and add an independent testing period for the deep learning models and also the PBM which should share the same testing period. I also note that there are no details on the selected objective function for training the LSTMs nor do we know how this was computed for the regional models? How are error/loss metrics calculated on multiple basins at the same time? A few studies proposed some methods to do this, for example just in HESS see Kratzert et al (2019) and Arsenault et al. (2023) listed below. The second sub-point is that the authors state: "All input data are normalized before training" without further details on how this was done. This is critical, because the data need to preserve independence between the [Training] and [Validation; Testing] periods. Data need to be normalized using a scaler of some sort (which one was used?) using the training period data, and then the scaler is applied to the validation and testing data. Failing to do so means that the testing period data are included in the scaler and as such the training will benefit from knowing the scale of data it can expect to get. This is called data contamination and needs to be avoided. Nothing in the paper at this stage seems to suggest this was performed. As such, I believe the results are flawed and performance is overestimated in this study.
Given these points, I am of the opinion that the paper needs major modifications and that it should be in a quite different format if resubmitted. In any case, I have some other suggestions to improve an eventual future version of the paper here.
4. Line 32-33: This seems trivial that better inputs will lead to better modelling. I would remove.
5. Line 33: At this stage, readers don't know what hybrid modelling refers to. Please add a few key details to set the table, maybe a 1-sentence description.
6. Line 76: "solve" is a strong word. Perhaps "help address"?
7. Line 86-88: This is also true for deep learning models (perhaps even more so than PBMs!)
8. Lines 90-91: "LSTMs... can effectively capture nonlinear relationships...": so do PBMs, depending on the structure. The difference really lies in the model learning the relationships between weather and flows without humans providing any physical sense.
9. Lines 97-98: redundant sentence with the previous.
10. Lines 127-130: Would need a bit more details. Are the equations of the model kept as-is? Is it an emulator of the PBM? Do the parameters preserve the same meaning?
11. Line 214: Any reason why ERA5-Land is not used? Should be better/more precise especially in mountainous areas?
12. Line 230: "provided by the originates" : missing word here.
13. Line 258 (and multiple others): many times the word "relatively" is used to tone down some element. I suggest rephrasing to say that they are accessible or some other word that would be more precise. Same goes everywhere.
14. Line 295: all process based models follow this law of water balance, I would remove.
15. Line 300-320: Xin'anjiang does not model snow processes? How is it used in mountainous and other basins where snow is present?
16. Line 349: This is quite high initial learning rate. What is the learning rate decay rate or function? Also, what is the objective function used? What is the model training patience for the stopping criterion? is there a stopping criterion or are all runs leading to 150 iterations? If not, at 150 training iterations, the model will definitely be overfitting and providing poor results compared to a well-tuned model.
17. Line 363: capture nonlinear relationships that evade the physics depicted in the PBMs
18. Figure 4: PMB is used throughout the study, I would change PBHM to PBM
19. Lines 395-397: It seems to me that CN05.1 is more evenly distributed but has more lower extremes, opposed to what is written here.
20. Line 409: water-heat? perhaps mass-energy?
21. Figure 8: this figure has 2 panel "b". Also this figure needs more details in the legends and caption to fully understand, it is unclear. Add details to what each panel refers to / is presenting.
22. Line 467: These two references do not support he statement that NSE>=0.55 is good. Knoben says 0.50, Newman says it means the model has some skill, and NSE=0.8 shows reasonably good performance. Please clarify.
23. Line 534-537: Indeed, since XAJ does not have snow process representation?
24. Figure 11: there are two B panels. 2nd B panel missing numbers of the overall source/destination bins. Same comments as for Figure 8.
References
-Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., and Nearing, G.: Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, https://doi.org/10.5194/hess-23-5089-2019, 2019.
-Arsenault, R., Martel, J.-L., Brunet, F., Brissette, F., and Mai, J.: Continuous streamflow prediction in ungauged basins: long short-term memory neural networks clearly outperform traditional hydrological models, Hydrol. Earth Syst. Sci., 27, 139–157, https://doi.org/10.5194/hess-27-139-2023, 2023.

Citation: https://doi.org/10.5194/egusphere-2025-1161-RC3
- AC5:
  'Reply on RC3', Chunxiao Zhang, 13 Jul 2025
  Dear Reviewer:
  First of all, we would like to express our sincere gratitude for your careful review and critical evaluation of our manuscript. Your comments reflect a deep understanding of hydrological modeling practices, and we greatly appreciate the time and effort you invested in providing such detailed and constructive feedback.
  We fully recognize that our study still has room for improvement in terms of scientific rigor and clarity. In particular, your thoughtful observations regarding the use of simulated runoff data, the dependence on CN05.1, and the training procedures for deep learning models have prompted us to re-examine our methodology and presentation more thoroughly.
  Our overarching objective in this work is to promote large-sample hydrological modeling research in China. Despite notable progress globally, this line of research is still in its early stages in China due to challenges in consistent data availability, basin delineation, and unified model evaluation protocols. These difficulties are precisely what motivated us to attempt the construction of a consistent large-sample dataset and to evaluate multiple modeling approaches as a foundational step. We believe such efforts though imperfect in their initial form are valuable in initiating broader engagement from the hydrological community. Before addressing your specific concerns point-by-point, we would like to clarify three key aspects that underpin our original study design and also guide the revisions we have made:
  
  Regarding the source of runoff data: We acknowledge that the initial manuscript did not clearly explain the nature and origin of the target runoff variable. In the revised version, we have revised the data section to make it explicit that the runoff data is derived from the VIC-CN05.1 product. We further emphasize that this is not observational data, and we discuss the implications of using such a product for model training and evaluation.
  
  Regarding the experimental setup and use of forcing data: One motivation for comparing models under two widely used meteorological forcing products (CN05.1 and ERA5-Land) was to assess the relative consistency and robustness of model performance rankings under different input conditions. However, we now acknowledge that using CN05.1 as both the driver of the VIC-generated runoff and a forcing input introduces a structural dependence. To mitigate misinterpretation, we have carefully revised our manuscript particularly in the abstract, results, and conclusion to avoid overemphasizing the performance of CN05.1 and instead explicitly acknowledge the potential bias arising from this design.
  
  Regarding the model training procedures and data preprocessing: In response to your comments, we have clarified our training/testing split in the revised Methodology section, added details about the loss functions, and described the normalization procedure. We ensured that no data leakage occurred during training by applying scalers derived solely from the training period.
  
  We have responded to your suggestions one by one (in italics, with specific issues numbered) and attached a document that provides detailed responses to each of your comments, and we hope that the changes will improve clarity and remove any ambiguity.
  
  Thank you for your patience, please see below for a point-by-point response to the reviewers’ comments and concerns.
  Sincerely,
  Chunxiao Zhang
  zcx@cugb.edu.cn
  
  Citation: https://doi.org/10.5194/egusphere-2025-1161-AC5
RC4:
'Comment on egusphere-2025-1161', Anonymous Referee #4, 02 Jul 2025

* Summary
The authors present a study of runoff prediction across a wide range of watersheds in China.

They compare the performance of simple hydrologic models, LSTM models, and hybrid methods that combines the two in recreating a runoff time series and water budget closure. Performance comparisons and interpretation consider two different forcing datasets, which appear to have substantial effects on simulated runoff.
* General comments
The work represents an ambitious and admirable effort with considerable potential.

The inclusion of LSTM and hybrid hydrologic model - LSTM approaches is novel and provides an interesting dimension to the analysis.

Overall some additional explanation and clarification would help better connect the stated goals of the work with what was actually done.
Main issues that should be addressed to improve the rigor and value of the manuscript:
- The question of what model structures (including those that incorporate LSTM functionality) can best represent the relevant hydrologic processes across a continental scale domain with varying landscape and hydroclimatic characteristics is an important one. Providing useful answers to this sort of question relies on some interrogation and analysis of which model structures are associated with better performance. The authors compile a dataset of basin characteristics but make no connection between that and model performance variation. This manuscript would be improved with more explanation and interpretation of how known differences in model structures relate to better or worse model performance - and better address the stated goal of providing "scientific guidance for selection and application of hydrologic models" [line 166-167]

- The performance of models appears to be evaluated by comparing to the results of a different (VIC) model. The practical need for this approach (lack of consistent and comprehensive streamflow data) is understandable, but a rationale that justifies how this provides meaningful insight is not provided. For example - does the analysis reflect which model structures are most similar to VIC, or do they provide some broader insight about a "true" or "best" hydrologic model for different watersheds and regions? Additional explanation and justification is needed to clarify this component of the study.

- It is not entirely clear what the value of using the ERA5 and CN05.1 forcing datasets is when the evidence suggests the CN05.1 dataset provides better consistency with local conditions and better performance (albeit in comparison to a model also forced with CNO5.1). It seems that if the performance were being evaluated against true runoff or streamflow observations, model performance (such as with NSE as in Figure 7) would provide a meaningful basis for interpretation. As presented, the inclusion of models driven by ERA5 forcing complicates (and potentially obscures) a clear interpretation of the appropriateness of model structures. For example - how does training/calibrating a model that uses the ERA5 forcing against a target based on the CN05.1 forcing generate reliable information?

- The water budget analysis is a valuable complement to the runoff performance comparisons. However, it is unclear exactly how the closure error should be interpreted. The water balance presented implies the closure error may include changes in watershed storage (groundwater, snow, deep soil, etc) that may or may not be well represented in the models. Some further clarification and explanation that justifies the interpretation that the "smaller value of epsilont, the better the water budget balance of the basin" [lines 580-581]
* Other comments

- In general the figures are well done and informative. Their effectiveness could be improved in many cases by 1) enlarging the axis and label text and 2) providing more informative and more complete captions and annotations. In many figures (e.g. Figure 11) it is difficult directly discern the many types of information being presented.

- The methods and interpretation used for the "prediction in ungaged basins (PUB)" portion of the analysis is a bit confusing. Some more specific explanation that covers 1) the intent of this analysis and 2) how it is different from the other performance comparisons would make the paper much more effective.

- Check for redundant, extraneous, or unclear sentences. Some examples that would benefit from revision:

- Lines 96-98: "type of model excels in data collaboration....This type of model performs well in data-driven collaboration..."

- Lines 127-128: "..neuralizes the process-based model and adjusts model parameters by back propagating gradients based on daily prediction results.."

- Lines 321-322: "With the continuous advancement of deep learning technology, its applications in the field of hydrology are also expanding."

Citation: https://doi.org/10.5194/egusphere-2025-1161-RC4
- AC6:
  'Reply on RC4', Chunxiao Zhang, 13 Jul 2025
  Dear Reviewer:
  We would like to sincerely thank you for your recognition and thoughtful comments on our manuscript. Your feedback reflects a deep understanding of hydrological modeling and provides valuable insights for improving the rigor and clarity of our work. Our study aims to conduct a large-sample hydrological modeling benchmark across China using multiple modeling paradigms, including process-based models (PBMs), deep learning models (LSTMs), and differentiable hybrid models. After reviewing several large-sample hydrology studies published in this journal, we were inspired to carry out similar research in China. While we encountered several practical challenges, particularly the limited availability of long-term, consistent observational data, we believe that this work represents a preliminary but meaningful step toward large-sample hydrology in data-sparse regions.
  Your comments helped to highlight key aspects that needed clarification, giving us confidence to revise and improve our study. Before responding to your specific points, we would like to briefly clarify the following three main issues:
  Regarding the source of runoff data: We acknowledge that the original paper did not fully explain the rationale for using VIC-generated runoff data as a reference for model performance evaluation. However, we actually chose to use VIC-CN05.1 runoff as a surrogate for observed data based on spatial and temporal span requirements for large-sample model evaluation. We agree that the VIC model itself has limitations, so we clarified in the revision that this analysis focuses on the consistency and physical interpretability of the model under a unified comprehensive reference, rather than asserting its absolute predictive advantage.
  
  Regarding the two forcing datasets: We agree that the comparison between ERA5-Land and CN05.1-driven model results could be misinterpreted if not clearly contextualized. We have now revised the relevant sections to better explain that this comparison is not intended to directly assess which dataset is "better," but rather to understand how meteorological forcing uncertainty can propagate through different models and affect both runoff prediction and hydrological consistency.
  
  We have responded to your suggestions one by one (in italics, with specific issues numbered) and attached a document that provides detailed responses to each of your comments, and we hope that the changes will improve clarity and remove any ambiguity.
  
  Thank you for your patience, please see below for a point-by-point response to the reviewers’ comments and concerns.
  Sincerely,
  Chunxiao Zhang
  zcx@cugb.edu.cn
  
  Citation: https://doi.org/10.5194/egusphere-2025-1161-AC6

Yuqian Hu, Heng Li, Chunxiao Zhang, Dingtao Shen, Bingli Xu, Min Chen, Wenhao Chu, and Rongrong Li

Supplement

https://doi.org/10.5194/egusphere-2025-1161-supplement

Yuqian Hu, Heng Li, Chunxiao Zhang, Dingtao Shen, Bingli Xu, Min Chen, Wenhao Chu, and Rongrong Li

Viewed

Total article views: 1,183 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
996	144	43	1,183	47	19	35

HTML: 996
PDF: 144
XML: 43
Total: 1,183
Supplement: 47
BibTeX: 19
EndNote: 35

Views and downloads (calculated since 02 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	261	45	15	321
Jul 2025	102	46	16	164
Aug 2025	109	19	4	132
Sep 2025	425	10	3	438
Oct 2025	66	16	5	87
Nov 2025	33	8	0	41

Cumulative views and downloads (calculated since 02 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	261	45	15	321
Jul 2025	102	46	16	164
Aug 2025	109	19	4	132
Sep 2025	425	10	3	438
Oct 2025	66	16	5	87
Nov 2025	33	8	0	41

Viewed (geographical distribution)

Total article views: 1,180 (including HTML, PDF, and XML) Thereof 1,180 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Nov 2025

Short summary

Our study developed a preliminary dataset of catchment attributes and meteorological variables covering 544 basins in China, and evaluated the applicability of process-based models, the long short-term memory model, and hybrid modeling methods. Results highlight the critical role of meteorological data quality and the potential of hybrid approaches. Our findings support model selection and offer reference for regions with limited observational data.


Total:	0
HTML:	0
PDF:	0
XML:	0