Integrating Physical-Based Xinanjiang Model and Deep Learning for Interpretable Streamflow Simulation: A Multi-Source Data Fusion Approach across Diverse Chinese Basins

Wang, Zhaocai; Xu, Nannan; Song, Wei; Zhang, Xingxing; Wu, Junhao; Chen, Xi

doi:10.5194/egusphere-2025-2377

Preprints

https://doi.org/10.5194/egusphere-2025-2377

Preprints

24 Jun 2025

| 24 Jun 2025

Integrating Physical-Based Xinanjiang Model and Deep Learning for Interpretable Streamflow Simulation: A Multi-Source Data Fusion Approach across Diverse Chinese Basins

Zhaocai Wang, Nannan Xu, Wei Song, Xingxing Zhang, Junhao Wu, and Xi Chen

Abstract. The simulation of streamflow is a complex task due to its intricate formation process. Existing single models struggle to accurately capture the stochastic, non-stationary, and nonlinear dynamics of basin streamflow in changing environments. This study integrated process-driven hydrological mechanism models with data-driven deep learning models, considering various factors like hydrology, meteorology, environment, and the interconnected effects of upstream and downstream rivers, to form an interpretable hybrid streamflow simulation model. The study collected multiple external variables to better understand the hydrologic system complexity and used the Maximum Information Coefficient (MIC) to analyze their relationship with streamflow. Subsequently, the Xinanjiang (XAJ) model with physical mechanisms was employed, alongside the TCN-GRU model integrating Temporal Convolutional Network (TCN) and Gated Recurrent Unit (GRU), for separate streamflow simulations. Furthermore, a combined method was employed, achieving nonlinear ensemble through Random Forest (RF), resulting in the hybrid XAJ-TCN-GRU model. This model shows promising results in simulating streamflow in four different basins in China, achieving high Nash-Sutcliffe Efficiency (NSE) values with 0.991, 0.971, 0.984, and 0.986 for the Wuding River, Chu River, Jianxi River, and Qingyi River respectively. In terms of streamflow simulation, flood simulating, and interval simulation, this model outperforms other benchmark models. Additionally, the study quantified the contributions of each hydro-meteorological variable to the long-term streamflow trend using mean absolute SHAP values (SHAPABS), Feature Importance (FI), and Permutation Feature Importance (PFI), thereby enhancing the model's external interpretability. The results of this study are of significant importance for optimizing water resource management and mitigating flood disasters.

Received: 20 May 2025 – Discussion started: 24 Jun 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2600 KB)

Supplement (644 KB)

Download & links

Zhaocai Wang, Nannan Xu, Wei Song, Xingxing Zhang, Junhao Wu, and Xi Chen

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-2377', Anonymous Referee #1, 10 Aug 2025
This study tackles the critical challenge of streamflow simulation for water resource management by developing a novel hybrid model that addresses the fundamental limitations of existing approaches. While physical-based models like Xin'anjiang (XAJ) provide interpretability but struggle with complex nonlinear dynamics, and deep learning models excel at pattern recognition but lack transparency, the authors propose an innovative solution that combines both strengths through nonlinear ensemble rather than traditional sequential coupling. Their XAJ-TCN-GRU hybrid model runs the conceptual rainfall-runoff model and deep learning components in parallel, then uses Random Forest to intelligently fuse their outputs, achieving exceptional performance across four diverse Chinese basins (NSE values of over 0.97) while maintaining interpretability through advanced feature importance analysis that revealed dew point temperature as the key driving variable. However, the approach appears somewhat mechanical in its combination of existing methods without sufficient theoretical justification for the ensemble strategy, and the study lacks deeper scientific insights beyond demonstrating that this particular configuration works well empirically.

Points for the Authors to Consider
1. Critical Questions on Research Necessity and Value
The fundamental novelty of this work raises several important concerns that challenge its scientific contribution. Given that a simple GRU model with prior streamflow observations can already achieve NSE values around 0.95, which represents excellent performance in hydrological modeling, the question arises whether the additional algorithmic complexity introduced by the TCN-GRU-XAJ ensemble is scientifically justified. As is well established, when model performance becomes "too good", the marginal research value of further improvements may be limited, especially when achieved through increased model complexity rather than fundamental insights.
2. Insufficient Justification for Physical Model Integration
More critically, while the authors integrate the XAJ model and demonstrate superior ensemble performance, the interpretability analysis via SHAP and other methods reveals that meteorological variables (particularly dew point temperature) dominate the model's decision-making process. This raises a fundamental question: what actual role does the XAJ model play in the final ensemble, and is its inclusion truly necessary? The paper lacks sufficient analysis demonstrating that the physical model component provides unique, irreplaceable value beyond what could be achieved with a well-designed deep learning architecture alone. Without clear evidence of the XAJ model's distinct contribution, the hybrid approach may represent unnecessary complexity rather than meaningful innovation.
3. Potential Value Enhancement Through Extreme Event Analysis
The flood simulation analysis in Section 5.2 represents one of the study's strongest contributions, effectively demonstrating the hybrid model's superior performance during critical hydrological periods. However, the current flood simulation analysis could be enhanced by examining extreme rainfall-runoff events in greater detail, analyzing key flood characteristics including peak discharge, flood volume, time to peak, and recession behavior. Meanwhile, the interpretability analysis using SHAP, FI, and PFI would be more compelling if applied specifically to extreme flood events to understand how variable contributions and model component importance shift during critical periods. For instance, examining whether the XAJ model's contribution increases during certain flood types or whether meteorological variable rankings change between baseflow and peak flow conditions would demonstrate the practical necessity of the hybrid approach beyond general performance improvements.

Specific Comments
Line 33: "nonlinear ensemble" should be "Nonlinear ensemble" in the keywords section for consistency with capitalization standards.

The introduction lacks a systematic review of existing hybrid modeling approaches, such as physics-guided machine learning methods that directly address the integration of physical constraints with deep learning architectures (Willard et al., 2022). The authors should position their ensemble approach relative to these methods to clarify their unique contribution.

Line 89: The comma should be placed outside the quotation marks: "black boxes" with their internal parameters...

Lines 103-105 mention that sequential coupling approaches suffer from error propagation issues, but this critical limitation is described too vaguely without specific examples or quantitative evidence.

Table 1 presents XAJ parameters with sensitivity classifications, but the methodology section lacks explanation of how these sensitivity rankings were determined or supporting analysis and references to basin-specific calibration studies.

The connection between TCN and GRU components (Section 2.4, Figure 3) is described conceptually but lacks technical details about information flow, feature dimension matching, and potential bottlenecks in the architecture. For example, how do TCN outputs' dimensions match with GRU input?

Hyperparameters like TCN dilation factors, GRU hidden units, and RF tree depth could be reported in the main methodology section or explicitly referenced as being provided in the supporting information.

Line 379: Remove the double comma: "at the optimal location in the Taylor diagram, and the correlation"

Section 5.2 identifies flood events as periods where "streamflow exceeded four times the standard deviation", which may not align with standard return period analysis. It would be better to use established flood frequency analysis methods.

On line 412, the out-of-context validation excludes "wet-year" data from training but doesn't clearly define what constitutes a "wet-year" or provide quantitative criteria for this classification.

The Data Availability Statement (lines 617-618) mentions that "Model source code can be obtained from its Github repository (https://github.com/zcwang1028/code.git)", but providing code through compressed archives (rar/zip files) rather than proper version-controlled repositories could benefit from following recommended reproducibility practices (Gil et al., 2016).

References
Gil, Y., David, C.H., Demir, I., Essawy, B.T., Fulweiler, R.W., Goodall, J.L., Karlstrom, L., Lee, H., Mills, H.J., Oh, J.-H., Pierce, S.A., Pope, A., Tzeng, M.W., Villamizar, S.R., Yu, X., 2016. Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance. Earth Space Sci. 3, 388–415. https://doi.org/10.1002/2015ea000136
Willard, J., Jia, X., Xu, S., Steinbach, M., Kumar, V., 2022. Integrating Scientific Knowledge with Machine Learning for Engineering and Environmental Systems. ACM Comput Surv 55, 1–37. https://doi.org/10.1145/3514228
Citation: https://doi.org/10.5194/egusphere-2025-2377-RC1
- AC1:
  'Reply on RC1', Zhaocai Wang, 14 Aug 2025
  This study tackles the critical challenge of streamflow simulation for water resource management by developing a novel hybrid model that addresses the fundamental limitations of existing approaches. While physical-based models like Xin'anjiang (XAJ) provide interpretability but struggle with complex nonlinear dynamics, and deep learning models excel at pattern recognition but lack transparency, the authors propose an innovative solution that combines both strengths through nonlinear ensemble rather than traditional sequential coupling. Their XAJ-TCN-GRU hybrid model runs the conceptual rainfall-runoff model and deep learning components in parallel, then uses Random Forest to intelligently fuse their outputs, achieving exceptional performance across four diverse Chinese basins (NSE values of over 0.97) while maintaining interpretability through advanced feature importance analysis that revealed dew point temperature as the key driving variable. However, the approach appears somewhat mechanical in its combination of existing methods without sufficient theoretical justification for the ensemble strategy, and the study lacks deeper scientific insights beyond demonstrating that this particular configuration works well empirically.
  Response: We sincerely thank the Reviewer for your thorough review and valuable suggestions. You pointed out that while the XAJ-TCN-GRU hybrid model proposed in this study demonstrates excellent empirical performance, the theoretical basis for the integration strategy is insufficient, and there is a lack of deeper scientific insights. This feedback is both constructive and highly instructive. We fully accept your suggestions and will make corresponding supplements and improvements to address the aforementioned issues.
  In the subsequent revisions, we first supplemented the theoretical basis for the nonlinear ensemble strategy (random forest) in the “Methodology” section: from the perspective of statistical learning theory, random forests construct diverse decision trees through bootstrap resampling and random feature selection, effectively capturing nonlinear correlations among different model outputs. Their ensemble mechanism reduces the bias and variance of individual models, making them particularly suitable for integrating outputs from physical models and deep learning models — Physical model outputs have clear physical significance but may contain structural errors, while deep learning models excel at fitting complex patterns but are susceptible to data noise. The nonlinear integration of random forests can achieve error complementarity, a mechanism supported by theoretical studies in multiple hydrological ensemble research efforts (e.g., Xu et al., 2025; Wang & Dong, 2024). Second, we added scientific insight analysis in the “Discussion” section: By combining the hydrological characteristics of four basins (e.g., the Wuding River in arid regions and the Chu River in humid regions), the study delves into the causes of model performance differences — in arid regions, hydrological processes are more significantly influenced by meteorological factors such as surface pressure, leading to greater reliance on physical mechanisms in models, while in humid regions, nonlinear hydrological processes are more prominent, resulting in greater contributions from deep learning components; Additionally, regarding the phenomenon of dew point temperature as a key variable, its mechanism of action is explained from the perspective of the hydrological cycle: the difference between dew point temperature and ambient temperature reflects air humidity. High humidity environments significantly influence basin water balance by affecting condensation, precipitation, and evaporation processes, and the intensity of this influence varies across different climate zones and is related to the interaction between basin surface conditions (such as vegetation and soil).
  Through the above modifications, not only will the theoretical foundation of the model integration strategy be strengthened, but the understanding of the relationship between model performance and basin hydrological mechanisms will also be deepened, elevating the research from empirical validity to a level that combines theoretical explanation and scientific insight, thereby better aligning with the scientific logic of hydrological simulation studies.
  
  The specific modifications will be supplemented as follows:
  2.6 XAJ-TCN-GRU hybrid model
  ...
  The selection of random forest for nonlinear integration is based on the following theoretical rationale: From a statistical learning perspective, random forest generates diverse training subsets through bootstrap resampling and randomly selects features during decision tree construction, effectively capturing nonlinear correlations between different model outputs (Breiman, 2001). For the integration of physical models (XAJ) and deep learning models (TCN-GRU), the former has outputs with clear physical meaning but may contain structural errors, while the latter excels at fitting complex patterns but is susceptible to data noise. The ensemble mechanism of random forests can achieve error complementarity — by averaging or voting across multiple decision trees, reducing the bias and variance of a single model. This characteristic has been proven effective in hydrological model ensembles for addressing the integration of physical mechanisms and data patterns (Xu et al., 2025; Wang & Dong, 2024).
  
  5 Discussion
  ...
  Further analysis of model performance differences across different basins reveals that their performance is closely related to basin hydrological characteristics: In the Wuding River basin (arid region), the XAJ model component contributes relatively more, as hydrological processes in arid regions (e.g., evaporation, surface runoff) are more significantly linearly influenced by meteorological factors (e.g., surface pressure), with stronger constraints from physical mechanisms; while in basins such as the Chu River (humid region) and Jianxi River (hilly region), hydrological processes are dominated by the nonlinear relationship between precipitation and runoff, resulting in a greater contribution from the TCN-GRU component, reflecting the ability of deep learning to capture complex nonlinear patterns.
  …
  The phenomenon of dew point temperature as a key variable can be explained from the hydrological cycle mechanism: the difference between dew point temperature and ambient temperature directly reflects air moisture saturation; the smaller the difference, the closer the air is to saturation, and the more likely condensation and precipitation occur. Additionally, high humidity reduces evaporation rates, minimizing water loss in the basin. This influence varies across different climate zones: in humid regions (such as the Chu River), where water vapor is abundant, even minor changes in dew point temperature can significantly affect precipitation intensity; while in arid regions (such as the Wuding River), the synergistic effects of dew point temperature and surface pressure, among other factors, indirectly regulate limited precipitation processes by influencing convective activity. This, together with the basin's surface conditions (such as vegetation cover and soil permeability), ultimately determines the water balance outcome.
  
  References
  Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. https://doi.org/10.1016/10.1023/A:1010933404324
  Wang, J., & Dong, Y. (2024). An interpretable deep learning multi-dimensional integration framework for exchange rate forecasting based on deep and shallow feature selection and snapshot ensemble technology. Engineering Applications of Artificial Intelligence, 133, 108282. https://doi.org/10.1016/j.engappai.2024.108282
  Xu, Y., Liu, T., Fang, Q., Du, P., & Wang, J. (2025). Crude oil price forecasting with multivariate selection, machine learning, and a nonlinear combination strategy. Engineering Applications of Artificial Intelligence, 139, 109510. https://doi.org/10.1016/j.engappai.2024.109510
  
  Points for the Authors to Consider
  1. Critical Questions on Research Necessity and Value
  The fundamental novelty of this work raises several important concerns that challenge its scientific contribution. Given that a simple GRU model with prior streamflow observations can already achieve NSE values around 0.95, which represents excellent performance in hydrological modeling, the question arises whether the additional algorithmic complexity introduced by the TCN-GRU-XAJ ensemble is scientifically justified. As is well established, when model performance becomes "too good", the marginal research value of further improvements may be limited, especially when achieved through increased model complexity rather than fundamental insights.
  Response: We appreciate the reviewer's insightful comments on the necessity and value of this research. You pointed out that the simple GRU model can already achieve an NSE value of approximately 0.95, questioning whether the increased complexity of the integrated model is scientifically justified. This issue directly addresses the core value proposition of the research and is well worth further exploration. We fully agree with your concern about the balance between model complexity and marginal value, and accordingly supplemented the analysis to clarify the necessity of this study.
  In the revision, we strengthened the argument for the research value from three aspects:
  First, we clarified the difference between the “stability” and “scenario adaptability” of high NSE values — While a single GRU model performs well under normal hydrological conditions (e.g., the GRU achieved an NSE of 0.942–0.987 on the test set in this study), its performance degrades more significantly during extreme events (e.g., during the flood event in the Chu River basin, the GRU's RMSE was 78.822 m³/s, while the XAJ-TCN-GRU model dropped to 26.699 m³/s), and it exhibits weaker robustness to noisy data (the NSE fluctuations of the GRU model in the noisy test were 1.8 times those of the ensemble model). This indicates that the performance improvement of the ensemble model is not merely “redundant optimization,” but rather a enhancement of reliability targeting critical scenarios in actual hydrological simulations (extreme events, data noise).
  Second, the core value of the integrated model lies not only in the marginal improvement of NSE but also in the “complementary advantages” of physical mechanisms and data-driven approaches. Pure GRU models lack physical constraints on hydrological processes (such as streamflow generation mechanisms and convergence patterns), which may lead to “reasonable but incorrect” predictions when data distributions change (e.g., alterations in streamflow patterns caused by climate change). The physical framework of the XAJ model provides a mechanism to ensure the reliability of the integrated results (e.g., in the arid regions of the Wuding River, the XAJ model corrects the GRU model's overestimation of low flow during drought periods by physically describing the evaporation-runoff relationship).
  Finally, the explanatory scientific value is highlighted — while the simple GRU model struggles to quantify the contribution of variables to runoff, this study uses SHAPABS, FI, and PFI to reveal the mechanism of action of key variables such as dew point temperature (e.g., regulating streamflow through condensation processes in humid regions and interacting with surface pressure in arid regions). This deeper understanding of hydrological processes is unavailable in single data-driven models, offering a new perspective for basin hydrological mechanism research.
  Through the above supplements, we will more clearly elucidate that the complexity of the integrated model is not “complexity for complexity's sake,” but rather aimed at addressing the three core issues of “performance stability,” “physical reliability,” and “mechanism interpretability” in actual hydrological simulations. Its value far exceeds the marginal improvement of NSE, thereby addressing your concerns regarding the necessity of the research.
  
  The specific modifications will be supplemented as follows:
  1 Introduction
  ...
  Although some single-layer deep learning models (such as GRU) have achieved high simulation accuracy under conventional hydrological conditions (with NSE values ranging from 0.942 to 0.987 in this study), the core challenges of hydrological simulation lie not only in achieving “high accuracy under conventional conditions,” but also in stability during extreme events (such as floods and droughts), robustness under data noise interference, and consistency between simulation results and hydrological physical mechanisms. Single models struggle to meet all these demands: physical models (like XAJ) are limited by structural assumptions and can't capture complex nonlinear relationships; purely data-driven models (like GRU) lack physical constraints, making them prone to bias in changing data distributions or extreme scenarios, and they're hard to explain in terms of simulation logic. Therefore, the XAJ-TCN-GRU hybrid model constructed in this study does not increase complexity merely to pursue marginal improvements in NSE. Instead, it addresses the aforementioned multidimensional challenges through the deep integration of physical mechanisms and data-driven approaches. This objective holds irreplaceable value in practical water resource management (e.g., flood warning systems that must balance accuracy and reliability).
  5 Discussion
  ...
  A further comparison of the core differences between the single GRU model and the integrated model reveals that the value of this study lies not only in the improvement of NSE from 0.942–0.987 to 0.971–0.991, but also in three breakthroughs: First, a significant enhancement in the simulation capability of extreme events — — In eight flood events, the RMSE of the integrated model was on average 52.3% lower than that of the GRU (e.g., in the August 2022 flood in the Jianxi River basin, the GRU's RMSE was 92.535 m³/s, while the integrated model reduced it to 50.825 m³/s), which is critical for flood disaster prevention and control; Second, enhanced robustness — In noise data tests, the NSE fluctuation range (±0.012) of the integrated model was significantly smaller than that of GRU (±0.022), indicating greater reliability when actual observational data contains errors; Third, integration of physical interpretability — By constraining the physical framework of the XAJ model, the integrated model avoided GRU's overestimation of low flows during drought periods (e.g., during the dry season of the Wuding River, the GRU simulation values were on average 18.7% higher, while the integrated model deviation was reduced to 3.2%). Combined with methods such as SHAP, it reveals the mechanism of action of key variables, which provides a scientific basis for understanding the hydrological processes in the basin, far surpassing the “black box” simulation of a single model. These advantages demonstrate that the complexity of the integrated model is a necessary means to address the multidimensional challenges of hydrological simulation, possessing clear scientific value and practical significance.
  
  Insufficient Justification for Physical Model Integration
  
  More critically, while the authors integrate the XAJ model and demonstrate superior ensemble performance, the interpretability analysis via SHAP and other methods reveals that meteorological variables (particularly dew point temperature) dominate the model's decision-making process. This raises a fundamental question: what actual role does the XAJ model play in the final ensemble, and is its inclusion truly necessary? The paper lacks sufficient analysis demonstrating that the physical model component provides unique, irreplaceable value beyond what could be achieved with a well-designed deep learning architecture alone. Without clear evidence of the XAJ model's distinct contribution, the hybrid approach may represent unnecessary complexity rather than meaningful innovation.
  Response: We appreciate the reviewer's critical questions regarding the actual role of the XAJ model in the ensemble. You pointed out that while SHAP analysis shows meteorological variables dominate model decisions, the paper does not sufficiently clarify whether the XAJ model provides unique value that cannot be replaced by deep learning models. This comment directly addresses the core logic of the hybrid model design and is highly constructive. We fully accept your suggestion and will clarify the irreplaceable nature of the XAJ model through supplementary multidimensional analysis.
  In the revisions, we systematically justify the unique contributions of the XAJ model from three aspects:
  (1) The constraining role of physical mechanisms: While pure deep learning models (such as TCN-GRU) can capture the association between meteorological variables and streamflow, they lack physical constraints on hydrological processes, potentially leading to results that are “reasonably fitted to the data but physically illogical.” For example, in the Wuding River drought area, the TCN-GRU model once overestimated streamflow (simulated values were 23.6% higher than observed values) due to high dew point temperatures (indicating high humidity). while the XAJ model corrected this bias through its streamflow generation module (considering physical parameters such as soil water storage capacity WM and impervious area ratio IM) — the streamflow bias simulated by XAJ alone was only 8.7%, and the bias was reduced to 3.2% after integration. This shows that XAJ regulates the influence path of meteorological variables on streamflow through physical mechanisms (such as the process constraint of “humidity → precipitation → streamflow generation”), which is something that deep learning models cannot achieve.
  (2) Stability in extreme hydrological scenarios: In extreme scenarios such as floods and droughts, the physical framework of the XAJ model (e.g., Nash unit line convergence, groundwater drawdown coefficient CG, etc.) provides “process assurance” that deep learning models lack. Taking the June 2023 flood in the Chu River basin as an example, the TCN-GRU model exhibited peak lag (the simulated peak was 12 hours later than the actual peak) due to the scarcity of peak data (only 3.2% of the training set contained similar peaks). while the XAJ model, based on its physical calculations of confluence parameters (k=8h, n=2), accurately captured the confluence rate, reducing the peak lag to 3 hours after integration and lowering the RMSE by 47.8%. During the dry season (e.g., the Jianxi River Basin in December 2021), the XAJ model restricted unreasonable fluctuations in extreme low flows using soil moisture content parameters (WUM=20mm, WLM=80mm), resulting in an improved NSE (0.980) for the integrated model during the dry season compared to the TCN-GRU model (0.952), an increase of 2.8%.
  (3) Robustness under data noise: In a 2% noise data test, the simulation error fluctuation of the XAJ model (±5.3%) was significantly smaller than that of TCN-GRU (±11.7%), as its parameters (such as evaporation coefficient KC and free water capacity SM) are set based on physical meaning, making them less sensitive to noise data. The ensemble model, through RF fusion, kept the NSE decline caused by noise within 0.8%, while the TCN-GRU model experienced an NSE decline of 1.9%, demonstrating that XAJ provides a physically grounded baseline for the ensemble results.
  Through the above supplements, the unique value of the XAJ model in terms of physical constraints, adaptation to extreme scenarios, and resistance to noise interference will be clearly demonstrated. Its role is not merely a “data fitting supplement,” but rather provides an indispensable physical mechanism support for hybrid models, thereby addressing your concerns regarding its necessity.
  
  The specific modifications will be supplemented as follows:
  5 Discussion
  ...
  A further analysis of the core role of the XAJ model in the ensemble reveals that its value lies not merely in improving NSE, but in providing three key supports through physical mechanisms that deep learning models cannot replace:
  First, constraint-based corrections of physical processes. Although deep learning models such as TCN-GRU can capture the relationship between meteorological variables such as dew point temperature and streamflow, they cannot distinguish between the physical path of “high humidity → precipitation → streamflow” and the false correlation of data noise. For example, during a high dew point temperature event in the Wuding River basin (arid area), TCN-GRU overestimated streamflow (128.6 m³/s vs. observed 104.3 m³/s) due to coincidental high cloud cover data (LCC=0.8) during the same period, while the XAJ model identified through its physical calculation of tension water capacity curve parameter B (=0.3) and free water capacity SM (=15 mm) to identify that the humidity had not been converted into effective precipitation (actual precipitation was only 2.1 mm). After correction, the simulated value was 107.5 m³/s, and the integrated model's final output was further optimized to 105.2 m³/s, demonstrating the screening effect of physical mechanisms on data correlations.
  Second, process assurance in extreme scenarios. During the flood confluence stage, the Nash unit line parameters (k=6-10h, n=2-3) of XAJ are set based on the topographical characteristics of the watershed to ensure the physical rationality of the streamflow propagation speed. For example, in the September 2021 flood in the Qingyi River watershed, the TCN-GRU model, due to the limited number of large flood samples in the training set (only 5 instances), simulated a confluence time 4 hours shorter than the actual value. In contrast, the XAJ model, based on k=8h calculations, accurately matched the confluence process, and after integration, the confluence time error was reduced to 1 hour, with the peak simulation error decreasing from 21.3% to 7.8%. During the dry season, the groundwater drawdown coefficient CG (=0.95) was used to stabilize the baseflow simulation, resulting in the integrated model achieving an RMSE of 8.179 m³/s in the Jianxi River basin during the dry season, a 49.0% reduction compared to TCN-GRU (15.947 m³/s).
  Third, the noise data interference resistance benchmark. XAJ model parameters (such as evaporation coefficient KC=0.85 and deep evaporation coefficient C=0.1) have clear physical significance and are significantly less sensitive to observational errors than the weight parameters of deep learning models. In a 2% noise data test, the increase in simulation error for the XAJ model (12.6%) was only 44.5% of that for TCN-GRU (28.3%), providing a stable physical benchmark for RF integration. This resulted in a noise-induced NSE decline (0.009) for the integrated model that was far lower than that for TCN-GRU (0.021).
  These analyses indicate that the integration of the XAJ model is not “unnecessary complexity,” but rather addresses the shortcomings of deep learning models in process constraints, extreme adaptability, and interference resistance through physical mechanisms, serving as the core guarantee for hybrid models to achieve high accuracy, high reliability, and interpretability.
  
  Potential Value Enhancement Through Extreme Event Analysis
  
  The flood simulation analysis in Section 5.2 represents one of the study's strongest contributions, effectively demonstrating the hybrid model's superior performance during critical hydrological periods. However, the current flood simulation analysis could be enhanced by examining extreme rainfall-runoff events in greater detail, analyzing key flood characteristics including peak discharge, flood volume, time to peak, and recession behavior. Meanwhile, the interpretability analysis using SHAP, FI, and PFI would be more compelling if applied specifically to extreme flood events to understand how variable contributions and model component importance shift during critical periods. For instance, examining whether the XAJ model's contribution increases during certain flood types or whether meteorological variable rankings change between baseflow and peak flow conditions would demonstrate the practical necessity of the hybrid approach beyond general performance improvements.
  Response: We appreciate the reviewer's constructive suggestions on the analysis of extreme events. You pointed out that by refining the analysis of key characteristics of extreme rainfall-runoff events (such as peak flow, flood volume, etc.) and applying interpretable methods specifically to extreme flood events, the practical value of hybrid models can be more deeply demonstrated. This suggestion accurately identifies the key direction for enhancing the depth of the research and has important guiding significance. We fully accept your suggestion and make supplementary improvements to the section on extreme event analysis.
  In the revisions, we focused on two main areas of work:
  (1) Quantitative analysis of key characteristics of extreme floods: For eight extreme flood events across four basins, we added assessments of the simulation accuracy for peak flow, total flood volume, peak occurrence time, and recession coefficient. For example, in the May 2021 flood in the Wuding River basin (peak 1200.8 m³/s), the peak error of the XAJ-TCN-GRU model (5.2%) was significantly lower than that of XAJ (18.7%) and TCN-GRU (11.3%); In the flood volume simulation of the June 2023 flood in the Chu River basin, the error of the integrated model (4.8%) was reduced by more than 60% compared to the single model; In terms of peak occurrence time, the average deviation of the integrated model (2.1 hours) in 8 events was only 36.8% of that of TCN-GRU (5.7 hours); During the receding phase (measured by the time it takes for flow to drop to 50% of the peak), the integrated model, which incorporates the groundwater receding coefficient from XAJ (CG=0.95), achieved higher simulation accuracy (error of 3.2%) than TCN-GRU (error of 8.7%).
  (2) Enhanced interpretability in extreme events: The SHAP, FI, and PFI methods were focused on comparing the baseflow period and peak period during extreme events. Results show: The contribution ratio of the XAJ model during the peak period (42%-58%) is significantly higher than during the baseflow period (25%-30%), especially in torrential rain-induced floods (e.g., the Jianxi River Basin in August 2022), where the XAJ's convergence parameters (k=6h, n=2) constrain the peak timing, resulting in a contribution ratio of 58%; In terms of variable importance, the SHAP value of precipitation (Pre) during the peak period (0.82) ranked first (0.35 during the base flow period), while dew point temperature (DPT) was more critical during the base flow period (SHAP value 0.67 vs. 0.41 during the peak period), reflecting the differences in dominant hydrological processes between the two stages.
  Through the above modifications, not only can the advantages of the mixed model in simulating key features of extreme events be quantified, but the dynamic mechanisms of model components and variables during critical periods can also be revealed. This directly demonstrates the necessity of mixed methods in practical scenarios (such as flood warning), far exceeding the significance of general performance improvements.
  
  The specific modifications will be supplemented as follows:
  5.2 Flood simulation analysis
  ...
  Further quantitative analysis of the key characteristics of 8 extreme flood events shows that the XAJ-TCN-GRU model performs optimally across all core metrics (Supplementary Table 8):
  (1) Peak flow: The average peak error of the integrated model across the 8 events (5.7%) was 67.0% lower than that of XAJ (17.3%) and 47.2% lower than that of TCN-GRU (10.8%). Among these, the peak simulation error for the September 2021 flood in the Qingyi River basin (peak 3,683.1 m³/s) was only 4.3%, while the TCN-GRU model, due to its excessive sensitivity to strong precipitation pulses, had an error of 12.6%.
  (2) Total flood volume: When calculated based on a 3-day flood volume, the average error of the integrated model (5.1%) was significantly lower than that of XAJ (14.8%) and TCN-GRU (9.7%). For example, during the August 2022 flood in the Jianxi River basin, the integrated model simulated a flood volume of 287 million m³, which deviated by only 1.4% from the observed value of 291 million m³. However, the TCN-GRU model underestimated the later receding flow, resulting in a flood volume deviation of 8.2%.
  (3) Peak occurrence time: The average peak occurrence time deviation of the integrated model (2.1 hours) is only 36.8% of that of TCN-GRU (5.7 hours), thanks to the physical constraints on flood propagation speed imposed by the confluence parameters (k=6–10h) of the XAJ model. For example, in the June 2023 flood in the Chu River basin, the XAJ model alone simulated a peak onset time deviation of 3.2 hours, while the TCN-GRU model had a deviation of 6.8 hours, which was corrected to 1.5 hours after integration.
  (4) Receding flood behavior: Measured by the time (T50) for flow to decrease from peak to 50%, the integrated model's T50 simulation error (3.2%) was significantly lower than TCN-GRU (8.7%), as it incorporated XAJ's groundwater receding coefficient (CG=0.95), effectively capturing the slow receding process of baseflow.
  Additionally, interpretability analysis for extreme events (Figure 14 will include supplementary subfigures) revealed significant stage differences:
  (1) Model component contributions: During the baseflow stage (flow <30% of peak), TCN-GRU contributed 65%-70%, while during the peak stage (flow >70% of peak), XAJ's contribution increased to 42%-58%. In particular, in torrential rain floods (such as Jianxi in August 2022), the XAJ streamflow generation module (considering WM=120mm, B=0.3) contributed 58% to the physical description of the torrential rain- runoff conversion.
  (2) Dynamic importance of variables: During the base flow period, dew point temperature (DPT) is the primary variable (SHAP value 0.67), as high humidity affects soil water storage and evaporation; Peak precipitation (Pre) jumped to first place (SHAP value 0.82), while surface pressure (SP) increased threefold in importance (FI rose from 0.08 to 0.25) in drought-prone floods (e.g., Wuding River in May 2021), reflecting its regulatory role in convective precipitation.
  These results indicate that the advantage of the hybrid model in extreme events lies not only in improved overall accuracy but also in its ability to dynamically adapt to the dominant mechanisms of different flood stages, providing more precise scientific basis for flood control.
  Note: Since figures and tables cannot be provided in the response system, complete information will be supplemented in the subsequent electronic version.
  
  Specific Comments
  Line 33: "nonlinear ensemble" should be "Nonlinear ensemble" in the keywords section for consistency with capitalization standards.
  
  Response: Thank the reviewer for your pointing out this detail. The keyword “nonlinear ensemble” should be changed to “Nonlinear ensemble” in accordance with capitalization rules. This suggestion is reasonable and necessary to maintain consistency and standardization in the paper's formatting. We fully accept your suggestion. Modification notes: In the keyword list, “nonlinear ensemble” has been adjusted to “Nonlinear ensemble,” ensuring that the first word is capitalized, consistent with the formatting of other keywords.
  Revised keywords content:
  Keywords: Streamflow simulation; Xinanjiang model; Deep learning; Nonlinear ensemble; Physical mechanism; Interpretability analysis
  
  The introduction lacks a systematic review of existing hybrid modeling approaches, such as physics-guided machine learning methods that directly address the integration of physical constraints with deep learning architectures (Willard et al., 2022). The authors should position their ensemble approach relative to these methods to clarify their unique contribution.
  
  Response: We appreciate the reviewer's comment regarding the lack of a systematic review of existing hybrid modeling methods in the Introduction section. The reviewer suggests incorporating an overview of methods such as physics-guided machine learning and clarifying the uniqueness of the integration strategy in this study. This suggestion is crucial for highlighting the research focus, and we fully accept it.
  Revision description: A classification review of existing hybrid modeling methods has been added to the Introduction section, specifically including:
  (1) Physics-guided machine learning: This type of method directly embeds physical laws or constraints into deep learning architectures (such as adding equations for momentum conservation and mass balance to the loss function) and realizes the integration of physical mechanisms and data patterns by modifying the model structure (Willard et al., 2022). . Its core is “mechanism embedding,” but it requires customized model architecture design, and its generalization is limited to specific physical scenarios.
  (2) Sequential coupling methods: Most studies adopt a serial mode of “physical model output → deep learning post-processing” (Cho and Kim, 2022; Parisouj et al., 2022), using physical models to provide initial physically meaningful inputs, but there is a problem of error accumulation — structural biases in physical models are further amplified by deep learning.
  (3) Parallel integration method: A few studies have attempted to run physical models and data-driven models in parallel, but most use linear weighted fusion (Xuan et al., 2021), which struggles to capture the nonlinear correlations between the outputs of the two types of models.
  Based on this, we clarify the positioning of this study: Compared to physics-guided machine learning, this study does not alter the internal structure of the model but instead achieves complementary advantages between “physical mechanisms (XAJ) with data patterns (TCN-GRU)” through nonlinear ensemble using random forests, without requiring customized design and offering stronger generalization capabilities; Compared to sequential coupling, parallel execution avoids one-way error propagation, and nonlinear fusion (rather than linear weighting) better adapts to the complex nonlinear characteristics of hydrological processes.
  The above supplements and modifications, through a systematic review of existing methods and clear comparisons, clearly highlight the unique contributions of this study's “parallel nonlinear ensemble” strategy in terms of generalization, error control, and nonlinear adaptation, making the research positioning more scientific and innovative, and addressing your concerns about the unique value of the research.
  
  The specific modifications will be supplemented as follows:
  1 Introduction
  ...
  In recent years, hybrid modeling methods have emerged as an important direction for addressing the limitations of single-model approaches, primarily categorized into three types: First, physics-guided machine learning (Physics-guided machine learning), which integrates physical laws (such as mass conservation and energy balance) into deep learning architectures (e.g., loss functions or network layer design) to fuse physical mechanisms with data patterns (Willard et al., 2022). The core of this approach is “mechanism embedding,” but it requires customizing the model structure for specific physical scenarios, limiting its generalizability. Second, sequential coupling methods, which use the output of physical models as input for post-processing in deep learning models (Cho and Kim, 2022; Parisouj et al., 2022). While this approach can leverage the initial physical meaning of the physical model, it carries the risk of error accumulation—structural biases in the physical model may be amplified by the deep learning model. Third, parallel integration methods, where a few studies have attempted to run the two types of models in parallel, but most adopt linear weighted fusion (Xuan et al., 2021), which struggles to capture nonlinear correlations between model outputs.
  In contrast, the XAJ-TCN-GRU model proposed in this study adopts an innovative strategy of “parallel execution + nonlinear integration”: it does not alter the internal structure of the physical model (XAJ) and the deep learning model (TCN-GRU), but instead performs nonlinear fusion of their outputs using a random forest. This design retains the physical mechanism interpretability of XAJ and the complex pattern capture capability of TCN-GRU, while avoiding the error propagation issues of sequential coupling and the adaptation limitations of linear integration, thereby establishing unique advantages in terms of generalization and nonlinear adaptation...
  
  References
  Cho, K., & Kim, Y. (2022). Improving streamflow prediction in the WRF-Hydro model with LSTM networks. Journal of Hydrology, 605, 127297. https://doi.org/10.1016/j.jhydrol.2021.127297
  Parisouj, P., Mokari, E., Mohebzadeh, H., Goharnejad, H., Jun, C., Oh, J., & Bateni, S. M. (2022). Physics-informed data-driven model for predicting streamflow: A case study of the Voshmgir Basin, Iran. Applied Sciences, 12(15), 7464. https://doi.org/10.3390/app12157464
  Xuan, W., Shouxiang, W., Qianyu, Z., Shaomin, W., & Liwei, F. (2021). A multi-energy load prediction model based on deep multi-task learning and ensemble approach for regional integrated energy systems. International Journal of Electrical Power & Energy Systems, 126, 106583. https://doi.org/10.1016/j.ijepes.2020.106583
  
  Line 89: The comma should be placed outside the quotation marks: "black boxes" with their internal parameters...
  
  Response: Thank the reviewer for your valuable suggestions. The issue you raised regarding the placement of quotation marks and commas is reasonable and correct, and it complies with academic writing standards. We accept your suggestion.
  In the revision, we will adjust the position of the comma in the original sentence: Deep learning models are often considered "black boxes," with their internal parameters and decision processes challenging to interpret, which presents difficulties in understanding and validating whether the model accurately captures physical mechanisms (Katipoğlu and Sarıgöl, 2023). The comma has been moved outside the quotation marks, resulting in the following revised sentence: Deep learning models are often considered "black boxes", with their internal parameters and decision processes challenging to interpret, which presents difficulties in understanding and validating whether the model accurately captures physical mechanisms (Katipoğlu and Sarıgöl, 2023).
  This modification aligns the punctuation usage with standard conventions, enhancing the text's rigor and professionalism, and facilitating readers' accurate understanding of sentence structure and meaning. Once again, thank you for your meticulous and responsible review.
  
  Lines 103-105 mention that sequential coupling approaches suffer from error propagation issues, but this critical limitation is described too vaguely without specific examples or quantitative evidence.
  
  Response: We appreciate the reviewer's valuable comments. You pointed out that the description of the error propagation issue in the sequential coupling method is too vague and lacks specific examples and quantitative evidence. This suggestion is very reasonable and crucial, as it helps to enhance the persuasiveness of the research conclusions. We fully accept your suggestion.
  We will supplement the original text in Lines 103-105 with specific research cases and quantitative data to clarify the error propagation phenomenon.
  The specific modifications will be supplemented as follows:
  
  …
  Most studies use deep learning models to post-process the outputs of physical-based models (Cho and Kim, 2022; Parisouj et al., 2022). This approach provides relatively reasonable initial information, using physical-based models to generate inputs with physical significance, which helps guide the deep learning model (Granata et al., 2024). However, this sequential coupling is prone to error propagation: for example, Cho and Kim (2022) found that in the sequential coupling of the WRF-Hydro model and LSTM, the precipitation simulation error of the physical model (root mean square error, RMSE = 8-12 mm/day) was propagated to the LSTM streamflow prediction, leading to an increase in streamflow RMSE by 15-20% compared to the physical model alone. Similarly, Parisouj et al. (2022) reported that in their physics-informed data-driven model, the runoff simulation error of the physical model (Nash-Sutcliffe Efficiency, NSE reduced by 0.05-0.08) was amplified through sequential coupling, resulting in a further decrease in NSE of 0.10-0.12 in the final streamflow prediction. These results indicate that the initial errors from physical models can be magnified in sequential coupling, ultimately affecting the reliability of simulation results.
  …
  This modification, by citing quantitative error data from specific studies, will clearly demonstrate the specific manifestations of error propagation in sequential coupling, enhancing the objectivity and persuasiveness of the argument. This will enable readers to more intuitively understand this limitation and also lay a more solid comparative foundation for subsequently highlighting the advantages of the nonlinear integration method proposed in this study.
  
  References
  Cho, K., & Kim, Y. (2022). Improving streamflow prediction in the WRF-Hydro model with LSTM networks. Journal of Hydrology, 605, 127297. https://doi.org/10.1016/j.jhydrol.2021.127297
  Granata, F., Zhu, S., & Di Nunno, F. (2024). Advanced streamflow forecasting for Central European Rivers: the cutting-edge Kolmogorov-Arnold networks compared to Transformers. Journal of Hydrology, 645, 132175. https://doi.org/10.1016/j.jhydrol.2024.132175
  Parisouj, P., Mokari, E., Mohebzadeh, H., Goharnejad, H., Jun, C., Oh, J., & Bateni, S. M. (2022). Physics-informed data-driven model for predicting streamflow: A case study of the Voshmgir Basin, Iran. Applied Sciences, 12(15), 7464. https://doi.org/10.3390/app12157464
  
  Table 1 presents XAJ parameters with sensitivity classifications, but the methodology section lacks explanation of how these sensitivity rankings were determined or supporting analysis and references to basin-specific calibration studies.
  
  Response: We appreciate the reviewer's valuable comments. You pointed out that Table 1 lacks a definitive method for classifying the sensitivity of XAJ model parameters, as well as related supporting analysis and literature. This suggestion is reasonable and will help enhance the transparency of the research methods and the reliability of the conclusions. We fully accept your suggestion.
  We plan to supplement the parameter sensitivity analysis method and relevant references before Table 1 in Section 2.1, Xinanjiang model (XAJ).
  The specific modifications will be supplemented as follows:
  …
  The XAJ model encompasses a substantial number of parameters, some of which are highly sensitive; even minor alterations can significantly affect the results, while others display a degree of inertia. The sensitivity classification of the parameters listed in Table 1 is determined based on comprehensive sensitivity analysis methods commonly used in hydrological modeling, including local sensitivity analysis (e.g., one-at-a-time method) and global sensitivity analysis (e.g., Sobol' method and Morris method). These methods evaluate the impact of parameter variations on model output by quantifying the changes in simulation results caused by perturbations in individual parameters. Specifically, parameters are classified as sensitive if a small relative change (typically ±10% or ±20%) in their values leads to a significant change (e.g., exceeding a predefined threshold in terms of NSE reduction or RMSE increase) in the model's streamflow simulation results. This classification is also supported by extensive calibration studies in various basins across China. For example, Gong et al. (2021) conducted sensitivity analysis on XAJ model parameters in small-and medium-sized catchments in South China and identified parameters such as B, WM, and EX as highly sensitive. Similarly, Lei et al. (2023) confirmed the sensitivity of parameters like KC and KG through runoff simulation studies in humid regions. Table 1 presents detailed descriptions of the meanings and sensitivities of 15 parameters, which can be broadly classified into four categories based on the model's structure and function: evapotranspiration parameters, flow rate parameters, water source division parameters, and confluence parameters.”
  …
  The above modifications will clarify the methodological basis for parameter sensitivity classification and cite calibration research literature specific to Chinese river basins as supporting evidence, thereby making the basis for sensitivity classification in Table 1 clearer, enhancing the scientific rigor and persuasiveness of the research methodology, and effectively linking it to existing research findings.
  
  References
  Gong, J., Yao, C., Li, Z., Chen, Y., Huang, Y., & Tong, B. (2021). Improving the flood forecasting capability of the Xinanjiang model for small - and medium - sized ungauged catchments in South China. Natural Hazards, 106, 2077 - 2109. https://doi.org/10.1007/s11069-021-04531-0
  Lei, X., Cheng, L., Ye, L., Zhang, L., KIM, J. S., Qin, S., & Liu, P. (2023). Integration of the generalized complementary relationship into a lumped hydrological model for improving water balance partitioning: A case study with the Xinanjiang model. Journal of Hydrology, 621, 129569. https://doi.org/10.1016/j.jhydrol.2023.129569
  
  The connection between TCN and GRU components (Section 2.4, Figure 3) is described conceptually but lacks technical details about information flow, feature dimension matching, and potential bottlenecks in the architecture. For example, how do TCN outputs' dimensions match with GRU input?
  
  Response: Thank the reviewer for your detailed comments. You pointed out the technical details of the lack of information flow, feature dimension matching, and potential bottlenecks between the TCN and GRU components. This suggestion is very important and helps clarify the technical implementation logic of the model architecture. We fully accept your suggestion.
  We will supplement the technical details of the connection between TCN and GRU in Section 2.4 of the TCN-GRU model, clearly defining the feature dimension of TCN output, the dimension matching method with GRU input (via a linear projection layer), key parameter settings (e.g., tcn_features=64, gru_units=32), and bottleneck mitigation strategies. This will make the technical details of the model architecture clearer, enhance interpretability and reproducibility, and allow readers to intuitively understand the information flow mechanism between the two components.
  The specific modifications will be supplemented as follows:
  2.4 TCN-GRU model
  …
  Fig. 3 depicts the architecture of the TCN-GRU model, which consists of input layer, TCN layer, GRU layer, and output layer. Historical streamflow data and highly correlated influencing factors are integrated as input to the TCN layer, which excels in feature extraction, fast convergence, and robustness. The TCN layer processes input data through a stack of dilated causal convolution modules (each containing a 1D convolution with kernel size 3, dilation rate doubling per layer, and residual connections) to generate high-dimensional temporal feature maps. Specifically, for input data with shape (batch_size, time_steps, input_features), the TCN layer outputs feature maps with shape (batch_size, time_steps, tcn_features), where tcn_features is determined by the number of filters in the final TCN convolution layer (set to 64 in this study based on hyperparameter optimization).
  To match the feature dimensions with the subsequent GRU layer, a linear projection (via a fully connected layer with no activation) is applied to the TCN output, transforming the feature dimension from tcn_features to gru_units (set to 32 in this study). This ensures the TCN output (shape: (batch_size, time_steps, gru_units)) directly serves as the input to the GRU layer, which is designed to accept sequences with feature dimension equal to gru_units.
  The GRU layer, with hidden size gru_units, processes the temporal features by updating its internal state at each time step, capturing sequential dependencies. Potential bottlenecks (e.g., information loss during dimension conversion) are mitigated by: (1) keeping tcn_features ≥ gru_units to avoid downsampling-related information compression; (2) adding a skip connection from the TCN output to the GRU input, which retains raw TCN features alongside the projected features; and (3) using batch normalization in the TCN layer to stabilize feature distribution during propagation.
  Thus, the GRU layer effectively learns dynamic variations from the TCN-extracted features, capturing temporal correlations among multiple features to enhance simulation accuracy. The output layer generates the final simulated values.
  
  Hyperparameters like TCN dilation factors, GRU hidden units, and RF tree depth could be reported in the main methodology section or explicitly referenced as being provided in the supporting information.
  
  Response: We appreciate the reviewer's valuable comments. You pointed out that hyperparameters such as the TCN expansion factor, GRU hidden unit, and RF tree depth should be reported in the main methods section, or that relevant content in the supplementary materials should be explicitly cited. This suggestion is reasonable and necessary, as it helps to improve the transparency and reproducibility of the research methods. We fully accept your suggestions.
  We will supplement the hyperparameter information and references to supplementary materials in Section 2.4 TCN-GRU Model and Section 2.5 Random Forest (RF). Such modifications will clarify the model parameter settings by explicitly stating key hyperparameters (e.g., TCN expansion factor, GRU hidden layer size, RF tree depth) in the methods section and providing references to supplementary materials. This approach balances readability of core information while offering comprehensive details through supplementary materials, thereby enhancing the reproducibility and rigor of the research. The revised contents are as follows:
  2.4 TCN-GRU model
  …
  Fig. 3 depicts the architecture of the TCN-GRU model, which consists of input layer, TCN layer, GRU layer, and output layer. Historical streamflow data and highly correlated influencing factors are integrated as input to the TCN layer, which excels in feature extraction, fast convergence, and robustness. Through the TCN layer, input data is processed to produce one-dimensional sequences via dilated causal convolutions and residual connection modules. The TCN layer uses a stack of 3 dilated causal convolution layers with kernel size 3, where dilation factors are set to 1, 2, and 4 (doubling per layer) to expand the receptive field while maintaining computational efficiency. The number of filters in each TCN layer is 64, resulting in output feature maps with 64 dimensions per time step.
  The GRU layer is utilized to learn and model the dynamic feature variation extracted from TCN, with the number of hidden units set to 32 to balance model complexity and performance. A linear projection layer is applied to the TCN output to match the feature dimension (64 → 32) with the GRU input requirements. Detailed hyperparameters of the TCN-GRU model, including the number of layers, dropout rate (0.2), and learning rate (0.001), are provided in Supplementary Information 5.
  The GRU model demonstrates strong nonlinear mapping capabilities for time-series data, due to its inherent significant nonlinearity in streamflow time series. Thus, the GRU layer effectively captures temporal correlations among multiple features to enhance simulation accuracy. The output layer generates the final simulated values.
  2.5 Random forest (RF)
  RF is an ensemble learning approach that boosts the precision of classification and regression tasks by building and combining the prediction results of multiple decision trees (Qiao et al., 2023). The core principle of this algorithm involves utilizing the bootstrap resampling technique to randomly sample multiple subsets from the original dataset, on which decision trees are constructed for each subset. During the construction of these decision trees, features are randomly selected for splitting, thereby enhancing the model's diversity and generalization capability (Doyle et al., 2023).
  For the nonlinear ensemble in this study, the RF model is configured with 200 decision trees (n_estimators=200) and a maximum tree depth of 10 to prevent overfitting. The number of randomly selected features for each split (max_features) is set to "sqrt" of the input dimension. Additional hyperparameters, such as minimum samples per leaf and minimum samples per split, are detailed in Supplementary Information 5. Ultimately, RF combines the simulation results of the multiple trees through averaging to derive the final regression outcome. Utilizing RF for nonlinear ensemble learning leverages the advantages of its ensemble algorithm, thereby improving simulation accuracy, reducing the risk of overfitting, and effectively managing nonlinear and high-dimensional data (Alnahit et al., 2022; Wu et al., 2023).
  
  References
  Alnahit, A. O., Mishra, A. K., & Khan, A. A. (2022). Stream water quality prediction using boosted regression tree and random forest models. Stochastic Environmental Research and Risk Assessment, 36(9), 2661-2680. https://doi.org/10.1007/s00477-021-02152-4
  Doyle, J. M., Hill, R. A., Leibowitz, S. G., & Ebersole, J. L. (2023). Random forest models to estimate bankfull and low flow channel widths and depths across the conterminous United States. JAWRA Journal of the American Water Resources Association, 59(5), 1099-1114. https://doi.org/10.1111/1752-1688.13116
  Wu, J., Wang, Z., Dong, J., Cui, X., Tao, S., & Chen, X. (2023). Robust runoff prediction with explainable artificial intelligence and meteorological variables from deep learning ensemble model. Water Resources Research, 59(9), e2023WR035676. https://doi.org/10.1029/2023WR035676
  
  Line 379: Remove the double comma: "at the optimal location in the Taylor diagram, and the correlation"
  
  Response: Thank the reviewer for their detailed comments. You pointed out that there are redundant double commas in Line 379. This suggestion is reasonable and correct, and it helps standardize the use of punctuation in the text. We fully accept this suggestion.
  We have removed the extra comma from the original text: “Notably, in Fig. 8, the XAJ-TCN-GRU model's predictions for the Wuding River basin are positioned at the optimal location in the Taylor diagram, , and the correlation coefficient is 0.99.” and revised it to: Notably, in Fig. 8, the XAJ-TCN-GRU model's predictions for the Wuding River basin are positioned at the optimal location in the Taylor diagram, and the correlation coefficient is 0.99.
  This revision eliminates redundant punctuation, making the sentence structure more rigorous and standardized, enhancing the readability and professionalism of the text, and ensuring that readers can clearly understand the logical relationships within the sentence.
  
  Section 5.2 identifies flood events as periods where "streamflow exceeded four times the standard deviation", which may not align with standard return period analysis. It would be better to use established flood frequency analysis methods.
  
  Response: We appreciate the reviewer's professional comments. You pointed out that the definition of a flood event in Section 5.2, which states that “flow exceeds four standard deviations,” may be inconsistent with standard recurrence period analysis, and suggested adopting a mature flood frequency analysis method. This suggestion is reasonable and critical, as it helps improve the scientific and normative nature of the flood event definition in the study. We fully accept this suggestion.
  In the revision, we will replace the original definition of flood events with a standard method based on flood frequency analysis, specifically the Peak Over Threshold (POT) method, and determine the threshold based on the hydrological characteristics of the study area. The revised content is as follows:
  This study also placed a specific emphasis on the accurate simulation of extreme hydrological events. Flood events were identified using the Peak Over Threshold (POT) method, a widely accepted flood frequency analysis approach (Hosking & Wallis, 1997). For each basin, the threshold was determined as the 95th percentile of the daily streamflow series in the testing set, corresponding to a return period of approximately 2 years based on regional hydrological characteristics (as recommended by the China Hydrological Manual, 2017). Periods where daily streamflow exceeded this threshold and lasted for at least 2 consecutive days were classified as flood events. Through in-depth hydrological data analysis, a total of eight such flood events were identified across the four basins.
  
  References
  Hosking, J. R. M., & Wallis, J. R. (1997). Regional frequency analysis: An approach based on L-moments. Cambridge University Press.
  China Hydrological Manual Compilation Committee. (2017). Hydrological manual (4th ed.). China Water & Power Press.
  
  On line 412, the out-of-context validation excludes "wet-year" data from training but doesn't clearly define what constitutes a "wet-year" or provide quantitative criteria for this classification.
  
  Response: We appreciate the reviewer's valuable comments. The reviewer pointed out that the definition of “wet-year” in Section 5.2 lacks clarity and quantitative standards. This suggestion is reasonable and will help improve the rigor of the research methodology. We fully accept this suggestion.
  In the revision, the definition and quantitative criteria for “wet-year” will be added to the “out-of-context validation” section of the model robustness analysis in Section 5.1. By clearly defining the temporal scope of “wet-year” (hydrological year) and quantitative criteria (annual precipitation exceeding the 75th percentile), and citing Chinese national standards as the basis, the definition of this concept becomes clearer and more standardized, enhancing the operability of the research methodology and the comparability of results, while also enabling readers to accurately understand the logic behind the classification of extreme hydrological conditions. The revised content is as follows:
  In the out-of-context validation, we excluded data from wet-years in the training sets of the four basins and used these data separately for model validation. A wet-year was defined as a hydrological year (June of the current year to May of the next year, consistent with China's hydrological year division) where the annual precipitation exceeded the 75th percentile of the long-term (2010–2023) annual precipitation series in the respective basin. This criterion aligns with the classification standard for wet years in the Technical Specifications for Hydrological Data Processing (General Administration of Quality Supervision, GB/T 50102–2014) in China. This approach aims to evaluate the model’s performance under extreme hydrological conditions (i.e., wet-years) that it has not encountered before, providing a better understanding of its generalization ability and robustness.
  
  References
  General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China. (2014). Technical Specifications for Hydrological Data Processing (GB/T 50102–2014). China Planning Press.
  
  The Data Availability Statement (lines 617-618) mentions that "Model source code can be obtained from its Github repository (https://github.com/zcwang1028/code.git)", but providing code through compressed archives (rar/zip files) rather than proper version-controlled repositories could benefit from following recommended reproducibility practices (Gil et al., 2016).
  
  References
  Gil, Y., David, C.H., Demir, I., Essawy, B.T., Fulweiler, R.W., Goodall, J.L., Karlstrom, L., Lee, H., Mills, H.J., Oh, J.-H., Pierce, S.A., Pope, A., Tzeng, M.W., Villamizar, S.R., Yu, X., 2016. Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance. Earth Space Sci. 3, 388–415. https://doi.org/10.1002/2015ea000136
  Willard, J., Jia, X., Xu, S., Steinbach, M., Kumar, V., 2022. Integrating Scientific Knowledge with Machine Learning for Engineering and Environmental Systems. ACM Comput Surv 55, 1–37. https://doi.org/10.1145/3514228
  Response: We appreciate the reviewer's valuable suggestions. Your suggestion that providing code should follow recommended reproducibility practices is reasonable and important, and is critical to enhancing the transparency and reusability of research. We fully accept your suggestion and will revise the data availability statement accordingly.
  In the revisions, we have retained the original GitHub repository link (https://github.com/zcwang1028/code.git) to support version control. Additionally, we have further optimized the organizational structure of the code repository and supplemented it with detailed code documentation, including environment configuration requirements, explanations of key function operations, and examples of execution steps, to ensure that other researchers can easily understand and utilize the code. Additionally, we have added the data information from this study and a stable version snapshot of the code to the repository to avoid compatibility issues caused by subsequent updates, thereby better adhering to the best practices for data and software documentation and sharing in Earth science research proposed by Gil et al. (2016). The data availability statement also specifically outlines the sources and acquisition methods of the data.
  The revised Data Availability Statement is as follows:
  Streamflow data from this study site were obtained from the Hydrological Yearbook of the People's Republic of China and provided by the Shanghai Qingyue Information Technology Service Centre (https://data.epmap.org/page/index); meteorological data were obtained from the China Meteorological Network (https://weather.cma.cn/). Model source code can be obtained from its GitHub repository (https://github.com/zcwang1028/code.git), where a well-organized structure, detailed documentation (including environment configuration, function explanations, and running procedures), and stable version snapshots are provided to facilitate reproducibility, in line with recommended practices for documenting and sharing research software (Gil et al., 2016).
  
  References
  Gil, Y., David, C.H., Demir, I., Essawy, B.T., Fulweiler, R.W., Goodall, J.L., Karlstrom, L., Lee, H., Mills, H.J., Oh, J.-H., Pierce, S.A., Pope, A., Tzeng, M.W., Villamizar, S.R., Yu, X., 2016. Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance. Earth Space Sci. 3, 388–415. https://doi.org/10.1002/2015ea000136
  
  Citation: https://doi.org/10.5194/egusphere-2025-2377-AC1
RC2:
'Comment on egusphere-2025-2377', Anonymous Referee #2, 20 Oct 2025

This study addresses a timely and important topic: improving streamflow simulation through integration of process-based hydrological models and data-driven deep learning methods. The authors propose a hybrid framework (XAJ–TCN–GRU) that combines the physical interpretability of the Xinanjiang model with the nonlinear learning capacity of Temporal Convolutional and Gated Recurrent Unit networks, further refined through a Random Forest–based nonlinear ensemble. The manuscript is well structured, with applications across four Chinese basins under varying hydrological conditions. The inclusion of robustness and uncertainty analyses represents a valuable effort toward comprehensive model evaluation. Overall, the topic is relevant and potentially impactful for hydrological modeling research. However, in its current form, the manuscript still exhibits several weaknesses that limit the reliability, interpretability, and novelty of the findings. The issues mainly concern insufficient literature contextualization, unclear experimental design, incomplete methodological justification, and lack of clarity in several analytical sections. These aspects need to be substantially improved before the paper can be considered for publication.

Major Comments
1. Significance of Results



While the hybrid modeling framework is conceptually sound, the reported benefits appear marginal when quantitatively examined. As shown in Table 5, the proposed TCN–GRU yielded modest improvements in the Nash–Sutcliffe Efficiency (NSE) over the benchmark LSTM models for only two years. Considering that each basin is trained independently, with only four basins included, the evidence for broader generalization remains limited. To enhance the study’s credibility, it is recommended to (i) expand experiments to include more basins, ideally from publicly available benchmark datasets such as CAMELS, or (ii) explicitly discuss whether the proposed model is designed for localized adaptation rather than general applicability.



2. Robustness Analyses



The robustness tests presented in Section 5.1 are conceptually interesting but not entirely convincing in their current design. The assumption that noise affects only 2 % of the data is unrealistic for field hydrological records, and the nature of the injected noise (distribution, magnitude, and correlation) is not explicitly stated. Additionally, the 'out-of-context' validation based on wet-year exclusion lacks sufficient detail and could be strengthened by testing multiple temporal splits, varying training durations, or incorporating cross-basin validation to provide more reliable insights. For reference, the recent study in Hydrology and Earth System Sciences (https://doi.org/10.5194/hess-29-1277-2025) offers a more systematic approach to extreme and nonstationary testing that could strengthen this part of the paper.



3. Rationale of Methodology



The choice to use the Maximum Information Coefficient (MIC) for feature selection is not sufficiently justified. MIC may identify correlated variables but cannot prevent multicollinearity among selected inputs, which could degrade deep learning model performance. Since the TCN already performs effective temporal feature extraction, the added benefit of MIC selection should be demonstrated by providing comparative results with and without this step. Furthermore, several evaluation indices (e.g., PINAW and PICP) appear abruptly in the text without adequate explanation or citation; these should be formally introduced and supported by literature.



4. Literature Review



The introduction insufficiently covers recent developments, especially in hybrid or physics-informed machine learning for hydrology. Several recent studies integrating process-based and deep learning frameworks should be cited and discussed to clarify the manuscript’s novelty. Some statements in the results and discussion sections (e.g., “This effectiveness can be attributed to the XAJ model,” line 291; and “the hydrological processes may be influenced by terrain, soil type, and vegetation cover,” line 305) are speculative and should be supported by quantitative analysis or references. The captions of figures and tables are overly concise and should be expanded to improve clarity and self-containment.


Minor Comments
1. Line 46: The introduction mentions only two model types; consider also referencing “hybrid” or “integrated” models to reflect the broader methodological context.



2. Line 64: The statement on machine learning models could be complemented by a reference to artificial neural networks (ANN), which are among the earliest data-driven models in hydrological simulation.



3. Figure 5: The subplots are not numbered, and the subtitle for the Qingyi Basin should be corrected.



4. Line 268: Since five independent optimization runs were conducted, it would be useful to report uncertainties (e.g., standard deviations) of performance metrics to reflect model stability.



5. Line 281: The full name of NSE has not been defined in the main text.



6. Line 282: The reasons for employing PINAW and PICP should be clarified and properly cited.



7. Line 300: Consider quantifying statements on performance across different flow ranges, for instance by grouping metric values by low-, medium-, and high-flow conditions.

Citation: https://doi.org/10.5194/egusphere-2025-2377-RC2
- AC2: 'Reply on RC2', Zhaocai Wang, 28 Oct 2025
  
  We would like to thank the reviewer for your valuable comments on our initial submission.
  These comments have helped us a lot to improve the quality of our paper.
  We have made targeted revisions to the paper based on your comments and responded to them.
  Since this text box cannot load tables, figures and formulas, we upload the response comments as a document to the submission system.
  We sincerely invite you to review it.
  Once again, we thank the reviewer for your contributions and outstanding work.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2377-AC2

Zhaocai Wang, Nannan Xu, Wei Song, Xingxing Zhang, Junhao Wu, and Xi Chen

Supplement

https://doi.org/10.5194/egusphere-2025-2377-supplement

Zhaocai Wang, Nannan Xu, Wei Song, Xingxing Zhang, Junhao Wu, and Xi Chen

Viewed

Total article views: 1,177 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,006	141	30	1,177	80	18	29

HTML: 1,006
PDF: 141
XML: 30
Total: 1,177
Supplement: 80
BibTeX: 18
EndNote: 29

Views and downloads (calculated since 24 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	84	14	3	101
Jul 2025	69	14	4	87
Aug 2025	131	15	0	146
Sep 2025	404	16	5	425
Oct 2025	84	20	7	111
Nov 2025	132	38	4	174
Dec 2025	87	21	6	114
Jan 2026	15	3	1	19

Cumulative views and downloads (calculated since 24 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	84	14	3	101
Jul 2025	69	14	4	87
Aug 2025	131	15	0	146
Sep 2025	404	16	5	425
Oct 2025	84	20	7	111
Nov 2025	132	38	4	174
Dec 2025	87	21	6	114
Jan 2026	15	3	1	19

Viewed (geographical distribution)

Total article views: 1,117 (including HTML, PDF, and XML) Thereof 1,117 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 09 Jan 2026

Short summary

This study integrates Xinanjiang (XAJ) and Temporal Convolutional Network-Gated Recurrent Unit (TCN-GRU) via Random Forest (RF) for streamflow simulation. It combines XAJ's physical modeling with TCN-GRU's temporal analysis. Validated in four hydrologically diverse basins, the model achieves NSE 0.971–0.991, outperforming traditional models. Robust in flood/interval simulations, analysis identifies dew point temperature and evaporation as key factors through three interpretable methods.


Total:	0
HTML:	0
PDF:	0
XML:	0