the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A Distributed Hybrid Physics-AI Framework for Learning Corrections of Internal Hydrological Fluxes and Enhancing High-Resolution Regionalized Flood Modeling
Abstract. To advance the discovery of scale-relevant hydrological laws while better exploiting massive multi-source data, merging artificial intelligence with process-based modeling has emerged as a compelling approach, as demonstrated in recent lumped hydrological modeling studies. This research proposes a general spatially distributed hybrid modeling framework that seamlessly combines differentiable process-based modeling with neural networks. We focus on hybridizing a differentiable hydrological model with neural networks, leveraging the temporal memory effect of the original model, on top of a differentiable kinematic wave routing over a flow direction grid. We evaluate flood modeling performance and analyze the interpretability of learned conceptual parameters and corrections of internal fluxes using two high-resolution data sets (dx = 1 km, dt = 1 h). The first data set involves 235 catchments in France, used for local calibration-validation and model structure comparisons between the classical GR-like model and the hybrid approach. The second dataset presents a challenging multi-catchment modeling setup in flash flood-prone areas to demonstrate the framework's regionalization learning capabilities. The results show that the hybrid models achieve superior accuracy and robustness compared to classical approaches in both spatial and temporal validation. Analysis of the spatially distributed parameters and internal fluxes reveals the hybrid models' nuanced behavior, their adaptability to diverse hydrological responses, and their potential for uncovering physical processes.
- Preprint
(4046 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 05 Apr 2025)
-
RC1: 'Comment on egusphere-2024-3665', Anonymous Referee #1, 24 Feb 2025
reply
Review of HESS Manuscript
“A Distributed Hybrid Physics-AI Framework for Learning Corrections of Internal Hydrological Fluxes and Enhancing High-Resolution Regionalized Flood Modeling”
Dear Editor,I have attached my review of the manuscript.
1. Scope
The scope of the paper is well suited for HESS.
2. Summary
The authors introduce a distributed hybrid hydrological model. The model is based on the GRU process-based model architecture but includes embedding neural networks that are used to parameterize the process-based model. They test their model on 256 catchments located in France, divided into two datasets (235 and 21 catchments for the first and second datasets respectively). They conclude that the hybrid approaches perform better than the stand-alone process-based models.
Overall, the manuscript has the potential to be a good contribution, however, there are certain aspects mentioned in the comments below that should be taken into account before the manuscript is accepted.
3. EvaluationMajor comments:
Model comparison: The authors compare a stand-alone GRU model and their hybrid model approaches. I think the comparison between these two models is necessary, valuable, and the results are presented clearly. However, the fact that the hybrid performs better than the stand-alone process-based model is expected. With the hybrid approach, you have a model with more degrees of freedom, and the embedded NN can compensate for structural deficiencies in the process-based part, which will increase performance.
What I think is missing, to have a better idea of where the hybrid stands, is a comparison with a purely data-driven approach. For example, having a stand-alone LSTM, trained regionally with lumped meteorological inputs (e.g., catchment average values) would be a good benchmark. Or use as inputs, not only the basin-averaged values of precipitation, temperature, etc,… but also include other basin-averaged statistics (mean, std, max and min) that you can compute from the gridded products. This way we can see how the hybrid approach performs against purely data-driven methods, and if the extra effort of going distributed is worth it.
Section 3: Here you present two datasets, with which you run two sets of experiments. The first dataset includes 235 non-nested catchments in France with 13 years of data. In this dataset, you test the effect of having a NN for process parameterization. In the second dataset, you have 21 catchments in the Mediterranean region, both nested and independent, with 7 years of data. This one you use to test the model regionalization. Is there a reason why this last test cannot be made in the first dataset? One can evaluate regionalization from catchment to catchment, and not only inside the same catchment. Moreover, having results in 235 catchments for the second experiment will give more robust tests. Also, you can mix everything in a single dataset with 256 catchments. I was just wondering why did you make this division?
Line 282-294: The differences between the models are quite small. For example, the difference shown in Figure 4 between the median NSE for the GR.U and the GRNN.U is 0.008 and between the GRD and the GRNN.D is 0.014. Are the differences between the reported distributions statistically significant? I think this point should be further discussed. Because the hybrid approach has a higher flexibility than the process-based model. The embedded neural networks produce flux-correction parameters for each pixel and timestep, and if the differences between the hybrid and the stand-alone process-based model are small, it would be interesting to find out why. Maybe the physical dissipation of the basins makes it unnecessary to have so much detail, if one is just interested in the simulated discharge at a specific point. Or maybe the meteorological data is restricting further increases in quality.
As an additional question, do the flux correction parameters allow the model to artificially increase/decrease the amount of water (violate the mass-conservation principle) in the control volume?
Line 295-300: You indicate that you are evaluating the performance of the model in flash floods, and then you evaluate it in 2700 events during the 6-year validation period. Are these 2700 events flash floods or just regular floods? How did you classified them?
Line 327-333: In these lines (and Figure 7) you compare the NSE for 143 flood events, indicating that the hybrid models perform better. Even if this is true, all the models performed quite badly. For the GR.U and GRNN.U the median NSEs are -0.48 and 0.09, which is a clear indication that the models do not work at all. Just taking the mean of the observed data would yield to a NSE of 0. For the other two models, the NSE did improve, but was still quite low (0.19 and 0.37). You should expand the discussion here and try to understand why all models are performing so badly.
Minor comments:
Line 61: Clarify “This study”.
Line 72: Replace “have to be advanced” to “should advance”
Line 74: What do you mean by “earth critical zone”?
Line 136: The purple color of the parameters is almost red. I would suggest choosing another color scheme, more colorblind-friendly.
Line 183: What do you mean by neutralized atmospheric inputs?
Line 278-279: You indicate about Figure 3 “The results demonstrate the superior accuracy of hybrid methods compared to the classic models…” but it is not clear from the Figure, because one cannot see any details. There are certain peaks in which the hybrid is better, but you have 6 subplots, each with 5 years of hourly discharges, so you cannot really appreciate much. I imagine that if one looks at specific events, sometimes the hybrid is better, sometimes both are similar, and sometimes the process-based is better. Maybe print only a subset of the testing period, or specific events where the differences are significant. Then, with general metrics you can make the point on which model tends to perform better.
Line 285-290: I would separate more clearly (in different paragraphs) the results reported in calibration and validation. It is not a usual practice to compare models using results in the calibration period, as any meaningful comparison should be made in validation. If you want to report the results in calibration that is perfectly ok, but a more clear distinction should be made.
Line 290: The RMSE for GRNN.U, according to Figure 4, is 1.38 not 1.30. You should correct this in the text.
Figure 5. Is the Ebf metric (baseflow) a good/necessary indicator for performance during flood events?
Line 323-326: It is not clear what you want to say.
Line 352: You indicate that “Some spatial patterns in these corrections seem to emerge across France, and although analyzing trends in corrections as a function of physical explanatory factors may yield insights, it is beyond the scope of this study focusing on detailed quantitative analysis of those spatio-temporal corrections”. What are the spatial patterns showing in Figure 8? Because for me they are not so clear. Also, why analysing the correction factors as a function of physical characteristics is out of the scope? I think this is one of the most interesting parts you should focus on. If one of the advantages of hybrid models is that they produce physical interpretability, then one should interpret what the models are doing.
Line 361: You indicate that “a majority of exchange flux corrections fq,4 that share the same sign as fq,1.” Can you quantify this with a metric? Because from the figure is not obvious. fq4 shows more red in the bottom, but I am not sure if the majority of cases are in accordance.
Line 265: You indicate that “periodic behaviors are observed over time in all four heatmaps”. For fq4 I cannot distinguished clear periodic behaviors. Could be useful to plot the timeseries of some basins. Maybe include them in the appendix.
Line 385 and Figure 10b: You indicate that “Interestingly, these maps also reveal spatial variability in internal flux corrections.” It would be interesting to analyse why these patterns emerge.
Line 424: Rephrase “Also, one could also...”Citation: https://doi.org/10.5194/egusphere-2024-3665-RC1 -
AC1: 'Reply on RC1', Ngo Nghi Truyen Huynh, 20 Mar 2025
reply
We greatly appreciate your time and effort in reviewing our manuscript and providing constructive and detailed comments. Below is our point-by-point response addressing your concerns, with reviewer comments in italic and author responses in bold.
"Major comments. Model comparison: The authors compare a stand-alone GRU model and their hybrid model approaches. I think the comparison between these two models is necessary, valuable, and the results are presented clearly. However, the fact that the hybrid performs better than the stand-alone process-based model is expected. With the hybrid approach, you have a model with more degrees of freedom, and the embedded NN can compensate for structural deficiencies in the process-based part, which will increase performance."
We appreciate your feedback regarding the model comparison. Indeed, the hybrid model, for example GRNN.U has more degrees of freedom compared to the stand-alone GR.U model. However, it is important to note that the inputs and outputs of the flux correction model are physically consistent and of the same dimension as the original model. This design allows the hybrid model to learn non linearities in the internal flux laws, which we analyze thoroughly in the flux correction analysis in both time and space throughout the paper.
While the hybrid model (GRNN.U) does not necessarily have more conceptual parameters (maintaining the same number of reservoirs and connections here), it does introduce more non linearity in the internal flux laws corrections with neural network Φ1. This added complexity effectively increases the model's degrees of freedom, contributing to its enhanced performance.
This more complex hybrid model outperforms the original model in calibration, which is expected given its added complexity. Remarkably, it maintains robustness in both spatial and temporal validations, as evidenced by the numerical results. The hybrid approaches have been rigorously tested through temporal and spatial validation, demonstrating their robustness and performance not only on calibration but also on various validation scenarios. Moreover, what sets our hybrid approach apart is its physical interpretability, which demonstrates its strength compared to pure machine learning or deep learning approaches.
We will further clarify, analyze, and justify these aspects in the revision to address your concerns.
"What I think is missing, to have a better idea of where the hybrid stands, is a comparison with a purely data-driven approach. For example, having a stand-alone LSTM, trained regionally with lumped meteorological inputs (e.g., catchment average values) would be a good benchmark. Or use as inputs, not only the basin-averaged values of precipitation, temperature, etc,… but also include other basin-averaged statistics (mean, std, max and min) that you can compute from the gridded products. This way we can see how the hybrid approach performs against purely data-driven methods, and if the extra effort of going distributed is worth it."
Thank you for your suggestion. This work focuses on a spatially distributed conceptual model based on physics and its physical hybridization. Our emphasis is on the rigorous presentation and analysis of this framework over a large sample, its performance in calibration and spatio-temporal extrapolation, and the interpretability of internal fluxes corrected with the hybrid approach, as you mentioned in the previous comment that the comparison between the GR and the hybrid GRNN models is "necessary, valuable, and the results are presented clearly".
First, we believe that analyzing the question of lumped versus spatialized models is not within the scope of this study. A spatially distributed approach is essential given the high spatial variabilities involved within the flash flood-prone catchments of this dataset.
Second, unlike traditional and hybrid process-based models, which rely on physical conservation equations of mass, momentum, energy, and empirical closures, pure LSTM or ML models do not inherently impose physical constraints, leading to reduced physical interpretability and generalizability, especially under extreme or unseen hydrological conditions (Beven, 2020; Sit et al., 2020; Shen, 2018). Given this context, we believe that a comparison with a stand-alone LSTM is beyond the scope of this study, which focuses on the hybridization and physical interpretability of spatially distributed process-based models. Our goal is not to benchmark the hybrid approach against purely data-driven methods but to demonstrate the improvements achieved through the hybridization of a well-established, spatially distributed, differentiable numerical hydrological model. A benchmark represents the scope of another full study.
Moreover, building a stand-alone LSTM, operating at an hourly time step, trained regionally, and capable of accounting for basin-average forcings/descriptors and spatial information, would be interesting but would represent a significant undertaking based on our previous experience with daily LSTM models (Hashemi et al., 2022). This is the scope of another full study. Such pure deep learning models, with even more degrees of freedom than our relatively parsimonious hybrid model, are hardly interpretable or extrapolable beyond the training set - as other neural networks. Therefore, we believe that these two approaches are not comparable within this context, where the present paper features a rigorous presentation and detailed analysis, especially of optimized quantities and internal fluxes, over a large sample and in regionalization. We believe this design and study are scientifically solid, this will be better explained in the revised manuscript to answer such questions.
We appreciate your suggestion and will keep it in mind for future benchmarking studies that compare a broader range of modeling approaches.
"Section 3: Here you present two datasets, with which you run two sets of experiments. The first dataset includes 235 non-nested catchments in France with 13 years of data. In this dataset, you test the effect of having a NN for process parameterization. In the second dataset, you have 21 catchments in the Mediterranean region, both nested and independent, with 7 years of data. This one you use to test the model regionalization. Is there a reason why this last test cannot be made in the first dataset? One can evaluate regionalization from catchment to catchment, and not only inside the same catchment. Moreover, having results in 235 catchments for the second experiment will give more robust tests. Also, you can mix everything in a single dataset with 256 catchments. I was just wondering why did you make this division?"
Thank you for your valuable feedback. We will provide clarifications on the complementary goals of the experiments on the two datasets, which should address your points.
For the dataset with 235 Catchments:
- Objective: Test the performance of ϕ1 (NN for flux correction) using local calibrations only at downstream gauges. This setup demonstrates the efficiency of the ϕ1 NN in improving model performance.
- Reason for separate testing: conducting a multi-catchment setup across the entire mesh of France is computationally challenging, given the high spatio-temporal resolution of the data and model (see Huynh et al. (2024), for details on computational costs of the adjoint model). Moreover this would require to investigate cost functions adapted for meaningfull information selection/weighting over a set of catchments with contrasted area and physics (see issues for regionalization with downstream catchments in Huynh et al. (2024)).
For MedEst dataset:
- Objective: Evaluate regional calibration in a multi-catchment setup. This dataset tests the performance of both ϕ1 (flux correction NN) and ϕ2 (regionalization NN).
- Reason for focused regionalization: Analyzing physical interpretability in a national multi-catchment setup is complex. For this initial study, we focused on regionalization performance within a specific and known study zone. It is worth noting that regionalization performance over a larger zone, covering approximately 1/4 of France, has already been studied in Huynh et al. (2024). Future studies can certainly explore a national multi-catchment setup, as you suggested.
By separating the datasets, we aimed to provide a clear and focused analysis of both the flux correction and regionalization capabilities of our hybrid model. We appreciate your suggestion and will consider expanding the scope in future research.
"Line 282-294: The differences between the models are quite small. For example, the difference shown in Figure 4 between the median NSE for the GR.U and the GRNN.U is 0.008 and between the GRD and the GRNN.D is 0.014. Are the differences between the reported distributions statistically significant? I think this point should be further discussed. Because the hybrid approach has a higher flexibility than the process-based model. The embedded neural networks produce flux-correction parameters for each pixel and timestep, and if the differences between the hybrid and the stand-alone process-based model are small, it would be interesting to find out why. Maybe the physical dissipation of the basins makes it unnecessary to have so much detail, if one is just interested in the simulated discharge at a specific point. Or maybe the meteorological data is restricting further increases in quality."
We appreciate your insightful comments regarding the differences between the models and the significance of the reported distributions.
First, while the median improvements may appear small in case of temporal validation as you pointed out (knowing that evident higher performance of hybrid models observed in calibration), it is important to consider the entire distribution. In addition to the median values, we observe notable enhancements in other statistical measures, such as the interquartile range (0.25 and 0.75 quantiles) and whiskers in the boxplots.
Another important point is that for catchments that already exhibit satisfactory performance, the effect of hybridization is relatively small (leading to similar median, 0.75 and 0.95 quantile values). However, for poorly performing basins, the hybrid models provide substantial improvements, as evidenced by enhanced performance in the lower quartiles. To further illustrate this, an additional catchment-by-catchment comparison graph may be helpful, and we will consider adding this in the revised version. Note that even for slight global performance NSE improvement, the proposed method enables to learn spatio-temporal corrections of model internal fluxes within a basin/region, as shown by the analysis of the obtained internal flux corrections for each of the 235 catchments accross France. This property of the proposed approach is very promising to exploit other data, for improving evapotranspiration, infiltration, groundwater exchanges and other processes.
Finally, a particularly interesting result from this comparison is the performance of the hybrid model with spatially uniform control (GRNN.U) in multi-gauge regional calibration. Despite maintaining spatially uniform parameters, this model achieves performance comparable to the original model with spatially distributed parameters (GR.D). This result is remarkable, as it highlights the ability of ϕ1 to leverage both the memory effects of ODEs and the NN’s capacity to learn spatially distributed forcings and learn spatio-temporal compensations of modeling rigidity/uncertainties, which is a very promising property in view of regionalization with more diverse information from multi-source data.
We will include a more detailed comparison of these statistical indicators in the revised version to highlight the improvements more comprehensively.
"As an additional question, do the flux correction parameters allow the model to artificially increase/decrease the amount of water (violate the mass-conservation principle) in the control volume?"
Thank you for this valuable comment. The simulated water balance is influenced by the correction of kexc, which represents the exchange flux and can result in either gains or losses relative to the original model (which is already non-conservative). The flux correction is illustrated in the graphs, showing its spatio-temporal variability, which makes it complex to analyze directly. We will add this clarification in the revised version.
"Line 295-300: You indicate that you are evaluating the performance of the model in flash floods, and then you evaluate it in 2700 events during the 6-year validation period. Are these 2700 events flash floods or just regular floods? How did you classified them?"
Thank you for this comment. The selection of flood events is performed using an automatic segmentation algorithm based on Huynh et al. (2023), which identifies events where peak flows exceed a certain quantile threshold. Therefore, we acknowledge that these 2700 events are not necessarily flash floods but rather general flood events that include flash floods. To ensure clarity, we will remove the term “flash” from the text.
"Line 327-333: In these lines (and Figure 7) you compare the NSE for 143 flood events, indicating that the hybrid models perform better. Even if this is true, all the models performed quite badly. For the GR.U and GRNN.U the median NSEs are -0.48 and 0.09, which is a clear indication that the models do not work at all. Just taking the mean of the observed data would yield to a NSE of 0. For the other two models, the NSE did improve, but was still quite low (0.19 and 0.37). You should expand the discussion here and try to understand why all models are performing so badly."
We acknowledge that the NSE values computed for the 143 flood events are relatively low across all models. It is important to note that NSE for flood events, which are short time series with high values, is highly sensitive to small timing errors. Even slight discrepancies in peak timing can lead to substantial decreases in NSE. Additionally, data and modeling uncertainties may vary between events, making the accurate prediction of highly contrasted events particularly challenging.
The models are calibrated on the entire time series, and in this figure, we evaluate the validation results specifically for flood events, where classical approaches often struggle to accurately estimate water dynamics. This discrepancy highlights the difficulty in capturing the rapid and intense nature of flood events, even with advanced hybrid models.
We also recognize the need to investigate potential sources of error, including input data quality and model structural limitations, and the impact of using a calibration metric based solely on flood events, which could explain the overall challenges in flood event simulation. This will be clarified in the revised version.
"Minor comments."
We greatly appreciate your detailed minor comments and valuable suggestions. We will address each of these comments with a thorough explanation in the revision. Thank you again for your careful and constructive review of our manuscript.
Ngo Nghi Truyen Huynh and Pierre-André Garambois, on behalf of the authors.
References:
Beven, K., 2020. Deep learning, hydrological processes and the uniqueness of place. Hydrological Processes 34, 3608–3613.
Hashemi, R., Brigode, P., Garambois, P.A., Javelle, P., 2022. How can we benefit from regime information to make more effective use of long short-term memory (lstm) runoff models? Hydrology and Earth System Sciences 26, 5793–5816.
Huynh, N.N.T., Garambois, P.A., Colleoni, F., Javelle, P., 2023. Signatures-and-sensitivity-based multi-criteria variational calibration for distributed hydrological modeling applied to mediterranean floods. Journal of Hydrology 625, 129992.
Huynh, N.N.T., Garambois, P.A., Colleoni, F., Renard, B., Roux, H., Demargne, J., Jay-Allemand, M., Javelle, P., 2024. Learning regionalization using accurate spatial cost gradients within a differentiable high-resolution hydrological model: Application to the french mediterranean region. Water Resources Research 60, e2024WR037544.
Shen, C., 2018. A transdisciplinary review of deep learning research and its relevance for water resources scientists. Water Resources Research 54, 8558–8593.
Sit, M., Demiray, B.Z., Xiang, Z., Ewing, G.J., Sermet, Y., Demir, I., 2020. A comprehensive review of deep learning applications in hydrology and water resources. Water Science and Technology 82, 2635–2670.
Citation: https://doi.org/10.5194/egusphere-2024-3665-AC1
-
AC1: 'Reply on RC1', Ngo Nghi Truyen Huynh, 20 Mar 2025
reply
-
RC2: 'Comment on egusphere-2024-3665', Tadd Bindas, 04 Apr 2025
reply
To whom it may concern,
Thank you for including me in the peer review process of your paper. From my understanding, the study is about a spatially distributed differentiable hybrid model which runs diffusive wave routing on a grid. This model is designed to learn hydrological processes at the basin and regional scales, similar to [1] but with the bucket-based rainfall-runoff module connected to the routing component. Further, internal fluxes and hydrological parameters are estimated with neural networks. Results show the differentiable models have better performance, with fluxes and parameters correcting the biases in the buckets.
Overall, I believe this paper is well written and is a novel approach to grid-based differentiable rainfall modeling. I recommend this for acceptance with technical corrections based on the brief comments below.
Best,
Tadd Bindas
Major Comments
- Figure 9: While it’s important to understand the internal fluxes of your system, Figure 9 is confusing. It appears there is no temporal differences in flux per catchment, so the X axis is distracting. Further, the ordering of the fluxes is not clear. I don’t think this figure is required, as a spatial understanding of flux, as shown in Figure 10b is clearer to the reader
Minor Comments
- Some of the latex math functions regarding TanH in section 4.2 were messed up when translating to PDF
Citations:
- Bindas, T., Tsai, W.-P., Liu, J., Rahmani, F., Feng, D., Bian, Y., et al. (2024). Improving river routing using a differentiable Muskingum-Cunge model and physics-informed machine learning. Water Resources Research, 60, e2023WR035337. https://doi.org/10.1029/2023WR035337
Citation: https://doi.org/10.5194/egusphere-2024-3665-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
381 | 61 | 8 | 450 | 6 | 7 |
- HTML: 381
- PDF: 61
- XML: 8
- Total: 450
- BibTeX: 6
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 121 | 28 |
France | 2 | 77 | 17 |
China | 3 | 26 | 6 |
Germany | 4 | 18 | 4 |
United Kingdom | 5 | 15 | 3 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 121