When physics gets in the way: an entropy-based evaluation of conceptual constraints in hybrid hydrological models

Álvarez Chaves, Manuel; Acuña Espinoza, Eduardo; Ehret, Uwe; Guthke, Anneli

doi:10.5194/egusphere-2025-1699

Preprints

https://doi.org/10.5194/egusphere-2025-1699

Preprints

05 May 2025

| 05 May 2025

When physics gets in the way: an entropy-based evaluation of conceptual constraints in hybrid hydrological models

Manuel Álvarez Chaves, Eduardo Acuña Espinoza, Uwe Ehret, and Anneli Guthke

Abstract. Merging physics-based with data-driven approaches in hybrid hydrological modeling offers new opportunities to enhance predictive accuracy while addressing challenges of model interpretability and fidelity. Traditional hydrological models, developed using physical principles, are easily interpretable but often limited by their rigidity and assumptions. In contrast, machine learning methods, such as Long Short-Term Memory (LSTM) networks, offer exceptional predictive performance but are often criticized for their black-box nature. Hybrid models aim to reconcile these approaches by imposing physics to constrain and understand what the ML part of the model does. This study introduces a quantitative metric based on Information Theory to evaluate the relative contributions of physics-based and data-driven components in hybrid models. Through synthetic examples and a large-sample case study, we examine the role of physics-based conceptual constraints: can we actually call the hybrid model "physics-constrained", or does the data-driven component overwrite these constraints for the sake of performance? We test this on the arguably most constrained form of hybrid models, i.e., we prescribe structures of typical conceptual hydrological models and allow an LSTM to modify only its parameters over time, as learned during training against observed discharge data. Our findings indicate that performance predominantly relies on the data-driven component, with the physics-constraint often adding minimal value or even making the prediction problem harder. This observation challenges the assumption that integrating physics should enhance model performance by informing the LSTM. Even more alarming, the data-driven component is able to avoid (parts of) the conceptual constraint by driving certain parameters to insensitive constants or value sequences that effectively cancel out certain storage behavior. Our proposed approach helps to analyse such conditions in-depth, which provides valuable insights into model functioning, case study specifics, and the power or problems of prior knowledge prescribed in the form of conceptual constraints. Notably, our results also show that hybrid modeling may offer hints towards parsimonious model representations that capture dominant physical processes, but avoid illegitimate constraints. Overall, our framework can (1) uncover the true role of constraints in presumably "physics-constrained" machine learning, and (2) guide the development of more accurate representations of hydrological systems through careful evaluation of the utility of expert knowledge to tackle the prediction problem at hand.

Received: 09 Apr 2025 – Discussion started: 05 May 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Manuel Álvarez Chaves, Eduardo Acuña Espinoza, Uwe Ehret, and Anneli Guthke

Status: closed

RC1:
'Comment on egusphere-2025-1699', Georgios Blougouras & Shijie Jiang (co-review team), 08 Jun 2025

General Comments
The manuscript of Álvarez Chaves et al. tackles a highly relevant, important and immediate issue that the hybrid hydrological community is facing. The authors explore and compare the role of the data-driven component across different hybrid models, using entropy-based metrics, in two experiments: a synthetic experiment (known ground truth) and a real-world experiment (provided by the CAMELS-GB dataset). Does the conceptual model, to which the data-driven component is attached to, provide any guidance? Or is the data-driven component overpowering it (potentially 'overriding' parts of the conceptual model along the way)?
I found the article very informative. As I mentioned before, the question at hand is highly relevant, and the authors make a great effort in developing a robust methodology to explore it. The methodological contribution is, in my opinion, not limited to the proposed metric, but also extends towards the authors’ diligence to explore everything that happens ‘under the hood’ of their models. I expect the findings to resonate with hybrid hydrological modellers. At the same time, the suggested methodological workflow is of interest to more process-oriented/catchment-scale hydrologists as well, since it offers a way to test and refine the physical representation of their models.
The ‘personal’ language that the authors use (comments in parentheses, quotes inspired by general literature etc.) is also appreciated. It is dynamic and engaging, which enhances the reading experience. The manuscript is, in general, well formatted and easy to follow (with some exceptions - refer to ‘specific comments’), and has clear figures.
However, there are some key conceptual/general understanding concerns (see next paragraphs of this section), as well as some more specific comments (see the relevant section below) that should be addressed by the authors, before the manuscript is ready for publication.
General comment 1: According to the phrasing of the paper in many instances, the readers might be led to believe that hybrid modeling is used here as a way to help the LSTM predictive performance (e.g., L106-107: ‘While purely data-driven… genuinely enhances model performance’, or, L621: ‘...reduce the effort required…’). However, the LSTM’s role in the hybrid architectures explored in the manuscript is not to ‘lead the predictions’, but rather to infer the conceptual model parameters. Therefore, describing the models as ‘physics-constrained’ (L9) might not be the most accurate description - maybe something like ML-enhanced/parameter learning/… differentiable modeling would be more fitting. The change in terminology (although potentially annoying), implies different perspectives regarding the role of the data-driven and the physics-driven components of the model. To be more precise, the manuscript explores if different architectures are ‘helping the LSTM’, even though to begin with, the concept of parameter learning usually reflects the opposite direction (i.e., exploiting the LSTM to help the conceptual model improve its predictions). To my opinion, this creates a ‘conceptual mismatch’ across the perspective of the manuscript and the ‘typical’ way such hybrid models are used. I encourage the authors to clarify if their framing is applicable or not under this context - maybe it might be more accurate to view the LSTM as being ‘constrained’, rather than ‘assisted’ by the conceptual model. In light of this architecture interpretation, it would be helpful for the authors to revisit their interpretation of the entropy-based results, to ensure that the evaluation and subsequent conclusions are aligned with the actual model structure and what it represents.
General comment 2: The manuscript defines the value of hybrid modeling primarily in terms of streamflow prediction, measuring the “utility” of physics constraints by the degree to which they reduce LSTM parameter variability (entropy) for this single task. In my view, this approach risks narrowing the broader motivation for hybrid models, which is not limited to improving runoff prediction or reducing model complexity, but also includes enabling more meaningful and diagnostically useful process representations (e.g., for ET, soil moisture, or system anomalies). I am concerned that by focusing only on this narrow predictive context and a single diagnostic (entropy), the study may misrepresent the broader role of physical constraints in hybrid models. Many would consider physical constraints valuable even when predictive skill does not improve, as they support interpretability and process fidelity. I recommend that the authors more explicitly clarify the intended scope of their analysis and discuss the limitations of generalizing these findings to other goals of hybrid modeling.
General comment 3: In the current setup, the LSTM is only used to infer parameters, and streamflow is the sole observation timeseries used for evaluation. In my view, this setup may further amplify the well-known problem of equifinality in hydrological modeling. From this perspective, I am not convinced that low LSTM entropy necessarily reflects a physically meaningful or faithful conceptual model; rather, it may simply mean that the LSTM has found a convenient way to “satisfy” the runoff constraint, possibly by exploiting compensatory effects among parameters or model components. While the manuscript explores this issue by visualizing the distributions of the LSTM-predicted parameters (e.g., in Fig2 and Fig9), I would encourage the authors to discuss more explicitly how equifinality in particular might influence the interpretation of their results.

Specific Comments
Introduction: Why are you mentioning the modular hydrological frameworks more than once (L31, L47,...)? To my understanding, there is no clear and direct follow-up regarding such efforts in the rest of the manuscript (though one could make a conceptual link, but it would still benefit from a clarification from the authors).
For the ‘1.3 Hybrid models’ subsection: To someone unfamiliar with such modelling efforts, I think the current subsection lacks a bit of clarity - why should people care about hybrid models, and how have people employed them in the (recent) past? I think a little bit more context is required in this section (especially given that in the end, from your findings and conclusions, the paper would appeal not only to ‘hybrid modellers’, but also to the general hydrological modeling community). In general, every ‘modeling introduction’ subsection (i.e., 1.1, 1.2, 1.3) should have clear pros and cons of using each modeling type - this is not clear for section 1.3. Furthermore, additional context would be beneficial regarding the hydrological community’s evolution from 1.1 to 1.2 and then 1.3. Why did LSTM become the ‘benchmark’ for many researchers in the last 5 years? Why do more and more hydrologist attempt to use hybrid models instead of pure data-driven models? Some of these questions are already answered in the text, but in a way that in my opinion might not be extremely clear to someone that has is not well familiarized with hybrid modeling.
L106: Do you mean ‘data-driven components’? Because the hydrological models inherently are ‘physics’-driven models. Otherwise, please revise.
L105-109: This whole text passage has the point of view of someone who wants to improve data-driven hydrological models by incorporating physical principles. At the same time, the opposite pathway (trying to improve physical models by using data-driven modules) is also common in hydrology, and can also benefit from your suggested contributions. Why not mention it here as well?
L110-111 (contribution 1): This quantitative metric is not ‘self-standing’, to my understanding, but relevant to a ‘benchmark’ model (in this case, the pure LSTM), right?. I believe it would benefit the clarity of the manuscript if the authors were explicit about this here.
L114-115 (contribution 3): This suggested contribution is somewhat unclear to understand under the provided context - a reader would need to read the following manuscript first to fully grasp what the authors mean by ‘effective’ and ‘prescribed’. I suggest revising this.
L122: ‘favoring’ -> again, the context is missing at this point for the readers to fully understand what ‘favoring’ the data-driven component means, and I would suggest revising this a little bit to provide more information.
L173: Similar to my point regarding the 1.3 section, I think the authors should provide a little bit more context about why this revival of dynamic parameters has happened in hydrological modeling. I would refer more to Tsai et al. here (https://doi.org/10.1038/s41467-021-26107-z), which is already cited in your paper regardless.
L174 / Figure 1: To my understanding, the different model setups are not yet utilized, and they do not become fully relevant until section 3 (which could be viewed as a gray zone between a ‘methods’ and a ‘results’ section). It is fine if the model setups remain in Figure 1 and this section, but then the authors should also present what the individual models represent, because there are 5 subfigures in figure 1, leaving a lot of questions to the readers (especially the ones not familiar with past group efforts utilizing the SHM and Nonsense models). Otherwise, the authors could immediately move the figure and refer to it in a more relevant section.
NSE* calculation: In the original work, this metric is cross-basin and there was an additional sum in the formula (over all basins, which does not exist here), while Kratzert et al. divided by the number of basins instead of the number of days. Here you use a basin-average metric, deviating from the original formula as far as I can see (but please correct me if I am mistaken). Does this imply that the evaluation is done on a basin-scale? It would confuse/conflict with the rest of your paper where you indicate that you train on multiple basins.
Section 2.3: I am concerned that the entropy metric may depend strongly on the specific architecture and hyperparameters of the LSTM. Have you tested how robust the entropy-based findings are to changes in LSTM design (e.g., number of hidden units) or to using alternative machine learning models? I am curious to what extent the conclusions hold beyond the particular ML setup chosen in this study.
L200: Here I would suggest a small change in the phrasing - maybe emphasize that you move away from analyzing the entropy metric of the predicted parameter values, but nevertheless exploring (and visualizing) the LSTM predicted parameters is an important step in exploring the ‘under-the-hood’ performance of the model. Currently it reads as if you will completely ignore the information from the LSTM-predicted parameters.
L214: Mention that the link for the UNITE toolbox is found in the ‘code availability’ section.
L233 -> Appendix A1: please specify how these parameter ranges can be justified (you mention Beck et al., 2016, 2020 but later on in the text, leaving an open question for now).
Section 3.2: I think it would be beneficial to be more clear as to why you use a stand-alone LSTM model to compare against the hybrid setups (for example, L727-L730 of Appendix B could be mentioned here).
L274: I would say that randomly selecting 5 basins and not repeating the experiment might not convince many large-sample hydrologists - especially if you do not elaborate on these random catchments or their representativeness… One could repeat the exact same experiment by using 5 other randomly selected basins (no need to go too much into detail, providing the results in the supplementary and the logs on the repository is more than enough), or further demonstrate why conducting an experiment on a set of 5 randomly selected catchments is more than enough to derive safe and transferable conclusions.
L362: I think mentioning results about models appearing in the supplementary (and especially figures that have additional results from models not yet revealed before the supplementary information) can lead to unnecessary complication. Maybe you can either remove them from the main figures and reveal them directly in the supplementary information. Alternatively, if you wish to keep them, maybe some additional information about models 6,7, and 8 in the main manuscript would be helpful. Furthermore, you could also have specific shapes for Fig. 3 and 4 representing each type of ‘wrong model type’ - this would be practical if you wish to keep models 6, 7 and 8 in the main figures (e.g., over-parameterized models have a ‘square’ shape or a certain color, ‘wrong architecture’ models have striped color, etc…). This is not necessary to be done, just a suggestion / visualization experiment.
L380: ‘matches the true system’ (and many other instances in section 3 where the word ‘true’ is used) -> I feel like the phrasing is a bit misleading, as true is currently an idealized setup. Of course you have mentioned this in the manuscript, but still the word choice is important, and someone could make the implicit connection that these models can well capture the real word ‘true’ signal, which is not the case as we see in section 4.
L383: clarify which LSTM based entropy you refer to here - hidden-state or parameter based, because the distinction is important and it needs to be clear to the readers.
‘On the Complexity of the Prediction Task’ -> I think this can be seamlessly merged with the 3.4 subsection, or go directly into the appendix - I feel like it doesn’t provide enough information as a stand-alone part of the manuscript...
Subsection 3.4 ‘Summary of the proposed approach’ and overall point about Section 3: Now all the models had a (almost) perfect fit. This makes the evaluation ‘easier’, because now we know if the LSTM had to work ‘overtime’ to ‘save’ the model performance. But how can we evaluate this in cases where the models do not have equally comparable performance metrics? In other words, how do we measure the ‘effort’ by the LSTM to 'save' the streamflow prediction, if the final models do not predict streamflow prediction equally well? I am aware you touch this a little bit on the next chapter, but I think additional discussion on this matter on Subsection 3.4 would be very helpful - I am sure many readers would have this question.
L454-459: Here, you could mention that a description of these models (SHM, Bucket and Nonsense) can also be found later in this manuscript.
L476: ‘...improving prediction skill’ -> I assume you mean compared to your LSTM-baseline, right? Be more specific as the baseline comparison is important.
L485: ‘These five basins were carefully chosen…’ -> how? Please elaborate. Also, are these 5 basins the one in Fig. 6? This is not immediately clear. And follow up question on L519-520: Could this also be related to the hydrological processes involved in these catchments? You do not provide any information regarding the catchments, despite being carefully chosen (L485). Not all models can fit all catchments, and in the manuscript it is not explained how these catchments were selected/what are their characteristics and so on… This creates an uncertainty to the readers: how do we know that these results are not affected by the catchment selection and specific catchment behavior? In other words, can the authors ensure that their findings are transferable across and beyond the 5 selected catchments? (similar question - point for selecting the basin 73014 in L568 and Fig. 9)
L564-L567: You mention the two different components. The first being, the low vs high entropy, and the second being the unaffected vs suspicious time-varying patterns. Do you think there could be a visual representation of this on a 2-axes plot? I am thinking something like figure 4 but 2-d. In this case, the more you move towards the ‘high’ entropy values, the less important the change in parameters is (you already have mentioned that high entropy indicates ‘struggling due to the imposed constraint’), but the more you move towards the ‘low’ entropy values, you can get different ‘color’ (if we draw a comparison to fig. 4), depending on whether we have high or low variability in the parameter axis. This is not necessary, but I thought it might be interesting to try and visualize this important insight - it could help promote and establish the detailed methodology.
L573-574: Wouldn’t the fact that the model has learned behaviors from training on other basins a good thing? I am confused about the point you aim to make here.
L585-589 and Figure 10: You mention this later on, but it would be nice to elaborate already a bit on why the Hybrid Nonsense is the top 2 - it is a logical question that a lot of readers would immediately have.
Fig 11: Would be helpful to add the % in the bars as well.
General question about section 4: What about a joint analysis of ΔH and ΔNSE? I mentioned this earlier as well, but when trying to simulate real world catchments, the performance of the models can vary quite a lot - then it would be hard to judge and compare the different entropies across models, if the baseline of their predictive performance is not comparable. It would be nice to provide some clarifying perspectives on this while concluding section 4.
L648: I would like to suggest adding the word ‘evaluating’: ‘building **and evaluating** hybrid models’. It is a bit ‘nit-picky’ as a comment, but I think it is an important part of your implications and deserves to be mentioned.
L672: This sentence reads like you imply something along the lines of: ‘process-based modeling of catchment scale streamflow is unnecessary - why go in the long effort of creating or applying these models if LSTMs can be better?’. I know this is not your initial intent, so I would suggest rephrasing in order to avoid confusion.
Appendix B: It would be helpful to add some more context/information about the design of the additional models.

Technical Corrections
L38: ‘Typically catchment scale processes of in a rainfall-runoff…’
L174: ‘...used [in] our case…
L210: us -> is?
L211: you repeat ‘of’
L253: I believe ‘setup’ is just a noun - ‘set [space] up’ should be the verb needed in this context. Maybe I am wrong.
L351: ‘fix’ -> ‘adjust’ might be better fitting?
L408: is there a full stop [.] missing after ‘model’?
L472-475: This period is 4 lines long and quite hard to read through - I would suggest splitting up to individual sentences to ensure readability.
General ‘correction’: make sure you adopt a unified style when it comes to capitalizing titles throughout the manuscript. I see some passages have a ‘capitalized’ style (e.g., ‘Comparing Conceptual Constraints on the Entropy Axis’) and some others are more free with capitalizing words (e.g., ‘3.3.2 Measuring entropy of conceptual model parameter space’).

Citation: https://doi.org/10.5194/egusphere-2025-1699-RC1
- AC1: 'Reply on RC1', Manuel Alvarez Chaves, 11 Jul 2025
  
  We want to thank the review team for their evaluation of our manuscript. Please, find our responses in the provided supplement.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1699-AC1
RC2:
'Comment on egusphere-2025-1699', Anonymous Referee #2, 26 Jun 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1699/egusphere-2025-1699-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-1699-RC2
- AC2: 'Reply on RC2', Manuel Alvarez Chaves, 11 Jul 2025
  
  We want to thank the reviewer for their evaluation of our manuscript. Please, find our responses in the provided supplement.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1699-AC2

Status: closed

RC1:
'Comment on egusphere-2025-1699', Georgios Blougouras & Shijie Jiang (co-review team), 08 Jun 2025

General Comments
The manuscript of Álvarez Chaves et al. tackles a highly relevant, important and immediate issue that the hybrid hydrological community is facing. The authors explore and compare the role of the data-driven component across different hybrid models, using entropy-based metrics, in two experiments: a synthetic experiment (known ground truth) and a real-world experiment (provided by the CAMELS-GB dataset). Does the conceptual model, to which the data-driven component is attached to, provide any guidance? Or is the data-driven component overpowering it (potentially 'overriding' parts of the conceptual model along the way)?
I found the article very informative. As I mentioned before, the question at hand is highly relevant, and the authors make a great effort in developing a robust methodology to explore it. The methodological contribution is, in my opinion, not limited to the proposed metric, but also extends towards the authors’ diligence to explore everything that happens ‘under the hood’ of their models. I expect the findings to resonate with hybrid hydrological modellers. At the same time, the suggested methodological workflow is of interest to more process-oriented/catchment-scale hydrologists as well, since it offers a way to test and refine the physical representation of their models.
The ‘personal’ language that the authors use (comments in parentheses, quotes inspired by general literature etc.) is also appreciated. It is dynamic and engaging, which enhances the reading experience. The manuscript is, in general, well formatted and easy to follow (with some exceptions - refer to ‘specific comments’), and has clear figures.
However, there are some key conceptual/general understanding concerns (see next paragraphs of this section), as well as some more specific comments (see the relevant section below) that should be addressed by the authors, before the manuscript is ready for publication.
General comment 1: According to the phrasing of the paper in many instances, the readers might be led to believe that hybrid modeling is used here as a way to help the LSTM predictive performance (e.g., L106-107: ‘While purely data-driven… genuinely enhances model performance’, or, L621: ‘...reduce the effort required…’). However, the LSTM’s role in the hybrid architectures explored in the manuscript is not to ‘lead the predictions’, but rather to infer the conceptual model parameters. Therefore, describing the models as ‘physics-constrained’ (L9) might not be the most accurate description - maybe something like ML-enhanced/parameter learning/… differentiable modeling would be more fitting. The change in terminology (although potentially annoying), implies different perspectives regarding the role of the data-driven and the physics-driven components of the model. To be more precise, the manuscript explores if different architectures are ‘helping the LSTM’, even though to begin with, the concept of parameter learning usually reflects the opposite direction (i.e., exploiting the LSTM to help the conceptual model improve its predictions). To my opinion, this creates a ‘conceptual mismatch’ across the perspective of the manuscript and the ‘typical’ way such hybrid models are used. I encourage the authors to clarify if their framing is applicable or not under this context - maybe it might be more accurate to view the LSTM as being ‘constrained’, rather than ‘assisted’ by the conceptual model. In light of this architecture interpretation, it would be helpful for the authors to revisit their interpretation of the entropy-based results, to ensure that the evaluation and subsequent conclusions are aligned with the actual model structure and what it represents.
General comment 2: The manuscript defines the value of hybrid modeling primarily in terms of streamflow prediction, measuring the “utility” of physics constraints by the degree to which they reduce LSTM parameter variability (entropy) for this single task. In my view, this approach risks narrowing the broader motivation for hybrid models, which is not limited to improving runoff prediction or reducing model complexity, but also includes enabling more meaningful and diagnostically useful process representations (e.g., for ET, soil moisture, or system anomalies). I am concerned that by focusing only on this narrow predictive context and a single diagnostic (entropy), the study may misrepresent the broader role of physical constraints in hybrid models. Many would consider physical constraints valuable even when predictive skill does not improve, as they support interpretability and process fidelity. I recommend that the authors more explicitly clarify the intended scope of their analysis and discuss the limitations of generalizing these findings to other goals of hybrid modeling.
General comment 3: In the current setup, the LSTM is only used to infer parameters, and streamflow is the sole observation timeseries used for evaluation. In my view, this setup may further amplify the well-known problem of equifinality in hydrological modeling. From this perspective, I am not convinced that low LSTM entropy necessarily reflects a physically meaningful or faithful conceptual model; rather, it may simply mean that the LSTM has found a convenient way to “satisfy” the runoff constraint, possibly by exploiting compensatory effects among parameters or model components. While the manuscript explores this issue by visualizing the distributions of the LSTM-predicted parameters (e.g., in Fig2 and Fig9), I would encourage the authors to discuss more explicitly how equifinality in particular might influence the interpretation of their results.

Specific Comments
Introduction: Why are you mentioning the modular hydrological frameworks more than once (L31, L47,...)? To my understanding, there is no clear and direct follow-up regarding such efforts in the rest of the manuscript (though one could make a conceptual link, but it would still benefit from a clarification from the authors).
For the ‘1.3 Hybrid models’ subsection: To someone unfamiliar with such modelling efforts, I think the current subsection lacks a bit of clarity - why should people care about hybrid models, and how have people employed them in the (recent) past? I think a little bit more context is required in this section (especially given that in the end, from your findings and conclusions, the paper would appeal not only to ‘hybrid modellers’, but also to the general hydrological modeling community). In general, every ‘modeling introduction’ subsection (i.e., 1.1, 1.2, 1.3) should have clear pros and cons of using each modeling type - this is not clear for section 1.3. Furthermore, additional context would be beneficial regarding the hydrological community’s evolution from 1.1 to 1.2 and then 1.3. Why did LSTM become the ‘benchmark’ for many researchers in the last 5 years? Why do more and more hydrologist attempt to use hybrid models instead of pure data-driven models? Some of these questions are already answered in the text, but in a way that in my opinion might not be extremely clear to someone that has is not well familiarized with hybrid modeling.
L106: Do you mean ‘data-driven components’? Because the hydrological models inherently are ‘physics’-driven models. Otherwise, please revise.
L105-109: This whole text passage has the point of view of someone who wants to improve data-driven hydrological models by incorporating physical principles. At the same time, the opposite pathway (trying to improve physical models by using data-driven modules) is also common in hydrology, and can also benefit from your suggested contributions. Why not mention it here as well?
L110-111 (contribution 1): This quantitative metric is not ‘self-standing’, to my understanding, but relevant to a ‘benchmark’ model (in this case, the pure LSTM), right?. I believe it would benefit the clarity of the manuscript if the authors were explicit about this here.
L114-115 (contribution 3): This suggested contribution is somewhat unclear to understand under the provided context - a reader would need to read the following manuscript first to fully grasp what the authors mean by ‘effective’ and ‘prescribed’. I suggest revising this.
L122: ‘favoring’ -> again, the context is missing at this point for the readers to fully understand what ‘favoring’ the data-driven component means, and I would suggest revising this a little bit to provide more information.
L173: Similar to my point regarding the 1.3 section, I think the authors should provide a little bit more context about why this revival of dynamic parameters has happened in hydrological modeling. I would refer more to Tsai et al. here (https://doi.org/10.1038/s41467-021-26107-z), which is already cited in your paper regardless.
L174 / Figure 1: To my understanding, the different model setups are not yet utilized, and they do not become fully relevant until section 3 (which could be viewed as a gray zone between a ‘methods’ and a ‘results’ section). It is fine if the model setups remain in Figure 1 and this section, but then the authors should also present what the individual models represent, because there are 5 subfigures in figure 1, leaving a lot of questions to the readers (especially the ones not familiar with past group efforts utilizing the SHM and Nonsense models). Otherwise, the authors could immediately move the figure and refer to it in a more relevant section.
NSE* calculation: In the original work, this metric is cross-basin and there was an additional sum in the formula (over all basins, which does not exist here), while Kratzert et al. divided by the number of basins instead of the number of days. Here you use a basin-average metric, deviating from the original formula as far as I can see (but please correct me if I am mistaken). Does this imply that the evaluation is done on a basin-scale? It would confuse/conflict with the rest of your paper where you indicate that you train on multiple basins.
Section 2.3: I am concerned that the entropy metric may depend strongly on the specific architecture and hyperparameters of the LSTM. Have you tested how robust the entropy-based findings are to changes in LSTM design (e.g., number of hidden units) or to using alternative machine learning models? I am curious to what extent the conclusions hold beyond the particular ML setup chosen in this study.
L200: Here I would suggest a small change in the phrasing - maybe emphasize that you move away from analyzing the entropy metric of the predicted parameter values, but nevertheless exploring (and visualizing) the LSTM predicted parameters is an important step in exploring the ‘under-the-hood’ performance of the model. Currently it reads as if you will completely ignore the information from the LSTM-predicted parameters.
L214: Mention that the link for the UNITE toolbox is found in the ‘code availability’ section.
L233 -> Appendix A1: please specify how these parameter ranges can be justified (you mention Beck et al., 2016, 2020 but later on in the text, leaving an open question for now).
Section 3.2: I think it would be beneficial to be more clear as to why you use a stand-alone LSTM model to compare against the hybrid setups (for example, L727-L730 of Appendix B could be mentioned here).
L274: I would say that randomly selecting 5 basins and not repeating the experiment might not convince many large-sample hydrologists - especially if you do not elaborate on these random catchments or their representativeness… One could repeat the exact same experiment by using 5 other randomly selected basins (no need to go too much into detail, providing the results in the supplementary and the logs on the repository is more than enough), or further demonstrate why conducting an experiment on a set of 5 randomly selected catchments is more than enough to derive safe and transferable conclusions.
L362: I think mentioning results about models appearing in the supplementary (and especially figures that have additional results from models not yet revealed before the supplementary information) can lead to unnecessary complication. Maybe you can either remove them from the main figures and reveal them directly in the supplementary information. Alternatively, if you wish to keep them, maybe some additional information about models 6,7, and 8 in the main manuscript would be helpful. Furthermore, you could also have specific shapes for Fig. 3 and 4 representing each type of ‘wrong model type’ - this would be practical if you wish to keep models 6, 7 and 8 in the main figures (e.g., over-parameterized models have a ‘square’ shape or a certain color, ‘wrong architecture’ models have striped color, etc…). This is not necessary to be done, just a suggestion / visualization experiment.
L380: ‘matches the true system’ (and many other instances in section 3 where the word ‘true’ is used) -> I feel like the phrasing is a bit misleading, as true is currently an idealized setup. Of course you have mentioned this in the manuscript, but still the word choice is important, and someone could make the implicit connection that these models can well capture the real word ‘true’ signal, which is not the case as we see in section 4.
L383: clarify which LSTM based entropy you refer to here - hidden-state or parameter based, because the distinction is important and it needs to be clear to the readers.
‘On the Complexity of the Prediction Task’ -> I think this can be seamlessly merged with the 3.4 subsection, or go directly into the appendix - I feel like it doesn’t provide enough information as a stand-alone part of the manuscript...
Subsection 3.4 ‘Summary of the proposed approach’ and overall point about Section 3: Now all the models had a (almost) perfect fit. This makes the evaluation ‘easier’, because now we know if the LSTM had to work ‘overtime’ to ‘save’ the model performance. But how can we evaluate this in cases where the models do not have equally comparable performance metrics? In other words, how do we measure the ‘effort’ by the LSTM to 'save' the streamflow prediction, if the final models do not predict streamflow prediction equally well? I am aware you touch this a little bit on the next chapter, but I think additional discussion on this matter on Subsection 3.4 would be very helpful - I am sure many readers would have this question.
L454-459: Here, you could mention that a description of these models (SHM, Bucket and Nonsense) can also be found later in this manuscript.
L476: ‘...improving prediction skill’ -> I assume you mean compared to your LSTM-baseline, right? Be more specific as the baseline comparison is important.
L485: ‘These five basins were carefully chosen…’ -> how? Please elaborate. Also, are these 5 basins the one in Fig. 6? This is not immediately clear. And follow up question on L519-520: Could this also be related to the hydrological processes involved in these catchments? You do not provide any information regarding the catchments, despite being carefully chosen (L485). Not all models can fit all catchments, and in the manuscript it is not explained how these catchments were selected/what are their characteristics and so on… This creates an uncertainty to the readers: how do we know that these results are not affected by the catchment selection and specific catchment behavior? In other words, can the authors ensure that their findings are transferable across and beyond the 5 selected catchments? (similar question - point for selecting the basin 73014 in L568 and Fig. 9)
L564-L567: You mention the two different components. The first being, the low vs high entropy, and the second being the unaffected vs suspicious time-varying patterns. Do you think there could be a visual representation of this on a 2-axes plot? I am thinking something like figure 4 but 2-d. In this case, the more you move towards the ‘high’ entropy values, the less important the change in parameters is (you already have mentioned that high entropy indicates ‘struggling due to the imposed constraint’), but the more you move towards the ‘low’ entropy values, you can get different ‘color’ (if we draw a comparison to fig. 4), depending on whether we have high or low variability in the parameter axis. This is not necessary, but I thought it might be interesting to try and visualize this important insight - it could help promote and establish the detailed methodology.
L573-574: Wouldn’t the fact that the model has learned behaviors from training on other basins a good thing? I am confused about the point you aim to make here.
L585-589 and Figure 10: You mention this later on, but it would be nice to elaborate already a bit on why the Hybrid Nonsense is the top 2 - it is a logical question that a lot of readers would immediately have.
Fig 11: Would be helpful to add the % in the bars as well.
General question about section 4: What about a joint analysis of ΔH and ΔNSE? I mentioned this earlier as well, but when trying to simulate real world catchments, the performance of the models can vary quite a lot - then it would be hard to judge and compare the different entropies across models, if the baseline of their predictive performance is not comparable. It would be nice to provide some clarifying perspectives on this while concluding section 4.
L648: I would like to suggest adding the word ‘evaluating’: ‘building **and evaluating** hybrid models’. It is a bit ‘nit-picky’ as a comment, but I think it is an important part of your implications and deserves to be mentioned.
L672: This sentence reads like you imply something along the lines of: ‘process-based modeling of catchment scale streamflow is unnecessary - why go in the long effort of creating or applying these models if LSTMs can be better?’. I know this is not your initial intent, so I would suggest rephrasing in order to avoid confusion.
Appendix B: It would be helpful to add some more context/information about the design of the additional models.

Technical Corrections
L38: ‘Typically catchment scale processes of in a rainfall-runoff…’
L174: ‘...used [in] our case…
L210: us -> is?
L211: you repeat ‘of’
L253: I believe ‘setup’ is just a noun - ‘set [space] up’ should be the verb needed in this context. Maybe I am wrong.
L351: ‘fix’ -> ‘adjust’ might be better fitting?
L408: is there a full stop [.] missing after ‘model’?
L472-475: This period is 4 lines long and quite hard to read through - I would suggest splitting up to individual sentences to ensure readability.
General ‘correction’: make sure you adopt a unified style when it comes to capitalizing titles throughout the manuscript. I see some passages have a ‘capitalized’ style (e.g., ‘Comparing Conceptual Constraints on the Entropy Axis’) and some others are more free with capitalizing words (e.g., ‘3.3.2 Measuring entropy of conceptual model parameter space’).

Citation: https://doi.org/10.5194/egusphere-2025-1699-RC1
- AC1: 'Reply on RC1', Manuel Alvarez Chaves, 11 Jul 2025
  
  We want to thank the review team for their evaluation of our manuscript. Please, find our responses in the provided supplement.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1699-AC1
RC2:
'Comment on egusphere-2025-1699', Anonymous Referee #2, 26 Jun 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1699/egusphere-2025-1699-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-1699-RC2
- AC2: 'Reply on RC2', Manuel Alvarez Chaves, 11 Jul 2025
  
  We want to thank the reviewer for their evaluation of our manuscript. Please, find our responses in the provided supplement.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1699-AC2

Manuel Álvarez Chaves, Eduardo Acuña Espinoza, Uwe Ehret, and Anneli Guthke

Viewed

Total article views: 1,186 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
986	172	28	1,186	23	36

HTML: 986
PDF: 172
XML: 28
Total: 1,186
BibTeX: 23
EndNote: 36

Views and downloads (calculated since 05 May 2025)

Month	HTML	PDF	XML	Total
May 2025	133	37	6	176
Jun 2025	90	27	9	126
Jul 2025	57	11	4	72
Aug 2025	116	6	0	122
Sep 2025	421	16	3	440
Oct 2025	105	29	4	138
Nov 2025	60	43	2	105
Dec 2025	4	3	0	7

Cumulative views and downloads (calculated since 05 May 2025)

Month	HTML	PDF	XML	Total
May 2025	133	37	6	176
Jun 2025	90	27	9	126
Jul 2025	57	11	4	72
Aug 2025	116	6	0	122
Sep 2025	421	16	3	440
Oct 2025	105	29	4	138
Nov 2025	60	43	2	105
Dec 2025	4	3	0	7

Viewed (geographical distribution)

Total article views: 1,213 (including HTML, PDF, and XML) Thereof 1,213 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 04 Dec 2025

Short summary

This study evaluates hybrid hydrological models that combine physics-based and data-driven components, using Information Theory to measure their relative contributions. When testing conceptual models with LSTMs that adjust parameters over time, we found performance primarily comes from the data-driven component, with physics constraints adding minimal value. We propose a quantitative tool to analyse this behaviour and suggest a workflow for diagnosing hybrid models.


Total:	0
HTML:	0
PDF:	0
XML:	0