An argument for parsimony in differentiable hydrologic models
Abstract. Differentiable hydrologic models that use machine learning to infer parameters for process-based models show promise for both prediction and inference. However, these models are often developed with time-varying parameters, despite evidence that such flexibility can undermine physical consistency and yield only marginal predictive improvements over simpler static approaches. In this study, we revisit the comparison between static and dynamic differentiable models across 531 CAMELS-US basins, evaluating key architectural choices: (1) neural network type (multi-layer perceptron (MLP) vs. long short-term memory network (LSTM)); (2) process model configuration (single- versus ensemble-parameter estimation); and (3) comprehensive versus alternative input feature sets. Using the Hydrologiska Byråns Vattenbalansavdelning (HBV) conceptual model, we find that ensemble parameterizations consistently outperform single-parameter configurations, and that static, MLP-based ensembles achieve performance comparable to dynamic, LSTM-based ensembles despite their simpler structure. Additionally, we find that LSTM-estimated parameters rarely exhibit meaningful temporal variability despite their time-varying inputs, and when they do, this temporal variability may reflect hydrologic model equifinality rather than process dynamics. We further show that models using only latitude and longitude as static inputs achieve spatial generalization comparable to models using comprehensive feature sets describing climate, topography, geology, soils, and land cover. Similarly, temporal generalization is retained even when comprehensive features are replaced with physically meaningless values. This indicates static inputs are primarily used as spatial proxies to generalize in space and for site memorization when generalizing in time, rather than representations of physical basin processes. Overall, our results support reduced complexity in differentiable hydrologic modeling to provide greater transparency while retaining predictive performance.
Review of HESS Manuscript
“An argument for parsimony in differentiable hydrologic models”
Dear editor, please find attached my review of the manuscript.
1. Scope
The scope of the article is inside the scope of HESS.
2. Summary
The authors test how hybrid models composed of a NN+HBV behave when the NN gives static parameterization. They test two approaches, MLP and LSTM. They show that the performance of static-parameterised hybrid models is competitive with dynamic parameterization.
3. General comments
In general, I really liked the paper. I think it is very well-written, the objectives and the methods used are clear. I just have two major comments.
Comment 1:
Why didn’t you use the same periods as Feng2022? You compare your model to his, but you used different training and testing periods. One of the advantages of running experiments in CAMELS-US is that you can benchmark against existing models, automatically placing your study in current literature. But to do this, you should reproduce as good as possible the other study, and this includes training/testing splits. Even with different dates, you got similar performance, but then other factors could have played a role. I would suggest that you consider running the experiments for the same periods.
Comment 2:
You are selling hybrids as a way to have competitive performance with data-driven models while maintaining an interpretable model. But I think it is fair to ask the question, how much interpretability are we actually gaining? And I ask this as a person that have work a lot with hybrid models, and that also bought the interpretability argument.
Your models have 16 HBVs acting in parallel, for a total of 320 parameters. Can you really interpret this? If your HBV have 3 or 4 buckets, you are talking about 48 to 64 states. Moreover, you have basin-averaged quantities, so how useful is the information contained in the buckets? What can you tell a stakeholder based on this? That there is a lot of snow in the catchment and that the groundwater is high. This could be the case, but how useful is this? And I ask this as a genuine question, maybe this is useful, I just do not know how. In my opinion, basin-averaged conceptual models give interpretability mostly by association, and the physical principles are quite weak. LSTMs also do not provide physical interpretability, but I think the method is quite honest about it and mostly focused on performance.
To be clear, this does not invalidate any of the paper experiment nor results, and hybrid models might have other advantages. Moreover, making the models more parsimonious by replacing the LSTM with a MLP would indeed be a good strategy. My only concern is that promising interpretability might be an oversell, because even with a simpler and more parsimonious model, is the conceptual head layer really interpretable?
4. Specific comment
Line 290-296: Why do LSTMs outperform MLP for the single HBV case, if both are providing static parameterization? Is there any additional flexibility that the LSTM can provide? Because during testing, there is only one set of parameters for the whole testing series. Are the parameters “better informed” because of the dynamic series? Why does this difference disappear for the ensemble case? The performance gain from 1 to 16 HBVs is expected (more model flexibility), but why does the gap between MLP and LSTM disappear? Further explanation is necessary.
Line 303-304: What about out-of-sample in time? Are these metrics similar?
Line 329-333: I do not share this point.
I agree with what you said before (lines 329-330) that temporal generalization is worse with 2 static inputs because 2 static inputs are not enough to “fine-tune” the basin´s specific behaviour. On the other hand, 30 static attributes allow the MLP and LSTMs (I will say NN to refer to both) to further tune the output to the basin response. So for sure the NN have some memory of which basin they are treating.
What I do not agree with is saying that there are no meaningful relationships learn (line 329) with the static parameters. You actually showed this with the second experiment! You showed that if static attributes are permutated, the performance drops (with respect to the reference) for out-of-sample in space, proving that there are some meaningful relationships that were being learn. Otherwise, why will the performance drops?
The fact that permutation does not drop performance for out-of-sample in time can be explained due to the fact that, if the static-feature space is large enough (30 dimensions) the NN can still use this space to fine-tune basin-specific behaviours. So it basically uses it to memorize where it is and give the best parameters. However, because the static attributes are not consistent with the basin characteristics (because they were permutated), out-of-sample in space drops with respect to the reference (0.42 instead of 0.62). This showed that the reference did interpret the non-permutated parameters in a way that allows it to better generalize in space.
The NN do not have to interpret the static parameters the same as us, we are actually not training them to do that. Moreover, they should not be used for contrafactual checking (e.g. what would happen to runoff if we double the area) because again, we are not training them to do that. But they can learn that basins at certain latitudes are more arid or that bigger areas show bigger discharges.
This comment also extend to the second point in Discussion and Conclusions.
Section 3.3 and Figure 7: Is there a reason you are doing these experiments in the training period? I think keeping consistency in doing everything in the testing period is better. Further down in this section (line 374 and forward) you change to results in the testing period. Why not do everything in testing?
Line 390-393: I do not agree with this conclusion.
The deficiency I see in this set of experiments is that, you are approximating dynamic parameters as static parameters that can vary each 60 days. 2 months is not a short period, you cannot react to specific events, only to seasonal changes. The flexibility this might give you is a seasonal model that has a specific behavior every two months. On the other hand, the dynamic models from Feng2022 or Acuna2024 can react to specific events.
For example, the models Delta_1 and Delta_n from Feng 2022 show that when only one HBV is used, the median goes from 0.64 to 0.71, so there is a big performance gain when going dynamic. Now, Feng also shows that when you have 16 HBVs, the gain due to dynamic parameterization is much less , goes from 0.71 to 0.73. This, I would argue, is an indication that what gives improved performance is the additional model flexibility. That this flexibility is physically consistent is another point, and maybe you can argue that it is not. You also showed the same thing: 16 HBVs performed better than 1, because you have a more flexible model.
Alvarez2026 shows this in a really nice sets of experiments. I highly recommend this reading. If one has a perfect model, the parameters will be static, and as the model is a worse representation of what is happening, then the dynamic parameterization stars to compensate for structural deficiencies.
All this to say that I do not think your experiments can show that “Consequently, the observed temporal variability in LSTM-predicted parameters may be better interpreted as a manifestation of equifinality rather than evidence of dynamically evolving process”, because the 60-day windows you are using do not allow you to focus on dynamically evolving processes.
This comment extends to point 3 of Discussion and Conclusions.
5. Recommendation
Based on the comments above, I recommend acceptance subject to major revisions. As I indicated, I think the document is really well done. I was planning to suggest minor revisions, but I think the timeline could be a bit short for the authors. However, if the editor feels that they can be done as minor revisions, I would also support it.
Kind regards,
Eduardo
References:
Acuña Espinoza, E., Loritz, R., Álvarez Chaves, M., Bäuerle, N., & Ehret, U. (2024). To bucket or not to bucket? Analyzing the performance and interpretability of hybrid hydrological models with dynamic parameterization. Hydrology and Earth System Sciences, 28(12), 2705–2719. https://doi.org/10.5194/hess-28-2705-2024
Álvarez Chaves, M., Acuña Espinoza, E., Ehret, U., & Guthke, A. (2026). When physics gets in the way: An entropy-based evaluation of conceptual constraints in hybrid hydrological models. Hydrology and Earth System Sciences, 30(3), 629–658. https://doi.org/10.5194/hess-30-629-2026
Feng, D., Liu, J., Lawson, K., & Shen, C. (2022). Differentiable, Learnable, Regionalized Process-Based Models With Multiphysical Outputs can Approach State-Of-The-Art Hydrologic Prediction Accuracy. Water Resources Research, 58(10), e2022WR032404. https://doi.org/10.1029/2022WR032404