Testing the Temporal and Spatial Transferability of a Water Balance Model using a Parameter Learning Approach
Abstract. Reliable transferability of hydrological model parameters across time and space remains a challenge for large‑scale water resources assessment. In this study, we investigate whether a differentiable hybrid framework can identify robust and physically coherent parameter sets for annual streamflow modeling across a large‑sample dataset of 3,044 catchments from eight countries. To focus on temporal and spatial transferability analysis, we work at the annual time scale using what we consider to be the simplest possible model: an annual anomaly model of climate elasticities, coupled with the Turc–Mezentsev formulation for the long-term streamflow mean (MQ). A dense neural network is trained in an end‑to‑end fashion to map catchment descriptors to the four model parameters, with gradients propagated through the entire modeling chain.
We evaluate the framework using three cross‑validation settings inspired by Klemeš (1986): temporal, spatial, and combined temporal–spatial cross-validation. As a benchmark, we compare the hybrid model against local, catchment‑by‑catchment linear regressions under temporal cross-validation.
Our results show that, for temporal transferability, our parameter learning approach outperforms local calibration, yielding higher Nash-Sutcliffe efficiency (NSE) values while producing elasticity coefficients that remain within plausible physical ranges, despite lacking explicit parameter constraints. By contrast, spatial transferability reveals a marked limitation: the anomaly component extrapolates well spatially, but regionalizing MQ from descriptors proves difficult, with MQ errors dominating the loss of performance in spatial and spatiotemporal cross-validation. Experiments with random descriptors further show that our parameter learning uses attributes mainly as catchment identifiers in temporal cross-validation but relies on their physical content to sustain spatial transfer, particularly for MQ. Overall, the study demonstrates that simple differentiable hybrid annual models can learn robust and interpretable anomaly parameters, while highlighting MQ regionalization as the main remaining bottleneck for spatially transferable annual streamflow predictions.
Dear authors,
Your paper to test the performance of spatio-temporal regionalization using a type of neural network against a simple one is interesting and highlights the potential of how hybrid modelling can be used to solve existing problems in hydrology. Our main problem is that existing techniques do not perform as expected when validating them in space-time. I have had the displeasure of experiencing the same. This paper sets out to handle the issue on an annual scale rather than the more common daily scale. Which is a start at least. However, it would have been more interesting to see the daily or the hourly cases (CAMELS datasets exist for this) as a better comparison could have been drawn with the existing techniques. It was noted in the conclusions. Maybe in another article, I guess. I do not see any major flaws in the paper but, do, however, have a concern about the reference model used here. I think this article needs a couple of rounds of revisions and hence am asking for a major revision. Nothing that cannot be fixed with relative ease. Below are my general and specific comments. Please respond to them accordingly. I am open to discussion.
General Comments:
Overall rigor and detail are very good. The idea is simple and elegant. However, the reference approach is too simple in my view, and a non-linear one could be used as others use them in regionalization. I see it as unfair. It is not state-of-the-art. At present, we already use non-linearity to regionalize model parameters in the Model Parameter Regionalization (MPR) technique for the mesoscale Hydrological model (mHM), for example. It utilizes non-linear relationships between, say, soil type and a reservoir constant for use in the model. I am sure you have heard of it. Then, I would ask why a comparison is made using a linear model as a reference? Linearity is only one possibility out of the infinitely many when considering the general case. I understand that it will go against model parsimony, but still compared to a neural network, the increase in model parameters will be nothing. Next issue, no mention of the selected catchment IDs used in this study. It is important to disclose them, so that others can also use them while, of course, citing this study. You can also share your final training and validation datasets along with the code, if possible. The rest looks good. Some minor comments are below.
Specific Comments:
L13: What gradients propagated through the whole modeling chain?
L58: Here model parsimony is mentioned but I am curious to see how, later on, a model i.e., neural network, which goes against this philosophy, is selected and evaluated.
L76-78: For vegetation, I would assume that the relationship of AET to PET in these models takes care of such changes if they are consistent in time e.g., more vegetation in spring/summer. However, I have never tested it myself if it really brings an increase in performance. In any case, increasing model parameters generally always leads to an improvement. Handling complex groundwater interactions is generally out of the question for such simple models anyhow. Data is generally unavailable (e.g., fissures in karst). Adding such detail to simple models makes them complicated, which is why simple models are used in the first place. Multiyear hydrological memory (if any exists like a glacier or a lake) is going to be very local, IMO.
L161-167: A good idea.
L189-191: Taking the average of the gridded values is more correct, but I think the difference is not large from using the whole catchment temperature.
L195: Is there a list of all the catchment IDs somewhere?
L223-233: Good description of the equation. Such simplicity is appreciated.
L247-249: Now that I think about it, this is what Hundecha and Bardossy (I think in 2004) were doing when they were regionalizing HBV parameters somewhere in Germany. The local-calibration-first and later-fitting has been obsolete since then, IMO. Some people still do it.
L250-255: Any idea on how many parameters need to be calibrated for this dense neural network?
L264-266: Is it correct to use MSE? Won’t locations with an overall lesser precipitation/flow be disadvantaged? Was NSE used as penalty during learning or not? It is not clear. Using NSE with MSE does not make sense. NSE is just a normalized MSE. Only NSE is enough, IMO.
L275-279: Is it shown how different the results of these five models are? Very different results in general would mean that the model is set up, possibly, incorrect.
L279-285: Were the 77 combinations also done multiple times and did each time, the 8 hidden layers and 192 units for each layer was the best solution? Again, just to be sure that the answer does not change every time the configurations are tested. The batch size is also fixed! I don’t know how long the configurations take to run.
L295: Let’s hope that the discarding does not lead to surprises later.
L299-301: I think, it is implicitly meant that for spatial validation, time period is fixed. Maybe, mention it just to not have any confusion.
L335: The table shows extreme thinning of the data, from what I understand. I wonder how much can be learned from such small samples.
L341-350. Maybe, a bit more information can be added here. e.g., what is meant by statistically insignificant regression coefficient.
L356-364: Very good results. Maybe discuss the calibration before the validation? It is not strange that the HYBRID TEMP is better than LOCAL TEMP. It is strange that HYBRID FIT and LOCAL FIT are almost the same. I am assuming this is by chance. I mean HYBRID FIT could have been better here as well.
L372-374: Well, linear regression is used and it is most probably that the regression is supposed to be non-linear. What if a non-linear regression was used as the reference? It would be interesting to see, for sure. Linear relationships are only one possibility. What if by simply using a polynomial of degree 2 or 3 lead to a performance close to what HYBRID has? Linear is too naïve, IMO.
L374-378: It is good (also as a selling point) and also strange as many parameter combinations are/were possible, but it still stuck to the 0-1 range. I mean, how does the HYBRID approach know what is physical when nobody told it? I can think of one test for the HYBRID model to check if it always produces values between 0 and 1. To do so, give it random combinations of inputs. These could be values that are not present even any dataset considered here. The total number of samples must be kept very large e.g., one billion. And then look at the resulting weights and check if they still lie between 0 and 1. If yes, then we are good. If not, then something has to be done.
L386-390: Interesting. Traditional techniques also fail at the same. It makes me wonder about Beven’s Uniqueness-of-place problem.
L396-405: Even more interesting. All this time, we could have switched to anomaly (I don’t know how) than using the actual discharge. But for anomaly, we first need the data itself i.e., observations which we may not have. This leaves us at the same place where we started. One could test the same for a conceptual daily rainfall-runoff model and see if the overall Pearson correlation is good or not. NSE of course would be bad as usual.
L511: Here, it is mentioned that MSE was used. This information was not so clear till now.
L519-521: A crucial use case. I agree.