Do reservoir-influenced gauges need explicit consideration in machine learning models? A case study with Hydra-LSTM

Ruparell, Karan; Yamazaki, Dai; Hunt, Kieran; Cloke, Hannah; Prudhomme, Christel; Pappenberger, Florian; Chantry, Matthew

doi:10.5194/egusphere-2026-2909

Preprints

https://doi.org/10.5194/egusphere-2026-2909

Preprints

16 Jun 2026

| 16 Jun 2026

Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Do reservoir-influenced gauges need explicit consideration in machine learning models? A case study with Hydra-LSTM

Karan Ruparell, Dai Yamazaki, Kieran Hunt, Hannah Cloke, Christel Prudhomme, Florian Pappenberger, and Matthew Chantry

Abstract. Reservoirs fundamentally alter downstream river flow regimes, decoupling discharge from natural meteorological forcing and challenging standard hydrological prediction. While data-driven models, such as Long Short-Term Memory (LSTM) networks, show promise in regulated catchments, it remains unclear how training data composition across natural and regulated rivers influences model generalisability and behaviour. In this study, we investigate how the presence or absence of reservoir-influenced catchments in training data impacts model performance across different flow regimes and alters the physical drivers the models learn to rely on. Using carefully matched subsets of the CAMELS-GB dataset, we trained separate specialist LSTMs (reservoir and non-reservoir), a pooled Full LSTM, and a multi-headed Hydra-LSTM to investigate whether explicit architectural specialisation offers any advantage over pooled training alone. Models were evaluated on held-out test gauges using standard performance metrics and gradient importance analysis to interpret feature reliance. Our results demonstrate that exposure to reservoir-influenced catchments during training is essential. Models trained exclusively on natural catchments consistently overestimate the mean and variance of regulated flows. Conversely, training exclusively on reservoir-influenced data degrades performance on non reservoir-influenced rivers (KGE reduction of ≥ 0.1) giving importance primarily to anthropogenic static features, such as abstraction rates, at the expense of precipitation drivers. A single Full LSTM trained on combined data matched the performance of both specialist models in their respective domains, implicitly switching its feature reliance between regimes. The Hydra-LSTM performed comparably to the Full LSTM throughout, indicating that the shared body may act as a regulariser limiting over-specialisation, but that explicit architectural specialisation provides no further benefit under these conditions. We conclude that pooling training data across regimes is a highly effective strategy for general-purpose modelling. However, case studies highlight a fundamental limitation: purely meteorological inputs remain insufficient for predicting flows in heavily managed single-purpose reservoirs, where unobserved human operational decisions dominate the hydrograph.

Received: 21 May 2026 – Discussion started: 16 Jun 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Karan Ruparell, Dai Yamazaki, Kieran Hunt, Hannah Cloke, Christel Prudhomme, Florian Pappenberger, and Matthew Chantry

Status: open (until 18 Aug 2026)

Post a comment Subscribe to comment alert

RC1: 'Comment on egusphere-2026-2909', Saskia Salwey, 21 Jul 2026 reply

This paper investigates the importance of the type of training sets used to simulate flow in reservoir-influenced catchments using LSTM’s. The findings suggest that exposure to reservoir-influenced gauge records is essential for simulating flow in impacted catchments and that models trained only in natural catchments cannot capture the dynamics of a regulated flow regime. Interestingly, the results show that a full LSTM trained on a combined dataset could match the performance of both specialist models. The paper reads well and presents important and interesting findings but would benefit from a clearer explanation of the wider impact of the results as well as a more detailed explanation of where these finding sit in the broader literature. Below are some suggestions for how the paper might be improved.
Major comments:
The novelty of this study and its research questions should be better situated in the context of the existing literature. At the moment it is hard to understand where the authors research questions have come from and why these are questions that need to be answered. I would assume that a model trained on reservoir data only will not perform well in natural catchments (and visa versa), so could you better explain why this needs to be tested? Perhaps previous work has suggested that this is not the case? Or perhaps the novelty lies instead in understanding whether pooling reservoir and natural gauge data into a training set could be a more efficient approach? Paragraph 3 in the introduction could be a good place to discuss this in more detail.
I really like how the authors have chosen two case studies to focus on, this works very well. However, it might be useful to add a third case study to this section as a middle ground between the two the authors have currently presented. Currently it seems that at Vyrnwy neither the reservoir nor non-reservoir models are very good (I can see that the reservoir model has made improvements but the scores are still relatively low), and at Gwyfrai both models perform reasonably well. Could the authors perhaps pick a third case study where the reservoir model clearly makes large improvements on the non-reservoir model by recreating some of these very ‘reservoir’ behaviors that non-reservoir models simply can’t recreate?
I realize this is not the focus of the manuscript but I think this analysis would benefit from some discussion about the performance of the authors various LSTM models in comparison to alternative hydrological modelling approaches. I admit that I am biased because I spent quite a lot of time trying to improve the simulation of reservoir-impacted catchments in a semi-distributed hydrological model but I think that a comparison to the wider literature could be useful nonetheless. As an example, in Salwey et al. (2024) we simulate flow at Vyrnwy and improve the NSE from -1.23 to 0.28 by including simple reservoir operating rules, but similarly to the results in Figure 3 we can’t recreate the nuances of these specific rules enough to improve the simulations further. I’d be interested in some commentary on the pros and cons of the approach adopted in this paper in comparison to others in the literature. I think this would help users in the tricky process of selecting which model to use where!
It would be nice to include more discussion on the relevance of this papers findings for society, and for modelling practices more generally. At the moment the technical results are clear but the manuscript would benefit from better explaining the implications of its results.
Minor comments:
In general I really liked how the authors visualized the paired catchments in Figure 1, but I was slightly confused about which catchment boundaries have been marked on the map? How have these been chosen? Also, on L108 the authors mention that Figure 1 shows the catchment elevation but it is not clear to me how this can be read off the map.
I think it’s interesting how none of the models can recreate the constant flow releases present in the Vyrnwy timeseries, could the authors comment on why they think the model has not been able to learn this behavior? In my experience these plateaus on the hydrograph (which are also reflected in the flow duration curves) are very iconic of reservoir-impacted catchments!
L311 there is a typo here.

Reply

Citation: https://doi.org/10.5194/egusphere-2026-2909-RC1
CC1: 'Comment on egusphere-2026-2909', Jesús Casado Rodríguez, 24 Jul 2026 reply

General Evaluation
The manuscript trains different deep learning models on the dataset CAMELS-GB searching for a procedure that is able to reproduce both natural and regulated catchments. It creates a paired dataset of natural and regulated catchments and trains two different architectures (LSTM and Hydra-LSTM) to three different data samples: all catchments, only regulated or only natural catchments. The results indicate that special architecture for reservoirs are not necessary, but the model must be exposed to regulated catchments.
The paper is interesting as it is an open question how to model regulated catchments in the typical lumped structure of the LSTM models used in hydrology. The comparison of models using a paired dataset is very interesting for this benchmarking. However, the methods and data used in the paper are not well explained. The two model architectures should be explicitly defined, and their differences should be made clear. There is no reference to what dynamic inputs are used in the models. The results lack an analysis of model performance against degree of regulation, which would help assessing whether this specific results in Great Britain could be extrapolated ot different hydrological regimes.
I recommend minor revision before publication.
Major comments
Lines 89-93. I understand the logic behind the pairing of the reservoir-influenced and natural catchments; the idea is to provide model trainings with the same amount of data. However, this procedure is limiting the amount of data the model is trained on. My understanding is that the capacity of deep learning models relies on the amount of data, compute and size of the model. Isn't this reduction of data availability limiting the capacity of the models? I understand that this is done here to make a fair comparison among models, but it could probably be mentioned. Even further, the Full LSTM or Hydra-LSTM: Main could be trained on the whole dataset.
Section 2.1 Data Sources and Preprocessing. There is no mention to the dynamic inputs in the dataset. There could be an Appendix B defining them. Are they just meteorological forcings from ERA5? Are there any temporal encoders used to help reproducing reservoir operations? Depending on the reservoir use, the operations are driven by temporal features. For instance, hydropower generation will be correlated with working days, but irrigation or water supply reservoirs have clear seasonal patterns. How are these models supposed to learn these patterns? Could these temporal encoders be fed directly to the reservoir head?

In connection to the following comment about the Methods, it would be interesting to explain where in the model are the dynamic and static inputs fed. From the original reference of Hydra-LSTM (Ruparell et al., 2025), this architecture can receive inputs directly on the head. Is that the case on the Hydra-LSTM:Reservoir head?
Section 3.1 Models Trained. This section is a bit convoluted. The hypothesis behind the Hydra-LSTM models are introduced before the actual models. Even if there are citations to the specific models, I would introduce briefly the LSTM and Hydra-LSTM architectures and the particularities of the Hydra-LSTM model, so unfamiliar readers can follow. I would list or use bullet points to clarify the hypothesis. A figure summarizing the structural differences among models would be helpful.
Section 5 Discussion. As I will detail in the minor comments, the paper lacks an analysis of how heavily regulated are the catchments in the dataset, and how the different models perform in the most regulated samples. Modelling low regulated catchments, even in the presence of reservoirs, it is relatively easy for a general (Full) model, as the natural streamflow is not dramatically affected, so the performance metrics may be relatively good. The challenge is modelling heavily regulated catchments. Given the climate and orography of Great Britain, this may not be the best study case for heavy regulation, which does not invalidate the results, but it should be specified that extrapolation to other climates and regimes is not certain.
Minor comments
Line 28. Incomplete citations. The date is missing in "UK Centre for Ecology & Hydrology", and the cite "of Civil Engineers, 2015" is incomplete.
Line 35. Incomplete citation in "Yoshimi et al."; the date is missing.
Line 38. Reservoir operations may not be readily available in Great Britain, but national datasets exist in other countries like ResOpsUS (https://zenodo.org/records/6612040) or ResOpsBR+CARS (https://zenodo.org/records/16096623).
Line 54. I reckon there are more relevant references to process-based reservoir schemes:
Hanasaki, N., Kanae, S., & Oki, T. (2006). A reservoir operation scheme for global river routing models. Journal of Hydrology, 327(1–2), 22–41. https://doi.org/10.1016/j.jhydrol.2005.11.011 

Haddeland, I., Skaugen, T., & Lettenmaier, D. P. (2006). Anthropogenic impacts on continental surface water fluxes. Geophysical Research Letters, 33(8). https://doi.org/10.1029/2006GL026047 

Zajac, Z., Revilla-Romero, B., Salamon, P., Burek, P., Hirpa, F., & Beck, H. (2017). The impact of lake and reservoir parameterization on global streamflow simulation. Journal of Hydrology, 548, 552–568. https://doi.org/10.1016/j.jhydrol.2017.03.022 

Turner, S. W. D., Steyaert, J. C., Condon, L., & Voisin, N. (2021). Water storage and release policies for all large reservoirs of conterminous United States. Journal of Hydrology, 603. https://doi.org/10.1016/j.jhydrol.2021.126843 

Hanazaki, R., Yamazaki, D., & Yoshimura, K. (2022). Development of a Reservoir Flood Control Scheme for Global Flood Models. Journal of Advances in Modeling Earth Systems, 14(3). https://doi.org/10.1029/2021MS002944 

Salwey, S., Coxon, G., Pianosi, F., Lane, R., Hutton, C., Bliss Singer, M., McMillan, H., & Freer, J. (2024). Developing water supply reservoir operating rules for large-scale hydrological modelling. Hydrology and Earth System Sciences, 28(17), 4203–4218. https://doi.org/10.5194/hess-28-4203-2024 

Shrestha, P. K., Samaniego, L., Rakovec, O., Kumar, R., Mi, C., Rinke, K., & Thober, S. (2024). Toward Improved Simulations of Disruptive Reservoirs in Global Hydrological Modeling. Water Resources Research, 60(4). https://doi.org/10.1029/2023WR035433 

Line 81. Shouldn't the Scottish Environmental Protection Agency publication be cited?
Figure 1. It could be interesting to show reservoir size and degree of regulation in this figure. It would help to understand the level of regulation of British rivers for comparison against other datasets, or to correlate degree of regulation with model performance---probably the higher the regulation the harder to model. If that's the case, the results in GB may not be applicable to heavily regulated rivers such as those in (semi)arid climates.
Line 147. To be more consistent with the LSTM models, I would name this Hydra-LSTM: Full. That would clarify the difference between Hydra-LSTM: Full and the specialists Hydra-LSTM: Res-Head and Hydra-LSTM: NonRes-Head. I understand that both models have the same structure, first Full is trained on the whole dataset. The body is frozen and only the head is trained independently on the two datasets.
Line 156. Improve readability as it is a bit repetitive.
Lines 163-167. This is a very interesting setup. I think it would be interesting to further explain the benefits of training sequence-to-sequence. Why it the window size 90? If feels like a short window only looking at 3 months of data. Think about reservoirs whose degree of regulation is of one or a few years.
Line 227. It would ease reading if the notation was consistent. The "Hydra reservoir head" was previously called "Hydra-LSTM:Res-Head".
Line 247. How is the head in Hydra-LSTM Reservoir Head trained? Is the body freezed and the head trained from scratch? Or is the whole model fine tuned? It is not specified.

The fact that Hydra-LSTM Res-head can't really reproduce better reservoir behaviour may imply that it didn't learn during this finetuning/retraining process. This seems to be a problem in both specialists' heads.
Table 3. It is not clear to me what it means that "metrics were computed over a 90-day rolling window". Since this is a hindcast model, shouldn't it produce the complete time series for the test period and compare it against observations?
Figure 2. The labels in the legend are not consistent with the naming in the paper. It's understandable to which model it refers, but it hinders readability. It's not clear what it means "each point in the curve representing the average KGE in a different gauge"; there aren't points in the figure; what's the "average KGE" of a gauge?. There is a typo in "-a minimum KGE", remove the hyphen.
Lines 256 and 265. It would be interesting to show the degree of regulation or the

degree of disruptivity of these two reservoirs to get a clearer picture of how much the regulate streamflow. Definitions of these two attributes can be found in Shrestha et al. (2024), indicated above. In their work, they come up with threshols of these two values indicating reservoirs whose disruptivity is that low that they don't affect river discharge downstream. It could be the case that Gwyfrai is a non-disruptive reservoir, reason why all models perform similarly well.
Line 278. The sentence "percentage of the upstream catchment area that is inland water" is not clear. Does it mean the percentage of the catchment area that is regulated by reservoirs?
Lines 283-285. The fact that the relevant static features are almost identical is expected as the models share most of the parameters, i.e., the body. It could also be an indicator that the training of the heads didn't work.
Line 286. There is an issue in this reference. Is it a Figure or a Table? It was first referenced as Table 5, but here it's referenced as a Figure, and the actual tables are named as a Figure.
Lines 294-295. This statement needs an analysis of the model performance versus degree of regulation (or another metric of human intervention). It could be that the median/average performance is good enough because the majority of reservoir-influenced gauges are not heavily regulated, as it is the case of Gwyfrai.

As a humid country, my guess is that the reservoir regulation in GB is not very strong. That doesn't mean that these results can be extrapolated to arid or semi-arid climates, where rivers are strongly regulated.
Figure 5. I think that the reorganizing the table columns by type of model (LSTM vs Hydra) would help visualize that the feature importance is similar among each type of model, particularly for the Hydra models.
Line 305. Could you extract these worse-performing cases and analyse the degree of regulation? Are there good-performing gauges with a high degree of regulation?
Line 311. Typo. Replace "gayges" by "gauges".
Lines 343-345. To understand this, the Hydra architecture should've been mentioned in the Methods. The Hydra-LSTM has not been introduced, particularly the changes compared to the LSTM.
Line 350. Typo. "Howeverm".
Lines 363-365. The covariate for this pattern analysis is not geographic location as an indicator of climate or geology, but also the level of human regulation. As mentioned before, the paper lacks an analysis of the model performance related to the level of regulation.
Line 365. Typo. "indivdual".
Lines 377-378. Could the reservoir operations be learned from data, instead of being dynamic inputs? That data is available in some countries (ResOpsUS, ResOpsBR+CARS), or satellite estimates could be used.
Lines 387-388. I would restrict this statement to (semi)humid climates like Great Britain. Extrapolation to (semi)arid climates, where rivers are heavily regulated, is to be tested, as mentioned already in Line 396.
Line 399. I'm not sure that the cite (Mason and Dance, 2026) is relevant in the field of remote sensing for reservoir modelling. There is plenty of recent literature in that field:
Schwatke, C., Dettmering, D., Bosch, W., & Seitz, F. (2015). DAHITI - An innovative approach for estimating water level time series over inland waters using multi-mission satellite altimetry. Hydrology and Earth System Sciences, 19(10), 4345–4364. https://doi.org/10.5194/hess-19-4345-2015

Pekel, J. F., Cottam, A., Gorelick, N., & Belward, A. S. (2016). High-resolution mapping of global surface water and its long-term changes. Nature, 540(7633), 418–422. https://doi.org/10.1038/nature20584

Schwatke, C., Dettmering, D., & Seitz, F. (2020). Volume variations of small inland water bodies from a combination of satellite altimetry and optical imagery. Remote Sensing, 12(10). https://doi.org/10.3390/rs12101606

Donchyts, G., Winsemius, H., Baart, F., Dahm, R., Schellekens, J., Gorelick, N., Iceland, C., & Schmeier, S. (2022). High-resolution surface water dynamics in Earth’s small and medium-sized reservoirs. Scientific Reports, 12(1). https://doi.org/10.1038/s41598-022-17074-6

Khandelwal, A., Karpatne, A., Ravirathinam, P., Ghosh, R., Wei, Z., Dugan, H. A., Hanson, P. C., & Kumar, V. (2022). ReaLSAT, a global dataset of reservoir and lake surface area variations. Scientific Data, 9(1). https://doi.org/10.1038/s41597-022-01449-5

Hao, Z., Chen, F., Jia, X., Cai, X., Yang, C., Du, Y., & Ling, F. (2024). GRDL: A new global reservoir area-storage-depth data set derived through deep learning-based bathymetry reconstruction. Water Resources Research, (60). https://doi.org/10.1029/2023WR035781

Hou, J., van Dijk, A. I. J. M., Renzullo, L. J., & Larraondo, P. R. (2024). GloLakes: Water storage dynamics for 27000 lakes globally from 1984 to present derived from satellite altimetry and optical imaging. Earth System Science Data, 16(1), 201–218. https://doi.org/10.5194/essd-16-201-2024

Reply

Citation: https://doi.org/10.5194/egusphere-2026-2909-CC1

Karan Ruparell, Dai Yamazaki, Kieran Hunt, Hannah Cloke, Christel Prudhomme, Florian Pappenberger, and Matthew Chantry

Viewed

Total article views: 59 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
36	17	6	59	2	4

HTML: 36
PDF: 17
XML: 6
Total: 59
BibTeX: 2
EndNote: 4

Views and downloads (calculated since 16 Jun 2026)

Month	HTML	PDF	XML	Total
Jul 2026	36	17	6	59

Cumulative views and downloads (calculated since 16 Jun 2026)

Month	HTML	PDF	XML	Total
Jul 2026	36	17	6	59

Viewed (geographical distribution)

Total article views: 47 (including HTML, PDF, and XML) Thereof 47 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 28 Jul 2026

Short summary

Reservoirs change how rivers behave, making them harder to predict. We tested whether machine learning models can learn these effects by training on river sites with and without upstream reservoirs across Great Britain. Models without reservoir training overestimated regulated flows, while reservoir-only models performed poorly on natural rivers. Training on both performed well everywhere. All models failed at heavily managed reservoirs where human decisions dominate river flow.


Total:	0
HTML:	0
PDF:	0
XML:	0