the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Benchmarking reservoir operation schemes for large-scale hydrological models
Abstract. There are approximately 62,000 large dams worldwide that significantly alter the hydrological regimes of most major rivers. Despite their importance, reservoirs remain poorly represented in Large-Scale Hydrological Models (LSHMs) due to the complexity of human-driven operations and a widespread lack of observational records. Consequently, reservoir routines in LSHMs must balance structural simplicity with limited data requirements. In this study, we utilize the ResOpsUS dataset to benchmark four reservoir routines of increasing complexity: LISFLOOD, CaMa-Flood, mHM, and STARFIT. We evaluate these routines across 164 reservoirs in the United States and test which target variables are most informative for parameter estimation. Our results indicate that the mHM routine consistently achieves the highest performance; however, its dependence on site-specific demand data limits its applicability at the global scale. In contrast, the CaMa-Flood routine provides a robust compromise, significantly outperforming the linear logic of LISFLOOD while maintaining parsimonious data requirements. Crucially, we find that calibrating to reservoir storage is more informative than calibrating to outflow, as it effectively captures the dynamics of both state variables. This finding paves the way for the use of satellite-derived storage products in the calibration of LSHMs. The findings of this study have been implemented in the upcoming versions of the European and Global Flood Awareness Systems (EFAS v6 and GloFAS v5).
- Preprint
(3365 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2026-904', Saskia Salwey, 23 Mar 2026
-
RC2: 'Comment on egusphere-2026-904', Anonymous Referee #2, 30 Mar 2026
In this study, the authors evaluate four reservoir operational schemes: LISFLOOD, CaMa-Flood, mHM and STARFIT, in order to evaluate which scheme to identify which scheme is the most robust for large scale hydrologic models. To do this, they evaluate 160 dams in the CONUS domain in order to determine which parameters are the most informative for model calibration. The results demonstrate that mHM performs the best but requires site specific data which is not always accessible on the global scale. CaMa-Flood ultimately provides the best tradeoff with lower data requirements but more accurate reservoir storage levels. The authors also find that reservoir models should be calibrated against reservoir storage over reservoir outflow.
Major Comments:
The cut offs for DOR and DOD are not well defined. I would be interested to know more about why these exact values were used. I would also be interested to know if these values cut dams with main purposes that are note labeled as Hydropower as I would assume any dam focused primarily on providing hydropower regulation would have a lower DOR. Based on Figure 1, it looks like the majority of the dams that were cut correspond to hydropower dams in the eastern US. Inclusion of the dam purpose would be nice to have a broader idea of the purposes the study includes.
I really like that you published your updated version of ResOpsUS +CARS. I imagine it will have great use in further evaluations of reservoir models.
The calibration of storage and outflow is mentioned in L60 as well as Table 1 and L165, however I would be curious to know what this de coupled calibration looks like. It seems like it is calibrating the reservoir model independent of the hydrologic model, but without the explicit link I am unsure. I would add a brief description in the article and then perhaps a flow chart in Appendix. I would also be interested in how the actual calibration for the desired parameters were done per model.
The inclusion of KGE for outflow and storage is an informative addition. I would be interested to know how the outflow and storage KGE was calculated. Is this just the average of the outflow and storage KGEs? Additionally, the KGE components could be interesting to show as they might inform a bit more about the dynamics with regard to variation and biases in these schemes. It would also be nice to look at the storage and streamflow KGE components to see if storage on average fits better with respect to one of the components. Perhaps this could strengthen the argument that storage calibration is more important than streamflow calibration.
I really like your conclusions on L312 -315 with respect to incorporating remotely sensed data as a form of calibration, however, remotely sensed storage time series also contain errors due to overestimation of low storage values. It could be interesting to know if the authors looked at model calibration with remotely sensed data as well (perhaps using the data discussed in Appendix A). If so, I would be interested in potential biases that they found when comparing this calibration to calibrations done on direct observations of reservoir storage.
The authors mention that mHM performs the best yet is greatly limited by the reliance on demand time series. The discussion continues to include how in data rich regions demand can be inferred using machine learning. I would be interested to know the authors opinions on using modelled demand from other large scale hydrologic models and if that would be a suitable replacement in data scarce regions. If so, would this be a recommendation to try and include demand in more generic reservoir schemes?
Minor Comments:
Figure 4: The dashed blue line in panel b is hard to see. I would recommend making it wider or perhaps another color.
In Line 400 there is a floating comma.
Appendix A: I believe the title should read Evaluation of GWW Storage Estimates in lieu of Evaluatin of GWW storage estimates
Citation: https://doi.org/10.5194/egusphere-2026-904-RC2
Data sets
ResOpsUS+CARS: Reservoir Operations US and CAtchment and Reservoir Static attributes Jesús Casado-Rodríguez, Juliana Disperati, and Peter Salamon https://doi.org/10.5281/zenodo.15978041
Model code and software
Reservoirs in LSHM Jesús Casado-Rodríguez https://github.com/casadoj/reservoirs-LSHM.git
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 271 | 156 | 20 | 447 | 18 | 34 |
- HTML: 271
- PDF: 156
- XML: 20
- Total: 447
- BibTeX: 18
- EndNote: 34
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This study benchmarks the performance of four reservoir operation schemes across the US in a large-scale hydrological model. The manuscript compares four different calibration strategies, finding that calibrating to reservoir storage is more informative than calibrating to reservoir outflow. I think that the results of this study are very important for the modelling community and the take-home messages should be carefully considered by anyone incorporating reservoirs into large-scale hydrological modelling. In fact, I have been hoping someone would publish a study similar to this for a while so thank you! In general the manuscript was very clear and well-written, but below I have left a few suggestions for how I think it could be improved.
It would be interesting to hear more about your model calibration strategy (as introduced on L60). It is not super clear to me how the specific details of the calibration worked. Am I right in thinking that you used the ‘default’ reservoir parameters from the literature listed in Tables 2, 3 and 4 and then calibrated the non-reservoir parameters around these? If so, considering that the results using the default reservoir parameters often failed to capture the storage dynamics well, do you think that the non-reservoir parameters were calibrated in a way which means they overcompensate for poorly represented reservoir processes? Did you compare the selected non-reservoir parameters to values used in natural catchments or the literature to see whether they were physically realistic? Perhaps you can elaborate on this a bit.
Finding that calibrating the model to reservoir storage is more informative than calibrating to outflow is a really interesting (and useful!) result. I am pleased that your results suggest we may be able to utilize satellite data for model calibration but wonder whether you should demonstrate this in the manuscript. Did you try integrating satellite data (e.g. the data you discuss in Appendix A) into some of your reservoir storage calibration experiments? If not I think this would be a very valuable addition to Appendix A or the manuscript. If this paper is going to advocate for this possibility it would be nice to showcase this, particularly because in many places storage data like in ResOpsUS is not available. It would be interesting to know how the differences in satellite derived storage impact the results.
How were the thresholds for DOR and DOD selected to define a significantly altered natural flow regime?
At some point (even if in an appendix) I would be interested to hear about the breakdown of the individual KGE components. Did all aspects of the metric perform similarly or were there some that were always high or always low? How did this vary across reservoirs of different types?
Could Figure1 also show the primary purpose of the reservoirs? This seems important for the operations. Is it possible to add some analysis somewhere which describes how the KGE performance varied across reservoirs of different types? This could link nicely to the discussion in section 5.2. I think it would be useful for readers to understand whether the results of this study would apply to other locations where perhaps there is a different distribution of reservoir types.
I think one of the most interesting results in this paper is on L356 where you state that STARFIT was not markedly superior to CaMa-Flood which is far simpler. There is often an assumption in our field that more data/ complexity will always lead to better results and so I think it is important that we highlight that this is not always the case. Could you consider mentioning this in the abstract?
You mention several times that STARFIT still has distinct advantages over the other schemes (e.g. on L357 and L445) but I cannot see how your results evidence this? It seems to have been outperformed by simpler methods. Can you make it clearer why you think this?
I think it is really nice that you have published the ResOpsUS+CARS dataset!