Using satellite observations to validate and improve reservoir storage simulations in global hydrological models
Abstract. Global hydrological models (GHMs) increasingly incorporate generic reservoir operation schemes (GROS) to simulate the regulation of rivers by dams. However, the reliability of GROS remains largely unvalidated on a global scale due to the historical scarcity of open in situ data. Here, we leverage the Global Reservoir Storage (GRS) satellite dataset to conduct the first comprehensive quantitative evaluation of reservoir storage simulations globally from five GHMs: H08, WaterGAP2-2e (WGP), MIROC-INTEG-LAND (MIL), CWatM (CWT) and LPJmL5-7-10-fire (LPJ). H08, WGP, MIL and LPJ adopted the process-based Hanasaki et al. (2006) reservoir operation scheme (H06), while CWT adopted the piecewise-function rule curve approach of Burek et al. (2013, 2020) (LIS). We address two primary questions: (1) how accurately do state-of-the-art GHMs reproduce global reservoir storage dynamics? and (2) are model deficiencies attributable to parametric rigidity (i.e., the adoption of globally uniform parameters) in GROS? We evaluated monthly reservoir storage series at 424 major dams (capacity ≥ 0.5 km³) over the historical period, 1999–2018. Performance was quantified using the Kling-Gupta Efficiency (KGE). Two post-hoc bias correction methods—linear scaling and variance-matching—were applied to the raw monthly storage simulations to evaluate whether simple, targeted statistical transformations could recover model skill. To comprehensively address parametric rigidity, we conducted a sensitivity analysis on H08 using its H06 scheme by varying two parameters: target storage level (TSL) and the degree of regulation threshold (DORT) and using LIS by varying the normal storage limit (LN). Our evaluation reveals that current GROS yield generally unsatisfactory performance, characterised by two distinct features. The first concerns seasonal amplitude in storage. MIL initially achieves the highest skill: 52.36 % of dams had a KGE > -0.41. However, KGE decomposition revealed this skill was largely due to dampened intra-annual variability rather than being driven by high correlation and/or low bias error. In contrast, the other GHMs often exhibit excessive seasonal drawdown, systematically overestimating storage amplitude. The second feature pertains to temporal dynamics in storage: within the group exhibiting exaggerated seasonal drawdown, H06-based models—H08, WGP and LPJ—significantly outperform the LIS-based CWT in temporal correlation. We demonstrate that when variance-matching bias correction is applied across all GHMs, two things happen: firstly, the performance of all GHMs becomes generally satisfactory (median KGE > -0.41), and secondly, the GHMs with exaggerated seasonal drawdown outperform MIL in terms of KGE, owing to their superior temporal correlation (H06-based GHMs) and mean bias estimation performance (except H08). By contrast, linear scaling yields only marginal improvements, indicating that correcting variability errors is substantially more effective than adjusting mean bias alone. Furthermore, sensitivity analyses confirm that exaggerated seasonal drawdown is primarily a result of parameter choices rather than inherent flaws in GROS. These findings highlight two critical insights: (1) one-size-fits-all parameters are a primary limitation in global reservoir modelling; and (2) satellite observations are a viable dataset for calibrating reservoir operation schemes in GHMs.
This manuscript examines two generic reservoir operation schemes implemented in five global hydrological models (GHMs). The authors first evaluate the storage simulation performance of five GHMs against satellite-derived reservoir storage products, using the Global Reservoir Storage (GRS) dataset as the primary benchmark across 424 dams globally. They then test two simple, post-hoc approaches: variance-matching and linear bias correction, to determine whether satellite-derived storage data can recover simulation skill, and conduct a parameter sensitivity analysis on a small number of dams. The study demonstrates that satellite-derived products can be used both for statistical correction and for parameter tuning of generic reservoir operation schemes, improving GHM reservoir simulation performance.
While the topic is relevant to this journal, I have serious concerns about the scientific contribution of the study in its current form. Specifically, three major issues critically affect the quality of the work:
1. Too much unrelated and unnecessary content.
The two central research questions of this study are: (1) How accurate are generic reservoir operation schemes in GHMs globally? and (2) Can reservoir simulation performance be improved by correcting against satellite products and by replacing globally uniform parameters? Substantial content does not serve these questions and could be removed, including the intercomparison of satellite-based storage products (Sect. 2.3.1 and 3.1) and the development of the new GDS dataset (Sect. 2.3.2). Previous research, such as Cooley et al. (2025), already performed a comprehensive global intercomparison of five satellite-derived reservoir storage products (GLWS, GRS, GloLakes, GRDL-Y and GRDL-L) against in situ observations. Moreover, the newly developed GDS dataset performs worse than the existing GRS product (Sect. 3.1), so introducing it here does not appear necessary.
2. The attempt to use satellite-derived products to improve GHM performance is too simple and lacks further exploration.
Testing simple correction methods with satellite data is a reasonable first step, but doing so introduces a satellite-data requirement for each reservoir, which undermines the central advantage of generic reservoir operation schemes. For the generic reservoir operation schemes, they can be applied in any reservoir, regardless of satellite data availability. So here, I would suggest the authors calibrate the key parameters against satellite products across a larger sample of reservoirs, and then examine whether the resulting optimal parameter values relate systematically to reservoir characteristics. Such relationships could then be transferred to reservoirs lacking satellite coverage, preserving the genericness of the scheme while still benefiting from satellite-based calibration.
3. Too few reservoir samples for the parameter tuning tests.
The detailed sensitivity analysis is conducted on only three dams, with four more dams added only as a supplementary check (Sect. S10). Even combined, seven dams out of 424 remain far too few to support general conclusions about suggested parameter ranges. It is easy to imagine that allowing adaptive, reservoir-specific parameters would improve simulation performance even without running these tests. So the tuning experiments on three dams mostly confirm an already-expected result.
These three issues fundamentally limit the scientific depth and impact of the work. Nonetheless, the authors’ efforts are appreciated, and I recommend rejecting this manuscript in its current form, while encouraging the authors to consider these points in future work to strengthen the study’s methodology and the structure of the paper.
Minor Comments
References
Cooley, S. W., Wang, J., Gao, H., Yao, F., Livneh, B., Li, Y., et al. (2025). Global intercomparison of satellite-derived variability in reservoir storage. Environmental Research Letters, 20(8), 084035. https://doi.org/10.1088/1748-9326/ade903
Knoben, W. J. M., Freer, J. E., & Woods, R. A. (2019). Technical note: Inherent benchmark or not? Comparing Nash–Sutcliffe and Kling–Gupta efficiency scores. Hydrology and Earth System Sciences, 23(10), 4323–4331. https://doi.org/10.5194/hess-23-4323-2019