the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
PLSTM-Reg v1.0: A regional physics-encoded LSTM model for simulating reservoir operations under data scarcity
Abstract. Representing reservoir operations in large-scale hydrological models remains difficult due to complex release decisions and scarce operational records. Here, we develop PLSTM-Reg v1.0, a regional deep learning framework with physics encoded to simulate reservoir operations across diverse systems. The framework is evaluated using 256 representative reservoirs across the Contiguous United States, focusing on three core capabilities: temporal generalization to unseen periods, spatial transfer to unseen reservoirs, and historical data reconstruction. Under temporal testing, the regional model improves 1-day-ahead release forecasts from a median Kling–Gupta Efficiency (KGE) of 0.83 to 0.96 relative to local counterparts, and reduces poorly simulated cases (KGE < 0.8) from 41.8 % to 2.3 %. For long-term simulation, storage performance reaches a median KGE of 0.79, a modest gain over local models (0.76) but with notable robustness for reservoirs with large capacity. When transferred to unseen reservoirs, the model substantially outperforms widely used rule-based schemes: median KGE rises from 0.55 (best benchmark) to 0.73 for release and from 0.22 to 0.59 for storage, and the proportion of usable simulations (KGE > 0.5) increases from 56.6 % to 89.8 % for release and 14.4 % to 61.7 % for storage. In historical storage reconstruction, incorporating monthly satellite-derived surface area strengthens storage estimates and enables reconstruction accuracy comparable to models trained with local records. These results demonstrate that cross-reservoir deep learning combined with physical knowledge provides a scalable scheme for representing human water management within large-scale hydrological and land surface models under widespread data scarcity.
- Preprint
(5390 KB) - Metadata XML
-
Supplement
(4769 KB) - BibTeX
- EndNote
Status: open (until 25 May 2026)
-
CEC1: 'Comment on egusphere-2026-1098 - No compliance with the policy of the journal', Juan Antonio Añel, 28 Mar 2026
reply
-
CC1: 'Reply on CEC1', Bin Yu, 29 Mar 2026
reply
Publisher’s note: the content of this comment was removed on 31 March 2026 since the comment was posted by mistake.
Citation: https://doi.org/10.5194/egusphere-2026-1098-CC1 -
AC1: 'Reply on CEC1', Yi Zheng, 29 Mar 2026
reply
Please see attachment.
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 30 Mar 2026
reply
Dear authors,
Many thanks for your quick reply. Unfortunately, it does not solve all the issues pending with the data in your manuscript. We can not accept Hydroshare.org or Nasa.gov sites as permanent repositories to store the assets used to perform the work described in a manuscript. Therefore, you must store in one of the repositories that we can accept the data that you have made available through them.
Please, reply to this comment with the information for the new repositories containing the mentioned datasets, and a modified Code and Data Availability section that complies with the policy of the journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2026-1098-CEC2 -
AC2: 'Reply on CEC2', Yi Zheng, 30 Mar 2026
reply
Please see attachment.
-
CEC3: 'Reply on AC2', Juan Antonio Añel, 30 Mar 2026
reply
Dear authors,
Again, thanks for addressing this issue so quickly. I have checked the repositories and we can consider now the current version of your manuscript in compliance with the code policy of the journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2026-1098-CEC3
-
CEC3: 'Reply on AC2', Juan Antonio Añel, 30 Mar 2026
reply
-
AC2: 'Reply on CEC2', Yi Zheng, 30 Mar 2026
reply
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 30 Mar 2026
reply
-
CC1: 'Reply on CEC1', Bin Yu, 29 Mar 2026
reply
-
RC1: 'Comment on egusphere-2026-1098', Ningpeng Dong, 19 Apr 2026
reply
Greetings! This study presents PLSTM-Reg v1.0, a regional physics-encoded LSTM framework that effectively simulates reservoir operations by integrating mass-balance constraints with deep learning. The methodology demonstrates strong spatial and temporal generalizability, particularly through its synergy with remote sensing data. Overall, I think this is a very high-quality, well-organized study that addresses a critical challenge in hydrological modeling, which fully meets the requirements for publication in Geoscientific Model Development (GMD).
I have two minor suggestions. First, I recommend that the authors move the descriptions of static attributes to the main text, as they are essential parts of the model structure. Second, I would like to see a little discussion on which of these attributes have the most important impacts on the model results and which are less so.
I also have three technical questions that I hope the authors could clarify to improve reproducibility. 1) What is the specific sequence length used for the training and testing of the sequence-to-sequence regional model? 2) Why did the authors choose a recursive sequence-to-one model instead of a direct sequence-to-sequence approach (e.g., outputting a 7-day vector) for the 7-day lead time forecast? 3) Did the authors encounter gradient vanishing or truncation issues, given that the hard physical constraints in the PLSTM cell might truncate the gradient flow during backpropagation?
Best
Ningpeng DongCitation: https://doi.org/10.5194/egusphere-2026-1098-RC1 -
RC2: 'Comment on egusphere-2026-1098', Anonymous Referee #2, 28 Apr 2026
reply
The manuscript addresses a critical challenge in large-scale hydrological modeling: the generalization of reservoir operation rules to data-scarce regions. The authors propose PLSTM-Reg v1.0, a physics-encoded Long Short-Term Memory (LSTM) model that integrates reservoir storage directly into the LSTM cell to enforce mass conservation. The model is evaluated using 256 reservoirs across the Continental United States (CONUS) through five distinct experimental setups designed to test temporal generalization, spatial generalization, and historical data reconstruction. Overall, the manuscript presents an innovative approach to reservoir modeling. The initial results are promising; however, there are several major concerns regarding the experimental logic, the clarity of the evaluation metrics, and the positioning of the work within the existing literature that need to be addressed.
Major comments
- Logical Consistency in Experiment Comparison (IV vs. V): The authors conclude that incorporating remotely sensed (RS) surface area leads to superior model performance compared to local models (mentioned in places like lines 23-26 and lines 313-343). However, the logic underpinning this comparison in Table 1 is not fully sound.
Specifically, in Experiment IV, storage information is omitted. In Experiment V, remotely sensed surface area, which serves as a highly informative proxy for storage, is included. Given that Experiment V introduces a significant predictive variable that Experiment IV lacks, it is not a safe conclusion that V would outperform IV because of adding RS information. This comparison does not necessarily prove the superiority of RS integration in reconstructing historical records; rather, it highlights the value of the extra information provided. To make this claim robust, the authors must clarify if the “reconstruction” baseline is intended to represent a scenario where NO storage records exist. If storage data were available and utilized in IV, would the RS approach still be superior?
- Justification of Model Input for Long-Term Simulations: Following the point above, why is storage omitted from the long-term simulation setups? In many operational contexts, historical storage records (even if fragmented) are the primary benchmark. The authors should justify the decision to exclude storage in these specific experiments or discuss the implications of this omission for the model's practical utility.
- Review of Relevant Literature: The literature review (Lines 58-72) summarizes recent regionalization efforts (e.g., Turner et al., 2021; Steyaert et al., 2025) but misses several foundational or recent studies that utilize machine learning and remote sensing for parameter generalization. Including the following would provide a more comprehensive context:
- DZTR Model: Utilizing hydrological quantiles for straightforward generalization (Yassin et al., 2019).
- https://doi.org/10.5194/hess-23-3735-2019
- SBTS Model: Generalizing storage-based piecewise rules using ML and satellite observations (Shen et al., 2025).
- https://doi.org/10.1029/2024WR037620
- MODROM Model: A modular model using ML for generalization (Li & Villarini, 2026).
- https://doi.org/10.1029/2025MS005180
- Clarity on Evaluation Metrics and Performance (Lines 214-229): The KGE scores presented in Figure 3 are exceptionally high, with medians near 1.0. This level of "perfection" suggests the evaluation might be influenced by the model's short-term configuration. Are these scores calculated on the training, testing, or overall dataset? If the model utilizes lagged information for next-day release predictions, the high performance is likely a reflection of persistence rather than operational rule learning. The authors need to provide a more detailed description of the evaluation: specifically, define the ground truth, the simulation period, and the exact lead time for the releases being scored.
Minor Comments
- Line 20: The term "local models" is used without a prior definition. Please provide a brief explanation at the first mention to ensure clarity for readers unfamiliar with the authors' specific terminology.
- Line 27: The hybrid nature of the model (Physics + DL) is a highlight of the work but is under-represented in the Abstract. Please mention the specific physical constraints (e.g., mass balance) earlier in the abstract to emphasize the "Physics-encoded" aspect.
- Line 110: Please add a formal citation for the Daymet dataset.
- Lines 119-120: When discussing "reconstructing records from remote sensing," please specify which variables are being reconstructed (e.g., inflow, storage, or release).
- Line 123: Regarding the use of SARAH-CONUS to supplement GRSAD: Did the authors perform a consistency check or bias correction between these two datasets? Please clarify if there were systematic differences and how they were handled.
- Line 183: The term "PLSTM-Loc" appears abruptly. Please define this counterpart model before using its abbreviation.
- Equations (1)-(3): There are inconsistencies in the fonts used for variables and operators. Please ensure all LaTeX formatting is uniform throughout the manuscript.
Citation: https://doi.org/10.5194/egusphere-2026-1098-RC2 -
CC2: 'Comment on egusphere-2026-1098', Baptiste Francois, 07 May 2026
reply
I would like to congratulate the authors for this original and well-structured contribution. The PLSTM-Reg model is an elegant way to embed physical consistency directly into the recurrent loop, and the results across the diverse set of reservoirs are compelling. I have several question regarding the model training and evaluation that I hope the authors can address:
1. Estimation of operational release limits Q_min and Q_max
The Physical Knowledge Module (Equations S8–S10) requires per-reservoir estimates of minimum and maximum allowable release, Q_min and Q_max. The manuscript and supplement do not describe how these values were derived. Could the authors clarify: Were they computed from the observed outflow record (e.g., obs min and max values? empirical percentiles?) or from engineering/regulatory sources?
2. LSTM warmup period and handling of the physical storage state
Standard LSTM implementations use a warmup period to bring the hidden state from zero-initialization to a realistic operating regime before the loss is computed. In PLSTM-Reg, the physical storage state s_t is simultaneously updated at every timestep via the water-balance equation (Eq. S9), creating a tighter coupling between the recurrent state and the physical state than in a conventional LSTM. Could the authors clarify: (a) Was a warmup period used, and if so, what length? (b) During warmup, was the physical knowledge module (Eqs. S8–S11) active, or was the physical state held at the observed initial storage? Else?
3. Training sequence length
The choice of sequence length is particularly important for PLSTM-Reg because the model performs a free-run simulation over the full sequence, meaning error in storage accumulates over time. Could the authors indicate the sequence length used during training and whether they observed sensitivity to this hyperparameter? Specifically, did longer sequences lead to better long-term simulation performance at the cost of slower training convergence?
4. Long-term water balance consistency
While the physical knowledge module enforces local mass balance at every timestep (Eq. S11) by construction, this does not guarantee that the long-term water balance is respected at the reservoir scale. We identify several potential sources of systematic bias:
First, the linear output head predicts a candidate release r~_t without any architectural constraint that enforces long-term mass conservation (i.e., there is no mechanism ensuring that the mean predicted release equals the mean inflow minus the long-term storage change). The LSTM may systematically over- or under-predict releases, creating a persistent bias in the physical storage trajectory.
Second, even if the candidate release were unbiased on average, the physical knowledge module can introduce a systematic offset by activating the storage and flow constraints. For instance, frequent clamping at S_max (spilling excess water via Eq. S11) or at Q_max (capping release and accumulating storage) will alter the long-term mean release relative to what the LSTM predicted. The direction and magnitude of this effect depend on the distribution of reservoir states relative to the constraint bounds and is not self-correcting.
Third, and more fundamentally, the observed inflow and release records in datasets such as ResOpsUS may themselves not close the water balance, due to measurement uncertainty, unobserved fluxes (direct lake evaporation, groundwater exchange, water withdrawals from the reservoir), or data gaps. In such cases, the LSTM may partially learn to compensate for these residuals. This raises the question of whether the model is learning physically meaningful release dynamics or partly fitting an artifact of the input data.
Did the authors evaluate the long-term water balance of the PLSTM-Reg simulations, for example by comparing the multi-year mean simulated release to the mean inflow minus the observed long-term storage trend? A systematic evaluation of this property across the 259 reservoirs, and a discussion of how unobserved fluxes in the training data may affect the model's behavior, would significantly strengthen the physical interpretability of the results.
I look forward to seeing this work published and hope these questions can be addressed in the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2026-1098-CC2 -
CC3: 'Follow-up question', Baptiste Francois, 07 May 2026
reply
After reviewing the code shared by the authors, I would like to ask a few follow-up questions regarding Equation S8 and the role of Q_min and Q_max in the Physical Knowledge Module.
In the released code (ReservoirLSTM.rnncell), the physical module applies Equations S9–S11 as described in the supplement, but Equation S8 does not appear to be implemented. Instead, the candidate release predicted by the linear output head is denormalized directly using the per-reservoir empirical mean and standard deviation of observed outflow:
release_denorm = release_factor * target_scale[0] + target_center[0]This denormalization step does act as a soft, data-driven constraint: when the LSTM output is near zero, the predicted release is close to the observed mean, and the per-reservoir scaling discourages predictions far outside the observed distribution. However, this is a statistical prior derived from the training data rather than a hard operational bound. It cannot be directly interpreted as Q_min and Q_max, and it provides no guarantee that releases remain within physically or operationally meaningful bounds, particularly for out-of-distribution conditions.
I understand that the storage clamp (Eq S10) combined with Eq S11 provides an implicit physical floor and ceiling on release (you cannot release more than the available water, and you must spill when the reservoir is full). This is a meaningful hard constraint. However, it does not serve the same purpose as Q_min (minimum environmental or operational flow requirement) and Q_max (maximum release capacity).
Could the authors clarify: (a) Was Equation S8 intentionally omitted from the implementation, with the normalization scheme serving as the intended proxy for release bounds? (b) If so, how are Q_min and Q_max defined in the manuscript, and are they used anywhere in the training or evaluation pipeline? (c) Does the absence of an explicit release clamp affect the physical interpretability of the model, particularly the claim that "physical knowledge" is enforced?
In addition, I am curious about the spatial out-of-sample evaluation. The model's linear output head predicts a normalized release factor that must be denormalized using the per-reservoir mean and standard deviation of observed release (target_center and target_scale in the code). For reservoirs that were withheld from training, this implies that observed release statistics are still required at inference time. If that is the case, the model is not fully independent of observed data for unseen reservoirs, which would partially undermine the out-of-sample spatial generalization claim. Could the authors clarify how denormalization is handled for reservoirs not seen during training — specifically, whether per-reservoir observed statistics are used?I also noticed that the initial physical storage state is hard-coded to 0.5 (i.e., half of the normalized capacity) regardless of the actual observed storage at the start of each sequence:
# initialize reservoir storage using 0.5device = x_d_embedded.devicetemp_vars = torch.ones((batch,2)).to(device=device, dtype=x_d_embedded.dtype)*0.5Could the authors comment on the sensitivity of the model to this initialization choice? In particular, for short sequences or reservoirs that are frequently at extreme storage levels (near empty or near full), a fixed 0.5 initialization may introduce a systematic spin-up bias. Was this choice evaluated against alternatives such as initializing from the observed storage at the start of each sequence, or using a learned initial state? Is this initial storage used as the start of the warm-up period during training only?
I raise these points not as a criticism but because the relationship between the described equations and the actual implementation is not entirely clear to me from the manuscript alone, and a clarification would be valuable to readers seeking to reproduce or extend this work. I remain very interested in the authors' work and look forward to their answers.
Citation: https://doi.org/10.5194/egusphere-2026-1098-CC3
-
CC3: 'Follow-up question', Baptiste Francois, 07 May 2026
reply
-
RC3: 'Comment on egusphere-2026-1098', Anonymous Referee #3, 21 May 2026
reply
This manuscript develops PLSTM-Reg, a regional LSTM model trained and evaluated using long-term daily records from 256 reservoirs across the CONUS to simulate reservoir release and storage under temporal and spatial generalization settings. The study is valuable because it uses a high-quality reservoir dataset, considers both short-term and long-term simulation performances, designs multiple experiments to test transferability, and tests the added value of remote-sensing surface area. The manuscript is generally well organized. However, the novelty relative to recent pooled or transferable reservoir-operation models is fully stated, and several aspects of the validation require clarification, including possible information leakage from static attributes, and the fairness of local-model comparisons. Overall, I find the study promising and potentially useful for large-scale hydrologic modeling, but I recommend major revision before publication.
Major comments
1. Table S1 lists “Average discharge” as an input. This raises two concerns. First, it may partly explain the improvement of the regional model over local models, since local models do not use this static input. Providing an ablation excluding average discharge can be more convincing. Second, in the spatial-transfer experiment, average discharge for held-out reservoirs may create leakage. This would weaken the claim that the model is tested on truly “unseen” or data-scarce reservoirs.
2. The model seems to assume that reservoir operation depends only on local basin forcings, local inflow/storage, and reservoir attributes. However, many real reservoirs are operated as part of cascade reservoir systems. The authors should clarify whether such cases are included and whether PLSTM-Reg can handle coupled upstream–downstream operation (such as the flexibility to add upstream reservoir information as static input in current framework). If not, the application boundary should be stated more clearly.
3. The comparison with recent advances is incomplete. Comparing mainly with older rule-based models weakens the significance of the claimed innovation, especially because the key claim is multi-reservoir pooling/regional learning. The following studies should be compared with, or at least discussed:
Tran et al. (2025), “Improving the prediction of daily reservoir releases over the CONUS using conditioned LSTM.” This is a closed predecessor: a pooled/conditioned LSTM across nearly 200 CONUS reservoirs using static reservoir attributes.
The following three are cited in this work, but only as background. They should be discussed more directly against PLSTM-Reg as baselines:
- Ford and Sankarasubramanian (2023), “Generalizing reservoir operations using a piecewise classification and regression approach.”
- Turner et al. (2021), “Water storage and release policies for all large reservoirs of conterminous United States.”
- Chen et al. (2022), “Developing a generic data-driven reservoir operation model.”
4. The source of improvement should be interpreted more carefully. The gains over rule-based methods may come from both regional pooling and the flexible neural-network architecture. Simply attributing the improvement to the regional setting is not fully fair. The authors should distinguish the effects of regional training, static attributes, nonlinear LSTM architecture, and physical constraints.
Minor comments
- Line 14. “Representative reservoirs” needs clearer definition in the data section to avoid selection bias. Is data length the only selection criteria?
- Line 50. Previous models require water-demand data for parameterization, while this model does not use demand data but still performs well. The authors should explain why demand information may be implicitly captured, for example through storage, release history, seasonality, reservoir purpose, and static attributes.
- Line 124-126. It would be helpful to know the missing ratios of the remote sensing data.
- Line 165. The authors use the term “forecasting.” However, if future forcings/inflows are assumed to be known when running the model recursively, this is closer to conditional simulation or hindcast prediction than true operational forecasting. Please clarify.
- Line 176. Long-term simulation capability depends on the training window length. Longer sequence windows may help the model learn to avoid error accumulation. This does not require a new experiment, but it is worth mentioning in the discussion.
- Line 197. The authors should clarify how the monthly surface-area time series is used at daily steps. If the current month’s value is used for all days in that month, it may include future information for earlier days.
- Lines 227–229. It is interesting that release performance differs clearly between PLSTM-Reg and PLSTM-Loc, while storage performance is nearly the same. The authors should explain this.
- Figure 3. Since all statistical tests appear significant, the asterisks could be removed and the caption can simply state that all comparisons pass the significance test. Same for Figure 6.
- Lines 273–274. In the “unseen reservoir” experiment, it is unclear how PLSTM-Loc is trained. Does it use local data from the held-out reservoir, or random initial weights? This needs clarification.
- Line 280. The improvement over rule-based methods may come from both regional pooling and flexible black-box modeling. The manuscript should avoid attributing the gain only to regional learning.
- Figure 5. Please clarify why the metric values are not consistent with Figure S2. Do they correspond to different experimental settings or evaluation periods?
- Text S2. Rule-based benchmarks use default parameters, not calibrated ones. It does not necessarily imply that PLSTM-Reg outperforms these schemes.
Citation: https://doi.org/10.5194/egusphere-2026-1098-RC3
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 628 | 243 | 87 | 958 | 123 | 41 | 50 |
- HTML: 628
- PDF: 243
- XML: 87
- Total: 958
- Supplement: 123
- BibTeX: 41
- EndNote: 50
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
First, you have archived your code on a Zenodo private repository, something we can not accept. The GMD review and publication process depends on reviewers and community commentators being able to access, during the discussion phase, the code and data on which a manuscript depends, and on ensuring the provenance of replicability of the published papers for years after their publication. Therefore, you must public openly and without restrictions the code used in your manuscript to continue the Discussions and peer review process.
In addition, to access the data used and produced in your work you cite several sites; however, the cited sites do not fulfil GMD’s requirements for a persistent data archive because:
* They do not appear to have a published policy for data preservation over many years or decades (some flexibility exists over the precise length of preservation, but the policy must exist).
* They do not appear to have a published mechanism for preventing authors from unilaterally removing material. Archives must have a policy which makes removal of materials only possible in exceptional circumstances and subject to an independent curatorial decision,
* They do not appear to issue a persistent identifier such as a DOI or Handle for each precise dataset.
If we have missed a published policy which does in fact address this matter satisfactorily, please post a response linking to it. If you have any questions about this issue, please post them in a reply. There is a site in the case of your manuscript that could almost be considered in compliance with our policy, the Texas Data Repository; however, after reading their policy, it seems that the service for hosting and curating the data are not provided directly by the Texas Digital library, but by AWS, which is a private company, and therefore we can not accept as a long-term repository valid for scientific publication.
Please, therefore, publish all the code and data used in your work and necessary to replicate it in one of the appropriate repositories according to the policy of the journal, and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible. We cannot have manuscripts under discussion that do not comply with our policy.
The 'Code and Data Availability’ section must also be modified to cite the new repository locations, and corresponding references added to the bibliography.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in GMD.
Juan A. Añel
Geosci. Model Dev. Executive Editor