the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Predictive Performances of Machine Learning– and Deep Learning–Based Univariate and Multivariate Reservoir Inflow Predictions in the Chao Phraya River Basin
Abstract. This study demonstrated the predictability of Machine Learning (ML)– and Deep Learning (DL)–based univariate and multivariate predictions of reservoir inflows of Bhumibol (BB) and Sirikit (SK), two major dams in the Chao Phraya River Basin. XGBoost, tree–based ensemble–, and LSTM, deep neural network–based algorithms were selected for development of daily and monthly prediction models. For univariate prediction, the inflows of the BB and SK dams were predicted separately using two individual models. In contrast, for multivariate prediction, a single model was developed to simultaneously predict the inflows of both the BB and SK dams facilitating the integrated decision–making processes. Across all prediction scenarios, ML– and DL–based models demonstrated superior performances in predicting daily reservoir inflows for BB and SK dams compared to monthly predictions, achieving NSE values of 0.86 and 0.77, respectively. Since modeling with LSTM algorithm can effectively handle larger datasets, this enables single multivariate prediction model to predict closer results to those individual univariate models performed by XGBoost and LSTM for BB and SK prediction. XGBoost models mostly outperformed LSTM when tested on the datasets for both daily and monthly univariate predictions. Among all prediction scenarios, underprediction of low reservoir inflows and overprediction of high reservoir inflows by both univariate and multivariate models were consistently existed. Therefore, extracting specific and informative insights from the results of each model type, forecasting horizon, and algorithms used can significantly enhance decision–making support for both real–time reservoir operation and long–term reservoir management planning.
- Preprint
(1157 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-16', Anonymous Referee #1, 24 Mar 2025
This manuscript explores the application of two widely known data-driven algorithms—XGBoost and LSTM—in both univariate and multivariate modes for daily and monthly inflow predictions at two key reservoirs in the Chao Phraya River Basin. The topic is timely and relevant in the context of AI-driven hydrological forecasting. However, the manuscript, in its current form, fails to meet the scientific standards and novelty threshold expected by Hydrology and Earth System Sciences. The work is largely confirmatory, methodologically simplistic, and lacks both theoretical depth and critical interpretation. It represents an incremental application of well-established techniques without significant advancement in methodology, theory, or hydrological insight. Below are my detailed comments:
1. Despite the claim of contributing to reservoir inflow forecasting through multivariate models, the study does not introduce any methodological innovation. The application of XGBoost and LSTM, both extensively used in hydrology, adds no novelty unless combined with a new model architecture, uncertainty treatment, explainability component, or integration with process-based models. The experimental setting is rudimentary, and the results primarily confirm what has already been established in dozens of prior studies. Moreover, the assertion that multivariate prediction of inflows has rarely been studied is not substantiated and contradicts recent literature. The references cited are selective and outdated, omitting more advanced hybrid or physics-informed ML approaches currently under development in the hydrological community.
2. The manuscript fails to clearly define its scientific objectives or hypotheses. The rationale behind comparing univariate and multivariate approaches is weakly stated and not embedded in a theoretical or operational framework. The problem formulation is generic and reads more like a technical report than a scientific investigation.
3. The literature review is overly descriptive and lacks synthesis. It resembles an annotated bibliography rather than a critical narrative. Foundational works on multivariate time series modeling, ensemble learning, recent benchmarks on hybrid models, and the emerging field of physics-informed ML in hydrology are all missing. Furthermore, no discussion is provided on model explainability, uncertainty quantification, or generalization capacity, all of which are central themes in the current hydrological ML research agenda.
4. The methodology exhibits some critical flaws:
- No hyperparameter optimization strategy is described beyond brute-force listing of combinations.
- Feature selection is based solely on Pearson correlation, ignoring non-linear dependencies or mutual information approaches.
- The study does not address overfitting or generalization. Despite LSTM being known for susceptibility to overfitting, no regularization, dropout, or model selection strategy is employed.
- No benchmark model is used for reference, which is standard in HESS-level contributions.
5. The manuscript presents no discussion of data quality, treatment of missing values, stationarity, or outlier detection.
6. The results section is overly descriptive, listing metrics without proper analysis or critical discussion. Additionally, the model performances reported are relatively modest, especially for monthly inflow prediction, yet are uncritically presented as acceptable.
7. The discussion does not provide new hydrological or methodological insight. There is no exploration of why certain models perform better under given conditions, nor any effort to relate findings to hydrological processes. The difference in performance between the two dams, for instance, is acknowledged but not explained.
8. The implications for operational decision-making—often emphasized in the introduction—are not convincingly revisited.
9. The conclusions are largely a restatement of the results, without any critical reflection or forward-looking perspective. The authors do not acknowledge the substantial limitations of their study—particularly the lack of generalization, interpretability, and robustness of the models.
10. The manuscript suffers from structural repetition and verbosity. Some figures (e.g., radar plots) are poorly designed and do not enhance interpretability.
Citation: https://doi.org/10.5194/egusphere-2025-16-RC1 - CC1: 'Reply on RC1', Areeya Rittima, 03 May 2025
- AC2: 'Reply on RC1', Areeya Rittima, 05 Jun 2025
-
RC2: 'Comment on egusphere-2025-16', Anonymous Referee #2, 22 May 2025
The paper investigates the ability of Machine Learning (XGBoost) and Deep Learning (LSTM) models to predict daily and monthly reservoir inflows for the Bhumibol (BB) and Sirikit (SK) dams in Thailand's Chao Phraya River Basin. The authors evaluate both univariate models (predicting each dam separately) and multivariate models(predicting both dams simultaneously).
The multivariate approach is tested only with LSTM and not with XGBoost. Why do the authors not apply a multivariate approach using XGBoost as well?
The paper is generally well structured and presents a clear methodology. The correlation analysis presented in Section 2.1 serves as a useful preliminary step for selecting relevant input features. However, this approach would benefit from being complemented by a more in-depth evaluation of feature importance based on the model’s actual behavior. This is particularly relevant for complex architectures such as LSTM, where the relationship between inputs and outputs is not always easily Have the authors employed a systematic cross-validation strategy? In Section 2.3, the XGBoost configuration specifies only a 2-fold validation, which is the bare minimum. It would be more appropriate to use 5- or 10-fold cross-validation to ensure more reliable results. Furthermore, for the LSTM models, it appears that no form of cross-validation has been applied. To strengthen the robustness of the comparative analysis, a more comprehensive approach—such as k-fold cross-validation—should be adopted and clearly described for all models.
The description of the algorithms, while thorough, is excessively detailed and could be streamlined by referencing existing literature where appropriate. Additionally, the paper lacks a critical discussion of the limitations inherent to the algorithms used. The conclusion section would also be strengthened by including suggestions for future research directions and potential areas of improvement.
In Table 6, the test metrics unexpectedly outperform the training metrics—for example, in the XGBoost monthly model (S3–BB), the NSE increases from 0.411 (training) to 0.675 (testing). Generally, this should not occur, as the model is specifically trained on the training data, while the test set is meant to evaluate its ability to generalize to unseen data. Although it is technically possible for test metrics to exceed training metrics, such results often indicate potential issues—such as noise or outliers in the training set, an under-trained model, or a methodological flaw in the data splitting process, especially if temporal ordering is not preserved.
It would be important to clarify whether the authors used a random shuffling approach to split the dataset into training and testing sets.
Moreover, Table 6 shows that models using daily resolution consistently outperform those using monthly resolution across all evaluation metrics. Intuitively, one might expect the opposite—that monthly predictions would perform better—as temporal aggregation typically reduces noise and smooths out short-term variability. However, this counterintuitive outcome warrants further investigation. The authors should explore this aspect in more depth, discussing possible reasons.
In Section 3.2, comparing models on different test sets may compromise the reliability of the evaluation. To ensure a fair and consistent comparison, it is recommended to exclude a predefined period of observations and use this subset as a shared validation set for all models. This approach guarantees that all models are assessed on the same data, minimizing variability and potential biases in performance comparison.
In Table 8, the minimum and average observed values (i.e., those labeled as "Obs.") should be identical across all models tested on the same dataset, unless there were differences in data splitting or preprocessing. These "Obs." values represent the actual ground truth measurements used to evaluate model performance. Therefore, if the training set remains unchanged across models, the minimum, average, and maximum observed values in the training set should be the same. The same applies to the test set: if the data used is consistent, the observed statistics must also be consistent.
However, there are clear inconsistencies in Table 8. For example, in the SK monthly models (S3 vs. S4), the minimum observed inflow in the training set changes from 61.48 MCM (S3) to 46.50 MCM (S4).
The Authors acknowledge the tendency of the models to underestimate low flows and overestimate high flows (Table 8). What do the authors propose to address the consistent overestimation of maximum values and underestimation of minimum values? Do they suggest any strategies or methodologies to mitigate this issue?
Minor comments:
- Improve the resolution of the images; in their current state, they are very difficult to read.
- The captions of the tables and figures should be more descriptive of the content they present.
- Figure 7: The images A.1, A.2, B.1, and B.2 do not use the same scale, which makes direct comparison difficult.
Citation: https://doi.org/10.5194/egusphere-2025-16-RC2 - AC1: 'Reply on RC2', Areeya Rittima, 03 Jun 2025
Status: closed
-
RC1: 'Comment on egusphere-2025-16', Anonymous Referee #1, 24 Mar 2025
This manuscript explores the application of two widely known data-driven algorithms—XGBoost and LSTM—in both univariate and multivariate modes for daily and monthly inflow predictions at two key reservoirs in the Chao Phraya River Basin. The topic is timely and relevant in the context of AI-driven hydrological forecasting. However, the manuscript, in its current form, fails to meet the scientific standards and novelty threshold expected by Hydrology and Earth System Sciences. The work is largely confirmatory, methodologically simplistic, and lacks both theoretical depth and critical interpretation. It represents an incremental application of well-established techniques without significant advancement in methodology, theory, or hydrological insight. Below are my detailed comments:
1. Despite the claim of contributing to reservoir inflow forecasting through multivariate models, the study does not introduce any methodological innovation. The application of XGBoost and LSTM, both extensively used in hydrology, adds no novelty unless combined with a new model architecture, uncertainty treatment, explainability component, or integration with process-based models. The experimental setting is rudimentary, and the results primarily confirm what has already been established in dozens of prior studies. Moreover, the assertion that multivariate prediction of inflows has rarely been studied is not substantiated and contradicts recent literature. The references cited are selective and outdated, omitting more advanced hybrid or physics-informed ML approaches currently under development in the hydrological community.
2. The manuscript fails to clearly define its scientific objectives or hypotheses. The rationale behind comparing univariate and multivariate approaches is weakly stated and not embedded in a theoretical or operational framework. The problem formulation is generic and reads more like a technical report than a scientific investigation.
3. The literature review is overly descriptive and lacks synthesis. It resembles an annotated bibliography rather than a critical narrative. Foundational works on multivariate time series modeling, ensemble learning, recent benchmarks on hybrid models, and the emerging field of physics-informed ML in hydrology are all missing. Furthermore, no discussion is provided on model explainability, uncertainty quantification, or generalization capacity, all of which are central themes in the current hydrological ML research agenda.
4. The methodology exhibits some critical flaws:
- No hyperparameter optimization strategy is described beyond brute-force listing of combinations.
- Feature selection is based solely on Pearson correlation, ignoring non-linear dependencies or mutual information approaches.
- The study does not address overfitting or generalization. Despite LSTM being known for susceptibility to overfitting, no regularization, dropout, or model selection strategy is employed.
- No benchmark model is used for reference, which is standard in HESS-level contributions.
5. The manuscript presents no discussion of data quality, treatment of missing values, stationarity, or outlier detection.
6. The results section is overly descriptive, listing metrics without proper analysis or critical discussion. Additionally, the model performances reported are relatively modest, especially for monthly inflow prediction, yet are uncritically presented as acceptable.
7. The discussion does not provide new hydrological or methodological insight. There is no exploration of why certain models perform better under given conditions, nor any effort to relate findings to hydrological processes. The difference in performance between the two dams, for instance, is acknowledged but not explained.
8. The implications for operational decision-making—often emphasized in the introduction—are not convincingly revisited.
9. The conclusions are largely a restatement of the results, without any critical reflection or forward-looking perspective. The authors do not acknowledge the substantial limitations of their study—particularly the lack of generalization, interpretability, and robustness of the models.
10. The manuscript suffers from structural repetition and verbosity. Some figures (e.g., radar plots) are poorly designed and do not enhance interpretability.
Citation: https://doi.org/10.5194/egusphere-2025-16-RC1 - CC1: 'Reply on RC1', Areeya Rittima, 03 May 2025
- AC2: 'Reply on RC1', Areeya Rittima, 05 Jun 2025
-
RC2: 'Comment on egusphere-2025-16', Anonymous Referee #2, 22 May 2025
The paper investigates the ability of Machine Learning (XGBoost) and Deep Learning (LSTM) models to predict daily and monthly reservoir inflows for the Bhumibol (BB) and Sirikit (SK) dams in Thailand's Chao Phraya River Basin. The authors evaluate both univariate models (predicting each dam separately) and multivariate models(predicting both dams simultaneously).
The multivariate approach is tested only with LSTM and not with XGBoost. Why do the authors not apply a multivariate approach using XGBoost as well?
The paper is generally well structured and presents a clear methodology. The correlation analysis presented in Section 2.1 serves as a useful preliminary step for selecting relevant input features. However, this approach would benefit from being complemented by a more in-depth evaluation of feature importance based on the model’s actual behavior. This is particularly relevant for complex architectures such as LSTM, where the relationship between inputs and outputs is not always easily Have the authors employed a systematic cross-validation strategy? In Section 2.3, the XGBoost configuration specifies only a 2-fold validation, which is the bare minimum. It would be more appropriate to use 5- or 10-fold cross-validation to ensure more reliable results. Furthermore, for the LSTM models, it appears that no form of cross-validation has been applied. To strengthen the robustness of the comparative analysis, a more comprehensive approach—such as k-fold cross-validation—should be adopted and clearly described for all models.
The description of the algorithms, while thorough, is excessively detailed and could be streamlined by referencing existing literature where appropriate. Additionally, the paper lacks a critical discussion of the limitations inherent to the algorithms used. The conclusion section would also be strengthened by including suggestions for future research directions and potential areas of improvement.
In Table 6, the test metrics unexpectedly outperform the training metrics—for example, in the XGBoost monthly model (S3–BB), the NSE increases from 0.411 (training) to 0.675 (testing). Generally, this should not occur, as the model is specifically trained on the training data, while the test set is meant to evaluate its ability to generalize to unseen data. Although it is technically possible for test metrics to exceed training metrics, such results often indicate potential issues—such as noise or outliers in the training set, an under-trained model, or a methodological flaw in the data splitting process, especially if temporal ordering is not preserved.
It would be important to clarify whether the authors used a random shuffling approach to split the dataset into training and testing sets.
Moreover, Table 6 shows that models using daily resolution consistently outperform those using monthly resolution across all evaluation metrics. Intuitively, one might expect the opposite—that monthly predictions would perform better—as temporal aggregation typically reduces noise and smooths out short-term variability. However, this counterintuitive outcome warrants further investigation. The authors should explore this aspect in more depth, discussing possible reasons.
In Section 3.2, comparing models on different test sets may compromise the reliability of the evaluation. To ensure a fair and consistent comparison, it is recommended to exclude a predefined period of observations and use this subset as a shared validation set for all models. This approach guarantees that all models are assessed on the same data, minimizing variability and potential biases in performance comparison.
In Table 8, the minimum and average observed values (i.e., those labeled as "Obs.") should be identical across all models tested on the same dataset, unless there were differences in data splitting or preprocessing. These "Obs." values represent the actual ground truth measurements used to evaluate model performance. Therefore, if the training set remains unchanged across models, the minimum, average, and maximum observed values in the training set should be the same. The same applies to the test set: if the data used is consistent, the observed statistics must also be consistent.
However, there are clear inconsistencies in Table 8. For example, in the SK monthly models (S3 vs. S4), the minimum observed inflow in the training set changes from 61.48 MCM (S3) to 46.50 MCM (S4).
The Authors acknowledge the tendency of the models to underestimate low flows and overestimate high flows (Table 8). What do the authors propose to address the consistent overestimation of maximum values and underestimation of minimum values? Do they suggest any strategies or methodologies to mitigate this issue?
Minor comments:
- Improve the resolution of the images; in their current state, they are very difficult to read.
- The captions of the tables and figures should be more descriptive of the content they present.
- Figure 7: The images A.1, A.2, B.1, and B.2 do not use the same scale, which makes direct comparison difficult.
Citation: https://doi.org/10.5194/egusphere-2025-16-RC2 - AC1: 'Reply on RC2', Areeya Rittima, 03 Jun 2025
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
291 | 71 | 15 | 377 | 10 | 13 |
- HTML: 291
- PDF: 71
- XML: 15
- Total: 377
- BibTeX: 10
- EndNote: 13
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1