the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Uncovering a Key Predictors for Enhancing Daily Streamflow Simulation Using Machine Learning
Abstract. The sequence of droughts and wetter periods in Australia poses challenges for long-term hydrologic modelling. This paper develops a novel machine learning-based approach to uncover key predictors that improve daily streamflow predictions during and after the Millennium drought (1997 to 2009) in 39 gauged sub-catchments in Western Victoria, Australia.
For this purpose, a hybrid approach is adopted, combining simulations from the GR4J hydrological model with physical data as forcing (predictors) for multiple ML algorithms to identify the key predictors for improving streamflow prediction. GR4J is a widely used operational hydrological model in Australia. ML models including predictors representing long-term runoff coefficient and short-term runoff and rainfall showed the greatest improvement in streamflow predictions, particularly for low flows. This suggests that GR4J has limited ability to capture short/long-term persistence and therefore model enhancement should focus on these shortcomings. All ML algorithms resulted in improved streamflow prediction, with Multilayer Perceptron (MLP) consistently yielding the highest Nash Sutcliffe Efficiency, and Random Forest showing the strongest improvement in terms of low-flow prediction. Long-term runoff coefficient and machine learning were most effective in catchments with lower long-term runoff coefficients. Overall, this study provides insights for water resources management in drought-prone regions, highlighting the key predictors in the combination of ML and hydrological modelling to improve streamflow predictions during and after droughts.
- Preprint
(3814 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-553', Anonymous Referee #1, 03 Apr 2025
- AC1: 'Reply on RC1', Arash Aghakhani, 22 Jun 2025
-
RC2: 'Comment on egusphere-2025-553', Anonymous Referee #2, 12 Apr 2025
Summary
This paper explores the application of Machine Learning (ML) algorithms to post-process physically-based hydrological models and find key predictors to improve daily streamflow predictions. The methodology is tested over 39 sub-catchments in Western Australia (Victoria), using the streamflow predictions of the GR4J physically-based model and a set of climatic and hydrological additional variables (rainfall, potential evapotranspiration, streamflow-derived variables).
Review
While the topic of the paper is definitely of interest for the scientific community and fits the scope of HESS journal, I have several comments that the authors should address to improve the overall quality and scientific relevance of this manuscript. I have both general and specific comments.
General comments:
- The introduction section is far too long and sometimes addressing points that are not mentioned in the paper any longer. This is the case of the non-stationarity issue for physically-based hydrological models. The authors reserved quite some space in the introduction for this topic, which is then no longer touched upon in the analysis nor in the discussion. I suggest to shorten it or change something in the analysis (see my general comment #6). In general, the introduction could be shortened and focused towards the two research questions.
- The introduction is focusing very much on drought conditions, while this is not reflected in the two research questions mentioned. As also part of the results is focused in analysing the performances of all models for low flow conditions, I suggest to either specify the current research questions for the low flow conditions or add a third question about it.
- Results and discussion section: I believe the results section could be divided in sub-paragraph addressing different research questions, making the readability easier. As they are now, it is difficult to focus on the two research questions.
- Results and discussion section: there is no real discussion of the results in the context of literature, neither for the GR4J nor for the Machine Learning (ML) models. In literature there is plenty of evidence of the role of past streamflow in improving streamflow predictions (because of the high autocorrelation), but this is never mentioned in the paper. The relevance of past streamflow should have not come as a surprise. Also, there is no mention of the limitations of this study.
- The comparison of GR4J with ML models trained with streamflow is not entirely fair, as GR4J is not using the same input variables. While I understand that part of the purpose of this paper was to indeed find out which other predictors not used by GR4J should instead be considered, I believe that the authors should add a sentence acknowledging the difference in the models, also mentioning the well-known importance of past streamflow for ML models.
- I do not agree in the way performance are presented, i.e. showing the performance of ML and GR4J models in the pre-drought, drought and post-drought conditions all together. I believe that it would be more interesting to first show the comparison of performances in the testing set (during the pre-drought period), which are supposed to be the hydrological “regular” conditions. This would allow identifying (potential) deficiencies that are specific of the GR4J model (for instance the lack of long-term memory) and relevant input variables for streamflow prediction. Then, repeating the same comparison in the period of millennium drought and post-drought, would show if the deficiencies and relevant input features are the same, or if the millennium drought indeed brought a change in the hydrological characteristics of the catchments analysed. In case ML models show better performances in both periods, then it would validate the thesis that physically-based models are not adequate for hydrological studies in non-stationarity conditions, while ML are, also justifying the introduction on non-stationarity of climate.
- The ML models used exclude the Long Short-Term Memory (LSTM) model, which is nowadays considered the state of the art in terms of ML algorithms for hydrology. While I understand that adding yet another model in the analysis would now require quite some time, I believe that the authors should acknowledge the use of these models in the literature introduction, explain why LSTM was not used, and potentially add this in the limitations of the work.
Specific comments:
- Line 25: climate change is mentioned in the key words, but there is nothing about it in the paper, only the mention in the introduction. I suggest the authors to change the keyword or update the manuscript with additional considerations about climate change.
- Line 32: space missing between “noteworthy example” and “(can Dijk et al., 2013)”.
- Lines 34-36: I believe the referencing style is incorrect, the references should all be within the same brackets.
- Line 47: there is a typo. It should be “non-stationarity” rather than “non-stationary”.
- Lines 95-96: I do not agree with this statement, as you need to code yourself also for ML models development. I suggest to revise.
- Line 115: there is a mention to Section 2.2.2.8, but such section does not exist.
- Lines 130-131: I do not agree about this literature gap, as there are several attempts to use ML in a hybrid configuration to improve streamflow predictions.
- Section 2.1: as part of the focus of the paper is related to finding relevant input features, I would already specify here which variables are used in the ML models, or at least add a sentence guiding the reader to read section # xx to have more information about it.
- Section 2.2: it would have been easier to go through the methodology if a general workflow or information about the overall methodology is given before going into the details of each modelling part.
- Section 2.2 and results: how is the rainfall change and streamflow change linked to the remaining of the investigations?
- Line 194: what are the four parameters of the GR4J model?
- Figure 2: there is no legend about the symbols used. What is En, Es, Pn, Ps, and so on?
- Lines 205-208: the fact that random forest, Gradient Boosting, MLP have the best performance is coming from the results of this work or from the papers referenced? And also, if the other models are discarded, why presenting them as part of the methodology in the first place?
- Line 223: it is mentioned that RF can be used to check importance of features, but why is this characteristic not leveraged, since part of the part focuses on finding the most relevant predictors to improve streamflow predictions? If not used, I believe it should be justified in the paper, especially after adding this line.
- Section 2.2.3: what is the target of the model? It is not really specified.
- Line 256: how many steps back of rainfall and potential evapotranspiration are considered? What is the lag between the target and these predictors?
- Lines 278-289: the explanation of the variables used in the predictive mode and the training mode is not clear. If real observations are used to compute the runoff coefficient and short/long-term memory in the predictive mode only, which variables are used in the training mode?
- Line 332: it is mentioned that negative values are omitted, while previously (line 328, figure 5 caption), it is mentioned that negative values are shown as 0. I believe the same approach should be followed everywhere, to avoid inconsistent evaluation across the paper.
- Lines 361-362: why NDVI, LAI, groundwater exchange and not other variables?
- Lines 380-381: GR4J overestimates low flow, but is this in any period (pre and post-drought), or only after (during) millennium drought?
- 390-391: it is mentioned that it is not evident which predictor has the greatest overall impact. As this is part of the research questions of the paper, this result should be addressed again when discussing limitations of this work.
- Lines 403-406: ranges of runoff conditions are given. However, it would be interesting for the reader to know to which hydrological conditions or regime these coefficients refer to? For instance, catchments with flashy response, long recession limbs, high/low interannual variability.
- Lines 464-466: these lines are not clear. How is it possible that the results of the post-drought period show that there is memory effects in the pre-drought conditions?
- AC2: 'Reply on RC2', Arash Aghakhani, 22 Jun 2025
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
648 | 69 | 23 | 740 | 18 | 40 |
- HTML: 648
- PDF: 69
- XML: 23
- Total: 740
- BibTeX: 18
- EndNote: 40
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Review of HESS Manuscript
"Uncovering a Key Predictors for Enhancing Daily Streamflow Simulation Using Machine Learning”
Dear Editor,
Please find attached my review of the manuscript.
The scope of the article is inside the scope of HESS.
The authors use 3 machine learning algorithms as post-processors of the GR4J model. They create multiple models by varying the inputs that go into the ML algorithms and compare the performance of such models. They evaluate their model in a subset of basins in Australia.
Major comments:
Section 2.2.3:
In this section, the authors explain the different models used in their study. In general, the output from the GR4J model, and several additional variables, are used as input of different machine learning algorithms, that act as postprocessors (Figure 3). However, the performance comparison of the different models has errors.
As a benchmark, the authors are using the GR4J, which is a rainfall runoff model that receives meteorological input and predicts discharge. However, models 2c, 3, 4, 5 and 6, besides the predictions made by the GR4J, also use observed discharges as input to the ML algorithm. One cannot compare a model that receives observed discharge as an input with a model that does not, it is expected that the former one will be better. Discharge is a highly temporally correlated variable, so the discharge from time t-1 is an extremely good predictor for the discharge at time t. This is why in the results, they report that “Model 2c demonstrates improvement with respect to GR4J, which emphasises the importance of taking into account short-term streamflow memory.” This is not a surprising finding and makes the model comparison invalid between models that receive discharges and models that do not.
Section 3.
In line 333, the authors indicate that “None of the models 1, 2a, and 2b show any improvement over the GR4J predictions”. This is contrary to what has been shown in literature, where using ML models as postprocessors of process-based models improves performance (Frame et al, 2021) because of the enhanced flexibility of the resulting hybrid model. Nevertheless, the results shown by the authors are contrary to that. Further explanation of why this is the case is required.
Moreover, in the case reported by the authors, the models are performing badly. Based on Figure 7b and 7c, models 1, 2a and 2b reported a negative NSE for 60% (or more) of the basins. This indicates that just taking the average flow is better than the model, and consequently, the models are not working at all. Why is this the case?
In lines 370-372, the authors indicate that GR4J model lacks the capacity to represent low flow context, because the other ML algorithms performed better. However, this is again an unfair comparison because all the ML algorithms that you are using in this comparison receive discharge as input, which will be an extremely informative predictor of the discharge in the next time step, especially during low flow periods. Therefore, this is not a valid comparison.
General comment:
Even though the authors present an interesting study, the ML methods used are far from current state-of-the-art. It has been shown in multiple studies that LSTMs perform well as purely ML methods (Kratzert2019b, Kratzert2021 and Feng2020 for CAMELS US, Less2021 for CAMELS GB, Loritz2024 for CAMELS DE) and as postprocessors of process-based models (Frame et al, 2021). The overall poor performance of the hybrid models presented in this study (when they did not receive discharge as input) indicates that the general pipeline could be improved, as the ML postprocessor is not doing its job.
Moreover, it should be noted that other strategies for constructing hybrid models, like using ML methods to parameterize a process-based model (Kraft2022, Feng2022, AcuñaEspinoza2024) or using ML methods to replace process-based model parts (Höge2022, Li2023, Li2024) have shown improved performance with respect to the stand-alone conceptual model, which would be worth considering given that in this case, the hybrid models that did not receive discharge as input are not able to outperform the stand-alone GR4J model.
Therefore, I believe the models presented in the paper are not up to standard with current state-of-the-art, and further improvement is necessary.
Minor comments:
Line 34: Use proper citation format.
Line 43: This sentence does not read well. Please improve the phrasing.
Line 47: Should be: hydrologic non-stationarity, use the noun and not the adjective.
Line 60: What are you referring to here as validation data? Is it the data that you use to evaluate your model after calibration (but this will also include the forcings)? or the target variable that you are predicting? It just seems that the word validation here is out of context because errors can be found in other types of data too.
Line 62: Are you using conceptual hydrological models as a synonym of process-based models, as a subcategory or as a different category? The connection with the previous idea could be improved.
Line 95: I do not agree this phrase. There is code development by the user, because you are still using a model. Machine learning methods are models, and they need to be coded. It would be better to indicate that during the training, the model learns to map the input-output relationships using less prior constrains on how this mapping should be done.
Lines 98-101: It would be good to cite the studies that use these types of models.
Line 124: “or with data containing irrelevant or redundant information.” Do you have a source or examples that justify this? Because in principle, if data is not relevant for a ML model, the model could just ignore it.
Line 128: Which published studies?
Line 130: I disagree that there is a “literature gap on how machine learning can be used to improve hydrological models”. There are a lot of studies published in this area. Of course, there are things that can be improved, but what you mentioned here is too general.
Line 205-207: You should only mention the models that you will present the results for. You are saying that multiple algorithms were assessed in the study, you are naming them, and then saying that some of them are not going to be discussed. So why mention them at all? To make the study cleaner, I suggest you should talk only about the results you are presenting. Also, why is Less et al 2021 cited in this part? He used an LSTM model, which you are not using. Moreover, please clarify what the other citations are referring to.
Line 214: I would suggest avoiding this kind of phrase. Saying that random forest is one of the most powerful statistical learning methods is subjective. This would depend on the application you have, the metric you are using, and many other factors.
Line 227: Same here, avoid saying that gradient boosting is widely recognised as one of the most powerful algorithms. This is again subjective, case-dependent and not related to the main point you are trying to make.
Line 238: This is not true. MLP is not the most popular type of neural network in hydrology. The current state-of-the-art has been achieved with LSTMs (Kratzert2019b, Kratzert2021 and Feng2020 for CAMELS US, Less2021 for CAMELS GB, Loritz2024 for CAMELS DE). Transformers have also shown good results in CAMELS US. Both of these methods considerably outperform MLP.
Line 252: What do you mean by calibration and optimisation were conducted for the training and test period only? You should not calibrate for the test period. The test period is used to evaluate the model that was calibrated during the training period. I think there is a misunderstanding on the names you are using.
Line 266: Improve phrasing of “an intentional effort”.
Line 272: Are you referring to mean-squared error or sum-squared error?
Line 270-277: What you are referring to here as a testing period is what it is normally referred to as validation.
Reference.
Acuña Espinoza, E., Loritz, R., Álvarez Chaves, M., Bäuerle, N., & Ehret, U. (2024). To bucket or not to bucket? Analyzing the performance and interpretability of hybrid hydrological models with dynamic parameterization. Hydrology and Earth System Sciences, 28(12), 2705–2719. https://doi.org/10.5194/hess-28-2705-2024
Frame, J.M., F. Kratzert, A. Raney II, M. Rahman, F.R. Salas, and G.S. Nearing. 2021. “ Post-Post-Processing the National Water Model with Long Short-Term Memory Networks for Streamflow Predictions and Model Diagnostics.” Journal of the American Water Resources Association 57(6): 885–905. https://doi.org/10.1111/1752-1688.12964.
Feng, D., Liu, J., Lawson, K., and Shen, C.: Differentiable, Learnable, Regionalized Process-Based Models With Multiphysical Outputs can Approach State-Of-The-Art Hydrologic Prediction Accuracy, Water Resour. Res., 58, e2022WR032404, https://doi.org/10.1029/2022WR032404, 2022.
Hoge, M., Scheidegger, A., Baity-Jesi, M., Albert, C., and Fenicia, F.: Improving hydrologic models for predictions and process understanding using neural ODEs, Hydrol. Earth Syst. Sci., 26, 5085–5102, https://doi.org/10.5194/hess-26-5085-2022, 2022.
Kraft, B., Jung, M., Körner, M., Koirala, S., and Reichstein, M.: Towards hybrid modeling of the global hydrological cycle, Hydrol. Earth Syst. Sci., 26, 1579–1614, https://doi.org/10.5194/hess-26- 1579-2022, 2022.
Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., and Nearing, G.: Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, https://doi.org/10.5194/hess-23-5089-2019, 2019b.
Kratzert, F., Klotz, D., Hochreiter, S., & Nearing, G. S. (2021). A note on leveraging synergy in multiple meteorological data sets with deep learning for rainfall-runoff modeling. Hydrology and Earth System Sciences, 25(5), 2685–2703. https://doi.org/10.5194/hess-25-2685-2021
Lees, T., Buechel, M., Anderson, B., Slater, L., Reece, S., Coxon, G., and Dadson, S. J.: Benchmarking data-driven rainfallrunoff models in Great Britain: a comparison of long shortterm memory (LSTM)-based models with four lumped conceptual models, Hydrol. Earth Syst. Sci., 25, 5517–5534, https://doi.org/10.5194/hess-25-5517-2021, 2021.
Li, B., Sun, T., Tian, F., Tudaji, M., Qin, L., & Ni, G. (2024). Hybrid hydrological modeling for large alpine basins: A semi-distributed approach. Hydrology and Earth System Sciences, 28(20), 4521–4538. https://doi.org/10.5194/hess-28-4521-2024
Li, B., Sun, T., Tian, F., & Ni, G. (2023). Enhancing process-based hydrological models with embedded neural networks: A hybrid approach. Journal of Hydrology, 625, 130107. https://doi.org/10.1016/j.jhydrol.2023.130107
Loritz, R., Dolich, A., Acuña Espinoza, E., Ebeling, P., Guse, B., Götte, J., Hassler, S. K., Hauffe, C., Heidbüchel, I., Kiesel, J., Mälicke, M., Müller-Thomy, H., Stölzle, M., & Tarasova, L. (2024). CAMELS-DE: Hydro-meteorological time series and attributes for 1582 catchments in Germany. Earth System Science Data, 16(12), 5625–5642. https://doi.org/10.5194/essd-16-5625-2024