the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Long Short-Term Memory Networks for Real-time Flood Forecast Correction: A Case Study for an Underperforming Hydrologic Model
Abstract. Flood forecasting systems play a key role in mitigating socio-economic damages caused by flooding events. The majority of these systems rely on process-based hydrologic models (PBHM), which are used to predict future river runoff. To enhance the forecast accuracy of these models, many operational flood forecasting systems implement error correction techniques, which is particularly important if the underlying hydrologic model is underperforming. Especially, AutoRegressive Integrated Moving Average (ARIMA) type models are frequently employed for this purpose. Despite their high popularity, numerous studies have pointed out potential shortcomings of these models, such as a decline in forecast accuracy with increasing lead time. To overcome the limitations presented by conventional ARIMA models, we propose a novel forecast correction technique based on a hindcast-forecast Long Short-Term Memory (LSTM) network. We showcase the effectiveness of the proposed approach by rigorously comparing its capabilities to those of an ARIMA model, utilizing one underperforming PBHM as a case study. Additionally, we test whether the LSTM benefits from the PBHM's results or if a similar accuracy can be reached by employing a standalone LSTM. Our investigations show that the proposed LSTM model significantly improves the PBHM's forecasts. Compared to ARIMA, the LSTM achieves a higher forecast accuracy for longer lead times. In terms of flood event runoff, the LSTM performs mostly on par with ARIMA in predicting the magnitude of the events. However, the LSTM majorly outperforms ARIMA in accurately predicting the timing of the peak runoff. Furthermore, our results provide no reliable evidence of whether the LSTM is able to extract information from the PBHM's results, given the widely equal performance of the proposed and standalone LSTM models.
- Preprint
(16342 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-1030', Anonymous Referee #1, 07 Jun 2024
Dear editor and authors,
The following comment details my review of the manuscript “Long Short-Term Memory Networks for Real-time Flood Forecast Correction: A Case Study for an Underperforming Hydrologic Model” submitted to HESS.
In this preprint the authors present a model comparison study in which two (or three depending on the application) models are compared in their ability to forecast runoff. The models compared are all statistical- or machine learning-based models which take as inputs predictions of an underperforming conceptual model. The preprint is well written and the results are compelling. The scope of the manuscript is well suited for HESS and it has potential to be a great contribution to the literature on runoff forecasting, as well as models which combine physics-based and data-driven approaches.
However, I have a number of major and minor comments/suggestions that should be addressed before final publication and ultimately will benefit the manuscript and overall study.
Major Comments
The comparison is not “fair”
What the ARIMA model is doing is very different than what the LSTM-based models are doing and this “unfair” comparison is apparent in the results. Evidently the model which is able to use data from precipitation in its forecasting step will be better at predicting events that have precipitation as its main driver and not the current or past discharge as calculated by an underperforming PBHM.
What is missing is a model that is in-between the ARIMA and HLSTM-PBHM and bridges the gap between the two approaches. In principle this could be an ARIMA which considers exogenous inputs (ARIMAX) or an LSTM which predicts errors without the aid of external variables for a direct comparison with the presented ARIMA model. This way we see how performance changes from having a model that is only correcting the PBHM (ARIMA), to a model which relies in the PBHM but can use other inputs when the PBHM fails (ARIMAX), to a model that accounts for all available input data and chooses what to use (HLSTM-PBHM).
Ultimately, I think the models which should be part of the study are: ARIMA, ARIMAX, LSTM which predicts errors only using Qsim and Qobs (name: eLSTM?), and the presented HLSTM-PBHM.Furthermore, although the HLSTM was added to address a specific concern regarding the combination of the PBHM and an LSTM, I don’t think its able to address this issue effectively as the authors also recognize by saying that in their findings: “We did not find strong evidence of whether the inclusion of the PBHM’s results benefited the accuracy of the LSTM.” My suggestion is that the HLSTM is completely dropped. Give that this model simply serves to check if Qsim is somewhat informative to the LSTM in the HLSTM-PBHM, this could be made clearer through a sensitivity analysis and not using a different model with also a different architecture. I suggest the sensitivity analysis could be done using integrated gradients as Kratzert et al. (2019) or simply by replacing the input of Qsim for noise and seeing the effect it has on the predictions by the model. If the LSTM does not consider the input from Qsim useful, there should be no effect by replacing this input for noise and vice versa.
This flaw in the design of the study is also reflected on the research questions established in the introduction. None of the research questions concern the ARIMA model, so why is it part of the study at all? From the point of view of the RQs, the study should only focus on the HLSTM-PBHM and the previously described eLSTM which could be considered a deep-learning adaptation of ARIMA while the HLSTM-PBHM is more akin to an ARIMAX, keeping the scope of the paper within error correcting strategies, and then the discussion can focus on the benefit of precipitation as an input during forecasting, and the difference between years where the PBHM is acceptable (2014 and 2016) in contrast to when it’s terrible (2017).
Hyperparameters of the LSTM-based models
In the supplemental information of the paper by Nevo et al. (2022), their model is described to have an LSTM of 128 hidden units for hindcast and another 128 hidden units for forecast, which is similar to the 96 used in this article, but in the case of Nevo et al. (2022), the model was trained to forecast in at least 165 basins which use LSTM for the “stage forecast model”. The architecture presented by Nevo et al. (2022) is also based on the MTS-LSTM presented by Gauch et al. (2021). Although the purpose of that second paper is not forecasting, the idea of “handing” the hidden states from one LSTM to another is the same, and in their case both LSTMs which send and receive the hidden states have 64 hidden nodes. This is also applied in a regional case study in which the amount of data that the model needs to ingest is a lot larger. Finally, in a more recent example by some of the same authors of the previous papers, Kratzert et al. (2024) train single-basin LSTMs using models with hidden nodes ranging from 8 to, at most, 32. This is not a criticism of not using LSTM in a regional setting, in my view LSTMs are still valid for application in a single basin, acknowledging their limitations, but their size shouldn’t be the same as those used in regional modelling.
This could be addressed by adding smaller sizes into the hyperparameter search space and I would encourage the authors to present their results for training/validation in supplemental material of the article in the form of loss curves, metrics in training/validation, etc.
Checking the code repository and additional files provided by the authors, the ‘\tb’ folder was not included so the TensorBoard logs cannot be checked.
On checking the code further, there are remainder classes and functions that were not part of the study such as the models that include ‘CNN’ layers. I would suggest a general cleanup, but the code definitely runs and I was able to generate one of the trials done for the study and the corresponding logs.
Taking a look at some of the post-processing notebooks, I find the information presented in the `notebook_aux_peak_events.ipynb` to be informative and would encourage the authors to include some of the plots for peak events in section 4.2 in the manuscript. The colors for the ARIMA model need to be adjusted though, because they are the same as the “measured” data.
Minor Comments
General: I suggest that the name of the HLSTM-PBHM should be flipped around to PBHM-HLSTM as the PBHM is the initial step in the pipeline.
Line 72: The hindcast-forecast LSTM cannot be called “novel” as it is adapted from the approach of Nevo, et al. (2022), which in-itself is adapted from other sources.
Line 146: How was the search-space for the hyperparameters of the ARIMA model defined?
Line 171: Why was a loss function which combines two metrics chosen? Was minimizing NSE or simply MSE tried? From Appendix B the authors say that this was adapted from Nevo et al. (2022) but in that paper they minimize a negative log-likehood as they have a probabilistic model.
Figure 4: On the right-hand side of the figure I’m missing the Qforecast as in Figure 3. Further the “Targets” should be dropped from this figure or added to Figure 3 to have somewhat equivalent descriptions of the model architectures.
Table 2: Although PBIAS is a good overall metric to include, these results would greatly benefit of including metrics which directly target low- and high-flows like FHV and FLV. See Gauch et al. (2021) or directly Yilmaz et al. (2008). Some of the results described in the text using PBIAS are more suited to be described using these two metrics.
Also, I find the reported PBIAS metrics for the ARIMA models to be strange given their KGE and NSE per year. From the code, in the `run_arima.py` I see it calls the `calculate_bias` function but only prints values to the screen while in `notebook_tab1-4_table_results.ipynb` the metrics are read directly from a file. I'm guessing that `metrics.txt` is generated using the data in the other files in that folder `metric_nse.txt`, etc. but I also don't see where in the code those files are dumped. `run_arima.py` dumps the `all_fc_df` as a pickle, but I'm not sure if that DataFrame is used to generate the metrics `.txt` files.
Table 3: Generalization refers to a models ability to predict in previously unseen data drawn from the same distribution as the one used to train the model. In doesn’t have to do with differences between validation and testing. I would consider correcting this table with differences between the testing and training sets, or not including it at all as most of the discussion cantered around these results appears or can be included in other sections.
Fig. 6 and Fig. 7: The legend of both columns should not be shared. Currently it appears as if the “normalized win ratio” has something to do with the standard deviation of each model.
References
- Gauch, M., Kratzert, F., Klotz, D., Nearing, G., Lin, J., & Hochreiter, S. (2021). Rainfall–runoff prediction at multiple timescales with a single Long Short-Term Memory network. Hydrology and Earth System Sciences, 25(4), 2045–2062. https://doi.org/10.5194/hess-25-2045-2021
- Kratzert, F., Gauch, M., Klotz, D., & Nearing, G. (2024). HESS Opinions: Never train an LSTM on a single basin. Hydrology and Earth System Sciences Discussions, 1–19. https://doi.org/10.5194/hess-2023-275
- Kratzert, F., Herrnegger, M., Klotz, D., Hochreiter, S., & Klambauer, G. (2019). NeuralHydrology—Interpreting LSTMs in Hydrology. arXiv:1903.07903 [Physics, Stat], 11700, 347–362. https://doi.org/10.1007/978-3-030-28954-6_19
- Nevo, S., Morin, E., Gerzi Rosenthal, A., Metzger, A., Barshai, C., Weitzner, D., Voloshin, D., Kratzert, F., Elidan, G., Dror, G., Begelman, G., Nearing, G., Shalev, G., Noga, H., Shavitt, I., Yuklea, L., Royz, M., Giladi, N., Peled Levi, N., … Matias, Y. (2022). Flood forecasting with machine learning models in an operational framework. Hydrology and Earth System Sciences, 26(15), 4013–4032. https://doi.org/10.5194/hess-26-4013-2022
- Yilmaz, K. K., Gupta, H. V., & Wagener, T. (2008). A process-based diagnostic approach to model evaluation: Application to the NWS distributed hydrologic model. Water Resources Research, 44(9). https://doi.org/10.1029/2007WR006716
Citation: https://doi.org/10.5194/egusphere-2024-1030-RC1 -
AC1: 'Reply on RC1', Sebastian Gegenleithner, 24 Jul 2024
On behalf of the authors, I would like to thank the anonymous referee for his/her valuable feedback on our manuscript. We also appreciate the time and effort the referee dedicated to this review. In our opinion, the referee's constructive feedback will significantly enhance the quality of the presented manuscript.
Our response to the referee's comments, including our intended modifications to the manuscript, can be found in the attached file (response-RC1.pdf).
-
RC2: 'Comment on egusphere-2024-1030', Anonymous Referee #2, 16 Jul 2024
This study proposes an application of Long Short-Term Memory Networks (LSTM) complemented by the results of a hydrological model (PBHM) for operational flood forecasting in a smaller mountainous catchment. The performance of the resulting HLSTM-PBHM is compared with an ARIMA error correction model and a standalone application of the LSTM (HLSTM).
The results of this study are particularly significant, as they reveal performance improvements for the HLSTM-PBHM, especially for larger lead times. These findings have practical implications for flood forecasting in similar catchments.
The paper is within the scope and very interesting for the readers of HESS. The authors address a topic of high relevance for flood forecasting since studies focusing on small catchments and requiring sub-daily time steps are limited.
The authors have done a commendable job of presenting the scientific results concisely and well-structured. However, I have some fundamental comments on the interpretation of the proposed method and the concept of the experimental design to compare the different approaches:
- From my perspective, the proposed HLSTM-PBHM is an informed approach that uses precalculated results of the hydrological model (PBHM) combined with observations for the hindcast rather than applying an explicit error correction as the ARIMA error correction model does. Therefore, the title of the paper should reflect this, and I suggest revising it.
- This approach's consequence is that the input data used and the internal corrections of HLSTM-PBHM cannot be compared with the residual errors of the hydrological model and the corrections calculated by the explicit error correction models.
- In general, a comprehensive analysis of the residual errors of the PBMH model, e.g., the underlying statistical distribution, would be helpful and give the reader more insight to interpret the results. It would also prove the assumption of whether the errors are normally distributed. Many studies (among them [1]) found a high heteroscedasticity variance of residuals, which should be checked and considered for the residuals in the study.
- Please also briefly introduce the PBHM model in the Methods Section as it is used in the study.
- Multiple errors exist in flood forecasting due to meteorological uncertainties and those rising from the structure and parametrization of the hydrological model. Please elaborate on how the different contributions could be considered in future developments of HLSTM-PBHM in the discussion.
I suggest that the authors consider the above points before final publication. This will ultimately benefit the manuscript and the overall study.
[1] Li, D., Marshall, L., Liang, Z., Sharma, A., & Zhou, Y. (2021). Characterizing distributed hydrological model residual errors using a probabilistic long short-term memory network. Journal of Hydrology, 603, doi: 10.1016/j.jhydrol.2021.126888.
Citation: https://doi.org/10.5194/egusphere-2024-1030-RC2 -
AC2: 'Reply on RC2', Sebastian Gegenleithner, 24 Jul 2024
On behalf of the authors, I would like to thank the anonymous referee for his/her valuable feedback on our manuscript. We also appreciate the time and effort the referee dedicated to this review. In our opinion, the referee's constructive feedback will significantly enhance the quality of the presented manuscript.
Our response to the referee's comments, including our intended modifications to the manuscript, can be found in the attached file (response-RC2.pdf).
Model code and software
Model code for "Long Short-Term Memory Networks for Real-time Flood Forecast Correction: A Case Study for an Underperforming Hydrologic Model" Sebastian Gegenleithner, Manuel Pirker, Clemens Dorfmann, Roman Kern, and Josef Schneider https://github.com/tug17/ForecastModel
Interactive computing environment
Plots for "Long Short-Term Memory Networks for Real-time Flood Forecast Correction: A Case Study for an Underperforming Hydrologic Model" Sebastian Gegenleithner, Manuel Pirker, Clemens Dorfmann, Roman Kern, and Josef Schneider https://github.com/tug17/ForecastModel
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
423 | 126 | 43 | 592 | 18 | 27 |
- HTML: 423
- PDF: 126
- XML: 43
- Total: 592
- BibTeX: 18
- EndNote: 27
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1