Testing data assimilation strategies to enhance short-range AI-based discharge forecasts
Abstract. Effective discharge forecasts are essential in operational hydrology. The accuracy of such forecasts, particularly in short lead times, is generally increased through the integration of recent measured discharges using data assimilation (DA) procedures. Recent studies have demonstrated the effectiveness of deep learning (DL) approaches for rainfall-runoff (RR) modeling, particularly Long Short-Term Memory (LSTM) networks, outperforming traditional approaches. However, most of these studies do not include DA procedures, which may limit their operational forecast performance. This study suggests and evaluates three DA strategies that incorporate discharge from either past observed discharges or forecast discharges of a pre-trained benchmark model (BM). The proposed strategies, based on a Multilayer Perceptron (MLP) orchestrator, include: (1) the integration of recent observed discharges, (2) the integration of both recent discharge observations and pre-trained BM forecasts, and (3) the post-processing of BM forecast errors. Experiments are implemented using the CAMELS-US dataset using two established benchmark models: the trained LSTM model from Kratzert et al. (2019) and the conceptual Sacramento Soil Moisture Accounting (SAC-SMA) model from Newman et al. (2017), covering both machine learning and conceptual RR simulation approaches. Lead times of 1, 3, and 7 days, covering short- and mid-term horizons, are considered. The approaches are evaluated in two forecast frameworks: (1) perfect meteorological forecasts over the forecasting lead time and (2) highly uncertain ensemble meteorological forecasts. The two frameworks yield contrasting outcomes. When evaluated under the perfect forecast framework, the application of DA leads to substantial improvements in forecast performance, although the magnitude of these gains depends on the initial performance of the benchmark (BM) models and the forecasting lead time. Improvements are consistently significant for the SAC-SMA cases, while for the LSTM cases, gains are observed mainly for basins where the LSTM initially underperforms. However, the ensemble forecast evaluation yields unexpected results: the performance ranking of the tested models changes markedly compared to the perfect forecast framework. The LSTM model, in particular, appears penalized by the unreliability – specifically, the under-dispersion – of its forecast ensembles, meaning that its predictions are insufficiently responsive to meteorological forcing over the forecast lead time. This finding underscores the importance of ensuring reliable ensemble dispersion for the efficient operational deployment of AI-based hydrological forecasts.
Review of HESS Manuscript
“Testing data assimilation strategies to enhance short-range AI-based discharge forecasts”
Please find attached my review of the manuscript.
The scope of the article is within the scope of HESS.
The authors use the outputs of an LSTM and a SAC model, as inputs to a feed-forward neural network (together with historical discharge and meteorological variables) to produce hydrological forecasts 1, 3 and 7 days ahead. They ran their experiments in CAMELS US.
I believe there are important deficiencies in the study that should be addressed before moving forward in the review process. See my comments below.
General comments
Section 2.1.
The authors indicate that they are benchmarking against Kratzert2019 and Newman2017, but they have different training / testing periods as the benchmarks. In section 2.1 you indicate that you are training from 1989-2006 and testing in 2006-2008. However, Kratzert used 1999-2008 for training and 1989-1999 for testing. This means you are training much more than them, and testing in only a 2-year period. Why this difference?
Moreover, you are also indicating that you re-simulated the 1989-2008 period with the pretrained models from Kratzert and Newman, but again, you are not respecting the training/testing split that the authors did in their original study.
Using benchmarks is an extremely valuable technique, because it automatically places your method in existing literature, however the conditions of the original studies need to be respected. One should, if possible, adapt the new experiment to the existing benchmark, otherwise it does not make any sense to do a benchmark.
Also, why are you using only two years of testing?
Lastly, you should benchmark your study against other studies that used data assimilation methods. In CAMELS US there is the study of Nearing (2022), Feng (2020) or more recently Yang (2025).
Section 2.3
In section 2.3 you indicate that a specific model should be calibrated for each lead time, and that the alternative is inefficient. But this is not true. The LSTM has a linear layer at the end that transforms the hidden states to discharge.
Your model is a simple extension of this, but instead of a linear layer that go from hidden states to discharge, you have a feed forward neural network (so a couple of linear layers), in which, besides the simulated discharge you concatenate some past meteorological variables and discharge. I do not have anything against the simplicity of the model, because if it is simple and it works then great, but there are multiple things that need to be considered.
You can run the LSTM as a seq-seq model, roll it over the forecast horizon (so if you are predicting 7 day forecast just do seq-7) and in each step concatenate the hidden states with whatever you want and pass it through the feed forward neural network. This way you have a consistent and generalizable model that does not require a different embedding for each case.
Also why are you evaluating only on days 1, 3 and 7? Why not all seven days?
Furthermore, you are using climatological ensembles to create a possible forecast, but I do not believe this is the best strategy to do that. Climatology is used, normally, in medium to seasonal range (so couple of weeks to some months) where the forecast models of the meteorological variables are no longer reliable, and for variables that present a cyclic pattern (temperature, radiation…) but you are using it in precipitation 1 or 3 days ahead, which I do not believe is a good practice. How is the precipitation of the 1st of November for the last 18 years related to the 1st of November of this year? I do not think there is a strong relationship between these values that can be used for short-term forecasting, especially for precipitation, which is the most important variable to drive the forecast. If you want to use it for temperature or radiation, that can be an (non-ideal but defendable) option, but I would highly recommend to not use it for precipitation.
In this section, you also indicate the absence of operational weather forecast archives, but this is also not fully correct. Shalev (2024) release a historical weather forecast for CARAVAN, which includes the CAMELS US dataset. It is true that not all the products are available in the testing periods that you have but: One option is to use CHIRPS-GEFS which is only precipitation but is available in your period of interest. Another, better option, is to benchmark a simple LSTM against Kratzert 2019, and once that is working well, you evaluate your model and the new LSTM in the periods where you have a historical forecast (2016-2024) from Shalev. This is more work but would actually give you a robust study to evaluate your model under real forecast conditions.
Section 2.2 and 2.5.2
Here you indicate that you use an ensemble of 60 runs. Why 60? Most studies use between 5 and 10. Do you get a significant different with 60? If you want to use 60 that is your choice, but in section 2.5.2 you indicate that because of the 60 ensembles you have unreasonable computational cost, and that you are only going to use a subset of basins. If 5 to 10 ensembles give you the same as 60, then you can reduce the computational costs, and then do the study in the full region, which would produce more robust result.
Also, the 18 ensembles members can be accommodated in the batch dimension of the tensor and the different seeds can be run in parallel (even in a single GPU) so most of the computational overhead you are reporting can be overcome with some technical tweaking.
Also, from what I understood, you are not training the LSTM, just the feed-forward neural network. Is that right? If it is so, this should be extremely fast, I do not understand where the unreasonable computational cost is coming from.
Line 38: I do not agree that discharge simulation and discharge forecasting are fundamentally different tasks. You are trying to model the same system and the same rainfall-runoff response. In forecasting mode you have the increased uncertainty of the meteorological input, however that is more of a limitation and not a fundamentally different task. Multiple operational models are calibrated with observed data in pseudo-forecast mode and later incorporated in forecasting pipelines, and they work well. Data-driven methods give the advantage that, if trained with real forecast, they can learn to compensate for systematic biases, but again, this is more of training strategies to compensate for data quality limitations and not because the task is fundamentally different.
Line 66. The tittle of Fig 1 should be more self-explanatory.
Line 127-128: What do you mean by: “The direct forecasts from the benchmark models were assumed to be unchanged for the tested lead time; therefore, no further running was necessary”.
Line 136: Is the reference to figure 1 correct here?
Figure 7. You should explain the colors also on the legend of the figure and not only on the text above. The figure plus the legend should be self-explanatory. Also as a suggestion the message of this figure would be better explain by a boxplot per lead time graph. The boxplot would give the distribution along the basins and because you have one for each model and for each lead time it can be easily compared. Something similar to your Figure 8, but for the different lead times (you can also see Figure 3 from Nearing 2024).
Section 3.2.2: Can you explain in detail how did you constructed this figure? How did you construct the 10 classes? What does a lower or a higher rank indicates?
Line 330-335: Can you explain in detail how are you evaluating the LSTM here to produce these results? Also you are indicating that “This result suggests that the LSTM model is insufficiently responsive to recent meteorological inputs”. Can it be that because the LSTM is driven only by meteorology, and because the climatological forecast are not good (see my comment above about that) then the predictions are biased? The other models have the advantage of having discharge, which is a highly autoregressive variable, so they somehow compensate. However, if all you have is meteorological forecast and these are non-sense, how can the model perform well? I think this point is important an can biased the results you are presenting.
References:
Nearing, G. S., Klotz, D., Frame, J. M., Gauch, M., Gilon, O., Kratzert, F., Sampson, A. K., Shalev, G., & Nevo, S. (2022). Technical note: Data assimilation and autoregression for using near-real-time streamflow observations in long short-term memory networks. Hydrology and Earth System Sciences, 26(21), 5493–5513. https://doi.org/10.5194/hess-26-5493-2022
Nearing, G., Cohen, D., Dube, V. et al. Global prediction of extreme floods in ungauged watersheds. Nature 627, 559–563 (2024). https://doi.org/10.1038/s41586-024-07145-
Feng, D., Fang, K., & Shen, C. (2020).Enhancing streamflow forecast andextracting insights using long‐shortterm memory networks with dataintegration at continental scales. WaterResources Research, 56,e2019WR026793. https://doi.org/10.1029/2019WR026793
Yang, Y., Pan, M., Feng, D., Xiao, M., Dixon, T., Hartman, R., Shen, C., Song, Y., Sengupta, A., Delle Monache, L., & Ralph, F. M. (2025). Improving streamflow simulation through machine learning-powered data integration and its potential for forecasting in the Western U.S. Hydrology and Earth System Sciences, 29(20), 5453–5476. https://doi.org/10.5194/hess-29-5453-2025
Shalev, G., & Kratzert, F. (2024). Caravan MultiMet: Extending Caravan with multiple weather nowcasts and forecasts. arXiv preprint arXiv:2411.09459. https://arxiv.org/abs/2411.09459