Extending Medium-Range Global Flood Forecasts: The Google Global Flood Forecasting Model Version 2

Cohen, Deborah; Amira, Rony; Aschner, Rom; Carny, Yuval; Feinstein, Ben; Fester, Hadas; Fronman, Shmulik; Gauch, Martin; Gilon, Oren; Green, Rotem; Hassidim, Avinatan; Klotz, Daniel; Kratzert, Frederik; Korenfeld, Dan; Loike, Gila; Markel, Amit; Matias, Yossi; Mayo, Rotem; Metzger, Asher; Mosheyev, Benny; Niego, Aviel; Rees, Stephanie; Reinstein, Emily; Sicherman, Amitay; Shalev, Guy; Shefi, Omri; Shildan, Yuval; Zemach, Ido; Zlydenko, Oleg; Nearing, Grey

doi:10.5194/egusphere-2026-2283

Preprints

https://doi.org/10.5194/egusphere-2026-2283

Preprints

29 Apr 2026

| 29 Apr 2026

Extending Medium-Range Global Flood Forecasts: The Google Global Flood Forecasting Model Version 2

Deborah Cohen, Rony Amira, Rom Aschner, Yuval Carny, Ben Feinstein, Hadas Fester, Shmulik Fronman, Martin Gauch, Oren Gilon, Rotem Green, Avinatan Hassidim, Daniel Klotz, Frederik Kratzert, Dan Korenfeld, Gila Loike, Amit Markel, Yossi Matias, Rotem Mayo, Asher Metzger, Benny Mosheyev, Aviel Niego, Stephanie Rees, Emily Reinstein, Amitay Sicherman, Guy Shalev, Omri Shefi, Yuval Shildan, Ido Zemach, Oleg Zlydenko, and Grey Nearing

Abstract. This paper evaluates an updated flood forecasting system that significantly extends reliable lead times. We evaluated this updated model (v2) against the prior system (v1) and established third-party benchmarks across 1,223 global test basins. The primary finding is that the v2 system extends the reliable predictive horizon by 6 days in gauged basins and 1 day in ungauged basins relative to the v1 nowcast, as measured by the Nash Sutcliffe Efficiency. Along with this paper, we release an open-source codebase for training both the v1 and v2 forecast models with the open-source Caravan dataset.

How to cite. Cohen, D., Amira, R., Aschner, R., Carny, Y., Feinstein, B., Fester, H., Fronman, S., Gauch, M., Gilon, O., Green, R., Hassidim, A., Klotz, D., Kratzert, F., Korenfeld, D., Loike, G., Markel, A., Matias, Y., Mayo, R., Metzger, A., Mosheyev, B., Niego, A., Rees, S., Reinstein, E., Sicherman, A., Shalev, G., Shefi, O., Shildan, Y., Zemach, I., Zlydenko, O., and Nearing, G.: Extending Medium-Range Global Flood Forecasts: The Google Global Flood Forecasting Model Version 2, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2026-2283, 2026.

Received: 21 Apr 2026 – Discussion started: 29 Apr 2026

Competing interests: All authors are employed by their primary affiliation, Google, the organization that developed and operates the Google Global Flood Forecasting system and the associated open-source Google Hydrology codebase evaluated in this manuscript. The authors declare that they have no other competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-2283', Dai Yamazaki, 25 May 2026
This manuscript presents an evaluation of version 2 of the Google Global Flood Forecasting system against the previous v1 system and established third-party benchmarks. The study is valuable because it provides transparency about an operational global flood forecasting system, introduces important technical updates such as the ME-LSTM architecture and GraphCast meteorological forcings, and contributes open-source resources for the hydrological community. I consider the topic important and suitable for publication after revision.
My main concerns are related to the interpretation of the reported performance improvements. First, the comparison between v1 and v2 may be affected by changes in the gauged/ungauged status of evaluation basins, because the v2 system uses expanded training data. If some basins were ungauged in v1 but gauged in v2, part of the reported improvement may reflect increased spatial coverage of local streamflow training data rather than improvements in the model architecture or meteorological forcings. This issue should be clarified and, if relevant, quantified.
Second, the manuscript mainly evaluates the system using NSE and KGE components. These metrics are useful for assessing overall hydrograph prediction skill, but the manuscript is framed as a flood forecasting study. The authors should therefore discuss more clearly what the reported improvements imply for practical flood prediction, especially with respect to the improved correlation component, reduced forecast variability, flood peak timing, peak magnitude, and other aspects of flood warning performance. If event-based verification is beyond the scope of the paper, its absence should be acknowledged as a limitation or future direction.
Overall, I think the paper has strong potential, but the above issues should be addressed to make the central conclusions more robust and easier to interpret.
Note: attached is the same comments in the PDF version

[1] Possible confounding due to changes in gauged/ungauged status between v1 and v2
My most important concern is that it is not clear whether each evaluation basin has the same gauged/ungauged status in both v1 and v2. Since the v2 system uses expanded training data, including Caravan, some basins may have been ungauged in v1 but gauged in v2.
If such basins are included in the current “gauged” evaluation, the reported improvement from v1 to v2 may reflect not only improvements in the model architecture or the use of GraphCast forcings, but also the effect of newly including local streamflow observations from those basins in the training data. In other words, the improvement may partly reflect an ungauged-to-gauged transition. In that case, interpreting the v2 improvement mainly as an effect of the upgraded model structure would be potentially misleading.
I therefore ask the authors to clarify whether the gauged/ungauged status of each evaluation basin is consistent between v1 and v2. It would also be useful to separate the evaluation into at least the following groups:
basins that are gauged in both v1 and v2;

basins that are ungauged in v1 but gauged in v2;

basins that are ungauged in both v1 and v2.

This decomposition would help distinguish the effects of model and input-data improvements from the effect of increased spatial coverage of the training data. In particular, if basins that changed from ungauged in v1 to gauged in v2 show large improvements, the interpretation of the current aggregate v1-v2 comparison may change substantially.

[2] Limitations of NSE/KGE and the practical meaning of the improvements for flood forecasting
The manuscript demonstrates improved hydrograph prediction skill of the v2 system using NSE and KGE components. This evaluation is useful. However, because the manuscript focuses on a flood forecasting system, I think the authors should discuss more clearly what these improvements mean from a practical flood forecasting perspective.
In particular, the improvement in the correlation component of KGE is important. It may indicate better prediction of hydrograph phase, rising limbs, and flood peak timing, which are highly relevant for early warning. On the other hand, the reduction in forecast variability may imply possible underestimation of peak discharge. I therefore suggest that the authors interpret the meaning of both improved correlation and reduced variability more carefully in the context of flood forecasting. Although the Conclusions identify the KGE decomposition result as one of the main improvements, the main text currently contains relatively little discussion of why the correlation improvement is especially important.
If possible, it would also be helpful to include one or a few representative hydrograph examples, such as a basin where v1 missed the timing of a flood peak but v2, or v2 with GraphCast, captured it better. Such examples would help readers understand how the improvement in statistical metrics appears in actual forecast time series.
Finally, NSE and KGE alone do not directly evaluate several important aspects of flood disaster prediction, such as peak discharge, threshold exceedance, false alarms, and missed events. This limitation should at least be clearly acknowledged in the Conclusion or in a Limitations/Future Directions section.

Abstract:
In the Abstract, the system is described only as an “updated flood forecasting system”, but the name of the Google Global Flood Forecasting system is not explicitly stated. Although this is already included in the title, I think it would be useful to name the system explicitly in the Abstract, since the Abstract is often read independently.
In addition, the current Abstract does not clearly explain what technical changes were introduced in v2. I recommend adding one concise sentence summarizing the main technical updates, such as the replacement with the ME-LSTM architecture, improved integration of multiple meteorological input products and robustness to missing inputs, expanded training data through Caravan, and the inclusion of GraphCast meteorological forcings. This would help readers understand the technical basis for the reported improvement, rather than only seeing the performance outcome.

P2 L15: Alignment between the Introduction and the Results
The limitations of the previous system and the improvements introduced in the new system should be presented in a way that is more clearly aligned with the analyses in the Results section.
In the current Introduction, the v1-to-v2 upgrade is described as addressing three data-related challenges: training data availability, temporally limited data records, and input data distribution shifts. These are relevant points, but the Results section mainly discusses the improvements in terms of two components: improvements on the hydrological model side, including ME-LSTM and expanded training data, and improvements on the meteorological forcing side through the use of GraphCast.
I think the Introduction would be clearer if it first described the main limitations of v1 and then explained how v2 was designed to address them through both an improved model architecture and improved meteorological forecast inputs. This would make the narrative from motivation to methods and results more consistent.
If you include analysis on “ungauged to gauged” impact in the result, please arrange this part to align with the analysis in the updated result section.

P2 L18: The study objectives should explicitly include performance evaluation.
At the end of the Introduction, the authors state that the two main objectives of the paper are to provide transparency about the progress and challenges of the operational flood forecasting system, and to facilitate research on ML-based flood forecasting by providing open-source resources. However, the main focus of the manuscript is the performance evaluation and benchmarking of the v2 system. I therefore suggest that the stated objectives should explicitly include evaluating the predictive performance of the v2 operational system against v1 and third-party benchmarks. This would make the objectives better aligned with the structure and conclusions of the manuscript.

Table 1:
Table 1 is useful for reproducibility because it provides the full list of static catchment attributes. However, the table is very long and mostly consists of an enumeration of input variables, which substantially interrupts the flow of Section 2.1.1. I suggest keeping a concise summary in the main text, including the number of attributes, data sources, major categories, and representative examples, and moving the full attribute list to the Supporting Information or an Appendix. This would improve readability without reducing reproducibility.

P6 L9:
The descriptions of the meteorological input data in Section 2.1.2 and the training settings in Section 2.3 are presented mainly as bullet lists. The use of bullet lists itself is not a problem. However, for a model description paper, it is important not only to state what was used, but also to explain why those design choices were made and what data-availability or operational constraints motivated them. I suggest adding more explanation of the rationale behind choices such as feature unioning, input feature dropout, noise injection, batch size, number of epochs, and batch limits. This would make the model design and training strategy easier to understand and reproduce conceptually.

Figure 3:
Figure 3 is important for explaining the forecast initialization artifact in the ED-LSTM, but in its current form it is not easy to identify where the unnatural behavior appears. I suggest that the authors indicate the transition point from the hindcast period to the forecast period more clearly, for example using arrows, annotations, or highlighting, and explicitly show which part of the predicted hydrograph corresponds to the artifact. It would also help readers if the authors showed, next to the problematic example, a case without a strong artifact or a corresponding ME-LSTM example where the issue is reduced.
More generally, figures with multiple panels should include panel labels such as (a), (b), and (c). This would make it easier to refer to specific panels in the text and captions.

P10 L9 ME-LSTM
The ME-LSTM is one of the central technical improvements in this manuscript, but the roles of the two LSTM layers are not sufficiently clear from the current text and Figure 4. My understanding is that the first LSTM layer represents the evolving hydrological state derived from the hindcast sequence, while the second LSTM layer combines this state information with forecast embeddings to predict future streamflow. However, the current description does not make clear whether the first layer is only used as an initialization mechanism, or whether it continues to update state information during the forecast period.
It is also unclear how the training loss is applied across the hindcast and forecast periods, and whether the forecast layer is specifically optimized for future lead-time predictions. These points are important for understanding how the ME-LSTM differs from the ED-LSTM handoff approach.
I therefore suggest that the authors explain more clearly, both in the text and in Figure 4, the different roles of the hindcast and forecast models in the ME-LSTM, the flow of information between the first and second LSTM layers, how state information is updated during the forecast period, and how the loss function is applied.

Figure 4:
In Figure 4, it is not clear which LSTM block corresponds to the first layer and which corresponds to the second layer. Since the text describes the ME-LSTM as a two-layer stacked LSTM, the first and second LSTM layers should be explicitly labelled in both the figure and the caption.
The meaning of “Output” in the figure should also be clarified. It is currently unclear whether this refers to the predicted streamflow, the parameters of the CMAL predictive distribution, or the deterministic mean discharge used for evaluation.
In addition, the handling of missing inputs, which is an important advantage of the ME-LSTM architecture, is not easy to understand from the current figure. I suggest making NaNs or missing input products more visually prominent, and clearly indicating which inputs are included in the masked mean operation and which inputs are excluded. This should also be explained explicitly in the figure caption.

Figure 5:
Figure 5 is one of the key figures for the global performance comparison. However, the current CDF panels show many lines corresponding to multiple model configurations and multiple lead times at the same time, making the figure difficult to interpret. I suggest reorganizing this figure, for example by separating the comparison among models from the comparison across lead times, either into different figures or different panels. Another option would be to show CDFs only for selected representative lead times, while presenting the full lead-time dependence using boxplots or median performance curves.
The legend font is also too small and should be enlarged. In addition, each panel should be labelled clearly, for example as (a), (b), (c), and (d), so that the text and caption can refer to the individual panels more easily.

Figure 8:
Figure 8 supports one of the central conclusions of the manuscript, but the upper panels, especially the upper-left panel, are difficult to interpret because the legend is insufficient. It is not clear from the figure alone what is being compared. The authors should more clearly indicate the correspondence among the v1 nowcast, the v2 forecasts at different lead times, and the gauged/ungauged settings, both in the figure and in the caption.

Section 4.4 Effect of Hydrological Characteristics
In Section 4.4, the authors analyze the relationship between hydrological characteristics and model performance. In addition to the current attribute-based analysis, it would be useful to show the spatial distribution of the v1-to-v2 skill improvement on a world map. This would help readers understand where the updated system improves most, and whether the improvements are concentrated in particular regions or hydroclimatic settings. If the effects of gauged and ungauged evaluation are mixed, it may be better to show separate maps for gauged and ungauged basins. This would also help clarify whether the spatial pattern of improvement is related to model generalization, local training data availability, or meteorological forcing improvements.
If such a spatial map is added to the main text, Figure 10 could potentially be moved to the Supplementary Information, since the map may provide a more direct and intuitive view of where the model improvement occurs globally.

Section 4.5 and Figure 11
Section 4.5 and Figure 11 provide a useful comparison between gauged and ungauged performance. However, the discussion could be expanded to better explain what this performance gap implies for the reliability of the system in ungauged basins. Since global flood forecasting often targets regions where local streamflow observations are limited or unavailable, the relative performance of ungauged predictions compared with gauged predictions is highly important. I suggest that the authors discuss more explicitly how large the ungauged penalty is, whether it varies by region or hydrological characteristics, and what this means for operational confidence in ungauged basins.

Figure 11
In the lower panel of Figure 11, the line corresponding to the improvement ratio or relative percentage change appears to be shown as a dashed line. However, this dashed line style is not represented in the legend. The authors should revise the legend so that the line styles and colors are consistent with the plotted data.
Citation: https://doi.org/10.5194/egusphere-2026-2283-RC1
RC2:
'Comment on egusphere-2026-2283', Wenzhong Li, 28 May 2026
Overall comments: This manuscript presents a comparative evaluation of the Google Global Flood Forecasting system Version 2 (v2), Google Global Flood Forecasting system Version 1 (v1), and related benchmark models across 1223 basins worldwide. Using NSE as the main evaluation metric, the authors show that, relative to the v1 0-day lead time forecast, the v2 system extends the reliable predictive horizon by 6 days in gauged basins and by 1 day in ungauged basins. The manuscript is also accompanied by an open-source codebase that enables training of both the v1 and v2 models using the open-source Caravan dataset.

Specific comments:
The abstract is too concise. It does not sufficiently reflect the methodological innovations of the v2 model, and it also lacks adequate research background.

In the title, abstract, and elsewhere, the authors emphasize that this is a flood forecasting system. However, based on the definition of the model target variable and the official open-source code, the main evaluation in this paper appears to focus on daily streamflow simulation using metrics such as NSE. Daily streamflow simulation is an important component of flood forecasting, but in my view it is not the whole task. For example, when reading the abstract, I would expect hourly-scale flood forecasting, or that the forecasting system would include warning information, water level, and even inundation information.

Therefore, I have a question: is the object of this study the "core runoff/streamflow forecasting component within a flood forecasting system", or the "complete flood forecasting system itself"?
Previous work by Nevo et al. (2022) explicitly stated that Google's operational flood forecasting system consists of four subsystems: data validation, stage forecasting, inundation modeling, and alert distribution. In contrast, the target variable in the present paper is daily streamflow, and the main metrics are NSE and KGE. Therefore, more precisely, this paper evaluates the hydrological prediction core of a flood forecasting system, rather than the complete flood forecasting system itself.
If the term "flood forecasting system" is used, I suggest that the authors at least add event-level flood metrics or provide a clearer discussion in the Supplement. If the paper is only discussing the core support for a "flood forecasting system", then the title and related wording should not directly state "global flood forecasting system". The authors should make the terminology consistent throughout the paper, or address this issue in the outlook or discussion.
Following the previous comment, I think the authors should provide additional supporting metrics and justification related to the terms "operational system" and "operational flood". This paper provides substantial support for future flood forecasting systems and operational forecasting, but most of the evaluation focuses on model performance for daily streamflow prediction, rather than improvement of a global flood system.

Using "daily streamflow prediction metrics" to support claims about an "operational system" lacks solid support and evidence. Actual flood forecasting, especially for small and medium-sized basins, usually requires hourly-scale results, whereas daily-scale forecasts are not sufficiently fine. In operational flood control and emergency response, people often care about metrics such as peak flow and time of peak flow, rather than NSE, a goodness-of-fit metric that strongly favors long-term average behavior. We already know the coherent limitations of NSE, KGE, and similar metrics. The authors should discuss why daily streamflow prediction metrics such as NSE are sufficient to support an operational system in the context of a global flood forecasting system, or alternatively define the current system's limitations more clearly. Otherwise, the declaration in the paper may appear overstated.
The Introduction is very concise. However, as a research paper, it should clearly present the key research gap, the necessity of the study, and the need to upgrade existing technologies or solutions. For example, the main focus of this paper is the v2 system, but the current first paragraph mainly discusses the status of machine learning in streamflow simulation, model development, interpretability, and uncertainty quantification. These topics are only briefly mentioned, without specific literature citations, which makes the Introduction too brief.

In addition, the second paragraph directly turns to "using operational machine learning hydrology models for global-scale riverine flood forecasting", but it does not discuss the innovation of the v2 system or the improvements over v1. Since a substantial part of the v2 improvement comes from the introduction of GraphCast, I suggest that the authors at least add discussion of how meteorological data can improve flood forecasting models.
In Section 2.1.2, the authors state that HRES and GraphCast forecast archives begin in approximately 2012 and 2016, respectively. To use the full historical streamflow record from 1980 to 2024, the authors substitute ERA5-Land reanalysis data for HRES/GraphCast forecast inputs in earlier years when such forecasts are unavailable, and assume that these reanalysis data serve as an "effective proxy" for the forecast inputs. The justification given in the paper is that "HRES shares the same underlying physical model as ERA5, and GraphCast is trained on ERA5". However, this assumption is not supported or demonstrated. I suggest that the authors conduct a comparison for years when HRES/GraphCast and ERA5-Land are both available, and show whether their precipitation, temperature, and other distributions are similar. Alternatively, the authors could compare whether NSE/KGE differs substantially when ERA5-Land is used as input versus when the actual GraphCast forecast inputs are used.

In Section 3, the authors state that "For the ungauged setting, the v1 system used random k-fold (k=10) cross-validation, whereas the v2 system used a single holdout test set". I suggest that the authors explain why different spatial evaluation protocols were used for v1 and v2, how the v2 holdout basins were selected, and why a single spatial split is sufficient to evaluate ungauged generalization. This clarification is important because the spatial split strategy may affect the comparability of ungauged performance between v1 and v2.

In Section 4.2, the authors state that "Figure 6 disaggregates the improvements provided by the ME-LSTM architecture and expanded training data from the predictive skill injected by the GraphCast meteorological forcings". However, the authors also state that "Blue boxes represent the Delta NSE gained by transitioning from the v1 to v2 model architecture and expanded Caravan training data. Green boxes represent the additional Delta NSE gained by incorporating GraphCast". The paper has already demonstrated the contribution of GraphCast, but it has not separated the contribution of the ME-LSTM architecture change from the contribution of the expanded Caravan training data. I think additional experiments and evidence could be added in the Supplement.

The paper states that "We take the mean of the predicted distribution to be the deterministic model prediction that we evaluate in this Paper". Since both v1 and v2 produce probabilistic forecasts using "a countable mixture of asymmetric Laplacians (CMAL) distribution", I think it is necessary to explain why only the mean of the predicted distribution is evaluated. Deterministic NSE/KGE metrics can indicate predictive performance, but they cannot evaluate the quality of probabilistic forecasts. For flood forecasting, probabilistic forecast results themselves are important. I suggest adding metrics such as prediction interval coverage.

In Section 4.5 and Figure 11, the authors state that the v2 system "yields higher absolute performance globally", but is also "proportionally more sensitive to the absence of local streamflow data for training". Specifically, at a 0-day lead time, the v2 system without GraphCast has a median gauged NSE of 0.78 and an absolute median penalty of 0.07, meaning the difference between gauged NSE and ungauged NSE, corresponding to a 10.6% relative decrease. With GraphCast, the v2 system has a median gauged NSE of 0.83, but the absolute median penalty increases to 0.12, corresponding to a 19.8% relative decrease. I suggest that the authors explicitly state in the abstract or conclusion that GraphCast improves overall absolute performance, but also increases the penalty between the gauged and ungauged settings. This does not mean that GraphCast brings the same magnitude of improvement under both gauged and ungauged settings. The paper should not only state that the lead time is extended by one day in ungauged basins.

10.In Section 4.2, the authors state that "GraphCast forcings improve correlation but lower forecast variance", and further mention possible "spatial and temporal smoothing" and "underprediction of variance" at longer lead times. However, this paper studies flood forecasting, while the NSE and KGE metrics used in the paper cannot accurately demonstrate high-flow prediction performance. Flood peaks and high flows are very important for flood forecasting. From a mathematical perspective, lower forecast variance caused by GraphCast may indicate that the model underestimates high flows. Therefore, I suggest that the authors supplement the analysis with relevant flood peak or high-flow simulation results or metrics.
At the beginning of the paper, the authors state that v2 improves upon v1. The paper has explained that v1 is an ED-LSTM, while v2 introduces ME-LSTM, expanded Caravan training data, and new meteorological inputs such as GraphCast. The authors also acknowledge that the performance difference between v1 and v2 is "a compound effect", and that the two systems use different spatial split strategies in the ungauged evaluation. However, the exact differences between v1 and v2 are not clearly compared. I think the paper lacks a table, namely a v1-v2 comparison table, listing differences in model architecture, training data sources, dynamic meteorological inputs, whether GraphCast is included, temporal and spatial splitting strategies, and related aspects.

The paper uses the terms Google Global Flood Forecasting system and operational system in multiple places, but the methods, evaluation metrics, and open-source code mainly correspond to the runoff/streamflow model forecasting component of the Google FloodHub flood forecasting platform, rather than an end-to-end operational flood warning system. Section 2.1 clearly states that the model training target is daily streamflow at the basin outlet. The open-source code also mainly provides model training, evaluation, and related workflows; the target variable in the configuration file is streamflow, and the main evaluation metrics are NSE/KGE. Meanwhile, the public configuration file also states that this open-source pipeline differs from the operational pipeline.

I suggest that the authors more clearly distinguish whether what is evaluated and open-sourced in this paper is the "runoff/streamflow model component" or the complete "operational flood warning system". If the paper claims to evaluate the complete system, it should explain whether operational components such as real-time data validation, flood-threshold determination, inundation mapping, and alert distribution are included in the evaluation and open-source code. If they are not included, I suggest revising the wording in the title, abstract, methods, or Code Availability section, and clearly specifying which results can be reproduced using the released code. The provided code and the paper need to be explicitly aligned.
Citation: https://doi.org/10.5194/egusphere-2026-2283-RC2
CEC1: 'Comment on egusphere-2026-2283 - No compliance with the policy of the journal', Juan Antonio Añel, 08 Jun 2026

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived the GoogleHydrology code and Caravan on GitHub. However, GitHub is not a suitable repository for scientific publication. GitHub itself instructs authors to use other long-term archival and publishing alternatives, such as Zenodo.
The GMD review and publication process depends on reviewers and community commentators being able to access, during the discussion phase, the code and data on which a manuscript depends, and on ensuring the provenance of replicability of the published papers for years after their publication. Please, therefore, publish your code and data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible. We cannot have manuscripts under discussion that do not comply with our policy.
Later, if the Topical Editor decides to continue with the review or publication process of your manuscript and you are requested to upload a new version of it, then The 'Code and Data Availability’ section of your manuscript must also be modified to cite the new repository locations, and corresponding references added to the bibliography.
Additionally, although you do not seem to directly use them for the work presented, you link the Google Runoff Reanalysis & Reforecast dataset (GRRR), using a webpage hosted in google.com. Again, this is not an acceptable repository ensuring long-term preservation, therefore, it is hardly useful. We have already found there a link in the description to a Nature paper that does not work . We would encourage you to share any data in repositories that ensure long-term preservation to better serve the purpose of open science that you discuss in the manuscript.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in GMD.
Juan A. Añel

Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2026-2283-CEC1
CC1: 'Comment on egusphere-2026-2283', Oliver Konold, 15 Jun 2026

Dear authors,
We thank you for this substantial and transparent contribution to the hydrology community. Making the complete model pipeline behind the operational Google FloodHub publicly available is a commendable step that is rare for operational forecasting systems of this scale. The release of the Google Runoff Reanalysis and Reforecast (GRRR) dataset, providing historical simulations and reforecasts at over one million locations globally, represents a genuinely valuable resource that will enable validation efforts in data-sparse regions. We fully share the authors' stated commitment to transparency, and offer the following comments in that spirit.
Given that Google FloodHub is among the most widely accessible existing global flood forecasting services, reaching stakeholders and vulnerable populations in data-scarce regions who may act directly on its warnings, we believe a particularly thorough evaluation is warranted. Our comments focus on areas where the present evaluation could be strengthened to fully support the paper’s central claims. Most of the analyses we suggest appear feasible with data and infrastructure the authors have already released. We hope these observations are useful for the revision and for the continued development of the system.
Specific Comments
1. Alignment between the flood-forecasting framing and the evaluation metrics
The paper presents FloodHub v2 as a flood forecasting system, yet the evaluation relies exclusively on NSE and KGE computed over full hydrographs of daily mean streamflow. No flood-specific metrics are reported: there is no analysis of peak-flow bias (FHV; Yilmaz et al., 2008), peak timing error, threshold-exceedance skill (POD/FAR/CSI; Wilks, 2011), or return-period event performance. This stands in notable contrast to the predecessor paper (Nearing et al., 2024), which evaluated 1-, 2-, 5-, and 10-year return period events explicitly. We would encourage the authors to add such analyses, as a system designed to issue flood warnings should ideally be evaluated on the events that motivate those warnings.
An additional consideration is that daily averaging can smooth peak discharges, particularly in small and flashy catchments where sub-daily dynamics govern the flood response (Ficchì et al., 2016). Reporting skill for daily maximum discharge alongside daily mean discharge would provide a more informative picture for a flood-focused application. We note also that recent work has identified intrinsic limits of LSTM architectures for extreme discharge prediction: Baste et al. (2025) showed that saturation of gating structures can cap predictable discharge below the maximum of the training data, while Kratzert et al. (2024) described the underlying tanh saturation mechanism. Situating the present results in relation to this literature would be valuable.
2. Probabilistic head: consistency between training objective and evaluation
The model is trained with a CMAL head and a log-likelihood loss averaged uniformly over all time steps, which means rare flood events contribute only marginally to the gradient. For the evaluation, the CMAL distribution is then reduced to its mean and assessed deterministically. As Klotz et al. (2022) demonstrated, the mean of asymmetric Laplacian mixtures is a sub-optimal point estimator in tail regimes, where a deterministic LSTM outperformed probabilistic CMAL-based models. The benchmarking procedure established in that paper, probability plots for reliability (Laio and Tamea, 2007), is absent here.
We suggest either (a) evaluating the full predictive distribution using proper scoring rules and reliability diagrams, or (b) clearly framing the system as effectively deterministic and providing justification for the CMAL loss over simpler alternatives. In addition, an operational coverage analysis — quantifying how often observed daily maximum discharge exceeds the upper quantiles of the CMAL predictive distribution, for example using qmax values from hourly LamaH-CE (Klingler et al., 2021) or CAMELS DE hourly (Loritz et al., 2024) catchments — would directly address the question of whether the uncertainty communication is appropriate for flood warning purposes.
3. Disentangling the v1-to-v2 improvement
The v1-to-v2 comparison involves simultaneous changes in architecture (ED-LSTM to ME-LSTM), training data (5,680 to 15,923 gauges), and meteorological forcing (addition of GraphCast). As presented, the respective contributions of these changes cannot be separated. An ablation experiment isolating at least the data-expansion effect from the architecture effect would considerably clarify the source of the reported gains. Additionally, v1 used random k-fold cross-validation while v2 uses a single spatial holdout; as Roberts et al. (2017) have shown, random splits that retain spatially adjacent basins in training tend to produce optimistic estimates relative to spatially blocked designs, which makes the gauged/ungauged comparisons between versions difficult to interpret directly.
4. GraphCast forcing: skill gains and variance damping
The KGE decomposition shows that the variability ratio γ worsens with lead time when GraphCast is added, consistent with the known spatial and temporal smoothing of MSE-trained AI weather models and their tendency to underpredict variance at longer lead times (Ben Bouallègue et al., 2024). For flood applications, systematic variance underprediction translates to systematic peak underestimation at precisely the lead times where the system claims its greatest advantage. The paper acknowledges this conceptually but does not quantify its effect on flood peaks. Adding an analysis of peak-flow errors stratified by lead time and forcing type, which would clarify the practical implications of this trade-off and inform future forcing choices, would certainly strengthen the paper.
5. Validation of the feature-unioning imputation strategy
The substitution of ERA5-Land for missing HRES/GraphCast forecasts prior to 2012/2016 rests on the assumption that shared model lineage implies distributional equivalence. Available evidence suggests this assumption warrants verification: Konold et al. (2025) quantified systematic, variable-specific differences between ERA5-Land and ECMWF-HRES across 451 basins using Wasserstein distances and found that such domain shifts can substantially degrade model skill. Additional complexity arises from the non-stationarity of the HRES archive (IFS cycle upgrades, including an upgrade in 2016) versus the frozen ERA5 model cycle (Hersbach et al., 2020). A straightforward validation would be to compare the distributions of ERA5-Land, HRES, and GraphCast over their overlap period (2016–2024), for example as per-variable quantile–quantile plots stratified by region or catchment type. The data for this analysis appear to already exist within the authors’ training pipeline.
6. Integration of near-real-time discharge observations
The v2 system does not incorporate near-real-time discharge observations as model forcing. We recognise that the system is designed primarily for ungauged basins, where such data are unavailable, and that a simulation-mode model is the appropriate choice for that core mission. However, the headline 6-day lead-time extension is reported for gauged locations, where streamflow records exist and are, for a substantial subset, operationally available in near real time. Feng et al. (2020) demonstrated substantial forecast skill gains from discharge data integration at continental scale, and Nearing et al. (2022) developed autoregression and data assimilation methods specifically for LSTM streamflow models. For the gauged subset, the authors might clarify whether discharge integration was considered, why it is absent from the operational design, and what fraction of gauged test basins would support it. This would help readers understand the headroom available for further skill improvement.
7. Reporting of hyperparameter tuning
The paper states that hyperparameter tuning experiments are not reported. For a benchmark-style evaluation, this is an important omission: if v2 received more extensive tuning than v1, or than the GloFAS/GEOGLOWS baselines (which appear to be taken as-is), part of the reported improvement reflects tuning effort rather than methodological advance. Following the recommendations of Bouthillier et al. (2021), we encourage the authors to report the search space, computational budget, selection metric, and validation protocol for hyperparameter optimisation, and to address run-to-run variance by reporting results over multiple random initialisations, as is standard practice in large-sample hydrology (Kratzert et al., 2019b).
8. Mechanistic interpretation of catchment-attribute analysis
The finding that skill improvements concentrate in arid catchments is interesting and consistent with recent results by Konold et al. (2025), who report the same pattern in a forecast-bias context and offer a mechanistic explanation: arid, episodic catchments exhibit weak baseline skill and nonlinear rainfall–runoff relationships that are highly sensitive to input quality, whereas snow-dominated catchments benefit from the seasonal signal retained in the LSTM cell state (Kratzert et al., 2019a). Connecting the present attribute correlations to such mechanisms would strengthen Section 4.4. We also encourage the authors to link the ΔNSE–attribute analysis to the KGE decomposition results: if GraphCast systematically damps forecast variance, arid and flashy basins, where amplitude errors matter most, should be particularly sensitive to this effect. Testing this hypothesis would unify two currently separate parts of the results and provide a more complete picture of where and why v2 improves over v1.
We hope these comments are useful and look forward to the authors’ response. The FloodHub v2 system represents a significant engineering and scientific achievement, and we believe that addressing these points would substantially strengthen both the manuscript and the community’s ability to assess and build upon the operational system. We are also attaching a pdf version of our comments, together with the revised preprint of the aforementioned paper by Konold et al. (2025), as a single document.
Oliver Konold and Karsten Schulz
Institute of Hydrology and Water Management (HyWa), BOKU University, Vienna, Austria

References
Bartholmes, J. C., Thielen, J., Ramos, M. H., and Gentilini, S.: The European Flood Alert System EFAS – Part 2: Statistical skill assessment of probabilistic and deterministic operational forecasts, Hydrol. Earth Syst. Sci., 13, 141–153, https://doi.org/10.5194/hess-13-141-2009, 2009.
Baste, S., Klotz, D., Acuña Espinoza, E., Bardossy, A., and Loritz, R.: Unveiling the limits of deep learning models in hydrological extrapolation tasks, Hydrol. Earth Syst. Sci., 29, 5871–5891, https://doi.org/10.5194/hess-29-5871-2025, 2025.
Ben Bouallègue, Z., Clare, M. C. A., Magnusson, L., Gascón, E., Maier-Gerber, M., Janoušek, M., Rodwell, M., Pinault, F., Dramsch, J. S., Lang, S. T. K., Raoult, B., Rabier, F., Chevallier, M., Sandu, I., Dueben, P., Chantry, M., and Pappenberger, F.: The rise of data-driven weather forecasting: A first statistical assessment of machine learning-based weather forecasts in an operational-like context, Bull. Am. Meteorol. Soc., 105, E864–E883, https://doi.org/10.1175/BAMS-D-23-0162.1, 2024.
Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., Mohammadi Sepahvand, N., Raff, E., Madan, K., Voleti, V., Ebrahimi Kahou, S., Michalski, V., Arbel, T., Pal, C., Varoquaux, G., and Vincent, P.: Accounting for variance in machine learning benchmarks, Proceedings of Machine Learning and Systems (MLSys), 3, 747–769, https://doi.org/10.48550/arXiv.2103.03098, 2021.
Feng, D., Fang, K., and Shen, C.: Enhancing streamflow forecast and extracting insights using long–short term memory networks with data integration at continental scales, Water Resour. Res., 56, e2019WR026793, https://doi.org/10.1029/2019WR026793, 2020.
Ficchì, A., Perrin, C., and Andréassian, V.: Impact of temporal resolution of inputs on hydrological model performance: An analysis based on 2400 flood events, J. Hydrol., 538, 454–470, https://doi.org/10.1016/j.jhydrol.2016.04.016, 2016.
Gauch, M., Kratzert, F., Frame, J. M., Nearing, G., and Hochreiter, S.: How to deal with missing input data in machine learning for hydrology, Hydrol. Earth Syst. Sci., 29, 6221–6235, https://doi.org/10.5194/hess-29-6221-2025, 2025.
Haiden, T., Janousek, M., Vitart, F., Tanguy, M., Prates, F., and Chevallier, M.: Evaluation of ECMWF forecasts, including the 2023 upgrade, ECMWF Technical Memorandum, No. 902, https://doi.org/10.21957/ef3evxy25, 2024.
Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Simmons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., et al.: The ERA5 global reanalysis, Q. J. R. Meteorol. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803, 2020.
Klingler, C., Schulz, K., and Herrnegger, M.: LamaH-CE: LArge-SaMple DAta for Hydrology and Environmental Sciences for Central Europe, Earth Syst. Sci. Data, 13, 4529–4565, https://doi.org/10.5194/essd-13-4529-2021, 2021.
Klotz, D., Kratzert, F., Gauch, M., Keefe Sampson, A., Brandstetter, J., Klambauer, G., Hochreiter, S., and Nearing, G.: Uncertainty estimation with deep learning for rainfall–runoff modeling, Hydrol. Earth Syst. Sci., 26, 1673–1693, https://doi.org/10.5194/hess-26-1673-2022, 2022.
Konold, O., Feigl, M., Podast, P., Klingler, C., and Schulz, K.: BiasCast: Learning and adjusting real time biases from meteorological forecasts to enhance runoff predictions, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2025-4978, 2025.
Kratzert, F., Herrnegger, M., Klotz, D., Hochreiter, S., and Klambauer, G.: NeuralHydrology – Interpreting LSTMs in Hydrology, in: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Springer LNCS 11700, 347–362, https://doi.org/10.1007/978-3-030-28954-6_19, 2019a.
Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., and Nearing, G.: Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, https://doi.org/10.5194/hess-23-5089-2019, 2019b.
Laio, F., and Tamea, S.: Verification tools for probabilistic forecasts of continuous hydrological variables, Hydrol. Earth Syst. Sci., 11, 1267–1277, https://doi.org/10.5194/hess-11-1267-2007, 2007.
Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Vinyals, O., Stott, J., Pritzel, A., Mohamed, S., and Battaglia, P.: Learning skillful medium-range global weather forecasting, Science, 382, 1416–1421, https://doi.org/10.1126/science.adi2336, 2023.
Lavers, D. A., Harrigan, S., and Prudhomme, C.: Precipitation biases in the ECMWF Integrated Forecasting System, J. Hydrometeorol., 22, 1187–1198, https://doi.org/10.1175/JHM-D-20-0308.1, 2021.
Loritz, R., Dolich, A., Espinoza, E. A., Ebeling, P., Guse, B., Götte, J., Hassler, S. K., Hauffe, C., Ingo Heidbüchel, Kiesel, J., Mirko Mälicke, Hannes Müller-Thomy, Stölzle, M., and Tarasova, L.: CAMELS-DE: hydro-meteorological time series and attributes for 1582 catchments in Germany, Earth system science data, 16, 5625–5642, https://doi.org/10.5194/essd-16-5625-2024, 2024.
Muñoz-Sabater, J., Dutra, E., Agustí-Panareda, A., Albergel, C., Arduini, G., Balsamo, G., Boussetta, S., Choulga, M., Harrigan, S., Hersbach, H., Martens, B., Miralles, D. G., Piles, M., Rodríguez-Fernández, N. J., Zsoter, E., Buontempo, C., and Thépaut, J.-N.: ERA5-Land: a state-of-the-art global reanalysis dataset for land applications, Earth Syst. Sci. Data, 13, 4349–4383, https://doi.org/10.5194/essd-13-4349-2021, 2021.
Nearing, G. S., Klotz, D., Frame, J. M., Gauch, M., Gilon, O., Kratzert, F., Sampson, A. K., Shalev, G., and Nevo, S.: Technical note: Data assimilation and autoregression for using near-real-time streamflow observations in long short-term memory networks, Hydrol. Earth Syst. Sci., 26, 5493–5513, https://doi.org/10.5194/hess-26-5493-2022, 2022.
Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim, A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F., Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T. Y., Weitzner, D., and Matias, Y.: Global prediction of extreme floods in ungauged watersheds, Nature, 627, 559–563, https://doi.org/10.1038/s41586-024-07145-1, 2024.
Nevo, S., Morin, E., Gerzi Rosenthal, A., Metzger, A., Barshai, C., Weitzner, D., Voloshin, D., Kratzert, F., Elidan, G., Dror, G., Begelman, G., Nearing, G., Shalev, G., Noga, H., Shavitt, I., Yuklea, L., Royz, M., Giladi, N., Peled Levi, N., Reich, O., Gilon, O., Maor, R., Timnat, S., Shechter, T., Anisimov, V., Gigi, Y., Levin, Y., Moshe, Z., Ben-Haim, Z., Hassidim, A., and Matias, Y.: Flood forecasting with machine learning models in an operational framework, Hydrol. Earth Syst. Sci., 26, 4013–4032, https://doi.org/10.5194/hess-26-4013-2022, 2022.
Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J. J., Schröder, B., Thuiller, W., Warton, D. I., Wintle, B. A., Hartig, F., and Dormann, C. F.: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure, Ecography, 40, 913–929, https://doi.org/10.1111/ecog.02881, 2017.
Wilks, D. S.: Statistical Methods in the Atmospheric Sciences, 3rd ed., Academic Press, Oxford, https://doi.org/10.1016/C2009-0-02520-4, 2011.
Yilmaz, K. K., Gupta, H. V., and Wagener, T.: A process-based diagnostic approach to model evaluation: Application to the NWS distributed hydrologic model, Water Resour. Res., 44, W09417, https://doi.org/10.1029/2007WR006716, 2008.

Citation: https://doi.org/10.5194/egusphere-2026-2283-CC1

Data sets

Model Data Grey Nearing, Frederik Kratzert, Martin Gauch https://doi.org/10.5281/zenodo.19676842

Model code and software

GoogleHydrology Grey Nearing, Omri Shefi, Amit Markel, Frederik Kratzert, Martin Gauch https://github.com/google-research/flood-forecasting

Viewed

Total article views: 803 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
520	263	20	803	22	24

HTML: 520
PDF: 263
XML: 20
Total: 803
BibTeX: 22
EndNote: 24

Views and downloads (calculated since 29 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	64	27	4	95
May 2026	325	178	12	515
Jun 2026	71	36	3	110
Jul 2026	60	22	1	83

Cumulative views and downloads (calculated since 29 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	64	27	4	95
May 2026	325	178	12	515
Jun 2026	71	36	3	110
Jul 2026	60	22	1	83

Viewed (geographical distribution)

Total article views: 771 (including HTML, PDF, and XML) Thereof 771 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Jul 2026

Short summary

We improved our global river model to provide earlier, more reliable streamflow predictions. Testing over approximately 1,000 watersheds, we found it predicts river flows up to six days further into the future for monitored rivers, and one day further for unmonitored ones. Releasing our code publicly empowers the science community to improve global water forecasting.


Total:	0
HTML:	0
PDF:	0
XML:	0