the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Extending Medium-Range Global Flood Forecasts: The Google Global Flood Forecasting Model Version 2
Abstract. This paper evaluates an updated flood forecasting system that significantly extends reliable lead times. We evaluated this updated model (v2) against the prior system (v1) and established third-party benchmarks across 1,223 global test basins. The primary finding is that the v2 system extends the reliable predictive horizon by 6 days in gauged basins and 1 day in ungauged basins relative to the v1 nowcast, as measured by the Nash Sutcliffe Efficiency. Along with this paper, we release an open-source codebase for training both the v1 and v2 forecast models with the open-source Caravan dataset.
Competing interests: All authors are employed by their primary affiliation, Google, the organization that developed and operates the Google Global Flood Forecasting system and the associated open-source Google Hydrology codebase evaluated in this manuscript. The authors declare that they have no other competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(3811 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 24 Jun 2026)
- RC1: 'Comment on egusphere-2026-2283', Dai Yamazaki, 25 May 2026 reply
-
RC2: 'Comment on egusphere-2026-2283', Wenzhong Li, 28 May 2026
reply
Overall comments: This manuscript presents a comparative evaluation of the Google Global Flood Forecasting system Version 2 (v2), Google Global Flood Forecasting system Version 1 (v1), and related benchmark models across 1223 basins worldwide. Using NSE as the main evaluation metric, the authors show that, relative to the v1 0-day lead time forecast, the v2 system extends the reliable predictive horizon by 6 days in gauged basins and by 1 day in ungauged basins. The manuscript is also accompanied by an open-source codebase that enables training of both the v1 and v2 models using the open-source Caravan dataset.
Specific comments:
- The abstract is too concise. It does not sufficiently reflect the methodological innovations of the v2 model, and it also lacks adequate research background.
- In the title, abstract, and elsewhere, the authors emphasize that this is a flood forecasting system. However, based on the definition of the model target variable and the official open-source code, the main evaluation in this paper appears to focus on daily streamflow simulation using metrics such as NSE. Daily streamflow simulation is an important component of flood forecasting, but in my view it is not the whole task. For example, when reading the abstract, I would expect hourly-scale flood forecasting, or that the forecasting system would include warning information, water level, and even inundation information.
Therefore, I have a question: is the object of this study the "core runoff/streamflow forecasting component within a flood forecasting system", or the "complete flood forecasting system itself"?
Previous work by Nevo et al. (2022) explicitly stated that Google's operational flood forecasting system consists of four subsystems: data validation, stage forecasting, inundation modeling, and alert distribution. In contrast, the target variable in the present paper is daily streamflow, and the main metrics are NSE and KGE. Therefore, more precisely, this paper evaluates the hydrological prediction core of a flood forecasting system, rather than the complete flood forecasting system itself.
If the term "flood forecasting system" is used, I suggest that the authors at least add event-level flood metrics or provide a clearer discussion in the Supplement. If the paper is only discussing the core support for a "flood forecasting system", then the title and related wording should not directly state "global flood forecasting system". The authors should make the terminology consistent throughout the paper, or address this issue in the outlook or discussion.
- Following the previous comment, I think the authors should provide additional supporting metrics and justification related to the terms "operational system" and "operational flood". This paper provides substantial support for future flood forecasting systems and operational forecasting, but most of the evaluation focuses on model performance for daily streamflow prediction, rather than improvement of a global flood system.
Using "daily streamflow prediction metrics" to support claims about an "operational system" lacks solid support and evidence. Actual flood forecasting, especially for small and medium-sized basins, usually requires hourly-scale results, whereas daily-scale forecasts are not sufficiently fine. In operational flood control and emergency response, people often care about metrics such as peak flow and time of peak flow, rather than NSE, a goodness-of-fit metric that strongly favors long-term average behavior. We already know the coherent limitations of NSE, KGE, and similar metrics. The authors should discuss why daily streamflow prediction metrics such as NSE are sufficient to support an operational system in the context of a global flood forecasting system, or alternatively define the current system's limitations more clearly. Otherwise, the declaration in the paper may appear overstated.
- The Introduction is very concise. However, as a research paper, it should clearly present the key research gap, the necessity of the study, and the need to upgrade existing technologies or solutions. For example, the main focus of this paper is the v2 system, but the current first paragraph mainly discusses the status of machine learning in streamflow simulation, model development, interpretability, and uncertainty quantification. These topics are only briefly mentioned, without specific literature citations, which makes the Introduction too brief.
In addition, the second paragraph directly turns to "using operational machine learning hydrology models for global-scale riverine flood forecasting", but it does not discuss the innovation of the v2 system or the improvements over v1. Since a substantial part of the v2 improvement comes from the introduction of GraphCast, I suggest that the authors at least add discussion of how meteorological data can improve flood forecasting models.
- In Section 2.1.2, the authors state that HRES and GraphCast forecast archives begin in approximately 2012 and 2016, respectively. To use the full historical streamflow record from 1980 to 2024, the authors substitute ERA5-Land reanalysis data for HRES/GraphCast forecast inputs in earlier years when such forecasts are unavailable, and assume that these reanalysis data serve as an "effective proxy" for the forecast inputs. The justification given in the paper is that "HRES shares the same underlying physical model as ERA5, and GraphCast is trained on ERA5". However, this assumption is not supported or demonstrated. I suggest that the authors conduct a comparison for years when HRES/GraphCast and ERA5-Land are both available, and show whether their precipitation, temperature, and other distributions are similar. Alternatively, the authors could compare whether NSE/KGE differs substantially when ERA5-Land is used as input versus when the actual GraphCast forecast inputs are used.
- In Section 3, the authors state that "For the ungauged setting, the v1 system used random k-fold (k=10) cross-validation, whereas the v2 system used a single holdout test set". I suggest that the authors explain why different spatial evaluation protocols were used for v1 and v2, how the v2 holdout basins were selected, and why a single spatial split is sufficient to evaluate ungauged generalization. This clarification is important because the spatial split strategy may affect the comparability of ungauged performance between v1 and v2.
- In Section 4.2, the authors state that "Figure 6 disaggregates the improvements provided by the ME-LSTM architecture and expanded training data from the predictive skill injected by the GraphCast meteorological forcings". However, the authors also state that "Blue boxes represent the Delta NSE gained by transitioning from the v1 to v2 model architecture and expanded Caravan training data. Green boxes represent the additional Delta NSE gained by incorporating GraphCast". The paper has already demonstrated the contribution of GraphCast, but it has not separated the contribution of the ME-LSTM architecture change from the contribution of the expanded Caravan training data. I think additional experiments and evidence could be added in the Supplement.
- The paper states that "We take the mean of the predicted distribution to be the deterministic model prediction that we evaluate in this Paper". Since both v1 and v2 produce probabilistic forecasts using "a countable mixture of asymmetric Laplacians (CMAL) distribution", I think it is necessary to explain why only the mean of the predicted distribution is evaluated. Deterministic NSE/KGE metrics can indicate predictive performance, but they cannot evaluate the quality of probabilistic forecasts. For flood forecasting, probabilistic forecast results themselves are important. I suggest adding metrics such as prediction interval coverage.
- In Section 4.5 and Figure 11, the authors state that the v2 system "yields higher absolute performance globally", but is also "proportionally more sensitive to the absence of local streamflow data for training". Specifically, at a 0-day lead time, the v2 system without GraphCast has a median gauged NSE of 0.78 and an absolute median penalty of 0.07, meaning the difference between gauged NSE and ungauged NSE, corresponding to a 10.6% relative decrease. With GraphCast, the v2 system has a median gauged NSE of 0.83, but the absolute median penalty increases to 0.12, corresponding to a 19.8% relative decrease. I suggest that the authors explicitly state in the abstract or conclusion that GraphCast improves overall absolute performance, but also increases the penalty between the gauged and ungauged settings. This does not mean that GraphCast brings the same magnitude of improvement under both gauged and ungauged settings. The paper should not only state that the lead time is extended by one day in ungauged basins.
10.In Section 4.2, the authors state that "GraphCast forcings improve correlation but lower forecast variance", and further mention possible "spatial and temporal smoothing" and "underprediction of variance" at longer lead times. However, this paper studies flood forecasting, while the NSE and KGE metrics used in the paper cannot accurately demonstrate high-flow prediction performance. Flood peaks and high flows are very important for flood forecasting. From a mathematical perspective, lower forecast variance caused by GraphCast may indicate that the model underestimates high flows. Therefore, I suggest that the authors supplement the analysis with relevant flood peak or high-flow simulation results or metrics.
- At the beginning of the paper, the authors state that v2 improves upon v1. The paper has explained that v1 is an ED-LSTM, while v2 introduces ME-LSTM, expanded Caravan training data, and new meteorological inputs such as GraphCast. The authors also acknowledge that the performance difference between v1 and v2 is "a compound effect", and that the two systems use different spatial split strategies in the ungauged evaluation. However, the exact differences between v1 and v2 are not clearly compared. I think the paper lacks a table, namely a v1-v2 comparison table, listing differences in model architecture, training data sources, dynamic meteorological inputs, whether GraphCast is included, temporal and spatial splitting strategies, and related aspects.
- The paper uses the terms Google Global Flood Forecasting system and operational system in multiple places, but the methods, evaluation metrics, and open-source code mainly correspond to the runoff/streamflow model forecasting component of the Google FloodHub flood forecasting platform, rather than an end-to-end operational flood warning system. Section 2.1 clearly states that the model training target is daily streamflow at the basin outlet. The open-source code also mainly provides model training, evaluation, and related workflows; the target variable in the configuration file is streamflow, and the main evaluation metrics are NSE/KGE. Meanwhile, the public configuration file also states that this open-source pipeline differs from the operational pipeline.
I suggest that the authors more clearly distinguish whether what is evaluated and open-sourced in this paper is the "runoff/streamflow model component" or the complete "operational flood warning system". If the paper claims to evaluate the complete system, it should explain whether operational components such as real-time data validation, flood-threshold determination, inundation mapping, and alert distribution are included in the evaluation and open-source code. If they are not included, I suggest revising the wording in the title, abstract, methods, or Code Availability section, and clearly specifying which results can be reproduced using the released code. The provided code and the paper need to be explicitly aligned.
-
CEC1: 'Comment on egusphere-2026-2283 - No compliance with the policy of the journal', Juan Antonio Añel, 08 Jun 2026
reply
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived the GoogleHydrology code and Caravan on GitHub. However, GitHub is not a suitable repository for scientific publication. GitHub itself instructs authors to use other long-term archival and publishing alternatives, such as Zenodo.
The GMD review and publication process depends on reviewers and community commentators being able to access, during the discussion phase, the code and data on which a manuscript depends, and on ensuring the provenance of replicability of the published papers for years after their publication. Please, therefore, publish your code and data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible. We cannot have manuscripts under discussion that do not comply with our policy.
Later, if the Topical Editor decides to continue with the review or publication process of your manuscript and you are requested to upload a new version of it, then The 'Code and Data Availability’ section of your manuscript must also be modified to cite the new repository locations, and corresponding references added to the bibliography.
Additionally, although you do not seem to directly use them for the work presented, you link the Google Runoff Reanalysis & Reforecast dataset (GRRR), using a webpage hosted in google.com. Again, this is not an acceptable repository ensuring long-term preservation, therefore, it is hardly useful. We have already found there a link in the description to a Nature paper that does not work . We would encourage you to share any data in repositories that ensure long-term preservation to better serve the purpose of open science that you discuss in the manuscript.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in GMD.
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2026-2283-CEC1
Data sets
Model Data Grey Nearing, Frederik Kratzert, Martin Gauch https://doi.org/10.5281/zenodo.19676842
Model code and software
GoogleHydrology Grey Nearing, Omri Shefi, Amit Markel, Frederik Kratzert, Martin Gauch https://github.com/google-research/flood-forecasting
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 404 | 214 | 18 | 636 | 18 | 17 |
- HTML: 404
- PDF: 214
- XML: 18
- Total: 636
- BibTeX: 18
- EndNote: 17
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript presents an evaluation of version 2 of the Google Global Flood Forecasting system against the previous v1 system and established third-party benchmarks. The study is valuable because it provides transparency about an operational global flood forecasting system, introduces important technical updates such as the ME-LSTM architecture and GraphCast meteorological forcings, and contributes open-source resources for the hydrological community. I consider the topic important and suitable for publication after revision.
My main concerns are related to the interpretation of the reported performance improvements. First, the comparison between v1 and v2 may be affected by changes in the gauged/ungauged status of evaluation basins, because the v2 system uses expanded training data. If some basins were ungauged in v1 but gauged in v2, part of the reported improvement may reflect increased spatial coverage of local streamflow training data rather than improvements in the model architecture or meteorological forcings. This issue should be clarified and, if relevant, quantified.
Second, the manuscript mainly evaluates the system using NSE and KGE components. These metrics are useful for assessing overall hydrograph prediction skill, but the manuscript is framed as a flood forecasting study. The authors should therefore discuss more clearly what the reported improvements imply for practical flood prediction, especially with respect to the improved correlation component, reduced forecast variability, flood peak timing, peak magnitude, and other aspects of flood warning performance. If event-based verification is beyond the scope of the paper, its absence should be acknowledged as a limitation or future direction.
Overall, I think the paper has strong potential, but the above issues should be addressed to make the central conclusions more robust and easier to interpret.
Note: attached is the same comments in the PDF version
[1] Possible confounding due to changes in gauged/ungauged status between v1 and v2
My most important concern is that it is not clear whether each evaluation basin has the same gauged/ungauged status in both v1 and v2. Since the v2 system uses expanded training data, including Caravan, some basins may have been ungauged in v1 but gauged in v2.
If such basins are included in the current “gauged” evaluation, the reported improvement from v1 to v2 may reflect not only improvements in the model architecture or the use of GraphCast forcings, but also the effect of newly including local streamflow observations from those basins in the training data. In other words, the improvement may partly reflect an ungauged-to-gauged transition. In that case, interpreting the v2 improvement mainly as an effect of the upgraded model structure would be potentially misleading.
I therefore ask the authors to clarify whether the gauged/ungauged status of each evaluation basin is consistent between v1 and v2. It would also be useful to separate the evaluation into at least the following groups:
This decomposition would help distinguish the effects of model and input-data improvements from the effect of increased spatial coverage of the training data. In particular, if basins that changed from ungauged in v1 to gauged in v2 show large improvements, the interpretation of the current aggregate v1-v2 comparison may change substantially.
[2] Limitations of NSE/KGE and the practical meaning of the improvements for flood forecasting
The manuscript demonstrates improved hydrograph prediction skill of the v2 system using NSE and KGE components. This evaluation is useful. However, because the manuscript focuses on a flood forecasting system, I think the authors should discuss more clearly what these improvements mean from a practical flood forecasting perspective.
In particular, the improvement in the correlation component of KGE is important. It may indicate better prediction of hydrograph phase, rising limbs, and flood peak timing, which are highly relevant for early warning. On the other hand, the reduction in forecast variability may imply possible underestimation of peak discharge. I therefore suggest that the authors interpret the meaning of both improved correlation and reduced variability more carefully in the context of flood forecasting. Although the Conclusions identify the KGE decomposition result as one of the main improvements, the main text currently contains relatively little discussion of why the correlation improvement is especially important.
If possible, it would also be helpful to include one or a few representative hydrograph examples, such as a basin where v1 missed the timing of a flood peak but v2, or v2 with GraphCast, captured it better. Such examples would help readers understand how the improvement in statistical metrics appears in actual forecast time series.
Finally, NSE and KGE alone do not directly evaluate several important aspects of flood disaster prediction, such as peak discharge, threshold exceedance, false alarms, and missed events. This limitation should at least be clearly acknowledged in the Conclusion or in a Limitations/Future Directions section.
Abstract:
In the Abstract, the system is described only as an “updated flood forecasting system”, but the name of the Google Global Flood Forecasting system is not explicitly stated. Although this is already included in the title, I think it would be useful to name the system explicitly in the Abstract, since the Abstract is often read independently.
In addition, the current Abstract does not clearly explain what technical changes were introduced in v2. I recommend adding one concise sentence summarizing the main technical updates, such as the replacement with the ME-LSTM architecture, improved integration of multiple meteorological input products and robustness to missing inputs, expanded training data through Caravan, and the inclusion of GraphCast meteorological forcings. This would help readers understand the technical basis for the reported improvement, rather than only seeing the performance outcome.
P2 L15: Alignment between the Introduction and the Results
The limitations of the previous system and the improvements introduced in the new system should be presented in a way that is more clearly aligned with the analyses in the Results section.
In the current Introduction, the v1-to-v2 upgrade is described as addressing three data-related challenges: training data availability, temporally limited data records, and input data distribution shifts. These are relevant points, but the Results section mainly discusses the improvements in terms of two components: improvements on the hydrological model side, including ME-LSTM and expanded training data, and improvements on the meteorological forcing side through the use of GraphCast.
I think the Introduction would be clearer if it first described the main limitations of v1 and then explained how v2 was designed to address them through both an improved model architecture and improved meteorological forecast inputs. This would make the narrative from motivation to methods and results more consistent.
If you include analysis on “ungauged to gauged” impact in the result, please arrange this part to align with the analysis in the updated result section.
P2 L18: The study objectives should explicitly include performance evaluation.
At the end of the Introduction, the authors state that the two main objectives of the paper are to provide transparency about the progress and challenges of the operational flood forecasting system, and to facilitate research on ML-based flood forecasting by providing open-source resources. However, the main focus of the manuscript is the performance evaluation and benchmarking of the v2 system. I therefore suggest that the stated objectives should explicitly include evaluating the predictive performance of the v2 operational system against v1 and third-party benchmarks. This would make the objectives better aligned with the structure and conclusions of the manuscript.
Table 1:
Table 1 is useful for reproducibility because it provides the full list of static catchment attributes. However, the table is very long and mostly consists of an enumeration of input variables, which substantially interrupts the flow of Section 2.1.1. I suggest keeping a concise summary in the main text, including the number of attributes, data sources, major categories, and representative examples, and moving the full attribute list to the Supporting Information or an Appendix. This would improve readability without reducing reproducibility.
P6 L9:
The descriptions of the meteorological input data in Section 2.1.2 and the training settings in Section 2.3 are presented mainly as bullet lists. The use of bullet lists itself is not a problem. However, for a model description paper, it is important not only to state what was used, but also to explain why those design choices were made and what data-availability or operational constraints motivated them. I suggest adding more explanation of the rationale behind choices such as feature unioning, input feature dropout, noise injection, batch size, number of epochs, and batch limits. This would make the model design and training strategy easier to understand and reproduce conceptually.
Figure 3:
Figure 3 is important for explaining the forecast initialization artifact in the ED-LSTM, but in its current form it is not easy to identify where the unnatural behavior appears. I suggest that the authors indicate the transition point from the hindcast period to the forecast period more clearly, for example using arrows, annotations, or highlighting, and explicitly show which part of the predicted hydrograph corresponds to the artifact. It would also help readers if the authors showed, next to the problematic example, a case without a strong artifact or a corresponding ME-LSTM example where the issue is reduced.
More generally, figures with multiple panels should include panel labels such as (a), (b), and (c). This would make it easier to refer to specific panels in the text and captions.
P10 L9 ME-LSTM
The ME-LSTM is one of the central technical improvements in this manuscript, but the roles of the two LSTM layers are not sufficiently clear from the current text and Figure 4. My understanding is that the first LSTM layer represents the evolving hydrological state derived from the hindcast sequence, while the second LSTM layer combines this state information with forecast embeddings to predict future streamflow. However, the current description does not make clear whether the first layer is only used as an initialization mechanism, or whether it continues to update state information during the forecast period.
It is also unclear how the training loss is applied across the hindcast and forecast periods, and whether the forecast layer is specifically optimized for future lead-time predictions. These points are important for understanding how the ME-LSTM differs from the ED-LSTM handoff approach.
I therefore suggest that the authors explain more clearly, both in the text and in Figure 4, the different roles of the hindcast and forecast models in the ME-LSTM, the flow of information between the first and second LSTM layers, how state information is updated during the forecast period, and how the loss function is applied.
Figure 4:
In Figure 4, it is not clear which LSTM block corresponds to the first layer and which corresponds to the second layer. Since the text describes the ME-LSTM as a two-layer stacked LSTM, the first and second LSTM layers should be explicitly labelled in both the figure and the caption.
The meaning of “Output” in the figure should also be clarified. It is currently unclear whether this refers to the predicted streamflow, the parameters of the CMAL predictive distribution, or the deterministic mean discharge used for evaluation.
In addition, the handling of missing inputs, which is an important advantage of the ME-LSTM architecture, is not easy to understand from the current figure. I suggest making NaNs or missing input products more visually prominent, and clearly indicating which inputs are included in the masked mean operation and which inputs are excluded. This should also be explained explicitly in the figure caption.
Figure 5:
Figure 5 is one of the key figures for the global performance comparison. However, the current CDF panels show many lines corresponding to multiple model configurations and multiple lead times at the same time, making the figure difficult to interpret. I suggest reorganizing this figure, for example by separating the comparison among models from the comparison across lead times, either into different figures or different panels. Another option would be to show CDFs only for selected representative lead times, while presenting the full lead-time dependence using boxplots or median performance curves.
The legend font is also too small and should be enlarged. In addition, each panel should be labelled clearly, for example as (a), (b), (c), and (d), so that the text and caption can refer to the individual panels more easily.
Figure 8:
Figure 8 supports one of the central conclusions of the manuscript, but the upper panels, especially the upper-left panel, are difficult to interpret because the legend is insufficient. It is not clear from the figure alone what is being compared. The authors should more clearly indicate the correspondence among the v1 nowcast, the v2 forecasts at different lead times, and the gauged/ungauged settings, both in the figure and in the caption.
Section 4.4 Effect of Hydrological Characteristics
In Section 4.4, the authors analyze the relationship between hydrological characteristics and model performance. In addition to the current attribute-based analysis, it would be useful to show the spatial distribution of the v1-to-v2 skill improvement on a world map. This would help readers understand where the updated system improves most, and whether the improvements are concentrated in particular regions or hydroclimatic settings. If the effects of gauged and ungauged evaluation are mixed, it may be better to show separate maps for gauged and ungauged basins. This would also help clarify whether the spatial pattern of improvement is related to model generalization, local training data availability, or meteorological forcing improvements.
If such a spatial map is added to the main text, Figure 10 could potentially be moved to the Supplementary Information, since the map may provide a more direct and intuitive view of where the model improvement occurs globally.
Section 4.5 and Figure 11
Section 4.5 and Figure 11 provide a useful comparison between gauged and ungauged performance. However, the discussion could be expanded to better explain what this performance gap implies for the reliability of the system in ungauged basins. Since global flood forecasting often targets regions where local streamflow observations are limited or unavailable, the relative performance of ungauged predictions compared with gauged predictions is highly important. I suggest that the authors discuss more explicitly how large the ungauged penalty is, whether it varies by region or hydrological characteristics, and what this means for operational confidence in ungauged basins.
Figure 11
In the lower panel of Figure 11, the line corresponding to the improvement ratio or relative percentage change appears to be shown as a dashed line. However, this dashed line style is not represented in the legend. The authors should revise the legend so that the line styles and colors are consistent with the plotted data.