Enhancing flood forecasting reliability in data-scarce regions with a distributed hydrology-guided neural network framework
Abstract. Flood early warning systems are critical for reducing disaster impacts, yet their effectiveness remains limited in data-scarce regions such as Africa and South America. Existing global platforms – including GloFAS and the Google Flood Hub – exhibit low reliability in these areas, particularly for rare flood events and under strict timing constraints. Here, I demonstrate the potential of a distributed, hydrology-guided neural network framework, Bakaano-Hydro, to enhance flood forecasting reliability in data-scarce regions. The proposed framework integrates process-based runoff generation, topographic routing, and a Temporal Convolutional Network for streamflow simulation. Using a hindcast-based evaluation across 470 gauging stations from 1982 to 2016, I benchmark Bakaano-Hydro's flood detection skill against GloFAS and Google AI model across multiple return periods (1-, 2-, 5-, and 10-year) and timing tolerances (0–2 days). Results show that Bakaano-Hydro consistently achieves higher Critical Success Index (CSI), lower False Alarm Rate (FAR), and higher Probability of Detection (POD), even under exact-day (0-day) timing constraints. Its median CSI scores at 0-day tolerance exceed or match those of GloFAS and Google AI model under more lenient timing thresholds. These performance gains are statistically significant across diverse hydroclimatic regions, including arid and tropical basins, demonstrating the model's spatial generalization capacity. By coupling physical realism with machine learning generalizability, Bakaano-Hydro provides a reliable, interpretable, and open-source tool for enhancing flood forecasting in regions most vulnerable to climate extremes and least equipped with observational infrastructure.
Title: Enhancing flood forecasting reliability in data-scarce regions with a distributed
hydrology-guided neural network framework
The research compares three forecasting models in regions with low gauge density and usually low representation in traditional approaches. The comparison between models is based on return periods and their success or failure rates in capturing them. In all the metrics presented, the hybrid model outperforms the other models. These results are highly valuable in supporting the use of such models in forecasting frameworks. However, a couple of main points need to be addressed to fully validate the results.
The first point concerns the period used to define the number of successes of the hybrid model. The author used the training period to count these successes, which creates a biased metric that cannot be reliably used for comparison. Using the entire period to define return periods is acceptable, as it provides a good representation of them, but the success rate should be evaluated over an independent period to accurately estimate how the model performs on unseen data in operational settings.
The second point relates to the number of gauges used in training for each model. This can be problematic because it may lead to unfair comparisons. For example, it is not fair to compare a model trained with 100 gauges in a region to another trained with only 10 gauges. The author only mentions the number of gauges common to all the models, used for evaluation, but does not specify the number of gauges used during training or the training period used for each model.
Addressing these issues is essential to ensure the validity of the results. Therefore, we recommend returning the manuscript with major comments.
Minor comments
Pag.2 - Line 4. Be more specific about the catastrophic events in Nigeria, Sudan, etc., because they are unknown in other countries or continents.
Pag.3 - Line 9. Really, the problem with lumped models is with the spatial distribution inputs, which are not necessarily solved with a higher resolution process-based model (PB). Even those models have issues with the full characterization of all the processes at such high resolution, because many of the parameters and intermediate data do not exist at high resolution.
Pag.3 - Line 30-32. I am not sure if Fredrick Kratzert supports this statement in his last manuscripts. I am fairly certain that Mass Conservation LSTM demonstrated the opposite.
Pag.4 – Line 2-3. The previous statement and reference do not support your statement. It only mentions the need for higher resolution, which is not necessarily associated with the use of a process-based model.
Pag.4 – Line 21. Is 643 gauges good enough for the characterization of two continents?
Pag.5 – Line 3-15. A more detailed characterization is needed to fully describe the variability. Area sizes, aridity, slopes, total annual precipitation, etc. Are these areas poorly trained for the ML and process-based models?
Pag.5 – Line 16. Add reference to VegET method.
Pag.7 – Line 7-9. How much distortion does the 1km resolution bring for a river with much lower width networks (100m)?
Pag.7- Line 17-18. Add a reference to support this statement.
Pag7-Line 19-23. This is not a fair statement because CNN is used to convert the routed streamflow to the actual streamflow, which means it was never exposed to the vanishing of memory issue. After all, the process-based model is dealing with that.
Pag.7 – Line 28. Add a figure with the architecture and the input of the TCN.
Pag.8 – Line 1. Add more details about spatial periodicity.
Pag.8 – Line 23. How many gauges from the training of each model are present in the areas studied? What if some of the models used gauges and others did not? That is super important to have a fair comparison.
Pag.9 – Line 19-20. Why was only one distribution used when each catchment can have a very different distribution?
Pag.10 – Line 4. Why did you use the training period for the metric? This is a biased estimation. You must use the validation period.
Pag. 10 – Line 27. Be careful with be overconfident with your results. They are only valid for those catchments studied; this does not mean a true generalization.
Pag.11 – Line 6. Comparing with the testing period generates an overconfident metric. This is a biased analysis.
Figure 2. The color scheme is misleading the reader. Values near zero are already a bad performance. Please plot on a scale 0-1.
Pag.16 Line 12. Please, place this information in context. Add an aridity index or some descriptor to define clearly how arid those regions are.
Figure 6. “For each plot only basins for which differences among the three models were statistically significant are shown.” What does it mean? Significant differences for what model? Does it mean you are plotting only the catchments when your model was significantly better than the others?
Pag. 19 Line 10-11. From your results is clear that your model is better than others; however, it is not clear where this improvement is coming, PB or ML. It would be valuable if the same analysis is added between your PB (without the CNN model) and the GloFas, given that both are PB.
Pag.20 Line 2-5. From my point of view, both approaches are valuable depending on the purpose. Observations are good for models that are implemented or pretend to be implemented as an official tool. Simulations are good to present the chance of applying this model in an operational model after a bias correction (fine-tuning). Therefore, there is no approach better than others, just different purposes.
Pag.20 Line 6-7. I am not sure we can call this a new paradigm. Research applying hybrid models is abundant in the literature. Moreover, the idea of multi-representation approaches is mentioned already in the literature.
Pag.20 Line 11-12. Careful with overselling your research; it is not clear that the diversity studies in the models would allow you to make this statement.
Pag.20 Line 21. How interpretable is a hybrid model? From where are the good results, PB or ML?
Pag. 21 Line 1. The generalization to an ungauged basin is not well supported or explained in the manuscript. For example, how good the model perform in regions with more extreme climates (Northern Chile or southern Argentina)?
Pag.21 Line 3. From the point of view of a global analysis. Is the Bakaano PB model running in an operational framework? This way, different countries could easily train a CNN model to fit the local information in each country. If this model and the data used to run it are not freely available, it will be very hard to implement in an operational framework.