Outrunning flash floods: XGBoost and sparse impact reports deliver global medium-range probabilistic forecasts of flash flood occurrence
Abstract. Flash floods are the world's most frequent and deadly type of flood. Yet, no medium-range forecasts of their occurrence exist over a continuous global domain – essential to fulfil the UN's "Early Warnings for All" target to protect everyone with early warning systems. This study addressed this gap in two phases. In a first phase, regional medium-range, data-driven forecasts of flash occurrence were developed by combining regional high-density, quality-controlled flash flood impact reports (e.g., NOAA's Storm Event Database over the Contiguous US) with global reanalysis and forecasts (e.g. from ERA5 for non-meteorological variables and ERA5-ecPoint for rainfall). Out of all the tested models, XGBoost gradient boosting achieved the best performance: it maintained high and constant discrimination skill across scores (e.g. ROC and Precision-Recall curves) and lead times, and forecast probabilities remained reliable below 10 % at day 1 and 2 % at day 5. In a second phase, a spatial-constrained sensitivity analysis evaluated how well the regional XGBoost model generalised to unseen regions. The sensitivity analysis revealed that a model trained on hydro-climatologically diverse and observation-dense sub-domains generalised better than those trained across the full domain with sparser data, suggesting a viable strategy for extending regionally trained forecasts of flash flood occurrence globally. Hence, this study provides the first empirical evidence that global, medium-range forecasts of flash flood occurrence are achievable with simple data-driven approaches and readily available data, closing one of the most pressing and long-standing gaps in modern hydrology.
The authors develop an XGBoost gradient-boosting model for probabilistic flash flood occurrence forecasts up to day 5, trained on the NOAA Storm Event Database (2001–2020) over CONUS and driven by ERA5 reanalysis/forecasts, with rainfall post-processed via the ecPoint statistical downscaling. They benchmark XGBoost against alternative gradient-boosting libraries, random forests, and a feed-forward neural network, and use SHAP for interpretation. A spatially constrained sensitivity analysis tests three strategies (uniform thinning, restricted-region labels with full-domain inputs, and restricted-region labels with restricted-region inputs) for transferring a regionally trained model to data-sparse domains. The 2024 Valencia flash flood is used as an out-of-distribution case study.
The motivating question — can a model trained on a data-rich region produce skilful flash flood forecasts elsewhere? — is timely and operationally important. The framing around the UN's "Early Warnings for All" target is appropriate, and the choice to predict impact occurrence (rather than discharge) is well-justified. However, several aspects of the methodology and interpretation require substantial strengthening before the central claims can be supported.
Major Comments:
1. The flash-flood definition is incompatible with the working spatial scale
Table 1 defines fluvial flash floods as occurring in catchments <500 km² with concentration times of minutes to hours. The model operates on the ERA5 reduced-Gaussian grid (~31 km, i.e., grid-box areas approaching 1000 km²). A single grid box typically contains many flash-flood-relevant catchments, and the model cannot resolve the catchment dynamics it claims to predict. This mismatch undermines both the construct validity of the predictand and the interpretation of "flash flood occurrence" at the grid scale. The authors briefly acknowledge resolution as a limitation in Section 6.4, but the issue deserves more prominent treatment up front: what exactly is being forecast at 31 km, and how does the binarisation procedure (Appendix A, Figure A1) reconcile point reports occurring in sub-grid catchments with a grid-cell yes/no label?
2. Population-density confounding may not be a peripheral artefact
The authors indicated that the SHAP-derived associations between high LAI / low SDFOR / saturated soils and flash flood occurrence likely reflect the spatial distribution of who reports, not where flash floods occur. But the implications go further than they acknowledge. If the model has learned a latent population-density proxy, then its skill outside CONUS will be highest precisely in densely populated, vegetated, low-relief regions and lowest in the data-sparse mountain and arid regions where flash flood mortality is concentrated globally (Libya 2023, Afghanistan 2024, parts of the Andes). This is the inverse of where additional warning capability is most needed. A useful diagnostic would be to retrain after stratifying by population density — for example, restricting positive labels to grid boxes with population below a threshold — and report whether the SHAP rankings change. If the rankings flip toward steeper terrain and sparser vegetation, the population-confounding hypothesis is confirmed, and the global transferability claims need to be substantially softened.
3. The frequency bias
Figure 12c shows FB ≈ 1.2 for reanalysis but ~2.5 for forecasts at all lead times — a 150% overprediction of yes-events. Section 6.1 frames the choice of balanced loss as preferable because weighted loss inflates false alarm rates twentyfold, but FB = 2.5 is not a minor cost; in an operational early-warning context, persistent overprediction at this magnitude erodes trust as severely as the false-alarm problem the authors warn against.
4. Cross-validation does not respect spatial and temporal structure
The repeated stratified k-fold procedure (Figure 9) assumes IID samples. Adjacent grid boxes within the same storm system are strongly spatially autocorrelated, and 24-hourly periods within a multi-day event are temporally autocorrelated. Random folds will place training and test samples from the same physical event in different folds, inflating the apparent skill. A spatial blocked CV (e.g., leave-region-out, or buffered spatial folds) and event-based temporal CV would give more honest performance estimates. This is particularly important because the central transferability claim requires accurate generalisation estimates.
On a related note, Multi-day event expansion may introduce label leakage and double-counting. For a hurricane producing flash floods over five consecutive days, this generates five training rows whose meteorological predictors and impact labels are not independent. This both inflates the apparent class frequency (mildly mitigating the imbalance the loss function is designed to handle) and introduces correlated samples into the training set. The authors should report the distribution of event durations and quantify how many "samples" actually correspond to independent meteorological events.
5. No baseline beyond the no-skill diagonal
AUC-ROC of 0.8 is a meaningless number without a reference point. A logistic regression on the same features, climatology of flash-flood frequency by month and grid box, or even raw rainfall return-period exceedance probability used directly as the forecast probability would give a defensible baseline. The Valencia case (Section 5.2) actually provides a hint that the model's added value over raw rainfall exceedance is real (downstream propagation via the adjacent-grid-box feature), but a domain-wide quantitative baseline comparison is needed to support the headline claim that XGBoost adds skill.
6. "ensemble methods manage class imbalance effectively" justification is incomplete
L.206 cites Ayodele (2023) and Altalhan et al. (2025) for the claim, but neither shows that ensembles solve imbalance — at best, they manage it under appropriate hyperparameter and loss configurations.
7. ERA5 is the only meteorological input — but better data exist over CONUS
The CONUS is unusually well-observed by NEXRAD and MRMS, which provide quality-controlled rainfall at ~1 km / 2-min resolution. Using ERA5(-ecPoint) rainfall as the predictor over CONUS conflates two error sources: (i) ERA5's representation error, and (ii) the ML model's inability to map predictors to outcomes. A useful ablation would be to retrain the model with MRMS rainfall as input, holding everything else constant. If MRMS-driven skill is much higher, the bottleneck is the input data, not the model architecture, and operational improvements should be sought there. This would also clarify whether the improved performance with ecPoint reflects genuine sub-grid information or just bias correction.
8. Non-stationarity in the impact reports is unaddressed
Figure 2a shows a clear upward trend in geo-located reports from 2001 to 2024. This trend is dominated by changes in reporting (network expansion, smartphone-era citizen contributions, changes in NWS protocols) rather than by changes in flash flood frequency. Training on 2001–2020 and verifying on 2021–2024 means the model may be implicitly learning the reporting trend rather than the hazard. Please:
Quantify the trend in reports per grid box (separately from event counts);
Test whether de-trending the labels changes model skill;
Discuss how non-stationarity in the labels themselves affects the validity of the FB and reliability scores.
Minor Comments:
1. The distinguishing criterion for flash floods vs. riverine floods should be clarified explicitly in the Introduction, not only in Table 1. At 31 km, the "rapid rise" / "small steep catchments" defining criteria do not directly apply to the predictand.
2. L.159–160: "The figure demonstrates the model's ability to capture this saturation pattern up to five days in advance." This is a qualitative visual statement. Please report a quantitative metric (e.g., spatial correlation, RMSE of PMSS forecast vs. reanalysis) for the demonstration to be persuasive.
3. L.183 (ecPoint): Is the procedure correctly described as generating 100 sub-grid realisations per ERA5 grid box, distilled to 99 percentiles (1st–99th)? That is essentially a 100× downscaling in the statistical sense. A one-sentence clarification would help readers unfamiliar with ecPoint.
4. LAI as a climatological field cannot capture post-fire or post-deforestation changes in vegetation, which are first-order drivers of debris flow and flash flood risk in the western US (e.g., Front Range) and in Mediterranean Europe. This limitation should be discussed alongside the LAI-as-seasonality-proxy issue raised in Section 6.2.
5. The 1-year and 50-year return period thresholds (Figures 6a, 7a) — please describe the derivation method (annual maxima, partial duration series, distribution fitted) in the main text or Appendix B.
6. Static features that are conspicuously absent: soil hydraulic properties beyond soil type code (saturated hydraulic conductivity, porosity), flow accumulation / upstream contributing area, and HAND (or another channel-network proximity metric). The 2025 Texas Hill Country event at Camp Mystic is a reminder that confluence geometry is first-order. These should at least be discussed as prospective additions in Section 6.5.
7. Figure 8 functions more as a textbook taxonomy than a results figure and could be moved to the supplement. Note also that in the standard ML taxonomy, "ensemble methods" comprise bagging, boosting, and stacking; feed-forward neural networks are not an ensemble method per se. The figure conflates "models tested" with "ensemble families" in a way that is mildly misleading.
8. Section 3.6: The TA2 design (labels in one region, model exposed to the full domain) treats absent labels as true non-events. The authors acknowledge this introduces bias. It is not surprising that TA2-1 (labels only in the data-poor west) collapses to no skill — there are far fewer positive cases there to begin with (Figure 2b shows 35.4% of all reports in SE alone). The TA2-1 result is partly a statement about CONUS report distribution, not just about training strategy. Consider rebalancing the TA2 design to vary label-source region while controlling for label count.
9. Figure 12 shows performance on both the training and verification datasets. The training-dataset overlay is useful for assessing the generalisation gap, but the figure is busy. Consider plotting training scores in the supplement and keeping the main figure focused on out-of-sample performance.
10. Throughout the figures: please replace machine-readable variable names (tp_prob_1, tp_prob_1_adj_gb, SDFOR, PMSS) with human-readable axis labels (e.g., "Probability of 1-yr RP rainfall exceedance"). Figure 13 in particular is opaque on first reading.
11. AUC-PR values (~0.01–0.03) approach the no-skill threshold for the imbalanced base rate. Please add the no-skill PR baseline value (= positive class frequency) explicitly to each PR panel, so readers can see the actual margin of skill.
12. Appendix A, Figure A2 caption uses "(d)" for what appears to be panel (b) in the figure — please check.
13. L.7: "flash occurrence" → "flash flood occurrence".
14. L.285: "the second approach (TA2) simulates the scenario of training a global model over the whole global domain" — should be the full CONUS domain, not the global domain.