Outrunning flash floods: XGBoost and sparse impact reports deliver global medium-range probabilistic forecasts of flash flood occurrence

Pillosu, Fatima M.; Claire, Mariana; Baugh, Calum; Pappenberger, Florian; Prudhome, Christel; Cloke, Hannah L.

doi:10.5194/egusphere-2026-1591

Preprints

https://doi.org/10.5194/egusphere-2026-1591

Preprints

08 Apr 2026

| 08 Apr 2026

Outrunning flash floods: XGBoost and sparse impact reports deliver global medium-range probabilistic forecasts of flash flood occurrence

Fatima M. Pillosu, Mariana Claire, Calum Baugh, Florian Pappenberger, Christel Prudhome, and Hannah L. Cloke

Abstract. Flash floods are the world's most frequent and deadly type of flood. Yet, no medium-range forecasts of their occurrence exist over a continuous global domain – essential to fulfil the UN's "Early Warnings for All" target to protect everyone with early warning systems. This study addressed this gap in two phases. In a first phase, regional medium-range, data-driven forecasts of flash occurrence were developed by combining regional high-density, quality-controlled flash flood impact reports (e.g., NOAA's Storm Event Database over the Contiguous US) with global reanalysis and forecasts (e.g. from ERA5 for non-meteorological variables and ERA5-ecPoint for rainfall). Out of all the tested models, XGBoost gradient boosting achieved the best performance: it maintained high and constant discrimination skill across scores (e.g. ROC and Precision-Recall curves) and lead times, and forecast probabilities remained reliable below 10 % at day 1 and 2 % at day 5. In a second phase, a spatial-constrained sensitivity analysis evaluated how well the regional XGBoost model generalised to unseen regions. The sensitivity analysis revealed that a model trained on hydro-climatologically diverse and observation-dense sub-domains generalised better than those trained across the full domain with sparser data, suggesting a viable strategy for extending regionally trained forecasts of flash flood occurrence globally. Hence, this study provides the first empirical evidence that global, medium-range forecasts of flash flood occurrence are achievable with simple data-driven approaches and readily available data, closing one of the most pressing and long-standing gaps in modern hydrology.

Received: 23 Mar 2026 – Discussion started: 08 Apr 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 4878 KB)

Supplement (2100 KB)

Download & links

Fatima M. Pillosu, Mariana Claire, Calum Baugh, Florian Pappenberger, Christel Prudhome, and Hannah L. Cloke

Status: final response (author comments only)

RC1: 'Comment on egusphere-2026-1591', Zhi Li, 07 May 2026

The authors develop an XGBoost gradient-boosting model for probabilistic flash flood occurrence forecasts up to day 5, trained on the NOAA Storm Event Database (2001–2020) over CONUS and driven by ERA5 reanalysis/forecasts, with rainfall post-processed via the ecPoint statistical downscaling. They benchmark XGBoost against alternative gradient-boosting libraries, random forests, and a feed-forward neural network, and use SHAP for interpretation. A spatially constrained sensitivity analysis tests three strategies (uniform thinning, restricted-region labels with full-domain inputs, and restricted-region labels with restricted-region inputs) for transferring a regionally trained model to data-sparse domains. The 2024 Valencia flash flood is used as an out-of-distribution case study.

The motivating question — can a model trained on a data-rich region produce skilful flash flood forecasts elsewhere? — is timely and operationally important. The framing around the UN's "Early Warnings for All" target is appropriate, and the choice to predict impact occurrence (rather than discharge) is well-justified. However, several aspects of the methodology and interpretation require substantial strengthening before the central claims can be supported.
Major Comments:
1. The flash-flood definition is incompatible with the working spatial scale
Table 1 defines fluvial flash floods as occurring in catchments <500 km² with concentration times of minutes to hours. The model operates on the ERA5 reduced-Gaussian grid (~31 km, i.e., grid-box areas approaching 1000 km²). A single grid box typically contains many flash-flood-relevant catchments, and the model cannot resolve the catchment dynamics it claims to predict. This mismatch undermines both the construct validity of the predictand and the interpretation of "flash flood occurrence" at the grid scale. The authors briefly acknowledge resolution as a limitation in Section 6.4, but the issue deserves more prominent treatment up front: what exactly is being forecast at 31 km, and how does the binarisation procedure (Appendix A, Figure A1) reconcile point reports occurring in sub-grid catchments with a grid-cell yes/no label?
2. Population-density confounding may not be a peripheral artefact
The authors indicated that the SHAP-derived associations between high LAI / low SDFOR / saturated soils and flash flood occurrence likely reflect the spatial distribution of who reports, not where flash floods occur. But the implications go further than they acknowledge. If the model has learned a latent population-density proxy, then its skill outside CONUS will be highest precisely in densely populated, vegetated, low-relief regions and lowest in the data-sparse mountain and arid regions where flash flood mortality is concentrated globally (Libya 2023, Afghanistan 2024, parts of the Andes). This is the inverse of where additional warning capability is most needed. A useful diagnostic would be to retrain after stratifying by population density — for example, restricting positive labels to grid boxes with population below a threshold — and report whether the SHAP rankings change. If the rankings flip toward steeper terrain and sparser vegetation, the population-confounding hypothesis is confirmed, and the global transferability claims need to be substantially softened.
3. The frequency bias
Figure 12c shows FB ≈ 1.2 for reanalysis but ~2.5 for forecasts at all lead times — a 150% overprediction of yes-events. Section 6.1 frames the choice of balanced loss as preferable because weighted loss inflates false alarm rates twentyfold, but FB = 2.5 is not a minor cost; in an operational early-warning context, persistent overprediction at this magnitude erodes trust as severely as the false-alarm problem the authors warn against.
4. Cross-validation does not respect spatial and temporal structure

The repeated stratified k-fold procedure (Figure 9) assumes IID samples. Adjacent grid boxes within the same storm system are strongly spatially autocorrelated, and 24-hourly periods within a multi-day event are temporally autocorrelated. Random folds will place training and test samples from the same physical event in different folds, inflating the apparent skill. A spatial blocked CV (e.g., leave-region-out, or buffered spatial folds) and event-based temporal CV would give more honest performance estimates. This is particularly important because the central transferability claim requires accurate generalisation estimates.
On a related note, Multi-day event expansion may introduce label leakage and double-counting. For a hurricane producing flash floods over five consecutive days, this generates five training rows whose meteorological predictors and impact labels are not independent. This both inflates the apparent class frequency (mildly mitigating the imbalance the loss function is designed to handle) and introduces correlated samples into the training set. The authors should report the distribution of event durations and quantify how many "samples" actually correspond to independent meteorological events.
5. No baseline beyond the no-skill diagonal
AUC-ROC of 0.8 is a meaningless number without a reference point. A logistic regression on the same features, climatology of flash-flood frequency by month and grid box, or even raw rainfall return-period exceedance probability used directly as the forecast probability would give a defensible baseline. The Valencia case (Section 5.2) actually provides a hint that the model's added value over raw rainfall exceedance is real (downstream propagation via the adjacent-grid-box feature), but a domain-wide quantitative baseline comparison is needed to support the headline claim that XGBoost adds skill.
6. "ensemble methods manage class imbalance effectively" justification is incomplete

L.206 cites Ayodele (2023) and Altalhan et al. (2025) for the claim, but neither shows that ensembles solve imbalance — at best, they manage it under appropriate hyperparameter and loss configurations.
7. ERA5 is the only meteorological input — but better data exist over CONUS
The CONUS is unusually well-observed by NEXRAD and MRMS, which provide quality-controlled rainfall at ~1 km / 2-min resolution. Using ERA5(-ecPoint) rainfall as the predictor over CONUS conflates two error sources: (i) ERA5's representation error, and (ii) the ML model's inability to map predictors to outcomes. A useful ablation would be to retrain the model with MRMS rainfall as input, holding everything else constant. If MRMS-driven skill is much higher, the bottleneck is the input data, not the model architecture, and operational improvements should be sought there. This would also clarify whether the improved performance with ecPoint reflects genuine sub-grid information or just bias correction.

8. Non-stationarity in the impact reports is unaddressed
Figure 2a shows a clear upward trend in geo-located reports from 2001 to 2024. This trend is dominated by changes in reporting (network expansion, smartphone-era citizen contributions, changes in NWS protocols) rather than by changes in flash flood frequency. Training on 2001–2020 and verifying on 2021–2024 means the model may be implicitly learning the reporting trend rather than the hazard. Please:
Quantify the trend in reports per grid box (separately from event counts);

Test whether de-trending the labels changes model skill;

Discuss how non-stationarity in the labels themselves affects the validity of the FB and reliability scores.

Minor Comments:
1. The distinguishing criterion for flash floods vs. riverine floods should be clarified explicitly in the Introduction, not only in Table 1. At 31 km, the "rapid rise" / "small steep catchments" defining criteria do not directly apply to the predictand.
2. L.159–160: "The figure demonstrates the model's ability to capture this saturation pattern up to five days in advance." This is a qualitative visual statement. Please report a quantitative metric (e.g., spatial correlation, RMSE of PMSS forecast vs. reanalysis) for the demonstration to be persuasive.
3. L.183 (ecPoint): Is the procedure correctly described as generating 100 sub-grid realisations per ERA5 grid box, distilled to 99 percentiles (1st–99th)? That is essentially a 100× downscaling in the statistical sense. A one-sentence clarification would help readers unfamiliar with ecPoint.
4. LAI as a climatological field cannot capture post-fire or post-deforestation changes in vegetation, which are first-order drivers of debris flow and flash flood risk in the western US (e.g., Front Range) and in Mediterranean Europe. This limitation should be discussed alongside the LAI-as-seasonality-proxy issue raised in Section 6.2.
5. The 1-year and 50-year return period thresholds (Figures 6a, 7a) — please describe the derivation method (annual maxima, partial duration series, distribution fitted) in the main text or Appendix B.
6. Static features that are conspicuously absent: soil hydraulic properties beyond soil type code (saturated hydraulic conductivity, porosity), flow accumulation / upstream contributing area, and HAND (or another channel-network proximity metric). The 2025 Texas Hill Country event at Camp Mystic is a reminder that confluence geometry is first-order. These should at least be discussed as prospective additions in Section 6.5.
7. Figure 8 functions more as a textbook taxonomy than a results figure and could be moved to the supplement. Note also that in the standard ML taxonomy, "ensemble methods" comprise bagging, boosting, and stacking; feed-forward neural networks are not an ensemble method per se. The figure conflates "models tested" with "ensemble families" in a way that is mildly misleading.
8. Section 3.6: The TA2 design (labels in one region, model exposed to the full domain) treats absent labels as true non-events. The authors acknowledge this introduces bias. It is not surprising that TA2-1 (labels only in the data-poor west) collapses to no skill — there are far fewer positive cases there to begin with (Figure 2b shows 35.4% of all reports in SE alone). The TA2-1 result is partly a statement about CONUS report distribution, not just about training strategy. Consider rebalancing the TA2 design to vary label-source region while controlling for label count.
9. Figure 12 shows performance on both the training and verification datasets. The training-dataset overlay is useful for assessing the generalisation gap, but the figure is busy. Consider plotting training scores in the supplement and keeping the main figure focused on out-of-sample performance.
10. Throughout the figures: please replace machine-readable variable names (tp_prob_1, tp_prob_1_adj_gb, SDFOR, PMSS) with human-readable axis labels (e.g., "Probability of 1-yr RP rainfall exceedance"). Figure 13 in particular is opaque on first reading.
11. AUC-PR values (~0.01–0.03) approach the no-skill threshold for the imbalanced base rate. Please add the no-skill PR baseline value (= positive class frequency) explicitly to each PR panel, so readers can see the actual margin of skill.
12. Appendix A, Figure A2 caption uses "(d)" for what appears to be panel (b) in the figure — please check.
13. L.7: "flash occurrence" → "flash flood occurrence".
14. L.285: "the second approach (TA2) simulates the scenario of training a global model over the whole global domain" — should be the full CONUS domain, not the global domain.

Citation: https://doi.org/10.5194/egusphere-2026-1591-RC1
RC2:
'Comment on egusphere-2026-1591', Anonymous Referee #2, 09 Jul 2026
The manuscript addresses an important problem: developing medium-range probabilistic forecasts of flash-flood occurrence as a complement to existing catchment-scale or short-lead warning systems. Its use of impact reports as the prediction target, combined with global hydro-meteorological predictors and a relatively simple machine-learning framework, is a promising and practically relevant approach. The manuscript is generally well motivated and contains substantial verification and interpretation work. The main concerns relate to the interpretation and evidential scope of the results, particularly given the use of reported impacts as the target variable and the ambition to draw conclusions about global forecast feasibility.
Major comments
Evidence boundary for the global feasibility claim.

The manuscript’s strongest claim is that it provides empirical evidence for the feasibility of global, medium-range flash-flood occurrence forecasts. This is an important claim, but the evidential boundary is not fully clear. The quantitative verification is performed over the CONUS, while the extra-CONUS evidence appears to consist of a single qualitative case study, the 2024 Valencia event. The TA1–TA3 sensitivity experiments are informative, but they remain internal experiments within the same broad geographical domain and reporting system used for model development. It is therefore difficult to assess whether the results demonstrate global forecast feasibility in a general sense, or rather a CONUS-based feasibility demonstration with one out-of-domain illustration. This distinction matters because the abstract and conclusions use broad language suggesting that global, medium-range flash-flood occurrence forecasts are achievable (e.g. L12–14; conclusions around L607–614).
Interpretation of the prediction target: reported occurrence versus physical occurrence.

The model is trained against impact-report occurrence, defined as at least one flash-flood report within an ERA5 grid box and 24-hour period. This is a useful target, but it is not the same as physical flash-flood occurrence. The distinction matters because impact reports are affected by population density, exposure, reporting practices, and timing or location errors, as acknowledged in the data description.
This issue also affects the interpretation of the SHAP results. While the dominant role of rainfall is physically plausible, the counterintuitive behaviour of LAI and SDFOR suggests that the model may also be learning reporting structure, seasonality, or spatial exposure embedded in the database. The interpretation of the output as “flash-flood occurrence probability” therefore seems somewhat broader than the target variable itself supports, particularly when the model is discussed in the context of transfer beyond the training region.
Relatedly, the scale of the forecast product is easy to misinterpret. The Introduction motivates the study partly through the inability of existing systems to resolve small flashy catchments, whereas the proposed product is an ERA5-grid-box, 24-hour impact-occurrence probability forecast. This is a useful product definition, but it is different from resolving sub-grid catchment runoff processes or outlet-scale flash-flood dynamics.
Operational interpretation of skill and reliability under severe class imbalance.

The verification results suggest useful discrimination, especially in terms of AUC-ROC, but the operational meaning of the reported skill is less clear. The AUC-PR values are low in absolute terms but difficult to interpret without the positive-event base rate; the forecast frequency bias indicates over-forecasting, and reliability appears to hold mainly within low probability ranges, below about 10% at day 1 and about 2% by day 5 (around L338–350). For such a severely imbalanced warning problem, these details are important for interpreting the practical value of the forecasts. Without the grid-box-day sample size and the positive-event base rate, it is difficult to assess how informative the reported AUC-PR values are. The broad statements about skilful and reliable forecasts are therefore difficult to interpret in operational terms, especially where false alarms would matter for warning decisions.
Ambiguity in the nested cross-validation workflow.

The description of the nested cross-validation workflow is ambiguous. The main text states that the inner loop is used for hyperparameter tuning and the outer loop for generalization assessment, but the Fig. 9 caption says that trial performance was measured over the outer test subset. This wording seems inconsistent with the usual nested-CV design, where the outer test fold should only be used for the outer-loop assessment. This may simply be a wording issue, but it is important because the nested design is central to avoiding data leakage and overoptimistic performance estimates.
Precision of novelty and contribution claims.

The manuscript makes broad claims about novelty and contribution, particularly through the “first empirical evidence” framing. The study’s contribution is important, but the exact scope of this “first” claim is not fully clear. The novelty appears to lie in the combination of impact-report targets, global predictors, medium-range lead times, and a continuous-domain forecasting framework, rather than in any single component alone. A clearer distinction between what is new in this study and what is already covered by existing regional, machine-learning, or global-rainfall-based flash-flood guidance work would help place the manuscript more precisely.
Minor comments
Figure, caption, and cross-reference consistency.

18 is difficult to follow because the caption, text, and plotted panels do not appear to match. The figure contains panels a–g, whereas the caption describes only a–e. The text appears to refer to panels b–c as rainfall exceedance probabilities, panel d as the reanalysis prediction, and panels e–g as forecasts. The day-5 initialization/lead-time label also appears unclear.

In the TA1 description, the retained fractions are given as 90%, 50%, and 90% (L280–282), whereas Fig. 10 describes reductions of 10%, 50%, and 90%. The last retained fraction appears inconsistent and may be 10%.

The SHAP discussion refers to Fig. 13e–g, but Fig. 13 appears to contain only panels a–d.

The text appears to refer to Fig. 7a as an exceedance-probability panel, but Fig. 7a is the 50-year return-period threshold map. The intended reference seems to be Fig. 7b.

Some internal section references also appear inconsistent, for example the references to Section 4.1/4.2 and to results in Section 3.4.

Additional reporting details for interpretation and reproducibility.

The probability threshold selected by maximizing F1 is not clearly reported. Since binary forecasts and frequency bias depend on the thresholding procedure, the selected threshold value would be useful.

The normalization of the SHAP feature-importance percentages is unclear. The listed values exceed 100% when summed, so it is not obvious whether they are relative to the most important feature, normalized in another way, or reported on a different scale.

The workflow is described as repeated cross-validation, but Fig. 9 gives n_repeats = 1. This may be technically implemented through RepeatedStratifiedKFold, but the wording could be confusing.

The grid-box-day sample size and positive-event base rate would be useful to report together, especially for interpreting AUC-PR in the severely imbalanced setting.

Wording and numerical consistency.

The neural-network training-time comparison is difficult to reconcile across the main text and Supplement. It is not always clear whether the reported values refer to individual trials, one outer fold, or the full tuning process.

Some broad rhetorical phrases, such as “closing one of the most long-standing gaps”, seem closely related to the evidence-boundary issue raised in the major comments and may benefit from more precise wording.

In the abstract, “flash occurrence” should presumably read “flash flood occurrence”.

The Valencia damage estimate of “16.5 billion” lacks a currency symbol.
Citation: https://doi.org/10.5194/egusphere-2026-1591-RC2

Fatima M. Pillosu, Mariana Claire, Calum Baugh, Florian Pappenberger, Christel Prudhome, and Hannah L. Cloke

Supplement

https://doi.org/10.5194/egusphere-2026-1591-supplement

Fatima M. Pillosu, Mariana Claire, Calum Baugh, Florian Pappenberger, Christel Prudhome, and Hannah L. Cloke

Viewed

Total article views: 665 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
337	300	28	665	117	23	20

HTML: 337
PDF: 300
XML: 28
Total: 665
Supplement: 117
BibTeX: 23
EndNote: 20

Views and downloads (calculated since 08 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	182	84	15	281
May 2026	117	127	4	248
Jun 2026	17	17	4	38
Jul 2026	21	72	5	98

Cumulative views and downloads (calculated since 08 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	182	84	15	281
May 2026	117	127	4	248
Jun 2026	17	17	4	38
Jul 2026	21	72	5	98

Viewed (geographical distribution)

Total article views: 593 (including HTML, PDF, and XML) Thereof 593 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Jul 2026

Short summary

This paper presents the first global, medium-range probabilistic forecasts of flash flood occurrence. A single XGBoost model trained on coarse impact reports achieves skilful predictions, challenging assumptions that complex architectures and high-resolution data are required. At a time of heightened attention to flash flood risk and early warning, this work demonstrates that skilful global forecasting is achievable in data-sparse regions where flash flood risk is highest.


Total:	0
HTML:	0
PDF:	0
XML:	0