Outrunning flash floods: XGBoost and sparse impact reports deliver global medium-range probabilistic forecasts of flash flood occurrence
Abstract. Flash floods are the world's most frequent and deadly type of flood. Yet, no medium-range forecasts of their occurrence exist over a continuous global domain – essential to fulfil the UN's "Early Warnings for All" target to protect everyone with early warning systems. This study addressed this gap in two phases. In a first phase, regional medium-range, data-driven forecasts of flash occurrence were developed by combining regional high-density, quality-controlled flash flood impact reports (e.g., NOAA's Storm Event Database over the Contiguous US) with global reanalysis and forecasts (e.g. from ERA5 for non-meteorological variables and ERA5-ecPoint for rainfall). Out of all the tested models, XGBoost gradient boosting achieved the best performance: it maintained high and constant discrimination skill across scores (e.g. ROC and Precision-Recall curves) and lead times, and forecast probabilities remained reliable below 10 % at day 1 and 2 % at day 5. In a second phase, a spatial-constrained sensitivity analysis evaluated how well the regional XGBoost model generalised to unseen regions. The sensitivity analysis revealed that a model trained on hydro-climatologically diverse and observation-dense sub-domains generalised better than those trained across the full domain with sparser data, suggesting a viable strategy for extending regionally trained forecasts of flash flood occurrence globally. Hence, this study provides the first empirical evidence that global, medium-range forecasts of flash flood occurrence are achievable with simple data-driven approaches and readily available data, closing one of the most pressing and long-standing gaps in modern hydrology.