Using machine learning for the prediction of flood-related 112 calls
Abstract. As weather-related disasters become more frequent and severe, there is a growing global push toward impact-based early warning systems, exemplified by initiatives such as EW4All. This transition positions machine learning (ML) and artificial intelligence (AI) as powerful tools for integrating meteorological hazard data with information on vulnerability and exposure into data-driven forecasting systems. In this work, we explore the use of 112 emergency calls as high-resolution impact proxies for an ML-based prediction problem. Specifically, we develop a model that combines rainfall-related weather data and static vulnerability-exposure layers to predict, at a municipal and hourly resolution, whether flood-related impacts will occur in the next hour. This study spans a period of over six years (October 2018 to February 2025) in Catalonia, northeastern Spain.
To address the severe temporal class imbalance and uncertainty characteristics of emergency calls data, we define a custom walk-forward evaluation scheme that ensures the same number of positive samples across comparable time periods. We then distribute municipalities into three distinct population density groups (low, medium, and high) and train one model for each one. This stratification enables us to evaluate performance across diverse population dynamics and varying data availability. The resulting models are compared against operational methodologies, such as climatology-based weather warnings issued by meteorological agencies. Our results show that the ML approach represents a substantial improvement in two of the three groups. The model for the lowest-density group, however, struggles due to a substantial lack of impact data, highlighting a key roadblock for data-driven algorithm development in sparsely populated regions.
To gain a more complete understanding and improve model trust and explainability, we perform a series of experiments: a feature importance analysis using SHAP (SHapley Additive exPlanations), ablation studies over different feature groups, and training models on individual feature sets. From these results, we can ascertain how the combination of varied data sources (such as weather radar, station sensors, or call history) can result in more powerful predictions than using single sources in isolation.
Finally, we present a methodology to evaluate model behaviour across rainfall event stages, as performance is expected to vary throughout an event's evolution. We distinguish five stages based on observed rain in the previous and following hours: the first hour with rain, intermediate hours, the last hour with rain, the hours immediately after the event, and hours without rain. Evaluating all approaches following this framework adds a valuable dimension to the performance analysis and further improves explainability. The results demonstrate that our models outperform the baselines across all event stages, from the initial onset of rain to the hours after precipitation has stopped. This highlights the strong potential of even relatively simple ML pipelines to deliver timely, localized anticipation of weather-related impacts.