the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Using machine learning for the prediction of flood-related 112 calls
Abstract. As weather-related disasters become more frequent and severe, there is a growing global push toward impact-based early warning systems, exemplified by initiatives such as EW4All. This transition positions machine learning (ML) and artificial intelligence (AI) as powerful tools for integrating meteorological hazard data with information on vulnerability and exposure into data-driven forecasting systems. In this work, we explore the use of 112 emergency calls as high-resolution impact proxies for an ML-based prediction problem. Specifically, we develop a model that combines rainfall-related weather data and static vulnerability-exposure layers to predict, at a municipal and hourly resolution, whether flood-related impacts will occur in the next hour. This study spans a period of over six years (October 2018 to February 2025) in Catalonia, northeastern Spain.
To address the severe temporal class imbalance and uncertainty characteristics of emergency calls data, we define a custom walk-forward evaluation scheme that ensures the same number of positive samples across comparable time periods. We then distribute municipalities into three distinct population density groups (low, medium, and high) and train one model for each one. This stratification enables us to evaluate performance across diverse population dynamics and varying data availability. The resulting models are compared against operational methodologies, such as climatology-based weather warnings issued by meteorological agencies. Our results show that the ML approach represents a substantial improvement in two of the three groups. The model for the lowest-density group, however, struggles due to a substantial lack of impact data, highlighting a key roadblock for data-driven algorithm development in sparsely populated regions.
To gain a more complete understanding and improve model trust and explainability, we perform a series of experiments: a feature importance analysis using SHAP (SHapley Additive exPlanations), ablation studies over different feature groups, and training models on individual feature sets. From these results, we can ascertain how the combination of varied data sources (such as weather radar, station sensors, or call history) can result in more powerful predictions than using single sources in isolation.
Finally, we present a methodology to evaluate model behaviour across rainfall event stages, as performance is expected to vary throughout an event's evolution. We distinguish five stages based on observed rain in the previous and following hours: the first hour with rain, intermediate hours, the last hour with rain, the hours immediately after the event, and hours without rain. Evaluating all approaches following this framework adds a valuable dimension to the performance analysis and further improves explainability. The results demonstrate that our models outperform the baselines across all event stages, from the initial onset of rain to the hours after precipitation has stopped. This highlights the strong potential of even relatively simple ML pipelines to deliver timely, localized anticipation of weather-related impacts.
- Preprint
(11655 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2026-1253', Anonymous Referee #1, 29 Apr 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1253/egusphere-2026-1253-RC1-supplement.pdfCitation: https://doi.org/
10.5194/egusphere-2026-1253-RC1 -
AC1: 'Reply on RC1', Jordi Morales Casas, 13 May 2026
Dear Referee #1,
Thank you for taking the time to consider our work, and for the positive feedback and critiques provided in your comments. We have written a short reply below to continue the discussion while we await for the Discussion period to end.
Sincerely,
The authors
Comment 1. Operational validity and target definition
The manuscript is framed as an impact-based early-warning study, but some modelling choices make the prediction task closer to detecting or updating already emerging impacts. In particular, the model uses flood-related emergency calls from the previous 1–3 hours as predictors, which may strongly benefit cases where impacts have already started. The authors should distinguish more clearly between predicting the first occurrence of impacts and predicting the continuation of ongoing impacts. Additional experiments without previous-call features, or separate results for “before first call” and “after first call” situations, would make the operational contribution clearer. The daily rain filter also requires clarification: if it uses full-day information retrospectively, the evaluation may not represent a real-time forecasting setting. The definition of the positive class, positive rate, class weighing, and evaluation metrics should also be stated more explicitly.
Response:
The operational validity is a major focus of our work, and we thank the reviewer for raising this issue. Regarding the use of emergency call data from the previous 1-3 hours as predictors, we acknowledge how these features contribute significantly to the models' performance. To prove this, we actually conducted some ablation experiments in Section 6.2.3 where the models were trained ignoring previous call data, and observed a reduction in CSI for the high-density model of -11.6 ± 4.95, allowing us to identify these features as the most influential. We chose to maintain them in the final configuration as they would be available during a operational deployment, potentially benefiting situations with unfolding impacts.To address the distinction between predicting the first occurrence of impacts versus their continuation, we conducted the study in Section 6.2.4 (Model performance depending on rain stage). By presenting the performance of our high-density model (which obtained the best results and most consistent explanations) across different rainfall stages, we believe the operational concerns of the reviewer are answered. More specifically, we identify a performance drop at the start of a rainfall event, which can serve as a proxy for "before the first call", compared to the subsequent stages. We argue that using a physical magnitude (i.e., rainfall) instead of the appearance of the first impacts results in a more consistent definition of what an "event" is, and we hope this argumentation correctly addresses the reviewer's concerns.
Regarding the real-time use of our methods, the retrospective daily rain filter was primarily used to accelerate training and mitigate class imbalance in our experiments. In an operational, real-time setting, the model can still be run every hour without issue, as even after filtering out entire days without significant precipitation, most of the dataset still consists of samples that did not register neither rain nor impacts. In the revised version, however, we will make this clearer as the referee suggests. Furthermore, we will also revise the definitions of the rest of the concepts proposed.
Comment 2. Evaluation design and comparability across temporal configurations
The walk-forward evaluation is reasonable in principle, but the current implementation makes comparisons across temporal configurations difficult to interpret. Although the authors control the number of positive samples, the validation and test periods differ across configurations. Thus, differences in performance may arise not only from model robustness or training history, but also from differences in the test data themselves, such as negative-sample counts, positive rates, rainfall regimes, event severity, seasonality, or spatial distribution of impacts. Because validation sets also differ, hyperparameter tuning and threshold selection are performed under different validation distributions. The results shown in Figure 7 also do not indicate a clear trend.Response:
We acknowledge the reviewer's concerns regarding the temporal configurations employed. The irregular and sparse nature of impacts poses many challenges when evaluating these systems, and when choosing a particular temporal cross-validation scheme, one must also consider their related trade-offs.We experimented with multiple alternatives before deciding on our approach. For instance, fixing temporal periods (e.g., specific months) resulted in highly inconsistent numbers of both positive and negative samples per set, leading to high variance in performance metrics. On the other hand, fixing the total number of samples (positive and negative) led to sets with drastically different positive-to-negative ratios, and varying temporal periods. Moreover, for both previous cases, we could not ensure that the validation and test sets contained any positive samples, which was especially common in the medium-to-low density municipalities.
In the end, we chose to fix the number of positive instances (impacts) per set. When comparing the different schemes, this configuration yielded the most stable results with the lowest standard deviation in model metrics, providing a more reliable setting for comparing performance and interpretability across configurations. While we acknowledge that this leads to different test periods and background distributions, we have found that stabilizing the positive class seems to be the most effective way to derive meaningful insights from such an imbalanced dataset. We will update the manuscript to provide a more detailed justification of why we chose this methodology, and will additionally address the lack of clear trends in Figure 7.
Comment 3. Interpretation, uncertainty, and presentation
The SHAP, ablation, and rainfall-stage analyses are useful, but their interpretation should be more cautious. The low-density model has very weak and unstable performance, so feature-
importance or sensitivity analyses for this group should not be used to support strong conclusions; similar caution is needed for the medium-density model. The authors should distinguish robust findings from the high-density model from exploratory findings in lower-density settings. In addition, SHAP and ablation results should not be interpreted causally, especially given the likely correlation among rainfall, radar, warning, and call-history predictors. The rainfall-stage analysis is interesting, but the claimed “warm-up” phase of the first hours needs clearer reasons. More generally, the manuscript would benefit from restructuring: methodological details in the Introduction should be reduced, the short Related Work section could be integrated or better aligned with the research gap, and several redundant or unclear sentences should be revised.Response:
We appreciate these insightful notes regarding the interpretation of our explainability analyses and the overall structure of the manuscript. We fully agree that the performance of the medium- and, especially, low-density models is less stable. Consequently, we will revise the text to distinguish clearly between the more robust findings from the high-density model and the exploratory insights from the lower-density approaches, with the goal to temper the conclusions and avoiding over-interpretation. Moreover, we will clarify that the interpretation of the SHAP and ablation results represents feature contributions within the model, and not necessarily causal relationships.Regarding the rainfall-stage analysis, we will expand our discussion on the mentioned "warm-up" phase to be more grounded, and additionally make changes to target the proposed restructuring (which we agree could be beneficial) and the rest of the technical corrections.
Citation: https://doi.org/10.5194/egusphere-2026-1253-AC1 -
AC2: 'Reply on RC1 (Final)', Jordi Morales Casas, 22 Jun 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1253/egusphere-2026-1253-AC2-supplement.pdf
-
AC1: 'Reply on RC1', Jordi Morales Casas, 13 May 2026
-
RC2: 'Comment on egusphere-2026-1253', Anonymous Referee #2, 14 Jun 2026
General comments
This manuscript presents a timely and practically relevant study on the use of machine learning to predict flood-related 112 emergency calls at municipal and hourly resolution in Catalonia. The topic fits well within the scope of impact-based forecasting and early-warning systems, and the use of emergency-call records as impact proxies is an interesting contribution. The manuscript is generally well structured, and the comparison with operational warning products makes the work relevant for both scientific and applied audiences.
However, I recommend major revision before publication. The study has a promising core, but several methodological and interpretive issues need to be clarified before the conclusions about operational impact prediction can be fully supported.
Major comments
- Clarify the operational forecasting setting, especially the use of previous emergency calls as predictors.
The manuscript presents the task as predicting whether flood-related impacts will occur in the next hour, for example in the Abstract and in Sect. 4.1, where the target is defined as the occurrence of one or more flood-related 112 calls in the following hour. However, Sect. 4.3.4 states that the model includes the number of calls received 1, 2, and 3 hours earlier as input features. This means that the model may perform particularly well when impacts have already started, rather than when the first impact is still to occur. This distinction is important for the claimed contribution to early warning. I suggest that the authors separate model performance for "first-impact" situations, where no calls have occurred in the preceding hours, and "continuation" situations, where calls have already been received. An additional experiment excluding previous-call features would also help clarify how much predictive skill comes from hydrometeorological and vulnerability-exposure information alone. - Improve the fairness and transparency of the baseline comparison.
Sect. 5.2 compares the machine-learning model with AEMET warnings, SMC warnings, and the FF-EWS radar-based warning product. This is useful, but the comparison may not be fully balanced. In Sect. 4.3.2, the manuscript states that all SMC warnings are used as model input, including observation warnings, whereas Sect. 5.2 says that observation warnings are excluded from the SMC warning baseline because they are reactive. If reactive observation warnings are available to the machine-learning model but not to the baseline, the model may benefit from information that is closer to real-time event detection than prospective forecasting. The authors should clarify exactly which warning products are available at prediction time and provide a sensitivity experiment using only prospectively available warning information. This would strengthen the claim that the machine-learning model outperforms operational warning approaches. - Justify the target thresholds and discuss their implications for comparing population-density groups.
In Sect. 4.1 and Table 1, the positive class is defined using different thresholds: 1 call per hour for low- and medium-density municipalities, and 3 calls per hour for high-density municipalities. While this may be reasonable because call volumes differ greatly across population-density groups, it also means that the three models are not predicting exactly the same level of impact. The high-density model is evaluated on a more severe impact definition than the other two models. This complicates interpretation of Table 2 and Fig. 8, where performance is compared across low-, medium-, and high-density groups. The authors should explain how these thresholds were selected, whether alternative thresholds were tested, and how sensitive the conclusions are to this modelling choice. It would also be useful to discuss whether the threshold should represent a fixed operational burden, a population-normalised impact rate, or a severity-based definition. - Strengthen the evaluation by addressing event dependence and operational usefulness.
Sect. 5.3 describes a walk-forward evaluation with validation and test sets constructed to contain the same number of positive samples, and Fig. 6 illustrates this design. This is a thoughtful approach to temporal imbalance, but the evaluation is still based on municipality-hour samples. Flood impacts are likely clustered in time and space, so many positive samples may belong to the same rainfall event. As a result, the effective number of independent test cases may be smaller than suggested by the sample count. I recommend adding an event-based evaluation, for example by grouping impacts into rainfall or flood episodes and reporting whether the model detects event onset, peak impact periods, and affected areas. In addition, the practical alert burden should be reported, such as the number of alerts generated per event, per municipality, or per day. Metrics such as CSI, recall, and precision are useful, but they do not fully show how emergency managers would experience the model in operation.
Minor and technical comments
The manuscript would benefit from careful language editing. Examples include inconsistent numerical formatting such as "32.000 km2", minor grammatical issues, and some awkward or redundant phrasing. I also suggest replacing "112, the European equivalent of 911" with "112, the European emergency number." The description of the daily rain filter should also be clearer, especially regarding whether it uses information from the whole day retrospectively or whether it could be implemented in real time.
Recommendation
Overall, the manuscript addresses an important and promising topic, and the dataset is valuable. I recommend major revision, mainly to clarify the operational forecasting setting, improve baseline comparability, strengthen the evaluation, and moderate the interpretation of results across population-density groups.
Citation: https://doi.org/10.5194/egusphere-2026-1253-RC2 -
AC3: 'Reply on RC2', Jordi Morales Casas, 22 Jun 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1253/egusphere-2026-1253-AC3-supplement.pdf
- Clarify the operational forecasting setting, especially the use of previous emergency calls as predictors.
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 626 | 307 | 56 | 989 | 43 | 53 |
- HTML: 626
- PDF: 307
- XML: 56
- Total: 989
- BibTeX: 43
- EndNote: 53
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1