the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Probabilistic flood hazard mapping for dike-breach floods via graph neural networks
Abstract. Flood hazard maps are essential for protection and emergency plans, yet their probabilistic application is constrained by the computational cost of numerical models. Deep learning surrogates can provide orders of magnitude faster predictions, but their use for uncertainty quantification in realistic settings and their ability to incorporate hydraulic structures remain largely unexplored. Studying deep learning surrogates for probabilistic flood maps is non-trivial because of the lack of reference ground-truth data that might lead to misleading confidence in predictions. Moreover, hydraulic structures are challenging to include due to their generally unidimensional nature. In this work, we investigate the use of deep learning surrogates for realistic, large-scale flood simulations in case studies with hydraulic structures, under diverse boundary conditions. To this end, we employ the multi-scale hydraulic graph neural network (mSWE-GNN) that enjoys transferability to different boundary conditions and locations and whose graph-based architecture allows to represent structures such as canals, underpasses, and elevated elements as inputs. To address the lack of reference ground-truth data, we further introduce the average relative mass error (ARME), a mass-conservation-based criterion that helps identify physically plausible simulations. We apply the model on dike ring 41 in the Netherlands, generating probabilistic flood maps that account for uncertainties in breach location and breach outflow hydrographs. The model was trained on 30 simulations, generated with Delft3D, and evaluated against unseen benchmark simulations from the Dutch national flood catalogue, achieving a critical success index (CSI) of 73.6 % while running 10,000 times faster than the numerical simulator. The proposed ARME is negatively correlated with the CSI, with a Pearson correlation coefficient of −0.7, making it a useful indicator of simulation plausibility when evaluating unseen case studies. We obtained probabilistic flood maps by running 10,000 different flooding scenarios on a computational mesh of 180,000 cells in approximately 10 hours with about half of the simulations classified as plausible based on the mass-conservation check. This framework offers a practical tool for rapid probabilistic flood hazard assessment and a way to prioritize detailed physical simulations, supporting more efficient and robust flood risk management.
- Preprint
(12458 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-5582', Anonymous Referee #1, 10 Jan 2026
-
AC1: 'Reply on RC1', Roberto Bentivoglio, 29 Jan 2026
We thank the reviewer for the constructive comments, which helped us improve the clarity and presentation. We address each comment point-by-point below.
We indicate the reviewer’s comments in bold and the text modifications in blue.
1) Line 8: The “mSWE-GNN” developed by the authors stands for the “multi-scale hydraulic graph neural network”. While “SWE” may be short for shallow water equations, strictly speaking, it is not the same as “hydraulic”.
We agree that “SWE” refers to the shallow water equations, while “hydraulic” is a broader term. We modified “multi-scale hydraulic graph neural network” to “multi-scale shallow-water-equation graph neural network” as recommended.
2) Line 16, Figure 11, and Table 2: The Pearson’s r is not good measure of correlation in this case, because it is sensitive to outliers and is not applicable when two variables do not show a clear linear pattern. The Spearman’s correlation coefficient could be a better choice in this case.
We agree with the reviewer that Pearson’s correlation can be sensitive to outliers and assumes linear dependence. In the experiments, all data pairs for which a clear correlation exists (i.e., between CSI and ARME) tend to follow a linear relationship, as also suggested by similar values of both coefficients. In the revised manuscript, we now report Spearman’s ρ both in line 16 and Figure 11. In Table 2, we kept both metrics to present a more complete picture of the correlations.
3) Lines 25-28: It should be noted that the uncertainty in model evaluation should not be ignored given the sampling uncertainty over limited space and time. The authors can refer to the paper below for more information about the limitations of some commonly used evaluation metrics in flood modeling.
We thank the reviewer for this suggestion. We now explicitly acknowledge the limitations of deterministic evaluation metrics in flood modeling, which gives further motivation for requiring probabilistic flood maps.
Lines 26-28 now read as:
“Further uncertainties can appear when quantifying the statistical fit of a model with limited data and treating metrics as deterministic (Huang and Merwade, 2024).”
4) Lines 30-32: Too many references are used here. It is suggested to remove some old ones.
As recommended by the reviewer, we removed the older references in lines 30-32, keeping up to a maximum of two entries per uncertain input.
These lines now read as:
“Building probabilistic hazard maps remains challenging as the number of uncertain variables can be large, particularly for dike breaching, where additional geotechnical properties must be considered. Uncertainties include breach location (D’Oria and Maranzoni, 2019; Westerhof et al., 2023), breach width (Mazzoleni et al., 2014; de Moel et al., 2014), breach development time (Apel et al., 2006; Ferrari et al., 2020), failure time (D’Oria and Maranzoni, 2019), and failure mechanism (D’Oria and Maranzoni, 2019; Mazzoleni et al., 2014).”
5) It is suggested to add a list of acronyms mentioned in the manuscript. The full term of the acronym only needs to be presented the first time it appears, e.g., ARME and CSI.
We appreciate this suggestion. We added a list of acronyms at the beginning of the manuscript. In addition, we ensured that all acronyms (e.g., ARME, CSI) are defined in full only upon their first appearance, in the abstract, and in the conclusions.
6) Figure 1: It would be helpful to explain the terms like “Z_ee” in the figure or figure caption.
We have updated the captions of Figures 1 and 3 to explicitly define the symbol Z_ee (elevation of elevated elements).
Figure 1 caption (modified part) now reads as:
“… a) Hydraulic structures such as elevated elements and canals are inputs. The difference between elevated elements (z_ee) and the terrain (z_i) is treated as an additional edge feature. …”
Figure 3 caption (modified part) now reads as:
“… For each edge (i,j) that intersects an elevated element, we determine the feature as the difference in elevation from that of the element (z_ee) to that of the source node i (z_i).”
7) Lines 113-114: What is “p” in the superscript, and how to determine the value of “p”.
We have clarified that p denotes the number of previous time steps used as dynamic inputs in the autoregressive formulation. This number is a hyperparameter which we set to be p=2 (lines 264-265). We selected this value based on previous studies (Bentivoglio et al., 2025).
Lines 115-116 now read as:
“…U^t-p:t are dynamic node features for time steps t-p to t, with p indicating the number of previous times steps given as input, and E are edge features.”
8) Figure 3: Is it necessary to force the mesh to align with the boundaries of structures and riverbanks?
For the coarse meshes, it is not strictly necessary to align its boundaries with structures and riverbanks, as they only need to roughly represent that flow runs in a preferential direction (given, for example, by a canal). Preliminary experiments showed little difference between including them or not.
9) Line 163: The u0_hat instead of u0 are the predicted hydraulic variables.
We thank the reviewer for identifying this inconsistency. The text has been corrected to use û₀ to indicate predicted hydraulic variables.
10) Figure 5: Please add the units for both longitudes and latitudes.
We have added the appropriate units to both longitude and latitude axes in Figure 5.
11) Figure 7(d): What is the definition of the roughness coefficient?
We have clarified that the roughness coefficient we used is based on White-Colebrook’s formula. It is the one employed for running the official simulations used by the Dutch government (VNK). The proposed model works independently of the roughness coefficient type.
Figure 7 caption now reads as:
“… d) represents the distribution of the White-Colebrook roughness coefficient, with higher values indicating urban areas”
12) Table 1: The numbers after “±” are standard deviations or standard errors? For the MAE in the validation dataset, how could 1.41-1.72 < 0?
The values after “±” represent standard deviations, not standard errors. The reason why the standard deviation is higher than the mean is that there are some simulations with much larger errors, which skew the standard deviation to larger values than the mean.
We have clarified this explicitly in the table caption, which now reads as:
“Training and testing metrics for the mSWE-GNN on dike ring 41, reporting mean and standard deviation for the mean absolute error (MAE) and critical success index for a water depth threshold ). HD073, ND234, and ND111 are three test locations along the dike perimeter and RP stands for return period. Arrows indicate whether higher (↑) or lower (↓) values are better.”
13) Line 344: Please correct the text “outlier Contrarily”.
Thank you for catching this typo. The text has been corrected to read simply “Contrarily”.
Citation: https://doi.org/10.5194/egusphere-2025-5582-AC1
-
AC1: 'Reply on RC1', Roberto Bentivoglio, 29 Jan 2026
-
RC2: 'Comment on egusphere-2025-5582', Anonymous Referee #2, 20 Jan 2026
This study applies the mSWE-GNN model to probabilistic dike-breach flood hazard mapping, incorporating hydraulic structures and introducing the ARME metric for model validation without ground-truth data. The research addresses an important practical problem and demonstrates significant computational speedup. However, several methodological and presentation issues require attention before publication.
General Comments
- There are some distribution shifts between training and testing datasets that may undermines the model generalizability. According to Table 1 and Figure 6, the training flood volumes are 220.1±131.4×10⁶ m³, while test location ND111 produces volumes of 395.71-974.55×10⁶ m³, far exceeding the training range. The MAE at ND111 reaches 247.81×10⁻² m, nearly ten times higher than the training error. (Also, Table 1 has two rows labeled “test”; please clarify how these correspond to test cases/locations.) Please discuss the model’s applicable range and extrapolation limits. If feasible, include higher-volume training samples or explore domain adaptation for out-of-distribution conditions.
- The theoretical foundation of the ARME metric requires further clarification. In Equation (5), when the predicted volume may be negative in early simulation stages due to subtracting V₀ in Equation (4), or when Vt approaches zero, ARME may produce numerically unstable or meaningless results. Please discuss ARME’s behavior during initial simulation phases and whether such issues were handled in the results, as they may bias plausibility assessment.
- The finding that approximately 50% of simulations are classified as implausible (ARME>0.4) raises concerns about operational applicability. How can users determine prediction reliability when encountering new boundary conditions in practical applications? Does this high discard rate introduce bias in probability distribution estimates? The authors should analyze whether there are systematic differences in spatial distribution or boundary conditions between discarded and retained simulations, and discuss strategies for improving the plausibility rate.
Minor Comments
- What criteria guided the selection of dike-breach locations, and are they representative and physically justified?
- Figure 6(b) caption reads “Training and testing discharge hydrographs…”. Given validation in Figure 6(a), consider “Training and validation discharge hydrographs…” (or include testing if applicable).
- Pearson’s r (Line 16, Figure 11, Table 2) is outlier-sensitive and assumes linearity. Given Figure 11’s scatter, please justify Pearson’s r or report Spearman’s rank as a complementary robust measure.
- Lines 30–32 contain many references for one statement on dike-breach uncertainties; retain the most representative and recent.
- Cite more recent references (ideally last 10 years) when discussing computation costs.
- The 8-hour output resolution in Section 3.1 may be coarse for rapid flood-front dynamics. Briefly discuss how this choice affects ARME and CSI.
- Line 344: fix “outlier Contrarily” to “Contrarily” (or rephrase).
- The computational efficiency claim of "10,000 times faster" requires more detail, including the specific Delft3D configuration (number of CPU cores, parallelization), whether data I/O time is included.
Citation: https://doi.org/10.5194/egusphere-2025-5582-RC2 - AC2: 'Reply on RC2', Roberto Bentivoglio, 29 Jan 2026
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 420 | 247 | 25 | 692 | 119 | 138 |
- HTML: 420
- PDF: 247
- XML: 25
- Total: 692
- BibTeX: 119
- EndNote: 138
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The study employed the mSWE-GNN model developed by the authors to investigate its applicability for large-scale flood simulations that incorporate the information of hydraulic structures, and introduced a metric based on mass conservation to evaluate the model performance in absence of ground-truth hydrologic data. Overall, this study is comprehensive and the findings are meaningful for more informed and efficient flood risk management. However, I still have several comments and suggestions as follows to improve the current manuscript.
1) Line 8: The “mSWE-GNN” developed by the authors stands for the “multi-scale hydraulic graph neural network”. While “SWE” may be short for shallow water equations, strictly speaking, it is not the same as “hydraulic”.
2) Line 16, Figure 11, and Table 2: The Pearson’s r is not good measure of correlation in this case, because it is sensitive to outliers and is not applicable when two variables do not show a clear linear pattern. The Spearman’s correlation coefficient could be a better choice in this case.
3) Lines 25-28: It should be noted that the uncertainty in model evaluation should not be ignored given the sampling uncertainty over limited space and time. The authors can refer to the paper below for more information about the limitations of some commonly used evaluation metrics in flood modeling.
Reference: “Beyond a fixed number: Investigating uncertainty in popular evaluation metrics of ensemble flood modeling using bootstrapping analysis” (https://doi.org/10.1111/jfr3.12982)
4) Lines 30-32: Too many references are used here. It is suggested to remove some old ones.
5) It is suggested to add a list of acronyms mentioned in the manuscript. The full term of the acronym only needs to be presented the first time it appears, e.g., ARME and CSI.
6) Figure 1: It would be helpful to explain the terms like “Z_ee” in the figure or figure caption.
7) Lines 113-114: What is “p” in the superscript, and how to determine the value of “p”?
8) Figure 3: Is it necessary to force the mesh to align with the boundaries of structures and riverbanks?
9) Line 163: The u0_hat instead of u0 are the predicted hydraulic variables.
10) Figure 5: Please add the units for both longitudes and latitudes.
11) Figure 7(d): What is the definition of the roughness coefficient?
12) Table 1: The numbers after “±” are standard deviations or standard errors? For the MAE in the validation dataset, how could 1.41-1.72 < 0?
13) Line 344: Please correct the text “outlier Contrarily”.