the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Threshold Effects and Generalization Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective
Abstract. Reliable urban flood prediction hinges on how datasets are designed, yet most existing research disproportionately emphasizes network architectures over data foundations. This study systematically investigates how dataset characteristics—scale, feature composition, and rainfall-event distribution—govern predictive performance and generalization in AI-based flood modeling. A physically calibrated hydrological–hydrodynamic model was employed to generate synthetic datasets with varied temporal lengths, input feature combinations (rainfall, infiltration, drainage), and rainfall-intensity distributions. A long short-term memory (LSTM) network, chosen for its widespread adoption and proven performance in hydrology, was applied as a representative benchmark to assess accuracy, computational cost, and bias under controlled conditions. Results identify: (1) a threshold effect of dataset length (~14,400 samples), beyond which performance gains plateau; (2) rainfall-intensity distribution as the dominant driver of generalization—training solely on light or extreme events induces systematic bias, whereas mixed-intensity datasets substantially enhance robustness; (3) ancillary features (infiltration and drainage) improve stability only when data are sufficiently abundant. These findings quantify trade-offs and pinpoint actionable design levers, offering general insights into dataset design for machine learning models in flood prediction and beyond. By clarifying critical dataset requirements, this study provides transferable guidance for building efficient and balanced datasets in hydrology and broader Earth system sciences.
- Preprint
(2231 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-4125', Anonymous Referee #1, 15 Feb 2026
-
CC1: 'Comment on egusphere-2025-4125', Nima Zafarmomen, 11 Mar 2026
The manuscript presents a timely and well-structured investigation into an important but often overlooked issue in AI-based urban pluvial flood prediction: dataset design. Rather than focusing only on model architecture, the study systematically evaluates how dataset length, feature composition, and rainfall-intensity distribution affect predictive skill, computational cost, and generalization. Overall, this is a solid and relevant contribution, and I believe the manuscript is worth publishing after minor revision.
1) The manuscript should improve consistency in the statistical terminology. Section 2.6 refers to MANOVA, whereas Section 4.3 describes the analysis as multi-factor ANOVA. The authors should clarify which method was actually used and ensure the terminology is consistent throughout the manuscript.
2) The target variable should be described more consistently. In some places the manuscript defines the prediction target as inundation area, while elsewhere it discusses peak water level, runoff volume, and water depth.
3) Some of the conclusion wording is stronger and more conversational than is typical for a scientific paper. Phrases such as “respect the 14k-sample ceiling” and “start lean, enrich later” may be better rephrased in a more formal academic style. The ideas are useful, but the tone could be made more precise and neutral.
4) In addition, I do strongly recommend authors consider citing the following recent and relevant paper, which is closely related to AI surrogate modeling in urban drainage networks and prediction of node-level hydraulic states: Zafarmomen, N., Samadi, V., and Borgomeo, E. (2026). Spatiotemporal SWMM-LSTM surrogate modeling for efficient node-level water depth and inflow prediction in urban drainage networks. Cambridge University Press, published online 13 January 2026.
Citation: https://doi.org/10.5194/egusphere-2025-4125-CC1 -
RC2: 'Comment on egusphere-2025-4125', Anonymous Referee #2, 12 Mar 2026
The manuscript addresses a highly relevant topic as the optimization of dataset design for Deep Learning models (specifically LSTM) in urban pluvial flood prediction. The study identifies a "threshold effect" regarding dataset size (~14,400 samples) and highlights that rainfall intensity distribution is more critical for model generalization than raw data volume. While the work is interesting and well-motivated, several critical methodological clarifications and improvements in the presentation of results are required to be recommended.
Specific Comments:
Introduction and Generalization: The motivation is clear and well-structured. However, line 87 states that insights from the LSTM model are expected to be transferred to other ML models. The authors should justify this claim, explaining why these findings (especially threshold effects) would apply to other models.
Section 2.2: An NSE is reported for the hydraulic model, but more detail is needed. What variable does this NSE represent (water levels, discharge, or flood extent)? At which control points and under what rainfall events was this validated?
- Line 134: Is there a specific reference for the IDF formula used?
- Line 147: The equation for a distributed model is an oversimplification as it neglects lateral flows between cells and surface storage. This needs a more rigorous explanation.
- Line 148: There is a typo in “III”; it should likely be "I" for Infiltration.
Section 2.3:
- Line 161: For consistency, "Configuration 4" should be labeled: “Rainfall (P) + Soil infiltration (I) + Pipe drainage (D)”.
- Line 172: Specify the unit for sequence length. Are these time steps or hours?
Section 3.3:
- Line 332: Data Leakage Prevention? The manuscript mentions the use of overlapping sliding windows. It is crucial to clarify whether the data split (train/val/test) was performed before or after generating these windows. If done after, there is a high risk of information leakage, which would invalidate the reported generalization performance.
Section 4.1:
- Line 402: The claim that performance decays after Level 4 is difficult to discern in the current figure for low-intensity and mixed-intensity.
- Line 423: Does the model account for manhole overflows? This is a critical factor in urban pluvial flooding.
General comments:
- There is confusion between the use of MANOVA and ANOVA. The description suggests a factorial ANOVA for individual metrics, yet MANOVA is mentioned. Please clarify the exact statistical framework and how the covariance between multiple dependent variables was handled.
- A major omission is the time required to generate the dataset. Since InfoWorks ICM simulations are computationally intensive, the authors must provide details on the total simulation time, hardware used, and a comparison between the "data investment" time vs. the AI's real-time prediction advantage.
Figures and Legibility:
- Figure 7: What does "Y0333333.1" represent? Legend labels should be descriptive and self-explanatory.
- Figures 8, 9, and 11: Figure 8 (Training time 1, 2, 3) is difficult to read and seems redundant. Consider merging the essential information into Figure 9 and removing Figure 11 if it does not provide unique insights.
- Figure 17: Clarify the “Value” axis. The figure as it stands is not sufficiently explanatory for the conclusions drawn immediately below it.
Discussion: The discussion lacks sufficient citations to back its claims, particularly in line 534. The results should be contextualized by comparing them with existing literature on dataset design in computational hydrology.
In conclusion, the paper presents a timely and valuable contribution to AI applications in hydrology by shifting the focus from “model architecture” to “dataset design.” However, several critical issues need to be addressed. First, the lack of transparency regarding the computational cost of data generation makes it difficult to assess the efficiency of the proposed framework, as the “cost of data” is as relevant as model accuracy. Second, the potential for data leakage in sliding-window approaches must be resolved to ensure the validity of the results. Finally, the hydrological representation is oversimplified, mass balance considerations and calibration procedures are described too briefly. Strengthening the physical basis of the synthetic data is essential. To recommend this work, the authors need to address these points and resolve the concerns regarding computational feasibility, data handling, and the hydrological basis of the simulations.
Citation: https://doi.org/10.5194/egusphere-2025-4125-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 316 | 194 | 28 | 538 | 11 | 15 |
- HTML: 316
- PDF: 194
- XML: 28
- Total: 538
- BibTeX: 11
- EndNote: 15
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of the paper “Threshold Effects and Generalization 1 Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective.”
By Hao Hu et al,
In general, the introduction is well constructed and poses questions in a logical and relevant manner. There is indeed an advantage to using a synthetic database: beyond the lack of available data, it allows us to overcome the bias-variance dilemma that explains overfitting.
However, in the rest of the article, we note that the methodology is not correctly deployed: 1) the physics-based model that generated the hydrological responses is not fully presented or validated. This is important because if the physical model does not adequately represent the complexity of the case study, the study loses much of its meaning. Furthermore, 2) the simulated rainfall is very unrealistic, and simulated rainfall intensities (100 mm/h) are significantly higher than those observed and applied to the physical model (1 mm/h, see Fig. 7). It is not normal to learn this from values read on a figure. It is necessary to consider the consequences: can a model calibrated with rainfall of 1 mm/h simulate responses to rainfall of 100 mm/h? What are the consequences? The simulated rainfall events all have the same duration; why is this? One might imagine that longer rainfall events would have a greater impact in the model due to their greater power. The simulated rainfall therefore lacks the diversity needed to evaluate all possible states of the system. 3) Regarding the choice of the LSTM model, we do not choose a model because it is fashionable. We choose it because it is suited to the problem at hand. Furthermore, LSTM is not the most widely used model; it seems that for high-intensity/rapid floods, the perceptron is more commonly used. The latter is much less complex, consumes much less energy and contributes more to a sustainable world than LSTM.
Furthermore: 4) regarding the validation of the LSTM model, it can change drastically depending on the events used in the validation, so it is recommended to evaluate them with cross-validation, which is not done in the paper.
All these elements suggest that the chosen methodology is not properly thought out. And we doubt the meaning that the results could have.
Regarding the quality of the presentation, it must be noted that the writing is poor. Several parts are redundant (e.g. the description of rainfall), while others are missing (how is generalisation assessed?). The notation changes throughout the article. Even the model output is not defined consistently; sometimes we have a flow rate (L127), sometimes a flooded area (L146) and sometimes a water height (L230). How is this possible? The only explanation I can see is that an AI wrote parts of it, and they were not properly corrected.
The conclusion provides little new information: yes, the quality of the model is affected by the size of the database: we already know that; yes, events of different types and intensities must be presented to the training model. Indeed, we already know that thetraining database must include a representation of the entire state space. It is also obvious that the calculation time increases with the length of the training database. On the other hand, no: the model that has learned from synthetic data, i.e. without noise or uncertainty, cannot be overfitted. The conclusions relating to this last point are therefore false, and explanations must be sought elsewhere.
Unfortunately, the simplistic nature of rainfall events does not allow us to explore other avenues or quantify thresholds; for example, how many events are needed to reach the plateau.
Unfortunately, even though the subject seemed interesting, in view of all these factors, it is very difficult to recommend this paper for publication. It needs to be completely rethought and rewritten.
Specific comments
The variability in maximum precipitation intensities is not representative of real cases. 90mm/h to 140mm/h could have been distributed more widely, for example starting at 10 mm/h.
L128-135, there is a serious lack of information on the rainfall-runoff model; in particular, what is its response time? We would like to see the output hydrographs from this model. A Nash criterion of 0.5 may conceal simulations that reproduce the dynamics of the hydrosystem, or mediocre simulations with, for example, a very low peak flow/height, which does not represent the observed situation. We would like to see a more detailed analysis of the model outputs for intense and very intense rainfall events. Is there saturation for the most intense events?
This is problematic because if the rainfall-runoff model over-tops the floods or oversimplifies the real situation (the functioning of the hydrosystem), it is clear that the issue is then greatly simplified... this could even lead to doubts about the reliability of the conclusions.
Confusion between sequence length L171 and the LSTM sequence of 2 time steps (30 min).
L137, q is the intensity of the rain, which is poorly chosen, as q generally represents the flow.
L139 specify the figure number instead of ‘this formula’.
L148, specify what ‘total precipitation’ is: the cumulative amount since the start of the event?
The problem with equation 2 is that time does not appear. If the system is truly dynamic and time-dependent, this must be indicated in the equations. And better define what ‘total precipitation’ means.
L149. The output of the hydrological model called ‘rainfall-runoff’ is in fact a flooded area. This is an approximation that is very far from reality and reflects an approximation that is not acceptable if it is not explained and justified. Once again, the model is described too quickly.
L156-166: we would like to have the Nash criteria for the model by category of experiments.
L172-177: specify the duration in minutes and the number of floods.
L178: it is good that there is a validation set, but how is the database divided into learning, testing and validation?
L172 to L177 represents 5 floods, 6 floods, 7 floods, etc., up to 10 floods. It is a detail, but at 90 mm per hour, it is difficult to consider the rain as ‘light’, especially in an urban environment.
L202-203: I do not understand how a sequence of two time steps can help to approximate long-term memory. The LSTM is clearly misused, or rather it is not recommended for a synthetic system with a supposed response time of 2 time steps (the former information is lacking).
L 209: specify what helps to avoid overfitting? The batch, the learning rate or the epoch. Overfitting is avoided with regularisation methods. But a priori there should be no overfitting on a synthetic system, as there is no noise and no uncertainty.
Table 1: specify the units of time step, sequence and batch size more precisely.
2.5.1
L 220: why mention rainfall when the target is the flooded area? Also, if Y is the target measured by the criterion, the denominator should be Ymean, not X, which is not defined.
2.5.2 It is not specified what x represents. According to the text, it could be rainfall, but rainfall is referred to as Q L137
2.5.2 Express the coefficient of determination using the same notation as the NRMSE. Why is water depth mentioned here? Isn't the target the flooded area?
2.5.3. It is a good idea to focus on training time, but then why not use a multilayer perceptron, which is much faster than LSTM in such a simplified case study?
2.6.1 Why are we talking about an effect? The concept should be defined and used consistently throughout the document. This entire section should be rewritten to make it understandable.
3.1: This entire section should be moved up to the chapter where the model is discussed.
Fig. 6: Provide a scale.
Provide a reference for InfoWorks ICM.
Specify what the three observed flows (not the flooded area) correspond to in Figure 7. Furthermore, we note in Figure 7 that the model completely misses the first peak. The observed rainfall intensities are significantly lower (0.6 mm) than the simulated artificial rainfall (100 mm/h). Why is this? Is it because the model does not accurately represent responses to light rainfall?
It is necessary to present a comprehensive and detailed overview of the performance of the hydrological model.
Equation 10 is more like that of a recurrent MLP...
Table 3: repetition.
Figure 8: is the increase in computation time proportional to the increase in signal length?
Figure 11 is difficult to read. Why does the normalised criterion vary so much? It is not consistent.
Figure 12: only the orange colour is visible.
Figure 17: which model are we talking about? The physical model or the LSTM? Review the ordinates: ‘value’ does not indicate the variable displayed.
We do not know how generalisation is measured. On which set?
Line 561: it is not possible to write "Controlled experiments with high-fidelity synthetic rainfall–inundation datasets". The physics-based model is poorly evaluated; the example shown (Fig. 7) completely misses the first event, which is far from high fidelity. The rest of the conclusion is therefore unproven. Even if my experience is in line with the authors' conclusions: enough events and different types are needed; these conclusions are neither new nor rigorously proven.
Conclusion: "but beyond approximately 14,400 sequences" no, it is 14.400 time steps.