Threshold Effects and Generalization Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective
Abstract. Reliable urban flood prediction hinges on how datasets are designed, yet most existing research disproportionately emphasizes network architectures over data foundations. This study systematically investigates how dataset characteristics—scale, feature composition, and rainfall-event distribution—govern predictive performance and generalization in AI-based flood modeling. A physically calibrated hydrological–hydrodynamic model was employed to generate synthetic datasets with varied temporal lengths, input feature combinations (rainfall, infiltration, drainage), and rainfall-intensity distributions. A long short-term memory (LSTM) network, chosen for its widespread adoption and proven performance in hydrology, was applied as a representative benchmark to assess accuracy, computational cost, and bias under controlled conditions. Results identify: (1) a threshold effect of dataset length (~14,400 samples), beyond which performance gains plateau; (2) rainfall-intensity distribution as the dominant driver of generalization—training solely on light or extreme events induces systematic bias, whereas mixed-intensity datasets substantially enhance robustness; (3) ancillary features (infiltration and drainage) improve stability only when data are sufficiently abundant. These findings quantify trade-offs and pinpoint actionable design levers, offering general insights into dataset design for machine learning models in flood prediction and beyond. By clarifying critical dataset requirements, this study provides transferable guidance for building efficient and balanced datasets in hydrology and broader Earth system sciences.
Review of the paper “Threshold Effects and Generalization 1 Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective.”
By Hao Hu et al,
In general, the introduction is well constructed and poses questions in a logical and relevant manner. There is indeed an advantage to using a synthetic database: beyond the lack of available data, it allows us to overcome the bias-variance dilemma that explains overfitting.
However, in the rest of the article, we note that the methodology is not correctly deployed: 1) the physics-based model that generated the hydrological responses is not fully presented or validated. This is important because if the physical model does not adequately represent the complexity of the case study, the study loses much of its meaning. Furthermore, 2) the simulated rainfall is very unrealistic, and simulated rainfall intensities (100 mm/h) are significantly higher than those observed and applied to the physical model (1 mm/h, see Fig. 7). It is not normal to learn this from values read on a figure. It is necessary to consider the consequences: can a model calibrated with rainfall of 1 mm/h simulate responses to rainfall of 100 mm/h? What are the consequences? The simulated rainfall events all have the same duration; why is this? One might imagine that longer rainfall events would have a greater impact in the model due to their greater power. The simulated rainfall therefore lacks the diversity needed to evaluate all possible states of the system. 3) Regarding the choice of the LSTM model, we do not choose a model because it is fashionable. We choose it because it is suited to the problem at hand. Furthermore, LSTM is not the most widely used model; it seems that for high-intensity/rapid floods, the perceptron is more commonly used. The latter is much less complex, consumes much less energy and contributes more to a sustainable world than LSTM.
Furthermore: 4) regarding the validation of the LSTM model, it can change drastically depending on the events used in the validation, so it is recommended to evaluate them with cross-validation, which is not done in the paper.
All these elements suggest that the chosen methodology is not properly thought out. And we doubt the meaning that the results could have.
Regarding the quality of the presentation, it must be noted that the writing is poor. Several parts are redundant (e.g. the description of rainfall), while others are missing (how is generalisation assessed?). The notation changes throughout the article. Even the model output is not defined consistently; sometimes we have a flow rate (L127), sometimes a flooded area (L146) and sometimes a water height (L230). How is this possible? The only explanation I can see is that an AI wrote parts of it, and they were not properly corrected.
The conclusion provides little new information: yes, the quality of the model is affected by the size of the database: we already know that; yes, events of different types and intensities must be presented to the training model. Indeed, we already know that thetraining database must include a representation of the entire state space. It is also obvious that the calculation time increases with the length of the training database. On the other hand, no: the model that has learned from synthetic data, i.e. without noise or uncertainty, cannot be overfitted. The conclusions relating to this last point are therefore false, and explanations must be sought elsewhere.
Unfortunately, the simplistic nature of rainfall events does not allow us to explore other avenues or quantify thresholds; for example, how many events are needed to reach the plateau.
Unfortunately, even though the subject seemed interesting, in view of all these factors, it is very difficult to recommend this paper for publication. It needs to be completely rethought and rewritten.
Specific comments
The variability in maximum precipitation intensities is not representative of real cases. 90mm/h to 140mm/h could have been distributed more widely, for example starting at 10 mm/h.
L128-135, there is a serious lack of information on the rainfall-runoff model; in particular, what is its response time? We would like to see the output hydrographs from this model. A Nash criterion of 0.5 may conceal simulations that reproduce the dynamics of the hydrosystem, or mediocre simulations with, for example, a very low peak flow/height, which does not represent the observed situation. We would like to see a more detailed analysis of the model outputs for intense and very intense rainfall events. Is there saturation for the most intense events?
This is problematic because if the rainfall-runoff model over-tops the floods or oversimplifies the real situation (the functioning of the hydrosystem), it is clear that the issue is then greatly simplified... this could even lead to doubts about the reliability of the conclusions.
Confusion between sequence length L171 and the LSTM sequence of 2 time steps (30 min).
L137, q is the intensity of the rain, which is poorly chosen, as q generally represents the flow.
L139 specify the figure number instead of ‘this formula’.
L148, specify what ‘total precipitation’ is: the cumulative amount since the start of the event?
The problem with equation 2 is that time does not appear. If the system is truly dynamic and time-dependent, this must be indicated in the equations. And better define what ‘total precipitation’ means.
L149. The output of the hydrological model called ‘rainfall-runoff’ is in fact a flooded area. This is an approximation that is very far from reality and reflects an approximation that is not acceptable if it is not explained and justified. Once again, the model is described too quickly.
L156-166: we would like to have the Nash criteria for the model by category of experiments.
L172-177: specify the duration in minutes and the number of floods.
L178: it is good that there is a validation set, but how is the database divided into learning, testing and validation?
L172 to L177 represents 5 floods, 6 floods, 7 floods, etc., up to 10 floods. It is a detail, but at 90 mm per hour, it is difficult to consider the rain as ‘light’, especially in an urban environment.
L202-203: I do not understand how a sequence of two time steps can help to approximate long-term memory. The LSTM is clearly misused, or rather it is not recommended for a synthetic system with a supposed response time of 2 time steps (the former information is lacking).
L 209: specify what helps to avoid overfitting? The batch, the learning rate or the epoch. Overfitting is avoided with regularisation methods. But a priori there should be no overfitting on a synthetic system, as there is no noise and no uncertainty.
Table 1: specify the units of time step, sequence and batch size more precisely.
2.5.1
L 220: why mention rainfall when the target is the flooded area? Also, if Y is the target measured by the criterion, the denominator should be Ymean, not X, which is not defined.
2.5.2 It is not specified what x represents. According to the text, it could be rainfall, but rainfall is referred to as Q L137
2.5.2 Express the coefficient of determination using the same notation as the NRMSE. Why is water depth mentioned here? Isn't the target the flooded area?
2.5.3. It is a good idea to focus on training time, but then why not use a multilayer perceptron, which is much faster than LSTM in such a simplified case study?
2.6.1 Why are we talking about an effect? The concept should be defined and used consistently throughout the document. This entire section should be rewritten to make it understandable.
3.1: This entire section should be moved up to the chapter where the model is discussed.
Fig. 6: Provide a scale.
Provide a reference for InfoWorks ICM.
Specify what the three observed flows (not the flooded area) correspond to in Figure 7. Furthermore, we note in Figure 7 that the model completely misses the first peak. The observed rainfall intensities are significantly lower (0.6 mm) than the simulated artificial rainfall (100 mm/h). Why is this? Is it because the model does not accurately represent responses to light rainfall?
It is necessary to present a comprehensive and detailed overview of the performance of the hydrological model.
Equation 10 is more like that of a recurrent MLP...
Table 3: repetition.
Figure 8: is the increase in computation time proportional to the increase in signal length?
Figure 11 is difficult to read. Why does the normalised criterion vary so much? It is not consistent.
Figure 12: only the orange colour is visible.
Figure 17: which model are we talking about? The physical model or the LSTM? Review the ordinates: ‘value’ does not indicate the variable displayed.
We do not know how generalisation is measured. On which set?
Line 561: it is not possible to write "Controlled experiments with high-fidelity synthetic rainfall–inundation datasets". The physics-based model is poorly evaluated; the example shown (Fig. 7) completely misses the first event, which is far from high fidelity. The rest of the conclusion is therefore unproven. Even if my experience is in line with the authors' conclusions: enough events and different types are needed; these conclusions are neither new nor rigorously proven.
Conclusion: "but beyond approximately 14,400 sequences" no, it is 14.400 time steps.