Preprints
https://doi.org/10.5194/egusphere-2025-4125
https://doi.org/10.5194/egusphere-2025-4125
14 Oct 2025
 | 14 Oct 2025
Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Threshold Effects and Generalization Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective

Hao Hu, Fei Xiao, Peng Xu, Yuxuan Gao, Dongfang Liang, and Yizi Shang

Abstract. Reliable urban flood prediction hinges on how datasets are designed, yet most existing research disproportionately emphasizes network architectures over data foundations. This study systematically investigates how dataset characteristics—scale, feature composition, and rainfall-event distribution—govern predictive performance and generalization in AI-based flood modeling. A physically calibrated hydrological–hydrodynamic model was employed to generate synthetic datasets with varied temporal lengths, input feature combinations (rainfall, infiltration, drainage), and rainfall-intensity distributions. A long short-term memory (LSTM) network, chosen for its widespread adoption and proven performance in hydrology, was applied as a representative benchmark to assess accuracy, computational cost, and bias under controlled conditions. Results identify: (1) a threshold effect of dataset length (~14,400 samples), beyond which performance gains plateau; (2) rainfall-intensity distribution as the dominant driver of generalization—training solely on light or extreme events induces systematic bias, whereas mixed-intensity datasets substantially enhance robustness; (3) ancillary features (infiltration and drainage) improve stability only when data are sufficiently abundant. These findings quantify trade-offs and pinpoint actionable design levers, offering general insights into dataset design for machine learning models in flood prediction and beyond. By clarifying critical dataset requirements, this study provides transferable guidance for building efficient and balanced datasets in hydrology and broader Earth system sciences.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Hao Hu, Fei Xiao, Peng Xu, Yuxuan Gao, Dongfang Liang, and Yizi Shang

Status: open (until 25 Nov 2025)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Hao Hu, Fei Xiao, Peng Xu, Yuxuan Gao, Dongfang Liang, and Yizi Shang
Hao Hu, Fei Xiao, Peng Xu, Yuxuan Gao, Dongfang Liang, and Yizi Shang
Metrics will be available soon.
Latest update: 14 Oct 2025
Download
Short summary
Flooding in cities is becoming more frequent and damaging, yet accurate prediction remains difficult. This study shows that the way training data are designed strongly affects the reliability of machine learning forecasts. We find that more data are not always better, balanced rainfall conditions are essential, and extra features only help when enough data are available. These insights guide the design of efficient datasets for better flood early warning in cities with limited data.
Share