Threshold Effects and Generalization Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective

Hu, Hao; Xiao, Fei; Xu, Peng; Gao, Yuxuan; Liang, Dongfang; Shang, Yizi

doi:10.5194/egusphere-2025-4125

Preprints

https://doi.org/10.5194/egusphere-2025-4125

Preprints

14 Oct 2025

| 14 Oct 2025

Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Threshold Effects and Generalization Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective

Hao Hu, Fei Xiao, Peng Xu, Yuxuan Gao, Dongfang Liang, and Yizi Shang

Abstract. Reliable urban flood prediction hinges on how datasets are designed, yet most existing research disproportionately emphasizes network architectures over data foundations. This study systematically investigates how dataset characteristics—scale, feature composition, and rainfall-event distribution—govern predictive performance and generalization in AI-based flood modeling. A physically calibrated hydrological–hydrodynamic model was employed to generate synthetic datasets with varied temporal lengths, input feature combinations (rainfall, infiltration, drainage), and rainfall-intensity distributions. A long short-term memory (LSTM) network, chosen for its widespread adoption and proven performance in hydrology, was applied as a representative benchmark to assess accuracy, computational cost, and bias under controlled conditions. Results identify: (1) a threshold effect of dataset length (~14,400 samples), beyond which performance gains plateau; (2) rainfall-intensity distribution as the dominant driver of generalization—training solely on light or extreme events induces systematic bias, whereas mixed-intensity datasets substantially enhance robustness; (3) ancillary features (infiltration and drainage) improve stability only when data are sufficiently abundant. These findings quantify trade-offs and pinpoint actionable design levers, offering general insights into dataset design for machine learning models in flood prediction and beyond. By clarifying critical dataset requirements, this study provides transferable guidance for building efficient and balanced datasets in hydrology and broader Earth system sciences.

Received: 23 Aug 2025 – Discussion started: 14 Oct 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Hao Hu, Fei Xiao, Peng Xu, Yuxuan Gao, Dongfang Liang, and Yizi Shang

Status: open (until 25 Dec 2025)

Post a comment Subscribe to comment alert

Hao Hu, Fei Xiao, Peng Xu, Yuxuan Gao, Dongfang Liang, and Yizi Shang

Viewed

Total article views: 188 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
124	53	11	188	7	6

HTML: 124
PDF: 53
XML: 11
Total: 188
BibTeX: 7
EndNote: 6

Views and downloads (calculated since 14 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	104	26	4	134
Nov 2025	20	27	7	54

Cumulative views and downloads (calculated since 14 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	104	26	4	134
Nov 2025	20	27	7	54

Viewed (geographical distribution)

Total article views: 185 (including HTML, PDF, and XML) Thereof 185 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 26 Nov 2025

Short summary

Flooding in cities is becoming more frequent and damaging, yet accurate prediction remains difficult. This study shows that the way training data are designed strongly affects the reliability of machine learning forecasts. We find that more data are not always better, balanced rainfall conditions are essential, and extra features only help when enough data are available. These insights guide the design of efficient datasets for better flood early warning in cities with limited data.


Total:	0
HTML:	0
PDF:	0
XML:	0