the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Threshold Effects and Generalization Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective
Abstract. Reliable urban flood prediction hinges on how datasets are designed, yet most existing research disproportionately emphasizes network architectures over data foundations. This study systematically investigates how dataset characteristics—scale, feature composition, and rainfall-event distribution—govern predictive performance and generalization in AI-based flood modeling. A physically calibrated hydrological–hydrodynamic model was employed to generate synthetic datasets with varied temporal lengths, input feature combinations (rainfall, infiltration, drainage), and rainfall-intensity distributions. A long short-term memory (LSTM) network, chosen for its widespread adoption and proven performance in hydrology, was applied as a representative benchmark to assess accuracy, computational cost, and bias under controlled conditions. Results identify: (1) a threshold effect of dataset length (~14,400 samples), beyond which performance gains plateau; (2) rainfall-intensity distribution as the dominant driver of generalization—training solely on light or extreme events induces systematic bias, whereas mixed-intensity datasets substantially enhance robustness; (3) ancillary features (infiltration and drainage) improve stability only when data are sufficiently abundant. These findings quantify trade-offs and pinpoint actionable design levers, offering general insights into dataset design for machine learning models in flood prediction and beyond. By clarifying critical dataset requirements, this study provides transferable guidance for building efficient and balanced datasets in hydrology and broader Earth system sciences.
- Preprint
(2231 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-4125', Anonymous Referee #1, 15 Feb 2026
-
AC1: 'Reply on RC1', Yizi Shang, 10 May 2026
Dear Professor,
We sincerely appreciate the time and effort you and the reviewers have dedicated to providing valuable and constructive feedback on our manuscript. We have carefully read all the comments and thoroughly considered the concerns raised.
We have revised the manuscript accordingly to address all the requirements and suggestions. Please find attached a detailed, point-by-point response to your comments and those of the reviewers, along with the revised manuscript.
We hope that the revisions meet your expectations and that the manuscript is now suitable for publication. Thank you again for your time and continued guidance.
-
AC4: 'Reply on RC1', Yizi Shang, 30 May 2026
Dear Reviewers,
Thank you for reviewing our manuscript. The comments have helped us strengthen the methodology presentation and clarify several points in the original text. We address each concern below; all corresponding revisions have been incorporated into the updated manuscript.
Below are our detailed, point-by-point responses to your major concerns (Part I) and specific comments (Part II).
Part I: Response to Major Comments
Comment 1. The physics-based model that generated the hydrological responses is not fully presented or validated. This is important because if the physical model does not adequately represent the complexity of the case study, the study loses much of its meaning.
Response: We agree — the synthetic data is only as reliable as the physical model behind it. In the updated manuscript, Section 2.2 now presents the InfoWorks ICM model with expanded detail and validation. We note that a physically based model of this catchment necessarily involves simplifications, but the validation results below demonstrate that these do not compromise the fidelity of the generated synthetic dataset.
1) The revised Section 2.2 includes a detailed overview of the 6,500 m² study area, covering buildings, squares, green spaces, roads, and drainage pipe networks.
2) We supplemented the initial calibration (Fig. 3) with a new validation phase covering three additional typical observed rainfall events (Table 2).
3) These supplementary validations yielded Nash-Sutcliffe Efficiency (NSE) values of 0.82, 0.75, and 0.88, demonstrating that the model accurately captures the dynamics of the hydrosystem.
4) To directly validate the LSTM model against physical reality, we added Fig. 13, which compares the inundation area hydrographs simulated by the InfoWorks ICM model and the LSTM model under real rainfall events.
5) Quantitative analysis of these real events reveals peak error percentages of approximately 3.8%, NSE values exceeding 0.8, and R2 values above 0.9.
Comment 2. The simulated rainfall is very unrealistic (100 mm/h vs. 1 mm/h observed), and events all have the same duration.
Response: We apologize for the confusion regarding the rainfall units in the original manuscript. The observed rainfall measurements were presented in mm/min, which led to the apparent discrepancy when compared to the simulated design storms in mm/h. The updated manuscript now uses consistent units throughout. We recognize that these intensities are high, but they reflect the short-duration, extreme storm patterns characteristic of urban pluvial flooding in this region.
- The revised text clarifies that the observed storm intensities are consistent with the high-intensity nature of urban pluvial floods in the study area.
- The simulated storms utilize the Chicago design storm pattern (Eqs. 2–5), generating hydrographs for 1–10 year return periods specific to the region.
- Regarding the fixed duration: Each rainfall event is strictly set to a 24-hour duration followed by a 6-hour recession period (30 hours total).
- This controlled design isolates the specific effects of data length and rainfall intensity distribution on model performance. It prevents duration variations from introducing confounding factors into our statistical evaluation.
Comment 3. Justification for the LSTM model over a simpler Perceptron model.
Response: While a multilayer perceptron (MLP) is computationally lighter, short-duration, high-intensity urban flooding is a complex spatiotemporal nonlinear problem.
- We selected the LSTM network because it is designed to decode temporal dependencies and sequential rainfall-inundation responses, which are critical for minute-level high-frequency hydrological data.
- Our architecture uses a 30-minute sequence length and a 15-minute sliding window, capturing the dynamic response cycle of short-term rainfall-runoff.
- Furthermore, we constrained the model to a single-layer LSTM with 64 hidden neurons.
- This keeps training times efficient, averaging approximately 896 seconds on our hardware setup, addressing concerns about excessive energy consumption.
Comment 4. Lack of cross-validation for the LSTM model.
Response: This is an important point. We have restructured the evaluation framework to address it, as described below.
- The updated manuscript now employs a 5-fold cross-validation strategy across all dataset configurations (Section 2.4, Figs. 8–9).
- Data partitioning is conducted independently for each dataset configuration to prevent bias.
- The results are reported as the mean and standard deviation across the five folds, providing robust performance metrics that rule out anomalies caused by random validation splits.
- The corresponding box plots (Figs. 10–12) and data tables (Tables 6–8) confirm stable model performance across the folds.
Part II: Response to Specific Comments
1.The variability in maximum precipitation intensities is not representative (90mm/h to 140mm/h). Could start at 10 mm/h.
Response: The revised manuscript clarifies the categorization. In this study, the low-intensity training set consists of intensities of 90–120 mm/h, the high-intensity set spans 120–170 mm/h, and the mixed-intensity set spans 90–170 mm/h. These classifications are based on the historical threshold criteria for urban flooding in this catchment. Events starting at 10 mm/h in this area do not trigger meaningful surface inundation due to pipeline drainage capacity, which is why they are excluded from flood forecasting training. We agree that this range is narrow, but it reflects the actual flood-producing rainfall spectrum for this specific urban setting.
2.L128-135: Lack of information on the rainfall-runoff model, response time, and output hydrographs. Nash 0.5 may conceal mediocre simulations. Is there saturation for intense events?
Response: Supplementary validation for the physical model has been added to the revised manuscript. Table 2 now displays the observed vs. simulated peak flows, peak time errors, and NSE values for three independent events.
We also plotted the comparison of inundation area hydrographs (Fig. 13) under real rainfall events, demonstrating that the model accurately captures the rising limb, peak phase, and recession limb, proving the system is not oversimplified.
- Confusion between sequence length L171 and the LSTM sequence of 2 time steps (30 min).
Response: This nomenclature has been corrected throughout the revised manuscript. The fundamental sample unit is a 45-minute continuous sliding window sequence. The LSTM internally uses a 30-minute sequence length with a 15-minute step size. The "Dataset Length" variables (L1 to L6) refer to the total count of these sequence samples, ranging from 598 to 1,198.
- L137: q is the intensity of the rain, poorly chosen; L139: specify figure number; L148: specify ‘total precipitation’.
Response: The variables have been rewritten to align with standard hydrological nomenclature in the revised manuscript. Precipitation is now Pt ,infiltration is It , drainage is Dt , and inundation area is Yt.
Cumulative surface runoff volume is now defined mathematically via integration over time (Eq. 7) in the updated manuscript. All figure references have been corrected.
5.Problem with Eq 2: time does not appear.
Response: The equations have been updated in the revised manuscript to reflect time dependencies. The water balance equation for the net surface runoff rate is now defined at time t as Q(t)=P(t)-I(t)-D(t).
- L149: The output of the hydrological model called 'rainfall-runoff' is in fact a flooded area. This is an approximation.
Response: We agree and have included the exact mechanism mapping runoff to flooded area in the updated manuscript. Eq. 8 demonstrates how the instantaneous runoff rate is integrated into volume, then mapped to the inundation area Y(t) using the hydraulic conversion function established by the shallow water equations, DEM, surface slope, and depression storage. We note that this mapping, while an approximation, is standard practice in urban flood modeling and has been validated against observed events as shown in Fig. 13.
- L156-166: We would like to have the Nash criteria for the model by category of experiments.
Response: The Nash-Sutcliffe Efficiency (NSE) values for the physical model validation have been added in Table 2 of the revised manuscript, ranging from 0.75 to 0.88.
- L172-177: Specify duration in minutes and number of floods. At 90 mm per hour, it is difficult to consider the rain as ‘light’.
Response: 1) We have specified in the revised manuscript that the total duration configurations encompass 5 to 10 independent storm events, corresponding to cumulative durations of 150 to 300 hours. 2) We have also refined our terminology; we now refer to these events as "low-intensity" relative to the extreme storm design thresholds of the study area, rather than universally "light" rain.
- L178: How is the database divided into learning, testing and validation?
Response: The revised manuscript now employs a 5-fold cross-validation mechanism via stratified random sampling. For each fold, the remaining four folds serve as the training set, ensuring no data leakage. We chose stratified sampling over simple random splits because it better preserves the distribution of rainfall intensities across training and test sets.
- L202-203: I do not understand how a sequence of two time steps can help approximate long-term memory.
Response: This was an error in the previous draft. The corrected manuscript now states that the sequence length is 30 time steps (representing 30 minutes, with a sampling frequency of 1 minute).
- L209: Specify what helps avoid overfitting? There should be no overfitting on a synthetic system.
Response: Although the data are synthetic, the system output still presents complex nonlinear mappings and varied temporal dynamics depending on the design storm. In the updated manuscript we describe using a learning rate of 0.005, 50 epochs, and the Adam optimizer with weight decay to control variance. We recognize that overfitting is less common with synthetic data, but our goal was to ensure reproducibility and avoid fitting to specific hydrograph shapes.
- Table 1: Specify units of time step, sequence, and batch size.
Response: Table 1 has been renumbered as Table 5 in the revised manuscript. Table 5 now includes a batch size of 32.
2.5.1. L220: Why mention rainfall when target is flooded area? Ymean not X.
Response: The NRMSE formula (Eq. 11) has been corrected in the revised manuscript to reflect the variables: predicted , observed , and mean observed representing the inundation area.
- 2.5.2: Express coefficient of determination using the same notation as NRMSE. Why is water depth mentioned?
Response: Eq. 12 has been updated in the revised manuscript to standardize the notation for , and references to water depth have been replaced with "inundation area" to match the target variable.
2.6.1: Why are we talking about an effect? The concept should be defined.
Response: We rewrote this section in the revised manuscript to introduce a formal Multi-factor Analysis of Variance (MANOVA) framework. The updated text defines statistical main effects, interaction effects, significance testing (F-value, p-value), and effect size estimation (partial and ).
3.1: Move entire section to the chapter where the model is discussed.
Response: Agreed. The physical model calibration and validation content has been moved to Section 2.2 in the revised manuscript.
Fig 6, 7, 8, 11, 12, 17 formatting and readability.
Response: All figures have been revised in the updated manuscript. We added proper scales and clearer colors. The confusing performance variance charts (formerly Fig 11/12) have been replaced with box plots showing 5-fold CV performance (Figs. 10–12). Figure 17 now labels axes and shows the linear growth of training time against data scale.
Line 561: "Controlled experiments with high-fidelity...". The physics-based model is poorly evaluated; the conclusion is unproven.
Response: With the additional physical model validation (Table 2 and the direct comparison against real events in Fig. 13, yielding NSE > 0.8), the model’s fidelity is now demonstrated in the revised manuscript. We recognize that no physical model is perfect, but these metrics support the use of the synthetic dataset for the subsequent LSTM experiments. The conclusions drawn from the synthetic data are therefore based on a more thoroughly characterized physical baseline.
Conclusion: "beyond approximately 14,400 sequences" no, it is 14,400 time steps.
Response: The revised manuscript now states that the threshold is 14,400 samples (where each sample is an extracted 45-minute time-series window).
We hope these revisions address your concerns. The updated manuscript reflects all changes discussed above. Thank you for your time and careful review.Some formulas may not paste correctly. Please refer to the attached PDF file for the full and detailed response.
-
AC1: 'Reply on RC1', Yizi Shang, 10 May 2026
-
CC1: 'Comment on egusphere-2025-4125', Nima Zafarmomen, 11 Mar 2026
The manuscript presents a timely and well-structured investigation into an important but often overlooked issue in AI-based urban pluvial flood prediction: dataset design. Rather than focusing only on model architecture, the study systematically evaluates how dataset length, feature composition, and rainfall-intensity distribution affect predictive skill, computational cost, and generalization. Overall, this is a solid and relevant contribution, and I believe the manuscript is worth publishing after minor revision.
1) The manuscript should improve consistency in the statistical terminology. Section 2.6 refers to MANOVA, whereas Section 4.3 describes the analysis as multi-factor ANOVA. The authors should clarify which method was actually used and ensure the terminology is consistent throughout the manuscript.
2) The target variable should be described more consistently. In some places the manuscript defines the prediction target as inundation area, while elsewhere it discusses peak water level, runoff volume, and water depth.
3) Some of the conclusion wording is stronger and more conversational than is typical for a scientific paper. Phrases such as “respect the 14k-sample ceiling” and “start lean, enrich later” may be better rephrased in a more formal academic style. The ideas are useful, but the tone could be made more precise and neutral.
4) In addition, I do strongly recommend authors consider citing the following recent and relevant paper, which is closely related to AI surrogate modeling in urban drainage networks and prediction of node-level hydraulic states: Zafarmomen, N., Samadi, V., and Borgomeo, E. (2026). Spatiotemporal SWMM-LSTM surrogate modeling for efficient node-level water depth and inflow prediction in urban drainage networks. Cambridge University Press, published online 13 January 2026.
Citation: https://doi.org/10.5194/egusphere-2025-4125-CC1 -
AC5: 'Reply on CC1', Yizi Shang, 30 May 2026
Dear Reviewer,
Thank you very much for your positive assessment of our manuscript and for recognizing the value of our focus on dataset design in AI-based urban flood prediction. We greatly appreciate your constructive suggestions, which have been highly helpful in refining the clarity, consistency, and academic rigor of our paper.
Below, please find our detailed, point-by-point responses to your comments:
- The manuscript should improve consistency in the statistical terminology. Section 2.6 refers to MANOVA, whereas Section 4.3 describes the analysis as multi-factor ANOVA. The authors should clarify which method was actually used and ensure the terminology is consistent throughout the manuscript.
Response: We sincerely apologize for the inconsistency in the statistical terminology in the previous draft. We have clarified and corrected this throughout the revised manuscript. The actual method used was Multi-factor Analysis of Variance (Factorial ANOVA), as we independently analyzed the main and interaction effects for three separate dependent variables (training time, NRMSE, and ), rather than conducting a joint multivariate analysis. We have thoroughly reviewed Sections 2.6, 3.3.4, and the Discussion to ensure that the term "Multi-factor Analysis of Variance" is used consistently and correctly.
- The target variable should be described more consistently. In some places the manuscript defines the prediction target as inundation area, while elsewhere it discusses peak water level, runoff volume, and water depth.
Response: Thank you for pointing out this ambiguity. We have carefully revised the manuscript to ensure absolute consistency. To clarify: the ultimate and sole prediction target of the LSTM model in this study is the Inundation Area (m2). The terms "runoff volume," "water depth," and "pipe flow" are intermediate physical variables generated during the mechanistic simulation process within InfoWorks ICM (as described in Eq. 7 and Eq. 8, where instantaneous runoff is mapped to the final inundation area). We have combed through the entire manuscript to strictly differentiate between the intermediate physical variables of the hydrodynamic model and the final predictive target of the deep learning model. Any confusing references to predicting "water depth" or "peak levels" by the LSTM have been uniformly corrected to "inundation area."
- Some of the conclusion wording is stronger and more conversational than is typical for a scientific paper. Phrases such as “respect the 14k-sample ceiling” and “start lean, enrich later” may be better rephrased in a more formal academic style. The ideas are useful, but the tone could be made more precise and neutral.
Response: We completely agree with your feedback regarding the tone. We have thoroughly revised the Discussion and Conclusion sections to remove conversational phrasing and ensure a formal, precise, and objective academic tone.
Phrases like "respect the 14k-sample ceiling" have been formally rephrased to: "performance gains plateau once the sample size reaches a critical threshold of approximately 14,400 sequences under the current experimental setup." * Phrases like "start lean, enrich later" have been replaced with precise methodological recommendations, such as: "under data scarcity conditions, a minimalist input structure should be maintained... whereas introducing auxiliary features is an effective strategy only when data resources are abundant."
- In addition, I do strongly recommend authors consider citing the following recent and relevant paper, which is closely related to AI surrogate modeling in urban drainage networks and prediction of node-level hydraulic states: Zafarmomen, N., Samadi, V., and Borgomeo, E. (2026). Spatiotemporal SWMM-LSTM surrogate modeling for efficient node-level water depth and inflow prediction in urban drainage networks. Cambridge University Press, published online 13 January 2026.
Response: We are very grateful for this excellent literature recommendation. This recent publication is indeed highly relevant to our work and provides critical, up-to-date context regarding the application of SWMM-LSTM surrogate models for fine-scale spatiotemporal and node-level hydraulic predictions. We have incorporated this citation into the Introduction (Section 1) when discussing the latest advancements in deep learning-based data-driven models and spatiotemporal prediction techniques, as well as in the Discussion to contextualize our findings within the broader trajectory of AI surrogate modeling in urban drainage systems.
Thank you once again for your insightful and encouraging review. We believe these minor revisions have significantly strengthened the professional presentation of our work, and we hope the revised manuscript now fully meets your expectations.
-
AC5: 'Reply on CC1', Yizi Shang, 30 May 2026
-
RC2: 'Comment on egusphere-2025-4125', Anonymous Referee #2, 12 Mar 2026
The manuscript addresses a highly relevant topic as the optimization of dataset design for Deep Learning models (specifically LSTM) in urban pluvial flood prediction. The study identifies a "threshold effect" regarding dataset size (~14,400 samples) and highlights that rainfall intensity distribution is more critical for model generalization than raw data volume. While the work is interesting and well-motivated, several critical methodological clarifications and improvements in the presentation of results are required to be recommended.
Specific Comments:
Introduction and Generalization: The motivation is clear and well-structured. However, line 87 states that insights from the LSTM model are expected to be transferred to other ML models. The authors should justify this claim, explaining why these findings (especially threshold effects) would apply to other models.
Section 2.2: An NSE is reported for the hydraulic model, but more detail is needed. What variable does this NSE represent (water levels, discharge, or flood extent)? At which control points and under what rainfall events was this validated?
- Line 134: Is there a specific reference for the IDF formula used?
- Line 147: The equation for a distributed model is an oversimplification as it neglects lateral flows between cells and surface storage. This needs a more rigorous explanation.
- Line 148: There is a typo in “III”; it should likely be "I" for Infiltration.
Section 2.3:
- Line 161: For consistency, "Configuration 4" should be labeled: “Rainfall (P) + Soil infiltration (I) + Pipe drainage (D)”.
- Line 172: Specify the unit for sequence length. Are these time steps or hours?
Section 3.3:
- Line 332: Data Leakage Prevention? The manuscript mentions the use of overlapping sliding windows. It is crucial to clarify whether the data split (train/val/test) was performed before or after generating these windows. If done after, there is a high risk of information leakage, which would invalidate the reported generalization performance.
Section 4.1:
- Line 402: The claim that performance decays after Level 4 is difficult to discern in the current figure for low-intensity and mixed-intensity.
- Line 423: Does the model account for manhole overflows? This is a critical factor in urban pluvial flooding.
General comments:
- There is confusion between the use of MANOVA and ANOVA. The description suggests a factorial ANOVA for individual metrics, yet MANOVA is mentioned. Please clarify the exact statistical framework and how the covariance between multiple dependent variables was handled.
- A major omission is the time required to generate the dataset. Since InfoWorks ICM simulations are computationally intensive, the authors must provide details on the total simulation time, hardware used, and a comparison between the "data investment" time vs. the AI's real-time prediction advantage.
Figures and Legibility:
- Figure 7: What does "Y0333333.1" represent? Legend labels should be descriptive and self-explanatory.
- Figures 8, 9, and 11: Figure 8 (Training time 1, 2, 3) is difficult to read and seems redundant. Consider merging the essential information into Figure 9 and removing Figure 11 if it does not provide unique insights.
- Figure 17: Clarify the “Value” axis. The figure as it stands is not sufficiently explanatory for the conclusions drawn immediately below it.
Discussion: The discussion lacks sufficient citations to back its claims, particularly in line 534. The results should be contextualized by comparing them with existing literature on dataset design in computational hydrology.
In conclusion, the paper presents a timely and valuable contribution to AI applications in hydrology by shifting the focus from “model architecture” to “dataset design.” However, several critical issues need to be addressed. First, the lack of transparency regarding the computational cost of data generation makes it difficult to assess the efficiency of the proposed framework, as the “cost of data” is as relevant as model accuracy. Second, the potential for data leakage in sliding-window approaches must be resolved to ensure the validity of the results. Finally, the hydrological representation is oversimplified, mass balance considerations and calibration procedures are described too briefly. Strengthening the physical basis of the synthetic data is essential. To recommend this work, the authors need to address these points and resolve the concerns regarding computational feasibility, data handling, and the hydrological basis of the simulations.
Citation: https://doi.org/10.5194/egusphere-2025-4125-RC2 -
AC3: 'Reply on RC2', Yizi Shang, 10 May 2026
Dear Professor,
We sincerely appreciate the time and effort you and the reviewers have dedicated to providing valuable and constructive feedback on our manuscript. We have carefully read all the comments and thoroughly considered the concerns raised.
We have revised the manuscript accordingly to address all the requirements and suggestions. Please find attached a detailed, point-by-point response to your comments and those of the reviewers, along with the revised manuscript.
We hope that the revisions meet your expectations and that the manuscript is now suitable for publication. Thank you again for your time and continued guidance.
-
AC6: 'Reply on RC2', Yizi Shang, 30 May 2026
Dear Reviewers,
Thank you for your comments on our manuscript, "Impact of dataset design on LSTM-based urban pluvial flood prediction: Length, feature dimensions, and rainfall stratification." Your suggestions have helped improve the rigor and clarity of our study.
We have addressed all comments in the revised manuscript. Below are our detailed, point-by-point responses:
Part I: Response to Specific Comments
- Introduction and Generalization: The motivation is clear and well-structured. However, line 87 states that insights from the LSTM model are expected to be transferred to other ML models. The authors should justify this claim, explaining why these findings (especially threshold effects) would apply to other models.
Response: We agree that claiming absolute generalizability to other machine learning models without empirical evidence lacks rigor. In the revised manuscript, we have redefined the LSTM as a "representative baseline sequence model" and revised the discussion and conclusion to clarify: "Although the patterns derived under the LSTM baseline framework offer reference values for other sequence learning models, broader generalization (e.g., extension to Graph Neural Networks) necessitates further empirical validation".
- Section 2.2: "An NSE is reported for the hydraulic model, but more detail is needed. What variable does this NSE represent... At which control points and under what rainfall events was this validated?"
Response: We have supplemented Section 2.2 with detailed information on the hydrodynamic model validation.
- Variables and Control Points: The NSE represents the pipe flow rate (m³/s) at key monitoring nodes within the drainage pipe network.
- Rainfall Events for Validation: In addition to the initial calibration, we added a supplementary validation phase for three typical observed rainfall events (with total rainfall volumes of 80 mm, 120 mm, and 50 mm, respectively) (Table 2), yielding NSE values of 0.82, 0.75, and 0.88. Furthermore, we added Figure 13 to directly compare the inundation area hydrographs simulated by InfoWorks ICM and the LSTM model under real rainfall events, demonstrating that the NSE values for the simulated events exceed 0.8.
- Line 134: "Is there a specific reference for the IDF formula used?"
Response: We have added the source and specific parameters of the storm intensity formula (IDF). The Chicago design storm instantaneous rainfall intensity formula (Eq. 5) was fitted from local historical rainfall data; we have detailed the geographical and statistical significance of each empirical parameter (C, b, n).q = 1602(1+1.037lgp)/(t+11.593)^0.681. We have also detailed the geographical and statistical significance of each empirical parameter (e.g.,A1 , C, b, n).
- Line 147: The equation for a distributed model is an oversimplification as it neglects lateral flows between cells and surface storage. This needs a more rigorous explanation.
Response: We have clarified that Equation 6 is strictly used to calculate the net surface runoff rate (excess infiltration rate) prior to concentration. The actual physical processes—2D spatial concentration, lateral flow between grid cells, and surface depression storage—are handled by the InfoWorks ICM 2D module based on shallow water equations and a DEM. This hydraulic conversion is formally expressed via Equation 8, which incorporates surface slope and depression storage.
- Line 148: "There is a typo in “III”; it should likely be "I" for Infiltration."
Response: We have corrected this typographical error and now uniformly use "I" to represent Soil Infiltration throughout the manuscript.
- Section 2.3 (Line 161): "For consistency, 'Configuration 4' should be labeled: 'Rainfall (P) + Soil infiltration (I) + Pipe drainage (D)'."
Response: Accepted. In Table 4 and the relevant text, Combination 4 has been standardized to: "Precipitation (P) + Soil Infiltration (I) + Pipe Drainage (D) → Inundation Area (Y)".
- Line 172: "Specify the unit for sequence length. Are these time steps or hours?"
Response: The temporal sampling frequency is once per minute; the LSTM sequence length of "30" represents 30 time steps (30 minutes). This has been clarified in the manuscript.
- Section 3.3 (Line 332): Data Leakage Prevention? The manuscript mentions the use of overlapping sliding windows. It is crucial to clarify whether the data split (train/val/test) was performed before or after generating these windows. If done after, there is a high risk of information leakage, which would invalidate the reported generalization performance.
Response: This is an important concern regarding data leakage. Our data partitioning logic is as follows: First, we randomly sample independent storm events from the total sample space; we then perform sliding window interception within each selected independent event; finally, we employ 5-fold cross-validation. This ensures that training and validation sets remain independent. The 6-hour recession period between each event further guarantees physical independence.within each selected independent event; finally, we employ 5-fold cross-validation. This ensures that during validation, the training and validation sets not only remain independent during random shuffling but also feature a partitioning mechanism that strictly prevents future information leakage. Additionally, the 6-hour recession period between each event ensures absolute physical independence between the events.
- Section 4.1 (Line 402): The claim that performance decays after Level 4 is difficult to discern in the current figure for low-intensity and mixed-intensity.
Response: In the revised manuscript, we have corrected the term "decays," describing it more accurately as a "saturation effect" or "plateau." As shown in Section 3.3.1 and Figure 17, when data volume reaches L4 (~14,400 samples), model performance achieves a qualitative leap. Beyond L4, increasing data to L5 and L6 yields only marginal NRMSE reduction and minimal R² improvement—the gains plateau rather than decay.
- Line 423: Does the model account for manhole overflows? This is a critical factor in urban pluvial flooding.
Response: Yes, our coupled 1D–2D hydrodynamic model fully accounts for manhole overflows. When pipe drainage capacity reaches saturation, excess water overflows to the surface through manholes, and its inundation extent is calculated by the 2D module, as described in Sections 2.2 and 2.3.) reaches saturation, the excess water overflows to the surface through manholes, and its inundation area evolution on the surface is calculated by the 2D module.
Part II: Response to General Comments
- There is confusion between the use of MANOVA and ANOVA. The description suggests a factorial ANOVA for individual metrics, yet MANOVA is mentioned. Please clarify the exact statistical framework and how the covariance between multiple dependent variables was handled.
Response: We have corrected the statistical terminology in the revised manuscript: we conducted independent Multi-factor Analysis of Variance (Factorial ANOVA) for each dependent variable (training time, NRMSE, R²) to quantify the main and interaction effects of dataset length, rainfall level, and feature combinations, rather than a joint MANOVA. Results are detailed in Figures 20–22.
- A major omission is the time required to generate the dataset. Since InfoWorks ICM simulations are computationally intensive, the authors must provide details on the total simulation time, hardware used, and a comparison between the "data investment" time vs. the AI's real-time prediction advantage.
Response: This is a valid point. We have added the hardware specifications (Intel i9 + NVIDIA RTX 3090 GPU) and a comparison between the offline data generation cost and the AI's real-time prediction advantage. The LSTM training time is ~896 seconds (~15 minutes); once the training set is generated offline using the physical model, the AI model achieves second-level real-time predictions during deployment.
- Figures and Legibility:
Regarding "Y0333333.1" in Figure 7: This was a data label artifact from the original plotting. We have redrawn the legends for Figures 5, 6, and 7 with clear descriptive labels.
Regarding the redundancy and legibility of Figures 8, 9, and 11: We have significantly streamlined and consolidated the cross-validation charts. We removed the old, redundant, and hard-to-read training time graphs, integrating the core 5-fold cross-validation mechanism into Figure 8 (matrix plot) and Figure 9 (flowchart). Simultaneously, the validation performance of each configuration was transformed into highly intuitive box plots (Figures 10, 11, and 12), clearly displaying data distributions and means.
Regarding the "Value" axis in Figure 17: We have explicitly relabeled the vertical axes of Figure 17. The vertical axes of the three subplots are now clearly labeled "NRMSE", "R²", and "Training Time (s)", directly and clearly supporting the subsequent conclusion regarding performance saturation after reaching the L4 data volume threshold.
- Discussion: The discussion lacks sufficient citations to back its claims, particularly in line 534. The results should be contextualized by comparing them with existing literature on dataset design in computational hydrology.
Response: We have enriched the discussion and literature review, comparing our findings with existing research on dataset design in computational hydrology—including recent work on extreme event prediction and rainfall distribution in deep learning, which further supports the need for stratified mixed sampling under data-scarce conditions.
Conclusion:
We appreciate your recognition of the value of our research in "shifting the focus from model architecture to dataset design." By supplementing the hydrodynamic model validation, clarifying the data split mechanism, correcting statistical terminology, and optimizing figure legibility, we have addressed concerns regarding computational feasibility, data handling, and hydrological physical basis. We hope the revised manuscript meets the journal's publication requirements.Some formulas may not paste correctly. Please refer to the attached PDF file for the full and detailed response.
-
AC2: 'Reply on RC1', Yizi Shang, 10 May 2026
Dear Professor,
We sincerely appreciate the time and effort you and the reviewers have dedicated to providing valuable and constructive feedback on our manuscript. We have carefully read all the comments and thoroughly considered the concerns raised.
We have revised the manuscript accordingly to address all the requirements and suggestions. Please find attached a detailed, point-by-point response to your comments and those of the reviewers, along with the revised manuscript.
We hope that the revisions meet your expectations and that the manuscript is now suitable for publication. Thank you again for your time and continued guidance.
Status: closed
-
RC1: 'Comment on egusphere-2025-4125', Anonymous Referee #1, 15 Feb 2026
Review of the paper “Threshold Effects and Generalization 1 Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective.”
By Hao Hu et al,
In general, the introduction is well constructed and poses questions in a logical and relevant manner. There is indeed an advantage to using a synthetic database: beyond the lack of available data, it allows us to overcome the bias-variance dilemma that explains overfitting.
However, in the rest of the article, we note that the methodology is not correctly deployed: 1) the physics-based model that generated the hydrological responses is not fully presented or validated. This is important because if the physical model does not adequately represent the complexity of the case study, the study loses much of its meaning. Furthermore, 2) the simulated rainfall is very unrealistic, and simulated rainfall intensities (100 mm/h) are significantly higher than those observed and applied to the physical model (1 mm/h, see Fig. 7). It is not normal to learn this from values read on a figure. It is necessary to consider the consequences: can a model calibrated with rainfall of 1 mm/h simulate responses to rainfall of 100 mm/h? What are the consequences? The simulated rainfall events all have the same duration; why is this? One might imagine that longer rainfall events would have a greater impact in the model due to their greater power. The simulated rainfall therefore lacks the diversity needed to evaluate all possible states of the system. 3) Regarding the choice of the LSTM model, we do not choose a model because it is fashionable. We choose it because it is suited to the problem at hand. Furthermore, LSTM is not the most widely used model; it seems that for high-intensity/rapid floods, the perceptron is more commonly used. The latter is much less complex, consumes much less energy and contributes more to a sustainable world than LSTM.
Furthermore: 4) regarding the validation of the LSTM model, it can change drastically depending on the events used in the validation, so it is recommended to evaluate them with cross-validation, which is not done in the paper.
All these elements suggest that the chosen methodology is not properly thought out. And we doubt the meaning that the results could have.
Regarding the quality of the presentation, it must be noted that the writing is poor. Several parts are redundant (e.g. the description of rainfall), while others are missing (how is generalisation assessed?). The notation changes throughout the article. Even the model output is not defined consistently; sometimes we have a flow rate (L127), sometimes a flooded area (L146) and sometimes a water height (L230). How is this possible? The only explanation I can see is that an AI wrote parts of it, and they were not properly corrected.
The conclusion provides little new information: yes, the quality of the model is affected by the size of the database: we already know that; yes, events of different types and intensities must be presented to the training model. Indeed, we already know that thetraining database must include a representation of the entire state space. It is also obvious that the calculation time increases with the length of the training database. On the other hand, no: the model that has learned from synthetic data, i.e. without noise or uncertainty, cannot be overfitted. The conclusions relating to this last point are therefore false, and explanations must be sought elsewhere.
Unfortunately, the simplistic nature of rainfall events does not allow us to explore other avenues or quantify thresholds; for example, how many events are needed to reach the plateau.
Unfortunately, even though the subject seemed interesting, in view of all these factors, it is very difficult to recommend this paper for publication. It needs to be completely rethought and rewritten.
Specific comments
The variability in maximum precipitation intensities is not representative of real cases. 90mm/h to 140mm/h could have been distributed more widely, for example starting at 10 mm/h.
L128-135, there is a serious lack of information on the rainfall-runoff model; in particular, what is its response time? We would like to see the output hydrographs from this model. A Nash criterion of 0.5 may conceal simulations that reproduce the dynamics of the hydrosystem, or mediocre simulations with, for example, a very low peak flow/height, which does not represent the observed situation. We would like to see a more detailed analysis of the model outputs for intense and very intense rainfall events. Is there saturation for the most intense events?
This is problematic because if the rainfall-runoff model over-tops the floods or oversimplifies the real situation (the functioning of the hydrosystem), it is clear that the issue is then greatly simplified... this could even lead to doubts about the reliability of the conclusions.
Confusion between sequence length L171 and the LSTM sequence of 2 time steps (30 min).
L137, q is the intensity of the rain, which is poorly chosen, as q generally represents the flow.
L139 specify the figure number instead of ‘this formula’.
L148, specify what ‘total precipitation’ is: the cumulative amount since the start of the event?
The problem with equation 2 is that time does not appear. If the system is truly dynamic and time-dependent, this must be indicated in the equations. And better define what ‘total precipitation’ means.
L149. The output of the hydrological model called ‘rainfall-runoff’ is in fact a flooded area. This is an approximation that is very far from reality and reflects an approximation that is not acceptable if it is not explained and justified. Once again, the model is described too quickly.
L156-166: we would like to have the Nash criteria for the model by category of experiments.
L172-177: specify the duration in minutes and the number of floods.
L178: it is good that there is a validation set, but how is the database divided into learning, testing and validation?
L172 to L177 represents 5 floods, 6 floods, 7 floods, etc., up to 10 floods. It is a detail, but at 90 mm per hour, it is difficult to consider the rain as ‘light’, especially in an urban environment.
L202-203: I do not understand how a sequence of two time steps can help to approximate long-term memory. The LSTM is clearly misused, or rather it is not recommended for a synthetic system with a supposed response time of 2 time steps (the former information is lacking).
L 209: specify what helps to avoid overfitting? The batch, the learning rate or the epoch. Overfitting is avoided with regularisation methods. But a priori there should be no overfitting on a synthetic system, as there is no noise and no uncertainty.
Table 1: specify the units of time step, sequence and batch size more precisely.
2.5.1
L 220: why mention rainfall when the target is the flooded area? Also, if Y is the target measured by the criterion, the denominator should be Ymean, not X, which is not defined.
2.5.2 It is not specified what x represents. According to the text, it could be rainfall, but rainfall is referred to as Q L137
2.5.2 Express the coefficient of determination using the same notation as the NRMSE. Why is water depth mentioned here? Isn't the target the flooded area?
2.5.3. It is a good idea to focus on training time, but then why not use a multilayer perceptron, which is much faster than LSTM in such a simplified case study?
2.6.1 Why are we talking about an effect? The concept should be defined and used consistently throughout the document. This entire section should be rewritten to make it understandable.
3.1: This entire section should be moved up to the chapter where the model is discussed.
Fig. 6: Provide a scale.
Provide a reference for InfoWorks ICM.
Specify what the three observed flows (not the flooded area) correspond to in Figure 7. Furthermore, we note in Figure 7 that the model completely misses the first peak. The observed rainfall intensities are significantly lower (0.6 mm) than the simulated artificial rainfall (100 mm/h). Why is this? Is it because the model does not accurately represent responses to light rainfall?
It is necessary to present a comprehensive and detailed overview of the performance of the hydrological model.
Equation 10 is more like that of a recurrent MLP...
Table 3: repetition.
Figure 8: is the increase in computation time proportional to the increase in signal length?
Figure 11 is difficult to read. Why does the normalised criterion vary so much? It is not consistent.
Figure 12: only the orange colour is visible.
Figure 17: which model are we talking about? The physical model or the LSTM? Review the ordinates: ‘value’ does not indicate the variable displayed.
We do not know how generalisation is measured. On which set?
Line 561: it is not possible to write "Controlled experiments with high-fidelity synthetic rainfall–inundation datasets". The physics-based model is poorly evaluated; the example shown (Fig. 7) completely misses the first event, which is far from high fidelity. The rest of the conclusion is therefore unproven. Even if my experience is in line with the authors' conclusions: enough events and different types are needed; these conclusions are neither new nor rigorously proven.
Conclusion: "but beyond approximately 14,400 sequences" no, it is 14.400 time steps.
Citation: https://doi.org/10.5194/egusphere-2025-4125-RC1 -
AC1: 'Reply on RC1', Yizi Shang, 10 May 2026
Dear Professor,
We sincerely appreciate the time and effort you and the reviewers have dedicated to providing valuable and constructive feedback on our manuscript. We have carefully read all the comments and thoroughly considered the concerns raised.
We have revised the manuscript accordingly to address all the requirements and suggestions. Please find attached a detailed, point-by-point response to your comments and those of the reviewers, along with the revised manuscript.
We hope that the revisions meet your expectations and that the manuscript is now suitable for publication. Thank you again for your time and continued guidance.
-
AC4: 'Reply on RC1', Yizi Shang, 30 May 2026
Dear Reviewers,
Thank you for reviewing our manuscript. The comments have helped us strengthen the methodology presentation and clarify several points in the original text. We address each concern below; all corresponding revisions have been incorporated into the updated manuscript.
Below are our detailed, point-by-point responses to your major concerns (Part I) and specific comments (Part II).
Part I: Response to Major Comments
Comment 1. The physics-based model that generated the hydrological responses is not fully presented or validated. This is important because if the physical model does not adequately represent the complexity of the case study, the study loses much of its meaning.
Response: We agree — the synthetic data is only as reliable as the physical model behind it. In the updated manuscript, Section 2.2 now presents the InfoWorks ICM model with expanded detail and validation. We note that a physically based model of this catchment necessarily involves simplifications, but the validation results below demonstrate that these do not compromise the fidelity of the generated synthetic dataset.
1) The revised Section 2.2 includes a detailed overview of the 6,500 m² study area, covering buildings, squares, green spaces, roads, and drainage pipe networks.
2) We supplemented the initial calibration (Fig. 3) with a new validation phase covering three additional typical observed rainfall events (Table 2).
3) These supplementary validations yielded Nash-Sutcliffe Efficiency (NSE) values of 0.82, 0.75, and 0.88, demonstrating that the model accurately captures the dynamics of the hydrosystem.
4) To directly validate the LSTM model against physical reality, we added Fig. 13, which compares the inundation area hydrographs simulated by the InfoWorks ICM model and the LSTM model under real rainfall events.
5) Quantitative analysis of these real events reveals peak error percentages of approximately 3.8%, NSE values exceeding 0.8, and R2 values above 0.9.
Comment 2. The simulated rainfall is very unrealistic (100 mm/h vs. 1 mm/h observed), and events all have the same duration.
Response: We apologize for the confusion regarding the rainfall units in the original manuscript. The observed rainfall measurements were presented in mm/min, which led to the apparent discrepancy when compared to the simulated design storms in mm/h. The updated manuscript now uses consistent units throughout. We recognize that these intensities are high, but they reflect the short-duration, extreme storm patterns characteristic of urban pluvial flooding in this region.
- The revised text clarifies that the observed storm intensities are consistent with the high-intensity nature of urban pluvial floods in the study area.
- The simulated storms utilize the Chicago design storm pattern (Eqs. 2–5), generating hydrographs for 1–10 year return periods specific to the region.
- Regarding the fixed duration: Each rainfall event is strictly set to a 24-hour duration followed by a 6-hour recession period (30 hours total).
- This controlled design isolates the specific effects of data length and rainfall intensity distribution on model performance. It prevents duration variations from introducing confounding factors into our statistical evaluation.
Comment 3. Justification for the LSTM model over a simpler Perceptron model.
Response: While a multilayer perceptron (MLP) is computationally lighter, short-duration, high-intensity urban flooding is a complex spatiotemporal nonlinear problem.
- We selected the LSTM network because it is designed to decode temporal dependencies and sequential rainfall-inundation responses, which are critical for minute-level high-frequency hydrological data.
- Our architecture uses a 30-minute sequence length and a 15-minute sliding window, capturing the dynamic response cycle of short-term rainfall-runoff.
- Furthermore, we constrained the model to a single-layer LSTM with 64 hidden neurons.
- This keeps training times efficient, averaging approximately 896 seconds on our hardware setup, addressing concerns about excessive energy consumption.
Comment 4. Lack of cross-validation for the LSTM model.
Response: This is an important point. We have restructured the evaluation framework to address it, as described below.
- The updated manuscript now employs a 5-fold cross-validation strategy across all dataset configurations (Section 2.4, Figs. 8–9).
- Data partitioning is conducted independently for each dataset configuration to prevent bias.
- The results are reported as the mean and standard deviation across the five folds, providing robust performance metrics that rule out anomalies caused by random validation splits.
- The corresponding box plots (Figs. 10–12) and data tables (Tables 6–8) confirm stable model performance across the folds.
Part II: Response to Specific Comments
1.The variability in maximum precipitation intensities is not representative (90mm/h to 140mm/h). Could start at 10 mm/h.
Response: The revised manuscript clarifies the categorization. In this study, the low-intensity training set consists of intensities of 90–120 mm/h, the high-intensity set spans 120–170 mm/h, and the mixed-intensity set spans 90–170 mm/h. These classifications are based on the historical threshold criteria for urban flooding in this catchment. Events starting at 10 mm/h in this area do not trigger meaningful surface inundation due to pipeline drainage capacity, which is why they are excluded from flood forecasting training. We agree that this range is narrow, but it reflects the actual flood-producing rainfall spectrum for this specific urban setting.
2.L128-135: Lack of information on the rainfall-runoff model, response time, and output hydrographs. Nash 0.5 may conceal mediocre simulations. Is there saturation for intense events?
Response: Supplementary validation for the physical model has been added to the revised manuscript. Table 2 now displays the observed vs. simulated peak flows, peak time errors, and NSE values for three independent events.
We also plotted the comparison of inundation area hydrographs (Fig. 13) under real rainfall events, demonstrating that the model accurately captures the rising limb, peak phase, and recession limb, proving the system is not oversimplified.
- Confusion between sequence length L171 and the LSTM sequence of 2 time steps (30 min).
Response: This nomenclature has been corrected throughout the revised manuscript. The fundamental sample unit is a 45-minute continuous sliding window sequence. The LSTM internally uses a 30-minute sequence length with a 15-minute step size. The "Dataset Length" variables (L1 to L6) refer to the total count of these sequence samples, ranging from 598 to 1,198.
- L137: q is the intensity of the rain, poorly chosen; L139: specify figure number; L148: specify ‘total precipitation’.
Response: The variables have been rewritten to align with standard hydrological nomenclature in the revised manuscript. Precipitation is now Pt ,infiltration is It , drainage is Dt , and inundation area is Yt.
Cumulative surface runoff volume is now defined mathematically via integration over time (Eq. 7) in the updated manuscript. All figure references have been corrected.
5.Problem with Eq 2: time does not appear.
Response: The equations have been updated in the revised manuscript to reflect time dependencies. The water balance equation for the net surface runoff rate is now defined at time t as Q(t)=P(t)-I(t)-D(t).
- L149: The output of the hydrological model called 'rainfall-runoff' is in fact a flooded area. This is an approximation.
Response: We agree and have included the exact mechanism mapping runoff to flooded area in the updated manuscript. Eq. 8 demonstrates how the instantaneous runoff rate is integrated into volume, then mapped to the inundation area Y(t) using the hydraulic conversion function established by the shallow water equations, DEM, surface slope, and depression storage. We note that this mapping, while an approximation, is standard practice in urban flood modeling and has been validated against observed events as shown in Fig. 13.
- L156-166: We would like to have the Nash criteria for the model by category of experiments.
Response: The Nash-Sutcliffe Efficiency (NSE) values for the physical model validation have been added in Table 2 of the revised manuscript, ranging from 0.75 to 0.88.
- L172-177: Specify duration in minutes and number of floods. At 90 mm per hour, it is difficult to consider the rain as ‘light’.
Response: 1) We have specified in the revised manuscript that the total duration configurations encompass 5 to 10 independent storm events, corresponding to cumulative durations of 150 to 300 hours. 2) We have also refined our terminology; we now refer to these events as "low-intensity" relative to the extreme storm design thresholds of the study area, rather than universally "light" rain.
- L178: How is the database divided into learning, testing and validation?
Response: The revised manuscript now employs a 5-fold cross-validation mechanism via stratified random sampling. For each fold, the remaining four folds serve as the training set, ensuring no data leakage. We chose stratified sampling over simple random splits because it better preserves the distribution of rainfall intensities across training and test sets.
- L202-203: I do not understand how a sequence of two time steps can help approximate long-term memory.
Response: This was an error in the previous draft. The corrected manuscript now states that the sequence length is 30 time steps (representing 30 minutes, with a sampling frequency of 1 minute).
- L209: Specify what helps avoid overfitting? There should be no overfitting on a synthetic system.
Response: Although the data are synthetic, the system output still presents complex nonlinear mappings and varied temporal dynamics depending on the design storm. In the updated manuscript we describe using a learning rate of 0.005, 50 epochs, and the Adam optimizer with weight decay to control variance. We recognize that overfitting is less common with synthetic data, but our goal was to ensure reproducibility and avoid fitting to specific hydrograph shapes.
- Table 1: Specify units of time step, sequence, and batch size.
Response: Table 1 has been renumbered as Table 5 in the revised manuscript. Table 5 now includes a batch size of 32.
2.5.1. L220: Why mention rainfall when target is flooded area? Ymean not X.
Response: The NRMSE formula (Eq. 11) has been corrected in the revised manuscript to reflect the variables: predicted , observed , and mean observed representing the inundation area.
- 2.5.2: Express coefficient of determination using the same notation as NRMSE. Why is water depth mentioned?
Response: Eq. 12 has been updated in the revised manuscript to standardize the notation for , and references to water depth have been replaced with "inundation area" to match the target variable.
2.6.1: Why are we talking about an effect? The concept should be defined.
Response: We rewrote this section in the revised manuscript to introduce a formal Multi-factor Analysis of Variance (MANOVA) framework. The updated text defines statistical main effects, interaction effects, significance testing (F-value, p-value), and effect size estimation (partial and ).
3.1: Move entire section to the chapter where the model is discussed.
Response: Agreed. The physical model calibration and validation content has been moved to Section 2.2 in the revised manuscript.
Fig 6, 7, 8, 11, 12, 17 formatting and readability.
Response: All figures have been revised in the updated manuscript. We added proper scales and clearer colors. The confusing performance variance charts (formerly Fig 11/12) have been replaced with box plots showing 5-fold CV performance (Figs. 10–12). Figure 17 now labels axes and shows the linear growth of training time against data scale.
Line 561: "Controlled experiments with high-fidelity...". The physics-based model is poorly evaluated; the conclusion is unproven.
Response: With the additional physical model validation (Table 2 and the direct comparison against real events in Fig. 13, yielding NSE > 0.8), the model’s fidelity is now demonstrated in the revised manuscript. We recognize that no physical model is perfect, but these metrics support the use of the synthetic dataset for the subsequent LSTM experiments. The conclusions drawn from the synthetic data are therefore based on a more thoroughly characterized physical baseline.
Conclusion: "beyond approximately 14,400 sequences" no, it is 14,400 time steps.
Response: The revised manuscript now states that the threshold is 14,400 samples (where each sample is an extracted 45-minute time-series window).
We hope these revisions address your concerns. The updated manuscript reflects all changes discussed above. Thank you for your time and careful review.Some formulas may not paste correctly. Please refer to the attached PDF file for the full and detailed response.
-
AC1: 'Reply on RC1', Yizi Shang, 10 May 2026
-
CC1: 'Comment on egusphere-2025-4125', Nima Zafarmomen, 11 Mar 2026
The manuscript presents a timely and well-structured investigation into an important but often overlooked issue in AI-based urban pluvial flood prediction: dataset design. Rather than focusing only on model architecture, the study systematically evaluates how dataset length, feature composition, and rainfall-intensity distribution affect predictive skill, computational cost, and generalization. Overall, this is a solid and relevant contribution, and I believe the manuscript is worth publishing after minor revision.
1) The manuscript should improve consistency in the statistical terminology. Section 2.6 refers to MANOVA, whereas Section 4.3 describes the analysis as multi-factor ANOVA. The authors should clarify which method was actually used and ensure the terminology is consistent throughout the manuscript.
2) The target variable should be described more consistently. In some places the manuscript defines the prediction target as inundation area, while elsewhere it discusses peak water level, runoff volume, and water depth.
3) Some of the conclusion wording is stronger and more conversational than is typical for a scientific paper. Phrases such as “respect the 14k-sample ceiling” and “start lean, enrich later” may be better rephrased in a more formal academic style. The ideas are useful, but the tone could be made more precise and neutral.
4) In addition, I do strongly recommend authors consider citing the following recent and relevant paper, which is closely related to AI surrogate modeling in urban drainage networks and prediction of node-level hydraulic states: Zafarmomen, N., Samadi, V., and Borgomeo, E. (2026). Spatiotemporal SWMM-LSTM surrogate modeling for efficient node-level water depth and inflow prediction in urban drainage networks. Cambridge University Press, published online 13 January 2026.
Citation: https://doi.org/10.5194/egusphere-2025-4125-CC1 -
AC5: 'Reply on CC1', Yizi Shang, 30 May 2026
Dear Reviewer,
Thank you very much for your positive assessment of our manuscript and for recognizing the value of our focus on dataset design in AI-based urban flood prediction. We greatly appreciate your constructive suggestions, which have been highly helpful in refining the clarity, consistency, and academic rigor of our paper.
Below, please find our detailed, point-by-point responses to your comments:
- The manuscript should improve consistency in the statistical terminology. Section 2.6 refers to MANOVA, whereas Section 4.3 describes the analysis as multi-factor ANOVA. The authors should clarify which method was actually used and ensure the terminology is consistent throughout the manuscript.
Response: We sincerely apologize for the inconsistency in the statistical terminology in the previous draft. We have clarified and corrected this throughout the revised manuscript. The actual method used was Multi-factor Analysis of Variance (Factorial ANOVA), as we independently analyzed the main and interaction effects for three separate dependent variables (training time, NRMSE, and ), rather than conducting a joint multivariate analysis. We have thoroughly reviewed Sections 2.6, 3.3.4, and the Discussion to ensure that the term "Multi-factor Analysis of Variance" is used consistently and correctly.
- The target variable should be described more consistently. In some places the manuscript defines the prediction target as inundation area, while elsewhere it discusses peak water level, runoff volume, and water depth.
Response: Thank you for pointing out this ambiguity. We have carefully revised the manuscript to ensure absolute consistency. To clarify: the ultimate and sole prediction target of the LSTM model in this study is the Inundation Area (m2). The terms "runoff volume," "water depth," and "pipe flow" are intermediate physical variables generated during the mechanistic simulation process within InfoWorks ICM (as described in Eq. 7 and Eq. 8, where instantaneous runoff is mapped to the final inundation area). We have combed through the entire manuscript to strictly differentiate between the intermediate physical variables of the hydrodynamic model and the final predictive target of the deep learning model. Any confusing references to predicting "water depth" or "peak levels" by the LSTM have been uniformly corrected to "inundation area."
- Some of the conclusion wording is stronger and more conversational than is typical for a scientific paper. Phrases such as “respect the 14k-sample ceiling” and “start lean, enrich later” may be better rephrased in a more formal academic style. The ideas are useful, but the tone could be made more precise and neutral.
Response: We completely agree with your feedback regarding the tone. We have thoroughly revised the Discussion and Conclusion sections to remove conversational phrasing and ensure a formal, precise, and objective academic tone.
Phrases like "respect the 14k-sample ceiling" have been formally rephrased to: "performance gains plateau once the sample size reaches a critical threshold of approximately 14,400 sequences under the current experimental setup." * Phrases like "start lean, enrich later" have been replaced with precise methodological recommendations, such as: "under data scarcity conditions, a minimalist input structure should be maintained... whereas introducing auxiliary features is an effective strategy only when data resources are abundant."
- In addition, I do strongly recommend authors consider citing the following recent and relevant paper, which is closely related to AI surrogate modeling in urban drainage networks and prediction of node-level hydraulic states: Zafarmomen, N., Samadi, V., and Borgomeo, E. (2026). Spatiotemporal SWMM-LSTM surrogate modeling for efficient node-level water depth and inflow prediction in urban drainage networks. Cambridge University Press, published online 13 January 2026.
Response: We are very grateful for this excellent literature recommendation. This recent publication is indeed highly relevant to our work and provides critical, up-to-date context regarding the application of SWMM-LSTM surrogate models for fine-scale spatiotemporal and node-level hydraulic predictions. We have incorporated this citation into the Introduction (Section 1) when discussing the latest advancements in deep learning-based data-driven models and spatiotemporal prediction techniques, as well as in the Discussion to contextualize our findings within the broader trajectory of AI surrogate modeling in urban drainage systems.
Thank you once again for your insightful and encouraging review. We believe these minor revisions have significantly strengthened the professional presentation of our work, and we hope the revised manuscript now fully meets your expectations.
-
AC5: 'Reply on CC1', Yizi Shang, 30 May 2026
-
RC2: 'Comment on egusphere-2025-4125', Anonymous Referee #2, 12 Mar 2026
The manuscript addresses a highly relevant topic as the optimization of dataset design for Deep Learning models (specifically LSTM) in urban pluvial flood prediction. The study identifies a "threshold effect" regarding dataset size (~14,400 samples) and highlights that rainfall intensity distribution is more critical for model generalization than raw data volume. While the work is interesting and well-motivated, several critical methodological clarifications and improvements in the presentation of results are required to be recommended.
Specific Comments:
Introduction and Generalization: The motivation is clear and well-structured. However, line 87 states that insights from the LSTM model are expected to be transferred to other ML models. The authors should justify this claim, explaining why these findings (especially threshold effects) would apply to other models.
Section 2.2: An NSE is reported for the hydraulic model, but more detail is needed. What variable does this NSE represent (water levels, discharge, or flood extent)? At which control points and under what rainfall events was this validated?
- Line 134: Is there a specific reference for the IDF formula used?
- Line 147: The equation for a distributed model is an oversimplification as it neglects lateral flows between cells and surface storage. This needs a more rigorous explanation.
- Line 148: There is a typo in “III”; it should likely be "I" for Infiltration.
Section 2.3:
- Line 161: For consistency, "Configuration 4" should be labeled: “Rainfall (P) + Soil infiltration (I) + Pipe drainage (D)”.
- Line 172: Specify the unit for sequence length. Are these time steps or hours?
Section 3.3:
- Line 332: Data Leakage Prevention? The manuscript mentions the use of overlapping sliding windows. It is crucial to clarify whether the data split (train/val/test) was performed before or after generating these windows. If done after, there is a high risk of information leakage, which would invalidate the reported generalization performance.
Section 4.1:
- Line 402: The claim that performance decays after Level 4 is difficult to discern in the current figure for low-intensity and mixed-intensity.
- Line 423: Does the model account for manhole overflows? This is a critical factor in urban pluvial flooding.
General comments:
- There is confusion between the use of MANOVA and ANOVA. The description suggests a factorial ANOVA for individual metrics, yet MANOVA is mentioned. Please clarify the exact statistical framework and how the covariance between multiple dependent variables was handled.
- A major omission is the time required to generate the dataset. Since InfoWorks ICM simulations are computationally intensive, the authors must provide details on the total simulation time, hardware used, and a comparison between the "data investment" time vs. the AI's real-time prediction advantage.
Figures and Legibility:
- Figure 7: What does "Y0333333.1" represent? Legend labels should be descriptive and self-explanatory.
- Figures 8, 9, and 11: Figure 8 (Training time 1, 2, 3) is difficult to read and seems redundant. Consider merging the essential information into Figure 9 and removing Figure 11 if it does not provide unique insights.
- Figure 17: Clarify the “Value” axis. The figure as it stands is not sufficiently explanatory for the conclusions drawn immediately below it.
Discussion: The discussion lacks sufficient citations to back its claims, particularly in line 534. The results should be contextualized by comparing them with existing literature on dataset design in computational hydrology.
In conclusion, the paper presents a timely and valuable contribution to AI applications in hydrology by shifting the focus from “model architecture” to “dataset design.” However, several critical issues need to be addressed. First, the lack of transparency regarding the computational cost of data generation makes it difficult to assess the efficiency of the proposed framework, as the “cost of data” is as relevant as model accuracy. Second, the potential for data leakage in sliding-window approaches must be resolved to ensure the validity of the results. Finally, the hydrological representation is oversimplified, mass balance considerations and calibration procedures are described too briefly. Strengthening the physical basis of the synthetic data is essential. To recommend this work, the authors need to address these points and resolve the concerns regarding computational feasibility, data handling, and the hydrological basis of the simulations.
Citation: https://doi.org/10.5194/egusphere-2025-4125-RC2 -
AC3: 'Reply on RC2', Yizi Shang, 10 May 2026
Dear Professor,
We sincerely appreciate the time and effort you and the reviewers have dedicated to providing valuable and constructive feedback on our manuscript. We have carefully read all the comments and thoroughly considered the concerns raised.
We have revised the manuscript accordingly to address all the requirements and suggestions. Please find attached a detailed, point-by-point response to your comments and those of the reviewers, along with the revised manuscript.
We hope that the revisions meet your expectations and that the manuscript is now suitable for publication. Thank you again for your time and continued guidance.
-
AC6: 'Reply on RC2', Yizi Shang, 30 May 2026
Dear Reviewers,
Thank you for your comments on our manuscript, "Impact of dataset design on LSTM-based urban pluvial flood prediction: Length, feature dimensions, and rainfall stratification." Your suggestions have helped improve the rigor and clarity of our study.
We have addressed all comments in the revised manuscript. Below are our detailed, point-by-point responses:
Part I: Response to Specific Comments
- Introduction and Generalization: The motivation is clear and well-structured. However, line 87 states that insights from the LSTM model are expected to be transferred to other ML models. The authors should justify this claim, explaining why these findings (especially threshold effects) would apply to other models.
Response: We agree that claiming absolute generalizability to other machine learning models without empirical evidence lacks rigor. In the revised manuscript, we have redefined the LSTM as a "representative baseline sequence model" and revised the discussion and conclusion to clarify: "Although the patterns derived under the LSTM baseline framework offer reference values for other sequence learning models, broader generalization (e.g., extension to Graph Neural Networks) necessitates further empirical validation".
- Section 2.2: "An NSE is reported for the hydraulic model, but more detail is needed. What variable does this NSE represent... At which control points and under what rainfall events was this validated?"
Response: We have supplemented Section 2.2 with detailed information on the hydrodynamic model validation.
- Variables and Control Points: The NSE represents the pipe flow rate (m³/s) at key monitoring nodes within the drainage pipe network.
- Rainfall Events for Validation: In addition to the initial calibration, we added a supplementary validation phase for three typical observed rainfall events (with total rainfall volumes of 80 mm, 120 mm, and 50 mm, respectively) (Table 2), yielding NSE values of 0.82, 0.75, and 0.88. Furthermore, we added Figure 13 to directly compare the inundation area hydrographs simulated by InfoWorks ICM and the LSTM model under real rainfall events, demonstrating that the NSE values for the simulated events exceed 0.8.
- Line 134: "Is there a specific reference for the IDF formula used?"
Response: We have added the source and specific parameters of the storm intensity formula (IDF). The Chicago design storm instantaneous rainfall intensity formula (Eq. 5) was fitted from local historical rainfall data; we have detailed the geographical and statistical significance of each empirical parameter (C, b, n).q = 1602(1+1.037lgp)/(t+11.593)^0.681. We have also detailed the geographical and statistical significance of each empirical parameter (e.g.,A1 , C, b, n).
- Line 147: The equation for a distributed model is an oversimplification as it neglects lateral flows between cells and surface storage. This needs a more rigorous explanation.
Response: We have clarified that Equation 6 is strictly used to calculate the net surface runoff rate (excess infiltration rate) prior to concentration. The actual physical processes—2D spatial concentration, lateral flow between grid cells, and surface depression storage—are handled by the InfoWorks ICM 2D module based on shallow water equations and a DEM. This hydraulic conversion is formally expressed via Equation 8, which incorporates surface slope and depression storage.
- Line 148: "There is a typo in “III”; it should likely be "I" for Infiltration."
Response: We have corrected this typographical error and now uniformly use "I" to represent Soil Infiltration throughout the manuscript.
- Section 2.3 (Line 161): "For consistency, 'Configuration 4' should be labeled: 'Rainfall (P) + Soil infiltration (I) + Pipe drainage (D)'."
Response: Accepted. In Table 4 and the relevant text, Combination 4 has been standardized to: "Precipitation (P) + Soil Infiltration (I) + Pipe Drainage (D) → Inundation Area (Y)".
- Line 172: "Specify the unit for sequence length. Are these time steps or hours?"
Response: The temporal sampling frequency is once per minute; the LSTM sequence length of "30" represents 30 time steps (30 minutes). This has been clarified in the manuscript.
- Section 3.3 (Line 332): Data Leakage Prevention? The manuscript mentions the use of overlapping sliding windows. It is crucial to clarify whether the data split (train/val/test) was performed before or after generating these windows. If done after, there is a high risk of information leakage, which would invalidate the reported generalization performance.
Response: This is an important concern regarding data leakage. Our data partitioning logic is as follows: First, we randomly sample independent storm events from the total sample space; we then perform sliding window interception within each selected independent event; finally, we employ 5-fold cross-validation. This ensures that training and validation sets remain independent. The 6-hour recession period between each event further guarantees physical independence.within each selected independent event; finally, we employ 5-fold cross-validation. This ensures that during validation, the training and validation sets not only remain independent during random shuffling but also feature a partitioning mechanism that strictly prevents future information leakage. Additionally, the 6-hour recession period between each event ensures absolute physical independence between the events.
- Section 4.1 (Line 402): The claim that performance decays after Level 4 is difficult to discern in the current figure for low-intensity and mixed-intensity.
Response: In the revised manuscript, we have corrected the term "decays," describing it more accurately as a "saturation effect" or "plateau." As shown in Section 3.3.1 and Figure 17, when data volume reaches L4 (~14,400 samples), model performance achieves a qualitative leap. Beyond L4, increasing data to L5 and L6 yields only marginal NRMSE reduction and minimal R² improvement—the gains plateau rather than decay.
- Line 423: Does the model account for manhole overflows? This is a critical factor in urban pluvial flooding.
Response: Yes, our coupled 1D–2D hydrodynamic model fully accounts for manhole overflows. When pipe drainage capacity reaches saturation, excess water overflows to the surface through manholes, and its inundation extent is calculated by the 2D module, as described in Sections 2.2 and 2.3.) reaches saturation, the excess water overflows to the surface through manholes, and its inundation area evolution on the surface is calculated by the 2D module.
Part II: Response to General Comments
- There is confusion between the use of MANOVA and ANOVA. The description suggests a factorial ANOVA for individual metrics, yet MANOVA is mentioned. Please clarify the exact statistical framework and how the covariance between multiple dependent variables was handled.
Response: We have corrected the statistical terminology in the revised manuscript: we conducted independent Multi-factor Analysis of Variance (Factorial ANOVA) for each dependent variable (training time, NRMSE, R²) to quantify the main and interaction effects of dataset length, rainfall level, and feature combinations, rather than a joint MANOVA. Results are detailed in Figures 20–22.
- A major omission is the time required to generate the dataset. Since InfoWorks ICM simulations are computationally intensive, the authors must provide details on the total simulation time, hardware used, and a comparison between the "data investment" time vs. the AI's real-time prediction advantage.
Response: This is a valid point. We have added the hardware specifications (Intel i9 + NVIDIA RTX 3090 GPU) and a comparison between the offline data generation cost and the AI's real-time prediction advantage. The LSTM training time is ~896 seconds (~15 minutes); once the training set is generated offline using the physical model, the AI model achieves second-level real-time predictions during deployment.
- Figures and Legibility:
Regarding "Y0333333.1" in Figure 7: This was a data label artifact from the original plotting. We have redrawn the legends for Figures 5, 6, and 7 with clear descriptive labels.
Regarding the redundancy and legibility of Figures 8, 9, and 11: We have significantly streamlined and consolidated the cross-validation charts. We removed the old, redundant, and hard-to-read training time graphs, integrating the core 5-fold cross-validation mechanism into Figure 8 (matrix plot) and Figure 9 (flowchart). Simultaneously, the validation performance of each configuration was transformed into highly intuitive box plots (Figures 10, 11, and 12), clearly displaying data distributions and means.
Regarding the "Value" axis in Figure 17: We have explicitly relabeled the vertical axes of Figure 17. The vertical axes of the three subplots are now clearly labeled "NRMSE", "R²", and "Training Time (s)", directly and clearly supporting the subsequent conclusion regarding performance saturation after reaching the L4 data volume threshold.
- Discussion: The discussion lacks sufficient citations to back its claims, particularly in line 534. The results should be contextualized by comparing them with existing literature on dataset design in computational hydrology.
Response: We have enriched the discussion and literature review, comparing our findings with existing research on dataset design in computational hydrology—including recent work on extreme event prediction and rainfall distribution in deep learning, which further supports the need for stratified mixed sampling under data-scarce conditions.
Conclusion:
We appreciate your recognition of the value of our research in "shifting the focus from model architecture to dataset design." By supplementing the hydrodynamic model validation, clarifying the data split mechanism, correcting statistical terminology, and optimizing figure legibility, we have addressed concerns regarding computational feasibility, data handling, and hydrological physical basis. We hope the revised manuscript meets the journal's publication requirements.Some formulas may not paste correctly. Please refer to the attached PDF file for the full and detailed response.
-
AC2: 'Reply on RC1', Yizi Shang, 10 May 2026
Dear Professor,
We sincerely appreciate the time and effort you and the reviewers have dedicated to providing valuable and constructive feedback on our manuscript. We have carefully read all the comments and thoroughly considered the concerns raised.
We have revised the manuscript accordingly to address all the requirements and suggestions. Please find attached a detailed, point-by-point response to your comments and those of the reviewers, along with the revised manuscript.
We hope that the revisions meet your expectations and that the manuscript is now suitable for publication. Thank you again for your time and continued guidance.
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,706 | 1,095 | 157 | 2,958 | 76 | 105 |
- HTML: 1,706
- PDF: 1,095
- XML: 157
- Total: 2,958
- BibTeX: 76
- EndNote: 105
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of the paper “Threshold Effects and Generalization 1 Bias in AI-based Urban Pluvial Flood Prediction: Insights from a Dataset Design Perspective.”
By Hao Hu et al,
In general, the introduction is well constructed and poses questions in a logical and relevant manner. There is indeed an advantage to using a synthetic database: beyond the lack of available data, it allows us to overcome the bias-variance dilemma that explains overfitting.
However, in the rest of the article, we note that the methodology is not correctly deployed: 1) the physics-based model that generated the hydrological responses is not fully presented or validated. This is important because if the physical model does not adequately represent the complexity of the case study, the study loses much of its meaning. Furthermore, 2) the simulated rainfall is very unrealistic, and simulated rainfall intensities (100 mm/h) are significantly higher than those observed and applied to the physical model (1 mm/h, see Fig. 7). It is not normal to learn this from values read on a figure. It is necessary to consider the consequences: can a model calibrated with rainfall of 1 mm/h simulate responses to rainfall of 100 mm/h? What are the consequences? The simulated rainfall events all have the same duration; why is this? One might imagine that longer rainfall events would have a greater impact in the model due to their greater power. The simulated rainfall therefore lacks the diversity needed to evaluate all possible states of the system. 3) Regarding the choice of the LSTM model, we do not choose a model because it is fashionable. We choose it because it is suited to the problem at hand. Furthermore, LSTM is not the most widely used model; it seems that for high-intensity/rapid floods, the perceptron is more commonly used. The latter is much less complex, consumes much less energy and contributes more to a sustainable world than LSTM.
Furthermore: 4) regarding the validation of the LSTM model, it can change drastically depending on the events used in the validation, so it is recommended to evaluate them with cross-validation, which is not done in the paper.
All these elements suggest that the chosen methodology is not properly thought out. And we doubt the meaning that the results could have.
Regarding the quality of the presentation, it must be noted that the writing is poor. Several parts are redundant (e.g. the description of rainfall), while others are missing (how is generalisation assessed?). The notation changes throughout the article. Even the model output is not defined consistently; sometimes we have a flow rate (L127), sometimes a flooded area (L146) and sometimes a water height (L230). How is this possible? The only explanation I can see is that an AI wrote parts of it, and they were not properly corrected.
The conclusion provides little new information: yes, the quality of the model is affected by the size of the database: we already know that; yes, events of different types and intensities must be presented to the training model. Indeed, we already know that thetraining database must include a representation of the entire state space. It is also obvious that the calculation time increases with the length of the training database. On the other hand, no: the model that has learned from synthetic data, i.e. without noise or uncertainty, cannot be overfitted. The conclusions relating to this last point are therefore false, and explanations must be sought elsewhere.
Unfortunately, the simplistic nature of rainfall events does not allow us to explore other avenues or quantify thresholds; for example, how many events are needed to reach the plateau.
Unfortunately, even though the subject seemed interesting, in view of all these factors, it is very difficult to recommend this paper for publication. It needs to be completely rethought and rewritten.
Specific comments
The variability in maximum precipitation intensities is not representative of real cases. 90mm/h to 140mm/h could have been distributed more widely, for example starting at 10 mm/h.
L128-135, there is a serious lack of information on the rainfall-runoff model; in particular, what is its response time? We would like to see the output hydrographs from this model. A Nash criterion of 0.5 may conceal simulations that reproduce the dynamics of the hydrosystem, or mediocre simulations with, for example, a very low peak flow/height, which does not represent the observed situation. We would like to see a more detailed analysis of the model outputs for intense and very intense rainfall events. Is there saturation for the most intense events?
This is problematic because if the rainfall-runoff model over-tops the floods or oversimplifies the real situation (the functioning of the hydrosystem), it is clear that the issue is then greatly simplified... this could even lead to doubts about the reliability of the conclusions.
Confusion between sequence length L171 and the LSTM sequence of 2 time steps (30 min).
L137, q is the intensity of the rain, which is poorly chosen, as q generally represents the flow.
L139 specify the figure number instead of ‘this formula’.
L148, specify what ‘total precipitation’ is: the cumulative amount since the start of the event?
The problem with equation 2 is that time does not appear. If the system is truly dynamic and time-dependent, this must be indicated in the equations. And better define what ‘total precipitation’ means.
L149. The output of the hydrological model called ‘rainfall-runoff’ is in fact a flooded area. This is an approximation that is very far from reality and reflects an approximation that is not acceptable if it is not explained and justified. Once again, the model is described too quickly.
L156-166: we would like to have the Nash criteria for the model by category of experiments.
L172-177: specify the duration in minutes and the number of floods.
L178: it is good that there is a validation set, but how is the database divided into learning, testing and validation?
L172 to L177 represents 5 floods, 6 floods, 7 floods, etc., up to 10 floods. It is a detail, but at 90 mm per hour, it is difficult to consider the rain as ‘light’, especially in an urban environment.
L202-203: I do not understand how a sequence of two time steps can help to approximate long-term memory. The LSTM is clearly misused, or rather it is not recommended for a synthetic system with a supposed response time of 2 time steps (the former information is lacking).
L 209: specify what helps to avoid overfitting? The batch, the learning rate or the epoch. Overfitting is avoided with regularisation methods. But a priori there should be no overfitting on a synthetic system, as there is no noise and no uncertainty.
Table 1: specify the units of time step, sequence and batch size more precisely.
2.5.1
L 220: why mention rainfall when the target is the flooded area? Also, if Y is the target measured by the criterion, the denominator should be Ymean, not X, which is not defined.
2.5.2 It is not specified what x represents. According to the text, it could be rainfall, but rainfall is referred to as Q L137
2.5.2 Express the coefficient of determination using the same notation as the NRMSE. Why is water depth mentioned here? Isn't the target the flooded area?
2.5.3. It is a good idea to focus on training time, but then why not use a multilayer perceptron, which is much faster than LSTM in such a simplified case study?
2.6.1 Why are we talking about an effect? The concept should be defined and used consistently throughout the document. This entire section should be rewritten to make it understandable.
3.1: This entire section should be moved up to the chapter where the model is discussed.
Fig. 6: Provide a scale.
Provide a reference for InfoWorks ICM.
Specify what the three observed flows (not the flooded area) correspond to in Figure 7. Furthermore, we note in Figure 7 that the model completely misses the first peak. The observed rainfall intensities are significantly lower (0.6 mm) than the simulated artificial rainfall (100 mm/h). Why is this? Is it because the model does not accurately represent responses to light rainfall?
It is necessary to present a comprehensive and detailed overview of the performance of the hydrological model.
Equation 10 is more like that of a recurrent MLP...
Table 3: repetition.
Figure 8: is the increase in computation time proportional to the increase in signal length?
Figure 11 is difficult to read. Why does the normalised criterion vary so much? It is not consistent.
Figure 12: only the orange colour is visible.
Figure 17: which model are we talking about? The physical model or the LSTM? Review the ordinates: ‘value’ does not indicate the variable displayed.
We do not know how generalisation is measured. On which set?
Line 561: it is not possible to write "Controlled experiments with high-fidelity synthetic rainfall–inundation datasets". The physics-based model is poorly evaluated; the example shown (Fig. 7) completely misses the first event, which is far from high fidelity. The rest of the conclusion is therefore unproven. Even if my experience is in line with the authors' conclusions: enough events and different types are needed; these conclusions are neither new nor rigorously proven.
Conclusion: "but beyond approximately 14,400 sequences" no, it is 14.400 time steps.