the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
INFLOW-AI v2.1: A Machine Learning Framework for Predicting Out-of-Sample Extreme Seasonal Flood Extents
Abstract. Forecasting flood extent during extreme events remains a critical challenge for hydrological modelling, particularly in data-scarce and highly dynamic floodplain systems. Accurate and timely forecasts of these events are essential for effective disaster preparedness and response. Traditional physically based methods are often not well-suited for modelling complex hydrodynamic systems, as they depend on fixed structural parameterisations of surface water processes, groundwater interactions, and evapotranspiration that are difficult to calibrate and scale in catchments with highly heterogeneous vegetation, climatology, and terrain. Machine learning approaches, which can learn nonlinear relationships directly from data without explicit physical parameterisation, offer a promising alternative for modelling flooding in these regions.
We present INFLOW-AI v2.1, a machine learning framework for predicting extreme seasonal flood extent beyond what was observed in the training set. To enhance predictive accuracy for these out-of-sample extreme events, the framework employs a two-stage neural network architecture that combines (1) extreme-sensitive temporal thresholds with (2) dynamic spatial predictions. The first stage employs transformer-based models with multi-headed attention mechanisms to capture long– and short-term hydrometeorological patterns in total flood extent over the past 36 dekads. To enable more effective detection of extremes, this stage predicts the first difference of the seasonal anomaly in total flood extent, rather than the raw total flood extent. The second stage then dynamically models spatial flooding patterns using a ConvLSTM to predict local inundation probabilities at 1 km resolution, with the basin-scale inundation extent predicted by the first stage used to constrain the spatial predictions. The model generates forecasts with a lead time of up to six dekads (two months).
A case study was conducted over the Sudd wetland in South Sudan, one of the world’s largest freshwater ecosystems which has experienced unprecedented catastrophic flooding beginning in June 2019, severely impacting Jonglei, Unity, and Upper Nile States. INFLOW-AI was tested on this catchment, demonstrating the two-stage model’s ability to predict extreme out-of-sample post-2019 flooding with only exposure to pre-2019 data. INFLOW-AI has been deployed operationally since the 2024 flood season (August– November) on the Joint Analysis System Meeting Infrastructure Needs (JASMIN), providing real-time predictions to humanitarian organisations and informing flood preparedness in South Sudan.
- Preprint
(7066 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
CC1: 'Comment on egusphere-2026-66', James Verdin, 29 Mar 2026
-
CC2: 'Reply on CC1', Jessica Rapson, 19 May 2026
Hi James,
An easy fix. Will do.
Citation: https://doi.org/10.5194/egusphere-2026-66-CC2
-
CC2: 'Reply on CC1', Jessica Rapson, 19 May 2026
-
RC1: 'Comment on egusphere-2026-66', Anonymous Referee #1, 08 May 2026
Rapson et al. present INFLOW-AI v2.1, a two-stage ML framework for forecasting seasonal flood extent in the Sudd wetlands. The model development is solid, and the use of differenced seasonal anomalies as a predictive target is a clever and generalizable contribution. The results are plausible and the application is impactful.
I have three major comments:
- The Introduction and Section 2.2 frame the underperformance of physically based models in the Sudd as a paradigmatic limitation, motivating the shift to ML. However, the struggles of physical models in this region likely reflect data scarcity, observability constraints (e.g., sparse discharge gauges, uncertain wetland evapotranspiration, unobserved upstream dam releases), and calibration on pre-2019 conditions, rather than inherent structural failings of the modeling paradigm. I would encourage the authors to revise this framing accordingly.
- The Stage 1 transformer is rigorously benchmarked against a range of alternatives (Table 2), but the Stage 2 ConvLSTM is compared only against historic-fill and persistence-fill baselines. Vanilla ConvLSTM is well known to produce blurry predictions at longer forecast horizons, a limitation extensively documented since Shi et al. (2017) and addressed by more recent spatiotemporal architectures (e.g., PredRNN-v2, SimVP, Earthformer). I would encourage the authors to benchmark against at least one modern spatiotemporal architecture.
- The manuscript could benefit from restructuring. For example, the Background and Introduction could be merged to shorten the manuscript. Section 2.3 is essentially a methods description and would fit more naturally in Section 4. The Data section is overly long and detailed and sits awkwardly between Section 2.3 and the Methods (Section 4). Sections 4.1 and 4.2 are essentially data processing and visualization, and would be better placed in Section 3.
Minor comment:
The colormap used in Figure 14 is not explained in the caption. Please clarify what the colors encode.
Overall, this is a well-developed and operationally valuable contribution. Addressing the framing of physical models and providing more transparent validation of the ConvLSTM stage would substantially strengthen the manuscript.
Citation: https://doi.org/10.5194/egusphere-2026-66-RC1 -
CC3: 'Reply on RC1', Jessica Rapson, 19 May 2026
Thank you for the substantive comment. To address the points:
- While the struggles of physical models in this region likely do also reflect data limitations, this is also true for the entirely data-driven machine learning models. As such, machine learning models don't really have any comparative advantage on this front. If anything, data limitations are *more* detrimental to models that rely solely on detecting empirical relationships in past data. Thus, the concerns you mention regarding sensor uncertainties and limited historical records don't make as much sense to include in the discussion of limitations of physical models when trying to motivate the use of a data-driven model. The same is true for pre-2019 calibration considerations (most machine learning models trained on pre-2019 data only would fail to predict the 2019 flooding). That being said, we can certainly make it more clear that the concerns discussed in the paper regarding physical models in this region is not an inherent structural failing of physical models, but more a regionally-specific limitation caused by violations of physical hydrological assumptions in the Sudd basin (especially regarding wetland dynamics and evapotranspiration).
- This is fairly easy fix. Can benchmark to a 3D CNN (basic architecture) and SimVP (modern, widely cited architecture). In retrospect, I also think it makes sense to benchmark to the 10 year return period GloFAS data for the region, using the constant-fill methodology described in the paper (given that this is provided as a motivating example for needing dynamic spatial inundation forecasts).
- Agreed, these can be restructured to be more efficient.
Regarding the minor comment on Figure 14, the colours just encode categorical month. For clarity, these can be made a single colour.
Citation: https://doi.org/10.5194/egusphere-2026-66-CC3
-
RC2: 'Comment on egusphere-2026-66', Anonymous Referee #2, 20 May 2026
The manuscript presents a machine learning framework that has already been used in practice to predict flood extent in the Sudd wetlands. The study addresses an important humanitarian problem, and the operational deployment is impressive. The two-stage design, with a basin-scale temporal model followed by a ConvLSTM spatial model, is a reasonable approach, and the seasonal differencing idea is also interesting. However, I have several important concerns about the methodology, especially about how the main claim, predicting “out-of-sample extreme values” (OSEVs), is presented and tested. In my opinion, some of these issues are serious and should be addressed before the paper can be published. I would recommend major revisions.
Below are my major comments:
- This is my main concern: OSEV is defined in Eqs. 7–8 using the raw target, where any test value above 5σ from the training mean is considered extreme. Based on that definition, there are 35 such cases after 2019. However, the model is not trained on the raw target. It is trained and tested on the transformed seasonal anomaly, Δỹₜ. According to Figure 8c, there are no OSEVs in the test set under this transformed variable, and the training set even contains more extreme values than the test set in that space. So, the model is not predicting test cases that lie outside the training distribution. Instead, it is predicting values in a transformed space that appear to be within the training range, even though after reconstruction with Eq. 22 they become extreme in the original variable. That is a weaker claim than saying the model predicts truly out-of-sample extremes, but the paper often treats these two ideas as the same. Can the authors clarify the main claim here? Is it that (a) the transformation makes the test extremes look in-distribution to the model, which seems to be what is actually happening, or (b) the model truly extrapolates beyond the training extremes, which is what the title and abstract seem to suggest? Have the authors tested cases where Δỹₜ itself is extreme in the test set? Those would be the real out-of-sample extremes for the model’s actual prediction task.
- The temporal model uses the previous 36 dekads of inundation as input. Since flooding in the Sudd is highly persistent and can last for years, the model already receives very high inundation values during the post-2019 extreme period. In other words, even if the model was not trained on these extreme periods, its inputs already show that the system is in a high-flood state. This is not exactly data leakage, but it does affect how the results should be interpreted. The model is not predicting an extreme event from normal-looking conditions. Instead, it is predicting the continuation of an already extreme situation. This may also explain why the persistence baseline performs almost as well as the transformer in Table 2. Have the authors tested the transformer without inundation history as an input? That would help show how much of the OSEV performance comes from other predictors, such as rainfall, lake levels, and climate indices, and how much simply comes from persistence in the inundation record itself.
- Section 4.4 says that for deployment, the temporal and spatial models were retrained using all historical data, including the OSEV cases that were previously held out. This means the reported test results do not represent the same model that was later used to make the 2024/2025 forecasts in Figure 12 and share them with humanitarian partners. Has the deployed model been tested on any truly held-out data? If so, what was its performance?
- 17: Lₜ = MSE · SignLoss + 0.1 · SumLoss. The SignLoss multiplier of 20 (Eq. 18) is unusually heavy and the 0.1 weight on SumLoss appears arbitrary. How were the constants 20 and 0.1 chosen? Was a sensitivity analysis performed?
- Only one train/test split was used, with training before July 21, 2018 and testing after that date. Given the small effective sample size, this is a concern. Did the authors use any time-series cross-validation, such as a rolling-origin test, to see how sensitive the results are to the choice of cutoff date? Furthermore, the paper says that 10% of the training data were used for validation with a random split. In a time-series setting, this can cause leakage between training and validation. Was the validation set separated by time instead?
- The paper compares the model with persistence, linear interpolation, linear regression, lasso, random forest, and FFNN, but it does not compare it with some other important baselines. These include a standard LSTM or GRU, which are common in flood prediction studies and are also used in works cited by the paper, such as Google’s AI Flood model and Frame et al. (2022). It also does not compare the results with GloFAS or another physically based model for the same region. In addition, Section 2.2 discusses the FEWS NET LASSO plus constant-fill method, but the paper does not provide a direct comparison with that approach on the same test set. It would also be useful to compare against a simpler transformer model, without the custom loss or Monte Carlo dropout, to better understand which design choices really matter. Why was an LSTM not included, given how common it is in flood prediction research? Can you run a head-to-head comparison with FEWS NET's actual outputs for the operational period, since both are operational systems addressing the same problem?
Minor comments:
Section 7.1 says that this approach could also be useful for forecasting other spatio-temporal hydrometeorological events, such as rainfall, cyclones, and heatwaves. This feels like too strong a claim based on the evidence presented in the paper. Those processes behave very differently from flood extent in the Sudd, which changes slowly over time and has strong persistence. That is also one reason why the persistence baseline works so well here. Can the authors either soften this statement or provide stronger evidence that the seasonal differencing approach also works for faster-changing processes, where the target has much weaker autocorrelation?
The 70 HydroATLAS features are PCA'd to 16 components, but only the first PC is used in the spatial model. Why retain 16 in the pipeline if only 1 is used? Was the number of PCs itself tuned?
Page 4 line 1: "Slater et al., 202" truncated citation.
Page 12 line 9: "are primarily used or monitoring" has to be "for monitoring"
Section 5.2 is missing.
"CHRIPS" appears multiple times as a typo for "CHIRPS"
Many figures lack units, axis labels, or sufficient caption detail to be self-contained. Figure 13 in particular needs y-axis units.
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 845 | 836 | 68 | 1,749 | 53 | 108 |
- HTML: 845
- PDF: 836
- XML: 68
- Total: 1,749
- BibTeX: 53
- EndNote: 108
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
In the References, please delete
U.S. Geological Survey. (2024, July 15). Flood_Modelling_v2.docx [Data file]. Famine Early Warning Systems
1054 Network (FEWS NET).
1055 https://edcftp.cr.usgs.gov/project/FEWSNET/spervez/SouthSudan_flooding/Flood_Modelling_v2.docx
and instead cite
2025. Kimberly Slinski, Md Shahriar Pervez, James P Verdin, Abheera Hazra, Amy McNally, Karyn M. Tabor, Laura Harrison, Shraddhanand Shukla, Chris C Funk, Chris Shitote and Michael E Budde. "Forecasting Major Flood Events in the Sudd Wetlands of South Sudan: Leveraging Satellite Datasets and Earth System Models for Science-Based Decision Support.” AGU Fall Meeting 2025, December 15-19, 2025, New Orleans, LA, USA.
Thank you.