the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
INFLOW-AI v2.1: A Machine Learning Framework for Predicting Out-of-Sample Extreme Seasonal Flood Extents
Abstract. Forecasting flood extent during extreme events remains a critical challenge for hydrological modelling, particularly in data-scarce and highly dynamic floodplain systems. Accurate and timely forecasts of these events are essential for effective disaster preparedness and response. Traditional physically based methods are often not well-suited for modelling complex hydrodynamic systems, as they depend on fixed structural parameterisations of surface water processes, groundwater interactions, and evapotranspiration that are difficult to calibrate and scale in catchments with highly heterogeneous vegetation, climatology, and terrain. Machine learning approaches, which can learn nonlinear relationships directly from data without explicit physical parameterisation, offer a promising alternative for modelling flooding in these regions.
We present INFLOW-AI v2.1, a machine learning framework for predicting extreme seasonal flood extent beyond what was observed in the training set. To enhance predictive accuracy for these out-of-sample extreme events, the framework employs a two-stage neural network architecture that combines (1) extreme-sensitive temporal thresholds with (2) dynamic spatial predictions. The first stage employs transformer-based models with multi-headed attention mechanisms to capture long– and short-term hydrometeorological patterns in total flood extent over the past 36 dekads. To enable more effective detection of extremes, this stage predicts the first difference of the seasonal anomaly in total flood extent, rather than the raw total flood extent. The second stage then dynamically models spatial flooding patterns using a ConvLSTM to predict local inundation probabilities at 1 km resolution, with the basin-scale inundation extent predicted by the first stage used to constrain the spatial predictions. The model generates forecasts with a lead time of up to six dekads (two months).
A case study was conducted over the Sudd wetland in South Sudan, one of the world’s largest freshwater ecosystems which has experienced unprecedented catastrophic flooding beginning in June 2019, severely impacting Jonglei, Unity, and Upper Nile States. INFLOW-AI was tested on this catchment, demonstrating the two-stage model’s ability to predict extreme out-of-sample post-2019 flooding with only exposure to pre-2019 data. INFLOW-AI has been deployed operationally since the 2024 flood season (August– November) on the Joint Analysis System Meeting Infrastructure Needs (JASMIN), providing real-time predictions to humanitarian organisations and informing flood preparedness in South Sudan.
- Preprint
(7066 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
CC1: 'Comment on egusphere-2026-66', James Verdin, 29 Mar 2026
-
CC2: 'Reply on CC1', Jessica Rapson, 19 May 2026
Hi James,
An easy fix. Will do.
Citation: https://doi.org/10.5194/egusphere-2026-66-CC2 -
AC2: 'Reply on CC1', Jessica Rapson, 16 Jun 2026
Hi James,
An easy fix. Will do.
Citation: https://doi.org/10.5194/egusphere-2026-66-AC2
-
CC2: 'Reply on CC1', Jessica Rapson, 19 May 2026
-
RC1: 'Comment on egusphere-2026-66', Anonymous Referee #1, 08 May 2026
Rapson et al. present INFLOW-AI v2.1, a two-stage ML framework for forecasting seasonal flood extent in the Sudd wetlands. The model development is solid, and the use of differenced seasonal anomalies as a predictive target is a clever and generalizable contribution. The results are plausible and the application is impactful.
I have three major comments:
- The Introduction and Section 2.2 frame the underperformance of physically based models in the Sudd as a paradigmatic limitation, motivating the shift to ML. However, the struggles of physical models in this region likely reflect data scarcity, observability constraints (e.g., sparse discharge gauges, uncertain wetland evapotranspiration, unobserved upstream dam releases), and calibration on pre-2019 conditions, rather than inherent structural failings of the modeling paradigm. I would encourage the authors to revise this framing accordingly.
- The Stage 1 transformer is rigorously benchmarked against a range of alternatives (Table 2), but the Stage 2 ConvLSTM is compared only against historic-fill and persistence-fill baselines. Vanilla ConvLSTM is well known to produce blurry predictions at longer forecast horizons, a limitation extensively documented since Shi et al. (2017) and addressed by more recent spatiotemporal architectures (e.g., PredRNN-v2, SimVP, Earthformer). I would encourage the authors to benchmark against at least one modern spatiotemporal architecture.
- The manuscript could benefit from restructuring. For example, the Background and Introduction could be merged to shorten the manuscript. Section 2.3 is essentially a methods description and would fit more naturally in Section 4. The Data section is overly long and detailed and sits awkwardly between Section 2.3 and the Methods (Section 4). Sections 4.1 and 4.2 are essentially data processing and visualization, and would be better placed in Section 3.
Minor comment:
The colormap used in Figure 14 is not explained in the caption. Please clarify what the colors encode.
Overall, this is a well-developed and operationally valuable contribution. Addressing the framing of physical models and providing more transparent validation of the ConvLSTM stage would substantially strengthen the manuscript.
Citation: https://doi.org/10.5194/egusphere-2026-66-RC1 -
CC3: 'Reply on RC1', Jessica Rapson, 19 May 2026
Thank you for the substantive comment. To address the points:
- While the struggles of physical models in this region likely do also reflect data limitations, this is also true for the entirely data-driven machine learning models. As such, machine learning models don't really have any comparative advantage on this front. If anything, data limitations are *more* detrimental to models that rely solely on detecting empirical relationships in past data. Thus, the concerns you mention regarding sensor uncertainties and limited historical records don't make as much sense to include in the discussion of limitations of physical models when trying to motivate the use of a data-driven model. The same is true for pre-2019 calibration considerations (most machine learning models trained on pre-2019 data only would fail to predict the 2019 flooding). That being said, we can certainly make it more clear that the concerns discussed in the paper regarding physical models in this region is not an inherent structural failing of physical models, but more a regionally-specific limitation caused by violations of physical hydrological assumptions in the Sudd basin (especially regarding wetland dynamics and evapotranspiration).
- This is fairly easy fix. Can benchmark to a 3D CNN (basic architecture) and SimVP (modern, widely cited architecture). In retrospect, I also think it makes sense to benchmark to the 10 year return period GloFAS data for the region, using the constant-fill methodology described in the paper (given that this is provided as a motivating example for needing dynamic spatial inundation forecasts).
- Agreed, these can be restructured to be more efficient.
Regarding the minor comment on Figure 14, the colours just encode categorical month. For clarity, these can be made a single colour.
Citation: https://doi.org/10.5194/egusphere-2026-66-CC3 -
AC1: 'Reply on RC1', Jessica Rapson, 16 Jun 2026
Thank you for the substantive comment. To address the points:
- While the struggles of physical models in this region likely do also reflect data limitations, this is also true for the entirely data-driven machine learning models. As such, machine learning models don't really have any comparative advantage on this front. If anything, data limitations are *more* detrimental to models that rely solely on detecting empirical relationships in past data. Thus, the concerns you mention regarding sensor uncertainties and limited historical records don't make as much sense to include in the discussion of limitations of physical models when trying to motivate the use of a data-driven model. The same is true for pre-2019 calibration considerations (most machine learning models trained on pre-2019 data only would fail to predict the 2019 flooding). That being said, we can certainly make it more clear that the concerns discussed in the paper regarding physical models in this region is not an inherent structural failing of physical models, but more a regionally-specific limitation caused by violations of physical hydrological assumptions in the Sudd basin (especially regarding wetland dynamics and evapotranspiration).
- This is fairly easy fix. Can benchmark to a 3D CNN (basic architecture) and SimVP (modern, widely cited architecture). In retrospect, I also think it makes sense to benchmark to the 10 year return period GloFAS data for the region, using the constant-fill methodology described in the paper (given that this is provided as a motivating example for needing dynamic spatial inundation forecasts).
- Agreed, these can be restructured to be more efficient.
Regarding the minor comment on Figure 14, the colours just encode categorical month. For clarity, these can be made a single colour.
Citation: https://doi.org/10.5194/egusphere-2026-66-AC1
-
RC2: 'Comment on egusphere-2026-66', Anonymous Referee #2, 20 May 2026
The manuscript presents a machine learning framework that has already been used in practice to predict flood extent in the Sudd wetlands. The study addresses an important humanitarian problem, and the operational deployment is impressive. The two-stage design, with a basin-scale temporal model followed by a ConvLSTM spatial model, is a reasonable approach, and the seasonal differencing idea is also interesting. However, I have several important concerns about the methodology, especially about how the main claim, predicting “out-of-sample extreme values” (OSEVs), is presented and tested. In my opinion, some of these issues are serious and should be addressed before the paper can be published. I would recommend major revisions.
Below are my major comments:
- This is my main concern: OSEV is defined in Eqs. 7–8 using the raw target, where any test value above 5σ from the training mean is considered extreme. Based on that definition, there are 35 such cases after 2019. However, the model is not trained on the raw target. It is trained and tested on the transformed seasonal anomaly, Δỹₜ. According to Figure 8c, there are no OSEVs in the test set under this transformed variable, and the training set even contains more extreme values than the test set in that space. So, the model is not predicting test cases that lie outside the training distribution. Instead, it is predicting values in a transformed space that appear to be within the training range, even though after reconstruction with Eq. 22 they become extreme in the original variable. That is a weaker claim than saying the model predicts truly out-of-sample extremes, but the paper often treats these two ideas as the same. Can the authors clarify the main claim here? Is it that (a) the transformation makes the test extremes look in-distribution to the model, which seems to be what is actually happening, or (b) the model truly extrapolates beyond the training extremes, which is what the title and abstract seem to suggest? Have the authors tested cases where Δỹₜ itself is extreme in the test set? Those would be the real out-of-sample extremes for the model’s actual prediction task.
- The temporal model uses the previous 36 dekads of inundation as input. Since flooding in the Sudd is highly persistent and can last for years, the model already receives very high inundation values during the post-2019 extreme period. In other words, even if the model was not trained on these extreme periods, its inputs already show that the system is in a high-flood state. This is not exactly data leakage, but it does affect how the results should be interpreted. The model is not predicting an extreme event from normal-looking conditions. Instead, it is predicting the continuation of an already extreme situation. This may also explain why the persistence baseline performs almost as well as the transformer in Table 2. Have the authors tested the transformer without inundation history as an input? That would help show how much of the OSEV performance comes from other predictors, such as rainfall, lake levels, and climate indices, and how much simply comes from persistence in the inundation record itself.
- Section 4.4 says that for deployment, the temporal and spatial models were retrained using all historical data, including the OSEV cases that were previously held out. This means the reported test results do not represent the same model that was later used to make the 2024/2025 forecasts in Figure 12 and share them with humanitarian partners. Has the deployed model been tested on any truly held-out data? If so, what was its performance?
- 17: Lₜ = MSE · SignLoss + 0.1 · SumLoss. The SignLoss multiplier of 20 (Eq. 18) is unusually heavy and the 0.1 weight on SumLoss appears arbitrary. How were the constants 20 and 0.1 chosen? Was a sensitivity analysis performed?
- Only one train/test split was used, with training before July 21, 2018 and testing after that date. Given the small effective sample size, this is a concern. Did the authors use any time-series cross-validation, such as a rolling-origin test, to see how sensitive the results are to the choice of cutoff date? Furthermore, the paper says that 10% of the training data were used for validation with a random split. In a time-series setting, this can cause leakage between training and validation. Was the validation set separated by time instead?
- The paper compares the model with persistence, linear interpolation, linear regression, lasso, random forest, and FFNN, but it does not compare it with some other important baselines. These include a standard LSTM or GRU, which are common in flood prediction studies and are also used in works cited by the paper, such as Google’s AI Flood model and Frame et al. (2022). It also does not compare the results with GloFAS or another physically based model for the same region. In addition, Section 2.2 discusses the FEWS NET LASSO plus constant-fill method, but the paper does not provide a direct comparison with that approach on the same test set. It would also be useful to compare against a simpler transformer model, without the custom loss or Monte Carlo dropout, to better understand which design choices really matter. Why was an LSTM not included, given how common it is in flood prediction research? Can you run a head-to-head comparison with FEWS NET's actual outputs for the operational period, since both are operational systems addressing the same problem?
Minor comments:
Section 7.1 says that this approach could also be useful for forecasting other spatio-temporal hydrometeorological events, such as rainfall, cyclones, and heatwaves. This feels like too strong a claim based on the evidence presented in the paper. Those processes behave very differently from flood extent in the Sudd, which changes slowly over time and has strong persistence. That is also one reason why the persistence baseline works so well here. Can the authors either soften this statement or provide stronger evidence that the seasonal differencing approach also works for faster-changing processes, where the target has much weaker autocorrelation?
The 70 HydroATLAS features are PCA'd to 16 components, but only the first PC is used in the spatial model. Why retain 16 in the pipeline if only 1 is used? Was the number of PCs itself tuned?
Page 4 line 1: "Slater et al., 202" truncated citation.
Page 12 line 9: "are primarily used or monitoring" has to be "for monitoring"
Section 5.2 is missing.
"CHRIPS" appears multiple times as a typo for "CHIRPS"
Many figures lack units, axis labels, or sufficient caption detail to be self-contained. Figure 13 in particular needs y-axis units.
-
AC3: 'Reply on RC2', Jessica Rapson, 16 Jun 2026
Thank you for the substantive comment. To respond:
1. Definition and interpretation of OSEVs
We agree that the manuscript could be clearer in distinguishing between extremes in the original inundation series and extremes in the transformed anomaly space used for model training. Our primary claim is closer to interpretation (a): the seasonal differencing transformation converts the post-2019 flood regime into a representation that is substantially more stationary and amenable to learning, allowing the model to reconstruct inundation levels that are extreme in the original variable despite not being extreme in the transformed space.
We do not claim that the model is extrapolating beyond the training distribution in the transformed anomaly space. Rather, the key result is that the transformation enables accurate forecasting of inundation levels that are extreme relative to the historical inundation record. We agree that some wording in the title, abstract, and discussion could be interpreted as implying direct extrapolation by the model, and we will revise the manuscript to make this distinction explicit.
2. Role of inundation persistence
We agree that the strong persistence of inundation dynamics in the Sudd is an important consideration when interpreting the results. It is correct that the model is often forecasting the continuation and evolution of an already anomalous flood state rather than predicting the onset of an extreme event from climatologically normal conditions.
However, the manuscript already includes extensive comparison against a persistence baseline precisely to assess whether the model provides predictive skill beyond persistence alone. The improvement over persistence reflects the contribution of the additional hydrometeorological predictors and the model's ability to capture dynamics beyond simple continuation of current conditions.
We nevertheless agree that an explicit ablation study would provide a clearer quantification of the contribution of non-autoregressive predictors. We will investigate adding an experiment in which inundation-history inputs are removed and performance is evaluated using only the external predictors. This will be included in the revised manuscript.
3. Evaluation model vs. deployment model
The reported results are based on a fixed temporally held-out evaluation period intended to assess retrospective generalisation performance. After this evaluation, the operational deployment model was retrained on the full historical archive, including the previously held-out OSEV cases, so that the live forecasting system could make use of all available information.
We agree that this means the deployed model is not identical to the backtested evaluation model, and we will clarify this distinction more explicitly in the revised manuscript. The forecasts shown in Figure 12 should therefore be interpreted as operational deployment outputs rather than as additional held-out test results. Showing results for model accuracy with a rolling hold-out set should provide a strong indication of how well future deployments should perform. Rolling hold-out assessments will be included in the manuscript revision.
4. Loss-function weighting
The loss-function weights were selected empirically during model development to balance the competing objectives of minimising magnitude errors, preserving the direction of inundation change, and constraining aggregate volume errors. The relatively large SignLoss weighting reflects the operational importance of correctly predicting inundation expansion versus contraction, while the smaller SumLoss coefficient was chosen to keep its contribution comparable in scale to the other loss terms.
We agree that the rationale for these choices is not currently explained sufficiently clearly. We will add additional discussion of the weight selection process in the manuscript revision.
5. Train/test split and validation procedure
We agree that reliance on a single temporal split may make results somewhat sensitive to the chosen cutoff date. The pre-/post-2018 split was selected specifically to preserve a strictly held-out evaluation period containing the post-2019 extreme flooding events that motivated this study.
We also agree that rolling-origin (walk-forward) evaluation would provide a useful additional robustness check (as discussed above, this would also help provide an indication of performance for the deployed model, which would be most akin to a last-week-held-out model).
Regarding validation, we agree that random validation splits are not ideal in a time-series setting due to temporal autocorrelation. Importantly, the final test set remained strictly temporally held out and was never used for model selection. Nevertheless, we agree that temporally separated validation is more appropriate, and we will clarify the validation procedure in the revised manuscript and revise it if necessary to ensure strict temporal separation between training and validation data.
6. Additional baseline comparisons
We agree that additional baselines would strengthen the manuscript. Benchmarking against a standard LSTM or GRU is particularly reasonable given their widespread use in hydrological forecasting and their use in several studies cited in the manuscript. We will investigate adding these comparisons in the revision.
We also agree that comparison against a simpler transformer variant would help isolate the contribution of the custom loss function and uncertainty estimation framework.
With respect to physically based benchmarks, we agree that comparison against GloFAS is valuable. In retrospect, it also makes sense to benchmark against the 10-year return-period GloFAS product discussed in the manuscript using the same constant-fill methodology employed by FEWS NET, particularly because this operational workflow is presented as a motivating example for dynamic inundation forecasting.
Regarding FEWS NET, direct comparison against historical operational outputs is challenging because historical forecast archives are not consistently available (though we did try to access them). At minimum, we will add comparison against the LASSO plus constant-fill methodology described in Section 2.2 to the revised manuscript.
Regarding the minor comments:
1. Generalisability to other hazards
We agree that the current discussion is overly broad. We will soften the language in Section 7.1 and clarify that the demonstrated applicability of the framework is currently limited to slowly evolving spatio-temporal systems exhibiting strong persistence and seasonality. Broader applicability to phenomena such as cyclones, rainfall, or heatwaves remains speculative and requires separate evaluation.
2. HydroATLAS PCA components
We will clarify the PCA workflow and provide additional justification for the number of retained principal components. We will also explain why only the leading component was ultimately used by the spatial model and whether the number of retained components was tuned during development (ultimately, this was done because the spatial model grows exponentially with newly added layers and sufficient information was extracted from the 1st PCA that additional ones were not deemed necessary, since these are just static attributes and most of the predictive value for the spatial model is coming from the dynamic spatio-temporal data).
In addition per the comment, we will:
- Correct the truncated citation ("Slater et al., 202").
- Correct "used or monitoring" to "used for monitoring".
- Restore the missing Section 5.2 heading/content.
- Correct all instances of "CHRIPS" to "CHIRPS".
- Revise figures and captions to ensure units, axis labels, and sufficient descriptive detail are provided throughout.
- Add y-axis units to Figure 13.
Thank you again for these detailed and constructive comments. The suggested clarifications, additional analyses, and benchmarking experiments will substantially strengthen the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2026-66-AC3
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 852 | 846 | 70 | 1,768 | 54 | 109 |
- HTML: 852
- PDF: 846
- XML: 70
- Total: 1,768
- BibTeX: 54
- EndNote: 109
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
In the References, please delete
U.S. Geological Survey. (2024, July 15). Flood_Modelling_v2.docx [Data file]. Famine Early Warning Systems
1054 Network (FEWS NET).
1055 https://edcftp.cr.usgs.gov/project/FEWSNET/spervez/SouthSudan_flooding/Flood_Modelling_v2.docx
and instead cite
2025. Kimberly Slinski, Md Shahriar Pervez, James P Verdin, Abheera Hazra, Amy McNally, Karyn M. Tabor, Laura Harrison, Shraddhanand Shukla, Chris C Funk, Chris Shitote and Michael E Budde. "Forecasting Major Flood Events in the Sudd Wetlands of South Sudan: Leveraging Satellite Datasets and Earth System Models for Science-Based Decision Support.” AGU Fall Meeting 2025, December 15-19, 2025, New Orleans, LA, USA.
Thank you.