the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Evaluating the Feasibility of Scaling the FIER Framework for Large-Scale Flood Inundation Prediction
Abstract. Floods are a recurring global threat, causing lives lost, property damage, and agricultural impacts. Accurate and timely flood inundation forecasts are crucial for effective disaster preparedness and mitigation. However, traditional flood forecasting methods often face challenges in terms of computational demands and data requirements, particularly when applied to large geographic areas. This study presents a novel approach to scaling a data-driven flood forecasting framework, Forecasting Inundation Extents using REOF (Rotated Empirical Orthogonal Function) (FIER), to large geographic regions. FIER leverages historical satellite imagery and streamflow data to predict flood inundation extents without relying on complex hydrodynamic models. We demonstrate the effectiveness of applying FIER over a large geographic extent using watershed boundaries to create individual FIER models and then mosaicking the results geographically to provide large flood inundation predictions. The Upper Mississippi Alluvial Plain in the United States was used as a test region. We evaluated multiple buffer sizes for watersheds for generating the data-driven FIER models to reduce edge effects along watershed boundaries when mosaicking the individual FIER implementations. The FIER method using watersheds, coupled with different forecast lead times from the National Water Model operational streamflow forecasts, was used to accurately predict the extent of surface water for select flood and low flow use cases. Our results show that the scaled FIER approach using watersheds yields higher accuracies for different error metrics, including the Structural Similarity Index Measure (SSIM), RMSE, and MAE. The metrics for the watershed-scaling approach resulted in SSIM ranging from 0.699–0.804, RMSE range of 7.15–8.60, and an MAE range of 1.09–1.88 compared to a baseline area with SSIM ranging from 0.643–0.693, RMSE range of 8.112–11.681, and an MAE range of 1.969–1.989. We found that scaling FIER using a watershed approach yielded statistically significant better performance compared to the baseline area: this is particularly true when using buffer sizes for the watersheds of 0–10 km and when applying a post-processing correction to the FIER outputs. This approach offers a promising solution for large-scale flood forecasting, particularly in data-scarce regions or ungauged basins. Future research will focus on refining the framework to incorporate additional hydrological variables and improve the accuracy of long-range flood inundation predictions.
- Preprint
(7931 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 27 Dec 2024)
-
RC1: 'Comment on egusphere-2024-3491', Anonymous Referee #1, 26 Nov 2024
reply
The manuscript describes research on a new approach for scaling a flood inundation prediction method to larger regions. The text is well-written and has a good overall structure, with additional information included in an appendix. The topic is relevant, as scaling flood forecasting methods to enable coverage over large areas, while maintaining the spatial and temporal resolution, lead times and accuracy required to guide decisions prior to and during a flood event, is challenging and a topic of active research across the globe. This is also the case for the US, where this research has placed its study area. Furthermore, the fact that the code/scripts used for the study are made available publicly under an open-source license and the explicit attention the research places on operational constraints and use is recommendable. This can enable the research from being applied in operational practice more easily (something that is often overlooked, while stated as a driver or goal of many manuscripts). These are all points that merit publication.
However, there are a few areas where the manuscript can (and in some cases should) be improved, as outlined below. This includes general comments on (parts of) the text, followed by detailed line-by-line comments.General comments
The abstract gives a good and concise overview of the whole text. But a few things in the abstract could be made more clear. Sometimes it might simply be the right choice of words. For example, it is stated that FIER is a data-driven method but also a promising solution for data-scarce regions, so apparently these are different types of data. Stating specific numbers (e.g. error metrics, buffer sizes) in the abstract can sometimes help readers, but only if they can directly place it in context and gauge their value. For the error metrics it would help to know how these were calculated (e.g. what are the observations, for what regions/times). For the buffer sizes some more clarification and context would help (e.g. are these large or small compared to other tested buffers, what is the post-processing technique applied?). If this would make the abstract too long, it could also be considered to remove the specific numbers and instead explain things at a higher level.
There seems to be a mixed signal regarding different types of models. On one hand, the drawbacks of hydraulic/hydrodynamic models are listed, such as extensive parameter tuning and uncertainties in the input data. On the other hand, hydrological model output is mentioned as a good data source, while hydrological models suffer from very similar issues, at least in terms of parameter tuning and input/output uncertainties. The latter is recognized later in the text, but only as a potential explanation for issues with the described framework (more on that later). Describing this early on, and giving an explanation on why it should still be used, would show that this is recognized by the authors, and as such, with the current set up, is an inherent element of the framework.
The described method requires historical imagery and data. The authors begin their introduction with statements around climate change, population growth and urban expansion, rightly pointing to these to give relevance to the research. However, as a data-driven method, it is unclear how well the proposed method (including its post-processing correction step) could cope with such changes. This also holds for other changes directly influencing flood patterns, such as new infrastructure or (temporary) flood defenses. It would be good to make a notion of this in the text, even if the implications are uncertain.
It might help readers if a better description of the study area, including its hydrology, main flood drivers, relevant structures, precipitation and streamflow inflow from outside the study domain was given. This could help readers gauge results (especially since there are some things that remain unclear from the results and associated error metrics) and judge transferability to their own domains of interest.
The research is well structured, executed and documented, including the calculation of various error metrics. However, these (and the results in general) could do with more explanation, especially on their implications. For example, it would help readers if there was some insight on what certain RMSE values imply for flood forecasting purposes. Related, what is (or could be) the benchmark for each error metric? Without these, it is hard to judge if this indeed makes the described method suitable for operational flood forecasting purposes, while this is implied in the text. The use cases (and the maps shown there) definitely help in this regard, but also have the same issue with error metrics. This is understandably a hard task, as, for example, benchmarks for flood inundation maps are probably not available. But as long as it is unknown what would be required for operational flood forecasting and decision making, statements on the suitability of the described method for those purposes feel somewhat unsubstantiated.
The focus of the text is, understandably, very much on floods. It also includes a use case on low flows, but this is a relatively weak part of the manuscript. The study area was chosen because it is a flood-prone region, not because of its relevance for droughts. The reasoning behind the approach and the research also utilizes current challenges with flood (not drought) forecasting. Finally, as stated in some of the detailed comments on the low flows use case, the direct applicability of FIER for this use case is not clear. That being said, the fact that FIER can function well under low flow conditions, i.e. not producing bad results that would hinder its use and reduce trust in its results, speaks well for its operational potential. That is something that could be highlighted somewhere in the text as well.Detailed comments
[Line 29] What do these specific buffer values imply? Are they relatively small or large? (see also general comments on abstract)[Line 66] “hydrological model outputs” hasn’t been mentioned as input/use before, so the fact that these are “indeed” a “promising avenue” is unsubstantiated. This also relates to the general comment on (hydrological) model limitations and uncertainties.
[Line 96] “[…] without the complexities […] of hydrodynamic models”; that depends, as FIER itself is also rather complex. Hydrodynamic models are often offered as software packages which take away most of the underlying complexities for the user/modeler. How it this with FIER? How complex would it be for others to set this up, in comparison with often used hydrodynamic models?
[Line 96] “[…] without the […] computational needs of traditional hydrodynamic models” vs. “there is a computational challenge as the nature of developing the flood patterns requires loading data in memory for processing so applying FIER over large areas can be a challenge” (lines 81-82). It can be assumed that FIER’s computations are concentrated on its set-up / training, while the traditional models are in their execution, but it would be good to make that distinction explicit somewhere.
[Line 141] As mentioned in the general comments, it would help readers if there was more information on the study area. Some suggestions follow below.
[Line 145] What are the (in)flows of these rivers into the study area?
[Line 146] Where are these reservoirs located and what are their characteristics (e.g. how much can they buffer to reduce downstream flooding)?
[Line 147] Is that localized rainfall (and snowmelt) within the study area or coming (as runoff and/or streamflow) from upstream? Does local precipitation have a strong influence on floods within the study area or is it mainly streamflow-driven?
[Line 148] “[…] leading to increases in streamflow […]” Figure 3 (line 275) seems to indicate the opposite, i.e. a downward trend in streamflow? Is this a short-term anomaly, error in the simulations, or something else?
[Line 155] “[…] section 2.3 Experimental Design […]” Should probably be 3.3.
[Line 174] “[…] 2012-01-20 to, […]” seems like the end date is missing.
[Line 196] How were the basins selected? Not all basins intersect with the baseline area, but there also more basins on the sides that could have been included. For reproducibility and understanding of the readers, it would help to make this clear.
[Line 201-202] How do these buffer sizes compare to the watershed (area) sizes? It would be interesting for others to know this when applying it to other watersheds, as there probably is some sort of relation there? Has that been tested/investigated? This is addressed in the discussion section (lines 458-462), but an indication of the above, or at least how the current values were chosen, might be helpful to readers.
[Line 202-203] Duplicate sentence, has been stated in the prior sentence already, so I’d suggest this one can be removed.
[Line 205] Why 99.9% and not 100%?
[Line 205-206] What percentage of data is kept (or removed) as part of those two thresholds, i.e. the initial 90% and later 99.9%?
[Line 205-206] And how were these cloud percentages assessed/calculated? It is known that identification of cloud cover in satellite imagery is not a trivial task due to the varying nature and characteristics of clouds. Certain clouds can also have more impact on certain bands of the imagery. There’s often a balance between over- and underestimation of clouds, implying that sometimes clouds are missed (and thus included in the imagery even if it is stated it’s 100% clear). What would be the impact of cloudy pixels present in the data when fitting FIER?
[Line 205-206] FIER has previously been tested with Sentinel-1 SAR imagery (https://doi.org/10.1016/j.envsoft.2023.105643), which of course has much less issues with clouds. Why was it decided to go with VIIRS here? This is briefly touched upon in the Discussion (lines 512-513), stating that VIIRS provides daily observations. However, there are likely quite some days without data due to cloud (filtering), see also two comments above, so are there other factors that made the decision for VIIRS?
[Line 259-261] Why 2019 and 2020? Streamflow seems to indicate a downward trend from year-to-year in the operational hydrograph (from 2018-2024, Figure 3). Low-flow periods in later years are thus more extreme. Not saying that taking the most extreme low flow is the best choice, but an explanation as to why 2019 and 2020 were chosen would help the reader understand the authors line of thought.
[Line 267-269] What is meant with “evaluating” the return periods (vs. the previously mentioned calculation of them)? It is clear that the NWM operational product is to be used operationally, and thus also to test/validate FIER, but that was already stated previously and this seems to indicate something else.
[Line 271-272] Similar as the questions above on low flows, why 2018-2020 for high flows / floods? This is more clear for floods, as we can see from Figure 3 that the 50-year event in 2019 is the most extreme in this reach, but it’s still interesting to know why this decision was made and to make that explicit in the text.
[Line 281-282] NWM has sub-daily forecasts for a reason; this is often a requirement, or at least standard practice, for operational flood forecasting, as forecasters and decision makers need it at this level. Flood forecasting with a daily timestep is often considered inadequate, especially for shorter lead times. There can be various reasons why it was decided to go with a daily time step for FIER, but it would be good to mention these and if/how it could be done on the same time step as NWM.
[Line 282-285] Similar question as above on the ensembles; although perhaps less relevant than the sub-daily timestep, ensembles do enable estimates of uncertainty and probabilistic forecasting. While assumptions can be made about why these were averaged for this study, it would be good to make that explicit in the text.
[Line 293-294] Is this computational performance? If so, would be good to make that explicit (and provide at least some kind of indication of what it means). If not, the sentence might be obsolete, as the following sentences describe performance well enough. Related, it is later stated (lines 465-466) that “[…] using more watersheds comes at the cost of increased computational complexity […]”, which opens up the question as to whether that is a significant increase, and if so, what the relation between choice of watersheds and computational complexity/burden is.
[Line 296-297] It might be better to split this sentence in two as the fact that “[…] mosaicked results begin to trend more closely aligned to the baseline […]” and “[…] are not that much lower than the SSIM metric at the lower buffer sizes.” are two distinct observations. One is a comparison of two different methods, the other a comparison of input parameters within one of those methods. At present, it can create confusion for the reader.
[Line 297-298] If this is indeed also true for RMSE and MAE, why not include that with the previous (split up, see comment above) sentence? However, I’m not sure it really is. Just roughly calculating from reading the graphs in Figure 4, SSIM differences for various buffer sizes are indeed small (0.02 differences for values around 0.7, i.e. roughly 2.8%), but RMSE is already a lot more than that (at over 10%; 0.8 vs. 7.8) and MAE even more (0.3 vs. 1.75). Or was this statement referring to the first part of the previous sentence? (see comment above about confusion)
[Line 304-305] “[…] noticeable in the RMSE and RRMSE […]” Indeed it is, but what about SSIM? That also shows an interesting pattern, where the original scores lower while mosaicked shows higher values after post-processing (contrary to RMSE and RRMSE, where both score worse). Might be worth mentioning as well. And explain what could be behind this.
[Line 308-310] Indeed, and it would be worthwhile to investigate this further. What could cause this? Since FIER is a data-driven method (and so is the post-processing with CDF), there must be something in the data or the method itself that can explain this? Or is there still something that is not understood?
[Line 348-349] Indeed, but is this compared against the post-processed baseline as well? We’ve seen earlier that error metrics of the baseline deteriorate after post-processing, while they improve for the mosaicking approach, so testing the significance between these might not be that relevant. For the baseline, one would probably decide not to use the post-processing step? A fairer comparison would thus be against the original baseline.
[Line 355-367] Related to the above comment. Yes, there indeed seems to be merit in the watershed approach. But, in the paragraphs above there was more caution and also the notion that there can be more large errors with the post-processed results. Also, if (as questioned above) the test here are indeed comparing post-processed baseline and mosaics, the statement “[…] CDF matching further improves the accuracy […]” is based on a comparison that might be questionable or should at least be explained better.
[Line 380-381] “[…] better error metric results (SSIM, RMSE, MAE) […]” Doesn’t RMSE increase with post-processing corrections?
[Line 391-392] “[…] however […]” Nitpicking on word choice, but wasn’t that (smaller extent with larger lead time) also already the case with the other event (top row in figure), even if the difference is only observed between medium and long-range? So these are similar, while ‘however’ seems to indicate the opposite?
[Line 391-392] There seems to be more differences between the medium and long-range version, not necessarily (or at least not only) indicating smaller flood extents with the long-range version. Certain regions, especially in the Southern (downstream?) half, seems to have more water in the long-range version (even if the tributary there is not connected then). This implies more complex behavior than simply less water overall and it might be good to reflect on that in the text.
[Line 392] “[…] suggesting more uncertainty with longer range forecasts.” Why would a smaller flood extent imply more uncertainty? Or does this follow from something else?
[Line 393-394] “[…] long-range forecast exhibits slightly higher RMSE […]” Isn’t RMSE for long-range (16.3) lower than for nowcast (16.9) and medium-range (17.3)?
[Line 396-397] “[…] even with extended lead times […]” In fact, it seems to suggest that FIER performs best with the long-range forecast, as this has the best error metrics across the board? As stated in the text, longer lead times usually have higher uncertainties, so this is an interesting pattern. It just being a ‘lucky shot’ (across the two events, as it also holds for the second case) seems unlikely (although it cannot be ruled out, two cases are not enough to come to definitive conclusions). That would imply there are certain elements, within the input data and/or within the FIER framework, that work better for the long-range forecast and it would be very interesting to identify those. At minimum, something along these lines should be reflected in the text.
[Line 398] “[…] even better […]” That is rather subjective. This implies that performance of the first event was already good, but was it? Results are much worse compared to what’s shown before (e.g. nearly all metrics off the charts for Figure 4). There has been no link with what the metrics truly imply so far (e.g. their influence on operational practice or otherwise, see also general comment on this), so whether they are indeed good is hard to judge. It also implies that it has improved for the second event, but that also depends on what’s being looked at. All metrics are worse for nowcast, but are indeed better for (especially) medium and long-range.
[Line 398-399] “[…] particularly for the long-range […]” Related to comment directly above. Performance of stated SSIM and RMSE are only slightly better (0.6676 vs. 0.6664 and 16.04 vs. 16.26, respectively).
[Line 400] “[…] capturing more frequent flood events with high accuracy […]” See comments above. With rather small differences between some of the error metrics of the 50 and 5-year return periods, and no implications on what those values truly mean, this statement feels out of place. It either needs more explanation to back this up or should be changed.
[Line 401-403] “[…] likely due to the errors in the NWM streamflow predictions […]” It would be worthwhile to assess this quantitatively. It can of course be expected that there are errors in those predictions, but how large are they and what is their influence on the “degradation in performance” of FIER? And could there be any other factors influencing this?
[Line 405, Figure 6] Including differences between observations and FIER would help the reader assess quality and gauge the calculated error metrics. Showing this in the same figure might not be feasible, but a different figure (perhaps in the Appendix / supplementary material) would be great.
[Line 405, Figure 6] It seems from the first case (top row) that there are streams / water areas which are being cut off at the basin boundary (i.e. on the Southwestern side). Looking at Figure 2, these indeed seem to drain into the main river later (further South, i.e. downstream?). This relates to the earlier question on how the basins were selected (line 196). Would this influence FIER results (and if so, how), as the flood patterns there are less related to those in the main channel (e.g. not driven by discharge used as input)?
[Line 419-420] “[…] demonstrates the lowest RMSE 420 (8.53) and MAE (0.89) […]” And RRMSE?
[Line 420] “[…] even with extended lead times.” In fact, ‘especially’ would be a better choice of word here than ‘even’, as it performs best across all metrics and for both cases? This relates to similar comments regarding lead time for the flood cases.
[Line 425-426] Related to earlier comment on Figure 6; a calculated comparison (e.g. difference map) would help the reader to follow this statement.
[Line 426] “[…] particularly for the nowcast and medium-range forecasts.” Isn’t that counterintuitive, as the error metrics seem to imply the opposite? Looking at the figures closely also seems to indicate the opposite, although it is hard to make a definitive statement without a calculated comparison (see comment directly above). The nowcast seems to have more small isolated water bodies (or noise?), which do not seem to be present in the observations?
[Line 426-428] “[…] using the streamflow forecasts […]” Don’t all of these use those forecasts? Or is this about the distinction between nowcast and forecasts?
[Line 426-428] “[…] show some minor deviations […]” Such as? Pointing to an example would help the reader here.
[Line 426-428] “[…] inherent uncertainties […]” Similar comment as previously on the flood case; it would be worthwhile to assess this quantitatively. It can of course be expected that there are errors in those predictions, but how large are they and what is their influence here? And could there be any other factors?
[Line 429] “[…] low flow forecasting.” Does FIER also forecasts flows? These have not been shown so far. If so, it could be a valuable addition. If not, it might be considered to rephrase to something akin ‘during low flow conditions’. This also holds for the same wording in line 431.
[Line 431-433] FIER can forecast water fractions. During flood conditions these can serve as flood maps, forecasting where inundation could take place, which can definitely be useful information for decision makers. However, this might be less apparent during droughts. How would water fraction maps help “water management decisions, mitigating drought impacts, and ensuring sustainable water resource allocation”? What are the protocols or operational practice in this region? In many regions of the world these are based on streamflow, groundwater levels and/or lake/reservoir volumes, which are not an output of FIER. Outlining how the information from FIER would be used in practice would help strengthen this statement.
[Line 456-457] Figure 4 does not explicitly show that “discontinuities at watershed boundaries” or “abrupt transitions” are mitigated, it shows aggregated error metrics. We have in fact not seen an example of those discontinuities, so it might be worthwhile to include that.
[Line 464] “[…] allows for finer spatial resolution […]” Is FIER spatial resolution dependent on the area it covers (not the input data)? If so, this is new information and it would be good to include that earlier in the text.
[Line 479] “[…] calibration […]” Sorry for nitpicking again, but while FIER might not require calibration in the traditional sense, it has just been described that the “optimal watershed scale and buffer size” would “require careful evaluation”, which is not completely unlike calibration.
[Line 479] “[…] data independence […]” Hopefully the last nitpicking comment. Related to some previous comments on this; it’s a specific type of data that FIER is independent of, while it requires other data. Good to make that distinction.
[Line 481] one of “areas” or “regions” is redundant here, choose one.
[Line 485-487] Couldn’t changes in hydrologic conditions also affect the patterns learnt by FIER, thus causing results under changed conditions to be less accurate? Also, this would require a hydrological model coupled to FIER, as such conditions aren’t input for FIER itself? Finally, it’s not directly clear how such a thing could inform flood risk assessments or reservoir building/operations, but we might just have to wait for Do et al. to be published.
[Line 487] “[…] assessing the effectiveness of flood control measures […]” Won’t measures (e.g. new infrastructure, temporary defenses) directly affect the spatial extent of the flood and thus make FIER maps invalid? Related to a similar general comment.
[Line 489-490] Has it been tested if FIER can produce (accurate) maps for return periods, especially on the higher end (e.g. 100 year), that it hasn’t seen in the data yet? Because it can be those return periods which are most relevant regarding climate change and planning.
[Line 507-508] This is a very good point indeed. It might already help readers if a simple comparison with those type of models and NWM is made (e.g. spatial and temporal resolution, time step, lead times).
[Line 510-512] It might be good to mention whether this would be directly possible with the current FIER framework, or whether it would require some adjustments.
[Line 542] “[…] event-based forecasting […]” Maybe just semantics, but isn’t FIER more of a continuous operational forecasting approach, since there is no event-based spin-up or calibration involved? (which actually might have more value for operational use than a purely event-based approach)
[Line 554-555] Same comments as for lines 348-349.
[Line 568] Great that more information is included in an appendix.
[Line 575] It would help the readers to explain briefly what are the implications of positive and negative values in the RSM maps.
[Line 586-587] “[…] high NSE values (0.61, 0.77, and 0.63) […]” Whether a NSE value is ‘high’ can be subjective, with different research(ers) giving different classifications on NSE scores. Some indeed state values above 0.5 as ‘good’, while others reserve this for values above 0.65 (which is a threshold two of the three stated values don’t reach) or even higher. As such, this sentence could perhaps do with more careful phrasing.
[Line 590, Figure A1] The scatter plots of RTPCs 01 and 02 seem to indicate a strong link between the fitted model and (a single?) high value(s) in the data. The curve of the fit also seems to suggest that this results in a RTPC value that normally belong to much lower streamflow values. Is this indeed reading the graphs correctly? This is not mentioned in the text at all. Has the sensitivity and influence of this been tested (perhaps in previous FIER research)? What would the implications of this be?
[Line 590, Figure A1] The scatter plots of RTPCs seem to indicate that different fitting functions might be applicable to different RTPCs. For example, the data in RTCP-01 seems to show a linear break pattern. Has this been tested (perhaps in previous FIER research)? What would the implications of this be?
[Line 596-615] This seems contradictory to the main text, where we’ve seen that smaller buffer sizes yield better results. This is not mentioned here, while this warrants at least a notion, or better yet: a good explanation.
[Line 605-606] Could this be because streamflow for the study area is mainly driven by what comes in from upstream through the main river channel(s)? And that this relationship could be (more) affected by the buffer size if there was more locally generated streamflow? It might help the reader to gauge this if more information on the hydrology of the study area was provided (see earlier comments on this, e.g. line 141)
[Line 630-631] Great that this is made publicly available.
Citation: https://doi.org/10.5194/egusphere-2024-3491-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
123 | 33 | 7 | 163 | 4 | 2 |
- HTML: 123
- PDF: 33
- XML: 7
- Total: 163
- BibTeX: 4
- EndNote: 2
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1