the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Novel extensions to the Fisher copula to model flood spatial dependence over North America
Abstract. Taking into account the spatial dependence of floods is essential for an accurate assessment of fluvial flood risk. We propose novel extensions to the Fisher copula to statistically model the spatial structure of observed historical flood record data across North America. These include a machine-learning based XGBoost model, exploiting the information contained in 130 catchment specific covariates to predict discharge Kendall's τ coefficients between pairs of gauged-ungauged catchments. A novel conditional simulation strategy is utilized to simulate coherent flooding at all catchments efficiently. After subdividing North America into 14 hydrological regions and 1.8 million catchments, applying our methodology allows to obtain synthetic flood event sets with spatial dependence, magnitudes and frequency resembling those of the historical events. The different components of the model are validated using several measures of dependence and extremal dependence to compare the observed and simulated events. The obtained event set is further analyzed and supports the conclusions from a reference paper in flood spatial modeling. We find a non-trivial relationship between the spatial extent of a flood event and its peak magnitude.
- Preprint
(6180 KB) - Metadata XML
-
Supplement
(1890 KB) - BibTeX
- EndNote
Status: open (extended)
-
RC1: 'Comment on egusphere-2024-442', Anonymous Referee #1, 24 Mar 2024
reply
This study highlighted the importance of spatial dependence of floods for accurately evaluating the risk of riverine flooding. Innovative enhancements to the Fisher copula for statistically capturing the spatial arrangement of historical flood events recorded across North America. These enhancements feature an XGBoost model, which leverages 130 catchment-specific covariates to forecast the Kendall’s τ coefficients for discharge between gauged and ungauged catchment pairs. To efficiently simulate coherent flood events across all catchments, a novel conditional simulation approach was proposed.
Overall, the methods seem more likely to be acceptable/reliable and the originality of the research is undoubted. The analyses in this study are well organized and the results are reasonable. In addition, the presentation of this article is generally clear. It is a valuable study and within the scope of this journal. Therefore, I would recommend publication of this paper after the following issues are fully addressed.
- One of the innovations of this study is the use of a machine learning model to predict the Kendall’s τ instead of the original regression model. What are the advantages of choosing XGBoost for prediction instead of a regression model? Is there any comparison of results to provide evidence? Or is there any relevant material to illustrate this?
- Line207-208: “Since we work with stations spanning a large spatial scale, some pairs of stations present a negative Kendall’s τ (15 % of the pairs for region British Columbia). Those values are replaced by zero before Kendall’s τ inversion, since it requires the τ coefficients to be positive.” The value of Kendall’s τ is supposed to be negative, however, the authors have artificially converted it to zero. Does this practice have an effect on the results of the later calculations? More details should be furnished please.
- P.8 -The Fisher copula is adopted for model the spatial dependence of riverine flood. What are the advantages of adopting this method over others in this case, such as regular copula, vine copula? Please add detailed elaboration.
- P.11- The Generalized Extreme Value (GEV) distribution is utilized as the marginal distribution as annual maximum discharge in all catchments. Why is the GEV distribution function chosen directly? Was it selected preferably after comparison with other distribution functions?
- P.5- This study omit analysis on regions number 1, 2, 3, 4, 13 and 14. The methodology and results are presented for region 9–British Columbia. And the results of regions 7 (St Lawrence), 8 (Prairie), 10 (East Coast) and 11 (Midwest) can be found in the supplementary information. What is the reason for selecting regions 7, 8, 9, 10 and 11 from these 14 regions for analyzing? Is there anything unique about these regions? Why are the other regions omitted? And what about regions 5, 6, and 12? They are not mentioned in the paper.
- Line 215: “In this way, the correlation matrix Σ is extended to include all catchments, and the new parameters are used to simulate discharge at all catchments.” The tense of this sentence needs to be modified. Please consider changing to the past tense.
- Line 447-448: “Finally, compared to more parameterized models like neural networks, boosting models like XGBoost are much faster to train and yield satisfying results.” Is this conclusion derived from the comparison of the results calculated in this study? Or is it a regular characterization of XGBoost derived from other studies? Please provide some explanation to support this conclusion.
- Please adjust the font size in the images to make it larger and clearer, such as Figure 5, Figure 6, Figure 11 and so on.
Citation: https://doi.org/10.5194/egusphere-2024-442-RC1 -
AC1: 'Reply on RC1', Duy Anh Alexandre, 15 Apr 2024
reply
Responses to referee #1
One of the innovations of this study is the use of a machine learning model to predict the Kendall’s τ instead of the original regression model. What are the advantages of choosing XGBoost for prediction instead of a regression model? Is there any comparison of results to provide evidence? Or is there any relevant material to illustrate this?
XGBoost differs from linear models like simple regressions because it allows for nonlinearity through using decision trees as the base learner. This is particularly suitable for predictions involving hydrological processes which are highly non-linear. It is also an ensemble learner (combining many weak learners), which usually outperforms a single more complicated model. For example, for region 9 the Kendall’s τ cross-validated RMSE is 0.079 for our XGBoost model and 0.171 for a linear regression model. To further justify our choice of model, a recent study shows that among state-of-the-art machine learning models, tree-based models still outperform more complex deep learning models on tabular data [1].
Line 207-208: “Since we work with stations spanning a large spatial scale, some pairs of stations present a negative Kendall’s τ (15 % of the pairs for region British Columbia). Those values are replaced by zero before Kendall’s τ inversion, since it requires the τ coefficients to be positive.” The value of Kendall’s τ is supposed to be negative, however, the authors have artificially converted it to zero. Does this practice have an effect on the results of the later calculations? More details should be furnished please.
The inference of the Fisher copula parameters relies on the 1-to-1 relationship to link empirical Kendall’s τ between pairs of observations and the corresponding entry of the correlation matrix Σ. However, this relationship is only monotonous on the range of Kendall’s τ going from 0 to 1, as illustrated in figure 1 from Favre et al. (2018) [2]. Therefore, negative observed Kendall’s τ are replaced by zero before Kendall’s τ inversion. For these pairs of stations, this is equivalent to assuming that there is in reality no discharge correlation and the (small) observed negative correlation is spurious. In region 9, 15 % of the pairs present negative Kendall’s τ but only 5 % have Kendall’s τ lower than -0.2. For other regions, this percentage is even smaller.
This modification does result in a correlation matrix Σ which is not positive definite, which requires an adjustment to make it positive definite, following the idea of Higham (2002). We assessed the deformation of Σ by comparing each of its entries before and after the adjustment. This is shown in figure 1 below for region 9.
We see that the overall effects of transforming negative Kendall’s τ to zero on Σ are:
- a slight dampening of the correlations above 0.4
- a larger distortion of the zero entries, adjusted to values in the range [-0.2, 0.2]
Overall, the mean absolute difference is 0.062 between the adjusted entries and the original entries, which is deemed acceptable for our modeling purposes. As demonstrated in the manuscript, the simulated events are able to reproduce patterns in the observed events, in term of spatial patterns and upper tail dependence.
In conclusion, the practice of replacing negative observed Kendall’s τ by zero does slightly modify the corresponding Fisher copula correlation matrix, but this is unlikely to have a negative impact on the quality and ability of the simulated events to reproduce characteristics of observed events.
P.8 -The Fisher copula is adopted for model the spatial dependence of riverine flood. What are the advantages of adopting this method over others in this case, such as regular copula, vine copula? Please add detailed elaboration.
The Fisher copula is suitable for modeling spatial dependence in the right tail of random variables because:
- It is non-symmetrical, allowing for asymmetry in the dependence pattern for the lower and upper tail (unlike the Gaussian or Student-t copula)
- It allows to model positive upper tail dependence (unlike the gaussian copula)
- The specification of pairwise dependence strength through a correlation matrix allows spatial interpolation to model ungauged catchments
- It has a fairly low number of parameters, which make the model setup and inference much simpler than a vine copula for example. This is particularly important for operational reasons, as we are applying the model to thousands of catchments simultaneously.
Table 2 from [3] compares and summarizes the different characteristics of various copula models, including the Fisher copula.
P.11- The Generalized Extreme Value (GEV) distribution is utilized as the marginal distribution as annual maximum discharge in all catchments. Why is the GEV distribution function chosen directly? Was it selected preferably after comparison with other distribution functions?
Historically, several different distributions have regularly been utilized to model annual maxima in climate and hydrological studies, including the Gumbel, GEV, Gamma, Log-Pearson type III [4]. However, the GEV distribution is a common choice for modeling block extremes, with some studies showing its better performance and goodness-of-fit compared to the Gumbel, Log-Pearson III or Log-normal distributions. For river discharge, examples of such comparative studies are [5-7]. Also, the GEV distribution arises as the asymptotic limit distribution for block maxima, ground in extreme value theory [8]. This allows better confidence in the ability of the GEV distribution to extrapolate values in the upper tail, where observations are typically rare or absent. Due to its flexible parameterization, the GEV distribution is able to capture a wide variety of right-tail behaviours (bounded, exponential decay or heavy-detailed) and we believe that its theoretical support and wide use in the hydrological community make it a suitable choice in our study.
P.5- This study omit analysis on regions number 1, 2, 3, 4, 13 and 14. The methodology and results are presented for region 9–British Columbia. And the results of regions 7 (St Lawrence), 8 (Prairie), 10 (East Coast) and 11 (Midwest) can be found in the supplementary information. What is the reason for selecting regions 7, 8, 9, 10 and 11 from these 14 regions for analyzing? Is there anything unique about these regions? Why are the other regions omitted? And what about regions 5, 6, and 12? They are not mentioned in the paper.
The North American continent was divided into 14 regions following the level 2 HydroBASINS product delimitation. Among those, results for regions 7 to 11 were presented in our work (region 9 in the manuscript and regions 7, 8, 10 and 11 in the supplementary information). The study’s main objectives were to present and validate the methodology developed, contributing to enhance modeling of flood spatial dependence in North America. Therefore, the results were presented for a subset of regions as a way to validate the methodology on a variety of different hydrological and climatological conditions and were not meant to be exhaustive. Regions 7 to 11 were prioritized owing to their higher population densities and gauge density, which resulted in more accurate results. Besides, from an operational perspective for Geosapiens, these regions are more important because they cover the totality of the major urban centers in Canada, where our derived product will first be deployed. Results are however also available for regions 1 to 6, although their quality is lessened, and some minor methodology changes are required to account for the absence of high-quality discharge data in these northern territories. We omitted analysis on regions 12 to 14 because:
- From an operational perspective, they are the regions with no overlap with the Canadian territory.
- They have considerable overlap with the Mexican territory, where quality data gathering is not currently undertaken.
Line 215: “In this way, the correlation matrix Σ is extended to include all catchments, and the new parameters are used to simulate discharge at all catchments.” The tense of this sentence needs to be modified. Please consider changing to the past tense.
This is modified in the manuscript.
Line 447-448: “Finally, compared to more parameterized models like neural networks, boosting models like XGBoost are much faster to train and yield satisfying results.” Is this conclusion derived from the comparison of the results calculated in this study? Or is it a regular characterization of XGBoost derived from other studies? Please provide some explanation to support this conclusion.
Neural network models were not tested in our study. Following the principle of parsimony, we first started testing results with less parameterized models, namely Bayesian ridge regression (not presented) and boosting models. Since the predictive power of XGBoost was deemed satisfying, we did not see the need to pursue testing other more parameterized models. Besides, it is commonly known that tree-based models are faster to train than heavily parameterized neural network models. This can be seen for example in table 10.1 from [9] which compares some characteristics of different machine learning methods. This table shows that decision trees are computationally more scalable than neural nets. A recent study [1] also finds that boosting models outperform neural nets on medium sized tabular datasets, notably because they are more robust to uninformative features, less sensitive to the orientation of the data, while maintaining a superior computational speed.
Please adjust the font size in the images to make it larger and clearer, such as Figure 5, Figure 6, Figure 11 and so on.
This is adjusted for figures 1, 5, 6, 11 and 12 in the manuscript.
References
[1] Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on tabular data? https://doi.org/10.48550/arXiv.2207.08815
[2] Favre, A.-C., Quessy, J.-F., and Toupin, M.-H.: The new family of Fisher copulas to model upper tail dependence and radial asymmetry: Properties and application to high-dimensional rainfall data, Environmetrics, 29, e2494, https://doi.org/10.1002/env.2494, 2018.
[3] Brunner, M. I., Furrer, R., and Favre, A.-C.: Modeling the spatial dependence of floods using the Fisher copula, Hydrology and Earth System Sciences, 23, 107–124, https://doi.org/10.5194/hess-23-107-2019, 2019.
[4] Nerantzaki S. D., Papalexiou S. M., Assessing extremes in hydroclimatology: A review on probabilistic methods, Journal of Hydrology, Volume 605, 2022, 127302, ISSN 0022-1694, https://doi.org/10.1016/j.jhydrol.2021.127302.
[5] Haktanir, T., Horlacher, H.B., 1993. Evaluation of various distributions for flood frequency analysis. Hydrol. Sci. J. 38, 15–32. https://doi.org/10.1080/ 02626669309492637.
[6] Moisello, U., 2007. On the use of partial probability weighted moments in the analysis of hydrological extremes. Hydrol. Process. 21, 1265–1279. https://doi.org/10.1002/ hyp.6310.
[7] Gubareva, T.S., Gartsman, B.I., 2010. Estimating distribution parameters of extreme hydrometeorological characteristics by L-moments method. Water Resour. 37, 437–445. https://doi.org/10.1134/S0097807810040020.
[8] Coles, S., Bawa, J., Trenner, L., and Dorazio, P.: An introduction to statistical modeling of extreme values, vol. 208, Springer, 2001.
[9] Hastie, T., Tibshirani, R., and Friedman, J.: The Elements of Statistical Learning, Springer, 2009.
Citation: https://doi.org/10.5194/egusphere-2024-442-AC1 -
RC2: 'Reply on AC1', Anonymous Referee #1, 24 Apr 2024
reply
The concerns that we proposed are all properly addressed. Now I'm sure that this revised manuscript can be accepted for publication.Citation: https://doi.org/
10.5194/egusphere-2024-442-RC2
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
190 | 133 | 11 | 334 | 24 | 5 | 6 |
- HTML: 190
- PDF: 133
- XML: 11
- Total: 334
- Supplement: 24
- BibTeX: 5
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1