the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Landslide initiation thresholds in data sparse regions: Application to landslide early warning criteria in Sitka, Alaska, USA
Abstract. Probabilistic models to inform landslide early warning systems often rely on rainfall totals observed during past events with landslides. However, these models are generally developed for broad regions using large catalogs, with dozens, hundreds, or even thousands of landslide occurrences. This study evaluates strategies for training landslide forecasting models with a scanty record of landslide-triggering events, which is a typical limitation in remote, sparsely populated regions. We train and evaluate 136 statistical models with a rainfall dataset with five landslide-triggering rainfall events recorded near Sitka, Alaska, USA, as well as >6,000 days of non-triggering rainfall (2002–2020). We use Akaike, Bayesian, and leave-one-out information criteria to compare models trained on cumulative precipitation at timescales ranging from 1 hour to 2 weeks, using both frequentist and Bayesian methods to estimate the daily probability and intensity of potential landslide occurrence (logistic regression and Poisson regression). We evaluate the best-fit models using leave-one-out validation as well as with testing a subset of the data. Despite this sparse landslide inventory, we find that probabilistic models can effectively distinguish days with landslides from days without. Although frequentist and Bayesian inference produce similar estimates of landslide hazard, they do have different implications for use and interpretation: frequentist models are familiar and easy to implement, but Bayesian models capture the rare-events problem more explicitly and allow for better understanding of parameter uncertainty given the available data. Three-hour precipitation totals are the best predictor of elevated landslide hazard, and adding antecedent precipitation (days to weeks) did not improve model performance. This relatively short timescale combined with the limited role of antecedent conditions reflects the rapid draining of porous colluvial soils on very steep hillslopes around Sitka. We use the resulting estimates of daily landslide probability to establish two decision boundaries for three levels of warning. With these decision boundaries, the frequentist logistic regression model incorporates National Weather Service quantitative precipitation forecasts into a real-time landslide early warning “dashboard” system (sitkalandslide.org). This dashboard provides accessible and data-driven situational awareness for community members and emergency managers.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(6720 KB)
-
Supplement
(641 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(6720 KB) - Metadata XML
-
Supplement
(641 KB) - BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-25', Anonymous Referee #1, 07 Apr 2023
- AC1: 'Reply on RC1', Annette Patton, 31 May 2023
-
RC2: 'Comment on egusphere-2023-25', Anonymous Referee #2, 18 Apr 2023
General comments
This study deals with the relevant problem of establishing an early-warning system at a small regional scale with few observations. This problem is approached by testing different models with different inputs and tested for robustness. I think the study will be interesting to early-warning system developers as this system has been implemented in practice and therefore had to tackle many practical problems from how to establish thresholds to how to issue warning levels.
Generally, the manuscript is well written and organized. My main criticism is that the comparison/evaluation/validation is at times confusing and not as streamlined as other parts of the manuscript. Although I very much appreciate that much effort is put in the validation, I think section 2.3 (and 2.4 maybe) need some justification why so many different approaches are being taken. These should maybe also be presented more clearly later. You may have good reasons for choosing many different validation methods (leave-one-out for events, train/test split, different criterion metrics, etc.) but it’s not clear to me from the text and I think it will be confusing to readers. For example, there is leave-one-out for landslide events and train/test split. Wouldn’t it be simpler and similar added value to do leave-one-out with the recorded years (e.g. train with 2002-2018 and test with 2019)? Or why do you need the Brier skill score? Can’t you use the same skill score as for the other models and compare these? Anyway, I don’t think you go into details with the results of this part so it could be cut. To systematically compare the predictive power, you could also compute the area under the curve and assess if a model is better than random, as it is commonly done to assess predictive model skill.
I think if these issues could be solved, the manuscript will be more accessible to readers and that the authors are in a good position to solve this. Please find more specific comments below.
Specific comments
L17: Please specify «136 statistical models». When only reading the abstract I cannot imagine what that means. How do models differ?
L18: does that mean that your data is in daily resolution? If not, I would count the number of non-triggering rainfall events instead of the days.
L23-25: seems more like a conclusion to me. I would state this later in the abstract.
L34-72: the intro nicely shows that there is a need for a LEWS in this region but that it’s difficult to establish with current methods
L53-54: I think this sentence should end with a citation
L153: general question: what is the added value of the number of landslides if you don’t know where it’s going to happen?
L217: So 1-day antecedent precip is the total daily rainfall at the day of the landslide?
L230: Please give a citation for these equations
L235: Please integrate this sentence in another paragraph where you discuss this problem
L247: more intuitive than what?
L271: What is a chain?
L276-279: how are these criterions defined?
L275-289: why was not the same validation performed on each model?
L332.335: I agree about the inadequacy of accuracy in imbalanced datasets but ROC shows exactly true alarms and false alarms in relative terms.
Fig. 3&4: please make sure these figures have the same layout (axis limits, labels, font size, order of plotting lines, etc.) (same for fig. 5&6). As these figures look very similar, you could consider e.g. only showing figs 3&5 here and move the others to the supplement.
Fig. 7/8: blue stands for better and red for worse. Better and worse compared to what?
Fig 11: Generally a very nice figure. Some comments:
- Maybe the color scheme for the landslide probability could be optimized, e.g. change in color where you set your threshold (at values mentioned in L508)
- Why is the lower threshold not slightly higher where recall is still 1?
- When following this line from left to right, are you sure you can increase recall and precision at the same time? This is a univariate model, right? I can’t think of how this would happen. Since you have 5 events, should’t the steps be in intervals of 0.2 for precision?
- Caption: I would simply refer to the equations for definitions of recall and precision, but there you could mention the alternative names (e.g. precision=true positive rate)
L579: yes, compared to some shallow landslide thresholds this is rather low. Could it also be because some of the events are debris flows? The most predictive thresholds for runoff-triggered debris-flows can be at the 10-min timescale.
L610: please specify “hydrologic monitoring”. In this context, I assume soil moisture measurements.
L619: I would say with “few landslide events” instead of “without”. I don’t think you investigated threshold determination without triggering events.
L620-L625: you are of course right that negative events should be considered and in practice it may still be done only occasionally. However, this has been well-known for a while. The first ones I can think of are Staley et al. (2013, https://doi.org/10.1016/j.geomorph.2016.10.019) and Gariano et al. (2015, https://doi.org/10.1007/s11069-019-03830-x) and since then many others have adopted this procedure, some of them you cite earlier.
L639-644: isn’t this contradicting the earlier statement in L623-625 about the value of low precipitation totals? By using precision and recall you get rid of exactly these.
Citation: https://doi.org/10.5194/egusphere-2023-25-RC2 - AC2: 'Reply on RC2', Annette Patton, 31 May 2023
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-25', Anonymous Referee #1, 07 Apr 2023
- AC1: 'Reply on RC1', Annette Patton, 31 May 2023
-
RC2: 'Comment on egusphere-2023-25', Anonymous Referee #2, 18 Apr 2023
General comments
This study deals with the relevant problem of establishing an early-warning system at a small regional scale with few observations. This problem is approached by testing different models with different inputs and tested for robustness. I think the study will be interesting to early-warning system developers as this system has been implemented in practice and therefore had to tackle many practical problems from how to establish thresholds to how to issue warning levels.
Generally, the manuscript is well written and organized. My main criticism is that the comparison/evaluation/validation is at times confusing and not as streamlined as other parts of the manuscript. Although I very much appreciate that much effort is put in the validation, I think section 2.3 (and 2.4 maybe) need some justification why so many different approaches are being taken. These should maybe also be presented more clearly later. You may have good reasons for choosing many different validation methods (leave-one-out for events, train/test split, different criterion metrics, etc.) but it’s not clear to me from the text and I think it will be confusing to readers. For example, there is leave-one-out for landslide events and train/test split. Wouldn’t it be simpler and similar added value to do leave-one-out with the recorded years (e.g. train with 2002-2018 and test with 2019)? Or why do you need the Brier skill score? Can’t you use the same skill score as for the other models and compare these? Anyway, I don’t think you go into details with the results of this part so it could be cut. To systematically compare the predictive power, you could also compute the area under the curve and assess if a model is better than random, as it is commonly done to assess predictive model skill.
I think if these issues could be solved, the manuscript will be more accessible to readers and that the authors are in a good position to solve this. Please find more specific comments below.
Specific comments
L17: Please specify «136 statistical models». When only reading the abstract I cannot imagine what that means. How do models differ?
L18: does that mean that your data is in daily resolution? If not, I would count the number of non-triggering rainfall events instead of the days.
L23-25: seems more like a conclusion to me. I would state this later in the abstract.
L34-72: the intro nicely shows that there is a need for a LEWS in this region but that it’s difficult to establish with current methods
L53-54: I think this sentence should end with a citation
L153: general question: what is the added value of the number of landslides if you don’t know where it’s going to happen?
L217: So 1-day antecedent precip is the total daily rainfall at the day of the landslide?
L230: Please give a citation for these equations
L235: Please integrate this sentence in another paragraph where you discuss this problem
L247: more intuitive than what?
L271: What is a chain?
L276-279: how are these criterions defined?
L275-289: why was not the same validation performed on each model?
L332.335: I agree about the inadequacy of accuracy in imbalanced datasets but ROC shows exactly true alarms and false alarms in relative terms.
Fig. 3&4: please make sure these figures have the same layout (axis limits, labels, font size, order of plotting lines, etc.) (same for fig. 5&6). As these figures look very similar, you could consider e.g. only showing figs 3&5 here and move the others to the supplement.
Fig. 7/8: blue stands for better and red for worse. Better and worse compared to what?
Fig 11: Generally a very nice figure. Some comments:
- Maybe the color scheme for the landslide probability could be optimized, e.g. change in color where you set your threshold (at values mentioned in L508)
- Why is the lower threshold not slightly higher where recall is still 1?
- When following this line from left to right, are you sure you can increase recall and precision at the same time? This is a univariate model, right? I can’t think of how this would happen. Since you have 5 events, should’t the steps be in intervals of 0.2 for precision?
- Caption: I would simply refer to the equations for definitions of recall and precision, but there you could mention the alternative names (e.g. precision=true positive rate)
L579: yes, compared to some shallow landslide thresholds this is rather low. Could it also be because some of the events are debris flows? The most predictive thresholds for runoff-triggered debris-flows can be at the 10-min timescale.
L610: please specify “hydrologic monitoring”. In this context, I assume soil moisture measurements.
L619: I would say with “few landslide events” instead of “without”. I don’t think you investigated threshold determination without triggering events.
L620-L625: you are of course right that negative events should be considered and in practice it may still be done only occasionally. However, this has been well-known for a while. The first ones I can think of are Staley et al. (2013, https://doi.org/10.1016/j.geomorph.2016.10.019) and Gariano et al. (2015, https://doi.org/10.1007/s11069-019-03830-x) and since then many others have adopted this procedure, some of them you cite earlier.
L639-644: isn’t this contradicting the earlier statement in L623-625 about the value of low precipitation totals? By using precision and recall you get rid of exactly these.
Citation: https://doi.org/10.5194/egusphere-2023-25-RC2 - AC2: 'Reply on RC2', Annette Patton, 31 May 2023
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
440 | 172 | 20 | 632 | 51 | 11 | 10 |
- HTML: 440
- PDF: 172
- XML: 20
- Total: 632
- Supplement: 51
- BibTeX: 11
- EndNote: 10
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Annette I. Patton
Joshua J. Roering
Aaron Jacobs
Oliver Korup
Benjamin B. Mirus
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(6720 KB) - Metadata XML
-
Supplement
(641 KB) - BibTeX
- EndNote
- Final revised paper