the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Landslide susceptibility mapping with explainable AI techniques: Evidence from Bavaria, Germany
Abstract. Landslides threaten infrastructure, ecosystems, and human safety, particularly in mountainous regions. Climate change with increasingly intense rainfall, together with growing populations and assets in hazard-prone areas, increases the need for accurate and interpretable landslide susceptibility assessments. This study presents a region-wide landslide susceptibility map modeled for entire Bavaria, Germany, based on more than 11,000 recorded landslide events. Using slope units, which are terrain-based spatial mapping entities following natural drainage lines and ridges, the model captures landslide-prone areas in a more terrain-consistant manner than traditional grid-based approaches. To generate the landslide susceptibility map, we employ a dense neural network architecture. The model is trained on the landslide inventory and a wide range of landslide-influencing factors derived from high-resolution topographic, geological, and land cover data and achieves strong predictive performance (ROC AUC = 0.953, PR AUC = 0.844). Model interpretability is approached using the SHapley Additive exPlanations (SHAP) framework, which provides both global and local insights into the factors influencing landslide susceptibility, revealing a strong predictive influence of geology, soil properties and terrain heterogeneity. The resulting susceptibility map is compared with an existing map, which is based on manual assessments, and shows good performance, particularly for deep-seated landslides. However, evaluation using newly recorded landslides reveals limitations in the model's generalizability. Many newly recorded events occur in regions that were underrepresented in the original inventory and are therefore wrongly assigned low susceptibility values. This demonstrates how spatial incompleteness and selection bias in landslide inventories directly propagate into susceptibility maps, leading to systematic underestimation of hazard. Overall, this study highlights that while explainable machine learning enables robust and more interpretable regional susceptibility mapping, the quality and spatial completeness of landslide inventories are critical for reliable hazard assessment and mitigation.
- Preprint
(18015 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 20 May 2026)
-
RC1: 'Comment on egusphere-2026-1647', Anonymous Referee #1, 20 Apr 2026
reply
-
AC1: 'Reply on RC1', Veronika Buchauer, 13 May 2026
reply
We sincerely thank the reviewer for the very positive and constructive assessment of our manuscript. We particularly appreciate the recognition of the study’s large data base and clear result presentation, as well as the helpful suggestions for further clarification. Please find our comment-by-comment answers to your comments in the attached document.
-
AC1: 'Reply on RC1', Veronika Buchauer, 13 May 2026
reply
-
RC2: 'Comment on egusphere-2026-1647', Anonymous Referee #2, 29 Apr 2026
reply
Dear authors and editor,
the article is well written, the research question is clear, methodology appropriate, results are well presented and the discussion is thorough. The manuscript is fit for publication after minor changes as there are no significant flaws in the presented research but rather “food for thought” questions and/or suggestions listed below which can help increase the manuscript quality.
I have a few technical suggestions:
-the figures which represent maps should have a north arrow
-I failed to see study area size i.e. in terms how many square kilometres
-maybe it would be nice to state which seven land cover classes are used as “land cover” variable
-it would be beneficial to add a geological map, e.g. alongside figure 2: make it 2b and keep the current figure 2 as 2a
-if possible, try to quantify information about landslide inventory which is key in this article. E.g. the data in lines 120-123. Were there some mappings done via HR LiDAR derivatives, what amount of the inventory is field surveyed and verified, time spans etc. For an article focusing on the inventory, some more numbers should be present in my opinion, considering >11 000 landslides
A few questions for possible discussion, consider adding to the manuscript some of it (and reply in the discussion to the rest of them):
-you mentioned 84 input features (LCFs) for the modelling. It seems rather uncommonly much – what was the motivation for this and were there any collinearity tests done preliminary or during the modelling
-how come you decided “only” for ANN method? were there some preliminary or similar cases done which made you prioritize “only” this method? maybe another (quite different) method would show differently some results which you presented
-how were the exemplary slope units for figure 8 selected?
-did you consider taking LfU polygons as a LCF? (i.e. categorical LCF: present deep, new deep etc, presented in figure 9 left)
-could your model point out the locations where more inventory adjustment/corrections/upgrades are needed? can you elaborate on the uncertainty aspect of the modelling and why it wasn’t quantified
-it would be interesting to see spatial K-fold validation in this type of modelling, if you could elaborate on the topic
-you mentioned that stable slope units are in fact unknown, i.e. possibly stable or unstable which is correct (not verified as negatives). If you could declare some slope units as 100% stable, would you prioritize them in the modelling? was this done in some research before and if yes what were the reported results?
-your main conclusion is stated in lines 615-618, can your research propose some novelty to this topic? if so, specify some ideas for future work and perspectives to mitigate this issue with the “unseen locations”
Kind regards
Citation: https://doi.org/10.5194/egusphere-2026-1647-RC2 -
AC2: 'Reply on RC2', Veronika Buchauer, 13 May 2026
reply
We would like to thank the reviewer for the careful and constructive evaluation of our manuscript. We are pleased that the research question, methodology, and results section were found to be sound. We took the technical suggestions seriously, as we believe they add meaningful value to the work. Please find our comment-by-comment responses to your comments in the attached document.
-
AC2: 'Reply on RC2', Veronika Buchauer, 13 May 2026
reply
-
RC3: 'Comment on egusphere-2026-1647', Anonymous Referee #3, 13 May 2026
reply
I have read your manuscript with great detail. I have the following major feedbacks:
-
I see that the flat areas are not masked out in the slope units, is there a reason for that? If you look at the Figure 9, the compared Susceptibility map has a masked region. This is important because your model shows great variation in susceptibility overall but I am not sure if it has similar variation in the mountainous region only too or not (e.g. Figure 4). If all the mountainous regions have susceptibility above 0.7 it would not be a robust analysis. Therefore, I recommend masking the flat regions out.
-
The random train–test split likely introduces spatial leakage, so spatial cross-validation or blocked validation should be implemented to provide more realistic performance estimates.
-
The modeling framework incorrectly treats all non-landslide slope units as true negatives, whilst you remove 11k SU out of 317K and justify accounting for class imbalance, does it really solve your problem of strong class imbalance of positive 2.6%? In my calculation it makes it 2.7%. I would expect a better way to handle such a strong class imbalance.
-
The ANN architecture is insufficiently justified and should be benchmarked against simpler baseline models such as logistic regression, random forests, or XGBoost and compared in greater detail.
-
I find it really hard to intuitively get what is the novelty in the work, I know it is on the role of inventory but it should be discussed in greater detail and the story must be made more coherant to explain the story. Now the paper looks more like new AI model for landslide susceptibility (which is not the case as such models have been used a lot). I recommend framing the story in similar lines of this paper: https://link.springer.com/article/10.1007/s10064-005-0023-0
-
The threshold selection and evaluation framework should be strengthened using more rigorous imbalance-aware and calibration-sensitive performance metrics.
- The figure-1 workflow should be removed in my opinion, things such as "transform python code to R" is not actual method but a mere technical detail and flowcharts with such a long flow and details do not describe and justify the choice of method (which is done in text). Therefore I recommend removing or simplifying this figure.
- In general all maps should follow cartographic standards, which are missing overall (like co-ordinates, north arrow, consistency on using capital and small letter (in Digital Elevation Model, , shaded relief, either make everything uppercase or everything lowercase).
-
The manuscript should discuss temporal inconsistencies between landslide inventories and environmental predictor datasets more explicitly.
- The study should must reproducibility by providing complete implementation details, hyperparameter ranges, code availability, and workflow documentation.
- I would remove challenges in slope unit segmentation from the discussion as that is not the main focus of your manuscript but impact of inventory incompleteness and structure the manuscript in that line with a one clear message which has better scientific impact.
Citation: https://doi.org/10.5194/egusphere-2026-1647-RC3 -
-
CC1: 'Comment on egusphere-2026-1647', Oliver Wigmore, 19 May 2026
reply
This is an interesting and ambitious study that addresses an important applied problem. The slope unit framework is well implemented, and the dataset is impressively large. However, I believe there are several important methodological issues that warrant further consideration.
1) Performance metrics should be interpreted carefully
The ROC-AUC of 0.953 and PR-AUC of 0.844 are calculated on a random holdout, which does not account for the highly uneven spatial distribution of the inventory. Because positive cases are heavily concentrated in a few regions, the test set inevitably over-represents the same terrain already seen in training. These metrics therefore reflect within-inventory discrimination rather than generalisation across Bavaria, and consequently could be optimistic. A blocked spatial cross validation (e.g. holding out entire districts or grids) would be more meaningful. The “out-of-distribution” validation in Section 3.4 is effectively the only real test of generalisation, and the authors highlight poor performance here.
2) Feature set size and model complexity
84 predictors in a six-layer ANN is a lot with only ~8,000 positive cases (2.6% of ~300,000 slope units). Many of these predictors are highly intercorrelated (especially the topographic variables) and some are potentially redundant. While ANN are generally less sensitive to multicollinearity it still increases dimensionality of the model and can make generalisation/transferability harder to demonstrate without sensitivity tests. It is unclear how much of the poor “out-of-distribution” performance reflects inventory bias versus model overspecification. The discussion attributes the generalisation failure primarily to inventory incompleteness, but the model architecture may also be contributing. A feature reduction or ablation study would help to disentangle these.
3) Negative sampling strategy
Slope units without a recorded landslide are treated as negatives across all of Bavaria, regardless of whether those areas have ever been systematically surveyed. For an incomplete inventory this is a recognised issue, because it converts “unknowns” into negative class labels, which the model then learns. The resulting underestimation in districts such as Erding and Landshut (on the updated inventory) is consistent with this. The authors identify and acknowledge this issue, but it would strengthen the practical value of the Bavaria‑wide map to treat it as a design constraint rather than only a post‑hoc explanation. For example by running a sensitivity experiment that restricts negative sampling (or even training) to districts with completed systematic surveys, and treating predictions elsewhere explicitly as extrapolation. This should be paired with an assessment of covariate shift, i.e., whether predictor distributions in application regions fall within the ranges represented in the training data.
4) Reliability of the SHAP analysis
With 84 intercorrelated predictors, global SHAP rankings can be difficult to interpret. SHAP importance can be shared among correlated variables and rankings can be sensitive to the correlation structure, not only to genuine predictive signal. The finding that geological and soil variables outrank topographic ones is interesting, but may partly reflect that categorical variables are less intercorrelated amongst themselves than the many DEM-derived metrics. Consequently they receive a high SHAP score, while the DEM derived variables effectively split their SHAP ranking between themselves, appearing less important. A correlation analysis of the feature set and a more cautious framing of the SHAP results as exploratory rather than definitive would considerably strengthen this contribution. Specific mention of how SHAP is affected by multicollinearity and its limitations should also be included for clarity. The use of Group SHAP could also be explored, or at a basic level, simply summing SHAP scores across the variable groups (i.e. sum of SHAP for all DEM derived variables). This article may be useful here: https://doi.org/10.1002/aisy.202400304
Final thoughtsThese comments are not intended to diminish what is clearly a carefully executed, well-written and ambitious piece of work, and I recognise the substantial effort that a study of this scale represents. However, the concerns raised above interact with each other in ways that compound their individual effects, and collectively they affect how confidently the results and interpretations can be presented. That said, they are addressable, and the underlying dataset and slope unit framework remain a genuinely valuable contribution. I hope these comments are useful to the authors as they consider potential revisions.
Citation: https://doi.org/10.5194/egusphere-2026-1647-CC1
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 275 | 118 | 21 | 414 | 14 | 23 |
- HTML: 275
- PDF: 118
- XML: 21
- Total: 414
- BibTeX: 14
- EndNote: 23
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Dear authors and editor,
It has been some time since I have reviewed an article requiring so few corrections. The topic is interesting, the results are based on a very large dataset, the ROC AUC value is very high, and the results are presented clearly. I enjoyed reading it. Therefore, there are practically no critical suggestions or corrections, but please consider the following: