Landslide susceptibility mapping with explainable AI techniques: Evidence from Bavaria, Germany

Buchauer, Veronika; Sapena, Marta; Geiß, Christian; Aravena Pelizari, Patrick; Taubenböck, Hannes

doi:10.5194/egusphere-2026-1647

Preprints

https://doi.org/10.5194/egusphere-2026-1647

Preprints

08 Apr 2026

| 08 Apr 2026

Landslide susceptibility mapping with explainable AI techniques: Evidence from Bavaria, Germany

Veronika Buchauer, Marta Sapena, Christian Geiß, Patrick Aravena Pelizari, and Hannes Taubenböck

Abstract. Landslides threaten infrastructure, ecosystems, and human safety, particularly in mountainous regions. Climate change with increasingly intense rainfall, together with growing populations and assets in hazard-prone areas, increases the need for accurate and interpretable landslide susceptibility assessments. This study presents a region-wide landslide susceptibility map modeled for entire Bavaria, Germany, based on more than 11,000 recorded landslide events. Using slope units, which are terrain-based spatial mapping entities following natural drainage lines and ridges, the model captures landslide-prone areas in a more terrain-consistant manner than traditional grid-based approaches. To generate the landslide susceptibility map, we employ a dense neural network architecture. The model is trained on the landslide inventory and a wide range of landslide-influencing factors derived from high-resolution topographic, geological, and land cover data and achieves strong predictive performance (ROC AUC = 0.953, PR AUC = 0.844). Model interpretability is approached using the SHapley Additive exPlanations (SHAP) framework, which provides both global and local insights into the factors influencing landslide susceptibility, revealing a strong predictive influence of geology, soil properties and terrain heterogeneity. The resulting susceptibility map is compared with an existing map, which is based on manual assessments, and shows good performance, particularly for deep-seated landslides. However, evaluation using newly recorded landslides reveals limitations in the model's generalizability. Many newly recorded events occur in regions that were underrepresented in the original inventory and are therefore wrongly assigned low susceptibility values. This demonstrates how spatial incompleteness and selection bias in landslide inventories directly propagate into susceptibility maps, leading to systematic underestimation of hazard. Overall, this study highlights that while explainable machine learning enables robust and more interpretable regional susceptibility mapping, the quality and spatial completeness of landslide inventories are critical for reliable hazard assessment and mitigation.

Received: 24 Mar 2026 – Discussion started: 08 Apr 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Veronika Buchauer, Marta Sapena, Christian Geiß, Patrick Aravena Pelizari, and Hannes Taubenböck

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-1647', Anonymous Referee #1, 20 Apr 2026
Dear authors and editor,
It has been some time since I have reviewed an article requiring so few corrections. The topic is interesting, the results are based on a very large dataset, the ROC AUC value is very high, and the results are presented clearly. I enjoyed reading it. Therefore, there are practically no critical suggestions or corrections, but please consider the following:
Provide a geological map, both a general one and, optionally, a more detailed one for a smaller region.

slopeunits is a GRASS GIS tool, but this is not mentioned at all.

In Figure 3, when the DEM is zoomed in, visible squares appear across the region. Is this an artefact of a particular tool, a result of resampling to 5 × 5 m resolution, or is there another reason? Does this affect the results?

In section 2.3.1 (Segmentation of slope units), what procedure did you follow for the karst terrains? Delineation of slope units does not really work there, but there is no information on the geological map to allow further comment on this.

Line 238: Did you check the influence of several landslides within the same slope unit? Only the most recent was considered; however, it could also be the largest one or based on another criterion. What is the impact of this assumption?

Line 320: What is the basis for choosing these numbers of neurons?

Line 408: Geological units also do not have a major role (geochemical rock type is more important). What is the reason for this?

Figure 8: Move the right part of the figure below the left one (not side by side) and enlarge both (a) and (b), as the fonts in the figure are too small.
Citation: https://doi.org/10.5194/egusphere-2026-1647-RC1
- AC1: 'Reply on RC1', Veronika Buchauer, 13 May 2026
  
  We sincerely thank the reviewer for the very positive and constructive assessment of our manuscript. We particularly appreciate the recognition of the study’s large data base and clear result presentation, as well as the helpful suggestions for further clarification. Please find our comment-by-comment answers to your comments in the attached document.
  
  Citation: https://doi.org/10.5194/egusphere-2026-1647-AC1
RC2:
'Comment on egusphere-2026-1647', Anonymous Referee #2, 29 Apr 2026

Dear authors and editor,
the article is well written, the research question is clear, methodology appropriate, results are well presented and the discussion is thorough. The manuscript is fit for publication after minor changes as there are no significant flaws in the presented research but rather “food for thought” questions and/or suggestions listed below which can help increase the manuscript quality.

I have a few technical suggestions:
-the figures which represent maps should have a north arrow
-I failed to see study area size i.e. in terms how many square kilometres
-maybe it would be nice to state which seven land cover classes are used as “land cover” variable
-it would be beneficial to add a geological map, e.g. alongside figure 2: make it 2b and keep the current figure 2 as 2a
-if possible, try to quantify information about landslide inventory which is key in this article. E.g. the data in lines 120-123. Were there some mappings done via HR LiDAR derivatives, what amount of the inventory is field surveyed and verified, time spans etc. For an article focusing on the inventory, some more numbers should be present in my opinion, considering >11 000 landslides

A few questions for possible discussion, consider adding to the manuscript some of it (and reply in the discussion to the rest of them):
-you mentioned 84 input features (LCFs) for the modelling. It seems rather uncommonly much – what was the motivation for this and were there any collinearity tests done preliminary or during the modelling
-how come you decided “only” for ANN method? were there some preliminary or similar cases done which made you prioritize “only” this method? maybe another (quite different) method would show differently some results which you presented
-how were the exemplary slope units for figure 8 selected?
-did you consider taking LfU polygons as a LCF? (i.e. categorical LCF: present deep, new deep etc, presented in figure 9 left)
-could your model point out the locations where more inventory adjustment/corrections/upgrades are needed? can you elaborate on the uncertainty aspect of the modelling and why it wasn’t quantified
-it would be interesting to see spatial K-fold validation in this type of modelling, if you could elaborate on the topic
-you mentioned that stable slope units are in fact unknown, i.e. possibly stable or unstable which is correct (not verified as negatives). If you could declare some slope units as 100% stable, would you prioritize them in the modelling? was this done in some research before and if yes what were the reported results?
-your main conclusion is stated in lines 615-618, can your research propose some novelty to this topic? if so, specify some ideas for future work and perspectives to mitigate this issue with the “unseen locations”

Kind regards

Citation: https://doi.org/10.5194/egusphere-2026-1647-RC2
- AC2: 'Reply on RC2', Veronika Buchauer, 13 May 2026
  
  We would like to thank the reviewer for the careful and constructive evaluation of our manuscript. We are pleased that the research question, methodology, and results section were found to be sound. We took the technical suggestions seriously, as we believe they add meaningful value to the work. Please find our comment-by-comment responses to your comments in the attached document.
  
  Citation: https://doi.org/10.5194/egusphere-2026-1647-AC2
RC3:
'Comment on egusphere-2026-1647', Anonymous Referee #3, 13 May 2026
I have read your manuscript with great detail. I have the following major feedbacks:

I see that the flat areas are not masked out in the slope units, is there a reason for that? If you look at the Figure 9, the compared Susceptibility map has a masked region. This is important because your model shows great variation in susceptibility overall but I am not sure if it has similar variation in the mountainous region only too or not (e.g. Figure 4). If all the mountainous regions have susceptibility above 0.7 it would not be a robust analysis. Therefore, I recommend masking the flat regions out.

The random train–test split likely introduces spatial leakage, so spatial cross-validation or blocked validation should be implemented to provide more realistic performance estimates.

The modeling framework incorrectly treats all non-landslide slope units as true negatives, whilst you remove 11k SU out of 317K and justify accounting for class imbalance, does it really solve your problem of strong class imbalance of positive 2.6%? In my calculation it makes it 2.7%. I would expect a better way to handle such a strong class imbalance.

The ANN architecture is insufficiently justified and should be benchmarked against simpler baseline models such as logistic regression, random forests, or XGBoost and compared in greater detail.

I find it really hard to intuitively get what is the novelty in the work, I know it is on the role of inventory but it should be discussed in greater detail and the story must be made more coherant to explain the story. Now the paper looks more like new AI model for landslide susceptibility (which is not the case as such models have been used a lot). I recommend framing the story in similar lines of this paper: https://link.springer.com/article/10.1007/s10064-005-0023-0

The threshold selection and evaluation framework should be strengthened using more rigorous imbalance-aware and calibration-sensitive performance metrics.

The figure-1 workflow should be removed in my opinion, things such as "transform python code to R" is not actual method but a mere technical detail and flowcharts with such a long flow and details do not describe and justify the choice of method (which is done in text). Therefore I recommend removing or simplifying this figure.

In general all maps should follow cartographic standards, which are missing overall (like co-ordinates, north arrow, consistency on using capital and small letter (in Digital Elevation Model, , shaded relief, either make everything uppercase or everything lowercase).

The manuscript should discuss temporal inconsistencies between landslide inventories and environmental predictor datasets more explicitly.

The study should must reproducibility by providing complete implementation details, hyperparameter ranges, code availability, and workflow documentation.

I would remove challenges in slope unit segmentation from the discussion as that is not the main focus of your manuscript but impact of inventory incompleteness and structure the manuscript in that line with a one clear message which has better scientific impact.
Citation: https://doi.org/10.5194/egusphere-2026-1647-RC3
- AC3: 'Reply on RC3', Veronika Buchauer, 01 Jul 2026
  
  We thank the Editor for handling our manuscript and the Reviewer for the thorough and constructive review, which has helped us sharpen the paper's message and improve its rigor.
  In revising the manuscript, we made our central contribution more explicit, especially in the introduction, discussion, and conclusion. In response to the reviewer's concerns and suggestions, we included statistics on susceptibility levels within mountainous terrain, clarified the flat-area masking and explained in more detail how the class imbalance was handled. We are also improving the figures and maps. In addition, we ran more experiments, as suggested by the reviewers, such as the spatial cross-validation, and benchmark modeling. The results of these experiments will be incorporated in the revised version of the manuscript.
  Our point-by-point responses are provided in the attached document.
  
  Citation: https://doi.org/10.5194/egusphere-2026-1647-AC3
CC1:
'Comment on egusphere-2026-1647', Oliver Wigmore, 19 May 2026

This is an interesting and ambitious study that addresses an important applied problem. The slope unit framework is well implemented, and the dataset is impressively large. However, I believe there are several important methodological issues that warrant further consideration.
1) Performance metrics should be interpreted carefully
The ROC-AUC of 0.953 and PR-AUC of 0.844 are calculated on a random holdout, which does not account for the highly uneven spatial distribution of the inventory. Because positive cases are heavily concentrated in a few regions, the test set inevitably over-represents the same terrain already seen in training. These metrics therefore reflect within-inventory discrimination rather than generalisation across Bavaria, and consequently could be optimistic. A blocked spatial cross validation (e.g. holding out entire districts or grids) would be more meaningful. The “out-of-distribution” validation in Section 3.4 is effectively the only real test of generalisation, and the authors highlight poor performance here.
2) Feature set size and model complexity
84 predictors in a six-layer ANN is a lot with only ~8,000 positive cases (2.6% of ~300,000 slope units). Many of these predictors are highly intercorrelated (especially the topographic variables) and some are potentially redundant. While ANN are generally less sensitive to multicollinearity it still increases dimensionality of the model and can make generalisation/transferability harder to demonstrate without sensitivity tests. It is unclear how much of the poor “out-of-distribution” performance reflects inventory bias versus model overspecification. The discussion attributes the generalisation failure primarily to inventory incompleteness, but the model architecture may also be contributing. A feature reduction or ablation study would help to disentangle these.
3) Negative sampling strategy
Slope units without a recorded landslide are treated as negatives across all of Bavaria, regardless of whether those areas have ever been systematically surveyed. For an incomplete inventory this is a recognised issue, because it converts “unknowns” into negative class labels, which the model then learns. The resulting underestimation in districts such as Erding and Landshut (on the updated inventory) is consistent with this. The authors identify and acknowledge this issue, but it would strengthen the practical value of the Bavaria‑wide map to treat it as a design constraint rather than only a post‑hoc explanation. For example by running a sensitivity experiment that restricts negative sampling (or even training) to districts with completed systematic surveys, and treating predictions elsewhere explicitly as extrapolation. This should be paired with an assessment of covariate shift, i.e., whether predictor distributions in application regions fall within the ranges represented in the training data.
4) Reliability of the SHAP analysis
With 84 intercorrelated predictors, global SHAP rankings can be difficult to interpret. SHAP importance can be shared among correlated variables and rankings can be sensitive to the correlation structure, not only to genuine predictive signal. The finding that geological and soil variables outrank topographic ones is interesting, but may partly reflect that categorical variables are less intercorrelated amongst themselves than the many DEM-derived metrics. Consequently they receive a high SHAP score, while the DEM derived variables effectively split their SHAP ranking between themselves, appearing less important. A correlation analysis of the feature set and a more cautious framing of the SHAP results as exploratory rather than definitive would considerably strengthen this contribution. Specific mention of how SHAP is affected by multicollinearity and its limitations should also be included for clarity. The use of Group SHAP could also be explored, or at a basic level, simply summing SHAP scores across the variable groups (i.e. sum of SHAP for all DEM derived variables). This article may be useful here: https://doi.org/10.1002/aisy.202400304

Final thoughts
These comments are not intended to diminish what is clearly a carefully executed, well-written and ambitious piece of work, and I recognise the substantial effort that a study of this scale represents. However, the concerns raised above interact with each other in ways that compound their individual effects, and collectively they affect how confidently the results and interpretations can be presented. That said, they are addressable, and the underlying dataset and slope unit framework remain a genuinely valuable contribution. I hope these comments are useful to the authors as they consider potential revisions.

Citation: https://doi.org/10.5194/egusphere-2026-1647-CC1
- AC4: 'Reply on CC1', Veronika Buchauer, 01 Jul 2026
  
  We appreciate the time and effort spend by the community comment, Dr. Oliver Wigmore, and the feedback on our work.
  
  The comments have helped us strengthen both the analysis and its presentation. In response, we are adding a spatial cross-validation to better reflect spatial generalization, we clarify the effective size of the feature set and the rationale for the model, and we frame the SHAP results more cautiously alongside a correlation analysis. Our point-by-point replies follow in the attached document.
  
  Citation: https://doi.org/10.5194/egusphere-2026-1647-AC4

Veronika Buchauer, Marta Sapena, Christian Geiß, Patrick Aravena Pelizari, and Hannes Taubenböck

Viewed

Total article views: 642 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
409	197	36	642	18	30

HTML: 409
PDF: 197
XML: 36
Total: 642
BibTeX: 18
EndNote: 30

Views and downloads (calculated since 08 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	230	98	18	346
May 2026	159	84	12	255
Jun 2026	7	6	4	17
Jul 2026	13	9	2	24

Cumulative views and downloads (calculated since 08 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	230	98	18	346
May 2026	159	84	12	255
Jun 2026	7	6	4	17
Jul 2026	13	9	2	24

Viewed (geographical distribution)

Total article views: 638 (including HTML, PDF, and XML) Thereof 638 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 18 Jul 2026

Short summary

For Bavaria, Germany, we developed the first region-wide data-driven landslide susceptibility map, training a neural network on over 11,000 landslide events using terrain, geological, and land cover data. The model shows strong predictive performance, with geology and soil properties as the most influential features. Validation against newly recorded landslides shows that inventory incompleteness and selection bias translate directly into susceptibility underestimation in poorly mapped regions.


Total:	0
HTML:	0
PDF:	0
XML:	0