How hidden variables limit the performance of shallow landslide susceptibility models
Abstract. Susceptibility mapping is critical in assessing shallow landslide hazard and sediment transport potential. Advancements in modelling techniques and the availability of high-resolution spatial data have continuously improved the performance of landslide susceptibility maps. Nevertheless, discrepancies between predicted susceptibility and observed landslide occurrence remain. In addition to shortcomings in model design and the incompleteness of landslide inventories, the accuracy and transferability of susceptibility models are critically limited by hidden variables, such as site-specific variability in soil development, that control the triggering process but are rarely available in inventories. Here we developed an extensive case study framework, and apply it to two uniquely detailed inventories in order to quantify the role of hidden variables, as well the effects of incomplete landslide inventories. The first inventory is a comprehensive regional dataset containing over 24,000 mapped landslides across 5,939 km², and the second is a field-validated dataset of 734 landslides which includes detailed documentation of hidden variables. We trained two Random Forest machine learning models using a wide range of explanatory variables, including topography, land cover, soil properties, and climate. The first model was optimized for the first dataset, and achieved high predictive performance within its training domain (mean cross-validation of the area under the curve, AUC = 0.89). However, its accuracy decreased significantly (AUC = 0.74) when applied to the second dataset, highlighting limitations in transferability. The second model was optimized for the second dataset (AUC = 0.79). A comparison of the two models revealed that regional climatic and geologic data hindered transferability to remote regions because the relationship between available and hidden variables is not properly captured by the susceptibility model. We further analysed the predicted susceptibility values as a function of the site-specific information collected in the second database, to quantitatively explore the role of hidden variables. The analysis suggested that variables related to (i) subsurface heterogeneity and (ii) vegetation complexity govern landslide initiation, but are rarely accounted for in susceptibility models. Specifically, the models underestimated susceptibility in poorly developed soils and areas with uniform forest layering. This study underscores the necessity of a process-based understanding grounded in field observations to capture the full complexity of landslide failure mechanisms, relevant to landslide susceptibility modelling.