Feature Selection for Landslide Forecasting Models in Southern Andes
Abstract. Rainfall-induced landslide (RIL) forecasting is crucial for early warning systems developed to mitigate the devastating impacts of these events on human lives, infrastructure, and the environment. Currently, dense instrumental networks for early warning require large datasets to identify precursor patterns in current machine learning models. Topographic, lithological, vegetation, soil moisture, and climatic characteristics are among the most commonly used variables for training these models. However, there are no universal designs, so it is necessary to adapt the requirements to each context and to the available variables that characterise it. To develop a RIL forecasting model for the Southern Andes, this study gathers data from various local soil and climate databases to identify the most relevant variables. Feature selection is crucial for improving the design of machine learning models, reducing the dimensionality of input data, enhancing computational efficiency, and preventing overfitting. We assessed the impact of various features, both individually and in combination, on the performance of predictive models. Methods such as Classification and Regression Tree and Genetic Algorithms are employed to perform the feature selection. A national landslide database was enriched using techniques such as buffer control sampling, PU Bagging, and clustering methods to incorporate negative examples (non-landslide) data. Various predictive models were tested. The results reveal some consistent variables as the most significant in forecasting landslides in four southern Chilean regions.