the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Predicting Slope Instabilities in Salvador, Brazil, using Machine Learning and Georeferenced Data
Abstract. Municipalities worldwide struggle with slope instability, a particularly pressing issue in cities such as Salvador, Brazil, where rugged terrain, escarpments, and complex geology create a high risk of instability. The complexity of the problem is evident in the variability of terrain properties, the bedrock’s inherited structural features, and anthropogenic action. This range of variables makes it well-suited to machine learning (ML) approaches for instability prediction. Although ML has experienced an impressive recent boost, only a few cases have applied ML to real-world instability events. In this paper, a data bank of hydromechanical properties of soils is used in conjunction with a digital terrain model (DTM) and different geo-referenced information, including rainfall, vegetation coverage, geological structures, sewage collection/treatment status, and residential density, to predict the occurrence of soil mass movements and related emergency calls to the municipality from the population living in risk areas. 13,522 emergency calls were considered during the period from 2020 to 2025. Excellent predictive performance, with an R² ≈ 0.98 consistently across both the validation and testing phases, was obtained in this original study. This strong result underscores the viability of machine learning as a powerful tool for this kind of problem, particularly within municipal warning systems.
- Preprint
(11304 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 04 Jul 2026)
-
RC1: 'Comment on egusphere-2026-2056', Milad Basirifard, 04 Jun 2026
reply
-
AC1: 'Reply on RC1', Sandro Machado, 08 Jun 2026
reply
Please find below our responses (R) to to the RC1 comments:
This manuscript presents an important and timely contribution to urban landslide-risk prediction by combining geotechnical data, a high-resolution DTM, rainfall records, geological structures, sewage-service information, vegetation cover, residential density, and municipal emergency-call records for Salvador, Brazil. The use of 13,522 emergency calls from 2020–2025 is particularly valuable, since operational municipal landslide datasets are rarely available at this scale. The reported performance is very high, with R² values exceeding 0.94 across scenarios and reaching approximately 0.98 in some validation/test configurations. However, several methodological aspects require clarification before the results can be considered fully robust for operational early-warning use.
R - The authors thank the reviewer for his words and valuable feedback. All the comments are replied to in the sequence, and most of the reviewer's suggestions will be incorporated in the next version of the paper.
First, the definition of the output variable deserves more scrutiny. The model predicts a weighted emergency-call index rather than directly predicting slope failure occurrence. Confirmed and unconfirmed calls are combined using an RFU weighting factor, and the authors tune RFU, the time interval Ti, and the radius of influence Ri to maximize predictive performance. This is understandable from an operational perspective, but it also means that the target variable partly reflects human reporting behavior, population exposure, accessibility, public awareness, and municipal response practices, not only physical instability. The manuscript should discuss this distinction more explicitly and avoid implying that the model purely predicts geotechnical failure.
R – The reviewer is right in the sense that the output variable reflects human reporting behavior, population exposure, accessibility, public awareness, and municipal response practices, not only physical instability. The city of Salvador has implemented warning systems and direct communication channels to reach the population in risk areas. However, the population's perception of risk does not always lead to instability events, and when confirmed, landslides exhibit a wide range of scales and impacts, from economic losses to loss of human life. The complexity of the problem is therefore evident, involving geotechnical and geological aspects, the effects of anthropogenic actions, the quality and coverage of public services, and the human response when confronted with risk situations. The use of landslides as an output variable would require deeper investigation of the slip surface geometry, determination of local soil properties, and performing a retroanalysis of the events. The authors believe that the population's perception of risk can be a valuable source of feedback, since they live in such risk areas and, unfortunately, most of them have already experienced such instability events, which helps create a kind of natural perception, being able to analyze local indicators of instability and contact the municipal authorities. These aspects will be highlighted in the next version of the paper.
Second, the random train/validation/test split may overestimate model generalization because nearby slopes and repeated rainfall/event conditions are likely spatially and temporally autocorrelated. The database was randomly split into 60% training, 20% validation, and 20% testing after stratification of the output variable. For a geospatial hazard model, a stronger test would be spatial block cross-validation, temporal holdout validation, or testing on independent storm seasons/neighborhoods. Without such validation, the very high R² values may partly reflect spatial leakage or repeated event structures rather than true predictive skill for unseen areas or future events.
Fourth, the manuscript would benefit from additional evaluation metrics that are more relevant for warning systems. R², MAE, RMSE, and MSE are useful for regression, but early-warning decisions depend on threshold behavior: probability of detection, false-alarm ratio, precision-recall, ROC-AUC, confusion matrices at operational thresholds, and performance during extreme rainfall events. Since the conclusion emphasizes evacuation warnings and targeted alerts, the paper should demonstrate how the regression output would be converted into warning levels and how reliable those warnings would be.
R – The authors agree with the reviewer and acknowledge this limitation at the current stage of the project. The nearby slopes have indeed repeated or very similar rainfall event conditions, as they are spatially and temporally autocorrelated. These events, as discussed in the text, were interpolated in time and space from forty weather stations installed in the city, thereby making rainfall events similar across neighboring slopes at a given moment.
The authors, however, plan to perform temporal holdout validation during the interactive implementation of this model and gradually replace the warning and emergency declaration criteria currently used by the municipality. We are now in the rainy period in the city. All data produced during this period will help with the model implementation task, serving as a temporal holdout validation (walk-forward validation with the use of a Sliding Windows to Test Memory Depth, since most of the variables tend to evolve, e.g., the quality of the sewage collection system, and the geometry of the slopes). At this stage, new metrics will be employed, in addition to R², MAE, RMSE, and MSE. Different (at least two) thresholds must be defined in this stage, considering the predicted and experimental values of the output variable: a warning and an emergency state. Declaring an emergency without confirmed events (false alarms) can jeopardize the entire system, leading the population to gradually discredit it and expose them to risk in future events, which evidence the importance of some of the aspects highlighted by the reviewer. However, the authors intend to present these developments in a future publication.
Third, the radius of influence has a major effect on model performance. The authors report that Ri = 300 m gives the highest R², but also acknowledge that such a radius may merge emergency calls and geological information from adjacent slopes. This is a central issue, not a minor sensitivity result. Larger radii naturally smooth the target and aggregate more calls, which may increase R² while reducing slope-specific interpretability. The choice of Ri = 150 m as a physically consistent compromise should be justified more rigorously, perhaps using geomorphological criteria, event inventories, or sensitivity of false alarms/missed alarms.
R – The authors thank the reviewer for this observation and suggestion. In the revised version of this paper, the radii values are compared with the slope extensions in the plant and with the distances between the centroids of nearest slope areas and the geological variables, faults, foliations, and dikes. Using these measures, the authors believe a radius value can be easily chosen to prevent merging emergency calls
Fifth, feature importance based on XGBoost weight and gain should be interpreted cautiously. These metrics can be biased by correlated predictors, and this study includes many highly correlated rainfall windows as well as related FoS/probability/reliability variables. The paper would be stronger if the authors added SHAP analysis, permutation importance under grouped variables, or ablation tests that remove rainfall, geotechnical, anthropogenic, and geological feature groups separately.
R – The authors thank the reviewer for this important observation and agree that feature importance metrics based on XGBoost weight and gain should be interpreted with caution in the present study, which involves multiple cumulative rainfall as well FoS, probability, and reliability as model variables, with a clear interdependence relationship, which makes accurately accounting for their individual contribution to model performance complex.
However, the paper's main focus is the application of ML to a specific, practical problem rather than assessing the impact of different procedures or analyses focused on variable importance. Furthermore, the use of different cumulative rainfall is justified by differences in soil permeability, which makes longer periods more relevant for more impervious soils, for instance. Concerning the triad FoS/probability/reliability, for each situation (saturated or natural water content), probability and reliability are dependent and redundant on a given FoS. However, this dependency is highly nonlinear, and the variable scales are very different. Nevertheless, the authors recognize the potential value of complementary interpretability analyses and will evaluate the possibility of incorporating additional techniques, as suggested by the reviewer.
Finally, reproducibility is limited. The manuscript states that code and data can be made available upon request and subject to cooperation agreements or municipal approval. While privacy concerns around emergency-call data are legitimate, the authors should consider releasing anonymized, aggregated, or synthetic benchmark datasets, together with the trained model workflow and hyperparameter settings. This would substantially improve transparency and allow independent
R – Please note that the paper already shares much of the information used, e.g., the quality of the sewage services, the DTM, and the presence of residences and vegetation. The FoS/probability/reliability and code can be easily shared (the authors intend to do this in the next version of the paper). However, in addition to privacy concerns about emergency-call data, the Federal University of Bahia has partnership agreements with the Salvador Municipality that prohibit data sharing unless in special circumstances. Because of that, it is postulated in the paper “... their disclosure to a third party is only possible after filtering and approval by the city administration.”
Citation: https://doi.org/10.5194/egusphere-2026-2056-AC1
-
AC1: 'Reply on RC1', Sandro Machado, 08 Jun 2026
reply
-
CC1: 'Comment on egusphere-2026-2056', Mohammad Mokhtari, 08 Jun 2026
reply
The study addresses a real and pressing urban hazard problem, and the dataset is impressive — 13,522 emergency calls over five years give genuine statistical depth. The R² ≈ 0.98 across both validation and testing phases is a strong result, and the multi-variable approach combining hydromechanical soil properties, rainfall, vegetation, geology, sewage infrastructure, and residential density reflects the genuine complexity of urban slope failure. The focus on emergency calls as the target variable is clever — it grounds the prediction in operationally meaningful outcomes rather than abstract hazard indices.
An R² of 0.98 is almost suspiciously high for a real-world geohazard prediction problem and warrants scrutiny. The key question is whether the model is genuinely predictive or whether it has been over-fitted to the training data. The paper claims consistency across validation and testing phases, which would address this concern — but the reviewer would want to see the train/test split methodology clearly described and ideally an independent holdout period not used in any parameter tuning.
Citation: https://doi.org/10.5194/egusphere-2026-2056-CC1 -
AC2: 'Reply on CC1', Sandro Machado, 08 Jun 2026
reply
Please find below our responses (R) to the CC1 comments:
The study addresses a real and pressing urban hazard problem, and the dataset is impressive — 13,522 emergency calls over five years give genuine statistical depth. The R² ≈ 0.98 across both validation and testing phases is a strong result, and the multi-variable approach combining hydromechanical soil properties, rainfall, vegetation, geology, sewage infrastructure, and residential density reflects the genuine complexity of urban slope failure. The focus on emergency calls as the target variable is clever — it grounds the prediction in operationally meaningful outcomes rather than abstract hazard indices.
R - The authors sincerely thank CC1 for his(her) kind words!
An R² of 0.98 is almost suspiciously high for a real-world geohazard prediction problem and warrants scrutiny. The key question is whether the model is genuinely predictive or whether it has been over-fitted to the training data. The paper claims consistency across validation and testing phases, which would address this concern — but the reviewer would want to see the train/test split methodology clearly described and ideally an independent holdout period not used in any parameter tuning.
R - At this stage of our study, our main concern was random splitting into 60% training, 20% validation, and 20% testing after stratifying on the output variable to avoid overfitting. The authors are aware that time series modeling is challenging because, in many circumstances, as in our case, the problem is highly dynamic and evolves, and past data can not be adequate to model a new period. DTM changes quickly in risk areas and in non-consolidated settlements, as well as the quality of the public services, to give two examples. The authors, however, plan to perform temporal holdout validation during the interactive implementation of this model and gradually replace the municipality's current warning and emergency declaration criteria. We are now in the rainy period in the city. All data produced during this period will help with the model implementation task, serving as a temporal holdout validation (walk-forward validation with Sliding Windows to Test Memory Depth).
Citation: https://doi.org/10.5194/egusphere-2026-2056-AC2
-
AC2: 'Reply on CC1', Sandro Machado, 08 Jun 2026
reply
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 206 | 56 | 22 | 284 | 18 | 20 |
- HTML: 206
- PDF: 56
- XML: 22
- Total: 284
- BibTeX: 18
- EndNote: 20
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript presents an important and timely contribution to urban landslide-risk prediction by combining geotechnical data, a high-resolution DTM, rainfall records, geological structures, sewage-service information, vegetation cover, residential density, and municipal emergency-call records for Salvador, Brazil. The use of 13,522 emergency calls from 2020–2025 is particularly valuable, since operational municipal landslide datasets are rarely available at this scale. The reported performance is very high, with R² values exceeding 0.94 across scenarios and reaching approximately 0.98 in some validation/test configurations. However, several methodological aspects require clarification before the results can be considered fully robust for operational early-warning use.
First, the definition of the output variable deserves more scrutiny. The model predicts a weighted emergency-call index rather than directly predicting slope failure occurrence. Confirmed and unconfirmed calls are combined using an RFU weighting factor, and the authors tune RFU, the time interval Ti, and the radius of influence Ri to maximize predictive performance. This is understandable from an operational perspective, but it also means that the target variable partly reflects human reporting behavior, population exposure, accessibility, public awareness, and municipal response practices, not only physical instability. The manuscript should discuss this distinction more explicitly and avoid implying that the model purely predicts geotechnical failure.
Second, the random train/validation/test split may overestimate model generalization because nearby slopes and repeated rainfall/event conditions are likely spatially and temporally autocorrelated. The database was randomly split into 60% training, 20% validation, and 20% testing after stratification of the output variable. For a geospatial hazard model, a stronger test would be spatial block cross-validation, temporal holdout validation, or testing on independent storm seasons/neighborhoods. Without such validation, the very high R² values may partly reflect spatial leakage or repeated event structures rather than true predictive skill for unseen areas or future events.
Third, the radius of influence has a major effect on model performance. The authors report that Ri = 300 m gives the highest R², but also acknowledge that such a radius may merge emergency calls and geological information from adjacent slopes. This is a central issue, not a minor sensitivity result. Larger radii naturally smooth the target and aggregate more calls, which may increase R² while reducing slope-specific interpretability. The choice of Ri = 150 m as a physically consistent compromise should be justified more rigorously, perhaps using geomorphological criteria, event inventories, or sensitivity of false alarms/missed alarms.
Fourth, the manuscript would benefit from additional evaluation metrics that are more relevant for warning systems. R², MAE, RMSE, and MSE are useful for regression, but early-warning decisions depend on threshold behavior: probability of detection, false-alarm ratio, precision-recall, ROC-AUC, confusion matrices at operational thresholds, and performance during extreme rainfall events. Since the conclusion emphasizes evacuation warnings and targeted alerts, the paper should demonstrate how the regression output would be converted into warning levels and how reliable those warnings would be.
Fifth, feature importance based on XGBoost weight and gain should be interpreted cautiously. These metrics can be biased by correlated predictors, and this study includes many highly correlated rainfall windows as well as related FoS/probability/reliability variables. The paper would be stronger if the authors added SHAP analysis, permutation importance under grouped variables, or ablation tests that remove rainfall, geotechnical, anthropogenic, and geological feature groups separately.
Finally, reproducibility is limited. The manuscript states that code and data can be made available upon request and subject to cooperation agreements or municipal approval. While privacy concerns around emergency-call data are legitimate, the authors should consider releasing anonymized, aggregated, or synthetic benchmark datasets, together with the trained model workflow and hyperparameter settings. This would substantially improve transparency and allow independent verification.