the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Predicting Slope Instabilities in Salvador, Brazil, using Machine Learning and Georeferenced Data
Abstract. Municipalities worldwide struggle with slope instability, a particularly pressing issue in cities such as Salvador, Brazil, where rugged terrain, escarpments, and complex geology create a high risk of instability. The complexity of the problem is evident in the variability of terrain properties, the bedrock’s inherited structural features, and anthropogenic action. This range of variables makes it well-suited to machine learning (ML) approaches for instability prediction. Although ML has experienced an impressive recent boost, only a few cases have applied ML to real-world instability events. In this paper, a data bank of hydromechanical properties of soils is used in conjunction with a digital terrain model (DTM) and different geo-referenced information, including rainfall, vegetation coverage, geological structures, sewage collection/treatment status, and residential density, to predict the occurrence of soil mass movements and related emergency calls to the municipality from the population living in risk areas. 13,522 emergency calls were considered during the period from 2020 to 2025. Excellent predictive performance, with an R² ≈ 0.98 consistently across both the validation and testing phases, was obtained in this original study. This strong result underscores the viability of machine learning as a powerful tool for this kind of problem, particularly within municipal warning systems.
- Preprint
(11304 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2026-2056', Milad Basirifard, 04 Jun 2026
-
AC1: 'Reply on RC1', Sandro Machado, 08 Jun 2026
Please find below our responses (R) to to the RC1 comments:
This manuscript presents an important and timely contribution to urban landslide-risk prediction by combining geotechnical data, a high-resolution DTM, rainfall records, geological structures, sewage-service information, vegetation cover, residential density, and municipal emergency-call records for Salvador, Brazil. The use of 13,522 emergency calls from 2020–2025 is particularly valuable, since operational municipal landslide datasets are rarely available at this scale. The reported performance is very high, with R² values exceeding 0.94 across scenarios and reaching approximately 0.98 in some validation/test configurations. However, several methodological aspects require clarification before the results can be considered fully robust for operational early-warning use.
R - The authors thank the reviewer for his words and valuable feedback. All the comments are replied to in the sequence, and most of the reviewer's suggestions will be incorporated in the next version of the paper.
First, the definition of the output variable deserves more scrutiny. The model predicts a weighted emergency-call index rather than directly predicting slope failure occurrence. Confirmed and unconfirmed calls are combined using an RFU weighting factor, and the authors tune RFU, the time interval Ti, and the radius of influence Ri to maximize predictive performance. This is understandable from an operational perspective, but it also means that the target variable partly reflects human reporting behavior, population exposure, accessibility, public awareness, and municipal response practices, not only physical instability. The manuscript should discuss this distinction more explicitly and avoid implying that the model purely predicts geotechnical failure.
R – The reviewer is right in the sense that the output variable reflects human reporting behavior, population exposure, accessibility, public awareness, and municipal response practices, not only physical instability. The city of Salvador has implemented warning systems and direct communication channels to reach the population in risk areas. However, the population's perception of risk does not always lead to instability events, and when confirmed, landslides exhibit a wide range of scales and impacts, from economic losses to loss of human life. The complexity of the problem is therefore evident, involving geotechnical and geological aspects, the effects of anthropogenic actions, the quality and coverage of public services, and the human response when confronted with risk situations. The use of landslides as an output variable would require deeper investigation of the slip surface geometry, determination of local soil properties, and performing a retroanalysis of the events. The authors believe that the population's perception of risk can be a valuable source of feedback, since they live in such risk areas and, unfortunately, most of them have already experienced such instability events, which helps create a kind of natural perception, being able to analyze local indicators of instability and contact the municipal authorities. These aspects will be highlighted in the next version of the paper.
Second, the random train/validation/test split may overestimate model generalization because nearby slopes and repeated rainfall/event conditions are likely spatially and temporally autocorrelated. The database was randomly split into 60% training, 20% validation, and 20% testing after stratification of the output variable. For a geospatial hazard model, a stronger test would be spatial block cross-validation, temporal holdout validation, or testing on independent storm seasons/neighborhoods. Without such validation, the very high R² values may partly reflect spatial leakage or repeated event structures rather than true predictive skill for unseen areas or future events.
Fourth, the manuscript would benefit from additional evaluation metrics that are more relevant for warning systems. R², MAE, RMSE, and MSE are useful for regression, but early-warning decisions depend on threshold behavior: probability of detection, false-alarm ratio, precision-recall, ROC-AUC, confusion matrices at operational thresholds, and performance during extreme rainfall events. Since the conclusion emphasizes evacuation warnings and targeted alerts, the paper should demonstrate how the regression output would be converted into warning levels and how reliable those warnings would be.
R – The authors agree with the reviewer and acknowledge this limitation at the current stage of the project. The nearby slopes have indeed repeated or very similar rainfall event conditions, as they are spatially and temporally autocorrelated. These events, as discussed in the text, were interpolated in time and space from forty weather stations installed in the city, thereby making rainfall events similar across neighboring slopes at a given moment.
The authors, however, plan to perform temporal holdout validation during the interactive implementation of this model and gradually replace the warning and emergency declaration criteria currently used by the municipality. We are now in the rainy period in the city. All data produced during this period will help with the model implementation task, serving as a temporal holdout validation (walk-forward validation with the use of a Sliding Windows to Test Memory Depth, since most of the variables tend to evolve, e.g., the quality of the sewage collection system, and the geometry of the slopes). At this stage, new metrics will be employed, in addition to R², MAE, RMSE, and MSE. Different (at least two) thresholds must be defined in this stage, considering the predicted and experimental values of the output variable: a warning and an emergency state. Declaring an emergency without confirmed events (false alarms) can jeopardize the entire system, leading the population to gradually discredit it and expose them to risk in future events, which evidence the importance of some of the aspects highlighted by the reviewer. However, the authors intend to present these developments in a future publication.
Third, the radius of influence has a major effect on model performance. The authors report that Ri = 300 m gives the highest R², but also acknowledge that such a radius may merge emergency calls and geological information from adjacent slopes. This is a central issue, not a minor sensitivity result. Larger radii naturally smooth the target and aggregate more calls, which may increase R² while reducing slope-specific interpretability. The choice of Ri = 150 m as a physically consistent compromise should be justified more rigorously, perhaps using geomorphological criteria, event inventories, or sensitivity of false alarms/missed alarms.
R – The authors thank the reviewer for this observation and suggestion. In the revised version of this paper, the radii values are compared with the slope extensions in the plant and with the distances between the centroids of nearest slope areas and the geological variables, faults, foliations, and dikes. Using these measures, the authors believe a radius value can be easily chosen to prevent merging emergency calls
Fifth, feature importance based on XGBoost weight and gain should be interpreted cautiously. These metrics can be biased by correlated predictors, and this study includes many highly correlated rainfall windows as well as related FoS/probability/reliability variables. The paper would be stronger if the authors added SHAP analysis, permutation importance under grouped variables, or ablation tests that remove rainfall, geotechnical, anthropogenic, and geological feature groups separately.
R – The authors thank the reviewer for this important observation and agree that feature importance metrics based on XGBoost weight and gain should be interpreted with caution in the present study, which involves multiple cumulative rainfall as well FoS, probability, and reliability as model variables, with a clear interdependence relationship, which makes accurately accounting for their individual contribution to model performance complex.
However, the paper's main focus is the application of ML to a specific, practical problem rather than assessing the impact of different procedures or analyses focused on variable importance. Furthermore, the use of different cumulative rainfall is justified by differences in soil permeability, which makes longer periods more relevant for more impervious soils, for instance. Concerning the triad FoS/probability/reliability, for each situation (saturated or natural water content), probability and reliability are dependent and redundant on a given FoS. However, this dependency is highly nonlinear, and the variable scales are very different. Nevertheless, the authors recognize the potential value of complementary interpretability analyses and will evaluate the possibility of incorporating additional techniques, as suggested by the reviewer.
Finally, reproducibility is limited. The manuscript states that code and data can be made available upon request and subject to cooperation agreements or municipal approval. While privacy concerns around emergency-call data are legitimate, the authors should consider releasing anonymized, aggregated, or synthetic benchmark datasets, together with the trained model workflow and hyperparameter settings. This would substantially improve transparency and allow independent
R – Please note that the paper already shares much of the information used, e.g., the quality of the sewage services, the DTM, and the presence of residences and vegetation. The FoS/probability/reliability and code can be easily shared (the authors intend to do this in the next version of the paper). However, in addition to privacy concerns about emergency-call data, the Federal University of Bahia has partnership agreements with the Salvador Municipality that prohibit data sharing unless in special circumstances. Because of that, it is postulated in the paper “... their disclosure to a third party is only possible after filtering and approval by the city administration.”
Citation: https://doi.org/10.5194/egusphere-2026-2056-AC1
-
AC1: 'Reply on RC1', Sandro Machado, 08 Jun 2026
-
CC1: 'Comment on egusphere-2026-2056', Mohammad Mokhtari, 08 Jun 2026
The study addresses a real and pressing urban hazard problem, and the dataset is impressive — 13,522 emergency calls over five years give genuine statistical depth. The R² ≈ 0.98 across both validation and testing phases is a strong result, and the multi-variable approach combining hydromechanical soil properties, rainfall, vegetation, geology, sewage infrastructure, and residential density reflects the genuine complexity of urban slope failure. The focus on emergency calls as the target variable is clever — it grounds the prediction in operationally meaningful outcomes rather than abstract hazard indices.
An R² of 0.98 is almost suspiciously high for a real-world geohazard prediction problem and warrants scrutiny. The key question is whether the model is genuinely predictive or whether it has been over-fitted to the training data. The paper claims consistency across validation and testing phases, which would address this concern — but the reviewer would want to see the train/test split methodology clearly described and ideally an independent holdout period not used in any parameter tuning.
Citation: https://doi.org/10.5194/egusphere-2026-2056-CC1 -
AC2: 'Reply on CC1', Sandro Machado, 08 Jun 2026
Please find below our responses (R) to the CC1 comments:
The study addresses a real and pressing urban hazard problem, and the dataset is impressive — 13,522 emergency calls over five years give genuine statistical depth. The R² ≈ 0.98 across both validation and testing phases is a strong result, and the multi-variable approach combining hydromechanical soil properties, rainfall, vegetation, geology, sewage infrastructure, and residential density reflects the genuine complexity of urban slope failure. The focus on emergency calls as the target variable is clever — it grounds the prediction in operationally meaningful outcomes rather than abstract hazard indices.
R - The authors sincerely thank CC1 for his(her) kind words!
An R² of 0.98 is almost suspiciously high for a real-world geohazard prediction problem and warrants scrutiny. The key question is whether the model is genuinely predictive or whether it has been over-fitted to the training data. The paper claims consistency across validation and testing phases, which would address this concern — but the reviewer would want to see the train/test split methodology clearly described and ideally an independent holdout period not used in any parameter tuning.
R - At this stage of our study, our main concern was random splitting into 60% training, 20% validation, and 20% testing after stratifying on the output variable to avoid overfitting. The authors are aware that time series modeling is challenging because, in many circumstances, as in our case, the problem is highly dynamic and evolves, and past data can not be adequate to model a new period. DTM changes quickly in risk areas and in non-consolidated settlements, as well as the quality of the public services, to give two examples. The authors, however, plan to perform temporal holdout validation during the interactive implementation of this model and gradually replace the municipality's current warning and emergency declaration criteria. We are now in the rainy period in the city. All data produced during this period will help with the model implementation task, serving as a temporal holdout validation (walk-forward validation with Sliding Windows to Test Memory Depth).
Citation: https://doi.org/10.5194/egusphere-2026-2056-AC2
-
AC2: 'Reply on CC1', Sandro Machado, 08 Jun 2026
-
RC2: 'Comment on egusphere-2026-2056', Farshad Bahootoroody, 10 Jun 2026
The manuscript presents a potentially valuable machine-learning framework for assessing urban slope instability in Salvador, Brazil. The integration of geotechnical data, probabilistic slope-stability indicators, rainfall records, geological structures, vegetation cover, residential density, sewage-service quality, stabilization measures, and municipal emergency calls is relevant to urban hazard management.
However, the current analysis does not yet demonstrate that the model can reliably predict future slope-instability events. The principal concern is the construction and validation of the machine-learning dataset. Slopes without emergency calls appear to be excluded, which removes true negative cases and limits the interpretation of the model. In addition, approximately two million derived observations are randomly divided into training, validation, and testing subsets despite spatial overlap among slope areas and temporal dependence among emergency-call clusters and rainfall periods. This approach may place closely related records in different subsets and artificially inflate the reported R² values.
The revised manuscript should clearly define the prediction target, distinguish the prediction of physical mass movements from the prediction of emergency-call intensity, include appropriate negative cases, and evaluate model performance using spatially grouped and temporally independent validation. The authors should also report event-detection metrics suitable for an early-warning system, compare the proposed model against simpler baseline approaches, clarify the operational forecasting horizon, and temper claims regarding real-time evacuation support until prospective predictive performance has been demonstrated.
Additional comments concerning the geotechnical assumptions, feature engineering, reproducibility, reference list, equations, figures, and technical corrections are provided in the attached referee report.
I recommend reconsideration after major revisions.
-
AC3: 'Reply on RC2', Sandro Machado, 11 Jun 2026
Please find below our responses (R) to the RC2 comments:
The manuscript presents a potentially valuable machine-learning framework for assessing urban slope instability in Salvador, Brazil. The integration of geotechnical data, probabilistic slope-stability indicators, rainfall records, geological structures, vegetation cover, residential density, sewage-service quality, stabilization measures, and municipal emergency calls is relevant to urban hazard management.
R - The authors thank the reviewer for his words, time, and valuable feedback. All comments have been replied to in order, and most of the reviewers' suggestions will be incorporated into the next version of the paper.
However, the current analysis does not yet demonstrate that the model can reliably predict future slope-instability events. The principal concern is the construction and validation of the machine-learning dataset. Slopes without emergency calls appear to be excluded, which removes true negative cases and limits the interpretation of the model. In addition, approximately two million derived observations are randomly divided into training, validation, and testing subsets despite spatial overlap among slope areas and temporal dependence among emergency-call clusters and rainfall periods. This approach may place closely related records in different subsets and artificially inflate the reported R² values.
R – The reviewer is right, slopes without emergency calls are excluded from analysis. However, the system does not focus on true negative cases, because it was designed to forecast warnings and emergencies, preventing economic and human life losses. Therefore, in the authors' opinion, there is no limitation to the model's interpretation because stable slopes and non-risk areas are not the main concern of the practical problem.
The number of observations changed with the radius of influence. It means that, for each call, the number of slopes falling inside the influence area changed according to the radius used in the analysis: 75, 100, 150, 200, and 300m. The following text was added to the paper: “Increasing RFU, Ti, and Ri also increases the value of the output variable. Analysis with a higher Ri will encompass a larger area and involve more slopes per emergency call. The combinations of slopes and emergence calls, for the period of analysis, yielded approximately 872,000 and 11,000,000 observations for values of Ri of 75m and 300m, respectively, for each RFU , Ti, and Ri, which were modeled as described below. For the same radii, the mean number of slopes embraced by the influence area was 64.5 and 819.”
The revised manuscript should clearly define the prediction target, distinguish the prediction of physical mass movements from the prediction of emergency-call intensity, include appropriate negative cases, and evaluate model performance using spatially grouped and temporally independent validation. The authors should also report event-detection metrics suitable for an early-warning system, compare the proposed model against simpler baseline approaches, clarify the operational forecasting horizon, and temper claims regarding real-time evacuation support until prospective predictive performance has been demonstrated.
R – The authors agree with the reviewer on some of his observations. As explained in the previous answer for RC1, the output variable (emergency calls to the municipality) reflects human reporting behavior in response to risk exposure and does not always reflect actual physical instability. This is considered in the paper by the RFU, the ratio failure/unconfirmed call (RFU) weight: "In order to take this into consideration in the analysis, confirmed calls were given a higher weight (RFU value) in the output variable value, which is calculated as the weighted sum of all the calls in the adopted time interval."
The authors argue that each generated slope must be treated individually. Although the pluviosity events in a given time interval and inside the radius of influence of the call are similar, each slope has its own geometry and, therefore, its own values of the triad FoS/probability of failure/reliability index for both situations, saturated and natural water content. Furthermore, individual slopes may intercept different geological features at different angles.
The city of Salvador implemented warning systems and direct communication channels to reach the population in risk areas. However, the population's perception of risk does not always lead to instability events. When confirmed, landslides exhibit a wide range of scales and impacts, from economic losses to loss of human life. Even the call localization may differ substantially from the exact localization of eventual instabilities, making it impossible to choose a specific section of the slope area to represent a call. Furthermore, the FoS varies enormously inside the same slope for the same call, and slope failures do not usually involve all the slope area, but only small parts of it. Nevertheless, a high concentration of calls in a given neighborhood and time interval often reflects physical instability in some of the slopes within the influence area.
The complexity of the problem is therefore evident, involving geotechnical and geological aspects, the effects of anthropogenic actions, the quality and coverage of public services, and the human response when confronted with risk situations. However, the authors believe that the population's perception of risk can be a valuable source of feedback, since they live in such risk areas and, unfortunately, most of them have already experienced such instability events, which helps create a kind of natural perception, being able to analyze local indicators of instability and contact the municipal authorities. These aspects will be highlighted in the next version of the paper.
Considering the temporally independent validation, the authors agree with the reviewer and acknowledge this limitation at the current stage of the project. The nearby slopes have indeed repeated or very similar rainfall event conditions, as they are spatially and temporally autocorrelated. These events, as discussed in the text, were interpolated in time and space from forty weather stations installed in the city, thereby making rainfall events similar across neighboring slopes at a given moment.
The authors, however, plan to perform temporal holdout validation during the interactive implementation of this model and gradually replace the municipality's current warning and emergency declaration criteria. We are now in the rainy period in the city. All data produced during this period will help with the model implementation task, serving as a temporal holdout validation (walk-forward validation with the use of a Sliding Windows to Test Memory Depth, since most of the variables tend to evolve, e.g., the quality of the sewage collection system, and the geometry of the slopes). At this stage, new metrics will be employed, in addition to R², MAE, RMSE, and MSE. Different (at least two) thresholds must be defined in this stage, considering the predicted and experimental values of the output variable: a warning and an emergency state. Declaring an emergency without confirmed events (false alarms) can jeopardize the entire system, leading the population to gradually discredit it and expose them to risk in future events, underscoring the importance of the aspects highlighted by the reviewer. However, the authors intend to present these developments in a future publication.
Additional comments concerning the geotechnical assumptions, feature engineering, reproducibility, reference list, equations, figures, and technical corrections are provided in the attached referee report.
R – Thank you very much. The new version of the paper will address all your commentaries
Citation: https://doi.org/10.5194/egusphere-2026-2056-AC3 - AC4: 'Reply on RC2', Sandro Machado, 16 Jun 2026
-
AC3: 'Reply on RC2', Sandro Machado, 11 Jun 2026
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 244 | 73 | 30 | 347 | 19 | 20 |
- HTML: 244
- PDF: 73
- XML: 30
- Total: 347
- BibTeX: 19
- EndNote: 20
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript presents an important and timely contribution to urban landslide-risk prediction by combining geotechnical data, a high-resolution DTM, rainfall records, geological structures, sewage-service information, vegetation cover, residential density, and municipal emergency-call records for Salvador, Brazil. The use of 13,522 emergency calls from 2020–2025 is particularly valuable, since operational municipal landslide datasets are rarely available at this scale. The reported performance is very high, with R² values exceeding 0.94 across scenarios and reaching approximately 0.98 in some validation/test configurations. However, several methodological aspects require clarification before the results can be considered fully robust for operational early-warning use.
First, the definition of the output variable deserves more scrutiny. The model predicts a weighted emergency-call index rather than directly predicting slope failure occurrence. Confirmed and unconfirmed calls are combined using an RFU weighting factor, and the authors tune RFU, the time interval Ti, and the radius of influence Ri to maximize predictive performance. This is understandable from an operational perspective, but it also means that the target variable partly reflects human reporting behavior, population exposure, accessibility, public awareness, and municipal response practices, not only physical instability. The manuscript should discuss this distinction more explicitly and avoid implying that the model purely predicts geotechnical failure.
Second, the random train/validation/test split may overestimate model generalization because nearby slopes and repeated rainfall/event conditions are likely spatially and temporally autocorrelated. The database was randomly split into 60% training, 20% validation, and 20% testing after stratification of the output variable. For a geospatial hazard model, a stronger test would be spatial block cross-validation, temporal holdout validation, or testing on independent storm seasons/neighborhoods. Without such validation, the very high R² values may partly reflect spatial leakage or repeated event structures rather than true predictive skill for unseen areas or future events.
Third, the radius of influence has a major effect on model performance. The authors report that Ri = 300 m gives the highest R², but also acknowledge that such a radius may merge emergency calls and geological information from adjacent slopes. This is a central issue, not a minor sensitivity result. Larger radii naturally smooth the target and aggregate more calls, which may increase R² while reducing slope-specific interpretability. The choice of Ri = 150 m as a physically consistent compromise should be justified more rigorously, perhaps using geomorphological criteria, event inventories, or sensitivity of false alarms/missed alarms.
Fourth, the manuscript would benefit from additional evaluation metrics that are more relevant for warning systems. R², MAE, RMSE, and MSE are useful for regression, but early-warning decisions depend on threshold behavior: probability of detection, false-alarm ratio, precision-recall, ROC-AUC, confusion matrices at operational thresholds, and performance during extreme rainfall events. Since the conclusion emphasizes evacuation warnings and targeted alerts, the paper should demonstrate how the regression output would be converted into warning levels and how reliable those warnings would be.
Fifth, feature importance based on XGBoost weight and gain should be interpreted cautiously. These metrics can be biased by correlated predictors, and this study includes many highly correlated rainfall windows as well as related FoS/probability/reliability variables. The paper would be stronger if the authors added SHAP analysis, permutation importance under grouped variables, or ablation tests that remove rainfall, geotechnical, anthropogenic, and geological feature groups separately.
Finally, reproducibility is limited. The manuscript states that code and data can be made available upon request and subject to cooperation agreements or municipal approval. While privacy concerns around emergency-call data are legitimate, the authors should consider releasing anonymized, aggregated, or synthetic benchmark datasets, together with the trained model workflow and hyperparameter settings. This would substantially improve transparency and allow independent verification.