the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Addressing Class Imbalance in Soil Movement Predictions
Abstract. Landslides threaten human life and infrastructure, resulting in fatalities and economic losses. Monitoring stations provide valuable data for predicting soil movement, which is crucial in mitigating this threat. Accurately predicting soil movement from monitoring data is challenging due to its complexity and inherent class imbalance. This study proposes developing machine learning (ML) models with oversampling techniques to address the class imbalance issue and develop a robust soil movement prediction system. The dataset, comprising two years (2019–2021) of monitoring data from a landslide in Uttarakhand, was split into a 70:30 ratio for training and testing. To tackle the class imbalance problem, various oversampling techniques, including Synthetic Minority Oversampling Technique (SMOTE), K-Means SMOTE, Borderline SMOTE, Support Vector Machine SMOTE, and Adaptive SMOTE (ADASYN), were applied to the dataset. Several ML models, namely Random Forest (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (Light GBM), Adaptive Boosting (AdaBoost), Category Boosting (CatBoost), Long Short-Term Memory (LSTM), Multilayer Perceptron (MLP), and dynamic ensemble models, were trained and compared for soil movement prediction. Among these models, the dynamic ensemble model with K-Means SMOTE performed the best in testing, with an accuracy, precision, and recall rate of 99.68 % each and an F1-score of 0.9968. The RF model with K-Means SMOTE stood out as the second-best performer, achieving an impressive accuracy, precision, and recall rate of 99.64 % each and an F1-score of 0.9964. These results show that ML models with class imbalance techniques have the potential to significantly improve soil movement predictions in landslide-prone areas.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(627 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(627 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-1417', Anonymous Referee #1, 19 Dec 2023
This study employed a complete pipeline to build several ML algorithms accounting also for class imbalance and the effects that different oversampling algorithms have on model performance. Introduction and state of the art are clear and well described. The different ML algorithms are individually reported and described in perhaps too much detail. The monitored landslide is presented only in geographical terms. It might be useful to provide more details on the characteristics of the landslide. Different predictive features from an in-situ monitoring station were used to train and predict future landslide movements. The soil movements were split into four classes. Different SMOTE variants were then used to oversample the minority class. However, it is not clear if the other two minority classes were oversampled, or if they were removed in subsequent analyses. The main results are synthesised in Tables 5 and 6. In my opinion, these two tables are not enough to convey the effect of oversampling. In most cases the data without oversampling returns better scores than the oversampled data in all the metrics, not allowing the reader to understand the cause. Furthermore, scores so close to 1 might suggest a data leakage between training and model testing. It could be worth it to revise the data-splitting procedure and implement the pipeline with cross-validation to avoid this issue. Chapter 7 is just conclusions; the critical investigation of results (i.e., the discussion) is completely missing.
Citation: https://doi.org/10.5194/egusphere-2023-1417-RC1 -
AC1: 'Reply on RC1', Praveen Kumar, 04 Jan 2024
Dear Anonymous Referee #1,
Thank you for your positive and constructive comments. Please find attached a PDF file containing a documented list of changes we have made to the manuscript (marked R: in blue font). We have shortened the introduction of the ML algorithms, detailed the majority and minority class sampling used in the training and testing, revised the results by 5-fold cross-validation, and refined the discussion and conclusion of the findings. We hope these clarifications will improve the reader’s understanding of our work.
Kind Regards,
Praveen Kumar
-
RC2: 'Reply on AC1', Anonymous Referee #1, 22 Jan 2024
Dear Authors, thank you for revising the manuscript.
In my opinion, it is now acceptable for publication. Please check the following sentence again, as it is repeated in the same paragraph "Furthermore, the dynamic ensemble model incorporating SMOTE emerges as the second-best model in the test phase, showcasing high accuracy, precision, and recall rates of 0.993, 0.872, and 0.950, respectively, along with an F1 score of 0.907. This result reinforces the reliability and robustness of the model in tackling landslide prediction tasks."
Citation: https://doi.org/10.5194/egusphere-2023-1417-RC2 -
AC2: 'Reply on RC2', Praveen Kumar, 22 Jan 2024
Dear Anonymous Referee #1,
Thank you very much for considering this manuscript for publication, and many thanks for the positive comments and reviews. We have carefully addressed your suggestions, and as you recommended, we have removed the repeated sentence from the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2023-1417-AC2
-
AC2: 'Reply on RC2', Praveen Kumar, 22 Jan 2024
-
RC2: 'Reply on AC1', Anonymous Referee #1, 22 Jan 2024
-
AC1: 'Reply on RC1', Praveen Kumar, 04 Jan 2024
-
RC3: 'Comment on egusphere-2023-1417', Anonymous Referee #2, 28 Feb 2024
This paper presents the development of machine learning (ML) models with oversampling techniques to address the class imbalance issue, essential to developing a robust soil movement prediction system.
The paper is well-written and easy to follow. I have some significant questions regarding the proposed methods:
(i) How much does the model parameters value change with different training data sets? Also, authors should take different training sets for their method evaluation
(ii) The results should also contain the accuracy of each class. A truth table will be beneficial to understanding the method's performance.
(iii) The RF model has 100 % performance for training that might be overfitting in the model. Please check the overfitting in the model
(iv) The paper should include the precautions authors took to ensure no information from training samples was mixed with the testing samples.
(v) It will be good to compare results without a balanced dataset versus a balanced dataset (using oversampling techniques) in a plot. Also, discuss the reasons why no oversampling performs well over oversampling in some cases.
Some minor comments:
Figure 1 text is hard to read. Please increase the font size of figure 1Â
Â
Citation: https://doi.org/10.5194/egusphere-2023-1417-RC3 -
AC3: 'Reply on RC3', Praveen Kumar, 18 Mar 2024
Dear Anonymous Referee #2,Â
Thank you for your positive and constructive comments. We have carefully considered your comments and made several revisions to the manuscript (marked Response: in blue font). Firstly, we conducted a parameter variation analysis on different datasets to assess how parameters change across datasets. Secondly, we refined the results by incorporating insights from 5-fold cross-validation. Lastly, we enhanced the discussion and conclusion sections to provide a clearer understanding of our findings. We believe that these revisions will significantly improve the clarity and impact of our work.
Kind Regards,Â
Praveen KumarÂ
-
AC3: 'Reply on RC3', Praveen Kumar, 18 Mar 2024
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-1417', Anonymous Referee #1, 19 Dec 2023
This study employed a complete pipeline to build several ML algorithms accounting also for class imbalance and the effects that different oversampling algorithms have on model performance. Introduction and state of the art are clear and well described. The different ML algorithms are individually reported and described in perhaps too much detail. The monitored landslide is presented only in geographical terms. It might be useful to provide more details on the characteristics of the landslide. Different predictive features from an in-situ monitoring station were used to train and predict future landslide movements. The soil movements were split into four classes. Different SMOTE variants were then used to oversample the minority class. However, it is not clear if the other two minority classes were oversampled, or if they were removed in subsequent analyses. The main results are synthesised in Tables 5 and 6. In my opinion, these two tables are not enough to convey the effect of oversampling. In most cases the data without oversampling returns better scores than the oversampled data in all the metrics, not allowing the reader to understand the cause. Furthermore, scores so close to 1 might suggest a data leakage between training and model testing. It could be worth it to revise the data-splitting procedure and implement the pipeline with cross-validation to avoid this issue. Chapter 7 is just conclusions; the critical investigation of results (i.e., the discussion) is completely missing.
Citation: https://doi.org/10.5194/egusphere-2023-1417-RC1 -
AC1: 'Reply on RC1', Praveen Kumar, 04 Jan 2024
Dear Anonymous Referee #1,
Thank you for your positive and constructive comments. Please find attached a PDF file containing a documented list of changes we have made to the manuscript (marked R: in blue font). We have shortened the introduction of the ML algorithms, detailed the majority and minority class sampling used in the training and testing, revised the results by 5-fold cross-validation, and refined the discussion and conclusion of the findings. We hope these clarifications will improve the reader’s understanding of our work.
Kind Regards,
Praveen Kumar
-
RC2: 'Reply on AC1', Anonymous Referee #1, 22 Jan 2024
Dear Authors, thank you for revising the manuscript.
In my opinion, it is now acceptable for publication. Please check the following sentence again, as it is repeated in the same paragraph "Furthermore, the dynamic ensemble model incorporating SMOTE emerges as the second-best model in the test phase, showcasing high accuracy, precision, and recall rates of 0.993, 0.872, and 0.950, respectively, along with an F1 score of 0.907. This result reinforces the reliability and robustness of the model in tackling landslide prediction tasks."
Citation: https://doi.org/10.5194/egusphere-2023-1417-RC2 -
AC2: 'Reply on RC2', Praveen Kumar, 22 Jan 2024
Dear Anonymous Referee #1,
Thank you very much for considering this manuscript for publication, and many thanks for the positive comments and reviews. We have carefully addressed your suggestions, and as you recommended, we have removed the repeated sentence from the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2023-1417-AC2
-
AC2: 'Reply on RC2', Praveen Kumar, 22 Jan 2024
-
RC2: 'Reply on AC1', Anonymous Referee #1, 22 Jan 2024
-
AC1: 'Reply on RC1', Praveen Kumar, 04 Jan 2024
-
RC3: 'Comment on egusphere-2023-1417', Anonymous Referee #2, 28 Feb 2024
This paper presents the development of machine learning (ML) models with oversampling techniques to address the class imbalance issue, essential to developing a robust soil movement prediction system.
The paper is well-written and easy to follow. I have some significant questions regarding the proposed methods:
(i) How much does the model parameters value change with different training data sets? Also, authors should take different training sets for their method evaluation
(ii) The results should also contain the accuracy of each class. A truth table will be beneficial to understanding the method's performance.
(iii) The RF model has 100 % performance for training that might be overfitting in the model. Please check the overfitting in the model
(iv) The paper should include the precautions authors took to ensure no information from training samples was mixed with the testing samples.
(v) It will be good to compare results without a balanced dataset versus a balanced dataset (using oversampling techniques) in a plot. Also, discuss the reasons why no oversampling performs well over oversampling in some cases.
Some minor comments:
Figure 1 text is hard to read. Please increase the font size of figure 1Â
Â
Citation: https://doi.org/10.5194/egusphere-2023-1417-RC3 -
AC3: 'Reply on RC3', Praveen Kumar, 18 Mar 2024
Dear Anonymous Referee #2,Â
Thank you for your positive and constructive comments. We have carefully considered your comments and made several revisions to the manuscript (marked Response: in blue font). Firstly, we conducted a parameter variation analysis on different datasets to assess how parameters change across datasets. Secondly, we refined the results by incorporating insights from 5-fold cross-validation. Lastly, we enhanced the discussion and conclusion sections to provide a clearer understanding of our findings. We believe that these revisions will significantly improve the clarity and impact of our work.
Kind Regards,Â
Praveen KumarÂ
-
AC3: 'Reply on RC3', Praveen Kumar, 18 Mar 2024
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
259 | 121 | 33 | 413 | 21 | 18 |
- HTML: 259
- PDF: 121
- XML: 33
- Total: 413
- BibTeX: 21
- EndNote: 18
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Praveen Kumar
Priyanka Priyanka
Kala Venkata Uday
Varun Dutt
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(627 KB) - Metadata XML