the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach
Abstract. Machine learning (ML) models have become key ingredients for digital soil mapping. To improve the interpretability of their prediction, diagnostic tools have been developed like the widely used local attribution approach known as ‘SHAP’ (SHapley Additive exPlanation). However, the analysis of the prediction is only one part of the problem and there is an interest in getting deeper insights into the drivers of the prediction uncertainty as well, i.e. to explain why the ML model is confident, given the set of chosen covariates’ values (in addition to why the ML model delivered some particular results). We show in this study how to apply SHAP to the local prediction uncertainty estimates for a case of urban soil pollution, namely the presence of petroleum hydrocarbon in soil at Toulouse (France), which poses a health risk via vapour intrusion into buildings, direct soil ingestion or groundwater contamination. To alleviate the computational burden posed by the multiple covariates (typically >10) and by the large number of grid points on the map (typically over several 10,000s), we propose to rely on an approach that combines screening analysis (to filter out non-influential covariates) and grouping of dependent covariates by means of generic kernel-based dependence measures. Our results show evidence that the drivers of the prediction best estimate are not necessarily the ones that drive the confidence in these predictions, hence justifying that decisions regarding data collection and covariates’ characterisation as well as communication of the results should be made accordingly.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(2344 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(2344 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2024-323', Anonymous Referee #1, 19 Mar 2024
This is a review for the manuscript Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach by Rohmer et al. The authors use SHAP, a common tool for assessing machine learning predictions at local scale, to investigate the contribution of covariates (or rather groups of covariates) on the uncertainty of a random forest model. It is well known that Shapley values are computationally very expensive, and so the authors propose to reduce the number of covariates to speed up computations. This is done before model training (a rather odd proposal) by using a statistical dependence test (i.e., HSIC), and then after model training by grouping covariates (again with the same dependence test). The main aim of investigating covariates with the model's uncertainty is intriguing within the field of digital soil mapping, but the manuscript has some major flaws. Major concerns are related to the methodology of the entire selection procedure of covariates as well as the with the presented case study. The quality of the writing is also unfortunately poor.
Main methodological concerns
• My first criticism is related to the first step, that is, the elimination of covariates before model training. This is a common pitfall within machine learning in DSM. The problem is with data leakage which may cause bias, and this occurred when covariates are removed from the entire training data set, and not within for example a cross-validation within each fold. Note that any data preprocessing (e.g., normalisation) dealt with in such a way can lead to data leakage. Data leakage may also cause the model’s uncertainty to be lower, and this is then also problematic if interpretative machine learning (IML) methods (like SHAP) are used to analyse the relationships between covariates and the model’s uncertainty. In addition, with a model such as random forest, covariate selection is not really required, especially with so few covariates (i.e., 15). I invite the authors to refer to the work such as that of Zhu et al. (2023) for guidance on data preparation so that data leakage is avoided.
• Linking to my previous point. if the goal is to speed up computations, then removing covariates should not be a first choice. In addition, in typical DSM projects the number of covariates is usually more than 100. Therefore, the presented case study, which only has 15 covariates, is not the best choice to showcase the proposed methodology. One could rather perform a sample of grid cells at which Shapley values are estimated. Like for example in the Wadoux et al. (2023) paper. Again, in many DSM projects, maps are sometimes created over millions of grid cells, so the presented case study is not the best one to showcase this methodology. Therefore, to speed up computations with a small data set (like the one in this study), I would rather use a stronger machine to do the calculations than to omit potentially important parts of my data. If not possible, then let the computations run for a few days.
• The grouping of covariates is a practical way of speeding up computation, but I am afraid it holds no meaning for DSM practitioners. The authors acknowledge this in the discussion, starting at Line 519. Doing inference on machine learning output with IML methods is hard enough. I cannot see how the grouping of covariates could hold much interpretive meaning.
• To sum up, exploring the relationship between covariates and model uncertainty is intriguing and worth exploring. However, the paper's emphasis on reducing computation with (questionable?) methods distracts from the main goal of the paper. That is, I would have liked to see more in-depth analysis of covariates related to SHAP (prediction) vs SHAP (uncertainty). I would also like to have seen more emphasis on: do we expect the same covariates to be related to both, why do we see different covariates in terms of predictions vs uncertainty.
Some other concerns / suggestions
• The synthetic case study adds no value to the paper. I suggest removing it as the paper is already a bit long for the topic at hand.
• Section 3.1 is difficult to follow without the knowledge of HSIC and some of the information in the many cited references. Maybe just restructure the manuscript and include essential methodology.
• Random forests are standard and already widely known in DSM. The sections on RF and QRF can be removed, and replaced with brief references to RF and QRF.
• Maps presented in this manuscript are of poor quality and not visually appealing. Captions and legend can also be improved. With Figure 3, show more information. Not everyone is that familiar with this region in France. The histogram is not very clear, especially the long right tail can be enhanced visually.
• General writing of the manuscript is poor. Some examples: The overuse of “etc”, too many brackets to give additional information, brief introductions at each section.
• The mathematical writing can also be improved. For example, are the authors sure that ML model is just y=f(x)? See Line 142.
• Figure 6 does not make sense. Why is there an arrow from Step 2 to 4?References:
Wadoux, A., Saby, N., Martin, M. (2023). Shapley values reveal the drivers of soil organic carbon stock prediction. SOIL, 9, 21-38. doi: 10.5194/soil-9-21-2023.
Zhu et al. (2023). Machine Learning in Environmental Research: Common Pitfalls and Best Practices. https://pubs.acs.org/doi/10.1021/acs.est.3c00026.Citation: https://doi.org/10.5194/egusphere-2024-323-RC1 -
AC1: 'Reply on RC1', Jeremy Rohmer, 29 Apr 2024
We would like to thank Referee #1 for the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall in the attached document the reviews and we reply to each of the comments in turn (outlined in blue). The main corrections made to the manuscript are described in a specific section of each response.
-
AC1: 'Reply on RC1', Jeremy Rohmer, 29 Apr 2024
-
RC2: 'Comment on egusphere-2024-323', Anonymous Referee #2, 11 Apr 2024
This manuscript is well written, clear and relevant, and presents methods that could provide stakeholders with valuable insights into where the uncertainty comes from: this has the potential to make uncertainty more concrete for them.
I appreciate the use of a synthetic test case, which makes the whole procedure a lot easier to understand.
I don’t have any major criticisms. I would be pleased to see this manuscript published after attention to the following minor details :
Line 44: However, at a local scale, these methods don’t (?) provide any information for a prediction at a certain spatial location.
Line 157: pushes the prediction uncertainty?
Line 442: I don’t see any circular pattern on the bottom middle panel of Figure 13 (in the bottom right one however, they are really clear).
Synthetic test case: isn’t the fact that in Z1, the biggest contributor to uncertainty is Tmean-Tmax (and that respectively in Z2, the biggest contributor is Pwettest) be linked to the fact that these covariates have uniquely high (respectively low) values there, that are not represented in the dataset? If you agree, this in my opinion would be interesting to put in the discussion.
Citation: https://doi.org/10.5194/egusphere-2024-323-RC2 -
AC2: 'Reply on RC2', Jeremy Rohmer, 29 Apr 2024
We would like to thank Referee #2 for the positive analysis and the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall the reviews in the attached document and we reply to each of the comments in turn.
-
AC2: 'Reply on RC2', Jeremy Rohmer, 29 Apr 2024
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2024-323', Anonymous Referee #1, 19 Mar 2024
This is a review for the manuscript Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach by Rohmer et al. The authors use SHAP, a common tool for assessing machine learning predictions at local scale, to investigate the contribution of covariates (or rather groups of covariates) on the uncertainty of a random forest model. It is well known that Shapley values are computationally very expensive, and so the authors propose to reduce the number of covariates to speed up computations. This is done before model training (a rather odd proposal) by using a statistical dependence test (i.e., HSIC), and then after model training by grouping covariates (again with the same dependence test). The main aim of investigating covariates with the model's uncertainty is intriguing within the field of digital soil mapping, but the manuscript has some major flaws. Major concerns are related to the methodology of the entire selection procedure of covariates as well as the with the presented case study. The quality of the writing is also unfortunately poor.
Main methodological concerns
• My first criticism is related to the first step, that is, the elimination of covariates before model training. This is a common pitfall within machine learning in DSM. The problem is with data leakage which may cause bias, and this occurred when covariates are removed from the entire training data set, and not within for example a cross-validation within each fold. Note that any data preprocessing (e.g., normalisation) dealt with in such a way can lead to data leakage. Data leakage may also cause the model’s uncertainty to be lower, and this is then also problematic if interpretative machine learning (IML) methods (like SHAP) are used to analyse the relationships between covariates and the model’s uncertainty. In addition, with a model such as random forest, covariate selection is not really required, especially with so few covariates (i.e., 15). I invite the authors to refer to the work such as that of Zhu et al. (2023) for guidance on data preparation so that data leakage is avoided.
• Linking to my previous point. if the goal is to speed up computations, then removing covariates should not be a first choice. In addition, in typical DSM projects the number of covariates is usually more than 100. Therefore, the presented case study, which only has 15 covariates, is not the best choice to showcase the proposed methodology. One could rather perform a sample of grid cells at which Shapley values are estimated. Like for example in the Wadoux et al. (2023) paper. Again, in many DSM projects, maps are sometimes created over millions of grid cells, so the presented case study is not the best one to showcase this methodology. Therefore, to speed up computations with a small data set (like the one in this study), I would rather use a stronger machine to do the calculations than to omit potentially important parts of my data. If not possible, then let the computations run for a few days.
• The grouping of covariates is a practical way of speeding up computation, but I am afraid it holds no meaning for DSM practitioners. The authors acknowledge this in the discussion, starting at Line 519. Doing inference on machine learning output with IML methods is hard enough. I cannot see how the grouping of covariates could hold much interpretive meaning.
• To sum up, exploring the relationship between covariates and model uncertainty is intriguing and worth exploring. However, the paper's emphasis on reducing computation with (questionable?) methods distracts from the main goal of the paper. That is, I would have liked to see more in-depth analysis of covariates related to SHAP (prediction) vs SHAP (uncertainty). I would also like to have seen more emphasis on: do we expect the same covariates to be related to both, why do we see different covariates in terms of predictions vs uncertainty.
Some other concerns / suggestions
• The synthetic case study adds no value to the paper. I suggest removing it as the paper is already a bit long for the topic at hand.
• Section 3.1 is difficult to follow without the knowledge of HSIC and some of the information in the many cited references. Maybe just restructure the manuscript and include essential methodology.
• Random forests are standard and already widely known in DSM. The sections on RF and QRF can be removed, and replaced with brief references to RF and QRF.
• Maps presented in this manuscript are of poor quality and not visually appealing. Captions and legend can also be improved. With Figure 3, show more information. Not everyone is that familiar with this region in France. The histogram is not very clear, especially the long right tail can be enhanced visually.
• General writing of the manuscript is poor. Some examples: The overuse of “etc”, too many brackets to give additional information, brief introductions at each section.
• The mathematical writing can also be improved. For example, are the authors sure that ML model is just y=f(x)? See Line 142.
• Figure 6 does not make sense. Why is there an arrow from Step 2 to 4?References:
Wadoux, A., Saby, N., Martin, M. (2023). Shapley values reveal the drivers of soil organic carbon stock prediction. SOIL, 9, 21-38. doi: 10.5194/soil-9-21-2023.
Zhu et al. (2023). Machine Learning in Environmental Research: Common Pitfalls and Best Practices. https://pubs.acs.org/doi/10.1021/acs.est.3c00026.Citation: https://doi.org/10.5194/egusphere-2024-323-RC1 -
AC1: 'Reply on RC1', Jeremy Rohmer, 29 Apr 2024
We would like to thank Referee #1 for the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall in the attached document the reviews and we reply to each of the comments in turn (outlined in blue). The main corrections made to the manuscript are described in a specific section of each response.
-
AC1: 'Reply on RC1', Jeremy Rohmer, 29 Apr 2024
-
RC2: 'Comment on egusphere-2024-323', Anonymous Referee #2, 11 Apr 2024
This manuscript is well written, clear and relevant, and presents methods that could provide stakeholders with valuable insights into where the uncertainty comes from: this has the potential to make uncertainty more concrete for them.
I appreciate the use of a synthetic test case, which makes the whole procedure a lot easier to understand.
I don’t have any major criticisms. I would be pleased to see this manuscript published after attention to the following minor details :
Line 44: However, at a local scale, these methods don’t (?) provide any information for a prediction at a certain spatial location.
Line 157: pushes the prediction uncertainty?
Line 442: I don’t see any circular pattern on the bottom middle panel of Figure 13 (in the bottom right one however, they are really clear).
Synthetic test case: isn’t the fact that in Z1, the biggest contributor to uncertainty is Tmean-Tmax (and that respectively in Z2, the biggest contributor is Pwettest) be linked to the fact that these covariates have uniquely high (respectively low) values there, that are not represented in the dataset? If you agree, this in my opinion would be interesting to put in the discussion.
Citation: https://doi.org/10.5194/egusphere-2024-323-RC2 -
AC2: 'Reply on RC2', Jeremy Rohmer, 29 Apr 2024
We would like to thank Referee #2 for the positive analysis and the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall the reviews in the attached document and we reply to each of the comments in turn.
-
AC2: 'Reply on RC2', Jeremy Rohmer, 29 Apr 2024
Peer review completion
Journal article(s) based on this preprint
Data sets
Data to run the synthetic test case Hannah Meyer https://github.com/HannaMeyer/CAST/tree/master/inst/extdata
Model code and software
R markdown - synthetic test case Jeremy Rohmer https://github.com/anrhouses/groupSHAP-uncertainty
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
357 | 82 | 26 | 465 | 18 | 16 |
- HTML: 357
- PDF: 82
- XML: 26
- Total: 465
- BibTeX: 18
- EndNote: 16
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Stephane Belbeze
Dominique Guyonnet
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(2344 KB) - Metadata XML