Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach

Rohmer, Jeremy; Belbeze, Stephane; Guyonnet, Dominique

doi:10.5194/egusphere-2024-323

Preprints

https://doi.org/10.5194/egusphere-2024-323

Preprints

21 Feb 2024

| 21 Feb 2024

Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach

Jeremy Rohmer, Stephane Belbeze, and Dominique Guyonnet

Abstract. Machine learning (ML) models have become key ingredients for digital soil mapping. To improve the interpretability of their prediction, diagnostic tools have been developed like the widely used local attribution approach known as ‘SHAP’ (SHapley Additive exPlanation). However, the analysis of the prediction is only one part of the problem and there is an interest in getting deeper insights into the drivers of the prediction uncertainty as well, i.e. to explain why the ML model is confident, given the set of chosen covariates’ values (in addition to why the ML model delivered some particular results). We show in this study how to apply SHAP to the local prediction uncertainty estimates for a case of urban soil pollution, namely the presence of petroleum hydrocarbon in soil at Toulouse (France), which poses a health risk via vapour intrusion into buildings, direct soil ingestion or groundwater contamination. To alleviate the computational burden posed by the multiple covariates (typically >10) and by the large number of grid points on the map (typically over several 10,000s), we propose to rely on an approach that combines screening analysis (to filter out non-influential covariates) and grouping of dependent covariates by means of generic kernel-based dependence measures. Our results show evidence that the drivers of the prediction best estimate are not necessarily the ones that drive the confidence in these predictions, hence justifying that decisions regarding data collection and covariates’ characterisation as well as communication of the results should be made accordingly.

Received: 02 Feb 2024 – Discussion started: 21 Feb 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2344 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (2344 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

30 Sep 2024

Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach

Jeremy Rohmer, Stephane Belbeze, and Dominique Guyonnet

SOIL, 10, 679–697, https://doi.org/10.5194/soil-10-679-2024,https://doi.org/10.5194/soil-10-679-2024, 2024

Short summary

Jeremy Rohmer, Stephane Belbeze, and Dominique Guyonnet

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-323', Anonymous Referee #1, 19 Mar 2024

This is a review for the manuscript Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach by Rohmer et al. The authors use SHAP, a common tool for assessing machine learning predictions at local scale, to investigate the contribution of covariates (or rather groups of covariates) on the uncertainty of a random forest model. It is well known that Shapley values are computationally very expensive, and so the authors propose to reduce the number of covariates to speed up computations. This is done before model training (a rather odd proposal) by using a statistical dependence test (i.e., HSIC), and then after model training by grouping covariates (again with the same dependence test). The main aim of investigating covariates with the model's uncertainty is intriguing within the field of digital soil mapping, but the manuscript has some major flaws. Major concerns are related to the methodology of the entire selection procedure of covariates as well as the with the presented case study. The quality of the writing is also unfortunately poor.

Main methodological concerns

• My first criticism is related to the first step, that is, the elimination of covariates before model training. This is a common pitfall within machine learning in DSM. The problem is with data leakage which may cause bias, and this occurred when covariates are removed from the entire training data set, and not within for example a cross-validation within each fold. Note that any data preprocessing (e.g., normalisation) dealt with in such a way can lead to data leakage. Data leakage may also cause the model’s uncertainty to be lower, and this is then also problematic if interpretative machine learning (IML) methods (like SHAP) are used to analyse the relationships between covariates and the model’s uncertainty. In addition, with a model such as random forest, covariate selection is not really required, especially with so few covariates (i.e., 15). I invite the authors to refer to the work such as that of Zhu et al. (2023) for guidance on data preparation so that data leakage is avoided.

• Linking to my previous point. if the goal is to speed up computations, then removing covariates should not be a first choice. In addition, in typical DSM projects the number of covariates is usually more than 100. Therefore, the presented case study, which only has 15 covariates, is not the best choice to showcase the proposed methodology. One could rather perform a sample of grid cells at which Shapley values are estimated. Like for example in the Wadoux et al. (2023) paper. Again, in many DSM projects, maps are sometimes created over millions of grid cells, so the presented case study is not the best one to showcase this methodology. Therefore, to speed up computations with a small data set (like the one in this study), I would rather use a stronger machine to do the calculations than to omit potentially important parts of my data. If not possible, then let the computations run for a few days.

• The grouping of covariates is a practical way of speeding up computation, but I am afraid it holds no meaning for DSM practitioners. The authors acknowledge this in the discussion, starting at Line 519. Doing inference on machine learning output with IML methods is hard enough. I cannot see how the grouping of covariates could hold much interpretive meaning.

• To sum up, exploring the relationship between covariates and model uncertainty is intriguing and worth exploring. However, the paper's emphasis on reducing computation with (questionable?) methods distracts from the main goal of the paper. That is, I would have liked to see more in-depth analysis of covariates related to SHAP (prediction) vs SHAP (uncertainty). I would also like to have seen more emphasis on: do we expect the same covariates to be related to both, why do we see different covariates in terms of predictions vs uncertainty.

Some other concerns / suggestions

• The synthetic case study adds no value to the paper. I suggest removing it as the paper is already a bit long for the topic at hand.

• Section 3.1 is difficult to follow without the knowledge of HSIC and some of the information in the many cited references. Maybe just restructure the manuscript and include essential methodology.

• Random forests are standard and already widely known in DSM. The sections on RF and QRF can be removed, and replaced with brief references to RF and QRF.

• Maps presented in this manuscript are of poor quality and not visually appealing. Captions and legend can also be improved. With Figure 3, show more information. Not everyone is that familiar with this region in France. The histogram is not very clear, especially the long right tail can be enhanced visually.

• General writing of the manuscript is poor. Some examples: The overuse of “etc”, too many brackets to give additional information, brief introductions at each section.

• The mathematical writing can also be improved. For example, are the authors sure that ML model is just y=f(x)? See Line 142.

• Figure 6 does not make sense. Why is there an arrow from Step 2 to 4?
References:

Wadoux, A., Saby, N., Martin, M. (2023). Shapley values reveal the drivers of soil organic carbon stock prediction. SOIL, 9, 21-38. doi: 10.5194/soil-9-21-2023.

Zhu et al. (2023). Machine Learning in Environmental Research: Common Pitfalls and Best Practices. https://pubs.acs.org/doi/10.1021/acs.est.3c00026.

Citation: https://doi.org/10.5194/egusphere-2024-323-RC1
- AC1: 'Reply on RC1', Jeremy Rohmer, 29 Apr 2024
  
  We would like to thank Referee #1 for the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall in the attached document the reviews and we reply to each of the comments in turn (outlined in blue). The main corrections made to the manuscript are described in a specific section of each response.
  
  Citation: https://doi.org/10.5194/egusphere-2024-323-AC1
RC2:
'Comment on egusphere-2024-323', Anonymous Referee #2, 11 Apr 2024

This manuscript is well written, clear and relevant, and presents methods that could provide stakeholders with valuable insights into where the uncertainty comes from: this has the potential to make uncertainty more concrete for them.
I appreciate the use of a synthetic test case, which makes the whole procedure a lot easier to understand.
I don’t have any major criticisms. I would be pleased to see this manuscript published after attention to the following minor details :
Line 44: However, at a local scale, these methods don’t (?) provide any information for a prediction at a certain spatial location.
Line 157: pushes the prediction uncertainty?
Line 442: I don’t see any circular pattern on the bottom middle panel of Figure 13 (in the bottom right one however, they are really clear).

Synthetic test case: isn’t the fact that in Z1, the biggest contributor to uncertainty is Tmean-Tmax (and that respectively in Z2, the biggest contributor is Pwettest) be linked to the fact that these covariates have uniquely high (respectively low) values there, that are not represented in the dataset? If you agree, this in my opinion would be interesting to put in the discussion.

Citation: https://doi.org/10.5194/egusphere-2024-323-RC2
- AC2: 'Reply on RC2', Jeremy Rohmer, 29 Apr 2024
  
  We would like to thank Referee #2 for the positive analysis and the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall the reviews in the attached document and we reply to each of the comments in turn.
  
  Citation: https://doi.org/10.5194/egusphere-2024-323-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-323', Anonymous Referee #1, 19 Mar 2024

This is a review for the manuscript Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach by Rohmer et al. The authors use SHAP, a common tool for assessing machine learning predictions at local scale, to investigate the contribution of covariates (or rather groups of covariates) on the uncertainty of a random forest model. It is well known that Shapley values are computationally very expensive, and so the authors propose to reduce the number of covariates to speed up computations. This is done before model training (a rather odd proposal) by using a statistical dependence test (i.e., HSIC), and then after model training by grouping covariates (again with the same dependence test). The main aim of investigating covariates with the model's uncertainty is intriguing within the field of digital soil mapping, but the manuscript has some major flaws. Major concerns are related to the methodology of the entire selection procedure of covariates as well as the with the presented case study. The quality of the writing is also unfortunately poor.

Main methodological concerns

• My first criticism is related to the first step, that is, the elimination of covariates before model training. This is a common pitfall within machine learning in DSM. The problem is with data leakage which may cause bias, and this occurred when covariates are removed from the entire training data set, and not within for example a cross-validation within each fold. Note that any data preprocessing (e.g., normalisation) dealt with in such a way can lead to data leakage. Data leakage may also cause the model’s uncertainty to be lower, and this is then also problematic if interpretative machine learning (IML) methods (like SHAP) are used to analyse the relationships between covariates and the model’s uncertainty. In addition, with a model such as random forest, covariate selection is not really required, especially with so few covariates (i.e., 15). I invite the authors to refer to the work such as that of Zhu et al. (2023) for guidance on data preparation so that data leakage is avoided.

• Linking to my previous point. if the goal is to speed up computations, then removing covariates should not be a first choice. In addition, in typical DSM projects the number of covariates is usually more than 100. Therefore, the presented case study, which only has 15 covariates, is not the best choice to showcase the proposed methodology. One could rather perform a sample of grid cells at which Shapley values are estimated. Like for example in the Wadoux et al. (2023) paper. Again, in many DSM projects, maps are sometimes created over millions of grid cells, so the presented case study is not the best one to showcase this methodology. Therefore, to speed up computations with a small data set (like the one in this study), I would rather use a stronger machine to do the calculations than to omit potentially important parts of my data. If not possible, then let the computations run for a few days.

• The grouping of covariates is a practical way of speeding up computation, but I am afraid it holds no meaning for DSM practitioners. The authors acknowledge this in the discussion, starting at Line 519. Doing inference on machine learning output with IML methods is hard enough. I cannot see how the grouping of covariates could hold much interpretive meaning.

• To sum up, exploring the relationship between covariates and model uncertainty is intriguing and worth exploring. However, the paper's emphasis on reducing computation with (questionable?) methods distracts from the main goal of the paper. That is, I would have liked to see more in-depth analysis of covariates related to SHAP (prediction) vs SHAP (uncertainty). I would also like to have seen more emphasis on: do we expect the same covariates to be related to both, why do we see different covariates in terms of predictions vs uncertainty.

Some other concerns / suggestions

• The synthetic case study adds no value to the paper. I suggest removing it as the paper is already a bit long for the topic at hand.

• Section 3.1 is difficult to follow without the knowledge of HSIC and some of the information in the many cited references. Maybe just restructure the manuscript and include essential methodology.

• Random forests are standard and already widely known in DSM. The sections on RF and QRF can be removed, and replaced with brief references to RF and QRF.

• Maps presented in this manuscript are of poor quality and not visually appealing. Captions and legend can also be improved. With Figure 3, show more information. Not everyone is that familiar with this region in France. The histogram is not very clear, especially the long right tail can be enhanced visually.

• General writing of the manuscript is poor. Some examples: The overuse of “etc”, too many brackets to give additional information, brief introductions at each section.

• The mathematical writing can also be improved. For example, are the authors sure that ML model is just y=f(x)? See Line 142.

• Figure 6 does not make sense. Why is there an arrow from Step 2 to 4?
References:

Wadoux, A., Saby, N., Martin, M. (2023). Shapley values reveal the drivers of soil organic carbon stock prediction. SOIL, 9, 21-38. doi: 10.5194/soil-9-21-2023.

Zhu et al. (2023). Machine Learning in Environmental Research: Common Pitfalls and Best Practices. https://pubs.acs.org/doi/10.1021/acs.est.3c00026.

Citation: https://doi.org/10.5194/egusphere-2024-323-RC1
- AC1: 'Reply on RC1', Jeremy Rohmer, 29 Apr 2024
  
  We would like to thank Referee #1 for the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall in the attached document the reviews and we reply to each of the comments in turn (outlined in blue). The main corrections made to the manuscript are described in a specific section of each response.
  
  Citation: https://doi.org/10.5194/egusphere-2024-323-AC1
RC2:
'Comment on egusphere-2024-323', Anonymous Referee #2, 11 Apr 2024

This manuscript is well written, clear and relevant, and presents methods that could provide stakeholders with valuable insights into where the uncertainty comes from: this has the potential to make uncertainty more concrete for them.
I appreciate the use of a synthetic test case, which makes the whole procedure a lot easier to understand.
I don’t have any major criticisms. I would be pleased to see this manuscript published after attention to the following minor details :
Line 44: However, at a local scale, these methods don’t (?) provide any information for a prediction at a certain spatial location.
Line 157: pushes the prediction uncertainty?
Line 442: I don’t see any circular pattern on the bottom middle panel of Figure 13 (in the bottom right one however, they are really clear).

Synthetic test case: isn’t the fact that in Z1, the biggest contributor to uncertainty is Tmean-Tmax (and that respectively in Z2, the biggest contributor is Pwettest) be linked to the fact that these covariates have uniquely high (respectively low) values there, that are not represented in the dataset? If you agree, this in my opinion would be interesting to put in the discussion.

Citation: https://doi.org/10.5194/egusphere-2024-323-RC2
- AC2: 'Reply on RC2', Jeremy Rohmer, 29 Apr 2024
  
  We would like to thank Referee #2 for the positive analysis and the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall the reviews in the attached document and we reply to each of the comments in turn.
  
  Citation: https://doi.org/10.5194/egusphere-2024-323-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (14 May 2024) by Alexandre Wadoux

AR by Jeremy Rohmer on behalf of the Authors (25 Jun 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (01 Jul 2024) by Alexandre Wadoux

RR by Anonymous Referee #1 (16 Jul 2024)

Suggestions for revision or reasons for rejection

I commend the authors on an improved manuscript. The methodology section now reads better, and the various steps are also clearer. I am also pleased with the improved methodology in terms of performing the screening analysis within the cross-validation. However, even with the improvements, certain sections are still a bit cryptic. See additional concerns below (some are minor and other are more major). I based my second review on the track changes document.
Line 260. The sentence starting with “Overall, the RF …” reads strange.
Line 280. maybe rather: “Shapley values, as defined in Sect. 3.2, … ”. Avoid to overly depend on brackets.
Line 289-291. Try rewriting in more than one sentence. It is currently hard to follow. In addition, I am unsure what the authors mean by “… having uniquely high and low values…”.
Line 299-300. First part of the sentence does not make sense. The part with “to estimate the conditional mean, which is used as the best estimate of the prediction,…”. Rewrite, because this is technically wrong. How can the estimate of the conditional mean be the estimate of the prediction?
Line 301. I must admit I am getting lost with this part. Maybe other readers will as well. The authors mention the difficulty of related to the clustering (i.e., verb) of the observations. Because this term is also used in this manuscript to refer to the clustering algorithm, this is a bad choice for meaning how the points are distributed. Could the authors clarify what they mean here. Are they referring to how the points are spatially distributed?
Line 301: I am also confused as to why this is a problem? Given that my understanding of the above point is right.
Lines 302-308. Is all of this necessary? Was this discuss in the methodology section? So, to make sure I understand all of this. Since the points are spatially clustered, that is, the points are not well dispersed over the region, the authors define weights which must then be used when observations are sampled when the bootstrap samples (i.e. trees) are drawn. If my understanding is correct, then this seems all a bit unnecessary. Could the authors elaborate why this is necessary? In addition, why would you bring additional methodology that was not discussed in the previous sections? Also, what if the weights do not address the feature space well? Another question, is this step necessary when you include covariates that used to address the spatial aspect of the data? I mean, you included covariates such as the coordinates and various distances. Can the authors highlight DSM studies where this has been done? Again, I am just trying to understand the motivation behind this methodology in these lines.
Line 337: “…covariates are retained in the construction of the RF model.” But the RF was already constructed if the cross-validation was performed. So why are covariates retained? What does this mean?
Line: 361. Oh, I see retained for the group based shap. Is this what the authors meant at Line 337? If so, then make it clearer. If not, please explain.
Line 406: models, plural?
Line 406: This is also a very strange sentence, because the RF model cannot extrapolate. See this post for example that explains it (https://stats.stackexchange.com/questions/235189/random-forest-regression-not-predicting-higher-than-training-data#:~:text=Decision%20Trees%20%2F%20Random%20Forrest%20cannot,outside%20of%20the%20observed%20range. ). So again, all of this is a bit cryptic, and I am cautious to what the authors mean (Lines 405-412). The authors referenced here the paper by Takoutsing and Heuvelink. Note the paragraph right above section 3.5 that also notes that RF cannot extrapolate beyond training data.
L411: What limitations?
Lines 438-443: rewrite to include the long line-in reference in the quotes.
Line 456: extrapolation mode: odd way of stating that RF is used to make spatial extrapolations. See also in Line 411.

Hide

ED: Revision (19 Jul 2024) by Alexandre Wadoux

AR by Jeremy Rohmer on behalf of the Authors (12 Aug 2024) Author's response Author's tracked changes Manuscript

ED: Publish as is (13 Aug 2024) by Alexandre Wadoux

ED: Publish as is (13 Aug 2024) by Rémi Cardinael (Executive editor)

AR by Jeremy Rohmer on behalf of the Authors (20 Aug 2024) Manuscript

Journal article(s) based on this preprint

30 Sep 2024

Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach

Jeremy Rohmer, Stephane Belbeze, and Dominique Guyonnet

SOIL, 10, 679–697, https://doi.org/10.5194/soil-10-679-2024,https://doi.org/10.5194/soil-10-679-2024, 2024

Short summary

Jeremy Rohmer, Stephane Belbeze, and Dominique Guyonnet

Data sets

Data to run the synthetic test case Hannah Meyer https://github.com/HannaMeyer/CAST/tree/master/inst/extdata

Model code and software

R markdown - synthetic test case Jeremy Rohmer https://github.com/anrhouses/groupSHAP-uncertainty

Jeremy Rohmer, Stephane Belbeze, and Dominique Guyonnet

Viewed

Total article views: 2,231 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,555	586	90	2,231	108	163

HTML: 1,555
PDF: 586
XML: 90
Total: 2,231
BibTeX: 108
EndNote: 163

Views and downloads (calculated since 21 Feb 2024)

Month	HTML	PDF	XML	Total
Feb 2024	84	19	3	106
Mar 2024	66	13	2	81
Apr 2024	66	16	11	93
May 2024	33	12	1	46
Jun 2024	46	8	2	56
Jul 2024	38	14	4	56
Aug 2024	46	4	8	58
Sep 2024	54	12	2	68
Oct 2024	24	22	2	48
Nov 2024	6	6	2	14
Dec 2024	10	12	0	22
Jan 2025	22	16	8	46
Feb 2025	22	12	0	34
Mar 2025	26	14	4	44
Apr 2025	22	26	2	50
May 2025	38	18	0	56
Jun 2025	28	24	0	52
Jul 2025	30	22	2	54
Aug 2025	78	36	0	114
Sep 2025	284	36	2	322
Oct 2025	54	26	2	82
Nov 2025	44	14	8	66
Dec 2025	70	36	2	108
Jan 2026	78	66	12	156
Feb 2026	72	40	6	118
Mar 2026	86	24	2	112
Apr 2026	65	12	0	77
May 2026	43	20	0	63
Jun 2026	13	2	0	15
Jul 2026	7	4	3	14

Cumulative views and downloads (calculated since 21 Feb 2024)

Month	HTML	PDF	XML	Total
Feb 2024	84	19	3	106
Mar 2024	66	13	2	81
Apr 2024	66	16	11	93
May 2024	33	12	1	46
Jun 2024	46	8	2	56
Jul 2024	38	14	4	56
Aug 2024	46	4	8	58
Sep 2024	54	12	2	68
Oct 2024	24	22	2	48
Nov 2024	6	6	2	14
Dec 2024	10	12	0	22
Jan 2025	22	16	8	46
Feb 2025	22	12	0	34
Mar 2025	26	14	4	44
Apr 2025	22	26	2	50
May 2025	38	18	0	56
Jun 2025	28	24	0	52
Jul 2025	30	22	2	54
Aug 2025	78	36	0	114
Sep 2025	284	36	2	322
Oct 2025	54	26	2	82
Nov 2025	44	14	8	66
Dec 2025	70	36	2	108
Jan 2026	78	66	12	156
Feb 2026	72	40	6	118
Mar 2026	86	24	2	112
Apr 2026	65	12	0	77
May 2026	43	20	0	63
Jun 2026	13	2	0	15
Jul 2026	7	4	3	14

Viewed (geographical distribution)

Total article views: 2,237 (including HTML, PDF, and XML) Thereof 2,237 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 29 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (2344 KB)
Metadata XML

Short summary

Machine learning (ML) models have become key ingredients for digital soil mapping. To explain why the ML model is confident, we apply a popular method from the field of explainable artificial intelligence, i.e. based on the Shapley values, to the uncertainty prediction of hydrocarbon pollutants on an urban soil. To alleviate the implementation difficulties (number of factors, complex relationships between the factors, high resolution maps), a simple-but-efficient grouping approach is tested.


Total:	0
HTML:	0
PDF:	0
XML:	0