the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Estimation of local training data point densities to support the assessment of spatial prediction uncertainty
Abstract. Machine learning is frequently used in environmental and earth sciences to produce spatial or spatio-temporal predictions of environmental variables based on limited field samples – increasingly even on a global scale and far beyond the location of available training data. Since new geographic space often goes along with new environmental properties represented by the model's predictors, and since machine learning models do not perform well in extrapolation, this raises questions regarding the applicability of the trained models at the prediction locations.
Methods to assess the area of applicability of spatial prediction models have been recently suggested and applied. These are typically based on distances in the predictor space between the prediction data and the nearest reference data point to represent the similarity to the training data. However, we assume that the density of the training data in the predictor space, i.e. how well an environment is represented in a model, is highly decisive for the prediction quality and complements the consideration of distances.
We therefore suggest a local training data point density (LPD) approach. The LPD is a quantitative measure that indicates, for a new prediction location, how many similar reference data points have been included in the model training. Similarity here is defined by the dissimilarity threshold introduced by Meyer and Pebesma (2021) which is the maximum distance to a nearest training data point in the predictor space as observed during cross-validation. We assess the suitability of the approach in a simulation study and illustrate how the method can be used in real-world applications.
The simulation study indicated a positive relationship between LPD and prediction performance and highlights the value of the approach compared to the consideration of the distance to a nearest data point only. We therefore suggest the calculation of the LPD to support the assessment of prediction uncertainties.
- Preprint
(4980 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-2730', Anonymous Referee #1, 07 Dec 2024
Thanks for the opportunity to review this article. The manuscript "Estimation of local training data point densities to support the assessment of spatial prediction uncertainty" is presented with a new methodology design to improve the spatial prediction quality of the ML model by further considering the data point density. Overall, the work is solid enough at the publication level (with very high potential); the presentation quality (writing, data visualization, etc) is good; study background, research aim, and methodology introduction are clear; the methodology has the potential to stimulate further spatial interpolation method innovation. However, I have the following major concerns and would recommend a 'major revision' for this work.
Major scientific concerns (need to explain or clarify or do more quantitative work):
1. It looks like the LPD is designed for very large-scale spatial interpolation. However, given the case shown in Europe, and South America, sampling point density is subjected to the real-world distribution of stations (i.e. the location of record stations). Thus, the point density pattern is always fixed. Does LPD also work for non-fixed sampling locations for more realistic cases? I noticed that you have presented the random pattern distribution in the simulation study, while such a random pattern distribution at that large spatial scale might not be realistic due to the physical condition. Is it possible that LPD can work for smaller-scale cases (in-situ sampling for soil, for instance)?
2. The ML model in this article is a random forest (RF). RF is indeed intensively used for spatial interpolation and mapping. However, one major concern is how spatial patterns are fed as useful information in RF? How does RF understand point density or use point density to improve the model accuracy? Within the mechanism of RF and other ML models (tree shape for instance), how the tree shape is changed by integrating LPD? Does it necessarily lead to model improvement?
3. Another major concern is that the research significance / contribution (LPD+RF) is not linked directly with major spatial interpolationmethods (especially traditional methods like Kriging), given the content presented. The main contribution is shown as improvement from LPD after comparison with DI only.
4. It is scientists' common knowledge that no method is designed with suitability for all cases. Readers and I also want to see under what kind of spatial cases (patterns, scales, etc.) or what kind of scenarios the LPD model is more powerful than others? Under what kind of cirsumstences, that LPD can be improved (in discussion, future work)
Minor issues:
1. Please check the grammar issues throughout the article in the next version. Here are some examples of minor errors: "at a spatial resolution of 10 minutes" (Line 106); " with increasing dimensionality the computational and memory effort increases drastically." (Line 54, a little hard to understand)
2. There is missing information on your all 19 bio variables. I think it would be great that a summary table can be provided to demonstrate these variable information.
Citation: https://doi.org/10.5194/egusphere-2024-2730-RC1 -
AC2: 'Reply on RC1', Fabian Schumacher, 20 May 2025
We appreciate the referee's reflective and constructive feedback on our manuscript and the time and effort put into it. Below, we provide a detailed response to each comment.
Major Scientific Concerns
1. Applicability of LPD to non-fixed sampling locations and smaller-scale cases
We acknowledge the reviewer’s concern regarding the applicability of the LPD approach beyond large-scale spatial interpolation with fixed sampling locations. While the case studies in the manuscript focus on large-scale applications, the LPD method is not inherently restricted to such contexts. It is based on the local density of training points in the predictor space rather than geographic space, making it applicable across sampling strategies, including non-fixed and smaller-scale settings. Further explanation will be added to the discussion of the revised manuscript by us.
We demonstrate its applicability to local studies in the help file of the CAST package (?AOA). In this example, we use soil moisture and temperature logger data at the farm scale to show how the LPD can identify areas that are well or poorly covered by training data (in the predictor space).
2. How does RF incorporate LPD and influence tree structure?
The reviewer raises an important question about how Random Forest (RF) integrates spatial patterns and whether this necessarily leads to model improvement. While RF does not inherently encode spatial information—unless explicitly incorporated (see https://doi.org/10.5194/gmd-17-6007-2024)—it captures relationships within the predictor variables, which can include local density information.
The LPD does not influence the model itself and is not intended to be incorporated into the training process. Instead, it serves solely as a measure of training point density in the predictor data space for a new prediction location. Since it is not used during model training, it does not alter the tree structure.
Our goal is therefore not to improve model accuracy directly, but rather to enhance the reliability of predictions by identifying areas with high training coverage or by quantifying potential prediction accuracy loss related to low LPD.
We will add further clarification in the revised manuscript, emphasizing that while the LPD does not modify the tree structure (as it is not included in training) it offers a complementary measure of uncertainty.
3. Comparison with traditional spatial interpolation methods like Kriging
We appreciate the suggestion to better integrate our work within the broader context of spatial interpolation methods. However, our focus is on machine learning–based approaches, where, unlike kriging, which uses the kriging variance to quantify spatial uncertainty based on spatial autocorrelation, there is no established method for uncertainty assessment. We believe that comparing the LPD to uncertainty assessment techniques used in interpolation methods is not particularly helpful, as the models rely on different assumptions and strategies. We prefer to maintain our focus on machine learning methods in this context.
Nevertheless, we see the benefit of working out the differences between the various approaches and will elaborate on this in the discussion of the revised manuscript.
4. Identifying scenarios where LPD is most effective and areas for improvement
We fully agree that no single method is universally optimal. In our manuscript we also compare it to the dissimilarity index that was proposed earlier. To provide a clearer guideline for practitioners, we will add information to the manuscript discussing the conditions under which LPD performs best. These include:
- Cases with heterogeneous predictor distributions where training data density varies significantly.
- Applications where it is crucial to assess model applicability beyond training locations.
- Multivariate Predictor scenarios where e.g. statistical density measures like kernel density estimation fail to capture data density effectively (e.g. computational extensive for multivariate cases).
Minor Issues
1. Grammar and clarity improvements
We appreciate the reviewer’s attention to detail and will carefully review the manuscript for grammatical and clarity improvements.
2. Missing summary table for 19 bio variables
We acknowledge this oversight and will include a summary table detailing all 19 bio variables used in our study. This table will provide variable names and descriptions and will be placed in the appendix.
We are grateful for the referee’s insightful comment. We will incorporate the revisions and believe they will improve the clarity and impact of our work and strengthen the manuscript for publication.
Best regards,
Fabian Schumacher, Christian Knoth, Marvin Ludwig, and Hanna MeyerCitation: https://doi.org/10.5194/egusphere-2024-2730-AC2
-
AC2: 'Reply on RC1', Fabian Schumacher, 20 May 2025
-
CEC1: 'Comment on egusphere-2024-2730 - No compliance with the policy of the journal', Juan Antonio Añel, 08 Dec 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlYour "Code and Data Availability" statement does not contain the link to a permanent repository with the code and data used to produce your manuscript. I am sorry to have to be so outspoken, but this is something completely unacceptable, forbidden by our policy, and your manuscript should have never been accepted for Discussions given such flagrant violation of the policy. All the code and data must be published openly and freely to anyone in one of the repositories listed in our policy before submission of a manuscript.
Therefore, we are granting you a short time to solve this situation. You have to reply to this comment in a prompt manner with the information for the repositories containing all the models, code and data that you use to produce and replicate your manuscript. The reply must include the link and permanent identifier (e.g. DOI). Also, any future version of your manuscript must include the modified section with the new information.
Note that if you do not fix these problems as requested, we will have to reject your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-2730-CEC1 -
AC1: 'Reply on CEC1', Fabian Schumacher, 11 Dec 2024
Dear Dr. Añel,
Thank you for your feedback and for bringing this important issue to our attention. We sincerely apologize for the mistake and any inconvenience it has caused.
We have now archived the contents of the GitHub repository referenced in our paper, which contains all scripts and data, as well as the repository containing the R package CAST on Zenodo. In the revised version of our manuscript, we will update the "Code and Data Availability" section to ensure full compliance with the journal's data policy. Specifically, we will provide links to permanent repositories containing all relevant code and data used in our study. The updated section will look like this:
The current version of the method to calculate the introduced Local Point Density (LPD) is available from the developer version of the R package CAST (https://github.com/HannaMeyer/CAST) under the GNU General Public Licence (GPL >= v2). The exact R package version of the implementation used to produce the results of this paper is CAST Version 1.0.2 which is published on CRAN (Meyer et al., 2024b) and Zenodo (https://doi.org/10.5281/zenodo.14362793). The exact version of the simulation study and the case study including all scripts as well as the input data used to run the models and produce the results and plots described in this paper is archived on Zenodo (https://doi.org/10.5281/zenodo.14356807) under the GNU General Public Licence (GPL v3).
In the current manuscript, CAST version 1.0.0 is cited. Since version 1.0.2, which addresses only minor issues, has now been archived, revised versions of the manuscript will reference this updated version.
We hope this update addresses the concerns raised and ensures our manuscript fully aligns with the journal's data policy. Please let us know if there is anything further we can do to resolve this matter.
Thank you for your understanding and guidance.
Best regards,
Fabian Schumacher
On behalf of all co-authorsCitation: https://doi.org/10.5194/egusphere-2024-2730-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 12 Dec 2024
Dear authors,
Many thanks for addressing the issues pointed out so quickly. We can consider now the current version of your manuscript in compliance with our code and data policy.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-2730-CEC2
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 12 Dec 2024
-
AC1: 'Reply on CEC1', Fabian Schumacher, 11 Dec 2024
-
RC2: 'Comment on egusphere-2024-2730', Anonymous Referee #2, 12 May 2025
This paper is clear, well-structured and presents a metric for assessing ML-prediction validity which provides value to the discussion in the field. Work is open and reproducible. Code is usable, and comments on what to use look ok. I support the publication of this as-is.
Did find a couple of typos, though:
Line 102: "a prediction scenarios" should be either "prediction scenarios" or "a prediction scenario"
Line 149: "and and" -> "and"
Citation: https://doi.org/10.5194/egusphere-2024-2730-RC2 -
AC3: 'Reply on RC2', Fabian Schumacher, 20 May 2025
Thank you very much for your positive and encouraging review. We appreciate your support for the publication of our work.
Thank you also for pointing out the typos. We will correct both the issues you mentioned and will make sure to carefully proofread the final version to ensure clarity and correctness.
Best regards,
Fabian Schumacher, Christian Knoth, Marvin Ludwig, and Hanna MeyerCitation: https://doi.org/10.5194/egusphere-2024-2730-AC3
-
AC3: 'Reply on RC2', Fabian Schumacher, 20 May 2025
-
RC3: 'Comment on egusphere-2024-2730', Anonymous Referee #3, 21 May 2025
Overview
This study by Schumacher et al. proposes a new approach called local training data point density (LPD) to improve the evaluation of spatial prediction uncertainty in machine learning models. Building on the concept of the area of applicability (AOA) and the dissimilarity index (DI), the authors introduce LPD as a way to quantify the number of similar training data points within a defined similarity threshold in the predictor space. The logic and effectiveness of the method are well-explained through a simulation study and a real-world case on plant species richness in South America, using both random and clustered sampling scenarios. The results demonstrate that LPD complements DI and provides added insight into prediction uncertainty assessment. I think this approach is timely and important, especially for environmental applications where models are often applied in data-sparse regions.Overall, I find the manuscript is well-written with clear experimental design and informative figures. The study is methodologically sound, and the code and data availability add to its reproducibility and value for the community. The discussion of limitations, such as the computational intensity of LPD and the need for further testing with categorical data, sets the stage for future research. I have only a few minor comments and suggestions listed below. Once these are addressed, I recommend the manuscript for publication.
Specific Comments
- The study focuses on random forest models as a representative approach. Other algorithms such as neural networks and gradient boosting may behave differently in terms of extrapolation and sensitivity to training data density, and I'm wondering if LPD can perform well with other algorithms. A discussion on this would benefit a broader readership.
- Real-world geoscience data can sometimes show unphysical teleconnections, meaning spurious correlations between geographically distant locations or variables that arise due to data artifacts rather than true physical processes. Will this pose a concern for the application of the LPD method?
- The simulation study is solid, but it only looks at one response variable based on principal component analysis. It would be more convincing if LPD were tested on a wider range of simulated cases like non-Gaussian or multimodal responses to provide a more comprehensive validation.
- Some technical terms (e.g., shape-constrained additive model) may be unfamiliar to some readers. Perhaps consider adding a brief explanation or reference.
- L13: comma is missing before "which".Citation: https://doi.org/10.5194/egusphere-2024-2730-RC3 -
AC4: 'Reply on RC3', Fabian Schumacher, 29 May 2025
We would like to thank the reviewer for their positive and constructive feedback. Below, we respond point-by-point to the specific comments, which helped us to further clarify and improve the manuscript.
1. "The study focuses on random forest models... wondering if LPD can perform well with other algorithms."We agree that the behavior of different machine learning algorithms with respect to extrapolation and sensitivity to training data density can vary. To demonstrate that our LPD approach is not limited to random forest models, we have included an example using Support Vector Machines in the helpfile of the AOA function in our CAST package.
The results show that while the overall LPD patterns remain similar across algorithms, minor differences can emerge. These are primarily due to the varying ways in which different models internally weight predictor variables. Such differences are expected, as variable importance influences both model predictions and the LPD outcome.
When we neutralize these differences - by assigning equal weights to all variables and applying the same cross-validation strategy across models - the resulting LPD patterns become identical. This is because the LPD results are driven by the location of the training data and new data in the predictor pace well as the cross-validation design, rather than by the specific characteristics of the learning algorithm (see supplement figures).We have added a brief discussion of this point to the revised manuscript to clarify that LPD can be used with different algorithms as it is independent of the model's internal mechanics and mainly the variable importances as well as the cross-validation folds are used for the LPD calculation - alongside the predictor values of the training data and the new data.
2. "Real-world geoscience data can show unphysical teleconnections... Will this pose a concern for the application of the LPD method?"We use the predictor variables selected by the model - weighted by their estimated relevance - without distinguishing between causal and non-causal relationships, because the model itself does not make that distinction either. If a non-causal (i.e., potentially unphysical) variable is relevant for the model's prediction, then being close in predictor space to the training data (i.e., having a high LPD) becomes even more important - because the model is extrapolating based on a spurious relationship, which is likely to fail outside known data. Such failures are typically detectable through spatial cross-validation strategies, which we recommend using to define the similarity threshold and to assess model performance under realistic prediction scenarios.
3. "The simulation study... would be more convincing if LPD were tested on a wider range of simulated cases like non-Gaussian or multimodal responses."Thank you for pointing this out. We agree that further testing with more complex or non-Gaussian response structures would enhance the validation. We will acknowledge in the discussion that evaluating LPD under, e.g. multimodal or skewed relationships is essential for broader generalizability. We propose this as a extension of our work and include it under “future development” in our discussion.
4. "Some technical terms (e.g., shape-constrained additive model) may be unfamiliar..."We added brief explanations at the mention of shape-constrained additive models (Section 2.3) to ensure accessibility to a broader readership.
We once again thank the reviewer for the thoughtful feedback and are confident that the revisions have strengthened the manuscript.Sincerely,
Fabian Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer
-
AC4: 'Reply on RC3', Fabian Schumacher, 29 May 2025
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
610 | 138 | 36 | 784 | 22 | 43 |
- HTML: 610
- PDF: 138
- XML: 36
- Total: 784
- BibTeX: 22
- EndNote: 43
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1