the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Estimation of local training data point densities to support the assessment of spatial prediction uncertainty
Abstract. Machine learning is frequently used in environmental and earth sciences to produce spatial or spatio-temporal predictions of environmental variables based on limited field samples – increasingly even on a global scale and far beyond the location of available training data. Since new geographic space often goes along with new environmental properties represented by the model's predictors, and since machine learning models do not perform well in extrapolation, this raises questions regarding the applicability of the trained models at the prediction locations.
Methods to assess the area of applicability of spatial prediction models have been recently suggested and applied. These are typically based on distances in the predictor space between the prediction data and the nearest reference data point to represent the similarity to the training data. However, we assume that the density of the training data in the predictor space, i.e. how well an environment is represented in a model, is highly decisive for the prediction quality and complements the consideration of distances.
We therefore suggest a local training data point density (LPD) approach. The LPD is a quantitative measure that indicates, for a new prediction location, how many similar reference data points have been included in the model training. Similarity here is defined by the dissimilarity threshold introduced by Meyer and Pebesma (2021) which is the maximum distance to a nearest training data point in the predictor space as observed during cross-validation. We assess the suitability of the approach in a simulation study and illustrate how the method can be used in real-world applications.
The simulation study indicated a positive relationship between LPD and prediction performance and highlights the value of the approach compared to the consideration of the distance to a nearest data point only. We therefore suggest the calculation of the LPD to support the assessment of prediction uncertainties.
- Preprint
(4980 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 09 Jan 2025)
-
RC1: 'Comment on egusphere-2024-2730', Anonymous Referee #1, 07 Dec 2024
reply
Thanks for the opportunity to review this article. The manuscript "Estimation of local training data point densities to support the assessment of spatial prediction uncertainty" is presented with a new methodology design to improve the spatial prediction quality of the ML model by further considering the data point density. Overall, the work is solid enough at the publication level (with very high potential); the presentation quality (writing, data visualization, etc) is good; study background, research aim, and methodology introduction are clear; the methodology has the potential to stimulate further spatial interpolation method innovation. However, I have the following major concerns and would recommend a 'major revision' for this work.
Major scientific concerns (need to explain or clarify or do more quantitative work):
1. It looks like the LPD is designed for very large-scale spatial interpolation. However, given the case shown in Europe, and South America, sampling point density is subjected to the real-world distribution of stations (i.e. the location of record stations). Thus, the point density pattern is always fixed. Does LPD also work for non-fixed sampling locations for more realistic cases? I noticed that you have presented the random pattern distribution in the simulation study, while such a random pattern distribution at that large spatial scale might not be realistic due to the physical condition. Is it possible that LPD can work for smaller-scale cases (in-situ sampling for soil, for instance)?
2. The ML model in this article is a random forest (RF). RF is indeed intensively used for spatial interpolation and mapping. However, one major concern is how spatial patterns are fed as useful information in RF? How does RF understand point density or use point density to improve the model accuracy? Within the mechanism of RF and other ML models (tree shape for instance), how the tree shape is changed by integrating LPD? Does it necessarily lead to model improvement?
3. Another major concern is that the research significance / contribution (LPD+RF) is not linked directly with major spatial interpolationmethods (especially traditional methods like Kriging), given the content presented. The main contribution is shown as improvement from LPD after comparison with DI only.
4. It is scientists' common knowledge that no method is designed with suitability for all cases. Readers and I also want to see under what kind of spatial cases (patterns, scales, etc.) or what kind of scenarios the LPD model is more powerful than others? Under what kind of cirsumstences, that LPD can be improved (in discussion, future work)
Minor issues:
1. Please check the grammar issues throughout the article in the next version. Here are some examples of minor errors: "at a spatial resolution of 10 minutes" (Line 106); " with increasing dimensionality the computational and memory effort increases drastically." (Line 54, a little hard to understand)
2. There is missing information on your all 19 bio variables. I think it would be great that a summary table can be provided to demonstrate these variable information.
Citation: https://doi.org/10.5194/egusphere-2024-2730-RC1 -
CEC1: 'Comment on egusphere-2024-2730 - No compliance with the policy of the journal', Juan Antonio Añel, 08 Dec 2024
reply
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlYour "Code and Data Availability" statement does not contain the link to a permanent repository with the code and data used to produce your manuscript. I am sorry to have to be so outspoken, but this is something completely unacceptable, forbidden by our policy, and your manuscript should have never been accepted for Discussions given such flagrant violation of the policy. All the code and data must be published openly and freely to anyone in one of the repositories listed in our policy before submission of a manuscript.
Therefore, we are granting you a short time to solve this situation. You have to reply to this comment in a prompt manner with the information for the repositories containing all the models, code and data that you use to produce and replicate your manuscript. The reply must include the link and permanent identifier (e.g. DOI). Also, any future version of your manuscript must include the modified section with the new information.
Note that if you do not fix these problems as requested, we will have to reject your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-2730-CEC1 -
AC1: 'Reply on CEC1', Fabian Schumacher, 11 Dec 2024
reply
Dear Dr. Añel,
Thank you for your feedback and for bringing this important issue to our attention. We sincerely apologize for the mistake and any inconvenience it has caused.
We have now archived the contents of the GitHub repository referenced in our paper, which contains all scripts and data, as well as the repository containing the R package CAST on Zenodo. In the revised version of our manuscript, we will update the "Code and Data Availability" section to ensure full compliance with the journal's data policy. Specifically, we will provide links to permanent repositories containing all relevant code and data used in our study. The updated section will look like this:
The current version of the method to calculate the introduced Local Point Density (LPD) is available from the developer version of the R package CAST (https://github.com/HannaMeyer/CAST) under the GNU General Public Licence (GPL >= v2). The exact R package version of the implementation used to produce the results of this paper is CAST Version 1.0.2 which is published on CRAN (Meyer et al., 2024b) and Zenodo (https://doi.org/10.5281/zenodo.14362793). The exact version of the simulation study and the case study including all scripts as well as the input data used to run the models and produce the results and plots described in this paper is archived on Zenodo (https://doi.org/10.5281/zenodo.14356807) under the GNU General Public Licence (GPL v3).
In the current manuscript, CAST version 1.0.0 is cited. Since version 1.0.2, which addresses only minor issues, has now been archived, revised versions of the manuscript will reference this updated version.
We hope this update addresses the concerns raised and ensures our manuscript fully aligns with the journal's data policy. Please let us know if there is anything further we can do to resolve this matter.
Thank you for your understanding and guidance.
Best regards,
Fabian Schumacher
On behalf of all co-authorsCitation: https://doi.org/10.5194/egusphere-2024-2730-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 12 Dec 2024
reply
Dear authors,
Many thanks for addressing the issues pointed out so quickly. We can consider now the current version of your manuscript in compliance with our code and data policy.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-2730-CEC2
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 12 Dec 2024
reply
-
AC1: 'Reply on CEC1', Fabian Schumacher, 11 Dec 2024
reply
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
143 | 32 | 12 | 187 | 2 | 4 |
- HTML: 143
- PDF: 32
- XML: 12
- Total: 187
- BibTeX: 2
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1