Preprints
https://doi.org/10.5194/egusphere-2024-2730
https://doi.org/10.5194/egusphere-2024-2730
14 Nov 2024
 | 14 Nov 2024
Status: this preprint is open for discussion.

Estimation of local training data point densities to support the assessment of spatial prediction uncertainty

Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer

Abstract. Machine learning is frequently used in environmental and earth sciences to produce spatial or spatio-temporal predictions of environmental variables based on limited field samples – increasingly even on a global scale and far beyond the location of available training data. Since new geographic space often goes along with new environmental properties represented by the model's predictors, and since machine learning models do not perform well in extrapolation, this raises questions regarding the applicability of the trained models at the prediction locations.

Methods to assess the area of applicability of spatial prediction models have been recently suggested and applied. These are typically based on distances in the predictor space between the prediction data and the nearest reference data point to represent the similarity to the training data. However, we assume that the density of the training data in the predictor space, i.e. how well an environment is represented in a model, is highly decisive for the prediction quality and complements the consideration of distances.

We therefore suggest a local training data point density (LPD) approach. The LPD is a quantitative measure that indicates, for a new prediction location, how many similar reference data points have been included in the model training. Similarity here is defined by the dissimilarity threshold introduced by Meyer and Pebesma (2021) which is the maximum distance to a nearest training data point in the predictor space as observed during cross-validation. We assess the suitability of the approach in a simulation study and illustrate how the method can be used in real-world applications.

The simulation study indicated a positive relationship between LPD and prediction performance and highlights the value of the approach compared to the consideration of the distance to a nearest data point only. We therefore suggest the calculation of the LPD to support the assessment of prediction uncertainties.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer

Status: open (until 09 Jan 2025)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2024-2730', Anonymous Referee #1, 07 Dec 2024 reply
  • CEC1: 'Comment on egusphere-2024-2730 - No compliance with the policy of the journal', Juan Antonio Añel, 08 Dec 2024 reply
    • AC1: 'Reply on CEC1', Fabian Schumacher, 11 Dec 2024 reply
      • CEC2: 'Reply on AC1', Juan Antonio Añel, 12 Dec 2024 reply
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer

Viewed

Total article views: 187 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
143 32 12 187 2 4
  • HTML: 143
  • PDF: 32
  • XML: 12
  • Total: 187
  • BibTeX: 2
  • EndNote: 4
Views and downloads (calculated since 14 Nov 2024)
Cumulative views and downloads (calculated since 14 Nov 2024)

Viewed (geographical distribution)

Total article views: 176 (including HTML, PDF, and XML) Thereof 176 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 13 Dec 2024
Download
Short summary
Machine learning is increasingly used in environmental sciences for spatial predictions, but its effectiveness is challenged when models are applied beyond the areas they were trained on. We propose a Local Training Data Point Density (LPD) approach that considers how well a model's environment is represented by training data. This method provides a valuable tool for evaluating model applicability and uncertainties, crucial for broader scientific and practical applications.