Preprints
https://doi.org/10.5194/egusphere-2024-2730
https://doi.org/10.5194/egusphere-2024-2730
14 Nov 2024
 | 14 Nov 2024

Estimation of local training data point densities to support the assessment of spatial prediction uncertainty

Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer

Abstract. Machine learning is frequently used in environmental and earth sciences to produce spatial or spatio-temporal predictions of environmental variables based on limited field samples – increasingly even on a global scale and far beyond the location of available training data. Since new geographic space often goes along with new environmental properties represented by the model's predictors, and since machine learning models do not perform well in extrapolation, this raises questions regarding the applicability of the trained models at the prediction locations.

Methods to assess the area of applicability of spatial prediction models have been recently suggested and applied. These are typically based on distances in the predictor space between the prediction data and the nearest reference data point to represent the similarity to the training data. However, we assume that the density of the training data in the predictor space, i.e. how well an environment is represented in a model, is highly decisive for the prediction quality and complements the consideration of distances.

We therefore suggest a local training data point density (LPD) approach. The LPD is a quantitative measure that indicates, for a new prediction location, how many similar reference data points have been included in the model training. Similarity here is defined by the dissimilarity threshold introduced by Meyer and Pebesma (2021) which is the maximum distance to a nearest training data point in the predictor space as observed during cross-validation. We assess the suitability of the approach in a simulation study and illustrate how the method can be used in real-world applications.

The simulation study indicated a positive relationship between LPD and prediction performance and highlights the value of the approach compared to the consideration of the distance to a nearest data point only. We therefore suggest the calculation of the LPD to support the assessment of prediction uncertainties.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share

Journal article(s) based on this preprint

19 Dec 2025
Estimation of local training data point densities to support the assessment of spatial prediction uncertainty
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer
Geosci. Model Dev., 18, 10185–10202, https://doi.org/10.5194/gmd-18-10185-2025,https://doi.org/10.5194/gmd-18-10185-2025, 2025
Short summary
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2024-2730', Anonymous Referee #1, 07 Dec 2024
    • AC2: 'Reply on RC1', Fabian Schumacher, 20 May 2025
  • CEC1: 'Comment on egusphere-2024-2730 - No compliance with the policy of the journal', Juan Antonio Añel, 08 Dec 2024
    • AC1: 'Reply on CEC1', Fabian Schumacher, 11 Dec 2024
      • CEC2: 'Reply on AC1', Juan Antonio Añel, 12 Dec 2024
  • RC2: 'Comment on egusphere-2024-2730', Anonymous Referee #2, 12 May 2025
    • AC3: 'Reply on RC2', Fabian Schumacher, 20 May 2025
  • RC3: 'Comment on egusphere-2024-2730', Anonymous Referee #3, 21 May 2025
    • AC4: 'Reply on RC3', Fabian Schumacher, 29 May 2025

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2024-2730', Anonymous Referee #1, 07 Dec 2024
    • AC2: 'Reply on RC1', Fabian Schumacher, 20 May 2025
  • CEC1: 'Comment on egusphere-2024-2730 - No compliance with the policy of the journal', Juan Antonio Añel, 08 Dec 2024
    • AC1: 'Reply on CEC1', Fabian Schumacher, 11 Dec 2024
      • CEC2: 'Reply on AC1', Juan Antonio Añel, 12 Dec 2024
  • RC2: 'Comment on egusphere-2024-2730', Anonymous Referee #2, 12 May 2025
    • AC3: 'Reply on RC2', Fabian Schumacher, 20 May 2025
  • RC3: 'Comment on egusphere-2024-2730', Anonymous Referee #3, 21 May 2025
    • AC4: 'Reply on RC3', Fabian Schumacher, 29 May 2025

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
AR by Fabian Schumacher on behalf of the Authors (25 Jun 2025)  Author's response   Author's tracked changes   Manuscript 
ED: Referee Nomination & Report Request started (05 Sep 2025) by Yongze Song
RR by Anonymous Referee #2 (12 Sep 2025)
RR by Anonymous Referee #1 (16 Sep 2025)
ED: Publish as is (18 Sep 2025) by Yongze Song
AR by Fabian Schumacher on behalf of the Authors (20 Oct 2025)

Journal article(s) based on this preprint

19 Dec 2025
Estimation of local training data point densities to support the assessment of spatial prediction uncertainty
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer
Geosci. Model Dev., 18, 10185–10202, https://doi.org/10.5194/gmd-18-10185-2025,https://doi.org/10.5194/gmd-18-10185-2025, 2025
Short summary
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer

Viewed

Total article views: 1,298 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
1,058 197 43 1,298 32 54
  • HTML: 1,058
  • PDF: 197
  • XML: 43
  • Total: 1,298
  • BibTeX: 32
  • EndNote: 54
Views and downloads (calculated since 14 Nov 2024)
Cumulative views and downloads (calculated since 14 Nov 2024)

Viewed (geographical distribution)

Total article views: 1,284 (including HTML, PDF, and XML) Thereof 1,284 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 19 Dec 2025
Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Short summary
Machine learning is increasingly used in environmental sciences for spatial predictions, but its effectiveness is challenged when models are applied beyond the areas they were trained on. We propose a Local Training Data Point Density (LPD) approach that considers how well a model's environment is represented by training data. This method provides a valuable tool for evaluating model applicability and uncertainties, crucial for broader scientific and practical applications.
Share