the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation
Abstract. Random and spatial Cross-Validation (CV) methods are commonly used to evaluate machine learning-based spatial prediction models, and the obtained performance values are often interpreted as map accuracy estimates. However, the appropriateness of such approaches is currently the subject of controversy. For the common case where no probability sample for validation purposes is available, in Milà et al. (2022) we proposed the Nearest Neighbour Distance Matching (NNDM) Leave-One-Out (LOO) CV method. This method produces a distribution of geographical Nearest Neighbour Distances (NND) between test and train locations during CV that matches the distribution of NND between prediction and training locations. Hence, it creates predictive conditions during CV that are comparable to what is required when predicting a defined area. Although NNDM LOO CV produced largely reliable map accuracy estimates in our analysis, as a LOO-based method, it cannot be applied to large datasets found in many studies.
Here, we propose a novel k-fold CV strategy for map accuracy estimation inspired by the concepts of NNDM LOO CV: the k-fold NNDM (kNNDM) CV. The kNNDM algorithm tries to find a k-fold configuration such that the Empirical Cumulative Distribution Function (ECDF) of NND between test and train locations during CV is matched to the ECDF of NND between prediction and training locations.
We tested kNNDM CV in a simulation study with different sampling distributions and compared it to other CV methods including NNDM LOO CV. We found that kNNDM CV performed similarly to NNDM LOO CV and produced reasonably reliable map accuracy estimates across sampling patterns with strong reductions in computation time for large sample sizes. Furthermore, we found a positive linear association between the quality of the match of the two ECDFs in kNNDM and the reliability of the map accuracy estimates.
kNNDM provided the advantages of our original NNDM LOO CV strategy while bypassing its sample size limitations.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(1574 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(1574 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
CC1: 'Comment on egusphere-2023-1308', Nils Tjaden, 07 Jul 2023
Just a quick hint that https://doi.org/10.1016/j.jag.2023.103364 was just published - may or may not be relevant for your discussion.
Citation: https://doi.org/10.5194/egusphere-2023-1308-CC1 -
RC1: 'Comment on egusphere-2023-1308', Italo Goncalves, 23 Aug 2023
The manuscript presents a much-needed methodology for cross-validation of spatial data. In my opinion, the strongest point is the use of the W statistic to identify the best CV split. However, there are a few points which I feel should be addressed in the discussion.
- The proposed methodology using clustering algorithms seems valid, but how can we know if it provides the best possible result? An algorithm that optimizes the W statistic directly as a function of the CV fold indices would be more desirable, instead of relying on the clustering algorithm´s internal metric as a proxy. As a suggestion for future work, I recommend using a genetic algorithm to assign CV indices to the data points directly.
- The W statistic explained 60% of the variability in map accuracy, but would this be consistent across different datasets? At least one more case study would be needed to verify this.
Minor comment:
Line 90: cross “the”.
Citation: https://doi.org/10.5194/egusphere-2023-1308-RC1 -
RC2: 'Comment on egusphere-2023-1308', Anonymous Referee #2, 23 Aug 2023
The study proposes a novel cross-validation method for spatial data that aims to deliver more representative measurements of spatial map accuracy than commonly-used methods. This is a relevant concern for GMD readers with the rise in use of machine learning methods for geoscientific modelling. Issues with model evaluation in the spatial setting have been identified in a number of recent studies. The paper is well-written and contributes a practical solution for a common issue.
In my opinion, the most exciting/innovative idea in this work is the concept of defining the evaluation method based on the desired data for which he model is intended to return predictions. This would require researchers to more carefully define the purpose of their models before and during the model creation process, which should be common practice. In reality, this is often not done, or done in a ‘standard’ way which doesn’t accurately reflect the intended use of the model.
The method presented in this paper is a very practical solution to this, where the desired target dataset is an input of the evaluation algorithm and therefore researchers are required to clearly consider and define it. I think this is a significant contribution to model development methodology and should be more clearly emphasised in the manuscript. The possibilities, benefits and disadvantages of this concept could also be discussed - for example, when models are used in production, the prediction area is a moving target; would that require continual re-evaluation?
The paper suggests that kNNDM is, essentially, a computationally-cheaper alternative to the previously-published method by the authors, NNDM LOO. In the article, the only limitation of leave-one-out CV methods described is that of computational time. However, to my knowledge, even if computation is not considered, LOO CV methods may not be the optimum method due to higher variation in the resulting models (due to the bias-variance tradeoff). Could this explain why kNNDM 10-fold seems to perform better in the case of strong clusters (Figure 5)? For me, this would be more convincing than the computation speedup comparison, which is relatively trivial given that LOO CV is the most extreme version of k-fold CV.
Following on from this, it seems likely that the value of k would impact the results. Use of 10 folds is very common; is there theoretical justification for this? It would be useful to see some comparisons of the results with multiple values of k.
In Figure 1, it is shown that the W statistic will also be larger if training points are regularly distributed, as well as when clustered. Does this mean that the null hypothesis might be rejected for regularly distributed datapoints? Does this explain why NNDM LOO performed better for regularly distributed data (Figure 5)?
Finally, I would recommend testing the method on at least one additional dataset, as the results presumably depend on the spatial autocorrelation present in the dataset used.
Minor comment: I assume the hyperparameters of the models are not tuned as it is not mentioned, but this could be stated explicitly.
Citation: https://doi.org/10.5194/egusphere-2023-1308-RC2 -
AC1: 'Comment on egusphere-2023-1308', Jan Linnenbrink, 19 Oct 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1308/egusphere-2023-1308-AC1-supplement.pdf
Interactive discussion
Status: closed
-
CC1: 'Comment on egusphere-2023-1308', Nils Tjaden, 07 Jul 2023
Just a quick hint that https://doi.org/10.1016/j.jag.2023.103364 was just published - may or may not be relevant for your discussion.
Citation: https://doi.org/10.5194/egusphere-2023-1308-CC1 -
RC1: 'Comment on egusphere-2023-1308', Italo Goncalves, 23 Aug 2023
The manuscript presents a much-needed methodology for cross-validation of spatial data. In my opinion, the strongest point is the use of the W statistic to identify the best CV split. However, there are a few points which I feel should be addressed in the discussion.
- The proposed methodology using clustering algorithms seems valid, but how can we know if it provides the best possible result? An algorithm that optimizes the W statistic directly as a function of the CV fold indices would be more desirable, instead of relying on the clustering algorithm´s internal metric as a proxy. As a suggestion for future work, I recommend using a genetic algorithm to assign CV indices to the data points directly.
- The W statistic explained 60% of the variability in map accuracy, but would this be consistent across different datasets? At least one more case study would be needed to verify this.
Minor comment:
Line 90: cross “the”.
Citation: https://doi.org/10.5194/egusphere-2023-1308-RC1 -
RC2: 'Comment on egusphere-2023-1308', Anonymous Referee #2, 23 Aug 2023
The study proposes a novel cross-validation method for spatial data that aims to deliver more representative measurements of spatial map accuracy than commonly-used methods. This is a relevant concern for GMD readers with the rise in use of machine learning methods for geoscientific modelling. Issues with model evaluation in the spatial setting have been identified in a number of recent studies. The paper is well-written and contributes a practical solution for a common issue.
In my opinion, the most exciting/innovative idea in this work is the concept of defining the evaluation method based on the desired data for which he model is intended to return predictions. This would require researchers to more carefully define the purpose of their models before and during the model creation process, which should be common practice. In reality, this is often not done, or done in a ‘standard’ way which doesn’t accurately reflect the intended use of the model.
The method presented in this paper is a very practical solution to this, where the desired target dataset is an input of the evaluation algorithm and therefore researchers are required to clearly consider and define it. I think this is a significant contribution to model development methodology and should be more clearly emphasised in the manuscript. The possibilities, benefits and disadvantages of this concept could also be discussed - for example, when models are used in production, the prediction area is a moving target; would that require continual re-evaluation?
The paper suggests that kNNDM is, essentially, a computationally-cheaper alternative to the previously-published method by the authors, NNDM LOO. In the article, the only limitation of leave-one-out CV methods described is that of computational time. However, to my knowledge, even if computation is not considered, LOO CV methods may not be the optimum method due to higher variation in the resulting models (due to the bias-variance tradeoff). Could this explain why kNNDM 10-fold seems to perform better in the case of strong clusters (Figure 5)? For me, this would be more convincing than the computation speedup comparison, which is relatively trivial given that LOO CV is the most extreme version of k-fold CV.
Following on from this, it seems likely that the value of k would impact the results. Use of 10 folds is very common; is there theoretical justification for this? It would be useful to see some comparisons of the results with multiple values of k.
In Figure 1, it is shown that the W statistic will also be larger if training points are regularly distributed, as well as when clustered. Does this mean that the null hypothesis might be rejected for regularly distributed datapoints? Does this explain why NNDM LOO performed better for regularly distributed data (Figure 5)?
Finally, I would recommend testing the method on at least one additional dataset, as the results presumably depend on the spatial autocorrelation present in the dataset used.
Minor comment: I assume the hyperparameters of the models are not tuned as it is not mentioned, but this could be stated explicitly.
Citation: https://doi.org/10.5194/egusphere-2023-1308-RC2 -
AC1: 'Comment on egusphere-2023-1308', Jan Linnenbrink, 19 Oct 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1308/egusphere-2023-1308-AC1-supplement.pdf
Peer review completion
Journal article(s) based on this preprint
Model code and software
kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation Jan Linnenbrink, Carles Milà, Marvin Ludwig, and Hanna Meyer https://doi.org/10.6084/m9.figshare.23514135.v1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,015 | 299 | 46 | 1,360 | 46 | 36 |
- HTML: 1,015
- PDF: 299
- XML: 46
- Total: 1,360
- BibTeX: 46
- EndNote: 36
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
3 citations as recorded by crossref.
- kNNDM CV: k-fold nearest-neighbour distance matching cross-validation for map accuracy estimation J. Linnenbrink et al. 10.5194/gmd-17-5897-2024
- Random forests with spatial proxies for environmental modelling: opportunities and pitfalls C. Milà et al. 10.5194/gmd-17-6007-2024
- Adopting yield-improving practices to meet maize demand in Sub-Saharan Africa without cropland expansion F. Aramburu-Merlos et al. 10.1038/s41467-024-48859-0
Jan Linnenbrink
Carles Milà
Marvin Ludwig
Hanna Meyer
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(1574 KB) - Metadata XML