the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Random forests with spatial proxies for environmental modelling: opportunities and pitfalls
Abstract. Spatial proxies such as coordinates and Euclidean distance fields are often added as predictors in random forest models; however, their suitability in different predictive conditions has not yet been thoroughly assessed. We investigated 1) the conditions under which spatial proxies are suitable, 2) the reasons for such adequacy, and 3) how proxy suitability can be assessed using crossvalidation.
In a simulation and two case studies, we found that adding spatial proxies improved model performance when both residual spatial autocorrelation, and regularly or randomlydistributed training samples, were present. Otherwise, inclusion of proxies was neutral or counterproductive and resulted in feature extrapolation for clustered samples. Random kfold crossvalidation systematically favoured models with spatial proxies even when not appropriate.
As the benefits of spatial proxies are not universal, we recommend using spatial exploratory and validation analyses to determine their suitability, and considering alternative inherently spatial RFGLS models.
 Preprint
(6247 KB)  Metadata XML
 BibTeX
 EndNote
Status: closed

RC1: 'Comment on egusphere2024138', Anonymous Referee #1, 07 Feb 2024
This manuscript takes random forests as an example to analyze spatial agents such as coordinates and Euclidean distance fields in environmental modeling, which has positive value for spatial analysis based on machine learning models. However, there are some significant shortcomings in the work of this manuscript: (1) Like other models, random forests require a set of influencing or predictive factors. Therefore, the proxy of environmental factors in spatial analysis models is not a special case of random forest models. Therefore, it is recommended that the author provide additional information on this point. (2) The use of coordinates and Euclidean distance fields as spatial factor proxies is undoubtedly due to the influence of these spatial factors on the target, or the need to use spatial regions to reflect certain undiscovered factors. This is determined by the specific work, and such a spatial agent is undoubtedly reasonable. Even if the accuracy obtained in some models may not appear to have improved numerically. And this important aspect was not taken into account in this manuscript. (3) It is meaningless to evaluate the superiority or inferiority of a certain agent solely based on the accuracy of the final results, without considering specific issues. In summary, it is recommended to reject the manuscript.
Citation: https://doi.org/10.5194/egusphere2024138RC1 
RC2: 'Comment on egusphere2024138', Carsten F. Dormann, 07 Feb 2024
This study compares different approaches to address spatial autocorrelation in random forest analyses. Using simulated data, and two case studies, the authors assess prediction error and variable importance for differently clustered spatial data.
The study finds that clustering of spatial data had a substantial effect on RMSE, and that different spatial random forest versions differed less than that effect.
There are a few points I do, and a few I do not like about this study. On the plus side, I think the comparison of the RFapproaches is comprehensive and reflects nicely what people have been doing in the past. The evaluation against simulated data is how it should be (see caveat below), and I find the attempts to interpret performance using AOA nice and useful. The case studies illustrate the application case well, and also the problems, particularly using lat/lon as predictor.
My main criticisms are these:
1. The goal of the study does not become clear in the introduction and is confounded throughout the papers. To me, the structure should relate to the three “scenarios” one could have in mind for this study: interpolation, extrapolation and effect estimation (predictive inference). These targets are very different and need to be assessed differently, too. For example, regression kriging is an interpolation method (by design), while identifying importance is effect estimation. Extrapolation (to regions beyond the training data) is explicitly most often the target in the simulations presented here, but not always. I find the results hence sometimes confounding the different issues and hence have trouble interpreting them.
This also relates to the CV strategy. For interpolation error, random CV is fine, for extrapolation it is not. Thus, if the authors find a difference between randomCV and kNNDMCV, this may or may not be relevant, depending on the goal of the study.
I think this problem pertains particularly to the introduction and the results and will not require much work to address.
2. Spatial autocorrelation is entirely related to environmental variables, when in the real world it is also related to mass effects (dispersal, diffusion, contagion): the error is spatially autocorrelated, too. In my understanding, the authors did not address that. That is problematic, as Table 1’s second row is thus a model WITHOUT spatial autocorrelation in the residuals, as all predictors are present to correct for it. This is, from a statistical point of view, a noproblem data set. While that does in no way invalidate the simulations, it must be clearly communicated that this is NOT a situation one would even consider using spatial representation for: no residual SA, no problem.
Spatial error in the residuals has been simulated in previous such studies (since Dormann et al. 2007 Ecography), and is a bit annoying to finetune; it can be done, though, and I think it should be done (see also simData here: https://github.com/biometry/FReibier/tree/master).
On the back of such simulation, a GLS fitted to the correct model would give the best possible reference analysis; anything better than that would be a biased assessment of error.
3. Minimum RMSE according to simulations should be indicated in Fig. 3. Since the authors use the standard normal as error distribution, the best possible RMSE should also be 1. On average, it is closer to 2, but for the “complete, range 40” it looks as if it was below 1. That would be, well, surprising and important for interpretation: a fit into the spatial noise.
4. While I read about and like the kNNDM, I still prefer a truly independent test data set. Since the authors have invented the data, they could simply extend the area by doubling it to one side, and use that second half for validation in the sense of a true extrapolation. My feeling is that kNNDM will work well if the range of data is much larger than the spatial autocorrelation, but not if SA is large relative to the spatial extent. An extent of 100x100 is “only” 2.5 times larger than the range of 40. Thus, the sampled data points will fall within the SArange virtually always (Fig. 1.2). I am not convinced that this is an independentenough test case.
5. More as a suggestion: The problem of using spatial coordinates or proxies is that they replace the causal predictors in a random forest due to collinearity. As a consequence, the importances are wrongly estimated. One can, for the simulated X, compute how well each predictor can be represented by the specific spatial predictors used. If X1 can be predicted by lat/lon or the EDFs or distances with an r of 0.8 (or so), then clearly they will compete for explanation and substantially bias importance estimates. That is the reason why the MEapproach in spdep adds the PCNMs only to explain the RESIDUALs of the model, after fitting the nonspacevariables X1X6.
So, I would be interested in seeing how well space can replace actual predictors. Either by reporting such “predictability of predictors by space” in the appendix, or by having another model entirely without X1X6 in the comparison.
This would also tie in nicely with the difference between inter/extrapolation: spaceonly should work fine for inter, but fail for extrapolation.
Overall, I think this is a nice paper almost as it is and with a little bit more integration of WHY we would expect which approach to work better and a clearer structuring of the purposes of the analysis it will be just fine. IMHO it could be a greater paper, if the authors would allow for spatial autocorrelation in the error term and try to get at the bottom of WHEN space affects inference (i.e. here: importance) of predictors (that is, investigate the effect of collinearity on the intrapolation, extrapolation, variable importance).
Minor points:
L56: Also cite other people’s work here, much earlier, e.g. Le Rest et al. 2014 and whatever else we cited in Roberts et al. (2017 Ecography) on that topic.
L57: I find the restriction to RF too narrow. This is a logical and fundamental problem, not one specific to RF. Ploton et al. (2020) showed it for random forest, Kattenborn et al. (2022) for CNNs. It is the same problem of extrapolation in space with poor design for the CV.
L73: “scenarios” are what I called “target”, “goal” or “purpose”: Make clear what the goals are in the intro!
L178: Why would anybody use randomForest and not ranger? Much faster and hence less energy consumption.
What is the point of Fig. 7? I can see neither RMSEs or biases or anything, so why look at these maps? Also, we are typically more impressed by highresolution maps, even if they are completely wrong; map visualisation is thus either uninformative and misleading in many cases.
What is the point of Fig. 8, apart from the funny lines in “A Coordinates”?
Table 2 and 3: Where are the standard errors on these estimates? (Yes, I understood that some of them are a bit a pain to compute for one of the models. Still, without an estimate of the error, how can the reader interpret a value of “0.92” vs “0.87”? Might well be the same value if SD=0.4.)
I missed the discussion of some existing approaches to massage space into ML:
Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics & Probability Letters, 81(4), 451–459. https://doi.org/10.1016/j.spl.2010.12.003
Hajjem, A., Bellavance, F., & Larocque, D. (2014). Mixedeffects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6), 1313–1328. https://doi.org/10.1080/00949655.2012.741599
Li, L., Girguis, M., Lurmann, F., Wu, J., Urman, R., Rappaport, E., Ritz, B., Franklin, M., Breton, C., Gilliland, F., & Habre, R. (2019). Clusterbased bagging of constrained mixedeffects models for high spatiotemporal resolution nitrogen oxides prediction over large regions. Environment International, 128, 310–323. https://doi.org/10.1016/j.envint.2019.04.057
Li, L., Lurmann, F., Habre, R., Urman, R., Rappaport, E., Ritz, B., Chen, J.C., Gilliland, F. D., & Wu, J. (2017). Constrained mixedeffect models with ensemble learning for prediction of nitrogen oxides concentrations at high spatiotemporal resolution. Environmental Science & Technology, 51(17), 9920–9929. https://doi.org/10.1021/acs.est.7b01864
Zhan, Y., Luo, Y., Deng, X., Chen, H., Grieneisen, M. L., Shen, X., Zhu, L., & Zhang, M. (2017). Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmospheric Environment, 155, 129–139. https://doi.org/10.1016/j.atmosenv.2017.02.023
Kattenborn, T., Schiefer, F., Frey, J., Feilhauer, H., Mahecha, M. D., & Dormann, C. F. (2022). Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks. ISPRS Open Journal of Photogrammetry and Remote Sensing, 5, 100018. https://doi.org/10.1016/j.ophoto.2022.100018
Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., & Bretagnolle, V. (2014). Spatial leaveoneout crossvalidation for variable selection in the presence of spatial autocorrelation. Global Ecology and Biogeography, 23, 811–820. https://doi.org/10.1111/geb.12161
Ploton, P., Mortier, F., RéjouMéchain, M., Barbier, N., Picard, N., Rossi, V., Dormann, C., Cornu, G., Viennois, G., Bayol, N., Lyapustin, A., GourletFleury, S., & Pélissier, R. (2020). Spatial validation reveals poor predictive performance of largescale ecological mapping models. Nature Communications, 11(1), Article 1. https://doi.org/10.1038/s4146702018321y
Citation: https://doi.org/10.5194/egusphere2024138RC2  AC1: 'Comment on egusphere2024138', Carles Milà, 02 May 2024
Status: closed

RC1: 'Comment on egusphere2024138', Anonymous Referee #1, 07 Feb 2024
This manuscript takes random forests as an example to analyze spatial agents such as coordinates and Euclidean distance fields in environmental modeling, which has positive value for spatial analysis based on machine learning models. However, there are some significant shortcomings in the work of this manuscript: (1) Like other models, random forests require a set of influencing or predictive factors. Therefore, the proxy of environmental factors in spatial analysis models is not a special case of random forest models. Therefore, it is recommended that the author provide additional information on this point. (2) The use of coordinates and Euclidean distance fields as spatial factor proxies is undoubtedly due to the influence of these spatial factors on the target, or the need to use spatial regions to reflect certain undiscovered factors. This is determined by the specific work, and such a spatial agent is undoubtedly reasonable. Even if the accuracy obtained in some models may not appear to have improved numerically. And this important aspect was not taken into account in this manuscript. (3) It is meaningless to evaluate the superiority or inferiority of a certain agent solely based on the accuracy of the final results, without considering specific issues. In summary, it is recommended to reject the manuscript.
Citation: https://doi.org/10.5194/egusphere2024138RC1 
RC2: 'Comment on egusphere2024138', Carsten F. Dormann, 07 Feb 2024
This study compares different approaches to address spatial autocorrelation in random forest analyses. Using simulated data, and two case studies, the authors assess prediction error and variable importance for differently clustered spatial data.
The study finds that clustering of spatial data had a substantial effect on RMSE, and that different spatial random forest versions differed less than that effect.
There are a few points I do, and a few I do not like about this study. On the plus side, I think the comparison of the RFapproaches is comprehensive and reflects nicely what people have been doing in the past. The evaluation against simulated data is how it should be (see caveat below), and I find the attempts to interpret performance using AOA nice and useful. The case studies illustrate the application case well, and also the problems, particularly using lat/lon as predictor.
My main criticisms are these:
1. The goal of the study does not become clear in the introduction and is confounded throughout the papers. To me, the structure should relate to the three “scenarios” one could have in mind for this study: interpolation, extrapolation and effect estimation (predictive inference). These targets are very different and need to be assessed differently, too. For example, regression kriging is an interpolation method (by design), while identifying importance is effect estimation. Extrapolation (to regions beyond the training data) is explicitly most often the target in the simulations presented here, but not always. I find the results hence sometimes confounding the different issues and hence have trouble interpreting them.
This also relates to the CV strategy. For interpolation error, random CV is fine, for extrapolation it is not. Thus, if the authors find a difference between randomCV and kNNDMCV, this may or may not be relevant, depending on the goal of the study.
I think this problem pertains particularly to the introduction and the results and will not require much work to address.
2. Spatial autocorrelation is entirely related to environmental variables, when in the real world it is also related to mass effects (dispersal, diffusion, contagion): the error is spatially autocorrelated, too. In my understanding, the authors did not address that. That is problematic, as Table 1’s second row is thus a model WITHOUT spatial autocorrelation in the residuals, as all predictors are present to correct for it. This is, from a statistical point of view, a noproblem data set. While that does in no way invalidate the simulations, it must be clearly communicated that this is NOT a situation one would even consider using spatial representation for: no residual SA, no problem.
Spatial error in the residuals has been simulated in previous such studies (since Dormann et al. 2007 Ecography), and is a bit annoying to finetune; it can be done, though, and I think it should be done (see also simData here: https://github.com/biometry/FReibier/tree/master).
On the back of such simulation, a GLS fitted to the correct model would give the best possible reference analysis; anything better than that would be a biased assessment of error.
3. Minimum RMSE according to simulations should be indicated in Fig. 3. Since the authors use the standard normal as error distribution, the best possible RMSE should also be 1. On average, it is closer to 2, but for the “complete, range 40” it looks as if it was below 1. That would be, well, surprising and important for interpretation: a fit into the spatial noise.
4. While I read about and like the kNNDM, I still prefer a truly independent test data set. Since the authors have invented the data, they could simply extend the area by doubling it to one side, and use that second half for validation in the sense of a true extrapolation. My feeling is that kNNDM will work well if the range of data is much larger than the spatial autocorrelation, but not if SA is large relative to the spatial extent. An extent of 100x100 is “only” 2.5 times larger than the range of 40. Thus, the sampled data points will fall within the SArange virtually always (Fig. 1.2). I am not convinced that this is an independentenough test case.
5. More as a suggestion: The problem of using spatial coordinates or proxies is that they replace the causal predictors in a random forest due to collinearity. As a consequence, the importances are wrongly estimated. One can, for the simulated X, compute how well each predictor can be represented by the specific spatial predictors used. If X1 can be predicted by lat/lon or the EDFs or distances with an r of 0.8 (or so), then clearly they will compete for explanation and substantially bias importance estimates. That is the reason why the MEapproach in spdep adds the PCNMs only to explain the RESIDUALs of the model, after fitting the nonspacevariables X1X6.
So, I would be interested in seeing how well space can replace actual predictors. Either by reporting such “predictability of predictors by space” in the appendix, or by having another model entirely without X1X6 in the comparison.
This would also tie in nicely with the difference between inter/extrapolation: spaceonly should work fine for inter, but fail for extrapolation.
Overall, I think this is a nice paper almost as it is and with a little bit more integration of WHY we would expect which approach to work better and a clearer structuring of the purposes of the analysis it will be just fine. IMHO it could be a greater paper, if the authors would allow for spatial autocorrelation in the error term and try to get at the bottom of WHEN space affects inference (i.e. here: importance) of predictors (that is, investigate the effect of collinearity on the intrapolation, extrapolation, variable importance).
Minor points:
L56: Also cite other people’s work here, much earlier, e.g. Le Rest et al. 2014 and whatever else we cited in Roberts et al. (2017 Ecography) on that topic.
L57: I find the restriction to RF too narrow. This is a logical and fundamental problem, not one specific to RF. Ploton et al. (2020) showed it for random forest, Kattenborn et al. (2022) for CNNs. It is the same problem of extrapolation in space with poor design for the CV.
L73: “scenarios” are what I called “target”, “goal” or “purpose”: Make clear what the goals are in the intro!
L178: Why would anybody use randomForest and not ranger? Much faster and hence less energy consumption.
What is the point of Fig. 7? I can see neither RMSEs or biases or anything, so why look at these maps? Also, we are typically more impressed by highresolution maps, even if they are completely wrong; map visualisation is thus either uninformative and misleading in many cases.
What is the point of Fig. 8, apart from the funny lines in “A Coordinates”?
Table 2 and 3: Where are the standard errors on these estimates? (Yes, I understood that some of them are a bit a pain to compute for one of the models. Still, without an estimate of the error, how can the reader interpret a value of “0.92” vs “0.87”? Might well be the same value if SD=0.4.)
I missed the discussion of some existing approaches to massage space into ML:
Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics & Probability Letters, 81(4), 451–459. https://doi.org/10.1016/j.spl.2010.12.003
Hajjem, A., Bellavance, F., & Larocque, D. (2014). Mixedeffects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6), 1313–1328. https://doi.org/10.1080/00949655.2012.741599
Li, L., Girguis, M., Lurmann, F., Wu, J., Urman, R., Rappaport, E., Ritz, B., Franklin, M., Breton, C., Gilliland, F., & Habre, R. (2019). Clusterbased bagging of constrained mixedeffects models for high spatiotemporal resolution nitrogen oxides prediction over large regions. Environment International, 128, 310–323. https://doi.org/10.1016/j.envint.2019.04.057
Li, L., Lurmann, F., Habre, R., Urman, R., Rappaport, E., Ritz, B., Chen, J.C., Gilliland, F. D., & Wu, J. (2017). Constrained mixedeffect models with ensemble learning for prediction of nitrogen oxides concentrations at high spatiotemporal resolution. Environmental Science & Technology, 51(17), 9920–9929. https://doi.org/10.1021/acs.est.7b01864
Zhan, Y., Luo, Y., Deng, X., Chen, H., Grieneisen, M. L., Shen, X., Zhu, L., & Zhang, M. (2017). Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmospheric Environment, 155, 129–139. https://doi.org/10.1016/j.atmosenv.2017.02.023
Kattenborn, T., Schiefer, F., Frey, J., Feilhauer, H., Mahecha, M. D., & Dormann, C. F. (2022). Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks. ISPRS Open Journal of Photogrammetry and Remote Sensing, 5, 100018. https://doi.org/10.1016/j.ophoto.2022.100018
Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., & Bretagnolle, V. (2014). Spatial leaveoneout crossvalidation for variable selection in the presence of spatial autocorrelation. Global Ecology and Biogeography, 23, 811–820. https://doi.org/10.1111/geb.12161
Ploton, P., Mortier, F., RéjouMéchain, M., Barbier, N., Picard, N., Rossi, V., Dormann, C., Cornu, G., Viennois, G., Bayol, N., Lyapustin, A., GourletFleury, S., & Pélissier, R. (2020). Spatial validation reveals poor predictive performance of largescale ecological mapping models. Nature Communications, 11(1), Article 1. https://doi.org/10.1038/s4146702018321y
Citation: https://doi.org/10.5194/egusphere2024138RC2  AC1: 'Comment on egusphere2024138', Carles Milà, 02 May 2024
Data sets
Code and data for "Random forests with spatial proxies for environmental modelling: opportunities and pitfalls" Carles Milà https://zenodo.org/records/10495235
Viewed
HTML  XML  Total  BibTeX  EndNote  

367  120  24  511  20  11 
 HTML: 367
 PDF: 120
 XML: 24
 Total: 511
 BibTeX: 20
 EndNote: 11
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1