Random forests with spatial proxies for environmental modelling: opportunities and pitfalls

Milà, Carles; Ludwig, Marvin; Pebesma, Edzer; Tonne, Cathryn; Meyer, Hanna

doi:https://doi.org/10.5194/egusphere-2024-138

Preprints

https://doi.org/10.5194/egusphere-2024-138

Preprints

24 Jan 2024

| 24 Jan 2024

Random forests with spatial proxies for environmental modelling: opportunities and pitfalls

Carles Milà, Marvin Ludwig, Edzer Pebesma, Cathryn Tonne, and Hanna Meyer

Abstract. Spatial proxies such as coordinates and Euclidean distance fields are often added as predictors in random forest models; however, their suitability in different predictive conditions has not yet been thoroughly assessed. We investigated 1) the conditions under which spatial proxies are suitable, 2) the reasons for such adequacy, and 3) how proxy suitability can be assessed using cross-validation.

In a simulation and two case studies, we found that adding spatial proxies improved model performance when both residual spatial autocorrelation, and regularly or randomly-distributed training samples, were present. Otherwise, inclusion of proxies was neutral or counterproductive and resulted in feature extrapolation for clustered samples. Random k-fold cross-validation systematically favoured models with spatial proxies even when not appropriate.

As the benefits of spatial proxies are not universal, we recommend using spatial exploratory and validation analyses to determine their suitability, and considering alternative inherently spatial RF-GLS models.

Received: 16 Jan 2024 – Discussion started: 24 Jan 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Carles Milà, Marvin Ludwig, Edzer Pebesma, Cathryn Tonne, and Hanna Meyer

Status: closed

RC1: 'Comment on egusphere-2024-138', Anonymous Referee #1, 07 Feb 2024

This manuscript takes random forests as an example to analyze spatial agents such as coordinates and Euclidean distance fields in environmental modeling, which has positive value for spatial analysis based on machine learning models. However, there are some significant shortcomings in the work of this manuscript: (1) Like other models, random forests require a set of influencing or predictive factors. Therefore, the proxy of environmental factors in spatial analysis models is not a special case of random forest models. Therefore, it is recommended that the author provide additional information on this point. (2) The use of coordinates and Euclidean distance fields as spatial factor proxies is undoubtedly due to the influence of these spatial factors on the target, or the need to use spatial regions to reflect certain undiscovered factors. This is determined by the specific work, and such a spatial agent is undoubtedly reasonable. Even if the accuracy obtained in some models may not appear to have improved numerically. And this important aspect was not taken into account in this manuscript. (3) It is meaningless to evaluate the superiority or inferiority of a certain agent solely based on the accuracy of the final results, without considering specific issues. In summary, it is recommended to reject the manuscript.

Citation: https://doi.org/10.5194/egusphere-2024-138-RC1
RC2: 'Comment on egusphere-2024-138', Carsten F. Dormann, 07 Feb 2024

This study compares different approaches to address spatial autocorrelation in random forest analyses. Using simulated data, and two case studies, the authors assess prediction error and variable importance for differently clustered spatial data.
The study finds that clustering of spatial data had a substantial effect on RMSE, and that different spatial random forest versions differed less than that effect.

There are a few points I do, and a few I do not like about this study. On the plus side, I think the comparison of the RF-approaches is comprehensive and reflects nicely what people have been doing in the past. The evaluation against simulated data is how it should be (see caveat below), and I find the attempts to interpret performance using AOA nice and useful. The case studies illustrate the application case well, and also the problems, particularly using lat/lon as predictor.

My main criticisms are these:

1. The goal of the study does not become clear in the introduction and is confounded throughout the papers. To me, the structure should relate to the three “scenarios” one could have in mind for this study: interpolation, extrapolation and effect estimation (predictive inference). These targets are very different and need to be assessed differently, too. For example, regression kriging is an interpolation method (by design), while identifying importance is effect estimation. Extrapolation (to regions beyond the training data) is explicitly most often the target in the simulations presented here, but not always. I find the results hence sometimes confounding the different issues and hence have trouble interpreting them.
This also relates to the CV strategy. For interpolation error, random CV is fine, for extrapolation it is not. Thus, if the authors find a difference between randomCV and kNNDM-CV, this may or may not be relevant, depending on the goal of the study.
I think this problem pertains particularly to the introduction and the results and will not require much work to address.

2. Spatial autocorrelation is entirely related to environmental variables, when in the real world it is also related to mass effects (dispersal, diffusion, contagion): the error is spatially autocorrelated, too. In my understanding, the authors did not address that. That is problematic, as Table 1’s second row is thus a model WITHOUT spatial autocorrelation in the residuals, as all predictors are present to correct for it. This is, from a statistical point of view, a no-problem data set. While that does in no way invalidate the simulations, it must be clearly communicated that this is NOT a situation one would even consider using spatial representation for: no residual SA, no problem.
Spatial error in the residuals has been simulated in previous such studies (since Dormann et al. 2007 Ecography), and is a bit annoying to fine-tune; it can be done, though, and I think it should be done (see also simData here: https://github.com/biometry/FReibier/tree/master).
On the back of such simulation, a GLS fitted to the correct model would give the best possible reference analysis; anything better than that would be a biased assessment of error.

3. Minimum RMSE according to simulations should be indicated in Fig. 3. Since the authors use the standard normal as error distribution, the best possible RMSE should also be 1. On average, it is closer to 2, but for the “complete, range 40” it looks as if it was below 1. That would be, well, surprising and important for interpretation: a fit into the spatial noise.

4. While I read about and like the kNNDM, I still prefer a truly independent test data set. Since the authors have invented the data, they could simply extend the area by doubling it to one side, and use that second half for validation in the sense of a true extrapolation. My feeling is that kNNDM will work well if the range of data is much larger than the spatial autocorrelation, but not if SA is large relative to the spatial extent. An extent of 100x100 is “only” 2.5 times larger than the range of 40. Thus, the sampled data points will fall within the SA-range virtually always (Fig. 1.2). I am not convinced that this is an independent-enough test case.

5. More as a suggestion: The problem of using spatial coordinates or proxies is that they replace the causal predictors in a random forest due to collinearity. As a consequence, the importances are wrongly estimated. One can, for the simulated X, compute how well each predictor can be represented by the specific spatial predictors used. If X1 can be predicted by lat/lon or the EDFs or distances with an r of 0.8 (or so), then clearly they will compete for explanation and substantially bias importance estimates. That is the reason why the ME-approach in spdep adds the PCNMs only to explain the RESIDUALs of the model, after fitting the non-space-variables X1-X6.
So, I would be interested in seeing how well space can replace actual predictors. Either by reporting such “predictability of predictors by space” in the appendix, or by having another model entirely without X1-X6 in the comparison.
This would also tie in nicely with the difference between inter/extrapolation: space-only should work fine for inter-, but fail for extrapolation.

Overall, I think this is a nice paper almost as it is and with a little bit more integration of WHY we would expect which approach to work better and a clearer structuring of the purposes of the analysis it will be just fine. IMHO it could be a greater paper, if the authors would allow for spatial autocorrelation in the error term and try to get at the bottom of WHEN space affects inference (i.e. here: importance) of predictors (that is, investigate the effect of collinearity on the intrapolation, extrapolation, variable importance).

Minor points:
L56: Also cite other people’s work here, much earlier, e.g. Le Rest et al. 2014 and whatever else we cited in Roberts et al. (2017 Ecography) on that topic.
L57: I find the restriction to RF too narrow. This is a logical and fundamental problem, not one specific to RF. Ploton et al. (2020) showed it for random forest, Kattenborn et al. (2022) for CNNs. It is the same problem of extrapolation in space with poor design for the CV.
L73: “scenarios” are what I called “target”, “goal” or “purpose”: Make clear what the goals are in the intro!
L178: Why would anybody use randomForest and not ranger? Much faster and hence less energy consumption.

What is the point of Fig. 7? I can see neither RMSEs or biases or anything, so why look at these maps? Also, we are typically more impressed by high-resolution maps, even if they are completely wrong; map visualisation is thus either uninformative and misleading in many cases.
What is the point of Fig. 8, apart from the funny lines in “A Coordinates”?
Table 2 and 3: Where are the standard errors on these estimates? (Yes, I understood that some of them are a bit a pain to compute for one of the models. Still, without an estimate of the error, how can the reader interpret a value of “0.92” vs “0.87”? Might well be the same value if SD=0.4.)

I missed the discussion of some existing approaches to massage space into ML:
Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics & Probability Letters, 81(4), 451–459. https://doi.org/10.1016/j.spl.2010.12.003
Hajjem, A., Bellavance, F., & Larocque, D. (2014). Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6), 1313–1328. https://doi.org/10.1080/00949655.2012.741599
Li, L., Girguis, M., Lurmann, F., Wu, J., Urman, R., Rappaport, E., Ritz, B., Franklin, M., Breton, C., Gilliland, F., & Habre, R. (2019). Cluster-based bagging of constrained mixed-effects models for high spatiotemporal resolution nitrogen oxides prediction over large regions. Environment International, 128, 310–323. https://doi.org/10.1016/j.envint.2019.04.057
Li, L., Lurmann, F., Habre, R., Urman, R., Rappaport, E., Ritz, B., Chen, J.-C., Gilliland, F. D., & Wu, J. (2017). Constrained mixed-effect models with ensemble learning for prediction of nitrogen oxides concentrations at high spatiotemporal resolution. Environmental Science & Technology, 51(17), 9920–9929. https://doi.org/10.1021/acs.est.7b01864
Zhan, Y., Luo, Y., Deng, X., Chen, H., Grieneisen, M. L., Shen, X., Zhu, L., & Zhang, M. (2017). Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmospheric Environment, 155, 129–139. https://doi.org/10.1016/j.atmosenv.2017.02.023

Kattenborn, T., Schiefer, F., Frey, J., Feilhauer, H., Mahecha, M. D., & Dormann, C. F. (2022). Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks. ISPRS Open Journal of Photogrammetry and Remote Sensing, 5, 100018. https://doi.org/10.1016/j.ophoto.2022.100018

Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., & Bretagnolle, V. (2014). Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Global Ecology and Biogeography, 23, 811–820. https://doi.org/10.1111/geb.12161

Ploton, P., Mortier, F., Réjou-Méchain, M., Barbier, N., Picard, N., Rossi, V., Dormann, C., Cornu, G., Viennois, G., Bayol, N., Lyapustin, A., Gourlet-Fleury, S., & Pélissier, R. (2020). Spatial validation reveals poor predictive performance of large-scale ecological mapping models. Nature Communications, 11(1), Article 1. https://doi.org/10.1038/s41467-020-18321-y

Citation: https://doi.org/10.5194/egusphere-2024-138-RC2
AC1: 'Comment on egusphere-2024-138', Carles Milà, 02 May 2024

Please find attached to this post our author comments.

Citation: https://doi.org/10.5194/egusphere-2024-138-AC1

Status: closed

RC1: 'Comment on egusphere-2024-138', Anonymous Referee #1, 07 Feb 2024

This manuscript takes random forests as an example to analyze spatial agents such as coordinates and Euclidean distance fields in environmental modeling, which has positive value for spatial analysis based on machine learning models. However, there are some significant shortcomings in the work of this manuscript: (1) Like other models, random forests require a set of influencing or predictive factors. Therefore, the proxy of environmental factors in spatial analysis models is not a special case of random forest models. Therefore, it is recommended that the author provide additional information on this point. (2) The use of coordinates and Euclidean distance fields as spatial factor proxies is undoubtedly due to the influence of these spatial factors on the target, or the need to use spatial regions to reflect certain undiscovered factors. This is determined by the specific work, and such a spatial agent is undoubtedly reasonable. Even if the accuracy obtained in some models may not appear to have improved numerically. And this important aspect was not taken into account in this manuscript. (3) It is meaningless to evaluate the superiority or inferiority of a certain agent solely based on the accuracy of the final results, without considering specific issues. In summary, it is recommended to reject the manuscript.

Citation: https://doi.org/10.5194/egusphere-2024-138-RC1
RC2: 'Comment on egusphere-2024-138', Carsten F. Dormann, 07 Feb 2024

This study compares different approaches to address spatial autocorrelation in random forest analyses. Using simulated data, and two case studies, the authors assess prediction error and variable importance for differently clustered spatial data.
The study finds that clustering of spatial data had a substantial effect on RMSE, and that different spatial random forest versions differed less than that effect.

There are a few points I do, and a few I do not like about this study. On the plus side, I think the comparison of the RF-approaches is comprehensive and reflects nicely what people have been doing in the past. The evaluation against simulated data is how it should be (see caveat below), and I find the attempts to interpret performance using AOA nice and useful. The case studies illustrate the application case well, and also the problems, particularly using lat/lon as predictor.

My main criticisms are these:

1. The goal of the study does not become clear in the introduction and is confounded throughout the papers. To me, the structure should relate to the three “scenarios” one could have in mind for this study: interpolation, extrapolation and effect estimation (predictive inference). These targets are very different and need to be assessed differently, too. For example, regression kriging is an interpolation method (by design), while identifying importance is effect estimation. Extrapolation (to regions beyond the training data) is explicitly most often the target in the simulations presented here, but not always. I find the results hence sometimes confounding the different issues and hence have trouble interpreting them.
This also relates to the CV strategy. For interpolation error, random CV is fine, for extrapolation it is not. Thus, if the authors find a difference between randomCV and kNNDM-CV, this may or may not be relevant, depending on the goal of the study.
I think this problem pertains particularly to the introduction and the results and will not require much work to address.

2. Spatial autocorrelation is entirely related to environmental variables, when in the real world it is also related to mass effects (dispersal, diffusion, contagion): the error is spatially autocorrelated, too. In my understanding, the authors did not address that. That is problematic, as Table 1’s second row is thus a model WITHOUT spatial autocorrelation in the residuals, as all predictors are present to correct for it. This is, from a statistical point of view, a no-problem data set. While that does in no way invalidate the simulations, it must be clearly communicated that this is NOT a situation one would even consider using spatial representation for: no residual SA, no problem.
Spatial error in the residuals has been simulated in previous such studies (since Dormann et al. 2007 Ecography), and is a bit annoying to fine-tune; it can be done, though, and I think it should be done (see also simData here: https://github.com/biometry/FReibier/tree/master).
On the back of such simulation, a GLS fitted to the correct model would give the best possible reference analysis; anything better than that would be a biased assessment of error.

3. Minimum RMSE according to simulations should be indicated in Fig. 3. Since the authors use the standard normal as error distribution, the best possible RMSE should also be 1. On average, it is closer to 2, but for the “complete, range 40” it looks as if it was below 1. That would be, well, surprising and important for interpretation: a fit into the spatial noise.

4. While I read about and like the kNNDM, I still prefer a truly independent test data set. Since the authors have invented the data, they could simply extend the area by doubling it to one side, and use that second half for validation in the sense of a true extrapolation. My feeling is that kNNDM will work well if the range of data is much larger than the spatial autocorrelation, but not if SA is large relative to the spatial extent. An extent of 100x100 is “only” 2.5 times larger than the range of 40. Thus, the sampled data points will fall within the SA-range virtually always (Fig. 1.2). I am not convinced that this is an independent-enough test case.

5. More as a suggestion: The problem of using spatial coordinates or proxies is that they replace the causal predictors in a random forest due to collinearity. As a consequence, the importances are wrongly estimated. One can, for the simulated X, compute how well each predictor can be represented by the specific spatial predictors used. If X1 can be predicted by lat/lon or the EDFs or distances with an r of 0.8 (or so), then clearly they will compete for explanation and substantially bias importance estimates. That is the reason why the ME-approach in spdep adds the PCNMs only to explain the RESIDUALs of the model, after fitting the non-space-variables X1-X6.
So, I would be interested in seeing how well space can replace actual predictors. Either by reporting such “predictability of predictors by space” in the appendix, or by having another model entirely without X1-X6 in the comparison.
This would also tie in nicely with the difference between inter/extrapolation: space-only should work fine for inter-, but fail for extrapolation.

Overall, I think this is a nice paper almost as it is and with a little bit more integration of WHY we would expect which approach to work better and a clearer structuring of the purposes of the analysis it will be just fine. IMHO it could be a greater paper, if the authors would allow for spatial autocorrelation in the error term and try to get at the bottom of WHEN space affects inference (i.e. here: importance) of predictors (that is, investigate the effect of collinearity on the intrapolation, extrapolation, variable importance).

Minor points:
L56: Also cite other people’s work here, much earlier, e.g. Le Rest et al. 2014 and whatever else we cited in Roberts et al. (2017 Ecography) on that topic.
L57: I find the restriction to RF too narrow. This is a logical and fundamental problem, not one specific to RF. Ploton et al. (2020) showed it for random forest, Kattenborn et al. (2022) for CNNs. It is the same problem of extrapolation in space with poor design for the CV.
L73: “scenarios” are what I called “target”, “goal” or “purpose”: Make clear what the goals are in the intro!
L178: Why would anybody use randomForest and not ranger? Much faster and hence less energy consumption.

What is the point of Fig. 7? I can see neither RMSEs or biases or anything, so why look at these maps? Also, we are typically more impressed by high-resolution maps, even if they are completely wrong; map visualisation is thus either uninformative and misleading in many cases.
What is the point of Fig. 8, apart from the funny lines in “A Coordinates”?
Table 2 and 3: Where are the standard errors on these estimates? (Yes, I understood that some of them are a bit a pain to compute for one of the models. Still, without an estimate of the error, how can the reader interpret a value of “0.92” vs “0.87”? Might well be the same value if SD=0.4.)

I missed the discussion of some existing approaches to massage space into ML:
Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics & Probability Letters, 81(4), 451–459. https://doi.org/10.1016/j.spl.2010.12.003
Hajjem, A., Bellavance, F., & Larocque, D. (2014). Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6), 1313–1328. https://doi.org/10.1080/00949655.2012.741599
Li, L., Girguis, M., Lurmann, F., Wu, J., Urman, R., Rappaport, E., Ritz, B., Franklin, M., Breton, C., Gilliland, F., & Habre, R. (2019). Cluster-based bagging of constrained mixed-effects models for high spatiotemporal resolution nitrogen oxides prediction over large regions. Environment International, 128, 310–323. https://doi.org/10.1016/j.envint.2019.04.057
Li, L., Lurmann, F., Habre, R., Urman, R., Rappaport, E., Ritz, B., Chen, J.-C., Gilliland, F. D., & Wu, J. (2017). Constrained mixed-effect models with ensemble learning for prediction of nitrogen oxides concentrations at high spatiotemporal resolution. Environmental Science & Technology, 51(17), 9920–9929. https://doi.org/10.1021/acs.est.7b01864
Zhan, Y., Luo, Y., Deng, X., Chen, H., Grieneisen, M. L., Shen, X., Zhu, L., & Zhang, M. (2017). Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmospheric Environment, 155, 129–139. https://doi.org/10.1016/j.atmosenv.2017.02.023

Kattenborn, T., Schiefer, F., Frey, J., Feilhauer, H., Mahecha, M. D., & Dormann, C. F. (2022). Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks. ISPRS Open Journal of Photogrammetry and Remote Sensing, 5, 100018. https://doi.org/10.1016/j.ophoto.2022.100018

Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., & Bretagnolle, V. (2014). Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Global Ecology and Biogeography, 23, 811–820. https://doi.org/10.1111/geb.12161

Ploton, P., Mortier, F., Réjou-Méchain, M., Barbier, N., Picard, N., Rossi, V., Dormann, C., Cornu, G., Viennois, G., Bayol, N., Lyapustin, A., Gourlet-Fleury, S., & Pélissier, R. (2020). Spatial validation reveals poor predictive performance of large-scale ecological mapping models. Nature Communications, 11(1), Article 1. https://doi.org/10.1038/s41467-020-18321-y

Citation: https://doi.org/10.5194/egusphere-2024-138-RC2
AC1: 'Comment on egusphere-2024-138', Carles Milà, 02 May 2024

Please find attached to this post our author comments.

Citation: https://doi.org/10.5194/egusphere-2024-138-AC1

Carles Milà, Marvin Ludwig, Edzer Pebesma, Cathryn Tonne, and Hanna Meyer

Data sets

Code and data for "Random forests with spatial proxies for environmental modelling: opportunities and pitfalls" Carles Milà https://zenodo.org/records/10495235

Carles Milà, Marvin Ludwig, Edzer Pebesma, Cathryn Tonne, and Hanna Meyer

Viewed

Total article views: 594 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
425	140	29	594	24	15

HTML: 425
PDF: 140
XML: 29
Total: 594
BibTeX: 24
EndNote: 15

Views and downloads (calculated since 24 Jan 2024)

Month	HTML	PDF	XML	Total
Jan 2024	100	24	2	126
Feb 2024	79	23	8	110
Mar 2024	52	14	0	66
Apr 2024	40	19	5	64
May 2024	46	26	4	76
Jun 2024	67	16	7	90
Jul 2024	41	18	3	62

Cumulative views and downloads (calculated since 24 Jan 2024)

Month	HTML	PDF	XML	Total
Jan 2024	100	24	2	126
Feb 2024	79	23	8	110
Mar 2024	52	14	0	66
Apr 2024	40	19	5	64
May 2024	46	26	4	76
Jun 2024	67	16	7	90
Jul 2024	41	18	3	62

Viewed (geographical distribution)

Total article views: 612 (including HTML, PDF, and XML) Thereof 612 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 26 Jul 2024

Short summary

Spatial proxies such as coordinates and distances are often included as predictors in random forest models for predictive mapping. In a simulation and two case studies, we investigated under which conditions this is appropriate. We found that spatial proxies are not always beneficial and thus we conclude that they should not be used as default approach without careful consideration. We also give insights on the reasons behind their suitability, how to detect it, and potential alternatives.


Total:	0
HTML:	0
PDF:	0
XML:	0