GEM-Forest: A Global satellite EMbedding&ndash;based map of forests and tree crops for 2020

Paluba, Daniel; Marsocci, Valerio; Onačillová, Katarína; Puerta Quintana, Yarin T.; Hastie, Adam

doi:10.5194/egusphere-2026-1401

Preprints

https://doi.org/10.5194/egusphere-2026-1401

Preprints

19 Mar 2026

| Subsequently updated

| 19 Mar 2026 | Subsequently updated

GEM-Forest: A Global satellite EMbedding–based map of forests and tree crops for 2020

Daniel Paluba, Valerio Marsocci, Katarína Onačillová, Yarin T. Puerta Quintana, and Adam Hastie

Abstract. The advent of big data in Earth observation, coupled with recent advances in Artificial Intelligence, has led to the development of geospatial embeddings. These compact, information-rich feature vectors are designed for direct use in machine learning (ML) applications across a wide range of downstream tasks, such as forest monitoring. Motivated by the limitations of existing global forest products and policy requirements like the EU Deforestation Regulation (EUDR), we (1) evaluate whether lightweight classifiers applied to satellite embeddings from the Google DeepMind Alpha Earth Foundation (AEF) can accurately map global forest and tree crop extents, and (2) we introduce the resulting GEM-Forest dataset. GEM-Forest is a global satellite embedding–based dataset at 10 m spatial resolution for 2020 that provides a consistent classification across three classes: forest, non-forest, and tree crops. Our comparison of multiple ML approaches ranging from linear models to neural networks showed similar performance across classifiers, while linear models often outperformed more complex models. This consistency indicates that the embeddings encode highly informative and linearly separable structures for global forest discrimination, which includes tree crop separation. Based on these findings, a linear Support Vector Machine was used to generate the final GEM-Forest dataset, which outperformed eight existing global forest, tree cover, or land cover maps on two global validation datasets, while it placed second on the JRC’s global forest validation dataset. Across all three datasets, the forest class achieved omission errors of 12–18% and commission errors of 16–21%, with overall accuracies from 88% to 92%. Misclassifications of tree crops as forests varied between 0.5% and 14.8%, with a producer’s accuracy above 85% for most tree crop datasets, whereas the classification of European tree crops remains the most challenging. Globally, GEM-Forest maps 3,919 million hectares (Mha) of forest for 2020, representing a 5.9% underestimation relative to FAO reports. This variance is partly attributed to the exclusion of unstocked forest areas from our forest definition, discrepancies in country-based forest definitions, and misclassification errors that occurred primarily within open forests and forest–shrubland transition zones. Overall, these results demonstrate that satellite embeddings combined with simple ML approaches support highly accurate, computationally efficient global forest and tree crop mapping. The open-access release of the GEM-Forest dataset and its ML model weights (both available in Paluba et al., 2026; DOI: https://doi.org/10.5281/zenodo.18921586) can support international policy decisions and allows direct and straightforward temporal transferability for other years.

Received: 13 Mar 2026 – Discussion started: 19 Mar 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Daniel Paluba, Valerio Marsocci, Katarína Onačillová, Yarin T. Puerta Quintana, and Adam Hastie

Interactive discussion

Status: closed

Version 2 | 19 Jun 2026

CC2:
'Comment on egusphere-2026-1401', Mohamed Bourriz, 08 Jul 2026

I have a methodological clarification regarding the use of AlphaEarth Embedding features. Since the AEF embeddings distributed as Cloud-Optimized GeoTIFFs are stored as signed 8-bit integer values, could the authors clarify whether the embeddings were de-quantized before being used in the modelling workflow?
This is particularly important because the raw 8-bit values are not directly equivalent to the analysis-ready embedding values. According to the AEF documentation, the raw integer values should be mapped back to floating-point embedding values using the provided de-quantization transformation.
Clarifying this preprocessing step would improve the reproducibility of the study and help readers better interpret the role and numerical meaning of the AEF covariates in the modelling results.

Citation: https://doi.org/10.5194/egusphere-2026-1401-CC2
- AC2: 'Reply on CC2', Daniel Paluba, 17 Jul 2026
  
  Thank you for this methodological question.
  Our extraction of AEF vector values for our training and testing points was conducted within the Google Earth Engine (GEE) environment using the official AEF satellite embedding repository (ee.ImageCollection("GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL")). In GEE, the dataset is served natively as floating-point values ranging from -1.0 to +1.0, rather than the raw, signed 8-bit integers (int8) used for compressed storage and download distribution. We can therefore confirm that our modelling workflow used the dequantized, analysis-ready floating-point embedding values directly, and no additional value mapping was needed.
  We agree that explicitly stating this improves the reproducibility of our study. We will update the 2.2 section of the Data and methods section of the manuscript to explicitly mention that the AEF dataset was accessed via GEE in its native floating-point format to ensure full clarity regarding the numerical scale of these covariates.
  
  Citation: https://doi.org/10.5194/egusphere-2026-1401-AC2
Version 1 | 19 Mar 2026

CC1:
'Comment on egusphere-2026-1401', Vanna Teck, 08 May 2026
Dear Authors,
Thank you for sharing this impressive work on global satellite embedding-based forest and tree crop mapping. I found the dataset and methodology very valuable, especially for large-scale applications and regional analysis.
I would like to provide a suggestion regarding the Cambodia region. Based on visual inspection, some areas appear to show potential overestimation of tree crops, particularly where deciduous forests may have been classified as tree crops. This issue seems noticeable in several locations, including:
Lat: 12.80694, Lon: 107.51690

Lat: 12.80536, Lon: 106.82905

Lat: 12.97114, Lon: 105.76146

Lat: 12.05924, Lon: 104.00265

Lat: 10.77449, Lon: 103.24736

Lat: 11.95709, Lon: 103.40392

In Cambodia, deciduous forests and some plantation systems can have similar seasonal spectral characteristics, which may contribute to confusion between natural forests and tree crop classes.
Thank you again for this excellent contribution. I look forward to seeing the future development of this dataset.
Citation: https://doi.org/10.5194/egusphere-2026-1401-CC1
- AC1: 'Reply on CC1', Daniel Paluba, 17 Jul 2026
  
  Dear Vanna Teck,
  Thank you very much for your kind words regarding our work and for taking the time to provide such detailed, constructive feedback.
  We appreciate you sharing these specific coordinate locations for Cambodia. We will investigate them to better understand the cause of the overestimation of the tree crop class in that region. We will use these insights to enhance the dataset in its future iterations.
  Thank you again for helping us improve this work!
  
  Citation: https://doi.org/10.5194/egusphere-2026-1401-AC1

Interactive discussion

Status: closed

Version 2 | 19 Jun 2026

CC2:
'Comment on egusphere-2026-1401', Mohamed Bourriz, 08 Jul 2026

I have a methodological clarification regarding the use of AlphaEarth Embedding features. Since the AEF embeddings distributed as Cloud-Optimized GeoTIFFs are stored as signed 8-bit integer values, could the authors clarify whether the embeddings were de-quantized before being used in the modelling workflow?
This is particularly important because the raw 8-bit values are not directly equivalent to the analysis-ready embedding values. According to the AEF documentation, the raw integer values should be mapped back to floating-point embedding values using the provided de-quantization transformation.
Clarifying this preprocessing step would improve the reproducibility of the study and help readers better interpret the role and numerical meaning of the AEF covariates in the modelling results.

Citation: https://doi.org/10.5194/egusphere-2026-1401-CC2
- AC2: 'Reply on CC2', Daniel Paluba, 17 Jul 2026
  
  Thank you for this methodological question.
  Our extraction of AEF vector values for our training and testing points was conducted within the Google Earth Engine (GEE) environment using the official AEF satellite embedding repository (ee.ImageCollection("GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL")). In GEE, the dataset is served natively as floating-point values ranging from -1.0 to +1.0, rather than the raw, signed 8-bit integers (int8) used for compressed storage and download distribution. We can therefore confirm that our modelling workflow used the dequantized, analysis-ready floating-point embedding values directly, and no additional value mapping was needed.
  We agree that explicitly stating this improves the reproducibility of our study. We will update the 2.2 section of the Data and methods section of the manuscript to explicitly mention that the AEF dataset was accessed via GEE in its native floating-point format to ensure full clarity regarding the numerical scale of these covariates.
  
  Citation: https://doi.org/10.5194/egusphere-2026-1401-AC2
Version 1 | 19 Mar 2026

CC1:
'Comment on egusphere-2026-1401', Vanna Teck, 08 May 2026
Dear Authors,
Thank you for sharing this impressive work on global satellite embedding-based forest and tree crop mapping. I found the dataset and methodology very valuable, especially for large-scale applications and regional analysis.
I would like to provide a suggestion regarding the Cambodia region. Based on visual inspection, some areas appear to show potential overestimation of tree crops, particularly where deciduous forests may have been classified as tree crops. This issue seems noticeable in several locations, including:
Lat: 12.80694, Lon: 107.51690

Lat: 12.80536, Lon: 106.82905

Lat: 12.97114, Lon: 105.76146

Lat: 12.05924, Lon: 104.00265

Lat: 10.77449, Lon: 103.24736

Lat: 11.95709, Lon: 103.40392

In Cambodia, deciduous forests and some plantation systems can have similar seasonal spectral characteristics, which may contribute to confusion between natural forests and tree crop classes.
Thank you again for this excellent contribution. I look forward to seeing the future development of this dataset.
Citation: https://doi.org/10.5194/egusphere-2026-1401-CC1
- AC1: 'Reply on CC1', Daniel Paluba, 17 Jul 2026
  
  Dear Vanna Teck,
  Thank you very much for your kind words regarding our work and for taking the time to provide such detailed, constructive feedback.
  We appreciate you sharing these specific coordinate locations for Cambodia. We will investigate them to better understand the cause of the overestimation of the tree crop class in that region. We will use these insights to enhance the dataset in its future iterations.
  Thank you again for helping us improve this work!
  
  Citation: https://doi.org/10.5194/egusphere-2026-1401-AC1

Daniel Paluba, Valerio Marsocci, Katarína Onačillová, Yarin T. Puerta Quintana, and Adam Hastie

Supplement

https://doi.org/10.5194/egusphere-2026-1401-supplement

Data sets

GEM-Forest: A Global satellite EMbedding–based map of forests and tree crops for 2020 Daniel Paluba, Valerio Marsocci, Katarína Onačillová Yarin T. Puerta Quintana, Adam Hastie https://doi.org/10.5281/zenodo.18921586

Daniel Paluba, Valerio Marsocci, Katarína Onačillová, Yarin T. Puerta Quintana, and Adam Hastie

Viewed

Total article views: 3,119 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
2,176	878	65	3,119	0	43	62

HTML: 2,176
PDF: 878
XML: 65
Total: 3,119
Supplement: 0
BibTeX: 43
EndNote: 62

Views and downloads (calculated since 19 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	1,727	693	49	2,469
Apr 2026	240	100	6	346
May 2026	134	61	5	200
Jun 2026	16	12	3	31
Jul 2026	59	12	2	73

Cumulative views and downloads (calculated since 19 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	1,727	693	49	2,469
Apr 2026	240	100	6	346
May 2026	134	61	5	200
Jun 2026	16	12	3	31
Jul 2026	59	12	2	73

Viewed (geographical distribution)

Total article views: 3,092 (including HTML, PDF, and XML) Thereof 3,092 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 25 Jul 2026

Short summary

We created a new global map of forests and tree crops for 2020 at 10 m resolution using satellite embeddings. After testing many machine learning methods, we found that simple linear models performed as well as or better than more complex ones. Forest/non-forest map achieves 92% overall accuracy and separates tree crops with low confusion with forests. This shows that satellite embeddings can support reliable and efficient global forest monitoring and inform international and national policies.


Total:	0
HTML:	0
PDF:	0
XML:	0

GEM-Forest: A Global satellite EMbedding–based map of forests and tree crops for 2020

Interactive discussion

Version 2 | 19 Jun 2026

Version 1 | 19 Mar 2026

Interactive discussion

Version 2 | 19 Jun 2026

Version 1 | 19 Mar 2026

Supplement

Data sets

Viewed

Viewed (geographical distribution)