GEM-Forest: A Global satellite EMbedding–based map of forests and tree crops for 2020
Abstract. The advent of big data in Earth Observation (EO), coupled with recent advances in Artificial Intelligence, has led to the development of geospatial embeddings that are compact, information-rich feature vectors designed to be ready-to-use in machine learning (ML) applications for a wide range of downstream tasks, including forest monitoring. Motivated by the limitations of existing global forest products and by policy requirements such as the EU Deforestation Regulation (EUDR), we assess whether lightweight classifiers applied to satellite embeddings from the Google DeepMind Alpha Earth Foundation (AEF) can accurately map global forest and tree crop extents. In this study, we introduce GEM-Forest, a global satellite embedding–based forest dataset in 10 m spatial resolution for 2020, and its associated products: GEM-FnF2020, a forest / non-forest (F/nF) classification, and GEM-TC2020, which further distinguishes non-forest areas containing tree crops. Using ∼47,000 globally distributed training samples covering all major biomes, collected through an automated approach combining multiple forest-related, land cover and tree crop datasets, we compared multiple ML approaches ranging from linear models to neural networks. Accuracy assessment on a global F/nF dataset with ∼21,000 samples showed similar performance across classifiers, with overall accuracies of 90–92 % and macro F1-scores of 0.89–0.90, while linear models often outperformed more complex approaches. The validation of the tree crop subclass across 10 datasets showed larger differences among different ML models, with the highest accuracies achieved mostly by linear models. This consistency indicates that the embeddings encode highly informative and linearly separable structure for global F/nF discrimination, including tree-crop separation. A linear Support Vector Machine was therefore used to generate GEM-FnF2020 that achieves a 91 % overall accuracy, a macro F1-score of 0.90, with balanced omission and commission error rates for forests (15 % and 13 %, respectively). These results match or exceed existing global products, with most errors occurring in open forests and forest–shrubland transition zones. Residual misclassifications of tree crops as forests in GEM-FnF2020 ranged from 0.5 % to 14.8 %, which demonstrates the importance of including the tree crop subclass in the GEM-TC2020 map. The GEM-TC2020 enables distinction of agricultural tree crops with an overall accuracy higher than 85 % for most tree crops, while the classification of European tree crops remains the most challenging. The classified tree crop class significantly improves the commission error rates in the main GEM-FnF2020 product (0.5–14.8 %). Our proposed approach demonstrates strong potential for temporal transferability across the 2017–2025 period covered by AEF embeddings. This capability allows multi-year applications and change detection based on models trained for a single year and represents a key next step in our research. Overall, the findings demonstrate that AEF embeddings combined with simple ML approaches support accurate, transferable, and computationally-efficient global forest mapping, with remaining limitations related to temporal resolution and feature interpretability. These results and the presented approach can support policy and regulatory decisions, including the EUDR, while the open-access release of the GEM-Forest datasets and trained models facilitates global use, further testing, and methodological development by the EO and forest monitoring communities.