Use of Spatial Embeddings in Genosoil Identification
Abstract. Genosoils are minimally disturbed reference states within pedogenons, that is, soil units shaped by similar pedogenic processes within the Soil Security framework. They are central to assessing human impacts on soil functions, services, and resistance to threats. At present, genosoil delineation relies on the Human Modification Index (HMI), yet in intensively managed landscapes HMI thresholds may exclude all local pixels, leaving no local reference state available. Because the same pedogenon may occur across geographically distant regions, non-local occurrences may provide an alternative source of reference information. Using the United Kingdom as a case study, we tested whether satellite-derived spatial embeddings can detect genosoil signatures at 10 m resolution and whether these signatures can be transferred to regions with limited or absent local low-human-modification examples. We evaluated two satellite foundation-model embedding products, AlphaEarth and Tessera, across three contrasting pedogenons selected from the Global Pedogenon Map. Within each pedogenon, pixels with lower HMI values were generally more similar to the genosoil reference, indicating that the embeddings capture a reproducible low-modification surface-state signal. At the global scale, similarity to the UK genosoil was largely confined to biogeographically coherent regions. Cross-border substitution of local UK genosoil delineation was mostly limited, with meaningful partial recovery observed primarily in the highly modified agricultural pedogenon. These results indicate that satellite foundation-model embeddings can support higher-resolution genosoil delineation than is currently possible from global human modification products alone, extending the operational framework from 90 m to 10 m. They also suggest a pathway towards future genosoil identification frameworks that rely less on coarse disturbance proxies and more on validated surface-state similarity.
Pachón Maldonado et al. aim to identify genosoils (minimally disturbed soils in homogeneous soil geographic regions called pedogenons) using spatial embeddings (vectorized information summarizing land surface properties derived from remote sensing). The hypothesis is that undisturbed soils share similar land surface characteristics that can be represented, and transferred, using the spatial embeddings. The manuscript provides an extensive statistical analysis of different spatial datasets representing land surface properties, soil properties and human disturbances. The findings indicate limited geographical transferability of the spatial embeddings for genosoil identification.
I have three main comments regarding the manuscript: 1) I have some doubts about the quality of the datasets for their intended purpose, 2) the manuscript will be difficult to understand by the more general soil-scientific audience of SOIL due to the strong statistical focus and insufficient connection to existing pedogenic frameworks, and 3) the methods and results should be better structured for improved clarity.
I have detailed my main comments below, followed by a list of minor comments and technical corrections.
---
1. Quality of datasets
1.1. The HMI dataset is used to identify undisturbed genosoils. However, especially for pedogenon 1564, I have doubts whether this dataset actually shows minimally disturbed soils. Their minimal presence and highly scattered occurrence (Table 1, Fig. 3) raise the question whether they are actually genosoils, or just wrongly classified phenosoils in the HMI dataset. An evaluation of Table S2 shows that the most similar countries to the UK also have a very small percentage of genosoils in their pedogenon, while countries with much higher percentage of genosoils show lower similarity. This makes me wonder if the HMI actually identifies genosoils for P1564, or whether wrongly classified agricultural soils or built-up areas are compared with each other, which could also be an explanation for their similarity in European context. A thorough evaluation of the accuracy of the HMI dataset is essential before spatial embeddings of derived genosoils can be reliably compared.
1.2. Another concern comes from the uniformity of the pedogenons. Figure 5 shows occurrences of the same pedogenon in countries with a wide variety of climatic and topographic conditions, and the pedogenons sometimes contain soils that are formed under contrasting climatic conditions (Table S4). This heterogeneity will make it difficult to define one genosoil type with corresponding spatial embedding for each pedogenon. This point is addressed in Section 3.6, but in my opinion with insufficient detail. I would like to see a more extensive discussion of how the quality of the used datasets could influence the outcomes of this study, and also see this reflected in the conclusions.
2. Understandability for general soil-scientific audience
2.1. The authors remark that spatial embeddings “do not directly encode pedogenesis” (lines 83-84). However, many of the land surface properties of the spatial embeddings correspond to the soil forming factors, where Tessera seems to mainly focus on organisms through land cover from Sentinel, while AlphaEarth seems to represent topography and climate as well (lines 76-82). I think it would benefit the paper to frame the spatial embeddings in the context of soil forming factors, or a comparable model, to connect to more familiar frameworks in soil science. It would also be interesting to see a discussion on how the lack of representation of the other factors, especially time, could have an influence on the outcomes.
2.2. Next to that, despite the remark that spatial embeddings “do not directly encode pedogenesis” (lines 83-84), the authors actually suggest to use the spatial embeddings to “flag pedogenons whose global extent may conflate distinct soil-forming environments” (lines 409-410). I think these two statements contradict each other and need revising.
2.3. Statistical terminology. The manuscript introduces various statistical concepts and metrics that are not consistently named and referenced throughout the manuscript. For example, lines 164 – 166 uses the terms “internal coherence”, “cohesion” and “cosine similarity” to describe the representativeness of a reference embedding to its population. Throughout the rest of the manuscript, these terms are used interchangeably. Other important metrics are not defined by equation, but as a quick mention in the main text (e.g. cosine distance, lines 176-177). I think that the paper can benefit from more strict use of statistical terminology and a clear overview of their definitions and descriptions, for example in a table.
3. Structure of the manuscript
Most Tables and Figures present results for three pedogenons and two spatial embeddings. Their order of presentation is however not consistent, where some Figures present the pedogenons as rows (Fig. 3) and others as columns (Fig. 4). The order of presentation of the pedogenons also varies between Figures, Tables and their description in the text. I think the manuscript will be much clearer when the pedogenons and spatial embeddings are consequently presented and described in the same order throughout the Methods and Results.
---
Minor comments and technical corrections