Use of Spatial Embeddings in Genosoil Identification

Pachón Maldonado, Julio César; Padarian, José; Styc, Quentin; McBratney, Alex

doi:10.5194/egusphere-2026-1944

Preprints

https://doi.org/10.5194/egusphere-2026-1944

Preprints

20 Apr 2026

| 20 Apr 2026

Status: this preprint is open for discussion and under review for SOIL (SOIL).

Use of Spatial Embeddings in Genosoil Identification

Julio César Pachón Maldonado, José Padarian, Quentin Styc, and Alex McBratney

Abstract. Genosoils are minimally disturbed reference states within pedogenons, that is, soil units shaped by similar pedogenic processes within the Soil Security framework. They are central to assessing human impacts on soil functions, services, and resistance to threats. At present, genosoil delineation relies on the Human Modification Index (HMI), yet in intensively managed landscapes HMI thresholds may exclude all local pixels, leaving no local reference state available. Because the same pedogenon may occur across geographically distant regions, non-local occurrences may provide an alternative source of reference information. Using the United Kingdom as a case study, we tested whether satellite-derived spatial embeddings can detect genosoil signatures at 10 m resolution and whether these signatures can be transferred to regions with limited or absent local low-human-modification examples. We evaluated two satellite foundation-model embedding products, AlphaEarth and Tessera, across three contrasting pedogenons selected from the Global Pedogenon Map. Within each pedogenon, pixels with lower HMI values were generally more similar to the genosoil reference, indicating that the embeddings capture a reproducible low-modification surface-state signal. At the global scale, similarity to the UK genosoil was largely confined to biogeographically coherent regions. Cross-border substitution of local UK genosoil delineation was mostly limited, with meaningful partial recovery observed primarily in the highly modified agricultural pedogenon. These results indicate that satellite foundation-model embeddings can support higher-resolution genosoil delineation than is currently possible from global human modification products alone, extending the operational framework from 90 m to 10 m. They also suggest a pathway towards future genosoil identification frameworks that rely less on coarse disturbance proxies and more on validated surface-state similarity.

Received: 06 Apr 2026 – Discussion started: 20 Apr 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1429 KB)

Supplement (252 KB)

Download & links

Julio César Pachón Maldonado, José Padarian, Quentin Styc, and Alex McBratney

Status: open (until 01 Jun 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2026-1944', Marijn van der Meij, 08 May 2026 reply
Pachón Maldonado et al. aim to identify genosoils (minimally disturbed soils in homogeneous soil geographic regions called pedogenons) using spatial embeddings (vectorized information summarizing land surface properties derived from remote sensing). The hypothesis is that undisturbed soils share similar land surface characteristics that can be represented, and transferred, using the spatial embeddings. The manuscript provides an extensive statistical analysis of different spatial datasets representing land surface properties, soil properties and human disturbances. The findings indicate limited geographical transferability of the spatial embeddings for genosoil identification.
I have three main comments regarding the manuscript: 1) I have some doubts about the quality of the datasets for their intended purpose, 2) the manuscript will be difficult to understand by the more general soil-scientific audience of SOIL due to the strong statistical focus and insufficient connection to existing pedogenic frameworks, and 3) the methods and results should be better structured for improved clarity.
I have detailed my main comments below, followed by a list of minor comments and technical corrections.
---
1. Quality of datasets
1.1. The HMI dataset is used to identify undisturbed genosoils. However, especially for pedogenon 1564, I have doubts whether this dataset actually shows minimally disturbed soils. Their minimal presence and highly scattered occurrence (Table 1, Fig. 3) raise the question whether they are actually genosoils, or just wrongly classified phenosoils in the HMI dataset. An evaluation of Table S2 shows that the most similar countries to the UK also have a very small percentage of genosoils in their pedogenon, while countries with much higher percentage of genosoils show lower similarity. This makes me wonder if the HMI actually identifies genosoils for P1564, or whether wrongly classified agricultural soils or built-up areas are compared with each other, which could also be an explanation for their similarity in European context. A thorough evaluation of the accuracy of the HMI dataset is essential before spatial embeddings of derived genosoils can be reliably compared.
1.2. Another concern comes from the uniformity of the pedogenons. Figure 5 shows occurrences of the same pedogenon in countries with a wide variety of climatic and topographic conditions, and the pedogenons sometimes contain soils that are formed under contrasting climatic conditions (Table S4). This heterogeneity will make it difficult to define one genosoil type with corresponding spatial embedding for each pedogenon. This point is addressed in Section 3.6, but in my opinion with insufficient detail. I would like to see a more extensive discussion of how the quality of the used datasets could influence the outcomes of this study, and also see this reflected in the conclusions.
2. Understandability for general soil-scientific audience
2.1. The authors remark that spatial embeddings “do not directly encode pedogenesis” (lines 83-84). However, many of the land surface properties of the spatial embeddings correspond to the soil forming factors, where Tessera seems to mainly focus on organisms through land cover from Sentinel, while AlphaEarth seems to represent topography and climate as well (lines 76-82). I think it would benefit the paper to frame the spatial embeddings in the context of soil forming factors, or a comparable model, to connect to more familiar frameworks in soil science. It would also be interesting to see a discussion on how the lack of representation of the other factors, especially time, could have an influence on the outcomes.
2.2. Next to that, despite the remark that spatial embeddings “do not directly encode pedogenesis” (lines 83-84), the authors actually suggest to use the spatial embeddings to “flag pedogenons whose global extent may conflate distinct soil-forming environments” (lines 409-410). I think these two statements contradict each other and need revising.
2.3. Statistical terminology. The manuscript introduces various statistical concepts and metrics that are not consistently named and referenced throughout the manuscript. For example, lines 164 – 166 uses the terms “internal coherence”, “cohesion” and “cosine similarity” to describe the representativeness of a reference embedding to its population. Throughout the rest of the manuscript, these terms are used interchangeably. Other important metrics are not defined by equation, but as a quick mention in the main text (e.g. cosine distance, lines 176-177). I think that the paper can benefit from more strict use of statistical terminology and a clear overview of their definitions and descriptions, for example in a table.
3. Structure of the manuscript
Most Tables and Figures present results for three pedogenons and two spatial embeddings. Their order of presentation is however not consistent, where some Figures present the pedogenons as rows (Fig. 3) and others as columns (Fig. 4). The order of presentation of the pedogenons also varies between Figures, Tables and their description in the text. I think the manuscript will be much clearer when the pedogenons and spatial embeddings are consequently presented and described in the same order throughout the Methods and Results.
---
Minor comments and technical corrections
The abstract contains a lot of technical terms which will not be understandable without reading the manuscript first (e.g. spatial embeddings , line 12; reproducible low-modification surface-state signal, line 16-17). Please make sure that the abstract is in itself understandable to the audience of SOIL.

Figure 1. The Caption should better describe what is visible in the two panes. Pane B would fit better in the Methodology than in the Introduction.

Section 2.1. Another argument for selecting these pedogenons is that they have different levels of occurrence and modification (lines 224-225), which could be mentioned here as well.

I was wondering whether the cohesions derived from AlphaEarth and Tessera are comparable, as the spatial embeddings have different dimensions and are based on different datasets. I can imagine that Tessera has less internal variation as it is based on less variable data, which could lead to a higher cohesion. Could you add a remark about this in the manuscript?

Line 160, 167. The terms L2-mean and L2-normalization need explanation.

Eq. 3. Can you explain more extensively what the Jaccard Index calculates and represents?

Line 192. Silhouette confidence intervals are mentioned here but not provided in the results. Can you use uniform confidence-interval widths for silhouettes and bootstrapping?

Lines 216-219: Should be moved to methods, and Fig 1b should be referenced here.

Figure 3. This Figure needs some modification:
Axis descriptions and labels are not readable due to their size and because they sometimes overlap with the Figures.

Can you add the pedogenon code as a row label instead of as an inset in one of the panes?

Can you indicate where each area is located within the UK?

The details on the maps of P1564 are barely visible. Is there a way to improve the readability of these maps?

Table 2. Can you add values for coherence / cohesion, as these are mentioned in the text as well (e.g. lines 247-248, 250)?

Figure 4.
Shouldn’t the Y-axis label be “cosine distance from each country’s genosoil centroid” instead of “distance from UK genosoil centroid”?

Could you add a legend for the line widths instead of mentioning the scaling in the caption? You could, for example, group lines based on certain thresholds of pedogenon area. Also make the line width of the UK pedogenon consistent with the other line widths.

Figure 5. Can you use colors that are distinguishable for colorblind people and for black-and-white prints?

Line 337. You state that AlphaEarth shows a greater degree of local environmental context, which would be relevant for understanding soil formation. Yet, this spatial embedding scores systematically lower than Tessera. Could you discuss the reasons for this in the context of soil formation, data quality and used statistics? See also comment 7.

Table 4. Could you use consistent rounding of the decimals?

Code availability. The referenced Zenodo page does not exist.

Table S2. Why are the surface areas of pedogenons and genosoils different for each embedding model? Aren’t these based on the independent HMI?

Table S6 is not referenced in the text.

Reply
Citation: https://doi.org/10.5194/egusphere-2026-1944-RC1
RC2: 'Comment on egusphere-2026-1944', Anonymous Referee #2, 08 May 2026 reply

This is a very good study on testing the behavior if either of two 10-m-resolution RS-based foundation model (AlphaEarth and Tessera) would be introduced to representatively identify three types of UK genosoil across different nations at the global scale. The hypothesis testing experiments were well designed from different aspects for discussing different behaviors of AlphaEarth and Tessera, as well as providing reasonable explanations on such behaviors. Limitations within current study had been presented in the end of manuscript.

Among the three pedogenons tested in current study, there are very unique characteristics. It’s good; meanwhile it raises my further question. How about two pedogenons with similar characteristics such as much closer distance on geography or taxonomy? How would be the differentiation behavior from either AlphaEarth or Tessera?

Another minor comment is on Figure 3: the bottom subfigures are difficult to read clearly.

Reply

Citation: https://doi.org/10.5194/egusphere-2026-1944-RC2

Julio César Pachón Maldonado, José Padarian, Quentin Styc, and Alex McBratney

Supplement

https://doi.org/10.5194/egusphere-2026-1944-supplement

Data sets

Use of Spatial Embeddings in Genosoil Identification: code and tables Julio Pachon https://zenodo.org/records/19424156?preview=1&token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6IjU4NzA5MTQwLTk1Y2EtNGZmNS05MmMwLTcyNmNjMTZhODk5ZCIsImRhdGEiOnt9LCJyYW5kb20iOiIzYTcwNzg3YTQwNWIyODQwZjJkYWVhNTRhNDY5ZmNmNSJ9.6Dwt8tRhJQox3evJN0T63nh-wan6--UBnE6Ut8VptmMqf_b8CGBWJ-2PsFL31t5ks2U6TYhwcICYf0gZ5hYZmQ

Model code and software

Julio César Pachón Maldonado, José Padarian, Quentin Styc, and Alex McBratney

Viewed

Total article views: 265 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
153	93	19	265	26	12	16

HTML: 153
PDF: 93
XML: 19
Total: 265
Supplement: 26
BibTeX: 12
EndNote: 16

Views and downloads (calculated since 20 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	76	54	16	146
May 2026	77	39	3	119

Cumulative views and downloads (calculated since 20 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	76	54	16	146
May 2026	77	39	3	119

Viewed (geographical distribution)

Total article views: 265 (including HTML, PDF, and XML) Thereof 265 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 31 May 2026

Short summary

Identifying local soil references for research in heavily modified landscapes is challenging. We evaluated whether AI satellite models at 10 m resolution to detect undisturbed soils and apply the findings internationally. Pixels with less human disturbance aligned better with undisturbed references. Cross-border transfer was partly successful. Results indicate AI models may help find reference soils where local examples are scarce.


Total:	0
HTML:	0
PDF:	0
XML:	0