Soil science-informed neural networks for soil organic carbon density modelling under scarce bulk density data

Tian, Xuemeng; Ahrens, Bernhard; Rossdeutscher, Leo; Alonso, Lazaro; Parente, Leandro

doi:10.5194/egusphere-2026-229

Preprints

https://doi.org/10.5194/egusphere-2026-229

Preprints

28 Jan 2026

| 28 Jan 2026

Status: this preprint is open for discussion and under review for SOIL (SOIL).

Soil science-informed neural networks for soil organic carbon density modelling under scarce bulk density data

Xuemeng Tian, Bernhard Ahrens, Leo Rossdeutscher, Lazaro Alonso, and Leandro Parente

Abstract. Soil organic carbon (SOC) density is a key variable for quantifying soil carbon stocks, yet its modelling is challenged by sparse and inconsistent measurements of bulk density and coarse fragments relative to SOC content. Conventional digital soil mapping approaches typically model SOC density as a single target variable, thereby underutilising abundant SOC content data and overlooking physical relationships among soil properties. This study evaluates a soil science-informed neural network for SOC density prediction that explicitly constrains the SOC–BD relationship, and compares it with univariate and multivariate neural network architectures. Across sparsely sampled target variables, including SOC density, bulk density, and coarse fragments, the soil science-informed model achieves comparable or slightly improved prediction accuracy relative to multivariate and univariate models. Although it yields lower accuracy for SOC content, the soil science-informed model better preserves physically plausible SOC–BD joint distributions and generates smoother, more temporally stable SOC density trajectories. Overall, the results demonstrate that incorporating soil physical constraints into machine learning models adds value beyond univariate accuracy, improving robustness, plausibility, and temporal coherence of SOC density predictions under sparse data conditions. Moreover, the latent parameters inferred by the soil science-informed model improve model interpretability and offer additional soil science relevant insights beyond predictive accuracy.

Received: 15 Jan 2026 – Discussion started: 28 Jan 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Xuemeng Tian, Bernhard Ahrens, Leo Rossdeutscher, Lazaro Alonso, and Leandro Parente

Status: open (until 28 Mar 2026)

Post a comment Subscribe to comment alert

RC1: 'Comment on egusphere-2026-229', Anonymous Referee #1, 23 Feb 2026 reply

I have read the manuscript describing soil informed neural networks to predict SOC density under sparse auxiliary data by leveraging multivariate learning and a soil-relation–informed ML.
The topic is timely for the DSM/ML community, and the paper’s central idea, using constraints to improve plausibility and robustness under missing BD/CF and is potentially valuable.
However, there are several issues in the mathematical formulation, unit consistency, evaluation design, and soil-science framing that needs attention.
I outline some comments below.
-L.21 The Introduction implies SOC density as a DSM-driven stakeholder target. In practice: carbon accounting uses SOC stocks per area (depth-integrated), not only DSM.
- Depth handling (0–20 cm) and LUCAS 2018 exclusion, The study excluded 0–10 and 20–30 cm to focus on 0–20 cm. need to explain: whether 0–10 and 10–20 exist and why not used,
- Unit consistency and dimensional correctness in SOC density (Eq. 1) with SOC content in g kg⁻¹, BD in g cm⁻³, and output in kg m⁻³.
While the formula can be numerically correct if the implicit conversions cancel , it is not dimensionally transparent and is easy to misapply. I suggest Rewrite Eq. (1) with explicit conversion constants, or (b) define the variables as mass fraction and kg m⁻³ explicitly.
Also motivation says stakeholders care about “SOC density” more than content. In practice, stakeholders commonly want SOC as well as SOC stock per area (e.g., Mg ha⁻¹ for a depth interval). Please temper the claim and clarify the end-use context (accounting, monitoring, agronomy, reporting).
- SOC–BD mechanistic constraint (Eq. 2) and SOM=1.724⋅SOC content
this is incorrect, because SOC content is in g kg⁻¹ (not a fraction), so SOM becomes order 10–100+ (dimensionless), which makes (1−SOM) negative.
The Federer reference is not the origin of the equation. This equation of mixing is due to Adams WA. 1973. The effect of organic matter on the bulk and true densities of some uncultivated podzolic soils. J Soil Sci, 24 (1973), pp. 10-17
- Coarse fragments. In Lucas, CF is measured in mass basis, how did you convert to volume basis as in Eq 2
- L.85 The study promotes soil-informed ML, but at the same time uses 362 covariates from 15 groups, many of which are highly correlated, multi-scale, and partially redundant. That creates a tension between the stated philosophy and the modelling design.
- Cross-validation design likely suffers from leakage (repeated sites + spatial autocorrelation). It stated five-fold CV via random partitioning. But the dataset contains repeated measurements at the same sites across years. Random folds will almost certainly place the same site in both train and test folds (even if different years), inflating performance and plausibility diagnostics.
Additionally, DSM with dense covariates typically demands spatially blocked CV (or at least spatial buffering) to avoid optimistic estimates.
Use grouped CV by site ID so all time points for a site stay in one fold. Ideally, combine with spatial blocking (e.g., spatial k-fold) to reflect mapping/generalisation performance.
- Targets are transformed to reduce skewness, and constrain to [0,1], achieved through “log transformation and scaling using a standard scaler.” A standard scaler does not constrain to [0,1]; it standardises to mean 0, variance 1. Correct the description. Coul be Min–Max scaling?
- SOC density can only be “truly” validated where BD and CF exist, and here they exist only in 2018. The study claims robustness under sparse BD availability. But have not rigorously validated the “sparse BD reconstruction” claim. It just means better internal consistency + smoother time series. But not neccessarily correct reconstruction when BD is truly absent.
- Temporal consistency filter appears logically inconsistent (likely a typo or mis-specified threshold). It wrote assume SOC changes < 0.5 g kg⁻¹ yr⁻¹, but use a “conservative threshold of 50 g kg⁻¹ yr⁻¹ for the maximum absolute difference across measurements.” This is confusing . Justify with citations and show sensitivity (how many series removed under alternative thresholds).
- Some reported units and plausibility statements are incorrect. Example: “extreme changes exceeding 60 g cm⁻³” for SOC density trajectories. SOC density is in kg m⁻³ (or equivalently g L⁻¹), not g cm⁻³.
- Small sample sizes make stratified metrics unreliable (Table 3). In Table 3, Wetland has N = 2 but reports R² = 0.90. This is not meaningful.
-MSE and R² are fine, but heavy tails and log transforms can distort interpretation. Add residual analysis (bias by SOC quantiles, BD quantiles)
For the joint SOC–BD space: consider distance metrics or coverage metrics i.e, how much predicted mass lies outside observed support.
- Uncertainty quantification is missing, since “gains are modest”, uncertainty reduction may be the key value proposition. Provide at least one uncertainty (ensembles, MC dropout, deep ensembles) for SOC density maps.
L.109 eq 2 is not mechanistic, still empirical

Reply

Citation: https://doi.org/10.5194/egusphere-2026-229-RC1

Xuemeng Tian, Bernhard Ahrens, Leo Rossdeutscher, Lazaro Alonso, and Leandro Parente

Model code and software

EasyDensity Xuemeng Tian et al. https://github.com/AI4SoilHealth/EasyDensity.jl

Xuemeng Tian, Bernhard Ahrens, Leo Rossdeutscher, Lazaro Alonso, and Leandro Parente

Viewed

Total article views: 198 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
116	68	14	198	14	22

HTML: 116
PDF: 68
XML: 14
Total: 198
BibTeX: 14
EndNote: 22

Views and downloads (calculated since 28 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	66	23	5	94
Feb 2026	50	45	9	104

Cumulative views and downloads (calculated since 28 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	66	23	5	94
Feb 2026	50	45	9	104

Viewed (geographical distribution)

Total article views: 189 (including HTML, PDF, and XML) Thereof 189 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 23 Feb 2026

Short summary

We studied how to better estimate and map how much carbon is stored in soils when key measurements are scarce. We built a machine learning model that follows known physical links between soil carbon and soil density, and compared it with pure machine learning methods. Although accuracy was similar, the model informed by soil science gave more realistic and stable results over time and clearer insights into soil behavior, improving interpretability for decision making.


Total:	0
HTML:	0
PDF:	0
XML:	0