Modelling and Interpreting Thermal Stability Indices to Understand Soil Carbon Stabilization Using Soil Properties Data
Abstract. Soil organic carbon (SOC) sequestration and nutrient cycling are related to the susceptibility of soil organic matter to biological decomposition. Several studies have demonstrated associations between biological stability and thermal stability, as assessed using programmed pyrolysis. We sought to develop parsimonious machine learning (ML) models to predict SOC stability indices from measured soil properties. The study analyzed indices such as S1, S2, and S3; the oxygen index (OI); the hydrogen index (HI); and T50, which reflect SOC composition, thermal behaviour, and stability. A total of 203 soil samples collected at 0–15cm depth increments from agricultural, forest, and wetland landscapes in New Brunswick, Canada, were analyzed. Feature selection techniques optimized predictive models, and a random forest (RF) was used to develop one. Correlation results revealed that HI was negatively associated with pH (r = –0.35) and bulk density (r = –0.33), whereas OI showed a positive correlation with pH (r = 0.34). Thermal indices were more strongly related to soil chemistry and texture, with S1 closely correlated with total carbon (r = 0.88) and nitrogen, and S2 negatively associated with sand (r = –0.66) but positively associated with clay (r = 0.24) and POXC (r = 0.84). T50 showed positive correlations with both pH (r = 0.48) and bulk density (r = 0.36), indicating greater thermal stability in higher pH, compacted soils, though these patterns varied by land use. Random Forest (RF) model predicted S1 S3 indices with high accuracy (CCC = 0.83–0.86), while HI and OI were more difficult to model (CCC = 0.44–0.48), suggesting missing biological or environmental predictors; NH₄⁺ and POXC emerged as key predictors. Structural equation modeling (SEM), after addressing multicollinearity, supported a hypothesis driven model that explained ~54% of T50 variation. Clay dissolved organic carbon, pH, and aluminum showed significant direct associations with T50 (β for pH = 0.44), whereas bulk density showed no meaningful relationship. Our study demonstrates that ML and SEM can reveal patterns and associations between soil properties and thermal stability indices, offering insight into understanding the SOC stability under a changing climate as well as presenting a framework for rapid estimation of SOC stability proxies.
Using a dataset of n=203 samples from the New Brunswick province in Eastern Canada, the authors proposed a detailed modeling and analysis framework linking thermal stability indicators with soil organic matter (SOM) stability. The framework includes both a predictive model (using a Random Forest (RF) implementation in R), and a more analytical model that aims at removing the black-box quality so often associated with machine learning predictions (using Structural Equation Modeling (SEM)). The manuscript represents a scientific work of both good quality and significance. The presentation of this work, however, needs to be revised and, at times, restructured.
The RF modeling section presents a serious effort of finding the most parsimonious set of predictors, by both eliminating (multi)collinearity in the feature set before model training, and by applying recursive feature elimination to further eliminate less informative features. During the RF model training and selection, a grid search was used to find the best-fitting hyperparameters. The authors also included a posteriori analysis using partial dependence plots and individual conditional expectation plots. A detailed analysis of the relationships between the main thermal indicators (S1, S2, S3, OI, HI, T50) is introduced before presenting the modeling results, confirming pre-existing hypotheses about the links between SOM stability and thermal stability indicators.
In the modeling section, more effort should be put in detailing the exact parameters/libraries used, and explanations of common methods (like PDPs and ICEs) should be minimal and concise. A significant effort has been put into making the paper accessible to a wider audience, however the presentation of many modeling choices was omitted in doing so.
Regarding code and data availability, authors have declared that the data will be made available upon reasonable request (due to privacy concerns). Will the code be accessible online? If not, even more precisions need to be provided.
Metrics have to be systematically introduced and, when appropriate, references and/or equations should be provided. Certain metrics are never introduced, others change their nomenclature throughout the text, and some will depend on the code implementation used (thus, rendering code availability even more important).
In the results section, and particularly in the summary results of the thermal indicators of stability, authors need to make clearer distinctions between the conclusions from their own results and the overall consensus regarding the relationships between thermal indicators and SOM stability. In general, the results section (3.1, 3.2…) are very repetitive and contain lots of redundant information. The authors should consider rephrasing and shortening their result explanations so as to not repeat certain conclusions many times. Similar to the modeling section, the results are at times very elaborate while significant statistical/modeling choices are omitted.
Authors need to interpret PDP and ICE results with a grain of salt, as these can’t always be taken at face value due to the marginal effects they represent (as the authors have well explained in the modeling section).
Detailed corrections are provided in a commented pdf. I urge the authors to pay attention to having consistency in the introduction, repetition, and capitalization of acronyms and initialisms. Same goes for the introduction of new terms, chemical solutions, and references.