Towards resolving poor performance of mechanistic soil organic carbon models

Wang, Lingfei; Abramowitz, Gab; Wang, Ying-Ping; Pitman, Andy; Ciais, Philippe; Goll, Daniel S.

doi:10.5194/egusphere-2025-2545

Preprints

https://doi.org/10.5194/egusphere-2025-2545

Preprints

19 Jun 2025

| 19 Jun 2025

Towards resolving poor performance of mechanistic soil organic carbon models

Lingfei Wang, Gab Abramowitz, Ying-Ping Wang, Andy Pitman, Philippe Ciais, and Daniel S. Goll

Abstract. The accuracy of soil organic carbon (SOC) models and their ability to capture the relationship between SOC and environmental variables are critical for reducing uncertainties in future projection of soil carbon balance. In this study, we evaluate the performance of two state-of-the-art mechanistic SOC models, the vertically resolved MIcrobial-MIneral Carbon Stabilisation (MIMICS) and Microbial Explicit Soil Carbon (MES-C) model, against a machine learning (ML) approach. By applying multiple interpretable ML methods, we find that the poorer performance of the two mechanistic models is associated both with the missing of key variables, and the underrepresentation of the role of existing variables. Soil cation exchange capacity (CEC) is identified as an important predictor missing from mechanistic models, and soil texture is given more importance in models compared to observations. Although the overall relationships between SOC and individual predictors are reasonably captured, the varying sensitivity across entire predictor range is not replicated by mechanistic models, most notably for net primary production (NPP). Observations exhibit a nonlinear relationship between NPP and SOC while models show a simplistic positive trend. Additionally, MES-C largely diminishes interacting effects of variable pairs, whereas MIMICS produces mismatches relating to the interactions between NPP and both soil temperature and moisture. Mechanistic models also fail to reproduce the interactions among soil moisture, soil texture, and soil pH, hindering our understanding on SOC stabilisation and destabilisation processes. Our study highlights the importance in improving the representation of environmental variables in mechanistic models to achieve a more accurate projection of SOC under future climate conditions.

Received: 30 May 2025 – Discussion started: 19 Jun 2025

Competing interests: At least one of the (co-)authors is a member of the editorial board of Biogeosciences. The peer-review process was guided by an independent editor, and the authors also have no other competing interests to declare.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3711 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (3711 KB)

Supplement (1008 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

09 Dec 2025

Using explainable AI to diagnose the representation of environmental drivers in process-based soil organic carbon models

Lingfei Wang, Gab Abramowitz, Ying-Ping Wang, Andy Pitman, Philippe Ciais, and Daniel S. Goll

Biogeosciences, 22, 7845–7863, https://doi.org/10.5194/bg-22-7845-2025,https://doi.org/10.5194/bg-22-7845-2025, 2025

Short summary

Lingfei Wang, Gab Abramowitz, Ying-Ping Wang, Andy Pitman, Philippe Ciais, and Daniel S. Goll

Interactive discussion

Status: closed

CC1:
'Comment on egusphere-2025-2545: Machine learning versus “mechanistic” modelling of soil carbon dynamics: Are current comparison attempts meaningful?', Philippe C. Baveye, 13 Jul 2025

Machine learning versus “mechanistic” modelling of soil carbon dynamics: Are current comparison attempts meaningful?
Philippe C. Baveye^1*
¹Saint Loup Research Institute, 7 rue des chênes, La Grande Romelière, 79600 Saint Loup Lamairé, France.
The manuscript that Wang et al. have submitted for possible publication in Biogeosciences raises a number of questions about the soundness of the comparison that these authors carry out between machine learning and the more traditional, “mechanistic” modelling of soil carbon dynamics.

First, a word of caution is in order about the use of the qualifier “mechanistic” used by the authors. This term conveys the notion that the models are based on detailed descriptions of actual mechanisms of processes taking place in soils. Formally, that may seem to be the case, but one could argue that for macroscopic or ecosystem-scale models like Century, RothC, and the like, it hides assumptions that may or may not correspond to reality. Take the decomposition of litter pools and SOC pools in both MIMICS and MES-C as an example. The authors consider that this decomposition follows a Michaelis-Menten kinetic equation. Heße et al. (2009), among others, have shown a while ago already that if one assumes that a Monod-type equation (mathematically identical to Michaelis-Menten) describes microbial growth at the microscale in a natural porous medium, the corresponding equation at the macroscopic scale, over soil volumes that encompass many microorganisms and pores, does not in general follow a Monod-type kinetics. Only in special cases can one write a Monod-type equation at the macroscopic scale as an approximation. In soils, computer simulations using microscale models suggest that there is no evidence that either a Monod formalism holds in general at the macroscale, even as an approximation (e.g., Falconer et al., 2015; Mbé et al., 2022). The same goes for most “mechanistic” descriptions included in MIMICS and MES-C. Unless there is clear experimental evidence that the “mechanistic” descriptions of processes at the macroscopic scale are reliable, or there is microscopic evidence and a robust upscaling approach that produces macroscopic descriptions of processes, macroscopic models based on so-called “mechanistic” descriptions are in fact entirely empirical and should be treated as such. From that standpoint, following Baveye’s (2023) discussion of Schimel’s (2023) analysis, the label of “mechanistic” should be used with significant caution when referring to ecosystem-scale models of soil organic matter dynamics at this stage, since experimental data at the landscape- or ecosystem scales in support of some of the mechanisms included in the models are lacking and since the upscaling hurdle still needs to be addressed satisfactorily.

Beyond this point of terminology, a first question that the manuscript by Wang et al. (2025) raises concerns the specific choice of MIMICS and MES-C. Neither in this manuscript, nor in the previous one by Wang et al. (2024), is this choice justified. One would have expected the authors to first review all available “mechanistic” models available, determine their respective pros and cons, and on the basis of this state-of-the-art review, choose one or two of the models for comparison with machine learning. The fact that this eminently reasonable approach was apparently not adopted raises the question of the reasons why MIMICS was selected and subsequently modified into MES-C. In this respect, it is hard to avoid thinking that MIMICS was chosen because it requires very few input data. In Table 1 of the manuscript by Wang et al. (2025), it appears that it requires knowledge only of the soil temperature, soil moisture content, soil clay content, and net primary production. That is a puzzlingly limited set of data, which one could easily argue would be insufficient to account for the complexity of soil organic carbon dynamics. Nevertheless, if the objective that is pursued is a comparison with an approach based on machine learning, one perhaps does not have the luxury of using a more sophisticated process-based model, requiring a larger input data set. Indeed, machine learning demands very large amounts of data, as is evinced by the 37,691 SOC profiles considered by Wang et al. (2025). Any comparison attempt would rapidly become unmanageable if for each one of these profiles, in order to run process-based models, one needed more soil data than what can be readily retrieved from available soil databases. The down side of this is that, by choosing a simple, let alone simplistic, “mechanistic” model in what it is difficult not to view as a “strawman” strategy, one predetermines the ultimate outcome of the comparison, since the odds are good that this model will exhibit ”poor performance”, for example, as noted by Wang et al. (2025) by showing “a simplistic positive trend” between the net primary production (NPP) and SOC, whereas in reality the observed relationship is nonlinear.

The next question concerns the soundness of comparing models that have different numbers of parameters, or involve different numbers of variables. Many years ago, modellers used to say that with 4 parameters, one could draw an elephant. With 5 parameters, one could make it wag its tail. The message was that a model with more degrees of freedom should always be expected to fit experimental data better than one with fewer degrees of freedom. From this perspective, it is not surprising that in Table 2 of Wang et al. (2025), RF_env, which involves 13 variables, leads to significantly better performance metrics than RF_imp and MES-C, which involve only 6 variables (In this respect, the fact that MIMICs, involving only 4 variables, has poorer performance metrics than MES-C is perplexing). That does not mean that the additional 7 variables considered by RF_env relative to RF_imp and MES-C are necessarily meaningful. Several authors have indeed demonstrated that statistically stronger patterns can emerge from spatial databases to which one has deliberately added utterly irrelevant information, for example, a painting or the photograph of a colleague (Fourcade et al. , 2018; Behrens and Viscarra Rossel, 2020). In this context, it is not at all clear that machine learning, applied to large data sets, can always be useful in “guiding future model development”, as Wang et al. (2025) suggest. Thus when these authors consider that the CEC of soils, which is taken into account by the machine learning model RF_env and has a high PVI (Permutation Variable Importance), should be taken into account in future work on “mechanistic” models, it is unclear at this point how reliable this conclusion is, based as it is only on statistical considerations. One might argue that the careful analysis of Julien and Tessier (2021) of the molecular and microscale processes involved in the fate of SOC in soils, leading to the conclusion that CEC is an important factor controlling it, is far more convincing that there may be something there worth investigating further.

An additional, even more basic question that the research of Wang et al. (2025) raises has to do with whether or not it is in fact meaningful to try to compare the so-called “mechanistic” modelling approach with one based on machine learning, insofar as the dynamics of SOC is concerned. The very large dataset assembled by Wang et al. (2025) evinces one of the key characteristics of machine learning algorithms, and more generally of data-driven and artificial intelligence approaches, namely that they require massive amounts of data (e.g., Baveye, 2021; Wadoux, 2025; Minasny and McBratney, 2025). When dealing with field soil properties, like the fate of their SOC, it is customary, in order to obtain the requisite amount of data to apply machine learning algorithms, to work with extended geographical areas. In such cases, the outcomes of the application of machine learning may be useful to address issues relevant at these large scales, but they may be inadequate to address issues at smaller spatial scales, like in a watershed, ecoregion, ecosystem, agricultural field, or even at specific locations in a field where an experiment is taking place. At these much smaller spatial scales, models attempting to describe soil processes explicitly and in detail appear far more appropriate.

In this context, those who still read the literature may recall a piece of research that had some notoriety 40 years ago. Using a large climatic database, Folkoff et al. (1981) produced a map of the pH of the surface horizon of soils within the conterminous United States that was based solely on climatic conditions. Although the outcome of their research had undeniable intellectual interest, it proved to be of little practical use, and as a result their article has been seldom cited since 1981. Knowing that the climate correlates well with soil pH does not help farmers manage the acidity in their field or governmental agencies assess, e.g., the local impact of acid rain on agricultural yields or environmental degradation. For those practical situations, a different kind of model of acidity-related soil processes is needed, taking into account the spatial heterogeneity of soils. Occasionally, the larger-scale, statistical observations may be useful. Worrall (2001), using observations of pesticide occurrence in 303 boreholes across 12 states in the midwest of the U.S., showed that the molecular topology of the pesticide molecules themselves is, in and of itself, a sufficient basis to discriminate globally between polluting and non-polluting pesticide compounds, regardless of the spatial variation of the soils through which the pesticides transit. On that basis, one might come up with nation-wide or regional regulations allowing the use of some pesticides and banning others (Baveye and Laba, 2015), but it would not be possible to predict whether and how fast a given pesticide compound applied to a given agricultural field will reach the underlying groundwater.

At the moment, a tremendous amount of hype is associated with machine learning, data-driven research, and artificial intelligence in the soil science and biogeosciences communities. There is therefore a good chance that the manuscript by Wang et al. (2025), whose title clearly advertises a bias in favour of these approaches, will be well received by at least a portion of researchers in these disciplines. The risk in this respect is that ill-inspired comparisons would further fuel the drive to accumulate massive amounts of potentially meaningless data to feed data-hungry algorithms. There may be limited uses for the latter, but it would be unwise, I think, to put too much emphasis on them, or to rely on them too closely to improve the process-based models that could be useful to us in practice. For the development of these process-based models, what we need is not massive amounts of data that are haphazardly assembled, but data that actually make sense.
References
Baveye, P. C. (2022). “Data‐driven” versus “question‐driven” soil research. European Journal of Soil Science, 73(1), e13159.
Baveye, P. C. (2023). Ecosystem-scale modelling of soil carbon dynamics: time for a radical shift of perspective?. Soil Biology and Biochemistry, 184, 109112.
Baveye, P. C., & Laba, M. (2015). Moving away from the geostatistical lamppost: Why, where, and how does the spatial heterogeneity of soils matter?. Ecological Modelling, 298, 24-38.
Behrens, T., & Viscarra Rossel, R. A. (2020). On the interpretability of predictors in spatial data science: The information horizon. Scientific Reports, 10(1), 16737.
Falconer, R. E., Battaia, G., Schmidt, S., Baveye, P., Chenu, C., & Otten, W. (2015). Microscale heterogeneity explains experimental variability and non-linearity in soil organic matter mineralisation. PloS one, 10(5), e0123774.
Folkoff, M. E., Meentemeyer, V., & Box, E. O. (1981). Climatic control of soil acidity. Physical Geography, 2(2), 116-124.
Fourcade, Y., Besnard, A. G., & Secondi, J. (2018). Paintings predict the distribution of species, or the challenge of selecting environmental predictors and evaluation statistics. Global Ecology and Biogeography, 27(2), 245-256.
Heße, F., Radu, F. A., Thullner, M., & Attinger, S. (2009). Upscaling of the advection–diffusion–reaction equation with Monod reaction. Advances in water resources, 32(8), 1336-1351.
Julien, J. L., & Tessier, D. (2021). Rôles du pH, de la CEC effective et des cations échangeables sur la stabilité structurale et l’affinité pour l’eau du sol. Etude et Gestion des sols, 28, 159-179.
Mbé, B., Monga, O., Pot, V., Otten, W., Hecht, F., Raynaud, X., ... & Garnier, P. (2022). Scenario modelling of carbon mineralization in 3D soil architecture at the microscale: Toward an accessibility coefficient of organic matter for bacteria. European Journal of Soil Science, 73(1), e13144.
Minasny, B., & McBratney, A. B. (2025). Machine Learning and Artificial Intelligence Applications in Soil Science. European Journal of Soil Science, 76(2), e70093.
Schimel, J. (2023). Modeling ecosystem-scale carbon dynamics in soil: the microbial dimension. Soil Biology and Biochemistry, 178, 108948.
Shi, Z., Crowell, S., Luo, Y., & Moore, B. (2018). Model structures amplify uncertainty in predicted soil carbon responses to climate change. Nature Communications, 9(1), 1–11. https://doi.org/10.1038/s41467-018-04526-9
Wadoux, A. M. C. (2025). Artificial intelligence in soil science. European Journal of Soil Science, 76(2), e70080.
Wang, L., Abramowitz, G., Wang, Y. P., Pitman, A., & Viscarra Rossel, R. A. (2024). An ensemble estimate of Australian soil organic carbon using machine learning and process-based modelling. Soil, 10(2), 619-636.
Wang, L., Abramowitz, G., Wang, Y. P., Pitman, A., Ciais, P., & Goll, D. S. (2025). Towards resolving poor performance of mechanistic soil organic carbon models. EGUsphere, 2025, 1-32.

Citation: https://doi.org/10.5194/egusphere-2025-2545-CC1
- CC2: 'Minor erratum on CC1', Philippe C. Baveye, 13 Jul 2025
  
  In the 4th paragraph, where the text reads "In this respect, the fact that MIMICs, involving only 4 variables, has poorer performance metrics than MES-C is perplexing", it should read instead "In this respect, the fact that MIMICs, involving only 4 variables, has performance metrics that are comparable to those of MES-C is perplexing"
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-CC2
- AC1: 'Reply on CC1', Lingfei Wang, 18 Jul 2025
  
  Dear Dr. Baveye,
  Thank you for your time and comments on our manuscript.
  In this study, we collected globally distributed SOC observations and applied both machine learning and microbial-explicit models to simulate SOC content. However, we would like to clarify that the primary aim of our work is not to compare the predictive performance of machine learning versus microbial-explicit models. Rather, our central objective is to explore how information derived from machine learning models can be used to identify potential shortcomings in microbial-explicit models and guide their improvement at the global scale. We will revise the manuscript to emphasize this point more clearly throughout.
  Regarding the use of the term “mechanistic”, we agree that many processes are not well represented or missing in the microbial-explicit soil carbon models we used in this study, and therefore the models are not fully mechanistic. We will adopt the term process-based models instead, clearly define the term, and revise the manuscript title to: “Towards Resolving the Poor Performance of Microbial-Explicit Soil Carbon Models.”
  For the point you raised about choice of process-based models and the kinetics of litter decomposition, MIMICS has been well validated at site level and has been shown to simulate litter decomposition and SOC responses to experimental warming quite well (Wieder et al., 2014; Wieder et al., 2015). MIMICS is also representative for soil modules deployed in global carbon cycle models as it is used in e.g., CLM (Wieder et al., 2018) and ORCHIDEE (Goll et al., 2023). These points support our decision to use MIMICS as a representative example of microbial-explicit models in this study and develop MES-C based on MIMICS to incorporate more advanced theories about soil carbon stabilisation and destabilisation. We will clarify these aspects in the revised manuscript.
  Regarding the difference in the number of predictors used by the random forest (RF) and microbial-explicit models, we agree that RF model with more variables may naturally exhibit stronger performance. However, our intent was not to compare the models purely on predictive accuracy, but to use the RF model as a diagnostic tool to identify potential missing environmental drivers in microbial-explicit models. We will clarify this point in the revised manuscript.
  We appreciate your feedback, which has helped us refine the focus of our manuscript.
  Sincerely,
  Lingfei Wang
  On behalf of all co-authors
  
  Goll, D. S., Bauters, M., Zhang, H., Ciais, P., Balkanski, Y., Wang, R. and Verbeeck, H.: Atmospheric phosphorus deposition amplifies carbon sinks in simulations of a tropical forest in Central Africa, New Phytologist, 237, 2054-2068, https://doi.org/10.1111/nph.18535, 2023.
  Wieder, W., Grandy, A., Kallenbach, C. and Bonan, G.: Integrating microbial physiology and physio-chemical principles in soils with the MIcrobial-MIneral Carbon Stabilization (MIMICS) model, Biogeosciences, 11, 3899-3917, https://doi.org/10.5194/bg-11-3899-2014, 2014.
  Wieder, W., Grandy, A., Kallenbach, C., Taylor, P. and Bonan, G.: Representing life in the Earth system with soil microbial functional traits in the MIMICS model, Geoscientific Model Development, 8, 1789-1808, https://doi.org/10.5194/gmd-8-1789-2015, 2015.
  Wieder, W., Hartman, M. D., Sulman, B. N., Wang, Y. P., Koven, C. D. and Bonan, G. B.: Carbon cycle condidence and uncertainty: Exploring variation among soil biogeochemical models, Global Change Biology, 24, 1563-1579, https://doi.org/10.1111/gcb.13979, 2018.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-AC1
RC1:
'Comment on egusphere-2025-2545', Anonymous Referee #1, 18 Aug 2025
In their manuscript, titled ‘Towards resolving poor performance of mechanistic soil organic carbon models’, Wang et al. describe their results of a study comparing the performance of two mechanistic soil organic carbon models (MIMICS and the newly-proposed MES-C) to a machine learning approach (random forest). They use the mechanistic models and random forest models to predict the amount of soil organic carbon for multiple locations around the world, after which they compare model results to depth profiles of SOC from the WoSIS database. Based on their results, the authors conclude that random forest models consistently outperformed mechanistic models. In addition, they used interpretable machine learning methods to conclude that both mechanistic soil organic carbon models perform poorly due to lack of accounting for key variables (most notably CEC), while the role of existing variables is underrepresented. They compare the controlling factors in the mechanistic models to factors controlling the SOC content at the global scale, and find that, for example, the role of the clay and silt content of the soil is overestimated by the models, compared to observations. Furthermore, the authors asses the sensitivity of model outcomes along a range of predictor values, and study how pairs of predictors affect model outcomes, compared to observations.
The development and application of mechanistic soil organic matter models, among which MIMICS, has gained much importance over the past two decades. Therefore, the assessment of model performance and the relative importance of model drivers in making predictions of the SOC content, compared to observations, is needed to improve these models to increase confidence in their outcomes. I appreciate the effort done to collect the data, develop a new model and apply it, together with MIMICS, at the global scale, which is a challenging endeavor. The manuscript is well-written and well-structured. The abstract conveys the main messages and the introduction provides sufficient background to the topic, although I would encourage the authors to include more information about previous studies assessing controls of the SOC content at the global scale, which is the topic of their study. Although the methods section describes the performed analyses and data collection strategy, much information about the performed model simulations is missing to assess the quality of the performed simulations (see below). The results section concisely describes the results, but I would encourage the authors to show the model results in a graphical way, for example using scatterplots. The discussion touches upon the main results, but mainly sections 4.3 and 4.4 lack clear messages, and are rather a summing up of the relations the authors found.
Based on my evaluation I recommend major revisions, mainly due to (1) the lack of detailed information which would be necessary to evaluate the performed model simulations and (2) the nearly complete attribution of the mismatch between model results and measurements to the structure of the model, without accounting for uncertainties in input data at the global scale (see below). My main concerns about the manuscript are the following, while detailed feedback is provided below.
The authors propose a new mechanistic soil organic matter model that is more complex than MIMICS, as it also incorporates SOC in aggregates. However, the authors do not have any data on the distribution of measured SOC among different pools, so they have no data to either constrain the distribution of simulated SOC among the different model pools, or to evaluate this simulated distribution. Therefore, I invite the authors to justify the use of such a (complex) model at the global scale, given this lack of data. The evaluation of the simulation results based on data for only total SOC inevitably leads to overparameterization and equifinality. This is an important, but generally overlooked, aspect of environmental models, the consequences of which for the model simulations should be discussed in the manuscript.

The authors attribute the ‘poor performance’ of the mechanistic models exclusively to the structure of these models. However, there are many more uncertainties leading to a mismatch between model results and measurements:
Uncertainties related to model drivers, which were extracted from global data sets. For example, NPP data derived from MODIS cannot be expected to correctly represent C inputs from vegetation to the soil, as, for example, a substantial part of aboveground litter C will be respired before it can enter the soil. A model can never be better than the input data, so accounting for uncertainties in the data is important.

The applied mechanistic models are too complex compared to the data available for model evaluation, i.e., they are overparameterized. Six parameters are evaluated while only one output (total SOC) is evaluated (it seems, as this is not described clearly in the manuscript). This inevitably leads to large uncertainties about model results.

These aspects should be discussed, because it is evident that an overparameterized model with uncertain inputs will lead to a ‘poor performance’.
The methods related to the application of the mechanistic models are not sufficiently described. For example:
Down to which depth were the simulations performed? To 1.5 m, as this is the maximum depth down to which measurements were available?

Were the simulations performed at a vertical resolution of 30 cm?

For how long were the simulations run? Was there a spin-up period, followed by a run with actual temperature and precipitation data?

In which programming language were the models coded?

With which time step were the simulations performed? Which solver was used to solve the differential equations?

How were locations with a different land use treated? Was there a different vertical resolution for C inputs per land use?

No sufficient information about model evaluation is provided. As the authors conclude that the mechanistic models perform poorly compared to machine learning approaches, a thorough model evaluation is important to support this conclusion. For example:
What was the simulated turnover time of SOC at different soil depths? Was this in line with measurements (e.g., DOI: 10.1126/science.aad4273) Did this decrease with soil depth, as is generally observed?

How was the simulated amount of SOC distributed among the different pools? Was this distribution in line with observations?

The authors should justify why they chose to perform a global study, instead of focusing on a smaller region with less uncertainties about input data for the models and more detailed data for model evaluation.

Detailed feedback
Abstract
L23: Would be good to mention which ‘existing variables’ these are
L29: What do you mean with a ‘simple trend’?

Introduction
L45-55: It is not clear what the ‘effectiveness of land management strategies’ has to do with the ability to accurately estimate SOC content
L73: The study by Gurung et al. is a study on daycent, so not applicable as a reference for the ‘poorly constrained parameter values of microbial explicit models’
L83: Georgiou et al. (2022) is not a modelling study, so not appropriate as a reference for this sentence.
L107: ‘accuracy’ of what?
L111: R2 is not a measure of model performance (see e.g. https://en.wikipedia.org/wiki/Anscombe%27s_quartet), so please remove this throughout the manuscript when used for this purpose.

Methods
General methods: Does the WoSIS database contain SOC stocks, i.e., a combination of OC% and bulk density, for all profiles that were used? If not, how was OC% converted to SOC stocks? Although it is not described in the manuscript, I assume absolute amounts of SOC were simulated? To compare these to observations of OC% in WoSIS, these would need to be converted to OC%. How did this happen? This information should be provided in the manuscript.
L143-144: where did the values for the parameter to distribute litter inputs over metabolic and structural litter come from? As this is based on litter quality, was this value different for different types of vegetation?
L143-158: more information about the chosen version of MIMICS is needed. What did the soil moisture scalar look like? This is important information, as you discuss the limited ability of MIMICS to account for moisture later in the manuscript. How was bioturbation simulated? Was the magnitude of diffusion the same for every land use and vegetation type? Were the same parameter values used for all simulated depths? How was, for example, the generally-observed decrease in the rate of SOC cycling with depth accounted for?
L181-182: In MIMICS, doesn’t the physically-protected SOC pool represent mineral-associated OC?
L183: this knowledge is not ‘recent’ but has been known for decades
Section 1.2.1: as you describe a new model (although based on MIMICS), I encourage the authors to describe the equations that they added to MIMICS in the main manuscript, not only in the supplement.
Section 2.2.1: did you check for autocorrelation between the predictors before performing the analyses?
L223-225: did this ‘negative exponential function’ have different parameters for different vegetation types?
L233: weighted based on what?
L234: what do you mean by ‘standardized’?
L235: ‘mean annual values’: over which time period where these means calculated?
L248: what is ‘SOC profile data’? Only OC%, or also bulk density, to convert OC% into stocks?
L255: so all profiles that were used had data down to 1.5 m? Please clarify this in the manuscript
L257: what happened to 1x1 km grid cells that did not have any profiles in WoSIS? Were these cells not simulated?
L262: It is not clear how a cumulative SOC content (in g/kg) can be calculated, as you can only calculate a cumulative profile by summing absolute values, not concentrations. Please clarify.
L266-267: If no data below 60 cm was present, where these deeper layers also simulated? How did you resample the data to 0-30, 30-60, … intervals? Were OC% weighted based on bulk density? Or were averages of the OC% used? Please clarify.
Fig. 1: Why is there a sharp boundary latitude below which no data are present?
Section 2.3:
More information about the parameter optimization process is needed:
Which data were used to optimize model parameters? Only total SOC%? If so, how can you assess if the distribution of simulated SOC over the different model pools was correct?

Which measure was optimized? RMSE?

Which program was used to optimize model parameters?

Did you use one constant desorption parameter for the entire globe, or per ‘cluster’? If so, the turnover rate of SOC is very unlikely to be simulated correctly, as this is different in different ecosystems/soil types/… Please clarify.

Please provide the optimized values of the calibrated parameters.

L285: what do you mean by ‘relatively sensitive’? How was this sensitivity calculated?
L300: what do you mean by ‘for reference’
L302-303: R2 is not a measure for model performance, and should not be used as such (see above). Please remove and perform the analysis without this measure.

Results
Section 3.1: scatterplots (measured vs modelled) would be necessary to evaluate model performance. I encourage the authors to provide these.
L352: Clarify what you mean by ‘out-of-sample’
L357-358: how do you see in Table 2 that that the predictability decreases with depth, since all measures decrease with depth? Better would be to use relative measures (scaled by the measured OC%).
Table 2: I encourage the authors to remove R2 from the table, as this is not a measure of model performance (a very bad simulation, for example a systematic over- or underestimation, can have an R2 value close to 1).
Fig. 3: how is it possible that in a) NPP has a ranking far below AMT, while in b) (same data but fewer predictor variables) this order is reversed?
Fig. 3: Is the variable Soil Temp different from AMT? Why is this variable not present in plot a)? Please clarify.
L547-549: While this is partially true, there is still a substantial correlation between clay+silt content and maximum potential mineral-associated SOC (see https://doi.org/10.5194/soil-10-275-2024). I encourage the authors to put some nuance to this statement.

Discussion
L503 + 511 + 517-518: Please remove these statements or use a performance measure different than R2 (see above)
L534: how should this be done? A substantial part of CEC comes from organic matter itself (see for example Solly et al., which you cite), which is not discussed in the manuscript. Soils with a high SOM content typically have a higher CEC value, so it seems that including CEC as a model driver will not improve the mechanistic basis for these models. Please discuss.

Technical feedback
L50: ‘is’ => ‘was’
L77: ‘progress’ => ‘process’?
L78: ‘limited’ => ‘few’?
L88-89: something seems wrong with this sentence
L138: ‘variable’ => ‘variables’
You use AMT as an abbreviation for mean annual temperature, why not stick to the conventional MAT abbreviation?
L299: ‘permafrost’ => ‘frozen’?
L554: ‘perform’ => ‘performs’
Citation: https://doi.org/10.5194/egusphere-2025-2545-RC1
- AC2: 'Reply on RC1', Lingfei Wang, 18 Sep 2025
  
  We would like to thank the reviewer for time and feedback on this manuscript. We’ll revise the manuscript through according the reviewer’s comments on all sections, particularly providing more information about previous studies on the dominant controls of the SOC content at global scale, a more detailed description of model structure and parameter optimisation processes, and revising the discussion to avoid restating the results but place greater emphasis on consistent findings across methods, as well as the limitations and uncertainties of this study. Please find our point-by-point responses in the attached PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-AC2
RC2:
'Comment on egusphere-2025-2545', Anonymous Referee #2, 20 Aug 2025

Wang et al. compare the performance of mechanistic SOC models and machine learning (ML) models on capturing the spatial variations of global SOC across different soil layers. Their findings reveal that some key environmental facotrs controlling SOC are still missing or have not been well represented in existing mechanistic SOC models. The topic of this study is very interesting and their findings provide valuable insights for improving existing mechanistic models. The manuscript is well organized and the results are well represented.
My main concerns are related to the inadequate description on the data and methods used in this study, as well as the interpretation of some analysis results:
1）Vegetation type strongly affects the stoichiometry and physiochemical property of litter input. Do you consider the vegetation type at each observation site when dividing the NPP into different litter pools? If yes, please give a description in the Method section. In addition, have you tested the RF model which includes the land use type as an predicting factor?
2）Both PVI and SHAP can be used to identify the relative importance of different environmental factors on controlling SOC. Why PVI but not SHAP is used here? Will the key predictors for SOC will be different if the authors use the SHAP approach?
3）I am wondering whether the analysis based on ML approach can provide the reliable implications on the controlling factors and underlying mechanism associated with the formation and decomposition of SOC. For example, based on the results in Fig. 3, NPP seems to be less important than climatic factors on predicting SOC. Ignoring NPP in ML model may have a small effect on the simulated SOC. But should we ignore the NPP in the process-based model? In reality, NPP cannot be ignored at all even it ranks last in the PVI analysis. We still should develop process-based SOC models based on the understandings in SOC formation and decomposition processes, rather than based on the statistical importance of environmental factors. My concern is that what implications from the analysis using ML are true in reality, or what implications should be incorporated into the process-based model?
4）Although this study did a analysis on the environmental controls of SOC and the related mechanism. There are still large limitation & uncertainties in the methods and data used in this study. I strongly the authors add a section to discuss the limitation and potential uncertainties in this study.

Minor comments:
L22: The poor performance
L40: ## between plant carbon inputs and outputs through soil heterotrophic respiration and leaching ##. Or, ## between carbon inputs in forms of plant litter and root exudates and outputs through soil heterotrophic respiration and leaching ##.
Fig. 2 Why there is no observation sites in the south Hemisphere? Or it is because the map in Figure 2 has not been correctly showed?
Fig.3 Terrain factor is only represented by elevation. Have the authors tried to include more factors such as slope steepness?
Fig. 5. It seems the relationships between SOC and NPP revealed by MIMICS and MES-C model are also nonlinear. SOC increases with NPP quickly when NPP is smaller than 1000 g C m-2 yr-1, then increases slowly with increasing NPP.

Citation: https://doi.org/10.5194/egusphere-2025-2545-RC2
- AC4: 'Reply on RC2', Lingfei Wang, 18 Sep 2025
  
  We would like to thank the reviewer for time and feedback on this manuscript. We’ll revise the manuscript through according the reviewer’s comments on all sections. Please find our point-by-point responses in the attached PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-AC4
RC3:
'Comment on egusphere-2025-2545', Anonymous Referee #3, 21 Aug 2025

Overall Comments:
This work compares drivers of soil organic carbon in observations (using two random forest models) and in two mechanistic models (MIMICS and the new MES-C model) with the goal of identifying poorly represented or missing drivers in mechanistic models. Overall, I think the goal of this work is admirable and the authors present some interesting results highlighting CEC as a missing driver and different types of relationships between NPP and SOC for observations (nonlinear) and models (linear) which provide areas for models to improve. The paper would benefit from clearer language, more detailed methods, and some re-organization. My comments below explain in more detail but I think the way the observations/random forest models are referred to needs to be more consistent and clear for readers; the choice of which models to present in each figure needs to be clarified; the description of MES-C (which seems to be a brand new model) needs to be much more detailed; and the discussion would benefit from a reorganization to either combine with the results or differentiate itself from the results to focus on more high level findings. I think this paper has a lot of great information in it that will be useful to readers once there is some more detail and organization.
Specific Comments:
Title: I feel this title is a little too broad for the paper scope, especially since only two relatively similar models are investigated. Perhaps something a bit more specific either about comparing to an ML model or about environmental drivers? For example, the title could be: “Comparative ability of two mechanistic soil biogeochemical models in representing environmental drivers of soil organic carbon” or something along those lines.
Abstract: It would be helpful to mention the scale of the study (globally-distributed) in the abstract.
Lines 50-51: I am not sure I agree with this statement. We’ve seen a proliferation of microbially-explicit SOC models in the last decade, many of which are cited throughout the introduction. Perhaps it’s better to highlight that these models are relatively under tested, especially compared to Century or RothC.
Line 69: It could be helpful here to mention that we might have more confidence in microbially-explicit models, given they represent microbes, which we know empirically to be important to SOC cycling. See Bradford et al., 2016 for commentary on having confidence in model projections: Bradford, M.A., Wieder, W.R., Bonan, G.B., Fierer, N., Raymond, P.A., Crowther, T.W., 2016. Managing uncertainty in soil carbon feedbacks to climate change. Nature Climate Change 6, 751-758.
Paragraph at line 104: I appreciate this paragraph overviewing benefits and drawbacks of ML/AI. One thing I feel is missing here is finding a sufficiently large training dataset to use AI/ML techniques to predict effects under climate change or conditions they are poorly trained on. Process-based models assume the underlying processes/equations won’t drastically change under novel conditions, such that equations in that model will still hold in unseen conditions, but my understanding is that AI/ML techniques don’t have that same strength. I admittedly have more knowledge on the process-based modeling side here, so I would be curious if the authors think this should be addressed.
Line 148: More current versions of MIMICS use reverse M-M kinetics (see Wieder et al., 2018 and forward). This could be considered in the discussion.
Wieder, W.R., Hartman, M.D., Sulman, B.N., Wang, Y.P., Koven, C.D., Bonan, G.B., 2018. Carbon cycle confidence and uncertainty: Exploring variation among soil biogeochemical models. Global change biology 24, 1563-1579.
Line 183: It seems MES-C is a new model the team developed – is that correct? I found it a bit surprising to have a new model in this work and not have that introduced in the introduction. Assuming this is indeed a new model, it would be helpful to have much more information on philosophy of the model, the equations that make up the model, what environmental variables it uses, and how parameter values were determined.
Lines 183-189: I am struggling with the fact that there is no particulate organic matter-like pool in MES-C. Could the authors address this? In the current structure there is no representation of “lower quality” or “chemically recalcitrant” soil organic matter, which is a considerable proportion of soil organic matter in many ecosystems.
Table 1: I suggest using MAP and MAT instead of AP and AMT, respectively, to make it easier for the reader to understand these drivers in the results.
Section 2.3: It seems the authors are choosing different parameter sets for each of the 12 clusters – is that correct? Parameters in both models have environmental drivers that modify fluxes and so these additional changes to parameters based on environmental differences seem a bit redundant. However, I also see how it could improve model fit. To that end, I would appreciate the authors reflecting on this choice and if kept, adding justification to the methods.
Line 304: Nice, I like this AIC approach for accounting for different numbers of predictors!
Line 314: Why not RF_env?
Line 315: I believe the following sections are the XAI techniques right? Could that be said explicitly here for the reader?
Line 352: Here is where I think the random forests models of SOC for comparison to MIMICS and MES-C and the random forest models used for XAI, that the reader just finished reading about, get confusing. In addition, the random forest models for SOC are later in this paragraph called “ML models”. I think using consistent, clear and distinct language is going to be needed for the reader. I suggest either calling the initial random forest models “ML models” consistently or only referring to them as RFenv and RFinp so its super clear to the reader which random forest models are being referred to at any given time.
Table 2: Please clarify your SOC units (and thus the MAE and RMSE units). Grams of what per kilogram of what? Please clarify this throughout the paper.
Figure 3 and associated text: Panel (a) is RFenv and (b) is RFinp right? As noted above, please use consistent language to help the reader. I am also wondering why (a) is stylized differently than the other panels? I understand why (a) might be vertical for space reasons but I find it a little misleading that is looks different when its displaying the same information. I believe panels b-c should be colored as panel a for consistency and to help the reader understand that these plots are showing the same things for different models.
Figure 4 and associated text: Why is RFenv missing here?
Lines 395-397: Please be careful to not use causal language here and throughout. Instead, the authors could say: higher soil temperature was associated with lower SOC and higher silt and clay was associated with higher SOC.
Lines 395-404: I wonder if simple linear regression would aid in interpretation of these plots. Currently, it seems interpretation is based on the smoothness of the color gradient, and it could be more objective to use r-squared values here.
Section 3.4: Which observed data is presented here? RFenv or RFinp?
Figure 5: Could the short ticks at the bottom of each axis be described more? Which percentiles do they represent? Also, could the authors justify why high SOC values are not shown?
Section 4.2: It could be useful to comment in this section that CEC is not as widely or easily measured as clay and silt, making it harder to use in models applied at site scales where site data is generally used for model inputs. Thus, it might be suggested to measure that more frequently.
Like 563: Is the weaker response or sharp decrease more pronounced in MES-C?
Lines 563-567: I am not sure I agree with this. The authors seem to be conflating steady state associations between SOC and temperature with transient sensitivities, and I am not sure those would necessarily be the same.
Lines 583-586: I suggest softening this language since Cotrufo et al., 2013 is a prediction with mixed support in the literature.
Lines 591-594: How was litter quality handled in this study as a model input? Did it vary with space? The quality of litter is a very influential input in MIMICS (and presumably MES-C given their similar structure) and so addressing this input would be important here.
Discussion, overall: There seems to be quite a bit of restating the results in the discussion – I wonder if these could be combined to reduce repetition? Or, if not, could the discussion try to be a bit broader, focusing on consistent findings across methods, rather than covering each method individually, which lends itself to a restating of the results?
Line 647: I think this paper would benefit from a “Limitations” section that addresses the fact that only two SOC models that are relatively similar were addressed here and that it seems an important driver in at least MIMICS and maybe also MES-C is not evaluated (e.g., litter quality).
Line 649: Two random forest models?
Line 658: MIMICS doesn’t have pH representation – could that be stated clearly here?

Citation: https://doi.org/10.5194/egusphere-2025-2545-RC3
- AC3: 'Reply on RC3', Lingfei Wang, 18 Sep 2025
  
  We would like to thank the reviewer for time and feedback on this manuscript. We’ll revise the manuscript through according the reviewer’s comments on all sections. Please find our point-by-point responses in the attached PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-AC3

Interactive discussion

Status: closed

CC1:
'Comment on egusphere-2025-2545: Machine learning versus “mechanistic” modelling of soil carbon dynamics: Are current comparison attempts meaningful?', Philippe C. Baveye, 13 Jul 2025

Machine learning versus “mechanistic” modelling of soil carbon dynamics: Are current comparison attempts meaningful?
Philippe C. Baveye^1*
¹Saint Loup Research Institute, 7 rue des chênes, La Grande Romelière, 79600 Saint Loup Lamairé, France.
The manuscript that Wang et al. have submitted for possible publication in Biogeosciences raises a number of questions about the soundness of the comparison that these authors carry out between machine learning and the more traditional, “mechanistic” modelling of soil carbon dynamics.

First, a word of caution is in order about the use of the qualifier “mechanistic” used by the authors. This term conveys the notion that the models are based on detailed descriptions of actual mechanisms of processes taking place in soils. Formally, that may seem to be the case, but one could argue that for macroscopic or ecosystem-scale models like Century, RothC, and the like, it hides assumptions that may or may not correspond to reality. Take the decomposition of litter pools and SOC pools in both MIMICS and MES-C as an example. The authors consider that this decomposition follows a Michaelis-Menten kinetic equation. Heße et al. (2009), among others, have shown a while ago already that if one assumes that a Monod-type equation (mathematically identical to Michaelis-Menten) describes microbial growth at the microscale in a natural porous medium, the corresponding equation at the macroscopic scale, over soil volumes that encompass many microorganisms and pores, does not in general follow a Monod-type kinetics. Only in special cases can one write a Monod-type equation at the macroscopic scale as an approximation. In soils, computer simulations using microscale models suggest that there is no evidence that either a Monod formalism holds in general at the macroscale, even as an approximation (e.g., Falconer et al., 2015; Mbé et al., 2022). The same goes for most “mechanistic” descriptions included in MIMICS and MES-C. Unless there is clear experimental evidence that the “mechanistic” descriptions of processes at the macroscopic scale are reliable, or there is microscopic evidence and a robust upscaling approach that produces macroscopic descriptions of processes, macroscopic models based on so-called “mechanistic” descriptions are in fact entirely empirical and should be treated as such. From that standpoint, following Baveye’s (2023) discussion of Schimel’s (2023) analysis, the label of “mechanistic” should be used with significant caution when referring to ecosystem-scale models of soil organic matter dynamics at this stage, since experimental data at the landscape- or ecosystem scales in support of some of the mechanisms included in the models are lacking and since the upscaling hurdle still needs to be addressed satisfactorily.

Beyond this point of terminology, a first question that the manuscript by Wang et al. (2025) raises concerns the specific choice of MIMICS and MES-C. Neither in this manuscript, nor in the previous one by Wang et al. (2024), is this choice justified. One would have expected the authors to first review all available “mechanistic” models available, determine their respective pros and cons, and on the basis of this state-of-the-art review, choose one or two of the models for comparison with machine learning. The fact that this eminently reasonable approach was apparently not adopted raises the question of the reasons why MIMICS was selected and subsequently modified into MES-C. In this respect, it is hard to avoid thinking that MIMICS was chosen because it requires very few input data. In Table 1 of the manuscript by Wang et al. (2025), it appears that it requires knowledge only of the soil temperature, soil moisture content, soil clay content, and net primary production. That is a puzzlingly limited set of data, which one could easily argue would be insufficient to account for the complexity of soil organic carbon dynamics. Nevertheless, if the objective that is pursued is a comparison with an approach based on machine learning, one perhaps does not have the luxury of using a more sophisticated process-based model, requiring a larger input data set. Indeed, machine learning demands very large amounts of data, as is evinced by the 37,691 SOC profiles considered by Wang et al. (2025). Any comparison attempt would rapidly become unmanageable if for each one of these profiles, in order to run process-based models, one needed more soil data than what can be readily retrieved from available soil databases. The down side of this is that, by choosing a simple, let alone simplistic, “mechanistic” model in what it is difficult not to view as a “strawman” strategy, one predetermines the ultimate outcome of the comparison, since the odds are good that this model will exhibit ”poor performance”, for example, as noted by Wang et al. (2025) by showing “a simplistic positive trend” between the net primary production (NPP) and SOC, whereas in reality the observed relationship is nonlinear.

The next question concerns the soundness of comparing models that have different numbers of parameters, or involve different numbers of variables. Many years ago, modellers used to say that with 4 parameters, one could draw an elephant. With 5 parameters, one could make it wag its tail. The message was that a model with more degrees of freedom should always be expected to fit experimental data better than one with fewer degrees of freedom. From this perspective, it is not surprising that in Table 2 of Wang et al. (2025), RF_env, which involves 13 variables, leads to significantly better performance metrics than RF_imp and MES-C, which involve only 6 variables (In this respect, the fact that MIMICs, involving only 4 variables, has poorer performance metrics than MES-C is perplexing). That does not mean that the additional 7 variables considered by RF_env relative to RF_imp and MES-C are necessarily meaningful. Several authors have indeed demonstrated that statistically stronger patterns can emerge from spatial databases to which one has deliberately added utterly irrelevant information, for example, a painting or the photograph of a colleague (Fourcade et al. , 2018; Behrens and Viscarra Rossel, 2020). In this context, it is not at all clear that machine learning, applied to large data sets, can always be useful in “guiding future model development”, as Wang et al. (2025) suggest. Thus when these authors consider that the CEC of soils, which is taken into account by the machine learning model RF_env and has a high PVI (Permutation Variable Importance), should be taken into account in future work on “mechanistic” models, it is unclear at this point how reliable this conclusion is, based as it is only on statistical considerations. One might argue that the careful analysis of Julien and Tessier (2021) of the molecular and microscale processes involved in the fate of SOC in soils, leading to the conclusion that CEC is an important factor controlling it, is far more convincing that there may be something there worth investigating further.

An additional, even more basic question that the research of Wang et al. (2025) raises has to do with whether or not it is in fact meaningful to try to compare the so-called “mechanistic” modelling approach with one based on machine learning, insofar as the dynamics of SOC is concerned. The very large dataset assembled by Wang et al. (2025) evinces one of the key characteristics of machine learning algorithms, and more generally of data-driven and artificial intelligence approaches, namely that they require massive amounts of data (e.g., Baveye, 2021; Wadoux, 2025; Minasny and McBratney, 2025). When dealing with field soil properties, like the fate of their SOC, it is customary, in order to obtain the requisite amount of data to apply machine learning algorithms, to work with extended geographical areas. In such cases, the outcomes of the application of machine learning may be useful to address issues relevant at these large scales, but they may be inadequate to address issues at smaller spatial scales, like in a watershed, ecoregion, ecosystem, agricultural field, or even at specific locations in a field where an experiment is taking place. At these much smaller spatial scales, models attempting to describe soil processes explicitly and in detail appear far more appropriate.

In this context, those who still read the literature may recall a piece of research that had some notoriety 40 years ago. Using a large climatic database, Folkoff et al. (1981) produced a map of the pH of the surface horizon of soils within the conterminous United States that was based solely on climatic conditions. Although the outcome of their research had undeniable intellectual interest, it proved to be of little practical use, and as a result their article has been seldom cited since 1981. Knowing that the climate correlates well with soil pH does not help farmers manage the acidity in their field or governmental agencies assess, e.g., the local impact of acid rain on agricultural yields or environmental degradation. For those practical situations, a different kind of model of acidity-related soil processes is needed, taking into account the spatial heterogeneity of soils. Occasionally, the larger-scale, statistical observations may be useful. Worrall (2001), using observations of pesticide occurrence in 303 boreholes across 12 states in the midwest of the U.S., showed that the molecular topology of the pesticide molecules themselves is, in and of itself, a sufficient basis to discriminate globally between polluting and non-polluting pesticide compounds, regardless of the spatial variation of the soils through which the pesticides transit. On that basis, one might come up with nation-wide or regional regulations allowing the use of some pesticides and banning others (Baveye and Laba, 2015), but it would not be possible to predict whether and how fast a given pesticide compound applied to a given agricultural field will reach the underlying groundwater.

At the moment, a tremendous amount of hype is associated with machine learning, data-driven research, and artificial intelligence in the soil science and biogeosciences communities. There is therefore a good chance that the manuscript by Wang et al. (2025), whose title clearly advertises a bias in favour of these approaches, will be well received by at least a portion of researchers in these disciplines. The risk in this respect is that ill-inspired comparisons would further fuel the drive to accumulate massive amounts of potentially meaningless data to feed data-hungry algorithms. There may be limited uses for the latter, but it would be unwise, I think, to put too much emphasis on them, or to rely on them too closely to improve the process-based models that could be useful to us in practice. For the development of these process-based models, what we need is not massive amounts of data that are haphazardly assembled, but data that actually make sense.
References
Baveye, P. C. (2022). “Data‐driven” versus “question‐driven” soil research. European Journal of Soil Science, 73(1), e13159.
Baveye, P. C. (2023). Ecosystem-scale modelling of soil carbon dynamics: time for a radical shift of perspective?. Soil Biology and Biochemistry, 184, 109112.
Baveye, P. C., & Laba, M. (2015). Moving away from the geostatistical lamppost: Why, where, and how does the spatial heterogeneity of soils matter?. Ecological Modelling, 298, 24-38.
Behrens, T., & Viscarra Rossel, R. A. (2020). On the interpretability of predictors in spatial data science: The information horizon. Scientific Reports, 10(1), 16737.
Falconer, R. E., Battaia, G., Schmidt, S., Baveye, P., Chenu, C., & Otten, W. (2015). Microscale heterogeneity explains experimental variability and non-linearity in soil organic matter mineralisation. PloS one, 10(5), e0123774.
Folkoff, M. E., Meentemeyer, V., & Box, E. O. (1981). Climatic control of soil acidity. Physical Geography, 2(2), 116-124.
Fourcade, Y., Besnard, A. G., & Secondi, J. (2018). Paintings predict the distribution of species, or the challenge of selecting environmental predictors and evaluation statistics. Global Ecology and Biogeography, 27(2), 245-256.
Heße, F., Radu, F. A., Thullner, M., & Attinger, S. (2009). Upscaling of the advection–diffusion–reaction equation with Monod reaction. Advances in water resources, 32(8), 1336-1351.
Julien, J. L., & Tessier, D. (2021). Rôles du pH, de la CEC effective et des cations échangeables sur la stabilité structurale et l’affinité pour l’eau du sol. Etude et Gestion des sols, 28, 159-179.
Mbé, B., Monga, O., Pot, V., Otten, W., Hecht, F., Raynaud, X., ... & Garnier, P. (2022). Scenario modelling of carbon mineralization in 3D soil architecture at the microscale: Toward an accessibility coefficient of organic matter for bacteria. European Journal of Soil Science, 73(1), e13144.
Minasny, B., & McBratney, A. B. (2025). Machine Learning and Artificial Intelligence Applications in Soil Science. European Journal of Soil Science, 76(2), e70093.
Schimel, J. (2023). Modeling ecosystem-scale carbon dynamics in soil: the microbial dimension. Soil Biology and Biochemistry, 178, 108948.
Shi, Z., Crowell, S., Luo, Y., & Moore, B. (2018). Model structures amplify uncertainty in predicted soil carbon responses to climate change. Nature Communications, 9(1), 1–11. https://doi.org/10.1038/s41467-018-04526-9
Wadoux, A. M. C. (2025). Artificial intelligence in soil science. European Journal of Soil Science, 76(2), e70080.
Wang, L., Abramowitz, G., Wang, Y. P., Pitman, A., & Viscarra Rossel, R. A. (2024). An ensemble estimate of Australian soil organic carbon using machine learning and process-based modelling. Soil, 10(2), 619-636.
Wang, L., Abramowitz, G., Wang, Y. P., Pitman, A., Ciais, P., & Goll, D. S. (2025). Towards resolving poor performance of mechanistic soil organic carbon models. EGUsphere, 2025, 1-32.

Citation: https://doi.org/10.5194/egusphere-2025-2545-CC1
- CC2: 'Minor erratum on CC1', Philippe C. Baveye, 13 Jul 2025
  
  In the 4th paragraph, where the text reads "In this respect, the fact that MIMICs, involving only 4 variables, has poorer performance metrics than MES-C is perplexing", it should read instead "In this respect, the fact that MIMICs, involving only 4 variables, has performance metrics that are comparable to those of MES-C is perplexing"
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-CC2
- AC1: 'Reply on CC1', Lingfei Wang, 18 Jul 2025
  
  Dear Dr. Baveye,
  Thank you for your time and comments on our manuscript.
  In this study, we collected globally distributed SOC observations and applied both machine learning and microbial-explicit models to simulate SOC content. However, we would like to clarify that the primary aim of our work is not to compare the predictive performance of machine learning versus microbial-explicit models. Rather, our central objective is to explore how information derived from machine learning models can be used to identify potential shortcomings in microbial-explicit models and guide their improvement at the global scale. We will revise the manuscript to emphasize this point more clearly throughout.
  Regarding the use of the term “mechanistic”, we agree that many processes are not well represented or missing in the microbial-explicit soil carbon models we used in this study, and therefore the models are not fully mechanistic. We will adopt the term process-based models instead, clearly define the term, and revise the manuscript title to: “Towards Resolving the Poor Performance of Microbial-Explicit Soil Carbon Models.”
  For the point you raised about choice of process-based models and the kinetics of litter decomposition, MIMICS has been well validated at site level and has been shown to simulate litter decomposition and SOC responses to experimental warming quite well (Wieder et al., 2014; Wieder et al., 2015). MIMICS is also representative for soil modules deployed in global carbon cycle models as it is used in e.g., CLM (Wieder et al., 2018) and ORCHIDEE (Goll et al., 2023). These points support our decision to use MIMICS as a representative example of microbial-explicit models in this study and develop MES-C based on MIMICS to incorporate more advanced theories about soil carbon stabilisation and destabilisation. We will clarify these aspects in the revised manuscript.
  Regarding the difference in the number of predictors used by the random forest (RF) and microbial-explicit models, we agree that RF model with more variables may naturally exhibit stronger performance. However, our intent was not to compare the models purely on predictive accuracy, but to use the RF model as a diagnostic tool to identify potential missing environmental drivers in microbial-explicit models. We will clarify this point in the revised manuscript.
  We appreciate your feedback, which has helped us refine the focus of our manuscript.
  Sincerely,
  Lingfei Wang
  On behalf of all co-authors
  
  Goll, D. S., Bauters, M., Zhang, H., Ciais, P., Balkanski, Y., Wang, R. and Verbeeck, H.: Atmospheric phosphorus deposition amplifies carbon sinks in simulations of a tropical forest in Central Africa, New Phytologist, 237, 2054-2068, https://doi.org/10.1111/nph.18535, 2023.
  Wieder, W., Grandy, A., Kallenbach, C. and Bonan, G.: Integrating microbial physiology and physio-chemical principles in soils with the MIcrobial-MIneral Carbon Stabilization (MIMICS) model, Biogeosciences, 11, 3899-3917, https://doi.org/10.5194/bg-11-3899-2014, 2014.
  Wieder, W., Grandy, A., Kallenbach, C., Taylor, P. and Bonan, G.: Representing life in the Earth system with soil microbial functional traits in the MIMICS model, Geoscientific Model Development, 8, 1789-1808, https://doi.org/10.5194/gmd-8-1789-2015, 2015.
  Wieder, W., Hartman, M. D., Sulman, B. N., Wang, Y. P., Koven, C. D. and Bonan, G. B.: Carbon cycle condidence and uncertainty: Exploring variation among soil biogeochemical models, Global Change Biology, 24, 1563-1579, https://doi.org/10.1111/gcb.13979, 2018.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-AC1
RC1:
'Comment on egusphere-2025-2545', Anonymous Referee #1, 18 Aug 2025
In their manuscript, titled ‘Towards resolving poor performance of mechanistic soil organic carbon models’, Wang et al. describe their results of a study comparing the performance of two mechanistic soil organic carbon models (MIMICS and the newly-proposed MES-C) to a machine learning approach (random forest). They use the mechanistic models and random forest models to predict the amount of soil organic carbon for multiple locations around the world, after which they compare model results to depth profiles of SOC from the WoSIS database. Based on their results, the authors conclude that random forest models consistently outperformed mechanistic models. In addition, they used interpretable machine learning methods to conclude that both mechanistic soil organic carbon models perform poorly due to lack of accounting for key variables (most notably CEC), while the role of existing variables is underrepresented. They compare the controlling factors in the mechanistic models to factors controlling the SOC content at the global scale, and find that, for example, the role of the clay and silt content of the soil is overestimated by the models, compared to observations. Furthermore, the authors asses the sensitivity of model outcomes along a range of predictor values, and study how pairs of predictors affect model outcomes, compared to observations.
The development and application of mechanistic soil organic matter models, among which MIMICS, has gained much importance over the past two decades. Therefore, the assessment of model performance and the relative importance of model drivers in making predictions of the SOC content, compared to observations, is needed to improve these models to increase confidence in their outcomes. I appreciate the effort done to collect the data, develop a new model and apply it, together with MIMICS, at the global scale, which is a challenging endeavor. The manuscript is well-written and well-structured. The abstract conveys the main messages and the introduction provides sufficient background to the topic, although I would encourage the authors to include more information about previous studies assessing controls of the SOC content at the global scale, which is the topic of their study. Although the methods section describes the performed analyses and data collection strategy, much information about the performed model simulations is missing to assess the quality of the performed simulations (see below). The results section concisely describes the results, but I would encourage the authors to show the model results in a graphical way, for example using scatterplots. The discussion touches upon the main results, but mainly sections 4.3 and 4.4 lack clear messages, and are rather a summing up of the relations the authors found.
Based on my evaluation I recommend major revisions, mainly due to (1) the lack of detailed information which would be necessary to evaluate the performed model simulations and (2) the nearly complete attribution of the mismatch between model results and measurements to the structure of the model, without accounting for uncertainties in input data at the global scale (see below). My main concerns about the manuscript are the following, while detailed feedback is provided below.
The authors propose a new mechanistic soil organic matter model that is more complex than MIMICS, as it also incorporates SOC in aggregates. However, the authors do not have any data on the distribution of measured SOC among different pools, so they have no data to either constrain the distribution of simulated SOC among the different model pools, or to evaluate this simulated distribution. Therefore, I invite the authors to justify the use of such a (complex) model at the global scale, given this lack of data. The evaluation of the simulation results based on data for only total SOC inevitably leads to overparameterization and equifinality. This is an important, but generally overlooked, aspect of environmental models, the consequences of which for the model simulations should be discussed in the manuscript.

The authors attribute the ‘poor performance’ of the mechanistic models exclusively to the structure of these models. However, there are many more uncertainties leading to a mismatch between model results and measurements:
Uncertainties related to model drivers, which were extracted from global data sets. For example, NPP data derived from MODIS cannot be expected to correctly represent C inputs from vegetation to the soil, as, for example, a substantial part of aboveground litter C will be respired before it can enter the soil. A model can never be better than the input data, so accounting for uncertainties in the data is important.

The applied mechanistic models are too complex compared to the data available for model evaluation, i.e., they are overparameterized. Six parameters are evaluated while only one output (total SOC) is evaluated (it seems, as this is not described clearly in the manuscript). This inevitably leads to large uncertainties about model results.

These aspects should be discussed, because it is evident that an overparameterized model with uncertain inputs will lead to a ‘poor performance’.
The methods related to the application of the mechanistic models are not sufficiently described. For example:
Down to which depth were the simulations performed? To 1.5 m, as this is the maximum depth down to which measurements were available?

Were the simulations performed at a vertical resolution of 30 cm?

For how long were the simulations run? Was there a spin-up period, followed by a run with actual temperature and precipitation data?

In which programming language were the models coded?

With which time step were the simulations performed? Which solver was used to solve the differential equations?

How were locations with a different land use treated? Was there a different vertical resolution for C inputs per land use?

No sufficient information about model evaluation is provided. As the authors conclude that the mechanistic models perform poorly compared to machine learning approaches, a thorough model evaluation is important to support this conclusion. For example:
What was the simulated turnover time of SOC at different soil depths? Was this in line with measurements (e.g., DOI: 10.1126/science.aad4273) Did this decrease with soil depth, as is generally observed?

How was the simulated amount of SOC distributed among the different pools? Was this distribution in line with observations?

The authors should justify why they chose to perform a global study, instead of focusing on a smaller region with less uncertainties about input data for the models and more detailed data for model evaluation.

Detailed feedback
Abstract
L23: Would be good to mention which ‘existing variables’ these are
L29: What do you mean with a ‘simple trend’?

Introduction
L45-55: It is not clear what the ‘effectiveness of land management strategies’ has to do with the ability to accurately estimate SOC content
L73: The study by Gurung et al. is a study on daycent, so not applicable as a reference for the ‘poorly constrained parameter values of microbial explicit models’
L83: Georgiou et al. (2022) is not a modelling study, so not appropriate as a reference for this sentence.
L107: ‘accuracy’ of what?
L111: R2 is not a measure of model performance (see e.g. https://en.wikipedia.org/wiki/Anscombe%27s_quartet), so please remove this throughout the manuscript when used for this purpose.

Methods
General methods: Does the WoSIS database contain SOC stocks, i.e., a combination of OC% and bulk density, for all profiles that were used? If not, how was OC% converted to SOC stocks? Although it is not described in the manuscript, I assume absolute amounts of SOC were simulated? To compare these to observations of OC% in WoSIS, these would need to be converted to OC%. How did this happen? This information should be provided in the manuscript.
L143-144: where did the values for the parameter to distribute litter inputs over metabolic and structural litter come from? As this is based on litter quality, was this value different for different types of vegetation?
L143-158: more information about the chosen version of MIMICS is needed. What did the soil moisture scalar look like? This is important information, as you discuss the limited ability of MIMICS to account for moisture later in the manuscript. How was bioturbation simulated? Was the magnitude of diffusion the same for every land use and vegetation type? Were the same parameter values used for all simulated depths? How was, for example, the generally-observed decrease in the rate of SOC cycling with depth accounted for?
L181-182: In MIMICS, doesn’t the physically-protected SOC pool represent mineral-associated OC?
L183: this knowledge is not ‘recent’ but has been known for decades
Section 1.2.1: as you describe a new model (although based on MIMICS), I encourage the authors to describe the equations that they added to MIMICS in the main manuscript, not only in the supplement.
Section 2.2.1: did you check for autocorrelation between the predictors before performing the analyses?
L223-225: did this ‘negative exponential function’ have different parameters for different vegetation types?
L233: weighted based on what?
L234: what do you mean by ‘standardized’?
L235: ‘mean annual values’: over which time period where these means calculated?
L248: what is ‘SOC profile data’? Only OC%, or also bulk density, to convert OC% into stocks?
L255: so all profiles that were used had data down to 1.5 m? Please clarify this in the manuscript
L257: what happened to 1x1 km grid cells that did not have any profiles in WoSIS? Were these cells not simulated?
L262: It is not clear how a cumulative SOC content (in g/kg) can be calculated, as you can only calculate a cumulative profile by summing absolute values, not concentrations. Please clarify.
L266-267: If no data below 60 cm was present, where these deeper layers also simulated? How did you resample the data to 0-30, 30-60, … intervals? Were OC% weighted based on bulk density? Or were averages of the OC% used? Please clarify.
Fig. 1: Why is there a sharp boundary latitude below which no data are present?
Section 2.3:
More information about the parameter optimization process is needed:
Which data were used to optimize model parameters? Only total SOC%? If so, how can you assess if the distribution of simulated SOC over the different model pools was correct?

Which measure was optimized? RMSE?

Which program was used to optimize model parameters?

Did you use one constant desorption parameter for the entire globe, or per ‘cluster’? If so, the turnover rate of SOC is very unlikely to be simulated correctly, as this is different in different ecosystems/soil types/… Please clarify.

Please provide the optimized values of the calibrated parameters.

L285: what do you mean by ‘relatively sensitive’? How was this sensitivity calculated?
L300: what do you mean by ‘for reference’
L302-303: R2 is not a measure for model performance, and should not be used as such (see above). Please remove and perform the analysis without this measure.

Results
Section 3.1: scatterplots (measured vs modelled) would be necessary to evaluate model performance. I encourage the authors to provide these.
L352: Clarify what you mean by ‘out-of-sample’
L357-358: how do you see in Table 2 that that the predictability decreases with depth, since all measures decrease with depth? Better would be to use relative measures (scaled by the measured OC%).
Table 2: I encourage the authors to remove R2 from the table, as this is not a measure of model performance (a very bad simulation, for example a systematic over- or underestimation, can have an R2 value close to 1).
Fig. 3: how is it possible that in a) NPP has a ranking far below AMT, while in b) (same data but fewer predictor variables) this order is reversed?
Fig. 3: Is the variable Soil Temp different from AMT? Why is this variable not present in plot a)? Please clarify.
L547-549: While this is partially true, there is still a substantial correlation between clay+silt content and maximum potential mineral-associated SOC (see https://doi.org/10.5194/soil-10-275-2024). I encourage the authors to put some nuance to this statement.

Discussion
L503 + 511 + 517-518: Please remove these statements or use a performance measure different than R2 (see above)
L534: how should this be done? A substantial part of CEC comes from organic matter itself (see for example Solly et al., which you cite), which is not discussed in the manuscript. Soils with a high SOM content typically have a higher CEC value, so it seems that including CEC as a model driver will not improve the mechanistic basis for these models. Please discuss.

Technical feedback
L50: ‘is’ => ‘was’
L77: ‘progress’ => ‘process’?
L78: ‘limited’ => ‘few’?
L88-89: something seems wrong with this sentence
L138: ‘variable’ => ‘variables’
You use AMT as an abbreviation for mean annual temperature, why not stick to the conventional MAT abbreviation?
L299: ‘permafrost’ => ‘frozen’?
L554: ‘perform’ => ‘performs’
Citation: https://doi.org/10.5194/egusphere-2025-2545-RC1
- AC2: 'Reply on RC1', Lingfei Wang, 18 Sep 2025
  
  We would like to thank the reviewer for time and feedback on this manuscript. We’ll revise the manuscript through according the reviewer’s comments on all sections, particularly providing more information about previous studies on the dominant controls of the SOC content at global scale, a more detailed description of model structure and parameter optimisation processes, and revising the discussion to avoid restating the results but place greater emphasis on consistent findings across methods, as well as the limitations and uncertainties of this study. Please find our point-by-point responses in the attached PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-AC2
RC2:
'Comment on egusphere-2025-2545', Anonymous Referee #2, 20 Aug 2025

Wang et al. compare the performance of mechanistic SOC models and machine learning (ML) models on capturing the spatial variations of global SOC across different soil layers. Their findings reveal that some key environmental facotrs controlling SOC are still missing or have not been well represented in existing mechanistic SOC models. The topic of this study is very interesting and their findings provide valuable insights for improving existing mechanistic models. The manuscript is well organized and the results are well represented.
My main concerns are related to the inadequate description on the data and methods used in this study, as well as the interpretation of some analysis results:
1）Vegetation type strongly affects the stoichiometry and physiochemical property of litter input. Do you consider the vegetation type at each observation site when dividing the NPP into different litter pools? If yes, please give a description in the Method section. In addition, have you tested the RF model which includes the land use type as an predicting factor?
2）Both PVI and SHAP can be used to identify the relative importance of different environmental factors on controlling SOC. Why PVI but not SHAP is used here? Will the key predictors for SOC will be different if the authors use the SHAP approach?
3）I am wondering whether the analysis based on ML approach can provide the reliable implications on the controlling factors and underlying mechanism associated with the formation and decomposition of SOC. For example, based on the results in Fig. 3, NPP seems to be less important than climatic factors on predicting SOC. Ignoring NPP in ML model may have a small effect on the simulated SOC. But should we ignore the NPP in the process-based model? In reality, NPP cannot be ignored at all even it ranks last in the PVI analysis. We still should develop process-based SOC models based on the understandings in SOC formation and decomposition processes, rather than based on the statistical importance of environmental factors. My concern is that what implications from the analysis using ML are true in reality, or what implications should be incorporated into the process-based model?
4）Although this study did a analysis on the environmental controls of SOC and the related mechanism. There are still large limitation & uncertainties in the methods and data used in this study. I strongly the authors add a section to discuss the limitation and potential uncertainties in this study.

Minor comments:
L22: The poor performance
L40: ## between plant carbon inputs and outputs through soil heterotrophic respiration and leaching ##. Or, ## between carbon inputs in forms of plant litter and root exudates and outputs through soil heterotrophic respiration and leaching ##.
Fig. 2 Why there is no observation sites in the south Hemisphere? Or it is because the map in Figure 2 has not been correctly showed?
Fig.3 Terrain factor is only represented by elevation. Have the authors tried to include more factors such as slope steepness?
Fig. 5. It seems the relationships between SOC and NPP revealed by MIMICS and MES-C model are also nonlinear. SOC increases with NPP quickly when NPP is smaller than 1000 g C m-2 yr-1, then increases slowly with increasing NPP.

Citation: https://doi.org/10.5194/egusphere-2025-2545-RC2
- AC4: 'Reply on RC2', Lingfei Wang, 18 Sep 2025
  
  We would like to thank the reviewer for time and feedback on this manuscript. We’ll revise the manuscript through according the reviewer’s comments on all sections. Please find our point-by-point responses in the attached PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-AC4
RC3:
'Comment on egusphere-2025-2545', Anonymous Referee #3, 21 Aug 2025

Overall Comments:
This work compares drivers of soil organic carbon in observations (using two random forest models) and in two mechanistic models (MIMICS and the new MES-C model) with the goal of identifying poorly represented or missing drivers in mechanistic models. Overall, I think the goal of this work is admirable and the authors present some interesting results highlighting CEC as a missing driver and different types of relationships between NPP and SOC for observations (nonlinear) and models (linear) which provide areas for models to improve. The paper would benefit from clearer language, more detailed methods, and some re-organization. My comments below explain in more detail but I think the way the observations/random forest models are referred to needs to be more consistent and clear for readers; the choice of which models to present in each figure needs to be clarified; the description of MES-C (which seems to be a brand new model) needs to be much more detailed; and the discussion would benefit from a reorganization to either combine with the results or differentiate itself from the results to focus on more high level findings. I think this paper has a lot of great information in it that will be useful to readers once there is some more detail and organization.
Specific Comments:
Title: I feel this title is a little too broad for the paper scope, especially since only two relatively similar models are investigated. Perhaps something a bit more specific either about comparing to an ML model or about environmental drivers? For example, the title could be: “Comparative ability of two mechanistic soil biogeochemical models in representing environmental drivers of soil organic carbon” or something along those lines.
Abstract: It would be helpful to mention the scale of the study (globally-distributed) in the abstract.
Lines 50-51: I am not sure I agree with this statement. We’ve seen a proliferation of microbially-explicit SOC models in the last decade, many of which are cited throughout the introduction. Perhaps it’s better to highlight that these models are relatively under tested, especially compared to Century or RothC.
Line 69: It could be helpful here to mention that we might have more confidence in microbially-explicit models, given they represent microbes, which we know empirically to be important to SOC cycling. See Bradford et al., 2016 for commentary on having confidence in model projections: Bradford, M.A., Wieder, W.R., Bonan, G.B., Fierer, N., Raymond, P.A., Crowther, T.W., 2016. Managing uncertainty in soil carbon feedbacks to climate change. Nature Climate Change 6, 751-758.
Paragraph at line 104: I appreciate this paragraph overviewing benefits and drawbacks of ML/AI. One thing I feel is missing here is finding a sufficiently large training dataset to use AI/ML techniques to predict effects under climate change or conditions they are poorly trained on. Process-based models assume the underlying processes/equations won’t drastically change under novel conditions, such that equations in that model will still hold in unseen conditions, but my understanding is that AI/ML techniques don’t have that same strength. I admittedly have more knowledge on the process-based modeling side here, so I would be curious if the authors think this should be addressed.
Line 148: More current versions of MIMICS use reverse M-M kinetics (see Wieder et al., 2018 and forward). This could be considered in the discussion.
Wieder, W.R., Hartman, M.D., Sulman, B.N., Wang, Y.P., Koven, C.D., Bonan, G.B., 2018. Carbon cycle confidence and uncertainty: Exploring variation among soil biogeochemical models. Global change biology 24, 1563-1579.
Line 183: It seems MES-C is a new model the team developed – is that correct? I found it a bit surprising to have a new model in this work and not have that introduced in the introduction. Assuming this is indeed a new model, it would be helpful to have much more information on philosophy of the model, the equations that make up the model, what environmental variables it uses, and how parameter values were determined.
Lines 183-189: I am struggling with the fact that there is no particulate organic matter-like pool in MES-C. Could the authors address this? In the current structure there is no representation of “lower quality” or “chemically recalcitrant” soil organic matter, which is a considerable proportion of soil organic matter in many ecosystems.
Table 1: I suggest using MAP and MAT instead of AP and AMT, respectively, to make it easier for the reader to understand these drivers in the results.
Section 2.3: It seems the authors are choosing different parameter sets for each of the 12 clusters – is that correct? Parameters in both models have environmental drivers that modify fluxes and so these additional changes to parameters based on environmental differences seem a bit redundant. However, I also see how it could improve model fit. To that end, I would appreciate the authors reflecting on this choice and if kept, adding justification to the methods.
Line 304: Nice, I like this AIC approach for accounting for different numbers of predictors!
Line 314: Why not RF_env?
Line 315: I believe the following sections are the XAI techniques right? Could that be said explicitly here for the reader?
Line 352: Here is where I think the random forests models of SOC for comparison to MIMICS and MES-C and the random forest models used for XAI, that the reader just finished reading about, get confusing. In addition, the random forest models for SOC are later in this paragraph called “ML models”. I think using consistent, clear and distinct language is going to be needed for the reader. I suggest either calling the initial random forest models “ML models” consistently or only referring to them as RFenv and RFinp so its super clear to the reader which random forest models are being referred to at any given time.
Table 2: Please clarify your SOC units (and thus the MAE and RMSE units). Grams of what per kilogram of what? Please clarify this throughout the paper.
Figure 3 and associated text: Panel (a) is RFenv and (b) is RFinp right? As noted above, please use consistent language to help the reader. I am also wondering why (a) is stylized differently than the other panels? I understand why (a) might be vertical for space reasons but I find it a little misleading that is looks different when its displaying the same information. I believe panels b-c should be colored as panel a for consistency and to help the reader understand that these plots are showing the same things for different models.
Figure 4 and associated text: Why is RFenv missing here?
Lines 395-397: Please be careful to not use causal language here and throughout. Instead, the authors could say: higher soil temperature was associated with lower SOC and higher silt and clay was associated with higher SOC.
Lines 395-404: I wonder if simple linear regression would aid in interpretation of these plots. Currently, it seems interpretation is based on the smoothness of the color gradient, and it could be more objective to use r-squared values here.
Section 3.4: Which observed data is presented here? RFenv or RFinp?
Figure 5: Could the short ticks at the bottom of each axis be described more? Which percentiles do they represent? Also, could the authors justify why high SOC values are not shown?
Section 4.2: It could be useful to comment in this section that CEC is not as widely or easily measured as clay and silt, making it harder to use in models applied at site scales where site data is generally used for model inputs. Thus, it might be suggested to measure that more frequently.
Like 563: Is the weaker response or sharp decrease more pronounced in MES-C?
Lines 563-567: I am not sure I agree with this. The authors seem to be conflating steady state associations between SOC and temperature with transient sensitivities, and I am not sure those would necessarily be the same.
Lines 583-586: I suggest softening this language since Cotrufo et al., 2013 is a prediction with mixed support in the literature.
Lines 591-594: How was litter quality handled in this study as a model input? Did it vary with space? The quality of litter is a very influential input in MIMICS (and presumably MES-C given their similar structure) and so addressing this input would be important here.
Discussion, overall: There seems to be quite a bit of restating the results in the discussion – I wonder if these could be combined to reduce repetition? Or, if not, could the discussion try to be a bit broader, focusing on consistent findings across methods, rather than covering each method individually, which lends itself to a restating of the results?
Line 647: I think this paper would benefit from a “Limitations” section that addresses the fact that only two SOC models that are relatively similar were addressed here and that it seems an important driver in at least MIMICS and maybe also MES-C is not evaluated (e.g., litter quality).
Line 649: Two random forest models?
Line 658: MIMICS doesn’t have pH representation – could that be stated clearly here?

Citation: https://doi.org/10.5194/egusphere-2025-2545-RC3
- AC3: 'Reply on RC3', Lingfei Wang, 18 Sep 2025
  
  We would like to thank the reviewer for time and feedback on this manuscript. We’ll revise the manuscript through according the reviewer’s comments on all sections. Please find our point-by-point responses in the attached PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2545-AC3

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (06 Oct 2025) by Akihiko Ito

AR by Lingfei Wang on behalf of the Authors (05 Nov 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (20 Nov 2025) by Akihiko Ito

AR by Lingfei Wang on behalf of the Authors (21 Nov 2025) Manuscript

Journal article(s) based on this preprint

09 Dec 2025

Using explainable AI to diagnose the representation of environmental drivers in process-based soil organic carbon models

Lingfei Wang, Gab Abramowitz, Ying-Ping Wang, Andy Pitman, Philippe Ciais, and Daniel S. Goll

Biogeosciences, 22, 7845–7863, https://doi.org/10.5194/bg-22-7845-2025,https://doi.org/10.5194/bg-22-7845-2025, 2025

Short summary

Lingfei Wang, Gab Abramowitz, Ying-Ping Wang, Andy Pitman, Philippe Ciais, and Daniel S. Goll

Supplement

https://doi.org/10.5194/egusphere-2025-2545-supplement

Lingfei Wang, Gab Abramowitz, Ying-Ping Wang, Andy Pitman, Philippe Ciais, and Daniel S. Goll

Viewed

Total article views: 4,282 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
3,532	633	117	4,282	207	109	131

HTML: 3,532
PDF: 633
XML: 117
Total: 4,282
Supplement: 207
BibTeX: 109
EndNote: 131

Views and downloads (calculated since 19 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	220	58	18	296
Jul 2025	256	66	12	334
Aug 2025	494	62	12	568
Sep 2025	1,358	50	20	1,428
Oct 2025	168	28	4	200
Nov 2025	126	66	10	202
Dec 2025	222	38	4	264
Jan 2026	118	66	22	206
Feb 2026	138	42	2	182
Mar 2026	156	76	6	238
Apr 2026	59	39	1	99
May 2026	193	31	2	226
Jun 2026	18	6	1	25
Jul 2026	6	5	3	14

Cumulative views and downloads (calculated since 19 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	220	58	18	296
Jul 2025	256	66	12	334
Aug 2025	494	62	12	568
Sep 2025	1,358	50	20	1,428
Oct 2025	168	28	4	200
Nov 2025	126	66	10	202
Dec 2025	222	38	4	264
Jan 2026	118	66	22	206
Feb 2026	138	42	2	182
Mar 2026	156	76	6	238
Apr 2026	59	39	1	99
May 2026	193	31	2	226
Jun 2026	18	6	1	25
Jul 2026	6	5	3	14

Viewed (geographical distribution)

Total article views: 4,278 (including HTML, PDF, and XML) Thereof 4,278 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 18 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (3711 KB)
Metadata XML

Short summary

Accurate estimates of global soil organic carbon (SOC) content and its spatial pattern are critical for future climate change mitigation. However, the most advanced mechanistic SOC models struggle to do this task. Here we apply multiple explainable machine learning methods to identify missing variables and misrepresented relationships between environmental factors and SOC in these models, offering new insights to guide model development for more reliable SOC predictions.


Total:	0
HTML:	0
PDF:	0
XML:	0