Comparing the MEMS v1 model performance with MCMC and 4DEnVar calibration methods over a continental soil inventory

Viskari, Toni; Quaife, Tristan; Fahl, Fernando; Zhang, Yao; Lugato, Emanuele

doi:10.5194/egusphere-2025-4999

Preprints

https://doi.org/10.5194/egusphere-2025-4999

Preprints

26 Jan 2026

| 26 Jan 2026

Comparing the MEMS v1 model performance with MCMC and 4DEnVar calibration methods over a continental soil inventory

Toni Viskari, Tristan Quaife, Fernando Fahl, Yao Zhang, and Emanuele Lugato

Abstract. An abundant amount of different data is required to calibrate soil organic carbon (SOC) models to represent ecosystems at large-scale. However, due to challenges related to model state projections, this calibration becomes very computationally heavy with traditional calibration methods. In this work, we test 4-Dimensional Ensemble Variational data assimilation (4DEnVar) method to parameterize the MEMS v1 SOC model using data from the LUCAS soil sampling network and compare its performance against MCMC calibration. Comparing the total SOC projections from both parameterizations to the validation datasets showed similar improvements even though the produced parameter sets differed. A thorough analysis revealed that the detailed SOC states were not similar to a degree that is meaningful for future predictions, but we also lacked information to determine which parameter set was closer to the truth. Our results here establish 4DEnVar as an applicable calibration method for SOC models but also highlight the need for more nuanced validation methods as well careful examination how different data sets affect the model calibration.

Received: 09 Oct 2025 – Discussion started: 26 Jan 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Toni Viskari, Tristan Quaife, Fernando Fahl, Yao Zhang, and Emanuele Lugato

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-4999', Anonymous Referee #1, 02 Mar 2026
In their article, ‘Comparing the MEMS v1 model performance with MCMC and 4DEnVar calibration methods over a continental soil inventory’, Viskari et al. compare two different parameter optimisation methods, Markov Chain Monte Carlo (MCMC) and the 4-Dimensional Ensemble Variational data assimilation method (4DEnVar), to optimise the MEMS v1 model.
Using SOC and carbon fraction data (POM vs MAOM) from the LUCAS 2009 soil inventory, the authors calibrated selected model parameters for 322 soil samples for which the POM and MAOM fractions were known, and analysed how these parameters influence steady-state SOC projections for 17,430 other LUCAS data points. The study includes a twin experiment (to assess if the algorithms were able to find the correct parameter set), two calibration scenarios using different assumptions about the fraction of net primary production entering the soil, and a large-scale validation.
The authors report that both calibration approaches produce similar results despite yielding different parameter sets and a different distribution of simulated SOM between POM and MAOM. They also explore how NPP-related assumptions alter calibration outcomes. Their results highlight the sensitivity of SOC model calibration to litter input assumptions and the implications of parameter differences for projected POM and MAOM distributions across Europe.
Obtaining parameter values through calibration for large amounts of data can be a computationally very costly procedure, as pointed out by the authors. Therefore, the evaluation of different methods to obtain suitable parameter values more efficiently is a valuable effort that can speed up parameterisation in the future. A strong point of the manuscript is that it does not only describe positive results, but also focusses on the pitfalls of model calibration, such as obtaining different parameter values that perform equally well, or a different simulated distribution of SOM among different simulated pools. These aspects of the model calibration process are often ignored in the literature and making modellers aware of these is highly important to advance this field.
The manuscript is well written and the results are clearly presented, although I missed a more quantitative evaluation of the calibration and validation results. I think it is an important contribution to the field of SOC development, which is regularly confronted with limitations when large amounts of data need to be used for model calibration. I hope my feedback can improve the quality of the manuscript, and make some aspects more clear to the readers.
Throughout my feedback, I mention certain published articles. These have been chosen based on their scientific relevance, and I leave it up to the authors whether they want to include these in their manuscript or not.
My main feedback is the following:
Throughout the manuscript the authors use the wording ‘to a meaningful degree’, without specifying what this means. This should be done, as it seems this term is used with the same meaning as ‘significantly different’, but it is not clear which criteria the authors use when applying this term.

One of the main outcomes of the manuscript is that the different calibration methods resulted in similar model outcomes (Fig. 3; although there are some notable differences as shown in Fig. 4) with different parameter values (Fig. 2). In addition, the projections show clear differences in the distribution of SOC between POC and MAOC (Fig. 7). Both are a clear example of equifinality, an important but often overlook concept in SOM modelling, and environmental modelling in general. I encourage the authors to have a look at this concept, and include this in their manuscript as it is highly relevant to interpret their results. The following articles could be used as a guide: Sierra et al. (2015; https://doi.org/10.1016/j.soilbio.2015.08.012), Marschmann et al. (2019; https://doi.org/10.1016/j.envsoft.2019.104518), Beven et al. (2006 ; https://doi.org/10.1016/j.jhydrol.2005.07.007), Van de Broek et al. (2025, https://doi.org/10.5194/bg-22-1427-2025), and Luo et al. (2017 ; https://www.jstor.org/stable/26155933)

Similarly, the parameters selected to be optimized are likely to be ‘not-identifiable’, meaning that different combinations of these parameters can lead to a similar model output (as observed by the authors). The authors would have been able to draw stronger conclusions about the comparison between the calibration techniques if only ‘identifiable parameters’, with only one solution, would have been optimised. I encourage the authors to discuss the implications of this, for example using the articles mentioned in the previous point, in addition to Guillaume et al. (2019; https://doi.org/10.1016/j.envsoft.2019.07.007) and Lam et al. (2022; https://doi.org/10.1016/j.matcom.2022.03.020)

The description of the 4DEnVar method is very technical, and difficult to understand for a non-expert. As the difference between this method and MCMC is a core aspect of the manuscript, I would encourage the authors to include a paragraph where the 4DEnVar method is explained in layman terms, with the differences with MCMC being highlighted.

The discussions and conclusions would benefit from a quantitative description of both calibration and validation results for both methods using multiple error metrics, which is currently lacking. As a result, the reader currently has to rely on only the plots to interpret the results.

Specific comments
--- Abstract ---
General: one of the main parts of the manuscript is the assessment of how the portion of NPP serving as C inputs affects model parameters and performance, but this is not mentioned in the abstract. I would encourage the authors to do so, so this is clear to the reader from the start.
--- Introduction ---
L 32-33: This understanding has not been ‘recently advanced’, as SOM fractionation is a practice that has been well-established for over three decades (see, for example, Cambardella et al. (1992; https://doi.org/10.2136/sssaj1992.03615995005600030017x))
L 40-41: instead of mentioning only two such models, it would be worthwhile to acknowledge that many similar non-linear models exist (see, for example, Chandel et al. (2023; https://doi.org/10.1029/2023JG007436) and Le Noë et al. (2023; https://doi.org/10.1038/s43247-023-00830-5))
L 53-55: That is correct, but a solution to this problem is to simulate 14C and evaluate this against measurements of d14C, so that both the stocks and turnover times are simulated correctly.
L 78-79: this sentence needs more explanation to be understandable by the reader
--- Methods and data ---
L 97: it seems the model is applied to 20 cm, as the LUCAS data contains data down to 20 cm. Please explicitly state this in the manuscript.
Fig. 1: it would be interesting to see where the 322 data points that were used for calibration were located. Can these be highlighted?
L112: it would be interesting for the reader to see the model structure of MEMS. Perhaps put a graph showing this in the supplement?
L 119: ‘were considered in this work’: what does that mean? Please clarify.
L122: please mention for which land use these default parameters were obtained. Are they readily applicable to your simulated forest, grassland and cropland ecosystems?
L124-127: these equations are very difficult to understand given the generic names of the carbon pools, and the lack of a graph showing the model structure and flows of C between the simulated pools. I suggest to authors to improve this.
L131-132: a couple words of explanation on the STANDCARB model are needed for readers not familiar with this model to understand.
L 145: please explain what you mean with ‘prior values’. Does this have the same meaning as the prior in a Bayesian calibration?
Table 1: (1) it would be more intuitive for the reader if the pool names (C5, C8, etc.) would be replaced by names of the pools such at POC, DOC, etc. As it is now, this table is difficult to interpret by readers not familiar with the MEMS model. (2) Please clarify what the minimum and maximum values are. (3) Please mention the units of the values. (4) What is meant with the baseline values?
L151: also here, a graph of the conceptual model of MEMS would help the reader understand how litter inputs are distributed among the model pools.
L 154-157: also here, the equations are not straightforward to interpret because of the use of C1, C2, etc. Better would be to use pool names that are understandable for the reader.
Table 2: it would be good to also explain in the caption what fsol, flig and fdoc are, so the table is understandable by itself
L 200: this section is very technical and difficult to understand for a non-expert. I encourage the authors to start this section with a paragraph that explain in simple words how this method works, and how it differs from MCMC. As this is central to your study, it is important that readers can understand how this method works.
L 277: the approach of performing all optimizations separately for different values of fdoc needs more explanation for the reader to understand why this was necessary
L282-284: something seems to be wrong with this sentence, it is not clear.
L 285-286: please find a better way to mention the initial size of the state variables, perhaps in a table in the supplement.
--- Results ---
L 340-341: is there an explanation why in the twin experiment, the algorithm found the same parameter values for both optimization methods, while this was not the case when the real data were used?
Figure 2: Please use more informative names for the parameters. As it is now, names as k5, k8 etc. are not intuitive for the reader and they will not be able to interpret this plot without going back to the methods section.
L 348: please clarify what you mean by ‘expected values’
L 349: please clarify what you mean by ‘differ meaningfully’. What criterion do you use for this? Please do so throughout the manuscript where this expression is used.
L 359-361: this sentence is very difficult to understand, please clarify
Table 3: (1) what do you mean by ‘expected values’? (2) What are the ‘baseline parameters’?
Figure 4: the MCMC method is not able to simulate the whole range in observed SOC, while both the MCMC method 4DEnVar systematically overestimate the MAOM:SOM ratio (with 4DEnVar not being able to simulate the whole range in measured ratios). As these are calibration results, I would have expected the models to perform better, at least without clear biases. Can a reason be that the ranges in the values of calibration parameters weren’t large enough (which is difficult to check by the reader because of the generic parameter names)? Also, please add to the labels on the x-axes that these are the modelled results.
Figure 6: please provide a time unit for NPP on the x-axis
L 412-425: the interpretation the performance of both methods would benefit from a more quantitative detailed description of the validation results. For example, a scatterplot of modelled versus measured SOM for the validation dataset, combined with different error measures. Currently Fig. 7C is the only figure where the reader can see the model error for the validation dataset.
Figure 7A&B: please add to the labels on the x-axes that these are the modelled results.
--- Discussion ---
L 444-445: I wouldn’t say it’s striking that the parameter sets differ from each other, this is an often-observed characteristic of equifinality (see above). I suggest the authors discuss this in more detail.
L 445-446: it’s not clear to me how both parameter sets ‘perform equally well with the validation dataset’, as no error measures for this have been provided, and Fig. 7 shows that there are clear differences between the simulation results for both methods. Therefore, I suggest the authors quantify model performance for the validation dataset, and explain why they interpret the validation results as being equally well between both methods. In addition, a good test of the effect of the different parameter sets would be to run your validation sites into a predictive mode, using for example an artificial increase in temperature for a couple of decades. If both methods result in a similar change in SOC for each site, you can say they ‘perform equally well’, but if both parameter sets results in a different change in SOC for each site, you can conclude that the different parameter sets have a different effect when moving away from the steady-state solution.
L 454-465: it’s not clear why these differences in parameter sets are attributed to the measurements (MAOM and total SOC), and not to a potentially inappropriate model structure, parameters values that were fixed incorrectly, or the existence of multiple mimima in the error space. What is the reason for not questioning these aspects of the model calibration process?
L 520-521: this statement is difficult to verify, as neither the error distribution nor the error measures are quantified for the validation (or calibration) results. For example, it is not possible for the reader to asses by which percentage the validation results are off, as only absolute numbers are shown in the plots (for example, Fig. 5 and 7C).
L536-537: SOC data alone is indeed often not sufficient to evaluate the performance of SOM models, see Guo et al. (2022; https://doi.org/10.1016/j.soilbio.2022.108780) or Braakhekke et al. (2014; https://doi.org/10.1002/2013JG002420)
L539: assessing the thermal stability of SOM is not the same as a fractionation into POM and MAOM (although they may be related), so the reference by Delahaie et al. is not appropriate as an example of a more efficient fractionation into POM and MAOM.
--- Conclusion ---
L552-554: the conclusion that both methods produce an ‘as good validation performance’ needs to be supported by a quantitative assessment.
L556-557: ‘[…] to notably impact future projections’: have such analyses been performed? That would be the ultimate proof to assess how the different parameter sets affected the model performance.
L 557: I wouldn’t call an increase in the portion of NPP going into the soil from 15% to 35 % a ‘slight change’, as this is more than a doubling
Technical comments
L 46: remove ‘that’
L 96: incorrect formatting of the references
L 111: remove the ‘/’
L 161-162: ‘equations from 17 to 20’: these numbers do not seem to be correct
L 213: split ‘anobservation’
L 224 - 225: ‘equation 10’: seems the wrong number
L 269-270: ‘from with the model’: something seems to be wrong with this
L 281: soi => soil
L 282: matrixl => matrix
L 283: ‘inputted’ is not correct English
L 284 : remove the two dots
L 287-288: no => not
L 322: incorrect formatting of the reference
L 416: we can see a similar differences => we can see similar differences
L 421: illustrated => illustrates
Citation: https://doi.org/10.5194/egusphere-2025-4999-RC1
- AC1: 'Reply on RC1', Toni Viskari, 13 May 2026
  
  For ease of formatting and including figures, we have attached our response to these comments as a PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4999-AC1
RC2:
'Comment on egusphere-2025-4999', Okiria Emmanuel, 07 Apr 2026

General Comments
In their manuscript, Viskari et al. tackle one of the most significant bottlenecks in large-scale soil organic carbon (SOC) modelling: the immense computational cost of parameter calibration. By comparing the traditional Markov Chain Monte Carlo (MCMC) algorithm with the novel 4-Dimensional Ensemble Variational (4DEnVar) data assimilation method, the authors provide a highly valuable methodological benchmark. Using the LUCAS 2009 soil inventory, they successfully demonstrate that 4DEnVar can achieve comparable validation performance to MCMC at a fraction of the computational cost, while also revealing how different algorithms navigate conflicting measurement incentives/trade-offs (Total SOC vs. MAOM fraction) and hidden model assumptions (litter input fractions).
The manuscript’s willingness to explore the pitfalls of calibration—specifically equifinality, and boundary-hitting parameter estimations (parameters hitting the ceiling)—makes it an important and refreshing contribution to the field.
However, while the conceptual framework and ultimate findings are strong, the manuscript currently suffers from some mathematical imprecisions/misrepresentations in the Methods section (probably mistakes in equation cross-referencing), as well as a major contradiction between the text and the visual data in the Results section. Moreover, in some cases, key variables are left undefined, mathematical notations in the 4D-Var equations are mismatched (cross-referencing error probably), and the interpretation of the kernel density plots mischaracterises the parameter uncertainty (I will apologise in advance if my understanding of these figures is off). Furthermore, the reliance on data not shown for foundational steps of the experiment hinders the reproducibility expected.
In my humble view, addressing the specific comments below will greatly improve the mathematical rigor, visual clarity, and overall readability of the manuscript.
The detailed responses are annotated directly in the attached pdf version of the manuscript.

Citation: https://doi.org/10.5194/egusphere-2025-4999-RC2
- AC2: 'Reply on RC2', Toni Viskari, 13 May 2026
  
  We have attached the responses to these comments as a pdf to ensure formatting and easy inclusion of figures.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4999-AC2
RC3:
'Comment on egusphere-2025-4999', Anonymous Referee #3, 11 Apr 2026
This manuscript by Viskari et al. presents the comparison between the established MCMC and emergent 4DEnVar calibration approaches for the MEMS v1 soil organic model using the continental-scale soil inventory dataset. The authors calibrate the model using a subset of sites where both total SOC and physical fractions were available, and validate these calibration across over 17,000 sites at a continental scale. Then they include a parameter experiment (changed from 0.15 to 0.35) to test the effects of litter input assumption on resulting parameters sets and model projections.
The authors find that MCMC and 4DEnVar calibration methods could achieve comparable performance in terms of total SOC validation with different parameter sets and different internal representation of the SOC state, while the later method is far more efficient. Changes in litter input assumptions affect both calibration methods but lead to general similar behaviors. Thus, 4DEnVar would be a potential tool for efficient calibration, yet more concrete evaluation approach for SOC modelling is needed.
This manuscript makes valuable contributions to the core issue of equifinality in the field of modelling by linking structural discrepancy between calibrating against total SOC and SOC fractions. Comparison between two methods is a practical finding for any modeler considering adopting these method and datasets.
The manuscript is generally well written. The general comments and specific comments are as following:
General comments
This study claims that both calibration methods produce ‘improvements’ (see in abstract, introduction, discussion) which is a ‘relative’ term. ‘Improvements’ are absent in the result section. The authors don’t quantitatively demonstrate what these improvements are, relative to what baseline and what magnitude. I recommend adding a systematic comparison by metrics of model performance under the default parameter set and calibrated parameters sets, as well as the simulations with change of f_doc.

Other than the relative performance of the two methods, both results and discussion are quite qualitative without providing specific numbers from the results. This manuscript would be strengthened by grounding its interpretive statements in quantitative statistics.

This manuscript presents several critical interpretive points that rest on ‘not shown’ analyses. Most of them are not minor analysis, e.g. the twin experiment, the relationship between high SOC projections and environmental factors. I strongly suggest the authors to include these results as supplementary materials.

This manuscript demonstrates the very important issue of equifinality in modelling. This includes the equifinality between the methods where they generate similar performance with different parameter sets, and the structural equifinality where similar performance was found with different underlying process representations. It would benefit a lot if the authors could put more discussions on this very important concept.

The author repeatedly uses the term “meaningful” to describe differences in results or changes. This term is quite vague, and I am unsure whether it refers to statistical significance or some other meaning.

The clarity of the result section would be improved if the authors could restructure by the order of two objects. It would be nice to at least have subheadings or any explicit sentences to navigate the readers.

Specific comments
Line 9-19: It would be beneficial to mention the second object and its findings of this manuscript, i.e. how NPP litter assumptions affecting the calibration and projections, to ensure that the abstract reflects the full scope of this research.

Line 18: as well -> as well as

Line 69-80: While motivation for this study lies in the computation challenge of MCMC method, the introduction of 4DEnVar method would be expected to be introduced in a way that emphasizing on its relative computation cost with concrete references or statistics to strengthen the motivation of this work.

Line 81-82: It is unclear for the reader why choose MEMS v1 model for this study, as this model is mentioned only at line 81. Would be nice to introduce this model already after mentioning the need to separate different SOC fractions.

Line 82-84: The advantage of using the LUCAS dataset can be briefly summarized. As a side focus of this paper, I would also like to know the challenges and limitations of using large-scale datasets for model development.

Line 85-86: From the previous context, I can’t see why can draw the hypothesis of ‘fit to the same degree’.

Line 87-91: The two objectives of this study are not articulately explained. Why this ‘simple experiment’ important in the context of ‘comparison of calibration methods’, and how is the first and two objectives inform each other?

Line 95-102: I wonder how presentative the subset of 350 samples is, which might be a big source of validation error. Would you please show the distribution of this subset in Figure 1 and provide more information regarding the contained context.

Line 105-109: I wonder what is the intention to include agricultural soils into simulation, where the-steady state assumption might not hold true?

Function 8-11: What is r^eco?

Line 162: Wrong equations reference?

Line 212: Wrong equation reference?

Line 224: Wrong equation reference?

Line 268-270: Hard for me to follow the description of the twin experiment. What are the baseline parameters for the parameters you calibrate in this model? How do you generate the perturbed parameter set?

Line 299-304: Please provide the full name of the MODIS product. And not clear for me that why the NPP dynamics are not expected to meaningfully affect the modelling results.

Line 363: I think the baseline parameters are not presented here? Also I wonder if you could provide uncertainty measures in this table? 0.3s5 -> 0.35.

Line 376-377: I wonder if you could provide qualitative statistics for the error distribution. As the difference between two methods with f_doc=0.35 can’t be intuitively recognized.

Line 401-403: The spatial pattern in Figure 5 seems to be coincided with the contexts in Line 404-408, and conclusions in 422-425. I wonder if you could discuss a bit more the regional patterns and related driver data limitation, model structure limitation, impacts in model performance evaluation in the discussion section. To make it more concreate in depicting the challenge in modelling with large-scale dataset.

Line 444-453: looks more like results for me, and these are not brought up in the results section.

Line 466-492: I wonder if you could provide a bit more information on the potential solutions to solve the prior impacts on calibration. So that this part would be more useful for practitioners considering using 4DEnVar model for similar applications.

Line 500-511: Current discussion mainly focuses on the challenges we are facing right now. It would be beneficial to put it in a broaden ecological context, and discuss about the ecological outcomes of changing the f_doc parameter. This might help to assess whether the chosen f_doc parameter is ecological reasonable.

Line 552-554: I wonder if this sentence should be stated more carefully, by specify the NPP assumption, while MCMC resulted in lower J under certain circumstance.

Line 557-559: The conclusion of the second object of this work is rather vague, and didn’t explicitly summarize the ecological meaning. Otherwise, it looks like just a repetition of the first object’s finding, without drawing importance of the second object.

Line 562: Might be nice to add one sentence at the end and return to the broaden context of SOC modelling goals.
Citation: https://doi.org/10.5194/egusphere-2025-4999-RC3
- AC3: 'Reply on RC3', Toni Viskari, 13 May 2026
  
  The responses to the reviewer comments are in the attached pdf to ensure formatting and correct inclusion of figures.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4999-AC3

Toni Viskari, Tristan Quaife, Fernando Fahl, Yao Zhang, and Emanuele Lugato

Viewed

Total article views: 1,471 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
956	410	105	1,471	82	158

HTML: 956
PDF: 410
XML: 105
Total: 1,471
BibTeX: 82
EndNote: 158

Views and downloads (calculated since 26 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	305	100	35	440
Feb 2026	167	87	31	285
Mar 2026	339	146	19	504
Apr 2026	101	44	9	154
May 2026	44	33	11	88
Jun 2026	0

Cumulative views and downloads (calculated since 26 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	305	100	35	440
Feb 2026	167	87	31	285
Mar 2026	339	146	19	504
Apr 2026	101	44	9	154
May 2026	44	33	11	88
Jun 2026	0

Viewed (geographical distribution)

Total article views: 1,455 (including HTML, PDF, and XML) Thereof 1,455 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 02 Jun 2026

Short summary

In this work we examined how different assumptions regarding soil carbon model calibration affect the resulting model performance. We found that how the litter inputs are set have a meaningful impact on the calibrated model parameters. Furthermore, two calibration methods produced parameter sets that differed meaningfully from each other but fit the validation dataset equally well. These results raise meaningful questions how we evaluate soil carbon model performance.


Total:	0
HTML:	0
PDF:	0
XML:	0