Comparing the MEMS v1 model performance with MCMC and 4DEnVar calibration methods over a continental soil inventory
Abstract. An abundant amount of different data is required to calibrate soil organic carbon (SOC) models to represent ecosystems at large-scale. However, due to challenges related to model state projections, this calibration becomes very computationally heavy with traditional calibration methods. In this work, we test 4-Dimensional Ensemble Variational data assimilation (4DEnVar) method to parameterize the MEMS v1 SOC model using data from the LUCAS soil sampling network and compare its performance against MCMC calibration. Comparing the total SOC projections from both parameterizations to the validation datasets showed similar improvements even though the produced parameter sets differed. A thorough analysis revealed that the detailed SOC states were not similar to a degree that is meaningful for future predictions, but we also lacked information to determine which parameter set was closer to the truth. Our results here establish 4DEnVar as an applicable calibration method for SOC models but also highlight the need for more nuanced validation methods as well careful examination how different data sets affect the model calibration.
In their article, ‘Comparing the MEMS v1 model performance with MCMC and 4DEnVar calibration methods over a continental soil inventory’, Viskari et al. compare two different parameter optimisation methods, Markov Chain Monte Carlo (MCMC) and the 4-Dimensional Ensemble Variational data assimilation method (4DEnVar), to optimise the MEMS v1 model.
Using SOC and carbon fraction data (POM vs MAOM) from the LUCAS 2009 soil inventory, the authors calibrated selected model parameters for 322 soil samples for which the POM and MAOM fractions were known, and analysed how these parameters influence steady-state SOC projections for 17,430 other LUCAS data points. The study includes a twin experiment (to assess if the algorithms were able to find the correct parameter set), two calibration scenarios using different assumptions about the fraction of net primary production entering the soil, and a large-scale validation.
The authors report that both calibration approaches produce similar results despite yielding different parameter sets and a different distribution of simulated SOM between POM and MAOM. They also explore how NPP-related assumptions alter calibration outcomes. Their results highlight the sensitivity of SOC model calibration to litter input assumptions and the implications of parameter differences for projected POM and MAOM distributions across Europe.
Obtaining parameter values through calibration for large amounts of data can be a computationally very costly procedure, as pointed out by the authors. Therefore, the evaluation of different methods to obtain suitable parameter values more efficiently is a valuable effort that can speed up parameterisation in the future. A strong point of the manuscript is that it does not only describe positive results, but also focusses on the pitfalls of model calibration, such as obtaining different parameter values that perform equally well, or a different simulated distribution of SOM among different simulated pools. These aspects of the model calibration process are often ignored in the literature and making modellers aware of these is highly important to advance this field.
The manuscript is well written and the results are clearly presented, although I missed a more quantitative evaluation of the calibration and validation results. I think it is an important contribution to the field of SOC development, which is regularly confronted with limitations when large amounts of data need to be used for model calibration. I hope my feedback can improve the quality of the manuscript, and make some aspects more clear to the readers.
Throughout my feedback, I mention certain published articles. These have been chosen based on their scientific relevance, and I leave it up to the authors whether they want to include these in their manuscript or not.
My main feedback is the following:
Specific comments
--- Abstract ---
General: one of the main parts of the manuscript is the assessment of how the portion of NPP serving as C inputs affects model parameters and performance, but this is not mentioned in the abstract. I would encourage the authors to do so, so this is clear to the reader from the start.
--- Introduction ---
L 32-33: This understanding has not been ‘recently advanced’, as SOM fractionation is a practice that has been well-established for over three decades (see, for example, Cambardella et al. (1992; https://doi.org/10.2136/sssaj1992.03615995005600030017x))
L 40-41: instead of mentioning only two such models, it would be worthwhile to acknowledge that many similar non-linear models exist (see, for example, Chandel et al. (2023; https://doi.org/10.1029/2023JG007436) and Le Noë et al. (2023; https://doi.org/10.1038/s43247-023-00830-5))
L 53-55: That is correct, but a solution to this problem is to simulate 14C and evaluate this against measurements of d14C, so that both the stocks and turnover times are simulated correctly.
L 78-79: this sentence needs more explanation to be understandable by the reader
--- Methods and data ---
L 97: it seems the model is applied to 20 cm, as the LUCAS data contains data down to 20 cm. Please explicitly state this in the manuscript.
Fig. 1: it would be interesting to see where the 322 data points that were used for calibration were located. Can these be highlighted?
L112: it would be interesting for the reader to see the model structure of MEMS. Perhaps put a graph showing this in the supplement?
L 119: ‘were considered in this work’: what does that mean? Please clarify.
L122: please mention for which land use these default parameters were obtained. Are they readily applicable to your simulated forest, grassland and cropland ecosystems?
L124-127: these equations are very difficult to understand given the generic names of the carbon pools, and the lack of a graph showing the model structure and flows of C between the simulated pools. I suggest to authors to improve this.
L131-132: a couple words of explanation on the STANDCARB model are needed for readers not familiar with this model to understand.
L 145: please explain what you mean with ‘prior values’. Does this have the same meaning as the prior in a Bayesian calibration?
Table 1: (1) it would be more intuitive for the reader if the pool names (C5, C8, etc.) would be replaced by names of the pools such at POC, DOC, etc. As it is now, this table is difficult to interpret by readers not familiar with the MEMS model. (2) Please clarify what the minimum and maximum values are. (3) Please mention the units of the values. (4) What is meant with the baseline values?
L151: also here, a graph of the conceptual model of MEMS would help the reader understand how litter inputs are distributed among the model pools.
L 154-157: also here, the equations are not straightforward to interpret because of the use of C1, C2, etc. Better would be to use pool names that are understandable for the reader.
Table 2: it would be good to also explain in the caption what fsol, flig and fdoc are, so the table is understandable by itself
L 200: this section is very technical and difficult to understand for a non-expert. I encourage the authors to start this section with a paragraph that explain in simple words how this method works, and how it differs from MCMC. As this is central to your study, it is important that readers can understand how this method works.
L 277: the approach of performing all optimizations separately for different values of fdoc needs more explanation for the reader to understand why this was necessary
L282-284: something seems to be wrong with this sentence, it is not clear.
L 285-286: please find a better way to mention the initial size of the state variables, perhaps in a table in the supplement.
--- Results ---
L 340-341: is there an explanation why in the twin experiment, the algorithm found the same parameter values for both optimization methods, while this was not the case when the real data were used?
Figure 2: Please use more informative names for the parameters. As it is now, names as k5, k8 etc. are not intuitive for the reader and they will not be able to interpret this plot without going back to the methods section.
L 348: please clarify what you mean by ‘expected values’
L 349: please clarify what you mean by ‘differ meaningfully’. What criterion do you use for this? Please do so throughout the manuscript where this expression is used.
L 359-361: this sentence is very difficult to understand, please clarify
Table 3: (1) what do you mean by ‘expected values’? (2) What are the ‘baseline parameters’?
Figure 4: the MCMC method is not able to simulate the whole range in observed SOC, while both the MCMC method 4DEnVar systematically overestimate the MAOM:SOM ratio (with 4DEnVar not being able to simulate the whole range in measured ratios). As these are calibration results, I would have expected the models to perform better, at least without clear biases. Can a reason be that the ranges in the values of calibration parameters weren’t large enough (which is difficult to check by the reader because of the generic parameter names)? Also, please add to the labels on the x-axes that these are the modelled results.
Figure 6: please provide a time unit for NPP on the x-axis
L 412-425: the interpretation the performance of both methods would benefit from a more quantitative detailed description of the validation results. For example, a scatterplot of modelled versus measured SOM for the validation dataset, combined with different error measures. Currently Fig. 7C is the only figure where the reader can see the model error for the validation dataset.
Figure 7A&B: please add to the labels on the x-axes that these are the modelled results.
--- Discussion ---
L 444-445: I wouldn’t say it’s striking that the parameter sets differ from each other, this is an often-observed characteristic of equifinality (see above). I suggest the authors discuss this in more detail.
L 445-446: it’s not clear to me how both parameter sets ‘perform equally well with the validation dataset’, as no error measures for this have been provided, and Fig. 7 shows that there are clear differences between the simulation results for both methods. Therefore, I suggest the authors quantify model performance for the validation dataset, and explain why they interpret the validation results as being equally well between both methods. In addition, a good test of the effect of the different parameter sets would be to run your validation sites into a predictive mode, using for example an artificial increase in temperature for a couple of decades. If both methods result in a similar change in SOC for each site, you can say they ‘perform equally well’, but if both parameter sets results in a different change in SOC for each site, you can conclude that the different parameter sets have a different effect when moving away from the steady-state solution.
L 454-465: it’s not clear why these differences in parameter sets are attributed to the measurements (MAOM and total SOC), and not to a potentially inappropriate model structure, parameters values that were fixed incorrectly, or the existence of multiple mimima in the error space. What is the reason for not questioning these aspects of the model calibration process?
L 520-521: this statement is difficult to verify, as neither the error distribution nor the error measures are quantified for the validation (or calibration) results. For example, it is not possible for the reader to asses by which percentage the validation results are off, as only absolute numbers are shown in the plots (for example, Fig. 5 and 7C).
L536-537: SOC data alone is indeed often not sufficient to evaluate the performance of SOM models, see Guo et al. (2022; https://doi.org/10.1016/j.soilbio.2022.108780) or Braakhekke et al. (2014; https://doi.org/10.1002/2013JG002420)
L539: assessing the thermal stability of SOM is not the same as a fractionation into POM and MAOM (although they may be related), so the reference by Delahaie et al. is not appropriate as an example of a more efficient fractionation into POM and MAOM.
--- Conclusion ---
L552-554: the conclusion that both methods produce an ‘as good validation performance’ needs to be supported by a quantitative assessment.
L556-557: ‘[…] to notably impact future projections’: have such analyses been performed? That would be the ultimate proof to assess how the different parameter sets affected the model performance.
L 557: I wouldn’t call an increase in the portion of NPP going into the soil from 15% to 35 % a ‘slight change’, as this is more than a doubling
Technical comments
L 46: remove ‘that’
L 96: incorrect formatting of the references
L 111: remove the ‘/’
L 161-162: ‘equations from 17 to 20’: these numbers do not seem to be correct
L 213: split ‘anobservation’
L 224 - 225: ‘equation 10’: seems the wrong number
L 269-270: ‘from with the model’: something seems to be wrong with this
L 281: soi => soil
L 282: matrixl => matrix
L 283: ‘inputted’ is not correct English
L 284Â : remove the two dots
L 287-288: no => not
L 322: incorrect formatting of the reference
L 416: we can see a similar differences => we can see similar differences
L 421: illustrated => illustrates