Emulating grid-based forest carbon dynamics using machine learning: an LPJ-GUESS v4.1.1 application

Natel, Carolina; Belda, David Martin; Anthoni, Peter; Haß, Neele; Rabin, Sam; Arneth, Almut

doi:https://doi.org/10.5194/egusphere-2024-4064

Preprints

https://doi.org/10.5194/egusphere-2024-4064

Preprints

03 Feb 2025

| 03 Feb 2025

Emulating grid-based forest carbon dynamics using machine learning: an LPJ-GUESS v4.1.1 application

Carolina Natel, David Martin Belda, Peter Anthoni, Neele Haß, Sam Rabin, and Almut Arneth

Abstract. The assessment of forest-based climate change mitigation strategies relies on computationally intensive scenario analyses, particularly when dynamic vegetation models are coupled with socio-economic models in multi-model frameworks. In this study, we developed surrogate models for the LPJ-GUESS dynamic global vegetation model to accelerate the prediction of carbon stocks and fluxes, enabling quicker scenario optimization within a multi-model coupling framework. We trained two machine learning methods: random forest and neural network. We assessed and compared the emulators using performance metrics and Shapley-based explanations. Our emulation approach accurately captured global and biome-specific forest carbon dynamics, closely replicating the outputs of LPJ-GUESS for both historical (1850–2014) and future (2015–2100) periods under various climate scenarios. Among the two trained emulators, the neural network extrapolated better at the end of the century for carbon stocks and fluxes, and provided more physically consistent predictions, as verified by Shapley values. Overall, the emulators reduced the simulation execution time by 97 %, bridging the gap between complex process-based models and the need for scalable and fast simulations. This offers a valuable tool for scenario analysis in the context of climate change mitigation, forest management, and policy development.

Received: 23 Dec 2024 – Discussion started: 03 Feb 2025

Competing interests: At least one of the (co-)authors is a member of the editorial board of Geoscientific Model Development. The peer-review process was guided by an independent editor, and the authors also have no other competing interests to declare

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 1861 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (1861 KB)

Supplement (90 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

18 Jul 2025

Emulating grid-based forest carbon dynamics using machine learning: an LPJ-GUESS v4.1.1 application

Carolina Natel, David Martín Belda, Peter Anthoni, Neele Haß, Sam Rabin, and Almut Arneth

Geosci. Model Dev., 18, 4317–4333, https://doi.org/10.5194/gmd-18-4317-2025,https://doi.org/10.5194/gmd-18-4317-2025, 2025

Short summary

Carolina Natel, David Martin Belda, Peter Anthoni, Neele Haß, Sam Rabin, and Almut Arneth

Interactive discussion

Status: closed

CC1:
'Comment on egusphere-2024-4064', Thomas Oberleitner, 26 Feb 2025

Thank you for this interesting preprint. Here are some notes and suggestions regarding the presentation of the results.
1. The preprint compares the predictive performance and parameter/response relationships of random forests and neural networks. The benefit of comparing two high-capacity/complexity models on tabular data is not clear, as it is self-evident that both can achieve good results given proper handling. In fact, because regularization in NNs can be more difficult, they are generally outperformed by easier to use off-the-shelf models such as random forest and gradient-boosting [1].
Furthermore, we don’t know of any literature in ML research supporting the claim that NNs would be inherently better at intra -or extrapolation unless they incorporate domain specific properties [2], for example the network architectures used in physics-informed NNs. Nor do the results suggest systematically better generalization performance. In table 3, RF models show higher R² than NNs for all RCPs and the slightly better performance in table 4 could be due to the choice of hyperparameters, random seeds, etc.
We therefore suggest removing the model comparison and focusing on the RF model.
2. The NRMSE in the model summary seems redundant as another scale-free metric in the form of the R² is provided. Additionally, NRMSE is highly sensitive to outliers, whereas R² is much less affected.
3. The point that the emulator reproduces LPJ-GUESS outputs well is made rather strongly in section 4. For example, line 215: “… emulators were able to generalize to LPJ-GUESS outputs produced with climate projections not included in the training data without a significant decline in performance”. It is not clear what “significant” refers to, nor is this evident from table 3 and 4, which show low R² values for most responses. The next sentence highlights the low NRMSE values, which could be deflated due to outliers (see remark 2).
We suggest more careful wording regarding emulator performance and to put it into a more applied context, i.e., by highlighting its efficiency in a specific task. While it reproduces average outputs of LPJ-GUESS well (figures 2 and 4), it most probably cannot reproduce extreme outputs of the process model. An analysis of residuals can help in verifying that. The potential inaccuracies in predicting non-average responses should then be noted somewhere, as the emulator seems to be intended as a highly efficient proxy for the process model.
4. The attribution of importance to features using Shapley values in the way it is presented could be misleading in the presence of correlations. This is a property of all data-driven models trained on correlated data, which is why all measures of importance are affected by this to varying degrees (e.g., total information gain in random forests, coefficients in linear models, etc.). In our experience, climate and other data used to train process model emulators are highly correlated and have a major effect on explanations. This can scramble the importance ranking of correlated features and even flip their Shapley value sign [3].
Furthermore, the text does not mention the ranking method for the features, which makes it hard to compare with the SHAP plots. Provided the authors stick to Shapley values, having the rank number included in the feature names in the plot would help to understand the conclusions drawn in the text.
We recommend supplementing correlation analysis, remove correlated features and/or weakening the language and inferences made about them. In many cases, feature selection algorithms can help in removing correlated features.
Alternatively, global explanations of feature importance could be used to rank features or supplement the Shapley results, such as contribution to loss function, global information gain, permutation importance, etc. As mentioned above, such measures are also not robust against correlations, but they might warp the results in a less drastic way. For some ML models, they are directly incorporated into feature selection algorithms [4].
Minor remarks
a. In the NRMSE equation (2), the term under the square root in the numerator should be divided by n.
References
[1] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep Neural Networks and Tabular Data: A Survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 6, pp. 7499–7519, Jun. 2024, doi: 10.1109/TNNLS.2022.3229161.

[2] V. N. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999, doi: 10.1109/72.788640.

[3] K. Aas, M. Jullum, and A. Løland, “Explaining individual predictions when features are dependent: More accurate approximations to Shapley values,” Artif. Intell., vol. 298, p. 103502, Sep. 2021, doi: 10.1016/j.artint.2021.103502.

[4] “CatBoost Feature Selection.” Accessed: Feb. 26, 2025. [Online]. Available: https://catboost.ai/docs/en/concepts/python-reference_catboost_select_features

Citation: https://doi.org/10.5194/egusphere-2024-4064-CC1
- AC2: 'Reply on CC1', Carolina Natel, 25 Mar 2025
  
  Thank you for your feedback. Please find attached our reply.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4064-AC2
RC1:
'Comment on egusphere-2024-4064', Joe Melton, 18 Mar 2025

Natel and coauthors are interested in developing emulators to allow easier integration of ecosystem models (like LPJ-GUESS) in broader frameworks that couple multiple models together. LPJ-GUESS is sufficiently computationally expensive that an emulator could be valuable for the multimodel frameworks (in particular LandSyMM). They use two machine learning based approaches: random forests (RF) and neural network (NN). Both emulators were trained using LPJ-GUESS outputs for some historical and future simulations. The emulators differed in both their performance and the main variables they were sensitive to but both were much faster than LPJ-GUESS itself.
The paper is generally well written and easy to follow. The work falls well within GMD's area of interest. I think the work is suitable for publication but have several questions that I would like to see answered beforehand.
Main comments:
1. I liked that the authors used two different ML-based approaches in their emulators and then attempted to understand/interpret what each emulator was sensitive to. This is valuable information but it felt like only half the story. What was missing was what LPJ-GUESS is sensitive to. If the point of the emulators was to allow cheaper approximation of LPJ-GUESS (the 'model') then the most important thing is that the emulator is responding in the same manner and to the same variables as LPJ-GUESS. There are many plots showing trajectories of pools and fluxes for LPJ-GUESS and the emulators (e.g. Fig 2 and 4) but no similar plots showing sensitivity of LPJ-GUESS as there is of the emulators (e.g. Fig 6). I realize this is more challenging with LPJ-GUESS since it is inherently a completely different kind of model, but I struggle to understand how one can trust either emulator without knowing if it is actually mimicking the model's sensitivities (which under this circumstance has to be assumed to be perfect).

2. I am quite skeptical of the claim that (L 342) 'NNs excel at modeling continuous relationships, making them more capable of generalizing to unseen data, particularly when extrapolation is required'. My read of Muckley et al. (2023) does not support this contention. Muckley et al. test out the performance of linear regressions and black box RF and NN models. For the interpolation tasks, the linear model was poor but when it came to extrapolation it could out perform the black box models in some of the tests (~40%). The authors state the '{linear regressions}... may be desirable over complex algorithms in many extrapolation problems because of their superior interpretability...'. Lakshminarayanan et al. (2017) nicely demonstrate this for a toy example using a NN whereby the NN extrapolates poorly (their Fig 1 - left panel is bounds of 5 NNs). They show that the uncertainty bound via an ensemble technique can be created that encompasses the true function. So, getting to my main concern, given that NNs do not extrapolate well (same with decision tree-based methods), how can we trust the NN/RF models when they are forced to extrapolate? The approach here doesn't have any way to uncertainty bound the emulator results so it can extrapolate (poorly) blindly to the user. I don't expect the authors to fix this problem right now, but I would like to see more discussion about this difficulty and how it could be addressed for emulators as their use if becoming more common.

Minor Comments:

Supplement - Can you explain more about the disturbance interval of 100 years and how that is applied? Also with fire off, the disturbance is then what? I see land use change is also not used.

L 84 - This sentence is a bit confusing as I found it less clear what the features were selected for.

L 112 - I wonder about the influence of this post-processing step for non-negative C stocks. How often did this come up? Were the instances where this came up for regions with very low stocks such that the inaccuracy would be small? (e.g. true value is 0.1 kg C/m2, so going negative is fairly reasonable but if it was really supposed to be 10 kg C/m2 then that is a big problem). This also demonstrates a problem with using an off-the-shelf NN whereby it has no knowledge of boundaries that one like a physics-informed ML model could.

L 137 - sampled by grid cell, time, or ?

L 153 - with the meteorology/climate of MPI..., not the actual climate model itself.

L 155 - Does LPJ-GUESS not have a dependence for computational cost on the number of PFTs present or the soil permeable depth?

L 158 - If it takes 5000 sims to train the emulator but actually running the model that uses the emulator only happens 1000 times then you may end up with no net benefit. Also, sorry if I missed it, how many simulations did you need to train the emulator?

L 207 - Could you give the real values in addition to the percent. It would be nice to see how much in clock time these cost (acknowledging it is system dependent).

L 215 - I think the lack of decline in performance is simply due to training with the most extreme ends of the scenarios. This ensured that you were interpolating as much as possible. This is likely the only reasonable approach given the these techniques do not extrapolate well (see one of my main comments). But it means that the emulator always requires retraining for new scenarios and the scenarios always need be more extreme than what the actual system should realistically experience. I think some aspects of this bear mentioning.

L 218 - 'greatest accuracy' - by NN? Unclear as written.

Fig 2 and 4 - What about adding new plots presenting these and the fluxes as cumulative plots so the impact of over/under predicting over time are visible? The fluxes is important as it has impact on how much C the land surface takes up/releases. The stocks as it changes how much C is emitted during disturbance or land use change. A cumulative plot can show the effect across the simulated period.

Lakshminarayanan, B., Pritzel, A., and Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles, arXiv [stat.ML],31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. https://arxiv.org/pdf/1612.01474

Citation: https://doi.org/10.5194/egusphere-2024-4064-RC1
- AC3: 'Reply on RC1', Carolina Natel, 09 Apr 2025
  
  Thank you for your review. We have provided our replies in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4064-AC3
CEC1:
'Comment on egusphere-2024-4064', Juan Antonio Añel, 21 Mar 2025

Dear authors,
A couple of remarks regarding the compliance with the code and data policy of the journal. First, from the text in the manuscript it is my understanding that the modifications to the LPJ-GUESS code are only available through GitHub. If this is the case, we can not accept it. GitHub is not valid as a long-term repository for scientific purposes, and GitHub itself recommends to use Zenodo for it, providing an integration that allows to migrate GitHub sites to Zenodo repositories. Please, clarify this situation, and store your modifications in one of the suitable repositories according to the policy of the journal, and reply to this comment with the new link and DOI, and modify the text accordingly in the manuscript.
Also, for ISIMIP3b you cite the ISIMIP site. We can not accept it, as again, it is not a suitable repository. I have seen that a Zenodo repository exists for different ISIMIP versions. https://zenodo.org/records/4686991. Therefore, you should modify the text, first, with a Zenodo repository valid that contains the referred data, and including the DOI and not only the citation inline.
Juan A. Añel
Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2024-4064-CEC1
- AC1:
  'Reply on CEC1', Carolina Natel, 21 Mar 2025
  
  Dear Juan A. Añel,
  Thank you for your feedback and for pointing out the necessary adjustments.
  In response to your suggestion, we have now published the LPJ-GUESS source code, including code modifications, on Zenodo at https://zenodo.org/records/15065248. We will revise the text of the Code and Data Availability section to reflect this change.
  Regarding the citation of the ISIMIP3b dataset, we would like to clarify that the repository cited in the manuscript is the official source of the dataset, and is publicly accessible via the DOI: 10.48364/ISIMIP.842396.1.
  The Zenodo link you provided refers to a code repository related to bias correction of the dataset, and does not host the dataset itself. While we would also like to provide a Zenodo repository and link for the dataset used, the size of the raw dataset we downloaded exceeds the capacity of open repositories, including Zenodo (~1.3TB).
  However, we would like to reiterate that all pre-processed data inputs (including the ISIMIP climate data) and code necessary to reproduce this study, and to adapt the emulation approach to other climate data (e.g. future ISIMIP versions) are already available and cited in the manuscript via https://zenodo.org/records/14230951 and https://zenodo.org/records/14231373.
  We will also revise the text of the manuscript to include the DOI for the ISIMIP3b dataset repository:
  "The ISIMIP3b bias-adjusted atmospheric climate data used in our simulations are publicly available via 10.48364/ISIMIP.842396.1 under the CC0 1.0 Universal Public Domain Dedication".
  We hope that these revisions address your concerns, but please let us know if any additional changes are needed. We greatly appreciate your time and effort in reviewing our manuscript and look forward to your feedback.
  Best regards,
  Carolina, on behalf of all co-authors.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4064-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 21 Mar 2025
    
    Dear authors,
    First, things like "the official repository" are irrelevant when assessing the compliance with the policy of the journal. That a site is "official" does not make it a trusted repository for long-term archival of the data, and it is the obligation of the authors to provide the assets necessary to replicate the work they are trying to publish. We can understand the issue that you mention about the size of the dataset. However, your reasoning that not public repository can host your dataset is not true. For example, Zenodo limits each repository to 50 GB; however, there is not limit to the amount of repositories that you can create. Therefore, in your case you could store the data in, for example, Zenodo, creating around 26 different repositories.
    We appreciate your willingness to comply with the policy, and the quick reply to the concerns that we have raised. In this case I am open to accept that you can not easily migrate the data from the website that currently host it, and we appreciate that it owns a DOI. Therefore, we can consider now your manuscript in compliance with the policy of the journal. However, it would be good if you took the time to store the data in a repository that serves better the purpose of its preservation and therefore the future replicability of your work.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2024-4064-CEC2
RC2:
'Comment on egusphere-2024-4064', Anonymous Referee #2, 22 Mar 2025

Carolina Natel et al. developed both a Random Forest model and a neural network model to emulate the dynamics of ecosystem carbon fluxes and carbon pool changes. These machine learning models have been widely applied to land models or components in the past and have consistently demonstrated effectiveness. Similarly, this study shows reasonable performance in emulating the target variables. The paper is well-written, with well-documented data and code. Overall, this is a solid modeling paper. Below, I have a few specific comments:
1. Interpretability vs. Physical Consistency
My primary concern is the balance between model interpretability and physical consistency. While SHAP has been used to interpret the ML models, it does not ensure that the emulators capture established physical knowledge embedded within the land model (in this case, LPJ-GUESS). I encourage the authors to further explore this aspect, as it is fundamentally important to understand the functional relationship emerging from ML emulators.

For example, in land models, atmospheric CO₂ concentration is a key driver of vegetation productivity, while temperature (T) strongly influences soil carbon stocks. Ideally, such first-order relationships should also be reflected in the trained ML emulators. One way to test this would be to leverage factorial LPJ-GUESS simulations with:
(a) Future SSP climate scenarios + historical atmospheric CO2 levels.
(b) Future SSP CO2 levels + historical climate conditions (e.g., repeated climate from 2010–2020).
If the trained ML emulators can reproduce the results of these factorial runs, it would provide strong evidence that the emulators have captured critical relationships between environmental drivers and carbon dynamics.

2. Justification for Annual Time Step
Further justification is needed regarding the choice of an annual time step. Land models typically operate at much finer temporal resolutions (e.g., daily, hourly, or at least monthly). It would be helpful to explain why the annual scale was selected and how potential loss of information at shorter timescales may affect the emulator's performance.

3. Capturing Inter-Annual Variability
Given the focus on annual time steps, evaluating the emulator’s ability to capture inter-annual variability in carbon fluxes (in addition to long-term trends) would be an important validation metric. Although the training is performed at the grid cell level (random sampling), it may also be valuable to include spatially aggregated fluxes (e.g., global/regional totals) as part of the loss function. This could improve the model's ability to represent inter-annual variability at regional or global scales.

4. Treatment of Disturbance Intensity
The results show disturbance as one of the most important features. However, it remains unclear how disturbance intensity (e.g., fractional area burned by wildfire, or land-use/land-cover change) is handled. How does the ML model represent partially disturbed grid cells? Additional clarification on this point would be needed.

5. Recommendations for future work: Since land models simulate continuous, time-dependent changes in carbon fluxes and pools, it may be worthwhile to explore time-series ML models (e.g., RNNs, LSTMs, or Transformers) in future work. Such models could potentially outperform static models like Random Forests and ANNs by better capturing temporal dependencies and dynamics.

Citation: https://doi.org/10.5194/egusphere-2024-4064-RC2
- AC4: 'Reply on RC2', Carolina Natel, 09 Apr 2025
  
  Dear Reviewer, thank you for your feedback on our manuscript. Please find our reply in the attachment.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4064-AC4

Interactive discussion

Status: closed

CC1:
'Comment on egusphere-2024-4064', Thomas Oberleitner, 26 Feb 2025

Thank you for this interesting preprint. Here are some notes and suggestions regarding the presentation of the results.
1. The preprint compares the predictive performance and parameter/response relationships of random forests and neural networks. The benefit of comparing two high-capacity/complexity models on tabular data is not clear, as it is self-evident that both can achieve good results given proper handling. In fact, because regularization in NNs can be more difficult, they are generally outperformed by easier to use off-the-shelf models such as random forest and gradient-boosting [1].
Furthermore, we don’t know of any literature in ML research supporting the claim that NNs would be inherently better at intra -or extrapolation unless they incorporate domain specific properties [2], for example the network architectures used in physics-informed NNs. Nor do the results suggest systematically better generalization performance. In table 3, RF models show higher R² than NNs for all RCPs and the slightly better performance in table 4 could be due to the choice of hyperparameters, random seeds, etc.
We therefore suggest removing the model comparison and focusing on the RF model.
2. The NRMSE in the model summary seems redundant as another scale-free metric in the form of the R² is provided. Additionally, NRMSE is highly sensitive to outliers, whereas R² is much less affected.
3. The point that the emulator reproduces LPJ-GUESS outputs well is made rather strongly in section 4. For example, line 215: “… emulators were able to generalize to LPJ-GUESS outputs produced with climate projections not included in the training data without a significant decline in performance”. It is not clear what “significant” refers to, nor is this evident from table 3 and 4, which show low R² values for most responses. The next sentence highlights the low NRMSE values, which could be deflated due to outliers (see remark 2).
We suggest more careful wording regarding emulator performance and to put it into a more applied context, i.e., by highlighting its efficiency in a specific task. While it reproduces average outputs of LPJ-GUESS well (figures 2 and 4), it most probably cannot reproduce extreme outputs of the process model. An analysis of residuals can help in verifying that. The potential inaccuracies in predicting non-average responses should then be noted somewhere, as the emulator seems to be intended as a highly efficient proxy for the process model.
4. The attribution of importance to features using Shapley values in the way it is presented could be misleading in the presence of correlations. This is a property of all data-driven models trained on correlated data, which is why all measures of importance are affected by this to varying degrees (e.g., total information gain in random forests, coefficients in linear models, etc.). In our experience, climate and other data used to train process model emulators are highly correlated and have a major effect on explanations. This can scramble the importance ranking of correlated features and even flip their Shapley value sign [3].
Furthermore, the text does not mention the ranking method for the features, which makes it hard to compare with the SHAP plots. Provided the authors stick to Shapley values, having the rank number included in the feature names in the plot would help to understand the conclusions drawn in the text.
We recommend supplementing correlation analysis, remove correlated features and/or weakening the language and inferences made about them. In many cases, feature selection algorithms can help in removing correlated features.
Alternatively, global explanations of feature importance could be used to rank features or supplement the Shapley results, such as contribution to loss function, global information gain, permutation importance, etc. As mentioned above, such measures are also not robust against correlations, but they might warp the results in a less drastic way. For some ML models, they are directly incorporated into feature selection algorithms [4].
Minor remarks
a. In the NRMSE equation (2), the term under the square root in the numerator should be divided by n.
References
[1] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep Neural Networks and Tabular Data: A Survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 6, pp. 7499–7519, Jun. 2024, doi: 10.1109/TNNLS.2022.3229161.

[2] V. N. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999, doi: 10.1109/72.788640.

[3] K. Aas, M. Jullum, and A. Løland, “Explaining individual predictions when features are dependent: More accurate approximations to Shapley values,” Artif. Intell., vol. 298, p. 103502, Sep. 2021, doi: 10.1016/j.artint.2021.103502.

[4] “CatBoost Feature Selection.” Accessed: Feb. 26, 2025. [Online]. Available: https://catboost.ai/docs/en/concepts/python-reference_catboost_select_features

Citation: https://doi.org/10.5194/egusphere-2024-4064-CC1
- AC2: 'Reply on CC1', Carolina Natel, 25 Mar 2025
  
  Thank you for your feedback. Please find attached our reply.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4064-AC2
RC1:
'Comment on egusphere-2024-4064', Joe Melton, 18 Mar 2025

Natel and coauthors are interested in developing emulators to allow easier integration of ecosystem models (like LPJ-GUESS) in broader frameworks that couple multiple models together. LPJ-GUESS is sufficiently computationally expensive that an emulator could be valuable for the multimodel frameworks (in particular LandSyMM). They use two machine learning based approaches: random forests (RF) and neural network (NN). Both emulators were trained using LPJ-GUESS outputs for some historical and future simulations. The emulators differed in both their performance and the main variables they were sensitive to but both were much faster than LPJ-GUESS itself.
The paper is generally well written and easy to follow. The work falls well within GMD's area of interest. I think the work is suitable for publication but have several questions that I would like to see answered beforehand.
Main comments:
1. I liked that the authors used two different ML-based approaches in their emulators and then attempted to understand/interpret what each emulator was sensitive to. This is valuable information but it felt like only half the story. What was missing was what LPJ-GUESS is sensitive to. If the point of the emulators was to allow cheaper approximation of LPJ-GUESS (the 'model') then the most important thing is that the emulator is responding in the same manner and to the same variables as LPJ-GUESS. There are many plots showing trajectories of pools and fluxes for LPJ-GUESS and the emulators (e.g. Fig 2 and 4) but no similar plots showing sensitivity of LPJ-GUESS as there is of the emulators (e.g. Fig 6). I realize this is more challenging with LPJ-GUESS since it is inherently a completely different kind of model, but I struggle to understand how one can trust either emulator without knowing if it is actually mimicking the model's sensitivities (which under this circumstance has to be assumed to be perfect).

2. I am quite skeptical of the claim that (L 342) 'NNs excel at modeling continuous relationships, making them more capable of generalizing to unseen data, particularly when extrapolation is required'. My read of Muckley et al. (2023) does not support this contention. Muckley et al. test out the performance of linear regressions and black box RF and NN models. For the interpolation tasks, the linear model was poor but when it came to extrapolation it could out perform the black box models in some of the tests (~40%). The authors state the '{linear regressions}... may be desirable over complex algorithms in many extrapolation problems because of their superior interpretability...'. Lakshminarayanan et al. (2017) nicely demonstrate this for a toy example using a NN whereby the NN extrapolates poorly (their Fig 1 - left panel is bounds of 5 NNs). They show that the uncertainty bound via an ensemble technique can be created that encompasses the true function. So, getting to my main concern, given that NNs do not extrapolate well (same with decision tree-based methods), how can we trust the NN/RF models when they are forced to extrapolate? The approach here doesn't have any way to uncertainty bound the emulator results so it can extrapolate (poorly) blindly to the user. I don't expect the authors to fix this problem right now, but I would like to see more discussion about this difficulty and how it could be addressed for emulators as their use if becoming more common.

Minor Comments:

Supplement - Can you explain more about the disturbance interval of 100 years and how that is applied? Also with fire off, the disturbance is then what? I see land use change is also not used.

L 84 - This sentence is a bit confusing as I found it less clear what the features were selected for.

L 112 - I wonder about the influence of this post-processing step for non-negative C stocks. How often did this come up? Were the instances where this came up for regions with very low stocks such that the inaccuracy would be small? (e.g. true value is 0.1 kg C/m2, so going negative is fairly reasonable but if it was really supposed to be 10 kg C/m2 then that is a big problem). This also demonstrates a problem with using an off-the-shelf NN whereby it has no knowledge of boundaries that one like a physics-informed ML model could.

L 137 - sampled by grid cell, time, or ?

L 153 - with the meteorology/climate of MPI..., not the actual climate model itself.

L 155 - Does LPJ-GUESS not have a dependence for computational cost on the number of PFTs present or the soil permeable depth?

L 158 - If it takes 5000 sims to train the emulator but actually running the model that uses the emulator only happens 1000 times then you may end up with no net benefit. Also, sorry if I missed it, how many simulations did you need to train the emulator?

L 207 - Could you give the real values in addition to the percent. It would be nice to see how much in clock time these cost (acknowledging it is system dependent).

L 215 - I think the lack of decline in performance is simply due to training with the most extreme ends of the scenarios. This ensured that you were interpolating as much as possible. This is likely the only reasonable approach given the these techniques do not extrapolate well (see one of my main comments). But it means that the emulator always requires retraining for new scenarios and the scenarios always need be more extreme than what the actual system should realistically experience. I think some aspects of this bear mentioning.

L 218 - 'greatest accuracy' - by NN? Unclear as written.

Fig 2 and 4 - What about adding new plots presenting these and the fluxes as cumulative plots so the impact of over/under predicting over time are visible? The fluxes is important as it has impact on how much C the land surface takes up/releases. The stocks as it changes how much C is emitted during disturbance or land use change. A cumulative plot can show the effect across the simulated period.

Lakshminarayanan, B., Pritzel, A., and Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles, arXiv [stat.ML],31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. https://arxiv.org/pdf/1612.01474

Citation: https://doi.org/10.5194/egusphere-2024-4064-RC1
- AC3: 'Reply on RC1', Carolina Natel, 09 Apr 2025
  
  Thank you for your review. We have provided our replies in the attached PDF.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4064-AC3
CEC1:
'Comment on egusphere-2024-4064', Juan Antonio Añel, 21 Mar 2025

Dear authors,
A couple of remarks regarding the compliance with the code and data policy of the journal. First, from the text in the manuscript it is my understanding that the modifications to the LPJ-GUESS code are only available through GitHub. If this is the case, we can not accept it. GitHub is not valid as a long-term repository for scientific purposes, and GitHub itself recommends to use Zenodo for it, providing an integration that allows to migrate GitHub sites to Zenodo repositories. Please, clarify this situation, and store your modifications in one of the suitable repositories according to the policy of the journal, and reply to this comment with the new link and DOI, and modify the text accordingly in the manuscript.
Also, for ISIMIP3b you cite the ISIMIP site. We can not accept it, as again, it is not a suitable repository. I have seen that a Zenodo repository exists for different ISIMIP versions. https://zenodo.org/records/4686991. Therefore, you should modify the text, first, with a Zenodo repository valid that contains the referred data, and including the DOI and not only the citation inline.
Juan A. Añel
Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2024-4064-CEC1
- AC1:
  'Reply on CEC1', Carolina Natel, 21 Mar 2025
  
  Dear Juan A. Añel,
  Thank you for your feedback and for pointing out the necessary adjustments.
  In response to your suggestion, we have now published the LPJ-GUESS source code, including code modifications, on Zenodo at https://zenodo.org/records/15065248. We will revise the text of the Code and Data Availability section to reflect this change.
  Regarding the citation of the ISIMIP3b dataset, we would like to clarify that the repository cited in the manuscript is the official source of the dataset, and is publicly accessible via the DOI: 10.48364/ISIMIP.842396.1.
  The Zenodo link you provided refers to a code repository related to bias correction of the dataset, and does not host the dataset itself. While we would also like to provide a Zenodo repository and link for the dataset used, the size of the raw dataset we downloaded exceeds the capacity of open repositories, including Zenodo (~1.3TB).
  However, we would like to reiterate that all pre-processed data inputs (including the ISIMIP climate data) and code necessary to reproduce this study, and to adapt the emulation approach to other climate data (e.g. future ISIMIP versions) are already available and cited in the manuscript via https://zenodo.org/records/14230951 and https://zenodo.org/records/14231373.
  We will also revise the text of the manuscript to include the DOI for the ISIMIP3b dataset repository:
  "The ISIMIP3b bias-adjusted atmospheric climate data used in our simulations are publicly available via 10.48364/ISIMIP.842396.1 under the CC0 1.0 Universal Public Domain Dedication".
  We hope that these revisions address your concerns, but please let us know if any additional changes are needed. We greatly appreciate your time and effort in reviewing our manuscript and look forward to your feedback.
  Best regards,
  Carolina, on behalf of all co-authors.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4064-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 21 Mar 2025
    
    Dear authors,
    First, things like "the official repository" are irrelevant when assessing the compliance with the policy of the journal. That a site is "official" does not make it a trusted repository for long-term archival of the data, and it is the obligation of the authors to provide the assets necessary to replicate the work they are trying to publish. We can understand the issue that you mention about the size of the dataset. However, your reasoning that not public repository can host your dataset is not true. For example, Zenodo limits each repository to 50 GB; however, there is not limit to the amount of repositories that you can create. Therefore, in your case you could store the data in, for example, Zenodo, creating around 26 different repositories.
    We appreciate your willingness to comply with the policy, and the quick reply to the concerns that we have raised. In this case I am open to accept that you can not easily migrate the data from the website that currently host it, and we appreciate that it owns a DOI. Therefore, we can consider now your manuscript in compliance with the policy of the journal. However, it would be good if you took the time to store the data in a repository that serves better the purpose of its preservation and therefore the future replicability of your work.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2024-4064-CEC2
RC2:
'Comment on egusphere-2024-4064', Anonymous Referee #2, 22 Mar 2025

Carolina Natel et al. developed both a Random Forest model and a neural network model to emulate the dynamics of ecosystem carbon fluxes and carbon pool changes. These machine learning models have been widely applied to land models or components in the past and have consistently demonstrated effectiveness. Similarly, this study shows reasonable performance in emulating the target variables. The paper is well-written, with well-documented data and code. Overall, this is a solid modeling paper. Below, I have a few specific comments:
1. Interpretability vs. Physical Consistency
My primary concern is the balance between model interpretability and physical consistency. While SHAP has been used to interpret the ML models, it does not ensure that the emulators capture established physical knowledge embedded within the land model (in this case, LPJ-GUESS). I encourage the authors to further explore this aspect, as it is fundamentally important to understand the functional relationship emerging from ML emulators.

For example, in land models, atmospheric CO₂ concentration is a key driver of vegetation productivity, while temperature (T) strongly influences soil carbon stocks. Ideally, such first-order relationships should also be reflected in the trained ML emulators. One way to test this would be to leverage factorial LPJ-GUESS simulations with:
(a) Future SSP climate scenarios + historical atmospheric CO2 levels.
(b) Future SSP CO2 levels + historical climate conditions (e.g., repeated climate from 2010–2020).
If the trained ML emulators can reproduce the results of these factorial runs, it would provide strong evidence that the emulators have captured critical relationships between environmental drivers and carbon dynamics.

2. Justification for Annual Time Step
Further justification is needed regarding the choice of an annual time step. Land models typically operate at much finer temporal resolutions (e.g., daily, hourly, or at least monthly). It would be helpful to explain why the annual scale was selected and how potential loss of information at shorter timescales may affect the emulator's performance.

3. Capturing Inter-Annual Variability
Given the focus on annual time steps, evaluating the emulator’s ability to capture inter-annual variability in carbon fluxes (in addition to long-term trends) would be an important validation metric. Although the training is performed at the grid cell level (random sampling), it may also be valuable to include spatially aggregated fluxes (e.g., global/regional totals) as part of the loss function. This could improve the model's ability to represent inter-annual variability at regional or global scales.

4. Treatment of Disturbance Intensity
The results show disturbance as one of the most important features. However, it remains unclear how disturbance intensity (e.g., fractional area burned by wildfire, or land-use/land-cover change) is handled. How does the ML model represent partially disturbed grid cells? Additional clarification on this point would be needed.

5. Recommendations for future work: Since land models simulate continuous, time-dependent changes in carbon fluxes and pools, it may be worthwhile to explore time-series ML models (e.g., RNNs, LSTMs, or Transformers) in future work. Such models could potentially outperform static models like Random Forests and ANNs by better capturing temporal dependencies and dynamics.

Citation: https://doi.org/10.5194/egusphere-2024-4064-RC2
- AC4: 'Reply on RC2', Carolina Natel, 09 Apr 2025
  
  Dear Reviewer, thank you for your feedback on our manuscript. Please find our reply in the attachment.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4064-AC4

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Carolina Natel on behalf of the Authors (14 Apr 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (15 Apr 2025) by Cynthia Whaley

RR by Anonymous Referee #2 (25 Apr 2025)

ED: Publish as is (30 Apr 2025) by Cynthia Whaley

AR by Carolina Natel on behalf of the Authors (05 May 2025) Manuscript

Journal article(s) based on this preprint

18 Jul 2025

Emulating grid-based forest carbon dynamics using machine learning: an LPJ-GUESS v4.1.1 application

Carolina Natel, David Martín Belda, Peter Anthoni, Neele Haß, Sam Rabin, and Almut Arneth

Geosci. Model Dev., 18, 4317–4333, https://doi.org/10.5194/gmd-18-4317-2025,https://doi.org/10.5194/gmd-18-4317-2025, 2025

Short summary

Carolina Natel, David Martin Belda, Peter Anthoni, Neele Haß, Sam Rabin, and Almut Arneth

Supplement

https://doi.org/10.5194/egusphere-2024-4064-supplement

Carolina Natel, David Martin Belda, Peter Anthoni, Neele Haß, Sam Rabin, and Almut Arneth

Viewed

Total article views: 426 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
308	99	19	426	45	10	17

HTML: 308
PDF: 99
XML: 19
Total: 426
Supplement: 45
BibTeX: 10
EndNote: 17

Views and downloads (calculated since 03 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	91	35	5	131
Mar 2025	68	16	4	88
Apr 2025	47	18	4	69
May 2025	34	10	1	45
Jun 2025	45	14	5	64
Jul 2025	23	6	0	29

Cumulative views and downloads (calculated since 03 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	91	35	5	131
Mar 2025	68	16	4	88
Apr 2025	47	18	4	69
May 2025	34	10	1	45
Jun 2025	45	14	5	64
Jul 2025	23	6	0	29

Viewed (geographical distribution)

Total article views: 489 (including HTML, PDF, and XML) Thereof 489 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 21 Jul 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (1861 KB)
Metadata XML

Short summary

Complex models predict forest carbon responses to future climate change but are slow and computationally intensive, limiting large-scale analyses. We used machine learning to accelerate predictions from the LPJ-GUESS vegetation model. Our emulators, based on random forests and neural networks, achieved 97 % faster simulations. This approach enables rapid exploration of climate mitigation strategies and supports informed policy decisions.


Total:	0
HTML:	0
PDF:	0
XML:	0