the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
ML-IAM v1.0: Emulating Integrated Assessment Models With Machine Learning
Abstract. Integrated Assessment Models (IAMs) are essential tools for projecting future environmental variables under diverse environmental, economic, and technological scenarios. However, their computational intensity limits accessibility and application scope. We present ML-IAM v1.0, the first machine learning emulator trained on the IPCC AR6 Scenarios Database to replicate IAM functionality across diverse model families. Our best-performing model, XGBoost, achieves an R² of 0.97 against original IAM data, outperforming the more complex models Long Short-Term Memory (LSTM) and Temporal Fusion Transformer (TFT). ML-IAM v1.0 generates results for 2,000 scenarios in 50 seconds and can produce predictions for any IAM family. This enables rapid exploration of climate scenarios, complementing traditional IAMs with efficient, scalable computation.
- Preprint
(3166 KB) - Metadata XML
-
Supplement
(5515 KB) - BibTeX
- EndNote
Status: open (until 06 Mar 2026)
- RC1: 'Comment on egusphere-2025-5305', Anonymous Referee #1, 23 Jan 2026 reply
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 417 | 135 | 13 | 565 | 48 | 11 | 12 |
- HTML: 417
- PDF: 135
- XML: 13
- Total: 565
- Supplement: 48
- BibTeX: 11
- EndNote: 12
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Summary
This paper presents ML-IAM v1.0, a machine learning emulator capable of replicating the outputs of diverse Integrated Assessment Models (IAMs). The authors compare three architectures and show that the tree-based XGBoost model achieves high accuracy while reducing computational runtimes from hours to seconds. This manuscript makes a solid contribution to the literature. It addresses a critical computational bottleneck in the field and provides a practical, open-source tool that facilitates rapid scenario exploration. The manuscript is well written, well structured, and uses a sound approach to separating exogenous and endogenous variables. Furthermore, the paper shares both data and code and provides extensive implementation details, improving reproducibility.
However, while the tool itself is valuable for the modeling community, the machine learning methodology used to benchmark the models is weak. The conclusion that tree-based models inherently outperform transformers and other models for this task is not fully supported by the experimental setup. The chosen experiments hinder the deep learning baselines through restrictive hyperparameters and simplistic imputation. The "failures" of the machine learning models should be framed as a result of specific configuration choices rather than an intrinsic incompatibility with IAM data.
Strengths
The primary strength of this work is its practical utility. Unlike previous emulators that focused on single models, this work successfully learns from 95 model families. The resulting tool allows for fast, model-agnostic scenario generation, which can be an important tool for researchers in the field. The inclusion of an interactive Emulation Viewer enhances the accessibility of the results. Additionally, the paper is highly reproducible. The separation of historical data from projections is handled rigorously, preventing the data leakage issues that often plague time-series emulation papers. Finally, the paper is very clearly written and follows a coherent structure.
Methodological Weaknesses and Limitations
Machine Learning Benchmarking
Despite the success of the XGBoost implementation, the machine learning benchmarking requires critical contextualization. From a machine learning perspective, the baselines are very limited and arbitrary. Specifically, the hyperparameter search space for the Temporal Fusion Transformer (TFT) listed in Table C1 is restrictive, exploring layer dimensions of only [16, 32, 64]. These layers are arguably too small to capture complex dependencies in a transformer architecture. Given the dataset characteristics, the transformer comparison offers limited generalizable insights for the ML community. Additionally, TFT received only 20 search iterations compared to 50 for XGBoost, with validation split rather than 5-fold cross-validation. Each hyperparameter sweep for each model is different and minimises a different metric. Consequently, the poor performance of the TFT is likely an artifact of this inconsistent configuration and limited search rather than any evidence that transformers are not suitable for IAM emulation. The manuscript should avoid broad claims that tree-based models outperform DL approaches in general, and instead clarify that they outperformed DL as configured in this specific study. Besides, non-DL algorithms could also be tested as baselines. I advise the authors to explicitly mention that this ML comparison is more illustrative than exhaustive and that, while XGBoost provides good results, significant additional research is needed to fully understand how ML can best be applied for IAM emulation and interpolation.
Furthermore, the data imputation strategy introduces a potential bias against the neural network models. While XGBoost handles sparsity natively, the authors employed variable-specific median imputation for the LSTM and TFT models. IAM scenarios may rely on distinct, internally consistent narratives where variables deviate intentionally from the median. Using median imputation likely suppresses the signal in these scenarios, disproportionately penalizing the neural networks, further weakening the claim that transformers are ill-suited for this task. This limitation should be explicitly acknowledged in the text as a confounding factor in the model comparison. Additional imputation techniques should have been studied.
Clarification of "Emulation" Scope
The paper positions ML-IAM as an "emulator" of IAMs, but the authors should more carefully distinguish what the model can and cannot do. Since ML-IAM is trained on existing IAM outputs, it is essentially interpolating within the space of scenarios already generated by the IAM community. The critical question is whether ML-IAM can generate meaningful predictions for scenario configurations that fall outside the training distribution. For example: novel policy combinations, extreme GDP trajectories, or technology cost assumptions not represented in the training data. The paper would benefit from explicit out-of-distribution testing to characterize generalization limits, or, at least, a clear statement that the emulator's validity is bounded by the scenario space present in the training data. Relatedly, the claim that the method generates predictions "for any IAM family" requires clarification: it can only produce predictions styled after IAM families present in the training data.
Interpretability Analysis
The analysis using SHAP values is somewhat over-interpreted. While SHAP is a useful tool for feature attribution, it is not a complete Explainable AI (XAI) framework, and attributions for deep networks can be unstable. The discussion implies causal relationships, but they may simply be correlations identified by the XAI tool. This analysis raises questions about what the model is actually learning: underlying climate-economy dynamics, or pattern-matching based on which IAMs report which variables? I advise for a more cautious interpretation of SHAP values, particularly those of deep learning methods. Complementary approaches like sensitivity analysis or partial dependence plots would provide more robust insights on what these models are learning. I request that the final manuscript at least acknowledges the limitations of using SHAP more candidly.
Physical and Structural Limitations
Physical Consistency
The evaluation relies primarily on correlation and RMSE metrics. However, IAM outputs must satisfy physical and economic constraints (e.g., energy balance, non-negative emissions for certain sectors, plausible relationships between GDP growth and energy demand). The manuscript does not report whether ML-IAM predictions satisfy basic physical constraints, whether there are scenarios where the emulator produces physically implausible outputs, or how predictions behave at trajectory endpoints where extrapolation errors may compound. I recommend adding analysis of physical consistency, perhaps including examples of failure cases. This is partly mentioned in future work (PINNs) but explicit analysis on this would benefit the paper.
Regional Independence
The decision to treat regions independently creates a model that ignores inter-regional interactions, such as trade flows, carbon leakage, and energy market equilibrium. While this assumption is necessary for computational tractability, it significantly limits the emulator's validity.
Uncertainty Quantification
The paper notes that ML-IAM enables uncertainty quantification via Monte Carlo sampling (line 267), but the emulator's own uncertainty is underexplored. ML-IAM provides point predictions without uncertainty estimates. For policy-relevant applications, users need to understand prediction confidence and separation between epistemic and aleatoric uncertainty. The authors should at least discuss how uncertainty could be incorporated in future work (e.g., quantile regression, ensemble approaches, or Bayesian methods).
Specific Comments and Technical Corrections
Recommendation: Accept subject to minor revisions, focused on significant textual clarifications of limitations. The revisions requested are primarily about removing claims regarding model comparison generality and explicitly stating the boundaries of the emulator's applicability.