the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
ML-IAM v1.0: Emulating Integrated Assessment Models With Machine Learning
Abstract. Integrated Assessment Models (IAMs) are essential tools for projecting future environmental variables under diverse environmental, economic, and technological scenarios. However, their computational intensity limits accessibility and application scope. We present ML-IAM v1.0, the first machine learning emulator trained on the IPCC AR6 Scenarios Database to replicate IAM functionality across diverse model families. Our best-performing model, XGBoost, achieves an R² of 0.97 against original IAM data, outperforming the more complex models Long Short-Term Memory (LSTM) and Temporal Fusion Transformer (TFT). ML-IAM v1.0 generates results for 2,000 scenarios in 50 seconds and can produce predictions for any IAM family. This enables rapid exploration of climate scenarios, complementing traditional IAMs with efficient, scalable computation.
- Preprint
(3166 KB) - Metadata XML
-
Supplement
(5515 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-5305', Anonymous Referee #1, 23 Jan 2026
-
AC1: 'Response to RC1', Yen Shin, 06 May 2026
Summary Response
RC1 Summary: The ML benchmarking methodology is weak. The conclusion that tree-based models inherently outperform transformers is not fully supported.
We appreciate this important critique. We have substantially revised the manuscript to (1) expand the TFT hyperparameter search space, (2) revise all claims about model comparison generality, and (3) explicitly frame our ML comparison as illustrative rather than exhaustive. Detailed responses to each sub-point follow below.
Methodological Weaknesses and LimitationsMachine Learning Benchmarking
The hyperparameter search space for TFT is restrictive (layer dimensions only [16, 32, 64]). TFT received only 20 search iterations compared to 50 for XGBoost, with validation split rather than 5-fold cross-validation.
Thank you for this comment. We have substantially expanded the TFT hyperparameter search space. The revised configuration searches over hidden dimensions of {128, 256, 512}, LSTM layer counts of {1, 2, 3}, dropout rates of {0.1, 0.2, 0.3}, and learning rates of {0.001, 0.01}. The number of search iterations is now 50, approximately matching XGBoost (56 trials across three stages). The revised Table D1 reports all updated hyperparameter search spaces.
The manuscript should avoid broad claims that tree-based models outperform DL approaches in general.
We agree with this comment. We have revised all such claims throughout the manuscript. For example, the abstract now stated conditionally: "while the Temporal Fusion Transformer (TFT) underperforms under the configurations tested." In the Results section (Section 3.1), we now state: "this three-model comparison is illustrative rather than exhaustive, and should not be taken as evidence that any architecture class is inherently unsuitable for IAM emulation."
Non-DL algorithms could also be tested as baselines. The ML comparison should be described as illustrative rather than exhaustive.
We agree and have added explicit language in Section 3.1 acknowledging this — "More broadly, this three-model comparison is illustrative rather than exhaustive, and should not be taken as evidence that any architecture class is inherently unsuitable for IAM emulation."
[Additional Revision] Lag feature parity across architectures.
In the original submission, only XGBoost received explicit output lag features as inputs, while LSTM and TFT were given exogenous covariates and static categorical embeddings only, on the assumption that their architectural temporal mechanisms (recurrent state and attention, respectively) would adequately capture dependencies on past outputs. Prompted by the reviewer's emphasis on a fair architectural comparison, we tested whether providing the same explicit output lag features to LSTM and TFT would change the picture. Contrary to our initial expectation, this seemingly small change improved R2 by 0.1–0.2 for both sequence models, as reported in the new ablation paragraph in Section 3.1. We tentatively interpret this as evidence that, in this short-horizon, covariate-heavy regime, the available exogenous inputs do not by themselves carry enough information to fully determine the outputs; explicit lag features then act as a strong proxy for unobserved IAM internal state, and this proxy is apparently more useful than what the architectures' implicit temporal mechanisms can extract on their own. As a consequence, XGBoost and LSTM now achieve nearly indistinguishable performance, and we have correspondingly revised the manuscript to frame the architectural comparison as illustrative rather than as evidence of intrinsic superiority of any single architecture. We thank the reviewer for prompting this experiment, which led to a more informative comparison and a finding we did not anticipate.
Imputation Bias
Median imputation likely suppresses signal for neural networks, disproportionately penalizing them. This should be acknowledged as a confounding factor.
We agree and have acknowledged this explicitly in the revised Section 2.3.2: “median-filled values may introduce residual noise relative to XGBoost's native sparsity handling”.
Clarification of "Emulation" Scope
The paper should distinguish what ML-IAM can and cannot do. Can it generate meaningful predictions outside the training distribution? The claim "for any IAM family" requires clarification.
This is an important point. We have clarified ML-IAM's scope in two places. In Section 2.3.1, we explain that "performance reflects generalization to new scenarios within known model families, regions, and scenario categories, rather than generalization to entirely novel ones." In Section 5, we add that "ML-IAM has not been validated for input values outside the ranges observed in the training data (e.g., extreme GDP trajectories, policy assumptions absent from the AR6 database), and predictions in such regimes should be treated with caution." We have also changed "for any IAM family" to "diverse IAM families" throughout.
Interpretability Analysis
SHAP analysis is over-interpreted. Attributions may reflect correlations, not causal relationships. Complementary approaches like sensitivity analysis or partial dependence plots would provide more robust insights.
We agree that our SHAP discussion should be more cautious. We have revised the interpretability sections in both Section 2.4 and Section 3.2 to tone down our arguments and added an explicit caveat in Section 2.4 that "SHAP values indicate feature associations rather than causal relationships, and DeepExplainer attributions for neural networks can vary with the choice of background samples; the patterns reported below should be interpreted as suggestive." We also moved detailed SHAP implementation information to the Appendix.
Physical Consistency
The manuscript does not report whether ML-IAM predictions satisfy basic physical constraints (e.g., energy balance, non-negative emissions).
Thank you for raising this. Physical consistency analysis is an important direction. In this paper, we focus on nine output variables (six energy sources and three GHG emissions) that do not form closed physical relationships such as energy balance, so this check is not directly applicable to the current output set. The ML-IAM framework itself readily extends to additional endogenous variables in the database (Section 2.1.1), and we view systematic physical consistency analysis on a broader output set as valuable future work outside this paper's scope. We are aware of ongoing efforts in this direction by other groups. The revised Conclusions now reflect this framing: "The current output set also does not form closed physical relationships such as energy balance, so systematic physical consistency analysis was outside this paper's scope. The framework readily extends to additional output variables, and applying such checks across a broader output set is valuable future work."
Regional Independence
Treating regions independently ignores inter-regional interactions such as trade flows and carbon leakage.
We agree with this comment and acknowledge this limitation in Section 5 (Conclusions), where we note that “Our current ML-IAM treats regions independently, omitting inter-regional interactions such as trade flows and grid connections. Incorporating regional connectivity through architectures like graph neural networks could address this gap (Kipf, 2016), which can learn how interconnected regions influence each other.” The current independent treatment is a simplification given the dataset composition (70% are aggregated regions rather than country-level regions).
Uncertainty Quantification
ML-IAM provides point predictions without uncertainty estimates. The authors should discuss how uncertainty could be incorporated.
We agree that incorporating uncertainty quantification is an important improvement for policy-relevant applications. We now discuss how it can be incorporated in future research in Section 5, noting that "ML-IAM does not yet provide uncertainty estimates alongside its point predictions, an important extension for policy-relevant applications" and that "quantifying prediction uncertainty—for example, by training multiple models, using the spread across model families as a proxy, or applying conformal prediction to obtain distribution-free prediction intervals—would be valuable for policy-relevant applications".
Specific Comments and Technical Corrections
Abstract/Conclusion phrasing: Modify statements claiming XGBoost outperforms complex models.
We agree. The abstract now reads: "Both XGBoost and Long Short-Term Memory (LSTM) achieve an R² of 0.97 against original IAM data, while the Temporal Fusion Transformer (TFT) underperforms under the configurations tested."
Imputation Discussion (Section 2.3.2): Acknowledge that median imputation might hinder signal for LSTM/TFT.
We agree. Refer to the comment on imputation bias above.
Train/Test Split Balance (Lines 174–176): Are there concerns about model family imbalance?
We have expanded the data splitting description in Section 2.3.1 to clarify the splitting strategy. We now explain that splitting is at the individual scenario level, meaning model families may appear across splits. We acknowledge that this means performance reflects generalization to new scenarios within known model families rather than to entirely novel ones. We further acknowledge that the AR6 database is not balanced across model families: some families (notably REMIND) contribute substantially more scenarios than others, so the model implicitly weights its learning toward better-represented families. The model family categorical identifier (Section 2.1.2) is intended to mitigate this by allowing the model to learn family-specific response patterns rather than averaging across them. Systematic leave-one-family-out evaluation to quantify how individual family removals affect predictions is left for future work.
Figure 3 Readability: Consider using a discrete color map for Year.
We appreciate this suggestion. But the Year variable had too many categories to use a discrete color map; therefore, we retained the continuous color map in order to preserve temporal progression information. However, we improved readability by enlarging the font size and placing a single legend on the far right of the figure (now Figure 3).
Figure 4 Uncertainty Representation: Define what the shading represents.
We have revised the Figure 4 caption to explicitly define the shading. The relevant portion now reads: "Solid lines show IAM outputs and dashed lines of matching color show emulator predictions; the shaded band between them equals the absolute prediction error at each year." The shaded region therefore represents the per-scenario, per-year emulation error rather than a confidence interval or uncertainty estimate.
Speculative Claims (Lines 274–277): The suggestion about learning specific dynamics across model families is speculative.
This is true. We have revised this passage to specify that incorporating scenario-level metadata as additional input features could be helpful, and frame it explicitly as a future extension rather than a current capability. The revised text now reads: "Future extensions could incorporate additional scenario-level metadata available in the AR6 database—such as whether COVID-19 recovery assumptions or rapid technology transition narratives were included—as explicit input features."
Reproducibility: Include computational requirements and use tools like CodeCarbon.
We have added a new figure (Figure 5) reporting computational requirements for each pipeline phase (hyperparameter search, training, inference) across all three architectures, and included GPU specifications in the figure caption. We agree that CodeCarbon would be useful for tracking energy use. However, since CodeCarbon requires a re-run for measurements and training for the current version is already completed, we will integrate it into the pipeline from the next version.
Citation: https://doi.org/10.5194/egusphere-2025-5305-AC1
-
AC1: 'Response to RC1', Yen Shin, 06 May 2026
-
CEC1: 'Comment on egusphere-2025-5305 - No compliance with the policy of the journal', Juan Antonio Añel, 11 Feb 2026
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived your data in a web page that does not comply with the requirements of the journal. Namely, for the AR6 data the Zenodo repository does not contain it, but links to an external site: "The data is available for download at the AR6 Scenario Explorer hosted by IIASA."
The GMD review process depends on reviewers and community commentators being able to access, during the discussion phase, the code and data on which a manuscript depends. Please, therefore, publish your code and data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible. We cannot have manuscripts under discussion that do not comply with our policy.
The 'Code and Data Availability’ section must also be modified to cite the new repository locations, and corresponding references added to the bibliography.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in GMD.
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2025-5305-CEC1 -
CC1: 'Reply on CEC1', Haewon McJeon, 13 Feb 2026
Dear Editor,
Thank you for bringing this issue to our attention. We have investigated the matter carefully.
The IPCC AR6 Scenario Database is not our own data; rather, it is third-party raw data we used to train our model. This database is published under the Creative Commons Attribution 4.0 International License by IIASA. However, IIASA's terms of use explicitly state that:
"... it is not permitted to republish (i.e., make available for download or otherwise distribute) a substantial portion (or the whole) of the scenario ensemble data without written permission from IIASA."
As our study uses the full AR6 Scenario Database, we are not permitted to re-host the data independently on Zenodo without written permission from the data owners, which would constitute a license violation.
While it is indeed unfortunate that the data is under restricted access at Zenodo, we would like to clarify that the AR6 Scenario Database is, in fact, fully accessible for direct download without registration requirements. Specifically, reviewers and readers can access and download the complete dataset by:
- Visiting: https://data.ece.iiasa.ac.at/ar6/#/downloads
- Selecting "login as guest" (no account registration required)
- Selecting “AR6_Scenarios_Database…” v1.1 files with the release year (email address needed to receive the download link)
Furthermore, the AR6 Scenario Database is a versioned, institutionally maintained dataset. The specific version used in our study (v1.1) is permanently identified and accessible at the above URL. We believe this constitutes a persistently accessible version of the data in the sense required by GMD policy, hosted by IIASA (International Institute for Applied Systems Analysis), a well-established international research institution with more than 50 year history providing long-term archival support.
The database is also formally cited via its Zenodo DOI (https://doi.org/10.5281/zenodo.7197970, Byers et al., 2022), which provides a persistent identifier for the exact version we used.
We propose to update the "Code and Data Availability" section as follows:
---
Code and data availability. The source code for ML-IAM v1.0 is permanently archived on Zenodo at https://doi.org/10.5281/zenodo.17390678 (Shin et al., 2025b). The supporting data files (base year mappings and input/output variable classifications) are archived separately at
https://doi.org/10.5281/zenodo.17390113 (Shin et al., 2025a). The code is also available on GitHub at https://github.com/YenShin1891/ml-iam.
The IPCC AR6 Scenario Database (Byers et al., 2022) is available for direct download at https://data.ece.iiasa.ac.at/ar6/#/downloads (requires email address; accessible with or without account creation). The Zenodo record at https://doi.org/10.5281/zenodo.7197970 (restricted access) provides the formal citation and DOI.
---
We believe this updated section, together with the explanation above, demonstrates that the AR6 Scenario Database is accessible to reviewers and the community in a manner consistent with GMD policy.
Please let us know if further clarification is needed.
Best regards,
Haewon McJeon
Associate Professor, Korea Advanced Institute of Science and Technology
Citation: https://doi.org/10.5194/egusphere-2025-5305-CC1 -
CEC2: 'Reply on CC1', Juan Antonio Añel, 13 Feb 2026
Dear authors,
First, I would like to point that we can not accept the IIASA as a server to host the assets related to manuscripts submitted to GMD, as it is not a trusted place to comply with the scientific method. Namely, it does not appear to have a published policy for data preservation over many years or decades (some flexibility exists over the precise length of preservation, but the policy must exist). it is unfortunate that the authors of the mentioned dataset are not sharing it without restrictions, which compromises the compliance with the scientific method, the provenance of materials and replicability of works based on them. However, as you are not the direct authors of it and can not take any action to share the data, and you have already performed the work, we can consider the private Zenodo repository shared by the IIASA as enough. Please, in your manuscript remove any link to the IIASA site, as it does not serve the compliance with the scientific method of the work, and keep only the Zenodo repository.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2025-5305-CEC2 -
CC2: 'Reply on CEC2', Haewon McJeon, 13 Feb 2026
Dear Editor,
Thank you for your understanding and for providing a path forward.
We will proceed as requested: removing all IIASA links from the manuscript and pointing exclusively to the Zenodo repository for the related assets.
Best regards,
Haewon McJeonCitation: https://doi.org/10.5194/egusphere-2025-5305-CC2
-
CC2: 'Reply on CEC2', Haewon McJeon, 13 Feb 2026
-
CC1: 'Reply on CEC1', Haewon McJeon, 13 Feb 2026
-
RC2: 'Comment on egusphere-2025-5305', Anonymous Referee #2, 28 Feb 2026
Main comments
The paper presents an attempt to develop an emulator of Integrated Assessment Models (IAMs). As the authors wrote, while emulation is extensively used in climate science, its application to IAMs is relatively nascent. The paper is relevant, interesting, and has the potential to contribute to the literature; however, I have significant reservations and doubts.
IAM emulation is a rapidly evolving field, a topic currently being undertaken by several research groups. The authors failed to discuss Xiong et al. (2025), which appears to be the first systematic attempt to emulate multiple IAMs. The Xiong study relies on the ENGAGE dataset (Riahi et al., 2021), one of the richest IAM datasets included in the AR6 scenario database. The Xiong study directly addresses “the core challenge of predictive emulation across diverse IAM families” mentioned on Line 49.
I suggest that the authors integrate the Xiong study into the discussion (e.g., starting at Line 42) and provide a comparative analysis of their approach versus the Xiong approach, particularly regarding underlying data, methodology, target variables, performance, reliability, and potential applications. I also bring the authors’ attention to Xiong and Tanaka (2025), which further applies Xiong’s emIAM approach to scenario extension beyond 2100. Given these existing works, the authors must avoid overstating the novelty and articulate their unique contribution more clearly.
Another issue is that it is currently unclear to me how the authors’ ML-IAM can be effectively utilized in practice. The introduction focuses heavily on technical challenges and lacks scientific motivation. The authors claim several potential applications at the end of the paper. For example, “Researchers can now optimize for multiple targets simultaneously—such as achieving specific temperature goals while maximizing sustainable development outcomes—through grid searches across millions of parameter combinations that would be computationally prohibitive with traditional IAMs” (Line 268). On Line 274, the authors state “Future extensions could enable ML-IAM to learn specific dynamics from individual models—such as COVID-19 impacts or rapid technological transitions captured by some IAMs but not others—and propagate these patterns across model families, potentially enriching the scenario landscape beyond what any single IAM provides.” However, it is not obvious how the ML-IAM, given the input variables in Table A1, supports these goals.
IAMs typically produce a least-cost emission pathway for a given carbon budget by optimization. Can the proposed IAM emulator be used in the same way as the original IAMs? The authors are well positioned to demonstrate selected applications. I highly recommend including a few concrete examples of applications as a proof-of-concept. This would provide necessary evidence that the emulator works as intended, especially outside of the range of training data. Such demonstrations would enhance the case as the current work has only limited validations.
Finally, the paper is highly technical and does not seem to consider the journal’s broad audience. Many ML terminologies are introduced with citations but lacks conceptual explanations. I raise several such examples in my detailed comments, but my comments are far from exhaustive. I suggest the authors move highly technical specifications to the Appendix and provide clearer, intuitive explanations for non-specialists in the main text to ensure the work is accessible to the wider community.
Detailed comments
Line 27: With the exception of FaIR, the models mentioned here have a long history as “simple climate model” or “reduced-complexity climate model” (Romero-Prieto et al., 2026). They directly represent physical and biogeochemical processes, without relying on ML techniques. A clear distinction should be made between these physical emulators and the ML-based emulators discussed in this paper.
Line 42: See my major comments regarding the omission of relevant literature (Xiong et al., 2025).
Line 51: Also see my major comments on the need for scientific motivation in the introduction.
Line 65 This statement raises concerns about what the emulator actually captures, given that IAMs are highly diverse and behave differently. How does the emulator handle “inter-model spread” and “intra-model spread”?
Line 82: This work addresses different gases and regions, but not different sectors. Please provide a rationale for not including different sectors.
Line 85: I highly recommend moving Table A1 to the main paper, as input and output are essential information for describing the emulator.
Line 92: Please provide conceptual explanations for “mixed-effects modeling.”
Line 100: Harmonization typically influences near-term data (Gidden et al. 2019). Please explain how the harmonization effects were treated in the emulator?
Line 114: The definition of these terms can be made much earlier in the paper, at the first instance of the appearance of these terms.
Line 122: Please provide conceptual explanations for “tabular regression.”
Line 124: Many IAMs (e.g., REMIND-MAgPIE and MESSAGE) are optimization-based and produce pathways based on constraints such as a carbon budget. Please describe how the IAM emulator treats these optimization targets.
Line 141: This sub-section is highly technical (see my major comment).
Line 157: See my comment above.
Line 180: See my comment above.
Line 198: See my comment above.
Line 211: For better readability, Figure 3 should be moved to the section where it is actually discussed (not Section 2.4).
Line 229: Please define the “original IAM projections” in the Figure 4 caption. Which models or model families are presented in the figure? I suggest showing results for marker IAMs, such as REMIND-MAgPIE, IMAGE, and MESSAGE to more clearly demonstrate reproducibility.
Line 229: The orange shaded zone in the left panel shows a large discrepancy between the original and reproduced pathways. If I understood correctly, the original scenario shows a zero-emission scenario without net negative emissions, while the reproduced scenarios shows net negative emission scenarios (or vise versa). The emulator seems to confuse these two types of pathways. I wonder how the ML IAM distinguish between these two fundamentally different pathways, which can occur under the same remaining carbon budget.
Line 243: Please define “missingness indicators.”
Line 266: See my major comment on the practical utility of the ML IAM.
References
Gidden, M. J., Riahi, K., Smith, S. J., Fujimori, S., Luderer, G., Kriegler, E., . . . Takahashi, K. (2019). Global emissions pathways under different socioeconomic scenarios for use in CMIP6: a dataset of harmonized emissions trajectories through the end of the century. Geosci. Model Dev., 12(4), 1443-1475. doi:10.5194/gmd-12-1443-2019
Riahi, K., Bertram, C., Huppmann, D., Rogelj, J., Bosetti, V., Cabardos, A.-M., . . . Zakeri, B. (2021). Cost and attainability of meeting stringent climate targets without overshoot. Nature Climate Change, 11(12), 1063-1069. doi:10.1038/s41558-021-01215-2
Romero-Prieto, A., Mathison, C., & Smith, C. (2026). Review of climate simulation by Simple Climate Models. Geosci. Model Dev., 19(1), 115-165. doi:10.5194/gmd-19-115-2026
Xiong, W., Tanaka, K., Ciais, P., Johansson, D. J. A., & Lehtveer, M. (2025). emIAM v1.0: an emulator for integrated assessment models using marginal abatement cost curves. Geosci. Model Dev., 18(5), 1575-1612. doi:10.5194/gmd-18-1575-2025
Xiong, W., & Tanaka, K. (2025). Extending Integrated Assessment Model scenarios until 2150 using an emulation approach. arXiv (preprint), 2512.06026 doi:10.48550/arXiv.2512.06026
Citation: https://doi.org/10.5194/egusphere-2025-5305-RC2 -
AC2: 'Response to RC2', Yen Shin, 06 May 2026
Main Comments
Missing Literature: Xiong et al. (2025) / emIAMThe authors failed to discuss Xiong et al. (2025), which appears to be the first systematic attempt to emulate multiple IAMs.
We thank the reviewer for bringing this important work to our attention. We have integrated both Xiong et al. (2025) and Xiong and Tanaka (2025) into the revised manuscript and added a new comparison table (Table 1) that systematically compares ML-IAM, emIAM, as well as Deep-IAM (Li et al., 2025) across key dimensions including underlying data, methodology, target variables, performance, and potential applications.
Practical Utility and Scientific Motivation
It is unclear how ML-IAM can be effectively utilized in practice. The introduction lacks scientific motivation.
Thanks for this comment. We have added a motivating sentence in the introduction: "A fast, cross-family emulator would enable systematic scenario exploration—such as large-ensemble sensitivity analysis and multi-objective policy screening—that remains computationally prohibitive with traditional IAMs."
For practical applications, we have revised the Discussion (Section 4) to clarify the concrete mechanism: users vary input assumptions via grid search and screen emulated outputs against policy-relevant criteria such as scenario categories (C1–C8).
Can the proposed emulator be used in the same way as the original IAMs? I recommend including concrete application examples.
Thanks for this recommendation. The proposed emulator is designed as a fast cross-family surrogate for systematic scenario exploration, and as such it complements the original IAMs but is not used in the same way as the original IAMs. We have included a few concrete application examples in the Discussion (Section 4): "controlled experiments varying single parameters while holding others constant, comprehensive sensitivity analysis across thousands of scenario variations, uncertainty quantification through Monte Carlo sampling, and interactive policy exploration with real-time feedback."
The paper is highly technical and does not consider the journal's broad audience.
We appreciate this comment, and we have made several changes to improve accessibility: moved highly technical specifications to the Appendix, added conceptual explanations for ML terminology (e.g., describing TFT's attention mechanism as "a method for dynamically identifying which inputs are most relevant at each prediction step"), and provided a new summary table of variables (Table 2) in the main text.
Detailed Comments
Line 27: Distinguish between physical emulators (MAGICC, FaIR, OSCAR) and ML-based emulators.
This is an important distinction. We now clarify this in Section 1 (Introduction): "It is important to distinguish two classes of emulators. Physical emulators [...] such as MAGICC (Meinshausen et al., 2011), FaIR (Smith et al., 2018), and OSCAR (Gasser et al., 2017) approximate the dynamics of more complex Earth system models through a small set of simplified physical equations with parameters calibrated against process-based simulations. ML-based emulators, by contrast, learn input-output mappings directly from data without imposing explicit physical equations [...]. ML-IAM belongs to the ML-based class, applied to the IAM domain rather than to physical climate variables."
Line 42: See major comments on Xiong et al.
We have addressed this in response above regarding emIAM integration and Table 1.
Line 51: Need for scientific motivation in introduction.
Agree. Added motivating sentences about systematic scenario exploration.
Line 65: How does the emulator handle inter-model and intra-model spread?
This is an important point. We added text in Section 2.1: "Specifically, inter-model spread, the variation in outputs across different IAM families for the same scenario, is captured through the model family categorical identifier (Section 2.1.2), while intra-model spread, the variation across scenarios within a single IAM family, is captured through the exogenous input variables that differ across scenarios."
Line 82: Rationale for not including different sectors.
We added: "We focus on aggregate-level variables rather than sector-disaggregated outputs (e.g., Emissions|CO2|Energy, Final Energy|Industry) because sectoral variables exhibit substantially lower reporting coverage across model families and inconsistent sector definitions between IAMs, which would introduce excessive missing data and ambiguity in cross-model training."
Line 85: Move Table A1 to main paper.
We agree that this information is essential. However, due to space constraints we have not moved the full Table A1 into the main text. Instead, we added a summary table (Table 2) in Section 2.1.1 that categorizes all input and output variables by role and category, and we direct readers to the complete variable list in Appendix Table A1.
Line 92: Provide conceptual explanations for "mixed-effects modeling."
Thanks for catching this. Since mixed-effects modeling is future work enabled by ML-IAM rather than an analysis we performed, retaining the brief mention in Section 2.1.2 (Methods) risked misleading readers. We have therefore removed it from Section 2.1.2 and consolidated the discussion, including the conceptual explanation requested, into Section 5 (Conclusions). The revised Section 5 now reads: "Mixed-effects modeling is a statistical framework that decomposes observed variation into fixed effects (systematic patterns shared across groups) and random effects (group-specific deviations). It is useful for IAM analysis because, with scenario inputs treated as fixed effects and model family as a random effect, this framework isolates scenario-driven signals shared across IAMs from individual model biases [...]."
Line 100: How were harmonization effects treated?
The AR6 Scenarios Database contains original, unharmonized IAM outputs. Harmonization is a separate post-processing step that aligns IAM outputs to a common historical reference, primarily for downstream use as CMIP climate model inputs. Since ML-IAM aims to emulate IAM behavior itself, our models are trained on unharmonized outputs, preserving each IAM family's characteristic response patterns. If harmonization effects are desired, the procedure can be applied to ML-IAM predictions after emulation since it is not computationally intensive. We now discuss this choice in Section 2.1: "We use the database's original, unharmonized outputs [...]. Because ML-IAM aims to emulate IAM behavior directly, training on unharmonized outputs preserves each family's characteristic response patterns. Harmonization can still be applied to ML-IAM predictions afterward if desired, since the procedure itself is not computationally intensive."
Line 114: Define exogenous/endogenous terms earlier.
We agree. Moved definitions to the first paragraph of the Methods section.
Line 122: Provide conceptual explanations for "tabular regression."
Thanks for pointing this out. We added: "i.e., predicting a target from a fixed set of input features at a single point in time, without sequential dependencies (Borisov et al., 2024).”
Line 124: How does the emulator treat optimization targets (carbon budgets) used by optimization-based IAMs?
Currently ML-IAM does not separately categorize optimization-based and recursive-dynamic IAMs, we were not able to address this due to the fact that the AR6 Scenarios Database does not report carbon budgets or other optimization targets as standardized metadata across models, so even if we wanted to include such inputs explicitly, the underlying data does not currently support it. The model family categorical input partially captures the differences between optimization and recursive-dynamic approaches by allowing the model to learn family-specific response patterns.
Lines 141, 157, 180, 198: Highly technical sub-sections.
Thanks for pointing this out. We have now reorganized the Methods section to improve readability, moved detailed technical specifications to the Appendix, and added intuitive explanations for broader readers.
Line 211: Move Figure 3 to the section where it is discussed.
We agree. Figure 3 has been moved to Section 3.1 (Performance Evaluation), where it is first introduced and discussed.
Line 229: Define "original IAM projections" in Figure 4 caption. Show results for marker IAMs.
We have implemented both suggestions in the revised Figure 4. The caption now defines "original IAM projections" as the unmodified outputs from the AR6 Scenarios Database used as the reference against which emulator predictions are compared. The figure further removes ambiguity by encoding the two trajectories with distinct line styles: solid lines for IAM outputs and dashed lines of matching color for emulator predictions for the same scenario. The revised figure additionally overlays the SSP1-2.6 marker scenario (IMAGE 3.0.1) as a thick black line. This is the only SSP marker scenario falling within the C3 category (>67% chance of limiting warming to 2°C) shown in this panel, so that emulator performance for the canonical 2°C SSP marker can be read directly against the IAM ensemble.
Line 229: Large discrepancy in the orange shaded zone — emulator confuses zero-emission vs. net-negative emission pathways.
We thank the reviewer for raising this.With scale-aware imputation and target interpolation prior to lag-feature construction in the revised setup, the discrepancy in the orange shaded zone is substantially reduced in the revised Figure 4. We would also like to note that small trajectory-end deviations of similar magnitude appear even in non-net-zero pathways such as C6 and above, as can be observed in our emulation viewer at https://mliam.dev. This reflects the autoregressive nature of the rollout, in which per-step errors accumulate over time. We agree with the reviewer that the inputs do not on their own fully determine whether a deeply mitigative pathway settles at net-zero or crosses into net-negative territory; encoding BECCS deployment magnitude, non-CCS CDR pathways such as direct air capture and afforestation, and scenario-level narrative metadata as additional inputs would further address this gap in future work.
Line 243: Define "missingness indicators."
We added the definition of this term where it first appears in Section 2.3.2: binary indicator variables that flag whether each input was originally reported or imputed.
Line 266: See major comment on practical utility.
As we have addressed in earlier response regarding practical utility, we have revised the Discussion (Section 4) to clarify the concrete mechanism: users vary input assumptions via grid search and screen emulated outputs against policy-relevant criteria such as scenario categories (C1–C8).
Citation: https://doi.org/10.5194/egusphere-2025-5305-AC2
-
AC2: 'Response to RC2', Yen Shin, 06 May 2026
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 3,939 | 1,284 | 179 | 5,402 | 409 | 121 | 146 |
- HTML: 3,939
- PDF: 1,284
- XML: 179
- Total: 5,402
- Supplement: 409
- BibTeX: 121
- EndNote: 146
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Summary
This paper presents ML-IAM v1.0, a machine learning emulator capable of replicating the outputs of diverse Integrated Assessment Models (IAMs). The authors compare three architectures and show that the tree-based XGBoost model achieves high accuracy while reducing computational runtimes from hours to seconds. This manuscript makes a solid contribution to the literature. It addresses a critical computational bottleneck in the field and provides a practical, open-source tool that facilitates rapid scenario exploration. The manuscript is well written, well structured, and uses a sound approach to separating exogenous and endogenous variables. Furthermore, the paper shares both data and code and provides extensive implementation details, improving reproducibility.
However, while the tool itself is valuable for the modeling community, the machine learning methodology used to benchmark the models is weak. The conclusion that tree-based models inherently outperform transformers and other models for this task is not fully supported by the experimental setup. The chosen experiments hinder the deep learning baselines through restrictive hyperparameters and simplistic imputation. The "failures" of the machine learning models should be framed as a result of specific configuration choices rather than an intrinsic incompatibility with IAM data.
Strengths
The primary strength of this work is its practical utility. Unlike previous emulators that focused on single models, this work successfully learns from 95 model families. The resulting tool allows for fast, model-agnostic scenario generation, which can be an important tool for researchers in the field. The inclusion of an interactive Emulation Viewer enhances the accessibility of the results. Additionally, the paper is highly reproducible. The separation of historical data from projections is handled rigorously, preventing the data leakage issues that often plague time-series emulation papers. Finally, the paper is very clearly written and follows a coherent structure.
Methodological Weaknesses and Limitations
Machine Learning Benchmarking
Despite the success of the XGBoost implementation, the machine learning benchmarking requires critical contextualization. From a machine learning perspective, the baselines are very limited and arbitrary. Specifically, the hyperparameter search space for the Temporal Fusion Transformer (TFT) listed in Table C1 is restrictive, exploring layer dimensions of only [16, 32, 64]. These layers are arguably too small to capture complex dependencies in a transformer architecture. Given the dataset characteristics, the transformer comparison offers limited generalizable insights for the ML community. Additionally, TFT received only 20 search iterations compared to 50 for XGBoost, with validation split rather than 5-fold cross-validation. Each hyperparameter sweep for each model is different and minimises a different metric. Consequently, the poor performance of the TFT is likely an artifact of this inconsistent configuration and limited search rather than any evidence that transformers are not suitable for IAM emulation. The manuscript should avoid broad claims that tree-based models outperform DL approaches in general, and instead clarify that they outperformed DL as configured in this specific study. Besides, non-DL algorithms could also be tested as baselines. I advise the authors to explicitly mention that this ML comparison is more illustrative than exhaustive and that, while XGBoost provides good results, significant additional research is needed to fully understand how ML can best be applied for IAM emulation and interpolation.
Furthermore, the data imputation strategy introduces a potential bias against the neural network models. While XGBoost handles sparsity natively, the authors employed variable-specific median imputation for the LSTM and TFT models. IAM scenarios may rely on distinct, internally consistent narratives where variables deviate intentionally from the median. Using median imputation likely suppresses the signal in these scenarios, disproportionately penalizing the neural networks, further weakening the claim that transformers are ill-suited for this task. This limitation should be explicitly acknowledged in the text as a confounding factor in the model comparison. Additional imputation techniques should have been studied.
Clarification of "Emulation" Scope
The paper positions ML-IAM as an "emulator" of IAMs, but the authors should more carefully distinguish what the model can and cannot do. Since ML-IAM is trained on existing IAM outputs, it is essentially interpolating within the space of scenarios already generated by the IAM community. The critical question is whether ML-IAM can generate meaningful predictions for scenario configurations that fall outside the training distribution. For example: novel policy combinations, extreme GDP trajectories, or technology cost assumptions not represented in the training data. The paper would benefit from explicit out-of-distribution testing to characterize generalization limits, or, at least, a clear statement that the emulator's validity is bounded by the scenario space present in the training data. Relatedly, the claim that the method generates predictions "for any IAM family" requires clarification: it can only produce predictions styled after IAM families present in the training data.
Interpretability Analysis
The analysis using SHAP values is somewhat over-interpreted. While SHAP is a useful tool for feature attribution, it is not a complete Explainable AI (XAI) framework, and attributions for deep networks can be unstable. The discussion implies causal relationships, but they may simply be correlations identified by the XAI tool. This analysis raises questions about what the model is actually learning: underlying climate-economy dynamics, or pattern-matching based on which IAMs report which variables? I advise for a more cautious interpretation of SHAP values, particularly those of deep learning methods. Complementary approaches like sensitivity analysis or partial dependence plots would provide more robust insights on what these models are learning. I request that the final manuscript at least acknowledges the limitations of using SHAP more candidly.
Physical and Structural Limitations
Physical Consistency
The evaluation relies primarily on correlation and RMSE metrics. However, IAM outputs must satisfy physical and economic constraints (e.g., energy balance, non-negative emissions for certain sectors, plausible relationships between GDP growth and energy demand). The manuscript does not report whether ML-IAM predictions satisfy basic physical constraints, whether there are scenarios where the emulator produces physically implausible outputs, or how predictions behave at trajectory endpoints where extrapolation errors may compound. I recommend adding analysis of physical consistency, perhaps including examples of failure cases. This is partly mentioned in future work (PINNs) but explicit analysis on this would benefit the paper.
Regional Independence
The decision to treat regions independently creates a model that ignores inter-regional interactions, such as trade flows, carbon leakage, and energy market equilibrium. While this assumption is necessary for computational tractability, it significantly limits the emulator's validity.
Uncertainty Quantification
The paper notes that ML-IAM enables uncertainty quantification via Monte Carlo sampling (line 267), but the emulator's own uncertainty is underexplored. ML-IAM provides point predictions without uncertainty estimates. For policy-relevant applications, users need to understand prediction confidence and separation between epistemic and aleatoric uncertainty. The authors should at least discuss how uncertainty could be incorporated in future work (e.g., quantile regression, ensemble approaches, or Bayesian methods).
Specific Comments and Technical Corrections
Recommendation: Accept subject to minor revisions, focused on significant textual clarifications of limitations. The revisions requested are primarily about removing claims regarding model comparison generality and explicitly stating the boundaries of the emulator's applicability.