Developing Guidelines for Working with Multi-Model Ensembles in CMIP

Katzenberger, Anja; Perez-Carrasquilla, Jhayron S.; Gemmell, Keighan; Galytska, Evgenia; Leclerc, Christine; Punya, P.; Roy, Indrani; Varuolo-Clarke, Arianna; Tošić, Milica; Črnivec, Nina

doi:10.5194/egusphere-2025-4744

Preprints

https://doi.org/10.5194/egusphere-2025-4744

Preprints

06 Oct 2025

| 06 Oct 2025

Developing Guidelines for Working with Multi-Model Ensembles in CMIP

Anja Katzenberger, Jhayron S. Perez-Carrasquilla, Keighan Gemmell, Evgenia Galytska, Christine Leclerc, P. Punya, Indrani Roy, Arianna Varuolo-Clarke, Milica Tošić, and Nina Črnivec

Abstract. Earth System Models (ESMs) are the key tool for studying the climate under changing conditions. Over recent decades, it has been established to not only rely on projections of a single model but to combine various ESMs in multi-model ensembles (MMEs) to improve robustness and quantify the uncertainty of the projections. The data access for MME studies has been fundamentally facilitated by the World Climate Research Programme's Coupled Model Intercomparison Project (CMIP) - a collaborative effort bringing together ESMs from modelling communities all over the world. Despite the CMIP standardisation processes, addressing specific research questions using MMEs requires unique ensemble design, analysis, and interpretation choices. Based on the collective expertise within the Fresh Eyes on CMIP initiative, mainly composed of early-career researchers engaged in CMIP, we have identified common issues and questions encountered while working with climate MMEs. In this project, we provide a comprehensive literature review addressing these questions. We provide statistics tracing the development of the climate MMEs analysis field throughout the last decades, and, synthesising existing studies, we outline guidelines regarding model evaluation, model dependence, weighting methods, and uncertainty treatment. We summarize a collection of useful resources for MME studies, we review common questions and strategies, and finally, we outline emerging scientific trends, such as the integration of machine learning (ML) techniques, single model initial-condition large ensembles (SMILES), and computational resource considerations. We thereby strive to support researchers working with climate MMEs particularly in the upcoming 7th phase of CMIP.

Received: 28 Sep 2025 – Discussion started: 06 Oct 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Anja Katzenberger, Jhayron S. Perez-Carrasquilla, Keighan Gemmell, Evgenia Galytska, Christine Leclerc, P. Punya, Indrani Roy, Arianna Varuolo-Clarke, Milica Tošić, and Nina Črnivec

Status: final response (author comments only)

RC1: 'Comment on egusphere-2025-4744', Anonymous Referee #1, 02 Nov 2025

This is an interesting paper that summarizes MME in CMIP. I would recommend eventual publication subject to revision, particularly on the editorial side, with the aim of providing clearer goals and more focus.
Major suggestions worth considering:
(1) While this is a very nice, thorough comprehensive review of various methods/ideas, the paper is far too long and needs some heavy trimming. I would ask the authors to dig deep and try to cut out some sections and shoot for 30-40 pages max (in the current format, it is 52 pages). Otherwise, I fear that the length will scare off readers and it will not have the desired impact. Some questions I would ask, for each section is, “is this section essential to the key goal of this paper?” “Would our key goal be harmed if it were cut?” Part of doing this may be scoping out the key goals a bit more by providing very clear questions this team wants to answer. Having a trusted colleague whose #1 goal would be to cut words/sections, and who is not part of the original authorship team, may also be helpful. Some suggestions as I was reading:
— I’m a little biased (pun intended) but the bias section (page 12- 15, section 2.2) seems ripe for the chopping block is it is not immediately relevant to MME but models in general and is well covered by past literature. It comes off as a laundry list of random studies and is a bit tedious.
— I also found pages 29-30 (section 3.3) to be mostly redundant with some of the messages already covered earlier in the paper (e.g. check that it performs well for a certain region/variable, MME allow for structural differences, uncertainty needs to be accounted for, etc). That these lessons also apply to extremes could be briefly mentioned in another section.
— There is a very long section on treatment of outliers (page 30-34) which I suspect could be much reduced.
— I generally found the ML section to be too detailed and too lengthy. Pages 38-45 are dedicated to a deep dive on ML methods which seems to deviate from the overall emphasis of the paper to provide info on how to work with MMEs in CMIP. Truthfully, it feels like it belongs to a separate paper. I’m not proposing ML to be removed but perhaps a more top level overview of the pros and cons of ML to evaluate and work with MME.
(2) In addition to reducing the scope and length of this paper, I would also ask the authors to consider mentioning that there are other “climate MMEs” in the form of initialized climate models used for seasonal climate prediction. These are not the same as “initial condition ensembles” (ICE)— in prediction, there is more emphasis on assimilating the most accurate ocean/atmosphere/land state (this does not require computing an assimilation system from scratch, which is time/computing intensive. Some prediction models are therefore initialized from reanalysis that are separate from the prediction system).
Initialized climate models have the advantage of having more frequent predictions that can be more immediately verified and are still climate models in the sense that correctly implementing the right radiative/boundary forcings is essential, especially beyond the influence of synoptic weather/subseasonal variability like the MJO. While I realize the main goals of CMIP are distinct, I think trying to merge “MME” lessons from these two worlds would be valuable— mostly because some challenges are the same, such as how to optimally combine MME and validating against observations. I would recommend reading the references below and perhaps adding a few lessons from this “initialized climate prediction MME” community that may be shared with CMIP.
Kirtman, B. P., and coauthors, 2014: The North American Multimodel Ensemble: Phase-1 Seasonal-to-Interannual Prediction; Phase-2 toward Developing Intraseasonal Prediction. Bull. Amer. Meteor. Soc., 95, 585–601.
DelSole, T., J. Nattala, and M. K. Tippett (2014), Skill improvement from increased ensemble size and model diversity, Geophys. Res. Lett., 41, 7331–7342.
Becker, E., Kirtman, B. P., & Pegion, K. (2020). Evolution of the North American multi-model ensemble. Geophysical Research Letters, 47, e2020GL087408.
Buontempo, C., and coauthors, 2022: The Copernicus Climate Change Service: Climate Science in Action. Bull. Amer. Meteor. Soc., 103, E2669–E2687.
Becker, E. J., and co authors, 2022: A Decade of the North American Multimodel Ensemble (NMME): Research, Application, and Future Directions. Bull. Amer. Meteor. Soc., 103, E973–E995.
Min, Y.-M., Lim, C.-M., Yoo, J.-H., Kim, H.-J., Kryjov, V. N., Jeong, D., et al. (2025). A diachronic assessment of advances in seasonal forecasting: Evolution of the APCC multi-model ensemble prediction system over the last two decades. Geophysical Research Letters, 52, e2025GL116416.
(3) Somewhat related to #2, I think the CMIP community needs to think more about how to link climate scenarios with actionable climate services, where societal and financial decisions are being made based on their trust (or lack thereof) in initialized climate models. I would like this smart group of authors to ponder the idea that one of the reasons that IPCC and others may have lost some clout in recent years is that not enough people see this activity as immediately relevant to what they experience. For a subset of people, CMIP feels too far off to be relevant and too uncertain to make decisions. Therefore, I see opportunity to build additional credibility/trust by linking initialized MME world with CMIP/LE MME world. A helpful example is GFDL SPEAR which provides seasonal forecasts once a month and also provides a large ensemble (LE) as well.
https://www.gfdl.noaa.gov/spear/
https://www.gfdl.noaa.gov/spear_large_ensembles/
A user can conceivably then, make linkages (and build trust) based on the performance of this model that they use for practical decisions with the more far off scenarios that this model produces. My recommendation of this paper is not conditional on implementing this suggestion— if this group isn’t comfortable addressing this issue, then it is fine to ignore.
——————
Minor notes:
Overarching comment: I’m not sure if this is a journal requirement, but it is fairly challenging dealing with line numbers that only go from 0 to 100 and repeat. I would follow standard practice and only have one set of line numbers that do not repeat.
Page 3, Line 74-75: “Climate model evaluation” should distinguish between projections and predictions.
Page 5, lines 29: “climate MME” isn’t sufficiently clear (see #2 above).
Page 5-6: The subsection breakdown seems to be listed out twice (40-42 and 55-58) which is a bit confusing. I would cover the outline in one place.
Page 7, line 81: I would argue that NCEP/NCAR should not be used b/c it is very dated and old. In the stratospheric community, in particular, the use of NCEP/NCAR has been discouraged.
Page 7, line 98-00: See point #2 above. We can verify climate models with some frequency.
Page 7, lines 98-Page 8, 44: “Performance oriented evaluation” Needs to be some commentary on the type of things that can be compared against observations. Since these are free running, there is no restriction that it be tethered to observations, thus limiting this sort of comparison. Can only make comparisons in general sense… e.g. what a pattern looks like, in the mean/climo, etc.
Page 8, Line 20-30: what parameters are adjusted? Is this calibration the same as model tuning? Or something different. Would distinguish this if so.
Page 9, line 31: Need to remind the reader what “this assumption” is here. What assumption.
Page 9, line 39: See #2-3 above. One alternative is to actually run these models in a way that can be validated against observations. Those that do well in this space may build trust for their use in projections.
Page 10, line 77-79: Could imagine that more coordination on processes could result in larger group of scientists developing preferences for certain methods against others. This may also result in more model similarity. Could reduce diversity of models so how to protect against that? [future question? ]
— Reading ahead to page 15 it seems that this problem may be observed, so it does suggest some thought should be given to how to protect against too much model convergence.
Section 2.2. I know every study cannot be included but this one feels fairly critical to mention given the importance of ENSO. Planton et al. (2021) https://journals.ametsoc.org/view/journals/bams/102/2/BAMS-D-19-0337.1.xml
Section 2.2. I have to admit this section seems unfocused. I thought the goal was to explain how using multi model approaches can help diagnose and resolve climate biases. This comes off as a bit of laundry list of model biases, so I would suggest starting over and writing a more concise section with a few (maybe 2-3) concrete examples of how folks leveraged multiple models to diagnose these biases. I other words, how could they see and understand the bias better using multiple models vs. just a single model.
It feels like there is some repetition on page 16 (47-54) and page 18 (62-70). I also don’t see the Figure 3 which is referenced (just Figure 2). I would clean this up a bit.
Page 20 (lines 21-30) Feels like there should be a clear statement in this paper about how retrospectively picking winners is tricky unless proper cross validation procedures are applied. I’m also curious how much more accurate the predictions are, so can something be cited here that quantifies this a bit more (line 23).
Page 20 line 29: typo— capitalized O.
Page 21: I feel like a sentence is needed to explain *why* SMILEs improves uncertainty in climate projections. Otherwise it’s unclear how it’s an improvement over polynomial fit.
Page 22 (lines 99-05): I think we need a clearer definition of what is meant by an “emergent constraint.” As written, it appears to just be plotting up x and y in models. What’s missing is that the relationship would need to be strong to be an effective constraint (could imagine a situation with low correlations) and that the current day observations should be plotted in the figure to help us understand how the real world is actually operating in the context of the various models (or model ensemble).
Page 27 (11-13): I’m not clear why it needs to be less than half the initial sample. Is this test among multiple model ensembles? So you don’t overly favor one model? If so, this needs to be stated up front.
Page 27 (line 19) Alternative to “internal variability is low” is when “signal to noise ratio is high.”
Page 28 (line 31-32): As written, this is a bit unclear. Are you saying average together each distinct model ensemble to create an an ensemble mean and THEN average across multiple models? Or just average together all available members across multiple models pooled together? There is a distinction.
Page 34 (lines 03-06): I’m not totally sure what this is saying and I think this needs to be rewritten. How does identifying values in the tails of the distribution help find models that represent more extreme events? All models with ensembles produce distributions that have tails. Are you talking about single member models?
Page 38 (line 19-20): In the leading/Intro section to ML, I think it’s worth explicitly mentioning that the advances in the modeling space have really been for weather timescales, in part because, there are sufficiently large training datasets. Climate and especially climate change provide fewer samples overall and more “out of sample” situations, which may be limited if the training dataset does not contain the extremes. So, while, yes, ML has potential to improve these models this may be much harder in the climate change domain than it is when applied to weather.
Page 42-43: Some font issues.
Page 46 (line 47-49). I might mention “Single forcing large ensembles” (SFLE) which partitions the forcings into different components (GHG, aerosols, biomass, etc). https://www.cesm.ucar.edu/working-groups/climate/simulations/cesm2-single-forcing-le . Otherwise it can be difficult to diagnose what aspect of the forcings are leading to a response.

Citation: https://doi.org/10.5194/egusphere-2025-4744-RC1
RC2:
'Comment on egusphere-2025-4744', Anonymous Referee #2, 07 Dec 2025
General Remarks:
First of all, I want to congratulate the authors on a very comprehensive treatise on the past, present, and future of Earth System Modeling. An enormous amount of ground is being covered, and it is clear that a lot of hard work has been put in thus far. Based on the scope, which is more or less much of the climate model diagnostic research done in the past 35 years, it is clear that this will be a long review paper. With reaching readers in mind, I think it is important to distill things down in many places. It is very difficult to effectively review a paper with this much information in it. The following topics are covered in the introduction alone:
History of Earth System Modeling

Components of an ESM

Complexity and parameterizations in ESMs

Why multi-model ensembles?

History of model intercomparison and CMIP

MME design

Other types of model ensembles

Some of the distillation can be achieved with supplementary material. I've noted down sections I think could move to an appendix to pace things up a bit, without completely discarding anyone's contribution. Additionally, the authors might consider synthesizing some information into tables or graphics that users can quickly reference (a good example of a summary table is given in Simpson et al. 2025). I have also made an attempt at a recommended restructuring, identifying sections that could be merged. Please take what is useful and leave the rest. Ultimately, you know the paper best.

There are only a handful of places that I feel warrant further discussion. In the introduction, expanding on how it has been determined that multi-model ensembles (MMEs) outperform single models would be a useful addition. Also missing is a discussion of CMIP as an ensemble of opportunity (e.g., Tebaldi and Knutti, 2007, Sanderson et al. 2012, Merrifield et al., 2023). CMIP is not explicitly designed, as a whole ensemble, to provide an estimate of robustness, for example.

And finally, if possible, please deputize a single author to read through the whole manuscript and harmonize contributions. “Multi-model ensembles (MMEs)” is defined at least twice (pg.3 L70, pg.4 L11), as is “Multi-Model Large Ensemble Archive (MMLEA)” (pg.46 L53, pg.47 L86). Sections are previewed multiple times back-to-back. There are several instances where things are discussed prior to being introduced as well.

Specific Comments:
Pg. 1 L21: “… a key tool”
Pg. 2 L48: “Concurrently, the volume of ESM simulation output data…”
Pgs. 2-3 L50-59: This is a candidate for adaptation into a table or graphic.
Pg. 3 L74-77: I think this is under-appreciated and worth elaborating on. How does the MME outperform an individual model? Are there cases where all the models are wrong together?
Pg. 4 L01: Is this the same AMIP experiment defined and discussed above?
Pg. 4 L05: CO2 subscript
Pg. 4 L06: quotes not needed around definition of SSP
Pg. 5 L29-39: I recommend moving this to an appendix that includes the sections on SMILEs as well.
Section 2.1: It would be good to open with a short explanation of what model evaluation is (e.g., benchmarking aspects of historical model simulations using observations, etc.). It also makes sense to define reanalysis here.
Pg. 6 L65: You could highlight Sippel et al. 2024 as an example of disadvantages.
Pg. 6 L80: “Reanalysis” comes in a bit abruptly here; it has not really been defined.
Pg. 7 L84-85: There is a whole section on regridding. Can it be referenced after “This can also be achieved by appropriate regridding methods” instead of discussing it again here?
Pg 7. L86-89: I would recommend combining all mentions throughout of “model tuning” and “emergent constraints” (they disappear as a consequence of model tuning) into one section to avoid repetition. Including: Pg. 8-9 L20-30, Pg. 22 L96-09
Pg 8. L01-02: Yes, to some extent. It is often thought of in reverse: models that fail to capture features of historical climate are unlikely to capture them in the future.
Pg 8. L03: A brief description of what a Taylor diagram is seems like it would be in the spirit of what you are trying to achieve with this paper.
Pg 9. L34: Fix citation
Pg. 10 L71: Fix citation
Pg. 10 L79 - Pg. 12 L36: Regarding “Below, we highlight examples of process-oriented analysis applied to CMIP models.” I recommend picking one example for the main text and moving the rest of the examples to supplementary material.
Section 2.2 – This whole section is another great candidate for a summary table or graphic. The written details can then move to supplementary material.
Pg. 12 L43: The ESMValTool has yet to be introduced. I know there is a whole section on tools, which I recommend moving up before the ESMValTool is mentioned. Otherwise, the fact that there are tools, e.g., the ESMValTool, could be mentioned in the introduction, such that this is not the first mention.
Section 2.3: Model dependence has come further into the mainstream than is credited here (e.g., Kuma et al. 2023, Merrifield et al. 2023). Additionally, I would highlight that the metadata reporting requirements in CMIP6 made comprehensive model dependence assessments possible, a real step forward for the field in transparency and reproducibility.
Pg. 15 L33: I would not go as far as saying this is “intriguing or concerning”, it’s a consequence of limited resources (-) and collaboration (+) and has been known for a long time.
Pg. 16 L54: Wrong figure referenced.
Pg. 23 L15-29: I recommend this moves to the supplementary material.

Section 2.6: This is a good section length for a paper with this much material in it.
Section 3: If you are going for an FAQ, then I would recommend a condensed answer and a reference to more information in the supplement for every one of the subsections in Section 3. No more than a page for each would be my preference.
Pg. 29 L 47: CORDEX is yet to be defined.
Section 3.4 – I recommend this section be combined with model evaluation / model weighting.
Pg. 32 L38-41: Pleased to report outlier models were included in CH2025. See Chapter 2 of the Scientific Report for more information.
I think Section 4.1 should be its own paper.

References
Isla R. Simpson et al., Confronting Earth System Model trends with observations.Sci. Adv.11,eadt8035(2025).DOI:10.1126/sciadv.adt8035

Claudia Tebaldi, Reto Knutti; The use of the multi-model ensemble in probabilistic climate projections. Philos Trans A Math Phys Eng Sci 15 August 2007; 365 (1857): 2053–2075.

Sanderson, B. M., and R. Knutti (2012), On the interpretation of constrained climate model ensembles, Geophys. Res. Lett., 39, L16708, doi:10.1029/2012GL052665.

Merrifield, A. L., Brunner, L., Lorenz, R., Humphrey, V., and Knutti, R.: Climate model Selection by Independence, Performance, and Spread (ClimSIPS v1.0.1) for regional applications, Geosci. Model Dev., 16, 4715–4747, https://doi.org/10.5194/gmd-16-4715-2023, 2023.

Sippel, S., Kent, E.C., Meinshausen, N. et al. Early-twentieth-century cold bias in ocean surface temperature observations. Nature 635, 618–624 (2024). https://doi.org/10.1038/s41586-024-08230-1

Simpson, I. R., K. A. McKinnon, F. V. Davenport, M. Tingley, F. Lehner, A. Al Fahad, and D. Chen, 2021: Emergent Constraints on the Large-Scale Atmospheric Circulation and Regional Hydroclimate: Do They Still Work in CMIP6 and How Much Can They Actually Constrain the Future?. J. Climate, 34, 6355–6377, https://doi.org/10.1175/JCLI-D-21-0055.1.
Citation: https://doi.org/10.5194/egusphere-2025-4744-RC2

Anja Katzenberger, Jhayron S. Perez-Carrasquilla, Keighan Gemmell, Evgenia Galytska, Christine Leclerc, P. Punya, Indrani Roy, Arianna Varuolo-Clarke, Milica Tošić, and Nina Črnivec

Viewed

Total article views: 562 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
329	211	22	562	18	18

HTML: 329
PDF: 211
XML: 22
Total: 562
BibTeX: 18
EndNote: 18

Views and downloads (calculated since 06 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	165	43	7	215
Nov 2025	92	79	7	178
Dec 2025	72	89	8	169

Cumulative views and downloads (calculated since 06 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	165	43	7	215
Nov 2025	92	79	7	178
Dec 2025	72	89	8	169

Viewed (geographical distribution)

Total article views: 550 (including HTML, PDF, and XML) Thereof 550 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 27 Dec 2025

Short summary

Multi-model ensembles are a central approach in climate model analysis, but their use involves many complex considerations. In this work, we review relevant literature and synthesize existing studies to contribute to the development of guidelines for designing and conducting ensemble analyses. This is complemented by a collection of useful resources and a discussion of emerging trends, supported by statistics tracing the number of publications.


Total:	0
HTML:	0
PDF:	0
XML:	0