the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Developing Guidelines for Working with Multi-Model Ensembles in CMIP
Abstract. Earth System Models (ESMs) are the key tool for studying the climate under changing conditions. Over recent decades, it has been established to not only rely on projections of a single model but to combine various ESMs in multi-model ensembles (MMEs) to improve robustness and quantify the uncertainty of the projections. The data access for MME studies has been fundamentally facilitated by the World Climate Research Programme's Coupled Model Intercomparison Project (CMIP) - a collaborative effort bringing together ESMs from modelling communities all over the world. Despite the CMIP standardisation processes, addressing specific research questions using MMEs requires unique ensemble design, analysis, and interpretation choices. Based on the collective expertise within the Fresh Eyes on CMIP initiative, mainly composed of early-career researchers engaged in CMIP, we have identified common issues and questions encountered while working with climate MMEs. In this project, we provide a comprehensive literature review addressing these questions. We provide statistics tracing the development of the climate MMEs analysis field throughout the last decades, and, synthesising existing studies, we outline guidelines regarding model evaluation, model dependence, weighting methods, and uncertainty treatment. We summarize a collection of useful resources for MME studies, we review common questions and strategies, and finally, we outline emerging scientific trends, such as the integration of machine learning (ML) techniques, single model initial-condition large ensembles (SMILES), and computational resource considerations. We thereby strive to support researchers working with climate MMEs particularly in the upcoming 7th phase of CMIP.
- Preprint
(3149 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 07 Dec 2025)
- RC1: 'Comment on egusphere-2025-4744', Anonymous Referee #1, 02 Nov 2025 reply
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 229 | 71 | 9 | 309 | 6 | 7 |
- HTML: 229
- PDF: 71
- XML: 9
- Total: 309
- BibTeX: 6
- EndNote: 7
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This is an interesting paper that summarizes MME in CMIP. I would recommend eventual publication subject to revision, particularly on the editorial side, with the aim of providing clearer goals and more focus.
Major suggestions worth considering:
(1) While this is a very nice, thorough comprehensive review of various methods/ideas, the paper is far too long and needs some heavy trimming. I would ask the authors to dig deep and try to cut out some sections and shoot for 30-40 pages max (in the current format, it is 52 pages). Otherwise, I fear that the length will scare off readers and it will not have the desired impact. Some questions I would ask, for each section is, “is this section essential to the key goal of this paper?” “Would our key goal be harmed if it were cut?” Part of doing this may be scoping out the key goals a bit more by providing very clear questions this team wants to answer. Having a trusted colleague whose #1 goal would be to cut words/sections, and who is not part of the original authorship team, may also be helpful. Some suggestions as I was reading:
— I’m a little biased (pun intended) but the bias section (page 12- 15, section 2.2) seems ripe for the chopping block is it is not immediately relevant to MME but models in general and is well covered by past literature. It comes off as a laundry list of random studies and is a bit tedious.
— I also found pages 29-30 (section 3.3) to be mostly redundant with some of the messages already covered earlier in the paper (e.g. check that it performs well for a certain region/variable, MME allow for structural differences, uncertainty needs to be accounted for, etc). That these lessons also apply to extremes could be briefly mentioned in another section.
— There is a very long section on treatment of outliers (page 30-34) which I suspect could be much reduced.
— I generally found the ML section to be too detailed and too lengthy. Pages 38-45 are dedicated to a deep dive on ML methods which seems to deviate from the overall emphasis of the paper to provide info on how to work with MMEs in CMIP. Truthfully, it feels like it belongs to a separate paper. I’m not proposing ML to be removed but perhaps a more top level overview of the pros and cons of ML to evaluate and work with MME.
(2) In addition to reducing the scope and length of this paper, I would also ask the authors to consider mentioning that there are other “climate MMEs” in the form of initialized climate models used for seasonal climate prediction. These are not the same as “initial condition ensembles” (ICE)— in prediction, there is more emphasis on assimilating the most accurate ocean/atmosphere/land state (this does not require computing an assimilation system from scratch, which is time/computing intensive. Some prediction models are therefore initialized from reanalysis that are separate from the prediction system).
Initialized climate models have the advantage of having more frequent predictions that can be more immediately verified and are still climate models in the sense that correctly implementing the right radiative/boundary forcings is essential, especially beyond the influence of synoptic weather/subseasonal variability like the MJO. While I realize the main goals of CMIP are distinct, I think trying to merge “MME” lessons from these two worlds would be valuable— mostly because some challenges are the same, such as how to optimally combine MME and validating against observations. I would recommend reading the references below and perhaps adding a few lessons from this “initialized climate prediction MME” community that may be shared with CMIP.
Kirtman, B. P., and coauthors, 2014: The North American Multimodel Ensemble: Phase-1 Seasonal-to-Interannual Prediction; Phase-2 toward Developing Intraseasonal Prediction. Bull. Amer. Meteor. Soc., 95, 585–601.
DelSole, T., J. Nattala, and M. K. Tippett (2014), Skill improvement from increased ensemble size and model diversity, Geophys. Res. Lett., 41, 7331–7342.
Becker, E., Kirtman, B. P., & Pegion, K. (2020). Evolution of the North American multi-model ensemble. Geophysical Research Letters, 47, e2020GL087408.
Buontempo, C., and coauthors, 2022: The Copernicus Climate Change Service: Climate Science in Action. Bull. Amer. Meteor. Soc., 103, E2669–E2687.
Becker, E. J., and co authors, 2022: A Decade of the North American Multimodel Ensemble (NMME): Research, Application, and Future Directions. Bull. Amer. Meteor. Soc., 103, E973–E995.
Min, Y.-M., Lim, C.-M., Yoo, J.-H., Kim, H.-J., Kryjov, V. N., Jeong, D., et al. (2025). A diachronic assessment of advances in seasonal forecasting: Evolution of the APCC multi-model ensemble prediction system over the last two decades. Geophysical Research Letters, 52, e2025GL116416.
(3) Somewhat related to #2, I think the CMIP community needs to think more about how to link climate scenarios with actionable climate services, where societal and financial decisions are being made based on their trust (or lack thereof) in initialized climate models. I would like this smart group of authors to ponder the idea that one of the reasons that IPCC and others may have lost some clout in recent years is that not enough people see this activity as immediately relevant to what they experience. For a subset of people, CMIP feels too far off to be relevant and too uncertain to make decisions. Therefore, I see opportunity to build additional credibility/trust by linking initialized MME world with CMIP/LE MME world. A helpful example is GFDL SPEAR which provides seasonal forecasts once a month and also provides a large ensemble (LE) as well.
https://www.gfdl.noaa.gov/spear/
https://www.gfdl.noaa.gov/spear_large_ensembles/
A user can conceivably then, make linkages (and build trust) based on the performance of this model that they use for practical decisions with the more far off scenarios that this model produces. My recommendation of this paper is not conditional on implementing this suggestion— if this group isn’t comfortable addressing this issue, then it is fine to ignore.
——————
Minor notes:
Overarching comment: I’m not sure if this is a journal requirement, but it is fairly challenging dealing with line numbers that only go from 0 to 100 and repeat. I would follow standard practice and only have one set of line numbers that do not repeat.
Page 3, Line 74-75: “Climate model evaluation” should distinguish between projections and predictions.
Page 5, lines 29: “climate MME” isn’t sufficiently clear (see #2 above).
Page 5-6: The subsection breakdown seems to be listed out twice (40-42 and 55-58) which is a bit confusing. I would cover the outline in one place.
Page 7, line 81: I would argue that NCEP/NCAR should not be used b/c it is very dated and old. In the stratospheric community, in particular, the use of NCEP/NCAR has been discouraged.
Page 7, line 98-00: See point #2 above. We can verify climate models with some frequency.
Page 7, lines 98-Page 8, 44: “Performance oriented evaluation” Needs to be some commentary on the type of things that can be compared against observations. Since these are free running, there is no restriction that it be tethered to observations, thus limiting this sort of comparison. Can only make comparisons in general sense… e.g. what a pattern looks like, in the mean/climo, etc.
Page 8, Line 20-30: what parameters are adjusted? Is this calibration the same as model tuning? Or something different. Would distinguish this if so.
Page 9, line 31: Need to remind the reader what “this assumption” is here. What assumption.
Page 9, line 39: See #2-3 above. One alternative is to actually run these models in a way that can be validated against observations. Those that do well in this space may build trust for their use in projections.
Page 10, line 77-79: Could imagine that more coordination on processes could result in larger group of scientists developing preferences for certain methods against others. This may also result in more model similarity. Could reduce diversity of models so how to protect against that? [future question? ]
— Reading ahead to page 15 it seems that this problem may be observed, so it does suggest some thought should be given to how to protect against too much model convergence.
Section 2.2. I know every study cannot be included but this one feels fairly critical to mention given the importance of ENSO. Planton et al. (2021) https://journals.ametsoc.org/view/journals/bams/102/2/BAMS-D-19-0337.1.xml
Section 2.2. I have to admit this section seems unfocused. I thought the goal was to explain how using multi model approaches can help diagnose and resolve climate biases. This comes off as a bit of laundry list of model biases, so I would suggest starting over and writing a more concise section with a few (maybe 2-3) concrete examples of how folks leveraged multiple models to diagnose these biases. I other words, how could they see and understand the bias better using multiple models vs. just a single model.
It feels like there is some repetition on page 16 (47-54) and page 18 (62-70). I also don’t see the Figure 3 which is referenced (just Figure 2). I would clean this up a bit.
Page 20 (lines 21-30) Feels like there should be a clear statement in this paper about how retrospectively picking winners is tricky unless proper cross validation procedures are applied. I’m also curious how much more accurate the predictions are, so can something be cited here that quantifies this a bit more (line 23).
Page 20 line 29: typo— capitalized O.
Page 21: I feel like a sentence is needed to explain *why* SMILEs improves uncertainty in climate projections. Otherwise it’s unclear how it’s an improvement over polynomial fit.
Page 22 (lines 99-05): I think we need a clearer definition of what is meant by an “emergent constraint.” As written, it appears to just be plotting up x and y in models. What’s missing is that the relationship would need to be strong to be an effective constraint (could imagine a situation with low correlations) and that the current day observations should be plotted in the figure to help us understand how the real world is actually operating in the context of the various models (or model ensemble).
Page 27 (11-13): I’m not clear why it needs to be less than half the initial sample. Is this test among multiple model ensembles? So you don’t overly favor one model? If so, this needs to be stated up front.
Page 27 (line 19) Alternative to “internal variability is low” is when “signal to noise ratio is high.”
Page 28 (line 31-32): As written, this is a bit unclear. Are you saying average together each distinct model ensemble to create an an ensemble mean and THEN average across multiple models? Or just average together all available members across multiple models pooled together? There is a distinction.
Page 34 (lines 03-06): I’m not totally sure what this is saying and I think this needs to be rewritten. How does identifying values in the tails of the distribution help find models that represent more extreme events? All models with ensembles produce distributions that have tails. Are you talking about single member models?
Page 38 (line 19-20): In the leading/Intro section to ML, I think it’s worth explicitly mentioning that the advances in the modeling space have really been for weather timescales, in part because, there are sufficiently large training datasets. Climate and especially climate change provide fewer samples overall and more “out of sample” situations, which may be limited if the training dataset does not contain the extremes. So, while, yes, ML has potential to improve these models this may be much harder in the climate change domain than it is when applied to weather.
Page 42-43: Some font issues.
Page 46 (line 47-49). I might mention “Single forcing large ensembles” (SFLE) which partitions the forcings into different components (GHG, aerosols, biomass, etc). https://www.cesm.ucar.edu/working-groups/climate/simulations/cesm2-single-forcing-le . Otherwise it can be difficult to diagnose what aspect of the forcings are leading to a response.