the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Rapid Evaluation Framework for the CMIP7 Assessment Fast Track
Abstract. As Earth system models (ESMs) grow in complexity and in volumes of output data, there is an increasing need for rapid, comprehensive evaluation of their scientific performance. The upcoming Assessment Fast Track for the Seventh Phase of the Coupled Model Intercomparison Project (CMIP7) will require expeditious response for model analyses designed to inform and drive integrated Earth system assessments. To meet this challenge, the Rapid Evaluation Framework (REF), a community-driven platform for benchmarking and performance assessment of ESMs, was designed and developed. The initial implementation of the REF, constructed to meet the near-term needs of the CMIP7 Assessment Fast Track, builds upon community evaluation and benchmarking tools. The REF runs within a containerized workflow for portability and reproducibility and is aimed at generating and organizing diagnostics covering a variety of model variables. The REF leverages best-available observational datasets to provide assessments of model fidelity across a collection of diagnostics. All diagnostics were identified and finally selected with community involvement and consultation. Operational integration with the Earth System Grid Federation (ESGF) will permit automated execution of the REF for specific diagnostics as soon as model data are published on ESGF by the originating modelling centres. The REF is designed to be portable across a range of current computational platforms to facilitate use by modelling centres for assessing the evolution of model versions or gauging the relative performance of CMIP simulations before being published on ESGF. When integrated into production simulation workflows, results from the REF provide immediate quantitative feedback that allows model developers and scientists to quickly identify model biases and performance issues. After the REF is released to the community, its subsequent development and support will be prioritized by an international consortium of scientists and engineers, enabling a broader impact across Earth science disciplines. For instance, the REF will facilitate improvements to models and reductions in uncertainties for projections since ESMs are the main tool for studying the global Earth system. Production of reproducible diagnostics and community-based assessments are the key features of the REF that help to inform mitigation and adaptation policies.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Geoscientific Model Development.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(944 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 23 Oct 2025)
- RC1: 'Comment on egusphere-2025-2685', Anonymous Referee #1, 05 Sep 2025 reply
-
RC2: 'Comment on egusphere-2025-2685', Anonymous Referee #2, 17 Oct 2025
reply
General remarks:
This paper presents a newly developed tool to enable a rapid assessment of simulations produced in the context of CMIP7. The primary objective is to provide a minimum set of diagnostics on model simulations published for the AR7 Fast Track intercomparison exercise. Such an assessment will be of great value for CMIP model data users, particularly those involved in climate services. However, the scope of the tool is considerably broader. It is intended for use in a wider range of contexts - for example, by model developers during the model development phase - and could be extended to other contexts: future CMIP phases, regional intercomparison such as CORDEX, ISIMIP, etc… The tool appears to have been designed in a safe conceptual basis, with a clear separation of concerns, adaptability to diverse computing environments and ease of integration for new diagnostics. Furthermore, the paper outlines several promising perspectives for extending its capabilities.
This tool is undoubtedly of great interest for the community and the authors should be acknowledged for this significant community effort. This tool’s focus on interoperability with existing metrics packages represents a particularly valuable contribution to the modelling community, and this aspect could be further emphasized in the paper.
Such a tool is desirable; however, it should also be emphasized that its adoption may change the paradigm of multi-model intercomparison exercises. Indeed, once modellers are aware of which metrics users will focus on to select models, it is almost inevitable that they will concentrate their efforts on improving model performance against those specific metrics et the expense of other processes. This dynamic needs to be explicitly acknowledged. However, modellers must avoid excessive model tuning (“overtuning”).
It is therefore crucial that the package is overly clear on the characterisation of the target metric uncertainty. For mean climate aspects, this target metric uncertainty primarily dependss on observational uncertainty. Although the paper briefly discuss observational uncertainty, it mainly approaches it as the uncertainty associated with individual datasets. In my view, the uncertainty of single observational products does not fully capture the range of observational uncertainty. A more robust estimate could be obtained by considering an ensemble of different observational products. For trend estimates, another important source of target uncertainty arises from internal variability. In future versions of the tool, I strongly recommend placing greater emphasis on the quantification and communication of this target metric uncertainty.
I would also recommend clarifying the types of simulations on which the REF focuses. My understanding is that it will primarily rely on historical-type simulations. However, since ECS and TCR are among the target metrics, this implies that academic simulations such as 1% CO₂ and abrupt4xCO₂ experiments are also expected. When modellers use the tool for model development, is it straightforward to select which simulations are to be evaluated? Additionally, is there a mechanism to ensure consistency between the simulation setup and the metric being calculated?
Specific questions:
Abstract
Sentence “For instance, the REF will facilitate improvements to models and reductions in uncertainties for projections since ESMs are the main tool for studying the global Earth system.”
I am not convinced that the REF will significantly reduce uncertainties in ESM projections. On the centennial timescale, uncertainties are primarily associated with model uncertainties and scenario uncertainties. In the case of model-related uncertainties, it has not been demonstrated that they can be substantially reduced through model selection. It may be more appropriate to rephrase the statement to suggest that the REF could contribute to enhancing the confidence in model projections and selection.
Introduction
In the first sentence, ESM are stated as the primary tools to study interactions between the atmosphere, biosphere,…. It is correct as long as centennial to millennium time-scales are considered but probably not for longer time-scale. I would be worth indicated a time range.
Section 2.1
It is stated that “Releasing the REF before model outputs are submitted to ESGF offers modelling centres the opportunity to systematically assess their models and to make targeted improvements during the development phase” (ln 129-131)
This is certainly the case, but at the same time, it will most likely change the paradigm of model development for CMIP exercises. It is hardly avoidable that models will tend to match the REF metrics, leading to a reduction of ensemble errors on these specific metrics. However, does this necessarily imply a genuine reduction in uncertainty? Could it occur for undesirable reasons? Since the REF includes metrics on model trends, models will likely converge toward similar historical trends and TCR values. At the same time, estimations of historical trends are influenced by internal variability, while accurately assessing real-world internal variability remains a challenge and is often based on model-derived estimates (Simpson et al., 2025; Gyuleva et al., 2025). Is it truly desirable for models to converge on observed historical trends? Similarly, TCR remains an unknown parameter—does this approach risk pushing all models toward a TCR within the medium range of previous CMIP exercises? Is that desirable? While I recognize the necessity of the REF tool, I believe this paper should highlight that it will, to some extent, reshape the model development paradigm. This potential impact should be carefully considered when analyzing future model ensembles.
Section 3
Three primary potential applications are stated (ln 196), but the text is not that clear on which are these three applications. It would help the readability if they were clearly stated here and then developed.
Section 4
Four open-source packages have been used as a starting point, yet no explanation is provided regarding the criteria for selecting these particular packages. Even if the choice was made purely for practical reasons, it would be valuable to include this information.
I consider CMEC to be a major development within the REF tool. I suggest emphasizing this aspect more prominently in the paper, particularly in the abstract. As noted, there exists a wide diversity of benchmarking packages in the community, but they often lack interoperability and can be challenging to install or implement. Consequently, CMEC has the potential to play a key role in establishing standards and facilitating their use for modelling centers.
Figure 2 is a bit unclear to me, the relationship between redis and database is particularly indirect. Why isn’t there any arrow leaving the redis block?
Miscellaneous questions/remarks:
ln 68-69: Model variables are described as changing during the model execution. In that case, what about fixed fields? Probably there are not metrics scrutinized by the REF but they could be part of a diagnostic calculation. Do you consider that they are still “Variables” or do you have another another term to call them?
ln 140-141 : the sentence is a bit weird, consider repharasing.
ln 207: “of the REF is that it be usable by” consider rephrasing.
ln 237 : I have not seen CMEC defined before in the text.
ln 275: Could you explain a bit what is “a functional relationship metric”
Gyuleva, G., Knutti, R., and Sippel, S.: Combination of Internal Variability and Forced Response Reconciles Observed 2023–2024 Warming, Geophysical Research Letters, 52, e2025GL115270, https://doi.org/10.1029/2025GL115270, 2025.
Simson et al., Confronting Earth System Model trends with observations | Science Advances: https://www-science-org.insu.bib.cnrs.fr/doi/10.1126/sciadv.adt8035, last access: 16 May 2025.
Citation: https://doi.org/10.5194/egusphere-2025-2685-RC2
Data sets
Observational Data for use by the REF Earth System Grid Federation (ESGF) https://esgf.github.io/nodes.html
Model code and software
Climate Rapid Evaluation Framework (REF) REF Delivery Team https://github.com/Climate-REF/climate-ref
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
2,240 | 150 | 19 | 2,409 | 31 | 31 |
- HTML: 2,240
- PDF: 150
- XML: 19
- Total: 2,409
- BibTeX: 31
- EndNote: 31
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Comments
Section 4.3: I’m not exactly sure of the difference between the ‘variables’ described here and the ‘diagnostics’ referred to in Table 1. For example, all the diagnostics in the Land and Land Ice Realm are objects I would think of as variables output by a land model (GPP, LAI etc). I do see how some of the diagnostics (like ECS/TCR etc in the ES realm) would be computed from variables and not be variables themselves. Perhaps it would be good to clarify that there might or might not be duplication/cross-over between ‘diagnostics’ and ‘variables’.
Section 5.3: this is more of a philosophical thought about subsetting models for particular scientific purposes and/or regional downscaling – for an end user, it would be tempting to just grab the models that perform well on certain regions/realms of interest, but if those same models are way off in other regions/realms, would that be a fair way to subset models? Perhaps this is more of a question for the community as a whole to come up with the best way of pulling a subset of models to look at specific questions (e.g. have some kind of penalty for not doing well for other regions/variables).
Table B1: (i) I am curious why cSoil was included but not cVeg? Is there a plan to include more datasets such as those from GEDI? (ii) would it be possible to add the dates over which the reference datasets are available? I am familiar with the ILAMB land datasets and I think they only go up to 2014 – have there been extensions to these for more recent years?
Section C1 1.1: what is the frequency of this calculation? i.e. annual/decadal/something else?
Minor Edits
Line 114: “produced, processes” -> “produced, and processes”
Line 117: what is a “realm” referring to here? These are defined in line 295 but should be introduced earlier here.
Figure 1 caption: spell out the acronyms (e.g. I am not sure what ‘DAG’ is)
Line 161: “a” -> “an”
Line 175: “at” -> “by the”
Line 259 -> add comma between “(ENSO)” and “CLIVAR”
Line 734: ref should be in parentheses
Line 743: insert “by” after “influenced”
Line 773: insert “by” after “obtained”
Line 780: I am not sure what is meant by “brought direction”
Line 792: “patterns” -> “pattern”
Line 802: insert “in” after “patterns”