the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Rapid Evaluation Framework for the CMIP7 Assessment Fast Track
Abstract. As Earth system models (ESMs) grow in complexity and in volumes of output data, there is an increasing need for rapid, comprehensive evaluation of their scientific performance. The upcoming Assessment Fast Track for the Seventh Phase of the Coupled Model Intercomparison Project (CMIP7) will require expeditious response for model analyses designed to inform and drive integrated Earth system assessments. To meet this challenge, the Rapid Evaluation Framework (REF), a community-driven platform for benchmarking and performance assessment of ESMs, was designed and developed. The initial implementation of the REF, constructed to meet the near-term needs of the CMIP7 Assessment Fast Track, builds upon community evaluation and benchmarking tools. The REF runs within a containerized workflow for portability and reproducibility and is aimed at generating and organizing diagnostics covering a variety of model variables. The REF leverages best-available observational datasets to provide assessments of model fidelity across a collection of diagnostics. All diagnostics were identified and finally selected with community involvement and consultation. Operational integration with the Earth System Grid Federation (ESGF) will permit automated execution of the REF for specific diagnostics as soon as model data are published on ESGF by the originating modelling centres. The REF is designed to be portable across a range of current computational platforms to facilitate use by modelling centres for assessing the evolution of model versions or gauging the relative performance of CMIP simulations before being published on ESGF. When integrated into production simulation workflows, results from the REF provide immediate quantitative feedback that allows model developers and scientists to quickly identify model biases and performance issues. After the REF is released to the community, its subsequent development and support will be prioritized by an international consortium of scientists and engineers, enabling a broader impact across Earth science disciplines. For instance, the REF will facilitate improvements to models and reductions in uncertainties for projections since ESMs are the main tool for studying the global Earth system. Production of reproducible diagnostics and community-based assessments are the key features of the REF that help to inform mitigation and adaptation policies.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Geoscientific Model Development.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(944 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-2685', Anonymous Referee #1, 05 Sep 2025
-
AC1: 'Reply on RC2', Forrest M. Hoffman, 07 Apr 2026
We, the authors, thank Anonymous Referee #1 for the thoughtful comments, which identified areas that required additional clarity, and for the minor edits, which assisted us in making the article more readable. The manuscript was improved as a result of revisions in response to this helpful review. Responses to each point of the review are interspersed with the review comments below.
Section 4.3: I’m not exactly sure of the difference between the ‘variables’ described here and the ‘diagnostics’ referred to in Table 1. For example, all the diagnostics in the Land and Land Ice Realm are objects I would think of as variables output by a land model (GPP, LAI etc). I do see how some of the diagnostics (like ECS/TCR etc in the ES realm) would be computed from variables and not be variables themselves. Perhaps it would be good to clarify that there might or might not be duplication/cross-over between ‘diagnostics’ and ‘variables’.
The definitions of Model Variables and Diagnostics provided in Section 1 distinguish the two terms from each other. In particular, some diagnostics employ a single variable, like GPP or LAI, while others, like Precipitation minus Evaporation (P−E), require multiple variables. However, to better clarify this point, the first sentence of the second paragraph of Section 4.3 was modified, and a second sentence was added as follows.
An opportunity regarding the REF was submitted to the Data Request Task Team’s open call to ensure that the model variables required by the REF for generating the REF diagnostics were clearly identified and that modelling centers wishing to use the REF with their simulations would have a checklist of model variables needed for producing the REF diagnostics. While some REF diagnostics use only a single model variable, others require two or more model variables, as described in the definitions in Section 1.
Section 5.3: this is more of a philosophical thought about subsetting models for particular scientific purposes and/or regional downscaling – for an end user, it would be tempting to just grab the models that perform well on certain regions/realms of interest, but if those same models are way off in other regions/realms, would that be a fair way to subset models? Perhaps this is more of a question for the community as a whole to come up with the best way of pulling a subset of models to look at specific questions (e.g. have some kind of penalty for not doing well for other regions/variables).
This is an important consideration, and to assist analysts in making such decisions, most of the REF diagnostics include multiple metrics and graphical output, like contour maps and Taylor diagrams, instead of only a single scalar score that can mask such details. To be more explicit, the following sentence was added to the paragraph in Section 5.3.
Of course, analysts must consider a wide range of factors when making such selections, including performance on a variety of metrics for relevant model variables (across spatial regions and through time) that may go beyond those incorporated into the initial version of the REF.
Table B1: (i) I am curious why cSoil was included but not cVeg? Is there a plan to include more datasets such as those from GEDI? (ii) would it be possible to add the dates over which the reference datasets are available? I am familiar with the ILAMB land datasets and I think they only go up to 2014 – have there been extensions to these for more recent years?
(i) As explained in Section 2.2, the diagnostics and therefore, the corresponding variables included in the initial version of the REF were prioritized through a community survey process. The objective was to have the community identify the top four-to-seven diagnostics to include for each of the five Earth system realms. The results of the community survey identified the soil carbon (cSoil) diagnostic as having the highest priority for the Land & Land Ice realm. Total carbon in vegetation or above- plus belowground biomass (cVeg) was not prioritized in the top seven diagnostics for Land & Land Ice. However, adding this diagnostic could be easily accomplished in a future version of the REF since the ILAMB package already includes a biomass diagnostic. Plans for future versions also include bringing in more reference datasets, including newer remote sensing products.
(ii) Reference data used in the REF span multiple time ranges, and the expectation is that these data will be regularly updated by extending their time ranges as appropriate or by replacing them with improved data sets as they become available and evaluated by the research community. In Table A1, we will add time ranges for reference data included in the first version of the REF.
Section C1 1.1: what is the frequency of this calculation? i.e. annual/decadal/something else?
Both the Antarctic annual mean and Arctic September rate of sea ice area loss per degree of warming are calculated using a single value each year. The description Section C1 1.1 was modified to be explicit as follows.
The metric is calculated by regressing the time-series of sea ice area on global mean temperature on an annual basis. The annual mean sea ice area is used for the Antarctic, while the September-mean sea ice area is used for the Arctic; both are regressed against annual global mean temperature.
Minor Edits
Line 114: “produced, processes” -> “produced, and processes”
Corrected.
Line 117: what is a “realm” referring to here? These are defined in line 295 but should be introduced earlier here.
Definition was moved and introduced at this earlier location.
Figure 1 caption: spell out the acronyms (e.g. I am not sure what ‘DAG’ is)
Abbreviations and acronyms are now defined in the figure caption.
Line 161: “a” -> “an”
Corrected.
Line 175: “at” -> “by the”
This was intended to precede a citation, which has now been corrected.
Line 259 -> add comma between “(ENSO)” and “CLIVAR”
Corrected.
Line 734: ref should be in parentheses
Corrected.
Line 743: insert “by” after “influenced”
Corrected.
Line 773: insert “by” after “obtained”
Corrected.
Line 780: I am not sure what is meant by “brought direction”
Corrected to “broad direction”.
Line 792: “patterns” -> “pattern”
Corrected.
Line 802: insert “in” after “patterns”
Corrected.
Citation: https://doi.org/10.5194/egusphere-2025-2685-AC1 -
AC2: 'Reply on RC2', Forrest M. Hoffman, 08 Apr 2026
We, the authors, thank Anonymous Referee #2 for the helpful comments that identified the need for improved descriptions of a few elements of the REF. We also appreciate the suggested minor edits. The manuscript was improved as a result of revisions in response to this review. Responses to each point of the review are interspersed with the review comments below.
This paper presents a newly developed tool to enable a rapid assessment of simulations produced in the context of CMIP7. The primary objective is to provide a minimum set of diagnostics on model simulations published for the AR7 Fast Track intercomparison exercise. Such an assessment will be of great value for CMIP model data users, particularly those involved in climate services. However, the scope of the tool is considerably broader. It is intended for use in a wider range of contexts - for example, by model developers during the model development phase - and could be extended to other contexts: future CMIP phases, regional intercomparison such as CORDEX, ISIMIP, etc… The tool appears to have been designed in a safe conceptual basis, with a clear separation of concerns, adaptability to diverse computing environments and ease of integration for new diagnostics. Furthermore, the paper outlines several promising perspectives for extending its capabilities.
This tool is undoubtedly of great interest for the community and the authors should be acknowledged for this significant community effort. This tool’s focus on interoperability with existing metrics packages represents a particularly valuable contribution to the modelling community, and this aspect could be further emphasized in the paper.
Thank you for the comment and for recognizing the importance of designing the REF to be interoperable with existing evaluation packages. Indeed, the aspect of interoperability among community efforts could be better emphasized. The abstract has been slightly extended as follows to strengthen this point.
Production of reproducible diagnostics and community-based assessments are the key features of the REF that help to inform mitigation and adaptation policies. Furthermore, providing interoperability with existing evaluation packages assures that contributions from previous community efforts will be available for use in future model intercomparison projects.
Such a tool is desirable; however, it should also be emphasized that its adoption may change the paradigm of multi-model intercomparison exercises. Indeed, once modellers are aware of which metrics users will focus on to select models, it is almost inevitable that they will concentrate their efforts on improving model performance against those specific metrics et the expense of other processes. This dynamic needs to be explicitly acknowledged. However, modellers must avoid excessive model tuning (“overtuning”).
This comment gives us the opportunity to stress once again that the community development of a standardized set of metrics and diagnostics does not imply a classification of “better” and “worse” models. The point here is to allow analysts to objectively characterize the strengths and weaknesses of models, so that modelers can work on improving the model where possible, and analysts can identify which models are more appropriate than others for the study of specific science questions. To address this question for the reader, we have added the following sentence in Section 5.3.
While it is hypothetically possible that modelers could ``tune'' or ``overtune'' their models to score well on a small set of specific diagnostics, this approach to improving model performance in the REF is of limited practical utility since the reference data are not consistent with each other and the relatively large number of diagnostics and metrics makes such an optimization impossible for a physics-constrained model.
It is therefore crucial that the package is overly clear on the characterisation of the target metric uncertainty. For mean climate aspects, this target metric uncertainty primarily dependss on observational uncertainty. Although the paper briefly discuss observational uncertainty, it mainly approaches it as the uncertainty associated with individual datasets. In my view, the uncertainty of single observational products does not fully capture the range of observational uncertainty. A more robust estimate could be obtained by considering an ensemble of different observational products. For trend estimates, another important source of target uncertainty arises from internal variability. In future versions of the tool, I strongly recommend placing greater emphasis on the quantification and communication of this target metric uncertainty.
We thank the referee for this thoughtful suggestion. Obtaining quantified uncertainties for individual reference datasets is difficult; however, in the initial version of the REF, we are including at least one diagnostic that includes a quantified uncertainty. In exploring observational uncertainties, we have identified a variety of sources of uncertainty, most of which are difficult to constrain, even for a single variable. Future versions of the REF are likely to incorporate multiple reference datasets for a given variable, as is already common in some of the community packages included in the REF, offering another method for understanding uncertainty for a single variable. We agree that the REF should further evolve into developing more robust estimates of uncertainties from a wide variety of reference data and visualize these uncertainties alongside the reference data and model results. The REF already permits multiple initial condition realizations to be analyzed as a way of quantifying internal model variability.
I would also recommend clarifying the types of simulations on which the REF focuses. My understanding is that it will primarily rely on historical-type simulations. However, since ECS and TCR are among the target metrics, this implies that academic simulations such as 1% CO₂ and abrupt4xCO₂ experiments are also expected. When modellers use the tool for model development, is it straightforward to select which simulations are to be evaluated? Additionally, is there a mechanism to ensure consistency between the simulation setup and the metric being calculated?
The list of simulations relevant to each of the diagnostics in the REF is now provided in Appendix B1. When modelers use the REF for model development, they can simply identify the model outputs that are to be evaluated. The REF uses the metadata provided in the CMORized netCDF files to ensure consistency between simulations and the metrics being calculated.
Specific questions:
Abstract
Sentence “For instance, the REF will facilitate improvements to models and reductions in uncertainties for projections since ESMs are the main tool for studying the global Earth system.”
I am not convinced that the REF will significantly reduce uncertainties in ESM projections. On the centennial timescale, uncertainties are primarily associated with model uncertainties and scenario uncertainties. In the case of model-related uncertainties, it has not been demonstrated that they can be substantially reduced through model selection. It may be more appropriate to rephrase the statement to suggest that the REF could contribute to enhancing the confidence in model projections and selection.
This is an excellent point, and one that many of the authors have experienced first hand. We have changed the sentence in the abstract as follows.
For instance, the REF will facilitate improvements to models and enhance the confidence in model projections through process-based selection of models based on their performance with respect to observations.
Introduction
In the first sentence, ESM are stated as the primary tools to study interactions between the atmosphere, biosphere,…. It is correct as long as centennial to millennium time-scales are considered but probably not for longer time-scale. I would be worth indicated a time range.
We agree that running ESMs over paleoclimatic timescales is computationally difficult and that mesoscale models are excellent for higher resolutions over short timescales. We changed the first sentence of the introduction as follows.
Earth system models (ESMs) are the primary tools for the scientific community to study interactions at a wide range of scales, from sub-daily to millennial, between the atmosphere, land, ocean, cryosphere, and biosphere in the Earth system and how it responds to human-induced and natural forcings
Section 2.1
It is stated that “Releasing the REF before model outputs are submitted to ESGF offers modelling centres the opportunity to systematically assess their models and to make targeted improvements during the development phase” (ln 129-131)
This is certainly the case, but at the same time, it will most likely change the paradigm of model development for CMIP exercises. It is hardly avoidable that models will tend to match the REF metrics, leading to a reduction of ensemble errors on these specific metrics. However, does this necessarily imply a genuine reduction in uncertainty? Could it occur for undesirable reasons? Since the REF includes metrics on model trends, models will likely converge toward similar historical trends and TCR values. At the same time, estimations of historical trends are influenced by internal variability, while accurately assessing real-world internal variability remains a challenge and is often based on model-derived estimates (Simpson et al., 2025; Gyuleva et al., 2025). Is it truly desirable for models to converge on observed historical trends? Similarly, TCR remains an unknown parameter—does this approach risk pushing all models toward a TCR within the medium range of previous CMIP exercises? Is that desirable? While I recognize the necessity of the REF tool, I believe this paper should highlight that it will, to some extent, reshape the model development paradigm. This potential impact should be carefully considered when analyzing future model ensembles.
These are all excellent questions, many of which our authors have discussed. Since modeling centers have access to evaluation tools, datasets, and model output and assessments from CMIP6 and prior activities, they already routinely use that information to evaluate their new models. The existence of the REF does not change the potential for the undesirable social effects of model convergence. The beauty of the REF is that these sorts of benchmarking activities will be consistent, made centrally available for potential model users, and employ regularly updated and curated observational reference data. As noted above, the breadth of the diagnostics and metrics included in the REF will preclude large-scale convergence across model components and realms. Overtuning for a few related variables will almost certainly produce poor performance for other variables. To be more explicit on these points, we added the following text in Section 5.3.
We emphasize that while the REF provides a suite of valuable diagnostics of model performance, it cannot replace dedicated analyses for each diagnostic that investigate in detail the mechanisms behind different model behaviors. Model improvements must come from such mechanistic understanding, and the REF may be useful in highlighting where to conduct such a detailed investigation.
Section 3
Three primary potential applications are stated (ln 196), but the text is not that clear on which are these three applications. It would help the readability if they were clearly stated here and then developed.
We agree. We restructured the text introducing Section 3 as follows.
Considering the REF scope, stakeholder consultation and initial implementation plan for the Assessment Fast Track simulations, three primary potential applications for the REF were identified: provide information about scientific performance of ESMs to stakeholders and policymakers, focus on key sensitivity indicators, and enhance equality in model data access.
Section 4
Four open-source packages have been used as a starting point, yet no explanation is provided regarding the criteria for selecting these particular packages. Even if the choice was made purely for practical reasons, it would be valuable to include this information.
Excellent suggestion. We have modified the introductory sentence of Section 4.1 as follows.
The open source evaluation and benchmarking packages described below—ESMValTool, PMP, and ILAMB & IOMB—were chosen for inclusion in the first version of the REF because they are open source, were relatively mature packages at the time, and combined, they permit comprehensive model evaluation across all the atmosphere, ocean and sea ice, land and land ice, Earth system, and impacts and adaptation realms.
I consider CMEC to be a major development within the REF tool. I suggest emphasizing this aspect more prominently in the paper, particularly in the abstract. As noted, there exists a wide diversity of benchmarking packages in the community, but they often lack interoperability and can be challenging to install or implement. Consequently, CMEC has the potential to play a key role in establishing standards and facilitating their use for modelling centers.
We agree that CMEC has provided a key technological framework for combining disparate benchmarking packages together into a cohesive system. We have modified a sentence in the abstract as follows.
The initial implementation of the REF, constructed to meet the near-term needs of the CMIP7 Assessment Fast Track, builds upon four disparate community evaluation and benchmarking tools that are coupled together using the Coordinated Model Evaluation Capabilities (CMEC) framework.
Figure 2 is a bit unclear to me, the relationship between redis and database is particularly indirect. Why isn’t there any arrow leaving the redis block?
Unfortunately, the reviewed manuscript included an early draft of Figure 2. It has been replaced with a more complete and accurate figure of the system design.
Miscellaneous questions/remarks:
ln 68-69: Model variables are described as changing during the model execution. In that case, what about fixed fields? Probably there are not metrics scrutinized by the REF but they could be part of a diagnostic calculation. Do you consider that they are still “Variables” or do you have another another term to call them?
Diagnostics targeting fixed quantities and model parameters have not been considered. We think that comparisons of fixed quantities (e.g., topography, bathimetry, constant astronomical factors) should be treated as a matter for model design intercomparison, rather than for the evaluation of model performance.
ln 140-141 : the sentence is a bit weird, consider repharasing.
We agree. We restructured the sentence as follows.
These conventions ensure that data are consistently structured, with uniform variable naming, units, and metadata, facilitating easy comparison across model outputs.
ln 207: “of the REF is that it be usable by” consider rephrasing.
The sentence has been restructured as follows.
A key design goal of the REF is that it can be used by modeling centers, research institutions, and individual scientists to enable validation of ESM output prior to publication of simulations on ESGF, intercomparison of model results, and general purpose analysis
ln 237 : I have not seen CMEC defined before in the text.
Agreed. We have now spelled out Coordinated Model Evaluation Capabilities (CMEC).
ln 275: Could you explain a bit what is “a functional relationship metric”
Certainly we can. We have simplified the identified sentence and added a second sentence that provides an illustrative example as follows.
Functional relationship metrics within ILAMB and IOMB evaluate the degree to which model variable-to-variable relationships correspond to those of observational data. For example, the relationship between gross primary production (GPP) and mean annual temperature in the model should correspond well with the same relationship extracted from observational data.
Gyuleva, G., Knutti, R., and Sippel, S.: Combination of Internal Variability and Forced Response Reconciles Observed 2023–2024 Warming, Geophysical Research Letters, 52, e2025GL115270, https://doi.org/10.1029/2025GL115270, 2025.
Simson et al., Confronting Earth System Model trends with observations | Science Advances: https://www-science-org.insu.bib.cnrs.fr/doi/10.1126/sciadv.adt8035, last access: 16 May 2025.
Citation: https://doi.org/10.5194/egusphere-2025-2685-AC2
-
AC1: 'Reply on RC2', Forrest M. Hoffman, 07 Apr 2026
-
RC2: 'Comment on egusphere-2025-2685', Anonymous Referee #2, 17 Oct 2025
General remarks:
This paper presents a newly developed tool to enable a rapid assessment of simulations produced in the context of CMIP7. The primary objective is to provide a minimum set of diagnostics on model simulations published for the AR7 Fast Track intercomparison exercise. Such an assessment will be of great value for CMIP model data users, particularly those involved in climate services. However, the scope of the tool is considerably broader. It is intended for use in a wider range of contexts - for example, by model developers during the model development phase - and could be extended to other contexts: future CMIP phases, regional intercomparison such as CORDEX, ISIMIP, etc… The tool appears to have been designed in a safe conceptual basis, with a clear separation of concerns, adaptability to diverse computing environments and ease of integration for new diagnostics. Furthermore, the paper outlines several promising perspectives for extending its capabilities.
This tool is undoubtedly of great interest for the community and the authors should be acknowledged for this significant community effort. This tool’s focus on interoperability with existing metrics packages represents a particularly valuable contribution to the modelling community, and this aspect could be further emphasized in the paper.
Such a tool is desirable; however, it should also be emphasized that its adoption may change the paradigm of multi-model intercomparison exercises. Indeed, once modellers are aware of which metrics users will focus on to select models, it is almost inevitable that they will concentrate their efforts on improving model performance against those specific metrics et the expense of other processes. This dynamic needs to be explicitly acknowledged. However, modellers must avoid excessive model tuning (“overtuning”).
It is therefore crucial that the package is overly clear on the characterisation of the target metric uncertainty. For mean climate aspects, this target metric uncertainty primarily dependss on observational uncertainty. Although the paper briefly discuss observational uncertainty, it mainly approaches it as the uncertainty associated with individual datasets. In my view, the uncertainty of single observational products does not fully capture the range of observational uncertainty. A more robust estimate could be obtained by considering an ensemble of different observational products. For trend estimates, another important source of target uncertainty arises from internal variability. In future versions of the tool, I strongly recommend placing greater emphasis on the quantification and communication of this target metric uncertainty.
I would also recommend clarifying the types of simulations on which the REF focuses. My understanding is that it will primarily rely on historical-type simulations. However, since ECS and TCR are among the target metrics, this implies that academic simulations such as 1% CO₂ and abrupt4xCO₂ experiments are also expected. When modellers use the tool for model development, is it straightforward to select which simulations are to be evaluated? Additionally, is there a mechanism to ensure consistency between the simulation setup and the metric being calculated?
Specific questions:
Abstract
Sentence “For instance, the REF will facilitate improvements to models and reductions in uncertainties for projections since ESMs are the main tool for studying the global Earth system.”
I am not convinced that the REF will significantly reduce uncertainties in ESM projections. On the centennial timescale, uncertainties are primarily associated with model uncertainties and scenario uncertainties. In the case of model-related uncertainties, it has not been demonstrated that they can be substantially reduced through model selection. It may be more appropriate to rephrase the statement to suggest that the REF could contribute to enhancing the confidence in model projections and selection.
Introduction
In the first sentence, ESM are stated as the primary tools to study interactions between the atmosphere, biosphere,…. It is correct as long as centennial to millennium time-scales are considered but probably not for longer time-scale. I would be worth indicated a time range.
Section 2.1
It is stated that “Releasing the REF before model outputs are submitted to ESGF offers modelling centres the opportunity to systematically assess their models and to make targeted improvements during the development phase” (ln 129-131)
This is certainly the case, but at the same time, it will most likely change the paradigm of model development for CMIP exercises. It is hardly avoidable that models will tend to match the REF metrics, leading to a reduction of ensemble errors on these specific metrics. However, does this necessarily imply a genuine reduction in uncertainty? Could it occur for undesirable reasons? Since the REF includes metrics on model trends, models will likely converge toward similar historical trends and TCR values. At the same time, estimations of historical trends are influenced by internal variability, while accurately assessing real-world internal variability remains a challenge and is often based on model-derived estimates (Simpson et al., 2025; Gyuleva et al., 2025). Is it truly desirable for models to converge on observed historical trends? Similarly, TCR remains an unknown parameter—does this approach risk pushing all models toward a TCR within the medium range of previous CMIP exercises? Is that desirable? While I recognize the necessity of the REF tool, I believe this paper should highlight that it will, to some extent, reshape the model development paradigm. This potential impact should be carefully considered when analyzing future model ensembles.
Section 3
Three primary potential applications are stated (ln 196), but the text is not that clear on which are these three applications. It would help the readability if they were clearly stated here and then developed.
Section 4
Four open-source packages have been used as a starting point, yet no explanation is provided regarding the criteria for selecting these particular packages. Even if the choice was made purely for practical reasons, it would be valuable to include this information.
I consider CMEC to be a major development within the REF tool. I suggest emphasizing this aspect more prominently in the paper, particularly in the abstract. As noted, there exists a wide diversity of benchmarking packages in the community, but they often lack interoperability and can be challenging to install or implement. Consequently, CMEC has the potential to play a key role in establishing standards and facilitating their use for modelling centers.
Figure 2 is a bit unclear to me, the relationship between redis and database is particularly indirect. Why isn’t there any arrow leaving the redis block?
Miscellaneous questions/remarks:
ln 68-69: Model variables are described as changing during the model execution. In that case, what about fixed fields? Probably there are not metrics scrutinized by the REF but they could be part of a diagnostic calculation. Do you consider that they are still “Variables” or do you have another another term to call them?
ln 140-141 : the sentence is a bit weird, consider repharasing.
ln 207: “of the REF is that it be usable by” consider rephrasing.
ln 237 : I have not seen CMEC defined before in the text.
ln 275: Could you explain a bit what is “a functional relationship metric”
Gyuleva, G., Knutti, R., and Sippel, S.: Combination of Internal Variability and Forced Response Reconciles Observed 2023–2024 Warming, Geophysical Research Letters, 52, e2025GL115270, https://doi.org/10.1029/2025GL115270, 2025.
Simson et al., Confronting Earth System Model trends with observations | Science Advances: https://www-science-org.insu.bib.cnrs.fr/doi/10.1126/sciadv.adt8035, last access: 16 May 2025.
Citation: https://doi.org/10.5194/egusphere-2025-2685-RC2 -
AC1: 'Reply on RC2', Forrest M. Hoffman, 07 Apr 2026
We, the authors, thank Anonymous Referee #1 for the thoughtful comments, which identified areas that required additional clarity, and for the minor edits, which assisted us in making the article more readable. The manuscript was improved as a result of revisions in response to this helpful review. Responses to each point of the review are interspersed with the review comments below.
Section 4.3: I’m not exactly sure of the difference between the ‘variables’ described here and the ‘diagnostics’ referred to in Table 1. For example, all the diagnostics in the Land and Land Ice Realm are objects I would think of as variables output by a land model (GPP, LAI etc). I do see how some of the diagnostics (like ECS/TCR etc in the ES realm) would be computed from variables and not be variables themselves. Perhaps it would be good to clarify that there might or might not be duplication/cross-over between ‘diagnostics’ and ‘variables’.
The definitions of Model Variables and Diagnostics provided in Section 1 distinguish the two terms from each other. In particular, some diagnostics employ a single variable, like GPP or LAI, while others, like Precipitation minus Evaporation (P−E), require multiple variables. However, to better clarify this point, the first sentence of the second paragraph of Section 4.3 was modified, and a second sentence was added as follows.
An opportunity regarding the REF was submitted to the Data Request Task Team’s open call to ensure that the model variables required by the REF for generating the REF diagnostics were clearly identified and that modelling centers wishing to use the REF with their simulations would have a checklist of model variables needed for producing the REF diagnostics. While some REF diagnostics use only a single model variable, others require two or more model variables, as described in the definitions in Section 1.
Section 5.3: this is more of a philosophical thought about subsetting models for particular scientific purposes and/or regional downscaling – for an end user, it would be tempting to just grab the models that perform well on certain regions/realms of interest, but if those same models are way off in other regions/realms, would that be a fair way to subset models? Perhaps this is more of a question for the community as a whole to come up with the best way of pulling a subset of models to look at specific questions (e.g. have some kind of penalty for not doing well for other regions/variables).
This is an important consideration, and to assist analysts in making such decisions, most of the REF diagnostics include multiple metrics and graphical output, like contour maps and Taylor diagrams, instead of only a single scalar score that can mask such details. To be more explicit, the following sentence was added to the paragraph in Section 5.3.
Of course, analysts must consider a wide range of factors when making such selections, including performance on a variety of metrics for relevant model variables (across spatial regions and through time) that may go beyond those incorporated into the initial version of the REF.
Table B1: (i) I am curious why cSoil was included but not cVeg? Is there a plan to include more datasets such as those from GEDI? (ii) would it be possible to add the dates over which the reference datasets are available? I am familiar with the ILAMB land datasets and I think they only go up to 2014 – have there been extensions to these for more recent years?
(i) As explained in Section 2.2, the diagnostics and therefore, the corresponding variables included in the initial version of the REF were prioritized through a community survey process. The objective was to have the community identify the top four-to-seven diagnostics to include for each of the five Earth system realms. The results of the community survey identified the soil carbon (cSoil) diagnostic as having the highest priority for the Land & Land Ice realm. Total carbon in vegetation or above- plus belowground biomass (cVeg) was not prioritized in the top seven diagnostics for Land & Land Ice. However, adding this diagnostic could be easily accomplished in a future version of the REF since the ILAMB package already includes a biomass diagnostic. Plans for future versions also include bringing in more reference datasets, including newer remote sensing products.
(ii) Reference data used in the REF span multiple time ranges, and the expectation is that these data will be regularly updated by extending their time ranges as appropriate or by replacing them with improved data sets as they become available and evaluated by the research community. In Table A1, we will add time ranges for reference data included in the first version of the REF.
Section C1 1.1: what is the frequency of this calculation? i.e. annual/decadal/something else?
Both the Antarctic annual mean and Arctic September rate of sea ice area loss per degree of warming are calculated using a single value each year. The description Section C1 1.1 was modified to be explicit as follows.
The metric is calculated by regressing the time-series of sea ice area on global mean temperature on an annual basis. The annual mean sea ice area is used for the Antarctic, while the September-mean sea ice area is used for the Arctic; both are regressed against annual global mean temperature.
Minor Edits
Line 114: “produced, processes” -> “produced, and processes”
Corrected.
Line 117: what is a “realm” referring to here? These are defined in line 295 but should be introduced earlier here.
Definition was moved and introduced at this earlier location.
Figure 1 caption: spell out the acronyms (e.g. I am not sure what ‘DAG’ is)
Abbreviations and acronyms are now defined in the figure caption.
Line 161: “a” -> “an”
Corrected.
Line 175: “at” -> “by the”
This was intended to precede a citation, which has now been corrected.
Line 259 -> add comma between “(ENSO)” and “CLIVAR”
Corrected.
Line 734: ref should be in parentheses
Corrected.
Line 743: insert “by” after “influenced”
Corrected.
Line 773: insert “by” after “obtained”
Corrected.
Line 780: I am not sure what is meant by “brought direction”
Corrected to “broad direction”.
Line 792: “patterns” -> “pattern”
Corrected.
Line 802: insert “in” after “patterns”
Corrected.
Citation: https://doi.org/10.5194/egusphere-2025-2685-AC1 -
AC2: 'Reply on RC2', Forrest M. Hoffman, 08 Apr 2026
We, the authors, thank Anonymous Referee #2 for the helpful comments that identified the need for improved descriptions of a few elements of the REF. We also appreciate the suggested minor edits. The manuscript was improved as a result of revisions in response to this review. Responses to each point of the review are interspersed with the review comments below.
This paper presents a newly developed tool to enable a rapid assessment of simulations produced in the context of CMIP7. The primary objective is to provide a minimum set of diagnostics on model simulations published for the AR7 Fast Track intercomparison exercise. Such an assessment will be of great value for CMIP model data users, particularly those involved in climate services. However, the scope of the tool is considerably broader. It is intended for use in a wider range of contexts - for example, by model developers during the model development phase - and could be extended to other contexts: future CMIP phases, regional intercomparison such as CORDEX, ISIMIP, etc… The tool appears to have been designed in a safe conceptual basis, with a clear separation of concerns, adaptability to diverse computing environments and ease of integration for new diagnostics. Furthermore, the paper outlines several promising perspectives for extending its capabilities.
This tool is undoubtedly of great interest for the community and the authors should be acknowledged for this significant community effort. This tool’s focus on interoperability with existing metrics packages represents a particularly valuable contribution to the modelling community, and this aspect could be further emphasized in the paper.
Thank you for the comment and for recognizing the importance of designing the REF to be interoperable with existing evaluation packages. Indeed, the aspect of interoperability among community efforts could be better emphasized. The abstract has been slightly extended as follows to strengthen this point.
Production of reproducible diagnostics and community-based assessments are the key features of the REF that help to inform mitigation and adaptation policies. Furthermore, providing interoperability with existing evaluation packages assures that contributions from previous community efforts will be available for use in future model intercomparison projects.
Such a tool is desirable; however, it should also be emphasized that its adoption may change the paradigm of multi-model intercomparison exercises. Indeed, once modellers are aware of which metrics users will focus on to select models, it is almost inevitable that they will concentrate their efforts on improving model performance against those specific metrics et the expense of other processes. This dynamic needs to be explicitly acknowledged. However, modellers must avoid excessive model tuning (“overtuning”).
This comment gives us the opportunity to stress once again that the community development of a standardized set of metrics and diagnostics does not imply a classification of “better” and “worse” models. The point here is to allow analysts to objectively characterize the strengths and weaknesses of models, so that modelers can work on improving the model where possible, and analysts can identify which models are more appropriate than others for the study of specific science questions. To address this question for the reader, we have added the following sentence in Section 5.3.
While it is hypothetically possible that modelers could ``tune'' or ``overtune'' their models to score well on a small set of specific diagnostics, this approach to improving model performance in the REF is of limited practical utility since the reference data are not consistent with each other and the relatively large number of diagnostics and metrics makes such an optimization impossible for a physics-constrained model.
It is therefore crucial that the package is overly clear on the characterisation of the target metric uncertainty. For mean climate aspects, this target metric uncertainty primarily dependss on observational uncertainty. Although the paper briefly discuss observational uncertainty, it mainly approaches it as the uncertainty associated with individual datasets. In my view, the uncertainty of single observational products does not fully capture the range of observational uncertainty. A more robust estimate could be obtained by considering an ensemble of different observational products. For trend estimates, another important source of target uncertainty arises from internal variability. In future versions of the tool, I strongly recommend placing greater emphasis on the quantification and communication of this target metric uncertainty.
We thank the referee for this thoughtful suggestion. Obtaining quantified uncertainties for individual reference datasets is difficult; however, in the initial version of the REF, we are including at least one diagnostic that includes a quantified uncertainty. In exploring observational uncertainties, we have identified a variety of sources of uncertainty, most of which are difficult to constrain, even for a single variable. Future versions of the REF are likely to incorporate multiple reference datasets for a given variable, as is already common in some of the community packages included in the REF, offering another method for understanding uncertainty for a single variable. We agree that the REF should further evolve into developing more robust estimates of uncertainties from a wide variety of reference data and visualize these uncertainties alongside the reference data and model results. The REF already permits multiple initial condition realizations to be analyzed as a way of quantifying internal model variability.
I would also recommend clarifying the types of simulations on which the REF focuses. My understanding is that it will primarily rely on historical-type simulations. However, since ECS and TCR are among the target metrics, this implies that academic simulations such as 1% CO₂ and abrupt4xCO₂ experiments are also expected. When modellers use the tool for model development, is it straightforward to select which simulations are to be evaluated? Additionally, is there a mechanism to ensure consistency between the simulation setup and the metric being calculated?
The list of simulations relevant to each of the diagnostics in the REF is now provided in Appendix B1. When modelers use the REF for model development, they can simply identify the model outputs that are to be evaluated. The REF uses the metadata provided in the CMORized netCDF files to ensure consistency between simulations and the metrics being calculated.
Specific questions:
Abstract
Sentence “For instance, the REF will facilitate improvements to models and reductions in uncertainties for projections since ESMs are the main tool for studying the global Earth system.”
I am not convinced that the REF will significantly reduce uncertainties in ESM projections. On the centennial timescale, uncertainties are primarily associated with model uncertainties and scenario uncertainties. In the case of model-related uncertainties, it has not been demonstrated that they can be substantially reduced through model selection. It may be more appropriate to rephrase the statement to suggest that the REF could contribute to enhancing the confidence in model projections and selection.
This is an excellent point, and one that many of the authors have experienced first hand. We have changed the sentence in the abstract as follows.
For instance, the REF will facilitate improvements to models and enhance the confidence in model projections through process-based selection of models based on their performance with respect to observations.
Introduction
In the first sentence, ESM are stated as the primary tools to study interactions between the atmosphere, biosphere,…. It is correct as long as centennial to millennium time-scales are considered but probably not for longer time-scale. I would be worth indicated a time range.
We agree that running ESMs over paleoclimatic timescales is computationally difficult and that mesoscale models are excellent for higher resolutions over short timescales. We changed the first sentence of the introduction as follows.
Earth system models (ESMs) are the primary tools for the scientific community to study interactions at a wide range of scales, from sub-daily to millennial, between the atmosphere, land, ocean, cryosphere, and biosphere in the Earth system and how it responds to human-induced and natural forcings
Section 2.1
It is stated that “Releasing the REF before model outputs are submitted to ESGF offers modelling centres the opportunity to systematically assess their models and to make targeted improvements during the development phase” (ln 129-131)
This is certainly the case, but at the same time, it will most likely change the paradigm of model development for CMIP exercises. It is hardly avoidable that models will tend to match the REF metrics, leading to a reduction of ensemble errors on these specific metrics. However, does this necessarily imply a genuine reduction in uncertainty? Could it occur for undesirable reasons? Since the REF includes metrics on model trends, models will likely converge toward similar historical trends and TCR values. At the same time, estimations of historical trends are influenced by internal variability, while accurately assessing real-world internal variability remains a challenge and is often based on model-derived estimates (Simpson et al., 2025; Gyuleva et al., 2025). Is it truly desirable for models to converge on observed historical trends? Similarly, TCR remains an unknown parameter—does this approach risk pushing all models toward a TCR within the medium range of previous CMIP exercises? Is that desirable? While I recognize the necessity of the REF tool, I believe this paper should highlight that it will, to some extent, reshape the model development paradigm. This potential impact should be carefully considered when analyzing future model ensembles.
These are all excellent questions, many of which our authors have discussed. Since modeling centers have access to evaluation tools, datasets, and model output and assessments from CMIP6 and prior activities, they already routinely use that information to evaluate their new models. The existence of the REF does not change the potential for the undesirable social effects of model convergence. The beauty of the REF is that these sorts of benchmarking activities will be consistent, made centrally available for potential model users, and employ regularly updated and curated observational reference data. As noted above, the breadth of the diagnostics and metrics included in the REF will preclude large-scale convergence across model components and realms. Overtuning for a few related variables will almost certainly produce poor performance for other variables. To be more explicit on these points, we added the following text in Section 5.3.
We emphasize that while the REF provides a suite of valuable diagnostics of model performance, it cannot replace dedicated analyses for each diagnostic that investigate in detail the mechanisms behind different model behaviors. Model improvements must come from such mechanistic understanding, and the REF may be useful in highlighting where to conduct such a detailed investigation.
Section 3
Three primary potential applications are stated (ln 196), but the text is not that clear on which are these three applications. It would help the readability if they were clearly stated here and then developed.
We agree. We restructured the text introducing Section 3 as follows.
Considering the REF scope, stakeholder consultation and initial implementation plan for the Assessment Fast Track simulations, three primary potential applications for the REF were identified: provide information about scientific performance of ESMs to stakeholders and policymakers, focus on key sensitivity indicators, and enhance equality in model data access.
Section 4
Four open-source packages have been used as a starting point, yet no explanation is provided regarding the criteria for selecting these particular packages. Even if the choice was made purely for practical reasons, it would be valuable to include this information.
Excellent suggestion. We have modified the introductory sentence of Section 4.1 as follows.
The open source evaluation and benchmarking packages described below—ESMValTool, PMP, and ILAMB & IOMB—were chosen for inclusion in the first version of the REF because they are open source, were relatively mature packages at the time, and combined, they permit comprehensive model evaluation across all the atmosphere, ocean and sea ice, land and land ice, Earth system, and impacts and adaptation realms.
I consider CMEC to be a major development within the REF tool. I suggest emphasizing this aspect more prominently in the paper, particularly in the abstract. As noted, there exists a wide diversity of benchmarking packages in the community, but they often lack interoperability and can be challenging to install or implement. Consequently, CMEC has the potential to play a key role in establishing standards and facilitating their use for modelling centers.
We agree that CMEC has provided a key technological framework for combining disparate benchmarking packages together into a cohesive system. We have modified a sentence in the abstract as follows.
The initial implementation of the REF, constructed to meet the near-term needs of the CMIP7 Assessment Fast Track, builds upon four disparate community evaluation and benchmarking tools that are coupled together using the Coordinated Model Evaluation Capabilities (CMEC) framework.
Figure 2 is a bit unclear to me, the relationship between redis and database is particularly indirect. Why isn’t there any arrow leaving the redis block?
Unfortunately, the reviewed manuscript included an early draft of Figure 2. It has been replaced with a more complete and accurate figure of the system design.
Miscellaneous questions/remarks:
ln 68-69: Model variables are described as changing during the model execution. In that case, what about fixed fields? Probably there are not metrics scrutinized by the REF but they could be part of a diagnostic calculation. Do you consider that they are still “Variables” or do you have another another term to call them?
Diagnostics targeting fixed quantities and model parameters have not been considered. We think that comparisons of fixed quantities (e.g., topography, bathimetry, constant astronomical factors) should be treated as a matter for model design intercomparison, rather than for the evaluation of model performance.
ln 140-141 : the sentence is a bit weird, consider repharasing.
We agree. We restructured the sentence as follows.
These conventions ensure that data are consistently structured, with uniform variable naming, units, and metadata, facilitating easy comparison across model outputs.
ln 207: “of the REF is that it be usable by” consider rephrasing.
The sentence has been restructured as follows.
A key design goal of the REF is that it can be used by modeling centers, research institutions, and individual scientists to enable validation of ESM output prior to publication of simulations on ESGF, intercomparison of model results, and general purpose analysis
ln 237 : I have not seen CMEC defined before in the text.
Agreed. We have now spelled out Coordinated Model Evaluation Capabilities (CMEC).
ln 275: Could you explain a bit what is “a functional relationship metric”
Certainly we can. We have simplified the identified sentence and added a second sentence that provides an illustrative example as follows.
Functional relationship metrics within ILAMB and IOMB evaluate the degree to which model variable-to-variable relationships correspond to those of observational data. For example, the relationship between gross primary production (GPP) and mean annual temperature in the model should correspond well with the same relationship extracted from observational data.
Gyuleva, G., Knutti, R., and Sippel, S.: Combination of Internal Variability and Forced Response Reconciles Observed 2023–2024 Warming, Geophysical Research Letters, 52, e2025GL115270, https://doi.org/10.1029/2025GL115270, 2025.
Simson et al., Confronting Earth System Model trends with observations | Science Advances: https://www-science-org.insu.bib.cnrs.fr/doi/10.1126/sciadv.adt8035, last access: 16 May 2025.
Citation: https://doi.org/10.5194/egusphere-2025-2685-AC2
-
AC1: 'Reply on RC2', Forrest M. Hoffman, 07 Apr 2026
Data sets
Observational Data for use by the REF Earth System Grid Federation (ESGF) https://esgf.github.io/nodes.html
Model code and software
Climate Rapid Evaluation Framework (REF) REF Delivery Team https://github.com/Climate-REF/climate-ref
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 2,977 | 652 | 43 | 3,672 | 54 | 51 |
- HTML: 2,977
- PDF: 652
- XML: 43
- Total: 3,672
- BibTeX: 54
- EndNote: 51
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Comments
Section 4.3: I’m not exactly sure of the difference between the ‘variables’ described here and the ‘diagnostics’ referred to in Table 1. For example, all the diagnostics in the Land and Land Ice Realm are objects I would think of as variables output by a land model (GPP, LAI etc). I do see how some of the diagnostics (like ECS/TCR etc in the ES realm) would be computed from variables and not be variables themselves. Perhaps it would be good to clarify that there might or might not be duplication/cross-over between ‘diagnostics’ and ‘variables’.
Section 5.3: this is more of a philosophical thought about subsetting models for particular scientific purposes and/or regional downscaling – for an end user, it would be tempting to just grab the models that perform well on certain regions/realms of interest, but if those same models are way off in other regions/realms, would that be a fair way to subset models? Perhaps this is more of a question for the community as a whole to come up with the best way of pulling a subset of models to look at specific questions (e.g. have some kind of penalty for not doing well for other regions/variables).
Table B1: (i) I am curious why cSoil was included but not cVeg? Is there a plan to include more datasets such as those from GEDI? (ii) would it be possible to add the dates over which the reference datasets are available? I am familiar with the ILAMB land datasets and I think they only go up to 2014 – have there been extensions to these for more recent years?
Section C1 1.1: what is the frequency of this calculation? i.e. annual/decadal/something else?
Minor Edits
Line 114: “produced, processes” -> “produced, and processes”
Line 117: what is a “realm” referring to here? These are defined in line 295 but should be introduced earlier here.
Figure 1 caption: spell out the acronyms (e.g. I am not sure what ‘DAG’ is)
Line 161: “a” -> “an”
Line 175: “at” -> “by the”
Line 259 -> add comma between “(ENSO)” and “CLIVAR”
Line 734: ref should be in parentheses
Line 743: insert “by” after “influenced”
Line 773: insert “by” after “obtained”
Line 780: I am not sure what is meant by “brought direction”
Line 792: “patterns” -> “pattern”
Line 802: insert “in” after “patterns”