the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Monitoring and benchmarking Earth system model simulations with ESMValTool v2.12.0
Abstract. Earth system models (ESMs) are important tools to improve our understanding of present-day climate and to project climate change under different plausible future scenarios. For this, ESMs are continuously improved and extended resulting in more complex models. Particularly during the model development phase, it is important to continuously monitor how well the historical climate is reproduced and to systematically analyze, evaluate, understand, and document possible shortcomings. For this, putting model biases relative to observations into the context of deviations shown by other state-of-the-art models greatly helps to assess which biases need to be addressed with higher priority. Here, we introduce the new capability of the open-source community-developed Earth System Model Evaluation Tool (ESMValTool) to monitor running or benchmark existing simulations with observations in the context of results from the Coupled Model Intercomparison Project (CMIP). To benchmark model output, ESMValTool calculates metrics such as the root-mean-square error, the Pearson correlation coefficient, or the Earth mover’s distance relative to reference datasets. This is directly compared to the same metric calculated for an ensemble of models such as the one provided by CMIP6, which provides a statistical measure for the range of values that can be considered typical for state-of-the-art ESMs. Results are displayed in different types of plots such as map plots or time series with different techniques such as stippling (maps) or shading (time series) used to visualize the typical range of values for a given metric from the model ensemble used for comparison. Automatic downloading of CMIP results from the Earth System Grid Federation (ESGF) makes application of ESMValTool for benchmarking of individual model simulations, for example in preparation of CMIP7, easy and very user friendly.
- Preprint
(2294 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-1518', Anonymous Referee #1, 16 Jul 2024
Review of “Monitoring and benchmarking Earth system model simulations with ESMValTool v2.12.0” by Lauer et al.
This paper is documenting the recent updates of one of climate and Earth system model evaluation software package, named ESMValTool. Considering the popularity of the tool, it is important to keep the capabilities updated and well-documented as a community resource. I think the paper is well organized. I only have a few minor comments as follows.
I wonder if authors could clarify more explicitly what are the new capabilities in this specific version of the tool, compared to the previous version with the published paper. Are those metrics in section 2.2 all new metrics that were not available in the previous version?
Line 44 to 47 “For this, for example results from the Coupled Model Intercomparison Phase 5 and 6 (Eyring et al., 2016; Taylor et al., 2012) can be used to get an overview of which biases can be considered “acceptable for now”, and which would need more attention and more detailed analysis and comparisons with observations.”: As there have been several tools being developed for such purposes, I wonder if it would be beneficial to provide a few references as examples: e.g., PCMDI Metrics Package (Lee et al., 2024, GMD), ILAMB (with proper reference), etc.
Line 88 “For all metrics, an unweighted and weighted version exists” and sections 2.2.2 through 2.2.4: I wonder what the rationale was to include unweighted metrics. While it is fair to include both methods as options, I think weighted metrics might better considered for the “default” method. Is there any practical use case of unweighted metrics?
Line 149 “The default value in ESMValTool is n=100”: Does the bin distributed equally and sized evenly by min/max of the PDF? I guess this might be the case but it won’t be harmful to clarify it.
Line 155 “2.3.1 Observation datasets”: I think the word “reference” might better inclusively represent datasets listed in the section. Some reanalysis datasets were discussed along, but often they are preferred to be differentiated from instrument-based observation.
Line 170 “(Ecmwf, 2000)”: Please capitalize all letters for ECMWF.
Line 557: The ERA5 reference is placed where it is not in right alphabetical order in the reference list.
Citation: https://doi.org/10.5194/egusphere-2024-1518-RC1 - AC1: 'Reply on RC1', Axel Lauer, 19 Sep 2024
-
RC2: 'Comment on egusphere-2024-1518', Anonymous Referee #2, 22 Jul 2024
This paper provides a description of some new functionality for ESMValTool. The ESMValTool is a key piece of software in the CMIP world, and these regular updates with new capabilities are extremely valuable. I am not a direct user of ESMValTool, so my comments here will be from a neutral perspective, but I will say that I'm more likely to become a user now that I've read this description. The paper is easy to read and understand, the topic is well within the scope of GMD by presenting new capabilities in a major software package and these new capabilities are substantial enough to warrant a publication. From my reading, there are a few things that could be made more clear. The main weakness in my opinion is the example problem that is used to show the new features seems both overly contrived and also not that compelling in terms of an example showing how to detect mistakes. I would not suggest getting rid of it entirely, but it might be useful to add another example. A few additional detailed comments are provided below.1. The text in the abstract (and elsewhere in the paper) makes some assumptions about how model development proceeds. Specifically, there is clearly and explicitly an assumption that 'historical' simulations are continuously produced during development. From my understanding, this is true for some modeling centers, but not for all. Several centers I am aware of use the pre-industrial climate much more extensively during development. It is also important to note that some (maybe most?) model development happens in individual component models rather than in the coupled system. This is evident even in the EMAC example which uses an AMIP setup. The discussion in this paper is very strongly focused on diagnostics/metrics of the atmosphere; does ESMValTool work as well for other components (land, ocean, sea-ice, land-ice)?2. Twice within the first several sentences (lines 30 and 39), biology/chemistry/biogeochemistry are mentioned. This early text heavily stresses the aspects of ESMs that go beyond GCMs/AOGCMs. The description raises the big issue of fitness for purpose of these models (line 36), and the need to evaluate them carefully. This is all good and true, but is somewhat disconnected from the rest of the paper which is focused on traditional physical quantities in the atmosphere. Without some example of metrics for chemistry/biology/biogeochemistry, I would suggest revising these first couple of paragraphs to more accurately set up what will be presented.3. In Section 2.1, the ESMValTool is very briefly introduced. Too briefly, I think. While a detailed introduction to the software can be skipped, there are several things that would be useful if described here. First are the system requirements for running ESMValTool. Second are the dependencies that ESMValTool needs. In terms of dependencies, maybe this could be an abbreviated discussion if there are many python packages needed, but it would be good to know if MPI, netCDF, etc. are required to be installed on the system, and if there are any compilers or languages that are needed.4. At line 70, and again in the discussion, the use of wildcards is mentioned. Without having run ESMValTool, I was a little confused about this feature. It might be nice to have some kind of example to show what this really means for a user.5. There are just a couple of small things that I'd recommend for Section 2.2. Lines 90-91 state that the length of time intervals are used as weights. This sounds correct, but a little more clarification would be useful. If using monthly means, does this mean that the different lengths of months are used to correctly weight seasonal and annual averages? What metadata is needed for ESMValTool to determine the lengths (e.g., in cases where there could be leap years, or if a model uses a 360-day year)? Does the weighting deal with non-uniform time intervals outside of monthly means? For the spatial weighting, the text says it uses grid cell area, but it is unclear if that is calculated or provided by the user. This seems especially important for non-uniform grids where the grid cells may not be rectangular. The use of "absolute bias" is possibly a little confusing, especially later (around line 360) when the "absolute value of the absolute bias" is used. My solution, which might not be optimal, would be to refer to X - R as the bias, and absolute bias as |X - R|. I wondered whether it would be easier to only present equations for the weighted metrics and remind the readers that normalized weights sum to unity and uniform weights sum to N?6. The description of the EMD mostly makes sense, but I wonder whether Equation 7 and the description that follows is actually the special case of the 1-dimensional Wasserstein metric? I was comparing with some resources on the web (e.g., wikipedia), and the description usually uses an infimum operator over the set of possible joint probably distributions (gamma). The description after the equation seems to have already done the minimization to get the optimal transport matrix. Maybe I'm just not understanding what the `min` operator in Equation 7 actually means? Maybe it is the minimization of all pdfs gamma in the nxm space? I'm not sure how the constraints are imposed, though, maybe only through the marginals? In which case, maybe this really is equivalent to the infimum. I did take a look at the Rubner et al. reference, but their description is in terms of the distance matrix and the flow matrix, and is similar to the other descriptions I looked at. Sorry, this is probably me being thick, but if Equation 7 could be made a little more clear in terms of the operator, that would be great. I guess, since I'm on about it, I might as well also ask how the distance is assigned in this implementation. Specifically, since the EMD is only applied to 1-d histograms, I wonder if the distance between every bin is just 1 or if it is the difference between each bin center value and every other bin center value?7. A few questions about the datasets described in Section 2.3. First, I was going to ask if these datasets being distributed with ESMValTool, but then I saw at the end that they are not. It's nice that downloading and processing scripts are supplied; I imagine that some of those datasets require some kind of registration to get access? Are all the processing tools just using the same functions that are built into ESMValTool, or are they totally separate? The choices for these data are not really given, and all of them have several valid alternatives. Especially if these are being advocated as good reference data for use in evaluating ESMs, it might be good to add some words noting why these might be preferred over their alternatives. In the GPCP-SG data, I was not sure what the "SG" actually means. For the ISCCP-FH data, are there spurious trends through the late 1990s in the fluxes like in the ISCCP-H data?8. The EMAC experiment and the presentation of the example analysis did not seem all that convincing. It was surprising to me that replacing the SST with the zonal mean SST didn't lead to much, much larger errors. I wonder if some of the zonal asymmetries of the real world are being smoothed by the coarse resolution of EMAC (is the resolution stated?)? At line 290, the small error in seasonal cycle of SST is noted for the first 5-years in EMAC, but this isn't explained for readers who might not be familiar with these kinds of model experiments. The EMAC run is being forced with observed SST and ice, but is being compared to coupled runs that produce their own SST. So it isn't a fair comparison. Actually, I was wondering why Figure 3b shows such large RMSE given the prescribed SST? Is it because the reference data set is calculated over a different time interval, or is it because EMAC has some bias in 2m temperature that makes it more different from SST than in most models? The text seems to be following a line of thinking that his EMAC example is similar to what would be done during model development, but I don't think that a preliminary "quick look" analysis of a model run would split the dataset into time periods. I think a more plausible storyline for this analysis would be the user would take all 10 years of simulation and run ESMValTool on that, and only upon noting something that looks suspicious would they go back and start looking at different time periods. I'd suggest putting the metrics for the full 10 years on the plots.9. An aspect of the example analysis that I thought was missing is a little more detail about the choices that the ESMValTool user needs to make. Are there decisions about how regions are defined (e.g., the tropics) and what land/sea mask to use, and temporal sampling (of test case versus reference)? How does that look in a user's "recipe"? Also, nothing is noted about how ESMValTool handles things when something doesn't work. What happens if the reference dataset doesn't match the specified time period? What happens if the CMIP model data download fails or is very slow? What happens if a specified variable is not available in the test case or the reference case? These are probably in the documentation, but it'd be nice to get an idea here to set some expectations.10. In the figures with stippling, are there options for how to handle the stippling? I'd imagine that most science applications would want to swap the stippling convention to emphasize where there is robust agreement among data sets. And there would also be users who would want to completely mask the stippled area instead of just obscuring it.11. For monitoring a simulation, as discussed around line 367 and elsewhere and throughout the EMAC examples, does ESMValTool actually provide utilities for doing these time slices? That is, can the user specify to produce all diagnostics for particular time periods or time chunk sizes (e.g., 5 years)? Or is it the responsibility of the user to repeatedly apply ESMValTool to their data subsets?12. For the portrait plot, are there options to specify how to display the results? It's easy to imagine use cases that would have more test cases than CMIP models. For example, one might want to put the CMIP results in one figure and all the test cases in another one? Can more than 2 reference data sets be used?13. During the wrap up of the paper, around line 448, the netCDF output from ESMValTool is mentioned. From previous reading, I think I remember that the ESMValTool produces plots and maybe some kind of browsable output? Maybe it would be worth mentioning what are the outputs of ESMValTool.Citation: https://doi.org/
10.5194/egusphere-2024-1518-RC2 - AC2: 'Reply on RC2', Axel Lauer, 19 Sep 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
355 | 98 | 155 | 608 | 14 | 19 |
- HTML: 355
- PDF: 98
- XML: 155
- Total: 608
- BibTeX: 14
- EndNote: 19
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1