the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
New framework for benchmarking decadal predictions leveraging the PCMDI Metric Package with interactive visualization
Abstract. Reliable climate predictions across multiple timescales are increasingly critical as climate-related risks continue to rise. With the growing number and diversity of climate prediction systems, systematic intercomparison has become essential. Here, we present a comprehensive evaluation framework based on the PCMDI Metric Package to assess the performance of multiple decadal climate prediction systems. Unlike uninitialized simulations, initialized predictions exhibit bias and predictive skill that evolve with forecast lead time. To address this, we introduce (1) model-by-lead-time portrait plots, which efficiently summarize metrics of global temperature, precipitation, and Arctic/Antarctic sea-ice extent, and (2) an HTML-based interactive visualization platform that provides detailed regional and seasonal diagnostics of model bias, skill scores, and ensemble spread for each model and lead time. Comparisons with uninitialized simulations further quantify the relative impacts of initialization and external forcing on prediction skill. The proposed framework provides a scalable and transparent approach for multi-model climate prediction assessments and can be readily extended to a wide range of operational and research forecasting systems.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Geoscientific Model Development.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(9456 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 29 Apr 2026)
- RC1: 'Comment on egusphere-2026-958', Anonymous Referee #1, 31 Mar 2026 reply
-
RC2: 'Comment on egusphere-2026-958', Anonymous Referee #2, 08 Apr 2026
reply
This manuscript presents a framework and results from a new benchmarking tool for decadal prediction experiments that is based on the PCMDI metrics package. The new benchmarking tool fills a critical gap in the ecosystem of standardized benchmarking software by being engineered to handle aspects of prediction datasets, such as forecast lead time. With this tool, there is a significant opportunity to understand how initialization impacts seasonal prediction, as well as how quickly long-term mean biases in climate models emerge in shorter decadal-scale experiments. The initial application of the tool to DCPP experiments shows several interesting themes. Long-term mean biases in temperature and precipitation tend to emerge during the decadal simulations for the three primary climate variables considered in this study: surface air temperature, precipitation, and sea ice extent. Trend biases also emerge in decadal predictions in some key processes, including wintertime mid-latitude precipitation in the Northern Hemisphere. The decadal prediction models also struggle to represent seasonal trends in Antarctic sea ice extent, but the results of the benchmarking analysis suggest that biases are reduced at shorter timescales, benefiting from the model initialization. Overall, the results suggest that much more work is needed to improve the fidelity of precipitation simulation. The manuscript is well written, and the results demonstrate the novelty and utility of this new benchmarking approach. After some revision, this manuscript and benchmarking tools will be important contributions to the field.
Major Comments:
1. The manuscript is a bit light in some key areas for context. The second-to-last paragraph of the introduction falls short of demonstrating what software and approaches already exist, and why this approach is novel. I would recommend a more thorough review of past efforts and existing tools to better frame the new work. Similarly, benchmarking diagnostics differ distinctly from process-based diagnostics, where the latter are more suited towards understanding why the models perform the way that they do. This is an important caveat that needs to be covered in both the introduction and the conclusions. Benchmarking diagnostics are sometimes suggestive of possible reasons for biases in climate variables, but are not definitive.
2. The introduction mentions the importance of model drift in the introduction (third-to-last paragraph), but it is largely ignored throughout the rest of the manuscript. The methods section would benefit from a more in-depth discussion of model drifts in decadal simulations and how the benchmarking tool handles them.
3. Statistical significance is largely absent from the benchmarking metrics. The benchmarking tool would benefit greatly by incorporating measures of statistical significance, for example, by shading/stippling on portrait plots and maps.
Minor Comments:Lines 104-105: The regional approach is fine, but the comment about regridding errors is curious. Ocean model grid remapping is not an intractable problem.
Section 2.4: This section on the display interface should be expanded. More details on how it is done, why it is important, and the design choices made would be helpful.
Figures 1,2,5,6,9,12,13: The “LY” labels might be better suited along the bottom of the plot. They are a bit lost with the plot titles.
Figure 10: It is difficult to see the yellow contours in these panels.
Citation: https://doi.org/10.5194/egusphere-2026-958-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 130 | 63 | 15 | 208 | 12 | 19 |
- HTML: 130
- PDF: 63
- XML: 15
- Total: 208
- BibTeX: 12
- EndNote: 19
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript presents a multi-model evaluation framework for initialized decadal climate prediction, implemented within the PCMDI Metrics Package, featuring two diagnostic tools: (1) a model-by-lead-time portrait plot and (2) an interactive HTML visualization platform, to assess how biases and prediction skill evolve with forecast lead time across temperature, precipitation, and sea ice. Application of the framework reveals that temperature biases drift toward each model's climatology over time, precipitation biases reflect systematic model physics errors, and sea-ice skill degrades rapidly with lead time while remaining closely coupled to temperature prediction quality. Overall, this work offers great practical value to the decadal predictability community; the publicly accessible diagnostics provide a useful reference that researchers can readily draw on for their own work. It is completely understandable that a single manuscript could not provide detailed mechanistic explanations for every systematic bias identified, and I expect this framework will serve as a springboard for many future studies in this area.
Minor comments:
1. The authors state in Lines 149–150 that forecasts are expected to drift toward the model's biased climatology as lead time increases. However, Figure 1 shows that tropical TAS biases in most models actually decrease with lead time rather than grow, which is somewhat counterintuitive, since one would expect small biases at short lead times that would then amplify as the forecast drifts toward the model climatology. Could the authors comment on this?
2. Line 160: It would be helpful to include a brief discussion of the physical reasons behind the tendency for models to exhibit a wet bias in the tropics and a dry bias in the mid-latitudes, rather than relying solely on the brief attribution to the summer ITCZ at Line 196.
3. Figure 1: Comparing TAS and PR results, TAS biases appear to evolve with lead time while PR biases remain largely constant across all initializations. A short comment on why precipitation biases are more stable with lead time than temperature biases would be great.