the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Actionable reporting of CPU-GPU performance comparisons: Insights from a CLUBB case study
Abstract. Graphics Processing Units (GPUs) are becoming increasingly central to high-performance computing (HPC), but fair comparison with central processing units (CPUs) remains challenging, particularly for applications that can be subdivided into smaller workloads. Traditional metrics such as speedup ratios can overstate GPU advantages and obscure the conditions under which CPUs are competitive, as they depend strongly on workload choice. We introduce two peak-based performance metrics, the Peak Ratio Crossover (PRC) and the Peak-to-Peak Ratio (PPR) which provide clearer comparisons by accounting for the best achievable performance of each device. Using a case study into the performance of the Cloud Layers Unified by Binormals (CLUBB) standalone model, we demonstrate these metrics in practice, show how they can guide execution strategy, and examine how they shift under factors that affect workload. We further analyze how implementation choices and code structure influence these metrics, showing how they enable performance comparisons to be expressed in a concise and actionable way, while also helping identify which optimization efforts should be prioritized to meet different performance goals.
- Preprint
(3233 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4435', Georgiana Mania, 05 Dec 2025
-
AC1: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
Thanks for the review and the positivity. We've addressed all the comments for the revised manuscript.A broader cross-vendor GPU comparison would definitely be interesting, especially given the ever-evolving landscape of GPU architectures and compilation/optimization methods. This concern of presenting unfair/incomplete comparisons of available hardware was a large part of why this paper focuses on the performance metrics and reporting methods rather than on the specific performance comparisons between different hardware configurations.The comment about the execution model was a bit too vague. It was meant to be specifically about the execution model, which does indeed mirror the execution model when CLUBB is embedded in a GCM, but could easily be interpreted as a more general statement about performance extrapolation. We kept the original phrasing about the execution model, but added a clarification immediately afterward to make clear that absolute timings may not extrapolate well to host-model runs.Excellent idea for the plots, thank you. This makes them much easier to interpret. We revised the figures to use the same color scheme for results which use the same number of vertical levels in both Fig.~3 and Fig.~4.The wording did seem too general in the OpenACC-vs-OpenMP section. We revised it so that the comparison is explicitly framed as applying to the NVIDIA GPUs and nvfortran compiler stack used in this study. We also clarified that the OpenMP directives were obtained through Intel's migration tool from the OpenACC source - which could also have an impact on the performance of the OpenMP version.Citation: https://doi.org/
10.5194/egusphere-2025-4435-AC1 - AC2: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
-
AC1: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
-
RC2: 'Comment on egusphere-2025-4435', Anonymous Referee #2, 09 Feb 2026
This paper gives a detailed performance analysis of the CLUBB cloud
and turbulence parameterization, including a new GPU openACC based
port. CLUBB is an important and expensive component used by modern
atmospheric models. The performance work described in this paper is
excellent. The authors give a thorough evaluation of both GPU and CPU
performance as a function of workload ( number of vertical physics
columns per device), compilers, precision, vertical levels and loop
structure. They go to great lengths to prevent a fair CPU vs GPU
comparison, comparing well performing CPU code to similarly well
performing GPU code. They use several metrics (PPR, PRC, ATR, DRC) and
carefully discuss the strengths and weaknesses of CPUs and GPUs. The
thoroughness and fairness of the paper puts it above many other papers
which present misleading GPU speedup numbers.The paper is well written with clear arguments.
I only have two minor suggestions:
1. Section 2: line 92:
The wording of the sentence containing the Sun et al 2023 reference
could be clarified. At first reading, I assumed the authors were saying
that the column loop changes in CLUBB were described in Sun et al 2023,
but I think the CLUBB changes are due to this work and Sun et al 2023
was discussing similar work in a different parameterization (PUMAS).2. Conclusions:
I think most readers will be interested in how the CLUBB performance
numbers will impact global atmospheric models, in typical regimes with
are running at the limit of strong scaling to get the maximum possible
throughput, or running on fewer nodes to maximize efficiency and
maximize ensemble throughput. The author's detailed benchmarks and
metrics will allow any motivated reader to answer this question. But
I think it would save the readers a little time if the authors added a
short discussion explaining the relation between PPR and PRC and these
strong scaling or maximum efficiency global model configurations.Citation: https://doi.org/10.5194/egusphere-2025-4435-RC2 -
AC3: 'Reply on RC2', Gunther Huebler, 10 Mar 2026
Thank you for the positive review and constructive comments. We've made updates to the manuscript in response.Good call with the Sun et al. reference, the original wording was ambiguous. We revised the sentence to make clear that Sun et al. (2023) describes an analogous restructuring in the separate PUMAS codebase, not the CLUBB changes mentioned in this paper.This is an excellent point about the connection to strong/weak scaling. The way we've defined and used the metrics does relate to the strong-scaling vs weak-scaling analyses that are common in the global model community, but we hadn't explicitly made this connection in the manuscript. The weak-scaling connection is pretty direct -- the PPR compares each device at its own most favorable workload, and assumes we can keep the devices fully utilized, which is exactly what you would want to do in a weak-scaling analysis. The connection to the strong-scaling analysis is a little more subtle, because that usually involves subdividing a fixed problem size across different numbers of cores/devices, which is not exactly the same as subdividing a fixed problem size into different batches, but is still pretty close. In the revised manuscript, we added this interpretation in two places: first in the introduction, where PPR and PRC are initially defined, and again in the summary sentence following Fig.~2, where the paper discusses which metric is most applicable in different situations.
-
AC3: 'Reply on RC2', Gunther Huebler, 10 Mar 2026
Model code and software
GitHub repo of CLUBB code Gunther Huebler and Vincent Larson https://github.com/larson-group/clubb_release/tree/clubb_performance_testing
Zenodo archive of CLUBB code and profiling results Gunther Huebler https://doi.org/10.5281/zenodo.17081296
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 460 | 584 | 39 | 1,083 | 31 | 27 |
- HTML: 460
- PDF: 584
- XML: 39
- Total: 1,083
- BibTeX: 31
- EndNote: 27
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The paper advances the science by introducing two new metrics, Peak Ratio Crossover (PRC) and Peek-to-Peek Ratio (PPR), which allow for a better comparison between CPU and GPU performance of a given application. The authors demonstrate the benefits of using these metrics on a single-column parametrisation of turbulence and clouds, Cloud Layers Unified by Binormals (CLUBB), which exposes several level of parallelism that can be exploited differently by heterogenous architectures. Several use cases show the impact of batch size, precision, asynchronous execution, device type and coding optimisations on the application's throughput, expressed in columns per second. The paper is well written and the claims are well supported by experiments and profiling data which naturally drive the conclusions. I would recommend the publication of the manuscript.
There are a few comments, listed below, which could be tackled by the authors in a minor revision: