the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Actionable reporting of CPU-GPU performance comparisons: Insights from a CLUBB case study
Abstract. Graphics Processing Units (GPUs) are becoming increasingly central to high-performance computing (HPC), but fair comparison with central processing units (CPUs) remains challenging, particularly for applications that can be subdivided into smaller workloads. Traditional metrics such as speedup ratios can overstate GPU advantages and obscure the conditions under which CPUs are competitive, as they depend strongly on workload choice. We introduce two peak-based performance metrics, the Peak Ratio Crossover (PRC) and the Peak-to-Peak Ratio (PPR) which provide clearer comparisons by accounting for the best achievable performance of each device. Using a case study into the performance of the Cloud Layers Unified by Binormals (CLUBB) standalone model, we demonstrate these metrics in practice, show how they can guide execution strategy, and examine how they shift under factors that affect workload. We further analyze how implementation choices and code structure influence these metrics, showing how they enable performance comparisons to be expressed in a concise and actionable way, while also helping identify which optimization efforts should be prioritized to meet different performance goals.
- Preprint
(3233 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-4435', Georgiana Mania, 05 Dec 2025
-
RC2: 'Comment on egusphere-2025-4435', Anonymous Referee #2, 09 Feb 2026
This paper gives a detailed performance analysis of the CLUBB cloud
and turbulence parameterization, including a new GPU openACC based
port. CLUBB is an important and expensive component used by modern
atmospheric models. The performance work described in this paper is
excellent. The authors give a thorough evaluation of both GPU and CPU
performance as a function of workload ( number of vertical physics
columns per device), compilers, precision, vertical levels and loop
structure. They go to great lengths to prevent a fair CPU vs GPU
comparison, comparing well performing CPU code to similarly well
performing GPU code. They use several metrics (PPR, PRC, ATR, DRC) and
carefully discuss the strengths and weaknesses of CPUs and GPUs. The
thoroughness and fairness of the paper puts it above many other papers
which present misleading GPU speedup numbers.The paper is well written with clear arguments.
I only have two minor suggestions:
1. Section 2: line 92:
The wording of the sentence containing the Sun et al 2023 reference
could be clarified. At first reading, I assumed the authors were saying
that the column loop changes in CLUBB were described in Sun et al 2023,
but I think the CLUBB changes are due to this work and Sun et al 2023
was discussing similar work in a different parameterization (PUMAS).2. Conclusions:
I think most readers will be interested in how the CLUBB performance
numbers will impact global atmospheric models, in typical regimes with
are running at the limit of strong scaling to get the maximum possible
throughput, or running on fewer nodes to maximize efficiency and
maximize ensemble throughput. The author's detailed benchmarks and
metrics will allow any motivated reader to answer this question. But
I think it would save the readers a little time if the authors added a
short discussion explaining the relation between PPR and PRC and these
strong scaling or maximum efficiency global model configurations.Citation: https://doi.org/10.5194/egusphere-2025-4435-RC2
Model code and software
GitHub repo of CLUBB code Gunther Huebler and Vincent Larson https://github.com/larson-group/clubb_release/tree/clubb_performance_testing
Zenodo archive of CLUBB code and profiling results Gunther Huebler https://doi.org/10.5281/zenodo.17081296
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 315 | 399 | 25 | 739 | 28 | 22 |
- HTML: 315
- PDF: 399
- XML: 25
- Total: 739
- BibTeX: 28
- EndNote: 22
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The paper advances the science by introducing two new metrics, Peak Ratio Crossover (PRC) and Peek-to-Peek Ratio (PPR), which allow for a better comparison between CPU and GPU performance of a given application. The authors demonstrate the benefits of using these metrics on a single-column parametrisation of turbulence and clouds, Cloud Layers Unified by Binormals (CLUBB), which exposes several level of parallelism that can be exploited differently by heterogenous architectures. Several use cases show the impact of batch size, precision, asynchronous execution, device type and coding optimisations on the application's throughput, expressed in columns per second. The paper is well written and the claims are well supported by experiments and profiling data which naturally drive the conclusions. I would recommend the publication of the manuscript.
There are a few comments, listed below, which could be tackled by the authors in a minor revision: