the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Actionable reporting of CPU-GPU performance comparisons: Insights from a CLUBB case study
Abstract. Graphics Processing Units (GPUs) are becoming increasingly central to high-performance computing (HPC), but fair comparison with central processing units (CPUs) remains challenging, particularly for applications that can be subdivided into smaller workloads. Traditional metrics such as speedup ratios can overstate GPU advantages and obscure the conditions under which CPUs are competitive, as they depend strongly on workload choice. We introduce two peak-based performance metrics, the Peak Ratio Crossover (PRC) and the Peak-to-Peak Ratio (PPR) which provide clearer comparisons by accounting for the best achievable performance of each device. Using a case study into the performance of the Cloud Layers Unified by Binormals (CLUBB) standalone model, we demonstrate these metrics in practice, show how they can guide execution strategy, and examine how they shift under factors that affect workload. We further analyze how implementation choices and code structure influence these metrics, showing how they enable performance comparisons to be expressed in a concise and actionable way, while also helping identify which optimization efforts should be prioritized to meet different performance goals.
- Preprint
(3233 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-4435', Georgiana Mania, 05 Dec 2025
-
AC1: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
Thanks for the review and the positivity. We've addressed all the comments for the revised manuscript.A broader cross-vendor GPU comparison would definitely be interesting, especially given the ever-evolving landscape of GPU architectures and compilation/optimization methods. This concern of presenting unfair/incomplete comparisons of available hardware was a large part of why this paper focuses on the performance metrics and reporting methods rather than on the specific performance comparisons between different hardware configurations.The comment about the execution model was a bit too vague. It was meant to be specifically about the execution model, which does indeed mirror the execution model when CLUBB is embedded in a GCM, but could easily be interpreted as a more general statement about performance extrapolation. We kept the original phrasing about the execution model, but added a clarification immediately afterward to make clear that absolute timings may not extrapolate well to host-model runs.Excellent idea for the plots, thank you. This makes them much easier to interpret. We revised the figures to use the same color scheme for results which use the same number of vertical levels in both Fig.~3 and Fig.~4.The wording did seem too general in the OpenACC-vs-OpenMP section. We revised it so that the comparison is explicitly framed as applying to the NVIDIA GPUs and nvfortran compiler stack used in this study. We also clarified that the OpenMP directives were obtained through Intel's migration tool from the OpenACC source - which could also have an impact on the performance of the OpenMP version.Citation: https://doi.org/
10.5194/egusphere-2025-4435-AC1 - AC2: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
-
AC1: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
-
RC2: 'Comment on egusphere-2025-4435', Anonymous Referee #2, 09 Feb 2026
This paper gives a detailed performance analysis of the CLUBB cloud
and turbulence parameterization, including a new GPU openACC based
port. CLUBB is an important and expensive component used by modern
atmospheric models. The performance work described in this paper is
excellent. The authors give a thorough evaluation of both GPU and CPU
performance as a function of workload ( number of vertical physics
columns per device), compilers, precision, vertical levels and loop
structure. They go to great lengths to prevent a fair CPU vs GPU
comparison, comparing well performing CPU code to similarly well
performing GPU code. They use several metrics (PPR, PRC, ATR, DRC) and
carefully discuss the strengths and weaknesses of CPUs and GPUs. The
thoroughness and fairness of the paper puts it above many other papers
which present misleading GPU speedup numbers.The paper is well written with clear arguments.
I only have two minor suggestions:
1. Section 2: line 92:
The wording of the sentence containing the Sun et al 2023 reference
could be clarified. At first reading, I assumed the authors were saying
that the column loop changes in CLUBB were described in Sun et al 2023,
but I think the CLUBB changes are due to this work and Sun et al 2023
was discussing similar work in a different parameterization (PUMAS).2. Conclusions:
I think most readers will be interested in how the CLUBB performance
numbers will impact global atmospheric models, in typical regimes with
are running at the limit of strong scaling to get the maximum possible
throughput, or running on fewer nodes to maximize efficiency and
maximize ensemble throughput. The author's detailed benchmarks and
metrics will allow any motivated reader to answer this question. But
I think it would save the readers a little time if the authors added a
short discussion explaining the relation between PPR and PRC and these
strong scaling or maximum efficiency global model configurations.Citation: https://doi.org/10.5194/egusphere-2025-4435-RC2 -
AC3: 'Reply on RC2', Gunther Huebler, 10 Mar 2026
Thank you for the positive review and constructive comments. We've made updates to the manuscript in response.Good call with the Sun et al. reference, the original wording was ambiguous. We revised the sentence to make clear that Sun et al. (2023) describes an analogous restructuring in the separate PUMAS codebase, not the CLUBB changes mentioned in this paper.This is an excellent point about the connection to strong/weak scaling. The way we've defined and used the metrics does relate to the strong-scaling vs weak-scaling analyses that are common in the global model community, but we hadn't explicitly made this connection in the manuscript. The weak-scaling connection is pretty direct -- the PPR compares each device at its own most favorable workload, and assumes we can keep the devices fully utilized, which is exactly what you would want to do in a weak-scaling analysis. The connection to the strong-scaling analysis is a little more subtle, because that usually involves subdividing a fixed problem size across different numbers of cores/devices, which is not exactly the same as subdividing a fixed problem size into different batches, but is still pretty close. In the revised manuscript, we added this interpretation in two places: first in the introduction, where PPR and PRC are initially defined, and again in the summary sentence following Fig.~2, where the paper discusses which metric is most applicable in different situations.
-
AC3: 'Reply on RC2', Gunther Huebler, 10 Mar 2026
Status: closed
-
RC1: 'Comment on egusphere-2025-4435', Georgiana Mania, 05 Dec 2025
The paper advances the science by introducing two new metrics, Peak Ratio Crossover (PRC) and Peek-to-Peek Ratio (PPR), which allow for a better comparison between CPU and GPU performance of a given application. The authors demonstrate the benefits of using these metrics on a single-column parametrisation of turbulence and clouds, Cloud Layers Unified by Binormals (CLUBB), which exposes several level of parallelism that can be exploited differently by heterogenous architectures. Several use cases show the impact of batch size, precision, asynchronous execution, device type and coding optimisations on the application's throughput, expressed in columns per second. The paper is well written and the claims are well supported by experiments and profiling data which naturally drive the conclusions. I would recommend the publication of the manuscript.
There are a few comments, listed below, which could be tackled by the authors in a minor revision:
- The GPU in the title is a broad topic while most of the findings are based on NVIDIA GPUs experiments. Would be interesting to see if all the conclusions still hold on MI250x (given different warp size and different generated assembly code). That is more for the future, not needed to be added for the current manuscript.
- Line 107: "This setup .. closely mirrors the execution model.." - This is usually only partially true because a standalone has a smaller memory footprint in terms of the instruction cache.
- Fig 3 / page 9 could be improved if the plots with the same number of levels would have the same colour, since the vertical levels are in focus here. E.g. AMD7763_2x128_34nz and A100_4x4_34nz.
- Section 5.4 OpenACC vs OpenMP - Firstly, the experiments were done with the NVIDIA compiler which is known to favour OpenACC (e.g. the amount of optimisations for it is significantly larger than for OpenMP), so double-checking the results with other compiler could bring new information. Secondly, Fortran + OpenMP is known to perform better than Fortran + OpenACC on AMD hardware (using AMD ROCm compiler stack), so without a similar experiment on AMD, I would rephrase the findings in lines 297-306 as limited to NVIDIA hardware using NVIDIA compiler family.
- Since the manuscript is not anonymised, maybe the authors can write their full names in lines 518-520.
Citation: https://doi.org/10.5194/egusphere-2025-4435-RC1 -
AC1: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
Thanks for the review and the positivity. We've addressed all the comments for the revised manuscript.A broader cross-vendor GPU comparison would definitely be interesting, especially given the ever-evolving landscape of GPU architectures and compilation/optimization methods. This concern of presenting unfair/incomplete comparisons of available hardware was a large part of why this paper focuses on the performance metrics and reporting methods rather than on the specific performance comparisons between different hardware configurations.The comment about the execution model was a bit too vague. It was meant to be specifically about the execution model, which does indeed mirror the execution model when CLUBB is embedded in a GCM, but could easily be interpreted as a more general statement about performance extrapolation. We kept the original phrasing about the execution model, but added a clarification immediately afterward to make clear that absolute timings may not extrapolate well to host-model runs.Excellent idea for the plots, thank you. This makes them much easier to interpret. We revised the figures to use the same color scheme for results which use the same number of vertical levels in both Fig.~3 and Fig.~4.The wording did seem too general in the OpenACC-vs-OpenMP section. We revised it so that the comparison is explicitly framed as applying to the NVIDIA GPUs and nvfortran compiler stack used in this study. We also clarified that the OpenMP directives were obtained through Intel's migration tool from the OpenACC source - which could also have an impact on the performance of the OpenMP version.Citation: https://doi.org/
10.5194/egusphere-2025-4435-AC1 - AC2: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
-
RC2: 'Comment on egusphere-2025-4435', Anonymous Referee #2, 09 Feb 2026
This paper gives a detailed performance analysis of the CLUBB cloud
and turbulence parameterization, including a new GPU openACC based
port. CLUBB is an important and expensive component used by modern
atmospheric models. The performance work described in this paper is
excellent. The authors give a thorough evaluation of both GPU and CPU
performance as a function of workload ( number of vertical physics
columns per device), compilers, precision, vertical levels and loop
structure. They go to great lengths to prevent a fair CPU vs GPU
comparison, comparing well performing CPU code to similarly well
performing GPU code. They use several metrics (PPR, PRC, ATR, DRC) and
carefully discuss the strengths and weaknesses of CPUs and GPUs. The
thoroughness and fairness of the paper puts it above many other papers
which present misleading GPU speedup numbers.The paper is well written with clear arguments.
I only have two minor suggestions:
1. Section 2: line 92:
The wording of the sentence containing the Sun et al 2023 reference
could be clarified. At first reading, I assumed the authors were saying
that the column loop changes in CLUBB were described in Sun et al 2023,
but I think the CLUBB changes are due to this work and Sun et al 2023
was discussing similar work in a different parameterization (PUMAS).2. Conclusions:
I think most readers will be interested in how the CLUBB performance
numbers will impact global atmospheric models, in typical regimes with
are running at the limit of strong scaling to get the maximum possible
throughput, or running on fewer nodes to maximize efficiency and
maximize ensemble throughput. The author's detailed benchmarks and
metrics will allow any motivated reader to answer this question. But
I think it would save the readers a little time if the authors added a
short discussion explaining the relation between PPR and PRC and these
strong scaling or maximum efficiency global model configurations.Citation: https://doi.org/10.5194/egusphere-2025-4435-RC2 -
AC3: 'Reply on RC2', Gunther Huebler, 10 Mar 2026
Thank you for the positive review and constructive comments. We've made updates to the manuscript in response.Good call with the Sun et al. reference, the original wording was ambiguous. We revised the sentence to make clear that Sun et al. (2023) describes an analogous restructuring in the separate PUMAS codebase, not the CLUBB changes mentioned in this paper.This is an excellent point about the connection to strong/weak scaling. The way we've defined and used the metrics does relate to the strong-scaling vs weak-scaling analyses that are common in the global model community, but we hadn't explicitly made this connection in the manuscript. The weak-scaling connection is pretty direct -- the PPR compares each device at its own most favorable workload, and assumes we can keep the devices fully utilized, which is exactly what you would want to do in a weak-scaling analysis. The connection to the strong-scaling analysis is a little more subtle, because that usually involves subdividing a fixed problem size across different numbers of cores/devices, which is not exactly the same as subdividing a fixed problem size into different batches, but is still pretty close. In the revised manuscript, we added this interpretation in two places: first in the introduction, where PPR and PRC are initially defined, and again in the summary sentence following Fig.~2, where the paper discusses which metric is most applicable in different situations.
-
AC3: 'Reply on RC2', Gunther Huebler, 10 Mar 2026
Model code and software
GitHub repo of CLUBB code Gunther Huebler and Vincent Larson https://github.com/larson-group/clubb_release/tree/clubb_performance_testing
Zenodo archive of CLUBB code and profiling results Gunther Huebler https://doi.org/10.5281/zenodo.17081296
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 477 | 602 | 39 | 1,118 | 32 | 28 |
- HTML: 477
- PDF: 602
- XML: 39
- Total: 1,118
- BibTeX: 32
- EndNote: 28
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The paper advances the science by introducing two new metrics, Peak Ratio Crossover (PRC) and Peek-to-Peek Ratio (PPR), which allow for a better comparison between CPU and GPU performance of a given application. The authors demonstrate the benefits of using these metrics on a single-column parametrisation of turbulence and clouds, Cloud Layers Unified by Binormals (CLUBB), which exposes several level of parallelism that can be exploited differently by heterogenous architectures. Several use cases show the impact of batch size, precision, asynchronous execution, device type and coding optimisations on the application's throughput, expressed in columns per second. The paper is well written and the claims are well supported by experiments and profiling data which naturally drive the conclusions. I would recommend the publication of the manuscript.
There are a few comments, listed below, which could be tackled by the authors in a minor revision: