Actionable reporting of CPU-GPU performance comparisons: Insights from a CLUBB case study

Huebler, Gunther; Larson, Vincent E.; Dennis, John; Voelz, Sheri

doi:10.5194/egusphere-2025-4435

Preprints

https://doi.org/10.5194/egusphere-2025-4435

Preprints

03 Nov 2025

| 03 Nov 2025

Actionable reporting of CPU-GPU performance comparisons: Insights from a CLUBB case study

Gunther Huebler, Vincent E. Larson, John Dennis, and Sheri Voelz

Abstract. Graphics Processing Units (GPUs) are becoming increasingly central to high-performance computing (HPC), but fair comparison with central processing units (CPUs) remains challenging, particularly for applications that can be subdivided into smaller workloads. Traditional metrics such as speedup ratios can overstate GPU advantages and obscure the conditions under which CPUs are competitive, as they depend strongly on workload choice. We introduce two peak-based performance metrics, the Peak Ratio Crossover (PRC) and the Peak-to-Peak Ratio (PPR) which provide clearer comparisons by accounting for the best achievable performance of each device. Using a case study into the performance of the Cloud Layers Unified by Binormals (CLUBB) standalone model, we demonstrate these metrics in practice, show how they can guide execution strategy, and examine how they shift under factors that affect workload. We further analyze how implementation choices and code structure influence these metrics, showing how they enable performance comparisons to be expressed in a concise and actionable way, while also helping identify which optimization efforts should be prioritized to meet different performance goals.

Received: 10 Sep 2025 – Discussion started: 03 Nov 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3233 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (3233 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

08 May 2026

Actionable reporting of CPU-GPU performance comparisons: insights from a CLUBB case study

Gunther Huebler, Vincent E. Larson, John M. Dennis, and Sheri A. Voelz

Geosci. Model Dev., 19, 3783–3800, https://doi.org/10.5194/gmd-19-3783-2026,https://doi.org/10.5194/gmd-19-3783-2026, 2026

Short summary

Gunther Huebler, Vincent E. Larson, John Dennis, and Sheri Voelz

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-4435', Georgiana Mania, 05 Dec 2025
The paper advances the science by introducing two new metrics, Peak Ratio Crossover (PRC) and Peek-to-Peek Ratio (PPR), which allow for a better comparison between CPU and GPU performance of a given application. The authors demonstrate the benefits of using these metrics on a single-column parametrisation of turbulence and clouds, Cloud Layers Unified by Binormals (CLUBB), which exposes several level of parallelism that can be exploited differently by heterogenous architectures. Several use cases show the impact of batch size, precision, asynchronous execution, device type and coding optimisations on the application's throughput, expressed in columns per second. The paper is well written and the claims are well supported by experiments and profiling data which naturally drive the conclusions. I would recommend the publication of the manuscript.
There are a few comments, listed below, which could be tackled by the authors in a minor revision:
The GPU in the title is a broad topic while most of the findings are based on NVIDIA GPUs experiments. Would be interesting to see if all the conclusions still hold on MI250x (given different warp size and different generated assembly code). That is more for the future, not needed to be added for the current manuscript.

Line 107: "This setup .. closely mirrors the execution model.." - This is usually only partially true because a standalone has a smaller memory footprint in terms of the instruction cache.

Fig 3 / page 9 could be improved if the plots with the same number of levels would have the same colour, since the vertical levels are in focus here. E.g. AMD7763_2x128_34nz and A100_4x4_34nz.

Section 5.4 OpenACC vs OpenMP - Firstly, the experiments were done with the NVIDIA compiler which is known to favour OpenACC (e.g. the amount of optimisations for it is significantly larger than for OpenMP), so double-checking the results with other compiler could bring new information. Secondly, Fortran + OpenMP is known to perform better than Fortran + OpenACC on AMD hardware (using AMD ROCm compiler stack), so without a similar experiment on AMD, I would rephrase the findings in lines 297-306 as limited to NVIDIA hardware using NVIDIA compiler family.

Since the manuscript is not anonymised, maybe the authors can write their full names in lines 518-520.
Citation: https://doi.org/10.5194/egusphere-2025-4435-RC1
- AC1: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
  
  Thanks for the review and the positivity. We've addressed all the comments for the revised manuscript.
  
  A broader cross-vendor GPU comparison would definitely be interesting, especially given the ever-evolving landscape of GPU architectures and compilation/optimization methods. This concern of presenting unfair/incomplete comparisons of available hardware was a large part of why this paper focuses on the performance metrics and reporting methods rather than on the specific performance comparisons between different hardware configurations.
  
  The comment about the execution model was a bit too vague. It was meant to be specifically about the execution model, which does indeed mirror the execution model when CLUBB is embedded in a GCM, but could easily be interpreted as a more general statement about performance extrapolation. We kept the original phrasing about the execution model, but added a clarification immediately afterward to make clear that absolute timings may not extrapolate well to host-model runs.
  
  Excellent idea for the plots, thank you. This makes them much easier to interpret. We revised the figures to use the same color scheme for results which use the same number of vertical levels in both Fig.~3 and Fig.~4.
  
  The wording did seem too general in the OpenACC-vs-OpenMP section. We revised it so that the comparison is explicitly framed as applying to the NVIDIA GPUs and nvfortran compiler stack used in this study. We also clarified that the OpenMP directives were obtained through Intel's migration tool from the OpenACC source - which could also have an impact on the performance of the OpenMP version.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4435-AC1
- AC2: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
  
  We also used full names instead of the author initials.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4435-AC2
RC2:
'Comment on egusphere-2025-4435', Anonymous Referee #2, 09 Feb 2026

This paper gives a detailed performance analysis of the CLUBB cloud

and turbulence parameterization, including a new GPU openACC based

port. CLUBB is an important and expensive component used by modern

atmospheric models. The performance work described in this paper is

excellent. The authors give a thorough evaluation of both GPU and CPU

performance as a function of workload ( number of vertical physics

columns per device), compilers, precision, vertical levels and loop

structure. They go to great lengths to prevent a fair CPU vs GPU

comparison, comparing well performing CPU code to similarly well

performing GPU code. They use several metrics (PPR, PRC, ATR, DRC) and

carefully discuss the strengths and weaknesses of CPUs and GPUs. The

thoroughness and fairness of the paper puts it above many other papers

which present misleading GPU speedup numbers.
The paper is well written with clear arguments.
I only have two minor suggestions:
1. Section 2: line 92:
The wording of the sentence containing the Sun et al 2023 reference

could be clarified. At first reading, I assumed the authors were saying

that the column loop changes in CLUBB were described in Sun et al 2023,

but I think the CLUBB changes are due to this work and Sun et al 2023

was discussing similar work in a different parameterization (PUMAS).
2. Conclusions:
I think most readers will be interested in how the CLUBB performance

numbers will impact global atmospheric models, in typical regimes with

are running at the limit of strong scaling to get the maximum possible

throughput, or running on fewer nodes to maximize efficiency and

maximize ensemble throughput. The author's detailed benchmarks and

metrics will allow any motivated reader to answer this question. But

I think it would save the readers a little time if the authors added a

short discussion explaining the relation between PPR and PRC and these

strong scaling or maximum efficiency global model configurations.

Citation: https://doi.org/10.5194/egusphere-2025-4435-RC2
- AC3: 'Reply on RC2', Gunther Huebler, 10 Mar 2026
  
  Thank you for the positive review and constructive comments. We've made updates to the manuscript in response.
  
  Good call with the Sun et al. reference, the original wording was ambiguous. We revised the sentence to make clear that Sun et al. (2023) describes an analogous restructuring in the separate PUMAS codebase, not the CLUBB changes mentioned in this paper.
  
  This is an excellent point about the connection to strong/weak scaling. The way we've defined and used the metrics does relate to the strong-scaling vs weak-scaling analyses that are common in the global model community, but we hadn't explicitly made this connection in the manuscript. The weak-scaling connection is pretty direct -- the PPR compares each device at its own most favorable workload, and assumes we can keep the devices fully utilized, which is exactly what you would want to do in a weak-scaling analysis. The connection to the strong-scaling analysis is a little more subtle, because that usually involves subdividing a fixed problem size across different numbers of cores/devices, which is not exactly the same as subdividing a fixed problem size into different batches, but is still pretty close. In the revised manuscript, we added this interpretation in two places: first in the introduction, where PPR and PRC are initially defined, and again in the summary sentence following Fig.~2, where the paper discusses which metric is most applicable in different situations.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4435-AC3

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-4435', Georgiana Mania, 05 Dec 2025
The paper advances the science by introducing two new metrics, Peak Ratio Crossover (PRC) and Peek-to-Peek Ratio (PPR), which allow for a better comparison between CPU and GPU performance of a given application. The authors demonstrate the benefits of using these metrics on a single-column parametrisation of turbulence and clouds, Cloud Layers Unified by Binormals (CLUBB), which exposes several level of parallelism that can be exploited differently by heterogenous architectures. Several use cases show the impact of batch size, precision, asynchronous execution, device type and coding optimisations on the application's throughput, expressed in columns per second. The paper is well written and the claims are well supported by experiments and profiling data which naturally drive the conclusions. I would recommend the publication of the manuscript.
There are a few comments, listed below, which could be tackled by the authors in a minor revision:
The GPU in the title is a broad topic while most of the findings are based on NVIDIA GPUs experiments. Would be interesting to see if all the conclusions still hold on MI250x (given different warp size and different generated assembly code). That is more for the future, not needed to be added for the current manuscript.

Line 107: "This setup .. closely mirrors the execution model.." - This is usually only partially true because a standalone has a smaller memory footprint in terms of the instruction cache.

Fig 3 / page 9 could be improved if the plots with the same number of levels would have the same colour, since the vertical levels are in focus here. E.g. AMD7763_2x128_34nz and A100_4x4_34nz.

Section 5.4 OpenACC vs OpenMP - Firstly, the experiments were done with the NVIDIA compiler which is known to favour OpenACC (e.g. the amount of optimisations for it is significantly larger than for OpenMP), so double-checking the results with other compiler could bring new information. Secondly, Fortran + OpenMP is known to perform better than Fortran + OpenACC on AMD hardware (using AMD ROCm compiler stack), so without a similar experiment on AMD, I would rephrase the findings in lines 297-306 as limited to NVIDIA hardware using NVIDIA compiler family.

Since the manuscript is not anonymised, maybe the authors can write their full names in lines 518-520.
Citation: https://doi.org/10.5194/egusphere-2025-4435-RC1
- AC1: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
  
  Thanks for the review and the positivity. We've addressed all the comments for the revised manuscript.
  
  A broader cross-vendor GPU comparison would definitely be interesting, especially given the ever-evolving landscape of GPU architectures and compilation/optimization methods. This concern of presenting unfair/incomplete comparisons of available hardware was a large part of why this paper focuses on the performance metrics and reporting methods rather than on the specific performance comparisons between different hardware configurations.
  
  The comment about the execution model was a bit too vague. It was meant to be specifically about the execution model, which does indeed mirror the execution model when CLUBB is embedded in a GCM, but could easily be interpreted as a more general statement about performance extrapolation. We kept the original phrasing about the execution model, but added a clarification immediately afterward to make clear that absolute timings may not extrapolate well to host-model runs.
  
  Excellent idea for the plots, thank you. This makes them much easier to interpret. We revised the figures to use the same color scheme for results which use the same number of vertical levels in both Fig.~3 and Fig.~4.
  
  The wording did seem too general in the OpenACC-vs-OpenMP section. We revised it so that the comparison is explicitly framed as applying to the NVIDIA GPUs and nvfortran compiler stack used in this study. We also clarified that the OpenMP directives were obtained through Intel's migration tool from the OpenACC source - which could also have an impact on the performance of the OpenMP version.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4435-AC1
- AC2: 'Reply on RC1', Gunther Huebler, 10 Mar 2026
  
  We also used full names instead of the author initials.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4435-AC2
RC2:
'Comment on egusphere-2025-4435', Anonymous Referee #2, 09 Feb 2026

This paper gives a detailed performance analysis of the CLUBB cloud

and turbulence parameterization, including a new GPU openACC based

port. CLUBB is an important and expensive component used by modern

atmospheric models. The performance work described in this paper is

excellent. The authors give a thorough evaluation of both GPU and CPU

performance as a function of workload ( number of vertical physics

columns per device), compilers, precision, vertical levels and loop

structure. They go to great lengths to prevent a fair CPU vs GPU

comparison, comparing well performing CPU code to similarly well

performing GPU code. They use several metrics (PPR, PRC, ATR, DRC) and

carefully discuss the strengths and weaknesses of CPUs and GPUs. The

thoroughness and fairness of the paper puts it above many other papers

which present misleading GPU speedup numbers.
The paper is well written with clear arguments.
I only have two minor suggestions:
1. Section 2: line 92:
The wording of the sentence containing the Sun et al 2023 reference

could be clarified. At first reading, I assumed the authors were saying

that the column loop changes in CLUBB were described in Sun et al 2023,

but I think the CLUBB changes are due to this work and Sun et al 2023

was discussing similar work in a different parameterization (PUMAS).
2. Conclusions:
I think most readers will be interested in how the CLUBB performance

numbers will impact global atmospheric models, in typical regimes with

are running at the limit of strong scaling to get the maximum possible

throughput, or running on fewer nodes to maximize efficiency and

maximize ensemble throughput. The author's detailed benchmarks and

metrics will allow any motivated reader to answer this question. But

I think it would save the readers a little time if the authors added a

short discussion explaining the relation between PPR and PRC and these

strong scaling or maximum efficiency global model configurations.

Citation: https://doi.org/10.5194/egusphere-2025-4435-RC2
- AC3: 'Reply on RC2', Gunther Huebler, 10 Mar 2026
  
  Thank you for the positive review and constructive comments. We've made updates to the manuscript in response.
  
  Good call with the Sun et al. reference, the original wording was ambiguous. We revised the sentence to make clear that Sun et al. (2023) describes an analogous restructuring in the separate PUMAS codebase, not the CLUBB changes mentioned in this paper.
  
  This is an excellent point about the connection to strong/weak scaling. The way we've defined and used the metrics does relate to the strong-scaling vs weak-scaling analyses that are common in the global model community, but we hadn't explicitly made this connection in the manuscript. The weak-scaling connection is pretty direct -- the PPR compares each device at its own most favorable workload, and assumes we can keep the devices fully utilized, which is exactly what you would want to do in a weak-scaling analysis. The connection to the strong-scaling analysis is a little more subtle, because that usually involves subdividing a fixed problem size across different numbers of cores/devices, which is not exactly the same as subdividing a fixed problem size into different batches, but is still pretty close. In the revised manuscript, we added this interpretation in two places: first in the introduction, where PPR and PRC are initially defined, and again in the summary sentence following Fig.~2, where the paper discusses which metric is most applicable in different situations.
  
  Citation: https://doi.org/10.5194/egusphere-2025-4435-AC3

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Gunther Huebler on behalf of the Authors (10 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (31 Mar 2026) by Peter Caldwell

AR by Gunther Huebler on behalf of the Authors (11 Apr 2026)

Journal article(s) based on this preprint

08 May 2026

Actionable reporting of CPU-GPU performance comparisons: insights from a CLUBB case study

Gunther Huebler, Vincent E. Larson, John M. Dennis, and Sheri A. Voelz

Geosci. Model Dev., 19, 3783–3800, https://doi.org/10.5194/gmd-19-3783-2026,https://doi.org/10.5194/gmd-19-3783-2026, 2026

Short summary

Gunther Huebler, Vincent E. Larson, John Dennis, and Sheri Voelz

Model code and software

GitHub repo of CLUBB code Gunther Huebler and Vincent Larson https://github.com/larson-group/clubb_release/tree/clubb_performance_testing

Zenodo archive of CLUBB code and profiling results Gunther Huebler https://doi.org/10.5281/zenodo.17081296

Gunther Huebler, Vincent E. Larson, John Dennis, and Sheri Voelz

Viewed

Total article views: 5,647 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
2,397	3,047	203	5,647	170	137

HTML: 2,397
PDF: 3,047
XML: 203
Total: 5,647
BibTeX: 170
EndNote: 137

Views and downloads (calculated since 03 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	920	665	60	1,645
Dec 2025	328	765	57	1,150
Jan 2026	250	440	20	710
Feb 2026	311	449	27	787
Mar 2026	451	565	36	1,052
Apr 2026	93	91	1	185
May 2026	40	67	2	109
Jun 2026	4	5	0	9
Jul 2026	0

Cumulative views and downloads (calculated since 03 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	920	665	60	1,645
Dec 2025	328	765	57	1,150
Jan 2026	250	440	20	710
Feb 2026	311	449	27	787
Mar 2026	451	565	36	1,052
Apr 2026	93	91	1	185
May 2026	40	67	2	109
Jun 2026	4	5	0	9
Jul 2026	0

Viewed (geographical distribution)

Total article views: 5,602 (including HTML, PDF, and XML) Thereof 5,602 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 07 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (3233 KB)
Metadata XML

Short summary

Central processing units (CPUs) and graphics processing units (GPUs) are different devices that suit different kinds of work. Using a climate modeling component, we provide a clearer way to tell which device type is faster for a given task. This matters because runs usually use only one device type. Our results are actionable: they guide device choice, report performance gains fairly, highlight code areas to improve, and show how code structure and optimization can change conclusions.


Total:	0
HTML:	0
PDF:	0
XML:	0