Actionable reporting of CPU-GPU performance comparisons: Insights from a CLUBB case study

Huebler, Gunther; Larson, Vincent E.; Dennis, John; Voelz, Sheri

doi:10.5194/egusphere-2025-4435

Preprints

https://doi.org/10.5194/egusphere-2025-4435

Preprints

03 Nov 2025

| 03 Nov 2025

Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

Actionable reporting of CPU-GPU performance comparisons: Insights from a CLUBB case study

Gunther Huebler, Vincent E. Larson, John Dennis, and Sheri Voelz

Abstract. Graphics Processing Units (GPUs) are becoming increasingly central to high-performance computing (HPC), but fair comparison with central processing units (CPUs) remains challenging, particularly for applications that can be subdivided into smaller workloads. Traditional metrics such as speedup ratios can overstate GPU advantages and obscure the conditions under which CPUs are competitive, as they depend strongly on workload choice. We introduce two peak-based performance metrics, the Peak Ratio Crossover (PRC) and the Peak-to-Peak Ratio (PPR) which provide clearer comparisons by accounting for the best achievable performance of each device. Using a case study into the performance of the Cloud Layers Unified by Binormals (CLUBB) standalone model, we demonstrate these metrics in practice, show how they can guide execution strategy, and examine how they shift under factors that affect workload. We further analyze how implementation choices and code structure influence these metrics, showing how they enable performance comparisons to be expressed in a concise and actionable way, while also helping identify which optimization efforts should be prioritized to meet different performance goals.

Received: 10 Sep 2025 – Discussion started: 03 Nov 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Gunther Huebler, Vincent E. Larson, John Dennis, and Sheri Voelz

Status: open (until 29 Jan 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2025-4435', Georgiana Mania, 05 Dec 2025 reply
The paper advances the science by introducing two new metrics, Peak Ratio Crossover (PRC) and Peek-to-Peek Ratio (PPR), which allow for a better comparison between CPU and GPU performance of a given application. The authors demonstrate the benefits of using these metrics on a single-column parametrisation of turbulence and clouds, Cloud Layers Unified by Binormals (CLUBB), which exposes several level of parallelism that can be exploited differently by heterogenous architectures. Several use cases show the impact of batch size, precision, asynchronous execution, device type and coding optimisations on the application's throughput, expressed in columns per second. The paper is well written and the claims are well supported by experiments and profiling data which naturally drive the conclusions. I would recommend the publication of the manuscript.
There are a few comments, listed below, which could be tackled by the authors in a minor revision:
The GPU in the title is a broad topic while most of the findings are based on NVIDIA GPUs experiments. Would be interesting to see if all the conclusions still hold on MI250x (given different warp size and different generated assembly code). That is more for the future, not needed to be added for the current manuscript.

Line 107: "This setup .. closely mirrors the execution model.." - This is usually only partially true because a standalone has a smaller memory footprint in terms of the instruction cache.

Fig 3 / page 9 could be improved if the plots with the same number of levels would have the same colour, since the vertical levels are in focus here. E.g. AMD7763_2x128_34nz and A100_4x4_34nz.

Section 5.4 OpenACC vs OpenMP - Firstly, the experiments were done with the NVIDIA compiler which is known to favour OpenACC (e.g. the amount of optimisations for it is significantly larger than for OpenMP), so double-checking the results with other compiler could bring new information. Secondly, Fortran + OpenMP is known to perform better than Fortran + OpenACC on AMD hardware (using AMD ROCm compiler stack), so without a similar experiment on AMD, I would rephrase the findings in lines 297-306 as limited to NVIDIA hardware using NVIDIA compiler family.

Since the manuscript is not anonymised, maybe the authors can write their full names in lines 518-520.

Reply
Citation: https://doi.org/10.5194/egusphere-2025-4435-RC1

Gunther Huebler, Vincent E. Larson, John Dennis, and Sheri Voelz

Model code and software

GitHub repo of CLUBB code Gunther Huebler and Vincent Larson https://github.com/larson-group/clubb_release/tree/clubb_performance_testing

Zenodo archive of CLUBB code and profiling results Gunther Huebler https://doi.org/10.5281/zenodo.17081296

Gunther Huebler, Vincent E. Larson, John Dennis, and Sheri Voelz

Viewed

Total article views: 567 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
255	291	21	567	17	19

HTML: 255
PDF: 291
XML: 21
Total: 567
BibTeX: 17
EndNote: 19

Views and downloads (calculated since 03 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	184	133	12	329
Dec 2025	64	153	9	226
Jan 2026	7	5	0	12

Cumulative views and downloads (calculated since 03 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	184	133	12	329
Dec 2025	64	153	9	226
Jan 2026	7	5	0	12

Viewed (geographical distribution)

Total article views: 526 (including HTML, PDF, and XML) Thereof 526 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 06 Jan 2026

Short summary

Central processing units (CPUs) and graphics processing units (GPUs) are different devices that suit different kinds of work. Using a climate modeling component, we provide a clearer way to tell which device type is faster for a given task. This matters because runs usually use only one device type. Our results are actionable: they guide device choice, report performance gains fairly, highlight code areas to improve, and show how code structure and optimization can change conclusions.


Total:	0
HTML:	0
PDF:	0
XML:	0