The Ocean Model for E3SM Global Applications: Omega Version 0.1.0. A New High-Performance Computing Code for Exascale Architectures

Petersen, Mark R.; Asay-Davis, Xylar S.; Barthel, Alice M.; Begeman, Carolyn Branecky; Bishnu, Siddhartha; Brus, Steven R.; Jones, Philip W.; Kang, Hyun-Gyu; Kim, Youngsung; Mametjanov, Azamat; O’Neill, Brian; Ringel, Kieran K.; Smith, Katherine M.; Sreepathi, Sarat; Van Roekel, Luke P.; Waruszewski, Maciej

doi:10.5194/egusphere-2025-3500

Preprints

https://doi.org/10.5194/egusphere-2025-3500

Preprints

24 Oct 2025

| 24 Oct 2025

Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

The Ocean Model for E3SM Global Applications: Omega Version 0.1.0. A New High-Performance Computing Code for Exascale Architectures

Mark R. Petersen, Xylar S. Asay-Davis, Alice M. Barthel, Carolyn Branecky Begeman, Siddhartha Bishnu, Steven R. Brus, Philip W. Jones, Hyun-Gyu Kang, Youngsung Kim, Azamat Mametjanov, Brian O’Neill, Kieran K. Ringel, Katherine M. Smith, Sarat Sreepathi, Luke P. Van Roekel, and Maciej Waruszewski

Abstract. Here we introduce Omega, the Ocean Model for E3SM Global Applications. Omega is a new ocean model designed to run efficiently on high performance computing (HPC) platforms, including exascale heterogeneous architectures with accelerators, such as Graphics Processing Units (GPUs). Omega is written in C++ and uses the Kokkos performance portability library. These were chosen because they are well-supported, and will help future-proof Omega for upcoming HPC architectures. Omega will eventually replace the Model for Prediction Across Scales-Ocean (MPAS-Ocean) in the US Department of Energy's Energy Exascale Earth System Model (E3SM). Omega runs on unstructured horizontal meshes with variable-resolution capability and implements the same horizontal discretization as MPAS-Ocean. In this paper, we document the design and performance of Omega Version 0.1.0 (Omega-V0), which solves the shallow water equations with passive tracers and is the first step towards the full primitive equation ocean model. On Central Processing Units (CPUs), Omega-V0 is 1.4 times faster than MPAS-Ocean with the same configuration. Omega-V0 is more efficient on GPUs than CPUs on a per-watt basis–by a factor of 5.3 on Frontier and 3.6 on Aurora, two of the world's fastest exascale computers.

Received: 20 Jul 2025 – Discussion started: 24 Oct 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Status: open (until 04 Jan 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2025-3500', Seiya Nishizawa, 01 Dec 2025 reply
Comments on
Title: The Ocean Model for E3SM Global Applications: Omega Version 0.1.0. A New High-Performance Computing Code for Exascale Architectures
Authors: Mark R. Petersen et al.
MS No.: egusphere-2025-3500
MS type: Model description paper

General Comments
This manuscript describes Omega-V0.1.0, a new C++/Kokkos-based ocean model for E3SM targeting performance portability across heterogeneous CPU/GPU architectures. The paper provides a clear scientific motivation for the rewrite from MPAS-Ocean, presents the governing equations and discretization in sufficient detail, and includes a broad set of verification tests and multi-platform performance benchmarks. The performance results, especially on multiple exascale-class GPU systems, are a valuable contribution to the community and align well with the objectives of GMD model description papers.
That said, several clarifications are still needed to strengthen reproducibility and to help readers interpret key results. In particular, the paper should provide more concrete explanations of why OpenACC offloading was limited in MPAS-Ocean, supply missing experimental details for the benchmarks, and expand the discussion of some performance claims (for example, regular versus unstructured mesh equivalence, CPU–GPU work partitioning). I also encourage the authors to discuss how the current performance conclusions are expected to extend to Omega-V1 when more complex physical parameterizations are added.
Overall, the manuscript is strong and suitable for publication after minor-to-moderate revisions focused on clarification and consistency.

Specific Comments
Programming model taxonomy (around line 41):

The authors list four competing GPU programming approaches. Given the focus on portability, it would be useful to briefly mention recent language-standard based parallel models (for example, C++ and Fortran standard parallelism), and position them relative to the four categories already listed.

Limitations of OpenACC in MPAS-Ocean (around line 56):

The manuscript explains that only about half of MPAS-Ocean could be accelerated with OpenACC and that this led to small kernels and poor throughput. Please add a concise, concrete explanation of which specific structural aspects of MPAS-Ocean prevented directive-based offload (for example, dynamic data structures, or control-flow complexity).

Engagement of domain scientists and transition strategy (around line 106):

The text states that Omega was developed by a small group mainly composed of domain scientists, and that Kokkos abstractions were simplified for legibility. Given that Omega-V1/V2 will require substantial physics and infrastructure development, it would be valuable to comment on how the Omega developer community is expected to grow (e.g., anticipated contributors from E3SM and the broader ocean/atmosphere community) and on practical strategies for enabling uptake by scientists less familiar with C++.

Halo-exchange benchmark reproducibility (Figure 1 discussion around line 210):

The description of GPU-aware MPI and the observed 4–6× speedup is clear, but key experimental parameters are missing. Please specify halo width, number and type of variables communicated per step, whether variables were packed separately or aggregated, and total and per-call message sizes.

Contiguous vertical chunk strategy and layout choice (around line 262):

Please clarify whether you tested or considered other memory orderings of the 3-D fields in Kokkos (changing which index is contiguous), and why the current choice (vertical index contiguous) is expected to be optimal across CPU and GPU architectures. In particular, for vertically dependent physics, non-coalesced access on GPUs could become a bottleneck; a short justification or discussion of tested layouts would be helpful.

Regular Cartesian vs. unstructured spherical mesh performance (line 450):

The claim that performance is “equivalent” between regular Cartesian and unstructured spherical meshes is not explained. Please clarify what metric “equivalent” refers to and why indirect or irregular accesses in unstructured meshes do not measurably degrade performance.

Grid size notation and horizontal cell counts (line 450):

The mesh is described as a regular hexagonal grid, and the test cases are labeled as 1024×1024×96 and 2048×2048×96. However, the mapping between the “1024×1024” notation and the reported horizontal cell counts (approximately one million and four million, respectively) is not obvious for a hexagonal mesh. Please add a brief explanation of what the 1024 and 2048 represent and how these translate to the stated horizontal cell numbers.

CPU–GPU work partitioning in GPU runs (around line 470):

The manuscript notes full utilization of CPUs and GPUs. Please describe how workload sharing between CPU and GPU is determined: automatic or manually tuned.

Different CPU counts in Table 5 GPU vs. CPU-only runs:

Table 5 uses fewer CPUs in GPU simulations than in CPU-only simulations. Please explain why the CPU count differs.

Figure 7 missing reference:

Figure 7 is not cited in the text. Please either reference and explain it or remove it.

Convergence comparison with and without FCT (around line 362):

Omega’s tracer transport tests are conducted without FCT, whereas the manuscript reports the MPAS-Ocean convergence rate only for the FCT case (2.42). To enable a clearer like-for-like comparison, please also provide the MPAS-Ocean convergence rate without FCT and discuss whether that baseline is comparable to Omega’s 1.36 rate.

CPU runtime identity across machines (Figure 18):

CPU runtimes on Frontier and Perlmutter are identical despite different compilers being used. Please double-check and add a brief comment confirming correctness if intended.

Why Omega outperforms MPAS-Ocean on GPUs (around line 527):

The reported GPU speedups of Omega over MPAS-Ocean are very large. However, the benchmark configuration targets a relatively simple shallow-water system with passive tracers and does not include the more complex, branching-heavy physical parameterizations that often challenge directive-based approaches. For such a comparatively regular workload, one might expect OpenACC to achieve reasonably high GPU efficiency as well. It is therefore unclear why the performance gap remains so dramatic. Please expand the discussion to identify which kernels or design choices dominate the difference (e.g., memory layout, kernel fusion/granularity, indirect addressing, communication overlap, or data movement), and explain concretely why OpenACC fails to reach similar efficiency for this specific configuration.

Absolute performance metrics:

The performance analysis is currently presented almost entirely in terms of relative comparisons (across machines and against MPAS-Ocean). While these are useful, the absence of absolute performance metrics makes it difficult to assess efficiency against hardware limits or to compare with other studies. Please add at least one absolute metric (e.g., achieved memory bandwidth/FLOPS, or fraction of peak) to complement the relative results and strengthen the performance section.

Performance outlook when physical parameterizations are added:

Omega-V0 benchmarks a relatively regular, shallow-water workload with passive tracers. Omega-V1 is expected to include more complex processes such as vertical advection and mixing, equation of state, pressure computation, and physics parameterizations. These additions often introduce more branching, irregular memory access, and heterogeneous kernel costs than the current configuration. Please include a short discussion on how the present performance conclusions are expected to translate to Omega-V1. For example:

which future modules are anticipated to be performance-critical or memory-bound,

whether the current kernel design strategy (functor granularity, vertical chunking, policy choices) is expected to remain optimal, and

how OpenACC vs. Kokkos performance might change once less regular physics kernels are introduced.

Even a qualitative outlook would help readers assess the generality of the current performance results.

Technical Corrections
Replace nonstandard capitalization “Nvidia” → “NVIDIA” consistently.

Define abbreviations at first use (SSH).

Fix minor typos:

“kinetic energy andpotential” → “kinetic and potential energy” (line 285)

“analytic solution 10.” → “analytic solution (Figure 10)” (line 392)

“Simulation” → “simulation” (line 307).

Replace e-notation in prose (for example, “1.5e06”) with “1.5 × 10⁶”.

The sentence starting “The tendency terms are …” (around line 286) largely repeats the content of the preceding paragraph (around lines 259–). Please remove it or merge the two passages to avoid redundancy.

Figure 2 aims to show equivalent multi-dimensional iteration in Fortran (left) and Omega abstractions (right), but the right panel uses A(i,j,k)=i*j+k whereas the left uses i+j+k. This appears to be a typo and should be made consistent.

Overall recommendation: Minor revision. The required changes are mainly clarification for reproducibility and a small set of consistency and formatting fixes, with an added request to outline how performance expectations extend to Omega-V1 physics.

Reply
Citation: https://doi.org/10.5194/egusphere-2025-3500-RC1
RC2:
'Comment on egusphere-2025-3500', Anonymous Referee #2, 18 Dec 2025 reply
The manuscript “The Ocean Model for E3SM Global Applications: Omega Version 0.1.0. A New High-Performance Computing Code for Exascale Architectures” describes the design and testing, and demonstrates the performance and performance-portability of Omega ocean model which is supposed to replace MPAS-Ocean in the future. The model solves the SWE with passive tracers and is therefore in the early phase of developments.
The manuscript is well organized and has all the contents that I would expect from a model documentation paper for modern architecture. It describes usage of Kokkos in a Ocean model, which is probably one of the first attempts. I recommend publication, as I have only a few minor questions/comments and suggestions.
Writing style: I strongly suggest the authors to review the introduction. It reads more like a personal experience than a scientific introduction. Specifically, the three options for model port is described in 3 large paragraphs expressing in details the concerns of the group involved. I think the community is well past that stage and does not need a long introduction of status-quo.

Line 53 and few other places- good to avoid using “our group”, “we”, if possible.

Why did the authors decide to publish at this early stage? Why not wait for the full ocean model to be ready?

Line 54: Authors write that MPAS-Ocean is only half OpenACC ported because of the code structure. Could you please elaborate?

Lines 90+: “The choice of Kokkos required our …..”- it again reflects your experience. See if the sentence can be rephrased to make it more objective.

Line 105+: It seems that the team has added additional layer of abstraction on top of kokkos. What will happen when these people leave? Is there a strategy behind?

Related, for my own understanding: Why did you decide to write the code on your own (domain experts) when there is a trend in the community to get it written by the software engineers?

11: is not discrete but the line above it says it is. Please check.

Line 164: “operator convergence rates” – do you mean grid convergence of individual operators? Please check.

Lines 190+ on multiple domain decomposition- I understand the benefits of supporting multiple domain decomposition but that sentence is not clear to me. Could you please rephrase?

Line 202: “We have” – try without we.

Trott et al., 2022b has been cited a few times for Kokkos. Please check.

Line 251+, for my own understanding: It is mentioned that a manual tuning of kernel execution resulted in 10-20% performance gain as opposed to MDRangePolicy. How and when did the team realize that more can be achieved avoiding what kokkos offers?

Lines around 275: Author mention that the ability to fuse or to not fuse functor is advantageous as it allows one to try out things. I am wondering if it is practically possible if one has several functors and possible combinations to try out?

Figure 18 and related text: the comparison with MPAS-Ocean on GPUs is unfair since it is not full ported and likely not tuned. I think it is fine to use it as a reference but I would suggest mentioning it explicitly in the text.

Reply
Citation: https://doi.org/10.5194/egusphere-2025-3500-RC2

Viewed

Total article views: 462 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
242	196	24	462	18	16

HTML: 242
PDF: 196
XML: 24
Total: 462
BibTeX: 18
EndNote: 16

Views and downloads (calculated since 24 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	84	30	4	118
Nov 2025	68	59	9	136
Dec 2025	90	107	11	208

Cumulative views and downloads (calculated since 24 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	84	30	4	118
Nov 2025	68	59	9	136
Dec 2025	90	107	11	208

Viewed (geographical distribution)

Total article views: 470 (including HTML, PDF, and XML) Thereof 470 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 27 Dec 2025

Short summary

Ocean models are used to predict currents, temperature, and salinity of the earth’s oceans, much like weather forecasting. As supercomputer hardware changes with evolving technology, models must be updated, and sometimes rewritten. Here we document Omega, a new ocean model that was designed to run on the world’s fastest supercomputers. Testing shows that Omega accurately solves the model equations, and runs efficiently on many different computer architectures, including exascale computers.


Total:	0
HTML:	0
PDF:	0
XML:	0