the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The Ocean Model for E3SM Global Applications: Omega Version 0.1.0. A New High-Performance Computing Code for Exascale Architectures
Abstract. Here we introduce Omega, the Ocean Model for E3SM Global Applications. Omega is a new ocean model designed to run efficiently on high performance computing (HPC) platforms, including exascale heterogeneous architectures with accelerators, such as Graphics Processing Units (GPUs). Omega is written in C++ and uses the Kokkos performance portability library. These were chosen because they are well-supported, and will help future-proof Omega for upcoming HPC architectures. Omega will eventually replace the Model for Prediction Across Scales-Ocean (MPAS-Ocean) in the US Department of Energy's Energy Exascale Earth System Model (E3SM). Omega runs on unstructured horizontal meshes with variable-resolution capability and implements the same horizontal discretization as MPAS-Ocean. In this paper, we document the design and performance of Omega Version 0.1.0 (Omega-V0), which solves the shallow water equations with passive tracers and is the first step towards the full primitive equation ocean model. On Central Processing Units (CPUs), Omega-V0 is 1.4 times faster than MPAS-Ocean with the same configuration. Omega-V0 is more efficient on GPUs than CPUs on a per-watt basis–by a factor of 5.3 on Frontier and 3.6 on Aurora, two of the world's fastest exascale computers.
- Preprint
(5327 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 31 Dec 2025)
- RC1: 'Comment on egusphere-2025-3500', Seiya Nishizawa, 01 Dec 2025 reply
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 180 | 119 | 14 | 313 | 11 | 9 |
- HTML: 180
- PDF: 119
- XML: 14
- Total: 313
- BibTeX: 11
- EndNote: 9
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Comments on
Title: The Ocean Model for E3SM Global Applications: Omega Version 0.1.0. A New High-Performance Computing Code for Exascale Architectures
Authors: Mark R. Petersen et al.
MS No.: egusphere-2025-3500
MS type: Model description paper
General Comments
This manuscript describes Omega-V0.1.0, a new C++/Kokkos-based ocean model for E3SM targeting performance portability across heterogeneous CPU/GPU architectures. The paper provides a clear scientific motivation for the rewrite from MPAS-Ocean, presents the governing equations and discretization in sufficient detail, and includes a broad set of verification tests and multi-platform performance benchmarks. The performance results, especially on multiple exascale-class GPU systems, are a valuable contribution to the community and align well with the objectives of GMD model description papers.
That said, several clarifications are still needed to strengthen reproducibility and to help readers interpret key results. In particular, the paper should provide more concrete explanations of why OpenACC offloading was limited in MPAS-Ocean, supply missing experimental details for the benchmarks, and expand the discussion of some performance claims (for example, regular versus unstructured mesh equivalence, CPU–GPU work partitioning). I also encourage the authors to discuss how the current performance conclusions are expected to extend to Omega-V1 when more complex physical parameterizations are added.
Overall, the manuscript is strong and suitable for publication after minor-to-moderate revisions focused on clarification and consistency.
Specific Comments
The authors list four competing GPU programming approaches. Given the focus on portability, it would be useful to briefly mention recent language-standard based parallel models (for example, C++ and Fortran standard parallelism), and position them relative to the four categories already listed.
The manuscript explains that only about half of MPAS-Ocean could be accelerated with OpenACC and that this led to small kernels and poor throughput. Please add a concise, concrete explanation of which specific structural aspects of MPAS-Ocean prevented directive-based offload (for example, dynamic data structures, or control-flow complexity).
The text states that Omega was developed by a small group mainly composed of domain scientists, and that Kokkos abstractions were simplified for legibility. Given that Omega-V1/V2 will require substantial physics and infrastructure development, it would be valuable to comment on how the Omega developer community is expected to grow (e.g., anticipated contributors from E3SM and the broader ocean/atmosphere community) and on practical strategies for enabling uptake by scientists less familiar with C++.
The description of GPU-aware MPI and the observed 4–6× speedup is clear, but key experimental parameters are missing. Please specify halo width, number and type of variables communicated per step, whether variables were packed separately or aggregated, and total and per-call message sizes.
Please clarify whether you tested or considered other memory orderings of the 3-D fields in Kokkos (changing which index is contiguous), and why the current choice (vertical index contiguous) is expected to be optimal across CPU and GPU architectures. In particular, for vertically dependent physics, non-coalesced access on GPUs could become a bottleneck; a short justification or discussion of tested layouts would be helpful.
The claim that performance is “equivalent” between regular Cartesian and unstructured spherical meshes is not explained. Please clarify what metric “equivalent” refers to and why indirect or irregular accesses in unstructured meshes do not measurably degrade performance.
The mesh is described as a regular hexagonal grid, and the test cases are labeled as 1024×1024×96 and 2048×2048×96. However, the mapping between the “1024×1024” notation and the reported horizontal cell counts (approximately one million and four million, respectively) is not obvious for a hexagonal mesh. Please add a brief explanation of what the 1024 and 2048 represent and how these translate to the stated horizontal cell numbers.
The manuscript notes full utilization of CPUs and GPUs. Please describe how workload sharing between CPU and GPU is determined: automatic or manually tuned.
Table 5 uses fewer CPUs in GPU simulations than in CPU-only simulations. Please explain why the CPU count differs.
Figure 7 is not cited in the text. Please either reference and explain it or remove it.
Omega’s tracer transport tests are conducted without FCT, whereas the manuscript reports the MPAS-Ocean convergence rate only for the FCT case (2.42). To enable a clearer like-for-like comparison, please also provide the MPAS-Ocean convergence rate without FCT and discuss whether that baseline is comparable to Omega’s 1.36 rate.
CPU runtimes on Frontier and Perlmutter are identical despite different compilers being used. Please double-check and add a brief comment confirming correctness if intended.
The reported GPU speedups of Omega over MPAS-Ocean are very large. However, the benchmark configuration targets a relatively simple shallow-water system with passive tracers and does not include the more complex, branching-heavy physical parameterizations that often challenge directive-based approaches. For such a comparatively regular workload, one might expect OpenACC to achieve reasonably high GPU efficiency as well. It is therefore unclear why the performance gap remains so dramatic. Please expand the discussion to identify which kernels or design choices dominate the difference (e.g., memory layout, kernel fusion/granularity, indirect addressing, communication overlap, or data movement), and explain concretely why OpenACC fails to reach similar efficiency for this specific configuration.
The performance analysis is currently presented almost entirely in terms of relative comparisons (across machines and against MPAS-Ocean). While these are useful, the absence of absolute performance metrics makes it difficult to assess efficiency against hardware limits or to compare with other studies. Please add at least one absolute metric (e.g., achieved memory bandwidth/FLOPS, or fraction of peak) to complement the relative results and strengthen the performance section.
Omega-V0 benchmarks a relatively regular, shallow-water workload with passive tracers. Omega-V1 is expected to include more complex processes such as vertical advection and mixing, equation of state, pressure computation, and physics parameterizations. These additions often introduce more branching, irregular memory access, and heterogeneous kernel costs than the current configuration. Please include a short discussion on how the present performance conclusions are expected to translate to Omega-V1. For example:
Even a qualitative outlook would help readers assess the generality of the current performance results.
Technical Corrections
Overall recommendation: Minor revision. The required changes are mainly clarification for reproducibility and a small set of consistency and formatting fixes, with an added request to outline how performance expectations extend to Omega-V1 physics.