the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A GPU-parallelization of the neXtSIM-DG dynamical core (v0.3.1)
Abstract. The cryosphere plays a crucial role in Earth’s climate system, making accurate sea ice simulation essential for improving climate projections. To achieve higher resolution simulations, graphics processing units (GPUs) have become increasingly appealing due to their higher floating point peak performance and superior energy efficiency compared to CPUs. However, harnessing the full theoretical performance of GPUs often requires significant effort in redesigning algorithms and careful implementation. Recently, several frameworks have emerged, aiming to simplify general-purpose GPU programming. In this study, we evaluate multiple such frameworks, including CUDA, SYCL, Kokkos, and PyTorch, for the parallelization of neXtSIM-DG, a finite-element-based dynamical core for sea ice. Based on our assessment of usability and performance, CUDA demonstrates the best performance, while Kokkos is a suitable option for its robust heterogeneous computing capabilities. Our complete implementation of the momentum equation using Kokkos achieves a sixfold speedup on the GPU compared to our OpenMP-based CPU code, while maintaining competitiveness when run on the CPU. Additionally, we explore the impact of different discretization orders and the use of lower precision floating-point types on the GPU, showing that switching to single precision can further accelerate sea ice codes.
- Preprint
(2518 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-2539', Till Rasmussen, 25 Nov 2024
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-2539/egusphere-2024-2539-RC1-supplement.pdf
-
AC2: 'Reply on RC1', Robert Jendersie, 19 Dec 2024
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-2539/egusphere-2024-2539-AC2-supplement.pdf
-
AC2: 'Reply on RC1', Robert Jendersie, 19 Dec 2024
-
RC2: 'Comment on egusphere-2024-2539', Anonymous Referee #2, 28 Nov 2024
In the present manuscript ‘A GPU-parallelization of the neXtSIM-DG dynamical core (v0.3.1)’ the authors test and evaluate different GPU programming frameworks based on their sea ice model dynamical core neXtSIM-DG.
Many modeling groups in the weather and climate community and beyond are facing similar problems as the neXtSIM-DG developers. Developing portable code that achieves good performance on various hardware architectures without limiting the productivity of the (scientific) developers too much is a major challenge. Therefore, the thorough analysis of the different available GPU programming frameworks presented here is of great value to the community. The study is well written and I would recommend publication in GMD after a few issues have been addressed as listed below.
- In line 370 in section 4.1 it is stated ‘These results indicate that the AMD ecosystem is still less mature’. However, to validate this statement and to have a complete picture also for AMD GPUs it would have been nice to also have a HIP implementation as a baseline to compare the other implementations against similar to the CUDA implementation for NVIDIA GPUs.
- Table 1: What hardware was used for these measurements and how many OpenMP threads were used?
- Figures 2 and 10 and lines 354 and 470: Again, how many OpenMP threads were used for the OpenMP reference simulation? And what backend was used for Kokkos on CPUs? OpenMP as well? And if yes, with the same number of threads as the reference OpenMP simulation?
- Line 206: LLVM/Clang provides a set of debugging flags (e.g. https://openmp.llvm.org/design/Runtimes.html#libomptarget-info) which can provide precise information about each block of memory and potential problems. Also, for the types that are not trivially copyable, OpenMP 5.0 offers the option of using declare mapper to define this. Wouldn’t that have been an option here?
Technical corrections:
- Table 2: ‘AdaptiveCPP’ is used here to indicate the SYCL implementation but the name is too generic. AdaptiveCPP is also the name of the compiler and it can also compile native OpenMP or other parallel APIs. I would suggest replacing ‘AdaptiveCPP’ with ‘SYCL-AdaptiveCPP’.
- Figure 2: Why is in the legend of the right panel TorchInductor marked with an ‘*’?
- Line 16: impact on long-term processes
- Line 42: is -> it
- Line 88: often often -> Remove one
Citation: https://doi.org/10.5194/egusphere-2024-2539-RC2 -
AC1: 'Reply on RC2', Robert Jendersie, 19 Dec 2024
Publisher’s note: the content of this comment was removed on 2 January 2025 since the comment was posted by mistake.
Citation: https://doi.org/10.5194/egusphere-2024-2539-AC1 -
AC3: 'Reply on RC2', Robert Jendersie, 19 Dec 2024
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-2539/egusphere-2024-2539-AC3-supplement.pdf
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
208 | 72 | 169 | 449 | 4 | 10 |
- HTML: 208
- PDF: 72
- XML: 169
- Total: 449
- BibTeX: 4
- EndNote: 10
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1