the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)
Abstract. Some programming languages are easy to develop at the cost of slow execution, while others are fast at run time but much more difficult to write. Julia is a programming language that aims to be the best of both worlds – a development and production language at the same time. To test Julia’s utility in scientific high-performance computing (HPC), we built an unstructured-mesh shallow water model in Julia and compared it against an established Fortran-MPI ocean model, MPAS-Ocean, as well as a Python shallow water code. Three versions of the Julia shallow water code were created, for: single-core CPU; graphics processing unit (GPU); and Message Passing Interface (MPI) CPU clusters. Comparing identical simulations revealed that our first version of the Julia model was 13 times faster than Python using Numpy, where both used an unthreaded single-core CPU. Further Julia optimizations, including static typing and removing implicit memory allocations, provided an additional 10–20x speed-up of the single-core CPU Julia model. The GPU-accelerated Julia code attained a speed-up of 230–380x compared to the single-core CPU Julia code. Parallelized Julia-MPI performance was identical to Fortran-MPI MPAS-Ocean for low processor counts, and ranges from 2x faster to 2x slower for higher processor counts. Our experience is that Julia development is fast and convenient for prototyping, but that Julia requires further investment and expertise to be competitive with compiled codes. We provide advice on Julia code optimization for HPC systems.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(978 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(978 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-57', Anonymous Referee #1, 25 Mar 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-57/egusphere-2023-57-RC1-supplement.pdf
- AC1: 'Reply on RC1', Mark R. Petersen, 02 Jul 2023
-
RC2: 'Comment on egusphere-2023-57', Anonymous Referee #2, 22 May 2023
The authors have made significant contributions by developing a shallow water solver using Julia language and comparing its performance with a solver written in Fortran. Furthermore, they have successfully implemented their solver on a GPU, demonstrating a remarkable speed-up. While the overall results appear promising, I would suggest considering the following points to further enhance the paper:
In section 3.2, it would greatly enhance the paper to include a table comparing the specifications of the CPU and GPU used in the simulations. This table should provide a comprehensive comparison of various factors, such as FLOPS (Floating-Point Operations Per Second) and memory bandwidth, specifically for both 32-bit and 64-bit computations. Additionally, it would be valuable to summarize the versions of the toolchain that were utilized during these computations. This information will provide readers with a better understanding of the hardware and software environment in which the simulations were conducted, allowing for a more comprehensive evaluation of the results.
2. In section 3.2, it would be beneficial to include a comparison of the performance between the Julia code and the Fortran code in a single-core execution. This comparison will provide readers with insights into the optimization of the Julia code for serial computation.
3. In Section 3.2, the authors mentioned that all codes were executed in double precision and highlighted the faster simulation on the NVIDIA RTX8000 GPU compared to the CPU. However, it is important to consider that the RTX8000 is primarily designed for consumer applications and may exhibit slower performance in double precision computation. To provide a more comprehensive evaluation, it would be valuable to compare the computation on a high-performance computing (HPC) targeted GPU, such as the NVIDIA TESLA A100, which is known for their robust performance in double precision computation and are specifically designed to excel in HPC workloads. Otherwise, please compare all simulations in single precision.
4. In section 3.3, it is evident that Julia-MPI outperformed Fortran-MPI in terms of computation, but it took more time for communication. To provide a clearer understanding of the experimental setup, it would be beneficial to specify the Fortran compiler and Julia interpreters, along with the related toolchain, that were employed in the study. Additionally, it is important to mention the specific version of the MPI library used for both the Fortran-MPI and Julia-MPI implementations. This information will help readers better comprehend the underlying MPI libraries utilized in each case and the potential impact they may have had on the communication performance.
Moreover, it is worth exploring the possibility that different MPI libraries might have been employed for the Fortran and Julia codes. If this is the case, it should be explicitly stated in the paper, along with the versions of the MPI libraries used for each implementation. Clarifying this aspect will enable readers to consider any discrepancies or optimizations associated with the MPI libraries employed in the Fortran and Julia implementations.
5. I think hyper threading may be disabled in supercomputer. It would be helpful to omit the hyper-thread performance of the CPU in section 3.3.
Â
Citation: https://doi.org/10.5194/egusphere-2023-57-RC2 - AC2: 'Reply on RC2', Mark R. Petersen, 02 Jul 2023
- AC3: 'Reply on RC2', Mark R. Petersen, 02 Jul 2023
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-57', Anonymous Referee #1, 25 Mar 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-57/egusphere-2023-57-RC1-supplement.pdf
- AC1: 'Reply on RC1', Mark R. Petersen, 02 Jul 2023
-
RC2: 'Comment on egusphere-2023-57', Anonymous Referee #2, 22 May 2023
The authors have made significant contributions by developing a shallow water solver using Julia language and comparing its performance with a solver written in Fortran. Furthermore, they have successfully implemented their solver on a GPU, demonstrating a remarkable speed-up. While the overall results appear promising, I would suggest considering the following points to further enhance the paper:
In section 3.2, it would greatly enhance the paper to include a table comparing the specifications of the CPU and GPU used in the simulations. This table should provide a comprehensive comparison of various factors, such as FLOPS (Floating-Point Operations Per Second) and memory bandwidth, specifically for both 32-bit and 64-bit computations. Additionally, it would be valuable to summarize the versions of the toolchain that were utilized during these computations. This information will provide readers with a better understanding of the hardware and software environment in which the simulations were conducted, allowing for a more comprehensive evaluation of the results.
2. In section 3.2, it would be beneficial to include a comparison of the performance between the Julia code and the Fortran code in a single-core execution. This comparison will provide readers with insights into the optimization of the Julia code for serial computation.
3. In Section 3.2, the authors mentioned that all codes were executed in double precision and highlighted the faster simulation on the NVIDIA RTX8000 GPU compared to the CPU. However, it is important to consider that the RTX8000 is primarily designed for consumer applications and may exhibit slower performance in double precision computation. To provide a more comprehensive evaluation, it would be valuable to compare the computation on a high-performance computing (HPC) targeted GPU, such as the NVIDIA TESLA A100, which is known for their robust performance in double precision computation and are specifically designed to excel in HPC workloads. Otherwise, please compare all simulations in single precision.
4. In section 3.3, it is evident that Julia-MPI outperformed Fortran-MPI in terms of computation, but it took more time for communication. To provide a clearer understanding of the experimental setup, it would be beneficial to specify the Fortran compiler and Julia interpreters, along with the related toolchain, that were employed in the study. Additionally, it is important to mention the specific version of the MPI library used for both the Fortran-MPI and Julia-MPI implementations. This information will help readers better comprehend the underlying MPI libraries utilized in each case and the potential impact they may have had on the communication performance.
Moreover, it is worth exploring the possibility that different MPI libraries might have been employed for the Fortran and Julia codes. If this is the case, it should be explicitly stated in the paper, along with the versions of the MPI libraries used for each implementation. Clarifying this aspect will enable readers to consider any discrepancies or optimizations associated with the MPI libraries employed in the Fortran and Julia implementations.
5. I think hyper threading may be disabled in supercomputer. It would be helpful to omit the hyper-thread performance of the CPU in section 3.3.
Â
Citation: https://doi.org/10.5194/egusphere-2023-57-RC2 - AC2: 'Reply on RC2', Mark R. Petersen, 02 Jul 2023
- AC3: 'Reply on RC2', Mark R. Petersen, 02 Jul 2023
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
715 | 394 | 20 | 1,129 | 14 | 5 |
- HTML: 715
- PDF: 394
- XML: 20
- Total: 1,129
- BibTeX: 14
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
2 citations as recorded by crossref.
Robert R. Strauss
Siddhartha Bishnu
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(978 KB) - Metadata XML