Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)

Strauss, Robert R.; Bishnu, Siddhartha; Petersen, Mark R.

doi:https://doi.org/10.5194/egusphere-2023-57

Preprints

https://doi.org/10.5194/egusphere-2023-57

Preprints

15 Feb 2023

| 15 Feb 2023

Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)

Robert R. Strauss, Siddhartha Bishnu, and Mark R. Petersen

Abstract. Some programming languages are easy to develop at the cost of slow execution, while others are fast at run time but much more difficult to write. Julia is a programming language that aims to be the best of both worlds – a development and production language at the same time. To test Julia’s utility in scientific high-performance computing (HPC), we built an unstructured-mesh shallow water model in Julia and compared it against an established Fortran-MPI ocean model, MPAS-Ocean, as well as a Python shallow water code. Three versions of the Julia shallow water code were created, for: single-core CPU; graphics processing unit (GPU); and Message Passing Interface (MPI) CPU clusters. Comparing identical simulations revealed that our first version of the Julia model was 13 times faster than Python using Numpy, where both used an unthreaded single-core CPU. Further Julia optimizations, including static typing and removing implicit memory allocations, provided an additional 10–20x speed-up of the single-core CPU Julia model. The GPU-accelerated Julia code attained a speed-up of 230–380x compared to the single-core CPU Julia code. Parallelized Julia-MPI performance was identical to Fortran-MPI MPAS-Ocean for low processor counts, and ranges from 2x faster to 2x slower for higher processor counts. Our experience is that Julia development is fast and convenient for prototyping, but that Julia requires further investment and expertise to be competitive with compiled codes. We provide advice on Julia code optimization for HPC systems.

Received: 18 Jan 2023 – Discussion started: 15 Feb 2023

Download & links

Preprint (PDF, 978 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (978 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

05 Oct 2023

Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)

Siddhartha Bishnu, Robert R. Strauss, and Mark R. Petersen

Geosci. Model Dev., 16, 5539–5559, https://doi.org/10.5194/gmd-16-5539-2023,https://doi.org/10.5194/gmd-16-5539-2023, 2023

Short summary

Robert R. Strauss et al.

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-57', Anonymous Referee #1, 25 Mar 2023

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-57/egusphere-2023-57-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2023-57-RC1
- AC1: 'Reply on RC1', Mark R. Petersen, 02 Jul 2023
  
  Please see the attached response.
  
  Citation: https://doi.org/10.5194/egusphere-2023-57-AC1
RC2:
'Comment on egusphere-2023-57', Anonymous Referee #2, 22 May 2023

The authors have made significant contributions by developing a shallow water solver using Julia language and comparing its performance with a solver written in Fortran. Furthermore, they have successfully implemented their solver on a GPU, demonstrating a remarkable speed-up. While the overall results appear promising, I would suggest considering the following points to further enhance the paper:
In section 3.2, it would greatly enhance the paper to include a table comparing the specifications of the CPU and GPU used in the simulations. This table should provide a comprehensive comparison of various factors, such as FLOPS (Floating-Point Operations Per Second) and memory bandwidth, specifically for both 32-bit and 64-bit computations. Additionally, it would be valuable to summarize the versions of the toolchain that were utilized during these computations. This information will provide readers with a better understanding of the hardware and software environment in which the simulations were conducted, allowing for a more comprehensive evaluation of the results.
2. In section 3.2, it would be beneficial to include a comparison of the performance between the Julia code and the Fortran code in a single-core execution. This comparison will provide readers with insights into the optimization of the Julia code for serial computation.
3. In Section 3.2, the authors mentioned that all codes were executed in double precision and highlighted the faster simulation on the NVIDIA RTX8000 GPU compared to the CPU. However, it is important to consider that the RTX8000 is primarily designed for consumer applications and may exhibit slower performance in double precision computation. To provide a more comprehensive evaluation, it would be valuable to compare the computation on a high-performance computing (HPC) targeted GPU, such as the NVIDIA TESLA A100, which is known for their robust performance in double precision computation and are specifically designed to excel in HPC workloads. Otherwise, please compare all simulations in single precision.
4. In section 3.3, it is evident that Julia-MPI outperformed Fortran-MPI in terms of computation, but it took more time for communication. To provide a clearer understanding of the experimental setup, it would be beneficial to specify the Fortran compiler and Julia interpreters, along with the related toolchain, that were employed in the study. Additionally, it is important to mention the specific version of the MPI library used for both the Fortran-MPI and Julia-MPI implementations. This information will help readers better comprehend the underlying MPI libraries utilized in each case and the potential impact they may have had on the communication performance.
Moreover, it is worth exploring the possibility that different MPI libraries might have been employed for the Fortran and Julia codes. If this is the case, it should be explicitly stated in the paper, along with the versions of the MPI libraries used for each implementation. Clarifying this aspect will enable readers to consider any discrepancies or optimizations associated with the MPI libraries employed in the Fortran and Julia implementations.
5. I think hyper threading may be disabled in supercomputer. It would be helpful to omit the hyper-thread performance of the CPU in section 3.3.

Citation: https://doi.org/10.5194/egusphere-2023-57-RC2
- AC2: 'Reply on RC2', Mark R. Petersen, 02 Jul 2023
  
  Please see the attached response.
  
  Citation: https://doi.org/10.5194/egusphere-2023-57-AC2
- AC3: 'Reply on RC2', Mark R. Petersen, 02 Jul 2023
  
  Please see the attached response.
  
  Citation: https://doi.org/10.5194/egusphere-2023-57-AC3

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-57', Anonymous Referee #1, 25 Mar 2023

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-57/egusphere-2023-57-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2023-57-RC1
- AC1: 'Reply on RC1', Mark R. Petersen, 02 Jul 2023
  
  Please see the attached response.
  
  Citation: https://doi.org/10.5194/egusphere-2023-57-AC1
RC2:
'Comment on egusphere-2023-57', Anonymous Referee #2, 22 May 2023

The authors have made significant contributions by developing a shallow water solver using Julia language and comparing its performance with a solver written in Fortran. Furthermore, they have successfully implemented their solver on a GPU, demonstrating a remarkable speed-up. While the overall results appear promising, I would suggest considering the following points to further enhance the paper:
In section 3.2, it would greatly enhance the paper to include a table comparing the specifications of the CPU and GPU used in the simulations. This table should provide a comprehensive comparison of various factors, such as FLOPS (Floating-Point Operations Per Second) and memory bandwidth, specifically for both 32-bit and 64-bit computations. Additionally, it would be valuable to summarize the versions of the toolchain that were utilized during these computations. This information will provide readers with a better understanding of the hardware and software environment in which the simulations were conducted, allowing for a more comprehensive evaluation of the results.
2. In section 3.2, it would be beneficial to include a comparison of the performance between the Julia code and the Fortran code in a single-core execution. This comparison will provide readers with insights into the optimization of the Julia code for serial computation.
3. In Section 3.2, the authors mentioned that all codes were executed in double precision and highlighted the faster simulation on the NVIDIA RTX8000 GPU compared to the CPU. However, it is important to consider that the RTX8000 is primarily designed for consumer applications and may exhibit slower performance in double precision computation. To provide a more comprehensive evaluation, it would be valuable to compare the computation on a high-performance computing (HPC) targeted GPU, such as the NVIDIA TESLA A100, which is known for their robust performance in double precision computation and are specifically designed to excel in HPC workloads. Otherwise, please compare all simulations in single precision.
4. In section 3.3, it is evident that Julia-MPI outperformed Fortran-MPI in terms of computation, but it took more time for communication. To provide a clearer understanding of the experimental setup, it would be beneficial to specify the Fortran compiler and Julia interpreters, along with the related toolchain, that were employed in the study. Additionally, it is important to mention the specific version of the MPI library used for both the Fortran-MPI and Julia-MPI implementations. This information will help readers better comprehend the underlying MPI libraries utilized in each case and the potential impact they may have had on the communication performance.
Moreover, it is worth exploring the possibility that different MPI libraries might have been employed for the Fortran and Julia codes. If this is the case, it should be explicitly stated in the paper, along with the versions of the MPI libraries used for each implementation. Clarifying this aspect will enable readers to consider any discrepancies or optimizations associated with the MPI libraries employed in the Fortran and Julia implementations.
5. I think hyper threading may be disabled in supercomputer. It would be helpful to omit the hyper-thread performance of the CPU in section 3.3.

Citation: https://doi.org/10.5194/egusphere-2023-57-RC2
- AC2: 'Reply on RC2', Mark R. Petersen, 02 Jul 2023
  
  Please see the attached response.
  
  Citation: https://doi.org/10.5194/egusphere-2023-57-AC2
- AC3: 'Reply on RC2', Mark R. Petersen, 02 Jul 2023
  
  Please see the attached response.
  
  Citation: https://doi.org/10.5194/egusphere-2023-57-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Mark R. Petersen on behalf of the Authors (02 Jul 2023) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (09 Jul 2023) by Sylwester Arabas

RR by Anonymous Referee #1 (30 Jul 2023)

ED: Publish subject to minor revisions (review by editor) (30 Jul 2023) by Sylwester Arabas

AR by Siddhartha Bishnu on behalf of the Authors (11 Aug 2023) Author's response Author's tracked changes Manuscript

ED: Publish as is (16 Aug 2023) by Sylwester Arabas

AR by Siddhartha Bishnu on behalf of the Authors (25 Aug 2023) Author's response Manuscript

Journal article(s) based on this preprint

05 Oct 2023

Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)

Siddhartha Bishnu, Robert R. Strauss, and Mark R. Petersen

Geosci. Model Dev., 16, 5539–5559, https://doi.org/10.5194/gmd-16-5539-2023,https://doi.org/10.5194/gmd-16-5539-2023, 2023

Short summary

Robert R. Strauss et al.

Viewed

Total article views: 1,129 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
715	394	20	1,129	14	5

HTML: 715
PDF: 394
XML: 20
Total: 1,129
BibTeX: 14
EndNote: 5

Views and downloads (calculated since 15 Feb 2023)

Month	HTML	PDF	XML	Total
Feb 2023	341	140	3	484
Mar 2023	142	64	3	209
Apr 2023	62	30	0	92
May 2023	30	38	2	70
Jun 2023	12	16	1	29
Jul 2023	68	43	9	120
Aug 2023	26	25	0	51
Sep 2023	33	34	2	69
Oct 2023	1	4	0	5

Cumulative views and downloads (calculated since 15 Feb 2023)

Month	HTML	PDF	XML	Total
Feb 2023	341	140	3	484
Mar 2023	142	64	3	209
Apr 2023	62	30	0	92
May 2023	30	38	2	70
Jun 2023	12	16	1	29
Jul 2023	68	43	9	120
Aug 2023	26	25	0	51
Sep 2023	33	34	2	69
Oct 2023	1	4	0	5

Viewed (geographical distribution)

Total article views: 1,081 (including HTML, PDF, and XML) Thereof 1,081 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 05 Oct 2023

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (978 KB)
Metadata XML

Short summary

Here we test Julia, a relatively new programming language, which is designed to be simple to write, but also fast on advanced computer architectures. We found that Julia is both convenient and fast, but there is no free lunch. Our first attempt to develop an ocean model in Julia was relatively easy, but the code was slow. After several months of further development, we created a Julia code that is as fast on supercomputers as the Fortran ocean model.


Total:	0
HTML:	0
PDF:	0
XML:	0

Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Journal article(s) based on this preprint

Viewed

Viewed (geographical distribution)