the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Accelerating Lagrangian transport simulations on graphics processing units: performance optimizations of MPTRAC v2.6
Abstract. Lagrangian particle dispersion models are indispensable tools for the study of atmospheric transport processes. However, Lagrangian transport simulations can become numerically expensive when large numbers of air parcels are involved. To accelerate these simulations, we made considerable efforts to port the Massive-Parallel Trajectory Calculations (MPTRAC) model to graphics processing units (GPUs). Here we discuss performance optimizations of the major bottleneck of the GPU code of MPTRAC, the advection kernel. Timeline, roofline, and memory analyses of the baseline GPU code revealed that the application is memory-bound and performance suffers from near-random memory access patterns. By changing the data structure of the horizontal wind and vertical velocity fields of the global meteorological data driving the simulations from Structure of Arrays (SoA) to Array of Structures (AoS), and by introducing a sorting method for better memory alignment of the particle data, performance was greatly improved. We evaluated the performance on NVIDIA A100 GPUs of the Jülich Wizard for European Leadership Science (JUWELS) Booster module at the Jülich Supercomputing Center, Germany. For our largest test case, transport simulations with 108 particles driven by the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 reanalysis, we found that the runtime for the full set of physics computations was reduced by 75 %, including a reduction of 85 % for the advection kernel. In addition to demonstrating the benefits of code optimization for GPUs, we show that the runtime of CPU-only simulations is also improved. For our largest test case, we found a runtime reduction of 34 % for the physics computations, including a reduction of 65 % for the advection kernel. The code optimizations discussed here bring the MPTRAC model closer to applications on upcoming exascale high performance computing systems, and will also be of interest for optimizing the performance of other models using particle methods.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(2358 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(2358 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-2547', Anonymous Referee #1, 01 Feb 2024
The manuscript is a follow-up to Hoffman et al. (2022) where the adaptation of MPTRAC to GPU processing unsing OPEN-ACC was described and demonstrated. The present work describes two types of optimization producing significant speed-up for both GPU and CPU versions of the code.
The manuscript is well written and clear both in the methods and results and should be published.
I have only a few minor comments and questions to the authors
1) Section 3 : Although the scope of this work is technical, a few more words about the type of tracer / molecule and processes considered here would be useful for the sake of completeness.
2) It is unclear that the ERA 5 needs to be used at its maximal spatial and temporal resolution for all transport applications, in particular for large-scale transport. Using the full vertical resolution is certainly a good choice but the horizontal and temporal resolution might be reduced at least for the horizontal wind with limited impact in many cases.
2) l.176 : What are NVTX markers ? This seems to be a NVIDIA feature for profiling.
3) L192 : The arithmetic intensity, which is perhaps not a common notion, needs to be defined.
4) No indication is given about the overlapping of data transfer and calculations. I do not know how this is applicable to the architecture considered here but it is a source of optimization in computers which cache memory. Perhaps it is done automatically but it derserved to be mentioned.
5) I guess that the code is running under a configuation where the nodes are reserved to a single user and not in time-sharing . This also deserves to be mentioned.
6) Figure 3 is hardly readible. I am not color blind but I see no red and the blue is difficult to distinguish from the green without zooming the figure. It seems that the baseline dots have a green contour and the optimized dots have a blue contour. This figure needs to be improved.
7) Figure 4: Is it possible to remove the 0.00 B/s channels or to indicate that this is an output from NVIDIA tools that cannot be beautified.
8) L 251 : Since two wind fields are required for time interpolation, why not aligning time in a float uvw[EX][EY][EZ][2][3] or float uvw[EX][EY][EZ][3][2] 5-D structure ?
9) Is there any impact of aligning the tracer data ?
10) It is not said whether the sorting is done by copying the tracer arrays or by using a permutation index without moving the data. Certainly the copy has the advantage of avoiding random access to the tracer data for the threads.
Citation: https://doi.org/10.5194/egusphere-2023-2547-RC1 -
RC2: 'Comment on egusphere-2023-2547', Anonymous Referee #2, 03 Feb 2024
Review of 'Accelerating Lagrangian transport simulations on graphics processing units: performance optimizations of MPTRAC v2.6'
General Comments:The manuscript titled "Accelerating Lagrangian transport simulations on graphics processing units: performance optimizations of MPTRAC v2.6" by Hoffmann et al. presents a comprehensive study on the optimization of the MPTRAC Lagrangian particle dispersion model for improved performance on graphics processing units (GPUs). The research focuses on overcoming the challenges associated with the memory-bound nature of the advection kernel in MPTRAC by employing two primary optimization strategies: restructuring the layout of meteorological data and introducing a sorting algorithm for aligned memory access of particle data. The study is methodically sound and well-structured and makes a significant contribution to the field of Earth system modeling, particularly in enhancing the efficiency of Lagrangian transport simulations.
The study not only demonstrates significant advancements in the performance of the MPTRAC model but also provides insights that could be applied to other Lagrangian particle dispersion models. The manuscript is well-written, thoroughly researched, and presents its findings in a clear and accessible manner. It is recommended for publication after considering minor suggestions.
1) It would be beneficial if the optimization technique regarding the overlapped execution of data transfers and computations were mentioned.2) A brief discussion on the limitations or challenges faced during the optimization process would add depth to the study.
3) In Fig.3, the “red” dots are difficult to distinguish from the “orange” dots. The figure would be more readable if there were more obvious color differences between “red” and “orange” and between “blue” and “green”.
4) In Fig. 4, some words, for instance, “* Global *” and “* Compression,” are not fully visible. I guess the figure might be produced by the NVIDIA software, but it would be beneficial to address this display issue for clarity, though this does not affect the integrity of the results presented.
5) As a reader, I am interested in the specific method of sorting the data, whether they are sorted by assigning an index to the arrows or creating a new set of data based on the target elements.
Citation: https://doi.org/10.5194/egusphere-2023-2547-RC2 -
RC3: 'Comment on egusphere-2023-2547', Anonymous Referee #3, 04 Feb 2024
General Comments
The manuscript provides detailed memory access optimization schemes for the Lagrangian particle dispersion model MPTRAC (Hoffman et al., 2022) especially on GPUs. The motivations of the two optimization schemes are clearly analyzed and the optimization processes are thoroughly demonstrated and tested, making the research sound and rigorous. The use of the Array of Structures method and the particle data sorting method provides new insights into memory optimization for earth system model simulations on GPUs.
The manuscript is well written and should be published after consideration for some minor questions.
Specific Comments
1) The color of the dots in Fig.3 is hard to distinguish. It’s best if more explanations of the figure can be provided about the result comparison of two model versions.
2) How the performance of the optimization scales with problem size has been investigated in the research. I’m curious whether the tests were also run on other number of GPU cores and whether the optimized model shows similar time improvement.
3) Data communications among GPU cores are commonly necessary in parallel sorting. I’m interested how much communication is introduced in the particle sorting through the Thrust library and how does is vary with the number of GPUs used. Does it become notable under certain GPU number settings?
4) I assume that the simulation results of the optimized model are bitwise identical with the base model, which is commonly achieved in model optimization works. I think it’s worth mentioned in the manuscript.
Citation: https://doi.org/10.5194/egusphere-2023-2547-RC3 -
AC1: 'Comment on egusphere-2023-2547', Lars Hoffmann, 02 Apr 2024
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2023-2547/egusphere-2023-2547-AC1-supplement.pdf
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-2547', Anonymous Referee #1, 01 Feb 2024
The manuscript is a follow-up to Hoffman et al. (2022) where the adaptation of MPTRAC to GPU processing unsing OPEN-ACC was described and demonstrated. The present work describes two types of optimization producing significant speed-up for both GPU and CPU versions of the code.
The manuscript is well written and clear both in the methods and results and should be published.
I have only a few minor comments and questions to the authors
1) Section 3 : Although the scope of this work is technical, a few more words about the type of tracer / molecule and processes considered here would be useful for the sake of completeness.
2) It is unclear that the ERA 5 needs to be used at its maximal spatial and temporal resolution for all transport applications, in particular for large-scale transport. Using the full vertical resolution is certainly a good choice but the horizontal and temporal resolution might be reduced at least for the horizontal wind with limited impact in many cases.
2) l.176 : What are NVTX markers ? This seems to be a NVIDIA feature for profiling.
3) L192 : The arithmetic intensity, which is perhaps not a common notion, needs to be defined.
4) No indication is given about the overlapping of data transfer and calculations. I do not know how this is applicable to the architecture considered here but it is a source of optimization in computers which cache memory. Perhaps it is done automatically but it derserved to be mentioned.
5) I guess that the code is running under a configuation where the nodes are reserved to a single user and not in time-sharing . This also deserves to be mentioned.
6) Figure 3 is hardly readible. I am not color blind but I see no red and the blue is difficult to distinguish from the green without zooming the figure. It seems that the baseline dots have a green contour and the optimized dots have a blue contour. This figure needs to be improved.
7) Figure 4: Is it possible to remove the 0.00 B/s channels or to indicate that this is an output from NVIDIA tools that cannot be beautified.
8) L 251 : Since two wind fields are required for time interpolation, why not aligning time in a float uvw[EX][EY][EZ][2][3] or float uvw[EX][EY][EZ][3][2] 5-D structure ?
9) Is there any impact of aligning the tracer data ?
10) It is not said whether the sorting is done by copying the tracer arrays or by using a permutation index without moving the data. Certainly the copy has the advantage of avoiding random access to the tracer data for the threads.
Citation: https://doi.org/10.5194/egusphere-2023-2547-RC1 -
RC2: 'Comment on egusphere-2023-2547', Anonymous Referee #2, 03 Feb 2024
Review of 'Accelerating Lagrangian transport simulations on graphics processing units: performance optimizations of MPTRAC v2.6'
General Comments:The manuscript titled "Accelerating Lagrangian transport simulations on graphics processing units: performance optimizations of MPTRAC v2.6" by Hoffmann et al. presents a comprehensive study on the optimization of the MPTRAC Lagrangian particle dispersion model for improved performance on graphics processing units (GPUs). The research focuses on overcoming the challenges associated with the memory-bound nature of the advection kernel in MPTRAC by employing two primary optimization strategies: restructuring the layout of meteorological data and introducing a sorting algorithm for aligned memory access of particle data. The study is methodically sound and well-structured and makes a significant contribution to the field of Earth system modeling, particularly in enhancing the efficiency of Lagrangian transport simulations.
The study not only demonstrates significant advancements in the performance of the MPTRAC model but also provides insights that could be applied to other Lagrangian particle dispersion models. The manuscript is well-written, thoroughly researched, and presents its findings in a clear and accessible manner. It is recommended for publication after considering minor suggestions.
1) It would be beneficial if the optimization technique regarding the overlapped execution of data transfers and computations were mentioned.2) A brief discussion on the limitations or challenges faced during the optimization process would add depth to the study.
3) In Fig.3, the “red” dots are difficult to distinguish from the “orange” dots. The figure would be more readable if there were more obvious color differences between “red” and “orange” and between “blue” and “green”.
4) In Fig. 4, some words, for instance, “* Global *” and “* Compression,” are not fully visible. I guess the figure might be produced by the NVIDIA software, but it would be beneficial to address this display issue for clarity, though this does not affect the integrity of the results presented.
5) As a reader, I am interested in the specific method of sorting the data, whether they are sorted by assigning an index to the arrows or creating a new set of data based on the target elements.
Citation: https://doi.org/10.5194/egusphere-2023-2547-RC2 -
RC3: 'Comment on egusphere-2023-2547', Anonymous Referee #3, 04 Feb 2024
General Comments
The manuscript provides detailed memory access optimization schemes for the Lagrangian particle dispersion model MPTRAC (Hoffman et al., 2022) especially on GPUs. The motivations of the two optimization schemes are clearly analyzed and the optimization processes are thoroughly demonstrated and tested, making the research sound and rigorous. The use of the Array of Structures method and the particle data sorting method provides new insights into memory optimization for earth system model simulations on GPUs.
The manuscript is well written and should be published after consideration for some minor questions.
Specific Comments
1) The color of the dots in Fig.3 is hard to distinguish. It’s best if more explanations of the figure can be provided about the result comparison of two model versions.
2) How the performance of the optimization scales with problem size has been investigated in the research. I’m curious whether the tests were also run on other number of GPU cores and whether the optimized model shows similar time improvement.
3) Data communications among GPU cores are commonly necessary in parallel sorting. I’m interested how much communication is introduced in the particle sorting through the Thrust library and how does is vary with the number of GPUs used. Does it become notable under certain GPU number settings?
4) I assume that the simulation results of the optimized model are bitwise identical with the base model, which is commonly achieved in model optimization works. I think it’s worth mentioned in the manuscript.
Citation: https://doi.org/10.5194/egusphere-2023-2547-RC3 -
AC1: 'Comment on egusphere-2023-2547', Lars Hoffmann, 02 Apr 2024
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2023-2547/egusphere-2023-2547-AC1-supplement.pdf
Peer review completion
Journal article(s) based on this preprint
Data sets
Supplementary material to `Accelerating Lagrangian transport simulations on graphics processing units: performance optimizations of MPTRAC v2.6' Lars Hoffmann https://doi.org/10.5281/zenodo.10065785
Model code and software
Massive-Parallel Trajectory Calculations (MPTRAC) v2.6 L. Hoffmann et al. https://doi.org/10.5281/zenodo.10067751
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
296 | 81 | 22 | 399 | 9 | 8 |
- HTML: 296
- PDF: 81
- XML: 22
- Total: 399
- BibTeX: 9
- EndNote: 8
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
Kaveh Haghighi Mood
Andreas Herten
Markus Hrywniak
Jiri Kraus
Jan Clemens
Mingzhao Liu
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(2358 KB) - Metadata XML