the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514)
Abstract. Classical numerical models for the global atmosphere, as used for numerical weather forecasting or climate research, have been developed for conventional central processing unit (CPU) architectures. This now hinders the employment of such models on current top performing supercomputers, which achieve their computing power with hybrid architectures, mostly using graphics processing units (GPUs). Thus also scientific applications of such models are restricted to the lesser computer power of CPUs. Here we present the development of a GPU enabled version of the ICON atmosphere model (ICON-A) motivated by a research project on the quasi-biennial oscillation (QBO), a global scale wind oscillation in the equatorial stratosphere that depends on a broad spectrum of atmospheric waves, which origins from tropical deep convection. Resolving the relevant scales, from a few km to the size of the globe, is a formidable computational problem, which can only be realized now on top performing supercomputers. This motivated porting ICON-A, in the specific configuration needed for the research project, in a first step to the GPU architecture of the Piz Daint computer at the Swiss National Supercomputing Centre, and in a second step to the Juwels-Booster computer at the Forschungszentrum Jülich. On Piz Daint the ported code achieves a single node GPU vs. CPU speed-up factor of 6.3, and now allows global experiments at a horizontal resolution of 5 km on 1024 computing nodes with 1 GPU per node with a turnover of 48 simulated days per day. On Juwels-Booster the more modern hardware in combination with an upgraded code base allows for simulations at the same resolution on 128 computing nodes with 4 GPUs per node and a turnover of 133 simulated days per day. Additionally, the code still remains functional on CPUs as it is demonstrated by additional experiments on the Levante compute system at the German Climate Computing Center. While the application shows good weak scaling making also higher resolved global simulations possible, the strong scaling on GPUs is relatively weak, which limits the options to increase turnover with more nodes. Initial experiments demonstrate that the ICON-A model can simulate downward propagating QBO jets, which are driven by wave meanflow interaction.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(3958 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(3958 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-152', Anonymous Referee #1, 27 May 2022
The paper dicusses porting of the ICON-A model to the GPU. An effective speedup of 6.3x times is obtained in a node-to-node comparison, which is very good in my opinion.
Note: Some of the questions/comments may not be relevant as I am writing notes as I go through the paper and understand it better.
I will probably send a second round of comments after I go through the paper once more.Line 15: Why is the strong scaliing poor?
Linne 55: The literature review on GPU ported weather models is not exhaustive. WRF entire dycore and select physics packages
have been porteed to the GPU gaining speedup of 5x to 7x, NEPTUNE's dycore (NUMA) has been ported to the GPU gaining speedup of
upoto 10x. I am sure MPAS has had OpenACC work too.Line 70: Please provide details (names) of the "Sapphire" physics packages ported to the GPU. Was there a study to choose the physics
packages needed for high-resolutiion grid simulations?Line 75: What is the time integration method used in ICON-A, I assume it is horizontally-explicit vertically-implicit (HEVI) scheme.
Line 245: This is a very good decision that is often missed. If one decides to run for example a particular physics scheme on the CPU,
then all the speedup will go down the drain.Line 275: I am curious if ICON-A has adaptive mesh refiinement (AMR) capability? If so, is that handled directly on the GPU?
Line 295: Deep subroutine calls are indeed a problem but it is still possible to get an efficient kernel. A difficulty with
that approach is if a lot of temporary arrays (e.g. 1D) are used in the inner subroutines, in which case, the compiler will
do GPU memory allocations there. What I found very helpful, is to move those out of the inner loops to the topmost loop.
Other than that, if the nested subroutines are siimple, it should be equally efficient without the needd for major rewrite.Line 320: Kernels vs parallel loop. Kernels is often the safest option when you start porting the code. You can add a `acc loop independent`
to the loops to effectively get a `parallel loop`. There should not be any performannce difference afterr that.Line 385: Use of atomics is good, but the second approach looks simpler. Have you also tried "coloring" schemes where each thread
constructs an independent portion of the list.Line 415: Indirect addressing (lists) are not good for the GPU, and I wonder if the fact that it is used throught the dycore
suggests that it should be re-written for better performance on the GPU, although 6x speedup is already very good?Line 460: GPU aware MPI should be the right choice here, otherwise the cost of data transfer between CPU/GPU will kill performance.
Line 695: I agree b4b test do not make sense for CPU/GPU validation. But a figure for both single and double precision calculations
representing the deviation betweeen CPU/GPU is a good target. Have you also tried PGI's auto-compare feature? It does redundant computations
on both CPU and GPU, and notifies where they are diverging for a given level of tolerance.Line 820: Could this be a reason for the less than optimal strong scaling results?
Line 840: I am confused because I thought the 6.3x speedup is over the end-to-end computation including all of physics and dynamics?
But here you present speedups of indiviidual kernels. I am afraid Amdahl's law plays a significant roler here, as the least performing kernel
often governs overall speedup.Line 845: Could you give details how much percentage of time is spent in the landsurface model specifically, but for others as well?
If the overall speedup is is indeed 6.5x, land-surface should not be taking a significant portion.Figure 5: Radiation is often the most-time consuming and I am not surprized you get the most speedup from it.
Some models often do not compute radiation every time step for this reason. Would you say that the GPU ported code
offers a chance to do it every time step for more accurate results?Could you also provide figures for speedup, weak and strong scaling for end-to-end run without the subcomponents?
The figure is cluttered at the moment and I would like to see one figure highliting the overall results.Line 900: Land-surface model showing very poor strong scaling could be the cause of the model's over poor strong scaling then?
Citation: https://doi.org/10.5194/egusphere-2022-152-RC1 -
AC1: 'Reply on RC1', Marco Giorgetta, 07 Jul 2022
Reply to the reviewer comment 1 (RC1) from 2022-05-27
Dear reviewer
Thank you very much for your review comment on our article. We are happy about your positive judgment of the effective 6x CPU-to-GPU speed-up in the node-to-node comparison.
We have improved the manuscript based on your comments, especially related to your comments about the cluttered figures. These figure have been redrawn. Figure 5 now shows the time-to-solution and the cumulative strong scaling of the integration timer for the R2B7 benchmarks alone, combining the results from the three machines in a single panel to facilitate the comparison between the machines, while Figure 6 now shows time-to-solution and cumulative strong scaling for the components, with separate panels for each machine similar to the old Figure 5. The text of Section "6 Benchmarking results" has been changed accordingly, discussing now first the overall results from Figure 5 before continuing with the results for the components based on Figure 6. The discussion of the strong scaling is followed by a short section on the weak scaling, based on the tabulated results alone. We hope that these changes have removed the cluttering. The stronger separation of the discussion of the integration as a whole from the components is certainly beneficial. Thanks for the recommendation!
A pdf file (generated by latexdiff) of the differences between the old and new manuscript, which includes changes related to your comments, is attached.
For all other comments please find our point-to-point replies below.
Best regards,
Marco Giorgetta
AbstractRC1: Line 15: Why is the strong scaling poor?
Probably you refer to the statement on line 17: "While the application shows good weak scaling making also higher resolved global simulations possible, the strong scaling on GPUs is relatively weak, ...". This judgment is based on findings presented in section 6. We found that the turn-over (or time compression) achieved with the base setups cannot be increased very much in an efficient way by increasing the number of compute nodes, i.e. by exploiting strong scaling, for example towards the goal of 1 simulated year per day. In practice the strong scaling on the GPU machines allowed to double once or twice the number of nodes of the base setups with still acceptable efficiency. In contrast, on the CPU machine even a 6 times doubling was still efficient, but here we start from a considerably lower turn-over rate so that no high turn-over can be achieved. It is this once or twice doubling on GPUs compared to 6 times doubling on CPU that lead to the statement "... the strong scaling on GPUs is relatively weak ...". The data for the strong scaling for each doubling step, S_s, are given in Table 3. And the cumulative strong scaling, S_{s,cum}, over multiple doubling steps is shown in Figure 5 for the whole model ("integrate") as well as for the model components.
1 IntroductionRC1: Line 55: The literature review on GPU ported weather models is not exhaustive. WRF entire dycore and select physics packages have been ported to the GPU gaining speedup of 5x to 7x, NEPTUNE's dycore (NUMA) has been ported to the GPU gaining speedup of up to 10x. I am sure MPAS has had OpenACC work too.
Indeed, the literature review is not exhaustive. Since we have implemented an end-to-end GPU implementation for real scientific runs, we are most interested in other complete implementations. However, we do respect the hard work which has gone into WRF and MPAS, and will add references to Huang, et al. (2015) for WRF, and Kim, et al. (2021) for MPAS (which incidentally reports a speedup of 2.38x).
2 Model configuration for QUBICC experimentsRC1: Line 70: Please provide details (names) of the "Sapphire" physics packages ported to the GPU. Was there a study to choose the physics packages needed for high-resolution grid simulations?
More details for the physics parameterizations comprising the ICON "Sapphire" physics are given further down in section 2.5 and its subsections. The text now refers to these subsections. We did not make a separate study to chose this set of parameterizations for our high resolution experiments. Rather we proceeded in 2 steps. First we excluded parameterizations which are not suitable at high resolution, because they were originally developed for significantly larger cell sizes, and secondly we use parameterizations which qualitatively correspond to the parameterizations used in the operational setups of the ICON model used by the German Weather Service (DWD) for global and regional weather predictions at 13 and 2 km resolution, respectively. The RTE+RRTMGP radiation scheme is a more modern version of the RRTMG code used for many years by DWD (and just recently replaced by the ECrad model). The cloud microphysics code is the same as in operational use. The vertical diffusion scheme is different but relies on similar theory and discretization. Similar arguments can be made for the land scheme.
RC1: Line 75: What is the time integration method used in ICON-A, I assume it is horizontally-explicit vertically-implicit (HEVI) scheme.
Yes, it is horizontally explicit and vertically implicit (HEVI). Details are described in the documentation of the dynamical core by Zaengl et al. (2015).
4 Porting ICON to GPUs
RC1: Line 245: This is a very good decision that is often missed. If one decides to run for example a particular physics scheme on the CPU, then all the speedup will go down the drain.
Indeed, we knew from the beginning that the only possibility for speedup was to port all code within the time loop. Practically it meant that a considerable part of the model code had to be ported, including the more demanding dynamics and tracer transport codes.
RC1: Line 275: I am curious if ICON-A has adaptive mesh refinement (AMR) capability? If so, is that handled directly on the GPU?
ICON-A has a static 'nesting' capability (telescopic grids of increasing resolutions over areas of interest), which is now being ported to GPUs for future applications. The key word here is 'static', which contrasts with AMR, which is implicitly dynamic. Therefore ICON-A does not have AMR capability.
RC1: Line 295: Deep subroutine calls are indeed a problem but it is still possible to get an efficient kernel. A difficulty with that approach is if a lot of temporary arrays (e.g. 1D) are used in the inner subroutines, in which case, the compiler will do GPU memory allocations there. What I found very helpful, is to move those out of the inner loops to the topmost loop. Other than that, if the nested subroutines are simple, it should be equally efficient without the need for major rewrite.
We considered moving the innermost loops upwards in the call tree, but decided against it: this goes in the direction of a full code rewrite, and the spirit of OpenACC is exactly to avoid such extensive refactoring of the code. Interestingly the parallelization over the two innermost dimensions (nproma and nlev) is sufficient for good performance if there are only one, or very few, blocks of nproma. The more critical performance bottleneck is actually the function call overhead of GPU routines designated with ACC ROUTINE SEQ. This call overhead is considerable and is only avoided if the compiler is capable of inlining the code. This is not always the case, and proper inlining is the subject of ongoing optimizations.
RC1: Line 320: Kernels vs parallel loop. Kernels is often the safest option when you start porting the code. You can add a `acc loop independent` to the loops to effectively get a `parallel loop`. There should not be any performance difference after that.
This depends entirely on the degree of support for KERNELS in the OpenACC compiler. For example, we understand GCC has very limited support for KERNELS. Since our work emphasizes the code portability aspect, we only used KERNELS for simple code cases, e.g., initialization of arrays using array syntax.
RC1: Line 385: Use of atomics is good, but the second approach looks simpler. Have you also tried "coloring" schemes where each thread constructs an independent portion of the list.
Atomics performed poorly in the two OpenACC compilers we worked with: Nvidia and CCE. Anecdotally, the performance of atomics is now improving, however we had long since moved to the parallel scan method based on the CUB library. While this is specific to Nvidia GPUs, a library with similar interface is also available for AMD GPUs. The atomic functionality in OpenACC is actually quite simple to use, but we would only consider reactivating that code if there were guarantees of commensurate performance with CUB.
Coloring schemes only work well in a static context, e.g., when dependencies on Cartesian grids are consistently NSEW neighbors, thus cells can be divided into “red” and “black”, with each color being update in two separate steps. Coloring will also work for more complex *static* dependencies. However, in this case the dependencies are determined dynamically through evaluating the fields for particular conditions, e.g., some velocity condition exceeded. Therefore, one would have to create the dependency tree, calculate the coloring, then perform the updates in multiple steps (there could be many dependencies for any given cell). Thus for ICON it is much more efficient to perform the parallel scan, as mentioned above and in the paper. This solution works sufficiently to avoid performance bottlenecks.
RC1: Line 415: Indirect addressing (lists) are not good for the GPU, and I wonder if the fact that it is used through the dycore suggests that it should be re-written for better performance on the GPU, although 6x speedup is already very good?
The performance loss due to indirect addressing is subject to discussion, with the NIM team (Govett et al., 2017) reporting limited overhead if the data are layed out properly. Since the icosahedral grid is quasi-regular, attempts were made by our team to avoid the indirect addressing. However, these ultimately would have lead to the development of a completely new model, and were thus discarded.
RC1: Line 460: GPU aware MPI should be the right choice here, otherwise the cost of data transfer between CPU/GPU will kill performance.
You are correct. While we saw only minor gains with GPU-aware MPI for Piz Daint, which has nodes with a single GPU and single CPU, there are *much* larger gains (a factor 2x) on architectures with multiple GPU nodes and a single CPU, e.g., Juwels Booster.
5 ValidationRC1: Line 695: I agree b4b test do not make sense for CPU/GPU validation. But a figure for both single and double precision calculations representing the deviation between CPU/GPU is a good target. Have you also tried PGI's auto-compare feature? It does redundant computations on both CPU and GPU, and notifies where they are diverging for a given level of tolerance.
We have tried Nvidia's PCAST functionality in certain debugging situations and have also implemented (see Section 5.1) a mode to compare the code running concurrently on different MPI processes, one running sequentially on CPU, with the others running on GPU, to a given relative and/or absolute error thresholds. However, due to the chaotic nature of underlying problem, this technique is only viable for debugging one or very few time steps. Ultimately the errors grow beyond any round-off bound, similar to what happens for a perturbed system, thus justifying the use of tolerance testing (Section 5.3).
6 Benchmarking ResultsRC1: Line 820: Could this be a reason for the less than optimal strong scaling results?
Strong scaling is not optimal because, with increasing number of GPUs, the local block size simply becomes too small for it to run efficiently. This is less of a problem on CPU. We have tried to emphasize this point in the paper, e.g., "strong scaling on GPUs depends sensitively on the ability to maintain sufficient work for each node as the node count is increased."
RC1: Line 840: I am confused because I thought the 6.3x speedup is over the end-to-end computation including all of physics and dynamics?
Yes, for the single node setup we measure a speed-up of 6.4x for the entire time loop. On line 840 we only point out the lower limit of such a speed-up that would make any sense. If we think about comparing the performance on one node with 1 CPU + 1 GPU to one node with 2 CPUs, a GPU over CPU speed-up should be significantly higher than 2 to make the GPU port a reasonable investment.
But here you present speedups of individual kernels. I am afraid Amdahl's law plays a significant role here, as the least performing kernel often governs overall speedup.
Yes, we think it is important to also present speedups of individual components. Amdahl's Law does play a role in the overall scalability, but it should not be assumed that the components Figure 4 have similar overall execution times. In particular, the time spent in the "land physics" (3x speedup) is insignificant compared to the time spent in dynamics (6.3x) or radiation (7.4x). To make this clear we have redrawn Figure 4. Panel (a) of the new figure shows the relative costs of all model components as a percentage of the time-to-solution of the integration. Obviously radiation and dynamics dominate the costs and land, cloud microphyiscs and vertical diffusion have only small shares in the costs. Panel (b) shows the speed-up of the integration and the components. Note that the "atmospheric physics" used in the old version of Figure 4 is no longer shown. Instead we show the results for all individual components, as used already in Figure 5.
RC1: Line 845: Could you give details how much percentage of time is spent in the land surface model specifically, but for others as well? If the overall speedup is is indeed 6.5x, land-surface should not be taking a significant portion.
Absolutely correct (see above). The new Figure 4 shows now the relative costs of the components. The text in the manuscript is adjusted accordingly. Concerning the land scheme, which has the poorest scaling, here the numbers for the relative costs: land on GPU: 4.3%, land on CPU: 2.0%.
Figure 5: Radiation is often the most-time consuming and I am not surprised you get the most speedup from it.
Unfortunately this is not automatically given. For the GPU port of ICON-A a first attempt was made to port the PSrad scheme, then the default radiation scheme for ICON-A, with OpenACC directives in a similar way as it worked for other physics parameterizations. This however did not reach any acceptable speedup. This was attributed to the code and data structures, where the additional dimension for the spectral resolution plays an important role. This lead to the replacement of the PSrad scheme by the RTE+RRTMGP scheme, where the developers have paid attention to these challenges from the beginning of the code development. RTE+RRTMGP also includes a small number of separate codes for CPU and GPU to account for the respective needs.
But you are certainly right that radiation can reach a high speedup, simply because it can offer a lot of work and parallelism for a GPU, if the code is structure such that this can be exploited.RC1: Some models often do not compute radiation every time step for this reason. Would you say that the GPU ported code offers a chance to do it every time step for more accurate results?
The setup used here computes radiative fluxes only every 18th time step. But still radiation is the most costly components when measured on a single node. Changing this to a setup with radiation computed every time step would result in 8 times higher integration costs. For out science case - the simulation of the QBO - we do not expect any advantage from such an increased frequency of the computation of radiative fluxes, and we cannot afford an 8-fold increase in costs. But we agree that the computation of all physical processes with equal frequency would generally be more appropriate.
RC1: Could you also provide figures for speedup, weak and strong scaling for end-to-end run without the sub-components? The figure is cluttered at the moment and I would like to see one figure highlighting the overall results.The figures have been redrawn and the text adjusted, see our main reply.
RC1: Line 900: Land-surface model showing very poor strong scaling could be the cause of the model's over poor strong scaling then?
This is true to some extent, but the main reason, as mentioned above, is the strong scaling is limited by the work available for the GPU as the local block size decreases.
Also, if - on GPUs - the land scheme scaled "perfectly", the whole model would still have a poor strong scaling for more than quadrupling the number of nodes. This can be seen from Figure 5, which shows the scaling as well as the time to solution. The reason is that also the components dynamics, transport, cloud microphysics and vertical diffusion, which together dominate the time-to-solution compared to land, have a scaling that is insufficient to achieve a good strong scaling.
-
AC1: 'Reply on RC1', Marco Giorgetta, 07 Jul 2022
-
RC2: 'Comment on egusphere-2022-152', Italo Epicoco, 13 Jul 2022
The authors described the activity of porting the ICON-A atmospheric model to a GPU-based parallel architecture using a directive based approach for the parallelisation in order to take full advantage of exascale architectures and improve scientific outcomes resolving physical and climate process down to the scale of a few kilometers.
First of all, I would like to express my full appreciation for a really well written manuscript and for a wealth of information and details.However some points can be better clarified / discussed
- The GPU approach followed by the authors leverages on OpenACC, although OpenMP is mentioned in some parts of the manuscript (lines 499, 833). My questions are:
+ is the initial version of ICON-A parallelized with MPI + OpenMP? If so, it should be explicitly mentioned at the beginning when the description of the model is given.
+ taking into account that OpenMP v5 supports the GPU offloading, supporting not only NVIDIA but also Intel GPUs, why did the authors choose to use OpenACC instead OpenMP? And finally, once the choice fell to OpenACC why the OpenMP is kept inside the code?
The authors should clarify these aspects to better justify their choices.
- in Section 4.4 Line 520 the authors first introduce the concept of reproducibility which will be better discussed later in Sections 5. In Section 4.4 the meaning of the word "reproducibility" is not clear. Are the authors referring to the bit-identity reproducibility or tolerance-reproducibility? How was reproducibility evaluated in the context of physical parametrization (Sec 4.4)?
Moreover, the authors uses the "ACC LOOP SEQ" directive to "fix the order of the summands" but it is not clear why this is needed; whhat is the correct order to do a summation. Considering that the round-off error is inherently present in the code, even in the sequential version of the code, why should summation follow the "LOOP SEQ" order?
- In Section 5 the authors deeply discussed the validation techniques available for ICON. Namely, in Sect 5.3 the tolerance testing is presented, which consists of evaluating an ensemble obtained by perturbing the state variables with a uniform error of the order of magnitude 10^-14. My comment here is that the main source of divergence in the outputs, when implementing a parallel version of a code, is due to the round-off error that can grow after several time steps. In order to evaluate the effect and impact of the round-off error it is probably best to create an ensemble by changing the order in which the grid cells are evaluated, such a by shuffling the arrays with the grid cells.- In Section 6.5.1 the authors should provide a comment on why the radiation exhibits a super linear strong scalability on PizDaint and not on Juwels-Booster neither on Levante.
Why the transport and the vertical diffusion have a super liner strong scalability on Levante? The vertical diffusion has a really strange and counterintuitive behaviour since its scalability curve increases with the number of nodes.- In Section 4.3.1 Line 322 the sentence "There are code divergences in the non-hydrostatic solver" is a bit misleading since it is not clear whether it refers to thread divergence or code differences between CPU and GPU.
- Listing 2 reports an example to explain the use of scalars on GPUs instead of arrays, but the transformation of 2D array into a scalar is not fully clear; namely, in the expression for the scalar (line 331-333) the index jk-1 is used while in the expression (lines 336-338) the index jk is used. Moreover is also unclear whether the "z_w_concorr_mc_m1" values are used/needed after the do loop; if these values are not used outside the loop probably the scalar transformation is also useful for the CPU case.
- In the abstract and in conclusions the authors write that the model exhibits a good weak scalability. But after a careful reading and according on what is stated in Section 6.5, "ICON exhibits very good weak-scaling for a 16-fold increase in node count", actually, a complete weak scalability analysis has not been provided as the weak scalability has been evaluated only in the case of 16-fold increase. I suggest that the same comment is also report in the abstract and conclusion .
More cosmetic comments, suggestions and typos:- In the abstract, line 8, it is better to use "kilometres" instead of "km"
- Line 17-18: there is a pun in the sentence... the weak scalability is good and the strong scalability is weak- Line 116: "(black)" should be "(blue)"
- Line 272: "data types" should probably be "data structures"
- Listing 6 and Listing 7 have exactly the same caption. I suggest to merge together both listing or to differentiate the captions
- Line 702: "ptest" mode is mentioned here for the first time, it would help if, in the same sentence, the authors anticipate that the mode is described in the following section.
- Line 787: "Ss and Ss" should be "Ss and Sw"
- Line 919: "1 SDPD" should probably be "1 SYPD"
Citation: https://doi.org/10.5194/egusphere-2022-152-RC2 -
AC2: 'Reply on RC2', Marco Giorgetta, 20 Jul 2022
Reply to the reviewer comment 2 (RC2) from 2022-07-13
Dear Italo Epicoco
Thank you very much for your review comment on our article and your appreciation of the manuscript. Please find below our responses to your comments.
A pdf file (generated by latexdiff) of the differences between the current manuscript and the manuscript that you reviewed is attached. Please note that the current manuscript includes also changes made in response to the comments of reviewer 1. This specifically relates to the sections 6 up to (and including) subsection 6.4.5, which have been modified to account for the redrawn figures 4, 5, and 6, which now present the results in a different order.
Best regards,
Marco Giorgetta
RC2: 'Comment on egusphere-2022-152', Italo Epicoco, 13 Jul 2022The authors described the activity of porting the ICON-A atmospheric model to a GPU-based parallel architecture using a directive based approach for the parallelisation in order to take full advantage of exascale architectures and improve scientific outcomes resolving physical and climate process down to the scale of a few kilometers.
First of all, I would like to express my full appreciation for a really well written manuscript and for a wealth of information and details.
However some points can be better clarified / discussed
- The GPU approach followed by the authors leverages on OpenACC, although OpenMP is mentioned in some parts of the manuscript (lines 499, 833). My questions are:
+ is the initial version of ICON-A parallelized with MPI + OpenMP? If so, it should be explicitly mentioned at the beginning when the description of the model is given.
Yes, this is the case. The standard applications of ICON-A on the CPU machines at DKRZ (or elsewhere) use mixed MPI OpenMP parallelization.
+ taking into account that OpenMP v5 supports the GPU offloading, supporting not only NVIDIA but also Intel GPUs, why did the authors choose to use OpenACC instead OpenMP? And finally, once the choice fell to OpenACC why the OpenMP is kept inside the code?
At the time when we started the GPU port on the Piz Daint computer at CSCS, the only directive based option that existed, and was powerful enough for the ICON code, was the OpenACC implementation for the PGI/Nvidia compiler. There was no OpenMP5 implementation available.
But in the meanwhile additional work has started for a GPU port based on OpenMP5 targeting ICON applications on the LUMI supercomputer. This work is still ongoing.OpenMP is kept in the code because the ICON code is still in use on CPU machines. Thus we currently have MPI x OpenMP for CPUs and MPI x OpenACC for GPU machines with OpenACC/PGI support. (Ongoing work currently investigates if the OpenACC port of ICON can be modified such that it works also on AMD/Cray systems.)
The authors should clarify these aspects to better justify their choices.
The introduction is now extended on line 62 as follows:
... In the CPU case applications shall continue to use the proven parallelization by MPI domain decomposition mixed with OpenMP multi-threading, while in the GPU case parallelization should now combine the MPI domain decomposition with OpenACC directives for the parallelization on the GPU. OpenACC was chosen because this was the only practical option on the GPU compute systems used in the presented work and described below. Consequently the resulting ICON code presented here includes now OpenMP and OpenACC directives.- in Section 4.4 Line 520 the authors first introduce the concept of reproducibility which will be better discussed later in Sections 5. In Section 4.4 the meaning of the word "reproducibility" is not clear. Are the authors referring to the bit-identity reproducibility or tolerance-reproducibility? How was reproducibility evaluated in the context of physical parametrization (Sec 4.4)?
Yes, "reproducibility" means bit-wise reproducibility. The text is changed on lin 533 to clarify this:
... This bit-wise reproducibility is important in the model development process because it facilitates the detection of unexpected changes of model results, as further discussed in Sect. 5.
Moreover, the authors uses the "ACC LOOP SEQ" directive to "fix the order of the summands" but it is not clear why this is needed; whhat is the correct order to do a summation. Considering that the round-off error is inherently present in the code, even in the sequential version of the code, why should summation follow the "LOOP SEQ" order?
The critical point is that we want to ensure that the sequence is maintained so that round-off effects remain unchanged if the computation is repeated, with the goal to obtain a bit-wise reproducible code.
- In Section 5 the authors deeply discussed the validation techniques available for ICON. Namely, in Sect 5.3 the tolerance testing is presented, which consists of evaluating an ensemble obtained by perturbing the state variables with a uniform error of the order of magnitude 10^-14. My comment here is that the main source of divergence in the outputs, when implementing a parallel version of a code, is due to the round-off error that can grow after several time steps. In order to evaluate the effect and impact of the round-off error it is probably best to create an ensemble by changing the order in which the grid cells are evaluated, such a by shuffling the arrays with the grid cells.
Certainly there exists more than one methods for creating an ensemble of simulations. But a variation of the order of the evaluation of the grid columns alone would not change any result in the ICON-A integration, in absence of bugs in the parallelization or blocking. This bit-wise reproducibility wrt. parallelization (number of MPI processes, OMP threads) and blocking is regularly tested. Therefore the simplest method was to add numerical noise to the state variables.
- In Section 6.5.1 the authors should provide a comment on why the radiation exhibits a super linear strong scalability on PizDaint and not on Juwels-Booster neither on Levante.
This is explained in section "6.5.1 Strong scaling of components". The reason is that the sub-blocking length for radiation, controlled by the rcc parameter, can be used on Piz Daint and Juwels-Booster to maintain or increase the work load per radiation call, within some limits, while the work load per call for all other processes simply decreases by a factor of 2 per node doubling, as does the number of columns per compute domain. Differences between Piz Daint and Juwels-Booster exist in the number of doubling steps where a factor of 2 reduction in rcc can be avoided. This limit is reached when rcc has grown to the number of points per domain. For Piz Daint this limit is reached only at the highest parallelization (1024 nodes), while on Juwels-Booster this limit is reached already on 32 nodes. This provides the simplest explanation for the nearly constant strong scaling of radiation, even slightly above 1, on Piz Daint, and the transition from a constant strong scaling just below 1 up to 32 nodes to a decaying strong scaling on Juwels-Booster. But we do not know why the effect of maintaining or increasing the radiation work load even leads to super linear strong scaling on Piz Daint, while we find only a strong scaling of just below 1 for Juwels Booster. Maybe it is related to differences in the overhead costs for setting up the parallel loops on the GPU.
On Levante we use nproma = rcc = 32 for all experiment so that the work load per radiation call is also fixed. Here the strong scaling of radiation depends on other factors, as it is also the case for all other processes.To clarify this the section "6.2 Optimization parameters" with respect to the sub-blocking for the radiation and section "6.4.3 Strong scaling of components" have been modified. Note that these sections have already been changed in response to referee 1.
Why the transport and the vertical diffusion have a super liner strong scalability on Levante? The vertical diffusion has a really strange and counterintuitive behaviour since its scalability curve increases with the number of nodes.
This is indeed peculiar. From our log files and timing data we cannot derive an explanation. Our speculation is that this is resulting from cache effects. We did not investigate this behavior further, first of all because these effects do not distort the overall scaling behavior shown in Figure 5, where the main difference occurs between the GPUs on the one hand and the CPU on the other hand. Secondly, diagnosing cache efficiency is non-trivial and can become a study on its own. Therefore we did not investigate the underlying reasons. Thus we only point out these behaviors in the manuscript on line 915 to 955.
- In Section 4.3.1 Line 322 the sentence "There are code divergences in the non-hydrostatic solver" is a bit misleading since it is not clear whether it refers to thread divergence or code differences between CPU and GPU.
Code differences between CPU and GPU are meant. The text in the manuscript is changed to make this clear (line 331).
- Listing 2 reports an example to explain the use of scalars on GPUs instead of arrays, but the transformation of 2D array into a scalar is not fully clear; namely, in the expression for the scalar (line 331-333) the index jk-1 is used while in the expression (lines 336-338) the index jk is used. Moreover is also unclear whether the "z_w_concorr_mc_m1" values are used/needed after the do loop; if these values are not used outside the loop probably the scalar transformation is also useful for the CPU case.
Indeed, Listing 2 was simplified to a point where it no longer sufficiently illustrated the intended point of using registers to replace arrays. We have now added the full loop, in which both the scalars z_w_concorr_mc_m0 and z_w_concorr_mc_m1 are calculated and consumed, and we have added Fortran comments explaining the code which they replaced. This should provide a full explanation of this optimization.
- In the abstract and in conclusions the authors write that the model exhibits a good weak scalability. But after a careful reading and according on what is stated in Section 6.5, "ICON exhibits very good weak-scaling for a 16-fold increase in node count", actually, a complete weak scalability analysis has not been provided as the weak scalability has been evaluated only in the case of 16-fold increase. I suggest that the same comment is also report in the abstract and conclusion .
The abstract and conclusions now include: "... over the tested 16-fold increase in grid size and node count ..." so that the statements are now more precise.
More cosmetic comments, suggestions and typos:
- In the abstract, line 8, it is better to use "kilometres" instead of "km"
Done
- Line 17-18: there is a pun in the sentence... the weak scalability is good and the strong scalability is weak
Changed to "... good weak scaling ..., the strong scaling on GPUs is relatively poor, ..."
- Line 116: "(black)" should be "(blue)"
Thanks for spotting the error, corrected.
- Line 272: "data types" should probably be "data structures"
Yes, changed.
- Listing 6 and Listing 7 have exactly the same caption. I suggest to merge together both listing or to differentiate the captions
Thanks for pointing out the issues with these two listings. Both are needed to illustrate the 2 possible communication modes (use_g2g), but the UPDATE was missing in Listing 6. Now all fixed: they illustrates the two possibilities and the captions have been changed accordingly.
- Line 702: "ptest" mode is mentioned here for the first time, it would help if, in the same sentence, the authors anticipate that the mode is described in the following section.
Done. This sentence is now followed by: "Details for these methods are given in the following subsections."
- Line 787: "Ss and Ss" should be "Ss and Sw"
Thanks for spotting this error, corrected.
- Line 919: "1 SDPD" should probably be "1 SYPD"
Thanks for spotting the error, corrected.
-
AC2: 'Reply on RC2', Marco Giorgetta, 20 Jul 2022
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-152', Anonymous Referee #1, 27 May 2022
The paper dicusses porting of the ICON-A model to the GPU. An effective speedup of 6.3x times is obtained in a node-to-node comparison, which is very good in my opinion.
Note: Some of the questions/comments may not be relevant as I am writing notes as I go through the paper and understand it better.
I will probably send a second round of comments after I go through the paper once more.Line 15: Why is the strong scaliing poor?
Linne 55: The literature review on GPU ported weather models is not exhaustive. WRF entire dycore and select physics packages
have been porteed to the GPU gaining speedup of 5x to 7x, NEPTUNE's dycore (NUMA) has been ported to the GPU gaining speedup of
upoto 10x. I am sure MPAS has had OpenACC work too.Line 70: Please provide details (names) of the "Sapphire" physics packages ported to the GPU. Was there a study to choose the physics
packages needed for high-resolutiion grid simulations?Line 75: What is the time integration method used in ICON-A, I assume it is horizontally-explicit vertically-implicit (HEVI) scheme.
Line 245: This is a very good decision that is often missed. If one decides to run for example a particular physics scheme on the CPU,
then all the speedup will go down the drain.Line 275: I am curious if ICON-A has adaptive mesh refiinement (AMR) capability? If so, is that handled directly on the GPU?
Line 295: Deep subroutine calls are indeed a problem but it is still possible to get an efficient kernel. A difficulty with
that approach is if a lot of temporary arrays (e.g. 1D) are used in the inner subroutines, in which case, the compiler will
do GPU memory allocations there. What I found very helpful, is to move those out of the inner loops to the topmost loop.
Other than that, if the nested subroutines are siimple, it should be equally efficient without the needd for major rewrite.Line 320: Kernels vs parallel loop. Kernels is often the safest option when you start porting the code. You can add a `acc loop independent`
to the loops to effectively get a `parallel loop`. There should not be any performannce difference afterr that.Line 385: Use of atomics is good, but the second approach looks simpler. Have you also tried "coloring" schemes where each thread
constructs an independent portion of the list.Line 415: Indirect addressing (lists) are not good for the GPU, and I wonder if the fact that it is used throught the dycore
suggests that it should be re-written for better performance on the GPU, although 6x speedup is already very good?Line 460: GPU aware MPI should be the right choice here, otherwise the cost of data transfer between CPU/GPU will kill performance.
Line 695: I agree b4b test do not make sense for CPU/GPU validation. But a figure for both single and double precision calculations
representing the deviation betweeen CPU/GPU is a good target. Have you also tried PGI's auto-compare feature? It does redundant computations
on both CPU and GPU, and notifies where they are diverging for a given level of tolerance.Line 820: Could this be a reason for the less than optimal strong scaling results?
Line 840: I am confused because I thought the 6.3x speedup is over the end-to-end computation including all of physics and dynamics?
But here you present speedups of indiviidual kernels. I am afraid Amdahl's law plays a significant roler here, as the least performing kernel
often governs overall speedup.Line 845: Could you give details how much percentage of time is spent in the landsurface model specifically, but for others as well?
If the overall speedup is is indeed 6.5x, land-surface should not be taking a significant portion.Figure 5: Radiation is often the most-time consuming and I am not surprized you get the most speedup from it.
Some models often do not compute radiation every time step for this reason. Would you say that the GPU ported code
offers a chance to do it every time step for more accurate results?Could you also provide figures for speedup, weak and strong scaling for end-to-end run without the subcomponents?
The figure is cluttered at the moment and I would like to see one figure highliting the overall results.Line 900: Land-surface model showing very poor strong scaling could be the cause of the model's over poor strong scaling then?
Citation: https://doi.org/10.5194/egusphere-2022-152-RC1 -
AC1: 'Reply on RC1', Marco Giorgetta, 07 Jul 2022
Reply to the reviewer comment 1 (RC1) from 2022-05-27
Dear reviewer
Thank you very much for your review comment on our article. We are happy about your positive judgment of the effective 6x CPU-to-GPU speed-up in the node-to-node comparison.
We have improved the manuscript based on your comments, especially related to your comments about the cluttered figures. These figure have been redrawn. Figure 5 now shows the time-to-solution and the cumulative strong scaling of the integration timer for the R2B7 benchmarks alone, combining the results from the three machines in a single panel to facilitate the comparison between the machines, while Figure 6 now shows time-to-solution and cumulative strong scaling for the components, with separate panels for each machine similar to the old Figure 5. The text of Section "6 Benchmarking results" has been changed accordingly, discussing now first the overall results from Figure 5 before continuing with the results for the components based on Figure 6. The discussion of the strong scaling is followed by a short section on the weak scaling, based on the tabulated results alone. We hope that these changes have removed the cluttering. The stronger separation of the discussion of the integration as a whole from the components is certainly beneficial. Thanks for the recommendation!
A pdf file (generated by latexdiff) of the differences between the old and new manuscript, which includes changes related to your comments, is attached.
For all other comments please find our point-to-point replies below.
Best regards,
Marco Giorgetta
AbstractRC1: Line 15: Why is the strong scaling poor?
Probably you refer to the statement on line 17: "While the application shows good weak scaling making also higher resolved global simulations possible, the strong scaling on GPUs is relatively weak, ...". This judgment is based on findings presented in section 6. We found that the turn-over (or time compression) achieved with the base setups cannot be increased very much in an efficient way by increasing the number of compute nodes, i.e. by exploiting strong scaling, for example towards the goal of 1 simulated year per day. In practice the strong scaling on the GPU machines allowed to double once or twice the number of nodes of the base setups with still acceptable efficiency. In contrast, on the CPU machine even a 6 times doubling was still efficient, but here we start from a considerably lower turn-over rate so that no high turn-over can be achieved. It is this once or twice doubling on GPUs compared to 6 times doubling on CPU that lead to the statement "... the strong scaling on GPUs is relatively weak ...". The data for the strong scaling for each doubling step, S_s, are given in Table 3. And the cumulative strong scaling, S_{s,cum}, over multiple doubling steps is shown in Figure 5 for the whole model ("integrate") as well as for the model components.
1 IntroductionRC1: Line 55: The literature review on GPU ported weather models is not exhaustive. WRF entire dycore and select physics packages have been ported to the GPU gaining speedup of 5x to 7x, NEPTUNE's dycore (NUMA) has been ported to the GPU gaining speedup of up to 10x. I am sure MPAS has had OpenACC work too.
Indeed, the literature review is not exhaustive. Since we have implemented an end-to-end GPU implementation for real scientific runs, we are most interested in other complete implementations. However, we do respect the hard work which has gone into WRF and MPAS, and will add references to Huang, et al. (2015) for WRF, and Kim, et al. (2021) for MPAS (which incidentally reports a speedup of 2.38x).
2 Model configuration for QUBICC experimentsRC1: Line 70: Please provide details (names) of the "Sapphire" physics packages ported to the GPU. Was there a study to choose the physics packages needed for high-resolution grid simulations?
More details for the physics parameterizations comprising the ICON "Sapphire" physics are given further down in section 2.5 and its subsections. The text now refers to these subsections. We did not make a separate study to chose this set of parameterizations for our high resolution experiments. Rather we proceeded in 2 steps. First we excluded parameterizations which are not suitable at high resolution, because they were originally developed for significantly larger cell sizes, and secondly we use parameterizations which qualitatively correspond to the parameterizations used in the operational setups of the ICON model used by the German Weather Service (DWD) for global and regional weather predictions at 13 and 2 km resolution, respectively. The RTE+RRTMGP radiation scheme is a more modern version of the RRTMG code used for many years by DWD (and just recently replaced by the ECrad model). The cloud microphysics code is the same as in operational use. The vertical diffusion scheme is different but relies on similar theory and discretization. Similar arguments can be made for the land scheme.
RC1: Line 75: What is the time integration method used in ICON-A, I assume it is horizontally-explicit vertically-implicit (HEVI) scheme.
Yes, it is horizontally explicit and vertically implicit (HEVI). Details are described in the documentation of the dynamical core by Zaengl et al. (2015).
4 Porting ICON to GPUs
RC1: Line 245: This is a very good decision that is often missed. If one decides to run for example a particular physics scheme on the CPU, then all the speedup will go down the drain.
Indeed, we knew from the beginning that the only possibility for speedup was to port all code within the time loop. Practically it meant that a considerable part of the model code had to be ported, including the more demanding dynamics and tracer transport codes.
RC1: Line 275: I am curious if ICON-A has adaptive mesh refinement (AMR) capability? If so, is that handled directly on the GPU?
ICON-A has a static 'nesting' capability (telescopic grids of increasing resolutions over areas of interest), which is now being ported to GPUs for future applications. The key word here is 'static', which contrasts with AMR, which is implicitly dynamic. Therefore ICON-A does not have AMR capability.
RC1: Line 295: Deep subroutine calls are indeed a problem but it is still possible to get an efficient kernel. A difficulty with that approach is if a lot of temporary arrays (e.g. 1D) are used in the inner subroutines, in which case, the compiler will do GPU memory allocations there. What I found very helpful, is to move those out of the inner loops to the topmost loop. Other than that, if the nested subroutines are simple, it should be equally efficient without the need for major rewrite.
We considered moving the innermost loops upwards in the call tree, but decided against it: this goes in the direction of a full code rewrite, and the spirit of OpenACC is exactly to avoid such extensive refactoring of the code. Interestingly the parallelization over the two innermost dimensions (nproma and nlev) is sufficient for good performance if there are only one, or very few, blocks of nproma. The more critical performance bottleneck is actually the function call overhead of GPU routines designated with ACC ROUTINE SEQ. This call overhead is considerable and is only avoided if the compiler is capable of inlining the code. This is not always the case, and proper inlining is the subject of ongoing optimizations.
RC1: Line 320: Kernels vs parallel loop. Kernels is often the safest option when you start porting the code. You can add a `acc loop independent` to the loops to effectively get a `parallel loop`. There should not be any performance difference after that.
This depends entirely on the degree of support for KERNELS in the OpenACC compiler. For example, we understand GCC has very limited support for KERNELS. Since our work emphasizes the code portability aspect, we only used KERNELS for simple code cases, e.g., initialization of arrays using array syntax.
RC1: Line 385: Use of atomics is good, but the second approach looks simpler. Have you also tried "coloring" schemes where each thread constructs an independent portion of the list.
Atomics performed poorly in the two OpenACC compilers we worked with: Nvidia and CCE. Anecdotally, the performance of atomics is now improving, however we had long since moved to the parallel scan method based on the CUB library. While this is specific to Nvidia GPUs, a library with similar interface is also available for AMD GPUs. The atomic functionality in OpenACC is actually quite simple to use, but we would only consider reactivating that code if there were guarantees of commensurate performance with CUB.
Coloring schemes only work well in a static context, e.g., when dependencies on Cartesian grids are consistently NSEW neighbors, thus cells can be divided into “red” and “black”, with each color being update in two separate steps. Coloring will also work for more complex *static* dependencies. However, in this case the dependencies are determined dynamically through evaluating the fields for particular conditions, e.g., some velocity condition exceeded. Therefore, one would have to create the dependency tree, calculate the coloring, then perform the updates in multiple steps (there could be many dependencies for any given cell). Thus for ICON it is much more efficient to perform the parallel scan, as mentioned above and in the paper. This solution works sufficiently to avoid performance bottlenecks.
RC1: Line 415: Indirect addressing (lists) are not good for the GPU, and I wonder if the fact that it is used through the dycore suggests that it should be re-written for better performance on the GPU, although 6x speedup is already very good?
The performance loss due to indirect addressing is subject to discussion, with the NIM team (Govett et al., 2017) reporting limited overhead if the data are layed out properly. Since the icosahedral grid is quasi-regular, attempts were made by our team to avoid the indirect addressing. However, these ultimately would have lead to the development of a completely new model, and were thus discarded.
RC1: Line 460: GPU aware MPI should be the right choice here, otherwise the cost of data transfer between CPU/GPU will kill performance.
You are correct. While we saw only minor gains with GPU-aware MPI for Piz Daint, which has nodes with a single GPU and single CPU, there are *much* larger gains (a factor 2x) on architectures with multiple GPU nodes and a single CPU, e.g., Juwels Booster.
5 ValidationRC1: Line 695: I agree b4b test do not make sense for CPU/GPU validation. But a figure for both single and double precision calculations representing the deviation between CPU/GPU is a good target. Have you also tried PGI's auto-compare feature? It does redundant computations on both CPU and GPU, and notifies where they are diverging for a given level of tolerance.
We have tried Nvidia's PCAST functionality in certain debugging situations and have also implemented (see Section 5.1) a mode to compare the code running concurrently on different MPI processes, one running sequentially on CPU, with the others running on GPU, to a given relative and/or absolute error thresholds. However, due to the chaotic nature of underlying problem, this technique is only viable for debugging one or very few time steps. Ultimately the errors grow beyond any round-off bound, similar to what happens for a perturbed system, thus justifying the use of tolerance testing (Section 5.3).
6 Benchmarking ResultsRC1: Line 820: Could this be a reason for the less than optimal strong scaling results?
Strong scaling is not optimal because, with increasing number of GPUs, the local block size simply becomes too small for it to run efficiently. This is less of a problem on CPU. We have tried to emphasize this point in the paper, e.g., "strong scaling on GPUs depends sensitively on the ability to maintain sufficient work for each node as the node count is increased."
RC1: Line 840: I am confused because I thought the 6.3x speedup is over the end-to-end computation including all of physics and dynamics?
Yes, for the single node setup we measure a speed-up of 6.4x for the entire time loop. On line 840 we only point out the lower limit of such a speed-up that would make any sense. If we think about comparing the performance on one node with 1 CPU + 1 GPU to one node with 2 CPUs, a GPU over CPU speed-up should be significantly higher than 2 to make the GPU port a reasonable investment.
But here you present speedups of individual kernels. I am afraid Amdahl's law plays a significant role here, as the least performing kernel often governs overall speedup.
Yes, we think it is important to also present speedups of individual components. Amdahl's Law does play a role in the overall scalability, but it should not be assumed that the components Figure 4 have similar overall execution times. In particular, the time spent in the "land physics" (3x speedup) is insignificant compared to the time spent in dynamics (6.3x) or radiation (7.4x). To make this clear we have redrawn Figure 4. Panel (a) of the new figure shows the relative costs of all model components as a percentage of the time-to-solution of the integration. Obviously radiation and dynamics dominate the costs and land, cloud microphyiscs and vertical diffusion have only small shares in the costs. Panel (b) shows the speed-up of the integration and the components. Note that the "atmospheric physics" used in the old version of Figure 4 is no longer shown. Instead we show the results for all individual components, as used already in Figure 5.
RC1: Line 845: Could you give details how much percentage of time is spent in the land surface model specifically, but for others as well? If the overall speedup is is indeed 6.5x, land-surface should not be taking a significant portion.
Absolutely correct (see above). The new Figure 4 shows now the relative costs of the components. The text in the manuscript is adjusted accordingly. Concerning the land scheme, which has the poorest scaling, here the numbers for the relative costs: land on GPU: 4.3%, land on CPU: 2.0%.
Figure 5: Radiation is often the most-time consuming and I am not surprised you get the most speedup from it.
Unfortunately this is not automatically given. For the GPU port of ICON-A a first attempt was made to port the PSrad scheme, then the default radiation scheme for ICON-A, with OpenACC directives in a similar way as it worked for other physics parameterizations. This however did not reach any acceptable speedup. This was attributed to the code and data structures, where the additional dimension for the spectral resolution plays an important role. This lead to the replacement of the PSrad scheme by the RTE+RRTMGP scheme, where the developers have paid attention to these challenges from the beginning of the code development. RTE+RRTMGP also includes a small number of separate codes for CPU and GPU to account for the respective needs.
But you are certainly right that radiation can reach a high speedup, simply because it can offer a lot of work and parallelism for a GPU, if the code is structure such that this can be exploited.RC1: Some models often do not compute radiation every time step for this reason. Would you say that the GPU ported code offers a chance to do it every time step for more accurate results?
The setup used here computes radiative fluxes only every 18th time step. But still radiation is the most costly components when measured on a single node. Changing this to a setup with radiation computed every time step would result in 8 times higher integration costs. For out science case - the simulation of the QBO - we do not expect any advantage from such an increased frequency of the computation of radiative fluxes, and we cannot afford an 8-fold increase in costs. But we agree that the computation of all physical processes with equal frequency would generally be more appropriate.
RC1: Could you also provide figures for speedup, weak and strong scaling for end-to-end run without the sub-components? The figure is cluttered at the moment and I would like to see one figure highlighting the overall results.The figures have been redrawn and the text adjusted, see our main reply.
RC1: Line 900: Land-surface model showing very poor strong scaling could be the cause of the model's over poor strong scaling then?
This is true to some extent, but the main reason, as mentioned above, is the strong scaling is limited by the work available for the GPU as the local block size decreases.
Also, if - on GPUs - the land scheme scaled "perfectly", the whole model would still have a poor strong scaling for more than quadrupling the number of nodes. This can be seen from Figure 5, which shows the scaling as well as the time to solution. The reason is that also the components dynamics, transport, cloud microphysics and vertical diffusion, which together dominate the time-to-solution compared to land, have a scaling that is insufficient to achieve a good strong scaling.
-
AC1: 'Reply on RC1', Marco Giorgetta, 07 Jul 2022
-
RC2: 'Comment on egusphere-2022-152', Italo Epicoco, 13 Jul 2022
The authors described the activity of porting the ICON-A atmospheric model to a GPU-based parallel architecture using a directive based approach for the parallelisation in order to take full advantage of exascale architectures and improve scientific outcomes resolving physical and climate process down to the scale of a few kilometers.
First of all, I would like to express my full appreciation for a really well written manuscript and for a wealth of information and details.However some points can be better clarified / discussed
- The GPU approach followed by the authors leverages on OpenACC, although OpenMP is mentioned in some parts of the manuscript (lines 499, 833). My questions are:
+ is the initial version of ICON-A parallelized with MPI + OpenMP? If so, it should be explicitly mentioned at the beginning when the description of the model is given.
+ taking into account that OpenMP v5 supports the GPU offloading, supporting not only NVIDIA but also Intel GPUs, why did the authors choose to use OpenACC instead OpenMP? And finally, once the choice fell to OpenACC why the OpenMP is kept inside the code?
The authors should clarify these aspects to better justify their choices.
- in Section 4.4 Line 520 the authors first introduce the concept of reproducibility which will be better discussed later in Sections 5. In Section 4.4 the meaning of the word "reproducibility" is not clear. Are the authors referring to the bit-identity reproducibility or tolerance-reproducibility? How was reproducibility evaluated in the context of physical parametrization (Sec 4.4)?
Moreover, the authors uses the "ACC LOOP SEQ" directive to "fix the order of the summands" but it is not clear why this is needed; whhat is the correct order to do a summation. Considering that the round-off error is inherently present in the code, even in the sequential version of the code, why should summation follow the "LOOP SEQ" order?
- In Section 5 the authors deeply discussed the validation techniques available for ICON. Namely, in Sect 5.3 the tolerance testing is presented, which consists of evaluating an ensemble obtained by perturbing the state variables with a uniform error of the order of magnitude 10^-14. My comment here is that the main source of divergence in the outputs, when implementing a parallel version of a code, is due to the round-off error that can grow after several time steps. In order to evaluate the effect and impact of the round-off error it is probably best to create an ensemble by changing the order in which the grid cells are evaluated, such a by shuffling the arrays with the grid cells.- In Section 6.5.1 the authors should provide a comment on why the radiation exhibits a super linear strong scalability on PizDaint and not on Juwels-Booster neither on Levante.
Why the transport and the vertical diffusion have a super liner strong scalability on Levante? The vertical diffusion has a really strange and counterintuitive behaviour since its scalability curve increases with the number of nodes.- In Section 4.3.1 Line 322 the sentence "There are code divergences in the non-hydrostatic solver" is a bit misleading since it is not clear whether it refers to thread divergence or code differences between CPU and GPU.
- Listing 2 reports an example to explain the use of scalars on GPUs instead of arrays, but the transformation of 2D array into a scalar is not fully clear; namely, in the expression for the scalar (line 331-333) the index jk-1 is used while in the expression (lines 336-338) the index jk is used. Moreover is also unclear whether the "z_w_concorr_mc_m1" values are used/needed after the do loop; if these values are not used outside the loop probably the scalar transformation is also useful for the CPU case.
- In the abstract and in conclusions the authors write that the model exhibits a good weak scalability. But after a careful reading and according on what is stated in Section 6.5, "ICON exhibits very good weak-scaling for a 16-fold increase in node count", actually, a complete weak scalability analysis has not been provided as the weak scalability has been evaluated only in the case of 16-fold increase. I suggest that the same comment is also report in the abstract and conclusion .
More cosmetic comments, suggestions and typos:- In the abstract, line 8, it is better to use "kilometres" instead of "km"
- Line 17-18: there is a pun in the sentence... the weak scalability is good and the strong scalability is weak- Line 116: "(black)" should be "(blue)"
- Line 272: "data types" should probably be "data structures"
- Listing 6 and Listing 7 have exactly the same caption. I suggest to merge together both listing or to differentiate the captions
- Line 702: "ptest" mode is mentioned here for the first time, it would help if, in the same sentence, the authors anticipate that the mode is described in the following section.
- Line 787: "Ss and Ss" should be "Ss and Sw"
- Line 919: "1 SDPD" should probably be "1 SYPD"
Citation: https://doi.org/10.5194/egusphere-2022-152-RC2 -
AC2: 'Reply on RC2', Marco Giorgetta, 20 Jul 2022
Reply to the reviewer comment 2 (RC2) from 2022-07-13
Dear Italo Epicoco
Thank you very much for your review comment on our article and your appreciation of the manuscript. Please find below our responses to your comments.
A pdf file (generated by latexdiff) of the differences between the current manuscript and the manuscript that you reviewed is attached. Please note that the current manuscript includes also changes made in response to the comments of reviewer 1. This specifically relates to the sections 6 up to (and including) subsection 6.4.5, which have been modified to account for the redrawn figures 4, 5, and 6, which now present the results in a different order.
Best regards,
Marco Giorgetta
RC2: 'Comment on egusphere-2022-152', Italo Epicoco, 13 Jul 2022The authors described the activity of porting the ICON-A atmospheric model to a GPU-based parallel architecture using a directive based approach for the parallelisation in order to take full advantage of exascale architectures and improve scientific outcomes resolving physical and climate process down to the scale of a few kilometers.
First of all, I would like to express my full appreciation for a really well written manuscript and for a wealth of information and details.
However some points can be better clarified / discussed
- The GPU approach followed by the authors leverages on OpenACC, although OpenMP is mentioned in some parts of the manuscript (lines 499, 833). My questions are:
+ is the initial version of ICON-A parallelized with MPI + OpenMP? If so, it should be explicitly mentioned at the beginning when the description of the model is given.
Yes, this is the case. The standard applications of ICON-A on the CPU machines at DKRZ (or elsewhere) use mixed MPI OpenMP parallelization.
+ taking into account that OpenMP v5 supports the GPU offloading, supporting not only NVIDIA but also Intel GPUs, why did the authors choose to use OpenACC instead OpenMP? And finally, once the choice fell to OpenACC why the OpenMP is kept inside the code?
At the time when we started the GPU port on the Piz Daint computer at CSCS, the only directive based option that existed, and was powerful enough for the ICON code, was the OpenACC implementation for the PGI/Nvidia compiler. There was no OpenMP5 implementation available.
But in the meanwhile additional work has started for a GPU port based on OpenMP5 targeting ICON applications on the LUMI supercomputer. This work is still ongoing.OpenMP is kept in the code because the ICON code is still in use on CPU machines. Thus we currently have MPI x OpenMP for CPUs and MPI x OpenACC for GPU machines with OpenACC/PGI support. (Ongoing work currently investigates if the OpenACC port of ICON can be modified such that it works also on AMD/Cray systems.)
The authors should clarify these aspects to better justify their choices.
The introduction is now extended on line 62 as follows:
... In the CPU case applications shall continue to use the proven parallelization by MPI domain decomposition mixed with OpenMP multi-threading, while in the GPU case parallelization should now combine the MPI domain decomposition with OpenACC directives for the parallelization on the GPU. OpenACC was chosen because this was the only practical option on the GPU compute systems used in the presented work and described below. Consequently the resulting ICON code presented here includes now OpenMP and OpenACC directives.- in Section 4.4 Line 520 the authors first introduce the concept of reproducibility which will be better discussed later in Sections 5. In Section 4.4 the meaning of the word "reproducibility" is not clear. Are the authors referring to the bit-identity reproducibility or tolerance-reproducibility? How was reproducibility evaluated in the context of physical parametrization (Sec 4.4)?
Yes, "reproducibility" means bit-wise reproducibility. The text is changed on lin 533 to clarify this:
... This bit-wise reproducibility is important in the model development process because it facilitates the detection of unexpected changes of model results, as further discussed in Sect. 5.
Moreover, the authors uses the "ACC LOOP SEQ" directive to "fix the order of the summands" but it is not clear why this is needed; whhat is the correct order to do a summation. Considering that the round-off error is inherently present in the code, even in the sequential version of the code, why should summation follow the "LOOP SEQ" order?
The critical point is that we want to ensure that the sequence is maintained so that round-off effects remain unchanged if the computation is repeated, with the goal to obtain a bit-wise reproducible code.
- In Section 5 the authors deeply discussed the validation techniques available for ICON. Namely, in Sect 5.3 the tolerance testing is presented, which consists of evaluating an ensemble obtained by perturbing the state variables with a uniform error of the order of magnitude 10^-14. My comment here is that the main source of divergence in the outputs, when implementing a parallel version of a code, is due to the round-off error that can grow after several time steps. In order to evaluate the effect and impact of the round-off error it is probably best to create an ensemble by changing the order in which the grid cells are evaluated, such a by shuffling the arrays with the grid cells.
Certainly there exists more than one methods for creating an ensemble of simulations. But a variation of the order of the evaluation of the grid columns alone would not change any result in the ICON-A integration, in absence of bugs in the parallelization or blocking. This bit-wise reproducibility wrt. parallelization (number of MPI processes, OMP threads) and blocking is regularly tested. Therefore the simplest method was to add numerical noise to the state variables.
- In Section 6.5.1 the authors should provide a comment on why the radiation exhibits a super linear strong scalability on PizDaint and not on Juwels-Booster neither on Levante.
This is explained in section "6.5.1 Strong scaling of components". The reason is that the sub-blocking length for radiation, controlled by the rcc parameter, can be used on Piz Daint and Juwels-Booster to maintain or increase the work load per radiation call, within some limits, while the work load per call for all other processes simply decreases by a factor of 2 per node doubling, as does the number of columns per compute domain. Differences between Piz Daint and Juwels-Booster exist in the number of doubling steps where a factor of 2 reduction in rcc can be avoided. This limit is reached when rcc has grown to the number of points per domain. For Piz Daint this limit is reached only at the highest parallelization (1024 nodes), while on Juwels-Booster this limit is reached already on 32 nodes. This provides the simplest explanation for the nearly constant strong scaling of radiation, even slightly above 1, on Piz Daint, and the transition from a constant strong scaling just below 1 up to 32 nodes to a decaying strong scaling on Juwels-Booster. But we do not know why the effect of maintaining or increasing the radiation work load even leads to super linear strong scaling on Piz Daint, while we find only a strong scaling of just below 1 for Juwels Booster. Maybe it is related to differences in the overhead costs for setting up the parallel loops on the GPU.
On Levante we use nproma = rcc = 32 for all experiment so that the work load per radiation call is also fixed. Here the strong scaling of radiation depends on other factors, as it is also the case for all other processes.To clarify this the section "6.2 Optimization parameters" with respect to the sub-blocking for the radiation and section "6.4.3 Strong scaling of components" have been modified. Note that these sections have already been changed in response to referee 1.
Why the transport and the vertical diffusion have a super liner strong scalability on Levante? The vertical diffusion has a really strange and counterintuitive behaviour since its scalability curve increases with the number of nodes.
This is indeed peculiar. From our log files and timing data we cannot derive an explanation. Our speculation is that this is resulting from cache effects. We did not investigate this behavior further, first of all because these effects do not distort the overall scaling behavior shown in Figure 5, where the main difference occurs between the GPUs on the one hand and the CPU on the other hand. Secondly, diagnosing cache efficiency is non-trivial and can become a study on its own. Therefore we did not investigate the underlying reasons. Thus we only point out these behaviors in the manuscript on line 915 to 955.
- In Section 4.3.1 Line 322 the sentence "There are code divergences in the non-hydrostatic solver" is a bit misleading since it is not clear whether it refers to thread divergence or code differences between CPU and GPU.
Code differences between CPU and GPU are meant. The text in the manuscript is changed to make this clear (line 331).
- Listing 2 reports an example to explain the use of scalars on GPUs instead of arrays, but the transformation of 2D array into a scalar is not fully clear; namely, in the expression for the scalar (line 331-333) the index jk-1 is used while in the expression (lines 336-338) the index jk is used. Moreover is also unclear whether the "z_w_concorr_mc_m1" values are used/needed after the do loop; if these values are not used outside the loop probably the scalar transformation is also useful for the CPU case.
Indeed, Listing 2 was simplified to a point where it no longer sufficiently illustrated the intended point of using registers to replace arrays. We have now added the full loop, in which both the scalars z_w_concorr_mc_m0 and z_w_concorr_mc_m1 are calculated and consumed, and we have added Fortran comments explaining the code which they replaced. This should provide a full explanation of this optimization.
- In the abstract and in conclusions the authors write that the model exhibits a good weak scalability. But after a careful reading and according on what is stated in Section 6.5, "ICON exhibits very good weak-scaling for a 16-fold increase in node count", actually, a complete weak scalability analysis has not been provided as the weak scalability has been evaluated only in the case of 16-fold increase. I suggest that the same comment is also report in the abstract and conclusion .
The abstract and conclusions now include: "... over the tested 16-fold increase in grid size and node count ..." so that the statements are now more precise.
More cosmetic comments, suggestions and typos:
- In the abstract, line 8, it is better to use "kilometres" instead of "km"
Done
- Line 17-18: there is a pun in the sentence... the weak scalability is good and the strong scalability is weak
Changed to "... good weak scaling ..., the strong scaling on GPUs is relatively poor, ..."
- Line 116: "(black)" should be "(blue)"
Thanks for spotting the error, corrected.
- Line 272: "data types" should probably be "data structures"
Yes, changed.
- Listing 6 and Listing 7 have exactly the same caption. I suggest to merge together both listing or to differentiate the captions
Thanks for pointing out the issues with these two listings. Both are needed to illustrate the 2 possible communication modes (use_g2g), but the UPDATE was missing in Listing 6. Now all fixed: they illustrates the two possibilities and the captions have been changed accordingly.
- Line 702: "ptest" mode is mentioned here for the first time, it would help if, in the same sentence, the authors anticipate that the mode is described in the following section.
Done. This sentence is now followed by: "Details for these methods are given in the following subsections."
- Line 787: "Ss and Ss" should be "Ss and Sw"
Thanks for spotting this error, corrected.
- Line 919: "1 SDPD" should probably be "1 SYPD"
Thanks for spotting the error, corrected.
-
AC2: 'Reply on RC2', Marco Giorgetta, 20 Jul 2022
Peer review completion
Journal article(s) based on this preprint
Model code and software
The ICON-A model for direct QBO simulations on GPUs Marco A. Giorgetta https://doi.org/10.17617/3.5CYUFN
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
640 | 225 | 11 | 876 | 8 | 6 |
- HTML: 640
- PDF: 225
- XML: 11
- Total: 876
- BibTeX: 8
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Marco A. Giorgetta
William Sawyer
Xavier Lapillonne
Panagiotis Adamidis
Dmitry Alexeev
Valentin Clément
Remo Dietlicher
Jan Frederik Engels
Monika Esch
Henning Franke
Claudia Frauen
Walter M. Hannah
Benjamin R. Hillman
Luis Kornblueh
Philippe Marti
Matthew R. Norman
Robert Pincus
Sebastian Rast
Daniel Reinert
Reiner Schnur
Uwe Schulzweida
Bjorn Stevens
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(3958 KB) - Metadata XML