the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Earth system modeling on Modular Supercomputing Architectures: coupled atmosphere-ocean simulations with ICON 2.6.6-rc
Abhiraj Bishnoi
Catrin I. Meyer
René Redler
Norbert Eicker
Helmuth Haak
Lars Hoffmann
Daniel Klocke
Luis Kornblueh
Estela Suarez
Abstract. The confrontation of complex Earth System model (ESM) codes with novel supercomputing architectures poses challenges to efficient modelling and job submission strategies. The modular setup of these models naturally fits a modular supercom- puting architecture (MSA), which tightly integrates heterogeneous hardware resources into a larger and more flexible high performance computing (HPC) system. While parts of the ESM codes can easily take advantage of the increased parallelism and communication capabilities of modern Graphics Processing Units (GPUs), others lack behind due to the long development cycles or are better suited to run on classical CPUs due to their communication and memory usage patterns. To better cope with these imbalances between the development of the model components, we performed benchmark campaigns on the Jülich Wizard for European Leadership Science (JUWELS) modular HPC system. We enabled the weather and climate model ICOsa- hedral Nonhydrostatic (ICON) to run in a coupled atmosphere-ocean setup, where the ocean and the model I/O is running on the CPU Cluster, while the atmosphere is simulated simultaneously on the GPUs of JUWELS Booster (ICON-MSA). Both, atmosphere and ocean, are running globally with a resolution of 5 km. In our test case, an optimal configuration in terms of model performance (core hours per simulation day) was found for the combination 84 GPU nodes on the JUWELS Booster module and 80 CPU nodes on the JUWELS Cluster module, of which 63 nodes were used for the ocean simulation and the remaining 17 nodes were reserved for I/O. With this configuration the waiting times of the coupler were minimized. Compared to a simulation performed on CPUs only, the MSA approach reduces energy consumption by 59 % with comparable runtimes. ICON-MSA is able to scale up to a significant portion of the JUWELS system, making best use of the available computing resources. A maximum throughput of 170 simulation days per day (SDPD) was achieved when running ICON on 335 JUWELS Booster nodes and 268 Cluster nodes.
- Preprint
(1033 KB) - Metadata XML
- BibTeX
- EndNote
Abhiraj Bishnoi et al.
Status: open (until 12 Oct 2023)
-
CC1: 'Comment on egusphere-2023-1476', Marco Giorgetta, 22 Aug 2023
reply
Thanks for posting the interesting results on the heterogeneous model runs, using CPU and GPU nodes! My question is about the turnover in simulated days per day (SDPD) that is reported in Table 3. For instance on line 3:
(1) 168 Booster nodes for the atmosphere + 126 Cluster nodes for the ocean yielding 130 SDPDThis can be compared to the turnover for an ICON atmosphere only model run at the same horizontal resolution also on Booster nodes, see The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514) (https://gmd.copernicus.org/articles/15/6985/2022/). Here we have among others the following case presented in Table 3:
(2) 128 Booster nodes for the atmosphere yielding 133 SDPD.While the turnover numbers are similar, there exist also differences:
- (2) uses 191 levels compared to 90 in (1)
- (2) uses 128/168 or ca. 75% of the Booster nodes used by (1)
All together it seems to me that the coupled model should be able to achieve a significantly higher turnover, if the turnover of the atmosphere model is the limiting factor. What could be the reason for achieving only 130 SDPD with 168 Booster nodes for the atmosphere?
The only obvious difference, that could explain why the atmosphere in the coupled setup is less performant is the nproma value, which seems smaller than necessary. (1) uses nproma = 32981, while (2) uses nproma = 42690, this despite of the fact that (a) (2) needs more memory for the atmosphere, due to a larger number of levels, and (b) (2) has only about 75% of the nodes and thus computer memory that is available to (1).
This means that probably more turnover can be achieved. Linear scaling of (2) from 128 to 168 nodes would yield 174 SDPD, though this accounts still for 191 levels used in (2). Thus even a larger turnover should be possible for (1). Probably this also means that the energy consumption per simulated day could be reduced further.
I hope this motivates some thoughts about the possible causes for the small turnover found here, for (1), compared to what seems possible based on (2).Citation: https://doi.org/10.5194/egusphere-2023-1476-CC1 -
AC1: 'Reply on CC1', Olaf Stein, 04 Sep 2023
reply
Thanks for your comment and question on the model performance of our modular GPU-CPU setup, which addresses the key issue of model effectiveness on heterogeneous supercomputing architectures. If the reviewers agree, we will incorporate some of the points discussed here into a final version of the manuscript.
We are well aware of the Giorgetta et al. (2022) study, presenting results from ICON atmosphere-only simulations on JUWELS Booster, which used the same architecture and a similar model resolution and setup than we did for ICON-A. Indeed, we see less than 50% of their throughput in terms of SDPD when comparing to linearly scaled – in terms of GPU nodes and vertical levels – results from their work. This is somewhat lower than we expected, but our simulations have to account for the overhead needed for data exchange between GPU and CPU in the course of Atmosphere-Ocean coupling. Moreover, we could not use openMPI in the modular setup, which is estimated to be 15% faster than ParaStationMPI on JUWELS, based on ICON atmosphere-only simulations. This is due to the fact that the Intel compiler does not work properly together with OpenMPI, which is our only compiler option for ICON-O.
For our benchmark simulations we adjusted nproma as described in Giorgetta et al. (2022): nproma was chosen to be as large as possible, such that all cell grid points of a computational domain including first- and second-level halo points fit into a single block (this value was derived from the model output log file). This yields in nproma values which are decreasing with the number of JUWELS Booster nodes used. For 119 nodes, nproma was set to 46156, which compares well to the value of 42690 used in Giogetta et al. (2022) with 128 nodes. We also modified the sub-chunking parameter for radiation rcc, which was set to 4000 in all our simulations, but found no significant changes in model runtime.
Alternatively, our results can be compared to those of Hohenegger et al. (2023), who present coupled ICON simulations in a model setup (G_AO_5km) which is equivalent to ours. In contrast to our study, they used CPU nodes only on the DKRZ HPC systems Mistral and Levante. One can compare the numbers from Hohenegger et al. to our CPU-only run on JUWELS Cluster, where we get 76 SDPD on 860 nodes (of which 780 nodes are used by ICON-A). Throughput is about 35-40% of their results on similar node numbers on Levante CPUs, but those are more performant than JUWELS Cluster CPUs by a factor of 2-3 and additionally we have to consider the exchange overhead between CPU and GPU on different devices.
References:
Giorgetta, M. A., et al: The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514), Geosci. Model Dev., 2022, 1–46, https://doi.org/10.5194/egusphere-2022-152, 2022.
Hohenegger, C., et al.: ICON-Sapphire: simulating the components of the Earth System and their interactions at kilometer and subkilometer scales, Geoscientific Model Development, 16, 779–811, https://doi.org/https://doi.org/10.5194/gmd-2022-171, 2023.
Citation: https://doi.org/10.5194/egusphere-2023-1476-AC1
-
RC1: 'Comment on egusphere-2023-1476', Anonymous Referee #1, 27 Sep 2023
reply
The study of Bishnoi et al. shows on the one hand that it is possible to run ICON on a heterogeneous architecture, with the atmosphere part running entirely on GPUs and the ocean part and the input and output on dedicated CPUs nodes, and on the other hand that considering the energy consumption, the performance of ICON on the heterogeneous architecture is improved by 59% compared to a pure CPU based architecture. The study is definitely very relevant for the atmospheric model community, since the ICON model is actively used by a large number of institutes. Because of the ever increasing share of boosters in the new supercomputers, it is necessary that at least large parts of ICON can run on boosters (e.g. GPUs). This is especially important with respect to the already existing and future exascale computers. The study therefore shows the feasibility. The fact that the performance is additionally increased, due to a lower energy consumption, can be considered as success and is quite reasonable to be presented in a publication. Furthermore, the study also show that the current architecture of the JUWELS system is definitely useful. Therefore, I can definitely recommend to publish the presented study and think that it is absolutely suitable for GMD.
I have only one general remarks:
Since ICON is a scientific test case, I would find it useful to present at least a few scientific results in a short subsection, regarding whether they are identical or almost identical, regardless of whether ICON runs on a homogeneous (cluster) architecture or on a heterogeneous architecture (cluster/booster). Perhaps a monthly average of the temperature or the zonal wind could be presented here.
And some minor remarks:
Line 13/14: “… was found for the combination 84 GPU nodes on the JUWELS Booster module and 80 CPU nodes on the JUWELS Cluster module …” --> “… was found for the combination 84 GPU nodes on the JUWELS Booster module to simulate the atmosphere and 80 CPU nodes on the JUWELS Cluster module …”
Line 42: “by a factor of 1 million” --> “by a factor of more than 1 million”
Line 57/58: I would also integrate here the acronym DKRZ, C2SM, and KIT
Line 61/62: “the performance of the ocean component on CPUs is still satisfactory”. Does it make sense to say it that way? Shouldn't one rather write "it is not yet possible to simulate the ocean on GPUs"? Later you write that there is a project for it.
Line 78: “Jülich Wizard for European Leadership Science (JUWELS) Jülich Supercomputing Centre (2019)” --> “Jülich Wizard for European Leadership Science (JUWELS, Jülich Supercomputing Centre (2019)”
Line 83-89: I would suggest: “In Sect. 2 we provide a comprehensive description of the ICON model and its specific setup. Sect. 3 presents a brief overview of the MSA, starting with an introduction to the concept (Sect. 3.1), the presentation of the modular hardware and software architecture of the JUWELS system at JSC (Sect. 3.2), and the strategy for porting the ICON model to the MSA, with a detailed explanation of the rationale behind each decision we made (Sect. 3.3). In Sect. 4 results from our analyses for finding a sweet spot configuration for ICON, the comparison to a non-modular setup, and strong scaling tests are provided. In Sect. 5 specific challenges and considerations associated with porting such complex codes as ICON to the MSA are discussed, and in Sect. 6 the summary and conclusions of this study is provided.”
Line 95: “of this paper”. I would rather write (also in all other cases further down in the text) “of this publication” or “of this study”.
Line 96: I would also mention R2B09 here.
Line 96: “Thus, the grids” -> "Thus, the horizontal grids”
Line 117: The start date of the ICON simulation is 20 January 2020. What is the end date?
Line 186: How many CPUs and how many cores has one cluster node?
Line 197: “Ozone” --> “ozone”
Line 200: “ICON ESM” --> “ICON-ESM”
Line 217: I would delete “naturally”
Line 226: “… homogeneous hardware platform.” --> “… homogeneous hardware platform, using only CPUs.”
Line 219: “nodes” --> “CPU nodes”
Line 225-240: I would not speak from a modular approach or case, this is in my opinion confusing, because ICON has also modules. I would suggest to change modular to “heterogeneous” and non-modular and standard to “homogeneous”. That would also be relevant in the rest of the text (Sect. 4 and 5).
Line 233: “... for each component.” --> ”for each model component (atmosphere, ocean).”
Line 267: “4 GPUs and 4 cores per node, with 1 GPU per core.” Is this correct? I thought there are 2 CPUs per node, i.e. I think 48 cores, or?
Line 268: “(48)” --> “(48 cores/node)”
Line 269: “(85/80)” --> “(85 Booster nodes/80 Cluster nodes)”
Figure 3: Please change “JUWELS Cluster/ Booster nodes” to “JUWELS Booster / Cluster nodes” in the caption
Line 286: “MSA case and …” --> ”MSA case (63 nodes) and …”
Line 289: “both” --> “all”
Line 294: “Runtimes for ICON-A are longer than for ICON-O and determine the overall runtime” --> “With 85 nodes for ICON-A and 63 nodes for ICON-O, runtimes for ICON-A are much longer than for ICON-O and determine the overall runtime”
Line 295: I would delete “For the non-modular setup”
Line 295: “ICON-O” --> “ICON-A”
Line 296: “ICON-O” --> “ICON-A”
Line 301: How many SDPDs are simulated with 780/63 nodes? Maybe you can still integrate this value in Table 2.
Table 2: I would find it useful to include not only the final result (780/63) in the table, but also the results of (84/63) and the steps in between.
Line 334: What is the reason for this (to use only 1 core/GPU)?
Line 317: I would insert “(see Fig. 4)” at the end of the sentence.
Line 332: “Figure 5” --> “Fig. 5”
Line 332: Why don´t you increase the number of I/O nodes during your scaling test?
Fig.4 (left): The orange line shows a decrease in the speedup from 237 nodes to 355 nodes, but in table 2 there is still a decrease in the Int. time from the atmospheric compound, so there should be at least a small increase in the speedup.
Line 379: “In particular, for our test case we found that …” --> "In particular, for our test case, a coupled ICON simulation, we found that …”
Line 384: “… ICON-A is running …” --> “… ICON-A is already running …”
Line 401: Please delete “Hallo Olaf, wir testen jetzt.…”
Citation: https://doi.org/10.5194/egusphere-2023-1476-RC1
Abhiraj Bishnoi et al.
Abhiraj Bishnoi et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
310 | 125 | 11 | 446 | 8 | 3 |
- HTML: 310
- PDF: 125
- XML: 11
- Total: 446
- BibTeX: 8
- EndNote: 3
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1