Earth system modeling on Modular Supercomputing Architectures: coupled atmosphere-ocean simulations with ICON 2.6.6-rc

Bishnoi, Abhiraj; Stein, Olaf; Meyer, Catrin I.; Redler, René; Eicker, Norbert; Haak, Helmuth; Hoffmann, Lars; Klocke, Daniel; Kornblueh, Luis; Suarez, Estela

doi:https://doi.org/10.5194/egusphere-2023-1476

Preprints

https://doi.org/10.5194/egusphere-2023-1476

Preprints

17 Aug 2023

| 17 Aug 2023

Earth system modeling on Modular Supercomputing Architectures: coupled atmosphere-ocean simulations with ICON 2.6.6-rc

Abhiraj Bishnoi, Olaf Stein, Catrin I. Meyer, René Redler, Norbert Eicker, Helmuth Haak, Lars Hoffmann, Daniel Klocke, Luis Kornblueh, and Estela Suarez

Abstract. The confrontation of complex Earth System model (ESM) codes with novel supercomputing architectures poses challenges to efficient modelling and job submission strategies. The modular setup of these models naturally fits a modular supercom- puting architecture (MSA), which tightly integrates heterogeneous hardware resources into a larger and more flexible high performance computing (HPC) system. While parts of the ESM codes can easily take advantage of the increased parallelism and communication capabilities of modern Graphics Processing Units (GPUs), others lack behind due to the long development cycles or are better suited to run on classical CPUs due to their communication and memory usage patterns. To better cope with these imbalances between the development of the model components, we performed benchmark campaigns on the Jülich Wizard for European Leadership Science (JUWELS) modular HPC system. We enabled the weather and climate model ICOsa- hedral Nonhydrostatic (ICON) to run in a coupled atmosphere-ocean setup, where the ocean and the model I/O is running on the CPU Cluster, while the atmosphere is simulated simultaneously on the GPUs of JUWELS Booster (ICON-MSA). Both, atmosphere and ocean, are running globally with a resolution of 5 km. In our test case, an optimal configuration in terms of model performance (core hours per simulation day) was found for the combination 84 GPU nodes on the JUWELS Booster module and 80 CPU nodes on the JUWELS Cluster module, of which 63 nodes were used for the ocean simulation and the remaining 17 nodes were reserved for I/O. With this configuration the waiting times of the coupler were minimized. Compared to a simulation performed on CPUs only, the MSA approach reduces energy consumption by 59 % with comparable runtimes. ICON-MSA is able to scale up to a significant portion of the JUWELS system, making best use of the available computing resources. A maximum throughput of 170 simulation days per day (SDPD) was achieved when running ICON on 335 JUWELS Booster nodes and 268 Cluster nodes.

Received: 03 Jul 2023 – Discussion started: 17 Aug 2023

Download & links

Preprint (PDF, 1033 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (1033 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

12 Jan 2024

Earth system modeling on modular supercomputing architecture: coupled atmosphere–ocean simulations with ICON 2.6.6-rc

Abhiraj Bishnoi, Olaf Stein, Catrin I. Meyer, René Redler, Norbert Eicker, Helmuth Haak, Lars Hoffmann, Daniel Klocke, Luis Kornblueh, and Estela Suarez

Geosci. Model Dev., 17, 261–273, https://doi.org/10.5194/gmd-17-261-2024,https://doi.org/10.5194/gmd-17-261-2024, 2024

Short summary

Abhiraj Bishnoi et al.

Interactive discussion

Status: closed

CC1:
'Comment on egusphere-2023-1476', Marco Giorgetta, 22 Aug 2023
Thanks for posting the interesting results on the heterogeneous model runs, using CPU and GPU nodes! My question is about the turnover in simulated days per day (SDPD) that is reported in Table 3. For instance on line 3:

(1) 168 Booster nodes for the atmosphere + 126 Cluster nodes for the ocean yielding 130 SDPD
This can be compared to the turnover for an ICON atmosphere only model run at the same horizontal resolution also on Booster nodes, see The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514) (https://gmd.copernicus.org/articles/15/6985/2022/). Here we have among others the following case presented in Table 3:

(2) 128 Booster nodes for the atmosphere yielding 133 SDPD.
While the turnover numbers are similar, there exist also differences:
(2) uses 191 levels compared to 90 in (1)

(2) uses 128/168 or ca. 75% of the Booster nodes used by (1)

All together it seems to me that the coupled model should be able to achieve a significantly higher turnover, if the turnover of the atmosphere model is the limiting factor. What could be the reason for achieving only 130 SDPD with 168 Booster nodes for the atmosphere?
The only obvious difference, that could explain why the atmosphere in the coupled setup is less performant is the nproma value, which seems smaller than necessary. (1) uses nproma = 32981, while (2) uses nproma = 42690, this despite of the fact that (a) (2) needs more memory for the atmosphere, due to a larger number of levels, and (b) (2) has only about 75% of the nodes and thus computer memory that is available to (1).
This means that probably more turnover can be achieved. Linear scaling of (2) from 128 to 168 nodes would yield 174 SDPD, though this accounts still for 191 levels used in (2). Thus even a larger turnover should be possible for (1). Probably this also means that the energy consumption per simulated day could be reduced further.

I hope this motivates some thoughts about the possible causes for the small turnover found here, for (1), compared to what seems possible based on (2).
Citation: https://doi.org/10.5194/egusphere-2023-1476-CC1
- AC1: 'Reply on CC1', Olaf Stein, 04 Sep 2023
  
  Thanks for your comment and question on the model performance of our modular GPU-CPU setup, which addresses the key issue of model effectiveness on heterogeneous supercomputing architectures. If the reviewers agree, we will incorporate some of the points discussed here into a final version of the manuscript.
  We are well aware of the Giorgetta et al. (2022) study, presenting results from ICON atmosphere-only simulations on JUWELS Booster, which used the same architecture and a similar model resolution and setup than we did for ICON-A. Indeed, we see less than 50% of their throughput in terms of SDPD when comparing to linearly scaled – in terms of GPU nodes and vertical levels – results from their work. This is somewhat lower than we expected, but our simulations have to account for the overhead needed for data exchange between GPU and CPU in the course of Atmosphere-Ocean coupling. Moreover, we could not use openMPI in the modular setup, which is estimated to be 15% faster than ParaStationMPI on JUWELS, based on ICON atmosphere-only simulations. This is due to the fact that the Intel compiler does not work properly together with OpenMPI, which is our only compiler option for ICON-O.
  For our benchmark simulations we adjusted nproma as described in Giorgetta et al. (2022): nproma was chosen to be as large as possible, such that all cell grid points of a computational domain including first- and second-level halo points fit into a single block (this value was derived from the model output log file). This yields in nproma values which are decreasing with the number of JUWELS Booster nodes used. For 119 nodes, nproma was set to 46156, which compares well to the value of 42690 used in Giogetta et al. (2022) with 128 nodes. We also modified the sub-chunking parameter for radiation rcc, which was set to 4000 in all our simulations, but found no significant changes in model runtime.
  Alternatively, our results can be compared to those of Hohenegger et al. (2023), who present coupled ICON simulations in a model setup (G_AO_5km) which is equivalent to ours. In contrast to our study, they used CPU nodes only on the DKRZ HPC systems Mistral and Levante. One can compare the numbers from Hohenegger et al. to our CPU-only run on JUWELS Cluster, where we get 76 SDPD on 860 nodes (of which 780 nodes are used by ICON-A). Throughput is about 35-40% of their results on similar node numbers on Levante CPUs, but those are more performant than JUWELS Cluster CPUs by a factor of 2-3 and additionally we have to consider the exchange overhead between CPU and GPU on different devices.
  References:
  Giorgetta, M. A., et al: The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514), Geosci. Model Dev., 2022, 1–46, https://doi.org/10.5194/egusphere-2022-152, 2022.
  Hohenegger, C., et al.: ICON-Sapphire: simulating the components of the Earth System and their interactions at kilometer and subkilometer scales, Geoscientific Model Development, 16, 779–811, https://doi.org/https://doi.org/10.5194/gmd-2022-171, 2023.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1476-AC1
RC1:
'Comment on egusphere-2023-1476', Anonymous Referee #1, 27 Sep 2023

The study of Bishnoi et al. shows on the one hand that it is possible to run ICON on a heterogeneous architecture, with the atmosphere part running entirely on GPUs and the ocean part and the input and output on dedicated CPUs nodes, and on the other hand that considering the energy consumption, the performance of ICON on the heterogeneous architecture is improved by 59% compared to a pure CPU based architecture. The study is definitely very relevant for the atmospheric model community, since the ICON model is actively used by a large number of institutes. Because of the ever increasing share of boosters in the new supercomputers, it is necessary that at least large parts of ICON can run on boosters (e.g. GPUs). This is especially important with respect to the already existing and future exascale computers. The study therefore shows the feasibility. The fact that the performance is additionally increased, due to a lower energy consumption, can be considered as success and is quite reasonable to be presented in a publication. Furthermore, the study also show that the current architecture of the JUWELS system is definitely useful. Therefore, I can definitely recommend to publish the presented study and think that it is absolutely suitable for GMD.
I have only one general remarks:
Since ICON is a scientific test case, I would find it useful to present at least a few scientific results in a short subsection, regarding whether they are identical or almost identical, regardless of whether ICON runs on a homogeneous (cluster) architecture or on a heterogeneous architecture (cluster/booster). Perhaps a monthly average of the temperature or the zonal wind could be presented here.
And some minor remarks:
Line 13/14: “… was found for the combination 84 GPU nodes on the JUWELS Booster module and 80 CPU nodes on the JUWELS Cluster module …” --> “… was found for the combination 84 GPU nodes on the JUWELS Booster module to simulate the atmosphere and 80 CPU nodes on the JUWELS Cluster module …”
Line 42: “by a factor of 1 million” --> “by a factor of more than 1 million”
Line 57/58: I would also integrate here the acronym DKRZ, C2SM, and KIT
Line 61/62: “the performance of the ocean component on CPUs is still satisfactory”. Does it make sense to say it that way? Shouldn't one rather write "it is not yet possible to simulate the ocean on GPUs"? Later you write that there is a project for it.
Line 78: “Jülich Wizard for European Leadership Science (JUWELS) Jülich Supercomputing Centre (2019)” --> “Jülich Wizard for European Leadership Science (JUWELS, Jülich Supercomputing Centre (2019)”
Line 83-89: I would suggest: “In Sect. 2 we provide a comprehensive description of the ICON model and its specific setup. Sect. 3 presents a brief overview of the MSA, starting with an introduction to the concept (Sect. 3.1), the presentation of the modular hardware and software architecture of the JUWELS system at JSC (Sect. 3.2), and the strategy for porting the ICON model to the MSA, with a detailed explanation of the rationale behind each decision we made (Sect. 3.3). In Sect. 4 results from our analyses for finding a sweet spot configuration for ICON, the comparison to a non-modular setup, and strong scaling tests are provided. In Sect. 5 specific challenges and considerations associated with porting such complex codes as ICON to the MSA are discussed, and in Sect. 6 the summary and conclusions of this study is provided.”
Line 95: “of this paper”. I would rather write (also in all other cases further down in the text) “of this publication” or “of this study”.
Line 96: I would also mention R2B09 here.
Line 96: “Thus, the grids” -> "Thus, the horizontal grids”
Line 117: The start date of the ICON simulation is 20 January 2020. What is the end date?
Line 186: How many CPUs and how many cores has one cluster node?
Line 197: “Ozone” --> “ozone”
Line 200: “ICON ESM” --> “ICON-ESM”
Line 217: I would delete “naturally”
Line 226: “… homogeneous hardware platform.” --> “… homogeneous hardware platform, using only CPUs.”
Line 219: “nodes” --> “CPU nodes”
Line 225-240: I would not speak from a modular approach or case, this is in my opinion confusing, because ICON has also modules. I would suggest to change modular to “heterogeneous” and non-modular and standard to “homogeneous”. That would also be relevant in the rest of the text (Sect. 4 and 5).
Line 233: “... for each component.” --> ”for each model component (atmosphere, ocean).”
Line 267: “4 GPUs and 4 cores per node, with 1 GPU per core.” Is this correct? I thought there are 2 CPUs per node, i.e. I think 48 cores, or?
Line 268: “(48)” --> “(48 cores/node)”
Line 269: “(85/80)” --> “(85 Booster nodes/80 Cluster nodes)”
Figure 3: Please change “JUWELS Cluster/ Booster nodes” to “JUWELS Booster / Cluster nodes” in the caption
Line 286: “MSA case and …” --> ”MSA case (63 nodes) and …”
Line 289: “both” --> “all”
Line 294: “Runtimes for ICON-A are longer than for ICON-O and determine the overall runtime” --> “With 85 nodes for ICON-A and 63 nodes for ICON-O, runtimes for ICON-A are much longer than for ICON-O and determine the overall runtime”
Line 295: I would delete “For the non-modular setup”
Line 295: “ICON-O” --> “ICON-A”
Line 296: “ICON-O” --> “ICON-A”
Line 301: How many SDPDs are simulated with 780/63 nodes? Maybe you can still integrate this value in Table 2.
Table 2: I would find it useful to include not only the final result (780/63) in the table, but also the results of (84/63) and the steps in between.
Line 334: What is the reason for this (to use only 1 core/GPU)?
Line 317: I would insert “(see Fig. 4)” at the end of the sentence.
Line 332: “Figure 5” --> “Fig. 5”
Line 332: Why don´t you increase the number of I/O nodes during your scaling test?
Fig.4 (left): The orange line shows a decrease in the speedup from 237 nodes to 355 nodes, but in table 2 there is still a decrease in the Int. time from the atmospheric compound, so there should be at least a small increase in the speedup.
Line 379: “In particular, for our test case we found that …” --> "In particular, for our test case, a coupled ICON simulation, we found that …”
Line 384: “… ICON-A is running …” --> “… ICON-A is already running …”
Line 401: Please delete “Hallo Olaf, wir testen jetzt.…”

Citation: https://doi.org/10.5194/egusphere-2023-1476-RC1
- AC2: 'Reply on RC1', Olaf Stein, 31 Oct 2023
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1476/egusphere-2023-1476-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2023-1476-AC2
RC2:
'Comment on egusphere-2023-1476', Anonymous Referee #2, 12 Oct 2023
The paper by Bishnoi et al. describes the work on using the modular supercomputing architecture (MSA) for coupled atmosphere and ocean simulations of the ICON model. The authors find that the MSA-approach improves the energy consumption by 59% compared to running the entire coupled model on the CPU nodes of the JUWELS Cluster supercomputer. The paper is well written and the results are very relevant for the audience of GMD. It is overall well structured and provides all the necessary details to understand how the results were obtained. There are a few places which I describe in detail in the specific comments below where the text is difficult to understand and should be improved. I recommend to publish the paper once these comments are addressed.

Specific comments:

line 33-35: This sentence is difficult to understand since it is too long. It should be split in two: "The code of more advanced complex climate models is composed of many different kinds of operations, e.g., ... Another level of complexity is added through optimisations for specific computing hardware (Lawence et al., 2018)."

line 62: up to 10% overall speedup according to line 360 doesn't sound satisfactory

line 83-87: Mixing sections and subsections is confusing. Better write something like: "In Section 2 we provide a comprehensive description of the ICON model and its specific setup. Section 3 presents a brief overview of the MSA, with an introduction to the concept (Section 3.1), a description of the modular hardware and software architecture of the JUWELS system at JSC (Section 3.2) and the strategy for porting the ICON model to the MSA (Section 3.3). Section 4 contains the results from our analyses for finding a sweet spot configuration for ICON (Section 4.1), the comparison to a non-modular setup (Section 4.2), and strong scaling tests (Section 4.3)."

line 225-241: The goal of these two paragraphs should be made clearer. In particular the formulation "until we reached user allocation limits" in line 237 sounds to me as if ICON-A was still slower compared to ICON-O and more nodes beyond the allocation limit would have improved the energy efficiency of the homogeneous configuraton further. This is not the case according to line 295 which states that ICON-O and ICON-A are balanced. Also it should be mentioned that this will be described in much more detail in Section 4. I suggest to replace the entire two paragraphs (line 225-241) with something like: "To quantify the benefit of the MSA approach we compare the energy consumption of an optimal MSA configuration with a homogeneous setting in which the entire coupled model is run on the same type of nodes while keeping the run-time roughly the same. Since not all model components of ICON can take advantage of GPUs we use the CPU nodes of the JUWELS Cluster module as a baseline for this comparison. The run-time is kept roughly the same by using the same number of nodes for the ocean component. Both configurations (MSA and homogeneous baseline) are optimized by adjusting the number of nodes used for the atmosphere component such that waiting times between model components are minimized. All other model parameters are kept the same. The process of finding the optimal configuration is described in detail in Section 4.1 for the MSA configuration and in Section 4.2 for the homogeneous baseline. In addition, we performed a strong scaling experiment to prove the scalability of the MSA approach which is presented in Section 4.3."

Figure 3: The absolute values of the coupling time alone are not very meaningful in this context. Either mention already here how these times compare to the overall run-times or use percentage of the overall run-time of the simulation for the horizontal axis. This would allow the reader to immediately understand the severeness of these coupling times and it would still convey the message which configuration is best.

line 269: If you don't show the percentage of the overall run-time in Figure 3 you should menion here the overall run-time to allow the reader to understand why it is a significant portion of the overall run-time.

line 285: I love the way how you compare the MSA and non-MSA approach. I agree that it is a fair comparison. I just miss a clear statement how you chose to compare the two approaches. In my opinion keeping the number of ICON-O nodes the same is rather a matter of how you chose to compare the approaches than a matter of making the comparison fair. In principle one could also choose to compare how many SDPD the same amount of energy can achieve in which case one wouldn't keep the number of ICON-O nodes the same. I think you should replace "In order to make a fair comparison" with something like "In order to keep the run-time roughly the same"

line 295: What exactly does it mean that atmosphere and ocean are balanced? Does it mean that coupling times are minimized or that the integration times are the same?

Table 3: Please explain in the caption or in the main text whether the waiting time is already included in the total time or not.

Figure 4: What is the relation between the numbers in Table 3 and the results shown in Figure 4? The figure shows a speedup of about 1.8x for atmosphere coupled at 355 nodes. All the speedups for the atmosphere component in Table 3 give me a speedup of at least 2 times even if I assume that the waiting time is not included in the total time. How does this fit? Why is the speedup at 355 nodes for the atmosphere lower than at 237 nodes? This is not visible in Table 3.

Minor comments:

line 13: "combination 84 GPU nodes" => "combination of 84 GPU nodes"

line 42: "factor of 1 million" => "factor of more than 1 million"

line 91: "non-hydrostatic atmosphere" => "non-hydrostatic atmosphere model"

line 96: It would be good to introduce the R2B9 grid already at this point which is often used later.

line 117: It would be helpful to mention already here that all experiments run for 1 simulation day.

second line of the caption of Figure 3: "Cluster / Booster" => "Booster / Cluster"

third and fourth line of the caption of Figure 3: "The number of Cluster nodes (17) dedicated to I/O is not taken into account, since it is kept constant across all experiments." => "The 17 Cluster nodes dedicated to I/O are not taken into account, since the number of IO nodes is kept constant across all experiments."

line 268: "(48)" => "(48 cores/node)"

line 269: "(85/80)" => "85 Booster nodes / 80 Cluster nodes"

line 295: "ICON-O" => "ICON-A"

line 296: "ICON-O" => "ICON-A"

caption of Table 2: Shouldn't this be rather "MSA configuration" than "MSA architecture"?

line 358: "simulationd" => "simulations"?

line 401: please remove this line "Hallo Olaf, wir testen jetzt...."
Citation: https://doi.org/10.5194/egusphere-2023-1476-RC2
- AC3: 'Reply on RC2', Olaf Stein, 31 Oct 2023
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1476/egusphere-2023-1476-AC3-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2023-1476-AC3

Interactive discussion

Status: closed

CC1:
'Comment on egusphere-2023-1476', Marco Giorgetta, 22 Aug 2023
Thanks for posting the interesting results on the heterogeneous model runs, using CPU and GPU nodes! My question is about the turnover in simulated days per day (SDPD) that is reported in Table 3. For instance on line 3:

(1) 168 Booster nodes for the atmosphere + 126 Cluster nodes for the ocean yielding 130 SDPD
This can be compared to the turnover for an ICON atmosphere only model run at the same horizontal resolution also on Booster nodes, see The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514) (https://gmd.copernicus.org/articles/15/6985/2022/). Here we have among others the following case presented in Table 3:

(2) 128 Booster nodes for the atmosphere yielding 133 SDPD.
While the turnover numbers are similar, there exist also differences:
(2) uses 191 levels compared to 90 in (1)

(2) uses 128/168 or ca. 75% of the Booster nodes used by (1)

All together it seems to me that the coupled model should be able to achieve a significantly higher turnover, if the turnover of the atmosphere model is the limiting factor. What could be the reason for achieving only 130 SDPD with 168 Booster nodes for the atmosphere?
The only obvious difference, that could explain why the atmosphere in the coupled setup is less performant is the nproma value, which seems smaller than necessary. (1) uses nproma = 32981, while (2) uses nproma = 42690, this despite of the fact that (a) (2) needs more memory for the atmosphere, due to a larger number of levels, and (b) (2) has only about 75% of the nodes and thus computer memory that is available to (1).
This means that probably more turnover can be achieved. Linear scaling of (2) from 128 to 168 nodes would yield 174 SDPD, though this accounts still for 191 levels used in (2). Thus even a larger turnover should be possible for (1). Probably this also means that the energy consumption per simulated day could be reduced further.

I hope this motivates some thoughts about the possible causes for the small turnover found here, for (1), compared to what seems possible based on (2).
Citation: https://doi.org/10.5194/egusphere-2023-1476-CC1
- AC1: 'Reply on CC1', Olaf Stein, 04 Sep 2023
  
  Thanks for your comment and question on the model performance of our modular GPU-CPU setup, which addresses the key issue of model effectiveness on heterogeneous supercomputing architectures. If the reviewers agree, we will incorporate some of the points discussed here into a final version of the manuscript.
  We are well aware of the Giorgetta et al. (2022) study, presenting results from ICON atmosphere-only simulations on JUWELS Booster, which used the same architecture and a similar model resolution and setup than we did for ICON-A. Indeed, we see less than 50% of their throughput in terms of SDPD when comparing to linearly scaled – in terms of GPU nodes and vertical levels – results from their work. This is somewhat lower than we expected, but our simulations have to account for the overhead needed for data exchange between GPU and CPU in the course of Atmosphere-Ocean coupling. Moreover, we could not use openMPI in the modular setup, which is estimated to be 15% faster than ParaStationMPI on JUWELS, based on ICON atmosphere-only simulations. This is due to the fact that the Intel compiler does not work properly together with OpenMPI, which is our only compiler option for ICON-O.
  For our benchmark simulations we adjusted nproma as described in Giorgetta et al. (2022): nproma was chosen to be as large as possible, such that all cell grid points of a computational domain including first- and second-level halo points fit into a single block (this value was derived from the model output log file). This yields in nproma values which are decreasing with the number of JUWELS Booster nodes used. For 119 nodes, nproma was set to 46156, which compares well to the value of 42690 used in Giogetta et al. (2022) with 128 nodes. We also modified the sub-chunking parameter for radiation rcc, which was set to 4000 in all our simulations, but found no significant changes in model runtime.
  Alternatively, our results can be compared to those of Hohenegger et al. (2023), who present coupled ICON simulations in a model setup (G_AO_5km) which is equivalent to ours. In contrast to our study, they used CPU nodes only on the DKRZ HPC systems Mistral and Levante. One can compare the numbers from Hohenegger et al. to our CPU-only run on JUWELS Cluster, where we get 76 SDPD on 860 nodes (of which 780 nodes are used by ICON-A). Throughput is about 35-40% of their results on similar node numbers on Levante CPUs, but those are more performant than JUWELS Cluster CPUs by a factor of 2-3 and additionally we have to consider the exchange overhead between CPU and GPU on different devices.
  References:
  Giorgetta, M. A., et al: The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514), Geosci. Model Dev., 2022, 1–46, https://doi.org/10.5194/egusphere-2022-152, 2022.
  Hohenegger, C., et al.: ICON-Sapphire: simulating the components of the Earth System and their interactions at kilometer and subkilometer scales, Geoscientific Model Development, 16, 779–811, https://doi.org/https://doi.org/10.5194/gmd-2022-171, 2023.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1476-AC1
RC1:
'Comment on egusphere-2023-1476', Anonymous Referee #1, 27 Sep 2023

The study of Bishnoi et al. shows on the one hand that it is possible to run ICON on a heterogeneous architecture, with the atmosphere part running entirely on GPUs and the ocean part and the input and output on dedicated CPUs nodes, and on the other hand that considering the energy consumption, the performance of ICON on the heterogeneous architecture is improved by 59% compared to a pure CPU based architecture. The study is definitely very relevant for the atmospheric model community, since the ICON model is actively used by a large number of institutes. Because of the ever increasing share of boosters in the new supercomputers, it is necessary that at least large parts of ICON can run on boosters (e.g. GPUs). This is especially important with respect to the already existing and future exascale computers. The study therefore shows the feasibility. The fact that the performance is additionally increased, due to a lower energy consumption, can be considered as success and is quite reasonable to be presented in a publication. Furthermore, the study also show that the current architecture of the JUWELS system is definitely useful. Therefore, I can definitely recommend to publish the presented study and think that it is absolutely suitable for GMD.
I have only one general remarks:
Since ICON is a scientific test case, I would find it useful to present at least a few scientific results in a short subsection, regarding whether they are identical or almost identical, regardless of whether ICON runs on a homogeneous (cluster) architecture or on a heterogeneous architecture (cluster/booster). Perhaps a monthly average of the temperature or the zonal wind could be presented here.
And some minor remarks:
Line 13/14: “… was found for the combination 84 GPU nodes on the JUWELS Booster module and 80 CPU nodes on the JUWELS Cluster module …” --> “… was found for the combination 84 GPU nodes on the JUWELS Booster module to simulate the atmosphere and 80 CPU nodes on the JUWELS Cluster module …”
Line 42: “by a factor of 1 million” --> “by a factor of more than 1 million”
Line 57/58: I would also integrate here the acronym DKRZ, C2SM, and KIT
Line 61/62: “the performance of the ocean component on CPUs is still satisfactory”. Does it make sense to say it that way? Shouldn't one rather write "it is not yet possible to simulate the ocean on GPUs"? Later you write that there is a project for it.
Line 78: “Jülich Wizard for European Leadership Science (JUWELS) Jülich Supercomputing Centre (2019)” --> “Jülich Wizard for European Leadership Science (JUWELS, Jülich Supercomputing Centre (2019)”
Line 83-89: I would suggest: “In Sect. 2 we provide a comprehensive description of the ICON model and its specific setup. Sect. 3 presents a brief overview of the MSA, starting with an introduction to the concept (Sect. 3.1), the presentation of the modular hardware and software architecture of the JUWELS system at JSC (Sect. 3.2), and the strategy for porting the ICON model to the MSA, with a detailed explanation of the rationale behind each decision we made (Sect. 3.3). In Sect. 4 results from our analyses for finding a sweet spot configuration for ICON, the comparison to a non-modular setup, and strong scaling tests are provided. In Sect. 5 specific challenges and considerations associated with porting such complex codes as ICON to the MSA are discussed, and in Sect. 6 the summary and conclusions of this study is provided.”
Line 95: “of this paper”. I would rather write (also in all other cases further down in the text) “of this publication” or “of this study”.
Line 96: I would also mention R2B09 here.
Line 96: “Thus, the grids” -> "Thus, the horizontal grids”
Line 117: The start date of the ICON simulation is 20 January 2020. What is the end date?
Line 186: How many CPUs and how many cores has one cluster node?
Line 197: “Ozone” --> “ozone”
Line 200: “ICON ESM” --> “ICON-ESM”
Line 217: I would delete “naturally”
Line 226: “… homogeneous hardware platform.” --> “… homogeneous hardware platform, using only CPUs.”
Line 219: “nodes” --> “CPU nodes”
Line 225-240: I would not speak from a modular approach or case, this is in my opinion confusing, because ICON has also modules. I would suggest to change modular to “heterogeneous” and non-modular and standard to “homogeneous”. That would also be relevant in the rest of the text (Sect. 4 and 5).
Line 233: “... for each component.” --> ”for each model component (atmosphere, ocean).”
Line 267: “4 GPUs and 4 cores per node, with 1 GPU per core.” Is this correct? I thought there are 2 CPUs per node, i.e. I think 48 cores, or?
Line 268: “(48)” --> “(48 cores/node)”
Line 269: “(85/80)” --> “(85 Booster nodes/80 Cluster nodes)”
Figure 3: Please change “JUWELS Cluster/ Booster nodes” to “JUWELS Booster / Cluster nodes” in the caption
Line 286: “MSA case and …” --> ”MSA case (63 nodes) and …”
Line 289: “both” --> “all”
Line 294: “Runtimes for ICON-A are longer than for ICON-O and determine the overall runtime” --> “With 85 nodes for ICON-A and 63 nodes for ICON-O, runtimes for ICON-A are much longer than for ICON-O and determine the overall runtime”
Line 295: I would delete “For the non-modular setup”
Line 295: “ICON-O” --> “ICON-A”
Line 296: “ICON-O” --> “ICON-A”
Line 301: How many SDPDs are simulated with 780/63 nodes? Maybe you can still integrate this value in Table 2.
Table 2: I would find it useful to include not only the final result (780/63) in the table, but also the results of (84/63) and the steps in between.
Line 334: What is the reason for this (to use only 1 core/GPU)?
Line 317: I would insert “(see Fig. 4)” at the end of the sentence.
Line 332: “Figure 5” --> “Fig. 5”
Line 332: Why don´t you increase the number of I/O nodes during your scaling test?
Fig.4 (left): The orange line shows a decrease in the speedup from 237 nodes to 355 nodes, but in table 2 there is still a decrease in the Int. time from the atmospheric compound, so there should be at least a small increase in the speedup.
Line 379: “In particular, for our test case we found that …” --> "In particular, for our test case, a coupled ICON simulation, we found that …”
Line 384: “… ICON-A is running …” --> “… ICON-A is already running …”
Line 401: Please delete “Hallo Olaf, wir testen jetzt.…”

Citation: https://doi.org/10.5194/egusphere-2023-1476-RC1
- AC2: 'Reply on RC1', Olaf Stein, 31 Oct 2023
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1476/egusphere-2023-1476-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2023-1476-AC2
RC2:
'Comment on egusphere-2023-1476', Anonymous Referee #2, 12 Oct 2023
The paper by Bishnoi et al. describes the work on using the modular supercomputing architecture (MSA) for coupled atmosphere and ocean simulations of the ICON model. The authors find that the MSA-approach improves the energy consumption by 59% compared to running the entire coupled model on the CPU nodes of the JUWELS Cluster supercomputer. The paper is well written and the results are very relevant for the audience of GMD. It is overall well structured and provides all the necessary details to understand how the results were obtained. There are a few places which I describe in detail in the specific comments below where the text is difficult to understand and should be improved. I recommend to publish the paper once these comments are addressed.

Specific comments:

line 33-35: This sentence is difficult to understand since it is too long. It should be split in two: "The code of more advanced complex climate models is composed of many different kinds of operations, e.g., ... Another level of complexity is added through optimisations for specific computing hardware (Lawence et al., 2018)."

line 62: up to 10% overall speedup according to line 360 doesn't sound satisfactory

line 83-87: Mixing sections and subsections is confusing. Better write something like: "In Section 2 we provide a comprehensive description of the ICON model and its specific setup. Section 3 presents a brief overview of the MSA, with an introduction to the concept (Section 3.1), a description of the modular hardware and software architecture of the JUWELS system at JSC (Section 3.2) and the strategy for porting the ICON model to the MSA (Section 3.3). Section 4 contains the results from our analyses for finding a sweet spot configuration for ICON (Section 4.1), the comparison to a non-modular setup (Section 4.2), and strong scaling tests (Section 4.3)."

line 225-241: The goal of these two paragraphs should be made clearer. In particular the formulation "until we reached user allocation limits" in line 237 sounds to me as if ICON-A was still slower compared to ICON-O and more nodes beyond the allocation limit would have improved the energy efficiency of the homogeneous configuraton further. This is not the case according to line 295 which states that ICON-O and ICON-A are balanced. Also it should be mentioned that this will be described in much more detail in Section 4. I suggest to replace the entire two paragraphs (line 225-241) with something like: "To quantify the benefit of the MSA approach we compare the energy consumption of an optimal MSA configuration with a homogeneous setting in which the entire coupled model is run on the same type of nodes while keeping the run-time roughly the same. Since not all model components of ICON can take advantage of GPUs we use the CPU nodes of the JUWELS Cluster module as a baseline for this comparison. The run-time is kept roughly the same by using the same number of nodes for the ocean component. Both configurations (MSA and homogeneous baseline) are optimized by adjusting the number of nodes used for the atmosphere component such that waiting times between model components are minimized. All other model parameters are kept the same. The process of finding the optimal configuration is described in detail in Section 4.1 for the MSA configuration and in Section 4.2 for the homogeneous baseline. In addition, we performed a strong scaling experiment to prove the scalability of the MSA approach which is presented in Section 4.3."

Figure 3: The absolute values of the coupling time alone are not very meaningful in this context. Either mention already here how these times compare to the overall run-times or use percentage of the overall run-time of the simulation for the horizontal axis. This would allow the reader to immediately understand the severeness of these coupling times and it would still convey the message which configuration is best.

line 269: If you don't show the percentage of the overall run-time in Figure 3 you should menion here the overall run-time to allow the reader to understand why it is a significant portion of the overall run-time.

line 285: I love the way how you compare the MSA and non-MSA approach. I agree that it is a fair comparison. I just miss a clear statement how you chose to compare the two approaches. In my opinion keeping the number of ICON-O nodes the same is rather a matter of how you chose to compare the approaches than a matter of making the comparison fair. In principle one could also choose to compare how many SDPD the same amount of energy can achieve in which case one wouldn't keep the number of ICON-O nodes the same. I think you should replace "In order to make a fair comparison" with something like "In order to keep the run-time roughly the same"

line 295: What exactly does it mean that atmosphere and ocean are balanced? Does it mean that coupling times are minimized or that the integration times are the same?

Table 3: Please explain in the caption or in the main text whether the waiting time is already included in the total time or not.

Figure 4: What is the relation between the numbers in Table 3 and the results shown in Figure 4? The figure shows a speedup of about 1.8x for atmosphere coupled at 355 nodes. All the speedups for the atmosphere component in Table 3 give me a speedup of at least 2 times even if I assume that the waiting time is not included in the total time. How does this fit? Why is the speedup at 355 nodes for the atmosphere lower than at 237 nodes? This is not visible in Table 3.

Minor comments:

line 13: "combination 84 GPU nodes" => "combination of 84 GPU nodes"

line 42: "factor of 1 million" => "factor of more than 1 million"

line 91: "non-hydrostatic atmosphere" => "non-hydrostatic atmosphere model"

line 96: It would be good to introduce the R2B9 grid already at this point which is often used later.

line 117: It would be helpful to mention already here that all experiments run for 1 simulation day.

second line of the caption of Figure 3: "Cluster / Booster" => "Booster / Cluster"

third and fourth line of the caption of Figure 3: "The number of Cluster nodes (17) dedicated to I/O is not taken into account, since it is kept constant across all experiments." => "The 17 Cluster nodes dedicated to I/O are not taken into account, since the number of IO nodes is kept constant across all experiments."

line 268: "(48)" => "(48 cores/node)"

line 269: "(85/80)" => "85 Booster nodes / 80 Cluster nodes"

line 295: "ICON-O" => "ICON-A"

line 296: "ICON-O" => "ICON-A"

caption of Table 2: Shouldn't this be rather "MSA configuration" than "MSA architecture"?

line 358: "simulationd" => "simulations"?

line 401: please remove this line "Hallo Olaf, wir testen jetzt...."
Citation: https://doi.org/10.5194/egusphere-2023-1476-RC2
- AC3: 'Reply on RC2', Olaf Stein, 31 Oct 2023
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1476/egusphere-2023-1476-AC3-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2023-1476-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Olaf Stein on behalf of the Authors (31 Oct 2023) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (09 Nov 2023) by Ludovic Räss

RR by Anonymous Referee #1 (16 Nov 2023)

ED: Publish as is (16 Nov 2023) by Ludovic Räss

AR by Olaf Stein on behalf of the Authors (20 Nov 2023)

Journal article(s) based on this preprint

12 Jan 2024

Earth system modeling on modular supercomputing architecture: coupled atmosphere–ocean simulations with ICON 2.6.6-rc

Abhiraj Bishnoi, Olaf Stein, Catrin I. Meyer, René Redler, Norbert Eicker, Helmuth Haak, Lars Hoffmann, Daniel Klocke, Luis Kornblueh, and Estela Suarez

Geosci. Model Dev., 17, 261–273, https://doi.org/10.5194/gmd-17-261-2024,https://doi.org/10.5194/gmd-17-261-2024, 2024

Short summary

Abhiraj Bishnoi et al.

Viewed

Total article views: 697 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
454	214	29	697	18	16

HTML: 454
PDF: 214
XML: 29
Total: 697
BibTeX: 18
EndNote: 16

Views and downloads (calculated since 17 Aug 2023)

Month	HTML	PDF	XML	Total
Aug 2023	186	79	6	271
Sep 2023	125	47	5	177
Oct 2023	77	38	5	120
Nov 2023	30	19	6	55
Dec 2023	22	24	5	51
Jan 2024	14	7	2	23

Cumulative views and downloads (calculated since 17 Aug 2023)

Month	HTML	PDF	XML	Total
Aug 2023	186	79	6	271
Sep 2023	125	47	5	177
Oct 2023	77	38	5	120
Nov 2023	30	19	6	55
Dec 2023	22	24	5	51
Jan 2024	14	7	2	23

Viewed (geographical distribution)

Total article views: 669 (including HTML, PDF, and XML) Thereof 669 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 12 Jan 2024

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (1033 KB)
Metadata XML

Short summary

We enabled the weather and climate model ICON to run in a high-resolution coupled atmosphere-ocean setup on the JUWELS supercomputer, where the ocean and the model I/O runs on the CPU Cluster, while the atmosphere is running simultaneously on GPUs. Compared to a simulation performed on CPUs only, our approach reduces energy consumption by 59 % with comparable runtimes. The experiments serve as preparation for efficient computing of kilometer-scale climate models on future supercomputing systems.


Total:	0
HTML:	0
PDF:	0
XML:	0