the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Toward Exascale Climate Modelling: A Python DSL Approach to ICON’s (Icosahedral Non-hydrostatic) Dynamical Core (icon-exclaim v0.2.0)
Abstract. A refactored atmospheric dynamical core of the ICON model implemented in GT4Py, a Python-based domain-specific language designed for performance portability across heterogeneous CPU-GPU architectures, is presented. Integrated within the existing Fortran infrastructure, the GT4Py core achieves throughput slightly exceeding the optimized OpenACC version, reaching up to 213 simulation days per day when using a quarter of CSCS’s ALPS GPUs.
A multi-tiered testing strategy has been implemented to ensure numerical correctness and scientific reliability of the model code. Validation has been performed through global aquaplanet and prescribed sea-surface temperature simulations to demonstrate model’s capability to simulate mesoscale and its interaction with the larger-scale at km-scale grid spacing. This work establishes a foundation for architecture-agnostic ICON global climate and weather model, and highlights poor strong scaling as a potential bottleneck in scaling toward exascale performance.
- Preprint
(4078 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4808', Anonymous Referee #1, 11 Nov 2025
- AC1: 'Reply on RC1', Anurag Dipankar, 10 Dec 2025
-
RC2: 'Comment on egusphere-2025-4808', Anonymous Referee #2, 17 Nov 2025
This is a clear well written paper describing a gt4py implementation
of the ICON dynamical core, running in the existing ICON Fortran
modeling system, enabling k-scale atmospheric simulations on the ALPS
GPU supercomputer. The authors describe their porting approach,
including thorough testing from the kernel level up to full physics
simulations. They provide a sober analysis of the potential of GPUs
and their strong scaling limitations.I only have minor comments:
1. Section 4.3: what is "the implementation of horizontal blocking"?
Does that refer to the loop blocking in the Fortran loops, (which was
removed in the Python code?)2. Section 4.3: "...testing is tricky as the results are different due
to rounding..."
The authors have a good port testing strategy in the presence of
roundoff error, but this statement implies that these
rounding differences are unavoidable. The E3SM dycore porting work
(Bertagna et al. GMD 2019 and Bertagna et al. SC2020) showed that it
is possible to obtain BFB agreement between CPUs and GPUs with careful
coding, allowing for a different porting approach which simplifies
some aspects of code porting.3: Section 5.1:
For the final model, I assume all significant code is running on the
GPUs, with the dycore using gt4py and the physics using openACC. I
believe this is implied, but I didn't see it clearly stated. Were
there any software challenges running the two different GPU
programming models in the same executable?4. Line 400: "GT4Py synchronization"
I know of two types of synchronization: across MPI nodes, as well as
synchronization among thread teams running on the GPU. Which
is this referring to?5. Section 5.1
How does the gt4py code compare with the Fortran code on CPUs?
It would be interesting to add CPU-only performance numbers to
Figure 7.Citation: https://doi.org/10.5194/egusphere-2025-4808-RC2 - AC2: 'Reply on RC2', Anurag Dipankar, 10 Dec 2025
-
RC3: 'Comment on egusphere-2025-4808', Anonymous Referee #3, 01 Dec 2025
The paper is about a re-implementation of the ICON dynamical core using a domain-specific language embedded into Python called GT4Py. The work is carried out in the EXCLAIM project for which the paper presents the outcomes of the first phase. Described is the porting approach when rewriting the dynamical core into GT4Py, the testing strategy during the work, an evaluation of the computational performance and the scientific validation of the new code.
I found the paper was written in an accessible way, with a clear and sensible structure that covers all relevant angles of this development. The achieved milestone of the dynamical core rewritten in GT4Py is a remarkable achievement and the approach that utilised a very thorough testing procedure was well designed to avoid mistakes as much as possible. I would recommend a few minor edits to improve the overall presentation, which I list below with reference to the relevant sections of the text:
The abstract presents a specific throughput number but without specifying for what configuration or resolution. I would either add more details or leave it at the statement that the GT4pPy core exceeds ICON OpenACC performance without giving a specific number.
The overview of current performance numbers in the paragraph in ll. 58ff is a wild mixture of very different configurations and resolutions. The intention is likely to take stock of how close current ESMs get to the 1 SYPD target, but this gets lost in the presentation. I would suggest to make this a little more focussed, ideally using a more like-for-like comparison. Moreover, most numbers are presented without references (NICAM, IFS-FESOM, ICON@1.25km). Some should stem from the GB submissions (https://dl.acm.org/doi/10.1145/3712285.3771789 and https://dl.acm.org/doi/10.1145/3712285.3771790) but it is irritating to see them published in this preprint before the availability of the original papers, particularly when no reference is given.
In l. 97ff the three-phase nature of EXCLAIM is mentioned but no further information about the planned content of phases 2 and 3 is provided. Does this correspond to the deliverables shown in Figure? In the same paragraph, it is stated that the rewrite is "driven by the existing Fortran driver", which I did not understand until much later. Maybe this could be described in a form that makes it clearer that it is embedded into the existing Fortran framework, replacing calls to the dynamical core routines.Figure 1 is a useful illustration of the GT4Py code generation pipeline. I suspect not every reader may be familiar with therein used acronyms "GTIR" and "GTFN", which could be spilled out in the caption. GTIR is clarified later in the text but GTFN remains unclear.
In l. 154f, three execution modes for running GT4Py are mentioned. Which of these are used here? Given that this is embedded into Fortran, I suspect this requires AOT?
I did not immediately recognize the term "Fortran+" in l. 169 as the introduction of nomenclature. Maybe putting this in quotes would be helpful?
The description of the refactoring work in Sec. 4 is well written with an appropriate level of detail. The formatting of Listing 2 is unfortunate, with a page break between the listing and the caption - this should be rectified before final publication.
I agree on the readability angle in l. 278 but I did not understand the reason why only Python should allow in-line documentation through docstrings. I would argue that this could be done in any language, including Fortran.
The resolution of Fig. 4 seems a little low, it shows some artifacts in my print-out.
The hierarchy of testing levels appears well thought-out and seems effective to cover testing from a fine-grained stencil-loop level to full system regression. How much of this is automatic and when is it run? How expensive are these tests (in core-h or similar)?
The presented performance numbers are promising. However, in Fig. 7 either the plot colours or caption are wrong. The caption claims "GT4Py (dashed yellow) is about 10% faster than the Fortran+ (solid yellow)", while the plot suggests this to be the other way round. For the blue colours, dashed/solid seem to be reversed, so I suspect this may simply be a mistake in the caption.
Given the substantial performance speed-up claims from the speed-of-light implementation: is there a specific pattern/generic improvement that accounts for this improvement? Or is it a mixture of several different changes?
Since I'm not an expert on the scientific evaluation presented in Sec. 6, I cannot give a substantial feedback to this part.
Citation: https://doi.org/10.5194/egusphere-2025-4808-RC3 - AC3: 'Reply on RC3', Anurag Dipankar, 10 Dec 2025
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 304 | 113 | 22 | 439 | 13 | 11 |
- HTML: 304
- PDF: 113
- XML: 22
- Total: 439
- BibTeX: 13
- EndNote: 11
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript presents stage one of a multi-tiered plan to support heterogeneous (mixed CPU/GPU) architectures for running the ICON model. The authors utilize GT4Py, a domain-specific language, to modernize the ICON dynamics core from the existing Fortran code base. The outcome is a more performant code, which is also easier to read and develop compared to the equivalent Fortran implementation. The paper is well written and well reasoned, demonstrating promising results that are on par with the current state of GPU-ready Earth System modeling. I recommend that this manuscript be published, as I have only a few minor questions and technical corrections to suggest.
First, I want to commend the authors for their attention to (a) the hardware-based challenges that arise when running these models at scale, and (b) the importance of robust testing. In my experience, these topics are not typically the most exciting to discuss, but they are essential considerations for any group undertaking a similar effort.
Minor Comments:
Introduction
Section 2
Section 3
Section 4