the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Operational numerical weather prediction with ICON on GPUs (version 2024.10)
Abstract. Numerical weather prediction and climate models require continuous adaptation to take advantage of advances in high-performance computing hardware. This paper presents the port of the ICON model to GPUs using OpenACC compiler directives for numerical weather prediction applications. In the context of an end-to-end operational forecast application, we adopted a full-port strategy: the entire workflow, from physical parameterizations to data assimilation, was analyzed and ported to GPUs as needed. Performance tuning and mixed-precision optimization yield a 5.6x speed-up compared to the CPU baseline in a socket-to-socket comparison. The ported ICON model meets strict requirements for time-to-solution and meteorological quality, in order for MeteoSwiss to be the first national weather service to run ICON operationally on GPUs with its ICON-CH1-EPS and ICON-CH2-EPS ensemble forecasting systems. We discuss key performance strategies, operational challenges, and the broader implications of transitioning community models to GPU-based platforms.
Competing interests: Author Dmitry Alexeev is employed by NVIDIA. The other authors have no competing interest
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(1487 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 20 Oct 2025)
- RC1: 'Comment on egusphere-2025-3585', Anonymous Referee #1, 25 Sep 2025 reply
-
RC2: 'Comment on egusphere-2025-3585', Anonymous Referee #2, 03 Oct 2025
reply
This paper outlines porting of the ICON NWP model to GPU execution performance gains arising from it in the two regional configurations of the model at MeteoSwiss. The paper is well structured and easy to read, with description of the porting strategies and outcomes, and a sufficient level of depth for technical implementation of OpenACC directives. It is very encouraging to see the performance gains from what I imagine to be a significant effort, especially when including the training of the ICON community developer to concepts of GPU parallelism and OpenACC implementation!
With the overall structure and presentation in place, I think the paper would benefit from more specific details in some parts to support the arguments made. Please see my comments below. I look forward to finding out more details about this valuable effort.
Section 1, lines 20-24: Having a GPU based model was mentioned as an advantage in the context of emerging Machine Learning for weather forecasting. The rest of the paper focuses on the computational performance gains from GPU porting, so I am wondering where the mention of ML application fits. Are there plans to incorporate ML algorithms and applications into ICON in the future, and how would this affect the porting strategy presented in the paper? The conclusion briefly mentions it, but it is not very specific.
Section 2, lines 59-61: I find this sentence a bit unspecified for verification: “This approach, combined with advanced data structures and efficient communication protocols, ensures scalability and optimal performance on massively parallel super-computing architectures.” What is the context and baseline for the claim that the data structures are advanced? Is it in comparison to what was applied in the same programming language but in earlier incarnations of the model, or the similar model? Are they advanced in comparison to other languages? As for the efficient communication protocols, some specifics on what is used and why would also aid understanding.
Section 3.1, line 68: “directive-based approach”, and “GPU-specific language”), I believe (compound adjectives).
Section 3.1, line 69: “its” instead of “it’s”, I believe (possessive pronoun).
Section 3.1 and later Section 4.3 (choice of porting approach and challenges): It is stated that “…approach was decided over a re-write in a GPU specific language like CUDA or a DSL, mainly because of it’s (sic!) broader acceptance by the ICON community”. Does this mean that the directives were added by hand by the ICON developers (domain scientists and RSEs, mentioned in Section 4.3)? I appreciate that re-writing an NWP model in something like CUDA is heavy-handed, however I am wondering what the main obstacle was against adopting a DSL approach, especially as it can facilitate porting by reducing the need for manual intervention. The section 4.3 mentions training domain scientists “in the basics of the GPU port, OpenACC, and tools for GPU verification”, and that the ICON developer community “comprises many scientists without a formal computer science background”. OpenACC and GPU port is not exactly the easiest thing to teach to such audience, and I am not sure that it would be more difficult to teach a DSL approach to them. I am not saying that the approach here was right or wrong, I am just wondering what motivated it (e.g. existing DSL tools not being mature enough to generate GPU parallelisations, or perhaps being difficult to teach, or there is uncertainty in maintenance / funding). I appreciate that when it comes to DSLs there is a decision whether to invest in the tool itself and who maintains it.
Section 3.1, lines 85-86: This can read as the conclusion only porting isolated kernels to GPU yields little benefit applies in atmospheric models in general, and I think this may not be entirely correct as the performance profiles or different models, and therefore optimisation strategies, can vary. I see that the paper from Adamidis et al. used ICON as the case study, so I assume the decision to go for the full port strategy here instead of porting individual kernels was heavily based on the performance profiling of that case? If so, it would be worth mentioning that.
Sections 3.1 and 3.2: If possible, I would advocate for placing Listing 1 to Section 3.2, as this is where the terms in Listing 1 are explained fully. The reference to Listing 1 in Section 3.1 could be adjusted as e.g. “see Section 3.2, Listing 1”.
tolerance validation with probtest.
Section 3.2, lines 109-110: I am curious how the additional logical argument, lacc, is introduced into low-level shared routines. Would it be possible to add a small code listing to illustrate this?
Section 3.3: The introduction gives an overview of different optimization strategies for ICON, outlining when it is possible to apply them. I think some references to what strategies were utilised mostly in what parts of the model in the following sections (e.g. dynamics, transport, physics) would be very useful. For instance, I would imagine that quite a few loops over horizontal sub-domains would have the same bounds (up to redundant computation for the difference).
Section 3.6.2, line 195: What is the “reduced grid” in the ecRad scheme?
Section 3.6.2, Figure 2: It is not quite clear to me what the figure illustrates here. Are these two ICON arrays mapping into the same ecRad data structure? Or is it the same ICON array but mapping of its different nproma portions?
Section 3.8, Figure 3 and lines 254-259: What are the red circles in Figure 3? It would also be useful to reference elements of the figure (red circles, blue and green squares) in the text referring to the figure, as it is not quite clear what happens when.
Section 3.8, lines 263-265: It would be useful to have some more specifics on the cost of data transfers between GPU and CPU, as it seems that it seemed to be acceptable in comparison to porting the Data Assimilation to GPU.
Section 3.9: Again, I would be curious about the mechanism and the cost of data transfers between GPU and CPU when outputting diagnostics at designated intervals. Would it be possible to provide some estimates and how they affect model times? Also, is the point of completing all the diagnostics calculations on GPU prior to sending the data to CPU a kind of “synchronisation” point for the model (or parts of it)?
Section 4.1, lines 289-290: “A short and computationally inexpensive configuration is used to run the perturbed ensemble and determine the expected spread of each variable”. I assume the same configuration is calculated on CPU and GPU, as later indicated?
Section 4.1, lines 293-295: “set of configurations which are supported on GPU using a reduced domain size. Every change to the model must pass this validation step to be accepted.” Is the change to the model tested on the same reduced domain size? How is the reduced domain size chosen to be sure it is an adequate representative of behaviour on operational model sizes?
Section 5: The two operational NWP ICON configurations presented here both refer to regional domains, as well as the results presented afterwards. Does this mean that the optimisation strategies presented in Section 3 were chosen for the regional configurations of the model? Or were they more general and applied for global and regional configurations? It would be good to clarify that.
Section 6.1: Is a single ensemble member run on two GPU nodes?
Sections 6.2.1-6.2.9: Are the performance improvements for the total model run (including parts computed on CPU plus data transfers)?
Section 6.3: Are the scaling results presented here for the total model run (including CPU-computed parts and data transfer)?
Citation: https://doi.org/10.5194/egusphere-2025-3585-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,726 | 70 | 8 | 1,804 | 11 | 14 |
- HTML: 1,726
- PDF: 70
- XML: 8
- Total: 1,804
- BibTeX: 11
- EndNote: 14
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
The authors present the adaptation of the ICON weather forecasting model to GPU execution and its use in operations at MeteoSwiss. The paper describes the ICON model, gives in depth the technical implementation of the GPU port using OpenACC directives and all components that are in use at MeteoSwiss and had to be adapted. It includes a presentation of the validation method, the use in operations at MeteoSwiss, describes optimisation work to improve performance and shows benchmark results. The completion and operational use of ICON on GPUs is a great achievement and I congratulate the authors on completing this substantial milestone. The paper is a good documentation of the outcome of this endeavour and the text is well-structured, understandable, and comprehensive. However, in several parts the presentation leaves out details or appears incomplete and thus the work might be improved by amending these. I list below observations and recommendations in the order of the paper for consideration by the authors:
The introduction lists several other efforts to adapt existing weather and climate models to GPUs but leaves out DSL-based efforts that aim at providing efficient CPU and GPU support through several code generation backends. This includes, e.g., PSyclone/LFric (UKMO/STFC) or GT4py-based models, such as ICON-EXCLAIM (MeteoSwiss/ETHZ), PMAP (ECMWF/ETHZ), PACE (AI2). Although this serves more purposes than GPU porting, the performance portability to heterogeneous hardware is a guiding principle for such approaches.
In terms of effort, the fact that weather and climate models often comprise millions LOC has been mentioned prominently but no indication is given into what order of magnitude ICON falls or what share of ICON had to be adapted additionally for the use at MeteoSwiss.
The description of ICON's parallelization is very handwavy and statements such as "advanced data structures and efficient communication protocols ensure scalability" would be more believable with some accompanying words what makes these advanced and efficient or references for this.
Listing 1 is presumably intended as a first glimpse at how the OpenACC code looks like, but it references some concepts that are (partially) explained only much later, such as nproma, async, or the OpenACC parallelization loop annotations in general. Since the text explains other OpenACC concepts later, it would be useful to readers without knowledge of the directives to briefly describe what these comprise of, how they are mapped to the GPU hardware in terms of parallelization/execution model, and data movement/availability.
In ll. 85-86, the authors claim that low arithmetic intensity is responsible for the fact that porting isolated kernels to GPU would offer little benefit. I don't agree with this causal link for two reasons: (1) I consider arithmetic intensity a useful metric that allows to estimate the potential for performance benefits from GPU execution of a code, but because GPUs do not only excel in executing large amounts of floating point instructions but also provide a much higher memory bandwidth than most CPU architectures, also low AI codes can benefit from GPU execution. (2) In my opinion, the question whether porting of individual kernels can be sufficient for performance improvements is mostly related to the performance profile of the full application. A flat performance profile with multiple kernels that contribute similar runtime shares makes it difficult to create measurable performance gains, while an application with a dominant kernel may benefit from just porting that one kernel if it has the potential to gain from GPU execution.
In the explanation of the loop collapse (ll. 116ff.), I would think it would be useful to relate the impact of the loop collapse to the SIMT execution concept of GPUs. In combination with the fast switching of lightweight threads, this makes it intuitively clear why more parallelism is so essential on GPUs to keep execution units occupied - a fact that is often overlooked by developers that are more familiar with SIMD execution models.
In l. 133: "generally consistent across most NVIDIA architectures": I think "across most NVIDIA architecture generations" is what is meant here?
In the discussion of the radiation scheme, it is not clear to me what is meant by the "reduced grid": Does ICON perform radiation computations on a lower resolution grid, or is this a reference to the reduced number of gridpoints within an nproma block? The description of the nproma_sub blocks is also somewhat repeated between the two paragraphs in ll. 193 to 206. I would advise some editorial changes to streamline the presentation in this section.
ll. 231-232 claims that WORKER VECTOR performed better than standard VECTOR loops but gives no evidence why that might be the case. Does this result in a difference in the launch configuration/block size that the NVIDIA runtime chooses for this particular loop? Or is there a different explanation?
The porting strategy in Section 3 focusses exclusively on computational aspects but excludes the management of managing data buffers on CPU and GPU and the implementation of any data transfers between these. Given that A100 GPUs are the target platform, it is unlikely that a unified memory model is used, but is there use of managed memory or are all GPU allocations and transfers explicit via OpenACC data directives? Is there any use of automatic allocations in acc routines or are any other means in place to improve memory handling, e.g., via the use of pool allocators?
The probabilistic testing method described in Sec. 4.1 seems similar to the Ensemble Consistency Testing methodology used by CESM and MPAS. How does the probtest method used here differ from this approach?
I like Section 4.3 that discusses some of the challenges related to porting a community code and the role of domain scientists. It does not become clear, however, how the feedback and adoption was from this less technical group of developers: Even with additional training, were subsequent code contributions into already ported components always suitable for GPU execution? Or do they sometimes/often require further adaptation? Moreover, with OpenMP and OpenACC directives in the same code base, readability of the code often suffers. Is this an issue, particularly with domain scientists?
The description of the GPU system in Sec. 5.1 was not sufficiently clear to me, in particular what the 42 GPU nodes resemble: Is this the number of nodes required for one ensemble member (I don't think so), or the number of nodes required to run all ensemble members in parallel, or simply the size of the production cluster? If the latter, how many nodes are in use in total when producing an ensemble forecast?
Is my understanding correct that the benchmark configuration in Section 6.1 does not include any I/O (nor Data Assimilation)? Since these parts are CPU resident (per Fig. 1/Fig. 3), it would be interesting to also have an assessment of the additional cost incurred by the necessary device-to-host data transfers, and whether any work has been done to optimise these - e.g., via the use of pinned memory buffers.
Section 6.1 does not explicitly state the compiler choice (and which version) - for OpenACC execution the obvious choice is NVHPC but it becomes only implicitly clear that this is likely also the compiler used for the baseline results on CPU. Also, note that optimization flag "-O" (in contrast to "-O2") disables SIMD vectorization, which tends to improve safety towards floating point exceptions but may significantly limit performance on modern CPUs and thus potentially result in an artificially reduced baseline performance, particularly since GPUs will still use FMA instructions etc:
From the NVHPC man page:
-O Sets the optimization level to 2, with no SIMD vectorization enabled. All level 1 optimizations are performed. In addition, traditional scalar optimizations such as induction
recognition and loop invariant motion are performed by the global optimizer.
-O2 All -O optimizations are performed. In addition, more advanced optimizations such as SIMD code generation, cache alignment and partial redundancy elimination are enabled.
The vague mentioning of "an issue" with OpenMP-MPI is also somewhat unsatisfactory, could this be further elaborated and whether this is something that has been brought up with the vendor or was resolved in a later compiler version? Lastly, the given bandwidth in l. 406 is presumably the peak bandwidth of the main memory?
Out of curiosity: Do "-O3" or "-fastmath" change the accuracy of the results in a way that is picked up by probtest? Or does it change them to unphysical values?
With regards to the optimisation flags, a common optimisation strategy on NVIDIA hardware is to limit the number of registers in GPU using "-gpu=maxregcount:<n>". That helps to increase occupancy and can be applied on a per-sourcefile level to boost performance of poorly performing kernels. Has this been tested and found to not provide any performance improvement?
Beyond the opt-rank-distribution described in Sec. 6.2.6, has the use of MPS to oversubscribe GPUs with multiple MPI ranks been explored? While this obviously reduces performance of kernels that are well saturating the hardware, it may help to improve occupancy for lower performing kernels.
I am unfamiliar with the Slurm setting "-distribution=plane=4" - could this be explained in Sec. 6.2.6?
The use of mixed-precision gives a surprisingly little performance improvement. IFS has achieved about 40% runtime improvement by switching to single precision (see https://doi.org/10.1002/qj.4181). This is likely limited by the number of fields that are still kept in double precision. What is the current share of operations/fields in single/double precision? Automatic tooling may help here to identify fields and operations that are sensitive to numerical accuracy, see for example https://github.com/aopp-pred/rpe or https://doi.org/10.1007/978-3-031-32041-5_20.
The timing results reported in Table 1 are helpful in assessing performance gains from the described optimisations. I would recommend to add a bar plot that presents the runtime and improvements for each optimisation described in Sec. 6, thus illustrating the gains achieved. The presented numbers also include only dycore and physics: what runtime share is attributed to other components shown in Figure 1 (e.g., "Infrastructure" and "Diagnostics")?
Section 6.4 claims that this now enables "more efficient and cost-effective use of computational resources" - compared to what? It begs the question if the original objective when embarking on the GPU porting work has been achieved, e.g., is the production of forecasts now cheaper/faster/more energy-efficient than on CPU?
Typos/minor remarks:
- l.70 "Further more" -> "Furthermore"
- Caption of Fig. 1: This should likely read "After Initialization on the CPU..."
- l. 311: "In contract" -> "In contrast"
- l. 507/508: two subsequent sentences start with "In particular"