the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A highly-efficient automated optimization approach for kilometer-level resolution Earth system models on heterogeneous many-core supercomputers
Abstract. As coupled Earth system models advance, it becomes increasingly feasible to attain higher spatial resolutions, thereby enabling more precise simulations and predictions of the evolution of the Earth system. Consequently, there is an urgent demand of highly-efficient optimization for extensive scientific programs on more power-efficient heterogeneous many-core systems. This study introduces a highly-efficient optimization approach tailored for kilometer-level resolution Earth System Models (ESMs) operating on heterogeneous many-core supercomputers. Leveraging scalable model configurations and innovative tripolar ocean/sea-ice grids that bolster spatial accuracy and computational efficiency, we initially establish a series of high resolutions (HRs) within a solitary component (either the atmosphere or ocean) while maintaining a fixed resolution for the other, resulting in notable enhancements in both model performance and efficacy. Furthermore, we have devised an OpenMP tool specifically optimized for the new Sunway supercomputer, facilitating automated code optimization. Our approach is designed to be non-intrusive, minimizing the need for manual code alterations while ensuring both performance gain and code consistency. We adopt a hybrid parallelization strategy combining Athread and OpenMP, achieving full parallel coverage for code segments with a runtime proportion exceeding 1 %. After optimization, the atmosphere, ocean, and sea-ice models achieve speedups of 4.43×, 1.86×, and 2.43×, respectively. Consequently, the overall simulation performance of the 5-km/3-km coupled model reaches 222 SDPD. This achievement renders multiple decadal scientific numerical simulations utilizing such HR coupled simulations feasible. Our work signifies a pivotal advancement in Earth system modeling, providing a robust framework for high-resolution climate simulations on more ubiquitous (next-generation) heterogeneous supercomputing platforms, such as GPUs, with minimal additional effort.
- Preprint
(1872 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 02 Jan 2026)
- RC1: 'Comment on egusphere-2025-5297', Anonymous Referee #1, 01 Dec 2025 reply
-
RC2: 'Comment on egusphere-2025-5297', Anonymous Referee #2, 05 Dec 2025
reply
Summary
The manuscript presents the porting of the coupled climate model CESM2 to the new Sunway supercomputer. The porting strategy combines OpenMP and Athread to map threads to the heterogenous CPU cores available on the system. Some computationally-expensive code sections are further manually refactored. Global simulations at different kilometer-scale resolutions are conducted, down to a grid spacing of 5km for the atmosphere component and 3km for the ocean part. Once deployed to the full machine, the highest-resolution simulation attains a throughput of 222 SYPD.
General comments
I feel sorry for the following statement, but I prefer to be clear from the very beginning: this manuscript in its current shape does not deserve a publication. I will expose my major concerns here; more specific comments will be provided in the next section.
- As pointed out also by Referee #1, the link to previous efforts to adapt CESM to Sunway systems is not clear. For instance, the authors cite Zhang et al. (2020), describing the porting of CESM1.3 to the previous Sunway supercomputer. Why is the optimized CESM1.3 not used as the starting point for optimizing CESM2? Does CESM2 benefit from any optimization strategy devised for CESM1.3? Why is a different optimization approach (based on OpenMP) pursued here? The authors might have very valid arguments, but these should be clearly stated in the text.
- The title is misleading and vague. From my perspective, OpenMP cannot be deemed as an “automated” software adaptation tool, since compiler directives must be inserted into the code manually; it is rather a “high-level” programming paradigm. For this same reason and because of the fine-grained refactoring at the Athread-level presented in Section 3.4, the porting cannot be flagged as “non-intrusive” (as stated in Section 3.2). Moreover, while the title alludes to a generic porting and optimization approach, I believe that the proposed optimization toolchain is much tailored to the peculiar hardware architecture of the Sunway chips; I doubt the overall porting strategy can be applied to other machines and other models as-is. Finally, it is not clear what “highly-efficient” means: which is the metric with respect to the methodology is efficient? Energy-to-solution, time-to-solution, hardware resources utilization, … ?
- The authors use the expression “ultra-high-resolution” to refer to the 5v3 model configuration, which I find exaggerated. Simply “high-resolution” or “kilometer-scale” would be more appropriate.
- The authors do not provide a well-defined baseline to assess the performance of the optimized code. The actual speed-up offered by the CPEs can only be measured by deploying the unoptimized code (i.e. the code executing on the MPEs only) and the optimized code on the same number of nodes. However, Fig. 1 only shows results up to 18000 processes, while the 5v3 simulation conducted with the optimized code is scaled up to the full machine. More comments will be provided in the next section.
- The syntax requires a comprehensive review. Many sentences are hard to follow and their meaning is not always clear. Moreover, while some acronyms are expanded multiple times (e.g., “CESM” at L378-379), others are not explained at all (e.g., “gld/gst” at L135). Given the target audience of the journal, any terminology belonging to the technical jargon must be explained.
Specific comments
- Section 2.1: Since the target Sunway supercomputer is rather new and not much information can be found on the internet, it would be useful to have more details about, e.g., clock frequency, system interconnect, software stack.
- Section 2.4:
- Table 1: I assume the table reports the total number of grid points for each model component. If so, there must be a typo, since the atmosphere component appear to feature the same number of grid points for 9v5 as for 25v10.
- How do you compute the throughput of each individual component? It seems to me that the SDPD of each component is proportional to its runtime fraction, which does not make any sense to me.
- L246-248: The reported increase factors in computational cost assume the time-step is kept fixed, i.e., a time-step satisfying the CFL condition at high resolution is also used at the coarse resolution. Please clarify.
- L248-251: This is too vague. The statement that the increase in compute cost is inversely proportional to the square of the grid spacing only holds if both the time-step and the number of processes are kept constant. The second condition is clearly not met here. Ultimately, could you explain how you choose the number of processes for each configuration.
- L256: The authors say that “the load balancing […] is particularly important for POP”. Could you please elaborate on this point. How does the domain decomposition work? Which are the criteria guiding the automatic decomposition? Why is load balancing so problematic for POP?
- L257-261: This is not clear to me.
- Section 3.1:
- L271-273: Is it correct that the dynamical core does not entail any vertical dependency?
- L298-300: I don't have much experience with land-surface schemes, but the schemes I know involve many parameters and fields, but calculations are fast nonetheless since only a few layers into the soil are needed. Does this hold true for your surface parametrization as well?
- Section 3.3:
- The meaning of technical terms like “swgcc”, “hybrid-LTO” and “SWACC” should be explained to engage a broader audience.
- It is not clear to me how the xfort.py script is invoked. Since it takes the compiler arguments, does it act as a compiler wrapper invoked by the user, or it’s rather used by the compiler under-the-hood?
- Section 3.4 – Figure 5: The authors present the speed-up provided by OpenMP+OATH compared to the fine-grained refactoring at the Athread-level. But which is the number of grid points? How the speed-ups vary with the grid size? Did you apply the fine-grained refactoring to other stencils as well? Moreover, I would expect the Athread optimizations to be performance-neutral at worst, while they can lead to performance regression on some stencils. How do you explain this?
- Section 4.1:
- Figure 6: Do the panels refer to the model configuration 5v3?
- L463-465: The authors here say that 5v3 running on MPEs only achieves 1.7 SDPD. However, Figure 1 shows that the throughput is already 4.34 SDPD when using only 18000 processes.
- Section 4.2 – Figure 7: although the use of CPEs in POP2 leads to a small gain in terms of time-to-solution (comparing the two rightmost bars), the throughput leaps from 131 SDPD to 222 SDPD. Is there a mistake in the plot, or am I missing something?
- Section 4.3 – Figure 8: In the weak scaling plots (panels (a) and (b)), the choice of the number of processes is not clear to me. If NE30 is run with 1800 processes, shouldn’t NE120 use 7200 processes, i.e. 4-times more processes? And is the time-step kept fixed?
Citation: https://doi.org/10.5194/egusphere-2025-5297-RC2 -
CEC1: 'Comment on egusphere-2025-5297 - No compliance with the policy of the journal', Juan Antonio Añel, 07 Dec 2025
reply
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlIn your "Code and Data Availability" statement you say that the optimized code and data that you use for your work is available upon request. I am sorry to have to be so outspoken, but this is something completely unacceptable, forbidden by our policy, and your manuscript should have never been accepted for Discussions given such violation of the policy of the journal, which very clearly states that all the code and data necessary to replicate a manuscript must be published openly and freely to anyone before submission.
Additionally, you have failed to provide a repository for the CESM2 model, link a GitHub site instead. GitHub is not a suitable repository for scientific publication. GitHub itself instructs authors to use other long-term archival and publishing alternatives, such as Zenodo. Again, this is very clearly pointed out in our the policy of the journal linked above.
Therefore, we are granting you a short time to solve this situation. You have to reply to this comment in a prompt manner with the information for the repositories containing all the models, code and data that you use to produce and replicate your manuscript. The reply must include the link and permanent identifier (e.g. DOI). Also, any future version of your manuscript must include the modified section with the new information.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2025-5297-CEC1
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 225 | 71 | 21 | 317 | 14 | 11 |
- HTML: 225
- PDF: 71
- XML: 21
- Total: 317
- BibTeX: 14
- EndNote: 11
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The authors present a porting of a version of the CESM climate model to the Sunway supercomputer, targeting horizontal resolutions of 5km (atmosphere) and 3km (ocean). The portability work affects several components of CESM, including atmosphere, ocean, sea ice, and the component coupler. The results presented show good speedup when using the CPEs of the full Sunway supercomputer, compared to the case where the model only runs on the MPEs, as well as good weak and strong scaling.
I have a few comments regarding clarity of certain statements/sections, but before that, I think my biggest concern, which makes me think the work may need a major revision, is that there are already efforts in literature that aim to port CESM (including CAM and POP) to run efficiently on the Sunway supercomputer, including previous works by some of the authors in 2016 and 2021. From reading the paper, it seems to me that the authors do not use previous porting as a starting point, but rather do a completely new effort. This seems like a waste, and this choice has to be explained and justified in detail, otherwise this is just presenting a work that is very similar to what was already done. Why not building on previous successes? Would previous work not scale to these resolutions? Or is this in fact the same work, just extended to the CICE and MCT components? Also, how does the speedup of this work compares to the previous porting efforts, if those versions of CESM were to be run at these resolutions? Hopefully, this can be cleared easily by the authors, but the importance of this detail is why I select the "major revision" bullet.
Below, are some additional areas that could use some clarification and/or be expanded a bit more.
0. In general, I think the paper could benefit from a grammar/syntax check, especially in the first couple of sections. Most text editors won't flag anything, as the words used do in fact exist (these are not misspellings), but a few more careful reads (or some better grammar/syntax check tool) would prob help identify areas that need fixes. Related to this, NICAM expanded name includes "Iscosajedral", but should be Icosahedral.
1. In the literature review, the authors mention Golaz 2022 as a model that accommodate up to 276,000 GPUs and achieves 0.97 SYPD. However, Golaz 2022 work uses E3SM v2, which is a CPU-only model, and the performance experiments showed in that paper are performed on the Chrysalis supercomputer, located at Argonne National Laboratory. Perhaps the authors were thinking of the Taylor 2023 work (doi: 10.1145/3581784.3627044), which achieved 1.26 SYDP on Frontier, using however only 65,536 AMD GPUs. Also, when comparing the current work's performance with leading modeling efforts, it may be best to not use the Bertagna 2020 work, as that only has the dycore ported to GPU. A better comparison within the E3SM umbrella would be the aforementioned Taylor 2023 paper, or Donahue 2024 (doi: 10.1029/2024MS004314).
2. In section 2.1, it would be nice to give some more interesting details on the new Sunway supercomputer, or provide in this section a reference to another work with such details. For instance, the clock speed of MPEs vs CPEs, or their nominal power consumption. These details would help putting in perspective the claim done in section 4.1 that "the highest resolution model now performs at nearly one simulation year per day, compared to 1.7 SDPD with the previous version that utilized MPEs only".
3. In section 2.4 the authors claim that going from 25km to 5km they would expect a drop in efficiency by a factor 25. However, the factor 25 only accounts for the increase in spatial resolution, while CFL constraints also impose a decrease in the maximum allowed time step. Are the authors using the same dt for all simulations, meaning that they pick a dt that works for the finest resolution, and use that also for coarse resolutions? If so, this should probably be stated, to avoid CFL-related confusion.
4. In section 3, I would consider dropping the terminology master/slave, in favor of less "controversial" ones. For instance, they could use "main thread" (or "primary thread") and "worker threads". Other good suggestions can be found browsing the CS community. While "mas
ter" is still widely used, the term "slave" is definitely falling out of favor...
5. In section 3.4, in particular figure 5, we see the perf boost of individual CAM/POP portions when switching from the O2ATH framework to a pure Athread one. I am not an expert of POP, but the three atm dycore subroutines reported have a different runtime. It may be helpful to add a table (perhaps side by side with the figure) showing the runtime of these portions of the code (for the base case, without either O2ATH nor Athread), to help weigh the different speedup bars.
6. In section 3.6 the authors discuss "binary consistency" across different compilers and/or supercomputers. By this, I assume they mean the bit-wise value of the generated output, and not the binary executable, but perhaps this should be clarified. Either way, I don't think there is much interest across models in retaining a bit-for-bit reproducible solution across different compilers, let alone different machines (but I may be wrong). Usually, this can be achieved for subsequent versions of the code on the SAME machine with the SAME compiler, but that can also be challenging (especially if one uses solvers that do make use of random algorithms, or iterative algorithm involving many global reductions). On the other hand, an ensemble-based Deep Learning approach is very reasonable, and defintely more interesting for the community. Since it seems the authors developed their own framework, I think it would be more interesting to devote the full 3.6 section to this, maybe giving some more detail that can be useful for other centers/models.
7. I'm curious as to how the OMEGA kernel acceleration is achieved. Does libvnest implement a divide-and-conquer approach? It would be nice to share a few more details, to allow other projects to learn from this effort.
8. In section 4.1 the authors claim that "the highest resolution model now performs at nearly one simulation year per day". In the rest of the 4.x sections the highest number we see (for the full model) is 222 SDPD, which is ~0.6 SYPD. I would not say that 0.6 is "nearly one SYPD"... It is possible (though the paragraph does not suggest that) that they referred to the performance of the ATM component only, which fig 8c shows to achieve 331 SDPD; if so, it should be made clear. If, instead, they were referring to the full model, i would co
nsider rephrasing the claim, as 0.6 is not very close to 1. To be clear, I am not belittling the relevance of the 222 SDPD; I am just saying that I don't see a need to use misleading words.
9. In section 4.2 the plot line shows a 1.83x improve in SDPD in the last optimization step. However, the bar plot of the wall seconds does not seem to decrease by much (if at all). Where does the boost from 131 to 222 SDPD come from, given that the wall seconds bar plots
seem unchanged?
10. In section 4.3:
a) when commenting fig 8, the authors first claim that the smallest feasible scale for NE480 was 28,800 processes (in agreement with fig 8c), but then they say that efficiency is ~47% when scaling from 14,400 to 460,800 processes. So, did 14.400 work? Or is this a typo?
b) the authors say that "a further decline to ~72% occurs when scaling beyond half of the full capacity". However, looking at fig 8a, I see an efficiency of 69% at 115,200 CPEs, that is, a lower efficiency, and happening earlier than scaling beyond half capacity. Either this discrepancy comes from data older/newer than the data in the plot, or it refers to different data. Either way, this needs to be clarified/fixed.
c) the authors say "the efficiency of POP is measured at 9.59 SDPD". I don't understand what it means to measure efficiency at 9.59 SDPD. Also, none of the plots below show the 9.59 SDPD number, so it's hard to link this number to the rest of the section and the plots. T
his sentence needs to be clarified.
d) the authors say that the peek perf is 50 SDPD, achieved with one fourth of the machine, and that using the full machine it drops to 30 SDPD. They speculate that this may be due to "incorrect setup of cases", but it is not clear what they mean by that. Also, these numbers are nowhere in the plots. Where this initial experiments before another round of optimizations? This needs some clarification.
e) related to the above, the authors then claim that "by fine-tuning CPL and CICE at the machine full size it is possible ot achieve significant improvement". Does this mean that the large scale runs contain ad-hoc optimizations that are not used at lower resolution? What kind of fine-tuning was it needed? As stated, this bit gives no insight that could help other projects that are experiencing similar performance drops at large scale...
11. In section 4.4, figure 9 claims to show both ocean surface vorticity as well as sea ice concentration/deformation. However, I only see one colorbar, with a legend showings the units of vorticity. How should one read the sea ice concentration? The presence of green-ish and white-ish areas is also unclear. I suspect the white area is sea ice (but how does one deduce concentration/deformation?), and the green-ish area is just a tight overlap of areas with (small) positive an negative vorticity, which make blue and yellow pixel sit very near each other, causing a green effect. Still, some clarification may help the reader.
12. In the "code and data availability" section at the end, the authors point to the ESCOMP org for the CESM code, and mention that the optimized code and the raw data is available upon request. I find this a bit underwhelming. I would prefer to see the optimized code as well as the input/run scripts (as well as any other input data) used for the experiments publicly available. They could easily create a zenodo (or similar) snapshot, as done in previous efforts by several climate modeling centers (as well as in previous works of some of the authors).
Finally, one comment on the keyword "automated" featured in the title. It is not clear to me how the optimization is "automated". Do the authors refer to the fact that the compiler does all the optimization work, once the proper pragma directives are inserted? If so, I find this a poor justification of the word "automated". Using OpenMP does not really qualify as "automated optimization". It is no more automated than the creation of the binary executable when running "make": it is just the compiler doing what it is designed to do. Moreover, the authors explicitly say that they also needed precise fine-grain optimizations to squeeze out performance, which does not really ring with the "automated" tune of the title. To be clear, there is nothing wrong with having to manually modify/refactor selected areas of the code, and there is nothing wrong with relying on the compiler for vectorization and/or threading choices. But I would not call this an "automated optimization".