the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
swLICOM: the multi-core version of an ocean general circulation model on the new generation Sunway supercomputer and its kilometer-scale application
Abstract. The global ocean general circulation model (OGCM) with kilometer-scale resolution is of great significance for understanding the climate effects of mesoscale and submesoscale eddies. To address the computational and storage demands of exponential growth associated with kilometer-scale resolution simulation for global OGCMs, we develop an enhanced and deeply optimized OGCM, namely swLICOM, on the new generation Sunway supercomputer. We design a novel split I/O scheme that effectively partitions tripole grid data across processes for reading and writing, resolving the IO bottleneck encountered in kilometer-scale resolution simulation. We also develop a new domain decomposition strategy that removes land points effectively to enhance the simulation capability. In addition, we upgrade the code translation tool swCUDA to convert the LICOM3 CUDA kernels to Sunway kernels efficiently. By further optimization using mixed precision, we achieve a peak performance of 453 Simulated Days per Day (SDPD) with 59 % parallel efficiencies at 1 km resolution, scaling up to 25 million cores. The result of simulation with a 2 km horizontal resolution shows swLICOM is capable of capturing the vigorous mesoscale eddies and active submesoscale phenomena.
- Preprint
(11223 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 25 Dec 2025)
-
RC1: 'Comment on egusphere-2025-2231', Anonymous Referee #1, 13 Oct 2025
reply
-
AC1: 'Reply on RC1', Kai Xu, 03 Dec 2025
reply
Thank you very much for your thorough review and the constructive comments on our manuscript. We sincerely appreciate the time and effort you have devoted to providing these insightful suggestions, which have significantly improved the quality of our work. We have carefully considered all the points raised. Below, we provide a point-by-point response to your comments:
1. Line 36: It is unclear who or what “Kinaco” refers to. Please clarify.
Response: Thanks. Kinaco in Line 36 is a non-hydrostatic ocean model that was developed for high-resolution numerical ocean studies. We will add further explanations and citations in the revised paper as follows.
- Yamagishi, T. and Matsumura, Y.: GPU Acceleration of a Non-hydrostatic Ocean Model with a Multigrid Poisson/Helmholtz solver, Procedia Computer Science, 80, 1658–1669, https://doi.org/https://doi.org/10.1016/j.procs.2016.05.502, International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, 2016.
- Matsumura, Y., Hasumi, H. 2008. A non-hydrostatic ocean model with a scalable multigrid Poisson solver. Ocean Model. 24, 15-28. DOI= http://dx.doi.org/10.1016/j.ocemod.2008.05.001
2. Line 58: LICOM2-GPU, LICOM3-HIP, and LICOM3-CUDA are model versions, not heterogeneous supercomputers; please adjust the wording accordingly.
Response: Thanks. The clarification will be modified in the revised paper as follows.
“The development of LICOM for heterogeneous supercomputers is evidenced by three key versions: LICOM2-gpu (Jiang et al., 2019), LICOM3-HIP (Wang et al., 2021), and LICOM3-CUDA (Wei et al., 2023), each specifically ported to a different computing architecture. ”
3. Section 2.2: The paper refers to the Sunway system as a “heterogeneous” architecture, but this is not clearly explained. Please clarify that heterogeneity arises from two distinct core types within each chip, the general-purpose MPEs and lightweight CPEs with separate memory hierarchies and instruction sets, rather than from separate CPU and GPU components. The section would also benefit from citing one or more detailed references on the SW26010 Pro system architecture.
Response: Thanks. We will clarify that heterogeneity arises from two distinct core types within each chip, the general-purpose MPEs and lightweight CPEs with separate memory hierarchies and instruction sets, rather than from separate CPU and GPU components. The reference, which contains the details of the SW26010 Pro architecture, will be added to the article.
Lin, R., Yuan, X., Xue, W., Yin, W., Yao, J., Shi, J., Sun, Q., Song, C., and Wang, F.: 5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway Supercomputer, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, Association for Computing Machinery, New York, NY, USA, ISBN 9798400701092,410 https://doi.org/10.1145/3581784.3607030, 2023.
4. Section 2.2: Please indicate the overall size of the Sunway supercomputer (e.g., total nodes, processors, or cores) to give readers a clearer sense of the system scale used for the simulations presented here.
Response: Thanks. The Sunway OceanLight is equipped with more than 100,000 SW26010-Pro chips, each with 390 cores, and is one of the fastest supercomputers globally. An additional clarification will be included in the revised manuscript.
5. Line 132: Please clarify what specific programming challenges are referred to, e.g., related to memory hierarchy, data communication between CPEs and MPEs, or algorithm adaptation to the Sunway architecture.
Response: Thanks. The primary challenge in optimizing for the SW26010 Pro processor stems from its heterogeneous architecture. In this architecture, the Management Processing Element (MPE) handles inter-process communication and controls the overall application workflow. The main computing power, however, resides in the Core Processing Elements (CPEs). Each CPE is equipped with a manually managed Local Data Memory (LDM) that offers access speeds comparable to the L1 cache. CPEs can communicate directly via Remote Memory Access (RMA). To leverage the CPEs' computational capacity, code executed initially on the MPE must be ported to an Athread kernel. The MPE is responsible for launching this kernel and subsequently waiting for its completion. Consequently, effectively leveraging the unique characteristics of the CPEs is the key to achieving high performance.
6. Line 139: The term “Athread kernel” refers to the parallel programming model on Sunway, but most readers may not be familiar with it. Please provide a brief explanation of Athread and its role in parallel execution.
Response: Thanks. The Athread programming model is a parallel programming model for the Sunway architecture. It provides an abstraction that is closely mapped to the Sunway hardware. It offers explicit control mechanisms for managing the DMA (Direct Memory Access) controller on the CPEs. This allows programmers to efficiently move data between the main memory (controlled by the MPE) and the Local Data Memory (LDM) of each CPE, which is crucial for overcoming memory bandwidth bottlenecks. The typical execution flow involves the main program running on an MPE. The MPE calls “athread spawn” interface to create "slave" threads that execute a specified function on the CPEs. All threads in the team execute the same function, but on different portions of the data. The Athread programming model provides synchronization primitives (e.g., barriers) to coordinate the execution of these threads. The MPE calls “athread_join” to wait for the code execution to finish.
7. Figure 1: The text and labels in Fig. 1a are too small to read clearly when printed. Please enlarge the figure or adjust the layout for better legibility.
Response: Thanks. The text and labels in Fig. 1a will be modified in the revised paper.
8. Line 156: Suggest to place JK decomposition in quotation marks (“JK decomposition”) to indicate it is a specific term introduced by the authors.
Response: Thanks. Quotation marks will be added in the revised paper.
9. Lines 186 and Fig. 5: The discussion of IJ, IK, and WKK decomposition is confusing. Please clarify how these decomposition strategies differ and what “WKK” specifically represents.
Response: Thanks. In the LICOM model, space is discretized into 3-D grid points. Horizontal grid points are labeled as (I, J). Each grid point represents multiple levels, which correspond to various vertical heights in the ocean; the vertical height is denoted as K. Most data structures (arrays) in LICOM are three-dimensional arrays with the layout of (I, J, K). The elements in Fortran’s arrays are stored in column-major order. Therefore, elements in dimension I are stored continuously in memory. Because there are different computational patterns in LICOM, different decomposition schemes are used for different patterns. For example, “JK decomposition” means that the computation is decomposed by assigning tasks with different J and K ranges to different CPEs. WKK is a variable name in Figure 5a.
10. Line 214: Please clarify the phrase “across tens of thousands of machines.” Do you mean compute nodes?
Response: Thanks. Yes, we refer to the compute nodes. We will clarify the phrase in the revised manuscript.
11. Line 228: The term “Canuto parametrization” appears without prior introduction or reference. Please briefly explain or cite the source when first mentioning it.
Response: Thanks. “Canuto parametrization” refers to the vertical viscosity and diffusivity schemes in LICOM. A reference for the "Canuto parametrization" will be added in the revised paper as follows.
Canuto, V. M., A. Howard, Y. Cheng, and M. S. Dubovikov, 2002: Ocean Turbulence. Part II: Vertical Diffusivities of Momentum, Heat, Salt, Mass, and Passive Scalars. J. Phys. Oceanogr., 32, 240–264, https://doi.org/10.1175/1520-0485(2002)032<0240:OTPIVD>2.0.CO;2.
12. Line 245: It appears that an equation is missing at this point in the manuscript.
Response: Thanks. We will add the equation back in the revised manuscript.
13. Tables 1 and 3: The timestep units (presumably seconds) are missing. Please also explain why all configurations use the same timestep despite large differences in horizontal resolution. Typically, finer grids require smaller timesteps for stability.
Response: Thanks. The timestep unit is the second, and the table will be updated accordingly in the revised paper. To ensure a fair comparison in our scalability tests, we evaluated all data resolutions by employing the time step of the highest-resolution simulation as the universal time step. Therefore, only one factor, resolution, changes across all these experiments.
14. Sections 4.3–4.6: These sections are quite brief. Consider merging them into one cohesive section summarizing the scaling and benchmarking results to improve readability.
Response: Thanks. We will merge Sections 4.3–4.6 into one section in the revised paper.
15. Line 276: The term “super large parallel scale” likely refers to the largest simulations conducted in this study, but please state this explicitly to avoid ambiguity.
Response: Thanks. We will revise "super large parallel scale" to "large parallel scale" to avoid ambiguity. In our largest parallel scale of 1-km resolution, the new domain decomposition saves more than 13 million cores.
16. Figures 11–13: The units of the displayed quantities (e.g., sea surface height, temperature, salinity) are missing. Please add appropriate units to the color bars or captions.
Response: Thanks. The units of the displayed quantities will be added to captions.
17. Code and Data Availability: The “project website” and the citation “Xu (2025)” both seem to refer to the same Zenodo record (10.5281/zenodo.15494635). Please clarify whether these are distinct (e.g., project page vs. archived version) or consolidate them to avoid redundancy.
Response: Thanks. We will consolidate “project website” and the citation “Xu (2025)” to avoid redundancy.
18. Technical corrections
A careful proofreading or light English edit is recommended to improve readability and ensure consistent terminology.
Please follow the Copernicus manuscript composition guidelines for capitalization, abbreviations, and formatting when referring to Figures, Tables, and Sections:
https://publications.copernicus.org/for_authors/manuscript_preparation.html
Line 106: Please correct or complete the reference “Y.Q. et al.” to match the proper citation format.
Line 114: The degree symbol (°) is missing, please add.
Line 176: The sentence beginning “Inout the attribute is used…” should be revised for clarity, e.g., “The inout attribute indicates whether the array is read-only or modified within the kernel.”
Line 221: Please fix the broken equation references (“equation ??”).
Line 265: The sentence beginning “Whenever the…” is unclear or incomplete; please revise.
Line 277: The manuscript frequently uses “mix precision,” but the correct term is “mixed precision.” Please revise throughout.
Line 336: Replace “double-only implementation” with “double-precision implementation” for accura
Response: Thanks. We will correct the errors based on your suggestions. The revisions will include careful proofreading, consolidation of code references, clarification of terminology ("large parallel scale"), addition of the requested citation, and full compliance with the Copernicus formatting guidelines for Figures, Tables, and Sections.
Once again, we would like to express our sincere gratitude for your thoughtful comments and guidance. We believe these revisions have substantially improved the manuscript. We hope that our responses and the revised manuscript will meet with your approval.
Citation: https://doi.org/10.5194/egusphere-2025-2231-AC1
-
AC1: 'Reply on RC1', Kai Xu, 03 Dec 2025
reply
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,770 | 69 | 20 | 1,859 | 24 | 27 |
- HTML: 1,770
- PDF: 69
- XML: 20
- Total: 1,859
- BibTeX: 24
- EndNote: 27
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of “swLICOM: the multi-core version of an ocean general circulation model on the new generation Sunway supercomputer and its kilometer-scale application” by Kai Xu et al.
General comments
This study presents swLICOM, a high-performance, multi-core version of the LASG/IAP Climate System Ocean Model (LICOM3) optimized for the new-generation Sunway supercomputer. It enables kilometer-scale global ocean simulations, which are critical for resolving mesoscale and submesoscale eddies that influence ocean circulation and climate.
The authors introduce several key innovations: an automatic CUDA-to-Sunway code translation tool (swCUDA) for efficient porting, a domain decomposition method that removes land grid points, a split I/O scheme to alleviate data bottlenecks, and mixed-precision computing to balance accuracy and performance. These optimizations allow swLICOM to achieve up to 453 simulated days per day (SDPD) with 59% efficiency at 1 km resolution using over 25 million cores. The model captures vigorous mesoscale and submesoscale features, demonstrating excellent scalability and efficiency.
Overall, the paper is clearly written and well-structured, effectively communicating substantial technical work. The study shows significant and comprehensive efforts to enhance the computational performance of LICOM when ported to the Sunway system. The methods are sound, and the results convincingly support the claims. I recommend publication in GMD after minor revisions addressing the specific points below.
Specific comments
Line 36: It is unclear who or what “Kinaco” refers to. Please clarify.
Line 58: LICOM2-GPU, LICOM3-HIP, and LICOM3-CUDA are model versions, not heterogeneous supercomputers; please adjust the wording accordingly.
Section 2.2: The paper refers to the Sunway system as a “heterogeneous” architecture, but this is not clearly explained. Please clarify that heterogeneity arises from two distinct core types within each chip, the general-purpose MPEs and lightweight CPEs with separate memory hierarchies and instruction sets, rather than from separate CPU and GPU components. The section would also benefit from citing one or more detailed references on the SW26010 Pro system architecture.
Section 2.2: Please indicate the overall size of the Sunway supercomputer (e.g., total nodes, processors, or cores) to give readers a clearer sense of the system scale used for the simulations presented here.
Line 132: Please clarify what specific programming challenges are referred to, e.g., related to memory hierarchy, data communication between CPEs and MPEs, or algorithm adaptation to the Sunway architecture.
Line 139: The term “Athread kernel” refers to the parallel programming model on Sunway, but most readers may not be familiar with it. Please provide a brief explanation of Athread and its role in parallel execution.
Figure 1: The text and labels in Fig. 1a are too small to read clearly when printed. Please enlarge the figure or adjust the layout for better legibility.
Line 156: Suggest to place JK decomposition in quotation marks (“JK decomposition”) to indicate it is a specific term introduced by the authors.
Lines 186 and Fig. 5: The discussion of IJ, IK, and WKK decomposition is confusing. Please clarify how these decomposition strategies differ and what “WKK” specifically represents.
Line 214: Please clarify the phrase “across tens of thousands of machines.” Do you mean compute nodes?
Line 228: The term “Canuto parametrization” appears without prior introduction or reference. Please briefly explain or cite the source when first mentioning it.
Line 245: It appears that an equation is missing at this point in the manuscript.
Tables 1 and 3: The timestep units (presumably seconds) are missing. Please also explain why all configurations use the same timestep despite large differences in horizontal resolution. Typically, finer grids require smaller timesteps for stability.
Line 276: The term “super large parallel scale” likely refers to the largest simulations conducted in this study, but please state this explicitly to avoid ambiguity.
Sections 4.3–4.6: These sections are quite brief. Consider merging them into one cohesive section summarizing the scaling and benchmarking results to improve readability.
Figures 11–13: The units of the displayed quantities (e.g., sea surface height, temperature, salinity) are missing. Please add appropriate units to the color bars or captions.
Code and Data Availability: The “project website” and the citation “Xu (2025)” both seem to refer to the same Zenodo record (10.5281/zenodo.15494635). Please clarify whether these are distinct (e.g., project page vs. archived version) or consolidate them to avoid redundancy.
Technical corrections
A careful proofreading or light English edit is recommended to improve readability and ensure consistent terminology.
Please follow the Copernicus manuscript composition guidelines for capitalization, abbreviations, and formatting when referring to Figures, Tables, and Sections:
https://publications.copernicus.org/for_authors/manuscript_preparation.html
Line 106: Please correct or complete the reference “Y.Q. et al.” to match the proper citation format.
Line 114: The degree symbol (°) is missing, please add.
Line 176: The sentence beginning “Inout the attribute is used…” should be revised for clarity, e.g., “The inout attribute indicates whether the array is read-only or modified within the kernel.”
Line 221: Please fix the broken equation references (“equation ??”).
Line 265: The sentence beginning “Whenever the…” is unclear or incomplete; please revise.
Line 277: The manuscript frequently uses “mix precision,” but the correct term is “mixed precision.” Please revise throughout.
Line 336: Replace “double-only implementation” with “double-precision implementation” for accuracy.