GPU-accelerated Finite-Element Method for the Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework

Li, Leisheng; Fu, Ximeng; Zheng, Xiyu; Li, Huiyuan; Li, Jinxi

doi:10.5194/egusphere-2026-695

Preprints

https://doi.org/10.5194/egusphere-2026-695

Preprints

05 Mar 2026

| 05 Mar 2026

Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

GPU-accelerated Finite-Element Method for the Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework

Leisheng Li, Ximeng Fu, Xiyu Zheng, Huiyuan Li, and Jinxi Li

Abstract. The three-dimensional unstructured-mesh finite-element atmospheric dynamical framework is gaining significance owing to its flexibility in representing complex topography and capability for multi-scale simulations in high resolutions. However, this framework has substantial bottlenecks. Unlike structured-grid models, the unstructured finite element method (FEM) must frequently access irregular mesh connectivity among nodes, edges, and elements, causing indirect memory addressing, inadequate data locality, and substantial memory bandwidth bottlenecks on conventional CPU architectures. Consequently, element-wise computations and global assembly are the primary contributors to the runtime in high-resolution simulations.

This study develops a GPU-parallel implementation of the Fluidity-Atmosphere dynamical core to address these challenges. The GPU-oriented data structures and optimized kernels are designed to efficiently leverage the computing power of GPUs. These kernels enable parallelized element integration and are efficient solvers for specific size matrices; a parallel assembly strategy enhances memory throughput during global sparse matrix construction. On the NVIDIA A100 GPU, the optimized kernels achieve speeds over 100× for element-wise computations and up to 389.02 times for global matrix assembly, resulting in an overall acceleration of 8.57 times with four messages passing interface (MPI) processes. The proposed framework demonstrates that tailored GPU parallelization is effective in overcoming the computational bottleneck of unstructured FEM-based atmospheric models, facilitating high-resolution simulations on heterogeneous architectures.

Received: 05 Feb 2026 – Discussion started: 05 Mar 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Leisheng Li, Ximeng Fu, Xiyu Zheng, Huiyuan Li, and Jinxi Li

Status: open (until 02 May 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2026-695', Anonymous Referee #1, 29 Mar 2026 reply
General Assessment
This manuscript presents a GPU-accelerated implementation of the Fluidity-Atmosphere dynamical core, with a focus on two major computational bottlenecks in unstructured finite-element atmospheric models: element-wise computations and global sparse matrix assembly. The authors design GPU-oriented data structures and optimized CUDA kernels, achieving substantial kernel-level speedups (up to ~100–900×) and an overall acceleration of up to 8.57× in a hybrid MPI+GPU configuration.
The topic is timely and relevant. Unstructured-mesh FEM atmospheric models are increasingly important due to their geometric flexibility and suitability for high-resolution simulations over complex terrain. However, their computational inefficiency remains a major obstacle. This work addresses a meaningful gap by targeting GPU acceleration in a realistic dynamical core rather than in isolated kernels.
Overall, the manuscript is well organized, and the implementation appears technically sound. The reported performance improvements are significant, and the inclusion of code and data is consistent with GMD’s reproducibility standards. These aspects make the work potentially valuable to the community.
However, despite these strengths, several aspects of the manuscript could be further strengthened to improve clarity and rigor. In particular, the discussion of performance results would benefit from a more detailed interpretation, especially regarding the gap between kernel-level and end-to-end speedups. In addition, some implementation aspects—such as portability and parameter choices—would benefit from further clarification. Addressing these points would help provide a more complete and transparent assessment of the proposed approach.
I therefore recommend minor revision before publication.
Major Comments1. Gap Between Kernel-Level and End-to-End Performance
The manuscript reports very high kernel-level speedups (up to ~900×), while the overall application speedup is approximately 8.57×. This discrepancy is expected in complex applications, but it is not sufficiently discussed in the current manuscript.
In particular, the contribution of non-accelerated components—such as the sparse linear solver and CPU–GPU data transfer—is not clearly quantified. As a result, it remains unclear which parts of the workflow dominate the runtime after GPU acceleration.
The authors are encouraged to provide a clearer breakdown of the total runtime and to discuss the limiting factors for end-to-end performance.
2. Limited Discussion on Portability and Generality
The current implementation is closely tied to CUDA and NVIDIA GPUs. Although the manuscript briefly mentions the possibility of porting the approach to other platforms (e.g., via HIP), this point is only touched upon and would benefit from a more explicit discussion.
Given the growing importance of performance portability in geoscientific modeling, it would be helpful to clarify:

which parts of the implementation are inherently hardware-specific;

what level of effort would be required to adapt the approach to other architectures, such as AMD GPUs or future heterogeneous systems.

A short discussion of the dependence on CUDA-specific features (for example, atomic operations and memory hierarchy), the potential challenges in porting, and the expected generality of the proposed approach would improve the completeness of the manuscript.
Minor Comments1. Consistency between abstract and reported results
The abstract reports an overall speedup of 8.57×, while later sections mention 10.28× under certain configurations. This inconsistency may cause confusion.
The authors are encouraged to clearly specify the conditions under which each value is obtained and ensure consistent reporting throughout the manuscript.
2. Kernel launch configuration
Thread block sizes are described as being “empirically tuned,” but no further details are provided.
It would be useful to:
report the configurations used in the experiments;

briefly indicate how they were selected (e.g., through profiling tools);

comment on sensitivity to these parameters, if relevant.

Recommendation
Based on the comments above, I recommend minor to moderate revision before the manuscript can be considered for publication.
The work is relevant and technically promising, but the manuscript would benefit from a clearer discussion of end-to-end performance limitations, a more complete treatment of portability, and several clarifications in the presentation of results and implementation details.
I hope these comments are helpful to the authors in improving the clarity and completeness of the manuscript.

Reply
Citation: https://doi.org/10.5194/egusphere-2026-695-RC1

Leisheng Li, Ximeng Fu, Xiyu Zheng, Huiyuan Li, and Jinxi Li

Viewed

Total article views: 185 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
131	39	15	185	16	15

HTML: 131
PDF: 39
XML: 15
Total: 185
BibTeX: 16
EndNote: 15

Views and downloads (calculated since 05 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	131	39	15	185

Cumulative views and downloads (calculated since 05 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	131	39	15	185

Viewed (geographical distribution)

Total article views: 175 (including HTML, PDF, and XML) Thereof 175 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 29 Mar 2026

Short summary

Scientists use irregular grid models for accurate weather simulation, which help capture details but also make the calculations slow on traditional computers. We redesigned this model for GPUs by reorganizing data and calculations. This makes the slowest parts hundreds of times faster and the whole simulation over ten times faster. This allows for higher-resolution simulations.


Total:	0
HTML:	0
PDF:	0
XML:	0