GPU-accelerated Finite-Element Method for the Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework
Abstract. The three-dimensional unstructured-mesh finite-element atmospheric dynamical framework is gaining significance owing to its flexibility in representing complex topography and capability for multi-scale simulations in high resolutions. However, this framework has substantial bottlenecks. Unlike structured-grid models, the unstructured finite element method (FEM) must frequently access irregular mesh connectivity among nodes, edges, and elements, causing indirect memory addressing, inadequate data locality, and substantial memory bandwidth bottlenecks on conventional CPU architectures. Consequently, element-wise computations and global assembly are the primary contributors to the runtime in high-resolution simulations.
This study develops a GPU-parallel implementation of the Fluidity-Atmosphere dynamical core to address these challenges. The GPU-oriented data structures and optimized kernels are designed to efficiently leverage the computing power of GPUs. These kernels enable parallelized element integration and are efficient solvers for specific size matrices; a parallel assembly strategy enhances memory throughput during global sparse matrix construction. On the NVIDIA A100 GPU, the optimized kernels achieve speeds over 100× for element-wise computations and up to 389.02 times for global matrix assembly, resulting in an overall acceleration of 8.57 times with four messages passing interface (MPI) processes. The proposed framework demonstrates that tailored GPU parallelization is effective in overcoming the computational bottleneck of unstructured FEM-based atmospheric models, facilitating high-resolution simulations on heterogeneous architectures.
General Assessment
This manuscript presents a GPU-accelerated implementation of the Fluidity-Atmosphere dynamical core, with a focus on two major computational bottlenecks in unstructured finite-element atmospheric models: element-wise computations and global sparse matrix assembly. The authors design GPU-oriented data structures and optimized CUDA kernels, achieving substantial kernel-level speedups (up to ~100–900×) and an overall acceleration of up to 8.57× in a hybrid MPI+GPU configuration.
The topic is timely and relevant. Unstructured-mesh FEM atmospheric models are increasingly important due to their geometric flexibility and suitability for high-resolution simulations over complex terrain. However, their computational inefficiency remains a major obstacle. This work addresses a meaningful gap by targeting GPU acceleration in a realistic dynamical core rather than in isolated kernels.
Overall, the manuscript is well organized, and the implementation appears technically sound. The reported performance improvements are significant, and the inclusion of code and data is consistent with GMD’s reproducibility standards. These aspects make the work potentially valuable to the community.
However, despite these strengths, several aspects of the manuscript could be further strengthened to improve clarity and rigor. In particular, the discussion of performance results would benefit from a more detailed interpretation, especially regarding the gap between kernel-level and end-to-end speedups. In addition, some implementation aspects—such as portability and parameter choices—would benefit from further clarification. Addressing these points would help provide a more complete and transparent assessment of the proposed approach.
I therefore recommend minor revision before publication.
Major Comments1. Gap Between Kernel-Level and End-to-End Performance
The manuscript reports very high kernel-level speedups (up to ~900×), while the overall application speedup is approximately 8.57×. This discrepancy is expected in complex applications, but it is not sufficiently discussed in the current manuscript.
In particular, the contribution of non-accelerated components—such as the sparse linear solver and CPU–GPU data transfer—is not clearly quantified. As a result, it remains unclear which parts of the workflow dominate the runtime after GPU acceleration.
The authors are encouraged to provide a clearer breakdown of the total runtime and to discuss the limiting factors for end-to-end performance.
2. Limited Discussion on Portability and Generality
The current implementation is closely tied to CUDA and NVIDIA GPUs. Although the manuscript briefly mentions the possibility of porting the approach to other platforms (e.g., via HIP), this point is only touched upon and would benefit from a more explicit discussion.
A short discussion of the dependence on CUDA-specific features (for example, atomic operations and memory hierarchy), the potential challenges in porting, and the expected generality of the proposed approach would improve the completeness of the manuscript.
Minor Comments1. Consistency between abstract and reported results
The abstract reports an overall speedup of 8.57×, while later sections mention 10.28× under certain configurations. This inconsistency may cause confusion.
The authors are encouraged to clearly specify the conditions under which each value is obtained and ensure consistent reporting throughout the manuscript.
2. Kernel launch configuration
Thread block sizes are described as being “empirically tuned,” but no further details are provided.
It would be useful to:
Recommendation
Based on the comments above, I recommend minor to moderate revision before the manuscript can be considered for publication.
The work is relevant and technically promising, but the manuscript would benefit from a clearer discussion of end-to-end performance limitations, a more complete treatment of portability, and several clarifications in the presentation of results and implementation details.
I hope these comments are helpful to the authors in improving the clarity and completeness of the manuscript.