the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
GPU-accelerated Finite-Element Method for the Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework
Abstract. The three-dimensional unstructured-mesh finite-element atmospheric dynamical framework is gaining significance owing to its flexibility in representing complex topography and capability for multi-scale simulations in high resolutions. However, this framework has substantial bottlenecks. Unlike structured-grid models, the unstructured finite element method (FEM) must frequently access irregular mesh connectivity among nodes, edges, and elements, causing indirect memory addressing, inadequate data locality, and substantial memory bandwidth bottlenecks on conventional CPU architectures. Consequently, element-wise computations and global assembly are the primary contributors to the runtime in high-resolution simulations.
This study develops a GPU-parallel implementation of the Fluidity-Atmosphere dynamical core to address these challenges. The GPU-oriented data structures and optimized kernels are designed to efficiently leverage the computing power of GPUs. These kernels enable parallelized element integration and are efficient solvers for specific size matrices; a parallel assembly strategy enhances memory throughput during global sparse matrix construction. On the NVIDIA A100 GPU, the optimized kernels achieve speeds over 100× for element-wise computations and up to 389.02 times for global matrix assembly, resulting in an overall acceleration of 8.57 times with four messages passing interface (MPI) processes. The proposed framework demonstrates that tailored GPU parallelization is effective in overcoming the computational bottleneck of unstructured FEM-based atmospheric models, facilitating high-resolution simulations on heterogeneous architectures.
- Preprint
(1439 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 02 May 2026)
- RC1: 'Comment on egusphere-2026-695', Anonymous Referee #1, 29 Mar 2026 reply
-
RC2: 'Comment on egusphere-2026-695', Anonymous Referee #2, 10 Apr 2026
reply
General Comments: This paper presents a systematic GPU porting and optimization effort for two key computational hotspots—element-wise computations and global matrix assembly—in the unstructured-mesh finite-element atmospheric model Fluidity-Atmosphere. The authors implement CUDA kernels on NVIDIA A100 GPUs and demonstrate significant speedups using a 3D mountain wave test case. Overall, the work is solid, the results are clearly presented, and the contribution offers practical value to the atmospheric modeling community, particularly for those working with unstructured mesh frameworks.
Specific comments:
- In Section 4.3, the authors use atomic operations to handle concurrent updates during parallel assembly and report speedups up to 389×. Given that multiple elements often share nodes in unstructured meshes, atomic contention may affect performance. It would be helpful to briefly comment on the observed contention level (e.g., cache hit rate, warp efficiency) or explain why this does not become a bottleneck in the current test case (e.g., hardware capabilities of the A100).
- The Introduction highlights anisotropic adaptive meshing as a key feature of Fluidity-Atmosphere. However, the GPU performance evaluation is conducted with a static mesh. Please clarify whether the current GPU implementation supports AMR, and what additional overheads it introduces.
- Several kernels achieve speedups of two to three orders of magnitude (Tables 5 and 9), the single-process end-to-end speedup is 3.36× (Table 10). This gap is understandable given that the linear solver remains on CPU. It would be valuable to include a brief analysis of how the runtime distribution across modules changes after GPU acceleration — specifically, what proportion of time the linear solver now occupies. This would clearly illustrate the "bottleneck shift" and motivate future work on GPU-based solvers.
- The claimed "10.28× speedup" in the abstract is inconsistent with the data presented in Table 10: the single-GPU configuration achieves only 3.36× speedup, while the 10.28× figure corresponds to the 4 MPI + GPU hybrid configuration. The abstract does not specify the configuration to which the reported speedup applies. Please clearly state the applicable conditions for the reported speedup, either in the abstract or in the main text. Furthermore, the speedup achieved by the 4 MPI + GPU configuration relative to a single CPU is substantially higher than that of a single GPU relative to a single CPU. This nonlinear behavior requires a proper explanation (e.g., limited scalability of multi-process CPU execution, higher GPU efficiency on larger problem sizes) rather than being presented merely as a data point without analysis.
- Consider adding a brief note on the number of kernel calls or the mesh size used in these tests to aid reproducibility in Tables 4–6.
- The profiling data in Table 1 is presented, but its role in guiding optimization priorities is not discussed. Please clarify how this timing breakdown informed the selection of modules for GPU acceleration.
- Appendices: The CUDA code snippets are valuable for community reuse. Please verify that they are complete and consistent with the descriptions in the main text.
Citation: https://doi.org/10.5194/egusphere-2026-695-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 165 | 56 | 16 | 237 | 18 | 17 |
- HTML: 165
- PDF: 56
- XML: 16
- Total: 237
- BibTeX: 18
- EndNote: 17
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General Assessment
This manuscript presents a GPU-accelerated implementation of the Fluidity-Atmosphere dynamical core, with a focus on two major computational bottlenecks in unstructured finite-element atmospheric models: element-wise computations and global sparse matrix assembly. The authors design GPU-oriented data structures and optimized CUDA kernels, achieving substantial kernel-level speedups (up to ~100–900×) and an overall acceleration of up to 8.57× in a hybrid MPI+GPU configuration.
The topic is timely and relevant. Unstructured-mesh FEM atmospheric models are increasingly important due to their geometric flexibility and suitability for high-resolution simulations over complex terrain. However, their computational inefficiency remains a major obstacle. This work addresses a meaningful gap by targeting GPU acceleration in a realistic dynamical core rather than in isolated kernels.
Overall, the manuscript is well organized, and the implementation appears technically sound. The reported performance improvements are significant, and the inclusion of code and data is consistent with GMD’s reproducibility standards. These aspects make the work potentially valuable to the community.
However, despite these strengths, several aspects of the manuscript could be further strengthened to improve clarity and rigor. In particular, the discussion of performance results would benefit from a more detailed interpretation, especially regarding the gap between kernel-level and end-to-end speedups. In addition, some implementation aspects—such as portability and parameter choices—would benefit from further clarification. Addressing these points would help provide a more complete and transparent assessment of the proposed approach.
I therefore recommend minor revision before publication.
Major Comments1. Gap Between Kernel-Level and End-to-End Performance
The manuscript reports very high kernel-level speedups (up to ~900×), while the overall application speedup is approximately 8.57×. This discrepancy is expected in complex applications, but it is not sufficiently discussed in the current manuscript.
In particular, the contribution of non-accelerated components—such as the sparse linear solver and CPU–GPU data transfer—is not clearly quantified. As a result, it remains unclear which parts of the workflow dominate the runtime after GPU acceleration.
The authors are encouraged to provide a clearer breakdown of the total runtime and to discuss the limiting factors for end-to-end performance.
2. Limited Discussion on Portability and Generality
The current implementation is closely tied to CUDA and NVIDIA GPUs. Although the manuscript briefly mentions the possibility of porting the approach to other platforms (e.g., via HIP), this point is only touched upon and would benefit from a more explicit discussion.
A short discussion of the dependence on CUDA-specific features (for example, atomic operations and memory hierarchy), the potential challenges in porting, and the expected generality of the proposed approach would improve the completeness of the manuscript.
Minor Comments1. Consistency between abstract and reported results
The abstract reports an overall speedup of 8.57×, while later sections mention 10.28× under certain configurations. This inconsistency may cause confusion.
The authors are encouraged to clearly specify the conditions under which each value is obtained and ensure consistent reporting throughout the manuscript.
2. Kernel launch configuration
Thread block sizes are described as being “empirically tuned,” but no further details are provided.
It would be useful to:
Recommendation
Based on the comments above, I recommend minor to moderate revision before the manuscript can be considered for publication.
The work is relevant and technically promising, but the manuscript would benefit from a clearer discussion of end-to-end performance limitations, a more complete treatment of portability, and several clarifications in the presentation of results and implementation details.
I hope these comments are helpful to the authors in improving the clarity and completeness of the manuscript.