the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
GPU-accelerated Finite-Element Method for the Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework
Abstract. The three-dimensional unstructured-mesh finite-element atmospheric dynamical framework is gaining significance owing to its flexibility in representing complex topography and capability for multi-scale simulations in high resolutions. However, this framework has substantial bottlenecks. Unlike structured-grid models, the unstructured finite element method (FEM) must frequently access irregular mesh connectivity among nodes, edges, and elements, causing indirect memory addressing, inadequate data locality, and substantial memory bandwidth bottlenecks on conventional CPU architectures. Consequently, element-wise computations and global assembly are the primary contributors to the runtime in high-resolution simulations.
This study develops a GPU-parallel implementation of the Fluidity-Atmosphere dynamical core to address these challenges. The GPU-oriented data structures and optimized kernels are designed to efficiently leverage the computing power of GPUs. These kernels enable parallelized element integration and are efficient solvers for specific size matrices; a parallel assembly strategy enhances memory throughput during global sparse matrix construction. On the NVIDIA A100 GPU, the optimized kernels achieve speeds over 100× for element-wise computations and up to 389.02 times for global matrix assembly, resulting in an overall acceleration of 8.57 times with four messages passing interface (MPI) processes. The proposed framework demonstrates that tailored GPU parallelization is effective in overcoming the computational bottleneck of unstructured FEM-based atmospheric models, facilitating high-resolution simulations on heterogeneous architectures.
- Preprint
(1439 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2026-695', Anonymous Referee #1, 29 Mar 2026
-
AC1: 'Reply on RC1', Ximeng Fu, 30 Apr 2026
We sincerely thank the reviewer for the careful and thorough evaluation of our manuscript and for the highly positive assessment of our work. We are deeply encouraged that the reviewer finds the topic timely and relevant, the manuscript well organized, and the implementation technically sound.
We also greatly appreciate the reviewer’s constructive and insightful comments, particularly regarding the interpretation of end-to-end performance limitations, hardware portability, and the clarity of our implementation details. These suggestions have helped us to significantly improve the rigor, transparency, and completeness of the manuscript.
In response, we have made substantial revisions, including:
(1) adding a detailed runtime distribution analysis (Table 10) to explicitly quantify the "bottleneck shift" toward the CPU-bound linear solver;
(2) expanding the discussion on performance portability and hardware-specific tuning across different GPU architectures;
(3) ensuring absolute consistency in the reported end-to-end speedups (8.57×) and explaining the nonlinear scaling behavior; and
(4) providing precise quantitative rationale for our kernel launch configurations (block sizes of 128 and 256) based on SM occupancy profiling.
All changes have been incorporated into the revised manuscript and are clearly indicated.In the document we provide a detailed, point-by-point response to each comment.
-
AC1: 'Reply on RC1', Ximeng Fu, 30 Apr 2026
-
RC2: 'Comment on egusphere-2026-695', Anonymous Referee #2, 10 Apr 2026
General Comments: This paper presents a systematic GPU porting and optimization effort for two key computational hotspots—element-wise computations and global matrix assembly—in the unstructured-mesh finite-element atmospheric model Fluidity-Atmosphere. The authors implement CUDA kernels on NVIDIA A100 GPUs and demonstrate significant speedups using a 3D mountain wave test case. Overall, the work is solid, the results are clearly presented, and the contribution offers practical value to the atmospheric modeling community, particularly for those working with unstructured mesh frameworks.
Specific comments:
- In Section 4.3, the authors use atomic operations to handle concurrent updates during parallel assembly and report speedups up to 389×. Given that multiple elements often share nodes in unstructured meshes, atomic contention may affect performance. It would be helpful to briefly comment on the observed contention level (e.g., cache hit rate, warp efficiency) or explain why this does not become a bottleneck in the current test case (e.g., hardware capabilities of the A100).
- The Introduction highlights anisotropic adaptive meshing as a key feature of Fluidity-Atmosphere. However, the GPU performance evaluation is conducted with a static mesh. Please clarify whether the current GPU implementation supports AMR, and what additional overheads it introduces.
- Several kernels achieve speedups of two to three orders of magnitude (Tables 5 and 9), the single-process end-to-end speedup is 3.36× (Table 10). This gap is understandable given that the linear solver remains on CPU. It would be valuable to include a brief analysis of how the runtime distribution across modules changes after GPU acceleration — specifically, what proportion of time the linear solver now occupies. This would clearly illustrate the "bottleneck shift" and motivate future work on GPU-based solvers.
- The claimed "10.28× speedup" in the abstract is inconsistent with the data presented in Table 10: the single-GPU configuration achieves only 3.36× speedup, while the 10.28× figure corresponds to the 4 MPI + GPU hybrid configuration. The abstract does not specify the configuration to which the reported speedup applies. Please clearly state the applicable conditions for the reported speedup, either in the abstract or in the main text. Furthermore, the speedup achieved by the 4 MPI + GPU configuration relative to a single CPU is substantially higher than that of a single GPU relative to a single CPU. This nonlinear behavior requires a proper explanation (e.g., limited scalability of multi-process CPU execution, higher GPU efficiency on larger problem sizes) rather than being presented merely as a data point without analysis.
- Consider adding a brief note on the number of kernel calls or the mesh size used in these tests to aid reproducibility in Tables 4–6.
- The profiling data in Table 1 is presented, but its role in guiding optimization priorities is not discussed. Please clarify how this timing breakdown informed the selection of modules for GPU acceleration.
- Appendices: The CUDA code snippets are valuable for community reuse. Please verify that they are complete and consistent with the descriptions in the main text.
Citation: https://doi.org/10.5194/egusphere-2026-695-RC2 -
AC2: 'Reply on RC2', Ximeng Fu, 30 Apr 2026
We sincerely thank the reviewer for the careful and thorough evaluation of our manuscript and for the positive assessment of our work. We are encouraged that the reviewer finds the study to be solid, well-presented, and of practical value to the atmospheric modeling community.
We also greatly appreciate the reviewer’s constructive and insightful comments, particularly regarding performance interpretation, consistency of reported results, and reproducibility. These suggestions have helped us to significantly improve the clarity, rigor, and completeness of the manuscript.
In response, we have made substantial revisions, including:
(1) clarifying the performance implications of atomic operations and explaining our algorithmic choices based on A100 hardware capabilities;
(2) discussing the compatibility of our GPU implementation with adaptive mesh refinement (AMR) and quantifying the minimal data transfer overheads;
(3) introducing a detailed runtime distribution analysis to clearly illustrate the "bottleneck shift" towards the CPU-bound linear solver;
(4) ensuring consistency in the reported speedups and providing a robust explanation for the nonlinear scaling observed in the hybrid MPI+GPU configurations;
(5) improving reproducibility by explicitly including mesh sizes and kernel invocation counts in the relevant table captions;
(6) strengthening the logical flow by explicitly stating how the initial profiling data guided our GPU optimization priorities;
and (7) verifying the appendix code snippets to ensure complete consistency with the main text.
All changes have been incorporated into the revised manuscript and are clearly indicated.
In the document we provide a detailed, point-by-point response to each comment.
-
RC3: 'Comment on egusphere-2026-695', Anonymous Referee #3, 30 Apr 2026
General Assessment:In their manuscript "GPU-accelerated Finite-Element Method for the
Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework" the
authors discuss the implementation of several performance critical
computational kernels of the code named in the title on GPUs using CUDA.
This is followed by verification and performance benchmarking.I second the first two reviewers in their opinion that this is a timely and
relevant topic. I also would like to emphasise that I think that the work the
authors have performed on implementation and testing is sound and worth
reporting.However, it is also my opinion that the paper as such is in need of definite
improvement. It leaves several important technical aspects unclear. These
need to be addressed before publication.Detailed Comments:
* Please specify explicitely in the paper at some point what kind of Finite
Element is used in the discretisation and in the tests of the code.Thus far, the only explicit statement in the the manuscript is that you are
using continuous elements. However, that still leaves a lot of possibilites
especially in the context of a mixed element discretisation for a CFD type
problem.This information is an important one, as it makes a significant difference
computationally and w.r.t. performance optimisation, if you are using a
P1 Lagrange element pair or a P3 one, or maybe a conforming Crouzeix-Raviart
element.From the text (e.g. Fig. 2 or the mentioning of a 4 x 4 local element
matrix on p. 12) it seems that you are using a classical P1 element.* In your experiments you measure runtimes and provide speed-up values
comparing your GPU implementation to the CPU one. One the one hand the
time-to-solution is, of course, a highly important aspect, since this is
directly affecting the user. However, in order the evaluate the quality
and potential of your optimisations other aspects would also be of
significant interest.a) As you correctly state in Table 2 the theoretical FP64 peak performance
of the A100 PCIe is 9.7 TFLOP/s. What percentage of this is reached by
your GPU code?b) You repeatedly mention the "high bandwidth and massive parallelism"
of GPUs. However, from Table 5 I would deduce that roughly a factor
of speed-up of 20 results from optimisations such as "loop unrolling"
and "templating". These would also benefit the CPU code of course. But,
if I understand correctly, these are not in the version of the CPU
code used for comparison, correct?c) In line 342 you mention "serial CPU execution", does that mean you
are comparing a CPU code running on a single of the AMD Epyc's 64 cores
to the GPU version? Does this hold for all results apart from the ones
given in Table 9?
d) What is the meaning of "CPU" in Table 9? Are you talking about a setup
with 4 physical EPYC CPUs and one GPU here, or is this implying that you
use MPI to parallelise the CPU part of the code and run it on 4 cores
of the EPYC's 64 cores?
* You repeatedly mention the high-bandwidth of GPUs. It would be nice to
explain in more detail what you are refering to with this. Accessing host
memory from the GPU through PCIe has a bandwidth of 32 GB/s (following
Table 2). This is significantly slower than the maximal bandwidth for access
to memory by the AMD EPYC. The latter being around 204.8 GB/s. So you are
presumably refering to the bandwidth to the A100's HBM (device memory)
which is 1.56 TB/s, aren't you? I'd also suggest to add these two
hardware characteristics to Table 2.* As a note, the general trend in Finite-Element simulations for HPC seems to
be to discard assembling a global matrix altogether and using matrix-free
methods, see e.g. Kronbichler et al. (2012, 2019), Rudi (2015) or May (2015)
to name a few. It would be appreciable to at least comment on this in your
contribution.* One of your key arguments (see e.g. line 162) is that element-wise
computations are strictly local w/o cross-element dependencies. While that
is mostly valid for continuous elements, I take from Li (2021) that
Fluidity-Atmosphere is using a mixed DG/CG discretisation. Now in DG
you typically have coupling to neighbouring elements, because your
bilinear forms contain averages, jumps and fluxes on or over the facets
of your tetrahedrons. Can you comment on that?* Concerning Algorithm 1 I have several questions:
a) As you appear to be using P1 elements, the gradients of the shape
functions are constant. So why do you need to recompute these at every
Gauss point?b) You do not explicitely state it, but it seems to me that you are using
a standard affine mapping from the reference to the actual element. In
this case the Jacobian matrix is also constant. Thus, it can be computed
once for the element and needs not be recomputed at each Gauss point?c) In general precomputing and storing some quantities can improve
performance, see e.g. Kronbichler (2012), Boehm (2025). Have you
considered this, or can comment on it?* Concerning your statement in Section 3.4 on the function-call overhead of
these lightweight functions, is there no option to force the compiler to
do inlining? Also I fail to see why there should be a problem with
indirections in the case of using LAPACK's DPSEV? This is dense linear
algebra after all. Also a 6 x 6 matrix of doubles occupies 288 bytes,
shouldn't that easily fit into a typical 32K L1 data cache?* With respect to the global matrix assembly, could you please provide more
details on the procedure? Where does the assembly happen? Is the matrix
first assembled in device memory and afterwards copied to the host memory,
so that it can be used in PETSc? Or is the strategy different?I do not see that this is explained somewhere and there is also no
mentioning of something in this direction in section 5.3.3.* Concerning your validation strategy, I must admit that I find this not
overly convincing. I agree with you that computations on the CPU and GPU
and with different codes will not give bitwise identical results, which in
the case of AMR indeed might lead to different meshes developping. However,
assuming that both approaches leads to a sufficiently accurate approximate
solution it should be perfectly okay to compare these as this is a FEM
approach, shouldn't it? Maybe I do not understand how the "significant
interpolation errors" arise?Anyway, wouldn't it be more convincing to use a fixed fine mesh for both
cases and perform simulations that include the non-stationary initial
part?* What precisely is shown in Table 3? What is or what does ele_val_scalar?
* Appendix A, B & C: Showing these codelets is generally a nice idea.
However, it would be more helpful, if there was explanation of the
details of the template parameters and the function parameters.Can you explain, why __restrict__ is used in dot_product_t, but not for
the pointers in matmul_t? I'd assume that at least the memory buffer C
must not alias with a or b?Also in matmul_t, line 428, it looks as if access to C and b was strided?
Is this behaviour not counter-intuitive performance-wise in the innermost
loop?In line 446 it seems a little bit fishy that the template arguments
in the call to matmul_t are int literals and do not depend on the
template arguments of dshape_tensor_dshape_unroll_t?
Line 6: While element-wise compuations and global assembly definitely
belong to the "primary contributors to the runtime", they are not the
only one. Especially, as you later also mention (line 52), solving the
sparse linear system itself is a significant contributor, too. Same holds
for line 45. Thus, you might want to make your formulation more consistent.
Line 11: You mention a speed-up of over 100 here, but in respect to what?Line 32: "FEM ... preserves conservation" I am not fully sure I understand
this statement. Also I feel qualified to state that not each and every
FE discretisation is e.g. locally mass-conserving, so maybe re-think
and re-formulate this statement.Line 64: "However, the computation of elements and subsequent matrix assembly
may pose significant challenges." Could you elaborate on why that is, or
what aspect you are refering to?Line 76: It would be beneficial to the reader to have an example of what an
"element-wise computation" is already here.Line 92/93: Try to consistently use "non-linear" or "nonlinear"
Line 98: "MPIs" -> "MPI" Line 107: "its (PETSc's] contribution to the overall runtime was relatively
small." Could this be quantified? Because in your kernel test I see
speed-ups between 100 and 900, but that is not reflected in the overall
solution time between the CPU-only and the CPU+GPU code. So, where does
the runtime then go, if not into solving the linear system?Line 112: "shows that" -> "shows"
Line 166: I failed to understand which parameters can be tuned manually.
Line 277: "In GPU" -> "In our GPU"?
Line 279: "computes" isn't this a lookup of the global index in an array?
Line 334: What is a "scalar data structure" ?
Line 346: "function template optimization" sounds like a technical term, but
AFAIK it isn't? You presumedly mean optimization via the use of template
functions?References:
There are several occurences of entries such as this:
https://doi.org/https://doi.org/10.1016/j.procs.2010.04.121
You should, probably, ensure that the DOI entries in your BibTeX file do not
contain a resolver part, but only a pure DOI.---
Kronbichler, 2012, A generic interface for parallel cell-based finite element
operator application, 10.1016/j.compfluid.2012.04.012Kronbichler, 2019, Fast Matrix-Free Evaluation of Discontinuous Galerkin
Finite Element Operators, 10.1145/3325864Rudi, 2015, An Extreme-Scale Implicit Solver for Complex PDEs: Highly
Heterogeneous Flow in Earth’s Mantle, 10.1145/2807591.2807675Boehm, 2025, Code Generation and Performance Engineering for Matrix-Free
Finite Element Methods on Hybrid Tetrahedral Grids, 10.1137/24m1653756Citation: https://doi.org/10.5194/egusphere-2026-695-RC3
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 908 | 314 | 84 | 1,306 | 82 | 84 |
- HTML: 908
- PDF: 314
- XML: 84
- Total: 1,306
- BibTeX: 82
- EndNote: 84
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General Assessment
This manuscript presents a GPU-accelerated implementation of the Fluidity-Atmosphere dynamical core, with a focus on two major computational bottlenecks in unstructured finite-element atmospheric models: element-wise computations and global sparse matrix assembly. The authors design GPU-oriented data structures and optimized CUDA kernels, achieving substantial kernel-level speedups (up to ~100–900×) and an overall acceleration of up to 8.57× in a hybrid MPI+GPU configuration.
The topic is timely and relevant. Unstructured-mesh FEM atmospheric models are increasingly important due to their geometric flexibility and suitability for high-resolution simulations over complex terrain. However, their computational inefficiency remains a major obstacle. This work addresses a meaningful gap by targeting GPU acceleration in a realistic dynamical core rather than in isolated kernels.
Overall, the manuscript is well organized, and the implementation appears technically sound. The reported performance improvements are significant, and the inclusion of code and data is consistent with GMD’s reproducibility standards. These aspects make the work potentially valuable to the community.
However, despite these strengths, several aspects of the manuscript could be further strengthened to improve clarity and rigor. In particular, the discussion of performance results would benefit from a more detailed interpretation, especially regarding the gap between kernel-level and end-to-end speedups. In addition, some implementation aspects—such as portability and parameter choices—would benefit from further clarification. Addressing these points would help provide a more complete and transparent assessment of the proposed approach.
I therefore recommend minor revision before publication.
Major Comments1. Gap Between Kernel-Level and End-to-End Performance
The manuscript reports very high kernel-level speedups (up to ~900×), while the overall application speedup is approximately 8.57×. This discrepancy is expected in complex applications, but it is not sufficiently discussed in the current manuscript.
In particular, the contribution of non-accelerated components—such as the sparse linear solver and CPU–GPU data transfer—is not clearly quantified. As a result, it remains unclear which parts of the workflow dominate the runtime after GPU acceleration.
The authors are encouraged to provide a clearer breakdown of the total runtime and to discuss the limiting factors for end-to-end performance.
2. Limited Discussion on Portability and Generality
The current implementation is closely tied to CUDA and NVIDIA GPUs. Although the manuscript briefly mentions the possibility of porting the approach to other platforms (e.g., via HIP), this point is only touched upon and would benefit from a more explicit discussion.
A short discussion of the dependence on CUDA-specific features (for example, atomic operations and memory hierarchy), the potential challenges in porting, and the expected generality of the proposed approach would improve the completeness of the manuscript.
Minor Comments1. Consistency between abstract and reported results
The abstract reports an overall speedup of 8.57×, while later sections mention 10.28× under certain configurations. This inconsistency may cause confusion.
The authors are encouraged to clearly specify the conditions under which each value is obtained and ensure consistent reporting throughout the manuscript.
2. Kernel launch configuration
Thread block sizes are described as being “empirically tuned,” but no further details are provided.
It would be useful to:
Recommendation
Based on the comments above, I recommend minor to moderate revision before the manuscript can be considered for publication.
The work is relevant and technically promising, but the manuscript would benefit from a clearer discussion of end-to-end performance limitations, a more complete treatment of portability, and several clarifications in the presentation of results and implementation details.
I hope these comments are helpful to the authors in improving the clarity and completeness of the manuscript.