GPU-accelerated Finite-Element Method for the Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework

Li, Leisheng; Fu, Ximeng; Zheng, Xiyu; Li, Huiyuan; Li, Jinxi

doi:10.5194/egusphere-2026-695

Preprints

https://doi.org/10.5194/egusphere-2026-695

Preprints

05 Mar 2026

| 05 Mar 2026

GPU-accelerated Finite-Element Method for the Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework

Leisheng Li, Ximeng Fu, Xiyu Zheng, Huiyuan Li, and Jinxi Li

Abstract. The three-dimensional unstructured-mesh finite-element atmospheric dynamical framework is gaining significance owing to its flexibility in representing complex topography and capability for multi-scale simulations in high resolutions. However, this framework has substantial bottlenecks. Unlike structured-grid models, the unstructured finite element method (FEM) must frequently access irregular mesh connectivity among nodes, edges, and elements, causing indirect memory addressing, inadequate data locality, and substantial memory bandwidth bottlenecks on conventional CPU architectures. Consequently, element-wise computations and global assembly are the primary contributors to the runtime in high-resolution simulations.

This study develops a GPU-parallel implementation of the Fluidity-Atmosphere dynamical core to address these challenges. The GPU-oriented data structures and optimized kernels are designed to efficiently leverage the computing power of GPUs. These kernels enable parallelized element integration and are efficient solvers for specific size matrices; a parallel assembly strategy enhances memory throughput during global sparse matrix construction. On the NVIDIA A100 GPU, the optimized kernels achieve speeds over 100× for element-wise computations and up to 389.02 times for global matrix assembly, resulting in an overall acceleration of 8.57 times with four messages passing interface (MPI) processes. The proposed framework demonstrates that tailored GPU parallelization is effective in overcoming the computational bottleneck of unstructured FEM-based atmospheric models, facilitating high-resolution simulations on heterogeneous architectures.

Received: 05 Feb 2026 – Discussion started: 05 Mar 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Leisheng Li, Ximeng Fu, Xiyu Zheng, Huiyuan Li, and Jinxi Li

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-695', Anonymous Referee #1, 29 Mar 2026
General Assessment
This manuscript presents a GPU-accelerated implementation of the Fluidity-Atmosphere dynamical core, with a focus on two major computational bottlenecks in unstructured finite-element atmospheric models: element-wise computations and global sparse matrix assembly. The authors design GPU-oriented data structures and optimized CUDA kernels, achieving substantial kernel-level speedups (up to ~100–900×) and an overall acceleration of up to 8.57× in a hybrid MPI+GPU configuration.
The topic is timely and relevant. Unstructured-mesh FEM atmospheric models are increasingly important due to their geometric flexibility and suitability for high-resolution simulations over complex terrain. However, their computational inefficiency remains a major obstacle. This work addresses a meaningful gap by targeting GPU acceleration in a realistic dynamical core rather than in isolated kernels.
Overall, the manuscript is well organized, and the implementation appears technically sound. The reported performance improvements are significant, and the inclusion of code and data is consistent with GMD’s reproducibility standards. These aspects make the work potentially valuable to the community.
However, despite these strengths, several aspects of the manuscript could be further strengthened to improve clarity and rigor. In particular, the discussion of performance results would benefit from a more detailed interpretation, especially regarding the gap between kernel-level and end-to-end speedups. In addition, some implementation aspects—such as portability and parameter choices—would benefit from further clarification. Addressing these points would help provide a more complete and transparent assessment of the proposed approach.
I therefore recommend minor revision before publication.
Major Comments1. Gap Between Kernel-Level and End-to-End Performance
The manuscript reports very high kernel-level speedups (up to ~900×), while the overall application speedup is approximately 8.57×. This discrepancy is expected in complex applications, but it is not sufficiently discussed in the current manuscript.
In particular, the contribution of non-accelerated components—such as the sparse linear solver and CPU–GPU data transfer—is not clearly quantified. As a result, it remains unclear which parts of the workflow dominate the runtime after GPU acceleration.
The authors are encouraged to provide a clearer breakdown of the total runtime and to discuss the limiting factors for end-to-end performance.
2. Limited Discussion on Portability and Generality
The current implementation is closely tied to CUDA and NVIDIA GPUs. Although the manuscript briefly mentions the possibility of porting the approach to other platforms (e.g., via HIP), this point is only touched upon and would benefit from a more explicit discussion.
Given the growing importance of performance portability in geoscientific modeling, it would be helpful to clarify:

which parts of the implementation are inherently hardware-specific;

what level of effort would be required to adapt the approach to other architectures, such as AMD GPUs or future heterogeneous systems.

A short discussion of the dependence on CUDA-specific features (for example, atomic operations and memory hierarchy), the potential challenges in porting, and the expected generality of the proposed approach would improve the completeness of the manuscript.
Minor Comments1. Consistency between abstract and reported results
The abstract reports an overall speedup of 8.57×, while later sections mention 10.28× under certain configurations. This inconsistency may cause confusion.
The authors are encouraged to clearly specify the conditions under which each value is obtained and ensure consistent reporting throughout the manuscript.
2. Kernel launch configuration
Thread block sizes are described as being “empirically tuned,” but no further details are provided.
It would be useful to:
report the configurations used in the experiments;

briefly indicate how they were selected (e.g., through profiling tools);

comment on sensitivity to these parameters, if relevant.

Recommendation
Based on the comments above, I recommend minor to moderate revision before the manuscript can be considered for publication.
The work is relevant and technically promising, but the manuscript would benefit from a clearer discussion of end-to-end performance limitations, a more complete treatment of portability, and several clarifications in the presentation of results and implementation details.
I hope these comments are helpful to the authors in improving the clarity and completeness of the manuscript.
Citation: https://doi.org/10.5194/egusphere-2026-695-RC1
- AC1: 'Reply on RC1', Ximeng Fu, 30 Apr 2026
  
  We sincerely thank the reviewer for the careful and thorough evaluation of our manuscript and for the highly positive assessment of our work. We are deeply encouraged that the reviewer finds the topic timely and relevant, the manuscript well organized, and the implementation technically sound.
  We also greatly appreciate the reviewer’s constructive and insightful comments, particularly regarding the interpretation of end-to-end performance limitations, hardware portability, and the clarity of our implementation details. These suggestions have helped us to significantly improve the rigor, transparency, and completeness of the manuscript.
  In response, we have made substantial revisions, including:
  
  (1) adding a detailed runtime distribution analysis (Table 10) to explicitly quantify the "bottleneck shift" toward the CPU-bound linear solver;
  
  (2) expanding the discussion on performance portability and hardware-specific tuning across different GPU architectures;
  
  (3) ensuring absolute consistency in the reported end-to-end speedups (8.57×) and explaining the nonlinear scaling behavior; and
  
  (4) providing precise quantitative rationale for our kernel launch configurations (block sizes of 128 and 256) based on SM occupancy profiling.
  
  All changes have been incorporated into the revised manuscript and are clearly indicated.
  In the document we provide a detailed, point-by-point response to each comment.
  
  Citation: https://doi.org/10.5194/egusphere-2026-695-AC1
RC2:
'Comment on egusphere-2026-695', Anonymous Referee #2, 10 Apr 2026
General Comments: This paper presents a systematic GPU porting and optimization effort for two key computational hotspots—element-wise computations and global matrix assembly—in the unstructured-mesh finite-element atmospheric model Fluidity-Atmosphere. The authors implement CUDA kernels on NVIDIA A100 GPUs and demonstrate significant speedups using a 3D mountain wave test case. Overall, the work is solid, the results are clearly presented, and the contribution offers practical value to the atmospheric modeling community, particularly for those working with unstructured mesh frameworks.
Specific comments:
In Section 4.3, the authors use atomic operations to handle concurrent updates during parallel assembly and report speedups up to 389×. Given that multiple elements often share nodes in unstructured meshes, atomic contention may affect performance. It would be helpful to briefly comment on the observed contention level (e.g., cache hit rate, warp efficiency) or explain why this does not become a bottleneck in the current test case (e.g., hardware capabilities of the A100).

The Introduction highlights anisotropic adaptive meshing as a key feature of Fluidity-Atmosphere. However, the GPU performance evaluation is conducted with a static mesh. Please clarify whether the current GPU implementation supports AMR, and what additional overheads it introduces.

Several kernels achieve speedups of two to three orders of magnitude (Tables 5 and 9), the single-process end-to-end speedup is 3.36× (Table 10). This gap is understandable given that the linear solver remains on CPU. It would be valuable to include a brief analysis of how the runtime distribution across modules changes after GPU acceleration — specifically, what proportion of time the linear solver now occupies. This would clearly illustrate the "bottleneck shift" and motivate future work on GPU-based solvers.

The claimed "10.28× speedup" in the abstract is inconsistent with the data presented in Table 10: the single-GPU configuration achieves only 3.36× speedup, while the 10.28× figure corresponds to the 4 MPI + GPU hybrid configuration. The abstract does not specify the configuration to which the reported speedup applies. Please clearly state the applicable conditions for the reported speedup, either in the abstract or in the main text. Furthermore, the speedup achieved by the 4 MPI + GPU configuration relative to a single CPU is substantially higher than that of a single GPU relative to a single CPU. This nonlinear behavior requires a proper explanation (e.g., limited scalability of multi-process CPU execution, higher GPU efficiency on larger problem sizes) rather than being presented merely as a data point without analysis.

Consider adding a brief note on the number of kernel calls or the mesh size used in these tests to aid reproducibility in Tables 4–6.

The profiling data in Table 1 is presented, but its role in guiding optimization priorities is not discussed. Please clarify how this timing breakdown informed the selection of modules for GPU acceleration.

Appendices: The CUDA code snippets are valuable for community reuse. Please verify that they are complete and consistent with the descriptions in the main text.
Citation: https://doi.org/10.5194/egusphere-2026-695-RC2
- AC2: 'Reply on RC2', Ximeng Fu, 30 Apr 2026
  
  We sincerely thank the reviewer for the careful and thorough evaluation of our manuscript and for the positive assessment of our work. We are encouraged that the reviewer finds the study to be solid, well-presented, and of practical value to the atmospheric modeling community.
  
  We also greatly appreciate the reviewer’s constructive and insightful comments, particularly regarding performance interpretation, consistency of reported results, and reproducibility. These suggestions have helped us to significantly improve the clarity, rigor, and completeness of the manuscript.
  
  In response, we have made substantial revisions, including:
  (1) clarifying the performance implications of atomic operations and explaining our algorithmic choices based on A100 hardware capabilities;
  (2) discussing the compatibility of our GPU implementation with adaptive mesh refinement (AMR) and quantifying the minimal data transfer overheads;
  (3) introducing a detailed runtime distribution analysis to clearly illustrate the "bottleneck shift" towards the CPU-bound linear solver;
  (4) ensuring consistency in the reported speedups and providing a robust explanation for the nonlinear scaling observed in the hybrid MPI+GPU configurations;
  (5) improving reproducibility by explicitly including mesh sizes and kernel invocation counts in the relevant table captions;
  (6) strengthening the logical flow by explicitly stating how the initial profiling data guided our GPU optimization priorities;
  and (7) verifying the appendix code snippets to ensure complete consistency with the main text.
  All changes have been incorporated into the revised manuscript and are clearly indicated.
  
  In the document we provide a detailed, point-by-point response to each comment.
  
  Citation: https://doi.org/10.5194/egusphere-2026-695-AC2
RC3:
'Comment on egusphere-2026-695', Anonymous Referee #3, 30 Apr 2026

General Assessment:
In their manuscript "GPU-accelerated Finite-Element Method for the

Three-dimensional Unstructured Mesh Atmospheric Dynamic Framework" the

authors discuss the implementation of several performance critical

computational kernels of the code named in the title on GPUs using CUDA.

This is followed by verification and performance benchmarking.
I second the first two reviewers in their opinion that this is a timely and

relevant topic. I also would like to emphasise that I think that the work the

authors have performed on implementation and testing is sound and worth

reporting.
However, it is also my opinion that the paper as such is in need of definite

improvement. It leaves several important technical aspects unclear. These

need to be addressed before publication.
Detailed Comments:
* Please specify explicitely in the paper at some point what kind of Finite

Element is used in the discretisation and in the tests of the code.
Thus far, the only explicit statement in the the manuscript is that you are

using continuous elements. However, that still leaves a lot of possibilites

especially in the context of a mixed element discretisation for a CFD type

problem.
This information is an important one, as it makes a significant difference

computationally and w.r.t. performance optimisation, if you are using a

P1 Lagrange element pair or a P3 one, or maybe a conforming Crouzeix-Raviart

element.
From the text (e.g. Fig. 2 or the mentioning of a 4 x 4 local element

matrix on p. 12) it seems that you are using a classical P1 element.
* In your experiments you measure runtimes and provide speed-up values

comparing your GPU implementation to the CPU one. One the one hand the

time-to-solution is, of course, a highly important aspect, since this is

directly affecting the user. However, in order the evaluate the quality

and potential of your optimisations other aspects would also be of

significant interest.
a) As you correctly state in Table 2 the theoretical FP64 peak performance

of the A100 PCIe is 9.7 TFLOP/s. What percentage of this is reached by

your GPU code?
b) You repeatedly mention the "high bandwidth and massive parallelism"

of GPUs. However, from Table 5 I would deduce that roughly a factor

of speed-up of 20 results from optimisations such as "loop unrolling"

and "templating". These would also benefit the CPU code of course. But,

if I understand correctly, these are not in the version of the CPU

code used for comparison, correct?
c) In line 342 you mention "serial CPU execution", does that mean you

are comparing a CPU code running on a single of the AMD Epyc's 64 cores

to the GPU version? Does this hold for all results apart from the ones

given in Table 9?

d) What is the meaning of "CPU" in Table 9? Are you talking about a setup

with 4 physical EPYC CPUs and one GPU here, or is this implying that you

use MPI to parallelise the CPU part of the code and run it on 4 cores

of the EPYC's 64 cores?

* You repeatedly mention the high-bandwidth of GPUs. It would be nice to

explain in more detail what you are refering to with this. Accessing host

memory from the GPU through PCIe has a bandwidth of 32 GB/s (following

Table 2). This is significantly slower than the maximal bandwidth for access

to memory by the AMD EPYC. The latter being around 204.8 GB/s. So you are

presumably refering to the bandwidth to the A100's HBM (device memory)

which is 1.56 TB/s, aren't you? I'd also suggest to add these two

hardware characteristics to Table 2.
* As a note, the general trend in Finite-Element simulations for HPC seems to

be to discard assembling a global matrix altogether and using matrix-free

methods, see e.g. Kronbichler et al. (2012, 2019), Rudi (2015) or May (2015)

to name a few. It would be appreciable to at least comment on this in your

contribution.
* One of your key arguments (see e.g. line 162) is that element-wise

computations are strictly local w/o cross-element dependencies. While that

is mostly valid for continuous elements, I take from Li (2021) that

Fluidity-Atmosphere is using a mixed DG/CG discretisation. Now in DG

you typically have coupling to neighbouring elements, because your

bilinear forms contain averages, jumps and fluxes on or over the facets

of your tetrahedrons. Can you comment on that?
* Concerning Algorithm 1 I have several questions:
a) As you appear to be using P1 elements, the gradients of the shape

functions are constant. So why do you need to recompute these at every

Gauss point?
b) You do not explicitely state it, but it seems to me that you are using

a standard affine mapping from the reference to the actual element. In

this case the Jacobian matrix is also constant. Thus, it can be computed

once for the element and needs not be recomputed at each Gauss point?
c) In general precomputing and storing some quantities can improve

performance, see e.g. Kronbichler (2012), Boehm (2025). Have you

considered this, or can comment on it?
* Concerning your statement in Section 3.4 on the function-call overhead of

these lightweight functions, is there no option to force the compiler to

do inlining? Also I fail to see why there should be a problem with

indirections in the case of using LAPACK's DPSEV? This is dense linear

algebra after all. Also a 6 x 6 matrix of doubles occupies 288 bytes,

shouldn't that easily fit into a typical 32K L1 data cache?
* With respect to the global matrix assembly, could you please provide more

details on the procedure? Where does the assembly happen? Is the matrix

first assembled in device memory and afterwards copied to the host memory,

so that it can be used in PETSc? Or is the strategy different?
I do not see that this is explained somewhere and there is also no

mentioning of something in this direction in section 5.3.3.
* Concerning your validation strategy, I must admit that I find this not

overly convincing. I agree with you that computations on the CPU and GPU

and with different codes will not give bitwise identical results, which in

the case of AMR indeed might lead to different meshes developping. However,

assuming that both approaches leads to a sufficiently accurate approximate

solution it should be perfectly okay to compare these as this is a FEM

approach, shouldn't it? Maybe I do not understand how the "significant

interpolation errors" arise?
Anyway, wouldn't it be more convincing to use a fixed fine mesh for both

cases and perform simulations that include the non-stationary initial

part?
* What precisely is shown in Table 3? What is or what does ele_val_scalar?
* Appendix A, B & C: Showing these codelets is generally a nice idea.

However, it would be more helpful, if there was explanation of the

details of the template parameters and the function parameters.
Can you explain, why __restrict__ is used in dot_product_t, but not for

the pointers in matmul_t? I'd assume that at least the memory buffer C

must not alias with a or b?
Also in matmul_t, line 428, it looks as if access to C and b was strided?

Is this behaviour not counter-intuitive performance-wise in the innermost

loop?
In line 446 it seems a little bit fishy that the template arguments

in the call to matmul_t are int literals and do not depend on the

template arguments of dshape_tensor_dshape_unroll_t?

Line 6: While element-wise compuations and global assembly definitely

belong to the "primary contributors to the runtime", they are not the

only one. Especially, as you later also mention (line 52), solving the

sparse linear system itself is a significant contributor, too. Same holds

for line 45. Thus, you might want to make your formulation more consistent.

Line 11: You mention a speed-up of over 100 here, but in respect to what?
Line 32: "FEM ... preserves conservation" I am not fully sure I understand

this statement. Also I feel qualified to state that not each and every

FE discretisation is e.g. locally mass-conserving, so maybe re-think

and re-formulate this statement.
Line 64: "However, the computation of elements and subsequent matrix assembly

may pose significant challenges." Could you elaborate on why that is, or

what aspect you are refering to?
Line 76: It would be beneficial to the reader to have an example of what an

"element-wise computation" is already here.
Line 92/93: Try to consistently use "non-linear" or "nonlinear"
Line 98: "MPIs" -> "MPI" Line 107: "its (PETSc's] contribution to the overall runtime was relatively

small." Could this be quantified? Because in your kernel test I see

speed-ups between 100 and 900, but that is not reflected in the overall

solution time between the CPU-only and the CPU+GPU code. So, where does

the runtime then go, if not into solving the linear system?
Line 112: "shows that" -> "shows"
Line 166: I failed to understand which parameters can be tuned manually.
Line 277: "In GPU" -> "In our GPU"?
Line 279: "computes" isn't this a lookup of the global index in an array?
Line 334: What is a "scalar data structure" ?
Line 346: "function template optimization" sounds like a technical term, but

AFAIK it isn't? You presumedly mean optimization via the use of template

functions?
References:
There are several occurences of entries such as this:

https://doi.org/https://doi.org/10.1016/j.procs.2010.04.121

You should, probably, ensure that the DOI entries in your BibTeX file do not

contain a resolver part, but only a pure DOI.
---
Kronbichler, 2012, A generic interface for parallel cell-based finite element

operator application, 10.1016/j.compfluid.2012.04.012
Kronbichler, 2019, Fast Matrix-Free Evaluation of Discontinuous Galerkin

Finite Element Operators, 10.1145/3325864
Rudi, 2015, An Extreme-Scale Implicit Solver for Complex PDEs: Highly

Heterogeneous Flow in Earth’s Mantle, 10.1145/2807591.2807675
Boehm, 2025, Code Generation and Performance Engineering for Matrix-Free

Finite Element Methods on Hybrid Tetrahedral Grids, 10.1137/24m1653756

Citation: https://doi.org/10.5194/egusphere-2026-695-RC3
- AC3: 'Reply on RC3', Ximeng Fu, 29 May 2026
  
  We sincerely thank the reviewer for the thorough evaluation and highly constructive comments. We carefully analyzed all suggestions and substantially revised the manuscript to improve its technical clarity, numerical validation, and performance analysis.
  Specifically, we clarified that the present study focuses on low-order P1 continuous Galerkin discretization and further distinguished the CG operators accelerated in this work from the DG components used elsewhere in the Fluidity-Atmosphere framework. We refined the descriptions of the GPU execution workflow, including the device-side sparse matrix assembly procedure, CPU–GPU data transfer process, and the interaction with PETSc-based solvers.
  To better evaluate the effectiveness of the GPU implementation, we added detailed hardware utilization and bottleneck analyses based on NVIDIA Nsight Compute profiling, including FP64 throughput and memory bandwidth utilization. We also expanded the discussion on the suitability of assembled sparse operators versus matrix-free methods for low-order unstructured FEM frameworks. Furthermore, the validation methodology was significantly strengthened through additional fixed-mesh transient simulations, demonstrating machine-precision agreement between CPU and GPU results.
  In addition, we revised Algorithm 1, improved the explanations of template parameters and implementation details in the appendices, clarified the CPU baseline and MPI terminology, and refined numerous technical descriptions and minor language issues throughout the manuscript to improve overall rigor.
  Detailed point-by-point responses to all comments, along with a marked-up version of the revised manuscript highlighting the specific changes, are provided in the attached files.
  
  Citation: https://doi.org/10.5194/egusphere-2026-695-AC3

Leisheng Li, Ximeng Fu, Xiyu Zheng, Huiyuan Li, and Jinxi Li

Viewed

Total article views: 1,508 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,017	406	85	1,508	84	87

HTML: 1,017
PDF: 406
XML: 85
Total: 1,508
BibTeX: 84
EndNote: 87

Views and downloads (calculated since 05 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	659	219	75	953
Apr 2026	204	88	4	296
May 2026	154	99	6	259

Cumulative views and downloads (calculated since 05 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	659	219	75	953
Apr 2026	204	88	4	296
May 2026	154	99	6	259

Viewed (geographical distribution)

Total article views: 1,498 (including HTML, PDF, and XML) Thereof 1,498 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 31 May 2026

Short summary

Scientists use irregular grid models for accurate weather simulation, which help capture details but also make the calculations slow on traditional computers. We redesigned this model for GPUs by reorganizing data and calculations. This makes the slowest parts hundreds of times faster and the whole simulation over ten times faster. This allows for higher-resolution simulations.


Total:	0
HTML:	0
PDF:	0
XML:	0