the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
SWEpy: An Open-Source GPU-Accelerated Solver for Near-Field Inundation and Far-Field Tsunami Modeling
Abstract. We present SWEpy, a new Python GPU-accelerated open-source finite volume (FV) software designed to solve the Saint-Venant system of shallow water equations (SWE) on unstructured triangular grids. SWEpy is designed for flexibility and performance, considering a well-balanced, positivity-preserving, and higher-order central-upwind FV scheme, intended to solve tsunami wave propagation, flooding, and dam-break scenarios, among others.
In this regard, we enhance the minimization of numerical diffusion, a phenomenon frequently found in this sort of FV schemes, by using a second-order WENO reconstruction operator as well as a third-order strong stability-preserving Runge-Kutta time integrator. With this in mind, a modular software architecture is presented that can support a range of initial and boundary conditions and source terms.
SWEpy's performance, stability, and accuracy are verified using canonical benchmarks, including Synolakis' conical island and Bryson's flow over a Gaussian bump, and further demonstrated in large-scale simulations of the 1959 Malpasset Dam failure and the Mw8.8 2010 Maule tsunami. SWEpy delivers high-resolution results on consumer-grade hardware, offering a user-friendly platform for both research and operational forecasting.
- Preprint
(7434 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 18 Dec 2025)
- RC1: 'Comment on egusphere-2025-3900', Anonymous Referee #1, 03 Dec 2025 reply
-
RC2: 'Comment on egusphere-2025-3900', Anonymous Referee #2, 06 Dec 2025
reply
SummaryThe article introduces SWEpy, open-source, Python-based GPU-accelerated finite-volume solver for the shallow-water equations on unstructured triangular grids, targeting tsunami propagation, flooding, and dam-break scenarios. The code uses a central-upwind flux formulation with WENO spatial reconstruction and SSP Runge–Kutta time integration to reduce numerical diffusion while attempting to ensure well-balancing and positivity preservation. The paper details the numerics and some algorithmic designs of the CuPy based code, providing additional validation and verification examples.Considering the article and the code as available today, there are several shortcomings that do not yet support the conclusions and ambition set out in the article. The largest issues include (details are in the comments):
- The background provided lacks many references regarding the state of shallow water solvers, especially the HPC field. The packages that are mentioned are not entirely representative and also have some errors in the stated properties. For instance, a major code that is GPU enabled and has many of the advantages listed for SWEpy has been authored by a group including Manuel Castro-Diaz.
- The modularity of the code and the algorithm design is mentioned various times as an advantage, but this is never fully explained or demonstrated. Reading through the code itself, it is not obvious how this modularity it designed or how one would go about implementing additional modules beyond the idea that Python allows this. Furthermore, the design of the algorithms on the GPU is not discussed, especially how this would enable easy extensions on a GPU. The design of the GPU components need to be more fully discussed, including how CuPy works and communicates between the CPU and GPU seamlessly. For example, what were the design decisions that needed to be made for the GPU?
- The verification and validation results are incomplete in several critical ways. There are tests showing conservation and well-balancing missing. The results also show a degradation in order, but this issue is not addressed satisfactorily, especially for a smooth problem. There are other issues that are mentioned, but not satisfactorily explained, such as the loss of stability in the Forward Euler and WENO runs in the Maule Tsunami.
While these are significant, it is clear that this is a significant step towards a working code that demonstrates the goals as laid out in the article. The article could address these concerns directly, or change the focus of the paper to be more about the implementation of the GPU code and the performance characteristics, minimizing the discussion of numerical methods that have been well established. For that reason, I would recommend at least major revisions that address the over-arching issues with the article as outline above (and the code) and a reconsideration or a resubmission with a change in focus to be more focused on the implementation details.Comments and Suggestions- The introduction is missing many details. A lot of recent work in the CPU/GPU, especially in Python, is out of date. For example, CuPy has undergone a major effort from NVIDIA to bring in direct support.
- (line 54) The reasons for unstructured vs. structured grids is well known, but this characterization omits adaptive grids, something that both structured and unstructured grids can take advantage of.
- (table 1) There are a few notable packages missing from this list, such as MOST, ADCIRC, and exaHYPe, and some inaccuracies in the table, such as GeoClaw's support for GPUs (GeoClaw also does not use a HLLE solver, but that's more complex). I might suggest that you have a column that distinguishes parallelism support and what type of support is provided (e.g., OpenMP, MPI, OpenACC, CUDA, CuPy, kokkos). You could still maintain the GPU column to distinguish this, as that is the focus of SWEpy.
- (line 61) This is a bit of an over-generalization and overstates the cost of the common types of Riemann solvers that are used in the packages in table 1. The point about central schemes being simpler are generally true, however.
- (line 86) Here, high-order is mentioned in the context of near-field wet/dry fronts. This ends up being limited to first order, and for good reason. It is also not clear that a high-order reconstruction in the near-shore is helpful in most cases anyway when using the shallow water equations without non-hydrostatic effects included. This high order reconstruction is a good argument for using GPUs, but this point is not directly made.
- While mentioned that it would be "easy", it is not clear how additional source terms would be added, either from the article or the code, especially with higher order reconstructions that may be important to these additional source terms.
- (line 310) The near-shore well-balancing is often where the largest errors can occur. Quantifying how much a "slight imbalance" is seemingly extremely relevant.
- It appears that the time steps are adaptive? If so, what's the penalty for having to take a time step. This is especially relevant for GPU computing.
- The CFL restriction is significant for higher-order WENO schemes. Can you say what the empirical stability is in practice for these simulations?
- (fig 5) Not sure that this is necessary, or maybe can be simplified for the article? In its stead, it may be more helpful to illustrate the modularization by showing how the code itself is split up into functions in the code.
- (fig 5) Imposition of the boundary conditions is essential, and with a wider stencil, can be tricky to implement. Later, you mention that there are issues with the BCs. Can you better quantify what is being done and how the "leaky" boundary effect could be better addressed rather than expanding the boundaries away from the area of interest? This is often not practical in many cases of interest. A common example that can be difficult is imposition of an incoming wave while allowing a wave to exit without reflection.
- (line 423) A stiff (implicit) solver would be very non-trivial to implement on a GPU.
- Modularity does enable additional capabilities, but saying that this would be easy while maintaining efficiency is rarely true. Some trade-off needs to be made, so a discussion of this would be warranted if this is a major feature of the code as proposed. This design question is very difficult to evaluate given the current state of the code. Diagrams illustrating how the code is modularized and composed would make this a salient point, but at present this is very difficult to evaluate.
- On an old GPU or even high-performance GPUs the data related to the grid can get large and lead to significant slow-downs. Was this addressed, or were the problems run not large enough for this to occur? Furthermore, memory locality is not assured for even moderately complex unstructured grids and can be particularly challenging for a GPU. Did this have an impact on the results presented?
- It would be much easier to evaluate the output in 2D rather than 3D in most of the cases presented.
- (table 2) The loss of order of convergence is significant and needs to be better addressed.
- (fig 9) Label the figure axes with the actual thing being plotted, e.g., surface depth. This is also not a gauge plot, so I am not sure that the caption is correct.
- (fig 9) Why is there water being plotted on top of the island? There are additional suspicious looking areas that are wet for all the methods.
- (table 3) It would be helpful to report L^infty or max errors for these types of problems (you could use a shifted norm for these)
- The conical island problem is using a first order time method, why? To isolate the spatial reconstruction, a high order time marching scheme should be used instead.
- A test demonstrating well-balancing and conservation is critical for verification of the methods.
- Malpasset Benchmark
- Were refinement studies done given the claim that the resolution has an impact on the comparison?
- The boundary problem and false wet/dry problems seem significant. Can this be better addressed than what is listed?
- I am not sure that the Malpasset benchmark shows that SWEpy could be used to replicate results from more careful validation studies. Additional comparisons may need to be done, and a demonstration that the boundary issues are not impactful would be helpful.
- Maule Benchmark
- Make a note to the reader that FE == forward Euler, not finite element, which is implied but may lead to confusion.
- Why is the CFL 0.25 rather than at least 1/3?
- The explanation of high-frequency waves in the FE+WENO at low resolution does not make sense (fig 18). Instead, the low-resolution in both time and space should have stabilized the run. What is going on here?
- The quantification as to why the high-order WENO scheme looks more diffusive than HySEA seems important for demonstrating the advantages of the method proposed. Again, a convergence study would help to illuminate the issue.
- Given the importance of GPU computing, having only one benchmark comparing the implementations is insufficient. We also are left wondering if this is a good comparison. Was the CPU code really tuned to a GPU? What about a code that is fast on a CPU (HySEA)? Scalability is essential for these problems and nothing is presented. What about a better CPU and GPU, would these differences be the same?
- Discussion of how you used CuPy in replacement of NumPy is very sparse. This is an critical factor that should be held up as a reason to do what you did and would significantly add to the impact of the article. What exactly needed to change (e.g., was it just dropping in `import cupy as np`) or something more?
- Code suggestions:
- The code has no documentation. It was not clear how to even run the examples, which scripts need to be called? How does the input work? Jupyter notebooks or something similar would be perfect for this. There is also no explanation of the input format.
- The GitHub repository is large, and the files included seem like they could be auto-generated.
- There are several issues with the code itself, such as remaining hard-coded links.
- How does one run the code? Cloning the repository gives no indication of how to actually run anything.
- The license choice can significantly limit the adoption and use beyond academic and government entities. Not to suggest you need to change this, but in the intro of this package you may want to consider changing it to a more permissive license.
- Jupyter notebooks would greatly enhance this demonstration.
Citation: https://doi.org/10.5194/egusphere-2025-3900-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 342 | 249 | 16 | 607 | 11 | 11 |
- HTML: 342
- PDF: 249
- XML: 16
- Total: 607
- BibTeX: 11
- EndNote: 11
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General comments: I think the manuscript is concise and well-written. It presents a code that seeks to fill a gap in the current shallow water modeling landscape and uses novel numerical implementations to improve accuracy. The code is tested on commercial hardware and this could make it a valuable tool for education and research when high-performance computing resources are unavailable. I have some questions and comments regarding the code's efficiency and parallelism and some of the results presented in the manuscript. Once these are addressed, I think the manuscript will be acceptable for publication.
Specific comments/questions:
Question/comment 1: Does the code feature any CPU parallelism through either multi-core or multi-thread capabilities? I infer from table 1 that the code is not MPI capable, is it thus safe to say that the code has no CPU parallelism outside the vectorized SIMD instructions? If this is the case it should be clearly stated especially when discussing simulation execution time.
Question/comment 2: Timing results are presented for the Maule test case. Is there any reason that no timing results are presented for the other benchmarks? If there isn't, it might be valuable to include a table with with timings for CPU and GPU for the different benchmarks and at different resolutions.
Question/comment 3: It is mentioned in the manuscript that the code has high throughput on modest hardware. I find this to be very valuable, but there doesn't seem to be any quantification of this throughput. Would be it possible to include a plot or a table describing the code's throughput compared to the maximum throughput of the hardware it was tested on? I think this would make a compelling case for the code's efficiency.
Question/comment 4: Following my first question, if there isn't any CPU parallelism outside SIMD instructions, is the code eventually intended to feature MPI or multi-threading capabilities? Most consumer hardware has CPUs with multiple cores that are multi-threading capable and I think many users who might not have access to a GPU could benefit from this (even on consumer hardware, GPUs remain a quite expensive component).
Question/comment 5: Swepy uses the CuPy library for it's GPU implementation and CuPy utilizes CUDA. Does this make the code only operational on CUDA compatible GPUs? If so, it should be clearly stated in the text.
Question/comment6: Following my previous question, if the code only runs on CUDA compatible GPUs, do you plan to make the code functional on GPUs that aren't CUDA compatible?
Question/comment 7: For the Maule test case, the manuscript says that the discrepancies between the HySea simulation and the Swepy simulation are due to interpolation errors, projection errors and resolution differences. The manuscript then states that studying these factors is left to future work. However, given the timings mentioned in the manuscript, it does not seem infeasible to compare the Swepy and HySea simulations at the same resolution. Would this be possible? If it is I think it would be valuable to see the impact of the additional resolution on the errors observed.
Question/comment 8: The use of the WENO reconstruction clearly shows better preservation of gradients and flow features, from the manuscript it seems that this can be done using polynomials of any order. Is there a reason outside of keeping the stencils simpler that only second order polynomials were used? Have you made any attempts using higher order polynomial reconstruction (for example with order 3 or 4) and if so did you see any additional improvements?
Technical comments/corrections:
Figure 1: The figure implies that data is saved for output every time the RK iterations are completed. The text itself says that the output time is user determined. I think the figure could be improved by including this information.
Figure 1: I think the figure could benefit from also highlighting that the code can perform divergence and stagnation detection.