the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
asQ: parallel-in-time finite element simulations using ParaDiag for geoscientific models and beyond
Abstract. Modern high performance computers are massively parallel; for many PDE applications spatial parallelism saturates long before the computer’s capability is reached. Parallel-in-time methods enable further speedup beyond spatial saturation by solving multiple timesteps simultaneously to expose additional parallelism. ParaDiag is a particular approach to parallel-in-time based on preconditioning the simultaneous timestep system with a perturbation that allows block diagonalisation via a Fourier transform in time. In this article, we introduce asQ, a new library for implementing ParaDiag parallel-in-time methods, with a focus on applications in the geosciences, especially weather and climate. asQ is built on Firedrake, a library for the automated solution of finite element models, and the PETSc library of scalable linear and nonlinear solvers. This enables asQ to build ParaDiag solvers for general finite element models and provide a range of solution strategies, making testing a wide array of problems straightforward. We use a quasi-Newton formulation that encompasses a range of ParaDiag methods, and expose building blocks for constructing more complex methods. The performance and flexibility of asQ is demonstrated on a hierarchy of linear and nonlinear atmospheric flow models. We show that ParaDiag can offer promising speedups and that asQ is a productive testbed for further developing these methods.
Status: open (until 03 Feb 2025)
-
CEC1: 'Comment on egusphere-2024-3699', Juan Antonio Añel, 27 Dec 2024
reply
Dear authors,
I would like kindly request you that in potentially reviewed versions of your manuscript you include in the Code and Data Availability section the information on the Zenodo repositories relevant for your work that you included in the system during your submission. Currently only one of them is cited. Therefore, you should add:
https://zenodo.org/records/14198294
https://zenodo.org/records/14198329
Thanks,
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-3699-CEC1 -
AC1: 'Reply on CEC1', Josh Hope-Collins, 02 Jan 2025
reply
Dear Juan,
The latest arXiv version of the manuscript includes references to all three Zenodo records listed in the assets for this submission. An earlier version of the manuscript on arXiv did not include citations to the zenodo record. When I follow the arXiv link from the "Abstract" tab for this submission it takes me to the most recent version - please let me know if this is not the case for you and I will update the link to point to the specific arXiv version.
The text for the code and data availability section reads:
All code used in this manuscript is free and open source. The asQ library is available at https://github.com/firedrakeproject/asQ and is released under the MIT license. The data in this manuscript was generated using the Python scripts in (Hope-Collins et al. (2024b), also available in the asQ repository), the Singularity container in (Hope-Collins et al., 2024a), and the versions of Firedrake, PETSc, and their dependencies in (firedrake zenodo, 2024).
Hope-Collins et al 2024b is the first record you requested (the python scripts used in the manuscript), and Hope-Collins et al 2024a is the second record (the singularity container used to run the scripts).
Many thanks
Josh Hope-Collins
Citation: https://doi.org/10.5194/egusphere-2024-3699-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 02 Jan 2025
reply
Dear authors,
We can not accept links to GitHub sites. GitHub sites do not comply with our policy, as they are not long-term repositories valid for scientific publication. GitHub itself recommend to use other repositories for scientific publication, and even provides an integration with Zenodo for it. Therefore, please, store the asQ library in a repository that complies with out policy, and reply to this comment with its link and permanent handle (e.g. DOI). Also, remember to include the information on a new repository in any potentially reviewed version of your manuscript.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-3699-CEC2 -
AC2: 'Reply on CEC2', Josh Hope-Collins, 03 Jan 2025
reply
Dear Juan,
The version (git commit) of asQ used in this paper is now recorded in the zenodo archive https://doi.org/10.5281/zenodo.14592039. This record will be cited in the updated version of the manuscript.
Many thanks
Josh Hope-Collins
Citation: https://doi.org/10.5194/egusphere-2024-3699-AC2
-
AC2: 'Reply on CEC2', Josh Hope-Collins, 03 Jan 2025
reply
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 02 Jan 2025
reply
-
AC1: 'Reply on CEC1', Josh Hope-Collins, 02 Jan 2025
reply
-
RC1: 'Comment on egusphere-2024-3699', Anonymous Referee #1, 02 Jan 2025
reply
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-3699/egusphere-2024-3699-RC1-supplement.pdf
-
RC2: 'Comment on egusphere-2024-3699', Anonymous Referee #2, 07 Jan 2025
reply
This paper introduces the asQ library for implementing ParaDiag parallel-in-time methods, focusing on applications related to weather and climate. The paper is really well written, and I was impressed by both the asQ library design and numerical results. The authors gave a nice intro on other relavent research with a comprehensive set of references. They also did a good job of describing the asQ library and positioning it relative to existing methods and software. The numerical examples are known to be difficult to solve parallel-in-time, yet excellent speedups were achieved in all cases. The authors are aware of the current limitations of ParaDiag, but have several ideas for future research to expand its applicability. I only have a few minor comments.
Minor comments:
- Page 2, paragraph before 1.1. Change to "a survey of previous research".
- Page 6, after (13). Change to "inverse x = P^{-1}b" or "inverse of Px = b"? This is what the algorithms seems to be computing.
- Page 10, second sentence after (36). Change to "backward Euler" (no "s").
- Page 10, "time_partition = [2, 2, 2, 2]". This description grows with P_t. This may not be a problem for a while (or ever, since roundoff grows with N_t), but I was curious. Is there a P_t independent way to describe the partition in asQ? Are there any other things about the interface that don't scale?
- Page 16, description of strong scaling. I was really confused by this until I saw the subsequent comment about nonlinear problems and the "full timeseries of N_T timesteps" broken up into windows. I'm not sure why you don't describe the linear problems in the same way, especially given that you use the term "windows" when discussing the numerical results. Related to this, is the time given for the serial-in-time method in Figure 3 for 16,384 timesteps (which you could call N_T)? Also in Figure 3, are you plotting T_p/N_t or (T_p/N_t)*N_w (where T_p is the time to solve N_t timesteps in a window)? It looks like maybe you don't compute the full timeseries in the linear case, and instead compute only one window of size N_t and simply assume (reasonably) that the full N_T timesteps would be N_w times larger. If this is all true, I think it would be less confusing to present it this way because it lines up exactly with the standard definition of strong scaling. If I've misunderstood, then some additional detail is needed.
- Page 27, 5.2.1. Delete one of "been" in the second sentence.
Citation: https://doi.org/10.5194/egusphere-2024-3699-RC2 -
RC3: 'Comment on egusphere-2024-3699', Anonymous Referee #3, 10 Jan 2025
reply
General comments.
The paper introduces asQ, a new software library built on top of the Firedrake software, for rapid testing and development of the ParaDiag parallel-in-time algorithm. While tailored towards earth system modeling, solved equations can be specified via the Uniform Form Language (UFL), making the framework quite general. The sections of the paper provide a detailed introduction into ParaDiag, the numerical algorithm, including a comprehensive review of the relevant literature. Then, the asQ library is described including the used space-time parallelization paradigm. The reader is walked through an example how to solve the heat equation with asQ to illustrate the steps that are needed. Wallclock scaling is shown for different numerical examples. While speedups for linear problems are extremely good, the fact that an averaged Jacobian needs to be used in ParaDiag results in much more modest speedups for nonlinear problems which are, however, very much in line with what other PinT methods deliver.
Overall, I think this is a strong, timely and well-written paper. Providing high-quality software libraries is exactly what is required to push parallel-in-time integration methods to broader use and there are still only few codes that do this. Therefore, asQ and the paper clearly fill an important need. Explanations in the paper are very clear and should be understandable for non-PinT-experts. The numerical results are convincingly linked to theory while the shown speedups are well explained using the provided performance model. I have only a few fairly minor issues that should be addressed before publication. The used code is provided open-source and numerical experiments can be reproduced via Python scripts and a complete installation is available as a Singularity container.
Specific comments.
- I am not entirely sure I understand why the transpose and thus all-to-all communication is needed. Is this a consequence of the way the method is implemented or is this intrinsic to ParaDiag?
If I understand correctly, in Steps 1 and 3 on p. 6 only a (I)FFT in time is required for every spatial DoF, but the (I)FFT for each spatial DoF is independent from all others. Since the number of parallel copies of one spatial DoF (that is, N_t) is small(ish), is it not possible to store all the temporal copies of a given spatial DoF on the same node?
Basically, let us say the spatial parallelization breaks down the spatial domain into four sub-domains Omega_1, Omega_2, Omega_3, Omega_4. Then say we use four parallel time steps. Cannot Node 1 hold the four temporal copies of Omega_1, Node 2 four copies of Omega_2 etc? This way, the (I)FFT in time can be computed without sending messages. This could even enable some "hybrid" space-time parallelization where the (I)FFT are parallelized with OpenMP or some other shared memory paradigm to avoid having to transfer full solutions in time. Is this possible in principle but not within the used framework, just very challenging to implement or is my understanding incorrect?
I am not suggesting that the authors implement this but I would be interested in some more details on why the transpose is required and if, theoretically, there are ways around it. - Eq. (24), I am not sure I understand what the Lip operator means. Is that the minimum L such that satisfies a Lipshitz condition for f(u) - Nabla f(u_tilde) * u ? But as what, as a function of u with fixed u_tilde? May be it is clearer to simply spell out the condition that the kappa needs to satisfy.
- Some of the figures are a bit hard to read in print. I would suggest slightly thicker lines and slightly larger markers. The yellow coloured lines are also sometimes difficult to see in print, may be some darker colour would help.
- There are a lot of parameters to keep track of (N_x, N_t, k_s, k_p, m_s, m_p, ...) and I found myself going back and forth a lot to find their definitions. Having them all summarized in a concise table for quick reference would help a lot.
- The description of the main steps of ParaDiag in the paragraph after Eq. (27) could probably be summarised in pseudo-code.
Technical corrections/minor comments.
- The unit in which wallclock time is measured seems not to be stated and is missing from the axis labels in Figures.
- There is a very recent preprint discussing integration of parallel-in-time functionality into the Nektar++ code. This effort probably deserves to be mentioned in the discussion of the (few) general purpose software packages supporting PinT.
Xing, Jacques Y. and Moxey, David and Cantwell, Chris D., Enhancing the Nektar++ Spectral/Hp Element Framework for Parallel-in-Time Simulations. Available at SSRN: https://ssrn.com/abstract=5010907 or http://dx.doi.org/10.2139/ssrn.5010907
Citation: https://doi.org/10.5194/egusphere-2024-3699-RC3 - I am not entirely sure I understand why the transpose and thus all-to-all communication is needed. Is this a consequence of the way the method is implemented or is this intrinsic to ParaDiag?
Model code and software
Python scripts for "asQ: parallel-in-time finite element simulations using ParaDiag for geoscientific models and beyond" Joshua Hope-Collins, Abdalaziz Hamdan, Werner Bauer, Lawrence Mitchell, and Colin Cotter https://doi.org/10.5281/zenodo.14198293
Singularity container for "asQ: parallel-in-time finite element simulations using ParaDiag for geoscientific models and beyond" Joshua Hope-Collins, Abdalaziz Hamdan, Werner Bauer, Lawrence Mitchell, and Colin Cotter https://doi.org/10.5281/zenodo.14198328
Software used in `asQ: parallel-in-time finite element simulations using ParaDiag for geoscientific models and beyond' Firedrake team https://doi.org/10.5281/zenodo.14205088
Viewed
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
135 | 0 | 0 | 135 | 0 | 0 |
- HTML: 135
- PDF: 0
- XML: 0
- Total: 135
- BibTeX: 0
- EndNote: 0
Viewed (geographical distribution)
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1