Parallelized Domain Decomposition for Multi-Dimensional Lagrangian Random Walk, Mass-Transfer Particle Tracking Schemes

Schauer, Lucas; Schmidt, Michael J.; Engdahl, Nicholas B.; Pankavich, Stephen D.; Benson, David A.; Bolster, Diogo

doi:https://doi.org/10.5194/egusphere-2022-781

Preprints

https://doi.org/10.5194/egusphere-2022-781

Preprints

22 Aug 2022

| 22 Aug 2022

Parallelized Domain Decomposition for Multi-Dimensional Lagrangian Random Walk, Mass-Transfer Particle Tracking Schemes

Lucas Schauer, Michael J. Schmidt, Nicholas B. Engdahl, Stephen D. Pankavich, David A. Benson, and Diogo Bolster

Abstract. Lagrangian particle tracking schemes allow a wide range of flow and transport processes to be simulated accurately, but a major challenge is numerically implementing the inter-particle interactions in an efficient manner. This article develops a multi-dimensional, parallelized domain decomposition (DDC) strategy for mass-transfer particle tracking (MTPT) methods in which particles exchange mass dynamically. We show that this can be efficiently parallelized by employing large numbers of CPU cores to accelerate run times. In order to validate the approach and our theoretical predictions we focus our efforts on a well known benchmark problem with pure diffusion, where analytical solutions in any number of dimensions are well established. In this work, we investigate different procedures for tiling the domain in two and three dimensions, (2-d and 3-d), as this type of formal DDC construction is currently limited to 1-d. An optimal tiling is prescribed based on physical problem parameters and the number of available CPU cores, as each tiling provides distinct results in both accuracy and run time. We further extend the most efficient technique to 3-d for comparison, leading to an analytical discussion of the effect of dimensionality on strategies for implementing DDC schemes. Increasing computational resources (cores) within the DDC method produces a trade-off between inter-node communication and on-node work. For an optimally subdivided diffusion problem, the 2-d parallelized algorithm achieves nearly perfect linear speedup in comparison with the serial run up to around 2700 cores, reducing a 5-hour simulation to 8 seconds, while the 3-d algorithm maintains appreciable speedup up to 1700 cores.

Received: 11 Aug 2022 – Discussion started: 22 Aug 2022

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 2501 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (2501 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

03 Feb 2023

Parallelized domain decomposition for multi-dimensional Lagrangian random walk mass-transfer particle tracking schemes

Lucas Schauer, Michael J. Schmidt, Nicholas B. Engdahl, Stephen D. Pankavich, David A. Benson, and Diogo Bolster

Geosci. Model Dev., 16, 833–849, https://doi.org/10.5194/gmd-16-833-2023,https://doi.org/10.5194/gmd-16-833-2023, 2023

Short summary

Lucas Schauer, Michael J. Schmidt, Nicholas B. Engdahl, Stephen D. Pankavich, David A. Benson, and Diogo Bolster

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2022-781', Anonymous Referee #1, 12 Sep 2022

The manuscript falls outside the scope of EGUsphere. This should go to Computational Geosciences or JCP.

Citation: https://doi.org/10.5194/egusphere-2022-781-RC1
- AC1:
  'Reply to RC1', Lucas Schauer, 21 Sep 2022
  
  The journal aims and scope clearly indicate that the purpose of GMD is to allow publication of papers regarding numerical simulations of earth system components. The diffusion and/or heat equation are ubiquitous throughout the vast coupled partial differential equations typically used to describe earth system process. If this were a “Model description paper” then the reviewer’s position could be considered correct. However, we point to the description of GMD manuscript types (https://www.geoscientific-model-development.net/about/manuscript_types.html#item2), which, for “Development and Technical papers”, specifically invites “…model improvements such as the speed or accuracy of numerical integration schemes…”. This is precisely what is done in our submission. The problem of efficiently solving interacting, Lagrangian particle simulations is difficult, and we offer an approach with scalable efficiency across parallel computing architectures (multi-processor and multi-node clusters). The approach also offers the unique and novel ability to predict scaling behavior, which is especially difficult in most complex, parallelized models. These reasons clearly support inclusion of this work in GMD as it is 1) within the overall scope, and 2) specifically fits within the defined article types.
  
  Citation: https://doi.org/10.5194/egusphere-2022-781-AC1
  - EC1: 'Reply on AC1', David Ham, 11 Oct 2022
    
    For the record, the authors are correct. This manuscript is clearly in scope for GMD.
    
    Citation: https://doi.org/10.5194/egusphere-2022-781-EC1
RC2: 'Comment on egusphere-2022-781', Anonymous Referee #2, 23 Sep 2022

This paper highlights a multi-dimensional, parallel domain decomposition method for mass-transfer particle tracking methods. The authors focus on demonstrating the parallel scalability of a two- and three-dimensional "checkerboard" partitioning, and highlight a theoretical analysis and prediction of scalability at high core counts as their main novel contribution. A grand claim of reducing a "5-hour simulation to 8 seconds" is made to highlight the near-linear scalability of the chosen method.

The paper is well written and gives a detailed and concise overview of the problem definition and theoretical context in which the problem is set. A clear description of the particular class of particle tracking algorithm is provided and appropriate comparisons to other related fields are made. The performance analysis and theoretical prediction of scalability, however, neglects certain established principles (for example weak and strong scaling) and attempts to re-derive some well-known performance analysis steps; for example using an arbitrary 100-core baseline to normalise super-linear speed-ups, instead of normalising relative speed-up over memory sockets or nodes.

Unfortunately, however, the lack of any performance quantification of the baseline result, in particular with respect to shared-memory scaling, casts doubt over the achievements presented. Achieving near-perfect scaling is, after all, much easier with an unoptimised baseline. [1] While the authors are clearly aware of the redundant ghost particle communication within a shared memory space, no attempt is made to quantify single-node/single-socket performance. In particular, despite several of the dominant algorithm components being likely to contain memory-intensive operations (sparse matrix solves, KD-tree lookups, potentially redundant communication buffer copies), no attempt is made to measure or quantify memory-bandwidth utilisation, which could potentially explain some of the observed scaling behaviour (eg. the drop-off in Fig. 9 possibly?). To compound this further, several of the

performance graphs are plotting wall time against number of particles per partition, rather than time vs number of core/sockets/nodes, which makes reasoning about scaling behaviour counter-intuitive; and the description of the benchmarking hardware is very loose.

Nevertheless, the authors clearly demonstrate that their implementation of the dominant KD-tree lookups scales as N log(N), and that at high core counts their predicted parallel efficiency can be achieved. The theoretical derivation and comparison of parallel efficiency and scaling behaviour and subsequent mapping to achieved results adds value to the paper.

Overall, I find the topic of the paper and its novel contribution, as stated in the introduction, to be relevant for GMD. The paper has strong potential, but requires significant revisions before I can recommend acceptance, in particular due to the quality of the performance analysis. While I appreciate that fitting shared-memory parallelism is beyond the scope of the benchmarking code, I believe the paper would be significantly improved if the authors would establish single-node performance more rigorously, especially with regards to memory bandwidth utilisation, before focusing on scaling behaviour across multiple sockets/nodes and modelling of algorithmic scaling behaviour.

[1] https://blogs.fau.de/hager/archives/5260

Detailed list of comments:

==========================

* Overly strong statements, like "reducing a 5-hour simulation to 8 seconds" (pg 2, l. 17) are undermined by the fact that single-node parallelism is not explored, benchmarked or optimised for.

* Section 2.3: The test hardware configuration is less than ideal for the given analysis, as the test nodes vary significantly ("8-24 cores with clock speeds ranging from 2.5GHz-3.06GHz"). For a rigorous performance analysis a fixed subset or partition of nodes with identical core types should be used, and the CPU model, as well as available memory needs be stated here.

* While I appreciate that availability may be a limiting factor here, industrial compilers (Intel / Cray / Nvidia) are likely to achieve better single-node performance than gfortran -O3. If available on the test system, investigating reporting whether or not that changes the baseline (single core / single node) runs would add weight to the performance investigation. This can should be considered optional.

* As clearly stated in section 5, accesses to particle of a neighbouring processors still go through a full MPI-driven halo exchange protocol, thus likely incurring significant memory movement that could further be optimised. Without appropriate treatment of shared memory parallelism, either via OpenMP or p-threads, or explicit MPI shared memory programming (one-sided) this is likely to incur significant performance overheads that should be accounted for and evalauted in scalability studies such as this.

* Section 6.1 Cost analysis: This seems like a rather lengthy way to explain and derive commonly known scaling behaviour (linear weak scaling with constant per-processor work and KD-tree lookup scaling as n log(n)). I wonder if the manuscript can be made more concise by stating these facts and offering evidence that indeed the implementation behaves as expected?

* Figure 6: While Fig. 6 (a) does appear to match the predicted growth very well, the slope in Fig. 6 (b) is arguably steeper than the prediction. Could this be clarified or commented on?

* Line 335-338; this concept is commonly referred to as "weak scaling" and should be named as such.

* Figure 8: The x-scale should be inverted. Furthermore, this figure does not actually add any value to the discussion, as the inverse relationship between total particles and local particles is trivial.

* Similarly, Fig 7. shows multiple instances of weak-scaling snapshots that are nowhere near the regime limits. Please consider replacing Fig 7. and Fig. 8 with an actual strong-scaling graph (wall time vs. number of cores) for total particle values chosen in Fig 7.

* Figure 9.: It is increasingly hard to think about scaling behaviour with an implicit number of processors. Please plot wall time against number of cores when assessing strong scaling!

* Fig 9: Are the drop-off points related to scaling beyond a single memory socket? The near-linear scaling beyond that point (counter-intuitively to the left!) suggests that scaling behaviour is memory-bandwidth limited and and the addition of memory-sockets determines the real scaling factor. But again, this is hard to assess accurately without appropriate strong-scaling graphs.

* Units of time on y-axis missing for almost all wall time plots

* Section 7.2; line 467: This highlights one of the major shortcomings of the paper! While the single-core base case does not enter the MPI routine, it does represent an appropriate baseline for single-node/socket shared memory scaling. The authors not only neglect to account for this, but instead raise the baseline to an arbitrary 100-core baseline. To improve the quality of the paper, I recommend doing a specific shared-memory scaling analysis to determine if indeed scaling within a single memory socket follows the expected trends, and/or if the redundant use of ghost exchanges within a shared memory space affects performance. Then, once a single-socket/single-node baseline is established, scaling behaviour across multiple nodes/sockets can be analysed, thus giving a more grounded baseline for the speed-up plots in Fig. 15.

* Line 482 and onward: The technical term for "above perfect efficiency" is "super-linear speed-ups". The explicit explanation of equation (30) might also be superfluous; instead a more detailed explanation of how super-linear speed-ups are achieved would certainly elevate the paper.

* Section 8 outlines the lack of shared-memory parallelism; the question on line 522 already indicates that the authors are aware of this issue. Unfortunately, to my mind, this issue impacts the validity of the findings, since the demonstrated scaling behaviour (and the "hours to seconds" claim") can easily be put down to an insufficiently optimised baseline run. As no attempt is made to quantify the performance of the implementation with respect to single-node performance this casts doubt over the importance of the findings.

* Future consideration: Latency hiding by overlapping ghost communication with compute work; possibly across time steps. Not for this work, but as a follow-on optimisation.

Citation: https://doi.org/10.5194/egusphere-2022-781-RC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2022-781', Anonymous Referee #1, 12 Sep 2022

The manuscript falls outside the scope of EGUsphere. This should go to Computational Geosciences or JCP.

Citation: https://doi.org/10.5194/egusphere-2022-781-RC1
- AC1:
  'Reply to RC1', Lucas Schauer, 21 Sep 2022
  
  The journal aims and scope clearly indicate that the purpose of GMD is to allow publication of papers regarding numerical simulations of earth system components. The diffusion and/or heat equation are ubiquitous throughout the vast coupled partial differential equations typically used to describe earth system process. If this were a “Model description paper” then the reviewer’s position could be considered correct. However, we point to the description of GMD manuscript types (https://www.geoscientific-model-development.net/about/manuscript_types.html#item2), which, for “Development and Technical papers”, specifically invites “…model improvements such as the speed or accuracy of numerical integration schemes…”. This is precisely what is done in our submission. The problem of efficiently solving interacting, Lagrangian particle simulations is difficult, and we offer an approach with scalable efficiency across parallel computing architectures (multi-processor and multi-node clusters). The approach also offers the unique and novel ability to predict scaling behavior, which is especially difficult in most complex, parallelized models. These reasons clearly support inclusion of this work in GMD as it is 1) within the overall scope, and 2) specifically fits within the defined article types.
  
  Citation: https://doi.org/10.5194/egusphere-2022-781-AC1
  - EC1: 'Reply on AC1', David Ham, 11 Oct 2022
    
    For the record, the authors are correct. This manuscript is clearly in scope for GMD.
    
    Citation: https://doi.org/10.5194/egusphere-2022-781-EC1
RC2: 'Comment on egusphere-2022-781', Anonymous Referee #2, 23 Sep 2022

This paper highlights a multi-dimensional, parallel domain decomposition method for mass-transfer particle tracking methods. The authors focus on demonstrating the parallel scalability of a two- and three-dimensional "checkerboard" partitioning, and highlight a theoretical analysis and prediction of scalability at high core counts as their main novel contribution. A grand claim of reducing a "5-hour simulation to 8 seconds" is made to highlight the near-linear scalability of the chosen method.

The paper is well written and gives a detailed and concise overview of the problem definition and theoretical context in which the problem is set. A clear description of the particular class of particle tracking algorithm is provided and appropriate comparisons to other related fields are made. The performance analysis and theoretical prediction of scalability, however, neglects certain established principles (for example weak and strong scaling) and attempts to re-derive some well-known performance analysis steps; for example using an arbitrary 100-core baseline to normalise super-linear speed-ups, instead of normalising relative speed-up over memory sockets or nodes.

Unfortunately, however, the lack of any performance quantification of the baseline result, in particular with respect to shared-memory scaling, casts doubt over the achievements presented. Achieving near-perfect scaling is, after all, much easier with an unoptimised baseline. [1] While the authors are clearly aware of the redundant ghost particle communication within a shared memory space, no attempt is made to quantify single-node/single-socket performance. In particular, despite several of the dominant algorithm components being likely to contain memory-intensive operations (sparse matrix solves, KD-tree lookups, potentially redundant communication buffer copies), no attempt is made to measure or quantify memory-bandwidth utilisation, which could potentially explain some of the observed scaling behaviour (eg. the drop-off in Fig. 9 possibly?). To compound this further, several of the

performance graphs are plotting wall time against number of particles per partition, rather than time vs number of core/sockets/nodes, which makes reasoning about scaling behaviour counter-intuitive; and the description of the benchmarking hardware is very loose.

Nevertheless, the authors clearly demonstrate that their implementation of the dominant KD-tree lookups scales as N log(N), and that at high core counts their predicted parallel efficiency can be achieved. The theoretical derivation and comparison of parallel efficiency and scaling behaviour and subsequent mapping to achieved results adds value to the paper.

Overall, I find the topic of the paper and its novel contribution, as stated in the introduction, to be relevant for GMD. The paper has strong potential, but requires significant revisions before I can recommend acceptance, in particular due to the quality of the performance analysis. While I appreciate that fitting shared-memory parallelism is beyond the scope of the benchmarking code, I believe the paper would be significantly improved if the authors would establish single-node performance more rigorously, especially with regards to memory bandwidth utilisation, before focusing on scaling behaviour across multiple sockets/nodes and modelling of algorithmic scaling behaviour.

[1] https://blogs.fau.de/hager/archives/5260

Detailed list of comments:

==========================

* Overly strong statements, like "reducing a 5-hour simulation to 8 seconds" (pg 2, l. 17) are undermined by the fact that single-node parallelism is not explored, benchmarked or optimised for.

* Section 2.3: The test hardware configuration is less than ideal for the given analysis, as the test nodes vary significantly ("8-24 cores with clock speeds ranging from 2.5GHz-3.06GHz"). For a rigorous performance analysis a fixed subset or partition of nodes with identical core types should be used, and the CPU model, as well as available memory needs be stated here.

* While I appreciate that availability may be a limiting factor here, industrial compilers (Intel / Cray / Nvidia) are likely to achieve better single-node performance than gfortran -O3. If available on the test system, investigating reporting whether or not that changes the baseline (single core / single node) runs would add weight to the performance investigation. This can should be considered optional.

* As clearly stated in section 5, accesses to particle of a neighbouring processors still go through a full MPI-driven halo exchange protocol, thus likely incurring significant memory movement that could further be optimised. Without appropriate treatment of shared memory parallelism, either via OpenMP or p-threads, or explicit MPI shared memory programming (one-sided) this is likely to incur significant performance overheads that should be accounted for and evalauted in scalability studies such as this.

* Section 6.1 Cost analysis: This seems like a rather lengthy way to explain and derive commonly known scaling behaviour (linear weak scaling with constant per-processor work and KD-tree lookup scaling as n log(n)). I wonder if the manuscript can be made more concise by stating these facts and offering evidence that indeed the implementation behaves as expected?

* Figure 6: While Fig. 6 (a) does appear to match the predicted growth very well, the slope in Fig. 6 (b) is arguably steeper than the prediction. Could this be clarified or commented on?

* Line 335-338; this concept is commonly referred to as "weak scaling" and should be named as such.

* Figure 8: The x-scale should be inverted. Furthermore, this figure does not actually add any value to the discussion, as the inverse relationship between total particles and local particles is trivial.

* Similarly, Fig 7. shows multiple instances of weak-scaling snapshots that are nowhere near the regime limits. Please consider replacing Fig 7. and Fig. 8 with an actual strong-scaling graph (wall time vs. number of cores) for total particle values chosen in Fig 7.

* Figure 9.: It is increasingly hard to think about scaling behaviour with an implicit number of processors. Please plot wall time against number of cores when assessing strong scaling!

* Fig 9: Are the drop-off points related to scaling beyond a single memory socket? The near-linear scaling beyond that point (counter-intuitively to the left!) suggests that scaling behaviour is memory-bandwidth limited and and the addition of memory-sockets determines the real scaling factor. But again, this is hard to assess accurately without appropriate strong-scaling graphs.

* Units of time on y-axis missing for almost all wall time plots

* Section 7.2; line 467: This highlights one of the major shortcomings of the paper! While the single-core base case does not enter the MPI routine, it does represent an appropriate baseline for single-node/socket shared memory scaling. The authors not only neglect to account for this, but instead raise the baseline to an arbitrary 100-core baseline. To improve the quality of the paper, I recommend doing a specific shared-memory scaling analysis to determine if indeed scaling within a single memory socket follows the expected trends, and/or if the redundant use of ghost exchanges within a shared memory space affects performance. Then, once a single-socket/single-node baseline is established, scaling behaviour across multiple nodes/sockets can be analysed, thus giving a more grounded baseline for the speed-up plots in Fig. 15.

* Line 482 and onward: The technical term for "above perfect efficiency" is "super-linear speed-ups". The explicit explanation of equation (30) might also be superfluous; instead a more detailed explanation of how super-linear speed-ups are achieved would certainly elevate the paper.

* Section 8 outlines the lack of shared-memory parallelism; the question on line 522 already indicates that the authors are aware of this issue. Unfortunately, to my mind, this issue impacts the validity of the findings, since the demonstrated scaling behaviour (and the "hours to seconds" claim") can easily be put down to an insufficiently optimised baseline run. As no attempt is made to quantify the performance of the implementation with respect to single-node performance this casts doubt over the importance of the findings.

* Future consideration: Latency hiding by overlapping ghost communication with compute work; possibly across time steps. Not for this work, but as a follow-on optimisation.

Citation: https://doi.org/10.5194/egusphere-2022-781-RC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Lucas Schauer on behalf of the Authors (17 Nov 2022) Author's response Author's tracked changes Manuscript

ED: Reconsider after major revisions (01 Dec 2022) by David Ham

Dear Mr Schauer,

Thank you for the submission of your revised manuscript. I am in a slightly unfortunate position here in that the first reviewer, as you know, misconstrued the scope of the journal. The second reviewer provided a very thorough review, but have indicated that they do not wish to review a revised manuscript. Since I think that the second reviewer did a good job of identifying the issues in the manuscript, I don't intend to send this manuscript to a further reviewer as this will just delay the process. In addition, the manuscript is firmly in my area of expertise so I'm simply going to compare your revised manuscript to the issues raised by reviewer 2.

I note that you've gone through the more technical issues raised by reviewer 2, and that part is good. Where we still have a significant issue is this part of the review:

"Unfortunately, however, the lack of any performance quantification of the baseline result, in particular with respect to shared-memory scaling, casts doubt over the achievements presented. Achieving near-perfect scaling is, after all, much easier with an unoptimised baseline."

I think you have possibly misinterpreted what the reviewer is saying here. What the reviewer is asking for is evidence that the scalar code, executed on 1 core (or maybe one socket or one node), is actually fast. In this context, "fast" means effectively using a significant percentage of either the available memory bandwidth or the available floating point operations on the hardware. On modern hardware, Lagrangian particle tracking is probably a bandwidth-limited operation (i.e. the limit is the amount of data which can be moved to the processor, not the number of floating point operations which can be achieved). Since bandwidth is shared among cores, it might therefore be appropriate to measure the bandwidth achieved across either a socket or a node. The outcome is often shown as a roofline plot (see, for example, figure 7 in https://doi.org/10.1177/1094342020945005).

The reason that the reviewer is asking for this is that if your base algorithm implementation is poor (e.g. if you're achieving a couple of percent of the hardware bandwidth) then the speedups reported are artificially high since the costs of parallelisation are hidden by the slow base algorithm implementation. Conversely, if you are achieving tens of percent of the available bandwidth then the parallel results reported are indeed an indication of a good scheme implemented well.

So what this manuscript needs is some sort of indicator that the base performance uses a significant part of the resources available on one compute unit (whether that is a core, a socket, or a node). If you are unsure how to do this, I am happy to answer questions on this matter.

Yours,

David

Hide

AR by Lucas Schauer on behalf of the Authors (13 Jan 2023) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (13 Jan 2023) by David Ham

AR by Lucas Schauer on behalf of the Authors (13 Jan 2023) Manuscript

Journal article(s) based on this preprint

03 Feb 2023

Parallelized domain decomposition for multi-dimensional Lagrangian random walk mass-transfer particle tracking schemes

Lucas Schauer, Michael J. Schmidt, Nicholas B. Engdahl, Stephen D. Pankavich, David A. Benson, and Diogo Bolster

Geosci. Model Dev., 16, 833–849, https://doi.org/10.5194/gmd-16-833-2023,https://doi.org/10.5194/gmd-16-833-2023, 2023

Short summary

Lucas Schauer, Michael J. Schmidt, Nicholas B. Engdahl, Stephen D. Pankavich, David A. Benson, and Diogo Bolster

Viewed

Total article views: 524 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
419	92	13	524	1	2

HTML: 419
PDF: 92
XML: 13
Total: 524
BibTeX: 1
EndNote: 2

Views and downloads (calculated since 22 Aug 2022)

Month	HTML	PDF	XML	Total
Aug 2022	103	30	4	137
Sep 2022	122	24	6	152
Oct 2022	85	17	2	104
Nov 2022	42	8	1	51
Dec 2022	31	5	0	36
Jan 2023	33	8	0	41
Feb 2023	3	0	3
Mar 2023	0
Apr 2023	0
May 2023	0
Jun 2023	0
Jul 2023	0
Aug 2023	0
Sep 2023	0
Oct 2023	0
Nov 2023	0
Dec 2023	0
Jan 2024	0
Feb 2024	0
Mar 2024	0
Apr 2024	0
May 2024	0
Jun 2024	0
Jul 2024	0
Aug 2024	0
Sep 2024	0

Cumulative views and downloads (calculated since 22 Aug 2022)

Month	HTML	PDF	XML	Total
Aug 2022	103	30	4	137
Sep 2022	122	24	6	152
Oct 2022	85	17	2	104
Nov 2022	42	8	1	51
Dec 2022	31	5	0	36
Jan 2023	33	8	0	41
Feb 2023	3	0	3
Mar 2023	0
Apr 2023	0
May 2023	0
Jun 2023	0
Jul 2023	0
Aug 2023	0
Sep 2023	0
Oct 2023	0
Nov 2023	0
Dec 2023	0
Jan 2024	0
Feb 2024	0
Mar 2024	0
Apr 2024	0
May 2024	0
Jun 2024	0
Jul 2024	0
Aug 2024	0
Sep 2024	0

Viewed (geographical distribution)

Total article views: 498 (including HTML, PDF, and XML) Thereof 498 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 04 Sep 2024

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (2501 KB)
Metadata XML

Short summary

We develop a multi-dimensional, parallelized domain decomposition strategy for mass-transfer particle tracking methods in two and three dimensions, investigate different procedures for decomposing the domain, and prescribe an optimal tiling based on physical problem parameters and the number of available CPU cores. For an optimally subdivided diffusion problem, the parallelized algorithm achieves nearly perfect linear speedup in comparison with the serial run up to thousands of cores.


Total:	0
HTML:	0
PDF:	0
XML:	0