An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)

Pang, Renbo; Yu, Fujiang; Gao, Yuanyong; Yuan, Ye; Yuan, Liang; Gao, Zhiyi

doi:https://doi.org/10.5194/egusphere-2024-2515

Preprints

https://doi.org/10.5194/egusphere-2024-2515

Preprints

07 Oct 2024

| 07 Oct 2024

An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)

Renbo Pang, Fujiang Yu, Yuanyong Gao, Ye Yuan, Liang Yuan, and Zhiyi Gao

Abstract. High-resolution models are essential for simulating small-scale processes and topographical features, which play a crucial role in understanding meteorological and oceanic events, as well as climatic patterns. High-resolution modeling requires substantial improvement on the parallel scalability of the model to reduce runtime, while massive parallelism is associated with intensive communications. Point-to-point communication is extensively utilized for neighborhood communication in earth models due to its flexibility. The distributed graph topology, first introduced in the MPI version 2.2, provides a scalable and informative communication method. It has demonstrated significant speedups over the point-to-point communication method based on a variety of synthetic and real-world communication graph datasets. But its application in earth models for neighborhood communication is rarely studied. In this study, we implemented neighborhood communication using both the traditional point-to-point communication method and the distributed graph communication topology. We then compared their performance in a case study of the Finite Volume WAve Modeling (FVWAM). Across all tests with 512 to 32,768 processes, the communication time speedup of the distributed graph communication topology ranged from 1.28 to 5.63 compared to the point-to-point communication method. For operational global wave forecasts with 1,024 processes, the runtime of the FVWAM reduced 40.2 % when the point-to-point communication method was replaced by the distributed graph communication topology.

Received: 15 Aug 2024 – Discussion started: 07 Oct 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 21238 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (21238 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

08 Jul 2025

An effective communication topology for performance optimization: a case study of the finite-volume wave modeling (FVWAM)

Renbo Pang, Fujiang Yu, Yuanyong Gao, Ye Yuan, Liang Yuan, and Zhiyi Gao

Geosci. Model Dev., 18, 4119–4136, https://doi.org/10.5194/gmd-18-4119-2025,https://doi.org/10.5194/gmd-18-4119-2025, 2025

Short summary

Renbo Pang, Fujiang Yu, Yuanyong Gao, Ye Yuan, Liang Yuan, and Zhiyi Gao

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-2515', Anonymous Referee #1, 27 Dec 2024

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-2515/egusphere-2024-2515-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2024-2515-RC1
- AC1: 'Reply on RC1', Renbo Pang, 15 Jan 2025
  
  Dear Reviewer,
  We would like to sincerely thank you for your thorough and constructive review of our manuscript (An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)). Your insightful comments have been invaluable in improving the quality of our work. Please find below our detailed responses in the attached PDF file to each of the comments you raised.
  Sincerely,
  Renbo PANG, on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2024-2515-AC1
RC2:
'Comment on egusphere-2024-2515', Anonymous Referee #2, 06 Jan 2025

Paper Review: "An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)"
This paper presents an implementation of halo-exchanges in the FVWAM model using MPI's distributed graph topology and performance comparison over baseline implementation using point-to-point communication primitives.
The paper provides a detailed comparison from tests with 512 to 32,768 processes and shows that the speedup from the distributed graph topology with and without reordered processes ranged from 1.28 to 5.63.
There is some important context missing from the article that can shed light on the significance of the performance improvements.
* What is the network interconnect and topology of the target system?
* Were experiments repeated with different node allocations assuming there is a batch system scheduling resources?
* Were the different experiment types (point-point, distributed, distributed with reordering) conducted using the same node allocation for consistency?
* Were the experiments at each processor count (e.g., 512) conducted multiple times to rule out network variability and interference from other traffic on the network?
* Is there any performance variability across runs?
* Can the authors elaborate differences of their approach if any with using the MPI-3 neighbourhood collectives?
The authors allude to the following factors as the primary contributors to the improvement:
> First, as the number of processes increases, the volume of exchanged data decreases, thereby reducing the speedup ratio achieved by the distributed graph communication topology.
It appears that the application is network bandwidth bound at low processor counts. It would be very enlightening to provide details of the communication volume and interconnect specifications to confirm if that's the case.
> Second, received data are continuously searched and inserted into wave action (N) at once in the distributed graph communication topology, which can improve cache hit rates.
The presumption about improved cache hit rates can be confirmed by obtaining hardware performance counter information. I'm skeptical that cache performance played such a big role. The improvement could better be explained by MPI library implementation ordering the communication operations optimally.
The biggest weakness of this work is the limited performance data from just one platform and MPI implementation. It would highly strengthen the work if the performance optimization can be demonstrated on multiple machines with different interconnects and topologies and recent versions of community standard libraries (OpenMPI, MPICH) or recent vendor implementations. It would make the case for neighbourhood collectives for earth system workloads stronger.
There is work illustrating performance improvements from reordering MPI processes taking network topology into account. e.g., https://dl.acm.org/doi/10.1145/2851553.2851575

What is the current reordering strategy in case I missed? Did the authors consider any advanced reordering strategies?
The paper refers to pre-posting receives using MPI_Irecv. However, they mention

> An alternative is to call the non-blocking communication interface MPI_Isend for sending data, but it is infrequently utilized due to the increased complexity that introduces to the sending operation

It's not inherently that complex as a lot of applications use non-blocking sends effectively. I wonder what the performance impact would be if the authors used non-blocking operations.
I was looking forward to the article to hear about novel techniques that could improve communication performance at scale. However, it was slightly disappointing to see that the benefit from the proposed optimization dramatically tapers off as we go from low (512) to high (32756) number of processes.
On heterogeneous GPU based supercomputers like the Frontier Exascale system, the number of nodes is relatively low (9,408) due to the fat node architecture compared to CPU based supercomputers like Fugaku (158,976 nodes) out there. In this overall context, the benefit of a communication optimization if more relevant at scale when there are potentially hundreds of thousands of MPI endpoints at scale (e.g., 600k on Fugaku with 4 MPI ranks per node mapping optimally to the NUMA domains there).
Ref: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

https://www.fujitsu.com/global/about/innovation/fugaku/specifications/

Pg 15, lines 295-302:

To conclude, I understand the motivation of authors to improve their production simulations performance and the relative significance for their workload. Additional performance data would be highly informative and make this more generally applicable.

Minor comments:
Pg 2, line 29: There are better references than Sukhija et al., 2022 for the Frontier Exascale supercomputer. I suggest using one of the papers from the Supercomputing conference. https://dl.acm.org/doi/abs/10.1145/3581784.3607089
> Sukhija, N., Bautista, E., Butz, D., and Whitney, C.: Towards anomaly detection for monitoring power consumption in HPC facilities, in: 380 Proceedings of the 14th International Conference on Management of Digital EcoSystems, pp. 1–8, 2022

The performance sections in the paper are a bit verbose and redundant pointing to the information in the figures. It might be better to be succinct in highlighting the results and elaborate further on the reasons behind the improvement.

Citation: https://doi.org/10.5194/egusphere-2024-2515-RC2
- AC2: 'Reply on RC2', Renbo Pang, 21 Jan 2025
  
  Dear Reviewer,
  We would like to sincerely thank you for your thorough and constructive review of our manuscript (An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)). Your insightful comments have been invaluable in improving the quality of our work. Please find below our detailed responses in the attached zip file to each of the comments you raised.
  Sincerely,
  Renbo PANG, on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2024-2515-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-2515', Anonymous Referee #1, 27 Dec 2024

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-2515/egusphere-2024-2515-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2024-2515-RC1
- AC1: 'Reply on RC1', Renbo Pang, 15 Jan 2025
  
  Dear Reviewer,
  We would like to sincerely thank you for your thorough and constructive review of our manuscript (An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)). Your insightful comments have been invaluable in improving the quality of our work. Please find below our detailed responses in the attached PDF file to each of the comments you raised.
  Sincerely,
  Renbo PANG, on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2024-2515-AC1
RC2:
'Comment on egusphere-2024-2515', Anonymous Referee #2, 06 Jan 2025

Paper Review: "An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)"
This paper presents an implementation of halo-exchanges in the FVWAM model using MPI's distributed graph topology and performance comparison over baseline implementation using point-to-point communication primitives.
The paper provides a detailed comparison from tests with 512 to 32,768 processes and shows that the speedup from the distributed graph topology with and without reordered processes ranged from 1.28 to 5.63.
There is some important context missing from the article that can shed light on the significance of the performance improvements.
* What is the network interconnect and topology of the target system?
* Were experiments repeated with different node allocations assuming there is a batch system scheduling resources?
* Were the different experiment types (point-point, distributed, distributed with reordering) conducted using the same node allocation for consistency?
* Were the experiments at each processor count (e.g., 512) conducted multiple times to rule out network variability and interference from other traffic on the network?
* Is there any performance variability across runs?
* Can the authors elaborate differences of their approach if any with using the MPI-3 neighbourhood collectives?
The authors allude to the following factors as the primary contributors to the improvement:
> First, as the number of processes increases, the volume of exchanged data decreases, thereby reducing the speedup ratio achieved by the distributed graph communication topology.
It appears that the application is network bandwidth bound at low processor counts. It would be very enlightening to provide details of the communication volume and interconnect specifications to confirm if that's the case.
> Second, received data are continuously searched and inserted into wave action (N) at once in the distributed graph communication topology, which can improve cache hit rates.
The presumption about improved cache hit rates can be confirmed by obtaining hardware performance counter information. I'm skeptical that cache performance played such a big role. The improvement could better be explained by MPI library implementation ordering the communication operations optimally.
The biggest weakness of this work is the limited performance data from just one platform and MPI implementation. It would highly strengthen the work if the performance optimization can be demonstrated on multiple machines with different interconnects and topologies and recent versions of community standard libraries (OpenMPI, MPICH) or recent vendor implementations. It would make the case for neighbourhood collectives for earth system workloads stronger.
There is work illustrating performance improvements from reordering MPI processes taking network topology into account. e.g., https://dl.acm.org/doi/10.1145/2851553.2851575

What is the current reordering strategy in case I missed? Did the authors consider any advanced reordering strategies?
The paper refers to pre-posting receives using MPI_Irecv. However, they mention

> An alternative is to call the non-blocking communication interface MPI_Isend for sending data, but it is infrequently utilized due to the increased complexity that introduces to the sending operation

It's not inherently that complex as a lot of applications use non-blocking sends effectively. I wonder what the performance impact would be if the authors used non-blocking operations.
I was looking forward to the article to hear about novel techniques that could improve communication performance at scale. However, it was slightly disappointing to see that the benefit from the proposed optimization dramatically tapers off as we go from low (512) to high (32756) number of processes.
On heterogeneous GPU based supercomputers like the Frontier Exascale system, the number of nodes is relatively low (9,408) due to the fat node architecture compared to CPU based supercomputers like Fugaku (158,976 nodes) out there. In this overall context, the benefit of a communication optimization if more relevant at scale when there are potentially hundreds of thousands of MPI endpoints at scale (e.g., 600k on Fugaku with 4 MPI ranks per node mapping optimally to the NUMA domains there).
Ref: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

https://www.fujitsu.com/global/about/innovation/fugaku/specifications/

Pg 15, lines 295-302:

To conclude, I understand the motivation of authors to improve their production simulations performance and the relative significance for their workload. Additional performance data would be highly informative and make this more generally applicable.

Minor comments:
Pg 2, line 29: There are better references than Sukhija et al., 2022 for the Frontier Exascale supercomputer. I suggest using one of the papers from the Supercomputing conference. https://dl.acm.org/doi/abs/10.1145/3581784.3607089
> Sukhija, N., Bautista, E., Butz, D., and Whitney, C.: Towards anomaly detection for monitoring power consumption in HPC facilities, in: 380 Proceedings of the 14th International Conference on Management of Digital EcoSystems, pp. 1–8, 2022

The performance sections in the paper are a bit verbose and redundant pointing to the information in the figures. It might be better to be succinct in highlighting the results and elaborate further on the reasons behind the improvement.

Citation: https://doi.org/10.5194/egusphere-2024-2515-RC2
- AC2: 'Reply on RC2', Renbo Pang, 21 Jan 2025
  
  Dear Reviewer,
  We would like to sincerely thank you for your thorough and constructive review of our manuscript (An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)). Your insightful comments have been invaluable in improving the quality of our work. Please find below our detailed responses in the attached zip file to each of the comments you raised.
  Sincerely,
  Renbo PANG, on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2024-2515-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Renbo Pang on behalf of the Authors (14 Feb 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (24 Feb 2025) by Qiang Wang

RR by Sarat Sreepathi (05 Mar 2025)

RR by Anonymous Referee #1 (17 Mar 2025)

ED: Publish subject to minor revisions (review by editor) (20 Mar 2025) by Qiang Wang

AR by Renbo Pang on behalf of the Authors (27 Mar 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (05 Apr 2025) by Qiang Wang

AR by Renbo Pang on behalf of the Authors (08 Apr 2025) Author's response Manuscript

Journal article(s) based on this preprint

08 Jul 2025

An effective communication topology for performance optimization: a case study of the finite-volume wave modeling (FVWAM)

Renbo Pang, Fujiang Yu, Yuanyong Gao, Ye Yuan, Liang Yuan, and Zhiyi Gao

Geosci. Model Dev., 18, 4119–4136, https://doi.org/10.5194/gmd-18-4119-2025,https://doi.org/10.5194/gmd-18-4119-2025, 2025

Short summary

Renbo Pang, Fujiang Yu, Yuanyong Gao, Ye Yuan, Liang Yuan, and Zhiyi Gao

Data sets

Datasets and source codes related to this paper Renbo Pang et al. https://zenodo.org/doi/10.5281/zenodo.13325957

Model code and software

source codes of three versions of the FVWAM Renbo Pang et al. https://github.com/victor-888888/fvwam

Renbo Pang, Fujiang Yu, Yuanyong Gao, Ye Yuan, Liang Yuan, and Zhiyi Gao

Viewed

Total article views: 515 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
275	91	149	515	25	31

HTML: 275
PDF: 91
XML: 149
Total: 515
BibTeX: 25
EndNote: 31

Views and downloads (calculated since 07 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	65	34	5	104
Nov 2024	41	2	45	88
Dec 2024	39	11	52	102
Jan 2025	60	21	40	121
Feb 2025	18	6	0	24
Mar 2025	8	4	0	12
Apr 2025	7	3	0	10
May 2025	11	7	3	21
Jun 2025	21	3	4	28
Jul 2025	5	0	5

Cumulative views and downloads (calculated since 07 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	65	34	5	104
Nov 2024	41	2	45	88
Dec 2024	39	11	52	102
Jan 2025	60	21	40	121
Feb 2025	18	6	0	24
Mar 2025	8	4	0	12
Apr 2025	7	3	0	10
May 2025	11	7	3	21
Jun 2025	21	3	4	28
Jul 2025	5	0	5

Viewed (geographical distribution)

Total article views: 497 (including HTML, PDF, and XML) Thereof 497 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 08 Jul 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (21238 KB)
Metadata XML

Short summary

The application of the distributed graph communication topology in earth models has been rarely studied. We tested and compared this topology with the traditional point-to-point communication method using a global wave model. We found that this topology is more efficient. Additionally, using this topology can greatly improve the performance of the wave model and could help improve the performance of other earth models.

An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Suggestions for revision or reasons for rejection

Journal article(s) based on this preprint

Data sets

Model code and software

Viewed

Viewed (geographical distribution)


Total:	0
HTML:	0
PDF:	0
XML:	0