the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)
Abstract. High-resolution models are essential for simulating small-scale processes and topographical features, which play a crucial role in understanding meteorological and oceanic events, as well as climatic patterns. High-resolution modeling requires substantial improvement on the parallel scalability of the model to reduce runtime, while massive parallelism is associated with intensive communications. Point-to-point communication is extensively utilized for neighborhood communication in earth models due to its flexibility. The distributed graph topology, first introduced in the MPI version 2.2, provides a scalable and informative communication method. It has demonstrated significant speedups over the point-to-point communication method based on a variety of synthetic and real-world communication graph datasets. But its application in earth models for neighborhood communication is rarely studied. In this study, we implemented neighborhood communication using both the traditional point-to-point communication method and the distributed graph communication topology. We then compared their performance in a case study of the Finite Volume WAve Modeling (FVWAM). Across all tests with 512 to 32,768 processes, the communication time speedup of the distributed graph communication topology ranged from 1.28 to 5.63 compared to the point-to-point communication method. For operational global wave forecasts with 1,024 processes, the runtime of the FVWAM reduced 40.2 % when the point-to-point communication method was replaced by the distributed graph communication topology.
- Preprint
(21238 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-2515', Anonymous Referee #1, 27 Dec 2024
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-2515/egusphere-2024-2515-RC1-supplement.pdf
-
AC1: 'Reply on RC1', Renbo Pang, 15 Jan 2025
Dear Reviewer,
We would like to sincerely thank you for your thorough and constructive review of our manuscript (An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)). Your insightful comments have been invaluable in improving the quality of our work. Please find below our detailed responses in the attached PDF file to each of the comments you raised.
Sincerely,
Renbo PANG, on behalf of the co-authors
-
AC1: 'Reply on RC1', Renbo Pang, 15 Jan 2025
-
RC2: 'Comment on egusphere-2024-2515', Anonymous Referee #2, 06 Jan 2025
Paper Review: "An Effective Communication Topology for Performance Optimization: A Case Study of the Finite Volume WAve Modeling (FVWAM)"
This paper presents an implementation of halo-exchanges in the FVWAM model using MPI's distributed graph topology and performance comparison over baseline implementation using point-to-point communication primitives.Â
The paper provides a detailed comparison from tests with 512 to 32,768 processes and shows that the speedup from the distributed graph topology with and without reordered processes ranged from 1.28 to 5.63.
There is some important context missing from the article that can shed light on the significance of the performance improvements.
* What is the network interconnect and topology of the target system?
* Were experiments repeated with different node allocations assuming there is a batch system scheduling resources?
* Were the different experiment types (point-point, distributed, distributed with reordering) conducted using the same node allocation for consistency?Â
* Were the experiments at each processor count (e.g., 512) conducted multiple times to rule out network variability and interference from other traffic on the network?
* Is there any performance variability across runs?Â
* Can the authors elaborate differences of their approach if any with using the MPI-3 neighbourhood collectives?
The authors allude to the following factors as the primary contributors to the improvement:Â
> First, as the number of processes increases, the volume of exchanged data decreases, thereby reducing the speedup ratio achieved by the distributed graph communication topology.Â
It appears that the application is network bandwidth bound at low processor counts. It would be very enlightening to provide details of the communication volume and interconnect specifications to confirm if that's the case.
> Second, received data are continuously searched and inserted into wave action (N) at once in the distributed graph communication topology, which can improve cache hit rates.
The presumption about improved cache hit rates can be confirmed by obtaining hardware performance counter information. I'm skeptical that cache performance played such a big role. The improvement could better be explained by MPI library implementation ordering the communication operations optimally.
The biggest weakness of this work is the limited performance data from just one platform and MPI implementation. It would highly strengthen the work if the performance optimization can be demonstrated on multiple machines with different interconnects and topologies and recent versions of community standard libraries (OpenMPI, MPICH) or recent vendor implementations. It would make the case for neighbourhood collectives for earth system workloads stronger.Â
There is work illustrating performance improvements from reordering MPI processes taking network topology into account. e.g., https://dl.acm.org/doi/10.1145/2851553.2851575
What is the current reordering strategy in case I missed? Did the authors consider any advanced reordering strategies?ÂThe paper refers to pre-posting receives using MPI_Irecv. However, they mention
> An alternative is to call the non-blocking communication interface MPI_Isend for sending data, but it is infrequently utilized due to the increased complexity that introduces to the sending operation
It's not inherently that complex as a lot of applications use non-blocking sends effectively. I wonder what the performance impact would be if the authors used non-blocking operations.I was looking forward to the article to hear about novel techniques that could improve communication performance at scale. However, it was slightly disappointing to see that the benefit from the proposed optimization dramatically tapers off as we go from low (512) to high (32756) number of processes.
On heterogeneous GPU based supercomputers like the Frontier Exascale system, the number of nodes is relatively low (9,408) due to the fat node architecture compared to CPU based supercomputers like Fugaku (158,976 nodes) out there. In this overall context, the benefit of a communication optimization if more relevant at scale when there are potentially hundreds of thousands of MPI endpoints at scale (e.g., 600k on Fugaku with 4 MPI ranks per node mapping optimally to the NUMA domains there).
Ref: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
https://www.fujitsu.com/global/about/innovation/fugaku/specifications/
Pg 15, lines 295-302:Â
To conclude, I understand the motivation of authors to improve their production simulations performance and the relative significance for their workload. Additional performance data would be highly informative and make this more generally applicable.
Minor comments:Pg 2, line 29: There are better references than Sukhija et al., 2022 for the Frontier Exascale supercomputer. I suggest using one of the papers from the Supercomputing conference. https://dl.acm.org/doi/abs/10.1145/3581784.3607089
> Sukhija, N., Bautista, E., Butz, D., and Whitney, C.: Towards anomaly detection for monitoring power consumption in HPC facilities, in: 380 Proceedings of the 14th International Conference on Management of Digital EcoSystems, pp. 1–8, 2022
The performance sections in the paper are a bit verbose and redundant pointing to the information in the figures. It might be better to be succinct in highlighting the results and elaborate further on the reasons behind the improvement.Citation: https://doi.org/10.5194/egusphere-2024-2515-RC2
Data sets
Datasets and source codes related to this paper Renbo Pang et al. https://zenodo.org/doi/10.5281/zenodo.13325957
Model code and software
source codes of three versions of the FVWAM Renbo Pang et al. https://github.com/victor-888888/fvwam
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
181 | 58 | 128 | 367 | 10 | 8 |
- HTML: 181
- PDF: 58
- XML: 128
- Total: 367
- BibTeX: 10
- EndNote: 8
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1