- Andreas Irmler ORCID:orcid.org/0000-0003-0525-777212,
- Raghavendra Kanakagiri14,
- Sebastian T. Ohlmann13,
- Edgar Solomonik14 &
- …
- Andreas Grüneis12
Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 14100))
Included in the following conference series:
Abstract
We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 10295
- Price includes VAT (Japan)
- Softcover Book
- JPY 12869
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
CTF-def and CTF-na can be run withhttps://github.com/airmler/ctf, branch node-awareness, commit ID 2f32bd6.
- 2.
https://github.com/eth-cscs/COSMA.git commit ID fe98d3eb.
References
Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev.39(5), 575–582 (1995)
Aggarwal, A., Chandra, A.K., Snir, M.: On communication latency in PRAM computations. In: Proceedings of the First Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 11–21 (1989)
Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys.79, 291–352 (2007)
Bienz, A., Gropp, W.D., Olson, L.N.: Node aware sparse matrix-vector multiplication. J. Parallel Distrib. Comput.130, 166–178 (2019)
Bienz, A., Gropp, W.D., Olson, L.N.: Reducing communication in algebraic multigrid with multi-step node aware communication. Int. J. High Perform. Comput. Appl.34(5), 547–561 (2020)
Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, USA (1969)
Chan, E., Heimlich, M., Purkayastha, A., Van De Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput.: Pract. Experience19(13), 1749–1783 (2007)
Choi, J., Dongarra, J., Pozo, R., Walker, D.: ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers. In: The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–127 (1992)
Demmel, J., et al.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272 (2013)
Irmler, A., Kanakagiri, R., Ohlmann, S.T., Solomonik, E., Grüneis, A.: Artifact overview document for Euro-Par 2023 paper: Optimizing distributed tensor contractions using node-aware processor grids.https://doi.org/10.6084/m9.figshare.23548113
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput.64(9), 1017–1026 (2004)
Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019. Association for Computing Machinery, New York (2019)
Lockhart, S., Bienz, A., Gropp, W., Olson, L.: Performance analysis and optimal node-aware communication for enlarged conjugate gradient methods. ACM Trans. Parallel Comput.10, 1–25 (2023)
Lockhart, S., Bienz, A., Gropp, W.D., Olson, L.N.: Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures. Parallel Comput.116, 103021 (2023)
McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica24, 287–297 (1999)
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011).https://doi.org/10.1007/978-3-642-23397-5_10
Solomonik, E., Matthews, D., Hammond, J.R., Stanton, J.F., Demmel, J.: A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel Distrib. Comput.74, 3176–3190 (2014)
Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003).https://doi.org/10.1007/978-3-540-39924-7_38
Van De Geijn, R.A., Watts, J.: Summa: scalable universal matrix multiplication algorithm. Concurr. Pract. Experience9(4), 255–274 (1997)
Author information
Authors and Affiliations
Institute for Theoretical Physics, TU Wien, Vienna, Austria
Andreas Irmler & Andreas Grüneis
Max Planck Computing and Data Facility, Garching, Germany
Sebastian T. Ohlmann
University of Illinois at Urbana-Champaign, Champaign, USA
Raghavendra Kanakagiri & Edgar Solomonik
- Andreas Irmler
You can also search for this author inPubMed Google Scholar
- Raghavendra Kanakagiri
You can also search for this author inPubMed Google Scholar
- Sebastian T. Ohlmann
You can also search for this author inPubMed Google Scholar
- Edgar Solomonik
You can also search for this author inPubMed Google Scholar
- Andreas Grüneis
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toAndreas Irmler.
Editor information
Editors and Affiliations
University of Glasgow, Glasgow, UK
José Cano
University of Cyprus, Nicosia, Cyprus
Marios D. Dikaiakos
University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
Chalmers University of Technology, Gothenburg, Sweden
Miquel Pericàs
University of Manchester, Manchester, UK
Rizos Sakellariou
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Irmler, A., Kanakagiri, R., Ohlmann, S.T., Solomonik, E., Grüneis, A. (2023). Optimizing Distributed Tensor Contractions Using Node-Aware Processor Grids. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_48
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-39697-7
Online ISBN:978-3-031-39698-4
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative