Movatterモバイル変換

Matrix multiplication algorithm

From Wikipedia, the free encyclopedia

Algorithm to multiply matrices

Becausematrix multiplication is such a central operation in manynumerical algorithms, much work has been invested in makingmatrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields includingscientific computing andpattern recognition and in seemingly unrelated problems such as counting the paths through agraph.^[1] Many different algorithms have been designed for multiplying matrices on different types of hardware, includingparallel anddistributed systems, where the computational work is spread over multiple processors (perhaps over a network).

Directly applying the mathematical definition of matrix multiplication gives an algorithm thattakes time on the order ofn³field operations to multiply twon ×n matrices over that field (Θ(n³) inbig O notation). Better asymptotic bounds on the time required to multiply matrices have been known since theStrassen's algorithm in the 1960s, but the optimal time (that is, thecomputational complexity of matrix multiplication) remains unknown. As of September 2025^[update], the best bound on theasymptotic complexity of a matrix multiplication algorithm isO(n^2.371339) time, given by Alman, Duan,Williams, Xu, Xu, and Zhou.^[2] However, this algorithm is agalactic algorithm because of the large constants and cannot be realized practically.

Iterative algorithm

[edit]

Thedefinition of matrix multiplication is that ifC =AB for ann ×m matrixA and anm ×p matrixB, thenC is ann ×p matrix with entries

c_{ij}=\sum _{k=1}^{m}a_{ik}b_{kj}.

From this, a simple algorithm can be constructed which loops over the indicesi from 1 throughn andj from 1 throughp, computing the above using a nested loop:

Input: matricesA andB
LetC be a new matrix of the appropriate size
Fori from 1 ton:
- Forj from 1 top:
  - Letsum = 0
  - Fork from 1 tom:
    - Setsum ← sum +A_ik ×B_kj
  - SetC_ij ← sum
ReturnC

This algorithm takestimeΘ(nmp) (inasymptotic notation).^[1] A common simplification for the purpose ofalgorithm analysis is to assume that the inputs are all square matrices of sizen ×n, in which case the running time isΘ(n³), i.e., cubic in the size of the dimension.^[3]

Cache behavior

[edit]

Illustration ofrow- and column-major order

The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time. However, the order can have a considerable impact on practical performance due to thememory access patterns andcache use of the algorithm;^[1]which order is best also depends on whether the matrices are stored inrow-major order, column-major order, or a mix of both.

In particular, in the idealized case of afully associative cache consisting ofM bytes andb bytes per cache line (i.e.⁠M/b⁠ cache lines), the above algorithm is sub-optimal forA andB stored in row-major order. Whenn >⁠M/b⁠, every iteration of the inner loop (a simultaneous sweep through a row ofA and a column ofB) incurs a cache miss when accessing an element ofB. This means that the algorithm incursΘ(n³) cache misses in the worst case. As of 2010^[update], the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices.^[4]

The optimal variant of the iterative algorithm forA andB in row-major layout is atiled version, where the matrix is implicitly divided into square tiles of size√M by√M:^[4]^[5]

Input: matricesA andB
LetC be a new matrix of the appropriate size
Pick a tile sizeT = Θ(√M)
ForI from 1 ton in steps ofT:
- ForJ from 1 top in steps ofT:
  - ForK from 1 tom in steps ofT:
    - MultiplyA_I:I+T,K:K+T andB_K:K+T,J:J+T intoC_I:I+T,J:J+T, that is:
    - Fori fromI tomin(I +T,n):
      - Forj fromJ tomin(J +T,p):
        Letsum = 0
        Fork fromK tomin(K +T,m):
        Setsum ← sum +A_ik ×B_kj
        SetC_ij ←C_ij + sum
ReturnC

In the idealized cache model, this algorithm incurs onlyΘ(⁠n³/b√M⁠) cache misses; the divisorb√M amounts to several orders of magnitude on modern machines, so that the actual calculations dominate the running time, rather than the cache misses.^[4]

Divide-and-conquer algorithm

[edit]

An alternative to the iterative algorithm is thedivide-and-conquer algorithm for matrix multiplication. This relies on theblock partitioning

C={\begin{pmatrix}C_{11}&C_{12}\\C_{21}&C_{22}\\\end{pmatrix}},\,A={\begin{pmatrix}A_{11}&A_{12}\\A_{21}&A_{22}\\\end{pmatrix}},\,B={\begin{pmatrix}B_{11}&B_{12}\\B_{21}&B_{22}\\\end{pmatrix}},

which works for all square matrices whose dimensions are powers of two, i.e., the shapes are2ⁿ × 2ⁿ for somen. The matrix product is now

{\begin{pmatrix}C_{11}&C_{12}\\C_{21}&C_{22}\\\end{pmatrix}}={\begin{pmatrix}A_{11}&A_{12}\\A_{21}&A_{22}\\\end{pmatrix}}{\begin{pmatrix}B_{11}&B_{12}\\B_{21}&B_{22}\\\end{pmatrix}}={\begin{pmatrix}A_{11}B_{11}+A_{12}B_{21}&A_{11}B_{12}+A_{12}B_{22}\\A_{21}B_{11}+A_{22}B_{21}&A_{21}B_{12}+A_{22}B_{22}\\\end{pmatrix}}

which consists of eight multiplications of pairs of submatrices, followed by an addition step. The divide-and-conquer algorithm computes the smaller multiplicationsrecursively, using thescalar multiplicationc₁₁ =a₁₁b₁₁ as its base case.

The complexity of this algorithm as a function ofn is given by the recurrence^[3]

T(1)=\Theta (1);

T(n)=8T(n/2)+\Theta (n^{2}),

accounting for the eight recursive calls on matrices of sizen/2 andΘ(n²) to sum the four pairs of resulting matrices element-wise. Application of themaster theorem for divide-and-conquer recurrences shows this recursion to have the solutionΘ(n³), the same as the iterative algorithm.^[3]

Non-square matrices

[edit]

A variant of this algorithm that works for matrices of arbitrary shapes and is faster in practice^[4] splits matrices in two instead of four submatrices, as follows.^[6]Splitting a matrix now means dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions.

Inputs: matricesA of sizen ×m,B of sizem ×p.
Base case: ifmax(n,m,p) is below some threshold, use anunrolled version of the iterative algorithm.
Recursive cases:

Ifmax(n,m,p) =n, splitA horizontally:

C={\begin{pmatrix}A_{1}\\A_{2}\end{pmatrix}}{B}={\begin{pmatrix}A_{1}B\\A_{2}B\end{pmatrix}}

Else, ifmax(n,m,p) =p, splitB vertically:

C=A{\begin{pmatrix}B_{1}&B_{2}\end{pmatrix}}={\begin{pmatrix}AB_{1}&AB_{2}\end{pmatrix}}

Otherwise,max(n,m,p) =m. SplitA vertically andB horizontally:

C={\begin{pmatrix}A_{1}&A_{2}\end{pmatrix}}{\begin{pmatrix}B_{1}\\B_{2}\end{pmatrix}}=A_{1}B_{1}+A_{2}B_{2}

Cache behavior

[edit]

The cache miss rate of recursive matrix multiplication is the same as that of atiled iterative version, but unlike that algorithm, the recursive algorithm iscache-oblivious:^[6] there is no tuning parameter required to get optimal cache performance, and it behaves well in amultiprogramming environment where cache sizes are effectively dynamic due to other processes taking up cache space.^[4](The simple iterative algorithm is cache-oblivious as well, but much slower in practice if the matrix layout is not adapted to the algorithm.)

The number of cache misses incurred by this algorithm, on a machine withM lines of ideal cache, each of sizeb bytes, is bounded by^[6]^: 13

\Theta \left(m+n+p+{\frac {mn+np+mp}{b}}+{\frac {mnp}{b{\sqrt {M}}}}\right)

Sub-cubic algorithms

[edit]

Further information:Computational complexity of matrix multiplication

Improvement of estimates of exponentω over time for the computational complexity of matrix multiplication $O(n^{\omega })$ .

{\displaystyle O(n^{\omega })} — Improvement of estimates of exponentω over time for the computational complexity of matrix multiplication $O(n^{\omega })$ .

Algorithms exist that provide better running times than the straightforward ones. The first to be discovered wasStrassen's algorithm, devised byVolker Strassen in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two 2×2 matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of $O(n^{\log _{2}7})\approx O(n^{2.807})$ . Strassen's algorithm is more complex, and thenumerical stability is reduced compared to the naïve algorithm,^[7] but it is faster in cases wheren > 100 or so^[1] and appears in several libraries, such asBLAS.^[8] It is very useful for large matrices over exact domains such asfinite fields, where numerical stability is not an issue.

Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems, improving on the constants hidden in thebig-O notation has its merits. A table that compares key aspects of the improved version based on recursive multiplication of 2×2-block matrices via 7 block matrix multiplications follows. As usual, $n {\displaystyle n}$ gives the dimensions of the matrix and $M {\displaystyle M}$ designates the memory size.

Progress for Strassen-like recursive 2x2-block matrix multiplication
Year	Reference	#matrix multiplications per step	#matrix additions per step	total arithmetic operations	total I/O-complexity
1969	Strassen^[9]	7	18	$7n^{\log _{2}7}-6n^{2}$	$6\left({\frac {{\sqrt {3}}n}{\sqrt {M}}}\right)^{\log _{2}7}\cdot M-18n^{2}+3M$
1971	Winograd^[10]	7	15	$6n^{\log _{2}7}-5n^{2}$	$5\left({\frac {{\sqrt {3}}n}{\sqrt {M}}}\right)^{\log _{2}7}\cdot M-15n^{2}+3M$
2017	Karstadt, Schwartz^[11]	7	12	$5n^{\log _{2}7}-4n^{2}+3n^{2}\log _{2}n$	$4\left({\frac {{\sqrt {3}}n}{\sqrt {M}}}\right)^{\log _{2}7}\cdot M-12n^{2}+3n^{2}\cdot \log _{2}\left({\frac {{\sqrt {2}}n}{\sqrt {M}}}\right)+5M$
2023	Schwartz, Vaknin^[12]	7	12	$5n^{\log _{2}7}-4n^{2}+1.5n^{2}\log _{2}n$	$4\left({\frac {{\sqrt {3}}n}{\sqrt {M}}}\right)^{\log _{2}7}\cdot M-12n^{2}+1.5n^{2}\cdot \log _{2}\left({\frac {{\sqrt {2}}n}{\sqrt {M}}}\right)+5M$

It is known that a Strassen-like algorithm with a 2×2-block matrix step requires at least 7 block matrix multiplications. In 1976 Probert^[13] showed that such an algorithm requires at least 15 additions (including subtractions); however, a hidden assumption was that the blocks and the 2×2-block matrix are represented in the same basis. Karstadt and Schwartz computed in different bases and traded 3 additions for less expensive basis transformations. They also proved that one cannot go below 12 additions per step using different bases. In subsequent work Beniamini et el.^[14] applied this base-change trick to more general decompositions than 2×2-block matrices and improved the leading constant for their run times.

It is an open question intheoretical computer science how well Strassen's algorithm can be improved in terms ofasymptotic complexity. Thematrix multiplication exponent, usually denoted $\omega$ , is the smallest real number for which any $n\times n$ matrix over a field can be multiplied together using $n^{\omega +o(1)}$ field operations. The current best bound on $\omega$ is $\omega <2.371339$ , by Alman, Duan,Williams, Xu, Xu, and Zhou.^[2] This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given byDon Coppersmith andShmuel Winograd in 1990.^[15] The conceptual idea of these algorithms is similar to Strassen's algorithm: a way is devised for multiplying twok ×k-matrices with fewer thank³ multiplications, and this technique is applied recursively. However, the constant coefficient hidden by thebig-O notation is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers.^[16]^[17] Victor Pan proposed so-called feasible sub-cubic matrix multiplication algorithms with an exponent slightly above 2.77, but in return with a much smaller hidden constant coefficient.^[18]

Freivalds' algorithm is a simpleMonte Carlo algorithm that, given matricesA,B andC, verifies inΘ(n²) time ifAB =C.

AlphaTensor

[edit]

In 2022,DeepMind introduced AlphaTensor, aneural network that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans and some that were not.^[19] Operations were restricted to thenon-commutative ground field^{[clarification needed]} (normal arithmetic) andfinite field $\mathbb {Z} /2\mathbb {Z}$ (mod 2 arithmetic). The best "practical" (explicit low-rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n^2.778).^[20] Finding low-rank decompositions of such tensors (and beyond) is NP-hard; optimal multiplication even for 3×3 matricesremains unknown, even in commutative field.^[20] On 4×4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5×5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4×4 algorithm, and separately tweaked Deepmind's 96-step 5×5 algorithm down to 95 steps in mod 2 arithmetic and to 97^[21] in normal arithmetic.^[22] Some algorithms were completely new: for example, (4, 5, 5) was improved to 76 steps from a baseline of 80 in both normal and mod 2 arithmetic.

Parallel and distributed algorithms

[edit]

Shared-memory parallelism

[edit]

Thedivide-and-conquer algorithm sketched earlier can beparallelized in two ways forshared-memory multiprocessors. These are based on the fact that the eight recursive matrix multiplications in

{\begin{pmatrix}A_{11}B_{11}+A_{12}B_{21}&A_{11}B_{12}+A_{12}B_{22}\\A_{21}B_{11}+A_{22}B_{21}&A_{21}B_{12}+A_{22}B_{22}\\\end{pmatrix}}

can be performed independently of each other, as can the four summations (although the algorithm needs to "join" the multiplications before doing the summations). Exploiting the full parallelism of the problem, one obtains an algorithm that can be expressed infork–join style pseudocode:^[23]

Proceduremultiply(C,A,B):

Base case: ifn = 1, setc₁₁ ←a₁₁ ×b₁₁ (or multiply a small block matrix).
Otherwise, allocate space for a new matrixT of shapen ×n, then:
- PartitionA intoA₁₁,A₁₂,A₂₁,A₂₂.
- PartitionB intoB₁₁,B₁₂,B₂₁,B₂₂.
- PartitionC intoC₁₁,C₁₂,C₂₁,C₂₂.
- PartitionT intoT₁₁,T₁₂,T₂₁,T₂₂.
- Parallel execution:
  - Forkmultiply(C₁₁,A₁₁,B₁₁).
  - Forkmultiply(C₁₂,A₁₁,B₁₂).
  - Forkmultiply(C₂₁,A₂₁,B₁₁).
  - Forkmultiply(C₂₂,A₂₁,B₁₂).
  - Forkmultiply(T₁₁,A₁₂,B₂₁).
  - Forkmultiply(T₁₂,A₁₂,B₂₂).
  - Forkmultiply(T₂₁,A₂₂,B₂₁).
  - Forkmultiply(T₂₂,A₂₂,B₂₂).
- Join (wait for parallel forks to complete).
- add(C,T).
- DeallocateT.

Procedureadd(C,T) addsT intoC, element-wise:

Base case: ifn = 1, setc₁₁ ←c₁₁ +t₁₁ (or do a short loop, perhaps unrolled).
Otherwise:
- PartitionC intoC₁₁,C₁₂,C₂₁,C₂₂.
- PartitionT intoT₁₁,T₁₂,T₂₁,T₂₂.
- In parallel:
  - Forkadd(C₁₁,T₁₁).
  - Forkadd(C₁₂,T₁₂).
  - Forkadd(C₂₁,T₂₁).
  - Forkadd(C₂₂,T₂₂).
- Join.

Here,fork is a keyword that signal a computation may be run in parallel with the rest of the function call, whilejoin waits for all previously "forked" computations to complete.partition achieves its goal by pointer manipulation only.

This algorithm has acritical path length ofΘ(log²n) steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possiblespeedup ofΘ(n³/log²n) on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrixT, but a more practical variant achievesΘ(n²) speedup, without using a temporary matrix.^[23]

Block matrix multiplication. In the 2D algorithm, each processor is responsible for one submatrix ofC. In the 3D algorithm, every pair of submatrices fromA andB that is multiplied is assigned to one processor.

Communication-avoiding and distributed algorithms

[edit]

On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called thecommunication bandwidth. The naïve algorithm using three nested loops usesΩ(n³) communication bandwidth.

Cannon's algorithm, also known as the2D algorithm, is acommunication-avoiding algorithm that partitions each input matrix into a block matrix whose elements are submatrices of size√M/3 by√M/3, whereM is the size of fast memory.^[24] The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth toO(n³/√M), which is asymptotically optimal (for algorithms performingΩ(n³) computation).^[25]^[26]

In a distributed setting withp processors arranged in a√p by√p 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmittingO(n²/√p) words, which is asymptotically optimal assuming that each node stores the minimumO(n²/p) elements.^[26] This can be improved by the3D algorithm, which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.^[27] This algorithm transmitsO(n²/p^2/3) words per processor, which is asymptotically optimal.^[26] However, this requires replicating each input matrix elementp^1/3 times, and so requires a factor ofp^1/3 more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.^[27] "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.^[28] On modern distributed computing environments such asMapReduce, specialized multiplication algorithms have been developed.^[29]

Algorithms for meshes

[edit]

Matrix multiplication completed in 2n-1 steps for two n×n matrices on a cross-wired mesh.

There are a variety of algorithms for multiplication onmeshes. For multiplication of twon×n on a standard two-dimensional mesh using the 2DCannon's algorithm, one can complete the multiplication in 3n-2 steps although this is reduced to half this number for repeated computations.^[30] The standard array is inefficient because the data from the two matrices does not arrive simultaneously and it must be padded with zeroes.

The result is even faster on a two-layered cross-wired mesh, where only 2n-1 steps are needed.^[31] The performance improves further for repeated computations leading to 100% efficiency.^[32] The cross-wired mesh array may be seen as a special case of a non-planar (i.e. multilayered) processing structure.^[33]

In a 3D mesh withn³ processing elements, two matrices can be multiplied in ${\mathcal {O}}(\log n)$ using the DNS algorithm.^[34]

References

[edit]

^^a ^b ^c ^dSkiena, Steven (2012). "Sorting and Searching".The Algorithm Design Manual. Springer. pp. 45–46,401–3.doi:10.1007/978-1-84800-070-4_4.ISBN 978-1-84800-069-8.
^^a ^bAlman, Josh; Duan, Ran; Williams, Virginia Vassilevska; Xu, Yinzhan; Xu, Zixuan; Zhou, Renfei (2025)."More Asymmetry Yields Faster Matrix Multiplication".Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM. pp. 2005–2039.doi:10.1137/1.9781611978322.63.
^^a ^b ^cCormen, Thomas H.;Leiserson, Charles E.;Rivest, Ronald L.;Stein, Clifford (2009) [1990].Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. pp. 75–79.ISBN 0-262-03384-4.
^^a ^b ^c ^d ^eAmarasinghe, Saman;Leiserson, Charles (2010)."6.172 Performance Engineering of Software Systems, Lecture 8".MIT OpenCourseWare. Massachusetts Institute of Technology. Archived fromthe original on 7 October 2019. Retrieved27 January 2015.
^Lam, Monica S.; Rothberg, Edward E.; Wolf, Michael E. (1991).The Cache Performance and Optimizations of Blocked Algorithms. ASPLOS91: 4th Int'l Conference on Architecture Support for Programming Languages & Operating Systems.doi:10.1145/106972.106981.ISBN 978-0-89791-380-5.
^^a ^b ^cProkop, Harald (1999).Cache-Oblivious Algorithms(PDF) (Master's). MIT.hdl:1721.1/80568. Archived fromthe original(PDF) on 2023-11-22. Retrieved2015-01-28.
^Miller, Webb (1975), "Computational complexity and numerical stability",SIAM News,4 (2):97–107,CiteSeerX 10.1.1.148.9947,doi:10.1137/0204009
^Press, William H.; Flannery, Brian P.;Teukolsky, Saul A.; Vetterling, William T. (2007).Numerical Recipes: The Art of Scientific Computing (3rd ed.).Cambridge University Press. p. 108.ISBN 978-0-521-88068-8.
^Strassen, Volker (1969). "Gaussian Elimination is not Optimal".Numer. Math.13 (4):354–356.doi:10.1007/BF02165411.S2CID 121656251.
^Private communication withProbert, Robert L (1976)."On the additive complexity of matrix multiplication".SIAM Journal on Computing.5 (2):187–203.doi:10.1137/0205016.
^Karstadt, Elaye; Schwartz, Oded (July 2017)."Matrix Multiplication, a Little Faster".Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures. SPAA '17. pp. 101–110.doi:10.1145/3087556.3087579.
^Schwartz, Oded; Vaknin, Noa (2023)."Pebbling Game and Alternative Basis for High Performance Matrix Multiplication".SIAM Journal on Scientific Computing. pp. C277–C303.doi:10.1137/22M1502719.
^Probert, Robert L. (1976). "On the additive complexity of matrix multiplication".SIAM J. Comput.5 (2):187–203.doi:10.1137/0205016.
^Beniamini, Gal; Cheng, Nathan;Holtz, Olga; Karstadt, Elaye; Schwartz, Oded (2020). "Sparsifying the Operators of Fast Matrix Multiplication Algorithms".arXiv:2008.03759 [cs.DS].
^Coppersmith, Don;Winograd, Shmuel (1990),"Matrix multiplication via arithmetic progressions"(PDF),Journal of Symbolic Computation,9 (3): 251,doi:10.1016/S0747-7171(08)80013-2
^Iliopoulos, Costas S. (1989),"Worst-case complexity bounds on algorithms for computing the canonical structure of finite abelian groups and the Hermite and Smith normal forms of an integer matrix"(PDF),SIAM Journal on Computing,18 (4):658–669,CiteSeerX 10.1.1.531.9309,doi:10.1137/0218045,MR 1004789, archived fromthe original(PDF) on 2014-03-05, retrieved2015-01-16,The Coppersmith–Winograd algorithm is not practical, due to the very large hidden constant in the upper bound on the number of multiplications required.
^Robinson, Sara (November 2005),"Toward an Optimal Algorithm for Matrix Multiplication"(PDF),SIAM News,38 (9),Even if someone manages to prove one of the conjectures—thereby demonstrating thatω = 2—the wreath product approach is unlikely to be applicable to the large matrix problems that arise in practice. [...] the input matrices must be astronomically large for the difference in time to be apparent.
^Laderman, Julian; Pan, Victor; Sha, Xuan-He (1992), "On practical algorithms for accelerated matrix multiplication",Linear Algebra and Its Applications,162–164:557–588,doi:10.1016/0024-3795(92)90393-O
^"Discovering novel algorithms with AlphaTensor".www.deepmind.com. 5 October 2022. Retrieved2022-11-01.
^^a ^bFawzi, Alhussein; Balog, Matej; Huang, Aja; Hubert, Thomas; Romera-Paredes, Bernardino; Barekatain, Mohammadamin; Novikov, Alexander; R. Ruiz, Francisco J.; Schrittwieser, Julian; Swirszcz, Grzegorz; Silver, David; Hassabis, Demis; Kohli, Pushmeet (October 2022)."Discovering faster matrix multiplication algorithms with reinforcement learning".Nature.610 (7930):47–53.Bibcode:2022Natur.610...47F.doi:10.1038/s41586-022-05172-4.ISSN 1476-4687.PMC 9534758.PMID 36198780.
^Kauers, Manuel; Moosbauer, Jakob (2022-12-02). "Flip Graphs for Matrix Multiplication".arXiv:2212.01175 [cs.SC].
^Brubaker, Ben (23 November 2022)."AI Reveals New Possibilities in Matrix Multiplication".Quanta Magazine. Retrieved26 November 2022.
^^a ^bRandall, Keith H. (1998).Cilk: Efficient Multithreaded Computing(PDF) (Ph.D.). Massachusetts Institute of Technology. pp. 54–57.hdl:1721.1/47519. Archived fromthe original(PDF) on 2020-11-06. Retrieved2015-01-16.
^Cannon, Lynn Elliot (14 July 1969).A cellular computer to implement the Kalman Filter Algorithm (Ph.D.). Montana State University.
^Hong, J. W.;Kung, H. T. (1981)."I/O complexity: The red-blue pebble game"(PDF).Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81. pp. 326–333.doi:10.1145/800076.802486.S2CID 8410593.Archived(PDF) from the original on December 15, 2019.
^^a ^b ^cIrony, Dror; Toledo, Sivan; Tiskin, Alexander (September 2004). "Communication lower bounds for distributed-memory matrix multiplication".J. Parallel Distrib. Comput.64 (9):1017–26.CiteSeerX 10.1.1.20.7034.doi:10.1016/j.jpdc.2004.03.021.
^^a ^bAgarwal, R.C.; Balle, S. M.; Gustavson, F. G.; Joshi, M.; Palkar, P. (September 1995). "A three-dimensional approach to parallel matrix multiplication".IBM J. Res. Dev.39 (5):575–582.CiteSeerX 10.1.1.44.3404.doi:10.1147/rd.395.0575.
^Solomonik, Edgar;Demmel, James (2011)."Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms"(PDF).Proceedings of the 17th International Conference on Parallel Processing. Vol. Part II. pp. 90–109.doi:10.1007/978-3-642-23397-5_10.ISBN 978-3-642-23397-5.
^Bosagh Zadeh, Reza; Carlsson, Gunnar (2013)."Dimension Independent Matrix Square Using MapReduce"(PDF).arXiv:1304.1467.Bibcode:2013arXiv1304.1467B. Retrieved12 July 2014.
^Bae, S.E.; Shinn, T.-W.; Takaoka, T. (2014)."A faster parallel algorithm for matrix multiplication on a mesh array".Procedia Computer Science.29:2230–40.doi:10.1016/j.procs.2014.05.208.
^Kak, S (1988). "A two-layered mesh array for matrix multiplication".Parallel Computing.6 (3):383–5.CiteSeerX 10.1.1.88.8527.doi:10.1016/0167-8191(88)90078-6.
^Kak, S. (2014). "Efficiency of matrix multiplication on the cross-wired mesh array".arXiv:1411.3273 [cs.DC].
^Kak, S (1988). "Multilayered array computing".Information Sciences.45 (3):347–365.CiteSeerX 10.1.1.90.4753.doi:10.1016/0020-0255(88)90010-2.
^Dekel, Eliezer; Nassimi, David; Sahni, Sartaj (1981). "Parallel Matrix and Graph Algorithms".SIAM Journal on Computing.10 (4):657–675.doi:10.1137/0210049.

v t e Numerical linear algebra
Key concepts	Floating point Numerical stability
Problems	System of linear equations Matrix decompositions Matrix multiplication (algorithms) Matrix splitting Sparse problems
Hardware	CPU cache TLB Cache-oblivious algorithm SIMD Multiprocessing
Software	ATLAS MATLAB Basic Linear Algebra Subprograms (BLAS) LAPACK Specialized libraries General purpose software

Movatterモバイル変換

Matrix multiplication algorithm

Iterative algorithm

Cache behavior

Divide-and-conquer algorithm

Non-square matrices

Cache behavior

Sub-cubic algorithms

AlphaTensor

Parallel and distributed algorithms

Shared-memory parallelism

Communication-avoiding and distributed algorithms

Algorithms for meshes

See also

References

Further reading