Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Automatically Tuned Linear Algebra Software

From Wikipedia, the free encyclopedia
Software library for linear algebra
icon
This articlerelies excessively onreferences toprimary sources. Please improve this article by addingsecondary or tertiary sources.
Find sources: "Automatically Tuned Linear Algebra Software" – news ·newspapers ·books ·scholar ·JSTOR
(November 2012) (Learn how and when to remove this message)
Automatically Tuned Linear Algebra Software
Stable release
3.10.3 / July 28, 2016; 9 years ago (2016-07-28)
Written inC,FORTRAN 77
TypeSoftware library
LicenseBSD License
Websitemath-atlas.sourceforge.net Edit this on Wikidata
Repository

Automatically Tuned Linear Algebra Software (ATLAS) is asoftware library forlinear algebra. It provides a matureopen source implementation ofBLASAPIs forC andFORTRAN 77.

ATLAS is often recommended as a way to automatically generate anoptimized BLAS library. While its performance often trails that of specialized libraries written for one specifichardware platform, it is often the first or even only optimized BLAS implementation available on new systems and is a large improvement over the generic BLAS available atNetlib. For this reason, ATLAS is sometimes used as a performance baseline for comparison with other products.

ATLAS runs on mostUnix-like operating systems and onMicrosoft Windows (usingCygwin). It is released under aBSD-style license without advertising clause, and many well-known mathematics applications includingMATLAB,Mathematica,Scilab,SageMath, and some builds ofGNU Octave may use it.

Functionality

[edit]

ATLAS provides a full implementation of the BLAS APIs as well as some additional functions fromLAPACK, a higher-level library built on top of BLAS. In BLAS, functionality is divided into three groups called levels 1, 2 and 3.

  • Level 1 containsvector operations of the form
yαx+y{\displaystyle \mathbf {y} \leftarrow \alpha \mathbf {x} +\mathbf {y} \!}
as well as scalardot products andvector norms, among other things.
  • Level 2 containsmatrix-vector operations of the form
yαAx+βy{\displaystyle \mathbf {y} \leftarrow \alpha A\mathbf {x} +\beta \mathbf {y} \!}
as well as solvingTx=y{\displaystyle T\mathbf {x} =\mathbf {y} } forx{\displaystyle \mathbf {x} } withT{\displaystyle T} being triangular, among other things.
CαAB+βC{\displaystyle C\leftarrow \alpha AB+\beta C\!}
as well as solvingBαT1B{\displaystyle B\leftarrow \alpha T^{-1}B} for triangular matricesT{\displaystyle T}, among other things.

Optimization approach

[edit]

Theoptimization approach is called Automated Empirical Optimization of Software (AEOS), which identifies four fundamental approaches to computer assisted optimization of which ATLAS employs three:[1]

  1. Parameterization—searching over the parameter space of a function, used for blocking factor, cache edge, etc.
  2. Multiple implementation—searching through various approaches to implementing the same function, e.g., forSSE support before intrinsics made them available in C code
  3. Code generation—programs that write programs incorporating what knowledge they can about what will produce the best performance for the system
  • Optimization of the level 1 BLAS uses parameterization and multiple implementation
Every ATLAS level 1 BLAS function has its own kernel. Since it would be difficult to maintain thousands of cases in ATLAS there is little architecture specific optimization for Level 1 BLAS. Instead multiple implementation is relied upon to allow forcompiler optimization to produce high performance implementation for the system.
  • Optimization of the level 2 BLAS uses parameterization and multiple implementation
WithN2{\displaystyle N^{2}} data andN2{\displaystyle N^{2}} operations to perform the function is usually limited by bandwidth to memory, and thus there is not much opportunity for optimization
All routines in the ATLAS level 2 BLAS are built from two Level 2 BLAS kernels:
  • GEMV—matrix by vector multiply update:
yαAx+βy{\displaystyle \mathbf {y} \leftarrow \alpha A\mathbf {x} +\beta \mathbf {y} \!}
  • GER—general rank 1 update from an outer product:
AαxyT+A{\displaystyle A\leftarrow \alpha \mathbf {x} \mathbf {y} ^{T}+A\!}
  • Optimization of the level 3 BLAS uses code generation and the other two techniques
Since we haveN3{\displaystyle N^{3}} ops with onlyN2{\displaystyle N^{2}} data, there are many opportunities for optimization

Level 3 BLAS

[edit]

Most of the Level 3 BLAS is derived fromGEMM, so that is the primary focus of the optimization.

O(n3){\displaystyle O(n^{3})} operations vs.O(n2){\displaystyle O(n^{2})} data

The intuition that then3{\displaystyle n^{3}} operations will dominate over then2{\displaystyle n^{2}} data accesses only works for roughly square matrices.The real measure should be some kind of surface area to volume.The difference becomes important for very non-square matrices.

Can it afford to copy?

[edit]

Copying the inputs allows the data to be arranged in a way that provides optimal access for the kernel functions, but this comes at the cost of allocating temporary space, and an extra read and write of the inputs.

So the first question GEMM faces is, can it afford to copy the inputs?

If so,

  • Put into block major format with good alignment
  • Take advantage of user contributed kernels and cleanup
  • Handle the transpose cases with the copy: make everything into TN (transpose - no-transpose)
  • Deal with α in the copy

If not,

  • Use the nocopy version
  • Make no assumptions on the stride of matrixA andB in memory
  • Handle all transpose cases explicitly
  • No guarantee about alignment of data
  • Support α specific code
  • Run the risk ofTLB issues, bad strides, etc.

The actual decision is made through a simpleheuristic which checks for "skinny cases".

Cache edge

[edit]

For 2nd Level Cache blocking a single cache edge parameter is used.The high level choose an order to traverse the blocks:ijk, jik, ikj, jki, kij, kji. These need not be the same order as the product is done within a block.

Typically chosen orders areijk orjik.Forjik the ideal situation would be to copyA and theNB wide panel ofB. Forijk swap the role ofA andB.

Choosing the bigger ofM orN for the outer loop reduces the footprint of the copy.But for largeK ATLAS does not even allocate such a large amount of memory.Instead it defines a parameter,Kp, to give best use of the L2 cache. Panels are limited toKp in length.It first tries to allocate (in thejik case)Mp+NBKp+NBNB{\displaystyle M\cdot p+NB\cdot Kp+NB\cdot NB}.If that fails it tries2KpNB+NBNB{\displaystyle 2\cdot Kp\cdot NB+NB\cdot NB}.(If that fails it uses the no-copy version of GEMM, but this case is unlikely for reasonable choices of cache edge.)Kp is a function of cache edge andNB.

LAPACK

[edit]

When integrating the ATLAS BLAS withLAPACK an important consideration is the choice of blocking factor for LAPACK. If the ATLAS blocking factor is small enough the blocking factor of LAPACK could be set to match that of ATLAS.

To take advantage of recursive factorization, ATLAS provides replacement routines for some LAPACK routines. These simply overwrite the corresponding LAPACK routines fromNetlib.

Need for installation

[edit]

Installing ATLAS on a particular platform is a challenging process which is typically done by a system vendor or a local expert and made available to a wider audience.

For many systems, architectural default parameters are available; these are essentially saved searches plus the results of hand tuning. If the arch defaults work they will likely get 10–15% better performance than the install search. On such systems the installation process is greatly simplified.

References

[edit]
  1. ^R. Clint Whaley; Antoine Petitet & Jack J. Dongarra (2001)."Automated Empirical Optimization of Software and the ATLAS Project"(PDF).Parallel Computing.27 (1–2):3–35.CiteSeerX 10.1.1.35.2297.doi:10.1016/S0167-8191(00)00087-9. Retrieved2006-10-06.

External links

[edit]
Key concepts
Problems
Hardware
Software
Retrieved from "https://en.wikipedia.org/w/index.php?title=Automatically_Tuned_Linear_Algebra_Software&oldid=1321851759"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp