- Notifications
You must be signed in to change notification settings - Fork2
HPAC/tccg
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The Tensor Contraction Code Generator (TCCG) generates high-performance (parallel and) vectorized C code for tensor contractions.
From a computational perspective, tensorscan be interpreted as higher dimensional matrices or simply asmultidimensional arrays; likewise, tensor contractionsare a generalization of the matrix-matrix multiplication to higherdimensions. For instance, A[i,k], B[k,j] and C[i,j] denote two-dimensionaltensors (i.e., matrices) and C[i,j] = A[i,k] * B[k,j] represents a tensorcontraction where the sum over 'k' as well as the loops over 'i' and 'j' areimplicit. Further examples of tensor contractions are: C[i0,j0,j1] = A[i0,k0] * B[j1,k0,j0];C[i0,j0,j1,i1] = A[i0,k0,i1] * B[j1,k0,j0]; C[i0,j0,j1,i1] = A[k0,i0,k1,i1] * B[k1,j1,k0,j0] ...
Current version:v0.1.2
- TCCG generates high-performance vectorized C code
- TCCG generates code based on three different approaches:
- GEMM-like Tensor-Tensor Multiplication (GETT): This novel approach to tensor contractions is at the core of our latest publication (see below).
- Transpose-Transpose-GEMM-Transpose (TTGT)
- Loops-over-GEMM (LoG)
- Shared-memory parallelism
- TTGT, LoG, GETT
- Support for single- and double-precision
- Auto-Fine-Tuning:
- Automatically explores a search space of promising implementation candidates
- The fastest candidate will be selected and returned automatically
- A performance model guides the search
- The search space can be limited by the user (via the --maxImplementations=N command line argument)
- Support for multiple instruction sets:
- AVX2: GETT, TTGT, LoG
- AVX512: GETT, TTGT, LoG (experimental)
- CUDA: TTGT, LoG
GETT's advantages are manifold:* GETT-based code isfully vectorized andexploits the cache hierarchy.* Sub-tensors are packed into the caches as needed. Thus, GETT avoids the explicit transposition overhead incurred by TTGT.* Thestride-one index is preserved while packing the sub-tensors into a specified level of the cache hierarchy.*No additional workspace is required (except for small buffers which fit into the caches).* Thearithmetic intensity is retained for any given tensor contraction.
While GETT exhibits excellent performance across a wide range of tensor contractions, its performance for bandwidth-bound tensor contractions is especially outstanding.
For further information, please see our(paper).
In order to use TCCG, a working C compiler and some BLAS library (e.g., Intel's MKL) as well as theHigh-Perfromance Tensor Transposition library (HPTT) are required:
- Intel's ICC (>= v15.0, recommended) or g++ (>= v4.8, experimental)
- Some BLAS library (e.g.,BLIS,ATLAS)
- High-Performance Tensor Transposition (HPTT) library
- Python (tested with v2.7.5 and v2.7.9)
- Tensor Contraction LibraryTCL (OPTIONAL)
Clone the repository into a desired directory and change to that location:
git clonehttps://github.com/HPAC/tccg.gitcd tccg
Install TCCG:
python setup.py install --user
Export the TCCG_ROOT environment variable (add to your .bashrc):
export TCCG_ROOT=
pwd
Setup the your BLAS library within the $TCCG_ROOT/config.cfg (default: mkl).
You might have to add the installed location to your PATH environment variable:
export PATH=$PATH:~/.local/bin
Please runtccg --help to get an overview of TCCG's parameters.
Here is an exemplary input file to TCCG:
C[a,b,i,j] = A[i,m,a] * B[m,j,b]a = 24b = 24i = 24j = 24m = 24
TCCG command line arguments:
tccg --arch=avx2 --numThreads=1 --floatType=s example.tccg
Further exmaples (.tccg files) can be generated via:
python bechmark/benchmark.py
TCCG provides abenchmark for tensor contractions.
python benchmark.py
This will generate the input files (.tccg) for TCCG for each of the test-cases within the benchmark.The tensor contractions within the benchmark are collected from four different publications to cover a broad range of use cases (see paper, Sec. 7.1); this being said, we don't claim that this benchmark is exhaustive in any sense.If you think that the benchmark is missing certain tensor contractions or sizes, please feel free to contribute to the benchmark.
Since this benchmark may evolve over time and to make comparisons easier, please refer to the current version of the benchmark.
Benchmark version:v0.1
The product of the sizes corresponding to the free indices of each input tensor needs to be amultiple of 24. This limitation will be lifted in a future version of GETT.
In case you want to refer to TCCG as part of a research paper, please cite the followingarticle(pdf):
@article{tccg2016a, author = {Paul Springer and Paolo Bientinesi}, title = {{Design of a high-performance GEMM-like Tensor-Tensor Multiplication}}, archivePrefix = "arXiv", eprint = {1607.00145}, primaryClass = "quant-ph", journal = {CoRR}, year = {2016}, issue_date = {July 2016}, url = {http://arxiv.org/abs/1607.00145}}
V0.2.0:
- GETT is now also parallelized
- this branch now uses theHigh-Perfromance Tensor Transposition library (HPTT) which significantly reduces the compile time
We are happy for any feedback or feature requests. Please contactspringer@aices.rwth-aachen.de.
We also welcome any contributions to the code base or the benchmark.