Bruce-Lee-LY/cuda_hgemmPublic

NotificationsYou must be signed in to change notification settings
Fork74
Star372

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

License

MIT license

372 stars 74 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
media/images		media/images
performance		performance
src		src
tools/performance		tools/performance
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
run_sample.sh		run_sample.sh

Repository files navigation

CUDA HGEMM

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction. The calculation expression is as follows, where the precision of matrix A (M * K), B (K * N) and C (M * N) is FP16. Through exploring various matrix tiling and optimization methods, the current performance between 256 to 16384 dimensions is not less than 95% of the performance of cublas, and in many scenarios, it exceeds the performance of cublas.

C (M * N) = A (M * K) * B (K * N)

Optimization Method

Tiling: 256 * 128 for block tiling size and 64 * 64 for warp tiling size
Coalescing Access: using wide instruction access to global memory
Data Reuse: using shared memory to reuse data of matrix A and B
Async Copy: using asynchronous copy operation with non-blocking instruction
Bank Conflict: using padding method for WMMA API and permuted method for MMA PTX instruction to eliminate bank conflict
L2 Cache: using swizzle access mode to increase L2 cache hit ratio
Register Reuse: calculating as "Right Left Right Left" for the internal tile of warp
Pg2s: double-buffer algorithm using prefetching global memory to shared memory
Ps2r: double-buffer algorithm using prefetching shared memory to register
Stage: multi-buffer algorithm using prefetching global memory to shared memory

Compile

Environment

OS: Linux
Cmake Version: >= 3.12
GCC Version: >= 4.8
CUDA Version: >= 11.0
Others: gflags, ccache

sudo apt-get install libgflags-dev ccache

Clone

git clone https://github.com/Bruce-Lee-LY/cuda_hgemm.git

Build

NVIDIA A100

cd cuda_hgemm./build.sh -a 80 -t Release -b OFF./build.sh -a 80 -t Debug -b OFF

RTX3080Ti / RTX3090 / RTX A6000

cd cuda_hgemm./build.sh -a 86 -t Release -b OFF./build.sh -a 86 -t Debug -b OFF

Run Sample

./run_sample.sh

Performance

Process the data in the log and plot it as a line chart.

cd tools/performance./performance.sh

RTX3090

CUDA Version: 11.3

The best performance that can be achieved.

Performance achieved by current optimization methods.

RTX A6000

CUDA Version: 11.3

The best performance that can be achieved.

About

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

CUDA HGEMM

Optimization Method

Compile

Environment

Clone

Build

NVIDIA A100

RTX3080Ti / RTX3090 / RTX A6000

Run Sample

Performance

RTX3090

RTX A6000

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

Bruce-Lee-LY/cuda_hgemm

Folders and files

Latest commit

History

Repository files navigation

CUDA HGEMM

Optimization Method

Compile

Environment

Clone

Build

NVIDIA A100

RTX3080Ti / RTX3090 / RTX A6000

Run Sample

Performance

RTX3090

RTX A6000

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages