- Notifications
You must be signed in to change notification settings - Fork0
NaveenElumalaiAMD/rocWMMA
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
rocWMMA is a C++ library for accelerating mixed-precision matrix multiply-accumulate (MMA)operations leveraging AMD GPU hardware. rocWMMA makes it easier to break down MMA problemsinto fragments and distribute block-wise MMA operations in parallel across GPU wavefronts. Our APIconsists of a header library, that you can use to compile MMA acceleration directly into GPU kerneldevice code. This can benefit from compiler optimization in the generation of kernel assembly, anddoesn't incur additional overhead costs of linking to external runtime libraries or having to launchseparate kernels.
The rocWMMA API is designed from awave perspective, where block-wise API calls are processed byGPU threads coordinated together as a group, or awave. Importantly, individual threads may onlyaccess a portion of a fragment, and there is no guaranteed layout or locality of this data. Thread accessto fragment data is analogous to the vector register access of individual threads. Fragments andblock-wise operations can therefore be thought of as a wave of grouped data buffers and operations.
Thread blocks that contain multiple waves have the added benefit of resource sharing andcooperation. For rocWMMA, this is especially useful for moving data that may be shared acrossmultiple waves. The rocWMMACoop
API facilitates cooperation between multiple waves whenloading and storing opaque data fragments. The order and locality of fragment data contribution perwave is not guaranteed, and can be thought of as adistributed wave. Typically, waves cooperate inmoving data from global memory to shared memory (LDS), loading their own full fragment copiesfrom LDS.
Memory addresses are treated as 1D arrays. The rocWMMA API can opaquely handle moving databetween both global and shared memory address locations. The library code includes utilities formapping 2D grid and matrix coordinates to 1D array coordinates, and supports either row-major orcolumn-major data layouts. Block-wise dimensions (BlockM,N,K) familiar to block-wise general matrixmultiply (GEMM) product algorithms are supported directly through the availability of matrixinstructions for the target architecture. Likewise, you can vary mixed-precision data types for input,output, and accumulation fragments (as listed in the supported configurations section).
rocWMMA is a header library that includes test and sample projects to validate and demonstrate APIuse. GEMM is used as primary validation given the heavy precedent for the library. However, we areexpanding our portfolio to demonstrate more diverse rocWMMA uses.
You can use rocWMMA with the following hardware:
- AMD CDNA class GPU featuring matrix core support: gfx908, gfx90a as 'gfx9'
- AMD RDNA3 class GPU featuring AI acceleration support: gfx1100, gfx1101, gfx1102 as 'gfx11'
Double precision FP64 data type support requires gfx90a
rocWMMA software requirements include:
- ROCm stack minimum version 5.4
- rocm-cmake minimum version 0.8.0 for ROCm 5.3
- C++ 17
- CMake >=3.6
- OpenMP
Optional:
- rocBLAS minimum version 2.46.0 for ROCm 5.4 (for rocBLAS validation and benchmarks)
- Doxygen (for building documentation)
To build our documentation locally, use the following commands.
cd docspip3 install -r .sphinx/requirements.txtpython3 -m sphinx -T -E -b html -d _build/doctrees -D language=en. _build/html
- Wave Size: Wave32 (gfx11), Wave64 (gfx9)
- Matrix Layout <LayoutA, LayoutB, Layout C, LayoutD> (N = col major, T = row major)
<N, N, N, N>,<N, N, T, T><N, T, N, N>,<N, T, T, T><T, N, N, N>,<T, N, T, T><T, T, N, N>,<T, T, T, T>
- Thread Block Sizes <TBlockX, TBlockY>
TBlockX must be a multiple of WaveSize
For GEMM, rocWMMA focuses on thread blocks of up to 4 waves for optimal resource usage andoccupancy. Larger thread block sizes are possible but are not officially supported.
WS = Wave Size<WS,1>,<WS,2>,<WS,4><WS* 2,1>,<WS* 2,2><WS* 4,1>
Data Types <Ti / To / Tc> = <Input type / Output Type / Compute Type>
Input Type = Matrix A/B
Output Type = Matrix C/D
Compute Type = math / accumulation type
gfx11 only supports BlockM/N = 16
Ti / To / Tc | BlockM | BlockN | BlockK Range (powers of 2) | Notes |
---|---|---|---|---|
i8 / i32 / i32 | 16 | 16 | [16, 256] | |
32 | 32 | [8, 128] | ||
i8 / i8 / i32 | 16 | 16 | [16, 256] | |
32 | 32 | [8, 128] | ||
f16 / f32 / f32 | 16 | 16 | [16, 256] | |
32 | 32 | [8, 128] | ||
f16 / f16 / f32 | 16 | 16 | [16, 256] | |
32 | 32 | [8, 128] | ||
f16 / f16 / f16* | 16 | 16 | [16, 256] | *= CDNA native f32 accumulation downcasted to fp16 |
32 | 32 | [8, 128] | ||
__half / f32 / f32 | 16 | 16 | [16, 256] | |
32 | 32 | [8, 128] | ||
__half / __half / f32 | 16 | 16 | [16, 256] | |
32 | 32 | [8, 128] | ||
__half / __half / __half* | 16 | 16 | [16, 256] | *= CDNA native f32 accumulation downcasted to __half |
32 | 32 | [8, 128] | ||
bf16 / f32 / f32 | 16 | 16 | [8, 256] | |
32 | 32 | [4, 128] | ||
bf16 / bf16 / f32 | 16 | 16 | [8, 256] | |
32 | 32 | [4, 128] | ||
bf16 / bf16 / bf16* | 16 | 16 | [8, 256] | *= CDNA native f32 accumulation downcasted to bf16 |
32 | 32 | [4, 128] | ||
f32 / f32 / f32* | 16 | 16 | [4, 256] | *= Supported only on gfx9 |
32 | 32 | [2, 128] | ||
f64 / f64 / f64* | 16 | 16 | [4, 256] | *= Supported only on gfx90a + |
Broadcasts a desired value to all elements in the fragment.
Loads data from memory according to the matrix layout.
- Matrix A layout: Loads and stores matrix columns in the K direction (Matrix A = M x K,fragA = BlockM x BlockK)
- Matrix B layout: Loads and stores matrix rows in the K direction (Matrix B = K x N,fragB = BlockK x BlockN)
- Matrix C layout: Loads and stores matrix rows in a vector width of 4 (Matrix C = M x N,fragAcc = BlockM x BlockN)
Fragments are stored in packed registers in optimal load and store patterns. In-register elements haveno guaranteed order, which optimizes loading and storing efficiency.
The MMA operation is performed on fragment data. The outer product of Fragment A elements withFragment B elements is added to the accumulator fragment.
Flow control for synchronization across multiple wavefronts in a workgroup. It also ensures thesynchronization of shared and global memory accesses across wavefronts.
Loads data from memory according to the matrix layout. Splits operation among wave members of aworkgroup.
- Matrix A: Layout loads and stores matrix columns in the K direction (Matrix A = M x K,fragA = BlockM x BlockK)
- Matrix B: Layout loads and stores matrix rows in the K direction (Matrix B = K x N,fragB = BlockK x BlockN)
- Matrix C: Layout loads and stores matrix rows in vector width of 4 (Matrix C = M x N,fragAcc = BlockM x BlockN)
Each contributing wave handles partial operation data, so split parameters should be consistent acrosssubsequent operations. For example, cooperatively moving data from global to shared memory shoulduse the same split parameters for the global load and subsequent local store.
Create and track a rocWMMA fork.
Clone your fork:
git clone -b develop https://github.com/<your_fork>/rocWMMA.git..githooks/installgit checkout -b<new_branch>...git add<new_work>git commit -m"What was changed"git push origin<new_branch>...
Create a pull request to the ROCmSoftwarePlatform/rocWMMA develop branch.
Await CI and approval feedback.
Once approved, merge.
You must install GitHooks because there are triggers for Clang formatting in commits.
Option | Description | Default value |
---|---|---|
AMDGPU_TARGETS | Build code for specific GPU target(s) | gfx908:xnack-;gfx90a:xnack-;gfx90a:xnack+;gfx1100;gfx1101;gfx1102 |
ROCWMMA_BUILD_TESTS | Build Tests | ON |
ROCWMMA_BUILD_SAMPLES | Build Samples | ON |
ROCWMMA_BUILD_DOCS | Build doxygen documentation from code | OFF |
ROCWMMA_BUILD_ASSEMBLY | Generate assembly files | OFF |
ROCWMMA_BUILD_VALIDATION_TESTS | Build validation tests | ON (requires ROCWMMA_BUILD_TESTS=ON) |
ROCWMMA_BUILD_BENCHMARK_TESTS | Build benchmark tests | OFF (requires ROCWMMA_BUILD_TESTS=ON) |
ROCWMMA_BUILD_EXTENDED_TESTS | Build extended testing coverage | OFF (requires ROCWMMA_BUILD_TESTS=ON) |
ROCWMMA_VALIDATE_WITH_ROCBLAS | Use rocBLAS for validation tests | ON (requires ROCWMMA_BUILD_VALIDATION_TESTS=ON) |
ROCWMMA_BENCHMARK_WITH_ROCBLAS | Include rocBLAS benchmarking data | OFF (requires ROCWMMA_BUILD_BENCHMARK_TESTS=ON) |
By default, the project is configured in release mode and is linked against rocBLAS for validatingresults. Here are some configuration examples:
Configuration | Command |
---|---|
Basic | CC=hipcc CXX=hipcc cmake -B<build_dir> . |
Targeting gfx908 | CC=hipcc CXX=hipcc cmake -B<build_dir> . -DAMDGPU_TARGETS=gfx908:xnack- |
Debug build | CC=hipcc CXX=hipcc cmake -B<build_dir> . -DCMAKE_BUILD_TYPE=Debug |
Build without rocBLAS (default on) | CC=hipcc CXX=hipcc cmake -B<build_dir> . -DROCWMMA_VALIDATE_WITH_ROCBLAS=OFF -DROCWMMA_BENCHMARK_WITH_ROCBLAS=OFF |
After configuration, build withcmake --build <build_dir> -- -j<nproc>
The build time for all projects can take several minutes.
Target a specific GPU (e.g.,
gfx908:xnack-
)Use lots of threads (e.g.,
-j64
)Select
ROCWMMA_BUILD_ASSEMBLY=OFF
Select
ROCWMMA_BUILD_DOCS=OFF
Select
ROCWMMA_BUILD_EXTENDED_TESTS=OFF
Specify
ROCWMMA_BUILD_VALIDATION_TESTS
orROCWMMA_BUILD_BENCHMARK_TESTS
as ON, and the other as OFFFor investigating particular kernels, build the ad hoc test with the parameters you are interested in
Manually build specific tests:
cd<build_dir>make<target_name> -j64
Where
<target_name>
is one of the following:<target_name>
Description rocwmma_unit_tests
Build all rocWMMA unit tests rocwmma_gemm_tests_validate
Build all GEMM validation tests rocwmma_gemm_tests_bench
Build all GEMM benchmark tests rocwmma_dlrm_tests_validate
Build all deep learning recommendation model (DLRM) validation tests rocwmma_dlrm_tests_bench
Build all DLRM benchmark tests rocwmma_samples
Build all rocWMMA samples Individual target name ( contamination_test
,simple_sgemm
, etcetera)Build individual rocWMMA test or sample Manually reduce tests or test case coverage
rocWMMA library features are showcased, validated, and benchmarked if applicable in GoogleTestapplications.
With CTest enabled, you can run the entire test suite by running CTest in the build folder:
cd<build_dir>ctest --output-on-failure
Otherwise, individual tests can be run as follows:
Unit tests for loading and storing APIs to verify that data boundaries are not crossed and pristine dataremain untouched.
Run the validation:
<build_dir>/test/unit/contamination_test
Unit tests for vector cross-lane operations.
Run the validation:
<build_dir>/test/unit/cross_lane_ops_test
Tests therocwmma::fill_fragment
API function for all supported configurations. Tests broadcasting of adesired value to all elements in the fragment.
Run the validation:
<build_dir>/test/unit/fill_fragment_test
Unit test for I/O shape meta-data generation on host machine
Run the validation:
<build_dir>/test/unit/io_shape_test
Unit test for I/O traits metadata generation on host machine.
Run the validation:
<build_dir>/test/unit/io_traits_test
Unit tests for the internal collect and scatter matrix element to register mapping transforms.
Run the validation:
<build_dir>/test/unit/layout_test
Tests therocwmma::load_matrix_sync
androcwmma::store_matrix_sync
API functions for allsupported configurations. Tests proper emplacement of data during loads and stores.
Run the validation:
<build_dir>/test/unit/load_store_matrix_sync_test<build_dir>/test/unit/load_store_matrix_coop_sync_test
Unit tests for the utility class used to calculate transforms and offsets between grid, matrix, and datacoordinate systems.
Run the validation:
<build_dir>/test/unit/map_util_test
Unit tests for internal vector iteration and navigation during access and storage.
Run the validation:
<build_dir>/test/unit/vector_iterator_test
Unit tests for internal vector storage.
Run the validation:
<build_dir>/test/unit/vector_test
Implements a GEMM blocking algorithm using rocWMMA for all supported parametric configurations.Validates on a CPU algorithm, or rocBLAS if available. Validation runs are performed on a reducedsubset of matrix sizes. Benchmark runs are performed on a larger set of matrix sizes. Extended testsprovide comprehensive parameter coverage and larger matrix sizes.
As of rocWMMA v0.8, we've introduced GEMM kernel naming nomenclature to allow compactrepresentation and quick identification of kernel implementation features. The rocWMMA GEMM testlibrary includes kernels that support the following features:
PGR# - Prefetch Global Read lookup stages. PGR0 = no globalread prefetch. PGR1 = 1 stage globalread prefetch.LB# - Lds buffer count. LB0 = no lds usage, LB2 = 2 Lds buffers usedfor swap.MP# - MFMA instruction priority. MP0 = default MFMA instruction priority of 0. MP1 = raise MFMA instruction priority to 1.MB - Multiple output blocks targeted per waveSB - Single output block target per waveNC - Non-Cooperative load / storeCP - Cooperative load / storeBLK - Cooperative load / store per block tileWV - Cooperative load / store per wave tileWG - Cooperative load / store per macro tile
gemm_PGR0_LB0_MP0_SB_NC
: The simplest blocked GEMM example, which targets one outputblock of matrix multiplication per wave. No prefetch, no lDs usage, default MFMA prioritization, singleblock output and non-collaborative.gemm_PGR0_LB0_MP0_MB_NC
: Implements a multi-block GEMM where each wave is responsiblefor a BlocksX x BlocksY grid of output blocks. No prefetch, no lDs usage, default MFMA prioritization,multiple blocks output, and non-collaborative.gemm_PGR1_LB2_MP0_MB_CP_BLK
: Implements a multi-block GEMM where each wave isresponsible for a BlocksX x BlocksY grid of output blocks. This kernel leverages shared memory toimplement a data prefetching pipeline and collaborates with other waves to improve performance.Implements single stage prefetch, double lDs buffer, default MFMA prioritization, multiple blocksoutput, and is block-tile collaborative in global read and local write.gemm_PGR1_LB2_MP0_MB_CP_WV
: Implements a multi-block GEMM where each wave isresponsible for a BlocksX x BlocksY grid of output blocks. This kernel leverages shared memory toimplement a data prefetching pipeline and collaborates with other waves to improve performance.Implements single stage prefetch, double lDs buffer, default MFMA prioritization, multiple blocksoutput, and is wave-tile collaborative in global read and local write.gemm_PGR1_LB2_MP0_MB_CP_WG
: Implements a multi-block GEMM where each wave isresponsible for a BlocksX x BlocksY grid of output blocks. This kernel leverages shared memory toimplement a data prefetching pipeline and collaborates with other waves to improve performance.Implements single stage prefetch, double lDs buffer, default MFMA prioritization, multiple blocksoutput and is macro-tile collaborative in global read and local write.Ad Hoc Test
: An executable that focuses on a specific set of kernel parameters. This is used as aquick mock-up of a situational investigation of a particular GEMM kernel.
Validation tests are postfixed with-validate
. Benchmark tests are postfixed with-bench
.
Run the validation tests:
<build_dir>/test/gemm/gemm_PGR0_LB0_MP0_SB_NC-validate<build_dir>/test/gemm/gemm_PGR0_LB0_MP0_MB_NC-validate<build_dir>/test/gemm/gemm_PGR1_LB2_MP0_MB_CP_BLK-validate<build_dir>/test/gemm/gemm_PGR1_LB2_MP0_MB_CP_WV-validate<build_dir>/test/gemm/gemm_PGR1_LB2_MP0_MB_CP_WG-validate
Run the benchmark only:
Note that benchmark runs can takeseveral hours to complete.
<build_dir>/test/gemm/gemm_PGR0_LB0_MP0_SB_NC-bench<build_dir>/test/gemm/gemm_PGR0_LB0_MP0_MB_NC-bench<build_dir>/test/gemm/gemm_PGR1_LB2_MP0_MB_CP_BLK-bench<build_dir>/test/gemm/gemm_PGR1_LB2_MP0_MB_CP_WV-bench<build_dir>/test/gemm/gemm_PGR1_LB2_MP0_MB_CP_WG-bench
Run the ad hoc test:
<build_dir>/test/gemm/gemm_PGR0_LB0_MP0_SB_NC_ad_hoc-validate<build_dir>/test/gemm/gemm_PGR0_LB0_MP0_MB_NC_ad_hoc-validate<build_dir>/test/gemm/gemm_PGR1_LB2_MP0_MB_CP_ad_hoc-validate<build_dir>/test/gemm/gemm_PGR0_LB0_MP0_SB_NC_ad_hoc-bench<build_dir>/test/gemm/gemm_PGR0_LB0_MP0_MB_NC_ad_hoc-bench<build_dir>/test/gemm/gemm_PGR1_LB2_MP0_MB_CP_ad_hoc-bench
Compact | Verbose | Description |
---|---|---|
-os <output_file>.csv | --output_stream <output_file>.csv | stream GEMM testing output to CSV file |
--omit | omits certain outputs :1 = SKIPPED tests 2 - FAILED tests 4 - PASSED tests 8 - All non-gtest output |
- Use GoogleTest filters, target specific test names:
<test_exe> --gtest_filter=*name_filter*
- Manually adjust the test cases coverage
- Use ad hoc tests to focus in on specific parameters
- Select
ROCWMMA_BUILD_EXTENDED_TESTS=OFF
These are standalone, practical use cases for the rocWMMA API. They have minimal dependencies andrepresent a targeted application with a fixed set of parameters.
Use MMA to demonstrate rocWMMA API usage in the context of wave-level GEMM computation, inboth simplified and optimized versions.
Simple GEMM algorithm demonstration without LDS memory usage and no transpose.
simple_sgemm calculates D = Alpha * A x B + Beta * C with fp32 inputs and output.
simple_dgemm calculates D = Alpha * A x B + Beta * C with fp64 inputs and output.
simple_hgemm calculates D = Alpha * A x B + Beta * C with fp16 inputs and output.
Includes a simple CPU validation and benchmark.
Run asimple gemm
sample:
<build_dir>/samples/simple_sgemm<build_dir>/samples/simple_dgemm<build_dir>/samples/simple_hgemm
To implement and measure performance of Matrix Multiply-Accumulate(D = Alpha * A x B + Beta * C)with user-defined configurations on a GPU.
It contains the best performant version of the multi-block GEMM algorithm with LDS memory,macro-tile collaboration, data reuse, and optimized pipeline, configured with the finest parameters forlarger sizes (1K and greater).
perf_sgemm calculates D = Alpha * A x B + Beta * C with fp32 inputs and output.
perf_dgemm calculates D = Alpha * A x B + Beta * C with fp64 inputs and output.
perf_hgemm calculates D = Alpha * A x B + Beta * C with fp16 inputs and output.
Includes a simple CPU validation and benchmark.
Run aperf gemm
sample:
<build_dir>/samples/perf_sgemm<build_dir>/samples/perf_dgemm<build_dir>/samples/perf_hgemm
Simple MMA with a vector demonstration, without LDS or transpose.
Calculates Y = alpha * (A) * X + beta * Y with fp32 inputs and output.
Includes a simple CPU validation and benchmark.
A = Matrix of size m * k (col-major)
X = Vector of size k * 1 (col-major)
Y = accumulator of size m * 1 (col-major)
Run thesgemv
sample:
<build_dir>/samples/simple_sgemv
Simple MMA with a vector demonstration, without LDS or transpose.
Calculates Y = alpha * (A) * X + beta * Y with fp64 inputs and output.
Includes a simple CPU validation and benchmark.
A = Matrix of size m * k (row-major)
X = Vector of size k * 1 (col-major)
Y = accumulator of size m * 1 (row-major)
Run thedgemv
sample:
<build_dir>/samples/simple_dgemv
Simple deep learning recommendation model (DLRM) for machine learning. Implements both forwardand backward passes on fp16 inputs and outputs.
Includes a simple CPU validation and benchmark.
Run thesimple_dlrm
sample:
<build_dir>/samples/simple_dlrm
The HIP runtime compilation (hipRTC) environment allows simultaneous compilation, loading, andrunning of device code on AMD GPUs. The rocWMMA library is compatible with hipRTC, so you canleverage it for runtime-generated kernels. A simple GEMM sample is included to demonstratecompatibility. For more information, refer to theHIP API reference.
The `rocwmma::bfloat16_t` data type is not currently supported in hipRTC.
A simple GEMM algorithm demonstrating runtime compilation (hipRTC) compatibility. CalculatesD = Alpha * A x B + Beta * C with mixed-precision fp16 inputs and fp32 output. Includes codecompilation, module loading, and kernel launch with the hipRTC API, with simple CPU validation andbenchmarking.
Run thehipRTC_gemm
sample:
<build_dir>/samples/hipRTC_gemm
About
rocWMMA
Resources
License
Stars
Watchers
Forks
Packages0
Languages
- C++96.6%
- CMake3.0%
- Other0.4%