This repository was archived by the owner on Jul 23, 2024. It is now read-only.
- Notifications
You must be signed in to change notification settings - Fork28
sjfeng1999/gpu-arch-microbenchmark
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
- install
turingas
compilergit clone --recursive git@github.com:sjfeng1999/gpu-arch-microbenchmark.git
cd turingas
python setup.py install
mkdir build && cd build
cmake .. && make
python ../compile_sass.py -arch=(70|75|80)
./(memory_latency|reg_bankconflict|...)
Device | Latency | Turing RTX-2070 (TU104) |
---|---|---|
Global Latency | cycle | 1000 ~ 1200 |
TLB Latency | cycle | 472 |
L2 Latency | cycle | 236 |
L1 Latency | cycle | 32 |
Shared Latency | cycle | 23 |
Constant Latency | cycle | 448 |
Constant L2 Latency | cycle | 62 |
Constant L1 Latency | cycle | 4 |
- const L1-cache is as fast as register.
- memory bandwidth within one thread
Device | Bandwidth | Turing RTX-2070 |
---|---|---|
Global LDG.128 | GB/s | 194.12 |
Global LDG.64 | GB/s | 140.77 |
Global LDG.32 | GB/s | 54.18 |
Shared LDS.128 | GB/s | 152.96 |
Shared LDS.64 | GB/s | 30.58 |
Shared LDS.32 | GB/s | 13.32 |
- global memory bandwidth within (64 block * 256 thread)
Device | Bandwidth | Turing RTX-2070 |
---|---|---|
LDG.32 | GB/s | 246.65 |
LDG.32 Group1 Stride1 | GB/s | 118.73(2X) |
LDG.32 Group2 Stride2 | GB/s | 119.08(2X) |
LDG.32 Group4 Stride4 | GB/s | 117.11(2X) |
LDG.32 Group8 Stride8 | GB/s | 336.27 |
LDG.64 | GB/s | 379.24 |
LDG.64 Group1 Stride1 | GB/s | 126.40(2X) |
LDG.64 Group2 Stride2 | GB/s | 124.51(2X) |
LDG.64 Group4 Stride4 | GB/s | 398.84 |
LDG.64 Group8 Stride8 | GB/s | 371.28 |
LDG.128 | GB/s | 391.83 |
LDG.128 Group1 Stride1 | GB/s | 125.25(2X) |
LDG.128 Group2 Stride2 | GB/s | 402.55 |
LDG.128 Group4 Stride4 | GB/s | 394.22 |
LDG.128 Group8 Stride8 | GB/s | 396.10 |
Device | Linesize | Turing RTX-2070(TU104) |
---|---|---|
L2 Linesise | bytes | 64 |
L1 Linesize | bytes | 32 |
Constant L2 Linesise | bytes | 256 |
Constant L1 Linesize | bytes | 32 |
Instruction | CPI | conflict | without conflict | reg reuse | double reuse |
---|---|---|---|---|---|
FFMA | cycle | 3.516 | 2.969 | 2.938 | 2.938 |
IADD3 | cycle | 3.031 | 2.062 | 2.031 | 2.031 |
Memory Load | Latency | Turing RTX-2070 (TU104) |
---|---|---|
Single | cycle | 23 |
Vector2 X 2 | cycle | 27 |
Conflict Strided | cycle | 41 |
Conlict-Free Strided | cycle | 32 |
- warp schedule
- L1/L2 cache n-way k-set
- Jia, Zhe, et al. "Dissecting the NVIDIA volta GPU architecture via microbenchmarking." arXiv preprint arXiv:1804.06826 (2018).
- Jia, Zhe, et al. "Dissecting the NVidia Turing T4 GPU via microbenchmarking." arXiv preprint arXiv:1903.07486 (2019).
- Yan, Da, Wei Wang, and Xiaowen Chu. "Optimizing batched winograd convolution on GPUs." Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming. 2020.(turingas)
About
Dissecting NVIDIA GPU Architecture
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.