Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on Jul 23, 2024. It is now read-only.

Dissecting NVIDIA GPU Architecture

NotificationsYou must be signed in to change notification settings

sjfeng1999/gpu-arch-microbenchmark

Repository files navigation

Prerequisites

  1. installturingas compiler

    git clone --recursive git@github.com:sjfeng1999/gpu-arch-microbenchmark.git
    cd turingas
    python setup.py install

Usage

  1. mkdir build && cd build
  2. cmake .. && make
  3. python ../compile_sass.py -arch=(70|75|80)
  4. ./(memory_latency|reg_bankconflict|...)

Microbenchmark

1. Memory Latency

DeviceLatencyTuring RTX-2070 (TU104)
Global Latencycycle1000 ~ 1200
TLB Latencycycle472
L2 Latencycycle236
L1 Latencycycle32
Shared Latencycycle23
Constant Latencycycle448
Constant L2 Latencycycle62
Constant L1 Latencycycle4
  • const L1-cache is as fast as register.

2. Memory Bandwidth

  1. memory bandwidth within one thread
DeviceBandwidthTuring RTX-2070
Global LDG.128GB/s194.12
Global LDG.64GB/s140.77
Global LDG.32GB/s54.18
Shared LDS.128GB/s152.96
Shared LDS.64GB/s30.58
Shared LDS.32GB/s13.32
  1. global memory bandwidth within (64 block * 256 thread)
DeviceBandwidthTuring RTX-2070
LDG.32GB/s246.65
LDG.32 Group1 Stride1GB/s118.73(2X)
LDG.32 Group2 Stride2GB/s119.08(2X)
LDG.32 Group4 Stride4GB/s117.11(2X)
LDG.32 Group8 Stride8GB/s336.27
LDG.64GB/s379.24
LDG.64 Group1 Stride1GB/s126.40(2X)
LDG.64 Group2 Stride2GB/s124.51(2X)
LDG.64 Group4 Stride4GB/s398.84
LDG.64 Group8 Stride8GB/s371.28
LDG.128GB/s391.83
LDG.128 Group1 Stride1GB/s125.25(2X)
LDG.128 Group2 Stride2GB/s402.55
LDG.128 Group4 Stride4GB/s394.22
LDG.128 Group8 Stride8GB/s396.10

3. Cache Linesize

DeviceLinesizeTuring RTX-2070(TU104)
L2 Linesisebytes64
L1 Linesizebytes32
Constant L2 Linesisebytes256
Constant L1 Linesizebytes32

4. Reg Bankconflict

InstructionCPIconflictwithout conflictreg reusedouble reuse
FFMAcycle3.5162.9692.9382.938
IADD3cycle3.0312.0622.0312.031

5. Shared Bankconflict

Memory LoadLatencyTuring RTX-2070 (TU104)
Singlecycle23
Vector2 X 2cycle27
Conflict Stridedcycle41
Conlict-Free Stridedcycle32

Instruction Efficiency

Roadmap

  • warp schedule
  • L1/L2 cache n-way k-set

Citation

  • Jia, Zhe, et al. "Dissecting the NVIDIA volta GPU architecture via microbenchmarking." arXiv preprint arXiv:1804.06826 (2018).
  • Jia, Zhe, et al. "Dissecting the NVidia Turing T4 GPU via microbenchmarking." arXiv preprint arXiv:1903.07486 (2019).
  • Yan, Da, Wei Wang, and Xiaowen Chu. "Optimizing batched winograd convolution on GPUs." Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming. 2020.(turingas)

About

Dissecting NVIDIA GPU Architecture

Topics

Resources

Stars

Watchers

Forks


[8]ページ先頭

©2009-2025 Movatter.jp