Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Using the famous cnn model in Pytorch, we run benchmarks on various gpu.

License

NotificationsYou must be signed in to change notification settings

lamikr/pytorch-gpu-benchmark

 
 

Repository files navigation

Comparison of learning and inference speed of different GPU with various CNN models inpytorchList of tested AMD and NVIDIA GPUs:

Example Results

Following benchmark results has been generated with the command: ./show_benchmarks_resuls.shGraph shows the 7700S results both with the pytorch 2.3.1 and with pytorch 2.4.0.ROCM SDK builders pytorch 2.4.0 contains the optimized flashattention support forAMD RX 7700S. (and other gfx1100/gfx1101/gfx1102 and gfx1103 cards)

Resnet Benchmark for Half-type

Benchmark Execution

Benchmarking All GPUs

This command will use pytorch to search all GPUs and will thenrun the benchmark for each of them separately and then in the endthe benchmark that uses all of the GPUs

./run_benchmarks.sh

Benchmarking One GPU

This command shows how to execute the benchmark for single gpu by using the -i parameter.

python3 benchmark_models.py -i 1 -g 1

First GPU has index 0, second 1, etc...

Benchmark Results

  • New Results are stored under "new_results" folder
  • Existing old results are under results folder
  • After running the benchmarks, you can create a pull request to githubto request to get them merged
  • You can view the results of new benchmarks by addings the name of it's resultfile to plot_benchmarks.py and then running the show_benchmarks.sh script.

List of Benchmarked GPUs

  • AMD_Radeon_RX_6800
  • AMD_Radeon_RX_7900_XTX
  • AMD_Radeon_RX_7700S (Framework 16 laptop discrete GPU)
  • AMD_Radeon_780M (Framework 16 laptop iGPU)
  • 1080TI
  • TITAN XP
  • TITAN V
  • 2080TI
  • Titan RTX
  • RTX 2060
  • RTX 3090
  • A100-PCIE
  • A100-SXM4

Specification

Graphics Card NameGTX 1080 TiTITAN XPTITAN VRTX 2060RTX 2080 TiTITAN RTXA100-PCIERTX 3090
Process16nm16nm12nm12nm12nm12nm7nm8 nm
Die Size471mm²471mm²815mm²445 mm²754mm²754mm²826mm²628 mm²
Transistors11,800 million11,800 million21,100 million10,800 million18,600 million18,600 million54,200 million28,300 million
CUDA Cores3584 Cores3840 Cores5120 Cores1920 Cores4352 Cores4608 Cores6912 Cores10496 Cores
Tensor CoresNoneNone640 Cores240544 Cores576 Cores432 Cores328 Cores
Clock(base)1481 MHz1405 Mhz1200 MHz1365 MHz1350 MHz1350 MHz765 MHz1395 MHz
FP16 (half)177.2 GFLOPS189.8GFLOPS29,798 GFLOPS12.90 TFLOPS26,895 GFLOPS32.62 TFLOPS77.97 TFLOPS35.58 TFLOPS
FP32 (float)11,340 GFLOPS12.15FLOPS14,899 GFLOPS6.451 TFLOPS13,448 GFLOPS16.31 TFLOPS19.49 TFLOPS35.58 TFLOPS
FP64 (double)354.4 GFLOPS379.7 GFLOPS7,450 GFLOPS201.6 GFLOPS420.2 GFLOPS509.8 GFLOPS9.746 TFLOPS556.0 GFLOPS
Memory11GB GDDR5XGDDR5X12 GB HBM26GB GDDR611 GB GDDR624 GB GDDR640GB HBM2e24GB GDDR6X
Memory Interface352-bit384bit3072-bit192 bit352-bit384 bit5120 bit384 bit
Memory Bandwidth484 GB/s547.6GB/s653GB/s336.0 GB/s616 GB/s672.0 GB/s1,555 GB/s936.2 GB/s
Price$699 US$1,199 US$2,999 US$ 349 US$1,199 US$2,499 US$ 1,499 USD
Release DateMar 10th, 2017Apr 6th 2017Dec 7th, 2017Jan 7th, 2019Sep 20th, 2018Dec 18th, 2018Jun 22nd, 2020Sep 1st, 2020

reference site

  1. Single & multi GPU with batch size 12: compare training and inference speed of **SequeezeNet, VGG-16, VGG-19, ResNet18, ResNet34, ResNet50, ResNet101,ResNet152, DenseNet121, DenseNet169, DenseNet201, DenseNet161 mobilenet mnasnet ... **

  2. Experiments are performed on three types of the datatype. single-precision, double-precision, half-precision

  3. making plot(plotly)

Usage

././run_benchmarks.sh

Results

Requirement

  • python>=3.6(for f-formatting)
  • torchvision
  • torch>=1.0.0
  • pandas
  • psutil
  • plotly(for plot)
  • cufflinks(for plot)

Environment

  • Pytorch version2.3
  • Number of GPUs on current device4
  • CUDA version =10.0
  • CUDNN version=7601
  • nvcr.io/nvidia/pytorch:20.10-py3 (docker container in A100 and 3090)

Change Log

  • 2024/07/22
    • benchmarks can now be run also on AMD gpus
    • ./run_benchmarks.sh script uses now pytorch to query the gpu countand will first run the tests for each device separately and thenby using all GPU's simultaneously
    • new benchmark results are saved to new_results/<gpu_index>/<gpu_name> folder
    • added new "-i" option which can be used to specify which GPU to use
    • If gpu index is not specified with -i option but the total gpu count specifiedby -g option > 1, then the tests will be run in a way that it uses all gpus simultaneously
  • 2021/02/27
    • Addition result in RTX3090
    • Addition result in RTX2060(thanks for gutama)
  • 2021/01/07
    • Addition result in TITANXP
  • 2021/01/05
    • Addition result in A100 A100-PCIE(PR#14)
  • 2021/01/04
    • Addition result in A100 SXM4
    • Addition result in TitanRTX
    • Edit coding style benchmark_model
      • f-formatting
      • save option for json
    • Edit test.sh for bash shell
    • Edit README.md
  • 2020/09/01
    • Addition result in windows10
    • Edit README.md
  • 2020/01/17
    • Edit coding style and some bug
    • Change plot method
    • Add results of various model experiments(only 2080ti)
  • 2019/01/09
    • PR Update typo (thanks for johmathe)
    • Add requirements.txt
    • Add result figures
    • Add ('TkAgg') for cli
    • Addition Muilt GPUS (DGX-station)
  • 2021/02/27
  • 2021/01/05 thanks for kirk86pr#14
  • 2021/01/05 Thanks for kirk86pr#14
  • 2021/01/04
  • 2021/01/04
  • thanks for olixu
  • based on 2020/01/17 update

Comparison between networks (single GPU)

Each network is fed with 12 images with 224x224x3 dimensions.For training, time durations of 20 passes of forwarding and backward are averaged. For inference, time durations of20 passes of forwarding are averaged. 5 warm-up steps are performed that do not calculate towards the final result.

I conducted the experiment using two RTX 2080ti.

Modegpuprecisiondensenet121densenet161densenet169densenet201resnet101resnet152resnet18resnet34resnet50squeezenet1_0squeezenet1_1vgg16vgg16_bnvgg19vgg19_bn
TrainingTITAN Vsingle56.17 ms120.7 ms72.59 ms93.35 ms84.59 ms119.5 ms16.69 ms28.27 ms50.54 ms15.30 ms9.857 ms72.85 ms80.95 ms85.55 ms94.42 ms
InferenceTITAN Vsingle17.49 ms39.33 ms23.63 ms30.93 ms23.96 ms34.22 ms4.827 ms8.428 ms14.27 ms4.565 ms2.765 ms22.94 ms25.41 ms27.55 ms30.28 ms
TrainingTITAN Vdouble139.8 ms387.4 ms175.9 ms224.5 ms509.9 ms720.0 ms94.21 ms194.6 ms271.7 ms68.38 ms31.18 ms1463. ms1484. ms1993. ms2016. ms
InferenceTITAN Vdouble47.68 ms170.5 ms60.73 ms78.43 ms317.7 ms448.6 ms60.26 ms129.9 ms159.8 ms42.37 ms11.95 ms1261. ms1266. ms1745. ms1751. ms
TrainingTITAN Vhalf43.79 ms75.16 ms57.53 ms70.88 ms47.82 ms67.43 ms10.48 ms17.19 ms29.08 ms13.15 ms9.390 ms36.03 ms46.84 ms41.16 ms52.65 ms
InferenceTITAN Vhalf11.87 ms22.88 ms16.04 ms20.70 ms12.80 ms18.11 ms3.085 ms5.116 ms7.608 ms3.694 ms2.329 ms10.96 ms13.26 ms12.72 ms15.17 ms
Training1080tisingle77.18 ms164.0 ms99.66 ms127.6 ms112.8 ms158.7 ms22.48 ms36.80 ms68.87 ms20.56 ms13.29 ms101.8 ms114.1 ms119.9 ms133.2 ms
Inference1080tisingle23.53 ms51.53 ms31.82 ms41.73 ms33.02 ms47.02 ms6.426 ms10.97 ms20.17 ms7.174 ms4.370 ms33.73 ms37.25 ms39.95 ms44.12 ms
Training1080tidouble779.5 ms2522. ms940.4 ms1196. ms2410. ms3546. ms463.3 ms969.9 ms1216. ms259.9 ms131.5 ms4227. ms4271. ms5475. ms5522. ms
Inference1080tidouble47.68 ms275.2 ms1157. ms328.6 ms414.9 ms1080. ms1589. ms181.1 ms390.8 ms529.6 ms110.9 ms49.96 ms2094. ms2103. ms2775. ms
Training1080tihalf43.79 ms70.00 ms148.4 ms89.43 ms113.6 ms151.0 ms219.5 ms21.00 ms34.84 ms76.24 ms19.60 ms13.18 ms91.60 ms105.9 ms108.1 ms
Inference1080tihalf18.62 ms42.26 ms25.27 ms33.01 ms27.49 ms38.88 ms5.645 ms9.765 ms16.26 ms5.869 ms3.576 ms30.69 ms33.22 ms36.71 ms39.51 ms
Modegpuprecisionresnet18resnet34resnet50resnet101resnet152densenet121densenet169densenet201densenet161squeezenet1_0squeezenet1_1vgg16vgg16_bnvgg19_bnvgg19
TrainingRTX 2080ti(1)single16.36 ms28.44 ms49.63 ms81.40 ms115.1 ms57.69 ms75.18 ms91.69 ms112.7 ms14.49 ms9.108 ms75.86 ms85.42 ms98.43 ms88.05 ms
InferenceRTX 2080ti(1)single4.894 ms8.624 ms14.65 ms24.57 ms35.15 ms16.70 ms21.94 ms28.89 ms34.64 ms4.704 ms2.765 ms23.70 ms26.25 ms30.82 ms28.03 ms
TrainingRTX 2080ti(1)double367.9 ms755.4 ms939.9 ms1844. ms2702. ms593.5 ms724.3 ms921.3 ms1916. ms187.8 ms94.99 ms3251. ms3277. ms4265. ms4238. ms
InferenceRTX 2080ti(1)double165.0 ms328.5 ms436.4 ms831.0 ms1196. ms213.8 ms266.0 ms339.5 ms910.7 ms82.71 ms35.79 ms1702. ms1708. ms2280. ms2274. ms
TrainingRTX 2080ti(1)half13.17 ms22.25 ms35.46 ms57.50 ms81.38 ms51.11 ms66.88 ms80.20 ms88.37 ms17.87 ms35.75 ms53.16 ms63.06 ms72.75 ms61.95 ms
InferenceRTX 2080ti(1)half3.423 ms5.662 ms9.035 ms14.51 ms20.52 ms13.47 ms17.54 ms22.51 ms27.10 ms4.280 ms2.397 ms16.14 ms18.14 ms19.76 ms17.89 ms
TrainingRTX 2080ti(2)single16.92 ms29.51 ms51.46 ms84.90 ms120.0 ms58.13 ms75.96 ms92.47 ms117.6 ms14.95 ms9.255 ms78.95 ms88.71 ms102.3 ms91.67 ms
InferenceRTX 2080ti(2)single5.107 ms8.976 ms15.18 ms25.60 ms36.60 ms17.02 ms22.40 ms29.46 ms36.72 ms4.852 ms2.786 ms24.76 ms27.25 ms32.05 ms29.27 ms
TrainingRTX 2080ti(2)double381.9 ms781.5 ms971.6 ms1900. ms2777. ms610.6 ms744.7 ms948.1 ms1974. ms191.9 ms97.27 ms3317. ms3350. ms4357. ms4329. ms
InferenceRTX 2080ti(2)double171.8 ms341.7 ms449.5 ms849.5 ms1231. ms221.1 ms275.2 ms352.5 ms938.9 ms83.66 ms36.48 ms1715. ms1721. ms2294. ms2289. ms
TrainingRTX 2080ti(2)half13.57 ms22.97 ms36.55 ms59.10 ms83.81 ms51.74 ms68.35 ms81.21 ms89.46 ms15.75 ms35.46 ms55.28 ms65.43 ms75.75 ms64.62 ms
InferenceRTX 2080ti(2)half3.520 ms5.837 ms9.272 ms14.93 ms21.13 ms13.38 ms18.71 ms22.40 ms26.82 ms4.446 ms2.406 ms16.29 ms17.91 ms20.90 ms19.14 ms
  • Results using codes prior to 2020/01/17

contribute

If you want to contribute to the experiment in an additional environment, please contribute to the result by subfolder in fig.

About

Using the famous cnn model in Pytorch, we run benchmarks on various gpu.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python80.8%
  • Jupyter Notebook16.6%
  • Shell2.6%

[8]ページ先頭

©2009-2025 Movatter.jp