39benchmark(DGX-1)• Two fullyconnected quads,connected at corners• 160GB/s per GPU bidirectional to Peers• Load/store access to Peer Memory• Full atomics to Peer GPUs• High speed copy engines for bulk datacopy• PCIe to/from CPUDGX-1Dual 20-core Intel® Xeon® E5-2698 v4 2.2 GHz8x Tesla GP100
36.
40TensorFlowDeep Learning TrainingAnopen-source software library for numericalcomputation using data flow graphs.VERSION1.0ACCELERATED FEATURESFull framework acceleratedSCALABILITYMulti-GPU and multi-nodeMore Informationhttps://www.tensorflow.org/TensorFlow Deep Learning FrameworkTraining on 8x P100 GPU Server vs 8 x K80 GPU Server-1.02.03.04.05.0Speedupvs.Serverwith8xK80AlexNet GoogleNet ResNet-50 ResNet-152 VGG162.5xAvg. Speedup3xAvg. SpeedupGPU Servers: Single Xeon E5-2690 v4@2.6GHz with GPUs configs as shownUbuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet;batch sizes: AlexNet (128), GoogleNet (256), ResNet-50 (64), ResNet-152 (32), VGG-16 (32)Server with 8x P10016GB NVLinkServer with 8x P100PCIe 16GB
47データ並列(同期型)wx ywx ywxyLayer 1“dog”Layer 2Inputs OutputsLayer NLossFunc“human”LossFuncGPU 1GPU 2“cat”Labels“monkey”error⊿y⊿x⊿y⊿x⊿y⊿w⊿w⊿wwx ywx ywx y⊿y⊿x⊿y⊿x⊿y⊿w⊿w⊿werrorUpdate Weights Independentlyw w ww w w
43.
48マルチGPU学習のパフォーマンスNVIDIA DGX-1, Chainer1.17.0 with multi-process patch0123456780 1 2 3 4 5 6 7 8Speed-upto1GPUNumber of GPUsAlexNet VGG-D ResNet[Batch size per GPU] AlexNet:768, VGG-D:32, ResNet:1200.511.522.51 2 4 8Relativetimeto1GPUNumber of GPUsTime per one batch (VGG-D)UpdateAllreduceBackwardForwardDGX-1’s NVLink is not well utilized.Chainer’s all-reduce implementationis naïve “gather and broadcat”.
53NCCLの実装• 1 CPUand 4 GPUs (PCIe)Ring AlgorithmMost collectives amenable to bandwidth-optimalimplementation on rings, and many topologyies can beinterpreted as one or more rings [P. Patarasuk and X. Yuan]
49.
54NCCLの実装• 2 CPUsand 8 GPUs (QPI and PCIe)Ring AlgorithmMost collectives amenable to bandwidth-optimalimplementation on rings, and many topologyies can beinterpreted as one or more rings [P. Patarasuk and X. Yuan]
50.
55NCCL パフォーマンスBandwidth atdifferent problem sizes (4 Maxwell GPUs)All-GatherAll-ReduceReduce-ScatterBroadcast