Movatterモバイル変換

FPGA向けディープラーニング開発環境GUINNESSについて中原啓貴,⽶川晴義,藤井智也,下⽥将之,佐藤真平東京⼯業⼤学⼯学院情報通信系リコンフ研2017 9⽉@ドワンゴ

発表内容• 研究背景• Convolutional Neural Network (CNN)• 2値化CNNの最適化⼿法• FPGA専⽤ディープラーニング開発環境GUINNESSについて• 実験結果• まとめ2

クルマで想定されるスペック4クラウドエッジMany classes (1000s) Few classes (<10)Large workloads Frame rates (15‐30 FPS)High efficiency(Performance/W)Low cost & low power(1W‐5W)Server form factor Custom form factorJ. Freeman (Intel), “FPGA Acceleration in the era of high level design”, 2017

サーマルスロットリング• ⾼負荷による過度な発熱を抑える• 安全性の向上や機器を熱による破損から守る• TM2: 電源と周波数が低下, TM1: 間隔実⾏• CPU, GPU, SSDなどに搭載• 性能低下→機器停⽌も6💭

組込み(エッジ)でディープラーニング• クラウドでの問題• ネットワーク遅延• プライバシー• セキュリティ• 学習はオンライン,推論だけ⾏うことを想定• 検討事項• 計算能⼒• バッテリ• 冷却ファン• バッテリ時間7

Artificial Neuron (AN)+x0=1x1x2xN... w0 (Bias)w1w2wNf(u)u yxi: Input signalwi: Weightu: Internal statef(u): Activation function (Sigmoid, ReLU, etc.)y: Output signaly  f (u)u  wi xii0N9

Deep Neural Network10happysadmadcurious出典: imotionsglobal.com

畳込み演算1 0 1 11 1 1 00 1 0 01 1 0 11 1 1 00 1 1 00 0 0 01 0 1 15x1 x0 x1x0 x1 x0x0 x0 x1x0 x0 x1x1 x0 x1x1 x1 x1+カーネル(この例ではK=3) 11

畳込み演算1 0 1 11 1 1 00 1 0 01 1 0 11 1 1 00 1 1 00 0 0 01 0 1 15 3x1 x0 x1x0 x1 x0x0 x0 x1x0 x0 x1x1 x0 x1x1 x1 x1+12

畳込み演算1 0 1 11 1 1 00 1 0 01 1 0 11 1 1 00 1 1 00 0 0 01 0 1 15 36x1 x0 x1x0 x1 x0x0 x0 x1x0 x0 x1x1 x0 x1x1 x1 x1+13

畳込み演算1 0 1 11 1 1 00 1 0 01 1 0 11 1 1 00 1 1 00 0 0 01 0 1 15 36 4x1 x0 x1x0 x1 x0x0 x0 x1x0 x0 x1x1 x0 x1x1 x1 x1+14

CNNで⾏われている畳込み演算1 0 1 11 1 1 00 1 0 01 1 0 11 1 1 00 1 1 00 0 0 01 0 1 15 36 4x1 x0 x1x0 x1 x0x0 x0 x1x0 x0 x1x1 x0 x1x1 x1 x1• ANを2次元に拡張15

2値化ニューラルネットワーク• ⼆値(-1/+1)の乗算• 乗算器をXNORゲートで16x1 x2 Y‐1 ‐1 1‐1 +1 ‐1+1 ‐1 ‐1+1 +1 1x1 x2 Y0 0 10 1 01 0 01 1 1

2値化CNNの効果17x1w0 (Bias)fsgn(Y)Yzw1x2w2xnwn...短精度(4〜8)ビットを2値に置き換え→メモリ帯域の圧縮乗算器をXNORに置き換え→回路⾯積の削減

メモリ量削減→電⼒効率向上• メモリと演算器の距離∝電⼒→FPGAのオンチップメモリに格納できれば電⼒効率↑E. Joel et al., “Tutorial on Hardware Architectures for Deep Neural Networks,” MICRO‐49, 2016.18

認識精度低下に対して• バッチ正規化(BatchNormalization)を導⼊020406080100# of epochsClassification error (%)(a) float32 bit precision CNN1 80 160 200020406080100# of epochsClassification error (%)(b) Binarized CNN1 80 160 200単に2値化した場合提案⼿法約6%の誤差(VGG‐16を使⽤)H. Nakahara et al., “A memory‐based binarized convolutional deep neural network,”FPT2016, pp285‐288, 2016.19

• Normalizing the resultof MAC operations• Batch normalization isnecessary for theBinarized CNN toimprove its accracy20Normalization for Binarized DNN BatchNorm 0204060801001 80 160 200Error rate[％]epochWithout BNWith BNH. Nakahara, H. Yonekawa, T. Sasao, H. Iwamoto, and M. Motomura, "A Memory‐Based Realization of a Binarized Deep Convolutional Neural Network," The International Conference on Field‐Programmable Technology (FPT 2016), pp.273‐76, 2016.meanvarianceScaling Shift

• Batch Normalization is implemented by fixedpoint adders and multipliers21バッチ正規化を導⼊した回路Adder treeBatch normalizationSign bitXNOR gate

• The output from batchnormalization( ) is theinput to sign functionConstant factor canbe ignored• The input from batchnormalization( ) is theinteger valueTo integer22バッチ正規化をバイアスで実現

23バッチ正規化と等価な回路 BatchNorm

2値化CNN専⽤回路• カスタマイズ演算: 1ビット積和演算• 専⽤パイプライン24x00 x01 x02 x03 x04x10 x11 x12 x13 x14x20 x21 x22 x23 x24x30 x31 x32 x33 x34x40 x41 x42 x43 x44x22 x21 x20 x14 x13 x12 x11 x10 x04 x03 x02 x01 x00+BinarizedWeightMem.IntegerBiasMem.WriteCtrl.LogicCounterBinarized Feature Map(L=5, K=3)Shift Register (2L+K bits)9Binarized MACs(EXNORs + Adder Tree)Signbit

Bottleneck• Convolutional layer→ #MAC operations• Fully connection layer→ Weight memoryJ. Qiu et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,”ISFPGA2016.25

Replacement Internal FC Layers into a Binarized Average Pooling Layer26“car”InputImageFeature mapsRepeats of CONV+Max PoolingFully connection...Max Pooling“car”InputImageFeature mapsRepeat of CONV+Max PoolingFullyconnection...Binarized Ave. PoolingFlatten

010000200003000040000500006000070000‐1 ‐0.8 ‐0.6 ‐0.4 ‐0.2 0 0.2 0.4 0.6 0.8 1# of weights Weight value“‐1” (50%) “+1” (50%)Distribution of trained weightBinarized weights arebalanced Binarized internal values arebalanced “car”InputImageFeature mapsRepeats of CONV+Max PoolingFully connection...Max Pooling Flatten

010000200003000040000500006000070000‐1 ‐0.8 ‐0.6 ‐0.4 ‐0.2 0 0.2 0.4 0.6 0.8 1x w Y0 0 10 1 01 0 01 1 1# of weightsWeight value“‐1” (50%) “+1” (50%)Distribution of trained weightBinarized weights arebalanced Binarized internal values arebalanced The outputs arealso balanced→ 1’s count operationfor a Binarized internalvalues0 00 11 00 11 10 1→ 0→ 1→ 1Training:Binarizedaverage poolingInference(FPGA):1’s counterΣx1w0 (Bias)fsgn(Y)Yzw1x2w2xnwn...

モデルサイズの⽐較29LayerBaseline ProposedDim.In F mapsOut F mapsWeight[bits]Dim.In F mapsOut F mapsWeight[bits]Iconv 32x32 3 64 1.7K 32x32 3 64 1.7KBconv 32x32 64 64 36.8K 32x32 64 64 36.8KMax Pool 16x16 64 64 16x16 64 64Bconv 16x16 64 128 73.7K 16x16 64 128 73.7KBconv 16x16 128 128 147.4K 16x16 128 128 147.4KMax Pool 8x8 128 128 8x8 128 128Bconv 8x8 128 256 294.9K 8x8 128 256 294.9KBconv 8x8 256 256 589.8K 8x8 256 256 589.8KMax Pool 4x4 256 256 4x4 256 256BFC 1x1 4096 4096 16.7M(Binarized Average Pooling)BFC 1x1 4096 4096 16.7MBFC 1x1 4096 10 40.9K 1x1 256 10 2.5K(fc total) (33.6M) (2.5K)Total 34.7M 1.5MError Rate 18.6% 18.2%

GUINNESSとは• A GUI based neural network synthesizerの略• ユーザの準備した画像を学習しFPGA向けの推論回路⽤ビットストリームを⽣成• GUIを操作するだけで学習・回路合成ができるのでハードウェア・アルゴリズムどちらの技術者でもディープラーニングを簡単にFPGAに組込み可能• コードを書く必要は⼀切なしTokyo Tech. Nakahara Lab. 31

GUINNESS (現バージョン)Tokyo Tech. Nakahara Lab. 32CNNのパラメータ（深さ・幅）レイヤの種類を指定できます（奨励パラメータを読み込み可能）ユーザの学習データを使えます学習の再開・保存が可能学習パラメータも予め設定済みターゲットFPGAボードを指定するだけでビットストリームが⾃動⽣成されます

GUINNESS Tool Flow.modelTrainingby ChainerBinarized CNN WeightChainertoC++ModeltoTextBinarized Weight.txtPL code.cppPS code.cpp gccHLS.elf.bit.pklLabel Data.txtCNN Spec..pyImage DataPSPLExe. dataBit streamBRAMZynqFPGASDSoCOperated bythe GUIGenerated fromImagesTrained by GPU

GUINNESS奨励環境• 奨励計算機環境• GPU (GTX1070以上の性能)+CUDA8.0• マルチGPU, クラウド環境への対応も可能(要相談)• 学習済みCNN読み込み可能(要相談)• メモリ16GB以上• Ubuntu 14.04 LTSのみサポート• Xilinx社 SDSoC 2016.3, 2016.4• Chainer 1.17.0〜1.21.0• 対応FPGA（今後追加予定、カスタム設計は要相談）• Digilent社 Zybo, Zedboard• Xilinx社 ZC702, ZCU102Tokyo Tech. Nakahara Lab. 34

既存FPGA実現法との⽐較 36Implementation(Year)Zhao et al. [1](2017)FINN [2](2017)OursFPGA Board(FPGA)Zedboard(XC7Z020)PYNQ board(XC7Z020)Zedboard(XC7Z020)Clock (MHz) 143 166 143#LUTs#18Kb BRAMs#DSP Blocks46900943428332703214509321Test Error 12.27% 19.90% 18.20%Time [msec](FPS)5.94(168)2.24(445)2.37(420)Power [W] 4.7 2.5 2.3FPS/WattFPS/LUTFPS/BRAM35.735.8x10‐41.8178.0103.9x10‐41.6182.6289.4x10‐413.1Y. Umuroglu, et al., “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,” ISFPGA, 2017.R. Zhao et al., “Accelerating Binarized Convolutional Neural Networks with Software‐Programmable FPGAs,” ISFPGA, 2017.

Comparison with EmbeddedPlatforms (VGG11 Forwarding)Platform CPU GPU FPGADeviceARM Cortex‐A57 Maxwell GPU Zynq7020Clock Freq. 1.9 GHz 998 MHz 143.78 MHzMemory 16 GB eMMC Flash 4 GB LPDDR4 4.9 Mb BRAMTime [msec](FPS)4210.0(0.23)27.23(36.7)2.37(421.9)Power [W] 7 17 2.3Efficiency 0.032 2.2 182.6Design Time [Hours] 72 72 75

まとめ• ディープラーニング統合開発環境を開発• 2値化CNNに特化した環境• 推論専⽤の最適化⼿法• 学習⽅法• 組込み向けCPU・GPUと⽐較• ⾼速かつ電⼒効率に優れる38

https://github.com/HirokiNakahara/GUINNESS39

Docker イメージあります• AzureとかでGUINNESSを実⾏可能︕40

Movatterモバイル変換

Change Language

(公開版)Reconf研2017GUINNESS

Recommended

More Related Content

What's hot

Viewers also liked

Similar to (公開版)Reconf研2017GUINNESS

More from Hiroki Nakahara

(公開版)Reconf研2017GUINNESS