CN106228240A

Movatterモバイル変換

Info

Publication number: CN106228240A
Application number: CN201610615714.2A
Authority: CN
Inventors: 王展雄; 周光朕; 冯瑞
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-07-30
Filing date: 2016-07-30
Publication date: 2016-12-14
Anticipated expiration: 2036-07-30
Also published as: CN106228240B

Abstract

Translated fromChinese

本发明方法属于数字图像处理、模式识别技术领域。具体为一种基于FPGA的深度卷积神经网络实现方法。本发明实现的硬件平台是XilinxZYNQ‑7030可编程片上SoC，硬件平台内置FPGA和ARM Cortex A9处理器。本发明首先将训练好的网络模型参数加载到FPGA端，然后在ARM端对输入数据进行预处理，再将结果传输到FPGA端，在FPGA端实现深度卷积神经网络的卷积计算和下采样，形成数据特征向量并传输至ARM端，完成特征分类计算。本发明利用FPGA的快速并行处理和极低功耗的高效能计算特性，实现深度卷积神经网络模型中复杂度最高的卷积计算部分，在保证算法正确率的前提下，大幅提升算法效率，降低功耗。

The method of the invention belongs to the technical fields of digital image processing and pattern recognition. Specifically, it is an FPGA-based deep convolutional neural network implementation method. The hardware platform realized by the present invention is a Xilinx ZYNQ-7030 programmable on-chip SoC, and the hardware platform has a built-in FPGA and an ARM Cortex A9 processor. The invention first loads the trained network model parameters to the FPGA end, then preprocesses the input data at the ARM end, and then transmits the result to the FPGA end, and realizes the convolution calculation and downsampling of the deep convolutional neural network at the FPGA end , to form a data feature vector and transmit it to the ARM end to complete the feature classification calculation. The invention utilizes the fast parallel processing of FPGA and the high-efficiency calculation characteristics of extremely low power consumption to realize the convolution calculation part with the highest complexity in the deep convolutional neural network model, and greatly improve the algorithm efficiency under the premise of ensuring the correct rate of the algorithm. Reduce power consumption.

Description

Translated fromChinese

基于FPGA的深度卷积神经网络实现方法Implementation method of deep convolutional neural network based on FPGA

技术领域technical field

本发明属于数字图像处理、模式识别技术领域，具体涉及一种在FPGA硬件平台上实现深度卷积神经网络模型的方法。The invention belongs to the technical fields of digital image processing and pattern recognition, and in particular relates to a method for realizing a deep convolutional neural network model on an FPGA hardware platform.

背景技术Background technique

在当前计算机技术及互联网高速发展的情况下，数据规模呈爆发式增长，海量数据的智能化分析处理成为有效利用数据价值的关键所在。人工智能技术是实现从海量数据中发现有价值的信息的一种有效手段，近年来在计算机视觉、语音识别和自然语言处理等应用领域取得突破性进展。基于深度卷积神经网络的深度学习算法模型是其中的一个典型代表。With the current rapid development of computer technology and the Internet, the scale of data is growing explosively, and the intelligent analysis and processing of massive data has become the key to effectively utilizing the value of data. Artificial intelligence technology is an effective means to discover valuable information from massive data. In recent years, breakthroughs have been made in computer vision, speech recognition and natural language processing and other application fields. The deep learning algorithm model based on deep convolutional neural network is a typical representative.

卷积神经网络（Convolutional Neural Network，CNN）受神经科学研究的启发。经过20多年的演变，在模式识别，人机对抗等领域取得令人瞩目的理论研究及实际应用成果，在著名的人机围棋对抗赛中，基于CNN+蒙特卡洛搜索树算法的人工智能系统AlphaGo，以4:1大比分优势战胜了世界围棋冠军李世石。典型的CNN算法模型由两部分组成：特征提取器和分类器。其中特征提取器负责生成输入数据的低维特征向量，对数据具有较好的鲁棒性。该向量作为分类器（通常基于传统的人工神经网络）的输入数据进行分类，得到输入数据的分类结果。Convolutional Neural Networks (CNNs) are inspired by neuroscience research. After more than 20 years of evolution, it has achieved remarkable theoretical research and practical application results in the fields of pattern recognition and man-machine confrontation. In the famous man-machine Go match, the artificial intelligence system AlphaGo based on CNN+Monte Carlo search tree algorithm, He defeated the world Go champion Lee Sedol with a big score of 4:1. A typical CNN algorithm model consists of two parts: feature extractor and classifier. Among them, the feature extractor is responsible for generating the low-dimensional feature vector of the input data, which has good robustness to the data. This vector is used as the input data of a classifier (usually based on a traditional artificial neural network) for classification, and the classification result of the input data is obtained.

在实现卷积神经网络算法模型中，卷积计算占整个算法模型90%的计算量^[1],因此卷积层的高效计算是大幅提升CNN算法模型计算效率的关键，通过硬件加速实现卷积计算是一种有效途径。In the implementation of the convolutional neural network algorithm model, convolution calculation accounts for 90% of the calculation amount of the entire algorithm model^[1] , so the efficient calculation of the convolution layer is the key to greatly improving the calculation efficiency of the CNN algorithm model, and the realization of convolution through hardware acceleration Computing is an effective way.

当前，业内普遍使用GPU集群实现深度学习算法模型，通过大规模并行计算实现深度神经网络模型，取得了令人瞩目的高效率与高性能结果，然而GPU的高功耗也制约了其大规模应用，进而成为深度卷积神经网络算法模型的实际推广应用的瓶颈所在。FPGA具有高性能并行计算和超低功耗的优点，在FPGA上实现深度学习算法模型是该领域的必然发展方向。At present, GPU clusters are widely used in the industry to implement deep learning algorithm models, and deep neural network models are implemented through large-scale parallel computing, which has achieved remarkable results in high efficiency and high performance. However, the high power consumption of GPUs also restricts its large-scale application. , and then become the bottleneck of the actual promotion and application of the deep convolutional neural network algorithm model. FPGA has the advantages of high-performance parallel computing and ultra-low power consumption. Implementing deep learning algorithm models on FPGA is an inevitable development direction in this field.

目前利用FPGA实现CNN的方案主要有三种:At present, there are three main schemes for implementing CNN using FPGA:

（1）利用软核CPU实现控制部分，配合FPGA实现算法加速；(1) Use soft-core CPU to implement the control part, and cooperate with FPGA to realize algorithm acceleration;

（2）利用硬核SoC内嵌的硬核ARM Cortex A9 CPU实现控制部分，配合FPGA实现算法加速；(2) Use the hard-core ARM Cortex A9 CPU embedded in the hard-core SoC to realize the control part, and cooperate with the FPGA to realize algorithm acceleration;

（3）利用云端服务器配合FPGA实现算法加速。(3) Use cloud servers to cooperate with FPGA to achieve algorithm acceleration.

三种方案各有利弊，根据不同的运用场合，可以选择不同的加速方案。The three schemes have their own advantages and disadvantages, and different acceleration schemes can be selected according to different application occasions.

在深度卷积神经网络中，卷积层计算占用了超过90%的计算量，而且是整个网络模型中承前启后的关键环节，其计算效率直接影响了模型算法实现的性能。然而，在FPGA上实现卷积计算具有很大难度，主要体现在以下几个方面：In the deep convolutional neural network, the calculation of the convolutional layer takes up more than 90% of the calculation, and it is a key link in the entire network model. Its calculation efficiency directly affects the performance of the model algorithm. However, it is very difficult to realize convolution calculation on FPGA, which is mainly reflected in the following aspects:

（1）深度学习算法模型目前基本还处于学术界研究的阶段，大规模产业化应用还有很多算法及模型优化的工作，因此算法模型需要不断优化，以适应不同的应用场景，需要对深度学习理论及算法有非常深入的理解；(1) The deep learning algorithm model is basically still in the stage of academic research. There are still many algorithms and model optimization work for large-scale industrial applications. Therefore, the algorithm model needs to be continuously optimized to adapt to different application scenarios. Have a very deep understanding of theory and algorithms;

（2）FPGA的研发基于底层的硬件语言，适于算法模型相对稳定的情况，不断变化的深度学习算法模型为其在FPGA上实现带来很大的难度；(2) The research and development of FPGA is based on the underlying hardware language, which is suitable for the situation where the algorithm model is relatively stable, and the ever-changing deep learning algorithm model brings great difficulty to its implementation on FPGA;

（3）在FPGA上实现深度卷积神经网络，需要对FPGA的工程实现具有丰富的经验。FPGA的运行时钟频率和使用的乘法器等模块的输出延时（Latency）互相矛盾，时钟频率越高，模块的输出延时越长，时钟频率越低，模块的输出延时越短。需要借助工程经验通过手工实验找到相对平衡的参数。(3) Realizing deep convolutional neural network on FPGA requires rich experience in FPGA engineering implementation. The operating clock frequency of FPGA and the output delay (Latency) of the multiplier and other modules used are contradictory. The higher the clock frequency, the longer the output delay of the module, and the lower the clock frequency, the shorter the output delay of the module. It is necessary to use engineering experience to find relatively balanced parameters through manual experiments.

发明内容Contents of the invention

本发明方法的目的是提供一种高效率、低功耗的实现深度卷积神经网络模型的方法，以解决当前基于GPU或CPU的深度学习模型功耗大、效率低的问题。The purpose of the method of the present invention is to provide a method for realizing a deep convolutional neural network model with high efficiency and low power consumption, so as to solve the problems of high power consumption and low efficiency of the current deep learning model based on GPU or CPU.

本发明对FPGA硬件设计进行了优化，有效降低了资源消耗，能够在低端FPGA硬件平台上实现深度卷积神经网络模型。The invention optimizes FPGA hardware design, effectively reduces resource consumption, and can realize a deep convolutional neural network model on a low-end FPGA hardware platform.

本发明提供的实现深度卷积神经网络模型的方法，实现的硬件平台是XilinxZYNQ-7030可编程片上SoC，硬件平台内置FPGA和ARM Cortex A9处理器。本发明首先将训练好的网络模型参数加载到FPGA端，然后在ARM端对输入数据进行预处理，再将结果传输到FPGA端，在FPGA端实现深度卷积神经网络的卷积计算和下采样，形成数据特征向量并传输至ARM端，完成特征分类计算。具体包括4个过程：模型参数加载过程、输入数据预处理操作过程、卷积和下采样计算过程、分类计算过程：The method for realizing the deep convolutional neural network model provided by the present invention, the hardware platform realized is XilinxZYNQ-7030 programmable on-chip SoC, and the hardware platform has built-in FPGA and ARM Cortex A9 processor. The invention first loads the trained network model parameters to the FPGA end, then preprocesses the input data at the ARM end, and then transmits the result to the FPGA end, and realizes the convolution calculation and downsampling of the deep convolutional neural network at the FPGA end , to form a data feature vector and transmit it to the ARM end to complete the feature classification calculation. It specifically includes four processes: model parameter loading process, input data preprocessing operation process, convolution and downsampling calculation process, and classification calculation process:

1、模型参数加载过程为：1. The model parameter loading process is:

（1）离线训练深度卷积神经网络模型；(1) Offline training of deep convolutional neural network models;

（2）ARM端加载训练模型参数；(2) The ARM end loads the training model parameters;

（3）将模型参数传输至FPGA；(3) Transfer model parameters to FPGA;

2、输入数据预处理操作过程为：2. The input data preprocessing operation process is as follows:

（1）归一化处理；(1) Normalization processing;

（2）将处理结果传输至FPGA；(2) Transfer the processing results to FPGA;

（3）在FPGA端存储至Block RAM；(3) Stored in the Block RAM on the FPGA side;

3、卷积和下采样计算过程为：3. The calculation process of convolution and downsampling is:

（1）初始化卷积流水线；(1) Initialize the convolution pipeline;

（2）卷积计算；(2) Convolution calculation;

（3）池化下采样计算；(3) Pooling down-sampling calculation;

（4）重新初始化卷积流水线，进行多层卷积下采样计算；(4) Re-initialize the convolution pipeline and perform multi-layer convolution down-sampling calculations;

4、分类计算过程为：4. The classification calculation process is:

（1）将特征向量传回ARM端；(1) Send the feature vector back to the ARM side;

（2）通过分类模型计算；(2) Calculated by the classification model;

（3）输出分类结果。(3) Output classification results.

具体介绍如下：The details are as follows:

步骤1、加载训练模型参数Step 1. Load training model parameters

（1）在ARM端加载离线训练的深度卷积神经网络模型参数；(1) Load the offline training deep convolutional neural network model parameters on the ARM side;

（2）将训练模型参数传输至FPGA端；(2) Transfer the training model parameters to the FPGA side;

（3）FPGA端经过FIFO缓存后存储在Block RAM(块随机存储器)中；(3) The FPGA end is stored in Block RAM (block random access memory) after being cached by FIFO;

步骤2、预处理深度卷积神经网络模型Step 2. Preprocessing the deep convolutional neural network model

（1）对输入数据进行归一化处理，使其满足模型卷积运算要求；(1) Normalize the input data to make it meet the requirements of the model convolution operation;

（2）利用APB总线将ARM端归一化数据传输至FPGA端；(2) Use the APB bus to transmit the normalized data from the ARM side to the FPGA side;

（3）FPGA端将归一化数据经过FIFO缓存后存入Block RAM；(3) The FPGA end stores the normalized data into the Block RAM after passing through the FIFO cache;

步骤3、卷积和下采样计算Step 3, convolution and downsampling calculation

针对深度卷积神经网络模型中计算量最大的卷积层计算，设计深度流水线实现模式。设网络模型有H个卷积层和池化层。第h个（h=1,2,…,H）卷积层输入为T个m×m浮点数(32位)矩阵，输出为S个（m-n+1）×（m-n+1）浮点数（32位）矩阵，卷积核为K个n×n浮点数（32位）矩阵（n≤m）,输入数据滑动窗尺度为n×n,横向滑动步长为1，纵向滑动步长为1。Aiming at the calculation of the convolutional layer with the largest amount of calculation in the deep convolutional neural network model, a deep pipeline implementation mode is designed. Suppose the network model has H convolutional layers and pooling layers. The hth (h=1,2,...,H) convolutional layer input is T m×m floating-point number (32-bit) matrix, and the output is S (m-n+1)×(m-n+1 ) floating-point number (32-bit) matrix, the convolution kernel is K n×n floating-point number (32-bit) matrix (n≤m), the input data sliding window scale is n×n, the horizontal sliding step is 1, and the vertical sliding The step size is 1.

（1）初始化卷积运算流水线(1) Initialize the convolution operation pipeline

定义n+1个数据缓存寄存器P₀，P₁，…，P_n-1，P_n，每个寄存器存放m个数据。其中n个寄存器(P_{（i-1）%(n+1)+0}，P_{（i-1）%(n+1)+1}，…，P_{（i-1）%(n+1)+n-1})存放第t个(t=1,2，…，T)输入数据矩阵的第i个（i=1,2,…，m-n+1）子矩阵（n×m）数据,其中%表示取余数，如果（i-1）%(n+1)+x>n,则（i-1）%(n+1)+x=0，（i-1）%(n+1)+x+1=1,…,其中x=0,1,…，n-1。如果n<m，P_{（i-1）%(n+1)+n}寄存器存放输入数据矩阵中的第i+n行数据，在卷积计算过程中实现并行初始化，以减少FPGA空闲周期，提高计算效率。Define n+1 data cache registers P₀ , P₁ , . . . , P_n-1 , P_n , and each register stores m data. Among them, n registers (P_{(i-1)%(n+1)+0} , P_{(i-1)%(n+1)+1} , ..., P_{(i-1)%(n+1)+ n-1} ) stores the i-th (i=1,2,...,m-n+1) sub-matrix (n×m) data of the t-th (t=1,2,...,T) input data matrix, Among them, % means to take the remainder, if (i-1)%(n+1)+x>n, then (i-1)%(n+1)+x=0, (i-1)%(n+1 )+x+1=1,..., where x=0,1,...,n-1. If n<m, the P_{(i-1)%(n+1)+n} register stores the i+nth row of data in the input data matrix, and realizes parallel initialization during the convolution calculation process to reduce FPGA idle cycles and improve Computational efficiency.

定义1个卷积核矩阵缓存寄存器W，存放第k个（k=1,2，…，K）n×n个卷积核矩阵权值数据。Define a convolution kernel matrix cache register W to store the kth (k=1,2,...,K) n×n convolution kernel matrix weight data.

（2）第h个卷积层计算(2) Calculation of the hth convolutional layer

完成网络第h个卷积层第t个输入数据矩阵和第k个卷积核的卷积计算，通过Sigmoid函数实现计算结果的激活。Complete the convolution calculation of the tth input data matrix of the hth convolutional layer of the network and the kth convolution kernel, and activate the calculation results through the Sigmoid function.

具体来说，在进行每次卷积计算的同时，初始化第i+n个数据缓存寄存器P_{（i-1）%(n+1)+n}，作为卷积中第i+1个子矩阵卷积计算的缓存输入数据，实现循环卷积。Specifically, while performing each convolution calculation, initialize the i+nth data cache register P_{(i-1)%(n+1)+n} , as the i+1th sub-matrix convolution in the convolution Computational cached input data to implement circular convolution.

在FPGA端通过浮点IP（Floating-point IP）核构建Sigmoid函数，实现卷积计算结果的激活；所述Sigmoid函数的表达式为：。具体步骤为：Construct the Sigmoid function through the floating-point IP (Floating-point IP) core on the FPGA side to realize the activation of the convolution calculation result; the expression of the Sigmoid function is: . The specific steps are:

如前所述，输入数据为m×m浮点数矩阵，卷积核为n×n浮点数矩阵，滑动窗尺度为n×n,横向滑动步长为1，纵向滑动步长为1，则卷积结果为(m-n+1)×(m-n+1)的浮点数矩阵，矩阵的每个元素加上偏置量b11(离线训练模型参数)，利用Sigmoid函数激活后，结果为(m-n+1)×(m-n+1)的浮点数矩阵，存入Block RAM。As mentioned above, the input data is an m×m floating-point matrix, the convolution kernel is an n×n floating-point matrix, the sliding window scale is n×n, the horizontal sliding step is 1, and the vertical sliding step is 1, then the convolution The result of the product is (m-n+1)×(m-n+1) floating-point number matrix, each element of the matrix is added with the offset b11 (offline training model parameters), after activation by the Sigmoid function, the result is ( The floating-point number matrix of m-n+1)×(m-n+1) is stored in Block RAM.

完成1次卷积计算后，重新初始化卷积核矩阵缓存寄存器W，进行下一次卷积计算，往复循环卷积计算，计算结果为S个(m-n+1)×(m-n+1)浮点数矩阵，存入Block RAM。After completing one convolution calculation, re-initialize the convolution kernel matrix cache register W, perform the next convolution calculation, reciprocate circular convolution calculation, and the calculation result is S (m-n+1)×(m-n+1 ) Floating-point number matrix, stored in Block RAM.

（3）第h个池化层计算(3) Calculation of the hth pooling layer

实现第h个卷积层计算结果的池化计算，结果为S个[(m-n+1)/2]×[(m-n+1)/2]浮点数矩阵，存入Block RAM。具体步骤为：设卷积计算结果数据滑动窗尺度为2×2，步长为2，采用平均下采样法实现池化，即逐个2×2浮点数矩阵相加，计算结果取均值，获得S个[(m-n+1)/2]×[(m-n+1)/2]浮点数矩阵，作为第h+1个卷积层计算的输入矩阵。Realize the pooling calculation of the calculation results of the hth convolutional layer, and the results are S [(m-n+1)/2]×[(m-n+1)/2] floating-point number matrices, which are stored in Block RAM. The specific steps are: set the sliding window scale of the convolution calculation result data to 2×2, and the step size is 2, and use the average downsampling method to realize pooling, that is, add each 2×2 floating-point number matrix one by one, and take the mean value of the calculation results to obtain S A [(m-n+1)/2]×[(m-n+1)/2] floating-point number matrix is used as the input matrix for the calculation of the h+1th convolutional layer.

步骤4、分类计算Step 4, classification calculation

将卷积计算和池化计算结果传回ARM端进行分类运算。具体步骤为：FPGA端将BlockRAM中的卷积池化计算结果矩阵，通过FIFO缓存，APB总线传输至ARM端，ARM端利用Softmax运算完成数据分类计算，得到输入数据的分类结果并输出。The convolution calculation and pooling calculation results are sent back to the ARM side for classification operations. The specific steps are: the FPGA side transfers the convolution pool calculation result matrix in the BlockRAM to the ARM side through the FIFO cache, and the APB bus. The ARM side uses the Softmax operation to complete the data classification calculation, and obtains the classification results of the input data and outputs them.

本发明方法的主要特点有：The main features of the inventive method have:

（1）在低端FPGA上实现了深度卷积神经网络模型；(1) Implemented a deep convolutional neural network model on a low-end FPGA;

（2）利用流水线计算方式实现了深度卷积神经网络模型中的卷积计算加速；(2) The convolution calculation acceleration in the deep convolutional neural network model is realized by using the pipeline calculation method;

（3）控制芯片采用Soc内嵌ARM处理器实现，具有体积小，功耗低，效率高的特点，可广泛应用于嵌入式系统领域。(3) The control chip is realized by Soc embedded ARM processor, which has the characteristics of small size, low power consumption and high efficiency, and can be widely used in the field of embedded systems.

本发明利用FPGA的快速并行处理和极低功耗的高效能计算特性，实现深度卷积神经网络模型中复杂度最高的卷积计算部分，在保证算法正确率的前提下，大幅提升算法效率。相比于传统基于CPU或GPU实现深度卷积神经网络的方法，本发明方法在有效提高算法计算速度的同时，大幅降低了功耗，有效解决了采用CPU或GPU实现深度卷积神经网络导致的运算时间长或功耗大的问题。The invention utilizes the fast parallel processing of FPGA and the high-efficiency calculation characteristics of extremely low power consumption to realize the convolution calculation part with the highest complexity in the deep convolutional neural network model, and greatly improve the algorithm efficiency under the premise of ensuring the correct rate of the algorithm. Compared with the traditional method of implementing deep convolutional neural network based on CPU or GPU, the method of the present invention not only effectively improves the calculation speed of the algorithm, but also greatly reduces power consumption, and effectively solves the problems caused by the implementation of deep convolutional neural network by using CPU or GPU. The problem of long operation time or high power consumption.

附图说明Description of drawings

图1基于FPGA的深度卷积神经网络实现流程图。Figure 1. FPGA-based implementation flow chart of deep convolutional neural network.

图2 MNIST数据库（部分）。Figure 2 MNIST database (partial).

图3矩阵转置原理图。Figure 3 Schematic diagram of matrix transposition.

图4流水线计算示意图。Figure 4 Schematic diagram of pipeline calculation.

图5卷积计算示意图。Figure 5 Schematic diagram of convolution calculation.

图6 深度卷积神经网络结构图。Figure 6. Structure diagram of deep convolutional neural network.

图7下采样计算示意图。Figure 7 Schematic diagram of downsampling calculation.

图8 基于FPGA的深度卷积神经网络模型仿真结果。Figure 8 Simulation results of the FPGA-based deep convolutional neural network model.

图9 数字“7”的实测分类结果（MNIST数据库）。Figure 9. The measured classification results of the number "7" (MNIST database).

具体实施方式detailed description

以下结合附图解释运用了本发明方法，在FPGA硬件平台上利用深度卷积神经网络模型实现手写体字符识别算法的具体实施。（该深度卷积神经网络模型由输入层I，第一个卷积层C1，第一个下采样层S1，第二个卷积层C2，第二个下采样层S2和全链接层Softmax组成。输入图片大小为28×28，第一层卷积层包含1个大小为5×5的卷积核，第二个卷积层包含3个大小为5×5的卷积核）。The method of the present invention is explained below in conjunction with the accompanying drawings, and the specific implementation of the handwritten character recognition algorithm is realized on the FPGA hardware platform using a deep convolutional neural network model. (The deep convolutional neural network model consists of the input layer I, the first convolutional layer C1, the first downsampling layer S1, the second convolutional layer C2, the second downsampling layer S2 and the full connection layer Softmax The input image size is 28×28, the first convolutional layer contains a convolution kernel with a size of 5×5, and the second convolutional layer contains three convolution kernels with a size of 5×5).

利用深度卷积神经网络模型的手写体字符识别算法在FPGA上实现的具体运算步骤如附图1所示。The specific operation steps of the handwritten character recognition algorithm using the deep convolutional neural network model implemented on the FPGA are shown in Figure 1.

1、加载训练好的模型参数1. Load the trained model parameters

首先参考DeepLearnToolbox-master中CNN的函数，并进行一定的修改(将卷积函数重写，并将神经网络层数改为5层，一个输入层，两个卷积层，两个下采样层；第一个卷积层1个大小为5×5的卷积核，第二个卷积层3个大小为5×5的卷积核，两个下采样层的滑动步长为2，滑动窗2×2矩阵，训练次数设为10)，利用Matlab训练深度卷积神经网络，然后在ARM端加载训练好的权值参数和偏置参数，最后将训练好的模型参数传输至FPGA端，经过FIFO缓存后存储在Block RAM中。First, refer to the CNN function in DeepLearnToolbox-master, and make certain modifications (rewrite the convolution function, and change the number of neural network layers to 5 layers, one input layer, two convolution layers, and two downsampling layers; The first convolution layer has 1 convolution kernel with a size of 5×5, the second convolution layer has 3 convolution kernels with a size of 5×5, the sliding step of the two downsampling layers is 2, and the sliding window 2×2 matrix, the number of training times is set to 10), use Matlab to train the deep convolutional neural network, then load the trained weight parameters and bias parameters on the ARM side, and finally transfer the trained model parameters to the FPGA side, after Stored in Block RAM after FIFO buffer.

2、预处理2. Pretreatment

附图2所示的MNIST手写体图像读入内存，每个像素除以255进行归一化，然后按照附图3所示进行转置。The MNIST handwriting image shown in Figure 2 is read into memory, each pixel is divided by 255 for normalization, and then transposed as shown in Figure 3.

3、将预处理结果传输至FPGA3. Transfer the preprocessing results to the FPGA

通过ZYNQ-7030 Soc上APB总线，将预处理结果传输至FPGA端，经过FIFO缓存后存储在Block RAM中。Through the APB bus on the ZYNQ-7030 Soc, the preprocessing result is transmitted to the FPGA side, and stored in the Block RAM after being buffered by FIFO.

4、初始化卷积运算流水线4. Initialize the convolution operation pipeline

如附图4所示，定义6个数据缓存寄存器P₀，P₁，P₂，P₃，P₄，P₅，每个寄存器可存放28个浮点数数据。其中5个寄存器(P_{（i-1）%(5+1)+0}，P_{（i-1）%(5+1)+1}，…，P_{（i-1）%(5+1)+5-1})存放输入图像矩阵的第i个（i=1,2,…，24）子矩阵（5×28）数据,其中%表示取余数。如果（i-1）%(5+1)+x>5,则（i-1）%(5+1)+x=0，（i-1）%(5+1)+x+1=1,…,其中x=0,1,…，4。P_{（i-1）%(5+1)+5}寄存器存放输入图像矩阵中的第i+5行数据。As shown in Figure 4, six data buffer registers P₀ , P₁ , P₂ , P₃ , P₄ , and P₅ are defined, and each register can store 28 floating-point data. Among them, 5 registers (P_{(i-1)%(5+1)+0} , P_{(i-1)%(5+1)+1} , ..., P_{(i-1)%(5+1)+ 5-1} ) Store the i-th (i=1,2,...,24) sub-matrix (5×28) data of the input image matrix, where % means to take the remainder. If (i-1)%(5+1)+x>5, then (i-1)%(5+1)+x=0, (i-1)%(5+1)+x+1= 1,...,where x=0,1,...,4. The P_{(i-1)%(5+1)+5} register stores the i+5th row data in the input image matrix.

定义1个卷积核矩阵缓存寄存器W，存放第1个卷积层的1个5×5个卷积核矩阵权值数据。Define a convolution kernel matrix buffer register W to store a 5×5 convolution kernel matrix weight data of the first convolution layer.

5、进行第1个卷积层计算5. Perform the first convolutional layer calculation

完成网络第1个卷积层输入图像矩阵和第1个卷积层第1个卷积核的卷积计算，通过Sigmoid函数实现计算结果的激活。Complete the convolution calculation of the input image matrix of the first convolutional layer of the network and the first convolution kernel of the first convolutional layer, and activate the calculation results through the Sigmoid function.

在进行卷积计算的同时，初始化第i+5个数据缓存寄存器P_{（i-1）%(5+1)+5}，作为卷积中第i+1个子矩阵卷积计算的缓存输入数据，实现循环卷积，如附图5所示。While performing the convolution calculation, initialize the i+5th data cache register P_{(i-1)%(5+1)+5} as the cache input data for the i+1th sub-matrix convolution calculation in the convolution, Realize circular convolution, as shown in Figure 5.

在FPGA端通过浮点IP（Floating-point IP）核构建Sigmoid函数，实现卷积计算结果的激活。Sigmoid函数的表达式为：。On the FPGA side, the Sigmoid function is constructed through the floating-point IP (Floating-point IP) core to realize the activation of the convolution calculation results. The expression of the Sigmoid function is: .

具体步骤为：The specific steps are:

如前所述，输入图像为28×28浮点数矩阵，卷积核为5×5浮点数矩阵，滑动窗尺度为5×5,横向滑动步长为1，纵向滑动步长为1，则卷积结果为24×24的浮点数矩阵，矩阵的每个元素加上偏置量b11(离线训练模型参数)，利用Sigmoid函数激活后，结果为24×24的浮点数矩阵，存入Block RAM。As mentioned above, the input image is a 28×28 floating-point matrix, the convolution kernel is a 5×5 floating-point matrix, the sliding window scale is 5×5, the horizontal sliding step is 1, and the vertical sliding step is 1, then the convolution The result of the product is a 24×24 floating-point matrix. Each element of the matrix is added with the offset b11 (offline training model parameter). After activation by the Sigmoid function, the result is a 24×24 floating-point matrix, which is stored in the Block RAM.

完成1次卷积计算后，计算结果为1个24×24浮点数矩阵，存入Block RAM。After completing one convolution calculation, the calculation result is a 24×24 floating-point number matrix, which is stored in Block RAM.

6、进行第1个池化层计算6. Perform the first pooling layer calculation

实现第1个卷积层计算结果的池化计算，如附图6所示，结果为1个12×12浮点数矩阵，存入Block RAM。具体步骤为：卷积计算结果数据滑动窗尺度为2×2，步长为2，采用平均下采样法实现池化，即逐个2×2浮点数矩阵相加，计算结果取均值，获得1个12×12浮点数矩阵，作为第2个卷积层计算的输入矩阵，如附图7所示。Realize the pooling calculation of the calculation result of the first convolutional layer, as shown in Figure 6, the result is a 12×12 floating-point number matrix, which is stored in Block RAM. The specific steps are: the sliding window scale of the convolution calculation result data is 2×2, and the step size is 2. The average downsampling method is used to realize pooling, that is, the 2×2 floating-point matrix is added one by one, and the calculation results are averaged to obtain 1 The 12×12 floating-point number matrix is used as the input matrix for the calculation of the second convolutional layer, as shown in Figure 7.

7、重新初始化卷积流水线7. Reinitialize the convolution pipeline

如附图4所示，重新初始化6个数据缓存寄存器P₀，P₁，P₂，P₃，P₄，P₅，每个寄存器存放12个浮点数数据。其中5个寄存器(P_{（i-1）%(5+1)+0}，P_{（i-1）%(5+1)+1}，…，P_{（i-1）%(5+1)+5-1})存放输入矩阵的第i个（i=1,2,…，8）子矩阵（5×12）数据,其中%表示取余数。如果（i-1）%(5+1)+x>5,则（i-1）%(5+1)+x=0，（i-1）%(5+1)+x+1=1,…,其中x=0,1,…，4。P_{（i-1）%(5+1)+5}寄存器存放输入矩阵中的第i+5行数据。As shown in Figure 4, re-initialize the six data cache registers P₀ , P₁ , P₂ , P₃ , P₄ , and P₅ , and each register stores 12 floating-point data. Among them, 5 registers (P_{(i-1)%(5+1)+0} , P_{(i-1)%(5+1)+1} , ..., P_{(i-1)%(5+1)+ 5-1} ) Store the i-th (i=1,2,...,8) sub-matrix (5×12) data of the input matrix, where % means to take the remainder. If (i-1)%(5+1)+x>5, then (i-1)%(5+1)+x=0, (i-1)%(5+1)+x+1= 1,...,where x=0,1,...,4. The P_{(i-1)%(5+1)+5} register stores the i+5th row data in the input matrix.

重新初始化卷积核矩阵缓存寄存器W，存放第2个卷积层的第1个5×5个卷积核矩阵权值数据。Reinitialize the convolution kernel matrix cache register W to store the first 5×5 convolution kernel matrix weight data of the second convolution layer.

8、进行第2个卷积层计算8. Perform the second convolutional layer calculation

完成网络第2个卷积层输入数据矩阵和第2个卷积层第1个卷积核的卷积计算，通过Sigmoid函数实现计算结果的激活。Complete the convolution calculation of the input data matrix of the second convolutional layer of the network and the first convolution kernel of the second convolutional layer, and activate the calculation results through the Sigmoid function.

重新初始化卷积核矩阵缓存寄存器W，存放第2个卷积层的第2个5×5个卷积核矩阵权值数据，完成网络第2个卷积层输入数据矩阵和第2个卷积层第2个卷积核的卷积计算，通过Sigmoid函数实现计算结果的激活。Reinitialize the convolution kernel matrix cache register W, store the second 5×5 convolution kernel matrix weight data of the second convolution layer, and complete the input data matrix of the second convolution layer of the network and the second convolution The convolution calculation of the second convolution kernel of the layer is activated by the Sigmoid function.

重新初始化卷积核矩阵缓存寄存器W，存放第2个卷积层的第3个5×5个卷积核矩阵权值数据，完成网络第2个卷积层输入数据矩阵和第2个卷积层第3个卷积核的卷积计算，通过Sigmoid函数实现计算结果的激活。Reinitialize the convolution kernel matrix cache register W, store the third 5×5 convolution kernel matrix weight data of the second convolution layer, and complete the input data matrix of the second convolution layer of the network and the second convolution The convolution calculation of the third convolution kernel of the layer is activated by the Sigmoid function.

在进行每次卷积计算的同时，初始化第i+5个数据缓存寄存器P_{（i-1）%(5+1)+5}，作为卷积中第i+1个子矩阵卷积计算的缓存输入数据，实现循环卷积，如附图5所示。At the same time of each convolution calculation, initialize the i+5th data cache register P_{(i-1)%(5+1)+5} as the cache input for the i+1th sub-matrix convolution calculation in the convolution Data, implement circular convolution, as shown in Figure 5.

具体步骤为：如前所述，输入图像为12×12浮点数矩阵，卷积核为3个5×5浮点数矩阵，滑动窗尺度为5×5,横向滑动步长为1，纵向滑动步长为1，则卷积结果为3个8×8的浮点数矩阵，3个矩阵的每个元素分别加上偏置量b21，b22，b23(离线训练模型参数)，利用Sigmoid函数激活后，结果为3个8×8的浮点数矩阵，存入Block RAM。The specific steps are: as mentioned above, the input image is a 12×12 floating-point number matrix, the convolution kernel is three 5×5 floating-point number matrices, the sliding window scale is 5×5, the horizontal sliding step is 1, and the vertical sliding step is 1. If the length is 1, the convolution result will be three 8×8 floating-point matrixes, each element of the three matrices is added with offsets b21, b22, b23 (offline training model parameters), and after activation by the Sigmoid function, The result is three 8×8 floating-point matrixes, which are stored in Block RAM.

完成2次卷积计算后，计算结果为3个8×8浮点数矩阵，存入Block RAM。After completing two convolution calculations, the calculation results are three 8×8 floating-point matrixes, which are stored in Block RAM.

9、进行第2个池化层计算9. Perform the second pooling layer calculation

实现第2个卷积层计算结果的池化计算，如附图6所示，结果为3个4×4浮点数矩阵，存入Block RAM。具体步骤为：卷积计算结果数据滑动窗尺度为2×2，步长为2，采用平均下采样法实现池化，即逐个2×2浮点数矩阵相加，计算结果取均值，获得3个4×4浮点数矩阵，作为Softmax层的输入矩阵，如附图7所示。Realize the pooling calculation of the calculation result of the second convolutional layer, as shown in Figure 6, the result is three 4×4 floating-point number matrices, which are stored in Block RAM. The specific steps are: the sliding window scale of the convolution calculation result data is 2×2, and the step size is 2. The average downsampling method is used to realize pooling, that is, the 2×2 floating-point matrix is added one by one, and the calculation results are averaged to obtain 3 The 4×4 floating-point number matrix is used as the input matrix of the Softmax layer, as shown in Figure 7.

10、分类计算10. Classification calculation

将卷积计算和池化计算结果传回ARM端进行分类运算。具体步骤为：FPGA端将BlockRAM中的卷积池化计算结果矩阵，通过FIFO缓存，APB总线传输至ARM端，ARM端利用Softmax运算完成数据分类计算，得到输入图片的分类结果并输出。The convolution calculation and pooling calculation results are sent back to the ARM side for classification operations. The specific steps are: the FPGA side transfers the convolution pool calculation result matrix in the BlockRAM to the ARM side through the FIFO cache, and the APB bus. The ARM side uses the Softmax operation to complete the data classification calculation, and obtains the classification result of the input image and outputs it.

上述方法处理MNIST数据库中数字图片“7”的仿真结果如图8所示。The simulation results of the above method processing the digital picture "7" in the MNIST database are shown in Fig. 8 .

上述方法处理MNIST数据库中数字图片“7”的实测分类结果如图9所示。Figure 9 shows the measured classification results of the above-mentioned method processing the digital picture "7" in the MNIST database.

参考文献references

[1] Cong J, Xiao B. Minimizing Computation in Convolutional NeuralNetworks[M]// Artificial Neural Networks and Machine Learning – ICANN 2014.Springer International Publishing, 2014:33-7.[1] Cong J, Xiao B. Minimizing Computation in Convolutional Neural Networks[M]// Artificial Neural Networks and Machine Learning – ICANN 2014. Springer International Publishing, 2014:33-7.

[2] Farabet C, Poulet C, Han J Y, et al. CNP: An FPGA-based processor forConvolutional Networks[J]. International Conference on Field ProgrammableLogic & Applications, 2009:32-37.[2] Farabet C, Poulet C, Han J Y, et al. CNP: An FPGA-based processor for Convolutional Networks[J]. International Conference on Field ProgrammableLogic & Applications, 2009:32-37.

[3] Gokhale V, Jin J, Dundar A, et al. A 240 G-ops/s Mobile Coprocessorfor Deep Neural Networks[C]// IEEE Embedded Vision Workshop. 2014:696-701.[3] Gokhale V, Jin J, Dundar A, et al. A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks[C]// IEEE Embedded Vision Workshop. 2014:696-701.

[4] Zhang C, Li P, Sun G, et al. Optimizing FPGA-based Accelerator Designfor Deep Convolutional Neural Networks[C]// Acm/sigda InternationalSymposium. 2015:161-170.[4] Zhang C, Li P, Sun G, et al. Optimizing FPGA-based Accelerator Designfor Deep Convolutional Neural Networks[C]// Acm/sigda InternationalSymposium. 2015:161-170.

[5] Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification withDeep Convolutional Neural Networks[J]. Advances in Neural InformationProcessing Systems, 2012, 25(2):2012.[5] Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification with Deep Convolutional Neural Networks[J]. Advances in Neural Information Processing Systems, 2012, 25(2):2012.

[6] Farabet C, Martini B, Corda B, et al. NeuFlow: A runtimereconfigurable dataflow processor for vision[J]. 2011, 9(6):109-116.[6] Farabet C, Martini B, Corda B, et al. NeuFlow: A runtime reconfigurable dataflow processor for vision[J]. 2011, 9(6):109-116.

[7] Matai J, Irturk A, Kastner R. Design and Implementation of an FPGA-Based Real-Time Face Recognition System[C]// IEEE, International Symposium onField-Programmable Custom Computing Machines. 2011:97-100.[7] Matai J, Irturk A, Kastner R. Design and Implementation of an FPGA-Based Real-Time Face Recognition System[C]// IEEE, International Symposium on Field-Programmable Custom Computing Machines. 2011:97-100.

[8] Sankaradas M, Jakkula V, Cadambi S, et al. A Massively ParallelCoprocessor for Convolutional Neural Networks[C]// IEEE InternationalConference on Application-Specific Systems, Architectures and Processors.IEEE Computer Society, 2009:53-60.。[8] Sankaradas M, Jakkula V, Cadambi S, et al. A Massively Parallel Coprocessor for Convolutional Neural Networks[C]// IEEE International Conference on Application-Specific Systems, Architectures and Processors.IEEE Computer Society, 2009:53-60..