CN109711533B

Movatterモバイル変換

Info

Publication number: CN109711533B
Application number: CN201811561899.9A
Authority: CN
Inventors: 石光明; 汪振宇; 汪芳羽; 谢雪梅
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2023-04-28
Anticipated expiration: 2038-12-20
Also published as: CN109711533A

Abstract

The invention discloses a convolutional neural network acceleration system based on an FPGA, which mainly solves the problems of the prior art that the internal structure is solidified, the type of the FPGA is limited and the processing speed is low. The parameter storage submodule stores configuration parameters and network weights, and the total control submodule reads the stored parameters and weights and writes the parameters and weights into the configuration register group and the network calculation submodule group to finish setting and initializing the internal connection structure and the cache size; the buffer submodule stores the original input data or the intermediate processing result and transmits the data or the intermediate processing result to the calculation submodule group, and the calculation of multiply-accumulate, downsampling and nonlinear activation functions is finished periodically under the control of the calculation control submodule group. The invention has the advantages of configurable internal structure, support of various FPGAs, lower power consumption and high processing speed.

Description

Translated fromChinese

基于FPGA的卷积神经网络加速系统FPGA-based Convolutional Neural Network Acceleration System

技术领域technical field

本发明属于计算机技术领域，主要涉及一种卷积神经网络，可用于构建基于FPGA的卷积神经网络加速系统。The invention belongs to the technical field of computers, and mainly relates to a convolutional neural network, which can be used to construct an FPGA-based convolutional neural network acceleration system.

背景技术Background technique

随着深度学习的发展，卷积神经网络在科研以及工业应用领域都取得了较好的成果，但是卷积神经网络在带来更好的效果的同时相较于许多传统算法也带来了计算量上的急剧增加，由于该算法存在大量并行性计算的特点，传统的通用处理器CPU不适合处理该算法，而目前应用较广的GPU则存在高功耗的显著问题。FPGA作为一种低功耗计算密集型可编程器件可以很好地服务于卷积神经网络算法。With the development of deep learning, convolutional neural networks have achieved good results in both scientific research and industrial applications, but convolutional neural networks have brought better results compared to many traditional algorithms. Due to the characteristics of a large number of parallel calculations in this algorithm, the traditional general-purpose processor CPU is not suitable for processing this algorithm, and the currently widely used GPU has a significant problem of high power consumption. As a low-power computing-intensive programmable device, FPGA can well serve convolutional neural network algorithms.

目前，基于FPGA的卷积神经网络实现在架构上多采用CPU+FPGA的形式，主要只利用FPGA对卷积神经网络中卷积层的计算进行加速，而将其余部分交于CPU处理，没有将FPGA和网络卷积层以外的部分充分结合；在流程上多采用从外部存储器读取数据至CPU或FPGA上处理，完成一个网络层的处理后，将中间结果写入外部存储，然后进行下一层处理前的准备工作，再从外部存储读取数据进行处理，如此往复直至得到最终结果，这种方式将数据在片内和片外反复搬运，每层的计算之间没有形成流水线，在功耗和速度上都受到较大的限制；在应用上，限制使用者只能在CPU端进行软件开发，而整个系统中FPGA对外封闭，型号和片上系统结构固化，使用者不能按需选择FPGA型号和调整FPGA上的系统结构。At present, FPGA-based convolutional neural network implementations mostly adopt the form of CPU+FPGA in terms of architecture, mainly only using FPGA to accelerate the calculation of the convolutional layer in the convolutional neural network, and handing over the rest to the CPU for processing. The FPGA and the parts other than the network convolution layer are fully integrated; in the process, data is often read from the external memory to the CPU or FPGA for processing. After completing the processing of a network layer, the intermediate results are written to the external storage, and then proceed to the next step. The preparatory work before the layer processing, and then read the data from the external storage for processing, and so on until the final result is obtained. In this way, the data is repeatedly transported on-chip and off-chip. There is no pipeline between the calculations of each layer. Both power consumption and speed are greatly restricted; in terms of application, users are limited to software development on the CPU side, while the FPGA in the entire system is closed to the outside world, and the model and system-on-chip structure are fixed, so users cannot choose FPGA models as needed And adjust the system structure on the FPGA.

发明内容Contents of the invention

本发明的目的在于针对上述现有技术的不足，提出一种基于FPGA的卷积神经网络加速系统，用以构建流水线形式的加速系统，降低功耗，提高卷积神经网络运算速度，实现对FPGA的灵活运用。The purpose of the present invention is to address the above-mentioned deficiencies in the prior art, and propose a convolutional neural network acceleration system based on FPGA, which is used to build an acceleration system in the form of a pipeline, reduce power consumption, improve the calculation speed of convolutional neural network, and realize the optimization of FPGA. flexible use.

为实现上述目的，本发明基于FPGA的卷积神经网络加速系统，其特征在于，包括：In order to achieve the above object, the FPGA-based convolutional neural network acceleration system of the present invention is characterized in that, comprising:

参数存储子模块，用于存储卷积神经网络的权值参数以及配置参数；The parameter storage sub-module is used to store the weight parameters and configuration parameters of the convolutional neural network;

总控制子模块，用于控制整体工作状态和初始化其它功能子模块；The overall control sub-module is used to control the overall working state and initialize other functional sub-modules;

配置寄存器组，用于控制各网络计算子模块组中的各种计算子模块的连接关系和工作模式及缓存子模块的缓存上限；The configuration register group is used to control the connection relationship and working mode of various computing sub-modules in each network computing sub-module group and the cache upper limit of the cache sub-module;

网络计算子模块组，用于完成卷积神经网络中的各种基本运算；The network computing sub-module group is used to complete various basic operations in the convolutional neural network;

缓存子模块，用于存储计算的中间结果；The cache submodule is used to store the intermediate results of the calculation;

计算控制子模块组，用于控制网络计算子模块组中的不同计算子模块完成各种基本运算；The computing control sub-module group is used to control different computing sub-modules in the network computing sub-module group to complete various basic operations;

各子模块的连接关系如下：The connection relationship of each sub-module is as follows:

总控制子模块通过内部数据、控制以及地址线与参数存储子模块连接；通过内部数据和控制线与配置寄存器组连接；通过内部控制线与网络计算子模块组连接；通过参数输入和地址输出端口与卷积神经网络外部连接；The total control sub-module is connected to the parameter storage sub-module through internal data, control and address lines; connected to the configuration register group through internal data and control lines; connected to the network computing sub-module group through internal control lines; through parameter input and address output ports External connection with convolutional neural network;

参数存储子模块通过内部数据线与网络计算子模块组连接；通过内部地址线与计算控制子模块组连接；通过参数输入端口与卷积神经网络外部连接；The parameter storage sub-module is connected to the network computing sub-module group through the internal data line; connected to the computing control sub-module group through the internal address line; connected to the convolutional neural network externally through the parameter input port;

网络计算子模块组通过内部数据线与缓存子模块连接；通过内部控制线与计算控制子模块组连接；通过数据输出端口与卷积神经网络外部连接；The network computing sub-module group is connected to the cache sub-module through the internal data line; connected to the computing control sub-module group through the internal control line; connected to the convolutional neural network externally through the data output port;

缓存子模块通过数据输入和状态信号输出端口与卷积神经网络外部连接；The cache sub-module is externally connected to the convolutional neural network through data input and status signal output ports;

配置寄存器组通过内部控制线与网络计算子模块组连接；The configuration register group is connected with the network computing sub-module group through the internal control line;

计算控制子模块组状态信号输入端口与卷积神经网络外部连接。The state signal input port of the computing control submodule group is externally connected to the convolutional neural network.

本发明具有如下优点：The present invention has the following advantages:

1.本发明由于是基于FPGA内的基本资源设计的，这些资源在不同型号的FPGA内普遍存在，且本发明对资源的开销不是很大，小于很多型号的FPGA的资源总量，因此可以在多种型号的FPGA上使用，对FPGA型号的限制更小；1. the present invention is owing to be based on the basic resource design in FPGA, these resources are ubiquitous in the FPGA of different models, and the present invention is not very large to the expense of resource, is less than the resource total amount of the FPGA of many models, therefore can be in It is used on various types of FPGAs, and the restrictions on FPGA models are smaller;

2.本发明由于模块外部提供了连接到缓存子模块的状态信号输出和数据输入端口，连接到计算控制子模块的状态信号输入端口以及连接到网络计算子模块的数据输出端口，因此多个模块可以通过这些端口级联来协同工作，实现更复杂卷积神经网络，可拓展性更好，使用更灵活；2. The present invention provides the state signal output and data input ports connected to the cache sub-module, the state signal input port connected to the calculation control sub-module and the data output port connected to the network calculation sub-module outside the module, so multiple modules These ports can be cascaded to work together to realize a more complex convolutional neural network with better scalability and more flexible use;

3.本发明由于提供了可外部直接读写的参数存储子模块，只需将配置参数以及网络的权值写入参数存储子模块，由总控制子模块和配置寄存器组自动完成功能结构的配置以及权值的加载，因此可以方便地配置卷积神经网络的整体模块，以适应多种网络结构的需求；3. Since the present invention provides a parameter storage sub-module that can be directly read and written externally, it only needs to write configuration parameters and network weights into the parameter storage sub-module, and the configuration of the functional structure is automatically completed by the general control sub-module and the configuration register group And the loading of weights, so the overall module of the convolutional neural network can be easily configured to meet the needs of various network structures;

4.本发明由于在网络计算子模块组集成了卷积神经网络不同层的功能，使卷积神经网络内的运算都能在FPGA上以并行运算的方式实现，提高了运算速度；4. Since the present invention integrates the functions of different layers of the convolutional neural network in the network computing sub-module group, the calculations in the convolutional neural network can be realized in parallel on the FPGA, which improves the calculation speed;

5.本发明由于对卷积神经网络的不同层采用多个计算子模块分别来进行计算，并且各计算子模块以流水线的方式进行工作，提高了连续处理多张图片的工作效率；5. The present invention uses a plurality of calculation sub-modules to calculate separately for different layers of the convolutional neural network, and each calculation sub-module works in a pipelined manner, which improves the work efficiency of continuous processing of multiple pictures;

6.本发明由于采用乒乓双缓存的方式存储数据，使得网络每一层的读写与计算和前一层的读写与计算可以同时进行，减少了等待数据的时间，进一步提高了数据处理的速度；6. Since the present invention stores data in the way of ping-pong double buffering, the reading, writing and calculation of each layer of the network and the reading, writing and calculation of the previous layer can be carried out simultaneously, which reduces the time for waiting for data and further improves the efficiency of data processing. speed;

仿真结果表明，本发明在基于MINIST数据集的分类任务中，对于单张MINIST图像的平均处理速度达到i7 CPU处理器的75倍。The simulation results show that in the classification task based on the MINIST data set, the average processing speed of a single MINIST image reaches 75 times that of an i7 CPU processor.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention.

图1是本发明的整体结构框图；Fig. 1 is an overall structural block diagram of the present invention;

图2是本发明中卷积计算子模块、卷积控制子模块、卷积输入缓存子模块和非线性激活子模块的连接及结构示意图；Fig. 2 is a schematic diagram of the connection and structure of the convolution calculation submodule, convolution control submodule, convolution input buffer submodule and nonlinear activation submodule in the present invention;

图3是本发明中卷积输入缓存子模块中缓存单元的结构示意图；Fig. 3 is a schematic structural view of the cache unit in the convolution input cache submodule in the present invention;

图4是本发明中池化计算子模块、池化控制子模块和池化输入缓存子模块的连接及结构示意图；Fig. 4 is a schematic diagram of the connection and structure of the pooling calculation sub-module, the pooling control sub-module and the pooling input cache sub-module in the present invention;

图5是本发明中全连接计算子模块、全连接控制子模块、全连接输入缓存子模块和非线性激活子模块的连接及结构示意图；Fig. 5 is a schematic diagram of the connection and structure of the fully connected calculation submodule, the fully connected control submodule, the fully connected input buffer submodule and the nonlinear activation submodule in the present invention;

图6是本发明中卷积计算子模块、池化计算子模块和全连接计算子模块的工作状态转移图。Fig. 6 is a working state transition diagram of the convolution calculation sub-module, the pooling calculation sub-module and the full connection calculation sub-module in the present invention.

具体实施方式Detailed ways

以下将结合附图，对本发明的技术方案进行详细说明。The technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings.

参照图1，本发明包括：参数存储子模块1、总控制子模块2、配置寄存器组3、网络计算子模块组4、缓存子模块5、计算控制子模块组6、参数输入端7、地址输出端8、数据输入端9、数据输出端10、状态信号输入端11和状态信号输出端12。其中：Referring to Fig. 1, the present invention includes: parameter storage submodule 1, general control submodule 2, configuration register group 3, network computing submodule group 4, cache submodule 5, calculation control submodule group 6, parameter input terminal 7, address Output terminal 8 , data input terminal 9 , data output terminal 10 , status signal input terminal 11 and status signal output terminal 12 . in:

参数存储子模块1，用于存储卷积神经网络的权值以及配置参数并供其它子模块读取，其分别通过一路内部数据线、一路内部控制线、一路内部地址线与总控制子模块2连接，通过两路内部数据线连接与网络计算子模块组4连接，通过一路内部地址线与计算控制子模块组6连接，通过一路数据线与参数输入端口7连接；The parameter storage sub-module 1 is used to store the weights and configuration parameters of the convolutional neural network and be read by other sub-modules, which respectively pass one internal data line, one internal control line, one internal address line and the total control sub-module 2 Connect, connect to the network computing sub-module group 4 through two internal data lines, connect to the computing control sub-module group 6 through one internal address line, and connect to the parameter input port 7 through one data line;

总控制子模块2，用于控制本发明的整体工作状态和初始化其它功能子模块，其分别通过一路内部数据线和一路内部控制线与配置寄存器组3连接，通过两路内部控制线与网络计算子模块组4连接，通过一路数据线与参数输入端口7连接，通过一路地址线与地址输出端口8连接；The total control sub-module 2 is used to control the overall working state of the present invention and initialize other functional sub-modules, which are respectively connected to the configuration register group 3 through one internal data line and one internal control line, and connected to the network computing through two internal control lines The sub-module group 4 is connected, connected to the parameter input port 7 through a data line, and connected to the address output port 8 through an address line;

配置寄存器组3，用于控制各网络计算子模4块组中的各种计算子模块的连接关系和工作模式及缓存子模块5的缓存上限和阈值，其通过一路内部控制线与网络计算子模块组4连接；Configuration register group 3 is used to control the connection relationship and working mode of various computing sub-modules in each network computing sub-module group of 4 and the cache upper limit and threshold of cache sub-module 5, which communicates with the network computing sub-module through one internal control line. Module group 4 connections;

网络计算子模块组4，用于完成卷积神经网络中的各种基本运算，其分别与缓存子模块5和计算控制子模块组6连接。该所述网络计算子模块组4，包括第一卷积计算子模块41、第二卷积计算子模块42、第一池化计算子模块43、第二池化计算子模块44、全连接计算子模块45、第一非线性激活子模块46、第二非线性激活子模块47和第三非线性激活子模块48；The network calculation sub-module group 4 is used to complete various basic operations in the convolutional neural network, which are respectively connected with the cache sub-module 5 and the calculation control sub-module group 6 . The network calculation submodule group 4 includes a first convolution calculation submodule 41, a second convolution calculation submodule 42, a first pooling calculation submodule 43, a second pooling calculation submodule 44, and a full connection calculation Submodule 45, firstnonlinear activation submodule 46, second nonlinear activation submodule 47 and thirdnonlinear activation submodule 48;

缓存子模块5，用于缓存输入各计算子模块的数据，其分别与计算子模块组4和计算控制子模块组6连接，其内部包括第一卷积输入缓存子模块51、第二卷积输入缓存子模块52、第一池化输入缓存子模块53、第二池化输入缓存子模块54和全连接输入缓存子模块55；The cache sub-module 5 is used to cache the data input to each calculation sub-module, which is respectively connected to the calculation sub-module group 4 and the calculation control sub-module group 6, and includes a first convolution input cache sub-module 51, a second convolution Input cache submodule 52, first pooledinput cache submodule 53, second pooled input cache submodule 54, and fully connected input cache submodule 55;

计算控制子模块组6，用于控制各种计算子模块在不同工作状态间切换从而完成各种基本运算，其分别与网络计算子模块组4和缓存子模块5连接，其内部包括第一卷积控制子模块61、第二卷积控制子模块62、第一池化控制子模块63、第二池化控制子模块64和全连接控制子模块65。The computing control sub-module group 6 is used to control various computing sub-modules to switch between different working states to complete various basic operations, which are respectively connected to the network computing sub-module group 4 and the cache sub-module 5, and its interior includes the first volume A product control submodule 61 , a second convolution control submodule 62 , a first pooling control submodule 63 , a second pooling control submodule 64 and a full connection control submodule 65 .

参照图2，网络计算子模块组4中的第一卷积计算子模块41、缓存子模块5中的第一卷积输入缓存子模块51以及计算控制子模块组6中的第一卷积控制子模块61和第一非线性激活子模块46的连接关系如下：Referring to Fig. 2, the first convolution calculation submodule 41 in the network calculation submodule group 4, the first convolution input cache submodule 51 in the cache submodule 5, and the first convolution control in the calculation control submodule group 6 The connection relationship between the submodule 61 and the firstnonlinear activation submodule 46 is as follows:

卷积输入缓存子模块51的两路输出数据连接到MUX选择器，MUX选择器的输出连接到卷积计算子模块41，卷积输入缓存子模块51的状态信号Full连接到卷积控制子模块61，卷积控制子模块61的控制信号Sel连接到MUX选择器，卷积计算子模块41的输出连接到非线性激活子模块46。其中：卷积输入缓存子模块51内部包括2个缓存组，每个缓存组包含6个缓存单元；卷积计算子模块41内包括6个加法树和36个卷积核。The two-way output data of the convolution input buffer submodule 51 is connected to the MUX selector, the output of the MUX selector is connected to the convolution calculation submodule 41, and the state signal Full of the convolution input buffer submodule 51 is connected to the convolution control submodule 61 , the control signal Sel of the convolution control submodule 61 is connected to the MUX selector, and the output of the convolution calculation submodule 41 is connected to thenonlinear activation submodule 46 . Wherein: the convolution input cache sub-module 51 includes 2 cache groups, and each cache group contains 6 cache units; the convolution calculation sub-module 41 includes 6 addition trees and 36 convolution kernels.

参照图3，卷积输入缓存子模块51中的每个缓存单元由5个首尾相连的FIFO队列构成。其中，FIFO1到FIFO4为行队列，FIFO5为主队列，每个队列的计数信号Count表示当前队列内的数据量。在缓存数据时，首先将输入Din写入主队列FIFO5，若主队列FIFO5的计数信号Count5未达到配置寄存器设置的上限，则主队列FIFO5不输出内部数据，反之主队列FIFO5的输出内部数据到第四行队列FIFO4的输入端，第四行队列FIFO4开始读入数据，当第四行队列FIFO4的计数信号Count4达到设置上限后，第四行队列FIFO4输出内部数据到第三行队列FIFO3的输入端，依次进行直到第一行队列FIFO1的输出端开始输出数据，则数据的缓存完成，此时卷积输入缓存子模块51向卷积控制子模块61发送状态信号Full，通知卷积控制子模块61启动卷积计算子模块41工作。每个缓存单元的5个FIFO队列并行输出，每个缓存组里的6个缓存单元并行输出，根据具体需求的不同，通过配置寄存器改变卷积计算子模块41并行输入的个数，即选择一个缓存组中的部分或全部6个缓存单元的输出作为卷积计算子模块41的输入，未被选择的缓存单元的输出被置零。Referring to FIG. 3 , each buffer unit in the convolution input buffer submodule 51 is composed of 5 end-to-end FIFO queues. Among them, FIFO1 to FIFO4 are row queues, FIFO5 is the main queue, and the count signal Count of each queue indicates the amount of data in the current queue. When caching data, first write the input Din into the main queue FIFO5, if the count signal Count5 of the main queue FIFO5 does not reach the upper limit set by the configuration register, the main queue FIFO5 will not output internal data, otherwise the output internal data of the main queue FIFO5 will go to the second The input terminal of the four-row queue FIFO4, the fourth row queue FIFO4 starts to read data, when the counting signal Count4 of the fourth row queue FIFO4 reaches the upper limit, the fourth row queue FIFO4 outputs internal data to the input terminal of the third row queue FIFO3 , proceed in sequence until the output end of the first line queue FIFO1 starts to output data, then the buffering of the data is completed, at this time the convolution input buffer submodule 51 sends a status signal Full to the convolution control submodule 61, and notifies the convolution control submodule 61 Start the convolution calculation sub-module 41 to work. The 5 FIFO queues of each cache unit are output in parallel, and the 6 cache units in each cache group are output in parallel. According to different specific requirements, the number of parallel inputs of the convolution calculation sub-module 41 can be changed through the configuration register, that is, one can be selected. The outputs of some or all of the six cache units in the cache group are used as the input of the convolution calculation sub-module 41 , and the outputs of unselected cache units are set to zero.

卷积计算子模块41中每个卷积核是由25个乘法器组成的乘法阵列，每个卷积核在每个时钟周期内可计算25个数据和对应权值的乘积，每6个卷积核并行输出到一个加法树进行求和，根据具体需求的不同，可通过配置寄存器改变卷积计算子模块41的并行输出个数，即选择部分或全部6个加法树的输出作为卷积计算子模块的输出，未被选择的加法树的输出被置零。Each convolution kernel in the convolution calculation sub-module 41 is a multiplication array composed of 25 multipliers, and each convolution kernel can calculate the product of 25 data and corresponding weights in each clock cycle, and every 6 volumes The product kernel is output in parallel to an addition tree for summation. According to different specific requirements, the number of parallel outputs of the convolution calculation sub-module 41 can be changed through the configuration register, that is, the output of some or all of the six addition trees is selected as the convolution calculation. The output of the submodule, the output of the additive tree that is not selected is set to zero.

参照图4，网络计算子模块组4中的第一池化计算子模块43、缓存子模块5中的第一池化输入缓存子模块53以及计算控制子模块组6中的第一池化控制子模块63的连接关系如下：Referring to Fig. 4, the first pooling computing submodule 43 in the network computing submodule group 4, the first poolinginput buffering submodule 53 in the cache submodule 5, and the first pooling control in the computing control submodule group 6 The connection relation of submodule 63 is as follows:

池化输入缓存子模块53的两路输出连接到池化计算子模块43的四个数据输入端，每一路连接两个数据输入端，池化输入缓存子模块53的状态信号Full连接到池化控制子模块63，池化控制子模块63的控制信号Ena连接到池化计算子模块43的两个控制输入端。其中，池化输入缓存子模块53内包括2个FIFO队列FIFO6和FIFO7；池化计算子模块43内包括2个比较器、2个加法器和1个MUX选择器。The two outputs of the poolinginput cache submodule 53 are connected to the four data input terminals of the pooling calculation submodule 43, each of which is connected to two data input terminals, and the state signal Full of the poolinginput cache submodule 53 is connected to the pooling The control sub-module 63 , the control signal Ena of the pooling control sub-module 63 is connected to the two control input terminals of the pooling calculation sub-module 43 . Wherein, the poolinginput buffer sub-module 53 includes two FIFO queues FIFO6 and FIFO7; the pooling calculation sub-module 43 includes two comparators, two adders and one MUX selector.

池化输入缓存子模块53的输入信号Din分别连接到2个FIFO队列的输入端，当输入信号Din传来有效数据时，首先打开本模块的第一队列FIFO6的输入端，将数据写入FIFO6中，当写入FIFO6中的数据量达到设置阈值后，FIFO6的输入端关闭，本模块的第二队列FIFO7的输入端打开，将数据写入FIFO7中，当写入FIFO7中的数据量达到设置阈值后，FIFO7的输入端关闭，再次打开本模块的第一队列FIFO6的输入端，将数据再次写入FIFO6，按照此规律输入数据轮流写入这两个FIFO队列FIFO6和FIFO7。当这两个FIFO队列内的数据量都超过设置阈值时，池化输入缓存子模块53向池化控制子模块63发送状态信号Full，池化控制子模块63收到状态信号Full后打开缓存与计算子模块间的数据通路，这两个FIFO队列FIFO6和FIFO7一起向池化计算子模块43发送数据。The input signal Din of the poolinginput buffer sub-module 53 is respectively connected to the input terminals of the two FIFO queues. When valid data is transmitted from the input signal Din, the input terminal of the first queue FIFO6 of this module is first opened, and the data is written into FIFO6. , when the amount of data written into FIFO6 reaches the set threshold, the input end of FIFO6 is closed, and the input end of the second queue FIFO7 of this module is opened, and the data is written into FIFO7. When the amount of data written into FIFO7 reaches the set threshold After the threshold, the input terminal of FIFO7 is closed, and the input terminal of the first queue FIFO6 of this module is opened again, and the data is written into FIFO6 again. According to this rule, the input data is written into the two FIFO queues FIFO6 and FIFO7 in turn. When the amount of data in these two FIFO queues all exceeds the setting threshold, the poolinginput buffer submodule 53 sends a status signal Full to the pooling control submodule 63, and the pooling control submodule 63 opens the cache and Computing the data path between the sub-modules, the two FIFO queues FIFO6 and FIFO7 send data to the pooled computing sub-module 43 together.

池化计算子模块43有两种工作模式，若由配置寄存器设置的工作模式为最大池化模式，则本模块中只有两个比较器工作，第一比较器C1在每个时钟周期对两个新的输入数据的大小进行比较并输出较大值，第二比较器C2在每个时钟周期对两个新的输入数据的大小进行比较并输出较大值，其中一个输入为0或比较器C2前一个时钟周期的输出，另一个输入为比较器C1的输出，照此工作方式，每经过两个时钟周期可得到4个输入数据中的最大值；若工作模式被设置为均值池化模式，则本模块中只有两个加法器工作，第一加法器A1在每个时钟周期对两个新的输入数据进行求和并输出和值，第二加法器A2在每个时钟周期对两个新的输入数据进行求和并输出和值，其中一个输入为0或加法器A2前一个时钟周期的输出，另一个输入为加法器A1的输出，照此工作方式，每经过两个时钟周期可以得到4个输入数据之和，再舍弃二进制结果的低2位，则可得到4个输入数据的均值。The pooling calculation sub-module 43 has two kinds of operating modes. If the operating mode set by the configuration register is the maximum pooling mode, then only two comparators work in this module, and the first comparator C1 pairs two comparators in each clock cycle. The size of the new input data is compared and outputs a larger value, and the second comparator C2 compares the size of two new input data and outputs a larger value in each clock cycle, one of the inputs is 0 or the comparator C2 The output of the previous clock cycle, and the other input is the output of comparator C1. According to this working method, the maximum value of the 4 input data can be obtained every two clock cycles; if the working mode is set to mean pooling mode, Then only two adders work in this module. The first adder A1 sums two new input data and outputs the sum value in each clock cycle, and the second adder A2 sums the two new input data in each clock cycle. Sum the input data and output the sum value. One of the inputs is 0 or the output of the previous clock cycle of the adder A2, and the other input is the output of the adder A1. In this way, every two clock cycles can be obtained The sum of the 4 input data, and discarding the lower 2 bits of the binary result, the mean value of the 4 input data can be obtained.

参照图5，网络计算子模块组4中的全连接计算子模块45、缓存子模块5中的全连接输入缓存55以及计算控制子模块组6中的全连接控制子模块65和第三非线性激活子模块48的连接关系如下：Referring to Fig. 5, the fully connected computing submodule 45 in the network computing submodule group 4, the fully connected input cache 55 in the cache submodule 5, and the fully connected control submodule 65 and the third nonlinearity in the computing control submodule group 6 The connection relation ofactivation submodule 48 is as follows:

全连接输入缓存子模块55的6路输出连接到全连接计算子模块45的6个输入端，全连接控制子模块65的2个选择控制信号Sel1和Sel2连接到全连接输入缓存子模块55，全连接控制子模块65的选择控制信号Sel3连接到全连接计算子模块45，全连接计算子模块45的输出连接到非线性激活子模块48的输入端。其中，全连接输入缓存子模块55内包括两组FIFO队列，每组6个队列，12个输入选择器和6个输出选择器；全连接计算子模块45内包括6个乘法器，7个寄存器，1个MUX选择器和1个加法树。The 6 outputs of the fully connected input buffer submodule 55 are connected to the 6 input terminals of the fully connected calculation submodule 45, and the 2 selection control signals Sel1 and Sel2 of the fully connected control submodule 65 are connected to the fully connected input buffer submodule 55, The selection control signal Sel3 of the fully connected control submodule 65 is connected to the fully connected calculation submodule 45 , and the output of the fully connected calculation submodule 45 is connected to the input terminal of thenonlinear activation submodule 48 . Among them, the fully connected input buffer sub-module 55 includes two groups of FIFO queues, each group has 6 queues, 12 input selectors and 6 output selectors; the fully connected computing sub-module 45 includes 6 multipliers and 7 registers , 1 MUX selector and 1 adder tree.

全连接输入缓存子模块55中每个FIFO队列的输入与一个输入选择器MUXI的输出连接，每个FIFO队列的输出分别连接到一个与自身相连的MUXI选择器和一个输出选择器MUXO，第一组的6个FIFO队列与第二组的6个FIFO队列一一对应，每两个相对应的FIFO队列的输出连接到同一个MUXO选择器，6个MUXO的输出连接到全连接计算子模块45的输入。全连接控制子模块65通过选择控制信号Sel1控制每个MUXI选择器输出与之相连接的输入信号Din或者FIFO队列的输出信号，通过选择控制信号Sel2选择两组FIFO队列的其中一组6个输出作为全连接输入缓存子模块55的输出。The input of each FIFO queue in the fully connected input buffer submodule 55 is connected to the output of an input selector MUXI, and the output of each FIFO queue is respectively connected to a MUXI selector connected to itself and an output selector MUXO, the first The 6 FIFO queues of the first group are in one-to-one correspondence with the 6 FIFO queues of the second group, the outputs of each two corresponding FIFO queues are connected to the same MUXO selector, and the outputs of the 6 MUXOs are connected to the fully connected computing sub-module 45 input of. The full connection control sub-module 65 controls each MUXI selector to output the input signal Din connected to it or the output signal of the FIFO queue by selecting the control signal Sel1, and selects one group of 6 outputs of two groups of FIFO queues by selecting the control signal Sel2 As the output of the fully connected input cache sub-module 55.

全连接计算子模块45在每个工作周期读入7个或6个权值，并将它们依次存储在寄存器Reg1到Reg7或寄存器Reg1到Reg6中，全连接计算子模块45的6个输入值分别与寄存器Reg1到Reg6中的6个权值相乘得到6个乘积并送入加法树，全连接控制子模块65通过Sel3选择Reg7或者加法树前一时钟周期的求和结果作为加法树的第7个输入，加法树每个时钟周期对7个输入求和，并将经过多个时钟周期求和结果送入非线性激活子模块48得到输出数据Dout。The fully-connected calculation sub-module 45 reads 7 or 6 weights in each working cycle, and stores them in registers Reg1 to Reg7 or registers Reg1 to Reg6 in turn, and the 6 input values of the fully-connected calculation sub-module 45 are respectively Multiply with the 6 weights in registers Reg1 to Reg6 to obtain 6 products and send them to the addition tree. The full connection control submodule 65 selects Reg7 or the summation result of the previous clock cycle of the addition tree through Sel3 as the seventh of the addition tree. input, the addition tree sums the 7 inputs every clock cycle, and sends the summation result after multiple clock cycles to thenonlinear activation sub-module 48 to obtain the output data Dout.

参照图6，网络计算子模块组4内的两个卷积计算子模块41和42、两个池化计算子模块43和44及全连接计算子模块45的工作状态如下：Referring to FIG. 6 , the working states of the two convolution computing sub-modules 41 and 42, the two pooling computing sub-modules 43 and 44 and the full connection computing sub-module 45 in the network computing sub-module group 4 are as follows:

如图6(a)所示，两个卷积计算子模块41和42工作状态的切换方式相同，以第一卷积计算子模块41为例，对其具体工作原理进行说明：卷积计算子模块41的初始工作状态为休眠状态，卷积计算子模块41等待第一卷积输入缓存子模块51加载数据，若数据加载完成，则卷积计算子模块41进入准备状态，反之保持当前状态；进入准备状态后，第一卷积控制子模块61中的计时器开始计时，若计时器达到阈值，则卷积计算子模块41进入写入状态，反之保持当前状态；进入写入状态后，卷积控制子模块61中列计数器开始计数，若列计数器未达到阈值，则卷积计算子模块41保持当前状态，反之卷积控制子模块61中的行计数器加1，再判断行计数器是否达到阈值，若行计数器达到阈值，则卷积计算子模块41进入休眠状态，反之进入准备状态。As shown in Figure 6(a), the switching modes of the working states of the two convolution calculation sub-modules 41 and 42 are the same. Taking the first convolution calculation sub-module 41 as an example, its specific working principle is explained: the convolution calculation sub-module The initial working state of the module 41 is a dormant state, and the convolution calculation submodule 41 waits for the first convolution input buffer submodule 51 to load data. If the data loading is completed, the convolution calculation submodule 41 enters the ready state, otherwise maintains the current state; After entering the ready state, the timer in the first convolution control submodule 61 starts counting. If the timer reaches the threshold, the convolution calculation submodule 41 enters the write state, otherwise maintains the current state; after entering the write state, the volume The column counter in the product control sub-module 61 starts counting, if the column counter does not reach the threshold, the convolution calculation sub-module 41 maintains the current state, otherwise the row counter in the convolution control sub-module 61 adds 1, and then judges whether the row counter reaches the threshold , if the line counter reaches the threshold, the convolution calculation sub-module 41 enters the dormant state, otherwise enters the ready state.

如图6(b)所示，两个池化计算子模块43和44工作状态的切换方式相同，以第一池化计算子模块43为例，对其具体工作原理进行说明：池化计算子模块43初始工作状态为休眠状态，池化计算子模块43等待第一池化输入缓存子模块53加载数据，若数据加载完成，则池化计算子模块43进入开始状态，反之保持当前状态；进入开始状态后，再经过一个时钟周期的时间，池化计算子模块43进入一轮运算状态；进入一轮运算状态之后，再经过一个时钟周期的时间，池化计算子模块43进入二轮运算状态；进入二轮运算状态之后，再经过一个时钟周期的时间，池化计算子模块43进入写入状态；进入写入状态后，第一池化控制模块63判断池化输入缓存子模块53是否为空，若为空，则池化计算子模块43进入休眠状态，反之进入开始状态。As shown in Figure 6(b), the switching modes of the two pooled calculation sub-modules 43 and 44 are the same. Taking the first pooled calculation sub-module 43 as an example, its specific working principle is explained: the pooled calculation sub-module The initial working state of the module 43 is a dormant state, and the pooling calculation submodule 43 waits for the first poolinginput cache submodule 53 to load data. If the data loading is completed, the pooling calculation submodule 43 enters the start state, otherwise maintains the current state; enters After the start state, after one clock cycle, the pooling calculation submodule 43 enters a round of computing state; after entering the round computing state, after one clock cycle, the pooling computing submodule 43 enters a second round of computing state ; After entering the second round of computing state, and then through the time of one clock cycle, the pooling calculation submodule 43 enters the writing state; after entering the writing state, the first pooling control module 63 judges whether the poolinginput cache submodule 53 is Empty, if it is empty, the pooling calculation sub-module 43 enters the sleep state, otherwise enters the start state.

如图6(c)所示，全连接计算子模块45的初始工作状态为休眠状态，等待全连接输入缓存子模块55加载数据，若数据加载完成，则全连接计算子模块45进入读取状态，反之保持当前状态；进入读取状态后，全连接控制子模块65中的读计数器开始倒计数，若读计数器归零，则全连接计算子模块45进入乘累加状态，反之保持当前状态；进入乘累加状态后，全连接控制子模块65中的乘累加计数器减1，再判断乘累加计数器是否归零，若乘累加计数器归零，则全连接计算子模块45进入写入状态，反之进入写入状态；当进入写入状态后，全连接控制子模块65中的输出计数器减1，再判断输出计数器是否归零，若输出计数器归零，则全连接计算子模块45进入休眠状态，反之进入读取状态。As shown in Figure 6(c), the initial working state of the fully connected computing sub-module 45 is a dormant state, waiting for the fully connected input buffer sub-module 55 to load data, and if the data loading is completed, the fully connected computing sub-module 45 enters the reading state , otherwise maintain the current state; after entering the read state, the read counter in the fully connected control submodule 65 starts counting down, if the read counter returns to zero, then the fully connected calculation submodule 45 enters the multiplication and accumulation state, otherwise maintains the current state; enter After multiplying and accumulating the state, the multiplying and accumulating counter in the fully connected control submodule 65 is decremented by 1, and then it is judged whether the multiplying and accumulating counter returns to zero. enter state; when entering the write state, the output counter in the fully connected control submodule 65 is decremented by 1, and then it is judged whether the output counter returns to zero. If the output counter returns to zero, the fully connected calculation submodule 45 enters the dormant state, otherwise enters Read status.

以上描述仅是本发明的一个具体实例，并未构成对本发明的任何限制，显然对于本领域的专业人员来说，在了解了本发明内容和原理后，都可以在不背离本发明原理、结构的情况下，进行形式和细节上的各种修改和改变，但是这些基于本发明思想的修正和改变仍在本发明的权利要求保护范围之内。The above description is only a specific example of the present invention, and does not constitute any limitation to the present invention. Obviously, for those skilled in the art, after understanding the contents and principles of the present invention, they can Various modifications and changes in form and details are made under the circumstances of the present invention, but these modifications and changes based on the idea of the present invention are still within the protection scope of the claims of the present invention.