CN108805272A

Movatterモバイル変換

Info

Publication number: CN108805272A
Application number: CN201810413101.XA
Authority: CN
Inventors: 陆生礼; 泮雯雯; 庞伟; 范雪梅; 李宇峰; 杨文韬
Original assignee: Southeast University - Wuxi Institute Of Technology Integrated Circuits; Southeast University
Current assignee: Southeast University - Wuxi Institute Of Technology Integrated Circuits; Southeast University
Priority date: 2018-05-03
Filing date: 2018-05-03
Publication date: 2018-11-13

Abstract

Translated fromChinese

本发明公开了一种基于FPGA的通用卷积神经网络加速器，包括MCU、AXI4总线接口、地址生成器、状态控制器、特征图缓存区、卷积核缓存区、卷积计算器以及分段式结果缓存区。卷积加速器采用FPGA实现，并包含N个卷积计算子单元，特征图缓存区和卷积核缓存区分别包含N个特征图子缓存区和N个卷积核子缓存区，每一个卷积计算子单元对应配置一个特征图子缓存区和一个卷积核子缓存区。卷积计算器读取特征图缓存区和卷积核缓存区中的数据进行卷积计算，并将相邻卷积计算子单元的计算结果进行多级累加，分段式结果缓存区用于存放卷积计算器输出的各级累加结果。本发明能够支持各种卷积神经网络结构，通用性好，对片上存储资源需求较少，通信开销小。

The invention discloses an FPGA-based general convolutional neural network accelerator, which includes an MCU, an AXI4 bus interface, an address generator, a state controller, a feature map buffer area, a convolution kernel buffer area, a convolution calculator, and a segmented Result cache. The convolution accelerator is implemented by FPGA and contains N convolution calculation sub-units. The feature map buffer area and the convolution kernel buffer area respectively contain N feature map sub-buffer areas and N convolution kernel sub-buffer areas. Each convolution calculation The subunits are correspondingly configured with a feature map subbuffer and a convolution kernel subbuffer. The convolution calculator reads the data in the feature map buffer area and the convolution kernel buffer area for convolution calculation, and performs multi-level accumulation of the calculation results of adjacent convolution calculation subunits, and the segmented result buffer area is used to store The cumulative results of all levels output by the convolution calculator. The invention can support various convolutional neural network structures, has good versatility, requires less on-chip storage resources, and has low communication overhead.

Description

Translated fromChinese

一种基于FPGA的通用卷积神经网络加速器A general-purpose convolutional neural network accelerator based on FPGA

技术领域technical field

本发明涉及电子信息和深度学习技术领域，特别涉及了一种通用卷积神经网络加速器。The invention relates to the technical fields of electronic information and deep learning, in particular to a general convolutional neural network accelerator.

背景技术Background technique

近年来，深度神经网络近来受到了广泛的关注，特别是卷积神经网络模型已经在计算机视觉上得到了广泛的应用，例如图像分类、人脸检测识别、文字识别等，2013年MITTechnology Review杂志将以卷积神经网络为代表的深度学习评为十大突破性技术之首。卷积神经网络算法受到人类视觉系统的启发，利用卷积操作作为模拟神经元的感受野。深度神经网络的计算非常密集，在检测识别等任务中达到数十GOPS。除了计算密集，神经网络具有数百万甚至近亿的参数需要存储，因此利用深度神经网络进行实时检测识别只能依靠高性能多核CPU(Central Processing Unit，中央处理器)和GPU(Graphic ProcessingUnit，图形处理器)来完成，但对于功耗及体积受限的设备特别是移动设备(例如机器人、消费电子产品、智能汽车等)要移植神经网络模型几乎不可能。因此，使用通用器件，构建专用加速电路来满足卷积神经网路计算及存储需求是一条可行之路。In recent years, deep neural networks have received extensive attention recently, especially convolutional neural network models have been widely used in computer vision, such as image classification, face detection and recognition, text recognition, etc. In 2013, MIT Technology Review magazine will Deep learning represented by convolutional neural network was ranked first among the top ten breakthrough technologies. Inspired by the human visual system, convolutional neural network algorithms use convolution operations as the receptive fields of simulated neurons. The calculation of deep neural network is very intensive, reaching tens of GOPS in tasks such as detection and recognition. In addition to being computationally intensive, neural networks have millions or even hundreds of millions of parameters that need to be stored, so real-time detection and recognition using deep neural networks can only rely on high-performance multi-core CPU (Central Processing Unit, central processing unit) and GPU (Graphic Processing Unit, graphics processor), but it is almost impossible to transplant neural network models to devices with limited power consumption and volume, especially mobile devices (such as robots, consumer electronics, smart cars, etc.). Therefore, it is a feasible way to use general-purpose devices and build dedicated acceleration circuits to meet the computing and storage requirements of convolutional neural networks.

目前，除了GPU以外，主流的加速硬件有FPGA、ASIC(Application SpecificIntegrated Circuit，专用集成电路)。ASIC虽然具有高性能低功耗的优点，但其需要根据具体的应用而进行专用设计，设计灵活性低且前期开发成本高。而近年来基于FPGA高层综合工具的研发，给FPGA设计带来很大便利，高层综合工具在不影响性能的情况下大幅度地降低了研发周期。FPGA作为适用不同功能的可编程标准器件，没有如此高额的研发成本，并且具有一定的灵活性。另外FPGA小巧灵活，功耗低，并行性等优点十分适合神经网络任务，因此将FPGA应用于移动工作平台来实现卷积神经网络的卷积计算是有效的解决方案。At present, besides the GPU, the mainstream acceleration hardware includes FPGA and ASIC (Application Specific Integrated Circuit, Application Specific Integrated Circuit). Although ASIC has the advantages of high performance and low power consumption, it needs to be specially designed according to the specific application, and the design flexibility is low and the initial development cost is high. In recent years, the development of FPGA-based high-level synthesis tools has brought great convenience to FPGA design. High-level synthesis tools have greatly reduced the development cycle without affecting performance. As a programmable standard device suitable for different functions, FPGA does not have such a high R&D cost and has certain flexibility. In addition, FPGA is small and flexible, low power consumption, parallelism and other advantages are very suitable for neural network tasks, so it is an effective solution to apply FPGA to mobile work platform to realize the convolution calculation of convolutional neural network.

发明内容Contents of the invention

为了解决上述背景技术提出的技术问题，本发明旨在提供一种基于FPGA的通用卷积神经网络加速器，能够支持各种卷积神经网络结构，通用性好，对片上存储资源需求较少，通信开销小。In order to solve the technical problems raised by the above-mentioned background technology, the present invention aims to provide a general-purpose convolutional neural network accelerator based on FPGA, which can support various convolutional neural network structures, has good versatility, and requires less on-chip storage resources. The overhead is small.

为了实现上述技术目的，本发明的技术方案为：In order to realize above-mentioned technical purpose, technical scheme of the present invention is:

一种基于FPGA的通用卷积神经网络加速器，包括MCU、AXI4总线接口、地址生成器、状态控制器、特征图缓存区、卷积核缓存区、卷积计算器以及分段式结果缓存区；卷积计算器采用FPGA实现，并包含N个卷积计算子单元，特征图缓存区和卷积核缓存区分别包含N个特征图子缓存区和N个卷积核子缓存区，每一个卷积计算子单元对应配置一个特征图子缓存区和一个卷积核子缓存区；MCU与外部存储器相连，用于读取外部存储器的输入数据，并将卷积计算结果发送给外部存储器；特征图缓存区和卷积核缓存区用于缓存通过AXI4总线接口从MCU读取的特征图和卷积核数据；地址生成器用于生成特征图缓存区的读取地址，按照该地址读取特征图缓存区中的数据送入卷积计算器；卷积计算器用于读取特征图缓存区和卷积核缓存区中的数据进行卷积计算，并将相邻卷积计算子单元的计算结果进行多级累加；分段式结果缓存区包含个结果子缓存区，用于存放卷积计算器输出的各级累加结果，并通过AXI4总线接口将各级累加结果传送至MCU；状态控制器用于控制整个加速器的工作状态，实现各个工作状态的转换。An FPGA-based general-purpose convolutional neural network accelerator, including an MCU, an AXI4 bus interface, an address generator, a state controller, a feature map buffer, a convolution kernel buffer, a convolution calculator, and a segmented result buffer; The convolution calculator is implemented by FPGA and contains N convolution calculation sub-units. The feature map buffer area and the convolution kernel buffer area respectively contain N feature map sub-buffer areas and N convolution kernel sub-buffer areas. Each convolution The calculation sub-unit is correspondingly configured with a feature map sub-buffer area and a convolution kernel sub-buffer area; the MCU is connected to the external memory to read the input data of the external memory and send the convolution calculation result to the external memory; the feature map buffer area and the convolution kernel buffer area are used to cache the feature map and convolution kernel data read from the MCU through the AXI4 bus interface; the address generator is used to generate the read address of the feature map buffer area, and read the feature map buffer area according to the address The data is sent to the convolution calculator; the convolution calculator is used to read the data in the feature map buffer area and the convolution kernel buffer area for convolution calculation, and perform multi-level accumulation of the calculation results of adjacent convolution calculation subunits ;The segmented result buffer contains A result sub-buffer area is used to store the accumulation results of all levels output by the convolution calculator, and transmit the accumulation results of all levels to the MCU through the AXI4 bus interface; the state controller is used to control the working state of the entire accelerator to realize the various working states. convert.

基于上述技术方案的优选方案，所述特征图缓存区和卷积核缓存区的写地址由状态控制器累加产生，卷积核缓存区的读地址通过每4个时钟累加一次产生。Based on the preferred solution of the above technical solution, the write addresses of the feature map buffer area and the convolution kernel buffer area are accumulated and generated by the state controller, and the read addresses of the convolution kernel buffer area are generated by accumulation every 4 clocks.

基于上述技术方案的优选方案，所述地址产生器的产生规则按照卷积矩阵展开，根据特征图大小、卷积核大小和卷积窗口偏移产生特征图缓存区的读地址。Based on the preferred solution of the above technical solution, the generation rule of the address generator is expanded according to the convolution matrix, and the read address of the feature map buffer area is generated according to the size of the feature map, the size of the convolution kernel and the offset of the convolution window.

基于上述技术方案的优选方案，在卷积计算器中，将相邻卷积计算子单元的计算结果进行log₂N级累加，分别将第2、3、4、5、……、log₂N次的累加结果存入分段式结果缓存区。Based on the preferred solution of the above technical solution, in the convolution calculator, the calculation results of the adjacent convolution calculation subunits are accumulated in log₂ N stages, and the 2nd, 3, 4, 5, ..., log₂ N The cumulative results of times are stored in the segmented result buffer.

基于上述技术方案的优选方案，所述分段式结果缓存区包含个结果子缓存区，被分为个一组、个一组、个一组、个一组、……、1个一组，依次对应存放第2、3、4、5、……、log₂N次的累加结果。Based on the preferred solution of the above technical solution, the segmented result buffer contains A result subbuffer, divided into a group, a group, a group, A group, ..., a group of 1, correspondingly store the accumulation results of the 2nd, 3rd, 4th, 5th, ..., log₂ N times.

基于上述技术方案的优选方案，卷积计算器为通用定点数计算器，待计算数据并行进入卷积计算器，每个时钟共计N组卷积核、特征图进入卷积计算器，采用DSP作为运算单元做乘累加计算，卷积计算器在第十个时钟输出卷积计算结果，其中前九个时钟执行乘累加计算，第十个时钟加上偏置后发送给后续的多级累加器。Based on the preferred solution of the above technical solution, the convolution calculator is a general-purpose fixed-point number calculator, and the data to be calculated enters the convolution calculator in parallel. A total of N sets of convolution kernels and feature maps enter the convolution calculator for each clock, and DSP is used as The arithmetic unit performs multiplication and accumulation calculations, and the convolution calculator outputs the convolution calculation results at the tenth clock, of which the first nine clocks perform multiplication and accumulation calculations, and the tenth clock is biased and sent to the subsequent multi-stage accumulator.

基于上述技术方案的优选方案，所述AXI4总线接口将多个数据拼接成一个数据发送，提高运算速度。Based on the preferred solution of the above technical solution, the AXI4 bus interface concatenates a plurality of data into one data and sends it to improve the operation speed.

采用上述技术方案带来的有益效果：The beneficial effect brought by adopting the above-mentioned technical scheme:

本发明缩短了FPGA与MCU的通信时间，加快了数据传输，运用地址产生器有效地避免了重复数据的发送，进一步减少了数据流时间。同时根据FPGA的片上存储空间安排合理地规划了分块特征图大小，将资源利用率达到最大。改进了计算结果缓存机制，更具有通用性，不再受网络层数约束，也减少了不必要的中间值缓存。The invention shortens the communication time between the FPGA and the MCU, speeds up the data transmission, effectively avoids the sending of repeated data by using the address generator, and further reduces the data flow time. At the same time, according to the on-chip storage space arrangement of the FPGA, the size of the block feature map is reasonably planned to maximize the resource utilization. The calculation result caching mechanism is improved, which is more versatile, no longer restricted by the number of network layers, and reduces unnecessary intermediate value caching.

本发明在基于Yolo算法的路况检测应用中得到验证，在100MHz的工作频率下，并行度N为64的条件下，数据精度为16位定点数，对640pixels×640pixels的实时检测速度可达到3～4FPS。The present invention is verified in the application of road condition detection based on Yolo algorithm. Under the condition of operating frequency of 100MHz and parallelism N of 64, the data accuracy is 16 fixed-point numbers, and the real-time detection speed of 640pixels×640pixels can reach 3~ 4FPS.

附图说明Description of drawings

图1是本发明的结构示意图；Fig. 1 is a structural representation of the present invention;

图2是本发明数据缓存区结构示意图；Fig. 2 is a schematic diagram of the structure of the data buffer area of the present invention;

图3是本发明卷积计算器工作示意图；Fig. 3 is the working schematic diagram of convolution calculator of the present invention;

图4是本发明卷积结果累加示意图。Fig. 4 is a schematic diagram of accumulation of convolution results in the present invention.

具体实施方式Detailed ways

以下将结合附图，对本发明的技术方案进行详细说明。The technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示本发明设计的通用卷积神经网络加速器结构，并行度N以64为例，其工作方法如下。As shown in Figure 1, the general convolutional neural network accelerator structure designed by the present invention, the parallelism N is 64 as an example, and its working method is as follows.

MCU预先从外部存储器里读取16位的输入特征图、卷积核及偏置，并按顺序将4个16位参数拼接成一个64位数据通过直接存取控制器、AXI4总线送入到数据缓存区(即特征图缓存区、卷积核缓存区)。The MCU reads the 16-bit input feature map, convolution kernel and bias from the external memory in advance, and splices the 4 16-bit parameters into a 64-bit data in order and sends it to the data through the direct access controller and the AXI4 bus. Buffer area (ie feature map buffer area, convolution kernel buffer area).

参照图2，待计算数据的缓存顺序受状态控制器控制，先缓存特征图数据，根据每片缓存器的大小设置分块特征图为42*42，位宽为16位。由于AXI4总线位宽设置为64位，故SRAM的读写位宽为1：4的比例，选取单双口RAM作为特征图子缓存区，写位宽为64位，写深度为441，读位宽为16位，读深度为1764。特征图缓存区由64个上述缓存区组成，故一次数据传输可接受42*42*64个特征图参数。接受完特征图数据后状态控制器跳转到缓存卷积核参数模式，卷积核大小为3*3，每个卷积核匹配一个偏置参数，故缓存到卷积核缓存区一共有10个16位参数，由于AXI4总线是以64位传输，故补2个字节的0，一共有12个16位参数，卷积核缓存区读写位宽均为64位，深度为3。卷积核缓存区由64个上述缓存区组成，与特征图缓存区一一匹配。Referring to Figure 2, the caching sequence of the data to be calculated is controlled by the state controller, and the feature map data is cached first, and the block feature map is set to 42*42 according to the size of each buffer, and the bit width is 16 bits. Since the bit width of the AXI4 bus is set to 64 bits, the read and write bit width of the SRAM is 1:4. The single and dual port RAM is selected as the feature map sub-buffer area, the write bit width is 64 bits, the write depth is 441, and the read bit width is 441 bits. The width is 16 bits and the read depth is 1764. The feature map buffer area is composed of 64 above buffer areas, so one data transmission can accept 42*42*64 feature map parameters. After receiving the feature map data, the state controller jumps to the cache convolution kernel parameter mode. The convolution kernel size is 3*3, and each convolution kernel matches a bias parameter, so there are 10 convolution kernel caches in total. A 16-bit parameter. Since the AXI4 bus is transmitted in 64 bits, 2 bytes of 0 are added. There are a total of 12 16-bit parameters. The read and write bit width of the convolution core buffer area is 64 bits, and the depth is 3. The convolution kernel buffer area is composed of 64 above-mentioned buffer areas, which are matched one by one with the feature map buffer area.

数据缓存结束后状态控制器跳入计算模式，启动地址产生器，故根据卷积窗口偏移Stride产生例如：0、1、2、42、43、44、84、85、86；1、2、3、43、44、45、85、86、87；…；1677、1678、1679、1719、1720、1721、1761、1762、1763的地址。同时控制缓存区读使能从特征图、卷积核缓存区里同时读取卷积核参数和特征图数据送入卷积计算器进行卷积运算。After the data cache is over, the state controller jumps into the calculation mode and starts the address generator, so according to the convolution window offset Stride generates, for example: 0, 1, 2, 42, 43, 44, 84, 85, 86; 1, 2, 3, 43, 44, 45, 85, 86, 87; ...; 1677, 1678, 1679, 1719, 1720, 1721, 1761, 1762, 1763 addresses. Simultaneously control the reading of the buffer area to enable simultaneous reading of convolution kernel parameters and feature map data from the feature map and convolution kernel buffer area and send them to the convolution calculator for convolution operations.

参照图3，待计算数据并行进入卷积运算器，每个时钟共计64组卷积核、特征图进入卷积器计算，对于大小为3*3的卷积计算，卷积器共需十个时钟计算出结果，其中前九个时钟执行乘累加，第十个时钟加上偏置后发送结果至后续累加器，将逐级累加结果缓存至所述分段式卷积结果缓存区。Referring to Figure 3, the data to be calculated enters the convolution operator in parallel. A total of 64 sets of convolution kernels and feature maps enter the convolution operator for calculation in each clock. For the convolution calculation with a size of 3*3, a total of ten convolution operators are required. The clocks calculate the results, among which the first nine clocks perform multiplication and accumulation, and the tenth clock adds a bias and sends the results to the subsequent accumulator, and caches the step-by-step accumulation results to the segmented convolution result buffer.

参照图4，一共有64个卷积计算子单元(Processing Convolution Unit,简称PCU)，PCU在第十个时钟加上偏置后输出结果至后续累加器，将逐级累加结果缓存至所述分段式卷积结果缓存区。分段式卷积结果缓存区由31个FIFO组成，其中31个FIFO依次分为、16个一组、8个一组、4个一组、2个一组、1个一组，以存放所述卷积计算单元的计算逐级累加结果，分段式卷积结果缓存区的写使能由所述卷积计算器给出，读出逻辑由所述状态控制器控制。分段式卷积结果缓存区的读写数据位数分别由AXI4总线写数据位宽和定点卷积计算位宽确定，在本实施例中，AXI4读写数据位宽均为64位，定点卷积运算为16位，故在写入计算结果FIFO时按16位写数据，64位读数据操作。逐级累加是在第十一个时钟将相邻两个PCU的计算结果相加得到32个一级累加值；接着在第十二个时钟将这32个一级累加值的两个相邻值送入二级累加器，得到16个二级累加值存入FIFO；以此类推，一共有六级累加，最后输出一个累加值，相当于64个PCU卷积计算结果累加。Referring to Figure 4, there are a total of 64 processing convolution units (Processing Convolution Unit, referred to as PCU). The PCU outputs the result to the subsequent accumulator after the tenth clock is biased, and caches the step-by-step accumulation result to the sub-unit. Segmented convolution result buffer. The segmented convolution result buffer area is composed of 31 FIFOs, of which 31 FIFOs are divided into groups of 16, 8, 4, 2, and 1 to store all The calculation of the convolution calculation unit accumulates results step by step, the write enable of the segmented convolution result buffer is given by the convolution calculator, and the readout logic is controlled by the state controller. The number of read and write data bits in the segmented convolution result buffer area is respectively determined by the AXI4 bus write data bit width and the fixed-point convolution calculation bit width. In this embodiment, the AXI4 read and write data bit width is 64 bits, and the fixed-point volume The product operation is 16 bits, so when writing the calculation result FIFO, write data in 16 bits, and read data in 64 bits. The step-by-step accumulation is to add the calculation results of two adjacent PCUs at the eleventh clock to obtain 32 first-level accumulation values; then at the twelfth clock, the two adjacent values of the 32 first-level accumulation values It is sent to the secondary accumulator, and 16 secondary accumulated values are stored in the FIFO; by analogy, there are six levels of accumulation, and finally an accumulated value is output, which is equivalent to the accumulation of 64 PCU convolution calculation results.

计算结束后状态控制器跳转到发送状态，根据CPU指令读取相应FIFO数据根据指令要求是否执行RELU操作后将数据通告AXI4总线发送回MCU存入外部存储器。After the calculation, the state controller jumps to the sending state, reads the corresponding FIFO data according to the CPU instruction, and performs the RELU operation according to the instruction request, and then notifies the AXI4 bus to send the data back to the MCU and store it in the external memory.

相比较于直接加速卷积运算，将卷积运算展开为向量的乘累加操作可以弱化网络结构对加速器结构不匹配带来的影响。本实施例中控制上述步骤的状态控制器状态转换条件如下：Compared with directly accelerating the convolution operation, expanding the convolution operation into a vector multiply-accumulate operation can weaken the influence of the network structure on the mismatch of the accelerator structure. The state transition condition of the state controller controlling the above-mentioned steps in the present embodiment is as follows:

初始化时状态机进入等待状态，等待MCU指令信号(ram_flag)，当ram_flag为001时，进入写卷积核状态；当ram_flag为010时，进入写特征图状态；当数据和缓存结束后MCU发送计算开始指令，状态机跳入卷积计算状态；当计算结束后跳入计算结果发送状态；当send状态结束时，返回等待状态，等待下一次数据写入。During initialization, the state machine enters the waiting state and waits for the MCU command signal (ram_flag). When ram_flag is 001, it enters the state of writing the convolution kernel; when ram_flag is 010, it enters the state of writing the feature map; when the data and cache are completed, the MCU sends the calculation Start the instruction, the state machine jumps into the convolution calculation state; when the calculation is over, it jumps into the calculation result sending state; when the send state ends, it returns to the waiting state and waits for the next data write.

写特征图：一旦进入该状态，所有的特征图RAM使能端有效，当AXI4总线上的数据有效信号有效时，开始接收数据，RAM的写地址开始从0累加，每次累加到441后清零从0重新累加，指示在缓存下一个分块特征图数据，一共缓存64个特征图RAM。完成后等待MCU指令信号ram_flag，当ram_flag为1XX时则进入计算状态。Write feature map: Once in this state, all feature map RAM enable terminals are valid, and when the data valid signal on the AXI4 bus is valid, start to receive data, and the write address of RAM starts to accumulate from 0, and is cleared after each accumulation to 441 Zero is re-accumulated from 0, indicating that the feature map data of the next block is cached, and a total of 64 feature map RAMs are cached. After completion, wait for the MCU instruction signal ram_flag, and enter the calculation state when ram_flag is 1XX.

写卷积核：一旦进入该状态，所有的卷积核RAM使能有效，当AXI4总线上的数据有效信号有效时，开始接收数据，RAM的写地址开始从0累加，每次累加到3后清零从0重新累加，指示在缓存下一个卷积核数据，一共缓存64个卷积核RAM。完成后等待MCU指令信号ram_flag，当ram_flag为1XX时则进入计算状态。Write convolution kernel: Once in this state, all convolution kernel RAMs are enabled and valid. When the data valid signal on the AXI4 bus is valid, data will be received, and the write address of RAM will start to accumulate from 0, and accumulate to 3 each time. Clear and accumulate again from 0, indicating that the next convolution kernel data is cached, and a total of 64 convolution kernel RAMs are cached. After completion, wait for the MCU instruction signal ram_flag, and enter the calculation state when ram_flag is 1XX.

卷积计算：一旦进入该状态，开始从特征图RAM和卷积核RAM并行读出数据，卷积核RAM的读地址从0开始每隔4时钟加1，累加到3后重新开始，特征图RAM的读地址由地址产生器产生，根据读地址从各自RAM中读取相应数据，每累加到9次，则停止一个时钟，以与PCU输出数据同步。当特征图RAM的读地址为1763时，标志所有数据均已计算完成，此时两组读地址复位，下个时钟将进入计算结果发送状态。Convolution calculation: Once in this state, start to read data from the feature map RAM and convolution kernel RAM in parallel. The read address of the convolution kernel RAM starts from 0 and adds 1 every 4 clocks, and restarts after accumulating to 3. The feature map The read address of the RAM is generated by the address generator, and the corresponding data is read from the respective RAMs according to the read address, and every time it is accumulated to 9 times, a clock is stopped to synchronize with the output data of the PCU. When the read address of the feature map RAM is 1763, it indicates that all the data has been calculated. At this time, the two sets of read addresses are reset, and the next clock will enter the state of sending the calculation results.

计算结果发送：发送数据选取根据layer(来自外部处理器)的指示，会按所需数据为多少个卷积结果累加和从对应的FIFO中取数据，再根据RELU(来自外部处理器)信号是否作RELU操作后将数据发送至总线。Calculation result sending: The sending data is selected according to the instruction of the layer (from the external processor), how many convolution results are accumulated according to the required data, and the data is taken from the corresponding FIFO, and then according to whether the RELU (from the external processor) signal is Send the data to the bus after doing the RELU operation.

实施例仅为说明本发明的技术思想，不能以此限定本发明的保护范围，凡是按照本发明提出的技术思想，在技术方案基础上所做的任何改动，均落入本发明保护范围之内。The embodiment is only to illustrate the technical idea of the present invention, and can not limit the scope of protection of the present invention with this. All technical ideas proposed in the present invention, any changes made on the basis of technical solutions, all fall within the scope of protection of the present invention .

Claims

2. the general convolutional neural networks accelerator based on FPGA according to claim 1, it is characterised in that：The characteristic patternThe write address of buffer area and convolution kernel buffer area is generated by state controller is cumulative, and the reading address of convolution kernel buffer area passes through every 4A clock accumulates once generation.

3. the general convolutional neural networks accelerator based on FPGA according to claim 1, it is characterised in that：Described address is producedThe generation rule of raw device is unfolded according to convolution matrix, is generated according to characteristic pattern size, convolution kernel size and convolution window offset specialLevy the reading address of figure buffer area.

4. the general convolutional neural networks accelerator based on FPGA according to claim 1, it is characterised in that：In convolutional calculationIn device, the result of calculation of adjacent convolutional calculation subelement is subjected to log₂N grades are cumulative, respectively by the 2nd, 3,4,5 ..., log₂NSecondary accumulation result deposit segmented result cache area.

5. the general convolutional neural networks accelerator based on FPGA according to claim 4, it is characterised in that：The segmentedResult cache area includesA buffer area that bears fruit, is divided intoA one group,A one group,A one group,A oneGroup ..., 1 one group, be corresponding in turn to storage the 2nd, 3,4,5 ..., log₂The accumulation result of n times.

6. the general convolutional neural networks accelerator based on FPGA according to claim 1, it is characterised in that：Convolutional calculation deviceFor general fixed-point number calculator, data parallel to be calculated enters convolutional calculation device, and each clock amounts to N groups convolution kernel, characteristic patternIt into convolutional calculation device, is done as arithmetic element using DSP and multiplies accumulating calculating, convolutional calculation device exports convolution in the tenth clockResult of calculation is sent to subsequent multistage wherein the execution of preceding nine clocks multiplies accumulating calculating after the tenth clock is plus biasingAccumulator.

7. according to the general convolutional neural networks accelerator based on FPGA described in any one of claim 1-6, feature existsIn：Multiple data are spliced into a data and sent by the AXI4 bus interface, improve arithmetic speed.