Movatterモバイル変換


[0]ホーム

URL:


CN108805272A - A kind of general convolutional neural networks accelerator based on FPGA - Google Patents

A kind of general convolutional neural networks accelerator based on FPGA
Download PDF

Info

Publication number
CN108805272A
CN108805272ACN201810413101.XACN201810413101ACN108805272ACN 108805272 ACN108805272 ACN 108805272ACN 201810413101 ACN201810413101 ACN 201810413101ACN 108805272 ACN108805272 ACN 108805272A
Authority
CN
China
Prior art keywords
buffer area
convolution
convolution kernel
fpga
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810413101.XA
Other languages
Chinese (zh)
Inventor
陆生礼
泮雯雯
庞伟
范雪梅
李宇峰
杨文韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University - Wuxi Institute Of Technology Integrated Circuits
Southeast University
Original Assignee
Southeast University - Wuxi Institute Of Technology Integrated Circuits
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University - Wuxi Institute Of Technology Integrated Circuits, Southeast UniversityfiledCriticalSoutheast University - Wuxi Institute Of Technology Integrated Circuits
Priority to CN201810413101.XApriorityCriticalpatent/CN108805272A/en
Publication of CN108805272ApublicationCriticalpatent/CN108805272A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于FPGA的通用卷积神经网络加速器,包括MCU、AXI4总线接口、地址生成器、状态控制器、特征图缓存区、卷积核缓存区、卷积计算器以及分段式结果缓存区。卷积加速器采用FPGA实现,并包含N个卷积计算子单元,特征图缓存区和卷积核缓存区分别包含N个特征图子缓存区和N个卷积核子缓存区,每一个卷积计算子单元对应配置一个特征图子缓存区和一个卷积核子缓存区。卷积计算器读取特征图缓存区和卷积核缓存区中的数据进行卷积计算,并将相邻卷积计算子单元的计算结果进行多级累加,分段式结果缓存区用于存放卷积计算器输出的各级累加结果。本发明能够支持各种卷积神经网络结构,通用性好,对片上存储资源需求较少,通信开销小。

The invention discloses an FPGA-based general convolutional neural network accelerator, which includes an MCU, an AXI4 bus interface, an address generator, a state controller, a feature map buffer area, a convolution kernel buffer area, a convolution calculator, and a segmented Result cache. The convolution accelerator is implemented by FPGA and contains N convolution calculation sub-units. The feature map buffer area and the convolution kernel buffer area respectively contain N feature map sub-buffer areas and N convolution kernel sub-buffer areas. Each convolution calculation The subunits are correspondingly configured with a feature map subbuffer and a convolution kernel subbuffer. The convolution calculator reads the data in the feature map buffer area and the convolution kernel buffer area for convolution calculation, and performs multi-level accumulation of the calculation results of adjacent convolution calculation subunits, and the segmented result buffer area is used to store The cumulative results of all levels output by the convolution calculator. The invention can support various convolutional neural network structures, has good versatility, requires less on-chip storage resources, and has low communication overhead.

Description

Translated fromChinese
一种基于FPGA的通用卷积神经网络加速器A general-purpose convolutional neural network accelerator based on FPGA

技术领域technical field

本发明涉及电子信息和深度学习技术领域,特别涉及了一种通用卷积神经网络加速器。The invention relates to the technical fields of electronic information and deep learning, in particular to a general convolutional neural network accelerator.

背景技术Background technique

近年来,深度神经网络近来受到了广泛的关注,特别是卷积神经网络模型已经在计算机视觉上得到了广泛的应用,例如图像分类、人脸检测识别、文字识别等,2013年MITTechnology Review杂志将以卷积神经网络为代表的深度学习评为十大突破性技术之首。卷积神经网络算法受到人类视觉系统的启发,利用卷积操作作为模拟神经元的感受野。深度神经网络的计算非常密集,在检测识别等任务中达到数十GOPS。除了计算密集,神经网络具有数百万甚至近亿的参数需要存储,因此利用深度神经网络进行实时检测识别只能依靠高性能多核CPU(Central Processing Unit,中央处理器)和GPU(Graphic ProcessingUnit,图形处理器)来完成,但对于功耗及体积受限的设备特别是移动设备(例如机器人、消费电子产品、智能汽车等)要移植神经网络模型几乎不可能。因此,使用通用器件,构建专用加速电路来满足卷积神经网路计算及存储需求是一条可行之路。In recent years, deep neural networks have received extensive attention recently, especially convolutional neural network models have been widely used in computer vision, such as image classification, face detection and recognition, text recognition, etc. In 2013, MIT Technology Review magazine will Deep learning represented by convolutional neural network was ranked first among the top ten breakthrough technologies. Inspired by the human visual system, convolutional neural network algorithms use convolution operations as the receptive fields of simulated neurons. The calculation of deep neural network is very intensive, reaching tens of GOPS in tasks such as detection and recognition. In addition to being computationally intensive, neural networks have millions or even hundreds of millions of parameters that need to be stored, so real-time detection and recognition using deep neural networks can only rely on high-performance multi-core CPU (Central Processing Unit, central processing unit) and GPU (Graphic Processing Unit, graphics processor), but it is almost impossible to transplant neural network models to devices with limited power consumption and volume, especially mobile devices (such as robots, consumer electronics, smart cars, etc.). Therefore, it is a feasible way to use general-purpose devices and build dedicated acceleration circuits to meet the computing and storage requirements of convolutional neural networks.

目前,除了GPU以外,主流的加速硬件有FPGA、ASIC(Application SpecificIntegrated Circuit,专用集成电路)。ASIC虽然具有高性能低功耗的优点,但其需要根据具体的应用而进行专用设计,设计灵活性低且前期开发成本高。而近年来基于FPGA高层综合工具的研发,给FPGA设计带来很大便利,高层综合工具在不影响性能的情况下大幅度地降低了研发周期。FPGA作为适用不同功能的可编程标准器件,没有如此高额的研发成本,并且具有一定的灵活性。另外FPGA小巧灵活,功耗低,并行性等优点十分适合神经网络任务,因此将FPGA应用于移动工作平台来实现卷积神经网络的卷积计算是有效的解决方案。At present, besides the GPU, the mainstream acceleration hardware includes FPGA and ASIC (Application Specific Integrated Circuit, Application Specific Integrated Circuit). Although ASIC has the advantages of high performance and low power consumption, it needs to be specially designed according to the specific application, and the design flexibility is low and the initial development cost is high. In recent years, the development of FPGA-based high-level synthesis tools has brought great convenience to FPGA design. High-level synthesis tools have greatly reduced the development cycle without affecting performance. As a programmable standard device suitable for different functions, FPGA does not have such a high R&D cost and has certain flexibility. In addition, FPGA is small and flexible, low power consumption, parallelism and other advantages are very suitable for neural network tasks, so it is an effective solution to apply FPGA to mobile work platform to realize the convolution calculation of convolutional neural network.

发明内容Contents of the invention

为了解决上述背景技术提出的技术问题,本发明旨在提供一种基于FPGA的通用卷积神经网络加速器,能够支持各种卷积神经网络结构,通用性好,对片上存储资源需求较少,通信开销小。In order to solve the technical problems raised by the above-mentioned background technology, the present invention aims to provide a general-purpose convolutional neural network accelerator based on FPGA, which can support various convolutional neural network structures, has good versatility, and requires less on-chip storage resources. The overhead is small.

为了实现上述技术目的,本发明的技术方案为:In order to realize above-mentioned technical purpose, technical scheme of the present invention is:

一种基于FPGA的通用卷积神经网络加速器,包括MCU、AXI4总线接口、地址生成器、状态控制器、特征图缓存区、卷积核缓存区、卷积计算器以及分段式结果缓存区;卷积计算器采用FPGA实现,并包含N个卷积计算子单元,特征图缓存区和卷积核缓存区分别包含N个特征图子缓存区和N个卷积核子缓存区,每一个卷积计算子单元对应配置一个特征图子缓存区和一个卷积核子缓存区;MCU与外部存储器相连,用于读取外部存储器的输入数据,并将卷积计算结果发送给外部存储器;特征图缓存区和卷积核缓存区用于缓存通过AXI4总线接口从MCU读取的特征图和卷积核数据;地址生成器用于生成特征图缓存区的读取地址,按照该地址读取特征图缓存区中的数据送入卷积计算器;卷积计算器用于读取特征图缓存区和卷积核缓存区中的数据进行卷积计算,并将相邻卷积计算子单元的计算结果进行多级累加;分段式结果缓存区包含个结果子缓存区,用于存放卷积计算器输出的各级累加结果,并通过AXI4总线接口将各级累加结果传送至MCU;状态控制器用于控制整个加速器的工作状态,实现各个工作状态的转换。An FPGA-based general-purpose convolutional neural network accelerator, including an MCU, an AXI4 bus interface, an address generator, a state controller, a feature map buffer, a convolution kernel buffer, a convolution calculator, and a segmented result buffer; The convolution calculator is implemented by FPGA and contains N convolution calculation sub-units. The feature map buffer area and the convolution kernel buffer area respectively contain N feature map sub-buffer areas and N convolution kernel sub-buffer areas. Each convolution The calculation sub-unit is correspondingly configured with a feature map sub-buffer area and a convolution kernel sub-buffer area; the MCU is connected to the external memory to read the input data of the external memory and send the convolution calculation result to the external memory; the feature map buffer area and the convolution kernel buffer area are used to cache the feature map and convolution kernel data read from the MCU through the AXI4 bus interface; the address generator is used to generate the read address of the feature map buffer area, and read the feature map buffer area according to the address The data is sent to the convolution calculator; the convolution calculator is used to read the data in the feature map buffer area and the convolution kernel buffer area for convolution calculation, and perform multi-level accumulation of the calculation results of adjacent convolution calculation subunits ;The segmented result buffer contains A result sub-buffer area is used to store the accumulation results of all levels output by the convolution calculator, and transmit the accumulation results of all levels to the MCU through the AXI4 bus interface; the state controller is used to control the working state of the entire accelerator to realize the various working states. convert.

基于上述技术方案的优选方案,所述特征图缓存区和卷积核缓存区的写地址由状态控制器累加产生,卷积核缓存区的读地址通过每4个时钟累加一次产生。Based on the preferred solution of the above technical solution, the write addresses of the feature map buffer area and the convolution kernel buffer area are accumulated and generated by the state controller, and the read addresses of the convolution kernel buffer area are generated by accumulation every 4 clocks.

基于上述技术方案的优选方案,所述地址产生器的产生规则按照卷积矩阵展开,根据特征图大小、卷积核大小和卷积窗口偏移产生特征图缓存区的读地址。Based on the preferred solution of the above technical solution, the generation rule of the address generator is expanded according to the convolution matrix, and the read address of the feature map buffer area is generated according to the size of the feature map, the size of the convolution kernel and the offset of the convolution window.

基于上述技术方案的优选方案,在卷积计算器中,将相邻卷积计算子单元的计算结果进行log2N级累加,分别将第2、3、4、5、……、log2N次的累加结果存入分段式结果缓存区。Based on the preferred solution of the above technical solution, in the convolution calculator, the calculation results of the adjacent convolution calculation subunits are accumulated in log2 N stages, and the 2nd, 3, 4, 5, ..., log2 N The cumulative results of times are stored in the segmented result buffer.

基于上述技术方案的优选方案,所述分段式结果缓存区包含个结果子缓存区,被分为个一组、个一组、个一组、个一组、……、1个一组,依次对应存放第2、3、4、5、……、log2N次的累加结果。Based on the preferred solution of the above technical solution, the segmented result buffer contains A result subbuffer, divided into a group, a group, a group, A group, ..., a group of 1, correspondingly store the accumulation results of the 2nd, 3rd, 4th, 5th, ..., log2 N times.

基于上述技术方案的优选方案,卷积计算器为通用定点数计算器,待计算数据并行进入卷积计算器,每个时钟共计N组卷积核、特征图进入卷积计算器,采用DSP作为运算单元做乘累加计算,卷积计算器在第十个时钟输出卷积计算结果,其中前九个时钟执行乘累加计算,第十个时钟加上偏置后发送给后续的多级累加器。Based on the preferred solution of the above technical solution, the convolution calculator is a general-purpose fixed-point number calculator, and the data to be calculated enters the convolution calculator in parallel. A total of N sets of convolution kernels and feature maps enter the convolution calculator for each clock, and DSP is used as The arithmetic unit performs multiplication and accumulation calculations, and the convolution calculator outputs the convolution calculation results at the tenth clock, of which the first nine clocks perform multiplication and accumulation calculations, and the tenth clock is biased and sent to the subsequent multi-stage accumulator.

基于上述技术方案的优选方案,所述AXI4总线接口将多个数据拼接成一个数据发送,提高运算速度。Based on the preferred solution of the above technical solution, the AXI4 bus interface concatenates a plurality of data into one data and sends it to improve the operation speed.

采用上述技术方案带来的有益效果:The beneficial effect brought by adopting the above-mentioned technical scheme:

本发明缩短了FPGA与MCU的通信时间,加快了数据传输,运用地址产生器有效地避免了重复数据的发送,进一步减少了数据流时间。同时根据FPGA的片上存储空间安排合理地规划了分块特征图大小,将资源利用率达到最大。改进了计算结果缓存机制,更具有通用性,不再受网络层数约束,也减少了不必要的中间值缓存。The invention shortens the communication time between the FPGA and the MCU, speeds up the data transmission, effectively avoids the sending of repeated data by using the address generator, and further reduces the data flow time. At the same time, according to the on-chip storage space arrangement of the FPGA, the size of the block feature map is reasonably planned to maximize the resource utilization. The calculation result caching mechanism is improved, which is more versatile, no longer restricted by the number of network layers, and reduces unnecessary intermediate value caching.

本发明在基于Yolo算法的路况检测应用中得到验证,在100MHz的工作频率下,并行度N为64的条件下,数据精度为16位定点数,对640pixels×640pixels的实时检测速度可达到3~4FPS。The present invention is verified in the application of road condition detection based on Yolo algorithm. Under the condition of operating frequency of 100MHz and parallelism N of 64, the data accuracy is 16 fixed-point numbers, and the real-time detection speed of 640pixels×640pixels can reach 3~ 4FPS.

附图说明Description of drawings

图1是本发明的结构示意图;Fig. 1 is a structural representation of the present invention;

图2是本发明数据缓存区结构示意图;Fig. 2 is a schematic diagram of the structure of the data buffer area of the present invention;

图3是本发明卷积计算器工作示意图;Fig. 3 is the working schematic diagram of convolution calculator of the present invention;

图4是本发明卷积结果累加示意图。Fig. 4 is a schematic diagram of accumulation of convolution results in the present invention.

具体实施方式Detailed ways

以下将结合附图,对本发明的技术方案进行详细说明。The technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示本发明设计的通用卷积神经网络加速器结构,并行度N以64为例,其工作方法如下。As shown in Figure 1, the general convolutional neural network accelerator structure designed by the present invention, the parallelism N is 64 as an example, and its working method is as follows.

MCU预先从外部存储器里读取16位的输入特征图、卷积核及偏置,并按顺序将4个16位参数拼接成一个64位数据通过直接存取控制器、AXI4总线送入到数据缓存区(即特征图缓存区、卷积核缓存区)。The MCU reads the 16-bit input feature map, convolution kernel and bias from the external memory in advance, and splices the 4 16-bit parameters into a 64-bit data in order and sends it to the data through the direct access controller and the AXI4 bus. Buffer area (ie feature map buffer area, convolution kernel buffer area).

参照图2,待计算数据的缓存顺序受状态控制器控制,先缓存特征图数据,根据每片缓存器的大小设置分块特征图为42*42,位宽为16位。由于AXI4总线位宽设置为64位,故SRAM的读写位宽为1:4的比例,选取单双口RAM作为特征图子缓存区,写位宽为64位,写深度为441,读位宽为16位,读深度为1764。特征图缓存区由64个上述缓存区组成,故一次数据传输可接受42*42*64个特征图参数。接受完特征图数据后状态控制器跳转到缓存卷积核参数模式,卷积核大小为3*3,每个卷积核匹配一个偏置参数,故缓存到卷积核缓存区一共有10个16位参数,由于AXI4总线是以64位传输,故补2个字节的0,一共有12个16位参数,卷积核缓存区读写位宽均为64位,深度为3。卷积核缓存区由64个上述缓存区组成,与特征图缓存区一一匹配。Referring to Figure 2, the caching sequence of the data to be calculated is controlled by the state controller, and the feature map data is cached first, and the block feature map is set to 42*42 according to the size of each buffer, and the bit width is 16 bits. Since the bit width of the AXI4 bus is set to 64 bits, the read and write bit width of the SRAM is 1:4. The single and dual port RAM is selected as the feature map sub-buffer area, the write bit width is 64 bits, the write depth is 441, and the read bit width is 441 bits. The width is 16 bits and the read depth is 1764. The feature map buffer area is composed of 64 above buffer areas, so one data transmission can accept 42*42*64 feature map parameters. After receiving the feature map data, the state controller jumps to the cache convolution kernel parameter mode. The convolution kernel size is 3*3, and each convolution kernel matches a bias parameter, so there are 10 convolution kernel caches in total. A 16-bit parameter. Since the AXI4 bus is transmitted in 64 bits, 2 bytes of 0 are added. There are a total of 12 16-bit parameters. The read and write bit width of the convolution core buffer area is 64 bits, and the depth is 3. The convolution kernel buffer area is composed of 64 above-mentioned buffer areas, which are matched one by one with the feature map buffer area.

数据缓存结束后状态控制器跳入计算模式,启动地址产生器,故根据卷积窗口偏移Stride产生例如:0、1、2、42、43、44、84、85、86;1、2、3、43、44、45、85、86、87;…;1677、1678、1679、1719、1720、1721、1761、1762、1763的地址。同时控制缓存区读使能从特征图、卷积核缓存区里同时读取卷积核参数和特征图数据送入卷积计算器进行卷积运算。After the data cache is over, the state controller jumps into the calculation mode and starts the address generator, so according to the convolution window offset Stride generates, for example: 0, 1, 2, 42, 43, 44, 84, 85, 86; 1, 2, 3, 43, 44, 45, 85, 86, 87; ...; 1677, 1678, 1679, 1719, 1720, 1721, 1761, 1762, 1763 addresses. Simultaneously control the reading of the buffer area to enable simultaneous reading of convolution kernel parameters and feature map data from the feature map and convolution kernel buffer area and send them to the convolution calculator for convolution operations.

参照图3,待计算数据并行进入卷积运算器,每个时钟共计64组卷积核、特征图进入卷积器计算,对于大小为3*3的卷积计算,卷积器共需十个时钟计算出结果,其中前九个时钟执行乘累加,第十个时钟加上偏置后发送结果至后续累加器,将逐级累加结果缓存至所述分段式卷积结果缓存区。Referring to Figure 3, the data to be calculated enters the convolution operator in parallel. A total of 64 sets of convolution kernels and feature maps enter the convolution operator for calculation in each clock. For the convolution calculation with a size of 3*3, a total of ten convolution operators are required. The clocks calculate the results, among which the first nine clocks perform multiplication and accumulation, and the tenth clock adds a bias and sends the results to the subsequent accumulator, and caches the step-by-step accumulation results to the segmented convolution result buffer.

参照图4,一共有64个卷积计算子单元(Processing Convolution Unit,简称PCU),PCU在第十个时钟加上偏置后输出结果至后续累加器,将逐级累加结果缓存至所述分段式卷积结果缓存区。分段式卷积结果缓存区由31个FIFO组成,其中31个FIFO依次分为、16个一组、8个一组、4个一组、2个一组、1个一组,以存放所述卷积计算单元的计算逐级累加结果,分段式卷积结果缓存区的写使能由所述卷积计算器给出,读出逻辑由所述状态控制器控制。分段式卷积结果缓存区的读写数据位数分别由AXI4总线写数据位宽和定点卷积计算位宽确定,在本实施例中,AXI4读写数据位宽均为64位,定点卷积运算为16位,故在写入计算结果FIFO时按16位写数据,64位读数据操作。逐级累加是在第十一个时钟将相邻两个PCU的计算结果相加得到32个一级累加值;接着在第十二个时钟将这32个一级累加值的两个相邻值送入二级累加器,得到16个二级累加值存入FIFO;以此类推,一共有六级累加,最后输出一个累加值,相当于64个PCU卷积计算结果累加。Referring to Figure 4, there are a total of 64 processing convolution units (Processing Convolution Unit, referred to as PCU). The PCU outputs the result to the subsequent accumulator after the tenth clock is biased, and caches the step-by-step accumulation result to the sub-unit. Segmented convolution result buffer. The segmented convolution result buffer area is composed of 31 FIFOs, of which 31 FIFOs are divided into groups of 16, 8, 4, 2, and 1 to store all The calculation of the convolution calculation unit accumulates results step by step, the write enable of the segmented convolution result buffer is given by the convolution calculator, and the readout logic is controlled by the state controller. The number of read and write data bits in the segmented convolution result buffer area is respectively determined by the AXI4 bus write data bit width and the fixed-point convolution calculation bit width. In this embodiment, the AXI4 read and write data bit width is 64 bits, and the fixed-point volume The product operation is 16 bits, so when writing the calculation result FIFO, write data in 16 bits, and read data in 64 bits. The step-by-step accumulation is to add the calculation results of two adjacent PCUs at the eleventh clock to obtain 32 first-level accumulation values; then at the twelfth clock, the two adjacent values of the 32 first-level accumulation values It is sent to the secondary accumulator, and 16 secondary accumulated values are stored in the FIFO; by analogy, there are six levels of accumulation, and finally an accumulated value is output, which is equivalent to the accumulation of 64 PCU convolution calculation results.

计算结束后状态控制器跳转到发送状态,根据CPU指令读取相应FIFO数据根据指令要求是否执行RELU操作后将数据通告AXI4总线发送回MCU存入外部存储器。After the calculation, the state controller jumps to the sending state, reads the corresponding FIFO data according to the CPU instruction, and performs the RELU operation according to the instruction request, and then notifies the AXI4 bus to send the data back to the MCU and store it in the external memory.

相比较于直接加速卷积运算,将卷积运算展开为向量的乘累加操作可以弱化网络结构对加速器结构不匹配带来的影响。本实施例中控制上述步骤的状态控制器状态转换条件如下:Compared with directly accelerating the convolution operation, expanding the convolution operation into a vector multiply-accumulate operation can weaken the influence of the network structure on the mismatch of the accelerator structure. The state transition condition of the state controller controlling the above-mentioned steps in the present embodiment is as follows:

初始化时状态机进入等待状态,等待MCU指令信号(ram_flag),当ram_flag为001时,进入写卷积核状态;当ram_flag为010时,进入写特征图状态;当数据和缓存结束后MCU发送计算开始指令,状态机跳入卷积计算状态;当计算结束后跳入计算结果发送状态;当send状态结束时,返回等待状态,等待下一次数据写入。During initialization, the state machine enters the waiting state and waits for the MCU command signal (ram_flag). When ram_flag is 001, it enters the state of writing the convolution kernel; when ram_flag is 010, it enters the state of writing the feature map; when the data and cache are completed, the MCU sends the calculation Start the instruction, the state machine jumps into the convolution calculation state; when the calculation is over, it jumps into the calculation result sending state; when the send state ends, it returns to the waiting state and waits for the next data write.

写特征图:一旦进入该状态,所有的特征图RAM使能端有效,当AXI4总线上的数据有效信号有效时,开始接收数据,RAM的写地址开始从0累加,每次累加到441后清零从0重新累加,指示在缓存下一个分块特征图数据,一共缓存64个特征图RAM。完成后等待MCU指令信号ram_flag,当ram_flag为1XX时则进入计算状态。Write feature map: Once in this state, all feature map RAM enable terminals are valid, and when the data valid signal on the AXI4 bus is valid, start to receive data, and the write address of RAM starts to accumulate from 0, and is cleared after each accumulation to 441 Zero is re-accumulated from 0, indicating that the feature map data of the next block is cached, and a total of 64 feature map RAMs are cached. After completion, wait for the MCU instruction signal ram_flag, and enter the calculation state when ram_flag is 1XX.

写卷积核:一旦进入该状态,所有的卷积核RAM使能有效,当AXI4总线上的数据有效信号有效时,开始接收数据,RAM的写地址开始从0累加,每次累加到3后清零从0重新累加,指示在缓存下一个卷积核数据,一共缓存64个卷积核RAM。完成后等待MCU指令信号ram_flag,当ram_flag为1XX时则进入计算状态。Write convolution kernel: Once in this state, all convolution kernel RAMs are enabled and valid. When the data valid signal on the AXI4 bus is valid, data will be received, and the write address of RAM will start to accumulate from 0, and accumulate to 3 each time. Clear and accumulate again from 0, indicating that the next convolution kernel data is cached, and a total of 64 convolution kernel RAMs are cached. After completion, wait for the MCU instruction signal ram_flag, and enter the calculation state when ram_flag is 1XX.

卷积计算:一旦进入该状态,开始从特征图RAM和卷积核RAM并行读出数据,卷积核RAM的读地址从0开始每隔4时钟加1,累加到3后重新开始,特征图RAM的读地址由地址产生器产生,根据读地址从各自RAM中读取相应数据,每累加到9次,则停止一个时钟,以与PCU输出数据同步。当特征图RAM的读地址为1763时,标志所有数据均已计算完成,此时两组读地址复位,下个时钟将进入计算结果发送状态。Convolution calculation: Once in this state, start to read data from the feature map RAM and convolution kernel RAM in parallel. The read address of the convolution kernel RAM starts from 0 and adds 1 every 4 clocks, and restarts after accumulating to 3. The feature map The read address of the RAM is generated by the address generator, and the corresponding data is read from the respective RAMs according to the read address, and every time it is accumulated to 9 times, a clock is stopped to synchronize with the output data of the PCU. When the read address of the feature map RAM is 1763, it indicates that all the data has been calculated. At this time, the two sets of read addresses are reset, and the next clock will enter the state of sending the calculation results.

计算结果发送:发送数据选取根据layer(来自外部处理器)的指示,会按所需数据为多少个卷积结果累加和从对应的FIFO中取数据,再根据RELU(来自外部处理器)信号是否作RELU操作后将数据发送至总线。Calculation result sending: The sending data is selected according to the instruction of the layer (from the external processor), how many convolution results are accumulated according to the required data, and the data is taken from the corresponding FIFO, and then according to whether the RELU (from the external processor) signal is Send the data to the bus after doing the RELU operation.

实施例仅为说明本发明的技术思想,不能以此限定本发明的保护范围,凡是按照本发明提出的技术思想,在技术方案基础上所做的任何改动,均落入本发明保护范围之内。The embodiment is only to illustrate the technical idea of the present invention, and can not limit the scope of protection of the present invention with this. All technical ideas proposed in the present invention, any changes made on the basis of technical solutions, all fall within the scope of protection of the present invention .

Claims (7)

1. a kind of general convolutional neural networks accelerator based on FPGA, it is characterised in that:Including MCU, AXI4 bus interface,Location generator, state controller, characteristic pattern buffer area, convolution kernel buffer area, convolutional calculation device and segmented result cache area;Convolutional calculation device is realized using FPGA, and includes N number of convolutional calculation subelement, characteristic pattern buffer area and convolution kernel buffer area differenceIncluding N number of sub- buffer area of characteristic pattern and N number of convolution nucleon buffer area, each convolutional calculation subelement corresponds to one feature of configurationScheme sub- buffer area and a convolution nucleon buffer area;MCU is connected with external memory, is used for the input number of reading external memoryAccording to, and convolutional calculation result is sent to external memory;Characteristic pattern buffer area and convolution kernel buffer area pass through for cachingThe characteristic pattern and convolution Nuclear Data that AXI4 bus interface is read from MCU;Address generator is used to generate the reading of characteristic pattern buffer areaAddress is taken, the data in characteristic pattern buffer area is read according to the address and is sent into convolutional calculation device;Convolutional calculation device is for reading spyIt levies the data in figure buffer area and convolution kernel buffer area and carries out convolutional calculation, and by the result of calculation of adjacent convolutional calculation subelementIt carries out multistage cumulative;Segmented result cache area includesA buffer area that bears fruit, for storing the output of convolutional calculation deviceAccumulation results at different levels, and accumulation results at different levels are sent to by MCU by AXI4 bus interface;State controller is entire for controllingThe working condition of accelerator realizes the conversion of each working condition.
CN201810413101.XA2018-05-032018-05-03A kind of general convolutional neural networks accelerator based on FPGAPendingCN108805272A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201810413101.XACN108805272A (en)2018-05-032018-05-03A kind of general convolutional neural networks accelerator based on FPGA

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201810413101.XACN108805272A (en)2018-05-032018-05-03A kind of general convolutional neural networks accelerator based on FPGA

Publications (1)

Publication NumberPublication Date
CN108805272Atrue CN108805272A (en)2018-11-13

Family

ID=64093618

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201810413101.XAPendingCN108805272A (en)2018-05-032018-05-03A kind of general convolutional neural networks accelerator based on FPGA

Country Status (1)

CountryLink
CN (1)CN108805272A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109146067A (en)*2018-11-192019-01-04东北大学A kind of Policy convolutional neural networks accelerator based on FPGA
CN109598338A (en)*2018-12-072019-04-09东南大学A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA
CN109934339A (en)*2019-03-062019-06-25东南大学 A Universal Convolutional Neural Network Accelerator Based on One-Dimensional Systolic Array
CN110097174A (en)*2019-04-222019-08-06西安交通大学 Implementation method, system and device of convolutional neural network based on FPGA and line output priority
CN110147251A (en)*2019-01-282019-08-20腾讯科技(深圳)有限公司 Architecture, chip and computing method for computing neural network models
CN110334801A (en)*2019-05-092019-10-15苏州浪潮智能科技有限公司A kind of hardware-accelerated method, apparatus, equipment and the system of convolutional neural networks
CN110928693A (en)*2020-01-232020-03-27飞诺门阵(北京)科技有限公司Computing equipment and resource allocation method
CN110991634A (en)*2019-12-042020-04-10腾讯科技(深圳)有限公司Artificial intelligence accelerator, equipment, chip and data processing method
CN111104124A (en)*2019-11-072020-05-05北京航空航天大学 A rapid deployment method of convolutional neural network based on Pytorch framework on FPGA
CN111416743A (en)*2020-03-192020-07-14华中科技大学Convolutional network accelerator, configuration method and computer readable storage medium
CN111860540A (en)*2020-07-202020-10-30深圳大学 FPGA-based neural network image feature extraction system
WO2020258528A1 (en)*2019-06-252020-12-30东南大学Configurable universal convolutional neural network accelerator
CN112508184A (en)*2020-12-162021-03-16重庆邮电大学Design method of fast image recognition accelerator based on convolutional neural network
CN112801285A (en)*2021-02-042021-05-14南京微毫科技有限公司High-resource-utilization-rate CNN accelerator based on FPGA and acceleration method thereof
CN113095471A (en)*2020-01-092021-07-09北京君正集成电路股份有限公司Method for improving efficiency of detection model
CN113869494A (en)*2021-09-282021-12-31天津大学 High-level synthesis based neural network convolution FPGA embedded hardware accelerator
CN114548361A (en)*2020-11-242022-05-27三星电子株式会社Neural network device and method of operating the same
CN114638352A (en)*2022-05-182022-06-17成都登临科技有限公司 A processor architecture, processor and electronic device
CN115481721A (en)*2022-09-022022-12-16浙江大学 A Novel Psum Computing Circuit for Convolutional Neural Networks
US11706076B2 (en)2020-01-232023-07-18Novnet Computing System Tech Co., Ltd.Computer system with computing devices, communication device, task processing device
CN117131910A (en)*2023-02-232023-11-28华东师范大学 A convolution accelerator based on the expansion of the RISC-V instruction set architecture and a method for accelerating convolution operations

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110029471A1 (en)*2009-07-302011-02-03Nec Laboratories America, Inc.Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN106875012A (en)*2017-02-092017-06-20武汉魅瞳科技有限公司A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
CN107392309A (en)*2017-09-112017-11-24东南大学—无锡集成电路技术研究所A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA
CN107657581A (en)*2017-09-282018-02-02中国人民解放军国防科技大学 A convolutional neural network (CNN) hardware accelerator and acceleration method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110029471A1 (en)*2009-07-302011-02-03Nec Laboratories America, Inc.Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN106875012A (en)*2017-02-092017-06-20武汉魅瞳科技有限公司A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
CN107392309A (en)*2017-09-112017-11-24东南大学—无锡集成电路技术研究所A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA
CN107657581A (en)*2017-09-282018-02-02中国人民解放军国防科技大学 A convolutional neural network (CNN) hardware accelerator and acceleration method

Cited By (34)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109146067B (en)*2018-11-192021-11-05东北大学 An FPGA-based Policy Convolutional Neural Network Accelerator
CN109146067A (en)*2018-11-192019-01-04东北大学A kind of Policy convolutional neural networks accelerator based on FPGA
CN109598338B (en)*2018-12-072023-05-19东南大学 An FPGA-Based Computationally Optimized Convolutional Neural Network Accelerator
CN109598338A (en)*2018-12-072019-04-09东南大学A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA
CN110147251A (en)*2019-01-282019-08-20腾讯科技(深圳)有限公司 Architecture, chip and computing method for computing neural network models
CN109934339B (en)*2019-03-062023-05-16东南大学 A Universal Convolutional Neural Network Accelerator Based on a 1D Systolic Array
CN109934339A (en)*2019-03-062019-06-25东南大学 A Universal Convolutional Neural Network Accelerator Based on One-Dimensional Systolic Array
CN110097174A (en)*2019-04-222019-08-06西安交通大学 Implementation method, system and device of convolutional neural network based on FPGA and line output priority
CN110097174B (en)*2019-04-222021-04-20西安交通大学Method, system and device for realizing convolutional neural network based on FPGA and row output priority
CN110334801A (en)*2019-05-092019-10-15苏州浪潮智能科技有限公司A kind of hardware-accelerated method, apparatus, equipment and the system of convolutional neural networks
WO2020258528A1 (en)*2019-06-252020-12-30东南大学Configurable universal convolutional neural network accelerator
CN111104124A (en)*2019-11-072020-05-05北京航空航天大学 A rapid deployment method of convolutional neural network based on Pytorch framework on FPGA
CN111104124B (en)*2019-11-072021-07-20北京航空航天大学 A rapid deployment method of convolutional neural network based on Pytorch framework on FPGA
WO2021109699A1 (en)*2019-12-042021-06-10腾讯科技(深圳)有限公司Artificial intelligence accelerator, device, chip and data processing method
CN110991634B (en)*2019-12-042022-05-10腾讯科技(深圳)有限公司 Artificial intelligence accelerator, equipment, chip and data processing method
CN110991634A (en)*2019-12-042020-04-10腾讯科技(深圳)有限公司Artificial intelligence accelerator, equipment, chip and data processing method
US20220051088A1 (en)*2019-12-042022-02-17Tencent Technology (Shenzhen) Company LtdArtificial intelligence accelerator, artificial intelligence acceleration device, artificial intelligence acceleration chip, and data processing method
CN113095471B (en)*2020-01-092024-05-07北京君正集成电路股份有限公司Method for improving efficiency of detection model
CN113095471A (en)*2020-01-092021-07-09北京君正集成电路股份有限公司Method for improving efficiency of detection model
CN110928693B (en)*2020-01-232021-01-15飞诺门阵(北京)科技有限公司Computing equipment and resource allocation method
US11706076B2 (en)2020-01-232023-07-18Novnet Computing System Tech Co., Ltd.Computer system with computing devices, communication device, task processing device
CN110928693A (en)*2020-01-232020-03-27飞诺门阵(北京)科技有限公司Computing equipment and resource allocation method
CN111416743A (en)*2020-03-192020-07-14华中科技大学Convolutional network accelerator, configuration method and computer readable storage medium
CN111860540A (en)*2020-07-202020-10-30深圳大学 FPGA-based neural network image feature extraction system
CN111860540B (en)*2020-07-202024-01-12深圳大学 Neural network image feature extraction system based on FPGA
CN114548361A (en)*2020-11-242022-05-27三星电子株式会社Neural network device and method of operating the same
CN112508184B (en)*2020-12-162022-04-29重庆邮电大学Design method of fast image recognition accelerator based on convolutional neural network
CN112508184A (en)*2020-12-162021-03-16重庆邮电大学Design method of fast image recognition accelerator based on convolutional neural network
CN112801285B (en)*2021-02-042024-01-26南京微毫科技有限公司FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof
CN112801285A (en)*2021-02-042021-05-14南京微毫科技有限公司High-resource-utilization-rate CNN accelerator based on FPGA and acceleration method thereof
CN113869494A (en)*2021-09-282021-12-31天津大学 High-level synthesis based neural network convolution FPGA embedded hardware accelerator
CN114638352A (en)*2022-05-182022-06-17成都登临科技有限公司 A processor architecture, processor and electronic device
CN115481721A (en)*2022-09-022022-12-16浙江大学 A Novel Psum Computing Circuit for Convolutional Neural Networks
CN117131910A (en)*2023-02-232023-11-28华东师范大学 A convolution accelerator based on the expansion of the RISC-V instruction set architecture and a method for accelerating convolution operations

Similar Documents

PublicationPublication DateTitle
CN108805272A (en)A kind of general convolutional neural networks accelerator based on FPGA
US11544191B2 (en)Efficient hardware architecture for accelerating grouped convolutions
CN104915322B (en)A kind of hardware-accelerated method of convolutional neural networks
CN109934339B (en) A Universal Convolutional Neural Network Accelerator Based on a 1D Systolic Array
CN107392309A (en)A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA
CN105843775B (en)On piece data divide reading/writing method, system and its apparatus
US20210216871A1 (en)Fast Convolution over Sparse and Quantization Neural Network
US20190026626A1 (en)Neural network accelerator and operation method thereof
CN110458279A (en) An FPGA-based binary neural network acceleration method and system
CN111459877A (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
CN104899182A (en)Matrix multiplication acceleration method for supporting variable blocks
CN110348574A (en) A general convolutional neural network acceleration structure and design method based on ZYNQ
WO2022037257A1 (en)Convolution calculation engine, artificial intelligence chip, and data processing method
CN107169563A (en)Processing system and method applied to two-value weight convolutional network
CN111768458A (en)Sparse image processing method based on convolutional neural network
CN108897716A (en)By memory read/write operation come the data processing equipment and method of Reduction Computation amount
CN117632844A (en)Reconfigurable AI algorithm hardware accelerator
Zhang et al.YOLOv3-tiny object detection SOC based on FPGA platform
Wang et al.Accelerating on-line training of LS-SVM with run-time reconfiguration
CN111445019A (en) A device and method for realizing channel shuffling operation in packet convolution
CN107273099A (en)A kind of AdaBoost algorithms accelerator and control method based on FPGA
CN109710562A (en) A configurable and high-speed FPGA configuration circuit and implementation method based on SELECTMAP
CN114691083B (en)Matrix multiplication circuit, method and related product
Huang et al.A low-bit quantized and hls-based neural network fpga accelerator for object detection
Zhang et al.Three-level memory access architecture for FPGA-based real-time remote sensing image processing system

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication

Application publication date:20181113

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp