CN114723029A

Movatterモバイル変換

Info

Publication number: CN114723029A
Application number: CN202210482658.5A
Authority: CN
Inventors: 黄以华; 罗聪慧; 黄文津
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-07-08
Anticipated expiration: 2042-05-05
Also published as: CN114723029B

Abstract

Translated fromChinese

本发明公开一种基于混合多行数据流策略的DCNN加速器，由多个卷积处理模块堆叠而成。卷积处理模块包含多个并行的计算单元阵列、计算缓存器以及数据缓存器。相邻卷积处理模块的数据传输以行为单位，行数据存储在数据缓存器中，通过依次读取数据缓存器的数据经过重排列操作后送入计算缓存器中，以供计算单元阵列的运算。每个计算单元阵列负责计算单行的输出特征图，所有计算单元阵列共用相同的权重数据，并且所有权重数据均存储在片外DRAM中。通过调整所述的卷积处理模块的计算单元阵列并行度可以实现片外带宽使用量的调整，解决现有逐层流水线式加速器无法实现片外带宽优化的问题。

The invention discloses a DCNN accelerator based on a mixed multi-line data flow strategy, which is formed by stacking multiple convolution processing modules. The convolution processing module includes a plurality of parallel computing unit arrays, computing buffers and data buffers. The data transmission of adjacent convolution processing modules is in row units, and the row data is stored in the data buffer. The data in the data buffer is sequentially read and sent to the calculation buffer after rearrangement operation for the calculation of the calculation unit array. . Each computing unit array is responsible for computing the output feature map of a single row, all computing unit arrays share the same weight data, and all weight data are stored in off-chip DRAM. By adjusting the parallelism of the computing unit array of the convolution processing module, the off-chip bandwidth usage can be adjusted, thereby solving the problem that the existing layer-by-layer pipeline accelerator cannot realize off-chip bandwidth optimization.

Description

Translated fromChinese

一种基于混合多行数据流策略的DCNN加速器A DCNN accelerator based on a mixed multi-line data flow strategy

技术领域technical field

本发明涉及电子信息以及深度学习技术领域，更具体地，涉及一种基于混合多行数据流策略的DCNN加速器。The invention relates to the technical field of electronic information and deep learning, and more particularly, to a DCNN accelerator based on a mixed multi-line data flow strategy.

背景技术Background technique

在近年来不断发展的人工智能浪潮中，深度卷积神经网络(DCNN)在目标检测、语义分割、人脸识别、语音识别、医学辅助诊断等领域相比传统算法展示出更为优越的性能表现。由此，DCNN受到非常广泛的关注和研究。In the continuous development of artificial intelligence in recent years, deep convolutional neural network (DCNN) has shown better performance than traditional algorithms in the fields of object detection, semantic segmentation, face recognition, speech recognition, and medical aided diagnosis. . As a result, DCNN has received very extensive attention and research.

由于可以充分利用DCNN模型的层内与层间并行度，逐层流水线式系统架构被广泛应用于基于FPGA的DCNN加速器中。在逐层流水线式系统架构中，卷积层计算任务的计算范式(逐行，逐层)决定了从片外DRAM读取权重数据的次数，进而决定了加速器的片外带宽。然而，现有逐层流水线架构均使用固定计算范式，使得加速器的吞吐量性能受制于片外带宽，难以通过片内存储来降低片外带宽的使用量。The layer-by-layer pipelined system architecture is widely used in FPGA-based DCNN accelerators because it can take full advantage of the intra-layer and inter-layer parallelism of the DCNN model. In the layer-by-layer pipelined system architecture, the computing paradigm (row-by-row, layer-by-layer) of the convolutional layer computing task determines the number of times the weight data is read from the off-chip DRAM, which in turn determines the off-chip bandwidth of the accelerator. However, the existing layer-by-layer pipeline architectures all use a fixed computing paradigm, so that the throughput performance of the accelerator is limited by off-chip bandwidth, and it is difficult to reduce the usage of off-chip bandwidth through on-chip storage.

现有技术中公开了一种基于混合并行的卷积计算装置，所述装置包括：输入模块，被配置为获取输入卷积数据及对应的参数，根据输入卷积数据判断卷积形状，并提取所述输入卷积数据的特征图大小、卷积核大小以及通道数目；仿真模块，被配置为根据输入模块提取到的数据特征得到所述输入卷积数据对应的并行度；所述数据特征包括卷积形状及参数；片上处理器，包括多个并行的处理模块；分组控制模块，与每个处理模块分别连接，被配置为根据所述并行度，将片上处理器上的所有处理模块分为G个分组，G等于并行度，且每个分组中处理模块的数量相等；映射模块，与每个处理模块分别连接，被配置为根据所述并行度、输入卷积数据以及对应的参数控制输入到每个处理模块中的数据及参数；其中，同一分组中的处理模块输入相同的参数、不同的数据；不同分组的处理模块输入不同的参数；所述处理模块用于根据输入的数据及参数完成卷积加速行为并输出结果。该方案同样难以通过片内存储来降低片外带宽的使用量。A hybrid parallel-based convolution computing device is disclosed in the prior art. The device includes: an input module configured to acquire input convolution data and corresponding parameters, determine the convolution shape according to the input convolution data, and extract the convolution shape. The feature map size, convolution kernel size and number of channels of the input convolution data; the simulation module is configured to obtain the parallelism corresponding to the input convolution data according to the data features extracted by the input module; the data features include Convolution shape and parameters; an on-chip processor including a plurality of parallel processing modules; a grouping control module, separately connected to each processing module, configured to divide all processing modules on the on-chip processor into G groupings, G is equal to the degree of parallelism, and the number of processing modules in each grouping is equal; the mapping module, connected to each processing module respectively, is configured to control the input according to the degree of parallelism, the input convolution data and the corresponding parameters To the data and parameters in each processing module; wherein, the processing modules in the same group input the same parameters, different data; the processing modules in different groups input different parameters; the processing module is used according to the input data and parameters Complete the convolution acceleration behavior and output the result. This solution is also difficult to reduce the use of off-chip bandwidth through on-chip storage.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于混合多行数据流策略的DCNN加速器，解决现有逐层流水线式DCNN加速器无法实现片外带宽优化的问题。The invention provides a DCNN accelerator based on a mixed multi-line data flow strategy, which solves the problem that the existing layer-by-layer pipeline type DCNN accelerator cannot realize off-chip bandwidth optimization.

为解决上述技术问题，本发明的技术方案如下：For solving the above-mentioned technical problems, the technical scheme of the present invention is as follows:

一种基于混合多行数据流策略的DCNN加速器，包括卷积处理单元、全连接处理单元，其中：A DCNN accelerator based on a mixed multi-line data flow strategy, including a convolution processing unit and a fully connected processing unit, wherein:

卷积处理单元负责处理DCNN模型中的卷积计算部分，所述卷积处理单元包括多个卷积处理模块、旁路卷积处理模块和支路处理模块，其中，多个卷积处理模块依次连接，所述卷积处理模块的数量与DCNN模型的卷积层数量L相等，每个卷积处理模块的输入行数据数量为r_i，输出行数据数量为r_i+1，且卷积处理模块的输入行数据数量为上一个卷积处理单元的输出行数据数量；所述旁路卷积处理模块的输入行数据数量为第一个卷积处理模块的输出行数据数量，所述旁路卷积处理模块的输出为支路处理模块的输入，所述支路处理模块的输出为最后一个卷积处理模块的输入，所述支路处理模块处理深度卷积神经网络中的支路部分；The convolution processing unit is responsible for processing the convolution calculation part in the DCNN model, and the convolution processing unit includes multiple convolution processing modules, bypass convolution processing modules and branch processing modules, wherein the multiple convolution processing modules are sequentially The number of the convolution processing modules is equal to the number L of the convolution layers of the DCNN model, the number of input row data of each convolution processing module is r_i , the number of output row data is r_i+1 , and the convolution processing module The input line data quantity of the module is the output line data quantity of the previous convolution processing unit; the input line data quantity of the bypass convolution processing module is the output line data quantity of the first convolution processing module, and the bypass The output of the convolution processing module is the input of the branch processing module, the output of the branch processing module is the input of the last convolution processing module, and the branch processing module processes the branch part in the deep convolutional neural network;

最后一个卷积处理模块的输出行数据输出至所述全连接处理单元，全连接处理单元用于处理深度卷积神经网络中全连接层部分。The output line data of the last convolution processing module is output to the fully connected processing unit, and the fully connected processing unit is used to process the fully connected layer part of the deep convolutional neural network.

优选地，所述每个卷积处理模块之间连接有池化处理模块，所述池化处理模块处理深度卷积神经网络中的池化层部分。Preferably, a pooling processing module is connected between each convolution processing module, and the pooling processing module processes the pooling layer part in the deep convolutional neural network.

优选地，外界数据源每隔ΔT₁个时钟周期向第一个卷积处理模块输入一行输入特征图数据，每隔r_i+1S_iΔT_i个时钟周期，第i个卷积处理模块完成r_i+1行输出特征图数据的计算，其中，Preferably, the external data source inputs a line of input feature map data to the first convolution processing module every ΔT₁ clock cycle, and every r_i+1 S_i ΔT_i clock cycles, the ith convolution processing module completes the r_i+1 line output feature map data calculation, where,

式中，Ps_j为第j个卷积处理模块后的池化处理模块的stride，S_j为第j个卷积处理模块对应的输入特征图的stride，输入特征图对应的卷积核的尺寸为N_i×C_i×K_i×K_i，padding为pad_i。In the formula, Ps_j is the stride of the pooling processing module after the jth convolution processing module, S_j is the stride of the input feature map corresponding to the jth convolution processing module, and the size of the convolution kernel corresponding to the input feature map. is N_i ×C_i ×K_i ×K_i , and padding is pad_i .

优选地，所述每个卷积处理模块均包括输入数据缓存器、多个并行计算缓存器、多个并行的计算单元阵列、输出数据缓存器，其中：Preferably, each of the convolution processing modules includes an input data buffer, a plurality of parallel computing buffers, a plurality of parallel computing unit arrays, and an output data buffer, wherein:

所述输入数据缓存器读取上一个卷积处理模块的输出数据缓存器中的数据并保存，所述多个并行计算缓存器读取输入数据缓存器中的数据，所述多个并行的计算单元阵列的输入为计算缓存器中的数据，所述多个并行的计算单元阵列的输出保存至所述输出数据缓存器。The input data buffer reads the data in the output data buffer of the previous convolution processing module and saves it, the multiple parallel computing buffers read the data in the input data buffer, and the multiple parallel computing The input of the cell array is the data in the calculation buffer, and the outputs of the plurality of parallel calculation cell arrays are stored in the output data buffer.

相邻卷积处理模块的数据传输以行为单位，行数据存储在数据缓存器中，通过依次读取数据缓存器的数据经过重排列操作后送入计算缓存器中，以供计算单元阵列的运算。每个计算单元阵列负责计算单行的输出特征图，所有计算单元阵列共用相同的权重数据，并且所有权重数据均存储在片外DRAM中。The data transmission of adjacent convolution processing modules is in row units, and the row data is stored in the data buffer. The data in the data buffer is sequentially read and sent to the calculation buffer after rearrangement operation for the calculation of the calculation unit array. . Each computing unit array is responsible for computing the output feature map of a single row, all computing unit arrays share the same weight data, and all weight data are stored in off-chip DRAM.

优选地，所述多个并行的计算单元阵列由W_h，i×I_w，i个计算单元组成，每个计算单元是一个W_w，i输入的乘累加树，中间计算数据缓存在Dual port RAM中，最终计算结果缓存在RAM中，其中，RAM作为下一个卷积处理模块的输入数据缓存器的数据源。Preferably, the multiple parallel computing unit arrays are composed of W_{h, i} ×I_{w, i} computing units, each computing unit is a multiply-accumulate tree inputted by W_{w, i} , and the intermediate computing data is cached in the Dual port In the RAM, the final calculation result is cached in the RAM, wherein the RAM is used as the data source of the input data buffer of the next convolution processing module.

优选地，所述每个计算单元通过依次执行小尺寸的矩阵乘法W_rb×I_rpb，从而最终实现大尺寸的矩阵乘法W_r×I_rp，最终计算结果为单行的输出特征图数据。Preferably, each computing unit performs a small-sized matrix multiplication W_rb ×I_rpb in sequence, thereby finally realizing a large-sized matrix multiplication W_r ×I_rp , and the final calculation result is a single row of output feature map data.

优选地，所述每个计算单元计算范式基于Toeplitz矩阵乘法，将输入特征图数据转成Toeplitz矩阵，每个PE阵列所处理的输入特征图数据位于Toeplitz矩阵的列矩阵，在混合多行数据流策略中，所有的并行计算单元阵列共用相同的权重参数来分别处理来自Toeplitz矩阵中不同列矩阵的Ifmap数据，从而实现带宽使用量的优化。Preferably, the computing paradigm of each computing unit is based on Toeplitz matrix multiplication, and the input feature map data is converted into a Toeplitz matrix, and the input feature map data processed by each PE array is located in the column matrix of the Toeplitz matrix, and the multi-row data flow In the strategy, all parallel computing unit arrays share the same weight parameters to process Ifmap data from different column matrices in the Toeplitz matrix respectively, so as to optimize the bandwidth usage.

优选地，所述每个计算单元的计算资源使用量#PE Mult为：Preferably, the computing resource usage #PE Mult of each computing unit is:

#PE Mult＝W_h，i×W_w，i×I_w，i#PE Mult=W_{h, i} ×W_{w, i} ×I_{w, i}

其中，W_h，i，W_w，i，I_w，i需要满足Among them, W_{h, i} , W_{w, i} , I_{w, i} need to satisfy

式中，Hout_i为第i个卷积处理模块对应的输出特征图的宽度和高度，ΔT₁为外界数据源向加速器输入一行行数据的时钟周期间隔，满足：In the formula, Hout_i is the width and height of the output feature map corresponding to the i-th convolution processing module, ΔT₁ is the clock cycle interval of the external data source inputting line-by-line data to the accelerator, which satisfies:

TRP_obj为加速器设计的期望

Expectations of TRP_obj for accelerator design

优选地，每个输入数据缓存器所存储的数据在输入特征图中具有相同的行位置和列位置，并且按通道大小在输入数据缓存器中排列，第i个卷积处理模块的数据缓存器的数量为#RowDataBuffer×Hin_i：Preferably, the data stored in each input data buffer has the same row position and column position in the input feature map, and is arranged in the input data buffer according to the channel size, and the data buffer of the i-th convolution processing module The number is #RowDataBuffer × Hin_i :

DataIn₀′＝K_i+S_i(r_i+1-1)+GCD(r_i+1S_i，r_i)(r_i′-1)DataIn₀ ′=K_i +S_i (r_i+1-1 )+GCD(r_i+1 S_i ,r_i )(r_i ′-1)

其中，GCD(r_i+1S_i，r_i)是r_i和r_i+1S_i的最大公约数，r_i′和(r_i+1S_i)′为两个正整数且互质，padding为输出特征图对应的步长，padding为pad_i。Among them, GCD(r_i+1 S_i , r_i ) is the greatest common divisor of r_i and r_i+1 S_i , and r_i ′ and (r_i+1 S_i )′ are two positive integers that are relatively prime , padding is the step size corresponding to the output feature map, and padding is pad_i .

优选地，当负责处理池化操作的池化处理模块用于处理卷积处理模块的输出数据时，其输入行数据数量等于卷积处理模块的输出行数据数量，所包含的数据缓存器数量满足：Preferably, when the pooling processing module responsible for processing the pooling operation is used to process the output data of the convolution processing module, the number of input row data is equal to the number of output row data of the convolution processing module, and the number of data buffers contained satisfies :

Hout_i×#RowPoolingBufferHout_i ×#RowPoolingBuffer

其中，

in,

与现有技术相比，本发明技术方案的有益效果是：Compared with the prior art, the beneficial effects of the technical solution of the present invention are:

本发明在逐层流水线式系统架构中引入混合多行计算范式(数据流策略)，通过灵活的数据流策略配置可以实现高效的片内存储和片外带宽的均衡，从而提高逐层流水线式系统架构设计的灵活性以及其理论吞吐量上限。The present invention introduces a hybrid multi-line computing paradigm (data flow strategy) into the layer-by-layer pipeline system architecture, and can achieve efficient on-chip storage and off-chip bandwidth balance through flexible data flow strategy configuration, thereby improving the layer-by-layer pipeline system. Flexibility in architectural design and its theoretical throughput cap.

附图说明Description of drawings

图1为本发明的加速器整体框架示意图。FIG. 1 is a schematic diagram of the overall framework of the accelerator of the present invention.

图2为混合多行数据流策略示意图。FIG. 2 is a schematic diagram of a mixed multi-line data flow strategy.

图3为卷积处理模块的框架示意图。Figure 3 is a schematic diagram of the framework of the convolution processing module.

图4为计算单元阵列硬件结构示意图。FIG. 4 is a schematic diagram of the hardware structure of the computing unit array.

图5为输入数据缓存器的数据存储顺序示意图。FIG. 5 is a schematic diagram of the data storage sequence of the input data buffer.

图6为基于Toeplitz矩阵的卷积计算范式示意图。FIG. 6 is a schematic diagram of the convolution computing paradigm based on the Toeplitz matrix.

图7为计算单元阵列处理权重数据与Toeplitz矩阵数据的顺序示意图。FIG. 7 is a schematic diagram of the sequence of processing the weight data and the Toeplitz matrix data by the computing unit array.

具体实施方式Detailed ways

附图仅用于示例性说明，不能理解为对本专利的限制；The accompanying drawings are for illustrative purposes only, and should not be construed as limitations on this patent;

为了更好说明本实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；In order to better illustrate this embodiment, some parts of the drawings are omitted, enlarged or reduced, which do not represent the size of the actual product;

对于本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。It will be understood by those skilled in the art that some well-known structures and their descriptions may be omitted from the drawings.

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solutions of the present invention will be further described below with reference to the accompanying drawings and embodiments.

实施例1Example 1

本实施例提供一种于混合多行数据流策略的DCNN加速器，如图1所示，包括卷积处理单元、全连接处理单元，其中：This embodiment provides a DCNN accelerator for a mixed multi-line data flow strategy, as shown in FIG. 1 , including a convolution processing unit and a fully connected processing unit, wherein:

图1中BottleNeck为用于诸如RESNET等网络中的支路处理模块，FPM(FullyConnected Process Module)为全连接处理单元，用于处理网络中的全连接层；CPM(Convolution Process Module)为卷积处理单元，用于处理网络中的卷积计算部分；External Memory为片外存储，即除FPGA芯片外的外部存储，表现为FPGA开发板上的DDR。In Figure 1, BottleNeck is a branch processing module used in networks such as RESNET, FPM (FullyConnected Process Module) is a fully connected processing unit, and is used to process the fully connected layer in the network; CPM (Convolution Process Module) is a convolution process. The unit is used to process the convolution calculation part in the network; External Memory is the off-chip storage, that is, the external storage other than the FPGA chip, which is represented by the DDR on the FPGA development board.

实施例2Example 2

本实施例在实施例1的基础上，继续公开以下内容：On the basis ofEmbodiment 1, this embodiment continues to disclose the following content:

当DCNN模型包含池化层时，所述每个卷积处理模块之间连接有池化处理模块，所述池化处理模块处理深度卷积神经网络中的池化层部分。When the DCNN model includes a pooling layer, a pooling processing module is connected between each convolution processing module, and the pooling processing module processes the pooling layer part in the deep convolutional neural network.

当池化层的参数有PK_i＝2和Ps_i＝2时，在池化模块中，每个pooling buffer的输出端都连接着一个比较器Comp，3个Comp组成一个4输入的比较器树，比较器树的输出端连接到CCM_i+1的data buffer输入端。此外，比较器树的并行度为Hin_i+1。池化模块并行地向CCM_i+1的r_i+1个row data buffer输出数据，因此，池化模块的输出数据维度为r_i+1×Hin_i+1。When the parameters of the pooling layer are PK_i =2 and Ps_i =2, in the pooling module, the output of each pooling buffer is connected to a comparator Comp, and 3 Comps form a 4-input comparator tree , the output of the comparator tree is connected to the data buffer input of CCM_i+1 . Furthermore, the parallelism of the comparator tree is Hin_i+1 . The pooling module outputs data to_ri+1 row data buffers of CCM_i+1 in parallel, so the output data dimension of the pooling module is ri₊₁ ×Hin_i+1 .

实施例3Example 3

本实施例在实施例1或实施例2的基础上，继续公开以下内容：On the basis ofEmbodiment 1 orEmbodiment 2, this embodiment continues to disclose the following content:

假设卷积神经网络的卷积层数量为L，每层卷积层对应的输入特征图(Ifmap)的宽度和高度均为Hin_i，输入通道数为C_i，输出通道数为N_i，输出特征图(Ofmap)的宽度和高度均为Hout_i，Ifmap对应的卷积核的尺寸为N_i×C_i×K_i×K_i，stride为S_i，padding为pad_i，不失一般性，假设在每层卷积层之间都存在pooling layer(PL_i)，pooling filter的尺寸为PK_i，池化的stride为Ps_i，其中所有1≤i≤L。Assuming that the number of convolutional layers of the convolutional neural network is L, the width and height of the input feature map (Ifmap) corresponding to each convolutional layer are Hin_i , the number of input channels is C_i , the number of output channels is N_i , and the output The width and height of the feature map (Ofmap) are Hout_i , the size of the convolution kernel corresponding to Ifmap is N_i ×C_i ×K_i ×K_i , the stride is S_i , and the padding is pad_i , without loss of generality, Assuming that there is a pooling layer (PL_i ) between each convolutional layer, the size of the pooling filter is PK_i , and the pooling stride is Ps_i , where all 1≤i≤L.

逐层流水线式DCNN加速器包含多个卷积处理模块，其数量等于DCNN模型的卷积层数量L，其整体框架见附图1。每个卷积处理模块的输入行数据数量为r_i，输出行数据数量为r_i+1。卷积处理模块的输入行数据数量等于上一个卷积层处理模块的输出行数据数量，见附图2，外界数据源每隔ΔT₁个时钟周期向加速器输入一行输入特征图数据。每隔r₂S₁ΔT₁个时钟周期，第一个CCM向下一个CCM输出r₂行行数据，类似地，每隔r_i+1S_iΔT_i个时钟周期，CCM_i完成r_i+1行Ofmap数据的计算，其中，The layer-by-layer pipelined DCNN accelerator includes multiple convolution processing modules, the number of which is equal to the number of convolution layers L of the DCNN model, and its overall framework is shown in Figure 1. The number of input row data of each convolution processing module is r_i , and the number of output row data is r_i+1 . The number of input row data of the convolution processing module is equal to the number of output row data of the previous convolution layer processing module, see Figure 2, the external data source inputs a row of input feature map data to the accelerator every ΔT₁ clock cycle. Every r₂ S₁ ΔT₁ clock cycle, the first CCM outputs r₂ lines of data to the next CCM, and similarly, every r_i+1 S_i ΔT_i clock cycles, CCM_i completes r_{i+ 1} line of calculation of Ofmap data, where,

在PE阵列中，每个计算模块(PE模块)均是一种乘累加树结构，见附图4，其中，W_h，i和I_w，i为PE阵列的高度和宽度，W_w，i为乘累加器树的输入端口数，中间计算数据缓存在Dualport RAM中，最终计算结果缓存在RAM中。其中，RAM作为下一个CCM的data buffer的数据源。每个PE阵列都有配套的Ping-Pong结构的PE buffer，PE buffer用以缓存PE阵列的输入数据，数据源自卷积处理模块的数据缓存器data buffer。Data buffer用以缓存来自上一个卷积处理模块的输出数据，每个data buffer所存储的数据在输入特征图中具有相同的行位置和列位置，并且按通道大小在data buffer中排列，见附图5，每个data buffer所存储的数据在输入特征图中具有相同的行位置和列位置，并且按通道大小在data buffer中排列。In the PE array, each computing module (PE module) is a multiply-accumulate tree structure, see Figure 4, where W_{h, i} and I_{w, i} are the height and width of the PE array, and W_{w, i} In order to multiply the number of input ports of the accumulator tree, the intermediate calculation data is cached in the Dualport RAM, and the final calculation result is cached in the RAM. Among them, the RAM is used as the data source of the data buffer of the next CCM. Each PE array has a matching PE buffer with Ping-Pong structure. The PE buffer is used to buffer the input data of the PE array, and the data comes from the data buffer of the convolution processing module. The data buffer is used to buffer the output data from the previous convolution processing module. The data stored in each data buffer has the same row position and column position in the input feature map, and is arranged in the data buffer according to the channel size, see appendix Figure 5, the data stored in each data buffer has the same row position and column position in the input feature map, and is arranged in the data buffer according to the channel size.

进一步地，data buffer的数量为#RowDataBuffer×Hin_i。Further, the number of data buffers is #RowDataBuffer×Hin_i .

其中，GCD(r_i+1S_i，r_i)是r_i和r_i+1S_i的最大公约数(greatest common divisor)，r_i′和(r_i+1S_i)′两个正整数互质。Among them, GCD(r_i+1 S_i , r_i ) is the greatest common divisor of r_i and r_i+1 S_i , and r_i ′ and (r_i+1 S_i )′ are two positive Integer coprime.

进一步地，PE阵列的计算范式是基于Toeplitz矩阵乘法，即需要将输入特征图数据转成Toeplitz矩阵。每个PE阵列所处理的输入特征图数据位于Toeplitz矩阵的列矩阵，见附图6，是一种基于Toeplitz矩阵的卷积运算，每个PE阵列通过依次执行小尺寸的矩阵乘法W_rb×I_rpb，从而最终实现大尺寸的矩阵乘法W_r×I_rp，最终计算结果为单行的输出特征图数据。Further, the computing paradigm of PE array is based on Toeplitz matrix multiplication, that is, the input feature map data needs to be converted into Toeplitz matrix. The input feature map data processed by each PE array is located in the column matrix of the Toeplitz matrix, see Figure 6, which is a convolution operation based on the Toeplitz matrix. Each PE array sequentially performs a small-sized matrix multiplication W_rb × I_rpb , so as to finally realize a large-sized matrix multiplication W_r ×I_rp , and the final calculation result is a single row of output feature map data.

PE阵列计算单行输出特征图数据的权重数据以及Toeplitz矩阵数据的顺序见附图7。在混合多行数据流策略中，所有的并行PE阵列共用相同的权重参数来分别处理来自Toeplitz矩阵中不同列矩阵的Ifmap数据，从而实现带宽使用量的优化。图7所示是两个并行PE阵列工作模式的例子。The order in which the PE array calculates the weight data of the single-row output feature map data and the Toeplitz matrix data is shown in FIG. 7 . In the mixed multi-row data flow strategy, all parallel PE arrays share the same weight parameters to process Ifmap data from different column matrices in the Toeplitz matrix respectively, so as to optimize the bandwidth usage. Figure 7 shows an example of two parallel PE array operating modes.

进一步地，每个PE阵列的计算资源使用量为Further, the computing resource usage of each PE array is

#PE Mult＝W_h，i×W_w，i×I_w，i#PE Mult=W_{h, i} ×W_{w, i} ×I_{w, i}

其中，ΔT₁为外界数据源向加速器输入一行行数据的时钟周期间隔，满足Among them, ΔT₁ is the clock cycle interval of the external data source inputting line-by-line data to the accelerator, which satisfies the

其中，TRP_obj为加速器设计的期望吞吐量。加速器所执行的DCNN模型的总计算量设为IOP。freq是加速器的工作频率。where TRP_obj is the expected throughput of the accelerator design. The total computation of the DCNN model performed by the accelerator is set as IOP. freq is the operating frequency of the accelerator.

进一步地，当负责处理池化操作的池化模块用以处理卷积处理模块的输出数据时，其输入行数据数量等于卷积处理模块的输出行数据数量，所包含的data buffer数量满足Further, when the pooling module responsible for processing the pooling operation is used to process the output data of the convolution processing module, the number of input row data is equal to the number of output row data of the convolution processing module, and the number of data buffers contained satisfies

Hout_i×#RowPoolingBufferHout_i ×#RowPoolingBuffer

其中in

进一步地，全连接处理单元的实现包含PE阵列与数据访存系统。具体地，PE阵列由batch个PE组成。所有PE共用相同的权重数据，但处理不同的Ifmap数据，即每个PE负责处理来自不同推理任务的Ifmap数据。PE的硬件结构是MACC_f-输入乘累加树，因此，在每时钟周期中，每个PE能够实现MACC_f个MACC运算。Further, the implementation of the fully connected processing unit includes a PE array and a data access system. Specifically, the PE array consists of batches of PEs. All PEs share the same weight data, but process different Ifmap data, that is, each PE is responsible for processing Ifmap data from different inference tasks. The hardware structure of PE is MACC_f-input multiply-accumulate tree, so each PE can implement MACC_f MACC operations in each clock cycle.

相同或相似的标号对应相同或相似的部件；The same or similar reference numbers correspond to the same or similar parts;

附图中描述位置关系的用语仅用于示例性说明，不能理解为对本专利的限制；The terms describing the positional relationship in the accompanying drawings are only used for exemplary illustration, and should not be construed as a limitation on this patent;

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the embodiments of the present invention. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. Any modification, equivalent replacement and improvement made within the spirit and principle of the present invention shall be included within the protection scope of the claims of the present invention.

Claims

1. A DCNN accelerator based on a hybrid multi-row data stream policy, comprising a convolution processing unit and a full-connection processing unit, wherein:

the convolution processing unit is responsible for processing the convolution calculation part in the DCNN model and comprises a plurality of convolution processing modules, a bypass convolution processing module and a branch processing module, wherein the convolution processing modules are connected in sequence, the number of the convolution processing modules is equal to the number L of convolution layers of the DCNN model, and the number of input line data of each convolution processing module is r_iThe number of output line data is r_i+1And the input row data quantity of the convolution processing module is the output row data quantity of the last convolution processing unit; the input line data quantity of the bypass convolution processing module is the output line data quantity of the first convolution processing module, the output of the bypass convolution processing module is the input of the branch processing module, the output of the branch processing module is the input of the last convolution processing module, and the branch processing module processes branch parts in the deep convolution neural network;

and the output line data of the last convolution processing module is output to the full-connection processing unit, and the full-connection processing unit is used for processing the full-connection layer part in the deep convolution neural network.

2. The hybrid multi-row data flow policy-based DCNN accelerator according to claim 1, wherein a pooling processing module is connected between each convolution processing module, and the pooling processing module processes a pooling layer portion in a deep convolution neural network.

3. The hybrid multi-row data flow policy-based DCNN accelerator of claim 2, wherein the ambient data source is every Δ T₁Inputting a row of input characteristic diagram data to a first convolution processing module in each clock period, and inputting the characteristic diagram data every r_i+1S_iΔT_iIn one clock cycle, the ith convolution processing module completes r_i+1The rows output the computation of the profile data, wherein,

in the formula, Ps_jStride, S of pooled processing modules after the jth convolution processing module_jThe stride of the input feature map corresponding to the jth convolution processing module has the convolution kernel size N_i×C_i×K_i×K_iPadding is pad_i。

4. The hybrid multi-row data flow policy-based DCNN accelerator according to claim 3, wherein each convolution processing module comprises an input data buffer, a plurality of parallel computation buffers, a plurality of parallel arrays of computation units, an output data buffer, wherein:

the input data buffer reads and stores data in the output data buffer of the last convolution processing module, the parallel computing buffers read data in the input data buffer, the input of the parallel computing unit arrays is data in the computing buffers, and the output of the parallel computing unit arrays is stored in the output data buffer.

5. The hybrid multi-row data flow policy-based DCNN accelerator of claim 4, wherein the multiple rowsA parallel computing unit array composed of_h,i×I_w,iEach computing unit is a W_w,iAnd buffering the input multiply-accumulate tree, the intermediate calculation data in a Dual port RAM, and buffering the final calculation result in the RAM, wherein the RAM is used as a data source of an input data buffer of the next convolution processing module.

6. The hybrid multi-row data flow policy-based DCNN accelerator of claim 5, wherein each compute unit performs a small-sized matrix multiplication W in turn_rb×I_rpbThereby finally realizing large-size matrix multiplication W_r×I_rpAnd finally, the calculation result is single-row output characteristic diagram data.

7. The hybrid multi-row data flow strategy-based DCNN accelerator of claim 6, wherein each of the computing units computing paradigm is based on a Toeplitz matrix multiplication, converting input feature map data into a Toeplitz matrix, the input feature map data processed by each PE array being located in a column matrix of the Toeplitz matrix, all of the parallel computing unit arrays in the hybrid multi-row data flow strategy sharing the same weight parameters to process Ifmap data from different column matrices of the Toeplitz matrix, respectively, thereby achieving optimization of bandwidth usage.

8. The DCNN accelerator according to claim 7, wherein the computing resource usage # PE Mult of each computing unit is:

#PE Mult＝W_h,i×W_w,i×I_w,i

wherein, W_h,i,W_w,i,I_w,iNeed to satisfy

In the formula, Hout_iIs the ith convolution siteWidth and height, Δ T, of output feature map corresponding to physical module₁The clock cycle interval of inputting a line of data to an accelerator for an external data source meets the following requirements:

is a desire for accelerator design.

9. The hybrid multi-line data stream policy-based DCNN accelerator according to claim 8, wherein each input data buffer stores data having the same line position and column position in the input profile and is arranged in the input data buffers by channel size, and the number of data buffers of the i-th convolution processing module is # RowDataBuffer × Hin_i：

DataIn₀′＝K_i+S_i(r_i+1-1)+GCD(r_i+1S_i,r_i)(r_i′-1)

Wherein, GCD (r)_i+1S_i,r_i) Is r_iAnd r_i+1S_iGreatest common divisor of r_i' and (r)_i+1S_i) ' is two positive integers and is relatively prime, padding is the step size corresponding to the output characteristic diagram, and padding is pad_i。

10. The DCNN accelerator according to claim 9, wherein when the pooling processing module responsible for processing the pooling operation is used for processing the output data of the convolution processing module, the number of input line data thereof is equal to the number of output line data of the convolution processing module, and the number of data buffers is equal to:

Hout_i×#RowPoolingBuffer

wherein,