CN108665063A

Movatterモバイル変換

Info

Publication number: CN108665063A
Application number: CN201810480881.XA
Authority: CN
Inventors: 潘红兵; 查羿; 王宇宣; 朱杏伟; 秦子迪; 姚鑫; 李丽; 何书专; 李伟
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2018-10-16
Anticipated expiration: 2038-05-18
Also published as: CN108665063B

Abstract

The present invention provides a kind of bi-directional data grade parallel processing convolution acceleration systems for BNN hardware accelerators to include：Storage unit, the result after excited data, convolution nuclear parameter and this layer of convolution algorithm for storing input；Arithmetic and control unit controls the input of transmission, the excitation of data between each convolutional layer and reading, parameter operation and the storage of result of calculation for reading convolution nuclear parameter；Convolution algorithm module is instructed according to the controller, reads the data and parameter in buffer cell, completes convolution operation；All parameters and excited data DDR outside piece are carried to on-chip memory by data transfer module according to the configuration information of arithmetic and control unit.By increasing the expense of calculation resources and data storage resource, operation throughput is substantially increased.

Description

Translated fromChinese

用于BNN硬件加速器的双向并行处理卷积加速系统Bidirectional Parallel Processing Convolution Acceleration System for BNN Hardware Accelerator

技术领域technical field

本发明属于计算机及电子信息技术领域，尤其涉及提供了一种用于BNN硬件加速器的双向数据级并行处理卷积加速方法。The invention belongs to the technical field of computers and electronic information, and in particular relates to providing a bidirectional data-level parallel processing convolution acceleration method for a BNN hardware accelerator.

背景技术Background technique

深度卷积神经网络已经成为机器学习算法中很重要的一部分，并广泛地运用于计算机视觉。大量运用CNN解决实际应用问题中，不可避免的一个挑战是，如何解决CNN的计算力需求和存储能力。例如，一个VGG-19的网络，包含了140百万个浮点参数，以及需要进行15百亿次浮点操作来进行图像的分类。因而，现在CNN大量的训练和推理工作均是在CPU和GPU集群下进行的。Deep convolutional neural networks have become an important part of machine learning algorithms and are widely used in computer vision. In the extensive use of CNN to solve practical application problems, an inevitable challenge is how to solve the computing power demand and storage capacity of CNN. For example, a VGG-19 network contains 140 million floating point parameters and requires 15 exaflops to classify images. Therefore, a large amount of training and reasoning work of CNN is now carried out under CPU and GPU clusters.

相比于CPU，GPU这类通用型平台，FPGA这样的定制型硬件平台更能节省功耗和提高效率，而这对于像无人机，嵌入式设备这些对功耗效率实时性要求高的终端应用场景来说，是非常适合的。近些年，无论是学术界还是工业界都对在FPGA上实现CNN的加速器做了很多的探索努力。Compared with general-purpose platforms such as CPU and GPU, customized hardware platforms such as FPGA can save power consumption and improve efficiency, and this is for terminals such as drones and embedded devices that require high power efficiency and real-time performance. It is very suitable for application scenarios. In recent years, both academia and industry have made a lot of exploration efforts to implement CNN accelerators on FPGAs.

随着深度卷积神经网络层数的增加，参数量的爆炸性增长，其计算复杂度和计算力需求也水涨船高。人工智能芯片主要用于训练和推理两个环节，其中在线推理环节是指利用训练出来的模型在线响应用户需求，如无人驾驶，智能家居等领域，基于实时性和隐私安全的考虑，需要在嵌入式智能终端部署计算平台，并且要尽可能地减少时延，这就对计算速度提出了要求。另外，对于可穿戴设备等嵌入式设备对功耗和效率的要求也非常高。With the increase of the number of deep convolutional neural network layers and the explosive growth of the number of parameters, its computational complexity and computing power requirements have also increased. Artificial intelligence chips are mainly used for training and reasoning. The online reasoning link refers to using the trained model to respond to user needs online, such as unmanned driving, smart home and other fields. Based on real-time and privacy considerations, it needs to be Embedded intelligent terminals deploy computing platforms, and the delay must be reduced as much as possible, which puts forward requirements for computing speed. In addition, embedded devices such as wearable devices have very high requirements on power consumption and efficiency.

发明内容Contents of the invention

为了解决上述问题，本发明提出一种用于BNN硬件加速器的双向数据级并行处理卷积加速方法，具体由以下技术方案实现：In order to solve the above problems, the present invention proposes a bidirectional data-level parallel processing convolution acceleration method for a BNN hardware accelerator, which is specifically implemented by the following technical solutions:

所述用于BNN硬件加速器的双向并行处理卷积加速系统，包括：The described two-way parallel processing convolution acceleration system for BNN hardware accelerator comprises:

存储单元，设置于于每一卷积层上，分别用于存储输入的激励数据、卷积核参数以及该层卷积运算结束后的结果；The storage unit is arranged on each convolution layer and is used to store the input excitation data, convolution kernel parameters and the result after the convolution operation of this layer;

一个运算控制器，控制各个卷积层之间数据的传递、激励的输入与读取卷积核参数的读取、参数运算以及计算结果的存储；An operation controller, which controls the transfer of data between each convolution layer, the input of excitation and the reading of convolution kernel parameters, parameter operation and storage of calculation results;

卷积运算模块，根据所述控制器指令，读取缓冲单元里的数据和参数，完成卷积操作；The convolution operation module, according to the controller instruction, reads the data and parameters in the buffer unit to complete the convolution operation;

数据搬运模块，根据运算控制器的配置信息，一次性将所有参数与激励数据从片外DDR搬运至片上存储器，以减少访问片外存储器的次数。The data transfer module, according to the configuration information of the operation controller, transfers all parameters and stimulus data from the off-chip DDR to the on-chip memory at one time, so as to reduce the number of accesses to the off-chip memory.

所述用于BNN硬件加速器的双向并行处理卷积加速的进一步设计在于，卷积运算模块包括：The further design of the described bidirectional parallel processing convolution acceleration for the BNN hardware accelerator is that the convolution operation module includes:

浮点卷积运算模块，为卷积运算模块的第一层，浮点卷积运算模块的输入数据为浮点数据；The floating-point convolution operation module is the first layer of the convolution operation module, and the input data of the floating-point convolution operation module is floating-point data;

二值卷积单元，为卷积运算模块的第二至五层，经卷积操作后，将卷积层的尺寸在各个维度上缩小两倍，完成了池化过程，将池化处理后的数据代入设定的公式进行量化处理；The binary convolution unit is the second to fifth layers of the convolution operation module. After the convolution operation, the size of the convolution layer is reduced by two times in each dimension, and the pooling process is completed. The pooled The data is substituted into the set formula for quantitative processing;

二值全连接层运算单元，为卷积运算模块的第六至九层，用全连接层将所学到的分布式特征映射到样本标记空间，将权重和激励的卷积运算转化为1bit数据之间的异或操作。The binary fully connected layer operation unit is the sixth to ninth layers of the convolution operation module. It uses the fully connected layer to map the learned distributed features to the sample label space, and converts the convolution operation of weights and incentives into 1bit data. XOR operation between them.

所述用于BNN硬件加速器的双向并行处理卷积加速的进一步设计在于，所述卷积运算模块中各单元包括：The further design of the bidirectional parallel processing convolution acceleration for the BNN hardware accelerator is that each unit in the convolution operation module includes:

数据缓冲器，在进行卷积运算时，从数据存储器中读出像素数据，存入到数据缓冲器中，每次卷积运算结束，从数据存储器中读入新的数据，覆盖之前的数据；The data buffer reads out the pixel data from the data memory and stores it in the data buffer when the convolution operation is performed. After each convolution operation is completed, new data is read from the data memory to overwrite the previous data;

卷积核参数缓冲器，从参数存储器中读取卷积核与相应的数据缓冲器中的数据进行按位卷积运算；The convolution kernel parameter buffer reads the convolution kernel from the parameter memory and performs bitwise convolution operation with the data in the corresponding data buffer;

乘法器和加法器，进行卷积操作时候，进行乘、加操作，以例化乘法器和加法器；Multipliers and adders, when performing convolution operations, perform multiplication and addition operations to instantiate multipliers and adders;

逻辑异或电路，进行针对中间卷积层的特殊处理运算。The logical XOR circuit performs special processing operations for the intermediate convolutional layer.

所述的用于BNN硬件加速器的双向并行处理卷积加速的进一步设计在于，数据缓冲器的容量大小等于两倍的与之进行卷积操作的卷积核参数量。The further design of the bidirectional parallel processing convolution acceleration for the BNN hardware accelerator is that the capacity of the data buffer is equal to twice the number of parameters of the convolution kernel with which the convolution operation is performed.

本发明的优点Advantages of the invention

1.本发明中的激励和权重经过二值量化处理，故可以降低芯片存储资源需求，二值算术运算更加适用于FPGA的逻辑单元，因而提高了芯片单位面积的吞吐率和功耗。比起单向数据处理，增加了约0.3%的存储资源，提高了约一倍的运算吞吐率。1. The incentives and weights in the present invention are processed through binary quantization, so the chip storage resource requirements can be reduced, and the binary arithmetic operation is more suitable for the logic unit of FPGA, thereby improving the throughput rate and power consumption per unit area of the chip. Compared with one-way data processing, it increases the storage resources by about 0.3%, and improves the computing throughput by about double.

2.本发明由于是激励权重均作二值量化处理，故通过二值逻辑操作代替CNN中多位宽的激励与权重之间的乘法操作，节省了乘法器等运算资源的开销。2. Since the present invention performs binary quantization processing on the excitation weights, the multiplication operation between multi-bit wide excitations and weights in the CNN is replaced by binary logic operations, saving the overhead of computing resources such as multipliers.

3.本发明采取的数据级双向并行流水线形式，大大提高了卷积操作的数据吞吐率，等时间内可以对更多的数据进行卷积处理，有效的提高了运算效率。3. The data-level bidirectional parallel pipeline form adopted by the present invention greatly improves the data throughput rate of the convolution operation, and can perform convolution processing on more data in the same time, effectively improving the operation efficiency.

4.本发明在进行卷积操作前，一次性地将所有的输入激励数据和参数从片外DDR中读取存入到该层的片上BRAM，以便在计算过程中减少访问片外存储器的次数，尽可能地将需要频繁访问的参数数据囤积在功耗低的片上存储器，从而降低功耗需求。4. The present invention reads all input excitation data and parameters from the off-chip DDR to the on-chip BRAM of this layer at one time before performing the convolution operation, so as to reduce the number of accesses to the off-chip memory during the calculation process , store the parameter data that needs to be accessed frequently in the on-chip memory with low power consumption as much as possible, so as to reduce the power consumption demand.

附图说明Description of drawings

图1是BNN处理器的整体架构。Figure 1 is the overall architecture of the BNN processor.

图2（a）是浮点卷积运算模块结构示意图。Figure 2(a) is a schematic diagram of the structure of the floating-point convolution operation module.

图2（b）是二值卷积运算模块结构示意图。Figure 2(b) is a schematic diagram of the structure of the binary convolution operation module.

图2（c）是全连接层运算单元结构示意图。Figure 2(c) is a schematic diagram of the structure of the fully connected layer operation unit.

图3是运算单元内部结构示意图。Fig. 3 is a schematic diagram of the internal structure of the arithmetic unit.

图4是数据缓冲映射示意图。Fig. 4 is a schematic diagram of data buffer mapping.

具体实施方式Detailed ways

下面结合附图对本发明方案进行详细说明。The solution of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示，本实施例的双向数据级并行处理器（BNN）由运算控制器、卷积运算模块、数据搬运模块以及存储单元（片上BRAM）组成。输入激励数据和配置参数自运算控制器输入，运算控制器根据自身的配置信息来控制数据搬运模块和卷积运算模块的工作。由数据搬运模块将输入激励数据和参数信息从片外DDR搬运到存储单元，在控制信号的作用下，运算单元依次从存储器中取出数据和参数，进行卷积运算。As shown in FIG. 1 , the bidirectional data-level parallel processor (BNN) of this embodiment is composed of an operation controller, a convolution operation module, a data handling module, and a storage unit (on-chip BRAM). The input stimulus data and configuration parameters are input from the operation controller, and the operation controller controls the work of the data transfer module and the convolution operation module according to its own configuration information. The input excitation data and parameter information are transferred from the off-chip DDR to the storage unit by the data transfer module, and under the action of the control signal, the operation unit sequentially fetches the data and parameters from the memory for convolution operation.

卷积运算模块包括浮点卷积运算模块，二值卷积运算模块，二值全连接层运算单元。如图2（a）所示，第一层为浮点卷积运算模块，起始的输入数据为20bit的浮点数据（量化后的结果）。如图2（b）所示，第二至五层为二值卷积单元，由于经过上一层的卷积、量化、二值化之后，输入的激励数据均为1bit的0或1。经历过卷积操作后，将filter的尺寸在各个维度上缩小2倍，完成了池化过程，数据代入特定的公式进行量化处理。如图2（c）所示用全连接层将所学到的分布式特征映射到样本标记空间，由于经过若干层运算后的结果为1bit，故权重和激励的卷积运算可以转化为1bit数据之间的异或操作。The convolution operation module includes a floating-point convolution operation module, a binary convolution operation module, and a binary fully connected layer operation unit. As shown in Figure 2(a), the first layer is a floating-point convolution operation module, and the initial input data is 20-bit floating-point data (quantized results). As shown in Figure 2(b), the second to fifth layers are binary convolution units. After the convolution, quantization, and binarization of the previous layer, the input excitation data are all 1-bit 0 or 1. After the convolution operation, the size of the filter is reduced by 2 times in each dimension, and the pooling process is completed, and the data is substituted into a specific formula for quantization. As shown in Figure 2(c), use the fully connected layer to map the learned distributed features to the sample label space. Since the result after several layers of operations is 1bit, the convolution operation of weights and incentives can be converted into 1bit data XOR operation between them.

如图3所示，卷积运算模块内部包括参数缓冲器、数据缓冲器、乘法器以及加法器。在进行每一层的流水运算时，由控制信号的作用下，依次从存储器中取出该层所需参数和数据，传送到数据缓冲器中进行乘加运算，得到的结果再保存到片上BRAM中，继而用于后续层级的计算。As shown in FIG. 3 , the convolution operation module includes a parameter buffer, a data buffer, a multiplier and an adder. When performing the pipeline operation of each layer, under the action of the control signal, the required parameters and data of the layer are sequentially taken out from the memory, transferred to the data buffer for multiplication and addition operations, and the obtained results are then stored in the on-chip BRAM , which is then used in the calculation of subsequent levels.

进一步的，数据缓冲器，在进行卷积运算时，从该层的数据存储器中读出像素数据，存入到数据缓冲器中，其容量大小等于2倍的与之进行卷积操作的卷积核参数量，如卷积核大小为3*3，那么该层的数据缓冲单元则可以存储18个激励数据，每次卷积运算结束，从数据存储器中读入新的数据，覆盖之前的数据。Further, the data buffer reads the pixel data from the data memory of this layer when performing the convolution operation, and stores it in the data buffer, and its capacity is equal to twice the convolution operation with it Kernel parameter quantity, if the convolution kernel size is 3*3, then the data buffer unit of this layer can store 18 excitation data, and after each convolution operation is completed, new data is read from the data memory to overwrite the previous data .

卷积核参数缓冲器，若从参数存储器中读取两组3*3的卷积核，存入卷Convolution kernel parameter buffer, if two sets of 3*3 convolution kernels are read from the parameter memory, they will be stored in the volume

积参数缓冲器，那么其位宽为18bit，（每个权重为1bit的位宽），与相应的数据缓冲单元中的数据进行按位卷积运算。If the product parameter buffer is used, its bit width is 18 bits, (each weight is 1 bit bit width), and the bitwise convolution operation is performed with the data in the corresponding data buffer unit.

乘法器和加法器，在进行原始图像数据卷积运算时，由于每个数据的量Multipliers and adders, when performing convolution operations on raw image data, due to the amount of each data

化位宽一般是大于1bit的，故进行卷积操作时候，需要进行乘加操作，需要例化乘法器和加法器。The bit width is generally greater than 1 bit, so when performing convolution operations, multiplication and addition operations are required, and multipliers and adders need to be instantiated.

逻辑异或电路，针对中间卷积层的特殊处理运算，由于中间层的输入激励为二值数，并且权重参数也是二值数，所以可以用逻辑异或电路来替代乘法器电路，以便于简化电路形式和减少系统时延。Logic XOR circuit, for the special processing operation of the middle convolutional layer, since the input excitation of the middle layer is a binary number, and the weight parameter is also a binary number, so the logic XOR circuit can be used instead of the multiplier circuit to simplify circuit form and reduce system delay.

如图4，分别将左上角3*3区域和右上角3*3区域内的所要运算的数据和参数存入数据缓冲器，进行卷积运算，所得结果保留至寄存器内，接着，以1为步长，分别取以上二者相邻区域数据作为运算对象。As shown in Figure 4, store the data and parameters to be calculated in the 3*3 area in the upper left corner and the 3*3 area in the upper right corner into the data buffer, perform the convolution operation, and save the result in the register, and then use 1 as Step length, respectively take the above two adjacent area data as the operation object.

本发明中双向并行处理数据包括以下步骤：Two-way parallel processing data comprises the following steps among the present invention:

步骤一，从待处理卷积层的存储器中分别读取位于图片左上角的3*3区域内的9个激励数据（即第一至三行中每一行的前三个像素点）和一组3*3的卷积核参数，依次存入缓冲单元，注意激励数据与相对应参数的对应位置。Step 1, respectively read the 9 excitation data located in the 3*3 area in the upper left corner of the picture (that is, the first three pixels of each row in the first to third rows) and a set of The 3*3 convolution kernel parameters are stored in the buffer unit in turn, and pay attention to the corresponding positions of the excitation data and the corresponding parameters.

步骤二，从待处理卷积层的存储器中分别读取位于图片右上角的3*3区域内的9个激励数据，（即第一至三行中每一行的最末尾三个像素点）和一组3*3的卷积核参数，依次存入缓冲单元，注意激励数据与相对应参数的对应位置。Step 2, respectively read the 9 excitation data located in the 3*3 area in the upper right corner of the picture from the memory of the convolutional layer to be processed (that is, the last three pixels of each row in the first to third rows) and A set of 3*3 convolution kernel parameters are sequentially stored in the buffer unit, and attention is paid to the corresponding position of the excitation data and the corresponding parameters.

步骤三，在同步时钟的作用下，将缓冲区域中的18个激励数据和18个权重参数进行对应乘加运算（如果激励数据为1bit则可以用逻辑异或的方式代替乘法运算）。最终，图片左上角区域的3*3阵列与权重卷积运算所得结果存入用于保存结果的存储器中，并记为新feature map的第一行第一个结果数据，同理，右上角区域的3*3阵列与权重卷积运算所得计算结果存入结果存储器中，记为新feature map 的第一行最后一个数据（本实验中为第32个数）。Step 3, under the action of the synchronous clock, perform corresponding multiplication and addition operations on the 18 stimulus data and 18 weight parameters in the buffer area (if the stimulus data is 1 bit, logic XOR can be used instead of multiplication). Finally, the result of the convolution operation between the 3*3 array and the weight in the upper left corner of the image is stored in the memory used to save the result, and is recorded as the first result data of the first row of the new feature map. Similarly, the upper right corner area The calculation result obtained from the 3*3 array and the weight convolution operation is stored in the result memory, and recorded as the last data of the first line of the new feature map (the 32nd number in this experiment).

步骤四，以1为步长，从激励数据存储器读取前三行中每一行的第2至4个数据，（相当于将步骤一的图片左上角3*3小片区域整体向右平移了1格），继续存入数据缓冲单元，并且覆盖之前的激励数据。同理，从参数存储器中读取权重参数置入缓冲单元，覆盖之前的参数。同时，读取激励存储器前三行中每一行的第29至31号数据进入缓冲区（相当于步骤二的图片右上角3*3小片区域向左平移一格），以及对应的参数进入缓冲单元。之后重复步骤三操作，得到新feature map第一行的第二个和倒数第二个结果数据。Step 4, with a step size of 1, read the 2nd to 4th data of each of the first three rows from the excitation data memory (equivalent to shifting the 3*3 small area in the upper left corner of the picture in step 1 to the right as a whole by 1 Grid), continue to store in the data buffer unit, and overwrite the previous incentive data. Similarly, the weight parameters are read from the parameter memory and put into the buffer unit, overwriting the previous parameters. At the same time, read the data No. 29 to No. 31 of each of the first three rows of the stimulus memory into the buffer (equivalent to the 3*3 small area in the upper right corner of the picture in step 2 shifting one grid to the left), and the corresponding parameters into the buffer unit . Then repeat step 3 to get the second and penultimate result data of the first row of the new feature map.

步骤五，将步骤四中两片3*3的区域分别向左和向右各自平移一个步长，读取数据和参数进入缓冲区，重复步骤三操作，得到新feature map第一行的第三个和倒数第三个结果数据。如此周而复始，直至两片区域在不断地平移过程中重合，此时经过卷积计算后的结果存储器中亦保存了32个数据结果，作为下一层输入feature map的第一行数据。Step 5: Shift the two 3*3 areas in step 4 to the left and right by one step respectively, read the data and parameters into the buffer, and repeat the operation of step 3 to get the third part of the first line of the new feature map and the penultimate result data. This goes on and on until the two areas overlap in the process of continuous translation. At this time, 32 data results are also stored in the result memory after convolution calculation, which is used as the first row of data input to the feature map in the next layer.

步骤六，分别读取第二至四行的最左侧3*3区域数据和最右侧3*3区域数据，如上述步骤1至5那样操作，得出下一层输入feature map的第二行数据。Step 6, read the leftmost 3*3 area data and the rightmost 3*3 area data of the second to fourth rows respectively, and operate as the above steps 1 to 5 to obtain the second layer of the input feature map of the next layer row data.

步骤七，如此周而复始步骤1至6，直至得出32*32的feature map作为下一层输入激励。Step 7, repeat steps 1 to 6 in this way until the feature map of 32*32 is obtained as the input stimulus for the next layer.

所用资源列表如下（仅以一幅图片为例）：随着feature map和filter的数量增加，多用的资源所占原存储资源百分比将越来越小。The list of used resources is as follows (only one picture is taken as an example): As the number of feature maps and filters increases, the percentage of multi-use resources to the original storage resources will become smaller and smaller.

卷积层convolutional layer原来存储资源original storage resource现存储资源storage resources原来延时original delay现在延时delay nowFixed_conv （32*32）Fixed_conv (32*32)821368213682325823253232 16 16Bin_conv2（32*32）Bin_conv2 (32*32)207520752093209332321616Bin_conv3（32*32）Bin_conv3 (32*32)53953955755732321616Bin_conv4（16*16）Bin_conv4 (16*16)539539557557161688Bin_conv5（16*16）Bin_conv5 (16*16)155155173173161688Bin_conv6（8*8）Bin_conv6 (8*8)1551551731738844totaltotal855998559985878858781361366868

以上，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. All should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

Translated fromChinese

1.一种用于BNN硬件加速器的双向并行处理卷积加速系统，其特征在于，包括：1. a kind of bidirectional parallel processing convolution acceleration system for BNN hardware accelerator, is characterized in that, comprises:

存储单元，设置于每一卷积层上，分别用于存储输入的激励数据、卷积核参数以及该层卷积运算结束后的结果；The storage unit is arranged on each convolution layer and is used to store the input excitation data, convolution kernel parameters and the result after the convolution operation of this layer;

2.根据权利要求1所述的用于BNN硬件加速器的双向并行处理卷积加速系统，其特征在于，卷积运算模块包括：2. the two-way parallel processing convolution acceleration system for BNN hardware accelerator according to claim 1, is characterized in that, convolution operation module comprises:

3.根据权利要求2所述的用于BNN硬件加速器的双向并行处理卷积加速系统，其特征在于，所述卷积运算模块中各单元包括：3. the two-way parallel processing convolution acceleration system for BNN hardware accelerator according to claim 2, is characterized in that, each unit comprises in the described convolution operation module:

4.根据权利要求1所述的用于BNN硬件加速器的双向并行处理卷积加速系统，其特征在于，数据缓冲器的容量大小等于两倍的与之进行卷积操作的卷积核参数量。4. the bidirectional parallel processing convolution acceleration system for BNN hardware accelerator according to claim 1, characterized in that, the capacity of the data buffer is equal to twice the volume of convolution kernel parameters for convolution operation therewith.