CN107085562B

Movatterモバイル変換

Info

Publication number: CN107085562B
Application number: CN201710179097.0A
Authority: CN
Inventors: 韩银和; 许浩博; 王颖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-03-23
Filing date: 2017-03-23
Publication date: 2020-11-03
Anticipated expiration: 2037-03-23
Also published as: CN107085562A

Abstract

Translated fromChinese

本发明提出一种基于高效复用数据流的神经网络处理器及设计方法，涉及神经网络模型计算的硬件加速技术领域，该处理器包括至少一存储单元，用于存储操作指令与运算数据；至少一计算单元，用于执行神经网络计算；控制单元，与所述至少一存储单元、所述至少一计算单元相连，用于经由所述至少一存储单元获得所述至少一存储单元存储的操作指令，并且解析所述操作指令以控制所述至少一计算单元；其中所述运算数据采用高效复用数据流的形式。本发明在神经网络处理过程中采用高效复用数据流，每次只需向计算单元阵列中的一列计算单元载入权重和数据，降低了片上数据带宽、提高了数据共享率、提升了能量效率。

The present invention provides a neural network processor and a design method based on efficient multiplexing of data streams, and relates to the technical field of hardware acceleration of neural network model computation. The processor includes at least one storage unit for storing operation instructions and operation data; at least a computing unit, configured to perform neural network computation; a control unit, connected to the at least one storage unit and the at least one computing unit, and configured to obtain operation instructions stored by the at least one storage unit via the at least one storage unit , and parses the operation instruction to control the at least one computing unit; wherein the operation data is in the form of an efficient multiplexing data stream. The present invention uses efficient multiplexing of data streams in the process of neural network processing, and only needs to load weights and data into one column of computing units in the computing unit array at a time, thereby reducing on-chip data bandwidth, increasing data sharing rate, and improving energy efficiency .

Description

Translated fromChinese

一种基于高效复用数据流的神经网络处理器及设计方法A neural network processor and design method based on efficient multiplexing of data streams

技术领域technical field

本发明涉及神经网络模型计算的硬件加速技术领域，特别涉及一种基于高效复用数据流的神经网络处理器及设计方法。The invention relates to the technical field of hardware acceleration of neural network model calculation, in particular to a neural network processor and a design method based on efficient multiplexing of data streams.

背景技术Background technique

随着机器学习技术的不断发展，深度神经网络已经成为认知与识别任务的最佳解决方案，在识别检测和计算机视觉领域引起了广泛关注，尤其在图像识别领域，深度神经网络已到达甚至超越人类的辨识准确度。With the continuous development of machine learning technology, deep neural network has become the best solution for cognition and recognition tasks, and has attracted wide attention in the field of recognition detection and computer vision, especially in the field of image recognition, deep neural network has reached or even surpassed Human recognition accuracy.

深度学习所得到的深度网络结构是一种运算模型，其中包含大量数据节点，每个数据节点与其他数据节点相连，各个节点间的连接关系用权重表示，主流的神经网络处理硬件包括通用图形处理器、专用处理器芯片和现场可编程逻辑阵列(FPGA)等，但是伴随神经网络复杂度的不断提高，神经网络技术在实际应用过程中占用资源多、运算速度慢、能量消耗大的问题日益突出，在移动平台或嵌入式平台中的适用性不高因此该技术在嵌入式设备或低开销数据中心等领域应用时存在严重的能效问题和运算速度瓶颈。The deep network structure obtained by deep learning is an operation model, which contains a large number of data nodes, each data node is connected to other data nodes, and the connection relationship between each node is represented by weights. The mainstream neural network processing hardware includes general graphics processing. However, with the continuous improvement of the complexity of neural network, the problems of neural network technology occupying a lot of resources, slow operation speed and high energy consumption in practical application are becoming more and more prominent. , the applicability in mobile platforms or embedded platforms is not high, so there are serious energy efficiency problems and computing speed bottlenecks when this technology is applied in embedded devices or low-cost data centers.

深度神经网络的规模不断扩大，增加了数据传送和数据计算的需求，由于在很多情况下数据传输比数据计算更消耗能量，因此，本发明提供一种基于高效复用数据流的神经网络处理器，通过优化神经网络处理器的数据调度方式，减少了数据传输带宽、增加了权重和数据的复用率、降低片上存储量，实现了工作能耗的降低。The scale of the deep neural network continues to expand, increasing the demand for data transmission and data calculation. Since data transmission consumes more energy than data calculation in many cases, the present invention provides a neural network processor based on efficient multiplexing of data streams. , By optimizing the data scheduling method of the neural network processor, the data transmission bandwidth is reduced, the weight and data multiplexing rate are increased, and the on-chip storage capacity is reduced, and the work energy consumption is reduced.

发明内容SUMMARY OF THE INVENTION

针对现有技术的不足，本发明提出一种基于高效复用数据流的神经网络处理器及设计方法。Aiming at the deficiencies of the prior art, the present invention proposes a neural network processor and a design method based on efficient multiplexing of data streams.

本发明提出一种基于高效复用数据流的神经网络处理器，包括：The present invention proposes a neural network processor based on efficient multiplexing of data streams, including:

至少一存储单元，用于存储操作指令与运算数据；at least one storage unit for storing operation instructions and operation data;

至少一计算单元，用于执行神经网络计算；at least one computing unit for performing neural network computation;

控制单元，与所述至少一存储单元、所述至少一计算单元相连，用于经由所述至少一存储单元获得所述至少一存储单元存储的操作指令，并且解析所述操作指令以控制所述至少一计算单元；a control unit, connected to the at least one storage unit and the at least one computing unit, and configured to obtain an operation instruction stored in the at least one storage unit via the at least one storage unit, and parse the operation instruction to control the at least one computing unit;

其中所述运算数据采用高效复用数据流的形式。The operation data is in the form of an efficient multiplexing data stream.

所述神经网络处理器包括存储结构、控制结构、计算结构。The neural network processor includes a storage structure, a control structure, and a calculation structure.

在所述计算单元阵列中，位于相同列的计算单元共享一组相同的数据；位于相同行的计算单元载入相同的一组权重，在每个计算周期，每个计算单元仅会载入一组权重的一个元素；位于不同行的计算单元会载入不同的权重。In the computing unit array, computing units located in the same column share the same set of data; computing units located in the same row load the same set of weights, and in each computing cycle, each computing unit only loads one set of weights. An element of group weights; compute units on different rows load different weights.

计算单元中每组数据按照神经网络层深度方向排列，计算单元内不同行间的运算提现了计算单元的并行度。Each group of data in the computing unit is arranged according to the depth direction of the neural network layer, and the operations between different rows in the computing unit improve the parallelism of the computing unit.

所述高效复用数据流每次只载入一列数据与权重进入计算单元阵列，并且载入的数据与权重仅在相邻两列间传播。The efficient multiplexing data stream loads only one column of data and weights into the computing unit array at a time, and the loaded data and weights are only propagated between two adjacent columns.

本发明还提出一种基于高效复用数据流的神经网络处理器的设计方法，包括：The present invention also proposes a method for designing a neural network processor based on efficient multiplexing of data streams, including:

设置至少一存储单元，存储操作指令与运算数据；at least one storage unit is provided to store operation instructions and operation data;

设置至少一计算单元，执行神经网络计算；Setting at least one computing unit to perform neural network computation;

设置控制单元，与所述至少一存储单元、所述至少一计算单元相连，经由所述至少一存储单元获得所述至少一存储单元存储的操作指令，并且解析所述操作指令以控制所述至少一计算单元；A control unit is provided, connected to the at least one storage unit and the at least one computing unit, obtains the operation instruction stored in the at least one storage unit via the at least one storage unit, and parses the operation instruction to control the at least one storage unit. a computing unit;

由以上方案可知，本发明的优点在于：As can be seen from the above scheme, the advantages of the present invention are:

本发明在神经网络处理过程中采用高效复用数据流，每次只需向计算单元阵列中的一列计算单元载入权重和数据，降低了片上数据带宽、提高了数据共享率、提升了能量效率。The present invention adopts efficient multiplexing of data streams in the process of neural network processing, and only needs to load weights and data into one column of computing units in the computing unit array at a time, thereby reducing on-chip data bandwidth, increasing data sharing rate, and improving energy efficiency .

附图说明Description of drawings

图1是本发明提供的神经网络处理器结构框图；1 is a structural block diagram of a neural network processor provided by the present invention;

图2是本发明提供的具有数据共享功能的计算单元阵列示意图；2 is a schematic diagram of a computing unit array with a data sharing function provided by the present invention;

图3是本发明提供的高效复用数据流示意图。FIG. 3 is a schematic diagram of an efficient multiplexing data stream provided by the present invention.

具体实施方式Detailed ways

本发明目的为提供一种基于高效复用数据流的神经网络处理器及设计方法，该处理器在现有神经网络处理器系统中采用时间维-空间维数据流并采用权重压缩的方法，降低了片上数据带宽、提高了数据共享率并减少了无效计算，进而提升了神经网络处理器的运算速度及运行能量效率。The purpose of the present invention is to provide a neural network processor and design method based on efficient multiplexing of data streams. The on-chip data bandwidth is increased, the data sharing rate is increased, and the invalid calculation is reduced, thereby improving the operation speed and operating energy efficiency of the neural network processor.

为实现上述目的，本发明提供的基于高效复用数据流的神经网络处理器，包括：In order to achieve the above object, the neural network processor based on the efficient multiplexing data stream provided by the present invention includes:

至少一个存储单元，用于存储操作指令和运算数据；at least one storage unit for storing operation instructions and operation data;

至少一个计算单元，用于执行神经网络计算；以及控制单元，与所述至少一个存储单元和所述至少一个计算单元相连，用于经由所述至少一个存储单元获得所述至少一个存储单元存储的指令，并且解析该指令以控制所述至少一个计算单元；at least one computing unit for performing neural network computation; and a control unit, connected with the at least one storage unit and the at least one computing unit, for obtaining the data stored in the at least one storage unit via the at least one storage unit an instruction, and parsing the instruction to control the at least one computing unit;

一种基于时间维-空间维的高效复用数据流进行数据传输及计算，所述高效复用数据流每次只载入一列数据和权重进入计算单元阵列，并且数据和权重仅在相邻两列间传播，具有低数据带宽和高数据共享率的特点。An efficient multiplexing data stream based on time dimension and space dimension for data transmission and calculation, the efficient multiplexing data stream only loads one column of data and weights into the calculation unit array at a time, and the data and weights are only in adjacent two. Propagation between columns, featuring low data bandwidth and high data sharing rate.

为了使本发明的目的、技术方案、设计方法及优点更加清楚明了，以下结合附图通过具体实施例对本发明进一步详细说明，应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions, design methods and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings through specific embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, and It is not intended to limit the present invention.

本发明旨在提供一种基于高效复用数据流的神经网络处理器，在神经网络处理过程中采用高效复用数据流，每次只需向计算单元阵列中的一列计算单元载入权重和数据，降低了片上数据带宽、提高了数据共享率、提升了能量效率。The present invention aims to provide a neural network processor based on high-efficiency multiplexing data streams, which adopts high-efficiency multiplexing data streams in the process of neural network processing, and only needs to load weights and data into one column of computing units in the computing unit array each time. , reducing the on-chip data bandwidth, increasing the data sharing rate, and improving energy efficiency.

本发明提供的神经网络处理基于存储-控制-计算的结构；The neural network processing provided by the present invention is based on the structure of storage-control-computing;

存储结构用于存储参与计算的数据及处理器操作指令；The storage structure is used to store the data involved in the calculation and the processor operation instructions;

控制结构包括译码电路，用于解析操作指令，生成控制信号以控制片上数据的调度与存储以及神经网络计算过程；The control structure includes a decoding circuit for parsing operation instructions, generating control signals to control the scheduling and storage of on-chip data and the neural network calculation process;

计算结构包括算术逻辑单元，用于参与该处理器中的神经网络计算操作，压缩数据在计算结构中实现计算操作。The computing structure includes an arithmetic logic unit for participating in the neural network computing operation in the processor, and the compressed data realizes the computing operation in the computing structure.

图1为本发明提供的一种神经网络处理器系统101，该系统架构由六个部分构成，包括输入数据存储单元102、控制单元103、输出数据存储单元104、权重存储单元105、指令存储单元106、计算单元阵列107。1 is a neuralnetwork processor system 101 provided by the present invention. The system architecture consists of six parts, including an inputdata storage unit 102, a control unit 103, an outputdata storage unit 104, aweight storage unit 105, and an instruction storage unit. 106. Computeunit array 107.

输入数据存储单元102用于参与计算的数据，该数据包括原始特征图数据和参与中间层计算的数据；输出数据存储单元104包括计算得到的神经元响应值；权重存储单元105用于存储已经训练好的神经网络权重；指令存储单元106存储参与计算的指令信息，指令被解析来实现神经网络计算。The inputdata storage unit 102 is used to participate in the calculation data, the data includes the original feature map data and the data involved in the middle layer calculation; the outputdata storage unit 104 includes the calculated neuron response value; theweight storage unit 105 is used to store the trained Good neural network weight; theinstruction storage unit 106 stores the instruction information involved in the calculation, and the instruction is parsed to realize the neural network calculation.

控制单元103分别与输出数据存储单元104、权重存储单元105、指令存储单元106、计算单元107，控制单元103获得保存在指令存储单元106中的指令并且解析该指令，控制单元103可根据解析指令得到的控制信号控制计算单元进行神经网络计算。The control unit 103 is respectively connected with the outputdata storage unit 104, theweight storage unit 105, theinstruction storage unit 106, and thecalculation unit 107. The control unit 103 obtains the instruction stored in theinstruction storage unit 106 and parses the instruction. The control unit 103 can analyze the instruction according to the The obtained control signal controls the computing unit to perform neural network computation.

计算单元107用于根据控制单元103的产生的控制信号来执行相应的神经网络计算。计算单元107与一个或多个存储单元相关联，计算单元107可以从与其相关联的输入数据存储单元102中的数据存储部件获得数据以进行计算，并且可以向该相关联的输出数据存储单元104写入数据。计算单元107完成神经网络算法中的大部分运算，即向量乘加操作等。Thecalculation unit 107 is configured to perform corresponding neural network calculation according to the control signal generated by the control unit 103 . Thecomputation unit 107 is associated with one or more storage units, thecomputation unit 107 can obtain data for computation from the data storage components in the inputdata storage unit 102 with which it is associated, and can output data to the associated outputdata storage unit 104 data input. Thecomputing unit 107 completes most of the operations in the neural network algorithm, ie, vector multiply-add operations and the like.

图2为本发明设计的一种适用于高效复用数据流的计算单元阵列示意图。该计算单元阵列由m*n个计算单元组成，每个计算单元完成数据和神经网络权重的卷积运算。在计算单元阵列中，位于相同列的计算单元共享一组相同的数据；位于相同行的计算单元会载入相同的一组权重值，在每个计算周期，每个计算单元仅会载入一组权重值的一个元素；位于不同行的计算单元会载入不同的权重值。FIG. 2 is a schematic diagram of a computing unit array suitable for efficient multiplexing of data streams designed by the present invention. The computing unit array is composed of m*n computing units, and each computing unit completes the convolution operation of the data and the weight of the neural network. In the computing unit array, the computing units located in the same column share the same set of data; the computing units located in the same row will load the same set of weight values, and in each computing cycle, each computing unit will only load one An element of a group weight value; compute units located on different rows load different weight values.

本发明提供一种应用于神经网络处理的高效复用数据流。所述高效复用数据流的特征包括：The present invention provides an efficient multiplexing data stream applied to neural network processing. Features of the efficient multiplexing data stream include:

(1)参与神经网络运算的数据按照神经网络层深度排列方式载入至计算单元中。(1) The data participating in the neural network operation is loaded into the computing unit according to the depth arrangement of the neural network layers.

(2)计算单元包含n行，行方向代表了计算单元的并行度。(2) The computing unit contains n rows, and the row direction represents the parallelism of the computing unit.

(3)计算单元内，数据和权重在列方向上进行传播和移动，可以增加数(3) In the computing unit, data and weights are propagated and moved in the column direction, which can increase the number of

据和权重的复用率。The reuse rate of data and weights.

图3以3*2的计算单元(PE)阵列为例，详细阐述本发明提供的计算单元阵列通过高效复用数据流进行神经网络计算的过程。如图3，两组权重权重0和权重1的规模大小均为2*2*4，数据规模大小为4*2*4，权重依据其空间位置被分为四组，其中位于相同x轴和相同y轴的权重元素被分为一组，权重0的四组权重分别为Ax、Bx、Cx和Dx(x＝0,1,2,3)，权重1的四组权重分别为ax、bx、cx和dx(x＝0,1,2,3)。PE的具体工作过程如下：FIG. 3 takes a 3*2 computing unit (PE) array as an example to describe in detail the process of the computing unit array provided by the present invention performing neural network computing by efficiently multiplexing data streams. As shown in Figure 3, the scales of the two groups of weights weight 0 and weight 1 are both 2*2*4, and the data size is 4*2*4. The weights are divided into four groups according to their spatial positions, which are located on the same x-axis and The weight elements of the same y-axis are divided into one group, the four groups of weights of weight 0 are Ax, Bx, Cx and Dx (x=0,1,2,3) respectively, the four groups of weights of weight 1 are respectively ax, bx , cx and dx (x=0,1,2,3). The specific working process of PE is as follows:

在第0个周期，数据①②③分别载入到计算单元PE00、PE01和PE02中，权重0中的权重元素A0载入到计算单元PE00中，数据①和权重元素A0在计算单元PE00中进行乘法等相关操作；同时数据①②③共享至计算单元PE00、PE01和PE02中，权重1中权重元素a0载入到计算单元PE10中，数据①和权重元素a0在计算单元PE10中进行乘法等相关操作；In the 0th cycle, the data ①②③ are loaded into the calculation units PE00, PE01 and PE02 respectively, the weight element A0 in the weight 0 is loaded into the calculation unit PE00, and the data ① and the weight element A0 are multiplied in the calculation unit PE00, etc. Relevant operations; at the same time, data ①②③ are shared to calculation units PE00, PE01 and PE02, weight element a0 in weight 1 is loaded into calculation unit PE10, data ① and weight element a0 are multiplied and other related operations in calculation unit PE10;

在第1个周期，权重元素A0向右移动至计算单元PE01中，权重元素A1载入到计算单元PE00中；同时权重元素a0向右移动至计算单元PE11中，权重元素a1载入到计算单元PE10中；In the first cycle, the weight element A0 is moved to the right in the calculation unit PE01, and the weight element A1 is loaded into the calculation unit PE00; at the same time, the weight element a0 is moved to the right in the calculation unit PE11, and the weight element a1 is loaded into the calculation unit In PE10;

在第2个周期，权重元素A0和A1同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素A2载入到计算单元PE00中；同时，权重元素a0和a1同时右移，分别载入到计算单元PE12和计算单元PE11中，权重元素a2载入到计算单元PE10中；In the second cycle, the weight elements A0 and A1 are shifted to the right at the same time, loaded into the calculation unit PE02 and the calculation unit PE01 respectively, and the weight element A2 is loaded into the calculation unit PE00; at the same time, the weight elements a0 and a1 are shifted to the right at the same time, Loaded into the calculation unit PE12 and the calculation unit PE11 respectively, and the weight element a2 is loaded into the calculation unit PE10;

在第3个周期，权重元素A1和A2同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素A3载入到计算单元PE00中，此时权重0的第一组权重元素Ax(x＝0,1,2,3)已全部载入至计算单元阵列中；同时，权重元素a1和a2同时右移，分别载入到计算单元PE12和计算单元PE11中，权重元素a3载入到计算单元PE10中，此时权重1的第一组权重元素ax(x＝0,1,2,3)已全部载入至计算单元阵列中；In the third cycle, the weight elements A1 and A2 are shifted to the right at the same time, loaded into the calculation unit PE02 and the calculation unit PE01 respectively, and the weight element A3 is loaded into the calculation unit PE00. At this time, the first group of weight elements Ax of the weight 0 (x=0, 1, 2, 3) have all been loaded into the calculation unit array; at the same time, the weight elements a1 and a2 are shifted to the right at the same time and loaded into the calculation unit PE12 and the calculation unit PE11 respectively, and the weight element a3 is loaded In the computing unit PE10, the first group of weight elements ax (x=0, 1, 2, 3) of the weight 1 have all been loaded into the computing unit array;

在第4个周期，权重元素A2和A3同时右移，分别载入到计算单元PE02和计算单元PE01中，下一组权重元素Bx(x＝0,1,2,3)中的第一个元素B0载入到计算单元PE00中，此外数据②载入至计算单元PE00中；同时，权重元素a2和a3同时右移，分别载入到计算单元PE12和计算单元PE11中，下一组权重元素bx(x＝0,1,2,3)中的第一个元素b0载入到计算单元PE10中，数据②共享至计算单元PE10中；In the fourth cycle, the weight elements A2 and A3 are shifted to the right at the same time and loaded into the calculation unit PE02 and the calculation unit PE01 respectively. The element B0 is loaded into the calculation unit PE00, and the data ② is loaded into the calculation unit PE00; at the same time, the weight elements a2 and a3 are shifted to the right at the same time and loaded into the calculation unit PE12 and the calculation unit PE11 respectively, and the next set of weight elements The first element b0 in bx (x=0, 1, 2, 3) is loaded into the computing unit PE10, and the data ② is shared in the computing unit PE10;

在第5个周期，权重元素A3和B0同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素B1载入到计算单元PE00中，此外数据③载入至计算单元PE01中；同时，权重元素a3和b0同时右移，分别载入到计算单元PE12和计算单元PE11中，权重元素b1载入到计算单元PE10中，此外数据③载入至计算单元PE11中；In the 5th cycle, the weight elements A3 and B0 are shifted to the right at the same time, loaded into the calculation unit PE02 and the calculation unit PE01 respectively, the weight element B1 is loaded into the calculation unit PE00, and the data ③ is loaded into the calculation unit PE01; At the same time, the weight elements a3 and b0 are shifted to the right at the same time and loaded into the calculation unit PE12 and the calculation unit PE11 respectively, the weight element b1 is loaded into the calculation unit PE10, and the data ③ is loaded into the calculation unit PE11;

在第6个周期，权重元素B0和B1时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素B2载入到计算单元PE00中，此外数据④载入至计算单元PE02中；同时，权重元素b0和b1时右移，分别载入到计算单元PE12和计算单元PE11中，权重元素b2载入到计算单元PE10中，数据④共享至计算单元PE12中；In the 6th cycle, the weight elements B0 and B1 are shifted to the right, respectively loaded into the calculation unit PE02 and the calculation unit PE01, the weight element B2 is loaded into the calculation unit PE00, and the data ④ is loaded into the calculation unit PE02; At the same time, the weight elements b0 and b1 are shifted to the right and loaded into the calculation unit PE12 and the calculation unit PE11 respectively, the weight element b2 is loaded into the calculation unit PE10, and the data ④ is shared in the calculation unit PE12;

在第7个周期，权重元素B1和B2同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素B3载入到计算单元PE00中，此时权重0的第二组权重元素Bx(x＝0,1,2,3)已全部载入至计算单元阵列中；同时，权重元素b1和b2同时右移，分别载入到计算单元PE12和计算单元PE11中，权重元素b3载入到计算单元PE10中，此时权重1的第二组权重元素bx(x＝0,1,2,3)已全部载入至计算单元阵列中；In the seventh cycle, the weight elements B1 and B2 are shifted to the right at the same time and loaded into the calculation unit PE02 and the calculation unit PE01 respectively, and the weight element B3 is loaded into the calculation unit PE00. At this time, the second group of weight elements Bx of the weight 0 (x=0, 1, 2, 3) have all been loaded into the calculation unit array; at the same time, the weight elements b1 and b2 are shifted to the right at the same time, loaded into the calculation unit PE12 and the calculation unit PE11 respectively, and the weight element b3 is loaded into the computing unit PE10, at this time, the second group of weight elements bx (x=0, 1, 2, 3) of the weight 1 have all been loaded into the computing unit array;

在第8个周期，权重元素B2和B3同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素C0载入到计算单元PE00中，此外数据⑤载入至计算单元PE00中；同时，权重元素b2和b3同时右移，分别载入到计算单元PE12和计算单元PE11中，权重元素c0载入到计算单元PE10中，数据⑤共享至计算单元PE10中；In the 8th cycle, the weight elements B2 and B3 are shifted to the right at the same time, loaded into the calculation unit PE02 and the calculation unit PE01 respectively, the weight element C0 is loaded into the calculation unit PE00, and the data ⑤ is loaded into the calculation unit PE00; At the same time, the weight elements b2 and b3 are shifted to the right at the same time and loaded into the calculation unit PE12 and the calculation unit PE11 respectively, the weight element c0 is loaded into the calculation unit PE10, and the data ⑤ is shared in the calculation unit PE10;

在第9个周期，数据⑥分别载入到计算单元PE01中，权重元素C0向右移动至计算单元PE01中，权重元素C1载入到计算单元PE00中；同时，数据⑥共享到计算单元PE11中，权重元素c0向右移动至计算单元PE11中，权重元素c1载入到计算单元PE10中；In the ninth cycle, the data ⑥ is loaded into the calculation unit PE01 respectively, the weight element C0 is moved to the right to the calculation unit PE01, and the weight element C1 is loaded into the calculation unit PE00; at the same time, the data ⑥ is shared in the calculation unit PE11 , the weight element c0 is moved to the right in the calculation unit PE11, and the weight element c1 is loaded into the calculation unit PE10;

在第10个周期，权重元素C0和C1同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素C2载入到计算单元PE00中，此外数据⑦载入至计算单元PE02中；同时，权重元素c0和c1同时右移，分别载入到计算单元PE12和计算单元PE11中，权重元素c2载入到计算单元PE10中，数据⑦共享至计算单元PE12中；In the 10th cycle, the weight elements C0 and C1 are shifted to the right at the same time, respectively loaded into the calculation unit PE02 and the calculation unit PE01, the weight element C2 is loaded into the calculation unit PE00, and the data ⑦ is loaded into the calculation unit PE02; At the same time, the weight elements c0 and c1 are shifted to the right at the same time and loaded into the calculation unit PE12 and the calculation unit PE11 respectively, the weight element c2 is loaded into the calculation unit PE10, and the data ⑦ is shared in the calculation unit PE12;

在第11个周期，权重元素C1和C2同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素C3载入到计算单元PE00中，此时权重0的第三组权重元素Cx(x＝0,1,2,3)已全部载入至计算单元阵列中；同时，权重元素c1和c2同时右移，分别载入到计算单元PE12和计算单元PE11中，权重元素c3载入到计算单元PE10中，此时权重1的第三组权重元素cx(x＝0,1,2,3)已全部载入至计算单元阵列中；In the 11th cycle, the weight elements C1 and C2 are shifted to the right at the same time, loaded into the calculation unit PE02 and the calculation unit PE01 respectively, and the weight element C3 is loaded into the calculation unit PE00. At this time, the third group of weight elements Cx of the weight 0 (x=0, 1, 2, 3) have all been loaded into the calculation unit array; at the same time, the weight elements c1 and c2 are shifted to the right at the same time and loaded into the calculation unit PE12 and the calculation unit PE11 respectively, and the weight element c3 is loaded In the computing unit PE10, the third group of weight elements cx (x=0, 1, 2, 3) of the weight 1 has all been loaded into the computing unit array;

在第12个周期，权重元素C2和C3同时右移，分别载入到计算单元PE02和计算单元PE01中，权重0中下一组权重元素Dx(x＝0,1,2,3)中的第一个元素D0载入到计算单元PE00中，此外数据⑥载入至计算单元PE00中；同时，权重元素c2和c3同时右移，分别载入到计算单元PE02和计算单元PE01中，权重1中下一组权重元素dx(x＝0,1,2,3)中的第一个元素d0载入到计算单元PE10中，数据⑥共享至计算单元PE00中；In the 12th cycle, the weight elements C2 and C3 are shifted to the right at the same time and loaded into the calculation unit PE02 and the calculation unit PE01 respectively. The first element D0 is loaded into the calculation unit PE00, and the data ⑥ is loaded into the calculation unit PE00; at the same time, the weight elements c2 and c3 are shifted to the right at the same time, and loaded into the calculation unit PE02 and the calculation unit PE01 respectively, with a weight of 1 The first element d0 in the next group of weight elements dx (x=0, 1, 2, 3) is loaded into the computing unit PE10, and the data ⑥ is shared in the computing unit PE00;

在第13个周期，权重元素C3和D0同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素D1载入到计算单元PE00中，此外数据⑦载入至计算单元PE01中；同时，权重元素c3和d0同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素d1载入到计算单元PE00中，数据⑦共享至计算单元PE01中；In the 13th cycle, the weight elements C3 and D0 are shifted to the right at the same time, and are loaded into the calculation unit PE02 and the calculation unit PE01 respectively, the weight element D1 is loaded into the calculation unit PE00, and the data ⑦ is loaded into the calculation unit PE01; At the same time, the weight elements c3 and d0 are shifted to the right at the same time, and are loaded into the calculation unit PE02 and the calculation unit PE01 respectively, the weight element d1 is loaded into the calculation unit PE00, and the data ⑦ is shared in the calculation unit PE01;

在第14个周期，权重元素D1和D0同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素D2载入到计算单元PE00中，此外数据⑧载入至计算单元PE02中；同时，权重元素d1和d0同时右移，分别载入到计算单元PE02和计算单元PE01中，权重元素d2载入到计算单元PE00中，数据⑧共享至计算单元PE02中；In the 14th cycle, the weight elements D1 and D0 are shifted to the right at the same time, loaded into the calculation unit PE02 and the calculation unit PE01 respectively, the weight element D2 is loaded into the calculation unit PE00, and the data ⑧ is loaded into the calculation unit PE02; At the same time, the weight elements d1 and d0 are shifted to the right at the same time, and are loaded into the calculation unit PE02 and the calculation unit PE01 respectively, the weight element d2 is loaded into the calculation unit PE00, and the data ⑧ is shared in the calculation unit PE02;

在第15个周期，权重元素D2和D1同时右移，分别载入到计算单元PE2和计算单元PE01中，权重元素D3载入到计算单元PE00中；同时，权重元素d2和d1同时右移，分别载入到计算单元PE2和计算单元PE01中，权重元素d3载入到计算单元PE00中；In the 15th cycle, the weight elements D2 and D1 are shifted to the right at the same time and loaded into the calculation unit PE2 and PE01 respectively, and the weight element D3 is loaded into the calculation unit PE00; at the same time, the weight elements d2 and d1 are shifted to the right at the same time, Loaded into computing unit PE2 and computing unit PE01 respectively, and weight element d3 is loaded into computing unit PE00;

在第16个周期，权重元素D3和D2同时右移，分别载入到计算单元PE02和计算单元PE01中；同时，权重元素d3和d2同时右移，分别载入到计算单元PE12和计算单元PE11中；In the 16th cycle, the weight elements D3 and D2 are shifted to the right at the same time and loaded into the calculation unit PE02 and the calculation unit PE01 respectively; at the same time, the weight elements d3 and d2 are shifted to the right at the same time and loaded into the calculation unit PE12 and the calculation unit PE11 respectively. middle;

在第17个周期，权重元素D3右移，载入到计算单元PE02中，此时上述权重规模大小为2*2*4与数据规模大小为4*2*4的卷积运算结束；同时，权重元素d3右移，载入到计算单元PE12中，此时上述权重规模大小为2*2*4与数据规模大小为4*2*4的卷积运算结束。In the 17th cycle, the weight element D3 is shifted to the right and loaded into the calculation unit PE02. At this time, the above-mentioned convolution operation of the weight scale of 2*2*4 and the data scale of 4*2*4 is completed; at the same time, The weight element d3 is shifted to the right and loaded into the computing unit PE12. At this time, the above-mentioned convolution operation of the weight scale of 2*2*4 and the data scale of 4*2*4 is completed.

所述高效复用数据流每次只载入一列数据与权重进入计算单元阵列，并且载入的数据与权重仅在相邻两列间传播。综上所述，本发明针对能量效率低这一问题，提供了一种高效复用数据流，减少了数据带宽，增加了数据复用率，有效提高了处理器的能量效率。The efficient multiplexing data stream loads only one column of data and weights into the computing unit array at a time, and the loaded data and weights are only propagated between two adjacent columns. To sum up, the present invention provides an efficient multiplexing data stream for the problem of low energy efficiency, which reduces the data bandwidth, increases the data multiplexing rate, and effectively improves the energy efficiency of the processor.

应当理解，虽然本说明书是按照各个实施例描述的，但并非每个实施例仅包含一个独立的技术方案，说明书的这种叙述方式仅仅是为清楚起见，本领域技术人员应当将说明书作为一个整体，各实施例中的技术方案也可以经适当组合，形成本领域技术人员可以理解的其他实施方式。It should be understood that although this specification is described according to various embodiments, not each embodiment only includes an independent technical solution, and this description in the specification is only for the sake of clarity, and those skilled in the art should take the specification as a whole , the technical solutions in each embodiment can also be appropriately combined to form other implementations that can be understood by those skilled in the art.

以上所述仅为本发明示意性的具体实施方式，并非用以限定本发明的范围。任何本领域的技术人员，在不脱离本发明的构思和原则的前提下所作的等同变化、修改与结合，均应属于本发明保护的范围。The above descriptions are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent changes, modifications and combinations made by any person skilled in the art without departing from the concept and principles of the present invention shall fall within the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于高效复用数据流的神经网络处理器，其特征在于，包括：1. a neural network processor based on efficient multiplexing data stream, is characterized in that, comprises:

其中所述运算数据采用高效复用数据流的形式，在所述计算单元构成的计算单元阵列中，位于相同列的计算单元共享一组相同的数据；位于相同行的计算单元载入相同的一组权重，在每个计算周期，每个计算单元仅会载入一组权重的一个元素；位于不同行的计算单元会载入不同的权重。The operation data is in the form of an efficient multiplexing data stream. In the calculation unit array formed by the calculation units, the calculation units located in the same column share a set of the same data; the calculation units located in the same row load the same For group weights, in each calculation cycle, each calculation unit will load only one element of a set of weights; calculation units located in different rows will load different weights.

2.如权利要求1所述的基于高效复用数据流的神经网络处理器，其特征在于，所述神经网络处理器包括存储结构、控制结构、计算结构。2 . The neural network processor based on efficient multiplexing of data streams according to claim 1 , wherein the neural network processor comprises a storage structure, a control structure, and a calculation structure. 3 .

3.如权利要求1所述的基于高效复用数据流的神经网络处理器，其特征在于，计算单元中每组数据按照神经网络层深度方向排列，计算单元内不同行间的运算体现了计算单元的并行度。3. the neural network processor based on efficient multiplexing data stream as claimed in claim 1, is characterized in that, in the computing unit, each group of data is arranged according to the depth direction of the neural network layer, and the operation between different lines in the computing unit has embodied the calculation. Unit parallelism.

4.如权利要求1所述的基于高效复用数据流的神经网络处理器，其特征在于，所述高效复用数据流每次只载入一列数据与权重进入计算单元阵列，并且载入的数据与权重仅在相邻两列间传播。4. The neural network processor based on the high-efficiency multiplexing data stream according to claim 1, wherein the high-efficiency multiplexing data stream only loads one column of data and weights into the computing unit array at a time, and the loaded Data and weights are only spread between adjacent columns.

5.一种基于高效复用数据流的神经网络处理器的设计方法，其特征在于，包括：5. a design method based on the neural network processor of efficient multiplexing data stream, is characterized in that, comprising:

6.如权利要求5所述的基于高效复用数据流的神经网络处理器的设计方法，其特征在于，所述神经网络处理器包括存储结构、控制结构、计算结构。6 . The method for designing a neural network processor based on efficient multiplexing of data streams according to claim 5 , wherein the neural network processor comprises a storage structure, a control structure, and a computing structure. 7 .

7.如权利要求5所述的基于高效复用数据流的神经网络处理器的设计方法，其特征在于，计算单元中每组数据按照神经网络层深度方向排列，计算单元内不同行间的运算体现了计算单元的并行度。7. the design method of the neural network processor based on efficient multiplexing data stream as claimed in claim 5, it is characterized in that, in the computing unit, each group of data is arranged according to the depth direction of the neural network layer, and the calculation between different rows in the computing unit It reflects the parallelism of the computing unit.

8.如权利要求5所述的基于高效复用数据流的神经网络处理器的设计方法，其特征在于，所述高效复用数据流每次只载入一列数据与权重进入计算单元阵列，并且载入的数据与权重仅在相邻两列间传播。8. The method for designing a neural network processor based on an efficient multiplexing data stream according to claim 5, wherein the efficient multiplexing data stream only loads one column of data and weights into the computing unit array at a time, and Loaded data and weights are only spread between adjacent columns.