CN115374399B

Movatterモバイル変換

Info

Publication number: CN115374399B
Application number: CN202210924135.1A
Authority: CN
Inventors: 任鹏举; 林晓云; 霍志旺; 楼薇; 张先娆; 赵文哲; 夏天
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-01-30
Filing date: 2022-08-02
Publication date: 2025-09-02
Anticipated expiration: 2042-08-02
Also published as: CN115374399A

Abstract

The present disclosure proposes a PE array structure compatible with multi-dimensional matrix multiplication, an arithmetic unit, and an MPU thereof. The PE array is subjected to function expansion design, and can support multiplication operation of various dimensional matrixes by transmitting control signals in different modes, so that the utilization rate of the PE array is improved, the operation time is shortened, and the energy consumption caused by data movement is saved. In addition, the method designs two modes of low power consumption and high performance for the same vector multiplication matrix operation to meet the requirements of different application scenes.

Description

Operation unit compatible with multi-dimensional matrix multiplication

Technical Field

The disclosure belongs to the technical field of processors and computation, and in particular relates to a PE array structure compatible with multi-dimensional matrix multiplication, an arithmetic unit and an MPU thereof.

Background

In the arithmetic units of modern processors, matrix multiplication and vector multiplication matrices are common arithmetic types, wherein the operations of square matrix multiplication matrix, vector multiplication matrix and matrix multiplication matrix are largely represented in convolution layers and full connection layers of the neural network, so that the traditional scalar arithmetic units cannot meet the current calculation power demands.

Heterogeneous processors have become a new trend in recent years, and MPU (Matrix Processing Unit) of the processors are dedicated to matrix multiplication and convolution operations. When matrix multiplication operation is performed on a modern processor, the utilization rate of hardware resources can be improved through software explicit programming (sub word parallel, instruction set parallel and Unrolling) and a hardware cache blocking technology, so that the time of matrix operation is shortened, but when asymmetric matrix multiplication such as vector multiplication matrix is calculated, great hardware resource waste is still caused.

Disclosure of Invention

In view of this, the present disclosure provides a PE array structure compatible with multi-dimensional matrix multiplication, comprising:

64 PE units, each PE unit having an address in the PE array designated (i, j), i representing a row and j representing a column;

The PE array comprises two inputs, namely 8A-direction inputs A0-A7 and 64W-direction inputs W00-W77, wherein the A-direction inputs are orthogonal to the W-direction inputs, 8W 00-W07 meters, 8W 10-W17 meters and 8W 70-W77 meters, so that the total of 64W-direction inputs are obtained;

each of A0-A7 comprises 8 numbers, 8 numbers form a vector of [1,8], each of which is a vector of [1,8], and is called A-direction input, wherein the 8A-direction inputs A0-A7 can be the same or different;

Each of W00-W77 comprises 8 numbers, 8 numbers form a vector of [1,8], each vector is a vector of [1,8], the vector is called W direction input, the 64W direction inputs W00-W77 are respectively sent into PE units at corresponding positions, and the 64W direction inputs W00-W77 can be the same or different;

for each PE unit (Processing Element), wherein:

As a basic processing unit in the PE array, there are two inputs (e.g., an a-direction input, a W-direction input), an output;

Taking PE units at locations (i, j) as examples, inputs are Ai and Wij, and outputs are noted Psum (i, j);

1 cycle of one PE unit can complete 1 vector multiplication operation of [1,8] × [1,8]^T;

for the combination of 8 PE units in the first row of the PE array, vector A0 is input to the 8 PE units laterally and simultaneously;

The vectors W00-W07 of the first row of the PE array are regarded as 8 column vectors of a matrix of [8,8], and then are respectively input into PE units at corresponding positions;

One row of the PE array can complete vector multiplication matrix operation of [1,8] × [8,8]^T in 1 cycle;

and carrying out different combinations on the 8-row vector multiplication matrix in the PE array, so as to realize vector multiplication matrix and matrix multiplication matrix operation with different dimensions.

In addition, the disclosure also discloses an operation unit, which comprises the PE array structure.

In addition, the present disclosure also discloses an MPU, which includes the PE array structure described above, or the arithmetic unit described above.

Preferably, the method comprises the steps of,

Besides PE_array, namely PE array structure, the MPU also comprises a control, ACC, buf, lm _A, namely local memory A, and a lm_W, namely local memory W;

The control is used for generating various control signals and controlling other modules;

lm_A and lm_W for storing input in A and W directions;

PE_array is used for realizing matrix operation in various modes;

ACC, which is used to accumulate the output of PE array in different time domain;

and Buf, which is used for storing the accumulated results of ACC, and returning the results to ACC for incomplete operation, and outputting the results from MPU for operation which has been completed.

When the control signal is the first control signal, the PE array structure works in a low power consumption mode.

Preferably, the method comprises the steps of,

When the control signal is the first control signal, the PE array structure operates in a high-performance mode.

Preferably, the method comprises the steps of,

The default mode of the PE array structure is a high performance mode.

Preferably, the method comprises the steps of,

By transmitting control signals in different modes, multiple dimension matrix multiplication operations are supported.

Preferably, the method comprises the steps of,

For the same vector multiplication matrix operation, the PE array structure can work in different modes with low power consumption or high performance, and defaults to a high-performance mode, and the modes are switchable.

Thus, the present disclosure proposes a PE array structure compatible with multi-dimensional matrix multiplication, an arithmetic unit, and an MPU thereof. The PE array is subjected to function expansion design, and can support multiplication operation of various dimensional matrixes by transmitting control signals in different modes, so that the utilization rate of the PE array is improved, the operation time is shortened, and the energy consumption caused by data movement is saved. In addition, the method designs two modes of low power consumption and high performance for the same vector multiplication matrix operation, and can be switched according to the needs to meet the requirements of different application scenes, and the default mode is the high performance mode.

Compared with the explicit programming of software, the method requires a programmer to have certain bottom hardware knowledge, and designs and improves the PE array structure from a hardware layer.

Drawings

FIG. 1 is a schematic diagram of a PE array structure in accordance with one embodiment of the disclosure;

FIG. 1A is a schematic diagram of a prior art conventional PE array in computing a matrix multiplication that is not square;

FIG. 1B is a schematic diagram of time-domain accumulation of a row of PE array outputs;

FIG. 1C is a schematic diagram of time-domain accumulation of a list of PE array outputs;

FIG. 2 is a schematic diagram of a vector multiplication matrix operation of [1,8 x [8 x s,8 x s ]^T in one embodiment of the disclosure;

FIG. 3 is a schematic diagram of one (low power mode) copy method of A [ m ] and W [ m ] [ n ] in one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of A [ m ] and W [ m ] [ n ] replication method two (high performance mode) in one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the combination of am and Wm n in low power mode and PE array inputs per cycle when s=8 in one embodiment of the disclosure;

FIG. 6 is a schematic diagram of the combination of am and Wm n in high performance mode and PE array inputs per cycle when s=8 in one embodiment of the disclosure;

FIG. 7 is a schematic diagram of a standard PE array matrix multiplication in one embodiment of the disclosure;

FIG. 8 is a schematic diagram of a [1,64] × [64,64]^T vector multiplication matrix operation in one embodiment of the present disclosure;

FIG. 9 is a schematic diagram of the division by row 8 for a [1,64] × [64,64]^T vector multiplication matrix operation in one embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a [1,64] × [64,64]^T vector multiplication matrix low-power operation in one embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a vector C [ k ] with dimensions [1,8] obtained after 1 cycle by inputting 8 [8,8]^T matrices in W [ k ] to corresponding rows in a PE array, respectively, in an (k+1) th cycle, in an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a [1,64] × [64,64]^T vector multiplication matrix high-performance operation in one embodiment of the present disclosure;

FIG. 13 is a schematic diagram of broadcasting a vector A [ k ] of dimension [1,8] to A0-A7 of a PE array in the (k+1) th cycle, inputting 8 [8,8] matrices W [ i ] [ k ] (i=0, 1..7) in W to the ith row in the PE array, respectively, outputting 1 number per PE unit after 1 cycle, and stitching the outputs of the 8 PEs of the ith row into vectors to be accumulated of dimension [1,8 ];

FIG. 14 is a schematic diagram of a [4,16] × [16,16]^T matrix multiplication in one embodiment of the present disclosure;

FIG. 15 is a split combination of A and W in a [4,16] × [16,16]^T matrix-multiplied low-power mode in one embodiment of the present disclosure;

FIG. 16 is a schematic diagram of a PE array during a period 1 PE array during which a [4,16] × [16,16]^T vector multiplication matrix low-power operation is performed, in accordance with one embodiment of the present disclosure;

FIG. 17 is a schematic diagram of a PE array during cycle 2 of a [4,16] × [16,16]^T vector multiplication matrix low-power operation, with the W-direction input remaining unchanged, in one embodiment of the present disclosure;

FIG. 18 is a schematic diagram of a split combination of A and W in a [4,16] × [16,16]^T matrix-multiplication high-performance mode, in one embodiment of the present disclosure;

FIG. 19 is a schematic diagram of a PE array during cycle 1 PE array during a [4,16] × [16,16]^T vector-multiply matrix high-performance operation, according to one embodiment of the present disclosure;

FIG. 20 is a schematic diagram of a PE array during cycle 2 of a [4,16] × [16,16]^T vector multiplication matrix high performance operation, with the A-way input remaining unchanged and the W-way input being changed to the 8 matrix inputs of cycle 2 in W, in one embodiment of the present disclosure;

FIG. 21 is a schematic diagram of a [2,32] × [32,32]^T matrix multiplication in one embodiment of the present disclosure;

FIG. 22 is a schematic diagram of a split combination of A and W in a [2,32] × [32,32]^T vector-multiplied matrix low-power mode, according to one embodiment of the present disclosure;

FIG. 23 is a schematic diagram of a PE array during a period 1 PE array in which [2,32] × [32,32]^T vector multiplication matrix high-performance operation is performed, in accordance with one embodiment of the present disclosure;

FIG. 23A is a schematic diagram of a split combination of A and W in a [2,32] × [32,32]^T vector-multiplied matrix high-performance mode, in one embodiment of the present disclosure;

FIG. 23B is a schematic diagram of a PE array in a high performance mode, according to one embodiment of the disclosure, when performing a [4,16] × [16,16]^T vector multiplication matrix operation;

FIG. 23C is a schematic diagram of a PE array in a high performance mode when performing a [4,16] × [16,16]^T vector multiplication matrix operation, where k=0, according to one embodiment of the present disclosure;

FIG. 24 is a schematic diagram of an MPU in one embodiment of the present disclosure.

Detailed Description

For further description of the present invention, it is further described below with reference to fig. 1 to 24.

Various embodiments of the present disclosure will be described in detail below.

As shown in fig. 1, in one embodiment, the present disclosure discloses an operation unit compatible with multi-dimensional matrix multiplication, which includes a PE array including at least 64 PE units (preferably, 64 PE units, which conform to binary, also a multiple of 8), each PE unit having an address of (i, j) in the array, i representing a row and j representing a column;

The PE array has two inputs, A0-A7 are each a vector of [1,8] and are called A-direction input, W00-W77 are each a vector of [1,8] and are called W-direction input, and 64 inputs are respectively sent into PE units at corresponding positions;

The PE array (PE array) comprises 8A-direction inputs (A0-A7) and 64W-direction inputs (W00-W77), wherein each of the A0-A7 comprises 8 (8 form a vector of [1,8 ]) and the 8A-direction inputs (A0-A7) can be the same or different, each of the W00-W77 comprises 8 (8 form a vector of [1,8 ]) and the 64W-direction inputs (W00-W77) can be the same or different;

The PE unit (Processing Element) is a basic processing unit calculated in the PE array, and has two inputs and one output, wherein the inputs are Ai and Wij, and the output is Psum (i, j) by taking the PE unit at the position (i, j) as an example, and the vector multiplication operation of 1 [1,8] × [1,8]^T can be completed by one PE unit for 1 period;

taking the combination of 8 PE units in the first row of the PE array as an example, the vector A0 is transversely and simultaneously input to the 8 PE units, the vectors W00-W07 in the first row of the PE array can be regarded as 8 column vectors of a [8,8] matrix and then respectively input to the PE units in the corresponding positions, and the vector multiplication matrix operation of [1,8] × [8,8]^T can be completed in one row of the PE array for 1 period.

The method and the device realize vector multiplication matrix and matrix multiplication matrix operation with different dimensions by carrying out different combinations on 8 rows of vector multiplication matrices in the PE array.

Those skilled in the art will appreciate that the W-direction input of a conventional PE array is only 64, and the W-direction input of 8 PE units in each column is the same. Compared with the prior art, the W-direction input of each PE unit is different, and when the W-direction inputs of the 8 PE units in each column of the PE array are identical, all functions of the conventional PE array can be completed. When the matrix multiplication of the non-square matrix is calculated, the conventional PE array has a large amount of idle hardware resources, as shown in FIG. 1A, when the vector multiplication matrix operation is performed, the PE array can only calculate 1 group of [1,8] × [8,8]^T operations in each period, and at this time, only one row of PE units is in a working state, and the calculation power is only 1/8 of the peak calculation power of the PE array.

The following details the splitting and combining method of the PE array and the input data of the disclosure:

The PE array of the disclosure is added with 56 units of storage resources (data input in the W direction), the size of each input data is 1Byte, namely 448 bytes of storage space is increased, 8 PE units in each row of the PE array can complete vector multiplication matrix operation of [1,8] × [8,8]^T, matrix multiplication operation and vector multiplication matrix operation of different dimensions can be completed by controlling 8 rows of different input combinations, and when vector multiplication matrix operation is performed, 8 groups of different [1,8] × [8,8]^T operations can be calculated for each 1 period of the PE array, and multi-dimensional matrix multiplication operation can be completed by performing time domain accumulation on the output of the PE array.

The PE array output time domain accumulation can be realized through ACC (Accumulation) units, as shown in FIG. 1B, 1 vector adder is needed for the PE array output time domain accumulation, the ACC unit comprises 8 vector adders for respectively carrying out the PE array output time domain accumulation, 8 PE units in each column of the PE array can complete the PE array output vector multiplication matrix operation of [1,64] × [8,64]^T, the PE array can complete the matrix multiplication operation and the vector multiplication matrix operation of different dimensions by controlling 8 columns of different input combinations, and when the PE array carries out the vector multiplication matrix operation, 8 groups of different [1,64× [8,64]^T operations can be calculated for each 1 period of the PE array, and the PE array output space domain summation can complete the multi-dimensional matrix multiplication operation;

The spatial summation of PE array output can be realized through the adder inside the PE array, as shown in FIG. 1C, 7 vector adders are needed for the spatial summation of PE unit output in a row, and the specified PE unit output can be summed by enabling the specified adder;

When matrix multiplication operation in any dimension is performed, input data is split and combined based on a control signal, and the split and combined input data is input to a designated PE unit in a time-sharing mode, so that 64 PE units in a PE array are all in a working state, and in theory, peak computing power can be kept for the matrix multiplication operation in any dimension by the PE array.

The splitting and combining method of the input data is described in more detail below:

For any vector multiplication matrix, the zero padding expansion can be carried out to form [1,8 x [8 x s,8 x s ]^T (s is a positive integer), taking [1,30 x [28,30]^T as an example, and the zero padding expansion can be operated in a mode of [1,32 x [32,32]^T;

fig. 2 shows a vector multiplication matrix operation of [1,8×s ] × [8×s,8×s ]^T;

Splitting vector A into s (8) vectors (1, 8), wherein one of the vectors is denoted as A m, splitting matrix W into s (8, 8) matrices, wherein one of the matrices is denoted as W m n, m represents rows and n represents columns, and one row 1 period of the PE array can just complete one vector multiplication matrix operation of A m multiplied by W m n, and the whole PE array 1 period can just complete the vector multiplication matrix operation of 8A m multiplied by W m multiplied by n^T;

Therefore, to perform a [1,8 x [8 x s,8 x s ]^T vector multiplication matrix operation, a plurality of Am x Wm n^T vectors are uniformly distributed on the PE array according to different time domains and airspaces;

in matrix multiplication, the critical path affecting the clock frequency of the PE array is the time to complete the multiply and add operation once every 1 cycle, i.e., the time to complete the vector multiplication and vector addition operations in 1 cycle.

In addition, the combination modes of A [ m ] and W [ m ] n are mainly two, the first mode focuses on low power consumption, after data are read once, multiplexing is performed as much as possible, and energy consumption caused by data movement is reduced; the second type focuses on high performance, shortens the operation time of data in the PE array as much as possible, and puts the operation (such as accumulation) irrelevant to vector multiplication on an external ACC unit for processing, thereby improving the clock frequency; for matrix multiplication with dimension not exceeding [8,8], the operation can be completed within 1 period, the problem of reducing data movement is not involved, and the problem of accumulating PE array operation results in different time domains in an ACC unit is not involved, so that the high-performance mode and the low-power mode of the matrix multiplication are consistent;

(1) When s=1, 2,4, the original input data is too small to fully utilize the bandwidth of the PE array, so that the PE array of the present disclosure improves the utilization rate of the PE array by broadcasting the A-direction input or the W-direction input, and shortens the operation time;

As shown in FIG. 3, the method for copying A m and W m n is one (low power consumption), the original A-direction input (A0-A s-1) is copied for 8/s parts, each part is 1 group, each group has 8*s numbers, the original W-direction input is divided into one group of 8 columns, each group has 64 x s numbers, so that the vector multiplication matrix operation of 1 group A-direction input and 1 group W-direction input can be completed by s rows and 1 period of the PE array, the whole PE array can complete the operation of 8/s groups, and all operations can be completed by s periods.

As shown in FIG. 4, the method of copying A m and W m n is two (high performance), the A-direction input in the first copying method is transposed to obtain s [1,8 x 8/s ] vectors, each row is 1 group, each group has 8 x 8/s number, the original W-direction input is divided into each 8 rows, each group has 64 x s number, therefore, the 8/s rows of the PE array can complete the vector multiplication matrix operation of 1 group A-direction input and 1 group W-direction input, the whole PE array can complete the operation of s groups, and all operations can be completed in total s periods.

(2) As shown in fig. 5, when s=8, the combination of am and wm in the low power mode, and the PE array input per cycle:

The 1 st period, 8 vectors (A0-A7) are sequentially taken from A and are respectively used as the input of PE arrays A0-A7, and the input of the A-direction is kept unchanged for 8 periods;

after 8 cycles, all operations related to A0-A7 are completed;

As shown in fig. 6, a [ m ] and W [ m ] [ n ] are combined in the high performance mode when s=8, and the PE array input per cycle:

The (k+1) th period inputs the kth vector A [ k ] in A, broadcasts the kth vector A [ k ] to A0-A7 of the PE array, and takes 8 matrixes (W [0] [ k ] -W [7] [ k ]) from W in line sequence as W-direction inputs of 8 rows of the PE array respectively, wherein all operations about A are completed after 8 periods;

(3) When s=3, 5,6,7, the a-direction input and the W-direction input can be expanded to the form of s=4, 8, respectively, and then operated according to the rules in (1) (2);

(4) When s >8, dividing A into s/8 segments, dividing W into s/8 blocks, and then calculating the split A and W according to the operation rule when s=8.

The following description is made in connection with other embodiments:

Example 1 [8,8] × [8,8]^T matrix multiplication

The matrix multiplication of [8,8] × [8,8]^T is commonly found in neural network convolution operations, as shown in FIG. 7, which is a standard PE array matrix multiplication, the matrix A is input data of A0-A7, the A dimension is [8,8], the matrix A is divided into A0-A7 in rows in eight, each is a vector of [1,8], and one of the vectors is denoted as A [ i ];

the matrix W is input data of W00-W77, the W dimension is [8,8], the matrix W is divided into W0-W7 in eight equal parts according to the row, each vector is a vector of [1,8], and one vector is marked as W [ j ];

The matrix C is the output data, the C dimension is [8,8], and the value at position (i, j) C is denoted as C [ i ] [ j ];

When the PE array disclosed by the disclosure performs [8,8] × [8,8]^T vector multiplication matrix operation, a data input mode and calculation steps are as follows:

(1) The method comprises the steps of respectively inputting A0-A7 into A0-A7 of PE array, namely, the direction input of 8 PE units A in each row of PE array is the same, respectively inputting W0-W7 into Wi 0-Ai 7 of PE array, namely, the direction input of W of 8 PE units in each column of PE array is the same;

(2) The output of each PE unit is:

Psum(i,j)=A[i]×W[j]

the value of C [ i ] [ j ] is easy to know and is Psum (i, j);

(3) C0-C7 are obtained by calculation after 1 period, and the multiplication matrix calculation of [8,8] × [8,8]^T = [8,8] is completed.

For matrix multiplication operation (matrix multiplication with matrix dimension not exceeding [8,8 ]) which can be completed only by 1 period, data input in the A direction and data input in the W direction are used for operation of 1 period, so that operation of the high-performance mode and the low-power mode of the operation is consistent, and distinction is not made.

Example 2 [1,64] × [64,64]^T vector multiplication matrix operation

Aiming at different application scene requirements, the hardware structure designed by the disclosure supports two modes of low power consumption and high performance to perform vector multiplication matrix operation;

The PE array can read 8 different A-direction inputs at one time under the low power consumption mode of [1,64] × [64,64]^T vector multiplication matrix operation, and 8 periods are highly multiplexed, so that the energy consumption caused by data movement during the vector multiplication matrix operation is greatly reduced, and 8 x 7 adders (2 input 1 output) are additionally added in the PE array under the low power consumption mode and are used for adding the operation result of each period of the PE array;

In the high-performance mode, the input to the PE array A and the input to the PE array W are different in each period, but the addition of the operation result of each period in the PE array is not needed, and the operation result is directly output to an external ACC unit.

As shown in FIG. 8, vector A is the input data for A0-A7, with A dimension [1,64], which is octally divided into A0-A7, each being a vector of [1,8], one of which is denoted as A [ i ];

The matrix W is input data of W00-W77, the W dimension is [64,64], the matrix W is equally divided into 64 [8,8] matrices, one of the matrices is denoted as W [ k ] [ i ], k represents a row, and i represents a column;

Vector C is output, C dimension is [1,64], one of the eight halves is marked as C [ k ], the dimension is [1,8], each number in C [ k ] is marked as C [ k ] j;

as shown in FIG. 9, W [ k ] [ i ] is equally divided into W [ k ] [ i ] [0] [ k ] [ i ] [7] by row 8, wherein each row is a vector of [1,8], and one of the vectors is denoted as W [ k ] [ i ] [ j ];

low power consumption mode:

the vector multiplication matrix to be calculated for the (k+1) th cycle of the PE array is a [ i ] ×w [ k ] [ i ]^T, i=0, 1., the vector multiplication matrix to be calculated for each cycle of the PE array for 7 cycles is shown in fig. 10;

In the PE array of the disclosure, when performing [1,64] × [64,64]^T vector multiplication matrix operation in a low power consumption mode, the data input mode and the calculation steps are as follows:

(1) After A0-A7 are respectively input into A0-A7 of PE array in 1 st period, the data of 8 [1,8] vectors are kept unchanged (7 periods are saved, as shown in FIG. 11, 8 [8,8]^T matrices in W [ k ] are respectively input into corresponding rows in PE array in (k+1) th period, and vector C [ k ] with dimension of [1,8] is obtained after 1 period;

(2) Taking k=0 as an example, inputting A0-A7 into A0-A7 of PE array respectively, inputting 8W 0 i into ith row of PE array respectively;

(3) The output of each PE unit is:

Psum(i,j)=A[i]×W[0][i][j]^T

(4) Summing psums with the same j in 8 rows of the PE array to obtain 8 values;

(5) Splicing the 8 values to obtain a vector C0 with the dimension of [1,8 ];

C[0]={C[0][0],C[0][1],...,C[0][j],...,C[0][7]}

(6) And (3) repeating the steps (2) - (5) for W1-W7 with the input of A0-A7 unchanged, and obtaining C0-C7 in 8 cycles to complete the vector multiplication matrix calculation of [1,64 x [64,64]^T = [1,64 ].

High performance mode:

The vector multiplication matrix to be calculated for the (k+1) th cycle of the PE array is a [ k ] ×w [ i ] [ k ]^T, i=0, 1., the vector multiplication matrix to be calculated for each cycle of the PE array for 7 cycles is shown in fig. 12;

in the high performance mode of the PE array, when performing [1,64] × [64,64]^T vector multiplication matrix operation, the data input mode and the calculation steps are as follows:

(1) As shown in fig. 13, the (k+1) th cycle broadcasts a vector a [ k ] with a dimension of [1,8] to A0-A7 of the PE array, 8 [8,8] matrices W [ i ] [ k ] (i=0, 1..7) in W are respectively input to the i-th row in the PE array, 1 number is output per PE unit after 1 cycle, and the outputs of 8 PEs of the i-th row are spliced into a vector to be accumulated with a dimension of [1,8], which is denoted as C^k [ i ];

(2) Taking k=0 as an example, broadcasting A0 to A0-A7 of the PE array, respectively inputting 8W [ i ] [0] s into the ith row of the PE array;

(3) The output of each PE unit is:

Psum(i,j)=A[0]×W[i][0][j]^T

(4) Splicing the output Psum of 8 PE units of the ith row to obtain vectors with the dimension of [1,8], inputting the vectors into an ACC unit for accumulation, namely adding the vectors with the vectors stored in the Buf unit in the previous period, outputting C^k [ i ] to an external Buf unit as an accumulation result of C^k [ i ], and waiting for accumulation in different time domains from the operation result of the PE array in the next period;

(5) Repeating steps (2) - (4) for A1-A7, and accumulating the newly obtained C^k i with the output of the PE unit of the previous cycle in each cycle of the ACC unit:

(6) After 8 cycles, 8 Ci are spliced to obtain vector C with dimension of [1,64], and vector multiplication matrix calculation of [1,64] × [64,64]^T = [1,64] is completed for 8 cycles.

Example 3 [4,16] × [16,16]^T matrix multiplication

As shown in FIG. 14, the matrix multiplication is [4,16] × [16,16]^T, where matrix A is the input data of A0-A7, the A dimension is [4,16], it is first divided into A0-A3 by row quarters, one of them is denoted as A m, then A m is divided into A m 0 and A m1 by column halving, each of them is a vector of [1,8], one of them is denoted as A m n;

the matrix W is input data of W00-W77, the W dimension is [16,16], the input data is split into 4 matrices of [8,8], one matrix is denoted as W [ i ] [ j ], i represents a row, and j represents a column;

The matrix C is the output of matrix multiplication, the dimension and number are consistent with the matrix A, each small block in the figure is a vector of [1,8], and one of the small blocks is marked as C [ m ] [ n ];

low power consumption mode:

The PE array can complete the vector multiplication matrix operation of 8 [1,8] × [8,8]^T in each period, as shown in FIG. 15, which is a split combination mode of A and W in the low power consumption mode;

when the PE array disclosed by the disclosure performs low-power operation of the [4,16] × [16,16]^T vector multiplication matrix, a data input mode and calculation steps are as follows:

(1) In the 1 st period, as shown in FIG. 16, according to the split combination method of A and W shown in FIG. 15, 4 [1,8] vectors in the 1 st period in A are respectively input into A0-A7 of the PE array (each [1,8] vector is broadcast to 2 rows in the PE array), 8 [8,8]^T matrices in W are respectively input into W00-W77 in the PE array, and C0-C1 is obtained after 1 period;

(2) The 2 nd cycle, as shown in FIG. 17, the W-direction input remains unchanged (1 cycle 8 [8,8] matrix data is saved here), the A-direction input is changed to 64 inputs of the 2 nd cycle in A, C2 [0] - [3] [1] is obtained after 1 cycle;

(3) The calculation formula of Cm n obtained in each period is as follows:

(4) After 2 cycles, a complete output matrix C is obtained, and matrix multiplication calculation of [4,16] × [16,16]^T = [4,16] is completed.

High performance mode:

FIG. 18 shows a split combination of A and W in high performance mode;

When the PE array disclosed by the disclosure performs [4,16] × [16,16]^T vector multiplication matrix high-performance operation, the data input mode and the calculation steps are as follows:

(1) In the 1 st period, as shown in FIG. 19, 8 [1,8] vectors in the 1 st period in A are respectively input into A0-A7 of the PE array according to FIG. 18, 2 [8,8] matrices in W are respectively input into W00-W77 of the PE array (each [8,8] matrix is broadcasted to 4 rows in the PE array), 8 [1,8] vectors to be accumulated are obtained after 1 period and are marked as C'm ] [ n ];

(2) The 2 nd period is shown in figure 20, the input A is kept unchanged, the input W is changed into 8 matrix inputs of the 2 nd period in W, 8 [1,8] vectors are obtained after 1 period, all Cm [ n ] are obtained after accumulating ACC units and Cm [ n ] of the last 1 period, and matrix multiplication calculation of [4,16] × [16,16]^T = [4,16] is completed.

Example 4 [2,32] × [32,32]^T matrix multiplication

As shown in FIG. 21, the matrix multiplication is [2,32] × [32,32]^T, where matrix A is the input data of A0-A7, the A dimension is [2,32], which is divided into A0, A1 by row, one of them is denoted as A m, and then A m is divided into A m 0-A m 3 by column, each of them is a vector of [1,8], one of them is denoted as A m n;

the matrix W is input data of W00-W77, the W dimension is [32,32], the matrix W is divided into 16 matrixes of [8,8] after being divided into four equal parts according to rows and four equal parts according to columns, one matrix is marked as W [ i ] [ j ], i represents a row, and j represents a column;

low power consumption mode:

The PE array can complete 8 [1,8] × [8,8]^T vector multiplication matrix operations per cycle, as shown in FIG. 22, which shows a split combination of A and W in a low power mode;

When the PE array disclosed by the disclosure performs [4,16] × [16,16]^T vector multiplication matrix operation, a data input mode and calculation steps are as follows:

(1) In the 1 st period, as shown in FIG. 23, according to the split combination method of A and W shown in FIG. 22, 4 [1,8] vectors in the 1 st period in A are respectively input into A0-A7 of the PE array (each [1,8] vector is broadcast to 2 rows in the PE array), 8 [8,8] matrices in W are respectively input into W00-W77 in the PE array, and C0 1 are obtained after 1 period;

(2) The 2 nd cycle, W input is kept unchanged (1 cycle 8 [8,8]^T matrix data movement is saved here), A input is changed into 8 [1,8] vectors of the 2 nd cycle in A, C1 [0] and C1 [1] are obtained after 1 cycle;

(4) After 4 cycles, a complete output matrix C is obtained, and matrix multiplication calculation of [2,32] × [32,32]^T = [2,32] is completed.

High performance mode:

the vector multiplication matrix to be calculated in the (k+1) th period of the PE array is A [ m ] [ k ]. Times.W [ i ] [ k ] T (m=0, 1; i=0, 1,2, 3), and the vector multiplication matrix to be calculated in each period of the PE array in 4 periods is shown in FIG. 23A;

In the high performance mode, when the PE array disclosed by the disclosure performs [4,16] × [16,16] T vector multiplication matrix operation, a data input mode and calculation steps are as follows:

(1) As shown in FIG. 23B, the (k+1) th cycle broadcasts the vector A [0] [ k ] with the dimension of [1,8] to A0-A3 of the PE array, similarly broadcasts the vector A [1] [ k ] to A4-A7 of the PE array, inputs the matrix W [ i ] [ k ] (i=0, 1,2, 3) to the ith row and the (i+4) th row in the PE array, outputs 1 number per PE unit after 1 cycle, and splices the outputs of 8 PEs in the same row into a vector to be accumulated with the dimension of [1,8] to be denoted as Ck [ m ] [ i ];

(2) As shown in fig. 23C, taking k=0 as an example, a [0] [0] is broadcasted to A0-A3 of the PE array, a [1] [0] is broadcasted to A4-A7 of the PE array, 4W [ i ] [0] (i=0, 1,2, 3) are respectively input to the first 4 rows and the last 4 rows of the PE array;

(3) Splicing the output Psum of 8 PE units in the (m-4+i) th row to obtain vectors with the dimension of [1,8], inputting the vectors into an ACC unit for accumulation, namely adding the vectors with the last period stored in the Buf unit, wherein the accumulation result is Ck [ m ] [ i ];

(4) Outputting Ck [ m ] [ i ] to an external Buf unit, and waiting for accumulation under different time domains with the output result of the PE array of the next period;

(5) Repeating steps (2) - (4) for a [ m ] [ n ] (m=0, 1; n=1, 2, 3), accumulating the output Ck [ m ] [ i ] of the (m× 4+i) th row of the PE array in the ACC unit every period, wherein the accumulated result of 4 periods is as follows:

(6) After 4 cycles, 4 C0 [ i ] (i=0, 1,2, 3) are spliced to obtain a vector C0 with the dimension of [1,32], a vector C1 with the dimension of [1,32] is obtained in the same way, and the vector multiplication matrix calculation of [2,32] × [32,32] T= [2,32] is completed in 4 cycles.

Matrix multiplication operations of different dimensions that can be implemented by the PE array of the present disclosure are shown in table 1 below:

TABLE 1

In addition to supporting the operation of matrix multiplication of the type shown in the table, the PE array can be used for zero padding of any W-direction input which is not a matrix to be the matrix closest to the W-direction input in the table, and simultaneously, zero padding of the A-direction input is extended to be the same column width, and the W-direction input is zero padded by [1,5] × [5,

6^T Is an example, and can be operated according to the mode of [1,8] × [8,8]^T after zero padding and expansion;

For any [ N, s ] × [ s, s ]^T matrix multiplication operation, the matrix multiplication operation can be split into N [1, s ] × [ s, s ]^T vector multiplication matrix operations, and the matrix multiplication operation is completed by at most N cycle periods (cycle is the number of periods required by the original vector multiplication matrix operation).

In addition, the present disclosure also discloses the application of the arithmetic unit, especially the PE array thereof, in a typical MPU. Furthermore, the present disclosure also discloses an MPU including the arithmetic unit.

The layout of the PE array of the present disclosure in a typical MPU is shown in fig. 24, wherein:

(1) The control is mainly used for generating various control signals to realize control of each module;

in the present invention, the operation modes are divided into a high performance mode and a low power consumption mode, the default mode is the high performance mode, and the operation mode signals of the control are exemplarily distinguished by the following table 2:

TABLE 2

Assuming that the two matrices to be multiplied are in the form of [ m, s ] × [ n, s ] T, in the present patent, the 3 dimensions m, s, n may take values of 1,2,4,8 (when the dimension is 3, the original dimension can be extended to a matrix of dimension 4 by zero padding; when the dimension is 5,6,7, the original dimension can be extended to a matrix of dimension 8 by zero padding), so that there are 64 matrix multiplication forms in total, the matrix dimension signal of control needs 6 bits to distinguish the representation, see table 3 below;

TABLE 3 Table 3

The working mode signal and the matrix dimension signal of the control are required to be transmitted to lm_A and lm_W, and the access mode of matrix A, W data in each period is determined together;

The working mode signal needs to be transmitted to PE_array, ACC and Buf, and is used for controlling the output result Psum of the PE unit to carry out time domain accumulation or space domain addition;

The matrix dimension signals are transmitted to PE_array, ACC and Buf and are used for determining the cycle number of matrix multiplication operations of different dimensions of the PE array, so that whether the matrix multiplication operations are completed or not is judged, and whether the Buf can transmit out operation results or not is determined;

(2) lm_a (i.e., local memory a) and lm_w (i.e., local memory W) are storage resources (assuming that the size of each input data is 1 Byte) with sizes of 64 bytes and 512 bytes, respectively, and are mainly used for temporarily storing input in the a direction and input in the W direction, i.e., data of two matrices a and W to be multiplied;

lm_a and lm_w read the data specified in the matrix a and the matrix W based on the matrix dimension signal and the operation mode signal of the control, and input to the specified PE unit;

(3) The PE_array is a PE array, and is mainly used for realizing matrix operation in various modes, and enabling an adder inside the PE array based on a control working mode signal. In the low power consumption mode, an adder in the PE array is enabled to carry out space domain addition, and the operation result of the PE array does not need to be accumulated in the ACC unit;

(4) The ACC unit is mainly used for accumulating the output of the PE array in different time domains. In a high-performance mode, the output of 8 rows of PE units in the PE array is respectively used as the first input of 8 adders of an ACC unit, 8 vectors stored in a Buf unit are respectively used as the second input of 8 adders of the ACC unit, and finally the ACC unit inputs the operation results of the 8 adders to the Buf unit;

(5) The Buf (Buffer) unit is mainly used for temporarily storing the accumulated results of the ACC, and the initial value is 0. Based on the control working mode signal, the Buf unit determines whether to input the stored data to the ACC unit for accumulation, takes the value stored in the Buf unit as the second input of the ACC unit for time domain accumulation in the high-performance mode, and does not transmit the data to the ACC unit in the low-power consumption mode.

Based on the control matrix dimension signals, the number of cycles of matrix multiplication operation with different dimensions can be determined, based on the count, the Buf unit can judge whether the matrix multiplication operation is completed, and when the operation is completed, the Buf unit transmits stored data from the MPU.

It will be appreciated that control, as is commonly abbreviated in the processor arts, stands for control module, PE_array, i.e., PE array structure, ACC, i.e., accumulator, and Buf, i.e., cache.

In addition, it should be noted that:

In the high-performance mode, 8 groups of different [1,8] × [8,8]^T operations can be calculated by the PE array every 1 period, and time domain accumulation of calculation results is realized by the ACC unit, so that the addition operation in matrix multiplication is transferred to the ACC unit, and the ACC unit performs one addition operation every period, thereby shortening the time of a key path of the PE array and improving the clock frequency;

assuming that the delay to complete one vector multiplication (by the PE unit) is 1ns, the delay to complete one vector addition is 0.3ns, where the matrix operations of examples 2, 3, 4 shown in the tables, the critical path delays and clock frequencies in both modes are shown in Table 4 below:

TABLE 4 Table 4

The addition operation in the high performance mode is performed in the ACC unit, but not in the PE array, so that the addition delay is not accumulated in the PE array critical path delay.

The main power consumption of the processor during working is derived from data movement, so that the space locality is utilized to improve the utilization rate of single-time data reading, and the power consumption is effectively reduced.

In the low power consumption mode, 8 groups of different [1,64] × [8,64] T operations can be calculated by the PE array every 1 period, and the adder specified in the PE array is enabled based on control signals of multiplication of different dimension matrixes, so that airspace addition of calculation results can be realized;

The space domain summation can maximally multiplex the input data in the W direction, so that the low power consumption mode can effectively reduce the data moving amount under the same operation cycle number for the same matrix multiplication operation;

Assuming that the size of each input data is 1Byte, the data shift amount for one a-direction input is 64 bytes, the data shift amount for one W-direction input is 512 bytes, and the same is the matrix operation of examples 2,3, and 4, the data shift amounts in the two modes are shown in table 5 below:

TABLE 5

Taking the low power consumption mode of example 4 as an example, 4×64 represents that the input data in the a direction is shifted for 4 periods, the shifting amount 64Byte per period, 2×512 represents that the input data in the W direction is shifted for 2 periods, and the shifting amount per period is 512Byte.

It should be emphasized that for matrix multiplication with dimensions not exceeding [8,8], the calculation can be completed in 1 cycle, the problem of reducing data movement is not involved, and the problem of accumulating the operation results of the PE array in different time domains in the ACC unit is not involved, so that the high-performance mode and the low-power mode of the matrix multiplication are consistent.

In summary, the present disclosure has the following features:

the acceleration board card can be applied to acceleration chips of end-side neural networks such as mobile phones, monitoring equipment, automobile electronics and the like, and also can be applied to a server;

The PE operation unit in the disclosure can be realized as fixed-point multiply-accumulate operation or floating-point multiply-accumulate operation;

The PE array of the present disclosure may support operations of similar structures such as 4×4, 16×16, and rectangular matrix operations such as 4×8, 8×4, 8×16, and 16×8, in addition to operations of matrix multiplication of the upper 64 elements.

Although embodiments of the present disclosure have been described above with reference to the accompanying drawings, the present disclosure is not limited to the specific embodiments and fields of application described above, which are merely illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may make numerous forms, and departures from the present disclosure as come within the scope of the invention as defined in the appended claims.

Claims

1. An MPU comprising:

an arithmetic unit comprising a PE array structure compatible with multi-dimensional matrix multiplication,

The PE array structure comprises:

each of A0-A7 comprises 8 numbers, the 8 numbers form a vector of [1,8], each vector is a vector of [1,8], the vector is called A-direction input, and the 8A-direction inputs A0-A7 can be the same or different;

Each of W00-W77 comprises 8 numbers, wherein the 8 numbers form a vector of [1, 8], each vector is a vector of [1, 8], and the vector is called W direction input, and the 64W direction inputs W00-W77 are respectively sent to PE units at corresponding positions, wherein the 64W direction inputs W00-W77 can be the same or different;

for each PE unit (Processing Element), wherein:

As a basic processing unit in the PE array, there are two inputs, one input in the A direction, one input in the W direction and one output;

Taking vectors W00-W07 of the first row of the PE array as 8 column vectors of a matrix of [8, 8], and then respectively inputting the vectors into PE units at corresponding positions;

One row of the PE array can complete vector multiplication matrix operation of [1, 8] × [8, 8]^T in 1 period;

Different combinations are carried out on 8 rows of vector multiplication matrixes in the PE array, so that vector multiplication matrixes and matrix multiplication matrix operation with different dimensions are realized;

the MPU also comprises a control, ACC, buf, lm _A local memory A and a lm_W local memory W;

lm_A and lm_W for storing input in A and W directions;

PE_array is used for realizing matrix operation in various modes;

Buf, which is used for storing the accumulated results of ACC, and for the incomplete operation, the results are returned to ACC, and for the completed operation, the results are output from MPU;

the default mode of the PE array structure is a high-performance mode;

when matrix multiplication operation in any dimension is performed, input data are split and combined based on control signals, and the split and combined input data are input to specified PE units in a time sharing mode, so that 64 PE units in a PE array are all in a working state.

2. The MPU of claim 1, wherein,

3. The MPU of claim 1, wherein,

4. The MPU of claim 1, wherein,

For the same vector multiplication matrix operation, the PE array structure can work in different modes with low power consumption or high performance, and the modes are switchable.