Detailed Description
For further description of the present invention, it is further described below with reference to fig. 1 to 24.
Various embodiments of the present disclosure will be described in detail below.
As shown in fig. 1, in one embodiment, the present disclosure discloses an operation unit compatible with multi-dimensional matrix multiplication, which includes a PE array including at least 64 PE units (preferably, 64 PE units, which conform to binary, also a multiple of 8), each PE unit having an address of (i, j) in the array, i representing a row and j representing a column;
The PE array has two inputs, A0-A7 are each a vector of [1,8] and are called A-direction input, W00-W77 are each a vector of [1,8] and are called W-direction input, and 64 inputs are respectively sent into PE units at corresponding positions;
The PE array (PE array) comprises 8A-direction inputs (A0-A7) and 64W-direction inputs (W00-W77), wherein each of the A0-A7 comprises 8 (8 form a vector of [1,8 ]) and the 8A-direction inputs (A0-A7) can be the same or different, each of the W00-W77 comprises 8 (8 form a vector of [1,8 ]) and the 64W-direction inputs (W00-W77) can be the same or different;
The PE unit (Processing Element) is a basic processing unit calculated in the PE array, and has two inputs and one output, wherein the inputs are Ai and Wij, and the output is Psum (i, j) by taking the PE unit at the position (i, j) as an example, and the vector multiplication operation of 1 [1,8] × [1,8]T can be completed by one PE unit for 1 period;
taking the combination of 8 PE units in the first row of the PE array as an example, the vector A0 is transversely and simultaneously input to the 8 PE units, the vectors W00-W07 in the first row of the PE array can be regarded as 8 column vectors of a [8,8] matrix and then respectively input to the PE units in the corresponding positions, and the vector multiplication matrix operation of [1,8] × [8,8]T can be completed in one row of the PE array for 1 period.
The method and the device realize vector multiplication matrix and matrix multiplication matrix operation with different dimensions by carrying out different combinations on 8 rows of vector multiplication matrices in the PE array.
Those skilled in the art will appreciate that the W-direction input of a conventional PE array is only 64, and the W-direction input of 8 PE units in each column is the same. Compared with the prior art, the W-direction input of each PE unit is different, and when the W-direction inputs of the 8 PE units in each column of the PE array are identical, all functions of the conventional PE array can be completed. When the matrix multiplication of the non-square matrix is calculated, the conventional PE array has a large amount of idle hardware resources, as shown in FIG. 1A, when the vector multiplication matrix operation is performed, the PE array can only calculate 1 group of [1,8] × [8,8]T operations in each period, and at this time, only one row of PE units is in a working state, and the calculation power is only 1/8 of the peak calculation power of the PE array.
The following details the splitting and combining method of the PE array and the input data of the disclosure:
The PE array of the disclosure is added with 56 units of storage resources (data input in the W direction), the size of each input data is 1Byte, namely 448 bytes of storage space is increased, 8 PE units in each row of the PE array can complete vector multiplication matrix operation of [1,8] × [8,8]T, matrix multiplication operation and vector multiplication matrix operation of different dimensions can be completed by controlling 8 rows of different input combinations, and when vector multiplication matrix operation is performed, 8 groups of different [1,8] × [8,8]T operations can be calculated for each 1 period of the PE array, and multi-dimensional matrix multiplication operation can be completed by performing time domain accumulation on the output of the PE array.
The PE array output time domain accumulation can be realized through ACC (Accumulation) units, as shown in FIG. 1B, 1 vector adder is needed for the PE array output time domain accumulation, the ACC unit comprises 8 vector adders for respectively carrying out the PE array output time domain accumulation, 8 PE units in each column of the PE array can complete the PE array output vector multiplication matrix operation of [1,64] × [8,64]T, the PE array can complete the matrix multiplication operation and the vector multiplication matrix operation of different dimensions by controlling 8 columns of different input combinations, and when the PE array carries out the vector multiplication matrix operation, 8 groups of different [1,64× [8,64]T operations can be calculated for each 1 period of the PE array, and the PE array output space domain summation can complete the multi-dimensional matrix multiplication operation;
The spatial summation of PE array output can be realized through the adder inside the PE array, as shown in FIG. 1C, 7 vector adders are needed for the spatial summation of PE unit output in a row, and the specified PE unit output can be summed by enabling the specified adder;
When matrix multiplication operation in any dimension is performed, input data is split and combined based on a control signal, and the split and combined input data is input to a designated PE unit in a time-sharing mode, so that 64 PE units in a PE array are all in a working state, and in theory, peak computing power can be kept for the matrix multiplication operation in any dimension by the PE array.
The splitting and combining method of the input data is described in more detail below:
For any vector multiplication matrix, the zero padding expansion can be carried out to form [1,8 x [8 x s,8 x s ]T (s is a positive integer), taking [1,30 x [28,30]T as an example, and the zero padding expansion can be operated in a mode of [1,32 x [32,32]T;
fig. 2 shows a vector multiplication matrix operation of [1,8×s ] × [8×s,8×s ]T;
Splitting vector A into s (8) vectors (1, 8), wherein one of the vectors is denoted as A m, splitting matrix W into s (8, 8) matrices, wherein one of the matrices is denoted as W m n, m represents rows and n represents columns, and one row 1 period of the PE array can just complete one vector multiplication matrix operation of A m multiplied by W m n, and the whole PE array 1 period can just complete the vector multiplication matrix operation of 8A m multiplied by W m multiplied by nT;
Therefore, to perform a [1,8 x [8 x s,8 x s ]T vector multiplication matrix operation, a plurality of Am x Wm nT vectors are uniformly distributed on the PE array according to different time domains and airspaces;
in matrix multiplication, the critical path affecting the clock frequency of the PE array is the time to complete the multiply and add operation once every 1 cycle, i.e., the time to complete the vector multiplication and vector addition operations in 1 cycle.
In addition, the combination modes of A [ m ] and W [ m ] n are mainly two, the first mode focuses on low power consumption, after data are read once, multiplexing is performed as much as possible, and energy consumption caused by data movement is reduced; the second type focuses on high performance, shortens the operation time of data in the PE array as much as possible, and puts the operation (such as accumulation) irrelevant to vector multiplication on an external ACC unit for processing, thereby improving the clock frequency; for matrix multiplication with dimension not exceeding [8,8], the operation can be completed within 1 period, the problem of reducing data movement is not involved, and the problem of accumulating PE array operation results in different time domains in an ACC unit is not involved, so that the high-performance mode and the low-power mode of the matrix multiplication are consistent;
(1) When s=1, 2,4, the original input data is too small to fully utilize the bandwidth of the PE array, so that the PE array of the present disclosure improves the utilization rate of the PE array by broadcasting the A-direction input or the W-direction input, and shortens the operation time;
As shown in FIG. 3, the method for copying A m and W m n is one (low power consumption), the original A-direction input (A0-A s-1) is copied for 8/s parts, each part is 1 group, each group has 8*s numbers, the original W-direction input is divided into one group of 8 columns, each group has 64 x s numbers, so that the vector multiplication matrix operation of 1 group A-direction input and 1 group W-direction input can be completed by s rows and 1 period of the PE array, the whole PE array can complete the operation of 8/s groups, and all operations can be completed by s periods.
As shown in FIG. 4, the method of copying A m and W m n is two (high performance), the A-direction input in the first copying method is transposed to obtain s [1,8 x 8/s ] vectors, each row is 1 group, each group has 8 x 8/s number, the original W-direction input is divided into each 8 rows, each group has 64 x s number, therefore, the 8/s rows of the PE array can complete the vector multiplication matrix operation of 1 group A-direction input and 1 group W-direction input, the whole PE array can complete the operation of s groups, and all operations can be completed in total s periods.
(2) As shown in fig. 5, when s=8, the combination of am and wm in the low power mode, and the PE array input per cycle:
The 1 st period, 8 vectors (A0-A7) are sequentially taken from A and are respectively used as the input of PE arrays A0-A7, and the input of the A-direction is kept unchanged for 8 periods;
after 8 cycles, all operations related to A0-A7 are completed;
As shown in fig. 6, a [ m ] and W [ m ] [ n ] are combined in the high performance mode when s=8, and the PE array input per cycle:
The (k+1) th period inputs the kth vector A [ k ] in A, broadcasts the kth vector A [ k ] to A0-A7 of the PE array, and takes 8 matrixes (W [0] [ k ] -W [7] [ k ]) from W in line sequence as W-direction inputs of 8 rows of the PE array respectively, wherein all operations about A are completed after 8 periods;
(3) When s=3, 5,6,7, the a-direction input and the W-direction input can be expanded to the form of s=4, 8, respectively, and then operated according to the rules in (1) (2);
(4) When s >8, dividing A into s/8 segments, dividing W into s/8 blocks, and then calculating the split A and W according to the operation rule when s=8.
The following description is made in connection with other embodiments:
Example 1 [8,8] × [8,8]T matrix multiplication
The matrix multiplication of [8,8] × [8,8]T is commonly found in neural network convolution operations, as shown in FIG. 7, which is a standard PE array matrix multiplication, the matrix A is input data of A0-A7, the A dimension is [8,8], the matrix A is divided into A0-A7 in rows in eight, each is a vector of [1,8], and one of the vectors is denoted as A [ i ];
the matrix W is input data of W00-W77, the W dimension is [8,8], the matrix W is divided into W0-W7 in eight equal parts according to the row, each vector is a vector of [1,8], and one vector is marked as W [ j ];
The matrix C is the output data, the C dimension is [8,8], and the value at position (i, j) C is denoted as C [ i ] [ j ];
When the PE array disclosed by the disclosure performs [8,8] × [8,8]T vector multiplication matrix operation, a data input mode and calculation steps are as follows:
(1) The method comprises the steps of respectively inputting A0-A7 into A0-A7 of PE array, namely, the direction input of 8 PE units A in each row of PE array is the same, respectively inputting W0-W7 into Wi 0-Ai 7 of PE array, namely, the direction input of W of 8 PE units in each column of PE array is the same;
(2) The output of each PE unit is:
Psum(i,j)=A[i]×W[j]
the value of C [ i ] [ j ] is easy to know and is Psum (i, j);
(3) C0-C7 are obtained by calculation after 1 period, and the multiplication matrix calculation of [8,8] × [8,8]T = [8,8] is completed.
For matrix multiplication operation (matrix multiplication with matrix dimension not exceeding [8,8 ]) which can be completed only by 1 period, data input in the A direction and data input in the W direction are used for operation of 1 period, so that operation of the high-performance mode and the low-power mode of the operation is consistent, and distinction is not made.
Example 2 [1,64] × [64,64]T vector multiplication matrix operation
Aiming at different application scene requirements, the hardware structure designed by the disclosure supports two modes of low power consumption and high performance to perform vector multiplication matrix operation;
The PE array can read 8 different A-direction inputs at one time under the low power consumption mode of [1,64] × [64,64]T vector multiplication matrix operation, and 8 periods are highly multiplexed, so that the energy consumption caused by data movement during the vector multiplication matrix operation is greatly reduced, and 8 x 7 adders (2 input 1 output) are additionally added in the PE array under the low power consumption mode and are used for adding the operation result of each period of the PE array;
In the high-performance mode, the input to the PE array A and the input to the PE array W are different in each period, but the addition of the operation result of each period in the PE array is not needed, and the operation result is directly output to an external ACC unit.
As shown in FIG. 8, vector A is the input data for A0-A7, with A dimension [1,64], which is octally divided into A0-A7, each being a vector of [1,8], one of which is denoted as A [ i ];
The matrix W is input data of W00-W77, the W dimension is [64,64], the matrix W is equally divided into 64 [8,8] matrices, one of the matrices is denoted as W [ k ] [ i ], k represents a row, and i represents a column;
Vector C is output, C dimension is [1,64], one of the eight halves is marked as C [ k ], the dimension is [1,8], each number in C [ k ] is marked as C [ k ] j;
as shown in FIG. 9, W [ k ] [ i ] is equally divided into W [ k ] [ i ] [0] [ k ] [ i ] [7] by row 8, wherein each row is a vector of [1,8], and one of the vectors is denoted as W [ k ] [ i ] [ j ];
low power consumption mode:
the vector multiplication matrix to be calculated for the (k+1) th cycle of the PE array is a [ i ] ×w [ k ] [ i ]T, i=0, 1., the vector multiplication matrix to be calculated for each cycle of the PE array for 7 cycles is shown in fig. 10;
In the PE array of the disclosure, when performing [1,64] × [64,64]T vector multiplication matrix operation in a low power consumption mode, the data input mode and the calculation steps are as follows:
(1) After A0-A7 are respectively input into A0-A7 of PE array in 1 st period, the data of 8 [1,8] vectors are kept unchanged (7 periods are saved, as shown in FIG. 11, 8 [8,8]T matrices in W [ k ] are respectively input into corresponding rows in PE array in (k+1) th period, and vector C [ k ] with dimension of [1,8] is obtained after 1 period;
(2) Taking k=0 as an example, inputting A0-A7 into A0-A7 of PE array respectively, inputting 8W 0 i into ith row of PE array respectively;
(3) The output of each PE unit is:
Psum(i,j)=A[i]×W[0][i][j]T
(4) Summing psums with the same j in 8 rows of the PE array to obtain 8 values;
(5) Splicing the 8 values to obtain a vector C0 with the dimension of [1,8 ];
C[0]={C[0][0],C[0][1],...,C[0][j],...,C[0][7]}
(6) And (3) repeating the steps (2) - (5) for W1-W7 with the input of A0-A7 unchanged, and obtaining C0-C7 in 8 cycles to complete the vector multiplication matrix calculation of [1,64 x [64,64]T = [1,64 ].
High performance mode:
The vector multiplication matrix to be calculated for the (k+1) th cycle of the PE array is a [ k ] ×w [ i ] [ k ]T, i=0, 1., the vector multiplication matrix to be calculated for each cycle of the PE array for 7 cycles is shown in fig. 12;
in the high performance mode of the PE array, when performing [1,64] × [64,64]T vector multiplication matrix operation, the data input mode and the calculation steps are as follows:
(1) As shown in fig. 13, the (k+1) th cycle broadcasts a vector a [ k ] with a dimension of [1,8] to A0-A7 of the PE array, 8 [8,8] matrices W [ i ] [ k ] (i=0, 1..7) in W are respectively input to the i-th row in the PE array, 1 number is output per PE unit after 1 cycle, and the outputs of 8 PEs of the i-th row are spliced into a vector to be accumulated with a dimension of [1,8], which is denoted as Ck [ i ];
(2) Taking k=0 as an example, broadcasting A0 to A0-A7 of the PE array, respectively inputting 8W [ i ] [0] s into the ith row of the PE array;
(3) The output of each PE unit is:
Psum(i,j)=A[0]×W[i][0][j]T
(4) Splicing the output Psum of 8 PE units of the ith row to obtain vectors with the dimension of [1,8], inputting the vectors into an ACC unit for accumulation, namely adding the vectors with the vectors stored in the Buf unit in the previous period, outputting Ck [ i ] to an external Buf unit as an accumulation result of Ck [ i ], and waiting for accumulation in different time domains from the operation result of the PE array in the next period;
(5) Repeating steps (2) - (4) for A1-A7, and accumulating the newly obtained Ck i with the output of the PE unit of the previous cycle in each cycle of the ACC unit:
(6) After 8 cycles, 8 Ci are spliced to obtain vector C with dimension of [1,64], and vector multiplication matrix calculation of [1,64] × [64,64]T = [1,64] is completed for 8 cycles.
Example 3 [4,16] × [16,16]T matrix multiplication
As shown in FIG. 14, the matrix multiplication is [4,16] × [16,16]T, where matrix A is the input data of A0-A7, the A dimension is [4,16], it is first divided into A0-A3 by row quarters, one of them is denoted as A m, then A m is divided into A m 0 and A m1 by column halving, each of them is a vector of [1,8], one of them is denoted as A m n;
the matrix W is input data of W00-W77, the W dimension is [16,16], the input data is split into 4 matrices of [8,8], one matrix is denoted as W [ i ] [ j ], i represents a row, and j represents a column;
The matrix C is the output of matrix multiplication, the dimension and number are consistent with the matrix A, each small block in the figure is a vector of [1,8], and one of the small blocks is marked as C [ m ] [ n ];
low power consumption mode:
The PE array can complete the vector multiplication matrix operation of 8 [1,8] × [8,8]T in each period, as shown in FIG. 15, which is a split combination mode of A and W in the low power consumption mode;
when the PE array disclosed by the disclosure performs low-power operation of the [4,16] × [16,16]T vector multiplication matrix, a data input mode and calculation steps are as follows:
(1) In the 1 st period, as shown in FIG. 16, according to the split combination method of A and W shown in FIG. 15, 4 [1,8] vectors in the 1 st period in A are respectively input into A0-A7 of the PE array (each [1,8] vector is broadcast to 2 rows in the PE array), 8 [8,8]T matrices in W are respectively input into W00-W77 in the PE array, and C0-C1 is obtained after 1 period;
(2) The 2 nd cycle, as shown in FIG. 17, the W-direction input remains unchanged (1 cycle 8 [8,8] matrix data is saved here), the A-direction input is changed to 64 inputs of the 2 nd cycle in A, C2 [0] - [3] [1] is obtained after 1 cycle;
(3) The calculation formula of Cm n obtained in each period is as follows:
(4) After 2 cycles, a complete output matrix C is obtained, and matrix multiplication calculation of [4,16] × [16,16]T = [4,16] is completed.
High performance mode:
FIG. 18 shows a split combination of A and W in high performance mode;
When the PE array disclosed by the disclosure performs [4,16] × [16,16]T vector multiplication matrix high-performance operation, the data input mode and the calculation steps are as follows:
(1) In the 1 st period, as shown in FIG. 19, 8 [1,8] vectors in the 1 st period in A are respectively input into A0-A7 of the PE array according to FIG. 18, 2 [8,8] matrices in W are respectively input into W00-W77 of the PE array (each [8,8] matrix is broadcasted to 4 rows in the PE array), 8 [1,8] vectors to be accumulated are obtained after 1 period and are marked as C'm ] [ n ];
(2) The 2 nd period is shown in figure 20, the input A is kept unchanged, the input W is changed into 8 matrix inputs of the 2 nd period in W, 8 [1,8] vectors are obtained after 1 period, all Cm [ n ] are obtained after accumulating ACC units and Cm [ n ] of the last 1 period, and matrix multiplication calculation of [4,16] × [16,16]T = [4,16] is completed.
Example 4 [2,32] × [32,32]T matrix multiplication
As shown in FIG. 21, the matrix multiplication is [2,32] × [32,32]T, where matrix A is the input data of A0-A7, the A dimension is [2,32], which is divided into A0, A1 by row, one of them is denoted as A m, and then A m is divided into A m 0-A m 3 by column, each of them is a vector of [1,8], one of them is denoted as A m n;
the matrix W is input data of W00-W77, the W dimension is [32,32], the matrix W is divided into 16 matrixes of [8,8] after being divided into four equal parts according to rows and four equal parts according to columns, one matrix is marked as W [ i ] [ j ], i represents a row, and j represents a column;
The matrix C is the output of matrix multiplication, the dimension and number are consistent with the matrix A, each small block in the figure is a vector of [1,8], and one of the small blocks is marked as C [ m ] [ n ];
low power consumption mode:
The PE array can complete 8 [1,8] × [8,8]T vector multiplication matrix operations per cycle, as shown in FIG. 22, which shows a split combination of A and W in a low power mode;
When the PE array disclosed by the disclosure performs [4,16] × [16,16]T vector multiplication matrix operation, a data input mode and calculation steps are as follows:
(1) In the 1 st period, as shown in FIG. 23, according to the split combination method of A and W shown in FIG. 22, 4 [1,8] vectors in the 1 st period in A are respectively input into A0-A7 of the PE array (each [1,8] vector is broadcast to 2 rows in the PE array), 8 [8,8] matrices in W are respectively input into W00-W77 in the PE array, and C0 1 are obtained after 1 period;
(2) The 2 nd cycle, W input is kept unchanged (1 cycle 8 [8,8]T matrix data movement is saved here), A input is changed into 8 [1,8] vectors of the 2 nd cycle in A, C1 [0] and C1 [1] are obtained after 1 cycle;
(3) Similarly, according to the split combination method of A and W shown in FIG. 22, the calculation formula of Cm n obtained in each period is:
(4) After 4 cycles, a complete output matrix C is obtained, and matrix multiplication calculation of [2,32] × [32,32]T = [2,32] is completed.
High performance mode:
the vector multiplication matrix to be calculated in the (k+1) th period of the PE array is A [ m ] [ k ]. Times.W [ i ] [ k ] T (m=0, 1; i=0, 1,2, 3), and the vector multiplication matrix to be calculated in each period of the PE array in 4 periods is shown in FIG. 23A;
In the high performance mode, when the PE array disclosed by the disclosure performs [4,16] × [16,16] T vector multiplication matrix operation, a data input mode and calculation steps are as follows:
(1) As shown in FIG. 23B, the (k+1) th cycle broadcasts the vector A [0] [ k ] with the dimension of [1,8] to A0-A3 of the PE array, similarly broadcasts the vector A [1] [ k ] to A4-A7 of the PE array, inputs the matrix W [ i ] [ k ] (i=0, 1,2, 3) to the ith row and the (i+4) th row in the PE array, outputs 1 number per PE unit after 1 cycle, and splices the outputs of 8 PEs in the same row into a vector to be accumulated with the dimension of [1,8] to be denoted as Ck [ m ] [ i ];
(2) As shown in fig. 23C, taking k=0 as an example, a [0] [0] is broadcasted to A0-A3 of the PE array, a [1] [0] is broadcasted to A4-A7 of the PE array, 4W [ i ] [0] (i=0, 1,2, 3) are respectively input to the first 4 rows and the last 4 rows of the PE array;
(3) Splicing the output Psum of 8 PE units in the (m-4+i) th row to obtain vectors with the dimension of [1,8], inputting the vectors into an ACC unit for accumulation, namely adding the vectors with the last period stored in the Buf unit, wherein the accumulation result is Ck [ m ] [ i ];
(4) Outputting Ck [ m ] [ i ] to an external Buf unit, and waiting for accumulation under different time domains with the output result of the PE array of the next period;
(5) Repeating steps (2) - (4) for a [ m ] [ n ] (m=0, 1; n=1, 2, 3), accumulating the output Ck [ m ] [ i ] of the (m× 4+i) th row of the PE array in the ACC unit every period, wherein the accumulated result of 4 periods is as follows:
(6) After 4 cycles, 4 C0 [ i ] (i=0, 1,2, 3) are spliced to obtain a vector C0 with the dimension of [1,32], a vector C1 with the dimension of [1,32] is obtained in the same way, and the vector multiplication matrix calculation of [2,32] × [32,32] T= [2,32] is completed in 4 cycles.
Matrix multiplication operations of different dimensions that can be implemented by the PE array of the present disclosure are shown in table 1 below:
TABLE 1
In addition to supporting the operation of matrix multiplication of the type shown in the table, the PE array can be used for zero padding of any W-direction input which is not a matrix to be the matrix closest to the W-direction input in the table, and simultaneously, zero padding of the A-direction input is extended to be the same column width, and the W-direction input is zero padded by [1,5] × [5,
6T Is an example, and can be operated according to the mode of [1,8] × [8,8]T after zero padding and expansion;
For any [ N, s ] × [ s, s ]T matrix multiplication operation, the matrix multiplication operation can be split into N [1, s ] × [ s, s ]T vector multiplication matrix operations, and the matrix multiplication operation is completed by at most N cycle periods (cycle is the number of periods required by the original vector multiplication matrix operation).
In addition, the present disclosure also discloses the application of the arithmetic unit, especially the PE array thereof, in a typical MPU. Furthermore, the present disclosure also discloses an MPU including the arithmetic unit.
The layout of the PE array of the present disclosure in a typical MPU is shown in fig. 24, wherein:
(1) The control is mainly used for generating various control signals to realize control of each module;
in the present invention, the operation modes are divided into a high performance mode and a low power consumption mode, the default mode is the high performance mode, and the operation mode signals of the control are exemplarily distinguished by the following table 2:
TABLE 2
Assuming that the two matrices to be multiplied are in the form of [ m, s ] × [ n, s ] T, in the present patent, the 3 dimensions m, s, n may take values of 1,2,4,8 (when the dimension is 3, the original dimension can be extended to a matrix of dimension 4 by zero padding; when the dimension is 5,6,7, the original dimension can be extended to a matrix of dimension 8 by zero padding), so that there are 64 matrix multiplication forms in total, the matrix dimension signal of control needs 6 bits to distinguish the representation, see table 3 below;
TABLE 3 Table 3
The working mode signal and the matrix dimension signal of the control are required to be transmitted to lm_A and lm_W, and the access mode of matrix A, W data in each period is determined together;
The working mode signal needs to be transmitted to PE_array, ACC and Buf, and is used for controlling the output result Psum of the PE unit to carry out time domain accumulation or space domain addition;
The matrix dimension signals are transmitted to PE_array, ACC and Buf and are used for determining the cycle number of matrix multiplication operations of different dimensions of the PE array, so that whether the matrix multiplication operations are completed or not is judged, and whether the Buf can transmit out operation results or not is determined;
(2) lm_a (i.e., local memory a) and lm_w (i.e., local memory W) are storage resources (assuming that the size of each input data is 1 Byte) with sizes of 64 bytes and 512 bytes, respectively, and are mainly used for temporarily storing input in the a direction and input in the W direction, i.e., data of two matrices a and W to be multiplied;
lm_a and lm_w read the data specified in the matrix a and the matrix W based on the matrix dimension signal and the operation mode signal of the control, and input to the specified PE unit;
(3) The PE_array is a PE array, and is mainly used for realizing matrix operation in various modes, and enabling an adder inside the PE array based on a control working mode signal. In the low power consumption mode, an adder in the PE array is enabled to carry out space domain addition, and the operation result of the PE array does not need to be accumulated in the ACC unit;
(4) The ACC unit is mainly used for accumulating the output of the PE array in different time domains. In a high-performance mode, the output of 8 rows of PE units in the PE array is respectively used as the first input of 8 adders of an ACC unit, 8 vectors stored in a Buf unit are respectively used as the second input of 8 adders of the ACC unit, and finally the ACC unit inputs the operation results of the 8 adders to the Buf unit;
(5) The Buf (Buffer) unit is mainly used for temporarily storing the accumulated results of the ACC, and the initial value is 0. Based on the control working mode signal, the Buf unit determines whether to input the stored data to the ACC unit for accumulation, takes the value stored in the Buf unit as the second input of the ACC unit for time domain accumulation in the high-performance mode, and does not transmit the data to the ACC unit in the low-power consumption mode.
Based on the control matrix dimension signals, the number of cycles of matrix multiplication operation with different dimensions can be determined, based on the count, the Buf unit can judge whether the matrix multiplication operation is completed, and when the operation is completed, the Buf unit transmits stored data from the MPU.
It will be appreciated that control, as is commonly abbreviated in the processor arts, stands for control module, PE_array, i.e., PE array structure, ACC, i.e., accumulator, and Buf, i.e., cache.
In addition, it should be noted that:
In the high-performance mode, 8 groups of different [1,8] × [8,8]T operations can be calculated by the PE array every 1 period, and time domain accumulation of calculation results is realized by the ACC unit, so that the addition operation in matrix multiplication is transferred to the ACC unit, and the ACC unit performs one addition operation every period, thereby shortening the time of a key path of the PE array and improving the clock frequency;
assuming that the delay to complete one vector multiplication (by the PE unit) is 1ns, the delay to complete one vector addition is 0.3ns, where the matrix operations of examples 2, 3, 4 shown in the tables, the critical path delays and clock frequencies in both modes are shown in Table 4 below:
TABLE 4 Table 4
The addition operation in the high performance mode is performed in the ACC unit, but not in the PE array, so that the addition delay is not accumulated in the PE array critical path delay.
The main power consumption of the processor during working is derived from data movement, so that the space locality is utilized to improve the utilization rate of single-time data reading, and the power consumption is effectively reduced.
In the low power consumption mode, 8 groups of different [1,64] × [8,64] T operations can be calculated by the PE array every 1 period, and the adder specified in the PE array is enabled based on control signals of multiplication of different dimension matrixes, so that airspace addition of calculation results can be realized;
The space domain summation can maximally multiplex the input data in the W direction, so that the low power consumption mode can effectively reduce the data moving amount under the same operation cycle number for the same matrix multiplication operation;
Assuming that the size of each input data is 1Byte, the data shift amount for one a-direction input is 64 bytes, the data shift amount for one W-direction input is 512 bytes, and the same is the matrix operation of examples 2,3, and 4, the data shift amounts in the two modes are shown in table 5 below:
TABLE 5
Taking the low power consumption mode of example 4 as an example, 4×64 represents that the input data in the a direction is shifted for 4 periods, the shifting amount 64Byte per period, 2×512 represents that the input data in the W direction is shifted for 2 periods, and the shifting amount per period is 512Byte.
It should be emphasized that for matrix multiplication with dimensions not exceeding [8,8], the calculation can be completed in 1 cycle, the problem of reducing data movement is not involved, and the problem of accumulating the operation results of the PE array in different time domains in the ACC unit is not involved, so that the high-performance mode and the low-power mode of the matrix multiplication are consistent.
In summary, the present disclosure has the following features:
the acceleration board card can be applied to acceleration chips of end-side neural networks such as mobile phones, monitoring equipment, automobile electronics and the like, and also can be applied to a server;
The PE operation unit in the disclosure can be realized as fixed-point multiply-accumulate operation or floating-point multiply-accumulate operation;
The PE array of the present disclosure may support operations of similar structures such as 4×4, 16×16, and rectangular matrix operations such as 4×8, 8×4, 8×16, and 16×8, in addition to operations of matrix multiplication of the upper 64 elements.
Although embodiments of the present disclosure have been described above with reference to the accompanying drawings, the present disclosure is not limited to the specific embodiments and fields of application described above, which are merely illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may make numerous forms, and departures from the present disclosure as come within the scope of the invention as defined in the appended claims.