Detailed Description
Fig. 2a schematically shows a structural diagram of a data processing apparatus provided in the present application. As shown in fig. 2a, the apparatus includes a processing system 10, a hardware accelerator 20, adata bus 30 and a control bus 40. Processing system 10 and hardware accelerator 20 may communicate data viadata bus 30 and control information via control bus 40. The processing system 10 is the controlling component of the overall system and includes aprocessor 111 and, optionally, the processing system 10 also includes amemory 112. Theprocessor 111 is used to run the software-side code and load acceleration tasks (e.g., matrices or vectors) onto the hardware accelerator 20. Optionally, hardware accelerator 20 may include one or more Intellectual Property (IP) cores, and hardware accelerator 20 may control the operating state and data reading of the one or more IP cores. The functions of the IP core include, for example: some functional blocks, such as Synchronous Dynamic Random Access Memory (SDRAM) controller, Peripheral Component Interconnect (PCI) interface, etc., which are commonly used in digital circuits but are relatively complex, are designed as circuits capable of modifying parameters. Theprocessor 111 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP. Theprocessor 111 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. Thememory 112 is used for storing data, and can be controlled by theprocessor 111 to read data from or write data to thememory 112, thememory 112 can be a double data rate synchronous dynamic random access memory (DDR SDRAM), a random-access memory (RAM), or a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD), or a solid-state drive (SSD), and thememory 112 can further include a combination of the above memories.
The hardware accelerator 20, which is a hardware acceleration component of the whole system, can design and implement a dedicated acceleration IP core according to a specific computation task, and the acceleration of the algorithm can be implemented by the IP core on the hardware accelerator. For example, a hardware accelerator applied in a neural network may achieve acceleration of the neural network algorithm. Hardware accelerator 20 and processing system 10 may independently execute tasks in parallel after data interaction is complete. The hardware accelerator 20 may be a Graphics Processing Unit (GPU), or an FPGA, or an Application Specific Integrated Circuit (ASIC).
Adata bus 30 for data transfer throughout the processing system 10 and the hardware accelerator 20. Thedata bus 30 may adopt an advanced extensible interface-Stream (AXI-Stream) protocol or a bus interface (PCI express, PCI-E) bus protocol, where AXI is a bus protocol and AXI-Stream is a high-performance transmission protocol, allowing unlimited data burst transmission.
A control bus 40 for the transmission of control signals throughout the processing system 10 and the hardware accelerator 20. The control bus 40 can adopt an AXI-Lite protocol, which is a lightweight address mapping single-time transmission protocol and is suitable for control signal transmission of a hardware operation circuit.
Based on the data processing device framework shown in fig. 2a, fig. 2b is a schematic diagram illustrating a software and hardware collaboration framework of a specific data processing device provided by the present application. As shown in fig. 2b, the framework includes a processing system 10a, a hardware accelerator 20a, a data bus 30a, and a control bus 40a, where the processing system 10a and the hardware accelerator 20a may transmit data via the data bus 30a and control information via the control bus 40 a.
The processing system 10a may be the same as the processing system 10 described above in fig. 2 a. That is, the processor 111a in the processing system 10a may be the same as theprocessor 111 in FIG. 2a, and the memory 112a may be the same as thememory 112 in FIG. 2 a. The data bus 30a may be the same as thedata bus 30 in FIG. 2a described above, and the control bus 40a may be the same as the control bus 40 in FIG. 2a described above, and is not further delineated here.
Hardware accelerator 20a may be the same as hardware accelerator 20 in FIG. 2a described above. Hardware accelerator 20a may include respective IP cores that are: a memory circuit 211, an arithmetic circuit 212, an input buffer circuit 213, an output buffer circuit 214, a read circuit 215 (fig. 2b illustrates the read circuit as a Direct Memory Access (DMA)) and a control interconnect circuit 216.
The storage circuit 211 includes a first data buffer 2111, a second data buffer 2112, an index information buffer 2113, and a selection circuit 2114 (the selection circuit is taken as an example for explanation in fig. 2 b). The first data buffer 2111 is configured to buffer data loaded from thememory 112 into the hardware accelerator, for example, when the hardware accelerator 20a is applied to a neural network, the first data buffer 2111 may buffer a convolution kernel parameter matrix of each layer of the neural network model. The second data buffer 2112 is used for buffering data loaded from the memory 112a into the hardware accelerator, for example, when the hardware accelerator 20a is applied to a neural network, the second data buffer 2112 may buffer the feature map data matrix. The index information buffer 2113 may be used to buffer the position information of the non-zero elements in the parameter matrix or the data matrix, such as the column address, the row address, and the relationship between the column address and the row address of the non-zero elements. A selector 2114, configured to select data in a corresponding position according to the index information cached in the index information cache region 2113; for example, if the index information is obtained according to the parameter matrix, the selector selects data at a corresponding position from the data matrix according to the index information; if the index information is obtained according to the data matrix, the selector selects the data at the corresponding position from the parameter matrix according to the index information.
The arithmetic circuit 212 is a core component of the hardware accelerator, and various functional units of the arithmetic circuit are solidified inside the arithmetic circuit. The arithmetic circuit 212 includes a multiplication unit 2121, an accumulation unit 2122, a vector calculation unit 2123, and a stimulus function unit 2124. The multiplying unit 2121 is configured to multiply an input vector by an element at a position corresponding to the matrix to obtain an intermediate value, or multiply an input matrix by the matrix to obtain an intermediate value, where the multiplying unit may be a matrix and vector multiplying unit, or a matrix and matrix multiplying unit. An accumulation unit 2122, configured to accumulate the intermediate values calculated by the multiplication unit 2121 to obtain gate vector values; an excitation function unit 2123 configured to perform excitation function processing on the gate vector values obtained by the accumulation unit, and a vector calculation unit 2124 configured to calculate each gate vector to obtain a final result. An input buffer 213 and an output buffer 214 for temporarily storing input and output data in the hardware accelerator, and the arithmetic circuit 212 may put the result into the output buffer and output the result after completing the calculation, wherein the output buffer 214 may be written back to thememory 112 once after the buffer is full.
Direct memory read DMA215 is used for data transfer between hardware accelerator 20a and storage 112 a. optionally, the physical addresses of data storage in storage 112a may be sequential, which may facilitate data transfer by DMA215 in hardware accelerator 20 a. Each circuit of the hardware accelerator may be equipped with one DMA to enable parallel data reads. And a control interconnection circuit 216 for controlling interconnection of the signal lines.
Either the above-mentioned fig. 2a or fig. 2b can be applied to a neural network acceleration system, and an optional implementation manner in the neural network acceleration system is: under the control of a processor in a processor system, the opening of the DMA is controlled through a control bus and a data bus respectively, the DMA transmits convolution kernel parameter data in the memory to a first data cache region 2111 in a storage circuit 211 of the hardware accelerator 20, transmits input characteristic diagram data in the memory to a second data cache region 2112 of the storage circuit of the hardware accelerator, then performs multiplication, accumulation, excitation function and other processing in an arithmetic circuit 212 to obtain a target result, and the obtained target result is transmitted back to the memory through the DMA.
The data processing apparatus shown in fig. 2a or fig. 2b can be applied to devices, such as a mobile phone, a camera, or a cloud server. The data processing device in fig. 2a or 2b may also be part of a System on Chip (SoC).
It is to be understood that, in the present application, "at least one" means one or more, "a plurality" means two or more.
Based on the architecture shown in fig. 2a or fig. 2b, the present application provides a data processing method. Fig. 3 is a schematic flow chart of a data processing method provided in the present application. In this embodiment, the processor may be theprocessor 111 in fig. 2a, or may be the processor 111a in fig. 2b, and the hardware accelerator may be the hardware accelerator 20 in fig. 2a, or may be the hardware accelerator 20a in fig. 2 b. As shown in fig. 3, the method comprises the steps of:
instep 301, a processor obtains a first matrix and a second matrix.
Wherein the first matrix and the second matrix both comprise non-zero elements, the first matrix is a matrix of L M, the second matrix is a matrix of M N, and L, M and N are both positive integers.
In one implementation, L and M of the first matrix are equal and M and N of the second matrix are equal.
In yet another implementation, L and M of the first matrix are equal and M and N of the second matrix are not equal.
In yet another implementation, L and M of the first matrix are not equal, and M and N of the second matrix are equal.
In yet another implementation, L and M of the first matrix are not equal, and M and N of the second matrix are not equal.
Step 302, the processor obtains at least one third matrix and corresponding at least one index information according to the non-zero elements in the second matrix and the specification of the hardware accelerator.
The specification of the hardware accelerator is used for indicating the hardware accelerator to process the product operation of a matrix of l x m and a matrix of m x n, the third matrix is a matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises each non-zero element in different n columns in the second matrix and does not comprise part or all zero elements in the n columns of the second matrix, the index information is used for indicating the position information of the non-zero elements in the n columns of the second matrix, and one third matrix corresponds to one index information; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer.
For example, if 4 third matrices are obtained, each third matrix includes n rows of the second matrix, and n rows of the second matrix included in any two of the 4 third matrices are different. For example, if the second matrix includes 32 columns and each third matrix includes 8 columns, the 8 columns of the second matrix included in the 4 third matrices may be: the first third matrix comprises 1 to 8 columns of the second matrix, the second third matrix comprises 9 to 16 columns of the second matrix, the third matrix comprises 17 to 24 columns of the second matrix, and the fourth third matrix comprises 25 to 32 columns of the second matrix. Wherein the first, second, third and fourth do not indicate a sequential order, but merely to distinguish the third matrix.
Optionally, the sizes of l, m and n in the specification of the hardware accelerator are all the same, or any two of them are the same, or all three are different.
In one possible implementation, the processor determines that the t value of the third matrix satisfies the condition: (t-1) m < p ≦ t m, where p is the number of non-zero elements in the column with the most non-zero elements in the different n columns in the second matrix. It is also understood that the processor may determine the number of rows of the third matrix according to the t value, and then form the third matrix according to the non-zero elements in the n columns of the second matrix block and the determined number of rows and columns of the third matrix. The number of rows of the third matrix is determined by the number p of non-zero elements in the column with the most non-zero elements in different n columns in the second matrix, all non-zero elements in different n columns in the second matrix can be included in the third matrix, and the third matrix includes all non-zero elements in different n columns in the second matrix due to the fact that the non-zero elements have large influence on the operation result, and the operation result can be accurately obtained through the third matrix.
For example, if the number of rows of the determined third matrix is 8, and the number of non-zero elements in a certain column of different n columns in the second matrix is 6, the column may be filled with two zero elements, where the positions of the two zero elements in the column may be in the first two rows of the column, may be in the last two rows of the column, and may also be in any two positions in between, which is not limited in this application.
Alternatively, the first matrix and the second matrix may both be sparse matrices, or both the first matrix and the second matrix may be non-sparse matrices, or one of them may be a sparse matrix. The sparse matrix refers to a matrix in which the number of elements with a value of 0 is much greater than the number of non-zero elements in the matrix (for example, the ratio of the number of elements with a value of 0 to the total number of elements in the matrix is greater than a preset threshold), and the non-zero elements are distributed irregularly, and the sparse matrix helps to reduce the requirement on hardware Input/Output (I/O) bandwidth when data is loaded from a memory to a hardware accelerator. For example, taking the second matrix as the sparse matrix, since the zero element in the sparse matrix has no influence on the operation result, but the zero element participates in the operation, the consumption of hardware acceleration is additionally increased, and the operation performance of the hardware accelerator is reduced, therefore, when the second matrix is subjected to sparse processing, in order to improve the operation performance of the hardware acceleration, the number p of the non-zero elements in the column with the most non-zero elements in each column of the second matrix can satisfy the condition that p is less than or equal to M-M, so that when the processor converts the second matrix into a plurality of third matrices, the row with only zero elements in different n columns in the second matrix can be deleted to obtain the third matrix, the number of the zero elements participating in the operation can be reduced, and the improvement of the calculation efficiency of the hardware accelerator is facilitated. Further, according to the condition that t meets, any irregular second sparse matrix can be converted into a regular third matrix, and then the product operation of the second matrix with any sparse rate can be achieved through the hardware accelerator.
As described with reference to fig. 2b, after the processor 111a obtains at least one third matrix and corresponding at least one index information according to the non-zero elements in the second matrix and the specification of the hardware accelerator, the obtained at least one third matrix and corresponding at least one index information may be stored in the memory 112 a.
Step 303, for each third matrix in the at least one third matrix, the hardware accelerator acquires a fourth matrix from the corresponding l rows in the first matrix according to the index information corresponding to the third matrix, where the fourth matrix is a matrix of l × m.
Optionally, the index information includes row addresses and column addresses, and one column address corresponds to m row addresses. The index information may be in a form of a table, or may be in other forms, for example, a text in an extensible markup language (XML) format, and the format of the index information is not limited in the present application.
Specifically, the hardware accelerator selects m column elements from corresponding l rows in the first matrix according to m row addresses corresponding to one column address of the index information to obtain a fourth matrix, where the m row addresses are in one-to-one correspondence with the m column addresses of the m column elements. Here, at least one fourth matrix may be obtained according to one index information.
In a possible implementation, n columns of the second matrix correspond to l rows of the first matrix. For example, if the first matrix comprises 32 rows, the second matrix comprises 32 columns, and each third matrix comprises 8 columns, the first third matrix comprises 1 to 8 columns of the second matrix, which may correspond to 1 to 8 rows of the first matrix; the second third matrix comprises 9 to 16 columns of the second matrix, which can correspond to 9 to 16 rows of the first matrix; the third matrix comprises 17 to 24 columns of the second matrix, which can correspond to 17 to 24 columns of the first matrix; the fourth third matrix includes 25 to 32 columns of the second matrix, which may correspond to 25 to 32 rows of the first matrix. The first, second, third and fourth do not indicate a sequential order, but merely to distinguish the third matrix.
Taking fig. 2b as an example, in one possible implementation manner, the hardware accelerator 20a loads the first matrix in the memory 112a to the second data cache 2112 under the control of the processor 111a, loads the index information to the index information cache 2113, and loads the third matrix to the first data cache 2111 of the hardware accelerator 20a, and the processor 111a controls the selector 2114 to obtain the fourth matrix from the first matrix cached in the hardware accelerator 20a according to the index information. In another possible implementation, the processor 111a divides the first matrix into one or more first matrix blocks (the first matrix block includes l rows), and the hardware accelerator 20 loads the first matrix block in the memory 112a into the second data buffer 2112 multiple times under the control of the processor 111a, where one first matrix block may be loaded at a time, or multiple first matrix blocks may be loaded at a time. Thus, the loading bandwidth can be improved; specifically, the fourth matrix may be obtained from the first matrix block loaded into the second data buffer 2112.
In one possible implementation, the first matrix may include at least one first matrix block aiFirst matrix block AiComprises l rows, which can be adjacent l rows in the first matrix, or alternate l rows of the first matrix. The second matrix comprises at least one second matrix block BjSecond matrix block BjThe matrix comprises n columns of the second matrix, wherein the n columns can be adjacent n columns in the second matrix or n columns of intervals of the second matrix, and i and j are integers. The present application provides three processors for determining a first matrix block aiAnd a second matrix block BjWherein the specification of the hardware accelerator is to instruct the hardware accelerator to process a multiplication operation of the matrix of l x m and the matrix of m x n.
In one implementation, the first step is determinedThe sparse rate of the second matrix is taken as the reciprocal of the sparse rate of the second matrix as the highest gear T1According to the specification of the hardware accelerator and the highest gear T1A second matrix block B of a second matrixjCan be determined as (m x T)1) And rows and columns. Therefore, the size of the second matrix block of the second matrix can be determined quickly, and especially when the distribution of the non-zero elements in the second matrix is uniform, the size of the second matrix block can be determined quickly and accurately.
Accordingly, a first matrix block A of the first matrixiIs equal to the rows of the second matrix, so that the first matrix block a can be divided intoiIs determined as l rows (m × T)1) And (4) columns.
In the second implementation manner, the second matrix is divided into a plurality of small matrix blocks according to the distribution of non-zero elements in the second matrix (for example, the second matrix is divided into 9 small matrix blocks according to the squared figure), the sparsity of each small matrix block is determined, and if the sparsity of most of the small matrix blocks is smaller than the threshold and the sparsity of a few of the small matrix blocks is larger than the threshold, the reciprocal of the sparsity of most of the small matrix blocks can be used as the highest gear T2According to the specification of the hardware accelerator and the highest gear T2A second matrix block B of a second matrixjIs determined as (m × T)2) And rows and columns. And aiming at the second matrix with non-uniform non-zero element distribution, the size of the second matrix block can be accurately determined by the second implementation mode.
For example, if the sparsity of most of the second matrix is less than 25%, and the local sparsity is greater than 25%, the highest gear T2May be set to 4.
Accordingly, the size of the first matrix block of the first matrix may be determined as l rows (m × T)2) And (4) columns.
Third, the highest gear T of the second matrix is determined according to historical empirical values3According to the specification of the hardware accelerator and the highest gear T3Determining a second matrix block B of a second matrixjCan be determined as (m x T)3) And rows and columns.
Accordingly, the first moment of the first matrixThe size of the array block may be determined as l rows (m × T)3) And (4) columns.
Illustratively, the first matrix is a matrix of 32 × 32, the second matrix is also a matrix of 32 × 32, the specification of the hardware accelerator is to support the product operation of the matrix of 8 × 8 and the matrix of 8 × 8, the highest gear T of the second matrix is 4 as an example, and fig. 4a provides a structural schematic diagram of the second matrix for the present application. As shown in FIG. 4a, the second matrix comprises four second matrix blocks BjAre respectively B0、B1、B2And B3Each second matrix block includes 8 columns, and the number of rows of the second matrix block is (m × T) ═ 8 × 4, and the number of columns n is 8. Correspondingly, fig. 4b provides a schematic structural diagram of a first matrix. As shown in fig. 4 b. The first matrix also comprises four first matrix blocks aiAre respectively A0、A1、A2And A3Each first matrix block has 8 rows and 32 columns (m × T) — (8 × 4).
And 304, aiming at each third matrix in the at least one third matrix, the hardware accelerator obtains a fifth matrix according to the fourth matrix and the third matrix.
With reference to fig. 2b, the processor 111a controls the hardware accelerator 20a to input the fourth matrix and the third matrix to the operation module 212, calculate multiplication between the third matrix and the fourth matrix in the multiplication unit 2121 of the operation circuit 212 to obtain an intermediate value, and input the intermediate value to the accumulation unit 2122 of the operation circuit 212 to perform accumulation operation to obtain a fifth matrix.
Optionally, the fourth matrix includes a plurality of matrices, one fourth matrix is multiplied by one column of the third matrix, and the multiplication results of the input fourth matrices and the input columns of the third matrix are accumulated to obtain a fifth matrix.
In a possible implementation manner, if t is an integer greater than or equal to 2, the hardware accelerator divides the third matrix into t m × n matrices; and the hardware accelerator respectively performs multiply-add operation on the fourth matrix and t m × n matrixes to obtain a fifth matrix.
Instep 305, the hardware accelerator obtains a target result according to the at least one fifth matrix, where the target result is an L × N matrix.
In one possible implementation manner, in combination with fig. 2b, the accumulation unit 2122 of the hardware accelerator 20a accumulates at least one fifth matrix to obtain a target result, where the target result is a result of a product operation of the first matrix and the second matrix.
Through theabove steps 301 to 305, since the third matrix is obtained by converting the non-zero elements in different N columns in the second matrix, the fourth matrix of L × M is reduced compared to the first matrix specification of L × M, and the fourth matrix of L × M meets the specification of the hardware accelerator, and the third matrix is generated according to the specification of the hardware accelerator, the hardware accelerator can obtain the fifth matrix by multiplying the fourth matrix of L × M and the third matrix, and obtain the target result according to the fifth matrix, where the target result is the matrix of L × N, that is, the target result is the operation result of the first matrix and the second matrix. Therefore, the multiplication of the first matrix and the second matrix is realized by the multiplication of the third matrix and the fourth matrix and the calculation according to the fifth matrix. In this scheme, the processor removes part or all of zero elements in different n columns in the second matrix to obtain a third matrix, so that the number of zero elements in the third matrix participating in operations in the hardware accelerator is reduced compared with the number of zero elements in different n columns in the second matrix. The method comprises the steps of deleting rows with only zero elements in different n columns in the second matrix to obtain a third matrix, obtaining a fourth matrix according to index information, and enabling the zero elements to have no influence on an operation result.
The data processing method flow shown in fig. 3 can be applied to a data processing scenario of a network model of a neural network. In a data processing scene of the neural network model, the first matrix can be a data matrix or a parameter matrix, and if the first matrix is the data matrix, the second matrix is the parameter matrix; if the first matrix is a parameter matrix, the second matrix is a data matrix. In neural network models, the parameter matrix is typically a sparse matrix.
The data processing method provided by the present application will be described below by taking as an example that the first matrix is the matrix shown in fig. 4b and the second matrix is the matrix shown in fig. 4 a. In the method, the specification of the hardware accelerator is taken as the product operation of the matrix supporting 8 × 8 and the matrix supporting 8 × 8 as an example.
As shown in fig. 4c, a schematic flow chart of another data processing method is provided for the present application. In the method, the processor may be the processor in fig. 2a or fig. 2b, and the hardware accelerator may be the hardware accelerator in fig. 2a or fig. 2 b. The method comprises the following steps:
instep 401, a processor obtains a first matrix and a second matrix.
This step is the same asstep 301 in the embodiment shown in fig. 3, and reference may be made to the foregoing description, which is not repeated herein.
Instep 402, the processor divides the first matrix into four first matrix blocks and the second matrix into four second matrix blocks.
Divide the first matrix into A0、A1、A2And A3Four first matrix blocks, the second matrix being divided into B0、B1、B2And B3Four matrix blocks. For the process of the processor for partitioning the first matrix and the second matrix, reference may be made to the implementation manner ofstep 303 in the embodiment shown in fig. 3, which is not described herein again.
In this application, the multiplication of the first matrix and the second matrix requires that each first matrix block of the first matrix and each second matrix block of the second matrix are multiplied and accumulated. The product operation of the first matrix and the second matrix is: a. the0*B0+A0*B1+A0*B2+A0*B3+A1*B0+A1*B1+A1*B2+A1*B3+A2*B0+A2*B1+A2*B2+A2*B3+A3*B0+A3*B1+A3*B2+A3*B3. First matrix block AiAnd a second matrix block BjThe multiplication process of (A) is the same, wherein BjThe number of non-zero elements in Ai*BjThe operation process of (2). This embodiment is illustrated separately in two cases.
The first case is a second matrix block B as shown in FIG. 5ajFor illustration, as shown in FIG. 5a, an exemplary diagram of a second matrix block BjIncluding 32 rows and 8 columns. In this first case, after theabove step 402,steps 403 to 410 are continued.
Instep 403, the processor determines a second matrix block BjThe number p of non-zero elements in the column with the most non-zero elements.
As in fig. 5a, a second matrix block BjThe number of the non-zero elements of each column is respectively as follows: p is a radical of0=6,p1=4,p2=7,p3=6,p4=0,p5=4,p6=5,p73, that is, the second matrix block BjThe number p of non-zero elements in the column with the most non-zero elements equals 7.
Instep 404, the processor determines a third matrix according to the determined p and the specification of the hardware accelerator.
In one possible implementation, t in the third matrix satisfies the condition: (t-1) m<p ≦ t × m, in conjunction with FIG. 5a (second matrix Block B)jThe number of non-zero elements in the column with the most non-zero elements in the column is 7) and the specification of the hardware accelerator, t meets the condition: (t-1). 8<7 ≦ t × 8, t ≦ 1 may be determined, i.e. the third matrix is a matrix of 8 × 8, as shown in fig. 5 b. The third matrix as shown in fig. 5B comprises the second matrix block B as shown in fig. 5ajAll of which are assigned non-zero elements, and each column is supplemented with the appropriate number of zero elements, which are identified by x. The non-zero elements of the third matrix may not be closely spaced, e.g.,column 0, 6, may be in the last row, orrow 6, it being understood that complementary zero elements may be in the last two rows ofcolumn 0, or the first two rows ofcolumn 0, or in the first two rows of column 0The positions of any two rows incolumn 0 are not limited in this application to the positions of the complementary zero elements in any column.
Instep 405, the processor determines a second matrix block BjThe index information of (2).
It can also be understood that a second matrix block B is determinedjPosition information of non-zero elements, and position information of zero padding required to form a matrix satisfying the specification of the hardware accelerator.
Second matrix Block B shown in FIG. 5ajAs shown in fig. 5c, the position information of the non-zero element is: the row addresses of the non-zero elements in the 0 th column are 1/4/8/12/17/28 respectively, the row addresses of the non-zero elements in the 1 st column are 3/6/14/24 respectively, the row addresses of the non-zero elements in the 2 nd column are 0/1/21/25/28/29/30 respectively, the row addresses of the non-zero elements in the 3 rd column are 0/5/12/13/21/27 respectively, the 4 th column are all zero elements, the row addresses of the non-zero elements in the 5 th column are 5/12/18/28 respectively, the row addresses of the non-zero elements in the 6 th column are 3/8/21/23/25 respectively, and the row addresses of the non-zero elements in the 7 th column are 11/16/30 respectively. For the number p of non-zero elements in a columniColumns smaller than 8(8 is the row number of the right matrix of the hardware accelerator), which may be filled with zero elements, as shown in fig. 5c, the zero elements may also be identified by x, for example, the number of non-zero elements incolumn 0 is 6, 2 zero elements need to be filled, and the number of non-zero elements incolumn 3 is 8, and zero elements do not need to be filled.
As explained in conjunction with fig. 2b above, the processor 111a may control to store both the determined third matrix and the index information in the memory 112 a. Since the number of zero elements of the third matrix is compared to the second matrix block BjThe number of the zero elements is less, so that the data amount stored in the memory can be reduced, and the memory space of the memory is saved.
There is no sequence between theabove steps 404 and 405, and step 404 may be executed first, and then step 405 may be executed;step 405 may be performed first, and then step 404 may be performed;step 404 and step 405 may also be performed simultaneously.
Atstep 406, the hardware accelerator loads the third matrix, the index information, and the corresponding first matrix block.
In one possible implementation, the first matrix block aiAs shown in fig. 6a, comprising 32 rows and 8 columns. The hardware accelerator loads the first matrix block AiThe third matrix and the index information corresponding to the third matrix do not need to load all the first matrix and the second matrix at a time, so that the bandwidth of loading data from the memory to the hardware accelerator is improved. It will also be appreciated that in order to increase the bandwidth of data loading from memory to the hardware accelerator, and the computational performance of the hardware accelerator, in an alternative implementation, the hardware accelerator may be loaded with the first matrix block a one at a timeiA third matrix and index information corresponding to the third matrix. As explained in conjunction with fig. 2b above, in one possible implementation, the processor 111a stores the third matrix and the index information in the memory 112a, and the first matrix is also in the first matrix block aiIs stored in the memory 112 a. Under the control of the processor 111a, the hardware accelerator 20a loads the third matrix in the memory 112a into the first data buffer 2111, loads the first matrix block corresponding to the third matrix into the second data buffer 2122, and loads the index information into the index information buffer 2113.
Instep 407, the hardware accelerator selects a first matrix block A according to the index informationiM rows of elements are selected to obtain a fourth matrix.
Wherein, the first matrix block A comprises m row addresses corresponding to a column address of the index informationiEach row of (A) is required to be associated with a second matrix block BjSo that a fourth matrix is obtained according to a column address of the index information, and in combination with the index information in fig. 5c, 8 by 8 fourth matrices, which may be respectively named as ai_1、ai_2…..ai_8. Taking 8 row addresses corresponding to the 0 th column address in FIG. 5c as an example, the first matrix block A can be selectediTo obtain a fourth matrix of 8 x 8, as shown in fig. 6b, wherein the zero element in the index information can be selected from the first matrix block aiCan be directly filled with zero elements, and the fourth matrix shown in FIG. 6b isA first matrix block a is selectediThe last two positions of elements of each row. From the first matrix block A by means of index informationiThe fourth matrix is obtained, and it can also be understood that 8 by 8 fourth matrices are generated by using a 32 row and 8 column matrix block according to the index information, so that the specification of the matrix requiring the operation of the hardware accelerator can be reduced, and the 32 row and 8 column matrix block can be operated in the hardware accelerator of the specification.
Taking fig. 2b as an example, the selector 2114 of the hardware accelerator 20a caches the first matrix block a from the second data cache 2122 according to the index information cached in the index information cache 2113iM rows of elements are selected to obtain a fourth matrix.
And 409, performing multiply-add operation on the fourth matrix and the third matrix by the hardware accelerator to obtain a fifth matrix.
One possible implementation is shown in FIG. 6c, ai_1Is multiplied with the first column of the third matrix, ai_2Is multiplied by the second column of the third matrix, by which column a is derivedi_8And multiplying the fourth matrix by the 8 th column of the third matrix, and accumulating intermediate values obtained by the multiplication to obtain a fifth matrix of 8 x 8. The multiplication of the fourth matrix and the third matrix is also understood to be a multiplication of a matrix and a vector. By the scheme, the hardware accelerator can finish the product operation of the third matrix and the fourth matrix at one time.
With reference to fig. 2b, in a possible implementation manner, the third matrix and 8 fourth matrices may be input to the multiplication unit 2121 of the operation circuit 212 for one time to perform a product operation, so as to obtain an intermediate value, and the intermediate value is input to the accumulation unit 2122 to perform an accumulation operation, so as to obtain a fifth matrix.
And step 410, accumulating the obtained at least one fifth matrix by the hardware accelerator to obtain a target result.
As described with reference to fig. 4a, fig. 4b and fig. 2b, the first matrix shown in fig. 4b includes 4 first matrix blocks, and the second matrix shown in fig. 4a includes 4 second matrix blocks, so that the hardware accelerator 20a may obtain 4 × 4 — 16 fifth matrices, and the accumulation unit 2122 of the hardware accelerator 20a may accumulate the 16 fifth matrices to obtain the target result.
Through theabove steps 403 to 410, the specification of the hardware accelerator is that multiplication of 8 × 8 matrix and 8 × 8 matrix is supported, and the hardware accelerator of the specification realizes that the multiplication operation of 8 × 32 matrix and 32 × 8 matrix is completed at one time.
The second case is a second matrix block B as shown in FIG. 7ajFor illustration, as shown in FIG. 7a, an exemplary diagram of a second matrix is shown, the second matrix block BjIncluding 32 rows and 8 columns, the following is a detailed description of the continuing steps 403-410 afterstep 402 in the second scenario.
Instep 403, the processor determines the second matrix block B in FIG. 7ajThe number of the non-zero elements of each column is respectively as follows: p is a radical of0=8,p1=8,p2=11,p3=6,p4=1,p5=6,p6=5,p73, then the second matrix block BjThe number p of non-zero elements in the column with the most non-zero elements is 11.
Instep 404, the processor satisfies the condition according to t in the third matrix: (t-1) m<11 ≦ t × m, t may be determined to be 2, i.e. the third matrix is a matrix of (2 × 8) × 8, as shown in fig. 7 b. The third matrix shown in fig. 7B comprises the second matrix block B shown in fig. 7ajEach column is supplemented with an appropriate number of zero elements, which are identified by x, e.g.,column 0 is supplemented with 8 zero elements, and the last 8 rows ofcolumn 0 are instantiated with the supplemented 8 zero elements. Optionally, the non-zero elements of the third matrix may not be closely arranged, for example, the complementary zero element may be located in the last 8 rows of the 0 th column, or in the first 8 rows of the 0 th column, or in any 8 rows of the 0 th column, and the present application does not limit the location of the complementary zero element in any column.
Instep 405 above, the processor determines the second matrix block B shown in FIG. 7ajAs shown in fig. 7c, the index information is divided into two parts, the first part is the position information of the non-zero element of the first 8 columns,respectively as follows: the row addresses of the non-zero elements in the 0 th column are 1/4/8/9/12/13/17/28 respectively, the row addresses of the non-zero elements in the 1 st column are 3/4/6/11/14/20/24/26 respectively, the row addresses of the non-zero elements in the 2 nd column are 0/1/3/16/18/20/21/25 respectively, the row addresses of the non-zero elements in the 3 rd column are 0/5/12/13/21/27 respectively, the row address of the non-zero elements in the 4 th column is 15, the row addresses of the non-zero elements in the 5 th column are 5//11/12/18/24/28 respectively, the row addresses of the non-zero elements in the 6 th column are 3/8/21/23/25 respectively, the row addresses of the non-zero elements ofcolumn 7 are 11/16/30, respectively. For the number p of non-zero elements in a columniColumns smaller than m-8 may be filled with zero elements, which may also be identified by x. For example, if the number of the non-zero elements in the 0 th column is 8, then zero elements do not need to be supplemented, and the number of the non-zero elements in the 3 rd column is 6, and 2 zero elements need to be supplemented, and the positions of the 2 zero elements in the 3 rd column may be arbitrary, and fig. 7c takes the positions in the last two rows as an example. The second part is position information of the non-zero elements in the last 8 columns, which are respectively: the row address of the non-zero element incolumn 0 is 29, that is,column 0 has a non-zero element, therest 7 elements can be identified by zero elements,column 1 has no non-zero element and can all be identified by zero elements, the row addresses of the non-zero elements incolumn 2 are 28/29/30, therest 5 elements can be identified by zero elements, andcolumn 3/4/5/6/7 also has no non-zero element and can all be identified by zero elements.
Instep 406, the hardware accelerator loads the third matrix shown in FIG. 7b, the index information shown in FIG. 7c, and the corresponding first matrix block. In this second situation, the first matrix block may be the same as the first matrix block shown in fig. 6a, and the manner in which the hardware accelerator loads the third matrix, the index information, and the corresponding first matrix block may be the same as that described instep 406, which is not repeated herein.
Instep 407, the hardware accelerator may select from the first matrix block A according to the index information shown in FIG. 7ciAnd (3) circularly selecting data twice: the first time, according to the row address corresponding to each of the first 8 columns (the first 0 th column to the 7 th column) in the index information shown in fig. 7c, 8 fourth matrices are obtained from the first matrix block, which may be named as ai_1_1、ai_1_2…..ai_1_8(ii) a The second time, according to the row addresses corresponding to the last 8 columns (the last 0 th column to the 7 th column) in the index information shown in fig. 7c, 8 fourth matrices are obtained from the first matrix, which may be respectively named as ai_2_1、ai_2_2…..ai_2_8The method for obtaining the fourth matrices twice is the same as the process of fig. 6b, and the details are not repeated here.
Taking the above fig. 2b as an example, the selector 2114 in the hardware accelerator 20a may cache the first matrix block a twice from the second data cache 2122 according to the index information cached in the index information cache 2113iM rows of elements are selected to obtain a fourth matrix.
Afterstep 407, and beforestep 409, the second case further includesstep 408, in which the hardware accelerator divides the third matrix into 28 x 8 matrix blocks.
In the third matrix shown in fig. 7b, an 8 x 8 matrix block is located above the dotted line, and another 8 x 8 matrix block is located below the dotted line. The index information corresponding to the front 8 rows and columns in fig. 7c is the position information of the non-zero elements in the matrix block above the dotted line in fig. 7b, and the index information corresponding to the rear 8 rows and columns in fig. 7c is the position information of the non-zero elements in the matrix block below the dotted line in fig. 7 b.
In theabove step 409, the hardware accelerator performs multiply-add operation on the obtained 16 fourth matrices shown in fig. 6b and the third matrix shown in fig. 7b to obtain a fifth matrix.
In conjunction with FIG. 2b, the 16 fourth matrices may be input to the arithmetic circuitry 212 of the hardware accelerator 20a twice, and the third matrix, a, may be input for the first timei_1_1A fourth matrix ofi_1_2A of a fourth matrix …i_1_8Are input together into the multiplication unit 2121 of the arithmetic circuit 212 of the hardware accelerator 20a, each fourth matrix in the multiplication unit 2121 being multiplied by a column of the third matrix, i.e. ai_1_1Multiplication with the first column of the third matrix, ai_1_2Multiplication with the second column of the third matrix, and so on, ai_1_8Multiplied withcolumn 8 of the third matrix. A second time ofi_2_1A fourth matrix ofi_2_2Fourth matrix of…..ai_2_8Is also input to the multiplication unit 2121 of the arithmetic circuit 212 of the hardware accelerator 20a for the product operation, i.e. ai_2_1Multiplication with the first column of the third matrix, ai_2_2Multiplication with the second column of the third matrix, and so on, ai_2_8The third matrix is multiplied by the 8 th column of the third matrix, and the process of multiplying twice can refer to the process of fig. 6c, which is not described herein again, and the intermediate value obtained from the operation results of twice is input to the accumulation unit 2122 for accumulation to obtain the fifth matrix. This process in case two can also be understood as the process in case one having t (t 2 in case two) times computed in a loop.
Further, the hardware accelerator accumulates at least one fifth matrix obtained in case two to obtain a target result.
In case two, the hardware accelerator may complete the multiplication operation of the 8 × 32 matrix and the 32 × 8 matrix by twice the multiplication operation of the 8 × 8 matrix and the 8 × 8 matrix.
In the application, the processor obtains at least one third matrix and corresponding at least one index information according to the non-zero elements in the second matrix and the specification of the hardware accelerator, including two cases. For convenience of explanation, the specification of the hardware accelerator, which is the second matrix shown in fig. 4a and the first matrix shown in fig. 4b, is described as an example of the multiplication operation that can support the matrix of 8 × 8 and the matrix of 8 × 8.
In this application, if the second matrix is a left matrix, the third matrix is a matrix of m × t × n, and the data processing method can refer to the contents of fig. 3 to 7c, which is not described herein again.
Based on the above and the same idea, the present application provides an apparatus, as shown in fig. 2a, for performing the above data processing method. The apparatus in this example may execute the scheme correspondingly executed by the data processing apparatus in fig. 3, or may execute the scheme correspondingly executed by the data processing apparatus in fig. 4 c.
Optionally, thememory 112 may be further configured to store program instructions, and theprocessor 111 may call the program instructions stored in thememory 112 to perform one or more steps in the embodiments shown in the above schemes, or in alternative embodiments thereof, so that the apparatus implements the functions of the data processing apparatus in the above methods.
The processor 111 is configured to execute the instructions stored in the memory and control the hardware accelerator 20 to execute the acceleration task, and when the processor 111 executes the instructions stored in the memory 112, the processor 111 in the apparatus is configured to obtain a first matrix and a second matrix, where the first matrix and the second matrix both include non-zero elements, the first matrix is a matrix of L × M, the second matrix is a matrix of M × N, and L, M and N are positive integers; obtaining at least one third matrix and corresponding at least one index information according to the nonzero elements in the second matrix and the specification of the hardware accelerator, wherein the specification of the hardware accelerator is used for indicating the hardware accelerator to process the product operation of the matrix of l x m and the matrix of m x n, the third matrix is the matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises each nonzero element in different n columns in the second matrix and does not comprise part or all zero elements in the n columns of the second matrix, and the index information is used for indicating the position information of the nonzero elements in different n columns in the second matrix; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer;
the hardware accelerator 20 is configured to, for each third matrix in the at least one third matrix, obtain a fourth matrix from corresponding l rows in the first matrix according to index information corresponding to the third matrix, where the fourth matrix is a matrix of l × m; obtaining a fifth matrix according to the fourth matrix and the third matrix; and obtaining a target result according to the at least one fifth matrix, wherein the target result is a matrix of L x N.
In an alternative embodiment, t satisfies the following condition: (t-1) × m < p ≦ t × m, p being the number of non-zero elements in the column with the most non-zero elements in the different n columns in the second matrix.
In an alternative embodiment, p satisfies the following condition: p is less than or equal to M-M.
In an optional embodiment, the index information includes row addresses and column addresses, and one column address corresponds to m row addresses; the hardware accelerator 20 is configured to select m column elements from one corresponding row in the first matrix according to m row addresses corresponding to one column address of the index information, to obtain a fourth matrix, where the m row addresses are in one-to-one correspondence with the m column addresses of the m column elements.
In an alternative embodiment, t is an integer greater than or equal to 2; a hardware accelerator 20 for dividing the third matrix into t matrices of m × n; and respectively carrying out multiply-add operation on the fourth matrix and t matrices of m x n to obtain a fifth matrix.
Based on the foregoing and similar considerations, the present application provides anapparatus 800 for performing the above-described data processing method. Fig. 8 schematically illustrates a structural diagram of an apparatus provided in the present application, and as shown in fig. 8, anapparatus 800 includes aprocessing module 801, a selectingmodule 802, and anoperating module 803. Theselection module 802 may be the selector 2114 in fig. 2b, and theoperation module 803 may be the operation circuit 212 in fig. 2 b. Theapparatus 800 in this example may be the data processing apparatus in the above description, and may execute the scheme correspondingly executed by the data processing apparatus in fig. 3, or may execute the scheme correspondingly executed by the data processing apparatus in fig. 4 c.
Aprocessing module 801, configured to obtain a first matrix and a second matrix, where the first matrix and the second matrix both include non-zero elements, the first matrix is an L × M matrix, the second matrix is an M × N matrix, and L, M and N are both positive integers; obtaining at least one third matrix and corresponding at least one index information according to the nonzero elements in the second matrix and the specification of the hardware accelerator, wherein the specification of the hardware accelerator is used for indicating the hardware accelerator to process the product operation of the matrix of l x m and the matrix of m x n, the third matrix is the matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises each nonzero element in different n columns in the second matrix and does not comprise part or all zero elements in the n columns of the second matrix, and the index information is used for indicating the position information of the nonzero elements in different n columns in the second matrix; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer;
a selectingmodule 802, configured to, for each third matrix in the at least one third matrix, obtain, according to index information corresponding to the third matrix, a fourth matrix from l rows corresponding to the first matrix, where the fourth matrix is a matrix of l × m;
anoperation module 803, configured to obtain a fifth matrix according to the fourth matrix and the third matrix; and obtaining a target result according to the fifth matrix, wherein the target result is an L-N matrix.
It should be understood that the above division of the units of theapparatus 800 is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. In this application, theprocessing module 801 shown in fig. 8 may be implemented by theprocessor 111 shown in fig. 2a, and the selectingmodule 802 and the calculatingmodule 803 may be implemented by the hardware accelerator 20 shown in fig. 2 a. That is to say, in this embodiment of the application, theprocessing module 801 may execute the scheme executed by theprocessor 111 in fig. 2a, the selectingmodule 802 and the calculatingmodule 803 may execute the scheme executed by the hardware accelerator 20 in fig. 2a, and the rest of contents may be referred to above, which is not described herein again.
In the above embodiments, the implementation may be wholly or partly implemented by software, hardware or a combination thereof, and when implemented using a software program, may be wholly or partly implemented in the form of a computer program product. The computer program product includes one or more instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The instructions may be stored on or transmitted from one computer storage medium to another computer storage medium, e.g., the instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optics, twisted pair) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer storage medium may be any medium that can be accessed by a computer or a data storage device comprising one or more integrated media, servers, data centers, and the like. The medium may be a magnetic medium (e.g., a flexible disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical medium (e.g., an optical disk), or a semiconductor medium (e.g., a ROM, an EPROM, an EEPROM, a Solid State Disk (SSD)), etc.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by instructions. These instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.