Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a convolution calculation accelerator based on CSR coding and an acceleration method. The technical problems to be solved by the invention are realized by the following technical scheme:
in a first aspect, the present invention provides a CSR coding-based convolutional computational accelerator, comprising:
the device comprises a data preprocessing module, a CSR encoding module, a multiplication ripple calculation array, a data distribution module, a data accumulation module, a data delay module, a data arrangement module, a re-quantization module and a total control module; wherein,,
the data preprocessing module is used for reading data from the outside through DMA and performing blocking processing to obtain blocked data;
the CSR coding module is used for carrying out CSR coding on the data after the blocking to obtain characteristic data and corresponding addresses thereof;
the multiplication ripple calculation array is used for carrying out natural flow multiplexing calculation on the corresponding characteristic data and weight according to the address, and transmitting a calculation result into the data distribution module;
the data distribution module is used for dividing the input calculation result into window data and cross-window data, and respectively inputting the window data and the cross-window data into corresponding accumulation units in the data accumulation module for data accumulation;
the data delay module is used for feeding back a back pressure signal to the multiplication ripple calculation array when judging that the data accumulation module generates addition and write conflict so as to pause the current work and restarting the current work after the delay data addition is finished;
the data arrangement module is used for integrating the accumulated data output by the data accumulation module, re-mapping the bit width by the re-quantization module, and writing the data into off-chip storage by DMA (direct memory access) so as to complete multiplexing convolution calculation of discontinuous input data;
and the total control module is used for controlling the calling and data interaction of all other modules.
In a second aspect, the present invention provides a convolutional calculation acceleration method based on CSR coding, which is applied to the convolutional calculation accelerator based on CSR coding provided in the foregoing embodiment, and includes the following steps:
external data are obtained, block preprocessing and CSR coding are carried out, and characteristic data and corresponding addresses thereof are obtained;
performing natural flow multiplexing calculation on the corresponding characteristic data and weights through the addresses based on the pulse array, and dividing calculation results into window data and cross-window data;
accumulating the window data and the cross-window data through adder multiplexing, suspending the current work through back pressure control when judging that the addition and write conflict occurs, and restarting the current work after the delay data are added;
and integrating and remapping bit widths of the accumulated data, and writing the accumulated data into off-chip storage to complete multiplexing convolution calculation of discontinuous input data.
The invention has the beneficial effects that:
1. the convolution calculation accelerator based on CSR coding saves the additional expenditure of coding by carrying out block preprocessing on input data, effectively avoids redundant parts in the convolution calculation of a neural network by CSR coding processing, completes multiplexing of discontinuous input data by a pulse calculation array after coding and a subsequent series of processing, realizes back pressure control by a data delay module, and solves addition and write conflicts caused by the discontinuity of the input data; meanwhile, the data is divided into the window and cross-window data, so that the storage size required by matching the data is saved, the pressure of on-chip storage is reduced, the power consumption is reduced, and the high-parallelism convolution calculation becomes possible;
2. the convolution calculation accelerator based on CSR codes provided by the invention packages pretreatment, CSR codes and caches together, and saves the time cost of codes by utilizing ping-pong operation;
3. the invention saves extra area expenditure through data classification and module multiplexing, not only realizes the characteristic value multiplexing and adder multiplexing in the pulse array, but also can enjoy the convenience of zero redundancy calculation removal brought by encoding.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.
Example 1
Referring to fig. 1, fig. 1 is a block diagram of a convolutional calculation accelerator based on CSR coding according to an embodiment of the present invention, where the accelerator includes:
the device comprises a data preprocessing module, a CSR encoding module, a multiplication ripple calculation array, a data distribution module, a data accumulation module, a data delay module, a data arrangement module, a re-quantization module and a total control module; wherein,,
the data preprocessing module is used for reading data from the outside through DMA and performing block processing to obtain block-partitioned data;
the CSR coding module is used for carrying out CSR coding on the data after the blocking to obtain characteristic data and corresponding addresses;
the multiplication pulsation calculation array is used for carrying out natural flow multiplexing calculation on the corresponding characteristic data and the weight according to the address, and transmitting the calculation result into the data distribution module;
the data distribution module is used for dividing the input calculation result into window data and cross-window data, and respectively inputting the window data and the cross-window data into corresponding accumulation units in the data accumulation module for data accumulation;
the data delay module is used for feeding back a back pressure signal to the multiplication ripple calculation array when the data accumulation module is judged to generate addition and write conflicts so as to pause the current work and restart the current work after the delay data addition is finished;
the data arrangement module is used for integrating the accumulated data output by the data accumulation module, re-mapping the bit width by the re-quantization module, and writing the data into off-chip storage by DMA (direct memory access) so as to complete multiplexing convolution calculation of discontinuous input data;
the total control module is used for controlling the calling and data interaction of all other modules.
Each module and its data processing procedure are described in detail in turn.
And a data preprocessing module:
specifically, the data preprocessing module divides an input characteristic data graph into different matrixes according to a specific size after the Padding operation, and the default part of the matrixes is filled with zeros. The blocking processing can reduce the additional overhead caused by data coding and save the consumption of computing resources. The last actually implemented method may also be a block processing on the software side before the data is input in advance.
CSR coding module:
the CSR encoding module is used for carrying out CSR encoding on the blocked input data, filtering zero data in the blocked input data and inputting new data into the multiplication ripple calculation array. Specifically, the CSR coding rule is shown in fig. 2, for example, the example illustrated in the figure is a coding window with a size of 10×4, and an english letter indicates that the data is non-zero, and a no letter indicates that the data is zero. The new DATA obtained after encoding can be divided into characteristic DATA, denoted DATA in fig. 2, and addresses, denoted COUNT and ADDR in fig. 2, which together represent that the non-zero DATA is at a specific position in the original window. The characteristic data is input into a subsequent ripple calculation array to finish calculation, and the address is used as the input of the address calculation module.
In this embodiment, the data address calculation module is spatially coincident with the multiplicative systolic calculation array.
Optionally, as an implementation manner, the data preprocessing module and the CSR coding module may be packaged together with the storage space of the data input channel and copied once, so as to implement the table tennis operation, thereby saving the time overhead caused by data coding.
Multiplication ripple calculation array:
since convolution computation can be decomposed into multiplication and addition, the multiplication ripple computation array is used for computing the product of the encoded data and the weight, and then the result is input into the data distribution module to wait for distribution. Therefore, the multiplication ripple calculation array in the present embodiment adopts a mode of encoding data to flow transversely and flow longitudinally after weight broadcasting to complete data multiplexing.
It should be noted that, because the encoded data is not continuous in the most original feature map distribution, the weights need to be input by broadcasting according to a specific arrangement mode, that is, the weights are arranged according to the number of convolution kernels in columns, so as to ensure that the possibility of collision caused by calculation of the same convolution window data in the same clock period is minimum.
Referring to fig. 3, fig. 3 is a detailed structural schematic diagram of a convolutional calculation accelerator based on CSR coding according to an embodiment of the present invention. Wherein the multiplication ripple calculation array comprises a plurality of PE units, the brief structure of the PE units is shown in figure 4, each PE unit comprises an address calculation module, a pipeline register and a multiplier, wherein,
the address calculation module calculates the position information of the corresponding coded data in the original window according to the currently received address;
the pipeline register is used for transmitting the current address and the corresponding coded data to the next PE unit;
the multiplier is used for multiplying the current non-zero data with the weight to obtain a corresponding product, and the product and the position information obtained by the address calculation module are synchronously kept to be fed into the data distribution module.
Specifically, the address calculating module f_addr calculates the position of the data in the original window according to the COUNT and ADDR values obtained by the previous encoding, the data and the COUNT and ADDR values of the data are led to the next PE unit through the pipeline register, the encoded data in the corresponding systolic array transversely flow, the non-zero data also needs to be multiplied by the weight through the multiplier to obtain the corresponding product, and the product and the position information obtained by the address calculating module are synchronously sent to the data distributing module.
And a data distribution module:
after the multiplication ripple calculation array is calculated, the data distribution module divides the data into two data streams according to specific situations, namely a window data stream and a cross-window data stream.
Referring to fig. 5-6, fig. 5 is a schematic diagram of data classification provided by an embodiment of the present invention, and fig. 6 is a schematic diagram of cross-window data interaction provided by an embodiment of the present invention, where F is the size of a convolution kernel, ROW is the number of columns of a coding window, COL is the number of ROWs, and a blue portion represents a convolution window of 3*3 size. As shown in fig. 5, assuming that the position of the third row of the convolution has non-zero data, multiplying the non-zero data with any value of the third row of the convolution generates a cross-window data, that is, the product generated by the non-zero data of the white window needs to be matched with the product generated by the non-zero data of the gray window, and the opposite window data represents that the data has no relation with other window data, so that the calculation can be independently completed. The data distribution module is responsible for dividing the data stream from the pulse array into the window data and the cross-window data, dividing the data which is calculated finally into two parts according to the address and the control signal, and then respectively sending the data to different parts of the data accumulation module.
Optionally, in this embodiment, when the data allocation module performs data allocation, the data allocation module sets the window size based on a principle that a size of a space storing cross-window data is proportional to the window, and an encoded overhead bit is inversely proportional to the window. The former can reduce the actual logic decisions of the circuit. However, since the window data size is inversely proportional to the overhead generated by encoding, that is, the smaller the window, the more the actual operation overhead and the storage overhead of the circuit can be reduced, in practical application, an optimal result needs to be obtained according to the influence of the two conditions, for example, the window size selected by the present invention is 10×10.
And a data accumulation module:
with continued reference to fig. 3, the data accumulation module includes a first-stage adder and a second-stage adder, and the first-stage adder may be multiplexed; wherein,,
the first-stage adder is used for adding the convolution window data of the same input channel and outputting the convolution result of the channel;
the second-stage adder is used for accumulating the same convolution window data on different convolution channels and sending the window data and cross-window data to the data arrangement module together.
In this embodiment, based on the data distribution module dividing the data into two types of the window data and the cross-window data, the data accumulation module is also provided with two corresponding accumulation units for accumulating the two types of data respectively. As shown in fig. 3, the first-stage adder includes a present window accumulating unit and a cross-window accumulating unit;
the data delay module is packaged with the window accumulating unit and used for accumulating the window data;
the cross-window accumulating unit is used for accumulating cross-window data.
Specifically, the first-stage adder is divided into two parts corresponding to the two different data streams, and the data delay module and part of the adders of the first-stage adder in the accumulation module are packaged together to be used as an input module of the window data. In the present window data, a case where two or more data are written into one adder at the same time in one clock cycle occurs due to the discontinuity and randomness of the encoded data mentioned above, which will be hereinafter collectively referred to as addition write collision. The data delay module determines whether an add-write collision has occurred, and if so, provides a backpressure signal to the pre-stage pipeline, halts operation of the pre-stage pipeline, and waits until the delay data is added before starting the pipeline.
It will be appreciated that, based on the design of back pressure control, the present embodiment also designs a back pressure memory bank formed by synchronous FIFOs between the allocation module and the window accumulation unit, where the depth requirement of the memory bank is small, only for storing the intermediate pipeline calculation data from the first stage pipeline to the delay module pipeline, so as to prevent data loss.
Further, the present embodiment adopts the conventional accumulation unit structure shown in fig. 7 to implement a two-stage accumulator, which includes two muxes for data selector, a DFF flip-flop group and an adder, and the detailed working principle can refer to the prior art.
The first-stage adder of the data accumulation module provided in this embodiment can achieve multiplexing to save the overhead of the extra triggers. The multiplexing function of the first-stage adder will be briefly described below.
The adder stage de-groups the adders according to different addition IDs according to requirements, wherein the number of adders in each group corresponds to the number of convolution sliding times of the window of the code window, for example, when the code window is 10, the convolution kernel size is 3, the adders are grouped into eight groups, and then the inter-group adders are de-scheduled through the IDs to complete multiplexing. Because when data flows to the F+1th row of the original window, all data required by the first convolution window, namely all data of the former F rows, are calculated, an adder for calculating the first convolution window can be used for calculating the data of the F+1th convolution window, each convolution window needs the data of the F rows to finish calculation, the first calculation needs the time of the F rows to obtain a result, and each later convolution window only needs the time of the F rows to obtain the result, so that multiplexing efficiency is greatly improved, and meanwhile, the additional cost of the adder is reduced. The other cross-window data also generates the back pressure requirement, but since only F-1 rows of data are distributed into the cross-window data, the total data generated by the addition and the write conflict is not large, so that the addition tree is adopted at the module to solve the problem of the write conflict at the cost of area, instead of generating two different back pressure signals, and finally, the control disturbance result is generated.
Furthermore, the embodiment also designs a data temporary storage module, wherein the data temporary storage module is connected with the window-crossing accumulation unit in the first-stage adder and the second-stage adder;
the data temporary storage module is used for synchronizing the cross-window data flow so as to cooperate with the second-stage adder to accumulate data;
and the data temporary storage module is packaged with part of adders in the second-stage adders.
In particular, the data-holding module corresponds to the synchronization FIFO of fig. 3 connected across the window and the two-stage adder, and, like the data delay module, is packaged together with a part of the adders in the second-stage adder as an input module of the window-crossing data. The module completes the matching of data streams across windows, under the control of the overall control module, the data streams needing to wait for the subsequent windows are written into or read out of the synchronous FIFO, and the read-out data streams representing the matched data are sent into the second-stage adder together to complete the next accumulation. Because of the reasonable distribution of window partitioning, the cross-window data occupies little part of the population, and therefore, the required storage space is not large. In practice the additional storage overhead of the present invention is embodied in a scratch pad module that includes multiplexed adders internal and cross window data.
The second-stage adder in the accumulation module is used for accumulating the same convolution window data on different convolution channels, and then the window data and the cross-window data are fed into the data arrangement module together.
And a data arrangement module:
the data arrangement module is responsible for integrating two different data streams, the last output data stream is integrated into a form similar to the input original window, and the last data is input to the re-quantization module.
And a re-quantization module:
the re-quantization module is used for remapping the data with the expanded bit width after convolution to the bit width required by the next input so as to facilitate the next convolution calculation. And finally writing the quantized data into off-chip storage through DMA (direct memory access) to complete convolution calculation.
It can be appreciated that the convolution computing accelerator designed in this embodiment further includes a total control module for controlling the data flow direction and data interaction of the whole system. The detailed control method can be realized with reference to the prior art.
The convolution calculation accelerator based on CSR coding saves the additional expenditure of coding by carrying out block preprocessing on input data, effectively avoids redundant parts in the convolution calculation of a neural network by CSR coding processing, completes multiplexing of discontinuous input data by a pulse calculation array after coding and a subsequent series of processing, realizes back pressure control by a data delay module, and solves addition and write conflicts caused by the discontinuity of the input data; meanwhile, the data is divided into the window and cross-window data, so that the storage size required by matching the data is saved, the pressure of on-chip storage is reduced, the power consumption is reduced, and the high-parallelism convolution calculation becomes possible;
in addition, the invention saves extra area expenditure through data classification and module multiplexing, not only realizes the characteristic value multiplexing and adder multiplexing in the pulse array, but also can enjoy the convenience of zero redundancy calculation removal brought by encoding.
Example two
On the basis of the first embodiment, the present embodiment provides a convolution calculation acceleration method based on CSR coding. Referring to fig. 8, fig. 8 is a flow chart of a convolutional calculation acceleration method based on CSR coding according to an embodiment of the present invention, where the method includes:
step 1: external data are obtained, block preprocessing and CSR coding are carried out, and characteristic data and corresponding addresses thereof are obtained;
step 2: performing natural flow multiplexing calculation on the corresponding characteristic data and weights through addresses based on the pulse arrays, and dividing calculation results into window data and cross-window data;
step 3: accumulating the window data and the cross-window data through adder multiplexing, suspending the current work through back pressure control when judging that the addition and write conflict occurs, and restarting the current work after the delay data are added;
step 4: and integrating and remapping bit widths of the accumulated data, and writing the accumulated data into off-chip storage to complete multiplexing convolution calculation of discontinuous input data.
The method provided in this embodiment may be applied to the convolution calculation accelerator provided in the first embodiment, and the detailed process may be referred to the description of the first embodiment. Therefore, the method can also reduce the pressure stored on the chip, reduce the power consumption and is suitable for high-parallel convolution calculation.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.