Layered sparse coding method of pruning deep neural network with extremely high compression ratioTechnical Field
The invention relates to a hierarchical sparse coding method of a pruning deep neural network with extremely high compression ratio, belonging to the field of machine learning.
Background
Deep neural networks DNN have evolved as state-of-the-art in many areas, particularly in the areas of computer vision, natural language processing, and audio processing. However, the large growth of hidden layers and neurons consumes considerable hardware storage and memory bandwidth, which presents a serious challenge for many resource-constrained scenarios in real-world applications. Especially the advent of the post-morgan era has slowed hardware replacement cycles. Specifically, there are two major bottlenecks to the current DNN: 1. in conflict with resource-constrained application platforms, such as automatic driving tools, mobile phones, mobile robots and augmented reality AR, which are very sensitive to the energy consumption and the amount of calculation of the DNN model, there is an urgent need for a DNN model with low energy consumption but good performance. 2. In conflict with new accelerators such as FPGAs, custom ASICs and AI specific chips, which are powerful computational accelerators for DNN but are also sensitive to hardware storage, memory bandwidth and parallelism, it is clear that DNN is expected to reduce hardware storage and memory usage while enjoying parallelism.
In order to solve the above-mentioned bottleneck, many compression methods, such as pruning and quantization, are proposed, and the number of weights of the trained DNN model can be easily reduced by using these effective compression methods. For example, the weight matrix of the target network can be made very sparse by using a classical magnitude-based pruning method, in order to store and migrate these sparse weights, they are usually decomposed into two types of data, namely non-zero weights and metadata, and these non-zero weights are encoded and decoded by using metadata expression index information, so that when an acceptable final performance is ensured, a high compression ratio can be achieved by reducing the number of non-zero weights as much as possible; however, encoding non-zero weights requires a large amount of metadata, which is several times more than the number of actual non-zero weights.
In summary, a large amount of metadata is that the weights of the pruning DNN compression, storage and migration obstacle learning compression DNN model become sparse after pruning, so how to store and migrate these sparse weights has become a focus of attention in recent years. Generally, these studies can be divided into two categories, depending on the objective: compression ratio and parallel computation.
Most compression methods use some classical sparse coding methods such as MATLAB, TensorFlow and Pytorch to integrate COO methods into default sparse coding methods, while Scipy encodes sparse matrices using compressed sparse rows/columns, CSR/CSC for short, based only on the programming framework they use. New methods such as bitmasks and relative indices have recently been proposed which are capable of encoding sparse models, but they are procedural methods which are difficult to implement in parallel. In order to fully utilize the resources of the deep learning accelerator, a series of novel sparse coding methods including block compression sparse column BCSC and nested bit mask NB have been proposed in recent years, and these methods are suitable for parallelization environment, but the compressed model still consumes a large amount of storage and memory bandwidth.
A significant challenge with both of the above methods is that it is difficult to achieve both high compression ratio and efficient calculation. Unlike these sparse coding methods, the present invention not only allows parallel model inference, but also requires very little metadata, thus providing higher compression rates for pruned deep models.
Disclosure of Invention
The purpose of the invention is as follows: a hierarchical sparse coding method of a pruning deep neural network with a very high compression ratio is provided to solve the problems in the prior art.
The technical scheme is as follows: a hierarchical sparse coding method of a pruning deep neural network with a very high compression ratio specifically comprises the following steps:
step 1, initializing and training an over-parameterized DNN model;
step 2, pruning the model to make the model as sparse as possible;
step 3, adopting a layered sparse coding LSC method to further compress and code the sparse weights, applying a Block bit mask Mechanism Block bitmap Mechanism to divide a sparse matrix into a plurality of small blocks for each sparse weight matrix in a Block layer, stretching and splicing all nonzero-value blocks into a vector, and sending the flattened vector into a subsequent coding layer;
step 4, encoding the flattened vector by using the extreme finite element data through an SRI method with a mark relative index at an encoding layer, and executing compression encoding;
and step 5, an inference stage, namely starting a matrix multiplication process as early as possible by using an intermediate decoding result to realize high-performance calculation.
The general procedure of learning the compressed DNN model ofstep 1 andstep 2 is as follows:
step 11, initializing and training an over-parameterized DNN model;
step 12, eliminating the weight which contributes less to prediction through pruning, and retraining the model;
and step 13, repeating the pruning and training processes for a plurality of times, wherein the finally obtained model keeps the performance similar to the original model but has less effective weight.
The hierarchical sparse coding LSC method of thestep 3 further comprises the following steps:
step 31, a block bit mask mechanism is adopted to divide each sparse weight matrix in the block layer into a plurality of small blocks, and for each block, if any non-zero weight exists, a 1-bit signal is used to mark the block as true, otherwise, the block is marked as false, so that a mask consisting of a plurality of 1-bit marks is obtained, and all blocks can be marked;
step 32, flattening all the non-zero blocks into a vector, and inputting the flattened vector into the coding layer.
The LSC method in step S4 adopts SRI method to perform high-intensity compression on the flattened non-zero block by its coding layer, and compresses and codes all the non-zero weights.
The inference process of step S5 is further:
step 51, calculating tree structure: the calculation tree is used for determining the calculation process of matrix multiplication, and firstly, the multiplication of two matrixes W multiplied by X is decomposed into the calculation of a plurality of sub-matrixes; according to the principle of block matrix multiplication, dividing W into a plurality of sub-matrices, then converting W multiplied by X into calculation among the sub-matrices, and further converting the calculation process of each row into a calculation tree;
step 52, pruning the calculation tree: the high sparsity of the DNN after pruning leads to a large number of zero-value blocks, multiplication of the zero blocks can be skipped, and a calculation tree is pruned according to the marks of the zero-value blocks by a block bit mask mechanism;
step 53, SRI decoding and submatrix multiplication: the SRI decoding process restores the SRI code into non-zero blocks, and the sub-matrix multiplication process adopts the non-zero blocks to execute sub-matrix multiplication; the two processes are relatively independent, once the multiplication of the submatrix is completed, the decoded non-zero block is destroyed, and the storage bandwidth is saved;
step 54, intermediate result integration: all intermediate calculation results of the sub-matrix multiplication are accumulated to obtain a final result and can be implemented in parallel.
Has the advantages that: the method has a novel and simple hierarchical sparse coding structure, provides a novel SRI method, can code the non-zero weight by using the minimum space, and designs an effective decoding mechanism for the provided LSC method so as to accelerate the multiplication of a coding matrix in an inference stage; a large number of comparison experiment results are analyzed, so that the LSC method provided by the invention obtains considerable benefits in DNN compression and inference calculation of pruning, and the yield exceeds the leading level of the field.
Drawings
Fig. 1 is a structural diagram of a hierarchical sparse coding LSC method.
Fig. 2 is a schematic diagram comparing a relative indexing method with marks and a conventional relative indexing method.
FIG. 3 is a diagram of a computation tree structure.
Fig. 4 is a matrix multiplication diagram under the LSC method.
Detailed Description
In an alternative embodiment, as shown in fig. 1, a hierarchical sparse coding method for pruning deep neural networks with extremely high compression ratio includes maximizing compression ratio by greatly reducing metadata amount, optimizing coding and inference process design of sparse matrix using a novel and effective hierarchical sparse coding framework; a sparse coding method with extremely high compression ratio is provided, wherein an LSC method is used, the LSC method comprises two key layer block layers and a coding layer, in the block layers, a sparse matrix is divided into a plurality of small blocks, then zero blocks are deleted, and in the coding layer, a novel SRI method is provided for further coding the non-zero blocks; in addition, the invention designs an effective decoding mechanism for the LSC method to accelerate the multiplication of the coding matrix in the inference stage.
The method specifically comprises the following steps:
step 1, initializing and training an over-parameterized DNN model;
step 2, pruning the model to make the model as sparse as possible;
step 3, adopting a layered sparse coding LSC method to further compress and code the sparse weights, applying a Block bit mask Mechanism Block bitmap Mechanism to divide a sparse matrix into a plurality of small blocks for each sparse weight matrix in a Block layer, stretching and splicing all nonzero-value blocks into a vector, and sending the flattened vector into a subsequent coding layer;
step 4, encoding the flattened vector by using the extreme finite element data through an SRI method with a mark relative index at an encoding layer, and executing compression encoding;
and step 5, an inference stage, namely starting a matrix multiplication process as early as possible by using an intermediate decoding result to realize high-performance calculation.
In a further embodiment, the general procedure of learning the compressed DNN model ofstep 1 andstep 2 is as follows:
step 11, initializing and training an over-parameterized DNN model;
step 12, eliminating the weight which contributes less than an expected value to the prediction through pruning, and retraining the model;
and step 13, repeating the pruning and training processes for a plurality of times, wherein the finally obtained model keeps the performance similar to the original model but has less effective weight.
In a further embodiment, the hierarchical sparse coding LSC method ofstep 3 further includes:
step 31, a block bit mask mechanism is adopted to divide each sparse weight matrix in the block layer into a plurality of small blocks, and for each block, if any non-zero weight exists, a 1-bit signal is used to mark the block as true, otherwise, the block is marked as false, so that a mask consisting of a plurality of 1-bit marks is obtained, and all blocks can be marked;
step 32, flattening all the non-zero blocks into a vector, and inputting the flattened vector into the coding layer.
In a further embodiment, the LSC method of step S4, which uses SRI method to perform high-intensity compression on the flattened non-zero block with its coding layer, compresses and encodes all the non-zero weights.
In a further embodiment, the inference process of step S5 is further:
step 51, calculating tree structure: the calculation tree is used for determining the calculation process of matrix multiplication, and firstly, the multiplication of two matrixes W multiplied by X is decomposed into the calculation of a plurality of sub-matrixes; according to the principle of block matrix multiplication, dividing W into a plurality of sub-matrices, then converting W multiplied by X into calculation among the sub-matrices, and further converting the calculation process of each row into a calculation tree;
step 52, pruning the calculation tree: the high sparsity of the DNN after pruning leads to a large number of zero-value blocks, multiplication of the zero blocks can be skipped, and a calculation tree is pruned according to the marks of the zero-value blocks by a block bit mask mechanism;
step 53, SRI decoding and submatrix multiplication: the SRI decoding process restores the SRI code into non-zero blocks, and the sub-matrix multiplication process adopts the non-zero blocks to execute sub-matrix multiplication; the two processes are relatively independent, once the multiplication of the submatrix is completed, the decoded non-zero block is destroyed, and the storage bandwidth is saved;
step 54, intermediate result integration: all intermediate calculation results of the sub-matrix multiplication are accumulated to obtain a final result and can be implemented in parallel.
Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the embodiments, and various equivalent modifications can be made within the technical spirit of the present invention, and the scope of the present invention is also within the scope of the present invention.