Summary of the invention
In view of the above-mentioned problems of the prior art, the present invention is intended to provide a kind of sparse convolution of high efficiency load balancingNeural network accelerator, with realize weight and excited data reusability are high, volume of transmitted data is small, it is expansible can degree of parallelism it is high andRequired hardware store resource and the few purpose of DSP resource.It is a further object of the present invention to provide a kind of adding using the acceleratorFast method.
Accelerator of the present invention the technical solution adopted is that:
A kind of sparse convolution neural network accelerator of load balancing, comprising: master controller, for controlling convolution algorithmSignal stream and data flow are controlled, data are handled and are saved;Data distribution module, according to the segment partition scheme pair of convolution algorithmComputing array carries out weighted data distribution;The computing array of convolution algorithm, the multiply-add operation for completing sparse convolution operates, defeatedThe result of part sum out;Result cache module is exported, the result for the part sum to computing array carries out cumulative caching, and wholeUnified format is managed into, the characteristic pattern result of processing and pond to be activated is exported;Linear activation primitive unit, for cumulative completionPart and result biasing set and activation primitive operation;Pond unit, for the pond through activation primitive treated resultChange operation;Online coding unit, for carrying out the excitation value for still needing to carry out subsequent convolutional layer operation in line coding;Dynamic outside pieceMemory, for storing the characteristic pattern of raw image data, the intermediate result of computing array operation and final output.
A kind of accelerated method of the sparse convolution neural network accelerator of load balancing of the present invention, comprising the following steps:
1) beta pruning is carried out to convolutional neural networks Model Weight data, according to the scale parameter logistic of weighted data according to progressThen grouping takes identical prune approach to carry out each group weighted data sparse on the basis of guaranteeing model entirety accuracy rateChange processing;
2) the sparse convolution operation mapping scheme for formulating load balancing, the convolutional neural networks after rarefaction are mapped to and are addedOn the computing array of the convolution algorithm of fast device;
3) accelerator guarantees the stream of convolution algorithm according to the configuration information reconstruction calculations array and storage array of mapping schemeWater carries out;
4) main controller controls data distribution module completes the distribution of weighted data and excited data, and computing array is transportedIt calculates, exports conventional part and result;
5) it is added up to the conventional part and result and is linearly corrected, i.e., completion biasing is set operates with activation primitive;
6) the pondization operation of respective cells core size and step-length is carried out according to current convolutional layer pond demand;
7) judge whether current convolution layer operation is the last layer, if it is not, then carrying out in line coding, after codingExcitation result is sent to next layer of convolution, if it is, being output to outer chip dynamic memory, completes the acceleration of convolutional neural networks.
Compared with prior art, the invention has the advantages that
The sparse convolution neural network accelerator and its accelerated method of a kind of load balancing provided by the invention, maximumllyUsing the sparse characteristic of convolution algorithm data, it can realize that the computing array of convolution algorithm is high under the conditions of seldom storage resourceEfficiency operation guarantees the high reusability of input stimulus and weighted data, the load balancing and high usage of operation array;It counts simultaneouslyArray is calculated to support to realize by way of static configuration between the convolution algorithm and ranks of different size different scales and different spiesThe Parallel Scheduling of two levels, has good applicability and scalability between sign figure.Design of the invention can meet wellThe demand of the low-power consumption high energy efficiency ratio of convolutional neural networks is run under embedded system at present.
Specific embodiment
The present invention program is described in detail with reference to the accompanying drawing.
It is as shown in Figure 1 the sparse convolution network operations method flow schematic diagram of load balancing, it first can be to convolutional NeuralNetwork model weighted data carries out beta pruning, according to the scale parameter logistic of weighted data according to being grouped, is then guaranteeing modelIdentical prune approach is taken to carry out LS-SVM sparseness each group weighted data on the basis of whole accuracy rate;Then according to convolutionThe sparse convolution operation mapping scheme of operation input feature vector figure and convolution kernel dimensioned load balancing, by the convolution after rarefactionNeural network is mapped on PE (Process Element arithmetic element) array of the convolution algorithm of hardware accelerator;Then hardPart accelerator reconstructs PE array and storage array according to the configuration information of mapping scheme, guarantees that the flowing water of convolution algorithm carries out;AddThe master controller of fast device can control the distribution for completing weighted data and excited data, and PE array carries out operation, export conventional partAnd result;Linear amending unit adds up to part and result and is linearly corrected, i.e., completion biasing is set operates with activation primitive;Pond unit carries out the pondization operation of respective cells core size and step-length, including selection maximum according to current convolutional layer pond demandIt is worth pondization or average value pond;Finally judge whether current convolution layer operation is the last layer, if it is not, then carrying outExcitation result after coding is sent to next layer of convolution, if it is, being output to piece external storage, completes entire convolution by line codingAccelerate.
The sparse convolution operation mapping scheme of load balancing includes convolution algorithm mapping mode, PE array grouping scheme, defeatedEnter the distribution multiplex mode and PE array operation Parallel Scheduling mechanism of feature image and weighted data.
Convolution algorithm mapping mode: input feature vector picture is transformed into a matrix according to row (column) dimension, by weighted dataA vector, which is launched into, according to output channel dimension passes through design so that convolution algorithm is converted to matrix-vector multiplicationSparse Matrix-Vector multiplication unit can skip the zero in input feature vector picture and weighted data well, guarantee whole fortuneThe high efficiency of calculation.
PE array is grouped scheme: completing to divide by master controller static configuration according to the dimensional parameters information of every layer of convolution algorithmGroup operation, when PE number is greater than three dimensional convolution kernel total number, one group can calculate all output characteristic pattern channels, on this basis,Remaining PE is grouped by same number, is responsible for calculating not going together for output characteristic pattern;When PE number is less than three dimensional convolution kernel total number,One group of calculating exports the maximum approximate number of characteristic pattern port number, and the principle being grouped in this way is to guarantee each PE arithmetic speed matching,PE array vacancy rate is low.
The distribution multiplex mode of input feature vector picture and weighted data: entire PE array is by one piece of shared on-chip memoryThe identical excited data of synchronization distribution is as matrix needed for operation, by data distribution module according to the control information of piecemeal operationWeighted data needed for distributing each PE essentially consists in different PE's as vector needed for operation, the multiplexing of input feature vector pictureIt uses simultaneously, the multiplexing of weight and the same PE replace weighted data after matrix between the multiplexing of weighted data essentially consists in different groupsUtilization again without distribution.
PE array operation Parallel Scheduling mechanism: PE array needs to export according to convolutional layer in operation the size of feature imageInformation determines that different grouping is to complete the output of same output feature image difference row (column), or complete different output characteristic patternsThe operation of piece.This ensure that PE array can carry out Parallel Scheduling in two levels, first is that in the layer of single features pictureParallel, second is that different characteristic picture simultaneously and concurrently.
A kind of sparse convolution neural network speeding scheme of load balancing of the present embodiment includes two portions of software and hardwarePoint, as shown in Fig. 2, being software section Pruning strategy schematic diagram in figure.Pruning strategy is described as follows: for initial intensive nerveNetwork connection can be grouped it according to the connection number and neuron number of network, and each grouping prune approach is identical with position,That is for the neuron of each convolution kernel group as connection type, the weighted data only connected is different.With input feature vectorFor figure is W*W*C, (W is characterized figure width and height dimensions, and C is input channel number), convolution kernel is having a size of R*R*C*N, and (R is volumeThe width of product core and high size, C are convolution kernel port number, and N is convolution kernel number namely output channel number), beta pruning when, can be firstThe convolution kernel of R*R*C is classified as a convolution kernel group, amount to it is N number of, for each convolution kernel, the position phase of neutral element in themTogether;If model needs are not achieved in accuracy rate after beta pruning, convolution kernel group size can be adjusted, takes R*R*C*N1 (approximate number that N1 is N)Carry out beta pruning.
It is illustrated in figure 3 the sparse convolution neural network accelerator structure schematic diagram of hardware components.Overall structure is mainly wrappedContain: master controller, the convolution algorithm since host computer CPU receives instruction, for generating the control signal of control convolution algorithmStream and data flow;Data distribution module carries out weighted data distribution to PE according to the segment partition scheme of convolution algorithm;Convolution algorithmPE (Process Element arithmetic element) array is grouped according to the configuration information of master controller and completes sparse convolutionMultiply-add operation operation, exports convolution results or part and result;Result cache module is exported, part and result to PE carry out tiredAdd caching, is organized into after unified format and is sent to subsequent cell and is operated;Linear activation primitive unit, completes convolution algorithm resultBiasing set and activation primitive operation;Pond unit completes the maximum value pondization operation of result;Online coding unit, to centreAs a result online CSR (storage of compression loose line) coding is carried out, to guarantee that the result of output meets the data of subsequent convolutional layer operationCall format;Outer chip dynamic memory DDR4, for storing raw image data, interlayer intermediate result and convolutional layer final outputAs a result.
Data distribution module includes the fetch configurable on-chip memory storage unit of address calculation, on piece and dataThe FIFO group of format caching conversion.The configuration information that data distribution module can be sent according to the master controller received, by fetchingAddress calculation completes the cache flush mode to outer chip dynamic memory DDR4, and the data taken out are cached to via AXI4 interfaceThe on-chip memory storage unit of on piece weight, a step of going forward side by side format, and distribution is cached in corresponding FIFO, waitOperation sends data.
The PE array of convolution algorithm includes multiple matrix-vector multiplication computing units, can be wanted according to static configuration informationIt asks, completes in the layer of feature image or interlayer parallel-convolution operates, export part and the result of convolution algorithm.Multiple PE are mono- simultaneouslyThe storage of member is common on-chip memory, and in view of the design of Pruning strategy and hardware structure, multiple PE can be using seldomUnder conditions of storage resource, reaches and jump zero acceleration calculating and the matching of difference PE arithmetic speed during calculating sparse convolution.
Matrix-vector multiplication computing unit includes flowing water controller module, weight non-zero detection module, pointer control module, swashsEncourage decompression module, MLA operation unit module and public on-chip memory storage.Weight non-zero detection module can be data distributionThe weighted data that module is sent carries out non-zero detection, only transmits nonzero value location information corresponding with its to PE unit;Pointer controlMolding block and excitation decompression module can take out non-zero weight value according to corresponding non-zero weight value from common on-chip memoryExcitation value needed for corresponding operation, while each PE unit is sent in case operation;MLA operation unit is mainly responsible for matrixVector multiply in multiplication and additional calculation.
It is illustrated in figure 4 convolution algorithm mapping mode schematic diagram, by taking input feature vector figure is W*W*C as an example, (W is characterized figureWidth and height dimensions, C are input channel number), convolution kernel is having a size of R*R*C*N, and (R is the wide and high size of convolution kernel, and C is convolutionCore port number, N are convolution kernel number namely output channel number), F is output characteristic pattern size;It is determined each by N size firstThe number Num_PE of PE unit in PE group can allow Num_PE to be equal to N, each group of a batch operation can if PE total number is greater than NWith immediately arrive at output all channels of characteristic pattern as a result, otherwise just allow Num_PE for the approximate number M of N, integer batch operation output is specialSign figure passage portion as a result, guaranteeing that certain PE will not be idle;The group number Group_PE of PE is true by PE total number and Num_PEIt is fixed, if one group of operation that can have completed all output channels, different groups are responsible for exporting not going together for characteristic pattern, i.e.,As shown in 2 operation of the PE group division of labor in figure.
Convolution algorithm complete for one layer, a PE group is by Num_PE PE unit (i.e. matrix-vector multiplication unit) structureAt each matrix-vector multiplication unit is responsible for exporting several rows in a channel of characteristic pattern, if wherein first time operation can exportThe first row of dry row, specific line number are determined that it is storage that matrix is corresponding in matrix-vector multiplication by the matrix size of matrix-vector multiplicationIn the shared excited data being locally stored in on-chip memory, corresponding vector is the weight number sent by data distribution moduleAccording to;For other PE groups, the subsequent rows that operation content can be output characteristic pattern that is, as shown in Figure 3 can alsoTo be the convolution algorithm of other input feature vector figures, it can it is two different parallel to meet row-column parallel calculation and different characteristic figure in layerParallel modes of operation.
It is illustrated in figure 5 convolution algorithm schematic diagram in PE group, input feature vector figure and difference are indicated with different numerical valueThe value of different location on convolution kernel, the matrix-vector multiplication scale that example is taken is the matrix of 2*12 and the vector of 12*1, so PEThe vector that each operation output result is 2*1, it is three channel 12* of convolution kernel 1 that PE1 vector in first time operation is corresponding1, it is three channels of (1,2,4,5) and (2,3,5,6) corresponding position in activating image that matrix is corresponding, is carrying out multiply-add operationOutput result is to export the front two row of first channel first row of characteristic pattern afterwards, and subsequent matrices can first update, that is, take (4,5,7,8) and the excitation value of (5,6,8,9) position, output result are to export the front two row of first channel secondary series of characteristic pattern;It is exportingAfter all column datas of corresponding row, the corresponding weighted data of vector will do it update, i.e., rear extended meeting output third channel is defeatedResult out.And it is exactly after weighted data updates, to become calculating defeated in second channel for calculating output characteristic pattern that PE2 is correspondingThe 4th channel out.
It is illustrated in figure 6 the realization schematic diagram that PE is array-supported balanced and storage is shared, the shared on piece of PE array is depositedReservoir stores according to the nonzero value of the input stimulus of CSR (storage of compression loose line) format storage and their index and refers toNeedle, the position of the weight vectors nonzero value sent according to data distribution module take out corresponding excitation and carry out multiply-add operation, due toThe interior all weight vectors of PE group are identical according to the position of its nonzero element of software Pruning strategy, so required for each PECorrespondence excitation value be also identical, it is only necessary to seldom memory saves a excitation value, and decodes while being sent to PE i.e.The matrix requirements of PE array can be met.And for all PE, carry out the non-of matrix and vector in matrix-vector multiplicationNull position is identical, therefore PE array computation speed matches, and reaches the purpose of design of the low storage load balancing of operation array.At the same time, different PE groups can also share the weighted data of distribution, realize the high reusability of excitation and weight.
It to sum up narrates, the accelerated method for sparse convolution neural network proposed using the embodiment of the present invention, Ke YiyouStorage hardware resource is saved on effect ground, improves the reusability of input feature vector figure and weight, and the load that can be realized PE array is equalWeighing apparatus, carrying out static configuration to PE array can satisfy different concurrent operation requirements, guarantee the high usage of PE array, whole to improveThe data throughput of system system, reaches very high Energy Efficiency Ratio, the embedded system suitable for low-power consumption.