A kind of fusion ReLU activation primitives and the vectorization implementation method in maximum pondTechnical field
Present invention relates generally to convolutional neural networks technical field, a kind of fusion ReLU activation primitives and maximum are refered in particular toThe vectorization implementation method in pond.
Background technology
In the 1960s, Hubel and Wiesel is used for the god of local sensitivity and set direction in research cat cortexFind that its unique network structure can be effectively reduced the complexity of Feedback Neural Network during through member, then propose convolution godThrough network (Convolutional Neural Network, CNN).Currently, convolutional neural networks have become numerous subject necksOne of the study hotspot in domain, particularly in pattern classification field, because the network avoids the complicated early stage pretreatment to image,Original image can be directly inputted, thus has obtained more being widely applied.
Usually, typical convolutional neural networks computation model include convolutional layer, pond layer, full articulamentum and afterContinuous grader, such as SVMs (Support Vector Machine, SVM).Wherein related in convolutional neural networks modelAnd to calculating type mainly have:The convolutional calculation of matrix, the processing of activation primitive;Such as, linear activation primitive f (x)=x orNonlinear activation functionDeng, and matrix pondization operation, including maximum pond (max pooling) andAverage value pond (average pooling), it is final to convolution god finally by matrix operation and some processing surmounted functionThe output of model through network is predicted, and completes the process of object correlation identification.Because convolutional neural networks model is by notSame convolutional layer and pond layer alternating iteration, therefore the amount of calculation of convolutional neural networks model is very huge.Therefore, how to accelerateThe computational efficiency of the model is all an important research contents in current academia and industrial quarters.
The activation primitive model used in current convolutional neural networks model mainly includes linear activation primitive and non-linearThe major class of activation primitive two, about ten is as many as several, and correct linear unit, i.e. ReLU (Rectified Linear Units,ReLU) one kind of activation primitive exactly most common of which, its mathematic(al) representation be f (x)=max (0, x), it can be seen that work as inputWhen signal x is less than 0, output is all 0, during more than 0, and output is equal to input.The outstanding advantages of ReLU functions are that one side suppresses;PhaseThe excited border broad to other activation primitives, with characteristics such as sparse activities.In terms of Neuscience, neuroscientistIt also found the sparse activity of neuron, 2001, on the observational learning that Attwell et al. is consumed based on cerebral energy, pushed awaySurveying neuron coding work mode has an openness and distributivity, the god that Lennie in 2003 et al. estimation brains are activated simultaneouslyThrough member only 1~4%, the openness of neuron work is further demonstrated that.In terms of signal, i.e., neuron is simultaneously only to defeatedEnter the small part selective response of signal, a large amount of signals can so improve the precision of study by shielding deliberately, more preferably,Quickly extract sparse features.Therefore, from the point of view of this openness angle, ReLU functions are into approximately meeting human neuronal mouldThe best model of type.
In convolutional neural networks model, view data after being handled by activation primitive, it is necessary to carry out the calculating of next stage,That is, pondization is operated, and pondization operation mainly includes maximum pondization and average value pond, and maximum pond refers to take out pond windowIn maximum as the pond window output, and average value pond refer to take out pond window in all elements average valueIt is used as the output of the pond window.Either average value pondization or maximum pond, its purpose are provided to not significantlyInfluence farthest to reduce the dimension of image array on the premise of Model Identification precision, amount of calculation is reduced, and also to keep awayExempt from model and over-fitting occur.
Convolutional neural networks are one of the computing modules commonly used during current high performance is calculated, be typical memory access it is intensive andCompute-intensive applications, calculating unit and memory bandwidth to processor require very high, and computation complexity is very big, current main-streamAcceleration platform have the convolutional neural networks calculating platform based on GPU, the convolutional neural networks calculating platform based on FPGA, be based onThe calculating platform of special neutral net accelerator and convolutional Neural net is accelerated based on universal cpu or some vector processorsThe calculating of network model.Vector processor is a kind of processor of multipurpose multifunctional operating system, generally comprises Vector Processing part (VectorProcessing Unit, VPU) and scalar processor unit (Scalar Processing Unit), Vector Processing part it is main bySeveral vector processing units (Vector Pocessing Element, VPE) constitute computing array, are mainly responsible for gaugeCalculate, each VPE includes other work(such as multiple isomorphism calculation function parts such as some MAC0, MAC1, and ALU, position processing (BP)Can part;Scalar processing unit is mainly responsible for calculating task and stream is controlled, and VPU and SPU can carry out data channel transmission and exchange.There is provided the special vectorial memory bank of Large Copacity by the Load and Store of vector data access unit supporting vector data.
The content of the invention
The technical problem to be solved in the present invention is that:The technical problem existed for prior art, the present invention provides oneKind principle is simple, it is convenient to realize, can fully excavate the fusion of the computation capability of vector processor and the concurrency of algorithmReLU activation primitives and the vectorization implementation method in maximum pond, i.e., grasped by merging ReLU activation primitives and maximum pondizationMake to reduce the memory access amount of data, and then shorten the calculating time of convolutional neural networks, improve the meter of convolutional neural networks modelCalculate efficiency.
A kind of fusion ReLU activation primitives and the vectorization implementation method in maximum pond
In order to solve the above technical problems, the present invention uses following technical scheme:
The vectorization implementation method in a kind of fusion ReLU activation primitives and maximum pond, its step is:
S1:Calculating matrix A ReLU activation primitive values;
S2:The maximum pond of matrix in calculation procedure S1 after the processing of ReLU activation primitives;
S3:All sub-blocks of the repeat step S1 and step S2 up to having traveled through matrix A, are finally completed whole matrix AThe processing of ReLU activation primitives and the operation of maximum pondization.
As a further improvement on the present invention:The step S1's concretely comprises the following steps:
S1.1 sets the matrix for needing to carry out activation primitive processing after convolution operation as A (M, N), and ReLU activation primitives are f (x)(0, x), vector processing unit VPE number is p to=max, and it is p, k to take Nx、kyIntegral multiple, maximum pond window be kx×ky;
S1.2 instructs the first row element for taking matrix A using vectorial VLOAD;
S1.3 compares size instruction VFCMPGD using vector, compares the size of vector registor, the logical value of comparative resultIt is put into PSW;
S1.4 use conditions vector assignment instructs VMOV, takes out the value in step 1.3 more than 0 and is put into vector registor;
S1.5 draws the result after the processing of ReLU activation primitives;
S1.6 is according to maximum pond window k, and repeat step draws the Relu activation of A matrix k row elements for 1.2 to 1.5k timesFunction operation, is as a result stored in vector registor, directly as the input value in maximum pond in step S2.
As a further improvement on the present invention:The step S2's concretely comprises the following steps:
S2.1 takes the k row elements calculated in step S 1.6, directly as the input of this calculating;
S2.2 makes comparisons the 1st row element with the 2nd row element, and the logical value of comparative result is put into PSW;
S2.3 use conditions vector assignment instructs VMOV;
S2.4 draws the corresponding row maximum of k row elements by comparing k-1 times;
S2.5 configures shuffle mode, compares the maximum for drawing corresponding k column elements in step S 2.4;
S2.6 finally show that p/k pond window size is k simultaneouslyx×kyMaximum pond result.
As a further improvement on the present invention:A maximum pond result c in the step S 2.50,0Calculating it is publicFormula is:
Wherein c0,0For first element in the matrix of consequence of maximum pond, kx、kyFor the size of pond window, in convolutionIn neutral net, pond window is square formation, i.e. kx=ky=k, ai,jTo need to carry out the element in the matrix A in maximum pond.
As a further improvement on the present invention:Defined in the above-mentioned steps size of pond window be sizeX, sizeY,The horizontal displacement of two adjacent pool windows or vertical displacement are stride, and pond window is not overlapping during maximum pondization is operated,That is sizeX=sizeY=stride.
Compared with prior art, the advantage of the invention is that:A kind of fusion ReLU activation primitives and maximum of the present inventionThe vectorization implementation method in pond, stream is calculated by the way that the operation of ReLU activation primitives and the calculating of maximum pond are fused into oneJourney, it is to avoid the most STORE and LOAD of time-consuming intermediate calculation data, while also make full use of the vectorial portion in vector processorThe characteristics of multiple parallel processing elements of part can carry out identical operation operation simultaneously carries out substantial amounts of same type operation, so that significantlyDegree improves the computational efficiency of convolutional neural networks model, and step is simple, it is easy to accomplish.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the inventive method.
Fig. 2 is the general structure schematic diagram of vector processor.
Fig. 3 is the present invention 2 × 2 maximum pond schematic diagram in concrete application example.
Fig. 4 is the RuLU activation primitive image schematic diagrames that the present invention is used in concrete application example.
Fig. 5 is present invention ReLU activation primitives vectorization implementation process schematic diagram in concrete application example.
Fig. 6 is the present invention 2 × 2 maximum pond vectorization implementation process schematic diagram in concrete application example.
Fig. 7 be the present invention in concrete application example maximum pondization operate in pond window not overlap operation schematic diagram.
Embodiment
The present invention is described in further details below with reference to Figure of description and specific embodiment.
As shown in Figure 1 and Figure 4, a kind of fusion ReLU activation primitives of the invention and the vectorization realization side in maximum pondMethod, its step is:
S1:Calculating matrix A ReLU activation primitive values;
S1.1 sets the matrix for needing to carry out activation primitive processing after convolution operation as A (M, N), and ReLU activation primitives are f (x)(0, x), vector processing unit VPE number is p to=max, and it is p, k typically to take Nx、kyIntegral multiple, maximum pond window iskx×ky;
S1.2 instructs the first row element for taking matrix A using vectorial VLOAD;Such as take into vector registor VR10, makeIt is 0, i.e. VMOVI 0, VR20 with one vector registor VR20 of VMOVI instruction initialization;
S1.3 compares size instruction VFCMPGD using vector, compares vector registor VR10 and VR20 size, compares knotThe logical value of fruit is put into PSW, such as VR0;VFCMPGD VR10, VR20, VR0, if VR10 [i]>VR20[i],1≤i≤ p, then VR0 [i]=1, otherwise VR0 [i]=0;
S1.4 use conditions vector assignment instructs VMOV, takes out the value in step 1.3 more than 0 and is put into vector registor,Computations is:[VR0] VMOV VR10, VR20, p numerical value can be calculated simultaneously by being instructed by this condition adeleReLU activation primitive values, the numerical value that 0 is more than in VR10 are put into VR20, the numerical value less than 0 is set to 0;
S1.5 draws the result VR20 after the processing of ReLU activation primitives;
S1.6 is according to maximum pond window k, and repeat step draws the Relu activation of A matrix k row elements for 1.2 to 1.5k timesFunction operation, is as a result stored in vector registor, it is not necessary to store, directly as the input in maximum pond in step S2Value.
S2:The maximum pond of matrix in calculation procedure S1 after the processing of ReLU activation primitives;
S2.1 takes the k row elements calculated in step S 1.6, is posted because the result in step S 1.6 is stored directly inIn storage, therefore it directly as the input of this calculating, this process avoids the time data memory in step S 1.6 and stepThe data LOAD times in rapid S 2.2, therefore, the calculating time is reduced accordingly.
S2.2 makes comparisons the 1st row element with the 2nd row element, and the logical value of comparative result is put into PSW, such asIn VR1, VFCMPGD VR20, VR21, VR1, if VR20 [i]>VR21 [i], 1≤i≤p, then VR0 [i]=1, otherwise VR0 [i]=0;
S2.3 use conditions vector assignment instructs VMOV, takes out corresponding to the conditional register VR0 [i]=1 of step S 2.2VPE in value VR20 [i] be assigned to corresponding VR21 [i], then value bigger than VR20 [i] in VR21 [i] is kept constant.
S2.4 draws the corresponding row maximum of k row elements by comparing k-1 times.
S2.5 configures shuffle mode, compares the maximum for drawing corresponding k column elements in step S 2.4;
S2.6 finally show that p/k pond window size is k simultaneouslyx×kyMaximum pond result;
S3:All sub-blocks of the repeat step S1 and step S2 up to having traveled through matrix A, are finally completed whole matrix AThe processing of ReLU activation primitives and the operation of maximum pondization.
The present invention is mainly suitable for vector processor, as shown in Fig. 2 being the general structure schematic diagram of vector processor.In concrete application example, a maximum pond result c in the step S 2.50,0Calculation formula be:
Wherein c0,0For first element in the matrix of consequence of maximum pond, kx、kyFor the size of pond window, in convolutionIn neutral net, pond window is generally square formation, i.e. kx=ky=k, ai,jTo need to carry out the member in the matrix A in maximum pondElement, its maximum pond schematic flow sheet is as shown in Figure 3.
In concrete application example, the size of pond window defined in above-mentioned steps is sizeX, sizeY, and two adjacentThe horizontal displacement of pond window or vertical displacement are stride, and pond window is not overlapping in the operation of maximum pondization, i.e. sizeX=SizeY=stride, as shown in Figure 7.
As shown in Figure 5, Figure 6, the present invention is in a concrete application example, and detailed step is:
S100:Calculating matrix A ReLU activation primitive values;
S1.1 sets the matrix for needing to carry out activation primitive processing after convolution operation as A (16,16), and ReLU activation primitives are f(x) (0, x), vector processing unit VPE number p is 16, maximum pond window k to=maxx=ky=2;
S1.2 takes 16 elements of the 1st row of matrix A using vectorial VLOAD instructions into vector registor VR10, the 2nd row16 elements, into VR11, are 0 using vector assignment instruction VMOVI instruction initialization 2 vector registors VR20, VR21, i.e.,VMOVI 0, VR20, VMOVI 0, VR21;
S1.3 compares size instruction VFCMPGD using vector, compares vector registor VR10 and VR20, VR11 and VR21Size, the logical value of comparative result is respectively put into PSW VR0, VR1;VFCMPGD VR10,VR20,VR0、VFCMPGD VR11, VR21, VR1, if VR10 [i]>VR20 [i], (1≤i≤16), then VR0 [i]=1, otherwise VR0 [i]=0,Similarly VR1 [i]=1, otherwise VR1 [i]=0;
S1.4 use conditions vector assignment instructs VMOV, takes out the value in step 1.3 more than or equal to 0 and is put into vector registorIn, computations is:[VR0] VMOV VR10, VR20, [VR1] VMOV VR11, VR21, are counted simultaneously by condition assignment directiveCalculate the Relu activation primitive values that matrix A front two row element amounts to 32 elements;
The ReLU activation primitive values of S1.5 matrix A front two row elements are put into vector registor VR20, VR21;
S200:The maximum pond of matrix in calculation procedure S100 after the processing of ReLU activation primitives;
S2.1 is according to maximum pond window size kx=ky=2, the front two row element that step S1.5 is calculated is taken out,That is VR20 and VR21, is used as the input of maximum pond layer;
S2.2 compares the 1st row element VR20 and the 2nd row element VR21, and the logical value of comparative result is put into condition depositIn device VR2, computations is:VFCMPGD VR20, VR21, VR2, if VR20 [i]>VR21 [i], 1≤i≤p, then VR2 [i]=1, otherwise VR2 [i]=0;
S2.3 use conditions vector assignment instructs VMOV, takes out corresponding to step S2.3 conditional register VR0 [i]=1VPE in value VR20 [i] be assigned to corresponding VR21 [i], then value bigger than VR20 [i] in VR21 [i] is kept constant;
S2.4 compares 1 time, draws the corresponding row maximum of 2 row elements;
S2.5 configures corresponding shuffle mode, compares the maximum for drawing corresponding adjacent 2 column element in step S2.4;
S2.6 finally show that (16/2) 8 pond window size is 2 × 2 maximum pond results simultaneously;
S300:Repeat step S100 and step S200, until having traveled through all sub-blocks of matrix A, is finally completed whole squareThe battle array A processing of ReLU activation primitives and the operation of maximum pondization.
It the above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment,All technical schemes belonged under thinking of the present invention belong to protection scope of the present invention.It should be pointed out that for the artFor those of ordinary skill, some improvements and modifications without departing from the principles of the present invention should be regarded as the protection of the present inventionScope.