A kind of YOLO network forward inference accelerator design method based on FPGATechnical field
The present invention relates to the technical fields of deep learning and Design of Hardware Architecture, more particularly, to a kind of target detection networkThe design method of forward inference acceleration is carried out on FPGA.
Background technique
In recent years, it is based on the machine learning algorithm of convolutional neural networks (Convolutional Neutral Network)It has been widely applied in the task of computer vision.It is computation-intensive but for large-scale CNN network, it depositsStorage is intensive, and the big feature of resource consumption brings huge challenge to above-mentioned task.Traditional general processor is in face of this kind ofWhen height calculating pressure and big data throughput, performance is extremely difficult to practical requirement, therefore the hardware based on GPU, FPGA, ASICAccelerator is suggested and widely comes into operation.
FPGA (Field Programmable Gate Array) field programmable gate array, is in PAL, GAL, EPLDThe product further developed on the basis of equal programming devices.It is that occur as one of field ASIC semi-custom circuit, that is, solve the deficiency of custom circuit, and overcome the limited disadvantage of original programming device gate circuit number amount.FPGA is usedLogical cell array LCA (Logic Cell Array) such a new concept, inside include configurable logic blocks CLB,Three parts output input module IOB and interconnector can support a PROM programming multiple FPGA.Due to flexibly may be usedAbility and outstanding power dissipation ratio of performance are reconfigured, FPGA is made to become a kind of current important deep learning processor.
Being suitble to hard-wired mainstream target detection network at present is YOLO (You Only Look Once), this networkFast speed and structure is simple, this algorithm object detection issue handling at regression problem, with a convolutional neural networks knotStructure can directly predict position and the class probability of target frame from input picture, realize object detection end to end, this knotStructure is relatively more suitable for the hardware realization on FPGA.It is a kind of general fixed based on FPGA to disclose in existing invention CN107392309APoints neural network convolution accelerator hardware structure, comprising: general AXI4 high speed bus interface, high parallel-convolution core and characteristic patternData buffer area, convolutional calculation device, caches area controller, state controller and is directly accessed segmented convolution results buffer areaController.The invention uses on piece storage as buffering, and main storage of the memory outside piece as data passes through oneGeneral processor outside piece carry out memory management to complete the calculating of entire convolutional network, the design of this structure only uses onePiece FPGA is unable to complete the forward inference of target detection network.A kind of convolutional Neural is proposed in existing invention CN107463990AThe FPGA parallel acceleration method of network includes the following steps: that (1) establishes CNN model;(2) hardware structure is configured;(3) configuration volumeProduct arithmetic element.The invention loads the interim calculated result of whole network using on piece storage, therefore the network rule that can be disposedMould is limited.
The results of intermediate calculations of network layer, is often all stored on piece by the existing neural network accelerator based on FPGAStatic memory in, weight required for network is stored in the dynamic memory outside piece, and such design will lead on pieceStorage space limits the network size that can accelerate.At this stage, as the demand of task complexity and precision is got higher, volumeProduct scale of neural network is increasing, and parameter total amount is also increasing, but the technique of fpga chip and on piece is open ended depositsThe growth of resource is stored up but without so rapid, if still according to design method before, FPGA cannot accommodate this rule completelyThe network of mould.
If using the static memory BRAM of on piece as the buffer area of data, and the dynamic memory DRAM outside pieceKey data as network stores, and since the memory space of dynamic memory is huge, it is very big can to accommodate number of parametersNetwork realizes the parallel computation of each convolution module by reasonably distributing the bandwidth of memory.The property of this design methodThe bandwidth of memory can be depended on, but stacks storage resource compared on piece, the bandwidth for promoting communication is more easily to realize's.Network referenced by the present invention is the version of YOLO-tiny, the input size of this network is 416*416*3, network totally 9Layer convolution, final output are the candidate frame with classification, position and confidence information, and pass through area maps (region behaviourMake) algorithm is mapped to calculated result in original image.
Summary of the invention
It is fast not as good as scale of neural network growth for the growth rate of the storage resource in solution in the prior art fpga chipSpeed, general target detection network are difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, the present inventionA kind of YOLO network forward inference accelerator based on FPGA is proposed for the development platform of YOLO-tiny network and KC705,Specific technical solution is as follows:
A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip andDRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAMIt is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy,Quantization scheme is formed, which successively carries out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
Specifically, in the step (1) a certain layer quantizing process are as follows:
A) quantify the weight data of former network: a certain scaling position for first establishing 8bit fixed-point number is 256 kinds correspondingMetric value reuses nearby principle and quantifies to initial data, the numerical value after quantization is still wherein including positive zero and negative zeroIt is indicated the detection accuracy for obtaining such quantization scheme in order to calculate using the floating type of 32bit, then traverses 8 kinds smallNumber point, which postpones to obtain, influences the smallest scaling position to detection accuracy, eventually forms the weight quantization scheme of this layer;
B) distribution of 0-1 is first normalized to input feature vector figure, is then somebody's turn to do using method described in step a)The quantization of layer input feature vector figure;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolutionIt propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains allThe quantization scheme of layer.
Specifically, in the step (2) each layer of convolutional network calculating process are as follows:
A) weighted data required for epicycle calculates is read from DRAM, is placed into BRAM;
B) feature diagram data (FM) of this layer to convolution is read, all input datas is completed and prepares;
C) convolutional calculation is carried out, after the completion of a wheel convolutional calculation, the data in BRAM are uploaded in DRAM again, clearlyEmpty interim findings data, then start the calculating of next round.
Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAMIn a progress convolutional calculation, by obtained convolution results be added to switching input channel after convolutional calculation in, every time plusThe input feature vector figure needs of load can just switch next input feature vector region after having been calculated one time with all convolution kernels.
It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2)Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-twoPoint, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculatingWhen, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrowCompare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitiveThreshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of outputAfterwards, while also the pondization operation and activation operation in the channel are completed.
Step (2) a) and b) in BRAM be set as 512 data widths, depth design is 512 points, one piece of BRAM7.5 RAMB36E1 are consumed, output minimum is set as 16bit;C) the dual-port mode that comes true is arranged in the BRAM in, and port width is16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM,Consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5RAM36E1;9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.
Specifically, in the step (2) feature diagram data storage scheme are as follows: for level 1 volume product network in a) and b)Using one piece of BRAM, 2-6 layers of every layer of convolutional network respectively uses two pieces of BRAM, and the 7th layer uses eight pieces of BRAM, the 8th layer of useTen pieces of BRAM, the 9th layer uses nine pieces of BRAM;One piece of BRAM is used for every layer in c);Every piece of BRAM consumes 7.5 RAM36E1.
Specifically, the output of the convolutional network includes the location information of 13*13*5 candidate frame, the position of each candidate frameConfidence breath is made of x, y, w, h value, respectively indicates the abscissa relative value of candidate frame central point, ordinate relative value, width phaseTo value, height relative value;For horizontal, ordinate relative value through sigmoid Function Mapping into absolute coordinate, wide, high coordinate is oppositeValue is mapped in absolute value by e index.
The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operationIt is as follows:
A) center point coordinate of each candidate frame is extracted in order, and flag bit is set to each candidate frame, for indicatingWhether entire candidate frame retains.
B) selecting first candidate frame is by the central point distance of comparison other candidate frame behind comparison other computational domain, when superWhen crossing a threshold value, that the candidate frame flag bit compared is effective status, indicates that this candidate frame needs retain, noThen be the mark position of the frame it is invalid, the comparison of subsequent distance is no longer participate in, when traversing the last of queue by comparison otherAt one, comparison other, i.e., the effective candidate frame of next flag bit of comparison other just now are replaced.
C) the effective candidate frame of all flag bits is extracted from result memory, and generates a marking frame printingInto original image as final testing result.
The invention has the following advantages:
One, the present invention is using the memory on fpga chip as the data buffer zone of convolutional calculation, depositing outside fpga chipReservoir as main storage equipment, each convolutional layer by piece outside memory be coupled, this design method is more thanSuitable for YOLO network, it is equally applicable to other neural network.
Two, the resource allocation methods for each layer convolutional calculation that the present invention is carried out can play to greatest extent whole network simultaneouslyRow calculate ability, compare with serial convolutional calculation structure, the Resources on Chip that the design of this programme uses is less, forward inferenceSpeed faster.
Three, on fpga chip, without direct data interaction between each layer, the connection of each layer belongs to loose coupling relation,It can guarantee the stability of system in this way.
Four, the calculating for accelerating whole network present invention uses simple version is carried out without using the method for overlapping areaIt calculates but is simplified with the central point distance of two frames, can greatly improve the speed of NMS step.
Detailed description of the invention
Fig. 1 is the calculating structure and storage organization schematic diagram of each layer of the present invention
Fig. 2 is single layer network calculation flow chart of the present invention
Specific embodiment
Embodiment 1
A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip andDRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAMIt is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy,Quantization scheme is formed, which successively carries out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
Specifically, in the step (1) a certain layer quantizing process are as follows:
A) quantify the weight data of former network: when being quantified according to a certain decimal position of 8bit fixed-point number, first buildingThe decimal value table of comparisons under the position, i.e. 256 kinds of decimal numbers are found, wherein including positive zero and negative zero, then using former nearbyThen initial data is quantified, after quantization although value changes, but data are still the floating type of 32bit, convenient for belowIt is calculated in GPU, obtains the detection accuracy of such quantization scheme, obtained after then traversing 8 kinds of scaling positions to detectionPrecision influences the smallest scaling position, eventually forms the weight quantization scheme of this layer;
B) all integrated test input feature vector figures are first normalized with the distribution of 0-1, then using described in step a)Method carry out the quantization of this layer of input feature vector figure;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolutionIt propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains allThe quantization scheme of layer.
Specifically, in the step (2) each layer of convolutional network calculating process are as follows: first from DRAM read epicycle calculateRequired weighted data is placed into weight buffer BRAM;Then feature diagram data (FM) of this layer to convolution is read, it is completeStart to carry out convolutional calculation after preparing at all input datas, after the completion of a wheel convolutional calculation, in results buffer BRAMData upload in DRAM again, empty interim findings data, then start the calculating of next round.Due to next layer of calculatingOne layer of calculated result is relied on, is mutually waited to allow every layer can be calculated simultaneously, table tennis is used in DRAMStructure, to play the computation capability in FPGA.On fpga chip, without direct data interaction, each layer between each layerConnection belong to loose coupling relation, can guarantee the stability of system in this way.
Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAMIn a progress convolutional calculation, since BRAM is resource-constrained on fpga chip, and the size of this layer of picture is bigger, therefore onlyContinuous several rows in first Loading Image, according to the principle of convolutional calculation, the convolution results of this several row are final output at this timeThe interim findings of corresponding region (that several row of same position) in a certain channel arrive phase calculating behind the channel of switching inputWhen with convolution at position, need to add up with interim findings before, so before this layer of module executes convolutional calculation,The interim convolution calculated result first before fetching in DDR from same position corresponding to the output channel is needed, in this way everyAfter secondary convolution module calculates result, it can be added and be stored in result memory BRAM again with the value in result memory BRAM.The input feature vector figure needs loaded every time can just switch next input feature vector region after having been calculated one time with all convolution kernels.
It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2)Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-twoPoint, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculatingWhen, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrowCompare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitiveThreshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of outputAfterwards, while also the pondization operation and activation operation in the channel are completed.
The step (2) needs to receive the data read out in DRAM as the BRAM of data buffer, in order to playThe maximum bandwidth of DRAM, the write-in end of BRAM are set as 512 data widths, and depth design is 512 points, one piece of BRAM consumption7.5 RAMB36E1, output minimum are set as 16bit, the input width as convolution operation;The BRAM of buffer as a resultIt to be not only also written in DRAM from data are read in DRAM simultaneously, therefore the dual-port mode that comes true is set, port width is16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM,Consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5RAM36E1;9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.
Specifically, in the step (2) feature diagram data storage scheme are as follows: for Input Data Buffer, level 1 volumeProduct one piece of BRAM of Web vector graphic, 2-6 layer of every layer of convolutional network respectively use two pieces of BRAM, the 7th layer use eight pieces of BRAM, the 8thLayer uses ten pieces of BRAM, and the 9th layer uses nine pieces of BRAM;For output data buffer, every layer uses one piece of BRAM;Every piece of BRAM7.5 RAM36E1 are consumed, total characteristic pattern data buffer needs 337.5 RAM36E1.Since BRAM is resource-constrained, only existDo ping-pong operation at output buffer, data of each layer in input buffer be not ready to before without convolutional calculation.According to every layer of multiply-add calculation amount come etc. ratios distribution every layer of parallel computation port number and every layer of corresponding parallel channel numberIt is shown in Table 1, for the parallel channel of input, each channel requires one piece of individual BRAM and stores, but result cacheDevice only needs the big BRAM such as a piece of to store.
The distribution of every layer of calculation amount ratio of 1 parallel computation of table and every layer of parallel channel number
| Layer | One | Two | Three | Four | Five | Six | Seven | Eight | Nine |
| Ratio | 1 | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 | 10 | 20 | 1 |
| PE quantity | 1 | 2 | 2 | 2 | 2 | 2 | 8 | 16 | 1 |
Specifically, conventional part is followed by the region layer operation of position mapping, and the output of the convolutional network includes 13*The location information of the location information of 13*5 candidate frame, each candidate frame is made of x, y, w, h value, respectively indicates candidate frame centerThe abscissa relative value of point, ordinate relative value, width relative value, height relative value;This four values are needed by some processingIt can just be mapped in actual Pictures location, horizontal, ordinate relative value passes through sigmoid Function Mapping into absolute coordinate,Due to output the result is that 8bit fixed-point representation, corresponding output result can quantify into a look-up table, accelerate thisMapping process;Wide, high coordinate relative value is mapped in absolute value by e index, and the form of equally applicable look-up table obtains result.
The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operationIt is as follows: first to extract the center point coordinate of each candidate frame in order, and flag bit is set to each candidate frame, for indicating entireWhether candidate frame retains;Due to using central point distance to be judged for index, according to prior information it is found that in network outputIn candidate frame, the closer frame of sequence is the object compared, and sequence is farther away can be ignored and compare, and then selects first candidate frame to beBy the central point distance of comparison other candidate frame behind comparison other computational domain, when more than a threshold value, compared thatCandidate frame flag bit be effective status, indicate this candidate frame needs retain, be otherwise the mark position of the frame it is invalid,It is no longer participate in the comparison of subsequent distance, when traversing the last one of queue by comparison other, replaces comparison other, i.e., just nowComparison other the effective candidate frame of next flag bit;Finally the effective candidate frame of all flag bits from result memoryIn extract, and generate a marking frame print in original image as final testing result.
The above embodiments and description only illustrate the principle of the present invention, is not departing from spirit of that invention and modelUnder the premise of enclosing, various changes and improvements may be made to the invention, these changes and improvements both fall within claimed invention modelIn enclosing.The scope of the present invention is defined by the appended claims and its equivalents.