Movatterモバイル変換


[0]ホーム

URL:


CN109214504A - A kind of YOLO network forward inference accelerator design method based on FPGA - Google Patents

A kind of YOLO network forward inference accelerator design method based on FPGA
Download PDF

Info

Publication number
CN109214504A
CN109214504ACN201810970836.2ACN201810970836ACN109214504ACN 109214504 ACN109214504 ACN 109214504ACN 201810970836 ACN201810970836 ACN 201810970836ACN 109214504 ACN109214504 ACN 109214504A
Authority
CN
China
Prior art keywords
bram
layer
network
value
design method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810970836.2A
Other languages
Chinese (zh)
Other versions
CN109214504B (en
Inventor
张轶凡
陈昊
应山川
李玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Original Assignee
Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Research Institute Of Beijing University Of Posts And TelecommunicationsfiledCriticalShenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Priority to CN201810970836.2ApriorityCriticalpatent/CN109214504B/en
Publication of CN109214504ApublicationCriticalpatent/CN109214504A/en
Application grantedgrantedCritical
Publication of CN109214504BpublicationCriticalpatent/CN109214504B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The YOLO network forward inference accelerator design method based on FPGA that the invention proposes a kind of, the accelerator includes fpga chip and DRAM, and the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage;The accelerator design method pinpoints quantification the following steps are included: (1) carries out 8bit to former network data, obtains influencing the smallest scaling position to detection accuracy, forms quantization scheme, which successively carries out;(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;(3) position maps.The growth rate of the storage resource on fpga chip in the prior art is solved not as good as scale of neural network rapid development, general target detection network is difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, realizes and achievees the purpose that faster speed using less Resources on Chip.

Description

A kind of YOLO network forward inference accelerator design method based on FPGA
Technical field
The present invention relates to the technical fields of deep learning and Design of Hardware Architecture, more particularly, to a kind of target detection networkThe design method of forward inference acceleration is carried out on FPGA.
Background technique
In recent years, it is based on the machine learning algorithm of convolutional neural networks (Convolutional Neutral Network)It has been widely applied in the task of computer vision.It is computation-intensive but for large-scale CNN network, it depositsStorage is intensive, and the big feature of resource consumption brings huge challenge to above-mentioned task.Traditional general processor is in face of this kind ofWhen height calculating pressure and big data throughput, performance is extremely difficult to practical requirement, therefore the hardware based on GPU, FPGA, ASICAccelerator is suggested and widely comes into operation.
FPGA (Field Programmable Gate Array) field programmable gate array, is in PAL, GAL, EPLDThe product further developed on the basis of equal programming devices.It is that occur as one of field ASIC semi-custom circuit, that is, solve the deficiency of custom circuit, and overcome the limited disadvantage of original programming device gate circuit number amount.FPGA is usedLogical cell array LCA (Logic Cell Array) such a new concept, inside include configurable logic blocks CLB,Three parts output input module IOB and interconnector can support a PROM programming multiple FPGA.Due to flexibly may be usedAbility and outstanding power dissipation ratio of performance are reconfigured, FPGA is made to become a kind of current important deep learning processor.
Being suitble to hard-wired mainstream target detection network at present is YOLO (You Only Look Once), this networkFast speed and structure is simple, this algorithm object detection issue handling at regression problem, with a convolutional neural networks knotStructure can directly predict position and the class probability of target frame from input picture, realize object detection end to end, this knotStructure is relatively more suitable for the hardware realization on FPGA.It is a kind of general fixed based on FPGA to disclose in existing invention CN107392309APoints neural network convolution accelerator hardware structure, comprising: general AXI4 high speed bus interface, high parallel-convolution core and characteristic patternData buffer area, convolutional calculation device, caches area controller, state controller and is directly accessed segmented convolution results buffer areaController.The invention uses on piece storage as buffering, and main storage of the memory outside piece as data passes through oneGeneral processor outside piece carry out memory management to complete the calculating of entire convolutional network, the design of this structure only uses onePiece FPGA is unable to complete the forward inference of target detection network.A kind of convolutional Neural is proposed in existing invention CN107463990AThe FPGA parallel acceleration method of network includes the following steps: that (1) establishes CNN model;(2) hardware structure is configured;(3) configuration volumeProduct arithmetic element.The invention loads the interim calculated result of whole network using on piece storage, therefore the network rule that can be disposedMould is limited.
The results of intermediate calculations of network layer, is often all stored on piece by the existing neural network accelerator based on FPGAStatic memory in, weight required for network is stored in the dynamic memory outside piece, and such design will lead on pieceStorage space limits the network size that can accelerate.At this stage, as the demand of task complexity and precision is got higher, volumeProduct scale of neural network is increasing, and parameter total amount is also increasing, but the technique of fpga chip and on piece is open ended depositsThe growth of resource is stored up but without so rapid, if still according to design method before, FPGA cannot accommodate this rule completelyThe network of mould.
If using the static memory BRAM of on piece as the buffer area of data, and the dynamic memory DRAM outside pieceKey data as network stores, and since the memory space of dynamic memory is huge, it is very big can to accommodate number of parametersNetwork realizes the parallel computation of each convolution module by reasonably distributing the bandwidth of memory.The property of this design methodThe bandwidth of memory can be depended on, but stacks storage resource compared on piece, the bandwidth for promoting communication is more easily to realize's.Network referenced by the present invention is the version of YOLO-tiny, the input size of this network is 416*416*3, network totally 9Layer convolution, final output are the candidate frame with classification, position and confidence information, and pass through area maps (region behaviourMake) algorithm is mapped to calculated result in original image.
Summary of the invention
It is fast not as good as scale of neural network growth for the growth rate of the storage resource in solution in the prior art fpga chipSpeed, general target detection network are difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, the present inventionA kind of YOLO network forward inference accelerator based on FPGA is proposed for the development platform of YOLO-tiny network and KC705,Specific technical solution is as follows:
A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip andDRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAMIt is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy,Quantization scheme is formed, which successively carries out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
Specifically, in the step (1) a certain layer quantizing process are as follows:
A) quantify the weight data of former network: a certain scaling position for first establishing 8bit fixed-point number is 256 kinds correspondingMetric value reuses nearby principle and quantifies to initial data, the numerical value after quantization is still wherein including positive zero and negative zeroIt is indicated the detection accuracy for obtaining such quantization scheme in order to calculate using the floating type of 32bit, then traverses 8 kinds smallNumber point, which postpones to obtain, influences the smallest scaling position to detection accuracy, eventually forms the weight quantization scheme of this layer;
B) distribution of 0-1 is first normalized to input feature vector figure, is then somebody's turn to do using method described in step a)The quantization of layer input feature vector figure;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolutionIt propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains allThe quantization scheme of layer.
Specifically, in the step (2) each layer of convolutional network calculating process are as follows:
A) weighted data required for epicycle calculates is read from DRAM, is placed into BRAM;
B) feature diagram data (FM) of this layer to convolution is read, all input datas is completed and prepares;
C) convolutional calculation is carried out, after the completion of a wheel convolutional calculation, the data in BRAM are uploaded in DRAM again, clearlyEmpty interim findings data, then start the calculating of next round.
Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAMIn a progress convolutional calculation, by obtained convolution results be added to switching input channel after convolutional calculation in, every time plusThe input feature vector figure needs of load can just switch next input feature vector region after having been calculated one time with all convolution kernels.
It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2)Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-twoPoint, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculatingWhen, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrowCompare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitiveThreshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of outputAfterwards, while also the pondization operation and activation operation in the channel are completed.
Step (2) a) and b) in BRAM be set as 512 data widths, depth design is 512 points, one piece of BRAM7.5 RAMB36E1 are consumed, output minimum is set as 16bit;C) the dual-port mode that comes true is arranged in the BRAM in, and port width is16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM,Consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5RAM36E1;9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.
Specifically, in the step (2) feature diagram data storage scheme are as follows: for level 1 volume product network in a) and b)Using one piece of BRAM, 2-6 layers of every layer of convolutional network respectively uses two pieces of BRAM, and the 7th layer uses eight pieces of BRAM, the 8th layer of useTen pieces of BRAM, the 9th layer uses nine pieces of BRAM;One piece of BRAM is used for every layer in c);Every piece of BRAM consumes 7.5 RAM36E1.
Specifically, the output of the convolutional network includes the location information of 13*13*5 candidate frame, the position of each candidate frameConfidence breath is made of x, y, w, h value, respectively indicates the abscissa relative value of candidate frame central point, ordinate relative value, width phaseTo value, height relative value;For horizontal, ordinate relative value through sigmoid Function Mapping into absolute coordinate, wide, high coordinate is oppositeValue is mapped in absolute value by e index.
The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operationIt is as follows:
A) center point coordinate of each candidate frame is extracted in order, and flag bit is set to each candidate frame, for indicatingWhether entire candidate frame retains.
B) selecting first candidate frame is by the central point distance of comparison other candidate frame behind comparison other computational domain, when superWhen crossing a threshold value, that the candidate frame flag bit compared is effective status, indicates that this candidate frame needs retain, noThen be the mark position of the frame it is invalid, the comparison of subsequent distance is no longer participate in, when traversing the last of queue by comparison otherAt one, comparison other, i.e., the effective candidate frame of next flag bit of comparison other just now are replaced.
C) the effective candidate frame of all flag bits is extracted from result memory, and generates a marking frame printingInto original image as final testing result.
The invention has the following advantages:
One, the present invention is using the memory on fpga chip as the data buffer zone of convolutional calculation, depositing outside fpga chipReservoir as main storage equipment, each convolutional layer by piece outside memory be coupled, this design method is more thanSuitable for YOLO network, it is equally applicable to other neural network.
Two, the resource allocation methods for each layer convolutional calculation that the present invention is carried out can play to greatest extent whole network simultaneouslyRow calculate ability, compare with serial convolutional calculation structure, the Resources on Chip that the design of this programme uses is less, forward inferenceSpeed faster.
Three, on fpga chip, without direct data interaction between each layer, the connection of each layer belongs to loose coupling relation,It can guarantee the stability of system in this way.
Four, the calculating for accelerating whole network present invention uses simple version is carried out without using the method for overlapping areaIt calculates but is simplified with the central point distance of two frames, can greatly improve the speed of NMS step.
Detailed description of the invention
Fig. 1 is the calculating structure and storage organization schematic diagram of each layer of the present invention
Fig. 2 is single layer network calculation flow chart of the present invention
Specific embodiment
Embodiment 1
A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip andDRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAMIt is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy,Quantization scheme is formed, which successively carries out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
Specifically, in the step (1) a certain layer quantizing process are as follows:
A) quantify the weight data of former network: when being quantified according to a certain decimal position of 8bit fixed-point number, first buildingThe decimal value table of comparisons under the position, i.e. 256 kinds of decimal numbers are found, wherein including positive zero and negative zero, then using former nearbyThen initial data is quantified, after quantization although value changes, but data are still the floating type of 32bit, convenient for belowIt is calculated in GPU, obtains the detection accuracy of such quantization scheme, obtained after then traversing 8 kinds of scaling positions to detectionPrecision influences the smallest scaling position, eventually forms the weight quantization scheme of this layer;
B) all integrated test input feature vector figures are first normalized with the distribution of 0-1, then using described in step a)Method carry out the quantization of this layer of input feature vector figure;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolutionIt propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains allThe quantization scheme of layer.
Specifically, in the step (2) each layer of convolutional network calculating process are as follows: first from DRAM read epicycle calculateRequired weighted data is placed into weight buffer BRAM;Then feature diagram data (FM) of this layer to convolution is read, it is completeStart to carry out convolutional calculation after preparing at all input datas, after the completion of a wheel convolutional calculation, in results buffer BRAMData upload in DRAM again, empty interim findings data, then start the calculating of next round.Due to next layer of calculatingOne layer of calculated result is relied on, is mutually waited to allow every layer can be calculated simultaneously, table tennis is used in DRAMStructure, to play the computation capability in FPGA.On fpga chip, without direct data interaction, each layer between each layerConnection belong to loose coupling relation, can guarantee the stability of system in this way.
Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAMIn a progress convolutional calculation, since BRAM is resource-constrained on fpga chip, and the size of this layer of picture is bigger, therefore onlyContinuous several rows in first Loading Image, according to the principle of convolutional calculation, the convolution results of this several row are final output at this timeThe interim findings of corresponding region (that several row of same position) in a certain channel arrive phase calculating behind the channel of switching inputWhen with convolution at position, need to add up with interim findings before, so before this layer of module executes convolutional calculation,The interim convolution calculated result first before fetching in DDR from same position corresponding to the output channel is needed, in this way everyAfter secondary convolution module calculates result, it can be added and be stored in result memory BRAM again with the value in result memory BRAM.The input feature vector figure needs loaded every time can just switch next input feature vector region after having been calculated one time with all convolution kernels.
It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2)Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-twoPoint, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculatingWhen, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrowCompare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitiveThreshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of outputAfterwards, while also the pondization operation and activation operation in the channel are completed.
The step (2) needs to receive the data read out in DRAM as the BRAM of data buffer, in order to playThe maximum bandwidth of DRAM, the write-in end of BRAM are set as 512 data widths, and depth design is 512 points, one piece of BRAM consumption7.5 RAMB36E1, output minimum are set as 16bit, the input width as convolution operation;The BRAM of buffer as a resultIt to be not only also written in DRAM from data are read in DRAM simultaneously, therefore the dual-port mode that comes true is set, port width is16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM,Consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5RAM36E1;9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.
Specifically, in the step (2) feature diagram data storage scheme are as follows: for Input Data Buffer, level 1 volumeProduct one piece of BRAM of Web vector graphic, 2-6 layer of every layer of convolutional network respectively use two pieces of BRAM, the 7th layer use eight pieces of BRAM, the 8thLayer uses ten pieces of BRAM, and the 9th layer uses nine pieces of BRAM;For output data buffer, every layer uses one piece of BRAM;Every piece of BRAM7.5 RAM36E1 are consumed, total characteristic pattern data buffer needs 337.5 RAM36E1.Since BRAM is resource-constrained, only existDo ping-pong operation at output buffer, data of each layer in input buffer be not ready to before without convolutional calculation.According to every layer of multiply-add calculation amount come etc. ratios distribution every layer of parallel computation port number and every layer of corresponding parallel channel numberIt is shown in Table 1, for the parallel channel of input, each channel requires one piece of individual BRAM and stores, but result cacheDevice only needs the big BRAM such as a piece of to store.
The distribution of every layer of calculation amount ratio of 1 parallel computation of table and every layer of parallel channel number
LayerOneTwoThreeFourFiveSixSevenEightNine
Ratio12.52.52.52.52.510201
PE quantity1222228161
Specifically, conventional part is followed by the region layer operation of position mapping, and the output of the convolutional network includes 13*The location information of the location information of 13*5 candidate frame, each candidate frame is made of x, y, w, h value, respectively indicates candidate frame centerThe abscissa relative value of point, ordinate relative value, width relative value, height relative value;This four values are needed by some processingIt can just be mapped in actual Pictures location, horizontal, ordinate relative value passes through sigmoid Function Mapping into absolute coordinate,Due to output the result is that 8bit fixed-point representation, corresponding output result can quantify into a look-up table, accelerate thisMapping process;Wide, high coordinate relative value is mapped in absolute value by e index, and the form of equally applicable look-up table obtains result.
The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operationIt is as follows: first to extract the center point coordinate of each candidate frame in order, and flag bit is set to each candidate frame, for indicating entireWhether candidate frame retains;Due to using central point distance to be judged for index, according to prior information it is found that in network outputIn candidate frame, the closer frame of sequence is the object compared, and sequence is farther away can be ignored and compare, and then selects first candidate frame to beBy the central point distance of comparison other candidate frame behind comparison other computational domain, when more than a threshold value, compared thatCandidate frame flag bit be effective status, indicate this candidate frame needs retain, be otherwise the mark position of the frame it is invalid,It is no longer participate in the comparison of subsequent distance, when traversing the last one of queue by comparison other, replaces comparison other, i.e., just nowComparison other the effective candidate frame of next flag bit;Finally the effective candidate frame of all flag bits from result memoryIn extract, and generate a marking frame print in original image as final testing result.
The above embodiments and description only illustrate the principle of the present invention, is not departing from spirit of that invention and modelUnder the premise of enclosing, various changes and improvements may be made to the invention, these changes and improvements both fall within claimed invention modelIn enclosing.The scope of the present invention is defined by the appended claims and its equivalents.

Claims (10)

5. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, featureIt is, further includes needing to carry out pondization operation when calculating the final result for arriving a certain output channel and swashing in the step (2)Operation living, detailed process is as follows, when calculating convolution results one by one to certain a line, this line result is divided two-by-two, and twoMaximum value is recorded in a value, is saved using the logical resource on fpga chip, when next line is arrived in calculating, togetherSample divides output result two-by-two, the larger value therein is taken, and be compared with the maximum value elected in lastrow, thisThat bigger value is as the maximum value in a certain region 2*2 in two maximum values, then with the threshold value of RELU activation primitiveIt is compared, result is saved in BRAM, in this way after carrying out the convolution of final result to a certain channel of output, whileComplete the pondization operation and activation operation in the channel.
CN201810970836.2A2018-08-242018-08-24FPGA-based YOLO network forward reasoning accelerator design methodActiveCN109214504B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201810970836.2ACN109214504B (en)2018-08-242018-08-24FPGA-based YOLO network forward reasoning accelerator design method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201810970836.2ACN109214504B (en)2018-08-242018-08-24FPGA-based YOLO network forward reasoning accelerator design method

Publications (2)

Publication NumberPublication Date
CN109214504Atrue CN109214504A (en)2019-01-15
CN109214504B CN109214504B (en)2020-09-04

Family

ID=64989693

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201810970836.2AActiveCN109214504B (en)2018-08-242018-08-24FPGA-based YOLO network forward reasoning accelerator design method

Country Status (1)

CountryLink
CN (1)CN109214504B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110033086A (en)*2019-04-152019-07-19北京异构智能科技有限公司Hardware accelerator for neural network convolution algorithm
CN110175670A (en)*2019-04-092019-08-27华中科技大学A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN110222835A (en)*2019-05-132019-09-10西安交通大学A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN110263925A (en)*2019-06-042019-09-20电子科技大学A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA
CN110555516A (en)*2019-08-272019-12-10上海交通大学FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
CN111752713A (en)*2020-06-282020-10-09浪潮电子信息产业股份有限公司 Model parallel training task load balancing method, device, device and storage medium
CN111814675A (en)*2020-07-082020-10-23上海雪湖科技有限公司 Convolutional Neural Network Feature Map Assembly System Based on FPGA Supporting Dynamic Resolution
CN112052935A (en)*2019-06-062020-12-08奇景光电股份有限公司 Convolutional Neural Network System
CN112085190A (en)*2019-06-122020-12-15上海寒武纪信息科技有限公司Neural network quantitative parameter determination method and related product
CN112470138A (en)*2019-11-292021-03-09深圳市大疆创新科技有限公司Computing device, method, processor and mobile equipment
CN113065303A (en)*2021-03-062021-07-02杭州电子科技大学 A Hierarchical Verification Method of DSCNN Accelerator Based on FPGA
CN113297128A (en)*2020-02-242021-08-24中科寒武纪科技股份有限公司Data processing method, data processing device, computer equipment and storage medium
CN113778655A (en)*2020-06-092021-12-10北京灵汐科技有限公司Network precision quantification method and system
CN114580352A (en)*2022-03-092022-06-03浙江大学Neural network reasoning acceleration method based on shallow feature precomputation
CN115049907A (en)*2022-08-172022-09-13四川迪晟新达类脑智能技术有限公司FPGA-based YOLOV4 target detection network implementation method
CN116737382A (en)*2023-06-202023-09-12中国人民解放军国防科技大学 A neural network inference acceleration method based on area folding

Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7454546B1 (en)*2006-01-272008-11-18Xilinx, Inc.Architecture for dynamically reprogrammable arbitration using memory
US20160379115A1 (en)*2015-06-292016-12-29Microsoft Technology Licensing, LlcDeep neural network processing on hardware accelerators with stacked memory
CN106529517A (en)*2016-12-302017-03-22北京旷视科技有限公司 Image processing method and image processing device
CN106650592A (en)*2016-10-052017-05-10北京深鉴智能科技有限公司Target tracking system
CN107066239A (en)*2017-03-012017-08-18智擎信息系统(上海)有限公司A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107451659A (en)*2017-07-272017-12-08清华大学Neutral net accelerator and its implementation for bit wide subregion
CN108108809A (en)*2018-03-052018-06-01山东领能电子科技有限公司A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork
CN108182471A (en)*2018-01-242018-06-19上海岳芯电子科技有限公司A kind of convolutional neural networks reasoning accelerator and method
EP3352113A1 (en)*2017-01-182018-07-25Hitachi, Ltd.Calculation system and calculation method of neural network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7454546B1 (en)*2006-01-272008-11-18Xilinx, Inc.Architecture for dynamically reprogrammable arbitration using memory
US20160379115A1 (en)*2015-06-292016-12-29Microsoft Technology Licensing, LlcDeep neural network processing on hardware accelerators with stacked memory
CN106650592A (en)*2016-10-052017-05-10北京深鉴智能科技有限公司Target tracking system
CN106529517A (en)*2016-12-302017-03-22北京旷视科技有限公司 Image processing method and image processing device
EP3352113A1 (en)*2017-01-182018-07-25Hitachi, Ltd.Calculation system and calculation method of neural network
CN107066239A (en)*2017-03-012017-08-18智擎信息系统(上海)有限公司A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107451659A (en)*2017-07-272017-12-08清华大学Neutral net accelerator and its implementation for bit wide subregion
CN108182471A (en)*2018-01-242018-06-19上海岳芯电子科技有限公司A kind of convolutional neural networks reasoning accelerator and method
CN108108809A (en)*2018-03-052018-06-01山东领能电子科技有限公司A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JING MA ET AL: "Hardware Implementation and Optimization of Tiny-YOLO Network", 《INTERNATIONAL FORUM ON DIGITAL TV AND WIRELESS MULTIMEDIA COMMUNICATIONS》*
VINCENT VANHOUCKE ET AL: "Improving the speed of neural networks on CPUs", 《DEEP LEARNING AND UNSUPERVISED FEATURE LEARNING WORKSHOP》*
陆志坚: "基于FPGA的卷积神经网络并行结构研究", 《中国博士学位论文全文数据库信息科技辑》*

Cited By (28)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110175670B (en)*2019-04-092020-12-08华中科技大学 A method and system for implementing YOLOv2 detection network based on FPGA
CN110175670A (en)*2019-04-092019-08-27华中科技大学A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN110033086A (en)*2019-04-152019-07-19北京异构智能科技有限公司Hardware accelerator for neural network convolution algorithm
CN110222835A (en)*2019-05-132019-09-10西安交通大学A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN110263925A (en)*2019-06-042019-09-20电子科技大学A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA
CN110263925B (en)*2019-06-042022-03-15电子科技大学 A hardware acceleration implementation device for forward prediction of convolutional neural network based on FPGA
CN112052935A (en)*2019-06-062020-12-08奇景光电股份有限公司 Convolutional Neural Network System
CN112085190A (en)*2019-06-122020-12-15上海寒武纪信息科技有限公司Neural network quantitative parameter determination method and related product
CN112085190B (en)*2019-06-122024-04-02上海寒武纪信息科技有限公司 A method for determining quantitative parameters of neural networks and related products
CN110555516B (en)*2019-08-272023-10-27合肥辉羲智能科技有限公司 Implementation method of low-latency hardware accelerator for YOLOv2-tiny neural network based on FPGA
CN110555516A (en)*2019-08-272019-12-10上海交通大学FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
CN112470138A (en)*2019-11-292021-03-09深圳市大疆创新科技有限公司Computing device, method, processor and mobile equipment
WO2021102946A1 (en)*2019-11-292021-06-03深圳市大疆创新科技有限公司Computing apparatus and method, processor, and movable device
CN113297128B (en)*2020-02-242023-10-31中科寒武纪科技股份有限公司Data processing method, device, computer equipment and storage medium
CN113297128A (en)*2020-02-242021-08-24中科寒武纪科技股份有限公司Data processing method, data processing device, computer equipment and storage medium
CN113778655A (en)*2020-06-092021-12-10北京灵汐科技有限公司Network precision quantification method and system
CN111752713B (en)*2020-06-282022-08-05浪潮电子信息产业股份有限公司Method, device and equipment for balancing load of model parallel training task and storage medium
US11868817B2 (en)2020-06-282024-01-09Inspur Electronic Information Industry Co., Ltd.Load balancing method, apparatus and device for parallel model training task, and storage medium
CN111752713A (en)*2020-06-282020-10-09浪潮电子信息产业股份有限公司 Model parallel training task load balancing method, device, device and storage medium
CN111814675B (en)*2020-07-082023-09-29上海雪湖科技有限公司Convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA
CN111814675A (en)*2020-07-082020-10-23上海雪湖科技有限公司 Convolutional Neural Network Feature Map Assembly System Based on FPGA Supporting Dynamic Resolution
CN113065303A (en)*2021-03-062021-07-02杭州电子科技大学 A Hierarchical Verification Method of DSCNN Accelerator Based on FPGA
CN113065303B (en)*2021-03-062024-02-02杭州电子科技大学DSCNN accelerator layering verification method based on FPGA
CN114580352A (en)*2022-03-092022-06-03浙江大学Neural network reasoning acceleration method based on shallow feature precomputation
CN115049907A (en)*2022-08-172022-09-13四川迪晟新达类脑智能技术有限公司FPGA-based YOLOV4 target detection network implementation method
CN115049907B (en)*2022-08-172022-10-28四川迪晟新达类脑智能技术有限公司FPGA-based YOLOV4 target detection network implementation method
CN116737382A (en)*2023-06-202023-09-12中国人民解放军国防科技大学 A neural network inference acceleration method based on area folding
CN116737382B (en)*2023-06-202024-01-02中国人民解放军国防科技大学Neural network reasoning acceleration method based on area folding

Also Published As

Publication numberPublication date
CN109214504B (en)2020-09-04

Similar Documents

PublicationPublication DateTitle
CN109214504A (en)A kind of YOLO network forward inference accelerator design method based on FPGA
CN110378468B (en) A neural network accelerator based on structured pruning and low-bit quantization
CN112465110A (en)Hardware accelerator for convolution neural network calculation optimization
CN113660113B (en) Adaptive sparse parameter model design and quantitative transmission method for distributed machine learning
CN108564168A (en)A kind of design method to supporting more precision convolutional neural networks processors
CN113361695B (en)Convolutional neural network accelerator
CN113591509B (en)Training method of lane line detection model, image processing method and device
CN102176750B (en)High-performance adaptive binary arithmetic encoder
CN101853485A (en) A Simplified Processing Method for Non-uniform Point Clouds Based on Neighbor Propagation Clustering
CN111814973A (en) An In-Memory Computing System Applicable to Network Computation of Regular Differential Equations
WO2023123919A1 (en)Data processing circuit, data processing method, and related product
CN108268603A (en)A kind of community discovery method based on core member's identification
CN115293978A (en)Convolution operation circuit and method, image processing apparatus
CN103871088A (en)Method and system for partitioning compression of spatial statistical data based on sparse characteristic of image
CN117745910A (en) A fast rendering method for block-based heat map of 3D model with massive grid points
CN113592885A (en)SegNet-RS network-based large obstacle contour segmentation method
Xu et al.Design and implementation of an efficient CNN accelerator for low-cost FPGAs
CN111782398B (en) Data processing method, device, system and related equipment
CN111445018B (en)Ultraviolet imaging real-time information processing method based on accelerating convolutional neural network algorithm
CN111882061B (en)Convolutional neural network training method based on hierarchical random gradient descent
CN114445687A (en) An image recognition and reasoning method, system, storage medium and device
CN110598159A (en) A Parallel Computation Method for Local Raster Spatial Analysis Based on Efficient Computation
CN109741421A (en) A GPU-based dynamic graph coloring method
CN115983343A (en)YOLOv4 convolutional neural network lightweight method based on FPGA
CN105844110B (en)A kind of adaptive neighborhood TABU search based on GPU solves Method for HW/SW partitioning

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp