CN108647777A

Movatterモバイル変換

Info

Publication number: CN108647777A
Application number: CN201810432269.5A
Authority: CN
Inventors: 聂林川; 姜凯; 王子彤
Original assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Current assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date: 2018-05-08
Filing date: 2018-05-08
Publication date: 2018-10-12

Abstract

The invention discloses a kind of data mapped systems and method for realizing that parallel-convolution calculates, belong to nerual network technique field.The data mapped system that the realization parallel-convolution of the present invention calculates includes input feature vector cache module, mapping logic module, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module, the input feature vector figure cache module is separately connected with control logic module, mapping logic module, weight cache module is separately connected with control logic module, mapping logic module, computing array is connected with control logic module, mapping logic module, output characteristic pattern cache module, and output characteristic pattern cache module is connected with control logic module.The data mapped system and can eliminate computing resource that is invalid or being not involved in that the realization parallel-convolution of the invention calculates, improve computing resource utilization rate, have good application value.

Description

A kind of data mapped system and method for realizing that parallel-convolution calculates

Technical field

The present invention relates to nerual network technique fields, specifically provide a kind of data mapped system realized parallel-convolution and calculatedAnd method.

Background technology

With artificial intelligence（AI）The development in field, CNN（Convolutional Neural Network, that is, convolutional NeuralNetwork）It is fully used.Mainstream convolutional neural networks model is not only complicated at present, and it is big and each to calculate data volumeLayer architecture difference is also very big, and hardware circuit realizes that high-performance realizes that high universalizable is not light simultaneously, should consider the utilization of resourcesRate considers Energy Efficiency Ratio again.Realize each layer of whole network model and unrealistic, power consumption, area, resource profit simultaneously with hardware circuitIt is difficult to obtain satisfied with rate etc. as a result, the usual way for solving the problems, such as this is to exchange area for the time, it also will entire modelHierarchical block processing is carried out, circuit design at general basic unit, entire model is constructed by control circuit timesharing, simultaneouslyMeans are mapped by efficient data and improve resource utilization, and circuit working performance is improved with this.In the prior art in hardware electricityRoad is realized be more than 1 there are convolution kernel sliding step during certain convolutional neural networks models calculate in the case of, there are invalid computation,Reduce resource utilization；On the other hand, in the case of computing array circuit design is fixed, if there is output characteristic pattern and calculateThere is the resource for being not involved in calculating in the unmatched situation of array sizes, there is also waste, computing resource waste meetings for resource utilizationOverall performance is set to cannot get ideal result.

Invention content

The technical assignment of the present invention is in view of the above problems, to provide a kind of meter that can be eliminated and in vain or be not involved inResource is calculated, the data mapped system of computing resource utilization rate realized parallel-convolution and calculated is improved.

The further technical assignment of the present invention is to provide a kind of data mapping method realized parallel-convolution and calculated.

To achieve the above object, the present invention provides following technical solutions：

A kind of data mapped system realized parallel-convolution and calculated, which includes input feature vector cache module, mapping logic mouldBlock, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module, the input feature vector figureCache module is separately connected with control logic module, mapping logic module, and weight cache module is patrolled with control logic module, mappingIt collects module to be separately connected, convolutional calculation array and control logic module, mapping logic module, output characteristic pattern cache module phaseEven, output characteristic pattern cache module is connected with control logic module.

The data mapped system for realizing parallel-convolution calculating increases convolutional calculation by reconfiguring input feature vector figureDegree of parallelism, eliminate computing resource that is invalid or being not involved in.Input feature vector figure is particularly subjected to well-regulated piecemeal, is passed throughEffective mapping means, reconfigure input feature vector figure, will be invalid or be not involved in calculating section and be substituted for effective calculating section, increasingThe degree of parallelism for adding whole convolutional calculation improves the utilization rate of computing resource, improves system performance.

Preferably, caching of the input feature vector figure cache module as outer input data, mapping logic module are pressedAccording to the order that control logic module issues data, mapping logic mould are obtained from input feature vector figure cache module and weight cache moduleBlock send the data of acquisition to convolutional calculation array, and convolutional calculation array will calculate the data completed and send to output characteristic pattern cachingModule.

Preferably, the convolutional calculation array multiplies N row convolutional calculation units, adjacent convolutional calculation unit using N rowsInterconnection.

Each convolutional calculation unit includes 2x2 PE（Processing Element, that is, processing unit）, convolution meterWhen calculation, each PE corresponds to the calculating of a pixel of an output characteristic pattern.

A method of realizing the data mapping that parallel-convolution calculates, the method carries out input feature vector figure well-regulatedPiecemeal reconfigures input feature vector figure by mapping means, increases the degree of parallelism of convolutional calculation, and mapping logic will be from group againThe data that the input feature vector figure of conjunction obtains are sent to convolutional calculation array, and convolutional calculation array is sent the data completed are calculated to outputCharacteristic pattern cache module.

Preferably, when convolution kernel sliding step is more than 1, convolution kernel in input feature vector figure is slided to the part of invalid computationIt is partially filled with what is effectively calculated, the input feature vector figure reconfigured is inputted as convolution unit.

Preferably, the part of convolution kernel sliding invalid computation is filled out with the part effectively calculated in the figure by input feature vectorIt fills, invalid computation partial array is filled using the data of the effective calculating position in the matrix upper right corner, will participate in having in input feature vector figureThe data that effect calculates translate downwards to the right, copy in adjacent convolutional calculation unit.

Preferably, the data copied in adjacent convolutional calculation unit and the volume read in from weight cache moduleProduct core weighted value carries out convolutional calculation, and the characteristic pattern of Combination nova is made to have traversed weighted value, and result of calculation is sent to output characteristic patternCache module.

Preferably, when output characteristic pattern and computing array size mismatch, by multichannel input feature vector figure be divided into compared withSmall characteristic pattern unit reconfigures the characteristic pattern unit of adjacency channel same position for new input feature vector figure, as volumeProduct computing array input.

Preferably, the multichannel input feature vector figure division proportion depends on output characteristic pattern size, port number depends onIn convolutional calculation array sizes and output characteristic pattern size.

Compared with prior art, the data mapping method that realization parallel-convolution of the invention calculates has with following prominentBeneficial effect：The data mapping method for realizing parallel-convolution calculating reconfigures input feature vector by effectively mapping meansFigure, increases the degree of parallelism of convolutional calculation, and input feature vector figure is particularly carried out well-regulated piecemeal, will be invalid or be not involved in meterPartial replacement is calculated into effective calculating section, computing resource that is invalid or being not involved in is eliminated, increases the degree of parallelism of whole convolutional calculation,The utilization rate of computing resource is improved, system performance is improved, there is good application value.

Description of the drawings

Fig. 1 is the topological diagram for the data mapped system that realization parallel-convolution of the present invention calculates；

Fig. 2 is that convolutional calculation unit progress convolutional calculation is opened up in the data mapped system that realization parallel-convolution of the present invention calculatesFlutter figure；

Fig. 3 is the signal when data mapping method convolution kernel sliding step that realization parallel-convolution of the present invention calculates is more than 1Figure；

Fig. 4 be realization parallel-convolution of the present invention the data mapping method output characteristic pattern and the computing array size that calculate notThe schematic diagram of timing.

Specific implementation mode

Below in conjunction with drawings and examples, to the data mapped system and method for realizing parallel-convolution calculating of the present inventionIt is described in further detail.

Embodiment

As shown in Figure 1, the data mapped system of the present invention realized parallel-convolution and calculated, including input feature vector cache mouldBlock, mapping logic module, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module.

Caching of the input feature vector figure cache module as outer input data, with control logic module, mapping logic moduleIt is separately connected.

Convolutional calculation array multiplies N row convolutional calculation units, adjacent convolutional calculation element-interconn ection using N rows.Such as Fig. 2 institutesShow, each convolutional calculation unit includes 2x2 PE, and when convolutional calculation, each PE corresponds to a pixel of an output characteristic patternThe calculating of point.

Mapping logic module is cached according to the order that control logic module issues from input feature vector figure cache module and weightModule obtains data, and mapping logic module send the data of acquisition to convolutional calculation array, and convolutional calculation array will be calculated and be completedData send to output characteristic pattern cache module.

Weight cache module is separately connected with control logic module, mapping logic module.Convolutional calculation array is patrolled with controlModule, mapping logic module, output characteristic pattern cache module is collected to be connected.Export characteristic pattern cache module and control logic module phaseEven.

The present invention's realizes that input feature vector figure is carried out well-regulated piecemeal by the data mapping method that parallel-convolution calculates, and leads toMapping means are crossed, input feature vector figure is reconfigured, increase the degree of parallelism of convolutional calculation, mapping logic will be from the input reconfiguredThe data that characteristic pattern obtains are sent to convolutional calculation array, and convolutional calculation array, which will calculate the data completed and send to output characteristic pattern, to be delayedStoring module.

When convolution kernel sliding step is more than 1, convolution kernel in input feature vector figure is slided to the part effectively meter of invalid computationThat calculates is partially filled with, and invalid computation partial array is filled using the data of the effective calculating position in the matrix upper right corner, by input feature vectorThe data effectively calculated are participated in figure to translate downwards to the right, are copied in adjacent computing unit.Copy to adjacent calculating listData in member carry out convolutional calculation with the convolution kernel weighted value read in from weight cache module, and the characteristic pattern of Combination nova is made to traverseComplete weighted value, the input feature vector figure reconfigured are inputted as convolution unit, and result of calculation is sent to output characteristic pattern and is delayedStoring module.Specific implementation process is as shown in Figure 3.It is 4x4 with convolutional calculation array sizes, output characteristic pattern is 2x2, convolution kernel powerWeight matrix is 1x1, is illustrated for the example that convolution kernel sliding step is 2.The each sliding step of convolution kernel is 2, often calculates oneEffective output point can all carry out primary invalid calculating, and entire computing array effective rate of utilization is (2x2)/(4x4)=1/4, is calculatedResource receives waste, in order to make full use of computing resource, is replicated parallel using by effective computing resource, then respectively from different volumesProduct nuclear convolution, and cache the mode of intermediate result.

1, period 1 T0 moment, control logic command mappings logic cache from input feature vector figure and obtain input feature vector figureIn 11 point values input computing array, cached from weight and obtain respective weights k1 and input computing array.

2, the T1 moment, in computing array, 11 point values and weight k1 be calculated result of calculation out0 to exporting featureFigure caching, while copying to 12 position of clearing array by 11 points.

3, the T2 moment, 12 point values and weight k2 carry out result out1 is calculated in computing array delays to output characteristic patternIt deposits, while 21 position of clearing array is copied to by 11 points.

4, the T3 moment, in computing array, 21 point values and weight k3 carry out result out2 is calculated to be delayed to output characteristic patternIt deposits, while 22 position of clearing array is copied to by 11 points.

5, T4 moment, 22 positions with weight k4 carry out that result out3 is calculated, to output characteristic pattern caching.

The same processing mode of other computing units, until by first characteristic value of input feature vector figure and ownership restatementIt calculates and completes, and preserve intermediate result, then carry out next characteristic value clearing, and so on, first passage input feature vector figure is wholeAfter the completion of calculating, next channel input feature vector figure enters calculating, and different channels are corresponded to results of intermediate calculations and sum up placeReason.

When exporting characteristic pattern with computing array size mismatch, multichannel input feature vector figure is divided into smaller characteristic patternUnit reconfigures the characteristic pattern unit of adjacency channel same position for new input feature vector figure, as convolutional calculation arrayInput.Multichannel input feature vector figure division proportion depend on output characteristic pattern size, port number depend on computing array size andExport characteristic pattern size.Specific implementation process is as shown in Figure 4.It is 3x3 with convolutional calculation array sizes, output characteristic pattern size is2x2, convolution kernel size are 1x1, and the example that sliding step is 1 illustrates, and in the case of this kind, computing array size is more than outputCharacteristic pattern size, and be not integral multiple relation, computing resource utilization rate is（2x2）/ (3x3)=4/9, is not involved in the resource of calculatingIt is wasted.Input feature vector figure stripping and slicing is taken, the input feature vector figure stripping and slicing in different channels is combined, calculating is made full use ofAll resources of array so that computing resource can be fully used.Detailed process is as follows：

1, period 1 T0 moment, control logic command mappings logic divide 11 point values of the same position of the one two three four-wayNot Shu Ru computing array 11,12,21,22 positions, four tunnel parallel computations simultaneously obtain 4 output characteristic pattern 11 point values, andKeep in output characteristic pattern caching.

2, T1 moment, control logic command mappings logic are defeated by the 12 point values difference of the same position of the one two three four-wayEnter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 12 point values of 4 output characteristic patterns simultaneously, and keep inTo output characteristic pattern caching.

3, T2 moment, control logic command mappings logic are defeated by the 21 point values difference of the same position of the one two three four-wayEnter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 21 point values of 4 output characteristic patterns simultaneously, and keep inTo output characteristic pattern caching.

4, T3 moment, control logic command mappings logic are defeated by the 22 point values difference of the same position of the one two three four-wayEnter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 22 point values of 4 output characteristic patterns simultaneously, and keep inTo output characteristic pattern caching.

At the end of the T3 moment, all point values of output characteristic pattern in four channels, which all calculate, to be completed.

Embodiment described above, the only present invention more preferably specific implementation mode, those skilled in the art is at thisThe usual variations and alternatives carried out within the scope of inventive technique scheme should be all included within the scope of the present invention.

Claims

1. a kind of data mapped system realized parallel-convolution and calculated, it is characterised in that：The system includes input feature vector caching mouldBlock, mapping logic module, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module, instituteIt states input feature vector figure cache module to be separately connected with control logic module, mapping logic module, weight cache module is patrolled with controlCollect module, mapping logic module is separately connected, convolutional calculation array and control logic module, mapping logic module, output featureFigure cache module is connected, and output characteristic pattern cache module is connected with control logic module.

2. the data mapped system according to claim 1 realized parallel-convolution and calculated, it is characterised in that：The inputCaching of the characteristic pattern cache module as outer input data, the order that mapping logic module is issued according to control logic module fromInput feature vector figure cache module and weight cache module obtain data, and mapping logic module send the data of acquisition to convolutional calculationArray, convolutional calculation array will calculate the data completed and send to output characteristic pattern cache module.

3. the data mapped system according to claim 1 or 2 realized parallel-convolution and calculated, it is characterised in that：The volumeProduct computing array multiplies N row convolutional calculation units, adjacent convolutional calculation element-interconn ection using N rows.

4. a kind of method that data that realizing that parallel-convolution calculates map, it is characterised in that：The method by input feature vector figure intoThe well-regulated piecemeal of row reconfigures input feature vector figure by mapping means, increases the degree of parallelism of convolutional calculation, mapping logicThe data obtained from the input feature vector figure reconfigured are sent to convolutional calculation array, convolutional calculation array will calculate the number completedAccording to send to output characteristic pattern cache module.

5. the data mapping method according to claim 4 realized parallel-convolution and calculated, it is characterised in that：Convolution kernel slidesWhen step-length is more than 1, the part that convolution kernel in input feature vector figure is slided to invalid computation is partially filled with what is effectively calculated, obtains weightThe input feature vector figure of Combination nova is inputted as convolution unit.

6. the data mapping method according to claim 4 or 5 realized parallel-convolution and calculated, it is characterised in that：It is described to incite somebody to actionThe part of convolution kernel sliding invalid computation is partially filled with what is effectively calculated in input feature vector figure, is effectively counted using the matrix upper right cornerThe data for calculating position fill invalid computation partial array, and the data for participating in effectively calculating in input feature vector figure are put down downwards to the rightIt moves, copies in adjacent convolutional calculation unit.

7. the data mapping method according to claim 6 realized parallel-convolution and calculated, it is characterised in that：It is described to copy toData in adjacent convolutional calculation unit carry out convolutional calculation with the convolution kernel weighted value read in from weight cache module, make newThe characteristic pattern of combination has traversed weighted value, and result of calculation is sent to output characteristic pattern cache module.

8. the data mapping method according to claim 4 realized parallel-convolution and calculated, it is characterised in that：Export characteristic patternWhen being mismatched with computing array size, multichannel input feature vector figure is divided into smaller characteristic pattern unit, adjacency channel is sameThe characteristic pattern unit of one position reconfigures as new input feature vector figure, is inputted as convolutional calculation array.

9. the data mapping method according to claim 8 realized parallel-convolution and calculated, it is characterised in that：The multichannelInput feature vector figure division proportion depends on output characteristic pattern size, and port number depends on convolutional calculation array sizes and output featureFigure size.