Invention content
The technical assignment of the present invention is in view of the above problems, to provide a kind of meter that can be eliminated and in vain or be not involved inResource is calculated, the data mapped system of computing resource utilization rate realized parallel-convolution and calculated is improved.
The further technical assignment of the present invention is to provide a kind of data mapping method realized parallel-convolution and calculated.
To achieve the above object, the present invention provides following technical solutions:
A kind of data mapped system realized parallel-convolution and calculated, which includes input feature vector cache module, mapping logic mouldBlock, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module, the input feature vector figureCache module is separately connected with control logic module, mapping logic module, and weight cache module is patrolled with control logic module, mappingIt collects module to be separately connected, convolutional calculation array and control logic module, mapping logic module, output characteristic pattern cache module phaseEven, output characteristic pattern cache module is connected with control logic module.
The data mapped system for realizing parallel-convolution calculating increases convolutional calculation by reconfiguring input feature vector figureDegree of parallelism, eliminate computing resource that is invalid or being not involved in.Input feature vector figure is particularly subjected to well-regulated piecemeal, is passed throughEffective mapping means, reconfigure input feature vector figure, will be invalid or be not involved in calculating section and be substituted for effective calculating section, increasingThe degree of parallelism for adding whole convolutional calculation improves the utilization rate of computing resource, improves system performance.
Preferably, caching of the input feature vector figure cache module as outer input data, mapping logic module are pressedAccording to the order that control logic module issues data, mapping logic mould are obtained from input feature vector figure cache module and weight cache moduleBlock send the data of acquisition to convolutional calculation array, and convolutional calculation array will calculate the data completed and send to output characteristic pattern cachingModule.
Preferably, the convolutional calculation array multiplies N row convolutional calculation units, adjacent convolutional calculation unit using N rowsInterconnection.
Each convolutional calculation unit includes 2x2 PE(Processing Element, that is, processing unit), convolution meterWhen calculation, each PE corresponds to the calculating of a pixel of an output characteristic pattern.
A method of realizing the data mapping that parallel-convolution calculates, the method carries out input feature vector figure well-regulatedPiecemeal reconfigures input feature vector figure by mapping means, increases the degree of parallelism of convolutional calculation, and mapping logic will be from group againThe data that the input feature vector figure of conjunction obtains are sent to convolutional calculation array, and convolutional calculation array is sent the data completed are calculated to outputCharacteristic pattern cache module.
Preferably, when convolution kernel sliding step is more than 1, convolution kernel in input feature vector figure is slided to the part of invalid computationIt is partially filled with what is effectively calculated, the input feature vector figure reconfigured is inputted as convolution unit.
Preferably, the part of convolution kernel sliding invalid computation is filled out with the part effectively calculated in the figure by input feature vectorIt fills, invalid computation partial array is filled using the data of the effective calculating position in the matrix upper right corner, will participate in having in input feature vector figureThe data that effect calculates translate downwards to the right, copy in adjacent convolutional calculation unit.
Preferably, the data copied in adjacent convolutional calculation unit and the volume read in from weight cache moduleProduct core weighted value carries out convolutional calculation, and the characteristic pattern of Combination nova is made to have traversed weighted value, and result of calculation is sent to output characteristic patternCache module.
Preferably, when output characteristic pattern and computing array size mismatch, by multichannel input feature vector figure be divided into compared withSmall characteristic pattern unit reconfigures the characteristic pattern unit of adjacency channel same position for new input feature vector figure, as volumeProduct computing array input.
Preferably, the multichannel input feature vector figure division proportion depends on output characteristic pattern size, port number depends onIn convolutional calculation array sizes and output characteristic pattern size.
Compared with prior art, the data mapping method that realization parallel-convolution of the invention calculates has with following prominentBeneficial effect:The data mapping method for realizing parallel-convolution calculating reconfigures input feature vector by effectively mapping meansFigure, increases the degree of parallelism of convolutional calculation, and input feature vector figure is particularly carried out well-regulated piecemeal, will be invalid or be not involved in meterPartial replacement is calculated into effective calculating section, computing resource that is invalid or being not involved in is eliminated, increases the degree of parallelism of whole convolutional calculation,The utilization rate of computing resource is improved, system performance is improved, there is good application value.
Embodiment
As shown in Figure 1, the data mapped system of the present invention realized parallel-convolution and calculated, including input feature vector cache mouldBlock, mapping logic module, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module.
Caching of the input feature vector figure cache module as outer input data, with control logic module, mapping logic moduleIt is separately connected.
Convolutional calculation array multiplies N row convolutional calculation units, adjacent convolutional calculation element-interconn ection using N rows.Such as Fig. 2 institutesShow, each convolutional calculation unit includes 2x2 PE, and when convolutional calculation, each PE corresponds to a pixel of an output characteristic patternThe calculating of point.
Mapping logic module is cached according to the order that control logic module issues from input feature vector figure cache module and weightModule obtains data, and mapping logic module send the data of acquisition to convolutional calculation array, and convolutional calculation array will be calculated and be completedData send to output characteristic pattern cache module.
Weight cache module is separately connected with control logic module, mapping logic module.Convolutional calculation array is patrolled with controlModule, mapping logic module, output characteristic pattern cache module is collected to be connected.Export characteristic pattern cache module and control logic module phaseEven.
The present invention's realizes that input feature vector figure is carried out well-regulated piecemeal by the data mapping method that parallel-convolution calculates, and leads toMapping means are crossed, input feature vector figure is reconfigured, increase the degree of parallelism of convolutional calculation, mapping logic will be from the input reconfiguredThe data that characteristic pattern obtains are sent to convolutional calculation array, and convolutional calculation array, which will calculate the data completed and send to output characteristic pattern, to be delayedStoring module.
When convolution kernel sliding step is more than 1, convolution kernel in input feature vector figure is slided to the part effectively meter of invalid computationThat calculates is partially filled with, and invalid computation partial array is filled using the data of the effective calculating position in the matrix upper right corner, by input feature vectorThe data effectively calculated are participated in figure to translate downwards to the right, are copied in adjacent computing unit.Copy to adjacent calculating listData in member carry out convolutional calculation with the convolution kernel weighted value read in from weight cache module, and the characteristic pattern of Combination nova is made to traverseComplete weighted value, the input feature vector figure reconfigured are inputted as convolution unit, and result of calculation is sent to output characteristic pattern and is delayedStoring module.Specific implementation process is as shown in Figure 3.It is 4x4 with convolutional calculation array sizes, output characteristic pattern is 2x2, convolution kernel powerWeight matrix is 1x1, is illustrated for the example that convolution kernel sliding step is 2.The each sliding step of convolution kernel is 2, often calculates oneEffective output point can all carry out primary invalid calculating, and entire computing array effective rate of utilization is (2x2)/(4x4)=1/4, is calculatedResource receives waste, in order to make full use of computing resource, is replicated parallel using by effective computing resource, then respectively from different volumesProduct nuclear convolution, and cache the mode of intermediate result.
1, period 1 T0 moment, control logic command mappings logic cache from input feature vector figure and obtain input feature vector figureIn 11 point values input computing array, cached from weight and obtain respective weights k1 and input computing array.
2, the T1 moment, in computing array, 11 point values and weight k1 be calculated result of calculation out0 to exporting featureFigure caching, while copying to 12 position of clearing array by 11 points.
3, the T2 moment, 12 point values and weight k2 carry out result out1 is calculated in computing array delays to output characteristic patternIt deposits, while 21 position of clearing array is copied to by 11 points.
4, the T3 moment, in computing array, 21 point values and weight k3 carry out result out2 is calculated to be delayed to output characteristic patternIt deposits, while 22 position of clearing array is copied to by 11 points.
5, T4 moment, 22 positions with weight k4 carry out that result out3 is calculated, to output characteristic pattern caching.
The same processing mode of other computing units, until by first characteristic value of input feature vector figure and ownership restatementIt calculates and completes, and preserve intermediate result, then carry out next characteristic value clearing, and so on, first passage input feature vector figure is wholeAfter the completion of calculating, next channel input feature vector figure enters calculating, and different channels are corresponded to results of intermediate calculations and sum up placeReason.
When exporting characteristic pattern with computing array size mismatch, multichannel input feature vector figure is divided into smaller characteristic patternUnit reconfigures the characteristic pattern unit of adjacency channel same position for new input feature vector figure, as convolutional calculation arrayInput.Multichannel input feature vector figure division proportion depend on output characteristic pattern size, port number depend on computing array size andExport characteristic pattern size.Specific implementation process is as shown in Figure 4.It is 3x3 with convolutional calculation array sizes, output characteristic pattern size is2x2, convolution kernel size are 1x1, and the example that sliding step is 1 illustrates, and in the case of this kind, computing array size is more than outputCharacteristic pattern size, and be not integral multiple relation, computing resource utilization rate is(2x2)/ (3x3)=4/9, is not involved in the resource of calculatingIt is wasted.Input feature vector figure stripping and slicing is taken, the input feature vector figure stripping and slicing in different channels is combined, calculating is made full use ofAll resources of array so that computing resource can be fully used.Detailed process is as follows:
1, period 1 T0 moment, control logic command mappings logic divide 11 point values of the same position of the one two three four-wayNot Shu Ru computing array 11,12,21,22 positions, four tunnel parallel computations simultaneously obtain 4 output characteristic pattern 11 point values, andKeep in output characteristic pattern caching.
2, T1 moment, control logic command mappings logic are defeated by the 12 point values difference of the same position of the one two three four-wayEnter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 12 point values of 4 output characteristic patterns simultaneously, and keep inTo output characteristic pattern caching.
3, T2 moment, control logic command mappings logic are defeated by the 21 point values difference of the same position of the one two three four-wayEnter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 21 point values of 4 output characteristic patterns simultaneously, and keep inTo output characteristic pattern caching.
4, T3 moment, control logic command mappings logic are defeated by the 22 point values difference of the same position of the one two three four-wayEnter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 22 point values of 4 output characteristic patterns simultaneously, and keep inTo output characteristic pattern caching.
At the end of the T3 moment, all point values of output characteristic pattern in four channels, which all calculate, to be completed.
Embodiment described above, the only present invention more preferably specific implementation mode, those skilled in the art is at thisThe usual variations and alternatives carried out within the scope of inventive technique scheme should be all included within the scope of the present invention.