Specific embodiment
In order to make those skilled in the art more fully understand present disclosure scheme, below in conjunction in present disclosure embodimentThe technical solution in present disclosure embodiment is clearly and completely described in attached drawing, it is clear that described embodiment is onlyPresent disclosure a part of the embodiment, instead of all the embodiments.Based on the embodiment in present disclosure, those of ordinary skill in the artEvery other embodiment obtained without creative efforts belongs to the range of present disclosure protection.
It is described to determine the n-th output result ladder according to the described n-th reversed computational complexity in the method that first aspect providesDegree, n-th layer of input data, the corresponding n-th reverse data type of n-th layer weight group data, comprising:
By the n-th reversed computational complexity compared with preset threshold, such as the described n-th reversed computational complexity is higher than described defaultThreshold value determines that the n-th reverse data type is fixed point type, as described in being less than or equal to the described n-th reversed computational complexityPreset threshold, computing device determine that the n-th reverse data type is floating point type.
In the method that first aspect provides, the method is determining the n-th output according to the described n-th reversed computational complexityAs a result after gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data further include:
Determine that the n-th output result gradient, n-th layer input data, n-th layer weight group data belong to (n+1)th is reversedData type, such as the (n+1)th reverse data type is different from the n-th reverse data type, will belong to the (n+1)th reverse dataThe n-th output result gradient of type, n-th layer input data, n-th layer weight group data conversion are at belonging to the n-th reverse dataThe n-th output the result gradient, n-th layer input data, n-th layer weight group data of type.
In the method that first aspect provides, such as the reversed operation of the n-layer is convolution algorithm, and convolution input data is describedN-th layer input data, convolution kernel are the n-th output result gradient,
N-th reversed computational complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M be convolution kernel four dimensions value, N, W,C, H is the value of convolution input data four dimensions;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the volumeWhether product input data and convolution kernel are floating data, will if the convolution input data and convolution kernel are not floating dataThe convolution input data is converted into floating data, and convolution kernel is converted into floating data, then by convolution input data, convolution kernelConvolution algorithm is executed with floating type.
In the method that first aspect provides, such as the described n-th reversed operation are as follows: Matrix Multiplication matrix operation, the input numberAccording to for n-th layer of input data, the weight is the n-th output result gradient;
Complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer inputThe row, column value of data, E, F are the row, column value of weight;
If the complexity is greater than given threshold, determine that the n-th reverse data type is floating type, determine thisWhether n-layer input data and weight are floating data, if the n-th layer input data and weight are not floating data, by thisN-th layer input data is converted into floating data, and weight is converted into floating data, then by n-th layer input data, weight with floatingPoint data type executes Matrix Multiplication matrix operation.
In the method that first aspect provides, such as the described n-th reversed operation are as follows: Matrix Multiplication vector operation, the input numberAccording to for n-th layer of input data, the weight is the n-th output result gradient;
Complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are that n-th layer inputs numberAccording to row, column value, F be n-th output result gradient train value;
If the complexity is greater than given threshold, determine that the n-th reverse data type is floating type, determine thisWhether n-layer input data and weight are floating data, if the n-th layer input data and weight are not floating data, by thisN-th layer input data is converted into floating data, and weight is converted into floating data, then by n-th layer input data, weight with floatingPoint data type executes Matrix Multiplication vector operation.
First aspect provide method in, the n-th reversed operation can also include: bigoted operation, entirely connect operation,One of GEMM operation, GEMV operation, activation operation or any combination.
In the device that second aspect provides, the processing circuit, specifically by the n-th reversed computational complexity and preset thresholdCompare, such as the described n-th reversed computational complexity is higher than the preset threshold, determines the n-th reverse data type for fixed point classType, such as the described n-th reversed computational complexity are less than or equal to the preset threshold, determine that the n-th reverse data type is floatingVertex type.
In the device that second aspect provides, the integrated circuit chip device further include: data type conversion circuit;
The processing circuit is also used to determine the n-th output the result gradient, n-th layer input data, n-th layer weight groupThe (n+1)th reverse data type that data belong to, such as the (n+1)th reverse data type is different from the n-th reverse data type,Conversion command is sent to the data type conversion circuit,
The data type conversion circuit, for the n-th output result ladder of the (n+1)th reverse data type will to be belonged toDegree, n-th layer input data, n-th layer weight group data conversion are at the n-th output result ladder for belonging to the n-th reverse data typeDegree, n-th layer input data, n-th layer weight group data.
In the device that second aspect provides, such as the reversed operation of the n-layer is convolution algorithm, and convolution input data is describedN-th layer input data, convolution kernel are the n-th output result gradient,
The processing circuit, for calculating the n-th reversed computational complexity,
N-th reversed computational complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M be convolution kernel four dimensions value, N, W,C, H is the value of convolution input data four dimensions;
The processing circuit is also used to the complexity such as and is greater than given threshold, determines that the n-th reverse data type is floatingPoint data type determines whether the convolution input data and convolution kernel are floating data;Such as the convolution input data and volumeProduct core is not floating data, which is converted into floating data, convolution kernel is converted into floating data, then willConvolution input data, convolution kernel execute convolution algorithm with floating type.
In the device that second aspect provides, such as the described n-th reversed operation are as follows: Matrix Multiplication matrix operation, the input numberAccording to for n-th layer of input data, the weight is the n-th output result gradient;
The processing circuit, for calculating the n-th reversed computational complexity,
N-th reversed computational complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is F, G more than or equal to 1For the row, column value of n-th layer input data, E, F are the row, column value of weight;
The processing unit is greater than given threshold for such as the complexity, determines that the n-th reverse data type is floating-pointData type determines whether the n-th layer input data and weight are floating data, such as the n-th layer input data and weightIt is not floating data, which is converted into floating data, weight is converted into floating data, then by n-th layerInput data, weight execute Matrix Multiplication matrix operation with floating type.
In the device that second aspect provides, such as the described n-th reversed operation are as follows: Matrix Multiplication vector operation, the input numberAccording to for n-th layer of input data, the weight is the n-th output result gradient;
The processing circuit, for calculating the n-th reversed computational complexity,
N-th reversed computational complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G areThe row, column value of n-th layer of input data, F are the train value of the n-th output result gradient;
The processing circuit is also used to the complexity such as and is greater than given threshold, determines that the n-th reverse data type is floatingPoint data type determines whether the n-th layer input data and weight are floating data, such as the n-th layer input data and powerValue is not floating data, which is converted into floating data, weight is converted into floating data, then by n-thLayer input data, weight execute Matrix Multiplication vector operation with floating type.
As shown in Figure 1, the step of neural metwork training, includes:
Each layer in one (multilayer) neural network successively executes forward operation;
Reversed operation, which is successively executed, according to the sequence of opposite layer obtains weight gradient;
The weight of update forward operation is removed with the gradient for the weight being calculated;
Here it is the successively iteration of the training of neural network, entire training process needs repeat (i.e. successive ignition meterCalculate) this process is multiple;
As shown in Figure 1a, a kind of forward operation of the neural network provided for present disclosure embodiment, each layer uses oneselfType according to layer of input data and weight specified by operation rule corresponding output data is calculated;
The forward operation process (being also reasoning, inference) of neural network is the input data for successively handling each layer, warpCertain calculating is crossed, the process of output data is obtained, has the feature that
The input of a certain layer:
The input of a certain layer can be the input data of neural network;
The input of a certain layer can be the output of other layers;
The input of a certain layer can be the output (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment;
A certain layer can obtain input from multiple above-mentioned input sources simultaneously;
The output of a certain layer:
The output of a certain layer can be used as the output result of neural network;
The output of a certain layer can be other layers of input;
The output of a certain layer can be the input (the case where Recognition with Recurrent Neural Network) of this layer of subsequent time;
The output of a certain layer can export result to above-mentioned multiple outbound courses;
Specifically, the type of the operation of the layer in the neural network includes but is not limited to following several:
Convolutional layer (i.e. execution convolution algorithm);
Full articulamentum (executing full connection operation);
Normalize (regularization) layer: including LRN (Local Response Normalization) layer, BN (BatchNormalization) the types such as layer;
Pond layer;
Active coating: including but is not limited to the Tanh with Sigmoid layers of Types Below, ReLU layers, PReLu layers, LeakyReLu layersLayer;
The reversed operation of layer, each layer of reversed operation need to be implemented two parts operation: a part is using may be diluteIt dredges the output data gradient indicated and may be that the input data of rarefaction representation calculates the gradient of weight (for " weight is moreNewly " step updates the weight of this layer), another part is using the output data gradient that may be rarefaction representation and may be sparseThe weight of expression, calculate input data gradient (for the output data gradient as next layer in reversed operation for its intoThe reversed operation of row);
Reversed operation is according to the sequence opposite with forward operation, the back transfer gradient since the last layer.
In a kind of optinal plan, the output data gradient that a certain layer retrospectively calculate obtains be can come from:
The gradient of the last loss function of neural network (lost function or cost function) passback;
Other layers of input data gradient;
The input data gradient (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment;
A certain layer can obtain output data gradient from multiple above-mentioned sources simultaneously;
After having executed the reversed operation of neural network, the gradient of the weight of each layer is just calculated, in the stepIn, the first input-buffer and the second input-buffer of described device are respectively used to store the gradient of the weight of this layer and weight, soUsing weights gradient is updated weight in arithmetic element afterwards;
The operation being mentioned above all is that multilayer neural network was realized in one layer of operation in neural networkCheng Shi, in forward operation, after upper one layer of artificial neural network, which executes, to be completed, next layer of operational order can be by operation listCalculated output data carries out operation as next layer of input data and (or carries out certain behaviour to the output data in memberIt is re-used as next layer of input data), meanwhile, weight is also replaced with to next layer of weight;In reversed operation, when upper oneAfter the completion of the reversed operation of layer artificial neural network executes, next layer of operational order can be by input number calculated in arithmetic elementAccording to gradient as next layer output data gradient carry out operation (or to the input data gradient carry out it is certain operation remakeOutput data gradient for next layer), while weight being replaced with to next layer of weight;It (is indicated with figure below, in the following figureThe arrow of dotted line indicates reversed operation, and the arrow of solid line indicates forward operation, respectively schemes the meaning of following mark expression figure)
The representation method of fixed point data
The method of fixed point refers to that the expression of the data of some data block in network is converted into certain specific fixation is smallThe data coding method (the 0/1 bit disposing way for being mapped to data on circuit device) of several positions;
In a kind of optinal plan, multiple data composition number is used into same fixed-point representation according to block as a wholeMethod carries out fixed point expression;
Fig. 1 b shows the specific table of short digit fixed-point data structure for storing data according to an embodiment of the present inventionShow method.Wherein, 1Bit are used to indicate symbol, and M are used to indicate integer part, and N for indicating fractional part;It comparesIn 32 floating data representations, the short position fixed-point data representation that the present invention uses is less in addition to occupying number of bitsOutside, it for same layer, same type of data in neural network, such as all weight datas of first convolutional layer, also in addition setsThe position of a flag bit Point location record decimal point has been set, number can have been adjusted according to the distribution of real data in this wayAccording to expression precision and can indicate data area.
Expression, that is, 32bit of floating number is indicated, but for this technical solution, uses fixed-point number that can reduceThe digit of the bit of one numerical value, to reduce the data volume of transmission and the data volume of operation.
Input data indicated with Fig. 2 a (N number of sample, each sample have C channel, a height of H of the characteristic pattern in each channel,Width is W), weight namely convolution kernel indicate (there is M convolution kernel, each convolution kernel has C channel, and height and width are respectively with Fig. 2 bKH and KW).For N number of sample of input data, the rule of convolution algorithm is the same, and explained later is on a sampleThe process of convolution algorithm is carried out, on a sample, each of M convolution kernel will carry out same operation, Mei GejuanProduct kernel operation obtains a sheet of planar characteristic pattern, and M plane characteristic figure is finally calculated in M convolution kernel, (to a sample, volumeLong-pending output is M characteristic pattern), for a convolution kernel, inner product fortune is carried out in each plan-position of a sampleIt calculates, is slided then along the direction H and W, for example, Fig. 2 c indicates that a convolution kernel is right in a sample of input dataThe position of inferior horn carries out the corresponding diagram of inner product operation;Fig. 2 d indicates that the position of convolution slides a lattice and Fig. 2 e to the left and indicates convolutionOne lattice of position upward sliding.
When the first operation is convolution algorithm, the input data is convolution input data, and the weight data is convolution kernel,
First complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M be convolution kernel four dimensions value, N, W,C, H is the value of convolution input data four dimensions;
If first complexity is greater than given threshold, determine whether the convolution input data and convolution kernel are floating numberAccording to which being converted into floating data, will be rolled up if the convolution input data and convolution kernel is not floating dataProduct consideration convey changes floating data into, and convolution input data, convolution kernel are then executed convolution algorithm with floating type.
Specifically, the mode of the process of convolution can be handled using chip structure as shown in Figure 3a, main process task circuit (Be properly termed as master unit) data conversion computing circuit can the first complexity be greater than given threshold when, by the part of weightOr the data conversion in whole convolution kernels, at the data of fixed point type, the control circuit of main process task circuit is by the part of weight or entirelyData in portion's convolution kernel are sent to those of to be directly connected with main process task circuit based process by lateral Data Input InterfaceCircuit (being referred to as base unit) (for example, vertical data path that the grey of the top is filled in Fig. 3 b);
In a kind of optinal plan, the control circuit of main process task circuit sends the data of some convolution kernel in weight every timeOne number or a part of number give some based process circuit;(for example, for some based process circuit, send for the 1st timeThe 1st number of 3 rows, the 2nd the 2nd number sent in the 3rd row data, the 3rd number ... or the 1st of the 3rd the 3rd row of transmissionThe 3rd row the first two number of secondary transmission, second of the 3rd row the 3rd of transmission and the 4th number, third time send the 3rd row the 5th and the 6thNumber ...;)
Another situation is that, the control circuit of main process task circuit is by the several convolution kernels of certain in weight in a kind of optinal planData every time respectively send an a part of number of number person give some based process circuit;(for example, for some based process electricityRoad, the 1st number of the 1st the 3rd, 4, the 5 every row of row of transmission, the 2nd number of the 2nd the 3rd, 4, the 5 every row of row of transmission, the 3rd transmission3rd number ... of the 3rd, 4, the 5 every row of row or the 1st transmission every row the first two number of the 3rd, 4,5 row, second of transmission the 3rd,The every row the 3rd of 4,5 rows and the 4th number, third time send the every row the 5th of the 3rd, 4,5 row and the 6th number ...;)
The control circuit of main process task circuit divides input data according to the position of convolution, the control of main process task circuitCircuit by the data some or all of in input data in convolution position be sent to by vertical Data Input Interface directly withMain process task circuit be connected those of based process circuit (for example, what the grey in Fig. 3 b on the left of based process gate array was filledLateral data path);
In a kind of optinal plan, the control circuit of main process task circuit is every by the data of some convolution position in input dataOne number of secondary transmission or a part of number give some based process circuit;(for example, for some based process circuit, the 1st timeIt sending the 3rd and arranges the 1st number, the 2nd the 2nd number sent in the 3rd column data sends the 3rd number ... of the 3rd column for the 3rd time,Or the 1st the 3rd column the first two number of transmission, second, which sends the 3rd, arranges the 3rd and the 4th number, and third time sends the 3rd and arranges the 5th and the6 numbers ...;)
Another situation is that, the control circuit of main process task circuit is by the several volumes of certain in input data in a kind of optinal planThe data of product position respectively send a number every time or a part of number gives some based process circuit;(for example, for some basePlinth processing circuit, the 1st number of the 1st the 3rd, 4,5 column each column of transmission, the 2nd number of the 2nd the 3rd, 4,5 column each column of transmission,The 3rd number ... or the 1st the 3rd, 4,5 column each column the first two number of transmission of 3rd the 3rd, 4,5 column each column of transmission, secondThe 3rd, 4,5 column each column the 3rd and the 4th number are sent, third time sends the 3rd, 4,5 column each column the 5th and the 6th number ...;)
After based process circuit receives the data of weight, which is transmitted by its lateral data output interfaceIt is connected next based process circuit to it (for example, the transverse direction of the white filling in Fig. 3 b among based process gate arrayData path);After based process circuit receives the data of input data, which is connect by its vertical data outputPort transmission is to coupled next based process circuit (for example, the white in Fig. 3 b among based process gate arrayThe vertical data path of filling);
Each based process circuit carries out operation to the data received;
In a kind of optinal plan, based process circuit calculates the multiplication of one or more groups of two data every time, then willAs a result it is added on register and/or on piece caching;
In a kind of optinal plan, based process circuit calculates the inner product of one or more groups of two vectors every time, then willAs a result it is added on register and/or on piece caching;
After based process circuit counting goes out result, result can be transferred out from data output interface;
In a kind of optinal plan, which can be the final result or intermediate result of inner product operation;
Specifically, from the interface if the based process circuit has the output interface being directly connected with main process task circuitTransmission is as a result, if it is not, towards that directly can export result to the direction of the based process circuit of main process task circuit output(for example, bottom line based process circuit outputs it result and is directly output to main process task circuit in Fig. 3 b, other basesProcessing circuit transmits downwards operation result from vertical output interface).
After based process circuit receives the calculated result from other based process circuits, transmit the data toIts other based process circuit or main process task circuit for being connected;
Towards can be directly to the direction of main process task circuit output output result (for example, bottom line based process electricityRoad outputs it result and is directly output to main process task circuit, other based process circuits transmit downwards fortune from vertical output interfaceCalculate result);
Main process task circuit receive each based process circuit inner product operation as a result, output result can be obtained.
Refering to Fig. 4 a, Fig. 4 a is a kind of Matrix Multiplication with the operation of matrix, such as first operation are as follows: Matrix Multiplication matrix fortuneIt calculates, the input data is the first matrix of the Matrix Multiplication matrix operation, and the weight is the Matrix Multiplication matrix operationSecond matrix;
First complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is F, G first more than or equal to 1The row, column value of matrix, E, F are the row, column value of the second matrix;
If first complexity is greater than given threshold, determine whether first matrix and the second matrix are floating numberAccording to if first matrix and the second matrix are not floating data, by first matrix conversion at floating data, by the second matrixIt is converted into floating data, the first matrix, the second matrix are then executed into Matrix Multiplication matrix operation with floating type.
Refering to Fig. 4 b, the operation of Matrix Multiplication matrix is completed using device as shown in Figure 3b;
Be described below calculate size be M row L column matrix S and size be L row N column matrix P multiplication operation, (squareEvery a line in battle array S is identical with each column length of matrix P, as shown in Figure 2 d) to possess K a for the neural computing deviceBased process circuit:
Step S401b, matrix S and matrix P are converted by main process task circuit when such as the first complexity is greater than given thresholdEvery data line in matrix S is distributed in K based process circuit by the control circuit of fixed point type data, main process task circuitSome on, based process circuit by the data received be stored on piece caching and/or register in;Specifically, can be withIt is sent to the based process circuit in K based process circuit with main process task circuit connection.
In a kind of optinal plan, if line number M≤K of S, the control circuit of main process task circuit is to M based processCircuit distributes a line of s-matrix respectively;
In a kind of optinal plan, if line number M > K of S, the control circuit of main process task circuit is to each based process electricityDistribute a line or the data of multirow in s-matrix respectively in road.
There is Mi row to be distributed to i-th of based process circuit in S, the collection of this Mi row is collectively referred to as Ai, as Fig. 2 e indicates i-th of baseCalculating to be executed on plinth processing circuit.
In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit:
Matrix A i is stored in i-th of based process circuit register by the received matrix A i distributed by main process task circuitAnd/or on piece caching;Advantage be the reduction of after volume of transmitted data, improve computational efficiency, reduce power consumption.
Step S402b, each section in matrix P is transferred to each base by the control circuit of main process task circuit in a broadcast mannerPlinth processing circuit;
In a kind of optinal plan, each section in matrix P can only be broadcasted and once arrive posting for each based process circuitIn storage or on piece caching, i-th of based process circuit is fully multiplexed the data of the matrix P this time obtained,Complete the corresponding inner product operation with every a line in matrix A i;Multiplexing in the present embodiment is specifically as follows based process circuit and existsReused in calculating, for example, matrix P data multiplexing, can be and the data of matrix P are being used for multiple times.
In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuitIn register or the on piece caching of a based process circuit, data of i-th of based process circuit to the matrix P obtained every timeWithout multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times;
In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuitIn register or the on piece caching of a based process circuit, data of i-th of based process circuit to the matrix P obtained every timeFractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed;
In a kind of optinal plan, each based process circuit, such as i-th of based process circuit, calculating matrix Ai'sThe inner product of data and the data of matrix P;
Step S403b, the result of inner product operation is added up and is transmitted by the accumulator circuit of each based process circuitReturn main process task circuit.
In a kind of optinal plan, based process circuit can execute the part and be transmitted back to that inner product operation obtains for eachMain process task circuit adds up;
The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantorIt is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit;
In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes andIt is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial pictureMain process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later.
It is a kind of Matrix Multiplication with the operation schematic diagram of vector refering to Fig. 4 c.Such as first operation are as follows: Matrix Multiplication vector fortuneIt calculates, the input data is the first matrix of the Matrix Multiplication vector operation, and the weight is the Matrix Multiplication vector operationVector;
First complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are the first squareThe row, column value of battle array, F are the train value of vector;
If first complexity is greater than given threshold, determine whether first matrix and vector are floating data, such asFirst matrix and vector are not floating data, and by first matrix conversion at floating data, vector is converted into floating numberAccording to then by the first matrix, vector with floating type execution Matrix Multiplication vector operation.
Refering to Fig. 4 d, Fig. 4 d has provided a kind of implementation method of Matrix Multiplication vector, can specifically include:
Step S401, every data line in matrix S is converted into pinpointing by the data conversion computing circuit of main process task circuitThe data of type, the control circuit of main process task circuit are distributed in some in K based process circuit, based process circuitThe distribution data received are stored in the on piece caching and/or register of based process circuit;
In a kind of optinal plan, if line number M≤K of matrix S, the control circuit of main process task circuit is to K basisProcessing circuit distributes a line of s-matrix respectively;
In a kind of optinal plan, if line number M > K of matrix S, the control circuit of main process task circuit gives each basisProcessing circuit distributes a line or the data of multirow in s-matrix respectively.
The collection for the row being distributed in the S of i-th of based process circuit is combined into Ai, shares Mi row, as Fig. 2 c is indicated i-thCalculating to be executed on based process circuit.
In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit, it can incite somebody to actionThe distribution data received such as matrix A i is stored in the register and/or on piece caching of i-th of based process circuit;AdvantageThe volume of transmitted data of distribution data after being the reduction of, improves computational efficiency, reduces power consumption.
Step S402, vector P is converted into the data of fixed point type, main place by the data type computing circuit of main process task circuitEach section in the vector P of fixed point type is transferred to K based process circuit by the control circuit of reason circuit in a broadcast manner;
In a kind of optinal plan, the control circuit of main process task circuit, which can only broadcast each section in vector P, once to be arrivedIn register or the on piece caching of each based process circuit, i-th of based process circuit is to the vector P's this time obtainedData are fully multiplexed, and the corresponding inner product operation with every a line in matrix A i is completed.Advantage is reduced from main process task circuitTo the volume of transmitted data of the repetition transmission of the vector P of based process circuit, execution efficiency is improved, reduces transmission power consumption.
In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuitIn register or the on piece caching of a based process circuit, data of i-th of based process circuit to the vector P obtained every timeWithout multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times;Advantage is reduced in based process circuitThe volume of transmitted data of the vector P of the single transmission in portion, and the capacity of based process circuit caching and/or register can be reduced,Execution efficiency is improved, transmission power consumption is reduced, reduces cost.
In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuitIn register or the on piece caching of a based process circuit, data of i-th of based process circuit to the vector P obtained every timeFractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed;Advantage is reduced from main process task circuit to baseThe volume of transmitted data of plinth processing circuit also reduces the volume of transmitted data inside based process circuit, improves execution efficiency, reduces and passDefeated power consumption.
Step S403, the inner product of the data of inner product operation device the circuit counting matrix S and vector P of K based process circuit,Such as i-th of based process circuit, the inner product of the data of the data and vector P of calculating matrix Ai;
Step S404, the accumulator circuit of K based process circuit is added up the result of inner product operationAs a result, accumulation result to be transmitted back to main process task circuit in the form of fixed point type.
In a kind of optinal plan, each based process circuit can be executed to the part and (part that inner product operation obtainsThat is a part of accumulation result, such as accumulation result are as follows: F1*G1+F2*G2+F3*G3+F4*G4+F5*G5, then part andCan be with are as follows: the value of F1*G1+ F2*G2+F3*G3) it is transmitted back to main process task circuit and adds up;Advantage is to reduce based processOperand inside circuit improves the operation efficiency of based process circuit.
The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantorIt is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit;Advantage is,Reduce the volume of transmitted data between based process circuit and main process task circuit, improve operation efficiency, reduces data transmissionPower consumption.
In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes andIt is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial pictureMain process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later;Advantage is to reduce based process circuit and masterVolume of transmitted data between processing circuit, improves operation efficiency, reduces data transmission power consumption, reduces based process circuitInternal operand improves the operation efficiency of based process circuit.
Neural network training method
Involved all data can use different data presentation techniques in neural network training process;
Specifically, the data presentation technique includes but is not limited to following situations:
The floating number of different bit wides;
The fixed-point number of different bit wides, the fixed-point number of different fixed positions;
The different moments (at the time of being specifically just different the number of iterations or initialization) of training process trainedDifferent data block (i.e. multiple input numbers in different phase (i.e. positive or reversed operation), different layers, same layer in journeyAccording to block, output block) or the same data block in the sub-block of different piece that divides, be ok:
It can be respectively using fixed point or floating-point;
For fixed point:
Use different fixed point bit wides;
Use different fixed point deviants (namely fixed position);
The concrete methods of realizing for illustrating neural metwork training with an actual example below, is as shown in Figure 1a single layerThe specific calculating schematic diagram of the neural metwork training of operation, as shown in Figure 1a, input data and weight or parameter execute this layerOperation, technical solution provided by the embodiments of the present application determine whether according to the forward operation amount of input data, weight and this layerThe type of the input data and weight is converted, specific mode can be with are as follows: such as the input data and weight storage instituteThe register or storage space of occupancy are greater than given threshold and the forward operation amount of this layer is greater than setting operand, determine that this is defeatedWhen to enter data and weight data be floating data, the input data and weight data are converted into fixed-point data.Such as input numberAccordingly and the occupied register of weight storage or storage space are less than given threshold, as the input data and weight data areFixed-point data after input data and weight data are converted into floating data, executes this layer of operation.
Principle the application of above-mentioned data type conversion is elaborated, is a kind of fixed point class as shown in Figure 1 bThe expression of type data, for computing system, the storage bit number of 1 floating data is 32bit, and for fixed-point data, especiallyIt is using the expression of the data progress data of the floating point type such as Fig. 1 b shown in, and the storage bit number of 1 fixed-point data can be withAccomplish 16Bit hereinafter, so for this conversion for, the transport overhead that can be significantly reduced between calculator, in addition, forFor calculator, the space of the data storage of less bit is also smaller, i.e., storage overhead can be smaller, and calculation amount can also be reduced,I.e. computing cost can be reduced, so the expense of computing cost and storage can be reduced, but the conversion for data typeIt is the need for the expense of part, hereinafter referred to as transition overhead, for computationally intensive, the big data of data storage capacity, conversion is openedPin almost can be ignored for subsequent computing cost, storage overhead and transport overhead, so for calculatingAmount is big, the big data of data storage capacity, and the application is used data type conversion into the technical solution of the data of fixed point type,It is small conversely, for calculation amount, the small data of data storage capacity, at this time since computing cost itself, storage overhead and transmission are openedPin just it is smaller, at this time if using fixed-point data, since the precision of fixed-point data can be slightly below floating data, calculation amount compared withUnder the premise of small, need to guarantee the precision calculated, so passing through increasing here by the data conversion of fixed point type at floating dataAdd lesser expense to improve the purpose of the precision of calculating.
Illustrated below with actual example, as shown in fig 4e, this layer of operation mode is matrix multiplication, input dataAnd weight is matrix, input data here is by taking matrix I as an example for convenience of explanation, and weight is by taking matrix W as an example, such as Fig. 4 eIt is shown, output data=matrix I* matrix W;Here if the sum of number of columns and line number amount of matrix I and matrix W are larger,It can think that above-mentioned matrix I and matrix W take up too much space in memory and/or register and calculation amount is also larger,In this way if matrix I and matrix W are floating data, matrix I and matrix W are converted into fixed-point data, then executedThe operation of matrix multiplication.
For example, matrix I be 1000*1000 matrix, matrix W is also the matrix of 1000*1000, then for number of columns withAnd the sum of line number amount is 2000, quantity is very big, and corresponding calculation amount is just bigger, and Matrix Multiplication is with the multiplication of the inner product operation of matrixOperation i.e. 109 time, for this technical solution, since the quantity of matrix I and matrix W are very big, it is impossible to once by all numbersAccording to whole transmission, data same in this way may be transmitted several times, it is assumed that be transmitted for fixed-point data, so that it may be significantly reduced transmissionData volume, and then reduce transport overhead, the calculating and storage relative to, less bit can also reduce computing cost withAnd storage overhead.
It is for the technical solution that fixed-point data is converted into floating data, by taking reversed operation as an example, as shown in figure 4gCalculate structure on to arrow direction be a kind of reversed operation.By taking reversed operation as an example, for direction operation, direction operationFor output data gradient, which is specifically as follows, if the output data gradient is the last of current iteration calculatingOne layer, (the default operation can by default operation for the output data for the last layer which calculatesBy producer's sets itself according to their needs, not limit the concrete operation step of the default operation here) obtain output numberIt is the last layer that non-current iteration calculates according to gradient, such as the output data gradient, such as the output data gradient changes for thisThe n-th layer that generation calculates, then the output data gradient is the input data gradient that (n+1)th layer of reversed operation is calculated.
Illustrated below with actual example, as shown in figure 4g, this layer of operation mode is matrix multiplication, input dataFor matrix, weight is scalar, and input data here by taking matrix I as an example, such as scheme by taking scalar C as an example by weight for convenience of explanationShown in 4g, output data=matrix I*C;At this time due to the data that weight is scalar, data calculation amount is smaller, in this way if matrixI is fixed-point data, then matrix I is converted into floating data, then in the operation for executing Matrix Multiplication scalar.
For example, matrix I is the matrix of 10*10, scalar C is counted then being 20 for the sum of number of columns and line number amountAmount is smaller, (assuming that being greater than 100 here is considered larger, is considered smaller less than 100, for the 100 digital capacity field techniquePersonnel can arbitrarily set.) corresponding calculation amount with regard to very little, Matrix Multiplication with the multiplying of the inner product operation of matrix i.e. 102 time,Since calculation amount is small, if still calculated using fixed-point data, its precision can be had an impact, in order to enable computational accuracyIt is higher, under the premise of smaller calculation amount, computational accuracy can be improved by floating data calculating.
In a kind of optinal plan, fixed fixed point bit wide can be respectively adopted in each data block of each layer in network, butIt is its fixed position with training iteration cycle variation;
Specifically, in the training process, the data presentation technique of some data block can be set as follows;
It specifically, can be to some data block selection arbitrary data representation method when starting to train;
In a kind of optinal plan, the floating point representation method of specific bit wide can choose;
In a kind of optinal plan, the fixed-point representation method of particular form can choose;
It can choose specific fixed point bit wide;
It can choose specific fixed position;
In a kind of optional scheme, it is fixed to be arranged according to the maximum value of the absolute value of data all in the data blockPoint position;
In a kind of optinal plan, fixed point can be set according to the minimum value of the absolute value of data all in the data blockPosition;
It, can be according to the fixed position of other data blocks come notebook data block when determining initialization in a kind of optinal planFixed position;
In a kind of optinal plan, the fixed position of notebook data block can be set based on experience value;
Specifically, in the training process, the data that can change some data block in any iteration cycle number indicateMethod;
It, can be without adjustment for some data block in a kind of optinal plan;
In a kind of optinal plan, it can be adjusted every certain the number of iterations;
In a kind of optinal plan, it can be adjusted every certain training epoch number;
In a kind of optinal plan, it can be adjusted according to unfixed the number of iterations interval;
In a kind of optinal plan, unfixed trained epoch number can be spaced and be adjusted;
Specifically, in the training process, it adjusts adjustable for arbitrary data when the representation method of some data blockRepresentation method;
In a kind of optinal plan, if a data block is indicated using fixed fixed point bit wide fixed-point number,The adjustment mode for the fixed position that data indicate may is that
In a kind of optinal plan, fixed position is set according to the setting method of initialization fixed position every time;
In a kind of optinal plan, if what some data block was calculated according to the initial setting method of fixed positionFixed position increased in some iteration cycle than last iteration cycle, that is just by the fixed position in this period towards the side of increaseMethod changes;Conversely, then changing towards reduced direction.
Present disclosure also provides a kind of integrated circuit chip device, and the integrated circuit chip device is for executing neural networkTraining, the neural network include multilayer, the integrated circuit chip device includes: processing circuit and external interface;
The external interface, for receiving training instruction;
The processing circuit leads to for determining first layer input data and first layer weight data according to the training instructionThe n-layer forward operation for crossing first layer input data and first layer weight data execution neural network obtains the n-th output result;
The processing circuit is also used to obtain the n-th output result gradient according to the n-th output result, refer to according to the trainingThe the n-th reversed operation for obtaining n-th layer of reversed operation is enabled, exports result gradient, n-th layer input data, n-th layer weight according to n-thGroup data and the n-th reversed operation obtain the n-th reversed computational complexity, determine that n-th is defeated according to the described n-th reversed computational complexityResult gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data out, by the n-th output result ladderDegree, n-th layer input data, n-th layer weight group data are obtained with the reversed operation of n-layer that the n-th reverse data type executes neural networkTo n weight gradient of n-layer operation;The n-th reverse data type includes: fixed point type or floating point type;
The processing circuit is also used to be updated n weight of n-layer operation using the n weight gradient.
Present disclosure is also disclosed that a neural network computing device comprising one or more is in such as Fig. 3 a or such as Fig. 3 b instituteThe chip shown is used to obtained from other processing units to operational data and control information, executes specified neural network computing,Implementing result passes to peripheral equipment by I/O interface.Peripheral equipment for example camera, display, mouse, keyboard, network interface card,Wifi interface, server.When comprising more than one mind such as Fig. 3 a or chip as shown in Figure 3b, such as Fig. 3 a or as shown in Figure 3bChip chamber can be linked by specific structure and transmit data, for example, interconnected and transmitted by PCIE busData, to support the operation of more massive neural network.At this point it is possible to share same control system, can also have respectively solelyVertical control system;Can with shared drive, can also each accelerator have respective memory.In addition, its mutual contact mode can beAny interconnection topology.
The neural network computing device compatibility with higher can pass through PCIE interface and various types of server phasesConnection.
Present disclosure is also disclosed that a combined treatment device comprising above-mentioned neural network computing device, general interconnectionInterface and other processing units (i.e. general processing unit).Neural network computing device is interacted with other processing units, altogetherThe operation specified with completion user.Such as the schematic diagram that 5a is combined treatment device.
Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/specialWith one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.ItsHis interface of the processing unit as neural network computing device and external data and control, including data are carried, and are completed to BenshenUnlatching, stopping through network operations device etc. control substantially;Other processing units can also cooperate with neural network computing deviceIt is common to complete processor active task.
General interconnecting interface, for transmitting data and control between the neural network computing device and other processing unitsInstruction.The neural network computing device obtains required input data, write-in neural network computing dress from other processing unitsSet the storage device of on piece;Control instruction can be obtained from other processing units, write-in neural network computing device on pieceControl caching;The data in the memory module of neural network computing device can also be read and be transferred to other processing units.
As shown in Figure 5 b, optionally, which further includes storage device, for being stored in this arithmetic element/arithmetic unitOr data required for other arithmetic elements, be particularly suitable for required for operation data this neural network computing device or itsThe data that can not be all saved in the storage inside of his processing unit.
The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipmentThe die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, the combined treatmentThe general interconnecting interface of device is connected with certain components of equipment.Certain components for example camera, display, mouse, keyboard,Network interface card, wifi interface.
C referring to figure 5., Fig. 5 c are a kind of structural representation for neural network processor board that present disclosure embodiment providesFigure.As shown in Fig. 5 c, above-mentioned neural network processor board 10 include neural network chip encapsulating structure 11, first it is electrical andNon-electrical attachment device 12 and first substrate (substrate) 13.
Present disclosure is not construed as limiting the specific structure of neural network chip encapsulating structure 11, optionally, as fig 5d,Above-mentioned neural network chip encapsulating structure 11 includes: neural network chip 111, second electrical and non-electrical attachment device 112, theTwo substrates 113.
The concrete form of neural network chip 111 involved in present disclosure is not construed as limiting, above-mentioned neural network chip 111Including but not limited to the neural network chip for integrating neural network processor, above-mentioned chip can be by silicon materials, germanium material, amountSub- material or molecular material etc. are made.(such as: more harsh environment) and different application demands can will be upper according to the actual situationNeural network chip is stated to be packaged, so that the major part of neural network chip is wrapped, and will be on neural network chipPin is connected to the outside of encapsulating structure by conductors such as gold threads, for carrying out circuit connection with more outer layer.
Present disclosure is not construed as limiting the specific structure of neural network chip 111, optionally, please refers to Fig. 1 a or Fig. 1 b instituteThe device shown.
Present disclosure for first substrate 13 and the second substrate 113 type without limitation, can be printed circuit board(printed circuit board, PCB) or (printed wiring board, PWB), it is also possible to be other circuit boards.It is rightThe making material of PCB is also without limitation.
The second substrate 113 involved in present disclosure is electrical and non-by second for carrying above-mentioned neural network chip 111The neural network chip that above-mentioned neural network chip 111 and the second substrate 113 are attached by electrical connection arrangement 112Encapsulating structure 11, for protecting neural network chip 111, convenient for by neural network chip encapsulating structure 11 and first substrate 13 intoRow further encapsulation.
Electrical for above-mentioned specific second and non-electrical attachment device 112 the corresponding knot of packaged type and packaged typeStructure is not construed as limiting, and can be selected suitable packaged type with different application demands according to the actual situation and simply be improved,Such as: flip chip ball grid array encapsulates (Flip Chip Ball Grid Array Package, FCBGAP), slim four directionsFlat type packaged (Low-profile Quad Flat Package, LQFP), the quad flat package (Quad with radiatorFlat Package with Heat sink, HQFP), without pin quad flat package (Quad Flat Non-leadPackage, QFN) or the encapsulation side small spacing quad flat formula encapsulation (Fine-pitch Ball Grid Package, FBGA) etc.Formula.
Flip-chip (Flip Chip), suitable for the area requirements after encapsulation are high or biography to the inductance of conducting wire, signalIn the case where defeated time-sensitive.In addition to this packaged type that wire bonding (Wire Bonding) can be used, reduces cost, mentionsThe flexibility of high encapsulating structure.
Ball grid array (Ball Grid Array), is capable of providing more pins, and the average conductor length of pin is short, toolThe effect of standby high-speed transfer signal, wherein encapsulation can encapsulate (Pin Grid Array, PGA), zero slotting with Pin-Grid ArrayPull out force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection,SECC), contact array (Land Grid Array, LGA) etc. replaces.
Optionally, using the packaged type of flip chip ball grid array (Flip Chip Ball Grid Array) to mindIt is packaged through network chip 111 and the second substrate 113, the schematic diagram of specific neural network chip encapsulating structure can refer toFig. 6.As shown in fig. 6, above-mentioned neural network chip encapsulating structure includes: neural network chip 21, pad 22, soldered ball 23, secondTie point 25, pin 26 on substrate 24, the second substrate 24.
Wherein, pad 22 is connected with neural network chip 21, passes through the tie point 25 on pad 22 and the second substrate 24Between welding form soldered ball 23, neural network chip 21 and the second substrate 24 are connected, that is, realize neural network chip 21Encapsulation.
Pin 26 is used for the external circuit with encapsulating structure (for example, the first substrate on neural network processor board 1013) be connected, it can be achieved that external data and internal data transmission, it is corresponding convenient for neural network chip 21 or neural network chip 21Neural network processor data are handled.Type and quantity present disclosure for pin are also not construed as limiting, according to differenceEncapsulation technology different pin forms can be selected, and defer to certain rule and arranged.
Optionally, above-mentioned neural network chip encapsulating structure further includes insulation filler, is placed in pad 22, soldered ball 23 and connectsIn gap between contact 25, interference is generated between soldered ball and soldered ball for preventing.
Wherein, the material of insulation filler can be silicon nitride, silica or silicon oxynitride;Interference comprising electromagnetic interference,Inductive interferences etc..
Optionally, above-mentioned neural network chip encapsulating structure further includes radiator, for distributing neural network chip 21Heat when operation.Wherein, radiator can be the good sheet metal of one piece of thermal conductivity, cooling fin or radiator, for example, windFan.
For example, as shown in Figure 6 a, neural network chip encapsulating structure 11 include: neural network chip 21, pad 22,Soldered ball 23, the second substrate 24, the tie point 25 in the second substrate 24, pin 26, insulation filler 27, thermal grease 28 and metalShell cooling fin 29.Wherein, thermal grease 28 and metal shell cooling fin 29 are used to distribute heat when neural network chip 21 is runAmount.
Optionally, above-mentioned neural network chip encapsulating structure 11 further includes reinforced structure, is connect with pad 22, and interior is embedded inIn soldered ball 23, to enhance the bonding strength between soldered ball 23 and pad 22.
Wherein, reinforced structure can be metal wire structure or column structure, it is not limited here.
Present disclosure is electrical for first and the concrete form of non-electrical device of air 12 is also not construed as limiting, can refer to second it is electrical andNeural network chip encapsulating structure 11 is packaged by the description of non-electrical device of air 112 by welding, can also be withBy the way of connecting line connection or pluggable mode connection the second substrate 113 and first substrate 13, it is convenient for the first base of subsequent replacementPlate 13 or neural network chip encapsulating structure 11.
Optionally, first substrate 13 includes the interface etc. for the internal storage location of extension storage capacity, such as: synchronous dynamicRandom access memory (Synchronous Dynamic Random Access Memory, SDRAM), Double Data Rate synchronous dynamic withMachine memory (Double Date Rate SDRAM, DDR) etc., the place of neural network processor is improved by exented memoryReason ability.
It may also include quick external equipment interconnection bus (Peripheral Component on first substrate 13Interconnect-Express, PCI-E or PCIe) interface, hot-swappable (the Small Form-factor of small packagePluggable, SFP) interface, Ethernet interface, Controller Area Network BUS (Controller Area Network, CAN) connectMouthful etc., for the data transmission between encapsulating structure and external circuit, the convenience of arithmetic speed and operation can be improved.
Neural network processor is encapsulated as neural network chip 111, neural network chip 111 is encapsulated as neural networkNeural network chip encapsulating structure 11 is encapsulated as neural network processor board 10, by board by chip-packaging structure 11Interface (slot or lock pin) and external circuit (such as: computer motherboard) carry out data interaction, i.e., directly by using nerveNetwork processing unit board 10 realizes the function of neural network processor, and protects neural network chip 111.And Processing with Neural NetworkOther modules can be also added on device board 10, improve the application range and operation efficiency of neural network processor.
In one embodiment, the present disclosure discloses an electronic devices comprising above-mentioned neural network processor plateCard 10 or neural network chip encapsulating structure 11.
Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal,Mobile phone, automobile data recorder, navigator, sensor, camera, server, camera, video camera, projector, wrist-watch, earphone, movementStorage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.
The vehicles include aircraft, steamer and/or vehicle;The household electrical appliance include TV, air-conditioning, micro-wave oven,Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator;The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrumentAnd/or electrocardiograph.
Particular embodiments described above has carried out further in detail the purpose of present disclosure, technical scheme and beneficial effectsDescribe in detail it is bright, it is all it should be understood that be not limited to present disclosure the foregoing is merely the specific embodiment of present disclosureWithin the spirit and principle of present disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the guarantor of present disclosureWithin the scope of shield.