Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a PSoC-based convolutional neural network accelerator, wherein the programmable part of the hardware of the whole neural network accelerator can be simplified into a multiplication and addition calculation module, an activation function module, a maximum pooling module and an average pooling module. The multiplication and addition operations in all the multiplication and addition calculation modules are calculated in parallel, convolution calculation with different convolution kernel sizes is supported, and the problems that the calculation amount of a roll-in neural network is large and the bandwidth requirement is large are solved. The software part solves the softmax classifier and non-maximum suppression algorithm and image processing algorithm which can not be realized by hardware logic, and solves the configuration of the convolutional neural network with different network structures.
The purpose of the invention is realized by the following technical scheme: a PSoC-based convolutional neural network accelerator, comprising: an off-chip memory, a CPU, a characteristic diagram input memory, a characteristic diagram output memory, an offset memory, a weight memory, a direct memory access DMA and a calculation unit with the same number as the neurons,
the direct memory storage DMA is read from the off-chip memory and transmitted to the characteristic diagram input memory, the offset memory and the weight memory under the control of the CPU, or data of the characteristic diagram data memory is written back to the off-chip memory, and the CPU needs to control the storage positions of the input characteristic diagram, the offset, the weight and the output characteristic diagram in the off-chip memory and the parameter transmission of the multilayer convolutional neural network so as to adapt to neural networks with various architectures.
Further, the computing unit comprises a first-in first-out queue, a state machine, a first data selector, a second data selector, an average value pooling module, a maximum value pooling module, a multiply-add computing module and an activation function module,
wherein the first data selector is in communication with the feature map input memory, the input feature map input data is input to the mean pooling module, the maximum pooling module, the multiply-add calculation module, and the activation function module via the first data selector,
the second data selector is communicated with the feature map output memory, and output results of the average value pooling module, the maximum value pooling module and the multiply-add calculation module are selectively output to the feature map output memory through the second data selector.
Further, the multiplication and addition calculation module is based on a combined structure of a multiplication and addition tree and a multiplication and addition register and comprises an input characteristic diagram matrix, a weight input matrix and a bias matrix.
Further, the activation function module includes a first configuration register, a first selector, a first multiplier, and a first adder, and is configured to implement a tangent function, a sigmoid function, and a ReLU function, and the CPU configures the first configuration register of the activation function module to implement the activation function through hardware logic.
Further, the average pooling module comprises a second configuration register, a second multiplier and a second adder, and the average pooling module is configured by the CPU to realize pooling of the matrix average and obtain the matrix average.
Further, the maximum pooling module comprises a third configuration register, a comparator and a second selector, the maximum pooling module is configured through the CPU, the maximum pooling of the matrix is realized, and each data in the matrix is compared to obtain a maximum value.
Compared with the prior art, the invention has the following advantages and effects: the whole convolutional neural network is controlled by a CPU to perform data storage allocation and data transmission, a data selector performs data allocation under the control of a state machine and transmits the data allocation to a multiply-add computing module, an activation function computing module, a maximum pooling module and an average pooling module, and meanwhile, the CPU performs algorithms such as image processing, a softmax classifier and a non-maximum suppression algorithm.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example one
In order to increase the computation amount of the convolutional neural network, increase the parallel processing efficiency, and reduce the bandwidth requirement, the present invention provides a convolutional neural network accelerator 100 based on a PSoC shown in fig. 1, which includes: an off-chip memory 101, a CPU102, a featuremap input memory 103, a featuremap output memory 104, anoffset memory 105, aweight memory 106, a direct memory access DMA107, and acalculation unit 108 having the same number of neurons.
The direct memory storage DMA107 reads data transferred from the off-chip memory 101 to theprofile input memory 103, theoffset memory 105, and theweight memory 106 under the control of the CPU102, or writes data of theprofile output memory 104 back to the off-chip memory 101. The CPU102 needs to control the storage locations of the input feature map, the offset, the weight, the output feature map in the off-chip memory, and the parameter transmission of the multi-layer convolutional neural network to adapt to neural networks of various architectures.
Thecalculation unit 108 with the same number of neurons includes a first-in-first-out queue, astate machine 109, afirst data selector 110, a second data selector 111, anaverage pooling module 112, amaximum pooling module 113, a multiply-add calculation module 114, and anactivation function module 115, wherein thefirst data selector 110 is in communication with the featuremap input memory 103, input feature map input data is input to theaverage pooling module 112, themaximum pooling module 113, the multiply-add calculation module 114, and theactivation function module 115 through thefirst data selector 110, the second data selector 111 is in communication with the featuremap output memory 104, and output results of theaverage pooling module 112, themaximum pooling module 113, and the multiply-add calculation module 114 are selected and output to the featuremap output memory 104 through the second data selector 111.
As shown in fig. 2, the multiply-add calculation module is based on a structure combining a multiply-add tree and a multiply-add register, and includes an input feature map matrix, a weight input matrix, and a bias matrix. The structure can realize parallel and efficient completion of convolution operation, and cannot reduce the utilization rate of the multiplier when convolution kernels with different sizes are realized.
As shown in fig. 3, the activation function module includes a first configuration register, a first selector, a first multiplier, and a first adder, and is configured to implement a tangent function, a sigmoid function, and a ReLU function, and the CPU configures the first configuration register of the activation function module to implement the activation function through hardware logic.
As shown in fig. 4, the average pooling module includes a second configuration register, a second multiplier, and a second adder. And configuring an average value pooling module through the CPU, wherein the m value can be configured, so that the pooling of the m × m average value is realized, and the m × m matrix average value is obtained.
As shown in fig. 5, the maximum pooling module includes a third configuration register, a comparator, and a second selector. And the CPU is configured with a maximum value pooling module, the k value is configurable, k × k maximum value pooling is realized, and each data in the k × k matrix is compared to obtain a maximum value.
Example two
Correspondingly, the invention further describes a method flow of the convolutional neural network calculation by the convolutional neural network accelerator based on the PSoC in combination with FIG. 6.
The CPU can be programmed in embedded software, the construction of a deep convolutional neural network is realized in the software programming, and the deep convolutional neural network is input into a relevant processor and is used for transmitting a command value control register through bus configuration.
Examples of configuration commands are shown in the following table:
the first layer input is x1 input feature map data and x3 weight data, and the calculation results are input into a maximum value pooling module and an activation function module to obtain x2 output feature map data.
The convolution layer output characteristic diagram has M layers in the storage form of the off-chip memory, and M takes the value of 1,3,5,7 … …. The output characteristic diagram of the M layer is the input characteristic diagram of the M +1 layer, the output characteristic diagram of the M layer is stored in the address space with the address A1 as the starting address, and the output characteristic diagram of the M +1 layer is stored in the address space with the address A2 as the starting address.
In a particular application, the computations within the convolutional neural network layers are performed in parallel. The whole network implementation process is as follows:
(1) the software of theprocessor 102 controls the image processing, and the sample data is stored in the off-chip memory 101;
(2) theprocessor 102 controls the DMA107 to read off-chip memory data to thefirst data selector 110 while configuring the multiply-add calculation unit 114, theaverage pooling module 112, themaximum pooling module 113, theactivation function module 115, and thestate machine 109 via theprocessor 102. Configuration information includes, but is not limited to, convolution computation step size, convolution kernel size, activation function type, mean pooling size, maximum pooling block size.
(3) Data is transferred from the DMA to theprofile input memory 103,offset memory 105,weight memory 106 under control of thestate machine 109.
(4) The data is input into the multiply-add calculation unit 114, theactivation function module 115, theaverage pooling module 112 or themaximum pooling module 113 to obtain the calculation result.
(5) Under state machine control, data is transferred from the multiply-add computation unit 114, theactivation function 115, themean pooling module 112 or themaximum pooling module 113 to the data selector and to the off-chip memory 101.
At this point, the whole network completes one layer of results, and the network completes multiple layers of results in a circulating manner.
In a word, the programmable part of the whole convolutional neural network accelerator hardware can be simplified into a multiply-add computing module, an activation function module, a maximum value pooling module and an average value pooling module, the multiply-add operations in all the multiply-add computing modules are computed in parallel, convolutional computations with different convolutional kernel sizes are supported, pooling computations with different sizes are supported, a Softmax classifier and a non-maximum value suppression algorithm which cannot be realized by hardware logic are realized through CPU software design of the convolutional neural network accelerator, convolutional neural network computations are completed through configuration of convolutional neural networks supporting different network structures, the problems that the amount of computation of a convolutional neural network is large, the bandwidth requirement is large are solved, and the convolutional neural network algorithms supporting different structures can be configured.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.