CN109102065B

Movatterモバイル変換

Info

Publication number: CN109102065B
Application number: CN201810689938.7A
Authority: CN
Inventors: 熊晓明; 李子聪; 曾宇航; 胡湘宏
Original assignee: Chipeye Microelectronics Foshan Ltd; Guangdong University of Technology
Current assignee: Chipeye Microelectronics Foshan Ltd; Guangdong University of Technology
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2022-03-11
Anticipated expiration: 2038-06-28
Also published as: CN109102065A

Abstract

Translated fromChinese

本专利公开了一种基于PSoC器件构建的卷积神经网络加速器，包括片外存储器、CPU、特征图输入存储器、特征图输出存储器、偏置存储器、权重存储器、直接内存存取与神经元数目相同的计算单元，所述计算单元包括先入先出队列、状态机、数据选择器、平均值池化模块、最大值池化模块、乘加计算模块、激活函数模块，所构成的乘加计算模块内计算是并行执行的，可用于多种架构的卷积神经网络系统。本发明充分利用片上可编程系统(PSoC，Programmable System on Chip)器件中可编程部分实现计算量大，并行性高的卷积神经网络计算部分，利用CPU实现串行算法及状态控制。

This patent discloses a convolutional neural network accelerator based on PSoC device, including off-chip memory, CPU, feature map input memory, feature map output memory, bias memory, weight memory, direct memory access and the same number of neurons The computing unit includes a first-in, first-out queue, a state machine, a data selector, an average value pooling module, a maximum value pooling module, a multiply-add calculation module, and an activation function module. Computations are performed in parallel and can be used in convolutional neural network systems of various architectures. The invention makes full use of the programmable part in the programmable system on chip (PSoC, Programmable System on Chip) device to realize the convolutional neural network calculation part with large amount of calculation and high parallelism, and uses the CPU to realize serial algorithm and state control.

Description

Convolutional neural network accelerator based on PSoC

Technical Field

The invention relates to a convolutional neural network structure technology, in particular to a convolutional neural network accelerator based on PSoC.

Background

The convolution neural network has unique superiority in image processing by local weight sharing, the layout of the convolution neural network is closer to the actual biological neural network, the complexity of the neural network is reduced by sharing the weight, and the calculation amount of the neural network is reduced. At present, the convolutional neural network is widely applied to the fields of video monitoring, machine vision, mode recognition, image search and the like.

But the realization of the hardware of the convolution network needs a large amount of hardware resources, and has the problems of low bandwidth utilization rate, low data multiplexing and the like. Convolutional neural networks need to support convolutional operation, pooling operation and full-link operation of different sizes, and many applications of convolutional neural networks usually include a picture processing part, so that the pure hardware logic implementation of an FPGA (Field Programmable Gate Array) limits expandability. For the implementation of the convolutional neural network, the network implemented by hardware is fixed, the bandwidth utilization rate is low, and the convolutional neural network supporting other structures cannot be expanded. The PSoC device has a hardware programmable part and a software programming characteristic, and is considered to be a suitable platform for realizing the convolutional neural network.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a PSoC-based convolutional neural network accelerator, wherein the programmable part of the hardware of the whole neural network accelerator can be simplified into a multiplication and addition calculation module, an activation function module, a maximum pooling module and an average pooling module. The multiplication and addition operations in all the multiplication and addition calculation modules are calculated in parallel, convolution calculation with different convolution kernel sizes is supported, and the problems that the calculation amount of a roll-in neural network is large and the bandwidth requirement is large are solved. The software part solves the softmax classifier and non-maximum suppression algorithm and image processing algorithm which can not be realized by hardware logic, and solves the configuration of the convolutional neural network with different network structures.

The purpose of the invention is realized by the following technical scheme: a PSoC-based convolutional neural network accelerator, comprising: an off-chip memory, a CPU, a characteristic diagram input memory, a characteristic diagram output memory, an offset memory, a weight memory, a direct memory access DMA and a calculation unit with the same number as the neurons,

the direct memory storage DMA is read from the off-chip memory and transmitted to the characteristic diagram input memory, the offset memory and the weight memory under the control of the CPU, or data of the characteristic diagram data memory is written back to the off-chip memory, and the CPU needs to control the storage positions of the input characteristic diagram, the offset, the weight and the output characteristic diagram in the off-chip memory and the parameter transmission of the multilayer convolutional neural network so as to adapt to neural networks with various architectures.

Further, the computing unit comprises a first-in first-out queue, a state machine, a first data selector, a second data selector, an average value pooling module, a maximum value pooling module, a multiply-add computing module and an activation function module,

wherein the first data selector is in communication with the feature map input memory, the input feature map input data is input to the mean pooling module, the maximum pooling module, the multiply-add calculation module, and the activation function module via the first data selector,

the second data selector is communicated with the feature map output memory, and output results of the average value pooling module, the maximum value pooling module and the multiply-add calculation module are selectively output to the feature map output memory through the second data selector.

Further, the multiplication and addition calculation module is based on a combined structure of a multiplication and addition tree and a multiplication and addition register and comprises an input characteristic diagram matrix, a weight input matrix and a bias matrix.

Further, the activation function module includes a first configuration register, a first selector, a first multiplier, and a first adder, and is configured to implement a tangent function, a sigmoid function, and a ReLU function, and the CPU configures the first configuration register of the activation function module to implement the activation function through hardware logic.

Further, the average pooling module comprises a second configuration register, a second multiplier and a second adder, and the average pooling module is configured by the CPU to realize pooling of the matrix average and obtain the matrix average.

Further, the maximum pooling module comprises a third configuration register, a comparator and a second selector, the maximum pooling module is configured through the CPU, the maximum pooling of the matrix is realized, and each data in the matrix is compared to obtain a maximum value.

Compared with the prior art, the invention has the following advantages and effects: the whole convolutional neural network is controlled by a CPU to perform data storage allocation and data transmission, a data selector performs data allocation under the control of a state machine and transmits the data allocation to a multiply-add computing module, an activation function computing module, a maximum pooling module and an average pooling module, and meanwhile, the CPU performs algorithms such as image processing, a softmax classifier and a non-maximum suppression algorithm.

Drawings

FIG. 1 is a diagram of a PSoC-based convolutional neural network accelerator of the present invention;

FIG. 2 is a block diagram of a multiply-add calculation module according to the present invention;

FIG. 3 is a block diagram of an activation function module of the present invention;

FIG. 4 is a block diagram of the mean pooling module of the present invention;

FIG. 5 is a block diagram of a maximum pooling module of the present invention;

FIG. 6 is a CPU software flow diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example one

In order to increase the computation amount of the convolutional neural network, increase the parallel processing efficiency, and reduce the bandwidth requirement, the present invention provides a convolutional neural network accelerator 100 based on a PSoC shown in fig. 1, which includes: an off-chip memory 101, a CPU102, a featuremap input memory 103, a featuremap output memory 104, anoffset memory 105, aweight memory 106, a direct memory access DMA107, and acalculation unit 108 having the same number of neurons.

The direct memory storage DMA107 reads data transferred from the off-chip memory 101 to theprofile input memory 103, theoffset memory 105, and theweight memory 106 under the control of the CPU102, or writes data of theprofile output memory 104 back to the off-chip memory 101. The CPU102 needs to control the storage locations of the input feature map, the offset, the weight, the output feature map in the off-chip memory, and the parameter transmission of the multi-layer convolutional neural network to adapt to neural networks of various architectures.

Thecalculation unit 108 with the same number of neurons includes a first-in-first-out queue, astate machine 109, afirst data selector 110, a second data selector 111, anaverage pooling module 112, amaximum pooling module 113, a multiply-add calculation module 114, and anactivation function module 115, wherein thefirst data selector 110 is in communication with the featuremap input memory 103, input feature map input data is input to theaverage pooling module 112, themaximum pooling module 113, the multiply-add calculation module 114, and theactivation function module 115 through thefirst data selector 110, the second data selector 111 is in communication with the featuremap output memory 104, and output results of theaverage pooling module 112, themaximum pooling module 113, and the multiply-add calculation module 114 are selected and output to the featuremap output memory 104 through the second data selector 111.

As shown in fig. 2, the multiply-add calculation module is based on a structure combining a multiply-add tree and a multiply-add register, and includes an input feature map matrix, a weight input matrix, and a bias matrix. The structure can realize parallel and efficient completion of convolution operation, and cannot reduce the utilization rate of the multiplier when convolution kernels with different sizes are realized.

As shown in fig. 3, the activation function module includes a first configuration register, a first selector, a first multiplier, and a first adder, and is configured to implement a tangent function, a sigmoid function, and a ReLU function, and the CPU configures the first configuration register of the activation function module to implement the activation function through hardware logic.

As shown in fig. 4, the average pooling module includes a second configuration register, a second multiplier, and a second adder. And configuring an average value pooling module through the CPU, wherein the m value can be configured, so that the pooling of the m × m average value is realized, and the m × m matrix average value is obtained.

As shown in fig. 5, the maximum pooling module includes a third configuration register, a comparator, and a second selector. And the CPU is configured with a maximum value pooling module, the k value is configurable, k × k maximum value pooling is realized, and each data in the k × k matrix is compared to obtain a maximum value.

Example two

Correspondingly, the invention further describes a method flow of the convolutional neural network calculation by the convolutional neural network accelerator based on the PSoC in combination with FIG. 6.

The CPU can be programmed in embedded software, the construction of a deep convolutional neural network is realized in the software programming, and the deep convolutional neural network is input into a relevant processor and is used for transmitting a command value control register through bus configuration.

Examples of configuration commands are shown in the following table:

the first layer input is x1 input feature map data and x3 weight data, and the calculation results are input into a maximum value pooling module and an activation function module to obtain x2 output feature map data.

The convolution layer output characteristic diagram has M layers in the storage form of the off-chip memory, and M takes the value of 1,3,5,7 … …. The output characteristic diagram of the M layer is the input characteristic diagram of the M +1 layer, the output characteristic diagram of the M layer is stored in the address space with the address A1 as the starting address, and the output characteristic diagram of the M +1 layer is stored in the address space with the address A2 as the starting address.

In a particular application, the computations within the convolutional neural network layers are performed in parallel. The whole network implementation process is as follows:

(1) the software of theprocessor 102 controls the image processing, and the sample data is stored in the off-chip memory 101;

(2) theprocessor 102 controls the DMA107 to read off-chip memory data to thefirst data selector 110 while configuring the multiply-add calculation unit 114, theaverage pooling module 112, themaximum pooling module 113, theactivation function module 115, and thestate machine 109 via theprocessor 102. Configuration information includes, but is not limited to, convolution computation step size, convolution kernel size, activation function type, mean pooling size, maximum pooling block size.

(3) Data is transferred from the DMA to theprofile input memory 103,offset memory 105,weight memory 106 under control of thestate machine 109.

(4) The data is input into the multiply-add calculation unit 114, theactivation function module 115, theaverage pooling module 112 or themaximum pooling module 113 to obtain the calculation result.

(5) Under state machine control, data is transferred from the multiply-add computation unit 114, theactivation function 115, themean pooling module 112 or themaximum pooling module 113 to the data selector and to the off-chip memory 101.

At this point, the whole network completes one layer of results, and the network completes multiple layers of results in a circulating manner.

In a word, the programmable part of the whole convolutional neural network accelerator hardware can be simplified into a multiply-add computing module, an activation function module, a maximum value pooling module and an average value pooling module, the multiply-add operations in all the multiply-add computing modules are computed in parallel, convolutional computations with different convolutional kernel sizes are supported, pooling computations with different sizes are supported, a Softmax classifier and a non-maximum value suppression algorithm which cannot be realized by hardware logic are realized through CPU software design of the convolutional neural network accelerator, convolutional neural network computations are completed through configuration of convolutional neural networks supporting different network structures, the problems that the amount of computation of a convolutional neural network is large, the bandwidth requirement is large are solved, and the convolutional neural network algorithms supporting different structures can be configured.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A PSoC-based convolutional neural network accelerator, comprising: the device comprises an off-chip memory, a CPU, a characteristic diagram input memory, a characteristic diagram output memory, an offset memory, a weight memory, a Direct Memory Access (DMA) and a computing unit with the same number as the neurons;

the direct memory storage DMA is read from an off-chip memory and transmitted to a characteristic diagram input memory, an offset memory and a weight memory under the control of a CPU (central processing unit), or data of a characteristic diagram data memory is written back to the off-chip memory, the CPU needs to control the storage positions of an input characteristic diagram, an offset, a weight and an output characteristic diagram in the off-chip memory and the parameter transmission of a multilayer convolutional neural network so as to adapt to neural networks with various architectures;

the computing unit comprises a first-in first-out queue, a state machine, a first data selector, a second data selector, an average value pooling module, a maximum value pooling module, a multiply-add computing module and an activation function module;

the first data selector is communicated with the feature map input memory, and input feature map input data are input into the average value pooling module, the maximum value pooling module, the multiply-add calculation module and the activation function module through the first data selector;

2. The PSoC-based convolutional neural network accelerator according to claim 1, wherein the multiply-add computation module is based on a combination of a multiply-add tree and a multiply-add register, and comprises an input feature map matrix, a weight input matrix and a bias matrix.

3. The PSoC-based convolutional neural network accelerator as claimed in claim 1, wherein said activation function module comprises a first configuration register, a first selector, a first multiplier and a first adder for implementing tangent function, sigmoid function and ReLU function, and the CPU configures the first configuration register of the activation function module to implement the activation function through hardware logic.

4. The PSoC-based convolutional neural network accelerator as claimed in claim 1, wherein the mean pooling module comprises a second configuration register, a second multiplier and a second adder, and the CPU configures the mean pooling module to realize pooling of matrix means to obtain matrix means.

5. The PSoC-based convolutional neural network accelerator as claimed in claim 1, wherein the maximum pooling module comprises a third configuration register, a comparator and a second selector, the maximum pooling module is configured by the CPU to pool the maximum values of the matrices, and each data in the matrices is compared to obtain the maximum value.