CN117973302A

Movatterモバイル変換

Info

Publication number: CN117973302A
Application number: CN202311676722.4A
Authority: CN
Inventors: 刘杰; 严心仪; 纪植; 张雨佳
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-05-03

Abstract

Translated fromChinese

本发明涉及人工智能和集成电路设计领域，公开了一种基于异构并行的目标检测计算芯片设计方法。本发明共分为五个步骤：量化训练移位神经网络；向硬件部署量化后的神经网络；数据复用模式与层间交互方式设计；搭建异构系统工作环境；加载COCO数据集进行功能测试。本发明基于非冯·诺依曼架构进行硬件部署，最大的特点是在低功耗的性能约束下，进行高速检测并保证推理精度。这种加速方法通过软件量化确定精度可行的量化方案，进行模型压缩，而后向存算一体平台搭载压缩后的模型，避免计算单元与存储单元之间频繁的数据交互，加快目标检测速度，使其具备实时性、低功耗、高效率等特性。

The present invention relates to the fields of artificial intelligence and integrated circuit design, and discloses a method for designing a target detection computing chip based on heterogeneous parallelism. The present invention is divided into five steps: quantization training of shifted neural networks; deployment of quantized neural networks to hardware; design of data reuse mode and inter-layer interaction mode; building a heterogeneous system working environment; loading the COCO data set for functional testing. The present invention is based on a non-von Neumann architecture for hardware deployment, and its biggest feature is that it performs high-speed detection and ensures reasoning accuracy under the performance constraint of low power consumption. This acceleration method determines a feasible quantization scheme through software quantization, compresses the model, and then carries the compressed model to the storage and computing integrated platform, avoiding frequent data interaction between the computing unit and the storage unit, speeding up the target detection speed, and making it have the characteristics of real-time, low power consumption, and high efficiency.

Description

Translated fromChinese

一种基于异构并行的低功耗边缘智能芯片内核设计方法A low-power edge intelligent chip core design method based on heterogeneous parallelism

技术领域Technical Field

本发明是一种目标检测低功耗边缘智能芯片内核设计方法，该方法能够将卷积神经网络部署在硬件平台上，与GPU、CPU等常规计算平台相比，在确保精度的前提下，降低检测功耗，提升检测速度。The present invention is a method for designing a low-power edge intelligent chip core for target detection. The method can deploy a convolutional neural network on a hardware platform. Compared with conventional computing platforms such as GPU and CPU, the method can reduce detection power consumption and improve detection speed while ensuring accuracy.

背景技术Background technique

在当今万物互联的时代，随着接入网络的终端设备数量爆发式增长，大规模的数据需要从边缘端传送到云端数据中心，这种处理模式不仅造成目标检测应急能力差，而且导致通信成本高昂和能耗损失严重。传统的集中式处理方式已然无法适应当前万物互联智能化的发展，并由此催生了基于边缘智能终端的目标检测。使用低功耗边缘智能终端的目标检测能够在边缘端完成数据存储、计算和处理，可以满足各行业在低时延、安全性、智能化、节能、分布式等方面的实际需求。边缘智能技术将边缘计算和人工智能相结合，要求其具备实时性、低功耗、高效率等特性，但现有研究中还存在部署AI算法困难、边缘智能终端算力弱等问题。In today's era of the Internet of Everything, with the explosive growth of the number of terminal devices connected to the network, large amounts of data need to be transmitted from the edge to the cloud data center. This processing mode not only results in poor emergency response capabilities for target detection, but also leads to high communication costs and serious energy losses. The traditional centralized processing method can no longer adapt to the current development of the intelligent Internet of Everything, and has given rise to target detection based on edge intelligent terminals. Target detection using low-power edge intelligent terminals can complete data storage, calculation and processing at the edge, which can meet the actual needs of various industries in terms of low latency, security, intelligence, energy saving, and distribution. Edge intelligence technology combines edge computing and artificial intelligence, requiring it to have real-time, low power consumption, and high efficiency. However, existing research still has problems such as difficulty in deploying AI algorithms and weak computing power of edge intelligent terminals.

因此，为了提高边缘智能终端的计算性能，本发明采用了一种基于异构并行的低功耗边缘智能芯片内核设计方法，相较于传统目标检测技术，该内核将深度学习算法与硬件架构深度匹配，加速CNN网络特征提取的过程。该方法的最大特点是在计算过程中使用了移位乘进行卷积计算，并结合层融合、输入输出特征图混合复用的模式进行架构设计，在卷积并行计算的同时使计算时延高密度覆盖通信时延，从而提升芯片性能。Therefore, in order to improve the computing performance of edge intelligent terminals, the present invention adopts a low-power edge intelligent chip core design method based on heterogeneous parallelism. Compared with traditional target detection technology, the core deeply matches the deep learning algorithm with the hardware architecture to accelerate the process of CNN network feature extraction. The biggest feature of this method is that it uses shift multiplication for convolution calculation during the calculation process, and combines the layer fusion and input and output feature map mixed reuse mode for architecture design. While performing convolution parallel calculation, the computing delay is densely covered with the communication delay, thereby improving chip performance.

发明内容Summary of the invention

本发明提供了一种基于神经网络的目标检测计算芯片设计方法，目的是能够将神经网络部署到非冯·诺依曼架构的硬件中，包括但不限于FPGA(FieldProgrammableGateArray)、ASIC(Application Specific Integrated Circuit)等，使其在边缘场景下正确进行图像分类，在确保精度的前提下，降低整机功耗、提升检测速度。该方法包括以下几个步骤：The present invention provides a method for designing a target detection computing chip based on a neural network, the purpose of which is to deploy the neural network into hardware of a non-von Neumann architecture, including but not limited to FPGA (Field Programmable Gate Array), ASIC (Application Specific Integrated Circuit), etc., so that it can correctly perform image classification in edge scenarios, reduce the power consumption of the whole machine and improve the detection speed while ensuring accuracy. The method includes the following steps:

步骤S1，量化训练移位神经网络；Step S1, quantization training shift neural network;

在卷积神经网络中涉及到大量的乘法运算，为了节省硬件资源，通常的做法是将权重数据W进行非均匀量化，用一个低精度的权值张量来近似取代，具体的计算过程见式(1)和式(2)。Convolutional neural networks involve a large number of multiplication operations. In order to save hardware resources, the usual practice is to non-uniformly quantize the weight data W and use a low-precision weight tensor To approximate replacement, the specific calculation process is shown in formula (1) and formula (2).

步骤S2，向硬件部署量化后的神经网络；Step S2, deploying the quantized neural network to the hardware;

将训练完成的网络模型部署到FPGA芯片中。设计采用ARM+FPGA异构设计，将数据前后处理分配至ARM进行，神经网络计算部署在FPGA上。FPGA端设计使用Verilog语言，硬件架构采用非冯·诺依曼架构。所有的网络模型参数(权值w和偏置b)预先存放在片外挂载存储器DDR中，计算过程中不断向DDR存取最终结果，网络计算的每一步中间结果都存在寄存器中，整个计算过程采用流水线的方式。The trained network model is deployed to the FPGA chip. The design adopts ARM+FPGA heterogeneous design, and the pre- and post-data processing is assigned to ARM, and the neural network calculation is deployed on the FPGA. The FPGA-side design uses the Verilog language, and the hardware architecture adopts the non-von Neumann architecture. All network model parameters (weights w and biases b) are pre-stored in the off-chip mounted memory DDR. The final results are continuously accessed to the DDR during the calculation process. The intermediate results of each step of the network calculation are stored in the register, and the entire calculation process adopts a pipeline method.

步骤S3，数据复用模式与层间交互方式设计；Step S3, design of data reuse mode and inter-layer interaction mode;

将量化后的神经网络部署在FPGA上并初步实现功能后，通过数据复用模式与层间交互方式设计对芯片性能进行优化。由于在芯片设计过程中，工作频率、功耗与计算速度是相互影响相互制约的，因此对功耗优化与帧率优化方法不做具体区分。通过三级乒乓缓存、层融合、多通道数据传输、参数重排序以及输入输出特征图混合复用等方法对芯片性能进行优化，并对其进行功能验证。After the quantized neural network is deployed on the FPGA and its functions are initially realized, the chip performance is optimized through the design of data reuse mode and inter-layer interaction mode. Since the operating frequency, power consumption and computing speed influence and restrict each other in the chip design process, no specific distinction is made between power consumption optimization and frame rate optimization methods. The chip performance is optimized and its functions are verified through methods such as three-level ping-pong cache, layer fusion, multi-channel data transmission, parameter reordering, and mixed multiplexing of input and output feature maps.

步骤S4，搭建异构系统工作环境；Step S4, building a heterogeneous system working environment;

根据FPGA开发板外部挂载两块ARM核的硬件特性，采用双ARM+FPGA异构并行的方式对神经网络的计算任务进行划分，将主体计算网络部署在FPGA端，后处理工作放在ARM端进行处理，借助Xilinx Vivado软件将硬件描述语言映射数字电路；对于网络的数据准备和后处理以及ARM与FPGA的通信，借助Vitis IDE软件构建两个工作在ARM核的应用工程。其中，一个ARM核搭载裸机系统，另一个ARM核搭载Linux系统。According to the hardware characteristics of two ARM cores mounted on the external FPGA development board, the computing tasks of the neural network are divided by using dual ARM+FPGA heterogeneous parallelism, the main computing network is deployed on the FPGA side, and the post-processing work is processed on the ARM side. The hardware description language is mapped to the digital circuit with the help of Xilinx Vivado software; for the data preparation and post-processing of the network and the communication between ARM and FPGA, two application projects working on the ARM core are built with the help of Vitis IDE software. Among them, one ARM core is equipped with a bare metal system, and the other ARM core is equipped with a Linux system.

步骤S5，加载数据集进行功能测试；Step S5, loading the data set for functional testing;

加载COCO数据集相关数据进行芯片功能正确性测试，测试所用FPGA的型号为XC7Z035-2FFG676。测试过程中，FPGA负责网络推理过程；ARM核负责数据前后处理，由搭载Linux系统的ARM核通过SD卡启动加速器计算，计算完成后打印检测结果，并使用直流电源进行功耗测试。The COCO dataset data was loaded to test the chip function correctness. The FPGA model used in the test was XC7Z035-2FFG676. During the test, the FPGA was responsible for the network reasoning process; the ARM core was responsible for data pre- and post-processing. The ARM core equipped with the Linux system started the accelerator calculation through the SD card, printed the test results after the calculation was completed, and used a DC power supply for power consumption testing.

本发明与现有技术相比，其显著优点在于，该发明通过软件量化确定精度可行的量化方案，进行模型压缩，而后向存算一体平台搭载压缩后的模型，避免计算单元与存储单元之间频繁的数据交互，实现低功耗下的实时检测。Compared with the prior art, the significant advantage of the present invention is that the present invention determines a quantization scheme with feasible accuracy through software quantization, compresses the model, and then loads the compressed model onto the storage and computing integrated platform, thereby avoiding frequent data interaction between the computing unit and the storage unit, and realizing real-time detection with low power consumption.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的操作流程图；Fig. 1 is an operation flow chart of the present invention;

图2是异构系统任务划分图；Figure 2 is a diagram of task division in a heterogeneous system;

图3是加速器整体架构图；Figure 3 is a diagram of the overall architecture of the accelerator;

图4是加速器整体数据流走向图；Figure 4 is a diagram showing the overall data flow of the accelerator;

图5是加速器数据复用架构图；FIG5 is a diagram of an accelerator data reuse architecture;

图6是Vitis嵌入式加速流程图；Figure 6 is a flow chart of Vitis embedded acceleration;

图7是GPU与FPGA检测结果对照图；FIG7 is a comparison chart of GPU and FPGA test results;

图8是直流电源实测功耗图。Figure 8 is a diagram of the measured power consumption of a DC power supply.

具体实施方式Detailed ways

以下将结合说明书附图和具体实施例对本发明作进一步详细说明。The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

本发明涉及一种基于异构并行的目标检测计算芯片设计方法，目的是能够将神经网络部署到非冯·诺依曼架构的硬件中，使其在边缘场景下正确进行图像分类，在确保精度的前提下，降低整机功耗、提升检测速度。本发明的步骤如图1所示，共分为5个步骤：量化训练移位神经网络；向硬件部署量化后的神经网络；数据复用模式与层间交互方式设计；搭建异构系统工作环境；加载数据集进行功能测试。本发明以COCO数据集为实施例详细说明实施步骤。The present invention relates to a method for designing a target detection computing chip based on heterogeneous parallelism, the purpose of which is to be able to deploy a neural network to hardware of a non-von Neumann architecture, so that it can correctly perform image classification in edge scenarios, reduce the power consumption of the entire machine and improve the detection speed while ensuring accuracy. The steps of the present invention are shown in Figure 1, and are divided into 5 steps: quantized training of shifted neural networks; deployment of quantized neural networks to hardware; design of data reuse modes and inter-layer interaction methods; building a heterogeneous system working environment; and loading data sets for functional testing. The present invention uses the COCO data set as an example to describe the implementation steps in detail.

基于COCO2017数据集，使用不同的移位方案对网络进行量化，并得到量化后的mAP指标。采用上述量化方法进行量化，不同移位方案和32位浮点数之间的精度对比如表1所示，在同等位宽的情况下增大移动次数能够提升检测精度；移动相同次数增大位宽亦可提升检测精度。其中移动三次(mAP＝33.5)之后的精度与32位浮点数(mAP＝33.6)精度匹配较好，mAP只下降了0.1，选用该方案。Based on the COCO2017 dataset, different shifting schemes are used to quantize the network, and the mAP index after quantization is obtained. The accuracy comparison between different shifting schemes and 32-bit floating point numbers using the above quantization method is shown in Table 1. Increasing the number of shifts can improve the detection accuracy under the same bit width; increasing the bit width by shifting the same number of times can also improve the detection accuracy. Among them, the accuracy after three shifts (mAP = 33.5) matches the accuracy of 32-bit floating point numbers (mAP = 33.6) well, and mAP only drops by 0.1, so this scheme is selected.

表1Table 1

COCO2017datasetCOCO2017 dataset输入图像分辨率Input image resolutionmAPmAPFP32FP32416*416416*41633.633.6Shift2(5bit)Shift2(5bit)416*416416*41631.931.9Shift3(4bit)Shift3(4bit)416*416416*41628.928.9Shift3(5bit)Shift3(5bit)416*416416*41633.533.5Shift4(4bit)Shift4(4bit)416*416416*41633.033.0

将训练完成的网络模型部署到FPGA芯片中。如图2所示，设计采用ARM+FPGA异构设计，将数据前后处理分配至ARM进行，神经网络计算部署在FPGA上。FPGA端设计使用Verilog语言，硬件架构采用非冯·诺依曼架构。所有的网络模型参数(权值w和偏置b)预先存放在片外挂载存储器DDR中，计算过程中不断向DDR存取最终结果，网络计算的每一步中间结果都存在寄存器中，整个计算过程采用流水线的方式。The trained network model is deployed to the FPGA chip. As shown in Figure 2, the design adopts ARM+FPGA heterogeneous design, and the pre- and post-processing of data are assigned to ARM, and the neural network calculation is deployed on the FPGA. The FPGA-side design uses the Verilog language, and the hardware architecture adopts the non-von Neumann architecture. All network model parameters (weights w and biases b) are pre-stored in the off-chip mounted memory DDR. The final results are continuously accessed to the DDR during the calculation process. The intermediate results of each step of the network calculation are stored in the register, and the entire calculation process adopts a pipeline method.

设计加速器整体架构如图3所示。其中，PS端包括两个ARM核与外部存储器DDR，PL端由直接内存访问(Direct MemoryAccess，DMA)、主控模块、数据缓存模块以及计算模块构成。加速器总体数据流走向如图4所示，加速器对外有多个AXI(Advanced eXtensibleInterface)Master接口，可以并发读取输入特征图和写回输出特征图，一个AXI Master接口负责读取每层权重参数。其中，数据分发模块负责生成对应DDR写入地址，并将从片外读到的输入特征图像素块分发到片上缓存中。数据接收模块负责生成DDR写回地址，并将输出缓存中的输出特征图像素块写回片外。The overall architecture of the designed accelerator is shown in Figure 3. Among them, the PS side includes two ARM cores and external memory DDR, and the PL side is composed of direct memory access (DMA), master control module, data cache module and computing module. The overall data flow of the accelerator is shown in Figure 4. The accelerator has multiple AXI (Advanced eXtensible Interface) Master interfaces to the outside, which can concurrently read input feature maps and write back output feature maps. An AXI Master interface is responsible for reading the weight parameters of each layer. Among them, the data distribution module is responsible for generating the corresponding DDR write address and distributing the input feature map pixel blocks read from outside the chip to the on-chip cache. The data receiving module is responsible for generating the DDR write-back address and writing the output feature map pixel blocks in the output cache back to the outside of the chip.

将量化后的神经网络部署在FPGA上并初步实现功能后，通过数据复用模式与层间交互方式设计对芯片性能进行优化。由于在芯片设计过程中，工作频率、功耗与计算速度是相互影响相互制约的，因此对功耗优化与帧率优化方法不做具体区分。通过三级乒乓缓存、层融合、多通道数据传输、参数重排序以及输入输出特征图混合复用等方法对芯片性能进行优化，并对其进行功能验证。为使计算时延能够更高密度的对通信时延进行覆盖，本设计使用了输入特征图与输出特征图混合复用的架构模式，如图5所示。其中，Tif为输入特征图维度，Tof为输出特征图维度，Tix为输入特征图宽，Tiy为输入特征图高，Tox为输出特征图宽，Toy为输出特征图高，Tr为单包计算像素点高，Tc为单包计算像素点宽，Tn为单包计算输入维度，N为内部输入复用次数。对各层网络交互方式也进行了优化处理，采用层融合技术解决池化层与上采样层的通信问题，将池化层与上采样层转为卷积层的一部分。共分为三种融合方式，其中池化层根据池化类型分为两种，上采样为一种。After the quantized neural network is deployed on the FPGA and the functions are initially realized, the chip performance is optimized by designing the data reuse mode and the inter-layer interaction mode. Since the operating frequency, power consumption and computing speed are mutually influenced and constrained in the chip design process, no specific distinction is made between the power consumption optimization and frame rate optimization methods. The chip performance is optimized by three-level ping-pong cache, layer fusion, multi-channel data transmission, parameter reordering, and mixed multiplexing of input and output feature maps, and the functions are verified. In order to make the calculation delay cover the communication delay with higher density, this design uses the architecture mode of mixed multiplexing of input feature maps and output feature maps, as shown in Figure 5. Among them, Tif is the input feature map dimension, Tof is the output feature map dimension, Tix is the input feature map width, Tiy is the input feature map height, Tox is the output feature map width, Toy is the output feature map height, Tr is the single packet calculation pixel height, Tc is the single packet calculation pixel width, Tn is the single packet calculation input dimension, and N is the number of internal input multiplexing times. The interaction mode of each layer of the network is also optimized. The layer fusion technology is used to solve the communication problem between the pooling layer and the upsampling layer, and the pooling layer and the upsampling layer are converted into part of the convolution layer. There are three fusion methods in total, of which the pooling layer is divided into two types according to the pooling type, and upsampling is one.

图6是使用Vitis做嵌入式应用加速开发流程，本设计先从Vivado软件中导出xsa(Xilinx Shell Archive)文件，先经过Petalinux软件向ARM搭载Linux核，制作得到roofs、fsbl.elf、u-boot.elf等启动文件，再进入Vitis平台构建硬件平台进行裸机域与Linux域设计，并在此基础上使用C++或Python语言进行编程，分别构建相应的软件应用平台。Figure 6 shows the embedded application acceleration development process using Vitis. This design first exports the xsa (Xilinx Shell Archive) file from the Vivado software, first uses the Petalinux software to load the Linux kernel to ARM, and produces boot files such as roofs, fsbl.elf, and u-boot.elf. Then, the Vitis platform is entered to build the hardware platform for bare metal domain and Linux domain design, and on this basis, C++ or Python language is used for programming to build the corresponding software application platforms.

加载COCO数据集相关数据进行芯片功能正确性测试，测试所用FPGA的型号为XC7Z035-2FFG676。测试过程中，FPGA负责网络推理过程；ARM核负责数据前后处理，由搭载Linux系统的ARM核通过SD卡启动加速器计算，计算完成后打印检测结果与检测时间，并使用直流电源进行功耗测试。图7为测试结果与GPU测试结果对比，图8为连续检测30帧，实测得到的峰值功耗。本加速器可以做到在200MHz下，峰值功耗为3.275W，检测速度33FPS。The relevant data of the COCO dataset was loaded to test the functional correctness of the chip. The model of the FPGA used in the test was XC7Z035-2FFG676. During the test, the FPGA was responsible for the network reasoning process; the ARM core was responsible for data pre- and post-processing. The ARM core equipped with the Linux system started the accelerator calculation through the SD card. After the calculation was completed, the detection results and detection time were printed, and the power consumption was tested using a DC power supply. Figure 7 shows the comparison between the test results and the GPU test results, and Figure 8 shows the peak power consumption measured after continuous detection of 30 frames. This accelerator can achieve a peak power consumption of 3.275W and a detection speed of 33FPS at 200MHz.

最后应当说明的是，以上实施例仅用以说明本发明的技术方案，而非对本发明保护范围的限制。参照该实施例的说明，本领域的普通技术人员应该可以理解并对本发明的技术方案进行相关的修改或替换，而不脱离本发明的实质和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, rather than to limit the protection scope of the present invention. With reference to the description of the embodiment, ordinary technicians in the field should be able to understand and make relevant modifications or replacements to the technical solution of the present invention without departing from the essence and scope of the present invention.

Claims

1. The heterogeneous parallelism-based low-power consumption edge intelligent chip kernel design method is characterized by comprising the following steps of:

and S1, quantitatively training a shift neural network.

Step S2, the quantized neural network is deployed to non-von Neumann architecture hardware.

S3, designing a data multiplexing mode and an interlayer interaction mode;

s4, building a heterogeneous system working environment;

And S5, loading the data set for functional testing.

2. The heterogeneous parallelism-based low-power consumption edge intelligent chip kernel design method as set forth in claim 1, wherein the method comprises the following steps: in the step S1, a lightweight neural network based on a shift algorithm is trained for target detection. Non-uniform quantization of weight data W using a low-precision weight tensorTo be replaced approximately, the specific calculation process is shown in the formula (1) and the formula (2). In a computer, the training of the network model uses floating point numbers, but when applied to hardware, if 32 bits or

The 64-bit floating point number has very high hardware resource overhead, and increases the stream slice cost. Therefore, network parameters are quantized in the training process, floating point numbers are converted into fixed point numbers in a digital circuit to participate in operation, and proper fixed point numbers are selected, so that the resource cost can be reduced while the model accuracy is ensured.

3. The heterogeneous parallelism-based low-power consumption edge intelligent chip kernel design method as set forth in claim 1, wherein the method comprises the following steps: in the step S2, a network model meeting the precision requirement is deployed into a digital chip, the design adopts an arm+fpga heterogeneous design, the preprocessing of data is distributed to an ARM for processing, and the neural network calculation is deployed on the FPGA. The FPGA end hardware architecture adopts a non-von Neumann architecture, all network model parameters (weight w and bias b) are prestored in a DDR (double data rate) of an on-chip loading memory, final results are continuously accessed to the DDR in the calculation process, intermediate results of each step of network calculation are stored in a register, and the whole calculation process adopts a pipeline mode. Compared with a CPU and a GPU of a von Neumann architecture, the chip provided by the invention can finish high-speed target detection calculation under lower power consumption.

4. The heterogeneous parallelism-based low-power consumption edge intelligent chip kernel design method as set forth in claim 1, wherein the method comprises the following steps: in the step S3, after the quantized neural network is deployed on the FPGA and the function is primarily implemented, the chip performance is optimized through the data multiplexing mode and the interlayer interaction mode design according to the detected frame rate and the actual power consumption condition. In the chip design process, the working frequency, the power consumption and the calculation speed are mutually influenced and restricted, so that the power consumption optimization and the frame rate optimization method are not particularly distinguished. The whole data multiplexing design adopts a multiplexing method of mixed multiplexing of input and output feature maps, not only utilizes the characteristic of 'cancelling part and continuously interacting data with an off-chip buffer zone' in multiplexing of the output feature maps, but also covers the communication duration of the next calculation input in a mode of combining the multiplexing of the input feature maps with the on-chip buffer zone, and the combination of the input feature maps and the on-chip buffer zone can carry out high-density coverage on the communication delay in convolution calculation. The network interaction modes of all layers are optimized, the communication problem of the pooling layer and the up-sampling layer is solved by adopting a layer fusion technology, the pooling layer and the up-sampling layer are converted into a part of a convolution layer, the redundant data interaction mode during interlayer interaction is reduced, and the fusion mode optimizes the communication time delay during pooling and up-sampling calculation to a greater extent. In the calculation process, three-level ping-pong buffer design is adopted, three levels are input communication, calculation and output communication, and all the modules perform their own functions under the control of the main control module, so that the normal calculation of the other modules is not affected, tasks are concurrently executed, and the calculation can be performed efficiently.

5. The heterogeneous parallelism-based low-power consumption edge intelligent chip kernel design method as set forth in claim 1, wherein the method comprises the following steps: in the step S4, the calculation tasks of the neural network are divided by adopting a dual arm+fpga heterogeneous parallel mode, the main calculation network is deployed at the FPGA end, the post-processing work is placed at the ARM end for processing, and the hardware description language is mapped to the digital circuit by means of Xilinx Vivado software; for data preparation and post-processing of the network and communication between ARM and FPGA, two application projects working in ARM core are constructed by means of Vitis IDE software. One ARM core is provided with a bare metal system, and the other ARM core is provided with a Linux system.

6. The heterogeneous parallelism-based low-power consumption edge intelligent chip kernel design method as set forth in claim 1, wherein the method comprises the following steps: in the step S5, a heterogeneous parallel target detection computing system pair is built for functional test. And (3) carrying out a chip function correctness test by loading relevant data of the COCO data set, wherein the model of the FPGA used in the test is XC7Z035-2FFG676. In the test process, the FPGA is responsible for the network reasoning process; the ARM core is responsible for preprocessing data, the ARM core carrying the Linux system starts accelerator calculation through the SD card, a detection result is printed after calculation is completed, and a direct current power supply is used for power consumption test.