CN117876661A

Movatterモバイル変換

Info

Publication number: CN117876661A
Application number: CN202311788491.6A
Authority: CN
Inventors: 葛梦柯; 李元
Original assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Current assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-04-12
Anticipated expiration: 2043-12-22
Also published as: CN117876661B

Abstract

Translated fromChinese

本发明公开了一种多尺度特征并行处理的目标检测方法及系统，将目标图像输入到已训练完成的目标检测模型中，输出目标图像的检测结果，目标检测模型的训练过程如下：S1：构建训练集，并将训练集输送到每个子网中；S2：每个子网基于采样模块对输入图像的RGB特征进行降采样处理，获得尺度特征图；S3：每个子网基于汇总融合模块获取其他所有子网的尺度特征图并将所有尺度特征图进行特征对齐后融合，每个子网得到对应融合后的融合特征图；S4：每个子网将融合特征图输送到预测头，输出预测信息，将所有子网的预测信息汇总后输出预测结果；该目标检测方法及系统减小了算法在硬件上的层间同步开销，通过子网架构提高了硬件运算单元利用率，有效降低实时目标检测算法的延时。

The present invention discloses a target detection method and system for multi-scale feature parallel processing, wherein a target image is input into a trained target detection model, and a detection result of the target image is output. The training process of the target detection model is as follows: S1: constructing a training set, and transmitting the training set to each subnet; S2: each subnet performs downsampling processing on the RGB features of the input image based on a sampling module to obtain a scale feature map; S3: each subnet obtains the scale feature maps of all other subnets based on a summary fusion module and fuses all the scale feature maps after feature alignment, and each subnet obtains a corresponding fused fusion feature map; S4: each subnet transmits the fused feature map to a prediction head, outputs prediction information, and outputs a prediction result after summarizing the prediction information of all subnets; the target detection method and system reduce the inter-layer synchronization overhead of the algorithm on hardware, improve the utilization rate of the hardware operation unit through the subnet architecture, and effectively reduce the delay of the real-time target detection algorithm.

Description

Translated fromChinese

一种多尺度特征并行处理的目标检测方法及系统A target detection method and system using multi-scale feature parallel processing

技术领域Technical Field

本发明涉及图像处理技术领域，尤其涉及一种多尺度特征并行处理的目标检测方法及系统。The present invention relates to the field of image processing technology, and in particular to a target detection method and system for parallel processing of multi-scale features.

背景技术Background technique

实时目标检测是计算机视觉领域的一个重要任务，旨在视频流或实时图像中检测并定位图像中的物体。与传统目标检测不同，实时目标检测要求在保持高准确性的同时，实时地执行，通常需要在每秒处理多帧图像。Real-time object detection is an important task in the field of computer vision, which aims to detect and locate objects in video streams or real-time images. Unlike traditional object detection, real-time object detection requires real-time execution while maintaining high accuracy, usually requiring processing multiple frames of images per second.

一般的实时目标检测流程如下：从摄像头、视频流或图像序列中获取图像，对图像进行预处理，可能包括调整图像大小、归一化和其他图像增强操作；利用目标检测算法在图像中定位和识别目标物体，对检测结果进行后处理，例如应用非极大值抑制(NMS)以去除冗余框、滤除低置信度的检测结果等；将最终的检测结果可视化，并在图像上标记出检测到的目标。The general real-time target detection process is as follows: obtain images from cameras, video streams or image sequences, preprocess the images, which may include resizing, normalization and other image enhancement operations; use target detection algorithms to locate and identify target objects in images, and post-process the detection results, such as applying non-maximum suppression (NMS) to remove redundant boxes and filter out low-confidence detection results; visualize the final detection results and mark the detected targets on the image.

近年来，基于深度学习的方法在实时目标检测领域占据统治地位，并诞生了一系列优异的模型，如R-CNN系列、YOLO系列。但是，这些目标检测算法为深度网络，层数多，每层网络需要专门的划分方法提高硬件运算单元利用率，且每层的计算都需要等待上一层计算结束，网络深度给实时推理带来了限制，其推理速度高度依赖硬件性能与划分方法，在自动驾驶、军事侦察等领域中的推理延时偏高。因此，考虑如何有效降低实时目标检测算法的延时是亟待研究的方向。In recent years, deep learning-based methods have dominated the field of real-time target detection, and a series of excellent models have been born, such as the R-CNN series and the YOLO series. However, these target detection algorithms are deep networks with many layers. Each layer of the network requires a special partitioning method to improve the utilization of hardware computing units, and the calculation of each layer needs to wait for the completion of the calculation of the previous layer. The depth of the network has brought limitations to real-time reasoning. Its reasoning speed is highly dependent on hardware performance and partitioning methods. The reasoning delay in fields such as autonomous driving and military reconnaissance is relatively high. Therefore, considering how to effectively reduce the delay of real-time target detection algorithms is an urgent research direction.

发明内容Summary of the invention

基于背景技术存在的技术问题，本发明提出了一种多尺度特征并行处理的目标检测方法及系统，通过算法本身的高并行性提高了在推理平台上的运算单元利用率，通过更少的网络层数减少了层间同步时间，有效降低实时目标检测算法的延时。Based on the technical problems existing in the background technology, the present invention proposes a target detection method and system for parallel processing of multi-scale features, which improves the utilization rate of the computing units on the inference platform through the high parallelism of the algorithm itself, reduces the inter-layer synchronization time through fewer network layers, and effectively reduces the delay of the real-time target detection algorithm.

本发明提出的一种多尺度特征并行处理的目标检测方法，将目标图像输入到已训练完成的目标检测模型中，输出目标图像的检测结果，所述检测结果包括目标的类别和位置；The present invention proposes a target detection method for multi-scale feature parallel processing, which inputs a target image into a trained target detection model and outputs a detection result of the target image, wherein the detection result includes the category and position of the target;

所述目标检测模型包括至少两个子网，子网之间相互通信，每个子网中包括采样模块、汇总融合模块和预测头；The target detection model includes at least two subnets, the subnets communicate with each other, and each subnet includes a sampling module, a summary fusion module and a prediction head;

目标检测模型的训练过程如下：The training process of the target detection model is as follows:

S1：构建训练集，并将训练集输送到每个子网中；S1: Build a training set and transfer it to each subnet;

S2：每个子网基于采样模块对输入图像的RGB特征进行降采样处理，获得尺度特征图；S2: Each subnet downsamples the RGB features of the input image based on the sampling module to obtain a scale feature map;

S3：基于子网之间相互通信，每个子网基于汇总融合模块获取其他所有子网的尺度特征图并将所有尺度特征图进行特征对齐后融合，每个子网得到对应融合后的融合特征图；S3: Based on the mutual communication between subnets, each subnet obtains the scale feature maps of all other subnets based on the summary fusion module and fuses all the scale feature maps after feature alignment. Each subnet obtains the corresponding fused feature map.

S4：每个子网将融合特征图输送到预测头，输出预测信息，将所有子网的预测信息汇总后输出预测结果，所述预测结果包括目标的类别和位置。S4: Each subnet transmits the fused feature map to the prediction head, outputs the prediction information, aggregates the prediction information of all subnets and outputs the prediction result, which includes the category and position of the target.

进一步地，目标检测模型训练过程中，在每个子网的预测头前引入单独的辅助训练头，辅助训练头基于不同大小的卷积核构成的卷积层，监督目标检测模型的中间层特征表示。Furthermore, during the training process of the target detection model, a separate auxiliary training head is introduced before the prediction head of each subnet. The auxiliary training head is based on a convolutional layer composed of convolution kernels of different sizes to supervise the intermediate layer feature representation of the target detection model.

进一步地，所述辅助训练头基于不同大小的卷积核构成的卷积层，监督目标检测模型的中间层特征表示中，Furthermore, the auxiliary training head supervises the intermediate layer feature representation of the target detection model based on the convolution layer composed of convolution kernels of different sizes.

针对大尺度特征，采用1×1大小的卷积核堆叠的卷积层，以联系不同通道间的上下文特征；For large-scale features, a convolutional layer with 1×1 convolution kernel stacking is used to connect the context features between different channels;

针对中小尺度特征，采用3×3大小的卷积核堆叠的卷积层，以捕获小目标的空间位置特征，从而监督目标检测模型的中间层特征表示。For small and medium-scale features, a convolutional layer with 3×3 convolution kernel stacking is used to capture the spatial position characteristics of small targets, thereby supervising the intermediate layer feature representation of the target detection model.

进一步地，目标检测模型的损失函数如下：Furthermore, the loss function of the target detection model is as follows:

其中，表示置信度损失，/>表示分类损失，/>表示目标边界框损失，/>表示辅助训练头损失；α,β,γ,δ分别表示置信度损失、分类损失、目标边界框损失、辅助训练头损失的权重。in, represents the confidence loss,/> represents the classification loss,/> represents the target bounding box loss,/> represents the auxiliary training head loss; α, β, γ, and δ represent the weights of confidence loss, classification loss, target bounding box loss, and auxiliary training head loss, respectively.

进一步地，在步骤S2中，每个子网通过控制采样模块中降采样的次数，得到不同尺度的尺度特征图，每个子网中设计了不同次数的卷积操作，以控制所有子网的计算时间接近。Furthermore, in step S2, each subnet obtains scale feature maps of different scales by controlling the number of downsampling in the sampling module, and different numbers of convolution operations are designed in each subnet to control the calculation time of all subnets to be close.

进一步地，在步骤S3中，将所有尺度特征图汇总融合具体为：Furthermore, in step S3, all scale feature maps are aggregated and fused as follows:

每个子网基于汇总融合模块获取其他所有子网的尺度特征图；Each subnet obtains the scale feature maps of all other subnets based on the summary fusion module;

基于与当前子网尺度特征图同一尺度为目标，将其他所有子网的尺度特征图采用降/升采样处理，得到特征对齐后的尺度特征图；Based on the same scale as the current subnet scale feature map, the scale feature maps of all other subnets are downsampled/upsampled to obtain the scale feature map after feature alignment;

每个子网采用点卷积操作将所有当前子网尺度特征图和特征对齐后的尺度特征图进行特征融合，得到融合特征图。Each subnet uses point convolution operation to fuse all the current subnet scale feature maps and the scale feature maps after feature alignment to obtain a fused feature map.

一种多尺度特征并行处理的目标检测系统，将目标图像输入到已训练完成的目标检测模型中，输出目标图像的检测结果，所述检测结果包括目标的类别和位置；A target detection system for multi-scale feature parallel processing, which inputs a target image into a trained target detection model and outputs a detection result of the target image, wherein the detection result includes the category and position of the target;

一种计算机可读储存介质，所述计算机可读储存介质上存储有若干分类程序，所述若干分类程序用于被处理器调用并执行如上所述的目标检测方法A computer-readable storage medium having a plurality of classification programs stored thereon, wherein the plurality of classification programs are used to be called by a processor and execute the target detection method as described above

本发明提供的一种多尺度特征并行处理的目标检测方法及系统的优点在于：本发明结构中提供的一种多尺度特征并行处理的目标检测方法及系统，该目标检测模型具有并行计算结构，使用不同的硬件计算节点同时处理输入图像，子网的计算量少、层数少，通过算法的并行推理，通过环形通信降低了子网的通信时长，加速了网络推理速度，大大降低了推理延时；因而该目标检测模型降低了对划分算法的依赖，最大化利用硬件平台性能，有效降低实时目标检测算法的延时；目标检测模型还可同时输出目标的类别、位置以及距离信息，对于自动驾驶等应用场景有着重要的意义。The advantages of the target detection method and system for parallel processing of multi-scale features provided by the present invention are: the target detection method and system for parallel processing of multi-scale features provided in the structure of the present invention, the target detection model has a parallel computing structure, uses different hardware computing nodes to simultaneously process input images, the subnet has a small amount of computation and a small number of layers, and through parallel reasoning of the algorithm, the communication time of the subnet is reduced through ring communication, the network reasoning speed is accelerated, and the reasoning delay is greatly reduced; therefore, the target detection model reduces the dependence on the partitioning algorithm, maximizes the use of hardware platform performance, and effectively reduces the delay of the real-time target detection algorithm; the target detection model can also simultaneously output the target category, location and distance information, which is of great significance for application scenarios such as autonomous driving.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明目标检测模型的结构示意图；FIG1 is a schematic diagram of the structure of a target detection model of the present invention;

图2为步骤S3中不同尺度特征图的特征之间汇总融合的结构流程图；FIG2 is a structural flow chart of the aggregation and fusion of features of feature maps of different scales in step S3;

图3为目标检测模型的训练流程图。Figure 3 is a training flowchart of the target detection model.

具体实施方式Detailed ways

下面，通过具体实施例对本发明的技术方案进行详细说明，在下面的描述中阐述了很多具体细节以便于充分理解本发明。但是本发明能够以很多不同于在此描述的其他方式来实施，本领域技术人员可以在不违背本发明内涵的情况下做类似改进，因此本发明不受下面公开的具体实施的限制。Below, the technical solution of the present invention is described in detail through specific embodiments. Many specific details are set forth in the following description to facilitate a full understanding of the present invention. However, the present invention can be implemented in many other ways different from those described herein, and those skilled in the art can make similar improvements without violating the connotation of the present invention. Therefore, the present invention is not limited to the specific implementation disclosed below.

如图1至3所示，本发明提出的一种多尺度特征并行处理的目标检测方法，将目标图像输入到已训练完成的目标检测模型中，输出目标图像的检测结果，所述检测结果包括目标的类别和位置。As shown in Figures 1 to 3, the present invention proposes a target detection method with multi-scale feature parallel processing, which inputs a target image into a trained target detection model and outputs a detection result of the target image, wherein the detection result includes the category and position of the target.

目标检测模型通过模型结构的并行化设计将模型分为多个子网，减少网络深度，而且模型的不同子网使用不同的硬件计算单元同时处理输入图像，子网之间通过环形通信降低了子网的通信时长，加速了网络推理速度，以下具体说明目标检测模型。The target detection model divides the model into multiple subnets through the parallel design of the model structure to reduce the network depth. Different subnets of the model use different hardware computing units to process the input image simultaneously. The subnets use ring communication to reduce the communication time of the subnets and accelerate the network reasoning speed. The target detection model is explained in detail below.

目标检测模型包括至少两个子网，子网之间相互通信，每个子网中包括采样模块、汇总融合模块和预测头；The target detection model includes at least two subnetworks, which communicate with each other. Each subnetwork includes a sampling module, a summary fusion module and a prediction head.

每个子网由卷积神经网络(CNN)组成，来分别提取输入图像的特征，每个子网通过控制采样模块中降采样的次数，得到不同尺度的尺度特征图，而且每个子网中设计了不同次数的卷积操作，控制三个子网的计算时间接近。Each subnet is composed of a convolutional neural network (CNN) to extract the features of the input image respectively. Each subnet obtains scale feature maps of different scales by controlling the number of downsampling in the sampling module. In addition, different numbers of convolution operations are designed in each subnet to control the calculation time of the three subnets to be close.

对于卷积神经网络，低层次的特征一般分辨率高、细节信息丰富；高层次的特征一般分辨率低、语义信息丰富。如图2所示，将不同尺度的其他尺度特征图对应的特征进行降采样或升采样处理，实现与同组特征同一尺度，后在每个子网中分别采用点卷积操作(Fusion层)将所有特征融合，实现信息的融合与提取。For convolutional neural networks, low-level features generally have high resolution and rich detail information; high-level features generally have low resolution and rich semantic information. As shown in Figure 2, the features corresponding to the feature maps of other scales at different scales are downsampled or upsampled to achieve the same scale as the features of the same group. Then, point convolution operations (Fusion layer) are used in each subnet to fuse all features to achieve information fusion and extraction.

不同尺度的尺度特征图汇总以及融合如下：The scale feature maps of different scales are summarized and fused as follows:

S4：每个子网将融合特征图输送到预测头(Predict)，输出预测信息，将所有子网的预测信息汇总后输出预测结果，所述预测结果包括目标的类别和位置；S4: Each subnet transmits the fused feature map to the prediction head (Predict), outputs the prediction information, aggregates the prediction information of all subnets and outputs the prediction result, which includes the category and position of the target;

预测头对图像中的目标进行类别分类、边界框回归，后处理过程包括解码、非极大值抑制、检测框绘制，目标检测模型最终输出同时包含目标的类别和位置的预测结果。The prediction head classifies the objects in the image and regresses the bounding box. The post-processing process includes decoding, non-maximum suppression, and detection box drawing. The target detection model finally outputs the prediction result that includes both the category and location of the target.

通过步骤S1至S4，该目标检测模型具有分布式并行结构，使用不同的硬件计算单元同时处理输入图像，子网的计算量少、层数少，通过算法的并行推理，通过环形通信降低了子网的通信时长，加速了网络推理速度，大大降低了推理延时；因而该目标检测模型降低了对硬件性能的依赖，有效降低实时目标检测算法的延时；目标检测模型还可同时输出目标的类别、位置以及距离信息，对于自动驾驶等应用场景有着重要的意义。Through steps S1 to S4, the target detection model has a distributed parallel structure, uses different hardware computing units to process input images simultaneously, the subnet has less computational complexity and fewer layers, and through parallel reasoning of the algorithm and ring communication, the communication time of the subnet is reduced, the network reasoning speed is accelerated, and the reasoning delay is greatly reduced; therefore, the target detection model reduces dependence on hardware performance and effectively reduces the delay of the real-time target detection algorithm; the target detection model can also simultaneously output the target category, location, and distance information, which is of great significance for application scenarios such as autonomous driving.

以下目标检测模型包括3个子网(子网一、子网二和子网三)进行说明目标检测模型的训练过程：The following target detection model includes three subnets (subnet 1, subnet 2, and subnet 3) to illustrate the training process of the target detection model:

(1)构建训练集，并将训练集输送到每个子网中；(1) Construct a training set and transmit it to each subnet;

(2)每个子网基于采样模块对输入图像的RGB特征进行降采样处理，获得尺度特征图；(2) Each subnetwork downsamples the RGB features of the input image based on the sampling module to obtain a scale feature map;

输入图像在子网一中进行3次降采样处理后进行若干个卷积处理提取特征，输入图像在子网二中进行4次降采样处理后进行若干个卷积处理提取特征，输入图像在子网三中进行5次降采样处理后进行若干个卷积处理提取特征，这些若干个卷积处理的次数是不一样的，以此保证目标检测模型的三个子网对给定输入图像的处理时间近似。The input image is downsampled three times in subnet one and then subjected to several convolution operations to extract features. The input image is downsampled four times in subnet two and then subjected to several convolution operations to extract features. The input image is downsampled five times in subnet three and then subjected to several convolution operations to extract features. The number of these convolution operations is different, so as to ensure that the processing time of the three subnets of the target detection model for a given input image is similar.

(3)基于子网之间相互通信，每个子网基于汇总融合模块获取其他所有子网的尺度特征图并将所有尺度特征图进行特征对齐后融合，每个子网得到对应融合后的融合特征图；(3) Based on the mutual communication between subnets, each subnet obtains the scale feature maps of all other subnets based on the summary fusion module and fuses all the scale feature maps after feature alignment. Each subnet obtains the corresponding fused feature map.

如图2所示，子网之间不同尺度的尺度特征图的信息交换采用的是环形通信，例如，子网一到子网二，子网二到子网三，子网三到子网一。不同尺度的特征通过升采样或降采样方法与本组特征进行特征对齐，得到对齐后尺度特征图进，后在每个子网中分别采用点卷积操作(Fusion层)将所有特征融合，得到融合特征图，实现信息的融合与提取。As shown in Figure 2, the information exchange of scale feature maps of different scales between subnets adopts ring communication, for example, subnet 1 to subnet 2, subnet 2 to subnet 3, and subnet 3 to subnet 1. Features of different scales are aligned with the features of the same group through upsampling or downsampling methods to obtain the aligned scale feature map. Then, point convolution operations (Fusion layer) are used in each subnet to fuse all features to obtain a fused feature map, realizing information fusion and extraction.

(4)每个子网将融合特征图输送到预测头，输出预测信息，将所有子网的预测信息汇总后输出预测结果，所述预测结果包括目标的类别和位置；(4) Each subnet transmits the fused feature map to the prediction head, outputs the prediction information, and outputs the prediction result after summarizing the prediction information of all subnets. The prediction result includes the category and position of the target;

三组预测头分别对图像中的目标进行类别分类、边界框回归，后处理过程包括解码、非极大值抑制、检测框绘制，目标检测模型最终输出同时包含目标的类别和位置的的预测结果。The three groups of prediction heads perform category classification and bounding box regression on the targets in the image respectively. The post-processing process includes decoding, non-maximum suppression, and detection box drawing. The target detection model finally outputs the prediction results that include both the category and location of the target.

在以上目标检测模型训练过程中，为了加快目标检测模型收敛，提高目标检测模型收敛，针对不同尺度的尺度特征图对应的特征，训练阶段在各个子网的预测头前引入单独的辅助训练头，使用不同大小的卷积核构成的卷积层，监督模型的中间层特征表示。具体来说，针对大尺度特征，采用1*1大小的卷积核堆叠的卷积层，更好的联系不同通道间的上下文特征；针对中小尺度特征，采用3*3大小的卷积核堆叠的卷积层，更好的捕获小目标的空间位置特征。In the above target detection model training process, in order to speed up the convergence of the target detection model and improve the convergence of the target detection model, for the features corresponding to the scale feature maps of different scales, a separate auxiliary training head is introduced in front of the prediction head of each subnet during the training phase, and convolution layers composed of convolution kernels of different sizes are used to supervise the intermediate layer feature representation of the model. Specifically, for large-scale features, a convolution layer with a 1*1 convolution kernel stack is used to better connect the context features between different channels; for small and medium-scale features, a convolution layer with a 3*3 convolution kernel stack is used to better capture the spatial position features of small targets.

子网中的预测头部分负责类别分类、边界框回归以及距离回归。在实际使用阶段，上述目标检测模型的预测经过后处理(后处理包括解码、非极大值抑制(Non-MaximumSuppression，NMS)、检测框绘制等后处理过程)后直接作为检测结果输的出。在训练阶段，上述目标检测模型的预测需要与Label(标签)进行比较、计算损失函数并通过反向传播调整模型的权重。本实施例中目标检测模型的损失函数表示如下：The prediction head in the subnet is responsible for category classification, bounding box regression, and distance regression. In the actual use stage, the prediction of the above target detection model is directly output as the detection result after post-processing (post-processing includes decoding, non-maximum suppression (NMS), detection box drawing and other post-processing processes). In the training stage, the prediction of the above target detection model needs to be compared with the label, calculate the loss function, and adjust the model weight through back propagation. The loss function of the target detection model in this embodiment is It is expressed as follows:

其中，表示置信度损失，反映目标检测模型所预测的是否是真目标；/>表示分类损失，反映目标检测模型预测的类别和真实类别之间的差距；/>表示目标边界框损失，反映模型预测的边界框和真实边界框之间的差距；/>表示辅助训练头损失，反映自蒸馏教师模型和学生模型之间的差距；α,β,γ,δ分别表示置信度损失、分类损失、目标边界框损失、辅助训练头损失的权重。in, Indicates the confidence loss, reflecting whether the target detection model predicts the true target; /> Represents classification loss, reflecting the gap between the category predicted by the target detection model and the actual category; /> Represents the target bounding box loss, reflecting the gap between the bounding box predicted by the model and the true bounding box; /> represents the auxiliary training head loss, reflecting the gap between the self-distillation teacher model and the student model; α, β, γ, δ represent the weights of confidence loss, classification loss, target bounding box loss, and auxiliary training head loss, respectively.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，根据本发明的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明的保护范围之内。The above description is only a preferred specific implementation manner of the present invention, but the protection scope of the present invention is not limited thereto. Any technician familiar with the technical field can make equivalent replacements or changes according to the technical scheme and inventive concept of the present invention within the technical scope disclosed by the present invention, which should be covered by the protection scope of the present invention.