CN114638971A

Movatterモバイル変換

Info

Publication number: CN114638971A
Application number: CN202210277660.9A
Authority: CN
Inventors: 曹家乐; 庞彦伟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-06-17
Anticipated expiration: 2042-03-21
Also published as: CN114638971B

Abstract

The invention relates to a target detection method for self-adaptive fusion of multi-level local and global features, which comprises the following steps: step 1: preparing a training data set for target detection; step 2: selecting a backbone network for target detection, building a candidate detection window extraction network, building a multi-level local and global feature self-adaptive fusion network MLGNet on the basis of the backbone network, and setting a training loss function of the candidate window extraction network and the MLGNet; and step 3: initializing network parameters of each part of the detector and hyper-parameters required by a training process; and 4, step 4: updating the weight of the detector by using a back propagation algorithm; and obtaining the final detector through the set training times.

Description

Translated fromChinese

多层级局部和全局特征自适应融合的目标检测方法Object detection method based on adaptive fusion of multi-level local and global features

技术领域technical field

本发明涉及智能系统(如智能驾驶、智能监控、智能交互等)中目标检测方法，特别是基于深度学习的目标检测方法。The invention relates to a target detection method in an intelligent system (such as intelligent driving, intelligent monitoring, intelligent interaction, etc.), in particular to a target detection method based on deep learning.

背景技术Background technique

目标检测主要指定位图像或视频中存在的物体，并给出物体的具体类别。近年来，基于深度卷积神经网络技术，物体检测取得了巨大的成功，广泛应用在智能驾驶、智能交通、智能搜索、智能认证等领域。例如，智能汽车需要在决策控制之前检测出前方的障碍物，而智能交互系统需要在识别相关手势和指令之前检测出需要交互的人。Object detection mainly specifies the objects present in the localized image or video, and gives the specific category of the object. In recent years, object detection based on deep convolutional neural network technology has achieved great success and is widely used in intelligent driving, intelligent transportation, intelligent search, intelligent authentication and other fields. For example, smart cars need to detect obstacles ahead before making decisions, while intelligent interaction systems need to detect people who need to interact before recognizing relevant gestures and commands.

由于深度卷积神经网络强大的特征表达能力，深度卷积神经网络在图像分类、目标检测、语义分割等任务中取得了巨大的成功。对于基于深度学习的目标检测而言，相关方法主要包括两类：两阶段方法和单阶段方法。与单阶段方法相比，两阶段方法具有更高的检测性能。本专利主要关注两阶段方法。两阶段方法主要包含两个部分：候选检测窗口提取网络和候选检测窗口分类、回归网络。为了粗略地提取图像中可能存在的物体，候选窗口提取网络生成一定数量的候选检测窗口。基于这些候选检测窗口，候选窗口分类、回归网络对这些候选检测窗口进行进一步分类和回归，得到检测窗口最终的位置和分类得分。Due to the powerful feature expression ability of deep convolutional neural networks, deep convolutional neural networks have achieved great success in tasks such as image classification, object detection, and semantic segmentation. For object detection based on deep learning, related methods mainly include two categories: two-stage methods and single-stage methods. Compared with the single-stage method, the two-stage method has higher detection performance. This patent is primarily concerned with a two-stage approach. The two-stage method mainly consists of two parts: candidate detection window extraction network and candidate detection window classification and regression network. In order to roughly extract possible objects in the image, the candidate window extraction network generates a certain number of candidate detection windows. Based on these candidate detection windows, the candidate window classification and regression network further classify and regress these candidate detection windows to obtain the final position and classification score of the detection window.

在两阶段方法中，比较有代表性的工作是Faster R-CNN^[1]。它通过共享基础网络进行候选窗口的提取和分类。Faster R-CNN采用RoI池化层提取候选检测窗口感兴趣区域的全局特征进行候选检测窗口的分类和回归。因而，它忽略了物体的局部特征。事实上，对于遮挡物体而言，局部特征更有利于提升检测性能。与此同时，RoI池化层将原始检测框区域特征图缩放成固定大小，对物体的形变不够鲁棒。为了将局部信息编码到特征中，Dai等人^[2]提出了位置敏感的感兴趣池化层PSRoI。具体地，PSRoI将每个感兴趣区域分成k×k大小的子区域。每个子区域的响应值由位置敏感的特征图对应通道上相应区域的平均相应值。同Faster R-CNN相比，基于PSRoI的目标检测器R-FCN具有相似的检测性能，但具有更快的检测速度。Zhu等人^[3]将RoI层和PSRoI层集成在一起充分利用全局和局部特征。然后，该方法缺乏挖掘多尺度特征以及如何将这些局部和全局特征进行自适应融合。为了编码多尺度特征，He等人^[4]提出利用空间金字塔结构融合多尺度特征。该方法对物体形变更加鲁棒。Zhao等人^[5]采用相似的结构提升语义分割的性能。Liu等人^[6]将空间金字塔结构的相关思想用于单阶段目标检测。Wang等人^[7]采用三维卷积操作融合多尺度特征。Among the two-stage methods, the more representative work is Faster R-CNN^[1] . It performs the extraction and classification of candidate windows by sharing the underlying network. Faster R-CNN uses the RoI pooling layer to extract the global features of the region of interest of the candidate detection window for classification and regression of the candidate detection window. Thus, it ignores the local features of the object. In fact, for occluded objects, local features are more conducive to improving the detection performance. At the same time, the RoI pooling layer scales the feature map of the original detection frame area to a fixed size, which is not robust to the deformation of the object. To encode local information into features, Dai et al.^[2] proposed a position-sensitive pooling layer of interest, PSRoI. Specifically, PSRoI divides each region of interest into sub-regions of size k × k. The response value of each sub-region is determined by the average corresponding value of the corresponding region on the corresponding channel of the position-sensitive feature map. Compared with Faster R-CNN, the PSRoI-based object detector R-FCN has similar detection performance but with faster detection speed. Zhu et al.^[3] integrated the RoI layer and the PSRoI layer to take full advantage of the global and local features. Then, this method lacks mining multi-scale features and how to adaptively fuse these local and global features. To encode multi-scale features, He et al.^[4] proposed to use spatial pyramid structure to fuse multi-scale features. This method is more robust to object deformation.^[5] adopted a similar structure to improve the performance of semantic segmentation. Liu et al.^[6] used related ideas of spatial pyramid structure for single-stage object detection. Wang et al.^[7] adopted a 3D convolution operation to fuse multi-scale features.

参考文献：references:

[1]S.Ren,K.He,R.Girshick,and J.Sun,Faster R-CNN:Towards Real-TimeObject Detection with Region Proposal Networks,IEEE Trans.Pattern Analysisand Machine Intelligence,vol.39,no.6,pp.1137-1149,2017.[1]S.Ren,K.He,R.Girshick,and J.Sun,Faster R-CNN:Towards Real-TimeObject Detection with Region Proposal Networks,IEEE Trans.Pattern Analysisand Machine Intelligence,vol.39,no.6 , pp.1137-1149, 2017.

[2]J.Dai,Y.Li,K.He,and J.Sun,R-FCN:Object Detection via Region-basedFully Convolutional Networks,Proc.Advances in Neural Information ProcessingSystems,2015.[2] J.Dai, Y.Li, K.He, and J.Sun, R-FCN: Object Detection via Region-based Fully Convolutional Networks, Proc. Advances in Neural Information Processing Systems, 2015.

[3]Y.Zhu,C.Zhao,J.Wang,X.Zhao,Y.Wu,H.Lu,CoupleNet:Coupling GlobalStructure with Local Parts for Object Detection,IEEE Computer Vision andPattern Recognition,2017.[3] Y.Zhu, C.Zhao, J.Wang, X.Zhao, Y.Wu, H.Lu, CoupleNet: Coupling GlobalStructure with Local Parts for Object Detection, IEEE Computer Vision and Pattern Recognition, 2017.

[4]K.He,X.Zhang,S.Ren,J.Sun,Spatial Pyramid Pooling in DeepConvolutional Networks for Visual Recognition,IEEE Trans.Pattern Analysis andMachine Intelligence,2015.[4] K. He, X. Zhang, S. Ren, J. Sun, Spatial Pyramid Pooling in DeepConvolutional Networks for Visual Recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, 2015.

[5]H.Zhao,J.Shi,X.Qi,X.Wang,and J.Jia,Pyramid Scene Parsing Network,Proc.IEEE Computer Vision and Pattern Recognition,2017.[5]H.Zhao,J.Shi,X.Qi,X.Wang,and J.Jia,Pyramid Scene Parsing Network,Proc.IEEE Computer Vision and Pattern Recognition,2017.

[6]S.Liu,D.Huang,Y.Wang,Receptive Field Block Net for Accurate andFast Object Detection,Proc.European Conference on Computer Vision,2018.[6] S. Liu, D. Huang, Y. Wang, Receptive Field Block Net for Accurate and Fast Object Detection, Proc. European Conference on Computer Vision, 2018.

[7]X.Wang,S.Zhang,Z.Yu,L.Feng,and W.Zhang,Scale-Equalizing PyramidConvolution for Object Detection,Proc.IEEE Computer Vision and PatternRecognition,2020.[7] X.Wang,S.Zhang,Z.Yu,L.Feng,and W.Zhang,Scale-Equalizing PyramidConvolution for Object Detection,Proc.IEEE Computer Vision and PatternRecognition,2020.

发明内容SUMMARY OF THE INVENTION

本发明提出一种多层级局部和全局特征自适应融合的目标检测方法，可以充分利用多尺度特征和局部、全局特征提升目标检测的性能。本发明所提出的方法采用多层级PSRoI层提取多层级局部特征，采用多层级RoI层提取多层级全局特征。在这些多层级特征的基础上，所提出方法预测各自的权重系数生成最终的特征图。因而，所提出的方法能够学习到多尺度的局部和全局特征，有助于提升目标检测的性能。技术方案如下：The present invention proposes a multi-level local and global feature adaptive fusion target detection method, which can fully utilize multi-scale features and local and global features to improve target detection performance. The method proposed by the present invention adopts the multi-level PSRoI layer to extract the multi-level local features, and adopts the multi-level RoI layer to extract the multi-level global features. On the basis of these multi-level features, the proposed method predicts the respective weight coefficients to generate the final feature map. Therefore, the proposed method can learn local and global features at multiple scales, which helps to improve the performance of object detection. The technical solution is as follows:

一种多层级局部和全局特征自适应融合的目标检测方法，包括下列步骤：A target detection method for adaptive fusion of multi-level local and global features, comprising the following steps:

步骤1：准备目标检测的训练数据集，包含训练图像以及对应物体标注；标注为物体所处检测框坐标及物体的类别；Step 1: Prepare a training data set for target detection, including training images and corresponding object labels; label the coordinates of the detection frame where the object is located and the category of the object;

步骤2：选定目标检测的主干网络，搭建候选检测窗口提取网络，并在主干网络的基础上构建多层级局部和全局特征自适用融合网络MLGNet，设定候选窗口提取网络和MLGNet的训练损失函数；Step 2: Select the backbone network for target detection, build a candidate detection window extraction network, and build a multi-level local and global feature self-adaptive fusion network MLGNet on the basis of the backbone network, set the candidate window extraction network and the training loss function of MLGNet ;

其中，在主干网络的基础上构建多层级局部和全局特征自适用融合网络MLGNet的方法如下：Among them, the method of constructing a multi-level local and global feature self-adaptive fusion network MLGNet on the basis of the backbone network is as follows:

对于给定一张输入图像，经过一个主干网络生成深度特征图；对于每个候选检测窗口，分别利用六个分支分别提取多层级局部和全局特征：利用三个分支经过三个不同的卷积层生成三个不同位置的敏感特征图，基于三个不同位置的敏感特征图，利用三个不同的PSRoI层分别提取候选窗口感兴趣区域的3×3,5×5,7×7大小和位置的敏感特征图，然后将三个不同大小和位置的敏感特征图通过双线性差值上采样到相同的7×7大小特征图；利用另外三个分支的卷积层生成一个特征图，基于该特征图，利用三个不同的RoI层分别提取候选窗口感兴趣区域的3×3,5×5,7×7大小全局特征图，然后将三个不同大小的全局特征图通过双线性差值上采样到相同的7×7大小特征图；For a given input image, a backbone network is used to generate a deep feature map; for each candidate detection window, six branches are used to extract multi-level local and global features: three branches are used to pass through three different convolutional layers. Generate sensitive feature maps of three different positions, and use three different PSRoI layers to extract the 3×3, 5×5, 7×7 size and position of the region of interest of the candidate window based on the sensitive feature maps of the three different positions. Sensitive feature map, and then upsample three sensitive feature maps of different sizes and locations to the same 7×7 feature map through bilinear difference; use the convolutional layers of the other three branches to generate a feature map, based on the Feature map, use three different RoI layers to extract the 3×3, 5×5, 7×7 global feature maps of the region of interest of the candidate window, and then pass the three global feature maps of different sizes through the bilinear difference Upsampling to the same 7×7 size feature map;

构建特征自适应融合单元，方法为：将上六个分支生成的六个7×7大小特征图沿特征通道方向串接在一起，得到原始串接特征图；对于该原始串接特征图，我们进行如下操作：首先利用一个全局平均全局池化操作生成一个特征通道长度大小的特征向量；然后利用全连接层生成一个长度为6的特征向量；接着经过一个Sigmoid层对该长度为6的特征向量进行归一化；最后用归一化特征向量的六个值分别与上述六个分支对应的特征进行相乘并沿着特征通道方向串接，得到增强后的串接特征图；将增强后的串接特征图和原始串接特征图相加得到最终的输出特征图；Construct a feature adaptive fusion unit. The method is: concatenate the six 7×7 size feature maps generated by the previous six branches together along the feature channel direction to obtain the original concatenated feature map; for the original concatenated feature map, we The following operations are performed: first, a global average global pooling operation is used to generate a feature vector of the length of the feature channel; then a fully connected layer is used to generate a feature vector of length 6; then a sigmoid layer is used to generate a feature vector of length 6. Perform normalization; finally, the six values of the normalized feature vector are multiplied with the features corresponding to the above six branches and concatenated along the direction of the feature channel to obtain the enhanced concatenated feature map; the enhanced concatenated feature map is obtained; The concatenated feature map and the original concatenated feature map are added to obtain the final output feature map;

形成多层级局部和全局特征自适用融合网络MLGNet；Form a multi-level local and global feature self-adaptive fusion network MLGNet;

步骤3：初始化检测器各部分的网络参数以及训练过程所需的超参数；Step 3: Initialize the network parameters of each part of the detector and the hyperparameters required for the training process;

步骤4：利用反向传播算法更新检测器的权重；经过设定的训练次数得到最终的检测器。Step 4: Use the back-propagation algorithm to update the weight of the detector; obtain the final detector after the set number of training times.

附图说明Description of drawings

图1多层级局部和全局特征自适应融合网络(MLGNet)Figure 1 Multi-level local and global feature adaptive fusion network (MLGNet)

图2特征自适应融合单元Figure 2 Feature Adaptive Fusion Unit

图3所提出方法的具体实施方式The specific implementation of the method proposed in Fig. 3

具体实施方式Detailed ways

本发明主要解决的技术问题是如何充分利用多尺度特征和局部、全局特征提升目标检测的性能。针对这一问题，本专利提出了一种多层级局部和全局特征自适应融合的目标检测方法。具体地，所提出的方法采用多层级PSRoI层提取多层级局部特征，采用多层级RoI层提取多层级全局特征。在这些多层级特征的基础上，所提出方法预测各自的权重系数生成最终的特征图。因而，所提出的方法能够学习到多尺度的局部和全局特征，有助于提升目标检测的性能。The main technical problem solved by the present invention is how to fully utilize multi-scale features and local and global features to improve the performance of target detection. In response to this problem, this patent proposes a target detection method for adaptive fusion of multi-level local and global features. Specifically, the proposed method adopts multi-level PSRoI layers to extract multi-level local features, and multi-level RoI layers to extract multi-level global features. On the basis of these multi-level features, the proposed method predicts the respective weight coefficients to generate the final feature map. Therefore, the proposed method can learn multi-scale local and global features, which can help improve the performance of object detection.

首先介绍所提出的多层级局部和全局特征自适用融合网络，然后介绍如何将所提出的网络用于目标检测。The proposed multi-level local and global feature self-adaptive fusion network is first introduced, and then how to use the proposed network for object detection.

(1)多层级局部和全局特征自适用融合网络(1) Multi-level local and global feature self-adaptive fusion network

图1给出了多层级局部和全局特征自适用融合网络(简称为MLGNet)的基本示意图。对于给定一张输入图像，它首先经过一个主干网络(如VGG16和ResNet)生成深度特征图。对于每个候选检测窗口，MLGNet利用六个分支分别提取多层级局部和全局特征。上面三个分支先经过三个不同的卷积层生成三个不同的位置敏感特征图。基于三个位置敏感特征图，三个不同的PSRoI层分别提取候选窗口感兴趣区域的3×3,5×5,7×7大小位置敏感特征图，然后将三个不同大小的特征图通过双线性差值上采样到相同的7×7大小特征图。下面三个分支首先利用一个卷积层生成一个特征图。基于该特征图，三个不同的RoI层分别提取候选窗口感兴趣区域的3×3,5×5,7×7大小全局特征图，然后将三个不同大小的特征图通过双线性差值上采样到相同的7×7大小特征图。Figure 1 presents a basic schematic diagram of a multi-level local and global feature self-adaptive fusion network (referred to as MLGNet for short). For a given input image, it first goes through a backbone network (such as VGG16 and ResNet) to generate deep feature maps. For each candidate detection window, MLGNet utilizes six branches to extract multi-level local and global features, respectively. The above three branches first generate three different position-sensitive feature maps through three different convolutional layers. Based on the three position-sensitive feature maps, three different PSRoI layers extract the 3×3, 5×5, 7×7 size position-sensitive feature maps of the region of interest of the candidate window respectively, and then pass the three feature maps of different sizes through the double The linear difference is upsampled to the same 7×7 size feature map. The next three branches first utilize a convolutional layer to generate a feature map. Based on this feature map, three different RoI layers extract the 3×3, 5×5, 7×7 global feature maps of the region of interest of the candidate window respectively, and then pass the three feature maps of different sizes through the bilinear difference Upsampled to the same 7×7 size feature map.

最终，我们将上下六个分支生成的六个7×7大小特征图串接在一起，利用自适应融合模块生成新的输出特征图。该新的输出特征经过一个卷积层用于后续的分类和回归任务。图2给出了特征自适应融合模块。对于六个串接的特征图，我们首先利用一个全局平均全局池化操作生成一个长度为通道长度的特征向量，然后利用全连接层生成一个长度为6的特征向量，接着经过一个sigmoid层进行归一化，最后用归一化的向量与六个分支分别进行相乘得到增强后的特征图。该增强后的特征图和特征自适应融合模块的输入特征图相加得到新的输出特征。Finally, we concatenate the six 7×7 feature maps generated by the upper and lower branches, and use the adaptive fusion module to generate new output feature maps. This new output feature goes through a convolutional layer for subsequent classification and regression tasks. Figure 2 presents the feature adaptive fusion module. For the six concatenated feature maps, we first use a global average global pooling operation to generate a feature vector of length 6, then use a fully connected layer to generate a feature vector of length 6, and then go through a sigmoid layer for normalization Normalized, and finally multiplied by the normalized vector and the six branches to obtain the enhanced feature map. The enhanced feature map and the input feature map of the feature adaptive fusion module are added to obtain new output features.

所提出的MLGNet能够提取多层级的局部和全局特征，并能够进行自适应的融合，从而提升了后续分类和回归特征的表达能力和鲁棒性。The proposed MLGNet can extract multi-level local and global features, and can perform adaptive fusion, thereby improving the expressiveness and robustness of subsequent classification and regression features.

(2)基于多层级局部和全局特征自适用融合网络的目标检测(2) Target detection based on multi-level local and global feature self-adaptive fusion network

本章节，我们介绍如何将所提出的多层级局部和全局特征自适用融合网络MLGNet应用到目标检测中。目标检测主要包含两个不同的步骤：训练阶段和测试阶段。训练阶段用于学习MLGNet的网络参数，测试阶段利用训练好的MLGNet对图像进行检测，判断图像中是否存在物体。首先，我们介绍一下训练过程：In this section, we introduce how to apply the proposed multi-level local and global feature self-adaptive fusion network MLGNet to object detection. Object detection mainly consists of two distinct steps: the training phase and the testing phase. The training phase is used to learn the network parameters of MLGNet. In the testing phase, the trained MLGNet is used to detect the image and determine whether there is an object in the image. First, we introduce the training process:

步骤1：准备目标检测的训练数据集(如一般目标检测数据MS COCO)。数据需要包含一定数量的训练图像以及对应物体标注。标注为物体所处检测框坐标及物体的类别。Step 1: Prepare a training data set for target detection (such as general target detection data MS COCO). The data needs to contain a certain number of training images and corresponding object annotations. It is marked as the coordinates of the detection frame where the object is located and the category of the object.

步骤2：选定目标检测器的主干网络，搭建候选检测窗口提取网络和所提出的多层级局部和全局特征自适用融合网络MLGNet。设定候选窗口提取网络和MLGNet的训练损失函数。Step 2: Select the backbone network of the target detector, build the candidate detection window extraction network and the proposed multi-level local and global feature self-adaptive fusion network MLGNet. Set the training loss function of the candidate window extraction network and MLGNet.

步骤3：初始化检测器各部分的网络参数以及训练过程所需的超参数。网络参数的初始化可以采用随机初始化，训练过程的超参数包括迭代次数、学习率、批处理大小等。Step 3: Initialize the network parameters for each part of the detector and the hyperparameters required for the training process. The initialization of network parameters can be random initialization, and the hyperparameters of the training process include the number of iterations, learning rate, batch size, etc.

步骤4：利用反向传播算法更新检测器的权重。经过设定的训练次数得到最终的检测器。Step 4: Update the weights of the detector using the back-propagation algorithm. After the set number of training times, the final detector is obtained.

然后，我们介绍一下测试过程：Then, we introduce the testing process:

步骤1：准备测试的图像。利用候选检测窗口网络提取若干候选检测窗口，利用多层级局部和全局特征自适用融合网络MLGNet对这些候选检测窗口进行精准分类和回归。Step 1: Prepare an image for testing. The candidate detection window network is used to extract several candidate detection windows, and the multi-level local and global feature self-adaptive fusion network MLGNet is used to accurately classify and regress these candidate detection windows.

步骤2：利用非极大值抑制算法NMS对MLGNet的输出结果进行后处理生成最终的检测结果。Step 2: Use the non-maximum suppression algorithm NMS to post-process the output of MLGNet to generate the final detection result.

图3给出了本发明所提出方法的具体实施方法，主要步骤如下：Fig. 3 provides the concrete implementation method of the proposed method of the present invention, and the main steps are as follows:

步骤1：根据需要应用的场合选择目标检测数据集，包含若干图像以及相应标注信息，如目标的位置信息及所属的类别信息。Step 1: Select a target detection data set according to the occasion of application, including several images and corresponding annotation information, such as the location information of the target and the category information to which it belongs.

步骤2：选定目标检测器的主干网络搭建候选检测窗口提取网络和所提出的多层级局部和全局特征自适用融合网络MLGNet。Step 2: Select the backbone network of the target detector to build the candidate detection window extraction network and the proposed multi-level local and global feature self-adaptive fusion network MLGNet.

步骤3：初始化检测器各部分的网络参数以及训练过程所需的超参数。Step 3: Initialize the network parameters for each part of the detector and the hyperparameters required for the training process.

步骤5：准备测试的图像。利用候选检测窗口网络提取若干候选窗口，利用多层级局部和全局特征自适用融合网络MLGNet对这些检测窗口进行精准分类和回归。Step 5: Prepare the image for testing. The candidate detection window network is used to extract several candidate windows, and the multi-level local and global feature self-adaptive fusion network MLGNet is used to accurately classify and regress these detection windows.

步骤6：利用非极大值抑制算法对网络输出结果后处理生成最终的检测结果。Step 6: Use the non-maximum suppression algorithm to post-process the network output results to generate the final detection results.

Claims

1. A target detection method of multi-level local and global feature adaptive fusion comprises the following steps:

step 1: preparing a training data set for target detection, wherein the training data set comprises training images and corresponding object labels; marking the coordinates of the detection frame where the object is located and the category of the object;

step 2: selecting a backbone network for target detection, building a candidate detection window extraction network, building a multi-level local and global feature self-adaptive fusion network MLGNet on the basis of the backbone network, and setting a training loss function of the candidate window extraction network and the MLGNet;

the method for constructing the multi-level local and global feature self-adaptive fusion network MLGNet on the basis of the backbone network comprises the following steps:

for a given input image, generating a depth feature map through a backbone network; for each candidate detection window, respectively extracting multi-level local and global features by using six branches: generating sensitive feature maps at three different positions by three branches through three different convolution layers, respectively extracting the sensitive feature maps with the size and the position of 3 multiplied by 3,5 multiplied by 5 and 7 multiplied by 7 of the interested region of the candidate window by three different PSRoI layers based on the sensitive feature maps at the three different positions, and then up-sampling the sensitive feature maps with the size and the position of the three different sizes and the position to the same feature map with the size of 7 multiplied by 7 through bilinear difference values; generating a feature map by using the convolution layers of the other three branches, respectively extracting 3 × 3,5 × 5 and 7 × 7 size global feature maps of the candidate window interesting region by using three different RoI layers based on the feature map, and then upsampling the three global feature maps with different sizes to obtain the same 7 × 7 size feature map through bilinear difference values;

constructing a feature self-adaptive fusion unit, wherein the method comprises the following steps: connecting six characteristic graphs with the size of 7 multiplied by 7 generated by the last six branches in series along the direction of the characteristic channel to obtain an original serial characteristic graph; for this original concatenated feature map, we perform the following operations: firstly, generating a feature vector with the length of a feature channel by using a global average global pooling operation; then generating a characteristic vector with the length of 6 by utilizing the full connection layer; then, normalizing the feature vector with the length of 6 through a Sigmoid layer; finally, multiplying the six values of the normalized feature vector with the features corresponding to the six branches respectively and connecting in series along the direction of the feature channel to obtain an enhanced serial feature map; adding the enhanced serial connection characteristic diagram and the original serial connection characteristic diagram to obtain a final output characteristic diagram;

forming a multi-level local and global feature self-adaptive fusion network MLGNet;

and step 3: initializing network parameters of each part of the detector and hyper-parameters required by a training process;

and 4, step 4: updating the weight of the detector by using a back propagation algorithm; and obtaining the final detector through the set training times.