CN114663880A

Movatterモバイル変換

Info

Publication number: CN114663880A
Application number: CN202210253116.0A
Authority: CN
Inventors: 曹原周汉; 李浥东; 张慧; 郎丛妍; 陈乃月
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-24
Anticipated expiration: 2042-03-15
Also published as: CN114663880B

Abstract

Translated fromChinese

本发明提供了一种基于多层级跨模态自注意力机制的三维目标检测方法。该方法包括利用RGB图像数据构建训练集与测试集；构建三维目标检测模型，该三维目标检测模型包含RGB主干网络、深度主干网络、分类器与回归器；利用训练集与测试集数据训练所述三维目标检测模型，并利用测试集验证训练效果，得到训练好的三维目标检测模型；利用训练得到的模型对RGB图像中的三维目标进行检测。本发明方法从深度特征图中获取全局场景范围内的深度结构信息，与外观信息有机结合以提升三维目标检测算法的准确性，从而有效地对二维RGB图像中的三维物体进行类别、位置、尺寸和姿态等信息的检测。

The invention provides a three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism. The method includes using RGB image data to construct a training set and a test set; constructing a three-dimensional target detection model, the three-dimensional target detection model includes an RGB backbone network, a deep backbone network, a classifier and a regressor; using the training set and the test set data to train the described The three-dimensional target detection model is used, and the training effect is verified by the test set, and a trained three-dimensional target detection model is obtained; the three-dimensional target in the RGB image is detected by using the model obtained by training. The method of the invention obtains the depth structure information within the scope of the global scene from the depth feature map, and organically combines it with the appearance information to improve the accuracy of the three-dimensional target detection algorithm, thereby effectively classifying the three-dimensional objects in the two-dimensional RGB image. Detection of information such as size and pose.

Description

Translated fromChinese

基于多层级跨模态自注意力机制的三维目标检测方法3D object detection method based on multi-level cross-modal self-attention mechanism

技术领域technical field

本发明涉及目标检测技术领域，尤其涉及一种基于多层级跨模态自注意力机制的三维目标检测方法。The invention relates to the technical field of target detection, in particular to a three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism.

背景技术Background technique

三维目标检测是计算机视觉领域中的一项重要分支，在智能交通、机器人视觉、三维重建、虚拟现实与增强现实等诸多场景有较强的应用价值。三维目标检测的目的是恢复三维空间中的物体的类别、位置、深度、尺寸和姿态等信息。根据处理数据类型的不同，三维目标检测技术主要分为基于二维图像检测与基于点云数据检测两类。3D object detection is an important branch in the field of computer vision, and has strong application value in many scenarios such as intelligent transportation, robot vision, 3D reconstruction, virtual reality and augmented reality. The purpose of 3D object detection is to recover information such as category, position, depth, size and pose of objects in 3D space. According to the different types of processed data, 3D object detection technology is mainly divided into two types: detection based on 2D image and detection based on point cloud data.

对三维物体进行成像过程，即是将三维空间中的点，在丢失深度信息之后映射至二维平面上的过程。而对三维空间中的物体进行检测，必然要用到丢失的深度信息，这也是三维目标检测与二维目标检测的主要区别之一，也是三维目标检测的难点所在。基于二维图像的三维目标检测方法可直接从二维图像中获取深度信息，进而检测三维目标。其深度信息的获取主要依赖三维场景中的几何约束，三维物体的形状约束与语意约束等诸多约束条件。由于二维图像中所包含的深度信息有限，且约束条件受场景与物体限制较大，因此该类三维目标检测方法所能达到的精度较低。The imaging process of three-dimensional objects is the process of mapping points in three-dimensional space to two-dimensional planes after losing depth information. To detect objects in three-dimensional space, the lost depth information must be used, which is one of the main differences between three-dimensional target detection and two-dimensional target detection, and also the difficulty of three-dimensional target detection. The 3D object detection method based on 2D images can directly obtain depth information from 2D images, and then detect 3D objects. The acquisition of depth information mainly depends on geometric constraints in 3D scenes, shape constraints and semantic constraints of 3D objects and many other constraints. Because the depth information contained in the two-dimensional image is limited, and the constraints are greatly restricted by the scene and objects, the accuracy that this type of three-dimensional object detection method can achieve is low.

点云是与二维图像中的像素点相对应的三维空间中点的集合，基于点云数据的三维目标检测可通过对点云数据进行处理以获取深度信息，其可进一步分为两类。首先是直接对三维空间中的点云数据进行处理，通过将二维目标检测方法中对像素点的操作升至三维，进而实现对点云的直接处理。由于运算维度的增加，因此该类方法计算复杂度较高，同时点云中的噪声数据也会直接影响该类算法的检测精度。另外一种方法首先通过点云数据训练得到深度预测模型，并通过该模型获取二维深度图像，然后通过二维深度图像获取深度信息用以三维目标检测。该类算法不需要直接对点云数据进行运算，而是将点云数据降维至二维深度图，降低了运算复杂度，同时深度预测模型可祛除部分点云噪声数据，因此在实际应用中使用较为广泛。A point cloud is a collection of points in three-dimensional space corresponding to pixels in a two-dimensional image. The three-dimensional object detection based on point cloud data can obtain depth information by processing the point cloud data, which can be further divided into two categories. The first is to directly process the point cloud data in the three-dimensional space, and then realize the direct processing of the point cloud by increasing the operation of the pixel points in the two-dimensional target detection method to three-dimensional. Due to the increase of the operation dimension, the computational complexity of this type of method is high, and the noise data in the point cloud will also directly affect the detection accuracy of this type of algorithm. Another method first obtains a depth prediction model by training point cloud data, obtains a 2D depth image through the model, and then obtains depth information through the 2D depth image for 3D target detection. This kind of algorithm does not need to directly operate on the point cloud data, but reduces the dimension of the point cloud data to a two-dimensional depth map, which reduces the computational complexity. At the same time, the depth prediction model can remove some point cloud noise data, so in practical applications more widely used.

现有技术中的一种三维目标检测方法包括：在得到二维深度图之后，由于通过点云数据训练得到的深度预测模型已具备了获取深度信息的能力，该方法在深度预测模型的基础上进一步通过二维RGB图像训练三维目标检测模型。该方法的缺点为：对于三维目标检测任务，从二维图像或视频帧中获取目标的类别与位置信息，直接对三维点云数据进行处理并无必要，且点云数据通常包含大量噪声。A three-dimensional target detection method in the prior art includes: after obtaining a two-dimensional depth map, since a depth prediction model obtained by training point cloud data has the ability to obtain depth information, the method is based on the depth prediction model. The 3D object detection model is further trained by 2D RGB images. The disadvantage of this method is: for the task of 3D target detection, the category and position information of the target are obtained from the 2D image or video frame, and it is not necessary to directly process the 3D point cloud data, and the point cloud data usually contains a lot of noise.

现有技术中的另一种三维目标检测方法包括：该方法将二维深度图像作为独立的模型输入，通过额外的模型从深度图像中获取深度信息，并与二维RGB(红、绿、蓝)图像输入相结合进行三维目标检测。该方法的缺点为：能够从二维图像中获取得到的深度信息十分有限，且深度信息的获取过程无法避免的用到几何约束，因此这类算法的检测精度较差。Another three-dimensional target detection method in the prior art includes: the method inputs a two-dimensional depth image as an independent model, obtains depth information from the depth image through an additional model, and combines it with two-dimensional RGB (red, green, blue, and blue) ) image input for 3D object detection. The disadvantage of this method is that the depth information that can be obtained from the two-dimensional image is very limited, and the geometric constraints cannot be avoided in the process of obtaining the depth information, so the detection accuracy of this kind of algorithm is poor.

发明内容SUMMARY OF THE INVENTION

本发明的实施例提供了一种基于多层级跨模态自注意力机制的三维目标检测方法，以实现有效地对二维RGB图像中的三维物体进行类别、位置和姿态目标检测。Embodiments of the present invention provide a three-dimensional object detection method based on a multi-level cross-modal self-attention mechanism, so as to effectively perform category, position and attitude object detection on three-dimensional objects in a two-dimensional RGB image.

为了实现上述目的，本发明采取了如下技术方案。In order to achieve the above objects, the present invention adopts the following technical solutions.

一种基于多层级跨模态自注意力机制的三维目标检测方法，包括：A three-dimensional object detection method based on a multi-level cross-modal self-attention mechanism, comprising:

利用RGB图像数据构建训练集与测试集数据；Use RGB image data to construct training set and test set data;

构建三维目标检测模型，该三维目标检测模型包含RGB主干网络、深度主干网络、分类器与回归器；Constructing a 3D target detection model, the 3D target detection model includes RGB backbone network, deep backbone network, classifier and regressor;

利用所述训练集与测试集数据训练所述三维目标检测模型，并利用所述测试集验证所述三维目标检测模型的训练效果，所述RGB主干网络、深度主干网络分别获取RGB特征与深度特征，将所述RGB特征与深度特征输入跨模态自注意力学习模块，对RGB特征进行更新，利用更新之后的RGB特征学习分类器与回归器，得到训练好的三维目标检测模型；Use the training set and test set data to train the 3D target detection model, and use the test set to verify the training effect of the 3D target detection model, the RGB backbone network and the depth backbone network respectively obtain RGB features and depth features , the RGB features and depth features are input into the cross-modal self-attention learning module, the RGB features are updated, and the updated RGB features are used to learn a classifier and a regressor to obtain a trained three-dimensional target detection model;

利用所述训练好的三维目标检测模型中的分类器和回归器对待识别的二维RGB图像中的三维物体进行类别、位置和姿态检测。Using the classifier and the regressor in the trained three-dimensional target detection model, the three-dimensional object in the two-dimensional RGB image to be recognized is used for category, position and attitude detection.

优选地，所述的利用RGB图像数据构建训练集与测试集数据，包括：Preferably, the described use of RGB image data to construct training set and test set data includes:

采集RGB图像，将RGB图像按照约1：1的比例分成训练集与测试集，对训练集与测试集中的图像数据进行归一化处理，通过深度估计算法获取训练集图像的二维深度图像，标注训练集图像中的物体的类别，对图像的二维检测框的坐标，以及三维检测框的中心位置、尺寸与转角进行标注。Collect RGB images, divide the RGB images into training set and test set according to the ratio of about 1:1, normalize the image data in the training set and test set, and obtain the two-dimensional depth image of the training set image through the depth estimation algorithm. Label the categories of objects in the training set images, and label the coordinates of the two-dimensional detection frame of the image, as well as the center position, size and corner of the three-dimensional detection frame.

优选地，所述三维目标检测模型中的RGB主干网络、深度主干网络、分类器与回归器均包含卷积层、全连接层与归一化层，RGB主干网络、深度主干网络的结构一致，均包含4个卷积模块。Preferably, the RGB backbone network, the deep backbone network, the classifier and the regressor in the three-dimensional target detection model all include a convolution layer, a fully connected layer and a normalization layer, and the structures of the RGB backbone network and the deep backbone network are consistent. Both contain 4 convolution modules.

优选地，所述的利用所述训练集与测试集数据训练所述三维目标检测模型，所述RGB主干网络、深度主干网络分别获取RGB特征与深度特征，将所述RGB特征与深度特征输入跨模态自注意力学习模块，对RGB特征进行更新，利用更新之后的RGB特征学习分类器与回归器，得到训练好的三维目标检测模型，包括：Preferably, the three-dimensional target detection model is trained by using the training set and the test set data, the RGB backbone network and the depth backbone network obtain RGB features and depth features respectively, and the RGB features and depth features are input across The modal self-attention learning module updates RGB features, uses the updated RGB features to learn classifiers and regressors, and obtains a trained 3D target detection model, including:

步骤S3-1：初始化所述三维目标检测模型中的RGB主干网络、深度主干网络、分类器与回归器所包含的卷积层、全连接层与归一化层中的参数；Step S3-1: Initialize the parameters in the RGB backbone network, the depth backbone network, the convolution layer, the fully connected layer and the normalization layer included in the classifier and the regressor in the three-dimensional target detection model;

步骤S3-2：设置随机梯度下降算法的相关训练参数，该相关训练参数包括学习率、冲量、批量大小与迭代次数；Step S3-2: setting relevant training parameters of the stochastic gradient descent algorithm, where the relevant training parameters include learning rate, impulse, batch size and number of iterations;

步骤S3-3：对于任一迭代批次，分别将全部RGB图与深度图输入至RGB主干网络与深度主干网络，得到多层级的RGB特征与深度特征，构建跨模态自注意力学习模块，将所述RGB特征与深度特征输入到跨模态自注意力学习模块，学习得到基于深度信息的自注意力矩阵，通过所述自注意力矩阵对RGB特征进行更新，利用更新之后的RGB特征学习分类器与回归器，将分类器与回归器用于二维RGB图像中的三维物体的目标检测，Step S3-3: For any iterative batch, input all RGB images and depth maps to the RGB backbone network and the depth backbone network respectively, obtain multi-level RGB features and depth features, and construct a cross-modal self-attention learning module. Input the RGB feature and depth feature into the cross-modal self-attention learning module, learn to obtain a self-attention matrix based on depth information, update the RGB feature through the self-attention matrix, and use the updated RGB feature to learn Classifiers and regressors, using classifiers and regressors for object detection of three-dimensional objects in two-dimensional RGB images,

通过计算网络估计值与实际标注值的误差得到目标函数值，利用公式(1)、(2)和(3)分别计算三种目标函数值：The objective function value is obtained by calculating the error between the estimated value of the network and the actual labeled value, and three types of objective function values are calculated using formulas (1), (2) and (3) respectively:

其中公式(1)中的s_i与p_i分别为第i个目标的类别标注与估计概率，公式(2)中的

与公式(3)中的

分别代表第i个目标的二维估计框与三维估计框，gt表示实际标注值，N表示目标总数；where s_i and p_i in formula (1) are the category label and estimated probability of the i-th target, respectively, and in formula (2)

with the formula (3)

Represent the two-dimensional estimation frame and the three-dimensional estimation frame of the ith target respectively, gt represents the actual label value, and N represents the total number of targets;

步骤S3-4：将所述三种目标函数值相加得到总目标函数值，并分别对三维目标检测模型中的所有参数求偏导数，通过随机梯度下降法对参数进行更新；Step S3-4: adding the three kinds of objective function values to obtain a total objective function value, and respectively obtaining partial derivatives for all parameters in the three-dimensional target detection model, and updating the parameters by the stochastic gradient descent method;

步骤S3-5：重复进行步骤S3-3与步骤3-4，不断更新三维目标检测模型的参数，直至收敛，输出训练好的三维目标检测模型的参数。Step S3-5: Repeat steps S3-3 and 3-4, continuously update the parameters of the three-dimensional target detection model until convergence, and output the parameters of the trained three-dimensional target detection model.

优选地，所述的将所述RGB特征与深度特征输入跨模态自注意力学习模块，对RGB特征进行更新，利用更新之后的RGB特征学习分类器与回归器，得到训练好的三维目标检测模型，包括：Preferably, the RGB features and depth features are input into the cross-modal self-attention learning module, the RGB features are updated, and the updated RGB features are used to learn a classifier and a regressor to obtain a trained 3D target detection module. models, including:

对于任一二维RGB特征图R与二维深度特征图D，假设其维度为C×H×W，其中C、H与W分别为维度、高和宽，将二维RGB特征图R与二维深度特征图D表示为N个C维特征的集合：R＝[r₁，r₂，...，r_N]^T与D＝[d₁，d₂，...，d_N]^T，其中N＝H×W；For any two-dimensional RGB feature map R and two-dimensional depth feature map D, suppose its dimension is C×H×W, where C, H and W are dimensions, height and width respectively. The dimensional depth feature map D is represented as a set of N C-dimensional features: R = [r₁ , r₂ , ..., r_N ]^T and D = [d₁ , d₂ , ..., d_N ]^T , where N=H×W;

对于输入特征图R构造一个全连接图，其中每一个特征r_i作为一个结点，边(r_i，r_j)表示结点r_i与r_j之间的关系，通过二维深度特征图D学习得到边，对当前二维RGB特征图R进行更新，具体表示为：Construct a fully connected graph for the input feature map R, in which each feature ri is used as a node, and the edge (_ri , r_j ) represents the_{relationship between the nodes ri and r j}_,_through the two-dimensional depth feature map D Learn to get the edge and update the current two-dimensional RGB feature map R, which is specifically expressed as:

其中

为归一化参数，δ为softmax函数，j为全部与i相关的位置，

为更新后的RGB特征，把上述公式写成矩阵相乘的形式：in

is the normalization parameter, δ is the softmax function, j is all the positions related to i,

For the updated RGB features, write the above formula in the form of matrix multiplication:

其中

为自注意力矩阵，D_θ,D_φ与R_g的维度均为N×C′；in

is the self-attention matrix, and the dimensions of D_θ , D_φ and R_g are all N×C′;

将每一个空间位置的特征矩阵r_i当做一个结点，并在全部空间区域内寻找与r_i关联的结点，对于深度特征图中的任意结点i，在与i相关的全部结点中采样数量为S的代表性特征：The feature matrix ri of each spatial position is regarded as a node, and the nodes associated with ri are found in all spatial regions. For any node_i in the depth feature map, among all nodes related to_i Representative features with sample size S:

其中s(n)为采样得到的特征向量，其维度为C′，

为采样函数，所述跨模态自注意力学习模块表示为：Among them, s(n) is the feature vector obtained by sampling, and its dimension is C',

is a sampling function, and the cross-modal self-attention learning module is expressed as:

其中n为采样得到的与i相关的结点，δ为softmax函数，d_θ(i)＝W_θd(i)，s_φ(i)＝W_φs(i)，s_g(i)＝W_gs(i)。

与

分别为三个线性变换的变换矩阵。where n is the sampled node related to i, δ is the softmax function, d_θ (i)=W_θ d(i), s_φ (i)=W_φ s(i), s_g (i)= W_g s(i).

and

are the transformation matrices of the three linear transformations, respectively.

由上述本发明的实施例提供的技术方案可以看出，本发明实施例提供了一种针对三维目标检测的多层级跨模态自注意力学习机制，从深度特征图中获取全局场景范围内的深度结构信息，并与外观信息有机结合以提升三维目标检测算法的准确性。同时采取了多种策略用以降低运算复杂度，以满足无人驾驶等场景对处理速度的需求。It can be seen from the technical solutions provided by the above-mentioned embodiments of the present invention that the embodiments of the present invention provide a multi-level cross-modal self-attention learning mechanism for 3D target detection, which acquires images within the scope of the global scene from the depth feature map. The depth structure information is combined with the appearance information to improve the accuracy of the 3D object detection algorithm. At the same time, a variety of strategies are adopted to reduce the computational complexity to meet the processing speed requirements of scenarios such as unmanned driving.

本发明附加的方面和优点将在下面的描述中部分给出，这些将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth in part in the following description, which will be apparent from the following description, or may be learned by practice of the present invention.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的一种基于跨模态自注意力机制的三维目标检测方法流程图。FIG. 1 is a flowchart of a three-dimensional target detection method based on a cross-modal self-attention mechanism provided by an embodiment of the present invention.

图2为本发明实施例提供的一种三维目标检测模型的结构图。FIG. 2 is a structural diagram of a three-dimensional target detection model provided by an embodiment of the present invention.

图3为本发明实施例提供的一种三维目标检测模型的训练流程图。FIG. 3 is a training flowchart of a three-dimensional target detection model provided by an embodiment of the present invention.

图4为本发明实施例提供的一种跨模态自注意力模块的结构图。FIG. 4 is a structural diagram of a cross-modal self-attention module provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施方式，所述实施方式的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的，仅用于解释本发明，而不能解释为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, but not to be construed as a limitation of the present invention.

本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件，但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解，当我们称元件被“连接”或“耦接”到另一元件时，它可以直接连接或耦接到其他元件，或者也可以存在中间元件。此外，这里使用的“连接”或“耦接”可以包括无线连接或耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的任一单元和全部组合。It will be understood by those skilled in the art that the singular forms "a", "an", "the" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof. It will be understood that when we refer to an element as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Furthermore, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

本技术领域技术人员可以理解，除非另外定义，这里使用的所有术语(包括技术术语和科学术语)具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是，诸如通用字典中定义的那些术语应该被理解为具有与现有技术的上下文中的意义一致的意义，并且除非像这里一样定义，不会用理想化或过于正式的含义来解释。It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms such as those defined in general dictionaries should be understood to have meanings consistent with their meanings in the context of the prior art and, unless defined as herein, are not to be taken in an idealized or overly formal sense. explain.

为便于对本发明实施例的理解，下面将结合附图以几个具体实施例为例做进一步的解释说明，且各个实施例并不构成对本发明实施例的限定。In order to facilitate the understanding of the embodiments of the present invention, the following will take several specific embodiments as examples for further explanation and description in conjunction with the accompanying drawings, and each embodiment does not constitute a limitation to the embodiments of the present invention.

基于目前三维目标检测算法中存在的主要缺点，本发明通过二维深度图获取深度信息，并将深度信息的利用形式化为跨模态的自注意力模块学习问题。通过跨模态的自注意力机制将深度信息与外观信息相结合，同时通过自注意力学习机制在全局范围内以非迭代的方式提取深度信息，以提高检测精度。在获取深度信息时，本发明还使用了诸多举措用以进一步降低运算复杂度，保证其能够用于自动驾驶等有实时性处理需求的场景。Based on the main shortcomings of the current three-dimensional target detection algorithms, the present invention obtains depth information through a two-dimensional depth map, and formalizes the utilization of the depth information as a cross-modal self-attention module learning problem. Depth information is combined with appearance information through a cross-modal self-attention mechanism, and depth information is extracted globally in a non-iterative manner through a self-attention learning mechanism to improve detection accuracy. When acquiring the depth information, the present invention also uses many measures to further reduce the computational complexity and ensure that it can be used in scenarios with real-time processing requirements such as automatic driving.

本发明提出了一种基于多层级跨模态自注意力机制的三维目标检测方法以二维RGB图像与深度图像作为输入，并通过自注意力机制将二维RGB图像获取的外观信息与深度图像所获取的结构信息相结合，以达到精确的检测结果，同时避免了点云处理所带来的高计算量。除此之外，由于自注意力机制在获取全局结构信息的同时，也获取了大量的冗余信息，为此本方法采用一种改进的自注意力机制，即对于某一区域，只对全局当中与其相关性最高的区域计算结构信息，在保证检测精度的前提下进一步降低了计算量。The present invention proposes a three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism, which takes two-dimensional RGB images and depth images as inputs, and uses the self-attention mechanism to obtain the appearance information and depth images from the two-dimensional RGB images. The acquired structural information is combined to achieve accurate detection results while avoiding the high computational burden brought about by point cloud processing. In addition, since the self-attention mechanism acquires a large amount of redundant information while acquiring the global structural information, this method adopts an improved self-attention mechanism, that is, for a certain area, only the global Among them, the structural information of the region with the highest correlation is calculated, and the calculation amount is further reduced on the premise of ensuring the detection accuracy.

本发明实施例的基于多层级跨模态自注意力机制的三维目标检测方法包含以下处理过程：The three-dimensional target detection method based on the multi-level cross-modal self-attention mechanism according to the embodiment of the present invention includes the following processing steps:

数据集构建：构建三维目标检测模型的训练集与测试集，具体包括采集训练与测试使用的RGB图像，并通过深度模型提取训练集图像对应的深度信息。对训练图像中物体的类别、二维坐标、三维坐标、深度和尺寸等值进行标注。并对图像数据进行预处理。Data set construction: Build the training set and test set of the 3D target detection model, including collecting RGB images for training and testing, and extracting the depth information corresponding to the training set images through the depth model. Annotate the class, 2D coordinates, 3D coordinates, depth and size of objects in the training image. and preprocess the image data.

三维目标检测模型构建：构建基于卷积神经网络的三维目标检测模型，具体包括RGB图像特征提取网络，深度图像特征提取网络，以及跨模态自注意力学习网络的构建。3D object detection model construction: Construct a 3D object detection model based on convolutional neural network, including RGB image feature extraction network, depth image feature extraction network, and the construction of cross-modal self-attention learning network.

三维目标检测模型训练：通过计算二维目标检测、三维目标检测的分类以及回归等损失函数，以及随机梯度下降算法，对三维目标检测模型中的参数进行更新直至收敛。3D target detection model training: By calculating loss functions such as 2D target detection, classification and regression of 3D target detection, and stochastic gradient descent algorithm, the parameters in the 3D target detection model are updated until convergence.

对三维目标进行检测：通过提供的彩色图像或视频帧，对其中的三维物体进行检测。Detecting 3D objects: Detecting 3D objects in color images or video frames provided.

本发明实施例提供的基于多层级跨模态自注意力机制的三维目标检测方法的处理流程图如图1所示，包括以下几个步骤：The processing flowchart of the three-dimensional target detection method based on the multi-level cross-modal self-attention mechanism provided by the embodiment of the present invention is shown in FIG. 1 , and includes the following steps:

步骤S1：构建训练集与测试集。采集RGB图像并按照约1：1的比例分成训练集与测试集。由于本发明实施例所提供的三维目标检测方法通过二维深度图像，而非传统方法所采用的点云数据获取深度信息，因此对于训练集中的彩色图像，通过深度估计算法获取二维深度图像。除此之外，对于训练集图像中的物体，首先标注其类别，同时对其二维检测框的坐标，以及三维检测框的中心位置、尺寸与转角进行标注。最后，对训练集与测试集中的图像数据进行归一化处理。Step S1: Build a training set and a test set. Collect RGB images and divide them into training set and test set in a ratio of about 1:1. Since the three-dimensional target detection method provided by the embodiment of the present invention obtains depth information through a two-dimensional depth image instead of the point cloud data used by the traditional method, for the color images in the training set, a two-dimensional depth image is obtained through a depth estimation algorithm. In addition, for the objects in the training set images, first mark their categories, and at the same time mark the coordinates of the two-dimensional detection frame, as well as the center position, size and corner of the three-dimensional detection frame. Finally, normalize the image data in the training set and test set.

步骤S2：在得到训练集与测试集之后，构建三维目标检测模型，该三维目标检测模型包含RGB主干网络、深度主干网络、分类器与回归器。本发明实施例提供的一种三维目标检测模型的结构如图2所示。由于在训练时我们需要分别对RGB图像与深度图像提取特征，因此我们需要构建两个特征提取主干网络。在本发明实施例中，RGB主干网络、深度主干网络的结构一致，均包含4个卷积模块，用于提取多层级特征。Step S2: After the training set and the test set are obtained, a three-dimensional target detection model is constructed, and the three-dimensional target detection model includes an RGB backbone network, a deep backbone network, a classifier and a regressor. The structure of a three-dimensional target detection model provided by an embodiment of the present invention is shown in FIG. 2 . Since we need to extract features from RGB images and depth images separately during training, we need to build two feature extraction backbone networks. In the embodiment of the present invention, the structures of the RGB backbone network and the depth backbone network are the same, and both include four convolution modules for extracting multi-level features.

步骤S3：训练三维目标检测模型。在构建完成三维目标检测模型之后，可通过步骤S1得到的训练集对模型进行训练，并利用测试集验证三维目标检测模型的训练效果。本发明实施例提供的一种三维目标检测模型的训练流程图如图3所示，具体包含如下步骤：Step S3: Train a three-dimensional target detection model. After the three-dimensional target detection model is constructed, the model can be trained through the training set obtained in step S1, and the training effect of the three-dimensional target detection model can be verified by using the test set. A training flowchart of a three-dimensional target detection model provided by an embodiment of the present invention is shown in FIG. 3 , which specifically includes the following steps:

步骤S3-1：初始化模型参数，具体包含RGB主干网络、深度主干网络、分类器与回归器中所包含的卷积层、全连接层与归一化层中的参数。Step S3-1: Initialize model parameters, specifically including parameters in the RGB backbone network, the deep backbone network, the convolutional layer, the fully connected layer and the normalization layer included in the classifier and the regressor.

步骤S3-2：设置训练参数。本发明实施例的三维目标检测模型采用SGD(Stochastic Gradient Descnet，随机梯度下降算法)进行训练，在训练前需设置相关训练参数，包括学习率、冲量、批量大小与迭代次数。Step S3-2: Set training parameters. The three-dimensional target detection model of the embodiment of the present invention adopts SGD (Stochastic Gradient Descnet, stochastic gradient descent algorithm) for training, and relevant training parameters need to be set before training, including learning rate, impulse, batch size and number of iterations.

步骤S3-3：计算目标函数值。对于任一迭代批次，分别将全部RGB图与深度图输入至RGB主干网络与深度主干网络，得到多层级特征，并通过跨模态自注意力学习模块得到更新的RGB特征，之后进一步通过分类器与回归器得到目标物体的估计类别，位置姿态与深度值。最后通过计算网络估计值与实际标注值的误差得到目标函数值。本模型训练时共计算三种目标函数值：Step S3-3: Calculate the objective function value. For any iterative batch, input all RGB images and depth images to the RGB backbone network and the depth backbone network respectively to obtain multi-level features, and obtain updated RGB features through the cross-modal self-attention learning module, and then further classify The regressor and regressor obtain the estimated class, position, pose and depth value of the target object. Finally, the objective function value is obtained by calculating the error between the estimated value of the network and the actual labeled value. There are three kinds of objective function values calculated during the training of this model:

与公式(3)中的

分别代表第i个目标的二维估计框与三维估计框，gt表示实际标注值，N表示目标总数。where s_i and p_i in formula (1) are the category label and estimated probability of the i-th target, respectively, and in formula (2)

with the formula (3)

Represent the 2D estimation frame and 3D estimation frame of the ith target, respectively, gt represents the actual label value, and N represents the total number of targets.

步骤S3-4：将多个目标函数值相加得到总目标函数值，并分别对模型中的所有参数求偏导数，然后通过随机梯度下降法对参数进行更新。Step S3-4: Add multiple objective function values to obtain a total objective function value, and obtain partial derivatives for all parameters in the model respectively, and then update the parameters through the stochastic gradient descent method.

步骤S3-5：重复进行步骤S3-3与步骤3-4，不断更新模型参数，直至收敛，最后输出模型参数。Step S3-5: Repeat steps S3-3 and 3-4, continuously update the model parameters until convergence, and finally output the model parameters.

至此，已得到本发明实施例中三维目标检测模型的全部参数，最后只需对用户提供的二维图像中的物体进行检测即可。So far, all the parameters of the three-dimensional target detection model in the embodiment of the present invention have been obtained, and finally it is only necessary to detect the objects in the two-dimensional image provided by the user.

步骤S4：在获取多层级的RGB特征与深度特征之后，构建跨模态自注意力学习模块，该模块以RGB特征与深度特征同时作为输入，学习得到基于深度信息的自注意力矩阵，并通过该自注意力矩阵对RGB特征进行更新，增加RGB特征中的结构信息。最后，利用更新之后的RGB特征学习分类器与回归器，将分类器与回归器用于二维RGB图像中的三维物体的目标检测，其中分类器可识别三维目标的类别，回归器可识别三维目标的位置与姿态。Step S4: After acquiring multi-level RGB features and depth features, build a cross-modal self-attention learning module, which takes RGB features and depth features as input at the same time, and learns to obtain a self-attention matrix based on depth information. The self-attention matrix updates the RGB features and increases the structural information in the RGB features. Finally, the updated RGB features are used to learn a classifier and a regressor, and the classifier and regressor are used for target detection of three-dimensional objects in two-dimensional RGB images. The classifier can identify the category of the three-dimensional target, and the regressor can identify the three-dimensional target. position and attitude.

本发明实施例中的三维目标检测模型包含RGB主干网络、深度主干网络、分类器与回归器。在训练结束后，RGB主干网络已通过跨模态自注意力学习模块保留了深度结构信息。在进行测试时，仅需要提供二维RGB图像，不需要深度主干网络提取深度特征。The three-dimensional target detection model in the embodiment of the present invention includes an RGB backbone network, a deep backbone network, a classifier and a regressor. After training, the RGB backbone network has preserved deep structural information through a cross-modal self-attention learning module. When testing, only two-dimensional RGB images need to be provided, and no deep backbone network is needed to extract deep features.

本发明实施例的跨模态自注意力学习模块，它可通过深度图学习得到深度结构信息，并嵌入到RGB图像特征中，进而提升三维目标检测的准确性。下面进行详细介绍。The cross-modal self-attention learning module of the embodiment of the present invention can obtain depth structure information through depth map learning, and embed it into RGB image features, thereby improving the accuracy of three-dimensional target detection. Details are given below.

本发明实施例提供的一种跨模态自注意力学习模块的结构流程图如图4所示，其主要包括四个子模块：采样点生成模块、多层级注意力学习模块、信息更新模块与信息融合模块。其构建的核心思想在于通过多层级的深度特征图学习得到基于深度信息的自注意力矩阵，该自注意力矩阵可反映全局图像范围内不同位置之间的结构相似性，通过该自注意力矩阵对RGB特征图进行更新，以获得全局图像范围内的结构特征，最终提升三维目标检测的准确性。在图4中以两层级深度特征图为例进行展示，在实际操作中，可扩展至多层级深度特征。A structural flowchart of a cross-modal self-attention learning module provided by an embodiment of the present invention is shown in FIG. 4 , which mainly includes four sub-modules: a sampling point generation module, a multi-level attention learning module, an information update module and an information module Fusion module. The core idea of its construction is to obtain a self-attention matrix based on depth information through multi-level deep feature map learning. The self-attention matrix can reflect the structural similarity between different positions in the global image range. Through the self-attention matrix The RGB feature map is updated to obtain structural features in the global image range, which ultimately improves the accuracy of 3D object detection. In FIG. 4 , a two-level depth feature map is used as an example for illustration, and in practice, it can be extended to multi-level depth features.

对于任一二维RGB特征图R与二维深度特征图D，假设其维度均为C×H×W，其中C、H与W分别为维度、高和宽。二维RGB特征图R与二维深度特征图D均可表示为N个C维特征的集合：R＝[r₁，r₂，...，r_N]^T与D＝[d₁，d₂，...，d_N]^T，其中N＝H×W。对于输入特征图R构造一个全连接图，其中每一个特征r_i作为一个结点，边(r_i，r_j)表示结点r_i与r_j之间的关系。对于二维RGB特征图R，其颜色、纹理等外观特征较明显，而深度等结构信息不足。本发明实施例中的跨模态自注意力学习模块通过二维深度特征图D学习得到边，然后对当前二维RGB特征图R进行更新，以增加其结构特征，具体可表示为：For any two-dimensional RGB feature map R and two-dimensional depth feature map D, it is assumed that their dimensions are C×H×W, where C, H and W are dimension, height and width, respectively. Both the two-dimensional RGB feature map R and the two-dimensional depth feature map D can be represented as a set of N C-dimensional features: R=[r₁ , r₂ , ..., r_N ]^T and D=[d₁ , d₂ , ..., d_N ]^T , where N=H×W. Construct a fully connected graph for the input feature map R, in which each feature ri is used as a node, and the edge (_ri , r_j ) represents the_relationship between the nodes_ri and r_j . For the two-dimensional RGB feature map R, the appearance features such as color and texture are more obvious, but the structural information such as depth is insufficient. The cross-modal self-attention learning module in the embodiment of the present invention obtains edges by learning the two-dimensional depth feature map D, and then updates the current two-dimensional RGB feature map R to increase its structural features, which can be specifically expressed as:

其中

为归一化参数，δ为softmax函数，j为全部与i相关的位置，

为更新后的RGB特征。我们可以进一步把上述公式写成矩阵相乘的形式：in

is the updated RGB feature. We can further write the above formula in the form of matrix multiplication:

其中

为自注意力矩阵，D_θ,D_φ与R_g的维度均为N×C′。in

is the self-attention matrix, and the dimensions of D_θ , D_φ and R_g are all N×C′.

至此，已构造完单层级的跨模态自注意力学习模块，其可通过单一层级的深度特征图中学习得到包含结构信息的自注意力矩阵，并对对应层级的RGB特征图进行更新。然而，由上述矩阵相乘公式可看出，对RGB特征图进行更新的操作的复杂度为O(C′×N²)，对于三维目标检测，特别是无人驾驶等场景，其输入图像或视频帧的分辨率通常较大，因此在计算自注意力矩阵A(X)时耗时过高，不利于有实时性处理要求的应用场景。在上述构造全连接图的过程中，我们将每一个空间位置的特征矩阵r_i当做一个结点，并在全部空间区域内寻找与r_i关联的结点，并计算自注意力矩阵。由于在全部空间区域内与r_i关联的结点存在高度重合，因此本发明实施例中的跨模态自注意力学习模块通过采样机制，只选取与r_i关联的结点中关联度最高的结点，去除大量冗余结点后计算自注意力矩阵，这样可以极大的提升运算效率，同时保证全部空间区域内的相关性。下面对包含采样机制的跨模态自注意力学习模块进行详细介绍。So far, a single-level cross-modal self-attention learning module has been constructed, which can learn a self-attention matrix containing structural information from a single-level deep feature map, and update the RGB feature map of the corresponding level. However, it can be seen from the above matrix multiplication formula that the complexity of updating the RGB feature map is O(C′×N² ). The resolution of the video frame is usually large, so the time-consuming calculation of the self-attention matrix A(X) is too high, which is not conducive to application scenarios with real-time processing requirements. In the above process of constructing the fully connected graph, we regard the feature matrix_ri of each spatial position as a node, and find the nodes associated with_ri in all spatial regions, and calculate the self-attention matrix. Since the nodes associated with_ri are highly overlapped in all spatial regions, the cross-modal self-attention learning module in the embodiment of the present invention selects only the nodes with the highest correlation among the nodes associated with_ri through the sampling mechanism. After removing a large number of redundant nodes, the self-attention matrix is calculated, which can greatly improve the operation efficiency and ensure the correlation in all spatial regions. The cross-modal self-attention learning module including the sampling mechanism is introduced in detail below.

对于深度特征图中的任意结点i，在与i相关的全部结点中采样数量为S的代表性特征：For any node i in the depth feature map, sample the number of representative features S in all nodes related to i:

其中s(n)为采样得到的特征向量，其维度为C′，

为采样函数。这样一来，本发明实施例中的跨模态自注意力学习模块可表示为：Among them, s(n) is the feature vector obtained by sampling, and its dimension is C',

is the sampling function. In this way, the cross-modal self-attention learning module in the embodiment of the present invention can be expressed as:

与

分别为三个线性变换的变换矩阵。通过增加采样模块，在计算自注意力矩阵时，我们可以将结点的数量由N降至S：where n is the sampled node related to i, δ is the softmax function, d_θ (i)=W_θ d(i), s_φ (i)=W_φ s(i), s_g (i)= W_g s(i).

and

are the transformation matrices of the three linear transformations, respectively. By adding a sampling module, we can reduce the number of nodes from N to S when calculating the self-attention matrix:

而S＜＜N，这样可极大的降低运算复杂度。例如对于一个空间尺寸为80×80的特征图，N为6400，而在本发明实施例中，选择的采样点个数为9。And S<<N, which can greatly reduce the computational complexity. For example, for a feature map with a spatial size of 80×80, N is 6400, and in this embodiment of the present invention, the number of selected sampling points is 9.

本发明借助可形变卷积(deformable convolution)的思想，通过估计偏移量来动态地选择采样点。具体地，对于特征图中的某一位置p，采样函数

可表示为：The present invention dynamically selects sampling points by estimating the offset by means of the idea of deformable convolution. Specifically, for a certain position p in the feature map, the sampling function

can be expressed as:

其中Δp_n为回归得到的偏移量。由于卷积操作所得到的结果通常包含小数部分，而采样点的坐标值仅为整数，为此还需通过双线性插值获取整数坐标值：where_Δpn is the offset obtained by regression. Since the result obtained by the convolution operation usually contains a fractional part, and the coordinate value of the sampling point is only an integer, it is necessary to obtain the integer coordinate value through bilinear interpolation:

其中p_s＝p+Δpn，t为计算得到采样点的四个具有整数坐标值的相邻点，K为双线性插值核函数。where p_s =p+Δpn, t is the four adjacent points with integer coordinate values of the sampling point obtained by calculation, and K is the bilinear interpolation kernel function.

在实际应用中，对每一个RGB特征图中的结点，我们可通过线性变换得到其偏移量，其变换矩阵为

输出的偏移量维度为2S分别为横轴与纵轴两个方向坐标的偏移量。再经过双线性差值，即可得到针对每个结点的S个最具代表性的结点。In practical applications, for each node in the RGB feature map, we can obtain its offset through linear transformation, and its transformation matrix is

The output offset dimension is 2S, which are the offsets of the coordinates in the two directions of the horizontal axis and the vertical axis, respectively. After bilinear difference, the S most representative nodes for each node can be obtained.

在通过深度特征图获取最具代表性的采样结点并计算得到自注意力矩阵之后，即可对RGB特征图进行更新。在本发明实施例的跨模态自注意力学习模块中，我们采用残差网络的结构对RGB特征图进行更新，具体可表示为：After obtaining the most representative sampling nodes through the depth feature map and calculating the self-attention matrix, the RGB feature map can be updated. In the cross-modal self-attention learning module of the embodiment of the present invention, we use the structure of the residual network to update the RGB feature map, which can be specifically expressed as:

其中，

上述公式(7)中的RGB特征，W_y为线性变换矩阵，

为学习得到的残差，r_i为原始输入RGB特征，y_i为最终更新后的RGB特征。这种基于残差网络结构构建的跨模态自注意力学习模块可嵌入任一神经网络模型当中。in,

For the RGB features in the above formula (7), W_y is a linear transformation matrix,

is the learned residual,_{ri is the original input RGB feature, and yi}_is the final updated RGB feature. This cross-modal self-attention learning module based on residual network structure can be embedded in any neural network model.

通过上述描述可见，在构造单层级跨模态自注意力学习模块时，共需要5个线性变换矩阵，分别为公式(7)中的W_φ，W_φ与W_g，公式(11)中的W_y，以及用于生成采样点的线性变换矩阵W_s。为进一步降低参数量，我们将跨模态自注意力学习模块构造为瓶颈(bottleneck)结构，即将公式(7)中的W_θ，W_φ与W_g融合成为一个线性变换矩阵W用于获取d_θ，s_φ与s_g。这样一来仅需3个线性变换矩阵即可构造单层级跨模态自注意力学习模块。所有线性变换均通过1×1卷积实现，同时需要添加批量归一化操作。It can be seen from the above description that when constructing a single-level cross-modal self-attention learning module, a total of 5 linear transformation matrices are required, which are W_φ , W_φ and W_g in formula (7), respectively, in formula (11) W_y , and the linear transformation matrix W_s used to generate sample points. In order to further reduce the amount of parameters, we construct the cross-modal self-attention learning module as a bottleneck structure, that is to fuse W_θ , W_φ and W_g in formula (7) into a linear transformation matrix W for obtaining d_θ , s_φ and s_g . In this way, only 3 linear transformation matrices are needed to construct a single-level cross-modal self-attention learning module. All linear transformations are implemented through 1×1 convolutions, while batch normalization operations need to be added.

如图4所示，本发明实施例中的跨模态自注意力学习模块可通过多层级深度特征图学习得到包含结构信息的自注意力矩阵，并对RGB特征图进行更新，因此最终需要对多层级信息进行融合。融合操作具体可表示为：As shown in FIG. 4 , the cross-modal self-attention learning module in the embodiment of the present invention can obtain a self-attention matrix containing structural information through multi-level deep feature map learning, and update the RGB feature map. Multi-level information is fused. The fusion operation can be specifically expressed as:

其中j列举了全部的深度层级，

为对应层级的线性变换矩阵，

为对应层级更新后的RGB特征，可由公式(7)计算得到。where j lists all the depth levels,

is the linear transformation matrix of the corresponding level,

The updated RGB feature for the corresponding level can be calculated by formula (7).

值得说明的是，为进一步降低本发明实施例的运算复杂度，在计算自注意力矩阵时，还可在空间与维度层面对特征图进行分组。在空间层面，对于一个维度为C×H×W的特征图，可对特征图划分成多个区域，每个区域包含多个维度为C×1特征向量，对每个区域进行池化操作即可将一个区域作为一个结点，这样可针对一个区域内的全部特征进行矩阵操作，进而极大地降低运算复杂度。类似地，在维度层面，可将全部特征通道平均分为组，每组的特征图维度为C′×H×W，其中C′＝C/G。首先对各组特征进行计算，然后将全部分组的特征级联在一起，即可得到最终特征。It should be noted that, in order to further reduce the computational complexity of the embodiment of the present invention, when calculating the self-attention matrix, the feature maps may also be grouped at the spatial and dimensional levels. At the spatial level, for a feature map with dimension C×H×W, the feature map can be divided into multiple regions, each region contains multiple feature vectors with dimension C×1, and the pooling operation is performed on each region. A region can be used as a node, so that matrix operations can be performed on all features in a region, thereby greatly reducing the computational complexity. Similarly, at the dimension level, all feature channels can be equally divided into groups, and the feature map dimension of each group is C′×H×W, where C′=C/G. First, each group of features is calculated, and then all the grouped features are cascaded together to obtain the final feature.

综上所述，本发明创新性地通过跨模态自注意力机制，将从深度图中获取得到的深度结构信息与从RGB图中获取得到的外观信息有机地结合，以达到精确的检测结果，而不是简单的融合上述两种信息。本发明在获取深度结构信息时，可在全局场景范围内考虑不同位置之间的相关性，而不是局限于邻域范围内。这主要得益于自注意力学习机制的特性，以及通过多层级特征进行学习的方式。此外，本发明在全局场景范围内获取不同位置之间的相关性时，只进行单次运算而不需要进行迭代，从而可以有效地对二维RGB图像中的三维物体进行类别、位置和姿态目标检测。To sum up, the present invention innovatively combines the depth structure information obtained from the depth map with the appearance information obtained from the RGB map through the cross-modal self-attention mechanism, so as to achieve accurate detection results , rather than simply merging the above two kinds of information. In the present invention, when acquiring the depth structure information, the correlation between different positions can be considered in the global scene scope, instead of being limited to the neighborhood scope. This is mainly due to the characteristics of the self-attention learning mechanism and the way it learns through multi-level features. In addition, when the present invention obtains the correlation between different positions within the scope of the global scene, only a single operation is performed without iteration, so that the classification, position and attitude targets of the three-dimensional objects in the two-dimensional RGB image can be effectively carried out. detection.

本发明中的跨模态自注意力机制在获取不同位置之间的相关性时，只针对高相关性的位置计算自注意力矩阵，这样可避免计算大量冗余位置之间的自注意力矩阵，在保证效果的同时降低运算复杂度。此外在计算自注意力矩阵时还可对深度特征在维度与空间层面进行分组以进一步降低运算复杂度。When the cross-modal self-attention mechanism in the present invention obtains the correlation between different positions, the self-attention matrix is only calculated for the positions with high correlation, which can avoid calculating the self-attention matrix between a large number of redundant positions , which reduces the computational complexity while ensuring the effect. In addition, when calculating the self-attention matrix, deep features can be grouped at the dimensional and spatial level to further reduce the computational complexity.

本领域普通技术人员可以理解：附图只是一个实施例的示意图，附图中的模块或流程并不一定是实施本发明所必须的。Those of ordinary skill in the art can understand that the accompanying drawing is only a schematic diagram of an embodiment, and the modules or processes in the accompanying drawing are not necessarily necessary to implement the present invention.

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art. The computer software products can be stored in storage media, such as ROM/RAM, magnetic disks, etc. , CD, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of the present invention.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置或系统实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的装置及系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus or system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts. The device and system embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

Translated fromChinese

1.一种基于多层级跨模态自注意力机制的三维目标检测方法，其特征在于，包括：1. a three-dimensional target detection method based on multi-level cross-modal self-attention mechanism, is characterized in that, comprises:

2.根据权利要求1所述的方法，其特征在于，所述的利用RGB图像数据构建训练集与测试集数据，包括：2. method according to claim 1, is characterized in that, described utilizing RGB image data to construct training set and test set data, comprises:

3.根据权利要求2所述的方法，其特征在于，所述三维目标检测模型中的RGB主干网络、深度主干网络、分类器与回归器均包含卷积层、全连接层与归一化层，RGB主干网络、深度主干网络的结构一致，均包含4个卷积模块。3. The method according to claim 2, wherein the RGB backbone network, the depth backbone network, the classifier and the regressor in the three-dimensional target detection model all comprise a convolution layer, a fully connected layer and a normalization layer , the structure of the RGB backbone network and the deep backbone network are the same, and both contain 4 convolution modules.

4.根据权利要求2和3所述的方法，其特征在于，所述的利用所述训练集与测试集数据训练所述三维目标检测模型，所述RGB主干网络、深度主干网络分别获取RGB特征与深度特征，将所述RGB特征与深度特征输入跨模态自注意力学习模块，对RGB特征进行更新，利用更新之后的RGB特征学习分类器与回归器，得到训练好的三维目标检测模型，包括：4. method according to claim 2 and 3, is characterized in that, described utilizing described training set and test set data to train described three-dimensional target detection model, described RGB backbone network, depth backbone network obtain RGB characteristic respectively and depth features, input the RGB features and depth features into the cross-modal self-attention learning module, update the RGB features, and use the updated RGB features to learn a classifier and a regressor to obtain a trained three-dimensional target detection model, include:

与公式(3)中的

with the formula (3)

5.根据权利要求4所述的方法，其特征在于，所述的将所述RGB特征与深度特征输入跨模态自注意力学习模块，对RGB特征进行更新，利用更新之后的RGB特征学习分类器与回归器，得到训练好的三维目标检测模型，包括：5. The method according to claim 4, wherein the described RGB feature and depth feature are input into a cross-modal self-attention learning module, the RGB feature is updated, and the updated RGB feature is used to learn classification The trained 3D target detection model, including:

对于任一二维RGB特征图R与二维深度特征图D，假设其维度为C×H×W，其中C、H与W分别为维度、高和宽，将二维RGB特征图R与二维深度特征图D表示为N个C维特征的集合：R＝[r₁，r₂，...，r_N]^T与D＝[d₁，d₂，...，d_N]^T，其中N＝H×W；For any two-dimensional RGB feature map R and two-dimensional depth feature map D, suppose its dimension is C×H×W, where C, H and W are dimensions, height and width respectively. A dimensional depth feature map D is represented as a set of N C-dimensional features: R = [r₁ , r₂ , ..., r_N ]^T and D = [d₁ , d₂ , ..., d_N ]^T , where N=H×W;

其中

为归一化参数，δ为softmax函数，j为全部与i相关的位置，

为更新后的RGB特征，把上述公式写成矩阵相乘的形式：in

其中

为自注意力矩阵，D_θ,D_φ与R_g的维度均为N×C′；in

其中s(n)为采样得到的特征向量，其维度为C′，

与

and