Movatterモバイル変換


[0]ホーム

URL:


CN114663880B - Three-dimensional object detection method based on multi-level cross-modal self-attention mechanism - Google Patents

Three-dimensional object detection method based on multi-level cross-modal self-attention mechanism
Download PDF

Info

Publication number
CN114663880B
CN114663880BCN202210253116.0ACN202210253116ACN114663880BCN 114663880 BCN114663880 BCN 114663880BCN 202210253116 ACN202210253116 ACN 202210253116ACN 114663880 BCN114663880 BCN 114663880B
Authority
CN
China
Prior art keywords
dimensional
rgb
depth
features
target detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210253116.0A
Other languages
Chinese (zh)
Other versions
CN114663880A (en
Inventor
曹原周汉
李浥东
张慧
郎丛妍
陈乃月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong UniversityfiledCriticalBeijing Jiaotong University
Priority to CN202210253116.0ApriorityCriticalpatent/CN114663880B/en
Publication of CN114663880ApublicationCriticalpatent/CN114663880A/en
Application grantedgrantedCritical
Publication of CN114663880BpublicationCriticalpatent/CN114663880B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供了一种基于多层级跨模态自注意力机制的三维目标检测方法。该方法包括利用RGB图像数据构建训练集与测试集;构建三维目标检测模型,该三维目标检测模型包含RGB主干网络、深度主干网络、分类器与回归器;利用训练集与测试集数据训练所述三维目标检测模型,并利用测试集验证训练效果,得到训练好的三维目标检测模型;利用训练得到的模型对RGB图像中的三维目标进行检测。本发明方法从深度特征图中获取全局场景范围内的深度结构信息,与外观信息有机结合以提升三维目标检测算法的准确性,从而有效地对二维RGB图像中的三维物体进行类别、位置、尺寸和姿态等信息的检测。

The present invention provides a three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism. The method comprises constructing a training set and a test set using RGB image data; constructing a three-dimensional target detection model, which comprises an RGB backbone network, a deep backbone network, a classifier and a regressor; training the three-dimensional target detection model using the training set and the test set data, and verifying the training effect using the test set to obtain a trained three-dimensional target detection model; and detecting three-dimensional targets in RGB images using the trained model. The method of the present invention obtains the depth structure information within the global scene range from the depth feature map, and organically combines it with the appearance information to improve the accuracy of the three-dimensional target detection algorithm, thereby effectively detecting the category, position, size and posture of the three-dimensional objects in the two-dimensional RGB image.

Description

Three-dimensional target detection method based on multi-level cross-mode self-attention mechanism
Technical Field
The invention relates to the technical field of target detection, in particular to a three-dimensional target detection method based on a multi-level cross-mode self-attention mechanism.
Background
Three-dimensional target detection is an important branch in the field of computer vision, and has strong application value in various scenes such as intelligent transportation, robot vision, three-dimensional reconstruction, virtual reality, augmented reality and the like. The purpose of three-dimensional object detection is to recover information such as the category, position, depth, size, and pose of objects in three-dimensional space. According to different processing data types, three-dimensional target detection technologies are mainly divided into two types of detection based on two-dimensional images and detection based on point cloud data.
The imaging process of a three-dimensional object is a process of mapping points in three-dimensional space onto a two-dimensional plane after losing depth information. The detection of objects in the three-dimensional space inevitably uses missing depth information, which is one of the main differences between three-dimensional object detection and two-dimensional object detection and is also a difficulty of three-dimensional object detection. The three-dimensional target detection method based on the two-dimensional image can directly acquire depth information from the two-dimensional image, so as to detect the three-dimensional target. The acquisition of the depth information mainly depends on geometric constraints in the three-dimensional scene, shape constraints and semantic constraints of the three-dimensional object and the like. Because the depth information contained in the two-dimensional image is limited, and the constraint condition is greatly limited by the scene and the object, the three-dimensional target detection method has lower accuracy.
The point cloud is a set of points in a three-dimensional space corresponding to pixel points in a two-dimensional image, and three-dimensional target detection based on the point cloud data can be further classified into two types by processing the point cloud data to acquire depth information. Firstly, directly processing point cloud data in a three-dimensional space, and further realizing direct processing of the point cloud by increasing the operation of pixel points to three dimensions in a two-dimensional target detection method. Due to the increase of operation dimensions, the method has higher calculation complexity, and noise data in the point cloud can also directly influence the detection accuracy of the algorithm. In another method, a depth prediction model is obtained through point cloud data training, a two-dimensional depth image is obtained through the model, and then depth information is obtained through the two-dimensional depth image for three-dimensional target detection. The algorithm does not need to directly operate the point cloud data, but reduces the dimension of the point cloud data to a two-dimensional depth map, so that the operation complexity is reduced, and meanwhile, the depth prediction model can remove part of the point cloud noise data, so that the algorithm is widely used in practical application.
After a two-dimensional depth map is obtained, the depth prediction model obtained through the training of the point cloud data has the capability of obtaining depth information, and the three-dimensional target detection model is further trained through a two-dimensional RGB image on the basis of the depth prediction model. The method has the defects that for a three-dimensional target detection task, the category and position information of a target is acquired from a two-dimensional image or video frame, the three-dimensional point cloud data is not necessary to be directly processed, and the point cloud data usually contains a large amount of noise.
Another three-dimensional object detection method in the prior art includes the steps of inputting a two-dimensional depth image as an independent model, acquiring depth information from the depth image through an additional model, and performing three-dimensional object detection in combination with two-dimensional RGB (red, green, blue) image input. The method has the defects that the depth information which can be obtained from the two-dimensional image is very limited, and geometric constraint is inevitably used in the process of obtaining the depth information, so that the detection accuracy of the algorithm is poor.
Disclosure of Invention
The embodiment of the invention provides a three-dimensional target detection method based on a multi-level cross-mode self-attention mechanism, which is used for effectively detecting the types, positions and postures of three-dimensional objects in a two-dimensional RGB image.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
A three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism comprises the following steps:
constructing training set and testing set data by utilizing RGB image data;
constructing a three-dimensional target detection model, wherein the three-dimensional target detection model comprises an RGB (red, green and blue) backbone network, a depth backbone network, a classifier and a regressor;
Training the three-dimensional target detection model by using the training set and the testing set data, verifying the training effect of the three-dimensional target detection model by using the testing set, respectively acquiring RGB features and depth features by using the RGB backbone network and the depth backbone network, inputting the RGB features and the depth features into a cross-modal self-care mechanics practice module, updating the RGB features, and obtaining a trained three-dimensional target detection model by using an updated RGB feature learning classifier and a regressor;
and detecting the category, the position and the gesture of the three-dimensional object in the two-dimensional RGB image to be identified by using the classifier and the regressive in the trained three-dimensional object detection model.
Preferably, the constructing training set and test set data using RGB image data includes:
The method comprises the steps of collecting RGB images, dividing the RGB images into a training set and a testing set according to the proportion of about 1:1, carrying out normalization processing on image data in the training set and the testing set, obtaining a two-dimensional depth image of the training set image through a depth estimation algorithm, labeling the types of objects in the training set image, and labeling coordinates of a two-dimensional detection frame of the image, the center position, the size and the rotation angle of the three-dimensional detection frame.
Preferably, the RGB backbone network, the deep backbone network, the classifier and the regressor in the three-dimensional target detection model all comprise a convolution layer, a full connection layer and a normalization layer, and the structures of the RGB backbone network and the deep backbone network are consistent and all comprise 4 convolution modules.
Preferably, the training the three-dimensional target detection model by using the training set and the test set data, the RGB backbone network and the depth backbone network respectively obtain RGB features and depth features, input the RGB features and the depth features into a cross-modal self-care mechanics practice module, update the RGB features, and learn a classifier and a regressor by using the updated RGB features to obtain a trained three-dimensional target detection model, including:
S3-1, initializing parameters in a convolution layer, a full connection layer and a normalization layer which are contained in an RGB (red, green and blue) backbone network, a depth backbone network, a classifier and a regressor in the three-dimensional target detection model;
s3-2, setting relevant training parameters of a random gradient descent algorithm, wherein the relevant training parameters comprise learning rate, impulse, batch size and iteration times;
Step S3-3, for any iteration batch, respectively inputting all RGB images and depth images into an RGB backbone network and a depth backbone network to obtain multi-level RGB features and depth features, constructing a cross-modal self-attention mechanical learning module, inputting the RGB features and the depth features into the cross-modal self-attention mechanical learning module, learning to obtain a self-attention matrix based on depth information, updating the RGB features through the self-attention matrix, learning a classifier and a regressor by using the updated RGB features, using the classifier and the regressor for target detection of three-dimensional objects in a two-dimensional RGB image,
Obtaining objective function values by calculating errors of the network estimated values and the actual labeling values, and respectively calculating three objective function values by using formulas (1), (2) and (3):
Wherein si and pi in formula (1) are respectively the class labeling and the estimated probability of the ith target, and in formula (2)And in formula (3)Respectively representing a two-dimensional estimation frame and a three-dimensional estimation frame of an ith target, wherein gt represents an actual labeling value, and N represents the total number of targets;
s3-4, adding the three objective function values to obtain an overall objective function value, respectively solving partial derivatives of all parameters in the three-dimensional objective detection model, and updating the parameters by a random gradient descent method;
And step S3-5, repeating the step S3-3 and the step S3-4, continuously updating parameters of the three-dimensional target detection model until convergence, and outputting the trained parameters of the three-dimensional target detection model.
Preferably, the inputting the RGB features and the depth features into the cross-modal self-care mechanics model updates the RGB features, and obtains a trained three-dimensional target detection model by using an updated RGB feature learning classifier and a regressor, including:
for any two-dimensional RGB feature map R and two-dimensional depth feature map D, assuming the dimensions are C×H×W, wherein C, H and W are the dimensions, height and width respectively, the two-dimensional RGB feature map R and the two-dimensional depth feature map D are represented as a set of N C-dimensional features, R= [ R1,r2,...,rN]T and D= [ D1,d2,...,dN]T, wherein N=H×W;
For the input feature map R, a full connection map is constructed, wherein each feature Ri is taken as a node, the edge (Ri,rj) represents the relation between the nodes Ri and Rj, the edge is obtained through learning of the two-dimensional depth feature map D, and the current two-dimensional RGB feature map R is updated, specifically expressed as:
Wherein the method comprises the steps ofFor the normalization parameter, δ is a softmax function, j is the positions all associated with i,For the updated RGB features, the above formula is written in the form of matrix multiplication:
Wherein the method comprises the steps ofFor the self-attention matrix, the dimensions of Dθ,Dφ and Rg are both nxc';
Taking the feature matrix ri of each spatial position as a node, searching nodes associated with ri in all spatial regions, and sampling representative features with the number S in all nodes associated with i for any node i in the depth feature map:
Where s (n) is a sampled feature vector, the dimension of which is C',As a sampling function, the cross-modal self-attention mechanics learning module is expressed as:
where n is the sampled i-related node and δ is the softmax function, dθ(i)=Wθd(i),sφ(i)=Wφs(i),sg(i)=Wg s (i).And (3) withThree linearly transformed transformation matrices, respectively.
According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides a multi-level cross-mode self-care mechanics learning mechanism for three-dimensional target detection, depth structure information in a global scene range is obtained from a depth feature map, and the depth structure information is organically combined with appearance information to improve the accuracy of a three-dimensional target detection algorithm. Meanwhile, various strategies are adopted to reduce the operation complexity so as to meet the requirements of unmanned scenes and the like on the processing speed.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a three-dimensional object detection method based on a cross-modal self-attention mechanism according to an embodiment of the present invention.
Fig. 2 is a block diagram of a three-dimensional object detection model according to an embodiment of the present invention.
Fig. 3 is a training flowchart of a three-dimensional object detection model according to an embodiment of the present invention.
Fig. 4 is a block diagram of a cross-modal self-attention module according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.
Based on the main defects existing in the current three-dimensional target detection algorithm, the method acquires the depth information through the two-dimensional depth map, and forms the utilization of the depth information into a cross-modal self-attention module learning problem. The depth information and the appearance information are combined through a cross-modal self-attention mechanism, and meanwhile, the depth information is extracted in a non-iterative mode in a global range through a self-attention mechanism, so that the detection precision is improved. When depth information is acquired, the method and the device also use various measures to further reduce the operation complexity and ensure that the method and the device can be used for scenes with real-time processing requirements such as automatic driving.
The invention provides a three-dimensional target detection method based on a multi-level cross-mode self-attention mechanism, which takes a two-dimensional RGB image and a depth image as input, combines appearance information acquired by the two-dimensional RGB image with structural information acquired by the depth image through the self-attention mechanism so as to achieve an accurate detection result, and simultaneously avoids high calculation amount caused by point cloud processing. In addition, since the self-attention mechanism acquires a large amount of redundant information while acquiring global structural information, the method adopts an improved self-attention mechanism, namely, for a certain area, structural information is calculated only for the area with the highest correlation in the global, and the calculated amount is further reduced on the premise of ensuring the detection precision.
The three-dimensional target detection method based on the multi-level cross-mode self-attention mechanism comprises the following processing procedures:
The data set construction comprises the steps of constructing a training set and a testing set of a three-dimensional target detection model, specifically comprising the steps of collecting RGB images used for training and testing, and extracting depth information corresponding to the training set images through a depth model. And labeling the class, two-dimensional coordinates, three-dimensional coordinates, depth, size and the like of the object in the training image. And pre-processes the image data.
The three-dimensional target detection model construction method specifically comprises the steps of constructing a three-dimensional target detection model based on a convolutional neural network, and specifically comprises the construction of an RGB image feature extraction network, a depth image feature extraction network and a cross-mode self-attention mechanics learning network.
And training a three-dimensional target detection model, namely updating parameters in the three-dimensional target detection model until convergence by calculating loss functions such as two-dimensional target detection, three-dimensional target detection classification, regression and the like and a random gradient descent algorithm.
The three-dimensional object is detected by providing a color image or video frame, and detecting the three-dimensional object therein.
The processing flow chart of the three-dimensional target detection method based on the multi-level cross-mode self-attention mechanism provided by the embodiment of the invention is shown in fig. 1, and comprises the following steps:
And S1, constructing a training set and a testing set. RGB images were acquired and split into training and test sets at a ratio of about 1:1. Because the three-dimensional target detection method provided by the embodiment of the invention acquires depth information through the two-dimensional depth image instead of the point cloud data adopted by the traditional method, the two-dimensional depth image is acquired through the depth estimation algorithm for the color image in the training set. In addition, the object in the training set image is marked with the category, and the coordinates of the two-dimensional detection frame, the center position, the size and the rotation angle of the three-dimensional detection frame are marked. And finally, carrying out normalization processing on the image data in the training set and the testing set.
And S2, after the training set and the testing set are obtained, constructing a three-dimensional target detection model, wherein the three-dimensional target detection model comprises an RGB backbone network, a depth backbone network, a classifier and a regressor. The structure of the three-dimensional target detection model provided by the embodiment of the invention is shown in figure 2. Since we need to extract features for RGB images and depth images separately during training, we need to construct two feature extraction backbone networks. In the embodiment of the invention, the RGB backbone network and the deep backbone network have the same structure and comprise 4 convolution modules for extracting multi-level features.
And step S3, training a three-dimensional target detection model. After the three-dimensional target detection model is built, the model can be trained through the training set obtained in the step S1, and the training effect of the three-dimensional target detection model is verified through the testing set. The training flow chart of the three-dimensional target detection model provided by the embodiment of the invention is shown in fig. 3, and specifically comprises the following steps:
And S3-1, initializing model parameters, wherein the model parameters comprise parameters in a convolution layer, a full connection layer and a normalization layer which are contained in an RGB (red, green and blue) backbone network, a deep backbone network, a classifier and a regressor.
And S3-2, setting training parameters. The three-dimensional target detection model of the embodiment of the invention adopts SGD (Stochastic GRADIENT DESCNET, random gradient descent algorithm) for training, and related training parameters including learning rate, impulse, batch size and iteration times are required to be set before training.
And S3-3, calculating an objective function value. For any iteration batch, all RGB images and depth images are respectively input into an RGB backbone network and a depth backbone network to obtain multi-level features, updated RGB features are obtained through a cross-mode self-care mechanics learning module, and then estimated categories, position gestures and depth values of a target object are further obtained through a classifier and a regressor. And finally, obtaining the objective function value by calculating the error between the network estimated value and the actual labeling value. Three objective function values are calculated in the training of the model:
Wherein si and pi in formula (1) are respectively the class labeling and the estimated probability of the ith target, and in formula (2)And in formula (3)And respectively representing a two-dimensional estimation frame and a three-dimensional estimation frame of the ith target, wherein gt represents an actual labeling value, and N represents the total number of the targets.
And S3-4, adding the objective function values to obtain a total objective function value, respectively solving partial derivatives of all parameters in the model, and updating the parameters by a random gradient descent method.
And step S3-5, repeating the step S3-3 and the step S3-4, continuously updating the model parameters until convergence, and finally outputting the model parameters.
So far, all parameters of the three-dimensional target detection model in the embodiment of the invention are obtained, and finally, only the object in the two-dimensional image provided by the user is detected.
And S4, after multi-level RGB features and depth features are acquired, a cross-mode self-attention learning module is constructed, the module learns to obtain a self-attention matrix based on depth information by taking the RGB features and the depth features as inputs at the same time, and updates the RGB features through the self-attention matrix to increase the structural information in the RGB features. And finally, utilizing the updated RGB features to learn a classifier and a regressive for target detection of the three-dimensional object in the two-dimensional RGB image, wherein the classifier can identify the category of the three-dimensional object, and the regressive can identify the position and the gesture of the three-dimensional object.
The three-dimensional target detection model in the embodiment of the invention comprises an RGB backbone network, a deep backbone network, a classifier and a regressor. After training is finished, the RGB backbone network has retained depth structure information through a cross-modal self-care learning module. In the test, only two-dimensional RGB images need to be provided, and depth features do not need to be extracted by a depth backbone network.
The cross-mode self-attention mechanical learning module provided by the embodiment of the invention can acquire depth structure information through depth map learning and is embedded into RGB image characteristics, so that the accuracy of three-dimensional target detection is improved. The following is a detailed description.
The structure flow chart of the cross-mode self-attention learning module provided by the embodiment of the invention is shown in fig. 4, and mainly comprises four sub-modules, namely a sampling point generation module, a multi-level attention learning module, an information updating module and an information fusion module. The core idea of the construction is that a self-attention matrix based on depth information is obtained through multi-level depth feature map learning, the self-attention matrix can reflect the structural similarity between different positions in the global image range, the RGB feature map is updated through the self-attention matrix, so that the structural features in the global image range are obtained, and finally the accuracy of three-dimensional target detection is improved. Two-level depth feature maps are shown in fig. 4 as an example, and in actual operation, may be extended to multi-level depth features.
For any two-dimensional RGB feature map R and two-dimensional depth feature map D, it is assumed that the dimensions are CXHXW, where C, H and W are dimension, height and width, respectively. Both the two-dimensional RGB feature map R and the two-dimensional depth feature map D can be represented as a set of N C-dimensional features, r= [ R1,r2,...,rN]T and d= [ D1,d2,...,dN]T, where n=h×w. For the input feature map R, a full connection map is constructed in which each feature Ri serves as a node, and the edge (Ri,rj) represents the relationship between nodes Ri and Rj. For the two-dimensional RGB feature map R, the appearance features such as color, texture and the like are obvious, and the structural information such as depth and the like is insufficient. The cross-modal self-attention learning module in the embodiment of the invention learns through the two-dimensional depth feature map D to obtain edges, and then updates the current two-dimensional RGB feature map R to increase the structural features, which can be expressed as follows:
Wherein the method comprises the steps ofFor the normalization parameter, δ is a softmax function, j is the positions all associated with i,Is the updated RGB feature. We can further write the above formula in the form of matrix multiplication:
Wherein the method comprises the steps ofFor the self-attention matrix, the dimensions of Dθ,Dφ and Rg are N C'.
Thus, a single-level cross-mode self-attention learning module is constructed, which can learn to obtain a self-attention matrix containing structural information through a single-level depth feature map and update an RGB feature map of a corresponding level. However, as can be seen from the above matrix multiplication formula, the complexity of the operation of updating the RGB feature map is O (C' ×n2), and for three-dimensional object detection, especially for unmanned scenes, the resolution of the input image or video frame is usually large, so that the time consumption is too high when calculating the self-attention matrix a (X), which is unfavorable for the application scene that is required by real-time processing. In the process of constructing the fully connected graph, we regard the feature matrix ri of each spatial position as a node, find the nodes associated with ri in the whole spatial region, and calculate the self-attention matrix. Because the nodes associated with ri in all spatial regions are highly overlapped, the cross-modal self-attention learning module in the embodiment of the invention only selects the node with the highest association degree in the nodes associated with ri through a sampling mechanism, calculates a self-attention matrix after removing a large number of redundant nodes, thus greatly improving the operation efficiency and simultaneously ensuring the correlation in all spatial regions. The cross-modal self-care learning module including the sampling mechanism is described in detail below.
For any node i in the depth profile, a representative feature of S is sampled in number in all nodes associated with i:
Where s (n) is a sampled feature vector, the dimension of which is C',As a sampling function. Thus, the cross-modal self-care learning module in the embodiment of the present invention may be expressed as:
where n is the sampled i-related node and δ is the softmax function, dθ(i)=Wθd(i),sφ(i)=Wφs(i),sg(i)=Wg s (i).And (3) withThree linearly transformed transformation matrices, respectively. By adding the sampling module, we can reduce the number of nodes from N to S when computing the self-attention matrix:
And S < < N, so that the operation complexity can be greatly reduced. For example, for a feature map with a spatial dimension of 80×80, N is 6400, and in the embodiment of the present invention, the number of sampling points is selected to be 9.
The invention dynamically selects sampling points by estimating offsets by means of the idea of deformable convolution (deformable convolution). Specifically, for a certain position p in the feature map, the sampling functionCan be expressed as:
Where Δpn is the regression derived offset. Since the result obtained by the convolution operation usually contains a fractional part, and the coordinate values of the sampling points are only integers, the integer coordinate values are also obtained by bilinear interpolation:
Wherein ps = p + delta pn, t is the four adjacent points with integer coordinate values of the calculated sampling points, and K is a bilinear interpolation kernel function.
In practical application, for each node in RGB feature map, we can obtain its offset by linear transformation, and its transformation matrix isThe output offset dimension is 2S, and the offset of the coordinates in the two directions of the horizontal axis and the vertical axis is respectively. And obtaining S most representative nodes for each node through bilinear difference.
After the most representative sampling nodes are obtained through the depth feature map and the self-attention moment matrix is calculated, the RGB feature map can be updated. In the cross-modal self-attention learning module of the embodiment of the invention, the structure of the residual error network is adopted to update the RGB feature map, which can be expressed as:
wherein,The RGB features in the above equation (7), Wy is a linear transformation matrix,To learn the resulting residual, ri is the original input RGB features and yi is the final updated RGB features. The cross-modal self-care learning module constructed based on the residual network structure can be embedded into any neural network model.
As can be seen from the above description, when constructing the single-level cross-modal self-care learning module, 5 linear transformation matrices are required, Wφ,Wφ and Wg in equation (7), Wy in equation (11), and the linear transformation matrix Ws for generating the sampling points, respectively. To further reduce the number of parameters, we construct the cross-modal self-care learning module as a bottleneck (bottleneck) structure, i.e., fuse Wθ,Wφ and Wg in equation (7) into a linear transformation matrix W for obtaining dθ,sφ and sg. Thus, only 3 linear transformation matrices are needed to construct a single-level cross-modal self-attention learning module. All linear transformations are realized by 1 x 1 convolution, with the addition of batch normalization operations.
As shown in fig. 4, the cross-modal self-attention learning module in the embodiment of the present invention may learn through the multi-level depth feature map to obtain a self-attention matrix containing structural information, and update the RGB feature map, so that the multi-level information needs to be fused finally. The fusion operation can be expressed specifically as:
Where j enumerates all of the depth levels,For a linear transformation matrix of the corresponding hierarchy,For the corresponding level updated RGB features, it can be calculated by equation (7).
It should be noted that, to further reduce the complexity of the operation in the embodiments of the present invention, the feature graphs may be further grouped in the spatial and dimensional layers when the self-attention matrix is calculated. In the space layer, for a feature map with one dimension of CxHxW, the feature map can be divided into a plurality of areas, each area comprises a plurality of feature vectors with the dimension of Cx1, and one area can be used as a node by carrying out pooling operation on each area, so that matrix operation can be carried out on all features in one area, and further the operation complexity is greatly reduced. Similarly, at the dimension level, all feature channels may be equally divided into groups, each group having a feature map dimension of C '×h×w, where C' =c/G. And firstly, calculating each group of characteristics, and then cascading all the grouped characteristics together to obtain the final characteristics.
In summary, the present invention organically combines the depth structure information obtained from the depth map with the appearance information obtained from the RGB map through the cross-modal self-attention mechanism to achieve an accurate detection result, instead of simply fusing the two information. When the depth structure information is acquired, the correlation between different positions can be considered in the global scene range, and the method is not limited in the neighborhood range. This is mainly due to the nature of the self-attention learning mechanism, and the manner in which learning is performed by multi-level features. In addition, when the correlation between different positions is obtained in the global scene range, only single operation is carried out without iteration, so that the type, position and gesture target detection can be effectively carried out on the three-dimensional object in the two-dimensional RGB image.
When the cross-mode self-attention mechanism in the invention acquires the correlation between different positions, the self-attention matrix is calculated only for the position with high correlation, so that the self-attention matrix among a large number of redundant positions can be avoided being calculated, and the operation complexity is reduced while the effect is ensured. In addition, the depth features can be grouped in dimensions and space when the self-attention matrix is calculated so as to further reduce the operation complexity.
Those of ordinary skill in the art will appreciate that the drawing is merely a schematic illustration of one embodiment and that modules or flow in the drawing are not necessarily required to practice the invention.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (3)

CN202210253116.0A2022-03-152022-03-15 Three-dimensional object detection method based on multi-level cross-modal self-attention mechanismActiveCN114663880B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202210253116.0ACN114663880B (en)2022-03-152022-03-15 Three-dimensional object detection method based on multi-level cross-modal self-attention mechanism

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202210253116.0ACN114663880B (en)2022-03-152022-03-15 Three-dimensional object detection method based on multi-level cross-modal self-attention mechanism

Publications (2)

Publication NumberPublication Date
CN114663880A CN114663880A (en)2022-06-24
CN114663880Btrue CN114663880B (en)2025-03-25

Family

ID=82029592

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202210253116.0AActiveCN114663880B (en)2022-03-152022-03-15 Three-dimensional object detection method based on multi-level cross-modal self-attention mechanism

Country Status (1)

CountryLink
CN (1)CN114663880B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114972958B (en)*2022-07-272022-10-04北京百度网讯科技有限公司 Key point detection method, neural network training method, device and equipment
CN116503418B (en)*2023-06-302023-09-01贵州大学Crop three-dimensional target detection method under complex scene
CN119444599B (en)*2025-01-102025-04-22北京中生金域诊断技术股份有限公司Reagent detection picture color conversion method and system based on attention mechanism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
GB2536493B (en)*2015-03-202020-11-18Toshiba Europe LtdObject pose recognition
CN108898630B (en)*2018-06-272020-12-15清华-伯克利深圳学院筹备办公室 A three-dimensional reconstruction method, apparatus, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CMAN: Leaning Global Structure Correlation for Monocular 3D Object Detection;Yuanzhouhan Cao 等;《IEEE Transactions on Intelligent Transportation Systems》;20221231;第23卷(第12期);24727-24737*

Also Published As

Publication numberPublication date
CN114663880A (en)2022-06-24

Similar Documents

PublicationPublication DateTitle
CN110458939B (en)Indoor scene modeling method based on visual angle generation
CN114663880B (en) Three-dimensional object detection method based on multi-level cross-modal self-attention mechanism
CN107481279B (en)Monocular video depth map calculation method
CN111627065A (en)Visual positioning method and device and storage medium
CN117115359B (en)Multi-view power grid three-dimensional space data reconstruction method based on depth map fusion
CN110009674A (en) A real-time calculation method of monocular image depth of field based on unsupervised deep learning
CN110246181A (en)Attitude estimation model training method, Attitude estimation method and system based on anchor point
CN114663509B (en)Self-supervision monocular vision odometer method guided by key point thermodynamic diagram
CN113610905A (en) Deep learning remote sensing image registration method and application based on sub-image matching
CN113643366B (en)Multi-view three-dimensional object attitude estimation method and device
Liu et al.Rotation-invariant siamese network for low-altitude remote-sensing image registration
CN116958420A (en) A high-precision modeling method for the three-dimensional face of a digital human teacher
CN113516693A (en)Rapid and universal image registration method
CN117095033B (en)Multi-mode point cloud registration method based on image and geometric information guidance
CN115147709A (en) A 3D reconstruction method of underwater target based on deep learning
CN117523100A (en)Three-dimensional scene reconstruction method and device based on neural network and multi-view consistency
CN113159158A (en)License plate correction and reconstruction method and system based on generation countermeasure network
CN104463962B (en)Three-dimensional scene reconstruction method based on GPS information video
Su et al.Omnidirectional depth estimation with hierarchical deep network for multi-fisheye navigation systems
CN113763539B (en)Implicit function three-dimensional reconstruction method based on image and three-dimensional input
CN120088514A (en) Image feature matching model, estimation method and system based on spatial geometric constraints
CN116188550A (en)Self-supervision depth vision odometer based on geometric constraint
CN118967913A (en) A neural radiance field rendering method based on straight line constraints
CN118552615A (en) A few-view neural radiation field optimization method and system based on object pose probe
CN117455972A (en) UAV ground target positioning method based on monocular depth estimation

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp