Disclosure of Invention
The embodiment of the invention provides a three-dimensional target detection method based on a multi-level cross-mode self-attention mechanism, which is used for effectively detecting the types, positions and postures of three-dimensional objects in a two-dimensional RGB image.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
A three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism comprises the following steps:
constructing training set and testing set data by utilizing RGB image data;
constructing a three-dimensional target detection model, wherein the three-dimensional target detection model comprises an RGB (red, green and blue) backbone network, a depth backbone network, a classifier and a regressor;
Training the three-dimensional target detection model by using the training set and the testing set data, verifying the training effect of the three-dimensional target detection model by using the testing set, respectively acquiring RGB features and depth features by using the RGB backbone network and the depth backbone network, inputting the RGB features and the depth features into a cross-modal self-care mechanics practice module, updating the RGB features, and obtaining a trained three-dimensional target detection model by using an updated RGB feature learning classifier and a regressor;
and detecting the category, the position and the gesture of the three-dimensional object in the two-dimensional RGB image to be identified by using the classifier and the regressive in the trained three-dimensional object detection model.
Preferably, the constructing training set and test set data using RGB image data includes:
The method comprises the steps of collecting RGB images, dividing the RGB images into a training set and a testing set according to the proportion of about 1:1, carrying out normalization processing on image data in the training set and the testing set, obtaining a two-dimensional depth image of the training set image through a depth estimation algorithm, labeling the types of objects in the training set image, and labeling coordinates of a two-dimensional detection frame of the image, the center position, the size and the rotation angle of the three-dimensional detection frame.
Preferably, the RGB backbone network, the deep backbone network, the classifier and the regressor in the three-dimensional target detection model all comprise a convolution layer, a full connection layer and a normalization layer, and the structures of the RGB backbone network and the deep backbone network are consistent and all comprise 4 convolution modules.
Preferably, the training the three-dimensional target detection model by using the training set and the test set data, the RGB backbone network and the depth backbone network respectively obtain RGB features and depth features, input the RGB features and the depth features into a cross-modal self-care mechanics practice module, update the RGB features, and learn a classifier and a regressor by using the updated RGB features to obtain a trained three-dimensional target detection model, including:
S3-1, initializing parameters in a convolution layer, a full connection layer and a normalization layer which are contained in an RGB (red, green and blue) backbone network, a depth backbone network, a classifier and a regressor in the three-dimensional target detection model;
s3-2, setting relevant training parameters of a random gradient descent algorithm, wherein the relevant training parameters comprise learning rate, impulse, batch size and iteration times;
Step S3-3, for any iteration batch, respectively inputting all RGB images and depth images into an RGB backbone network and a depth backbone network to obtain multi-level RGB features and depth features, constructing a cross-modal self-attention mechanical learning module, inputting the RGB features and the depth features into the cross-modal self-attention mechanical learning module, learning to obtain a self-attention matrix based on depth information, updating the RGB features through the self-attention matrix, learning a classifier and a regressor by using the updated RGB features, using the classifier and the regressor for target detection of three-dimensional objects in a two-dimensional RGB image,
Obtaining objective function values by calculating errors of the network estimated values and the actual labeling values, and respectively calculating three objective function values by using formulas (1), (2) and (3):
Wherein si and pi in formula (1) are respectively the class labeling and the estimated probability of the ith target, and in formula (2)And in formula (3)Respectively representing a two-dimensional estimation frame and a three-dimensional estimation frame of an ith target, wherein gt represents an actual labeling value, and N represents the total number of targets;
s3-4, adding the three objective function values to obtain an overall objective function value, respectively solving partial derivatives of all parameters in the three-dimensional objective detection model, and updating the parameters by a random gradient descent method;
And step S3-5, repeating the step S3-3 and the step S3-4, continuously updating parameters of the three-dimensional target detection model until convergence, and outputting the trained parameters of the three-dimensional target detection model.
Preferably, the inputting the RGB features and the depth features into the cross-modal self-care mechanics model updates the RGB features, and obtains a trained three-dimensional target detection model by using an updated RGB feature learning classifier and a regressor, including:
for any two-dimensional RGB feature map R and two-dimensional depth feature map D, assuming the dimensions are C×H×W, wherein C, H and W are the dimensions, height and width respectively, the two-dimensional RGB feature map R and the two-dimensional depth feature map D are represented as a set of N C-dimensional features, R= [ R1,r2,...,rN]T and D= [ D1,d2,...,dN]T, wherein N=H×W;
For the input feature map R, a full connection map is constructed, wherein each feature Ri is taken as a node, the edge (Ri,rj) represents the relation between the nodes Ri and Rj, the edge is obtained through learning of the two-dimensional depth feature map D, and the current two-dimensional RGB feature map R is updated, specifically expressed as:
Wherein the method comprises the steps ofFor the normalization parameter, δ is a softmax function, j is the positions all associated with i,For the updated RGB features, the above formula is written in the form of matrix multiplication:
Wherein the method comprises the steps ofFor the self-attention matrix, the dimensions of Dθ,Dφ and Rg are both nxc';
Taking the feature matrix ri of each spatial position as a node, searching nodes associated with ri in all spatial regions, and sampling representative features with the number S in all nodes associated with i for any node i in the depth feature map:
Where s (n) is a sampled feature vector, the dimension of which is C',As a sampling function, the cross-modal self-attention mechanics learning module is expressed as:
where n is the sampled i-related node and δ is the softmax function, dθ(i)=Wθd(i),sφ(i)=Wφs(i),sg(i)=Wg s (i).And (3) withThree linearly transformed transformation matrices, respectively.
According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides a multi-level cross-mode self-care mechanics learning mechanism for three-dimensional target detection, depth structure information in a global scene range is obtained from a depth feature map, and the depth structure information is organically combined with appearance information to improve the accuracy of a three-dimensional target detection algorithm. Meanwhile, various strategies are adopted to reduce the operation complexity so as to meet the requirements of unmanned scenes and the like on the processing speed.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.
Based on the main defects existing in the current three-dimensional target detection algorithm, the method acquires the depth information through the two-dimensional depth map, and forms the utilization of the depth information into a cross-modal self-attention module learning problem. The depth information and the appearance information are combined through a cross-modal self-attention mechanism, and meanwhile, the depth information is extracted in a non-iterative mode in a global range through a self-attention mechanism, so that the detection precision is improved. When depth information is acquired, the method and the device also use various measures to further reduce the operation complexity and ensure that the method and the device can be used for scenes with real-time processing requirements such as automatic driving.
The invention provides a three-dimensional target detection method based on a multi-level cross-mode self-attention mechanism, which takes a two-dimensional RGB image and a depth image as input, combines appearance information acquired by the two-dimensional RGB image with structural information acquired by the depth image through the self-attention mechanism so as to achieve an accurate detection result, and simultaneously avoids high calculation amount caused by point cloud processing. In addition, since the self-attention mechanism acquires a large amount of redundant information while acquiring global structural information, the method adopts an improved self-attention mechanism, namely, for a certain area, structural information is calculated only for the area with the highest correlation in the global, and the calculated amount is further reduced on the premise of ensuring the detection precision.
The three-dimensional target detection method based on the multi-level cross-mode self-attention mechanism comprises the following processing procedures:
The data set construction comprises the steps of constructing a training set and a testing set of a three-dimensional target detection model, specifically comprising the steps of collecting RGB images used for training and testing, and extracting depth information corresponding to the training set images through a depth model. And labeling the class, two-dimensional coordinates, three-dimensional coordinates, depth, size and the like of the object in the training image. And pre-processes the image data.
The three-dimensional target detection model construction method specifically comprises the steps of constructing a three-dimensional target detection model based on a convolutional neural network, and specifically comprises the construction of an RGB image feature extraction network, a depth image feature extraction network and a cross-mode self-attention mechanics learning network.
And training a three-dimensional target detection model, namely updating parameters in the three-dimensional target detection model until convergence by calculating loss functions such as two-dimensional target detection, three-dimensional target detection classification, regression and the like and a random gradient descent algorithm.
The three-dimensional object is detected by providing a color image or video frame, and detecting the three-dimensional object therein.
The processing flow chart of the three-dimensional target detection method based on the multi-level cross-mode self-attention mechanism provided by the embodiment of the invention is shown in fig. 1, and comprises the following steps:
And S1, constructing a training set and a testing set. RGB images were acquired and split into training and test sets at a ratio of about 1:1. Because the three-dimensional target detection method provided by the embodiment of the invention acquires depth information through the two-dimensional depth image instead of the point cloud data adopted by the traditional method, the two-dimensional depth image is acquired through the depth estimation algorithm for the color image in the training set. In addition, the object in the training set image is marked with the category, and the coordinates of the two-dimensional detection frame, the center position, the size and the rotation angle of the three-dimensional detection frame are marked. And finally, carrying out normalization processing on the image data in the training set and the testing set.
And S2, after the training set and the testing set are obtained, constructing a three-dimensional target detection model, wherein the three-dimensional target detection model comprises an RGB backbone network, a depth backbone network, a classifier and a regressor. The structure of the three-dimensional target detection model provided by the embodiment of the invention is shown in figure 2. Since we need to extract features for RGB images and depth images separately during training, we need to construct two feature extraction backbone networks. In the embodiment of the invention, the RGB backbone network and the deep backbone network have the same structure and comprise 4 convolution modules for extracting multi-level features.
And step S3, training a three-dimensional target detection model. After the three-dimensional target detection model is built, the model can be trained through the training set obtained in the step S1, and the training effect of the three-dimensional target detection model is verified through the testing set. The training flow chart of the three-dimensional target detection model provided by the embodiment of the invention is shown in fig. 3, and specifically comprises the following steps:
And S3-1, initializing model parameters, wherein the model parameters comprise parameters in a convolution layer, a full connection layer and a normalization layer which are contained in an RGB (red, green and blue) backbone network, a deep backbone network, a classifier and a regressor.
And S3-2, setting training parameters. The three-dimensional target detection model of the embodiment of the invention adopts SGD (Stochastic GRADIENT DESCNET, random gradient descent algorithm) for training, and related training parameters including learning rate, impulse, batch size and iteration times are required to be set before training.
And S3-3, calculating an objective function value. For any iteration batch, all RGB images and depth images are respectively input into an RGB backbone network and a depth backbone network to obtain multi-level features, updated RGB features are obtained through a cross-mode self-care mechanics learning module, and then estimated categories, position gestures and depth values of a target object are further obtained through a classifier and a regressor. And finally, obtaining the objective function value by calculating the error between the network estimated value and the actual labeling value. Three objective function values are calculated in the training of the model:
Wherein si and pi in formula (1) are respectively the class labeling and the estimated probability of the ith target, and in formula (2)And in formula (3)And respectively representing a two-dimensional estimation frame and a three-dimensional estimation frame of the ith target, wherein gt represents an actual labeling value, and N represents the total number of the targets.
And S3-4, adding the objective function values to obtain a total objective function value, respectively solving partial derivatives of all parameters in the model, and updating the parameters by a random gradient descent method.
And step S3-5, repeating the step S3-3 and the step S3-4, continuously updating the model parameters until convergence, and finally outputting the model parameters.
So far, all parameters of the three-dimensional target detection model in the embodiment of the invention are obtained, and finally, only the object in the two-dimensional image provided by the user is detected.
And S4, after multi-level RGB features and depth features are acquired, a cross-mode self-attention learning module is constructed, the module learns to obtain a self-attention matrix based on depth information by taking the RGB features and the depth features as inputs at the same time, and updates the RGB features through the self-attention matrix to increase the structural information in the RGB features. And finally, utilizing the updated RGB features to learn a classifier and a regressive for target detection of the three-dimensional object in the two-dimensional RGB image, wherein the classifier can identify the category of the three-dimensional object, and the regressive can identify the position and the gesture of the three-dimensional object.
The three-dimensional target detection model in the embodiment of the invention comprises an RGB backbone network, a deep backbone network, a classifier and a regressor. After training is finished, the RGB backbone network has retained depth structure information through a cross-modal self-care learning module. In the test, only two-dimensional RGB images need to be provided, and depth features do not need to be extracted by a depth backbone network.
The cross-mode self-attention mechanical learning module provided by the embodiment of the invention can acquire depth structure information through depth map learning and is embedded into RGB image characteristics, so that the accuracy of three-dimensional target detection is improved. The following is a detailed description.
The structure flow chart of the cross-mode self-attention learning module provided by the embodiment of the invention is shown in fig. 4, and mainly comprises four sub-modules, namely a sampling point generation module, a multi-level attention learning module, an information updating module and an information fusion module. The core idea of the construction is that a self-attention matrix based on depth information is obtained through multi-level depth feature map learning, the self-attention matrix can reflect the structural similarity between different positions in the global image range, the RGB feature map is updated through the self-attention matrix, so that the structural features in the global image range are obtained, and finally the accuracy of three-dimensional target detection is improved. Two-level depth feature maps are shown in fig. 4 as an example, and in actual operation, may be extended to multi-level depth features.
For any two-dimensional RGB feature map R and two-dimensional depth feature map D, it is assumed that the dimensions are CXHXW, where C, H and W are dimension, height and width, respectively. Both the two-dimensional RGB feature map R and the two-dimensional depth feature map D can be represented as a set of N C-dimensional features, r= [ R1,r2,...,rN]T and d= [ D1,d2,...,dN]T, where n=h×w. For the input feature map R, a full connection map is constructed in which each feature Ri serves as a node, and the edge (Ri,rj) represents the relationship between nodes Ri and Rj. For the two-dimensional RGB feature map R, the appearance features such as color, texture and the like are obvious, and the structural information such as depth and the like is insufficient. The cross-modal self-attention learning module in the embodiment of the invention learns through the two-dimensional depth feature map D to obtain edges, and then updates the current two-dimensional RGB feature map R to increase the structural features, which can be expressed as follows:
Wherein the method comprises the steps ofFor the normalization parameter, δ is a softmax function, j is the positions all associated with i,Is the updated RGB feature. We can further write the above formula in the form of matrix multiplication:
Wherein the method comprises the steps ofFor the self-attention matrix, the dimensions of Dθ,Dφ and Rg are N C'.
Thus, a single-level cross-mode self-attention learning module is constructed, which can learn to obtain a self-attention matrix containing structural information through a single-level depth feature map and update an RGB feature map of a corresponding level. However, as can be seen from the above matrix multiplication formula, the complexity of the operation of updating the RGB feature map is O (C' ×n2), and for three-dimensional object detection, especially for unmanned scenes, the resolution of the input image or video frame is usually large, so that the time consumption is too high when calculating the self-attention matrix a (X), which is unfavorable for the application scene that is required by real-time processing. In the process of constructing the fully connected graph, we regard the feature matrix ri of each spatial position as a node, find the nodes associated with ri in the whole spatial region, and calculate the self-attention matrix. Because the nodes associated with ri in all spatial regions are highly overlapped, the cross-modal self-attention learning module in the embodiment of the invention only selects the node with the highest association degree in the nodes associated with ri through a sampling mechanism, calculates a self-attention matrix after removing a large number of redundant nodes, thus greatly improving the operation efficiency and simultaneously ensuring the correlation in all spatial regions. The cross-modal self-care learning module including the sampling mechanism is described in detail below.
For any node i in the depth profile, a representative feature of S is sampled in number in all nodes associated with i:
Where s (n) is a sampled feature vector, the dimension of which is C',As a sampling function. Thus, the cross-modal self-care learning module in the embodiment of the present invention may be expressed as:
where n is the sampled i-related node and δ is the softmax function, dθ(i)=Wθd(i),sφ(i)=Wφs(i),sg(i)=Wg s (i).And (3) withThree linearly transformed transformation matrices, respectively. By adding the sampling module, we can reduce the number of nodes from N to S when computing the self-attention matrix:
And S < < N, so that the operation complexity can be greatly reduced. For example, for a feature map with a spatial dimension of 80×80, N is 6400, and in the embodiment of the present invention, the number of sampling points is selected to be 9.
The invention dynamically selects sampling points by estimating offsets by means of the idea of deformable convolution (deformable convolution). Specifically, for a certain position p in the feature map, the sampling functionCan be expressed as:
Where Δpn is the regression derived offset. Since the result obtained by the convolution operation usually contains a fractional part, and the coordinate values of the sampling points are only integers, the integer coordinate values are also obtained by bilinear interpolation:
Wherein ps = p + delta pn, t is the four adjacent points with integer coordinate values of the calculated sampling points, and K is a bilinear interpolation kernel function.
In practical application, for each node in RGB feature map, we can obtain its offset by linear transformation, and its transformation matrix isThe output offset dimension is 2S, and the offset of the coordinates in the two directions of the horizontal axis and the vertical axis is respectively. And obtaining S most representative nodes for each node through bilinear difference.
After the most representative sampling nodes are obtained through the depth feature map and the self-attention moment matrix is calculated, the RGB feature map can be updated. In the cross-modal self-attention learning module of the embodiment of the invention, the structure of the residual error network is adopted to update the RGB feature map, which can be expressed as:
wherein,The RGB features in the above equation (7), Wy is a linear transformation matrix,To learn the resulting residual, ri is the original input RGB features and yi is the final updated RGB features. The cross-modal self-care learning module constructed based on the residual network structure can be embedded into any neural network model.
As can be seen from the above description, when constructing the single-level cross-modal self-care learning module, 5 linear transformation matrices are required, Wφ,Wφ and Wg in equation (7), Wy in equation (11), and the linear transformation matrix Ws for generating the sampling points, respectively. To further reduce the number of parameters, we construct the cross-modal self-care learning module as a bottleneck (bottleneck) structure, i.e., fuse Wθ,Wφ and Wg in equation (7) into a linear transformation matrix W for obtaining dθ,sφ and sg. Thus, only 3 linear transformation matrices are needed to construct a single-level cross-modal self-attention learning module. All linear transformations are realized by 1 x 1 convolution, with the addition of batch normalization operations.
As shown in fig. 4, the cross-modal self-attention learning module in the embodiment of the present invention may learn through the multi-level depth feature map to obtain a self-attention matrix containing structural information, and update the RGB feature map, so that the multi-level information needs to be fused finally. The fusion operation can be expressed specifically as:
Where j enumerates all of the depth levels,For a linear transformation matrix of the corresponding hierarchy,For the corresponding level updated RGB features, it can be calculated by equation (7).
It should be noted that, to further reduce the complexity of the operation in the embodiments of the present invention, the feature graphs may be further grouped in the spatial and dimensional layers when the self-attention matrix is calculated. In the space layer, for a feature map with one dimension of CxHxW, the feature map can be divided into a plurality of areas, each area comprises a plurality of feature vectors with the dimension of Cx1, and one area can be used as a node by carrying out pooling operation on each area, so that matrix operation can be carried out on all features in one area, and further the operation complexity is greatly reduced. Similarly, at the dimension level, all feature channels may be equally divided into groups, each group having a feature map dimension of C '×h×w, where C' =c/G. And firstly, calculating each group of characteristics, and then cascading all the grouped characteristics together to obtain the final characteristics.
In summary, the present invention organically combines the depth structure information obtained from the depth map with the appearance information obtained from the RGB map through the cross-modal self-attention mechanism to achieve an accurate detection result, instead of simply fusing the two information. When the depth structure information is acquired, the correlation between different positions can be considered in the global scene range, and the method is not limited in the neighborhood range. This is mainly due to the nature of the self-attention learning mechanism, and the manner in which learning is performed by multi-level features. In addition, when the correlation between different positions is obtained in the global scene range, only single operation is carried out without iteration, so that the type, position and gesture target detection can be effectively carried out on the three-dimensional object in the two-dimensional RGB image.
When the cross-mode self-attention mechanism in the invention acquires the correlation between different positions, the self-attention matrix is calculated only for the position with high correlation, so that the self-attention matrix among a large number of redundant positions can be avoided being calculated, and the operation complexity is reduced while the effect is ensured. In addition, the depth features can be grouped in dimensions and space when the self-attention matrix is calculated so as to further reduce the operation complexity.
Those of ordinary skill in the art will appreciate that the drawing is merely a schematic illustration of one embodiment and that modules or flow in the drawing are not necessarily required to practice the invention.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.