Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the conventional SLAM system visual odometer scheme and provides a mobile robot visual odometer design method under a dynamic scene combining semantic information and geometric constraint, wherein the method has high map building precision, strong robustness and wide adaptability.
The technical scheme for realizing the purpose of the invention is that the visual odometer design method based on ORB-SLAM2 in a dynamic scene comprises the following steps:
step S1, a INTEL REALSENSE depth camera acquires real-time image information and carries out preprocessing such as graying;
S2, extracting image ORB characteristic points by using a self-adaptive threshold method based on grid division;
step S3, constructing YOLACT a network, performing example segmentation on the image after training by taking an MS_COCO data set as a sample, thereby acquiring image semantic information;
S4, detecting motion consistency by combining image semantic information and an L-K optical flow method, and roughly filtering dynamic feature points;
step S5, matching the characteristic points of the two frames of images based on PROSAC method, and estimating a basic matrix F;
And S6, calculating the polar line distance according to the basic matrix F, further finely filtering dynamic characteristic points, estimating the initial pose of the robot according to the selected key frame, and optimizing the result Bundle Adjustment.
Further, step S1 comprises the steps of:
Step S11, selecting INTEL REALSENSE D depth camera capable of simultaneously acquiring image depth information and color information by the vision sensor part, and preprocessing the acquired real-time image before extracting the characteristic points to accelerate the system operation speed, wherein the operations comprise removing image noise points, graying and the like.
Further, step S2 includes the steps of:
in step S21, in order to ensure the scale invariance of the feature points, the same image can still be matched with the corresponding feature points after scaling, firstly, a scale pyramid is constructed on the input image, and ORB feature points are calculated in images with different scales. The pyramid takes the original image acquired by the camera as a 0 th layer, and the original image is reduced layer by layer according to the scale factors until reaching the top layer of the pyramid.
Step S22, in order to ensure that the distribution of the extracted characteristic points is more uniform and reasonable, so that more comprehensive information can be obtained, each layer of image of the pyramid is divided into grids with a certain number of rows and columns, and the number of pre-extracted characteristic points in each grid and an initial FAST corner extraction threshold value are set;
Step S23, after the first feature point pre-extraction is completed according to the initial threshold, if the number of the feature points actually extracted in the grid is smaller than the set pre-extraction number, changing the threshold to continue the extraction, and repeating the above processes until the self-adaptive extraction of the feature points in the grid is completed.
Further, step S3 includes the steps of:
S31, constructing YOLACT a main network part by using a ResNet-101 convolution module, wherein the main network part is mainly responsible for finishing the feature extraction of the image;
S32, constructing Feature Pyramid Networks (FPN) network, wherein the aim is to generate a multi-scale characteristic diagram so as to ensure that target objects with different sizes can be detected;
step S33, constructing Protonet branches for generating a prototype mask and extracting important parts in the image to be processed;
Step S34, constructing a Prediction Head branch for generating mask coefficients, wherein a shared convolution network is used for better real-time segmentation, and the step is performed synchronously with the step S33;
step S35, training the YOLACT network by using 18 types of indoor household articles which are common in the MS_ COCO (Microsoft Common Objects in Context) dataset as training samples.
Further, step S4 includes the steps of:
Step S41, the pixel speed of the dynamic object can be tracked and calculated through an L-K optical flow method, the dynamic property of the object is analyzed by combining the operation result and semantic information, if the object and the background have relative motion, characteristic points contained in the object in the current frame are removed, if the object speed vector is smaller than a threshold value in the previous frame, relevant characteristic points are reserved, and rough filtering of the dynamic characteristic points is completed.
Further, step S5 includes the steps of:
and S51, calculating the minimum Euclidean distance and the evaluation function value of each characteristic point pair of the two frames of images, and arranging the characteristic points in a descending order according to the magnitude of the evaluation function value. Each eight feature points are used for calculating the quality of each group and arranging the quality of each group, 8 groups of matching points with highest matching quality are selected, and a basic matrix F is calculated;
Step S52, after the 8 groups of matching points are removed from the subset, corresponding projection points of the residual characteristic points in the subset are calculated according to the basic matrix;
and step S53, calculating errors between other characteristic points and the projection points, and if the errors are smaller than the set value, determining the errors as inner points. After updating the number of interior points, the basic matrix F is recalculated, and new interior points are acquired. If the iteration number does not exceed the maximum value, returning to the F and the interior point set, otherwise, failing to build the model;
step S54, calculating the polar distance, if the polar distance exceeds a set threshold value, indicating that the error is large, and surrounding dynamic objects needing to be removed exist.
Further, step S6 includes the steps of:
Step S61, using four non-coplanar virtual control points to linearly and weighted represent any punctuation on a target object in a world coordinate system as coordinates in a camera coordinate system, and then converting the problem into 3D-3D;
and step S62, the camera pose parameters including the rotation parameters R and the translation parameters t are obtained by utilizing ICP (Iterative Closest Point) algorithm.
In step S63, the estimation result may have errors due to the noise influence of the observation point, so that the pose calculation result is considered to be optimized by Bundle Adjustment (BA). The BA optimization is a relatively common nonlinear optimization method, and simultaneously optimizes the pose of the camera and the position of a space point, and the main idea is to build a least square problem after summing errors, and find the optimal pose of the camera so as to minimize an error term.
Compared with the prior art, the invention has the beneficial effects that:
(1) The visual odometer scheme provided by the invention has better robustness. Under the condition that the robot is in a complex environment, the visual odometer scheme provided by the invention can avoid the influence of dynamic objects to a great extent, keep static characteristic points and effectively increase the positioning and map building accuracy of the robot.
(2) Aiming at the condition that feature points are easy to be dense in the ORB feature extraction algorithm result, the invention provides a grid-based self-adaptive feature extraction method, and the method can obtain more uniformly distributed feature points and acquire more comprehensive image information.
(3) Aiming at the problem that the original system cannot acquire the environment semantic information, the invention researches an instance segmentation processing method based on YOLACT. In order to acquire the environment semantic information, an example segmentation thread is added in the original framework to segment the RGB image. YOLACT can be used for semantic labeling of common household articles, and can be used for constructing subsequent semantic maps besides a visual odometer part.
Detailed Description
The present invention will be described in further detail with reference to the drawings and the technical scheme, in order to make the objects, technical schemes and advantages of the present invention more apparent.
The invention aims to overcome the defects and shortcomings of the conventional SLAM system visual odometer scheme and provides an ORB-SLAM 2-based mobile robot visual odometer method in a dynamic scene. The method comprises the steps of obtaining real-time image information through a INTEL REALSENSE depth camera and carrying out preprocessing such as graying on the real-time image information, extracting image feature points more comprehensively by using a self-adaptive threshold method proposed based on ORB (Oriented FAST and Rotated BRIEF) algorithm, training YOLACT network by using MS_COCO data set as a sample and carrying out example segmentation on the image so as to obtain image semantic information, roughly filtering dynamic feature points by combining the image semantic information and an L-K optical flow method, estimating a basic matrix F based on a progressive consistent sampling (PROSAC) algorithm, calculating epipolar distances and finely filtering the dynamic feature points, and finally selecting filtered key frames and taking the filtered key frames as input of a local map building thread. The key point of the invention is to filter the dynamic characteristic points of the key frames by combining the environment semantic information and the geometric constraint, thereby effectively avoiding the influence of dynamic objects in the surrounding environment and improving the accuracy of positioning and mapping of the robot in the dynamic environment.
Specific steps of the method are described in detail below.
Referring to fig. 1 and 2, a mobile robot vision mileage calculation method based on ORB-SLAM2 in a dynamic scene includes the following steps:
and S1, acquiring and preprocessing an image, wherein the method specifically comprises the following steps.
INTEL REALSENSE D415 to be able to acquire image depth information and color information simultaneously is selected as a vision sensor in consideration of SLAM system requirements. Image information is acquired by using INTEL REALSENSE D depth camera, and the acquired real-time image is preprocessed before feature points are extracted, including noise removal, graying and other operations, in order to make system operation simpler.
And S2, extracting image characteristic points, wherein the method specifically comprises the following steps.
In order to ensure the scale invariance of the feature points, the same image can still be matched with the corresponding feature points after scaling, firstly, a scale pyramid is constructed on the input image, the ORB feature points are calculated in the images of different scales, the original image obtained by the camera is used as a 0 th layer, the original image is scaled down layer by layer according to the scaling factors until the pyramid top layer, the total pyramid layer number nlevels is set to 8, and then the scaled image is:
Wherein, Ik is the image size of the k layer of the pyramid, I is the original image acquired by the camera, scaleF actor is the scaling factor, which is set to 1.2, k represents the number of pyramid layers, and the value range is [1, nlevels-1];
The number of the extracted characteristic points on each layer of image is as follows:
Wherein N is the total number of ORB feature points to be extracted, desiredCi is the expected extraction number of feature points of the ith layer of the pyramid, the value range of i is [0, N-1], N is the total layer number of the pyramid, generally set to 8, and InvSF is the reciprocal of scale factor scaleF actor;
In step S22, in order to ensure that the extracted feature points are distributed more uniformly and reasonably, so that more comprehensive information can be obtained, a self-adaptive threshold feature point extraction algorithm for dividing grids is provided, and a schematic diagram is shown in FIG. 3. Each layer of image of the pyramid is divided into grids, the number of grid columns is Colsi, the number of lines is Rowsi, and the calculation is as follows:
wherein t is a grid division coefficient, and the total number of grids is increased when t is reduced, IRati represents the ratio of grid Rows and columns divided by the ith layer image, namely IRati =Rows/Cols.
And setting the number cDesCi of pre-extraction feature points in each grid and an initial FAST corner extraction threshold iniTh, and calculating as follows:
Wherein I (x) is a gray value of a certain pixel in the image, κ is an average gray value of all pixels, and sp represents the total number of pixels included in the image.
Step S23, after the first feature point pre-extraction is completed according to the initial threshold, if the number of the feature points actually extracted in the grid is smaller than the set pre-extraction number, changing the threshold to continue the extraction, and repeating the above processes until the self-adaptive extraction of the feature points in the grid is completed.
And step S3, acquiring image semantic information, wherein the method specifically comprises the following steps.
Step S31: YOLACT the overall structure of the network is shown in FIG. 4, the main network part uses ResNet101, the total number of convolution modules is 5, and the convolution modules comprise conv1, conv2_x to conv5_x, which correspond to C1 to C5 in FIG. 4 respectively and are mainly responsible for finishing the feature extraction of the image;
In step S32, P3 to P7 in FIG. 4 are FPN (Feature Pyramid Networks) networks, and a multi-scale characteristic diagram is generated, so that different-size target objects can be detected. Firstly, P5 is obtained from C5 through a convolution layer, then, the characteristic diagram is doubled by a bilinear interpolation method for P5, P4 is obtained by adding the characteristic diagram and the convolved C4, and then P3 is obtained through the same operation. Except for down conversion, P6 is generated after the P5 is rolled and down sampled, P7 is obtained after the P6 is operated identically, an FPN network is built completely, feature graphs with different dimensions are generated, features are more abundant, and the segmentation of target objects with different dimensions is facilitated;
Step S33, constructing Protonet branches for generating a prototype mask. The input of the branch is P3, the middle of the branch is composed of a plurality of convolution layers, the final output is k channels, each channel represents a prototype mask with 138 x 138 dimensions, and important parts in the image to be processed can be extracted through the mask;
And step S34, constructing a Prediction Head branch for generating mask coefficients. In order to make the real-time performance of the segmentation better, a shared convolution network is used, and the step is performed synchronously with the step S33;
step S35, training the YOLACT network by using 18 types of indoor household articles which are common in the MS_ COCO (Microsoft Common Objects in Context) dataset as training samples.
The MS_COCO data set is an open source data set published by Microsoft in 2014, comprises 91 types of objects and scenes which are common in daily life, and stores three annotation types of a target instance (object instances), a target key point (object keypoints) and an image description (image captions) in the form of a JSON file. The common 18 types of indoor household articles are customized to serve as training samples.
And S4, detecting motion consistency, which specifically comprises the following steps.
Step S41, the pixel speed of the dynamic object can be tracked and calculated through an L-K optical flow method, the dynamic property of the object is analyzed by combining the operation result and semantic information, if the object and the background have relative motion, characteristic points contained in the object in the current frame are removed, and if the speed vector of the object is less changed than that of the previous frame, relevant characteristic points are reserved, so that rough filtering of the dynamic characteristic points is completed.
And S5, matching the characteristic points, wherein the method specifically comprises the following steps.
Step S51, calculating a minimum Euclidean distance dmin, a second minimum Euclidean distance dmin2 and an evaluation function value q (ux) of each feature point pair of the two frames of images, wherein the evaluation function is calculated as follows:
The meaning is that if set uN contains N data points, then the data points within the set satisfy:
Qualitatively, data points with a large evaluation function value q (ux) are ranked higher in the set. Arranging the characteristic points in descending order according to the evaluation function value, calculating the quality of each group and arranging the characteristic points side by side for each group, selecting 8 groups of matching points with highest matching quality, and calculating a basic matrix F;
Step S52, after the 8 groups of matching points are removed from the subset, corresponding projection points of the residual characteristic points in the subset are calculated according to the basic matrix F;
And step S53, calculating the error e between other characteristic points and the projection point, and if the error e is smaller than the maximum error epsilon, determining the point as an inner point. After updating the number of interior points, the basic matrix F is recalculated, and new interior points are acquired. And if the iteration number does not exceed the maximum value, returning to the basic matrix F and the inner point set. Otherwise, the model establishment fails;
Step S54, calculating the polar line distance according to the basic matrix F, wherein the projection point of the imaging plane of the camera is p1, the matching point of the imaging plane of the current frame is p2, and the distance from the normalized coordinate x2 of p2 to the polar line l is:
where x1 is the normalized coordinates of p1, [ A B C ]T is a vector of expression coefficients of the epipolar line l.
If the calculated result exceeds the set threshold value theta, the error is large, and dynamic objects needing to be removed exist around the corresponding three-dimensional world points. The threshold θ is calculated as follows:
And S6, estimating the initial pose of the robot, which specifically comprises the following steps.
Step S61, using four non-coplanar virtual control points to linearly and weighted represent any punctuation on a target object under a world coordinate system as coordinates in a camera coordinate system, and then obtaining a group of matching 3D points, so as to convert the problem into 3D-3D;
And S62, solving the pose parameters of the camera by utilizing ICP (Iterative Closest Point) algorithm, namely enabling the square sum of errors to reach the minimum rotation parameter R and translation parameter t.
In step S63, the estimation result may have errors due to the noise influence of the observation point, so that the pose calculation result is considered to be optimized by Bundle Adjustment (BA). The BA optimization is a relatively common nonlinear optimization method, and simultaneously optimizes the pose of the camera and the position of a space point, and the main idea is to build a least square problem after summing errors, and find the optimal pose of the camera so as to minimize an error term.
The foregoing embodiments are merely illustrative embodiments of the present invention, and not restrictive, and the scope of the invention is not limited to the foregoing embodiments, but it should be understood by those skilled in the art that any modification, variation or substitution of some technical features described in the foregoing embodiments may be made without departing from the spirit and scope of the technical solutions of the embodiments of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.