Movatterモバイル変換


[0]ホーム

URL:


CN118314162B - Dynamic visual SLAM method and device for time sequence sparse reconstruction - Google Patents

Dynamic visual SLAM method and device for time sequence sparse reconstruction
Download PDF

Info

Publication number
CN118314162B
CN118314162BCN202410725976.9ACN202410725976ACN118314162BCN 118314162 BCN118314162 BCN 118314162BCN 202410725976 ACN202410725976 ACN 202410725976ACN 118314162 BCN118314162 BCN 118314162B
Authority
CN
China
Prior art keywords
frame
feature
image
moving object
motion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410725976.9A
Other languages
Chinese (zh)
Other versions
CN118314162A (en
Inventor
王晨捷
张燕咏
张昱
吉建民
曹泓
张露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Original Assignee
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Artificial Intelligence of Hefei Comprehensive National Science CenterfiledCriticalInstitute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority to CN202410725976.9ApriorityCriticalpatent/CN118314162B/en
Publication of CN118314162ApublicationCriticalpatent/CN118314162A/en
Application grantedgrantedCritical
Publication of CN118314162BpublicationCriticalpatent/CN118314162B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The disclosure belongs to the technical field of visual positioning and mapping, and particularly relates to a dynamic visual SLAM method and device for time sequence sparse reconstruction. Wherein the method comprises: acquiring a continuous binocular image sequence of a dynamic scene, acquiring a motion instance and a picture construction parameter of each frame of left-view image, and assembling into a frame of picture construction data; in the adjacent frames, determining the corresponding relation of the structural features based on the mapping parameters, determining key points in the structural features, associating the moving objects in the adjacent frames based on the moving examples and the key points, and respectively determining the pose of the camera and the moving objects by utilizing the moving examples and the key points; and in the subframe sequence, extracting a preset number of key frames based on optical flow estimation results and structural features of each frame, generating a local map based on the key frames, generating a global map by fusing each local map, and carrying out joint optimization on the local map and the global map based on a preset error factor set.

Description

Dynamic visual SLAM method and device for time sequence sparse reconstruction
Technical Field
The disclosure belongs to the technical field of visual positioning and mapping, and particularly relates to a dynamic visual SLAM method and device for time sequence sparse reconstruction.
Background
Visual SLAM is visual simultaneous localization and drawing, only need the camera just can fix a position the robot and carry out three-dimensional drawing to the environment, simple to operate, low cost is the focus of computer vision and robot field research always. The common visual SLAM method is based on the assumption of a static scene, and when more remarkable dynamic objects exist in the scene, the errors of camera pose estimation and environment mapping are larger. In a dynamic environment, the current common solution is to divide the moving object first and totally remove the moving object as an outlier, so as to achieve the purposes of positioning the robot and drawing the static environment. However, the information of the moving object in the dynamic scene is not useless, and in order to acquire the motion and geometric information of the moving object at the same time, the dynamic vision SLAM study only uses the motion of the camera, and acquires the static background three-dimensional map, the three-dimensional model of each dynamic object, and the absolute motion trail of the camera and each dynamic object at the same time.
According to the difference of the used motion segmentation methods, the existing dynamic visual SLAM method is mainly divided into a semantic information-based method and a geometric motion segmentation-based method, the semantic information-based method is too dependent on semantic priori, and for some objects without specific semantic tags, the expected effect is difficult to achieve; although the method based on geometric motion segmentation is not limited to moving objects with specific semantic tags, the geometric motion segmentation is often poor in effect on scenes with low texture features and small objects with few salient points, and is difficult to cope with degraded motion, and has poor applicability to complex scenes and low operation efficiency. Meanwhile, the dynamic vision SLAM method is mostly not considered to carry out integral optimization (carrying out joint optimization on four tasks for short) on the motion trail of a camera and a moving object, a static background and corresponding structural point representations of all the moving objects, so that the accuracy of the output moving object information is lower.
Disclosure of Invention
The embodiment of the disclosure provides a dynamic visual SLAM scheme based on motion instance segmentation, which aims to solve the problems that the existing dynamic visual SLAM scheme has poor segmentation effect on objects without semantic information or small objects with less salient points and the accuracy of motion object information is lower because joint optimization on four tasks cannot be realized.
A first aspect of an embodiment of the present disclosure provides a dynamic visual SLAM method for time-sequential sparse reconstruction, including:
Acquiring a continuous binocular image sequence of a dynamic scene, acquiring a mapping parameter of each frame of left-view image, acquiring a motion instance of the image based on motion instance segmentation, and assembling the mapping parameter and the motion instance at corresponding moments into one frame of mapping data, wherein the mapping parameter at least comprises a depth map of the left-view image, an optical flow estimation result and structural features, and the structural features comprise point features and edge features;
In adjacent frames in a frame sequence formed by continuous multiframes of the mapping data, determining the corresponding relation of the structural feature based on the mapping parameter, determining key points in the structural feature, associating a moving object in the adjacent frames based on the moving instance and the key points, and respectively determining the pose of a camera and the moving object by utilizing the moving instance and the key points;
And in a subframe sequence containing a preset frame number, extracting a preset number of key frames based on an optical flow estimation result and structural features of each frame, generating a local map corresponding to the subframe sequence based on the pose, static background and key points of a camera and a moving object in the key frames, and fusing each local map to generate a global map corresponding to the frame sequence.
In some embodiments, after the generating the global map corresponding to the sequence of frames, further comprising:
Performing joint optimization on the global map based on a preset error factor set, wherein the error factor set at least comprises a reprojection error of a static background point, a reprojection error of a moving object point, a camera pose estimation observation error, a moving object uniform motion error and a moving object point uniform motion error; and/or
After the generating of the local map corresponding to the subframe sequence, further comprising:
and carrying out joint optimization on the local map based on the preset error factor set.
In some embodiments, the obtaining motion instances of each frame of left-view image based on motion instance segmentation comprises:
Acquiring a first left-view image at a target moment of a moving object and a second left-view image at an adjacent moment, respectively extracting first feature images of the first left-view image and the second left-view image in a multi-stage scale based on an image feature extraction neural network, wherein the low-resolution first feature image is obtained by downsampling a high-resolution first feature image of an adjacent scale in a step length of 2;
Fusing the first feature map of each current scale of the image with the first feature map of the adjacent scale based on the feature pyramid model to obtain a second feature map of the current scale of the image, wherein the second feature maps of all scales of the image form the second feature map of the image;
and generating a third feature map of the first left-view image after the time sequence information is introduced based on the second feature map of the image, and performing motion instance segmentation on the third feature map of the first left-view image based on a target detection and instance segmentation algorithm to obtain a motion instance of the first left-view image as a motion object instance segmentation result.
In some embodiments, the acquiring mapping parameters of the left-view image of each frame includes:
Taking a binocular image at the target moment as input, and performing binocular depth estimation by adopting a CUDA application version of a semi-global matching algorithm to obtain a depth map of a left-view image at the target moment;
Taking left-view adjacent two frames of images as input, and obtaining an optical flow estimation result of a left-view image at a target moment based on an optical flow estimation network, wherein the optical flow estimation network adopts a liteflownet method;
Taking a left view image at a target moment as input, and detecting a sparse angular point feature set of the image by adopting a FAST feature extractor to output the point feature of the image;
And taking the left view image at the target moment as input, and detecting and outputting the edge characteristics of the image by using a Canny edge detector.
In some embodiments, the determining the correspondence of the structural feature based on the mapping parameters includes:
for the feature points in the current frame, obtaining matching feature points of the feature points on adjacent frames by adopting an optical flow estimation result of the corresponding pixel positions;
And for a first edge feature in the current frame, searching the effective depth of the edge feature from the depth map of the current frame, re-projecting an edge pixel set with the effective depth from the current frame into an adjacent frame, acquiring a second edge feature corresponding to the first edge feature in the adjacent frame, calculating Euclidean distance to each pixel position of the second edge feature on each pixel position of the first edge feature, and realizing the alignment of the current frame and the edge feature of the adjacent frame based on the minimum Euclidean distance.
In some embodiments, the associating moving objects in adjacent frames based on the motion instance and the keypoint comprises:
Respectively calculating the intersection ratio of a plurality of first motion instances positioned in the current frame and a plurality of second motion instances positioned in adjacent frames, wherein the first motion instance and the second motion instance with the maximum intersection ratio and not smaller than a preset threshold value are matched;
The keypoints of the moving object are associated with the keypoints of the moving object from a local map, wherein the local map is generated based on a sub-sequence of the sequence of consecutive binocular images of the dynamic scene.
In some embodiments, the associating the keypoints of the moving object with the keypoints of the moving object from the local map comprises:
if the speed of the moving object in the local map is known, predicting the motion of the moving object, and re-projecting the predicted three-dimensional key points of the moving object to the current frame and matching the three-dimensional key points with the nearest landmark points in the current frame;
if the speed of the moving object is unknown in the local map or the landmark point of the current frame is not matched with the predicted three-dimensional key point of the moving object, the three-dimensional key point is distributed to the moving instance with the largest overlapping degree in the adjacent frames.
In some embodiments, the determining the pose of the camera and the moving object using the motion instance and the keypoints, respectively, comprises:
Acquiring an edge distance error of a static background edge feature and a reprojection error from a 3-dimensional static point to a 2-dimensional plane point, and determining the pose of a camera from the current moment to a reference moment by jointly minimizing the edge distance error and the reprojection error;
determining edge characteristics and point characteristics of each moving object by utilizing a moving example of each moving object, determining an edge distance error of the moving object from the current moment to the reference moment based on the edge characteristics, determining a re-projection error based on the point characteristics, and determining the pose of the moving object from the current moment to the reference moment by jointly minimizing the edge distance error and the re-projection error.
In some embodiments, in the subframe sequence including the preset frame number, extracting the preset number of key frames based on the optical flow estimation result and the structural feature of each frame includes:
Taking the initial frame as a key frame, calculating a mean square optical flow generated by the current frame and the last key frame due to view angle change or shielding, and taking the current frame as a first key frame associated with camera motion when the mean square optical flow meets a first preset condition or structural characteristics of the current frame meet a second preset condition;
determining tracking quality of a current frame based on an inter-frame overlapping histogram, and taking the current frame as a second key frame associated with a moving object when the tracking quality meets a preset condition;
And forming an optimized frame set by the key frames, adding the first key frames or the second key frames into the optimized frame set when the number of the key frames in the optimized frame set is not higher than a first preset threshold value, and removing one key frame from the optimized frame set based on a preset strategy before each key frame is added when the number of the key frames in the optimized frame set is higher than a second preset threshold value.
A second aspect of an embodiment of the present disclosure provides a dynamic visual SLAM apparatus for sequential sparse reconstruction, including:
the data acquisition module is used for acquiring a continuous binocular image sequence of a dynamic scene, acquiring a mapping parameter of each frame of left-view image, acquiring a motion instance of the image based on motion instance segmentation, and assembling the mapping parameter and the motion instance at corresponding moments into one frame of mapping data, wherein the mapping parameter at least comprises a depth map, an optical flow estimation result and structural features of the left-view image, and the structural features comprise point features and edge features;
The tracking module is used for determining the corresponding relation of the structural feature based on the mapping parameter in the adjacent frames in the frame sequence formed by continuous multiframes of the mapping data, determining the key point in the structural feature, associating the moving object in the adjacent frames based on the moving instance and the key point, and respectively determining the pose of the camera and the moving object by utilizing the moving instance and the key point;
And the mapping module is used for extracting a preset number of key frames from a subframe sequence containing a preset frame number in the frame sequence based on an optical flow estimation result and structural characteristics of each frame, generating a local map corresponding to the subframe sequence based on the pose, static background and key points of a camera and a moving object in the key frames, and fusing each local map to generate a global map corresponding to the frame sequence.
In summary, the dynamic visual SLAM method and the dynamic visual SLAM device for time-series sparse reconstruction provided by the embodiments of the present disclosure divide the motion instance of the motion object through an unknown motion instance division network, and have good division effect on the object without semantic information in the complex dynamic scene; the problem of rare key points of a moving object due to small occupied image area is solved to a certain extent by a multi-motion estimation method combining point characteristics, edge characteristics and optical flow information, and the accuracy of absolute track estimation of the moving object is improved; finally, by constructing an error equation comprising four tasks of camera track estimation, moving object track estimation, static background mapping and moving object reconstruction, the problem of low moving object information precision caused by no joint optimization on the four tasks is solved.
Drawings
The features and advantages of the present disclosure will be more clearly understood by reference to the accompanying drawings, which are schematic and should not be construed as limiting the disclosure in any way, in which:
FIG. 1 is a schematic diagram of a computer system to which the present disclosure is applicable;
FIG. 2 is a schematic diagram of generating a global map based on a binocular image sequence shown in accordance with some embodiments of the present disclosure;
FIG. 3 is a flow chart of a dynamic visual SLAM method for time-sequential sparse reconstruction shown in accordance with some embodiments of the present disclosure;
FIG. 4 is a schematic diagram of some embodiments of the present disclosure utilizing a class-agnostic motion instance segmentation method to obtain a motion instance of a current frame image;
FIG. 5 is a schematic diagram of a multi-motion factor graph shown in accordance with some embodiments of the present disclosure;
FIG. 6 is a schematic diagram of a dynamic visual SLAM device for sequential sparse reconstruction shown in accordance with some embodiments of the present disclosure.
Detailed Description
In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant disclosure. However, it will be apparent to one of ordinary skill in the art that the present disclosure may be practiced without these specific details. It should be appreciated that the use of "system," "apparatus," "unit," and/or "module" terms in this disclosure is one method for distinguishing between different parts, elements, portions, or components at different levels in a sequential arrangement. However, these terms may be replaced with other expressions if the other expressions can achieve the same purpose.
It will be understood that when a device, unit, or module is referred to as being "on," "connected to," or "coupled to" another device, unit, or module, it can be directly on, connected to, or coupled to, or in communication with the other device, unit, or module, or intervening devices, units, or modules may be present unless the context clearly indicates an exception. For example, the term "and/or" as used in this disclosure includes any and all combinations of one or more of the associated listed items.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present disclosure. As used in the specification and the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" are intended to cover only those features, integers, steps, operations, elements, and/or components that are explicitly identified, but do not constitute an exclusive list, as other features, integers, steps, operations, elements, and/or components may be included.
These and other features and characteristics of the present disclosure, as well as the methods of operation, functions of the related elements of structure, combinations of parts and economies of manufacture, may be better understood with reference to the following description and the accompanying drawings, all of which form a part of this specification. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the disclosure. It will be understood that the figures are not drawn to scale.
Various block diagrams are used in the present disclosure to illustrate various modifications of the embodiments according to the present disclosure. It should be understood that the foregoing or following structures are not intended to limit the present disclosure. The protection scope of the present disclosure is subject to the claims.
FIG. 1 is a schematic diagram of a computer system to which the present disclosure is applicable. As in the system shown in fig. 1, a SLAM server is in data connection with the image acquisition device, the SLAM server generating a global map of the motion scene based on a sequence of successive binocular images of the motion scene acquired by the image acquisition device. Wherein:
SLAM (simultaneous localization AND MAPPING) refers to simultaneous localization and mapping. The SLAM server may be any of a stand-alone, clustered, or distributed server. Fig. 2 is a schematic diagram illustrating generation of a global map based on a binocular image sequence, shown in accordance with some embodiments of the present disclosure. As shown in fig. 2, the SLAM server includes a data processing module that acquires mapping data based on a binocular image sequence, a tracking thread that determines the pose of a camera and a moving object based on the mapping data, and a mapping thread that generates a global map based on the pose and a keypoint. Wherein the trace thread and the map-forming thread are processed in parallel.
The image acquisition device has binocular, and can simultaneously acquire a left-view image and a right-view image of a frame of picture and determine depth information of the picture according to the left-view image and the right-view image. The image acquisition means may be a video camera, a still camera, a monitoring device or a video camera built in other devices. The continuous sequence of binocular images may also be a video clip of the motion scene.
FIG. 3 is a flow chart of a dynamic vision SLAM method for time-sequential sparse reconstruction, shown in accordance with some embodiments of the present disclosure, performed by a SLAM server in the system shown in FIG. 1, comprising the steps of:
S310, a continuous binocular image sequence of a dynamic scene is acquired, a mapping parameter of each frame of left-view image is acquired, a motion instance of the image is acquired based on motion instance segmentation, the mapping parameter and the motion instance at corresponding moments are assembled into one frame of mapping data, the mapping parameter at least comprises a depth map of the left-view image, an optical flow estimation result and structural features, and the structural features comprise point features and edge features.
Specifically, as shown in fig. 2, the data acquisition and processing link is first performed. With the RGB binocular image sequence as input, after correction and de-distortion, the image sequence is first input into a data processing module. The data processing module adopts a mode of parallel computation of the GPU and the CPU to operate more complex tasks in the GPU, and the tasks in the GPU and the CPU are all operated in parallel, so that the time for processing each frame of data is the time for consuming the longest task, and the operation efficiency of the system is greatly influenced by the overlong data processing time. The tasks of the operations in the GPU include: obtaining a depth map through binocular depth estimation, wherein the binocular depth estimation adopts a CUDA application version of SGM (SemiGlobalMatching) method, obtaining a motion instance of a current frame image by using an agnostic motion instance segmentation method, obtaining an optical flow result by using an optical flow estimation network, wherein the optical flow estimation adopts liteflownet method, the processing speed of the method is higher, the occupied memory is smaller, and the adjacent frame image can be taken as input to rapidly provide a dense optical flow estimation result; the task of the operation in the CPU comprises extraction of point features and edge features of the current frame, wherein the point features adopt a FAST feature extractor to detect sparse corner feature sets, and the edge features adopt a Canny edge detector. Binocular depth estimation takes a binocular image at the current moment as input, and outputs a depth map of a left-view image at the current moment; the moving object example segmentation and optical flow estimation network takes left-view adjacent two frames of images as input and outputs the left-view image segmentation and optical flow estimation result at the current moment; the point feature and edge feature extraction takes the left-view current moment image as input, and outputs a feature extraction result of the left-view current moment image. When the processing of all five tasks of the image is completed, one frame of results of the data processing are output together. Because the parallel operation mode is adopted, the data processing process can be calculated simultaneously with the following tracking and drawing processes, and the operation time of the whole system in actual operation can be greatly reduced.
Fig. 4 is a schematic diagram of a motion instance of a current frame image obtained using a class-agnostic motion instance segmentation method in some embodiments of the present disclosure. As shown in fig. 4, the method includes:
Firstly, extracting image features of two frames of left-view images at adjacent moments of a motion scene based on an image feature extraction neural network, and generating a first feature map with a multi-level scale. And then adopting a feature pyramid model to identify multi-scale features in the picture, and fusing the first feature map of each current scale with the first feature map of the adjacent scale to obtain a second feature map of the current scale of the image.
If the second characteristic diagrams of the time t and the time t+1 output after fusion are respectivelyAndThen at time t+1The five-layer characteristic diagram is input into a convolution long-short time memory structure (ConvLSTM structure) as an input characteristic diagram, the five-layer characteristic diagram is output to introduce time sequence information, and then the time t is setAs an input feature map,The Hidden (Hidden) layer feature maps are respectively input into one ConvLSTM structure, and the feature map at the time t after five layers of time sequence information are introduced is output. And finally, inputting the feature map at the time t after five layers of time sequence information are introduced into a target detection and instance segmentation algorithm to carry out motion instance segmentation, and outputting a motion object instance segmentation result in an image at the time t.
The method and the device divide the motion instance of the motion object through a type of unknown motion instance division network, and still have good division effect on the object without semantic information in the complex dynamic scene.
S320, in the adjacent frames in the frame sequence formed by continuous multiframes of the mapping data, the corresponding relation of the structural feature is determined based on the mapping parameters, key points in the structural feature are determined, the moving object in the adjacent frames is associated based on the moving instance and the key points, and the pose of the camera and the pose of the moving object are respectively determined by utilizing the moving instance and the key points.
As shown in fig. 2, this step is processed by the trace thread, and the result of calculation is input to the mapping thread. In the present disclosure, feature point matching is implemented by using the result of optical flow estimation, specifically, for a certain feature point in the current frame, the optical flow result of its corresponding pixel position is used to obtain the matching feature point on the adjacent frame. Edge feature alignment uses reprojection of an edge pixel set with an effective depth from a current frame to a reference frame, calculates the euclidean distance to the nearest edge at each pixel position using distance transformation (Distance Transform, abbreviated as DT), and achieves alignment of two-frame edge features by minimum euclidean distance. The effective depth is obtained by searching and judging pixel values of the edge features in the depth map.
And when feature point matching and edge feature alignment are carried out, abnormal pixel points which cannot be matched or aligned are removed, and the left pixel points are key points of a static background and a moving object.
When tracking a fast moving object point in image space, the combination of camera motion and object motion can produce a large displacement in the image space; meanwhile, the large movement can cause large visual change of objects in the image, and the matching difficulty is increased. The present disclosure achieves a more robust effect by correlating at the instance level and the keypoint level, respectively.
At the instance level, the present disclosure performs matching between neighboring inter-frame motion instances by calculating an intersection ratio (Intersection over Union, ioU, referring to the ratio of the intersection area to the union area of two regions). The two motion instances respectively located in the adjacent frames with the largest cross ratio and not smaller than the preset threshold are matched.
In the keypoint level, the moving object's keypoints are associated with moving object keypoints from the local map in two different ways:
1. If the speed of the object in the map is known, predicting the motion of the object by utilizing the assumption of the uniform motion between frames, re-projecting the predicted 3D key points into the current frame, and matching the key points of the object with the nearest landmark points, so that the key points of the object are distributed to the key points of the object in the map;
2. If the velocity of the object is not initialized or a sufficient match is not found in 1, then the keypoints are assigned to the most overlapping motion instances in successive frames using violent matching.
Because the current keypoints are associated with objects in the map, rather than with objects in previous frames, mutual occlusion of objects can be handled to some extent.
For motion estimation of cameras, the present disclosure utilizes edge features and feature points that are only static backgrounds, minimizing edge distance errors by combiningAnd 3D static point to 2D planar point reprojection errorTo estimate the current timeTo the reference timeCamera relative motion of (2)
Edge distance error is obtained by aligning the nearest edge after the re-projection, and edge distance residual is obtained by DT transformation at the re-projection pixel locationThereby obtaining the edge distance error
Wherein,Representing reference framesIs used for the DT transformation of (c),For a pixel location of an edge in the current frame,Is thatThe depth value at the location is a function of the depth value at the location,Is a transformation function, representing the utilization of relative transformationCalculating in the current frameTo the re-projection in the reference frame,For a set of edge pixels in the current frame that have an effective depth,Is a Huber weight function used to reduce the impact of large residuals. Since edge detection often varies from frame to frame, ifGreater than the set threshold, a potential outlier is deleted.
For point features, 3D static point-to-2D planar point reprojection errorThe calculation formula is as follows:
Wherein,Representing the re-projection distance residual error,3D points representing a static backgroundA projection function to project to the reference frame 2D pixel location,Is thatThe observed pixel coordinates in the reference frame,Is a static 3D point set in the current frame, whereinIs a weight function used to reduce the impact of large residuals.
Thus, the optimization process of combining edge features and point features proceeds with the following formula:
Wherein,In order to optimize the relative movement to be achieved,Is a balance factor. The present disclosure uses an iterative re-weighted levenberg-Marquardt (ITERATIVELY RE-weighted Levenberg-Marquardt) optimization method to minimize equation (3.1) according to a coarse to fine approach, that is, at the start of optimization, the optimization is initialized with a constant motion assumption starting near the minimum.
For moving objects, the edge characteristics and the point characteristics of each moving object are determined by using the movement example of each moving object, and the movement of the object is still estimated by using the characteristics of the object and adopting a mode of jointly minimizing the edge distance error and the reprojection error. For moving objectsAt the current timeTo the reference timeEdge distance error of (2)The method comprises the following steps:
Wherein,Representing moving objectsAt the reference frameIs used for the DT transformation of (c),Is a moving objectA relative transformation that is self-centering,For the current frameA pixel position belonging to an edge of the motion instance of the moving object,Is thatThe depth value at the location is a function of the depth value at the location,Is a transformation function, representing the utilization of relative transformationWill be in the current frameThe re-projection is into a reference frame,For a set of edge pixels in the current frame that have significant depth within the motion instance of the moving object,Is a Huber weighting function.
For moving objectsAt the current timeTo the reference timePoint feature reprojection error of (2)The method comprises the following steps:
Wherein,Representing moving objectsIs used for the re-projection distance residual of (c),To be 3D points belonging to the moving objectA projection function to project to the reference frame 2D pixel location,Is thatAt the reference frameIn (3) the observed pixel coordinates of the pixel,For the current frameA set of 3D points of the moving object,Is a Huber weight function.
Thus, the moving objectThe optimization process of combining edge features and point features in pose estimation is performed with the following formula:
Wherein,In order to optimize the relative movement to be achieved,Is a balance factor.
According to the method, the number of target points in the moving object with smaller occupied image area is increased by utilizing the edge characteristics and the optical flow information through a multi-motion estimation method combining the point characteristics and the edge characteristics, so that robust estimation of absolute motion tracks of a camera and each moving object is realized.
S330, extracting a preset number of key frames from a subframe sequence containing a preset frame number in the frame sequence based on an optical flow estimation result and structural features of each frame, generating a local map corresponding to the subframe sequence based on the pose, static background and key points of a camera and a moving object in the key frames, and fusing each local map to generate a global map corresponding to the frame sequence.
As shown in fig. 2, this step is performed by the mapping process. The method only processes the selected key frames so as to keep key frame sets which are distributed as uniformly as possible and controllable in quantity in the global map; meanwhile, a key frame management strategy considering camera motion and moving object tracking quality is designed, and the key frame management strategy is more suitable for key frame maintenance of a dynamic visual SLAM method.
More key frames are acquired initially, about 5-15 key frames are acquired per second, and then the key frame set is thinned through marginalizing redundant key frames. For motion tracking of cameras, the present patent determines whether to create a keyframe by calculating the following three measures:
To create a new key frame taking into account the change in perspective, the mean square optical flow of the last key frame to the latest one is calculated (the mean square optical flow)Using edge feature pointsAnd feature pointsIs (are) re-projectedAndAnd (5) performing calculation.
Camera panning can result in the appearance of occluded and non-occluded areas, even ifPotentially smaller, occlusion problems also require more keyframes to handle, measured by mean square optical flow without rotation, with reprojection taking into account only translation without rotationAndAnd (5) performing calculation.
Calculating the number of points of which the edge feature points and the re-projection of the feature points are larger than a distance thresholdAnd a number of points less than a distance threshold
Using the calculated measure, a new key frame is created when one of the following conditions is met:
For key frame creation of moving object tracking, consider the case when an object has a large number of features, but few points are tracked in the current frame; meanwhile, considering that the relative speed and amplitude of the moving object are sometimes larger, tracking failure is easy, and a sufficient number of key frames with reasonable distribution are needed to ensure the pose accuracy of the moving object. Therefore, the patent also judges whether a new key frame is inserted or not by evaluating the tracking quality of the moving object, and the tracking quality measure is obtained by using the overlapping histogram among frames. The present disclosure will track beforeThe edge feature and feature point in the frame are re-projected into the current frame, and the number of re-projection times is counted in each pixel to generate a counting diagramThe value of each pixel position in the count map ranges from 0 toA value of 0 indicates that no pixel is re-projected to that location, a value ofRepresenting that the selected previous frame has pixels re-projected to that location. By calculating at the edge feature point and feature point pixel positions in the current frameTo generate a size ofIs of the overlapping histograms of (a)For each pixel belonging to an edge feature point or feature pointOverlapping histogramsThe method comprises the following steps:
Thus, whenLarge value and non-overlapping numberThe value time represents good tracking quality, and the disclosure is based on weighted summation of overlap of the re-projectionsLess than the non-overlapping amountAt this time, a new key frame is inserted, whereby the key frame is added by the motion of the moving object.
Since the edge features usually comprise a plurality of edge pixel points, each key frame in the optimized frame set has a large amount of feature information, and all edges of each key frame in the set are optimized, so that the difficulty of realizing real-time operation in a CPU is increased. The present patent reduces the amount of computation by maintaining active edges and active feature points and defining the number of key frames in the optimized frame set. After adding a new key framePreviously, this patent would remove an old key frame by key frame marginalization, which is done by the following strategy:
1. if the number of views of a key frame in the most recent key frame is less than the total number of points in the key frame multiplied by 5%, the key frame is marginalized.
2. If none of the keyframes satisfies the condition in 1, one keyframe is marginalized by maximizing the distance score for the keyframe set other than the last two keyframes. The distance scoring function is used to keep the keyframe set uniformly distributed in 3D space, making more keyframes close to the nearest keyframe, the distance score is calculated by the following formula:
Wherein,Is a key frameAndThe euclidean distance between the two,Is a small constant.
When the set of active variables is too large, to prevent a set of active variables that is actually difficult to handle, the present disclosure uses the Schur complement (Schur complex) to liminate the old variables, which would remove any impactTerms of matrix sparse mode. Whenever a key frame is marginalized, the marginalizing operation will marginalize all of its active edges and active feature points first, and marginalize all edge features and feature points that were not observed in the last two key frames. Finally, all the observed edge features and feature points are completely deleted from the system.
The present disclosure constructs a local map using a fixed-size locally optimized frame set containing the latest previous momentsThe information of the frames is used to determine,Is the window size. The multi-motion factor graph optimization is performed within the local optimization frame set, which optimizes camera pose, object pose, static background keypoints and moving object keypoints, which are then updated into the global map.
The result of the optimization of the local map consists of camera pose, moving object pose, static background key points and object key points, and is stored in a global map which is constructed by all frames before use and is continuously updated with the processing of each new frame. After all input frames are processed, the designed multi-motion factor graph is utilized to carry out global consistent optimization, and the optimization result is used as the output of the whole system.
The multi-motion factor graph optimization method used by the method establishes an error term model considering four tasks, and performs joint optimization on motions of cameras and various moving objects and static background maps and moving object structural representations. In consideration of the operation efficiency requirement, a dense model mode is not adopted in the aspects of static background mapping and moving object structure modeling, but a sparse map of a static background and sparse structure point representation of a moving object are constructed by using a sparse representation method of key points, so that the operation efficiency of an overall system is improved, and especially in an outdoor large scene, the sparse representation of the key points has more obvious efficiency advantage compared with the dense representation. Considering the multi-motion factor graph of a moving object as shown in fig. 5, this problem of joint optimization contains five types of error terms: the re-projection error of the static background point, the re-projection error of the moving object point, the camera pose estimation observation error, the moving object uniform motion error and the moving object point uniform motion error, the calculation process of the five error terms is specifically as follows:
For atThe time pose isThe re-projection error of the static background point is:
Wherein,For 3D static map pointsIs used for the purpose of determining the coordinates of (a),To project a 3D point in the camera coordinate system to the re-projection function of the pixel coordinates,Is thatAt the position ofObserving pixel coordinates of the moment.
For the followingMoving object at momentThe reprojection error of a moving object point is expressed as:
Wherein,To at the same timeAt any time by moving objectsThe self-centering pose is the position and the posture of the center,Is a moving objectIs a moving object point of (2)Is used for the purpose of determining the coordinates of (a),Is thatAt the position ofObserving pixel coordinates of the moment. The two re-projection errors can jointly optimize the pose of the camera and different moving objects, and the position of the corresponding 3D point.
The camera pose estimation observation error is used to limit the relative transform between frames, minimize the variation between camera motions in consecutive time steps, defined as:
In order to smooth the motion trail of a moving object, the present disclosure assumes that the moving object is moving at a constant speed in continuous time observationAt the position ofThe linear and angular velocities of the engraving are respectively expressed asAndThe uniform motion error of the moving object is defined as:
the constant speed error of the moving object point is used for coupling the moving object speed with the moving object pose and the corresponding 3D point, and the error items are established by combining the observation, and the constant speed error of the moving object point is defined as:
Wherein,Is a moving objectContinuous observationAndIs set in the time interval of (2)The pose transformation is performed in the process,Is made of a moving objectAt the position ofThe linear velocity and the angular velocity at the moment are calculated,Is thatIs an exponential mapping of:
Therefore, the definition of the beam method adjustment problem of the designed multi-motion factor graph optimization is shown in a formula (3.3), the beam method adjustment problem comprises five error terms, and in an optimized frame set, the camera pose at each moment and a group of map points observed by the camera pose and the corresponding moving object points are subjected to joint optimization.
Wherein,A set of frames for which optimization is to be performed is represented,For framesA set of observed map points is provided,Representation ofA collection of moving objects at a moment in time,Representing moving objectsIs provided with a set of object points,The parameters of the optimization are represented by the parameters,Representing a robust Huber kernel function for reducing outlier correspondence weights,Covariance matrix is shown. In the error of reprojectionRespectively with (1)Map points observed by engraving cameraAnd object pointScale-dependent of three other error termsIn relation to the time interval between two successive observations of a moving object, the longer the time interval, the greater the uncertainty of the uniform motion assumption.
The method and the device can perform joint optimization on the motion of the camera and each moving object and the motion of the static background map and the moving object structure representation through the multi-motion factor graph optimization framework, and obtain a globally consistent high-precision result.
FIG. 6 is a schematic diagram of a dynamic visual SLAM device for sequential sparse reconstruction shown in accordance with some embodiments of the present disclosure. As shown in fig. 6, the dynamic visual SLAM device 600 for time-series sparse reconstruction includes a data acquisition module 610, a tracking module 620, and a mapping module 630. The dynamic visual SLAM function for the time-series sparse reconstruction may be performed by a SLAM server in the system shown in fig. 1. Wherein:
The data acquisition module 610 is configured to acquire a continuous binocular image sequence of a dynamic scene, acquire a mapping parameter of each frame of left-view image, acquire a motion instance of the image based on motion instance segmentation, and assemble the mapping parameter and the motion instance at corresponding moments into one frame of mapping data, where the mapping parameter at least includes a depth map, an optical flow estimation result, and a structural feature of the left-view image, and the structural feature includes a point feature and an edge feature;
a tracking module 620, configured to determine, in adjacent frames in a frame sequence consisting of consecutive frames of the mapping data, a correspondence of the structural feature based on the mapping parameter, determine a key point in the structural feature, associate a moving object in the adjacent frames based on the motion instance and the key point, and determine pose of the camera and the moving object respectively using the motion instance and the key point;
The mapping module 630 is configured to extract a preset number of key frames from a subframe sequence containing a preset number of frames in the frame sequence based on an optical flow estimation result and structural features of each frame, generate a local map corresponding to the subframe sequence based on pose, static background and key points of a camera and a moving object in the key frames, and fuse each local map to generate a global map corresponding to the frame sequence.
In summary, the dynamic visual SLAM method and the dynamic visual SLAM device for time-series sparse reconstruction provided by the embodiments of the present disclosure divide the motion instance of the motion object through an unknown motion instance division network, and have good division effect on the object without semantic information in the complex dynamic scene; the problem of rare key points of a moving object due to small occupied image area is solved to a certain extent by a multi-motion estimation method combining point characteristics, edge characteristics and optical flow information, and the accuracy of absolute track estimation of the moving object is improved; finally, by constructing an error equation comprising four tasks of camera track estimation, moving object track estimation, static background mapping and moving object reconstruction, the problem of low moving object information precision caused by the fact that the four tasks are not optimized in a combined mode is solved.
It is to be understood that the above-described embodiments of the present disclosure are merely illustrative or explanatory of the principles of the disclosure and are not restrictive of the disclosure. Accordingly, any modifications, equivalent substitutions, improvements, or the like, which do not depart from the spirit and scope of the present disclosure, are intended to be included within the scope of the present disclosure. Furthermore, the appended claims of this disclosure are intended to cover all such changes and modifications that fall within the scope and boundary of the appended claims, or the equivalents of such scope and boundary.

Claims (8)

CN202410725976.9A2024-06-062024-06-06Dynamic visual SLAM method and device for time sequence sparse reconstructionActiveCN118314162B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410725976.9ACN118314162B (en)2024-06-062024-06-06Dynamic visual SLAM method and device for time sequence sparse reconstruction

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410725976.9ACN118314162B (en)2024-06-062024-06-06Dynamic visual SLAM method and device for time sequence sparse reconstruction

Publications (2)

Publication NumberPublication Date
CN118314162A CN118314162A (en)2024-07-09
CN118314162Btrue CN118314162B (en)2024-08-30

Family

ID=91725954

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410725976.9AActiveCN118314162B (en)2024-06-062024-06-06Dynamic visual SLAM method and device for time sequence sparse reconstruction

Country Status (1)

CountryLink
CN (1)CN118314162B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109166149A (en)*2018-08-132019-01-08武汉大学A kind of positioning and three-dimensional wire-frame method for reconstructing and system of fusion binocular camera and IMU
CN113808169A (en)*2021-09-182021-12-17南京航空航天大学ORB-SLAM-based large-scale equipment structure surface detection path planning method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP3551967A2 (en)*2016-12-092019-10-16TomTom Global Content B.V.Method and system for video-based positioning and mapping
US10269147B2 (en)*2017-05-012019-04-23Lockheed Martin CorporationReal-time camera position estimation with drift mitigation in incremental structure from motion
CN111260661B (en)*2020-01-152021-04-20江苏大学 A visual semantic SLAM system and method based on neural network technology
CN114612494B (en)*2022-03-112025-07-04南京理工大学 A design method for visual odometry of mobile robots in dynamic scenes

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109166149A (en)*2018-08-132019-01-08武汉大学A kind of positioning and three-dimensional wire-frame method for reconstructing and system of fusion binocular camera and IMU
CN113808169A (en)*2021-09-182021-12-17南京航空航天大学ORB-SLAM-based large-scale equipment structure surface detection path planning method

Also Published As

Publication numberPublication date
CN118314162A (en)2024-07-09

Similar Documents

PublicationPublication DateTitle
CN112304307B (en)Positioning method and device based on multi-sensor fusion and storage medium
CN113985445B (en)3D target detection algorithm based on camera and laser radar data fusion
CN108564616B (en)Fast robust RGB-D indoor three-dimensional scene reconstruction method
CN114782499B (en) A method and device for extracting static areas of images based on optical flow and view geometry constraints
Kang et al.Detection and tracking of moving objects from a moving platform in presence of strong parallax
Kamencay et al.Improved Depth Map Estimation from Stereo Images Based on Hybrid Method.
CN107657644B (en)Sparse scene flows detection method and device under a kind of mobile environment
CN108898676A (en)Method and system for detecting collision and shielding between virtual and real objects
CN115619826B (en) A dynamic SLAM method based on reprojection error and depth estimation
CN110599545B (en)Feature-based dense map construction system
CN112446882A (en)Robust visual SLAM method based on deep learning in dynamic scene
CN112419497A (en)Monocular vision-based SLAM method combining feature method and direct method
CN111998862A (en)Dense binocular SLAM method based on BNN
CN113362441A (en)Three-dimensional reconstruction method and device, computer equipment and storage medium
CN113298871B (en)Map generation method, positioning method, system thereof, and computer-readable storage medium
CN119068042B (en) Cargo volume calculation method and system based on panoramic video
CN115222884A (en)Space object analysis and modeling optimization method based on artificial intelligence
Shalaby et al.Algorithms and applications of structure from motion (SFM): A survey
CN116045965A (en)Multi-sensor-integrated environment map construction method
CN113850293A (en) Localization method based on joint optimization of multi-source data and direction priors
CN116894876A (en)6-DOF positioning method based on real-time image
JisenA study on target recognition algorithm based on 3D point cloud and feature fusion
CN114972539A (en) On-line calibration method, system, computer equipment and medium for camera plane in computer room
CN118887353A (en) A SLAM mapping method integrating points, lines and visual labels
CN118314162B (en)Dynamic visual SLAM method and device for time sequence sparse reconstruction

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp