Background
The unmanned aerial vehicle autonomous navigation positioning technology is used for determining the position and the posture of the unmanned aerial vehicle autonomous navigation positioning technology by the information observed by the sensor under a completely unknown environment, and providing a data basis for the tasks such as subsequent map construction, navigation and the like. Autonomous positioning is not only a core component of an unmanned system, but also a key to ensure robust operation in complex environments.
The traditional unmanned system often uses a mode of combining satellite navigation and inertial navigation, however, the satellite positioning system is greatly affected by the environment, satellite signals are easy to be shielded or have multipath effect when working in valleys, building groups and rooms, and signals are easy to be shielded or interfered when in war. When satellite positioning fails, simple inertial navigation is easily affected by sensor errors, system noise and integral accumulated errors, and after long-time use, position and speed information drift and additional time is needed to correct the accumulated errors, so that the navigation system is unstable and inaccurate.
Purely visual localization was proposed by the Stanford university in the early 80 s of the last century, however, is limited by the influence of sensor accuracy and processor performance, and has been in the experimental study phase only for a long time. With the progress of electronic information technology in the twenty-first century, vision-based positioning has been a long-standing development. At present, navigation by vision is divided into two types, namely relative navigation and absolute navigation. The relative navigation is a slam technology, does not depend on a reference gallery, can autonomously calculate relative pose according to the correlation of two frames of images, but has the defect of error accumulation, and realizes unmanned aerial vehicle positioning through image matching, can calculate absolute pose according to the reference gallery, has no accumulated error, but needs to realize the storage of the reference gallery, and is difficult to realize for long-endurance unmanned aerial vehicle deployment and has large storage capacity.
For relative navigation (SLAM) technology, a visual/inertial sensor fusion approach is currently commonly employed. The earliest vision/inertial system was proposed in 2007 by Anastasios Mourikis et al, division of the university of california, riverside, usa, after which a series of classical algorithms such as VI-ORB, VINS-Mono, ORB-SLAM3, etc., emerged. The VI-ORB (Visual-Inertial ORB-SLAM) proposed in 2017 is a Visual-Inertial fusion system based on ORB-SLAM framework, integrating IMU data with ORB Visual features by tight coupling. The method has the core improvement that the track precision is improved by jointly optimizing the vision re-projection error and the IMU measurement error. However, this approach still has certain limitations, 1) sensitivity to time synchronization and calibration errors of the IMU and camera, and minor deviations may result in accumulated errors, and 2) no explicit handling of dynamic objects, depending on static environmental assumptions, dynamic objects may interfere with positioning results. The VINS-Mono proposed in the same year is a representative framework of monocular vision-inertial SLAM, and adopts a tight coupling optimization strategy to fuse IMU and monocular image data. The innovation point is that the precision and the efficiency are balanced by combining sliding window optimization with global pose diagram adjustment. The method has the main defects that 1) a dynamic object detection module is not integrated, the positioning is possibly invalid in a dynamic scene, and 2) closed loop detection depends on a visual word bag model, and mismatching is easy to occur when the texture or the visual angle is repeatedly changed. The ORB-SLAM3 proposed in 2020 is the third generation of the ORB-SLAM series, supporting multi-modal input (monocular, binocular, RGB-D, vision-inertia). The method has the core improvements that 1) a multi-map system (Atlas) is introduced to support cross-scene map fusion and long-term positioning, 2) a vision-inertia initialization process is optimized to improve convergence speed and robustness, but the problems still exist, 1) dynamic environment processing capacity is limited, static assumptions are used, dynamic features are not removed explicitly, 2) divergence problems caused by long-term accumulation of vision-inertia mode errors are solved, 3) map consistency is insufficient, and a practical map is constructed by relying on an expansion module. From the above analysis, in the case of long endurance across multiple scenes, the problem of unavoidable error accumulation exists in the visual inertial navigation scheme.
The image matching method of absolute navigation starts from cruise missile terminal guidance and gradually develops into a visual navigation technology. The image matching navigation (SCENE MATCHING navigation systems, SMNS) has the characteristics of simple equipment structure, passivity, higher positioning precision and the like, and an image sensor is adopted to acquire the image of the area near the flying or target area to match with a stored reference image so as to acquire the position data of the aircraft. The unmanned aerial vehicle image matching navigation system is shown in fig. 1, performs preliminary path planning through a defensive film and an aerial film after a destination is acquired, then selects an adaptation area on the planned aerial road, obtains a reference image through correction, and then matches a real-time image obtained by a vision sensor with the reference image to realize navigation and positioning. The core of the method is an image matching algorithm, and the performance of the method directly determines the performance of a navigation system. According to the different visual feature extraction modes, the image matching algorithm can be divided into a method based on template matching, a method based on local invariant features and a method based on scene semantic learning. The unmanned aerial vehicle scene matching positioning algorithm based on template matching is used as one of representative algorithms of the earliest unmanned aerial vehicle vision positioning method, and the core thought is to construct a matching model by using image information such as pixel intensity and the like to complete scene matching. Image matching algorithms based on locally invariant features (also known as manual features) exhibit a stronger environmental adaptation capability than image matching algorithms based on template matching. Locally invariant features, i.e. manual feature descriptors, typically rely on expert prior knowledge in the design of algorithms, the complement of which in many applications presents advantages over template matching methods. With the development of computer vision, the deep neural network derives a scene-based semantic learning method by virtue of the characteristic that the deep neural network can capture high-dimensional semantic features in an image. Compared with the template matching and local invariant feature methods, the image matching method based on scene semantic learning has the advantage that the deep neural network has semantic information extraction capability, so that the image matching relationship can be established better. Meanwhile, by means of semantic information extraction, the image matching algorithm can cope with more complex visual angle changes, and various possible interferences such as illumination changes, noise and the like are overcome. as a representative of a multi-view scene matching algorithm based on metric learning, zheng et al discloses a first multi-source multi-view scene matching unmanned aerial vehicle visual positioning dataset University-1652, which provides support for multi-view scene matching research. On the basis, an author builds a three-branch twin neural network model, and scene matching is completed by building corresponding relations among images of the multi-view unmanned aerial vehicle, street view images and satellite images. The characteristic extraction model adopts a classical ResNet structure, and no additional improvement is added. Yang et al fully excavate the position coding of characteristics in order to solve the visual angle difference of unmanned aerial vehicle visual positioning algorithm when the matching image. And guiding the double-branch twin network to establish the region corresponding relation between the multi-view image pairs by utilizing the position coding information. By means of position coding, the image perspective relation is marked, the network can learn the corresponding association of the areas after perspective change, and the multi-view robustness of the algorithm is improved. However, because the transducer is selected as the feature extraction backbone network, and the spatial location correspondence module enables the network to have a large complexity, there is a certain limitation in training and mobile platform deployment. Wang et al propose a local mode network that employs a square ring feature partitioning strategy to obtain features describing central targets and surrounding information. and obtaining a similarity result based on the center and edge square rings by measuring corresponding feature pairs for multiple times, and finally obtaining a similarity measurement result through fusion. However, the circular partitioning pattern of similarity measures is completely fixed during the region division of the image, which makes it difficult for the algorithm to self-adjust and to make optimal decisions in different situations. As can be seen from the description of the image matching method for scene semantic learning, the method can well process complex scenes and has excellent generalization and broad prospects as a feature matching mode taking deep learning as a core. On the contrary, the method has the defect that high requirements are required to be put on hardware conditions and an offline map data set of a scene, and meanwhile, the model is not compatible in light weight and matching performance.
In short, in unmanned aerial vehicle autonomous navigation positioning, two core problems are often faced in a remote long-endurance task, namely, visual inertial navigation is accumulated continuously under a GNSS signal-free environment, so that the positioning accuracy is obviously reduced along with the increase of the range, and the traditional positioning method based on image matching can provide absolute position correction, but is difficult to deploy and poor in real-time due to dependence on a whole-course high-quality reference map, and is difficult to stably work when a part of regional reference map is lost or changed.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a remote long-endurance integrated navigation method and system combining visual inertia combined optimization and image matching.
In view of the above, the invention provides a remote long-endurance integrated navigation method combining visual inertia joint optimization and image matching, which is realized based on a monocular camera and an IMU, and comprises the following steps:
step 1, acquiring an offline map of the surrounding environment of the unmanned aerial vehicle flight path by inquiring satellite information, and presetting a plurality of route points in a task path according to map conditions;
step 2, performing high-frequency relative navigation between navigation route points by adopting visual inertial navigation to realize continuous pose estimation;
Step 3, based on the constructed multi-mode navigation autonomous decision scheme, triggering the switching of navigation strategies when approaching a waypoint, and converting to using image matching navigation;
step 4, according to the real-time photo shot by the camera, realizing image matching navigation by using a Latent diffration model and a XFeat model, and correcting the accumulated relative error between the waypoints;
and 5, repeating the steps 2 to 4 among a plurality of waypoints until the task path is traversed.
Preferably, the step 2 includes:
step 2-1, pre-integrating IMU data acquired by the IMU to construct a pre-integration error term;
Step 2-2, selecting a key frame for a real-time image acquired by a monocular camera, and processing the key frame by adopting a minimum luminosity error direct method to obtain the luminosity error of the image of the adjacent key frame;
and 2-3, carrying out joint error optimization on the IMU pre-integral error item and the image luminosity error to obtain more accurate camera pose information.
Preferably, the pre-integral error term constructed in the step 2-1To be in a time intervalIn, the difference between the predicted value and the actual observed value satisfies the following equation:
Wherein, theIs thatIs a logarithmic mapping of (a) to (b),Is a li-algebra vector extraction operation,,,Are respectively fromFrom moment to momentPre-integral rotation of time, position and speed increment,,,,,,Respectively represent the IMU in the world coordinate systemTime of day and time of dayThe rotation, position and speed of the moment in time,Is a gravitational acceleration vector in the global coordinate system,Is the time interval, and the superscript T is the transpose.
Preferably, the step 2-2 includes:
Calculating the gradient amplitude matrix difference of any two adjacent frames of images;
Dividing each frame of image into 32 multiplied by 32 grids, and reserving the point with the largest gradient amplitude difference of the adjacent frames as a key frame of the initially selected image by each grid;
Dynamically removing the image key frames with insufficient stability, thereby obtaining a plurality of key frames;
For adjacent key frames after dynamic eliminationObtaining corresponding image luminosity errors according to the following formula:
Wherein, theThe gray value of the image is represented,Is a key frameIs used for the image processing of the image data,Projection to current key frame for pixel by pose transformationIs defined by the coordinates of (a).
Preferably, the step 2-3 includes:
And unifying the IMU pre-integral error item and the image luminosity error into a nonlinear least square problem, and adopting a sliding window optimization method to marginalize old pose and map points, and updating the positions of the nearby map points through new observation to obtain more accurate camera pose information.
Preferably, the multi-modal navigation autonomous decision in step 3 includes:
Extracting semantic feature vectors and geometric feature vectors of the environment in the current map by adopting a hierarchical feature pyramid, normalizing the extracted feature vectors, and fusing;
and calculating cosine similarity between the fused feature vector and the pre-stored target area features by adopting a similarity evaluation algorithm, triggering the switching of the navigation strategy when the similarity of the number of frames continuously set exceeds a threshold value, and converting to an image matching navigation method.
Preferably, the hierarchical feature pyramid comprises a bottom layer and a high layer, wherein,
The bottom layer adopts ORB to extract geometrical characteristics including edges and angular points, and the high layer adopts SIFT to extract semantic objects including building outlines and texture areas.
Preferably, the step 4 includes:
Converting the real-time photo shot by the camera into a picture in a gallery style through a trained Latent Diffusion model;
Performing image feature matching on the real-time image after style conversion shot around the waypoints and the offline gallery photo by using a trained XFeat model, predicting pose offset, and realizing pixel level matching by classification;
The matching result is solved to calculate the pose of the current unmanned aerial vehicle;
And (3) using a sliding window optimization method to marginalize old pose and map points, and updating the map points by updating the positions of the map points nearby through new observation, so as to correct the relative errors accumulated among the waypoints.
On the other hand, the invention provides a remote long-endurance integrated navigation system combining visual inertia joint optimization and image matching, which is realized based on a monocular camera and an IMU, and comprises a data acquisition module, a visual inertia navigation module, a multi-mode navigation autonomous decision module and an image matching navigation module, wherein,
The data acquisition module is used for acquiring an offline map of the surrounding environment of the unmanned aerial vehicle flight path by inquiring satellite information, and presetting a plurality of route points in the mission path according to the map condition;
The visual inertial navigation module is used for performing high-frequency relative navigation between navigation points by adopting visual inertial navigation to realize continuous pose estimation;
the multi-modal navigation autonomous decision-making module is used for triggering the switching of the navigation strategy when approaching the waypoint based on the constructed multi-modal navigation autonomous decision-making scheme, and is converted into the image matching navigation module;
The image matching navigation module is used for realizing image matching navigation by using a text distribution model and a XFeat model according to a real-time picture shot by a camera and correcting the accumulated relative error between route points;
and repeatedly using the visual inertial navigation module, the multi-mode navigation autonomous decision-making module and the image matching navigation module among the plurality of navigation route points until the task path is traversed.
Compared with the prior art, the invention has the advantages that:
1. image matching navigation method based on Latent diffration and XFeat model
The traditional image matching method directly matches the acquired real-time image with the pre-stored target image features, and then calculates the pose from the feature, and the positioning method can provide absolute position correction, but has difficult deployment and poor real-time performance due to the dependence on the whole-course high-quality reference image, and has high accuracy requirement on the image matching model. According to the method, firstly, a picture shot by a camera is converted into a picture in a gallery style through a Latent Diffusion model, and then, feature extraction and matching are carried out by using a XFeat model which is recently proposed. The Latent Diffusion model migrates the Diffusion process to a low-dimensional Latent space, and has processing efficiency while retaining the generated image quality. The XFeat lightweight model can ensure good feature matching accuracy by taking a brand new strategy of 'doubling the convolution depth of the network when the resolution of an input image is halved'.
2. Multi-modal navigation autonomous handoff strategy
A strategy for autonomously switching from visual inertial navigation to image matching navigation is constructed. Firstly, extracting semantic features and geometric features of the current environment by using a hierarchical feature pyramid (ORB+SIFT combination). And secondly, calculating cosine similarity between the environmental features acquired by the current camera in real time and the pre-stored target region features according to a similarity evaluation algorithm, triggering navigation strategy switching when the similarity exceeds a threshold value, and converting into an image matching navigation method.
3. Combined navigation scheme based on visual inertial navigation and image matching method
The invention provides a combined navigation strategy for adopting visual inertial navigation between aircraft waypoints and using image matching navigation near the waypoints. This approach has the advantage that it can effectively compensate for the relative error accumulation caused by long-term operation of visual inertial navigation. For an image matching navigation algorithm, the application range of the vicinity of the waypoints is defined, so that the requirements of the class of the data set are greatly reduced, and the vividness of the peripheral features of the waypoints can finish the feature matching task by adopting a lightweight model. These unique advantages are exactly complementary to visual inertial navigation with high maturity, enabling further improvement of positioning accuracy.
Detailed Description
The invention designs a combined navigation scheme combining visual inertial navigation and image matching, which adopts visual inertial navigation to carry out high-frequency relative navigation among waypoints to realize continuous and low-drift pose estimation, and introduces an image matching method to correct the current navigation state and correct the relative error accumulated among the waypoints when approaching the waypoints. By the method, the advantages of the two technologies are effectively combined, the positioning continuity and high precision of the middle section of the range are ensured, and the dependence on the complete reference diagram of the whole range is obviously reduced. The scheme improves the adaptability and reliability of the system in complex and non-cooperative environments, and can still keep the continuity and stability of the navigation precision especially under the condition that the reference diagram is partially lost or changed, thereby obviously improving the positioning precision of the whole mission course.
The invention is realized by using a monocular camera and an IMU as an external hardware equipment supporting algorithm framework, and comprises the following steps:
step1, constructing a data acquisition link. Map acquisition and navigation route point presetting are carried out, an offline map of the surrounding environment of the unmanned aerial vehicle flight path is obtained through inquiring satellite information, and a plurality of navigation route points are preset in a task path according to map conditions.
Step2, constructing a visual inertial navigation link. And adopting a visual inertial navigation method between the waypoints, respectively distributing corresponding weights to the IMU pre-integral error item and the image luminosity error to form a combined error item, and then carrying out combined optimization on all state parameters by a nonlinear optimization method.
Step3, constructing a multi-mode navigation autonomous decision-making link. Extracting semantic features and geometric features of an environment in a pre-stored offline map by adopting a hierarchical feature pyramid (ORB+SIFT combination), calculating cosine similarity between real-time environmental features acquired by a current camera and pre-stored target area features according to a similarity evaluation algorithm, triggering navigation strategy switching when the similarity exceeds a threshold value, and converting into an image matching navigation method.
Step4, constructing an image matching navigation link. And converting the real-time photo shot by the camera into a picture in a gallery style through a Latent Diffusion model, and then carrying out feature matching on the real-time image after the style conversion shot by the periphery of the waypoint and the offline gallery photo by XFeat to calculate the pose of the current unmanned aerial vehicle. By the aid of the technical means, relative errors generated by visual inertial navigation are corrected, and the positioning accuracy of the unmanned aerial vehicle under a multi-scene long-endurance navigation task is improved. The general structural framework of the technical scheme is shown in fig. 2, and the visual description of the scheme can be seen in fig. 3.
The technical scheme of the invention is described in detail below with reference to the accompanying drawings and examples.
Example 1
The embodiment of the invention provides a remote long-endurance integrated navigation method combining visual inertia joint optimization and image matching. Comprises the following parts:
Step1, constructing a data acquisition link;
step2, constructing a visual inertial navigation link;
the visual inertial navigation scheme specifically comprises the following steps:
① Inertial navigation branches. And pre-integrating the IMU and constructing a pre-integration error term. First, the angular velocity and acceleration data of the gyroscope are integrated in a local coordinate system in a continuous time window, and the relative motion increment is constructed. Secondly, modeling attitude change through the Litsea SO (3) to compensate zero offset drift. Then converting the integral result into discrete pose constraint to form a pre-integral observation model, and constructing a pre-integral error term. And finally, transmitting the jacobian matrix and the covariance to a back-end optimizer to realize efficient state updating.
② And (5) visual navigation branches. And processing the real-time picture shot by the camera by adopting a minimum luminosity error direct method. Firstly, calibration of the camera is required to be ensured, and internal parameters (such as focal length, principal point coordinates, distortion coefficients and the like) of the camera are determined, so that the validity of photometric errors in subsequent calculation and the performance of an algorithm are ensured. Second, the image key frame is extracted. And carrying out Sobel/Prewitt operator convolution on the shot image, and calculating the gradient amplitude of each pixel. The image is divided into 32 x 32 grids, each retaining the first point of gradient magnitude as the primary image key frame. And dynamically removing the image key frames with insufficient stability after the primary selection is finished. Finally, calculating the photometric error between adjacent frames。
③ And constructing and optimizing a joint error term. And respectively constructing a luminosity error and a pre-integral error term through the two branches, and performing joint error optimization at the rear end, as shown in fig. 4. The core idea of joint optimization is to dynamically balance the constraint weights of the two sensors and adjust all state parameters in nonlinear optimization. And after the joint optimization, more accurate camera pose information is obtained. Specifically, the old pose and map points are marginalized by adopting a sliding window optimization method, and the positions of the map points nearby are updated through new observation, so that more accurate characteristic constraint is provided for the subsequent frames, and the positioning accuracy of the system is further improved. And finally, recalculating the coordinates and covariance matrix of the map points by using the optimized pose.
1.1 Construction of IMU Pre-integration error term
In the time intervalIn, IMU pre-integral quantity,,Respectively represent slavesFrom moment to momentThe pre-integral rotation, position and speed increment of time is defined as:
Wherein the method comprises the steps ofIt is the angular velocity and acceleration that are measured,Is zero offset of gyroscope and accelerometer.Is the IMU sampling interval and,Is a plum groupIs a function of the index map of (a).
Zero offset compensation and error propagation:
When the zero offset estimated value changes slightlyWhen the correction formula of the pre-integration amount is:
the jacobian matrix is calculated by recursive updating:
constructing a pre-integral error term:
in IMU back-end optimization, the pre-integration error is defined as the difference between the predicted value and the actual observed value:
Wherein the method comprises the steps ofIs thatIs a logarithmic mapping of (a) to (b),Is a li-algebra vector extraction operation,, ,,,,Respectively represent the IMU in the world coordinate systemTime of day and time of dayThe rotation, position and speed of the moment in time,Is a gravitational acceleration vector in the global coordinate system.
1.2 Construction of image photometric error term
Before calculating the photometric error, key frame selection is performed, and the most representative frame is selected from the video sequence for summarizing the video content or marking significant scene changes. The key frame judgment depends on gradient change intensity of adjacent frames, and is realized by the following steps:
First, the adjacent frames are calculatedAndGradient magnitude matrix differences of (c):
Wherein, theIt is the image resolution, the larger the disparity value, the more significant the scene change. Gradient magnitude for each pixel. Next, the image is divided into 32 x 32 grids, each retaining the point of gradient magnitude 1 as the primary image key frame. And finally, dynamically eliminating the image key frames with insufficient stability.
Assuming neighboring key framesThe pixel intensity of the same scene point is unchanged, and the luminosity error is defined as:
Wherein the method comprises the steps ofThe gray value of the image is represented,Is thatThe coordinates of the pixels at the moment in time,Projection to current frame by pose transformation for pixelsIs defined by the coordinates of (a).
1.3 Joint optimization
Unifying the IMU pre-integral error term and the photometric error term into a nonlinear least squares problem:
Wherein the method comprises the steps ofIn order to be a robust kernel function,For covariance matrix, X is the optimization variable (pose, velocity, IMU zero bias, etc.). The old pose and map points are marginalized by adopting a sliding window optimization method, and the positions of the map points nearby are updated through new observation, so that more accurate feature constraints are provided for subsequent frames.
Step3, constructing a multi-mode navigation autonomous decision-making link;
the multi-mode navigation autonomous decision scheme specifically comprises the following steps:
① And extracting multi-mode environmental characteristics. And respectively extracting low-level geometric features (edges, corner points) and high-level semantic features (scenery and building outlines) of the environment in the pre-stored offline map by adopting a hierarchical feature pyramid (ORB+SIFT combination).
② Feature matching and trigger criterion generation. And carrying out real-time feature similarity evaluation, calculating cosine similarity of the current environmental features and the pre-stored target region features, and triggering strategy switching when the matching degree of continuous 10 frames exceeds 80%.
③ And (5) multi-mode decision fusion. A double-layer decision mechanism is employed. The primary decision generates a candidate trigger instruction based on the feature matching result, and the secondary verification fuses multi-mode data such as IMU track fitting degree, visual matching confidence degree and the like, and calculates the trigger confidence degree through a Bayesian network:
Wherein, theRepresenting the confidence level that the previous visual match resulted in,Representing the trajectory fitting error of the inertial measurement unit, and thereforeRepresenting the reliability of IMU trajectory fitting. Weighting ofDynamic adjustment according to circumstances, e.g. in case of good illumination, the visual weight can be increased appropriatelyProperly reducing weight of inertial measurement unit when IMU moves vigorously。
④ Dynamic environment adaptive optimization. In the flight process of the unmanned aerial vehicle, the pre-stored feature library is updated by adopting incremental learning. Amplifying a feature library according to the occurrence of new landmarks in the sparse semantic map constructed by SLAM, and automatically cleaning old features based on the access frequency and timeliness, so that the feature library is updated in real time in the decision process.
2.1 Hierarchical feature pyramid (ORB+SIFT Combined)
ORB is an improvement based on FAST corner detection and BRIEF descriptors, FAST, suitable for real-time applications, but may behave generally under scale and rotation variations. SIFT detects key points through the DoG, has scale and rotation invariance, and is more robust in descriptor, but larger in calculated amount. Therefore, the limitation of a single feature descriptor in a complex scene can be solved by combining the feature fusion.
The feature pyramid is divided into a bottom layer (L0-L2) and a high layer (L3-L5), wherein the bottom layer adopts ORB to extract edges/corner points (geometric primitives), and the high layer adopts SIFT to extract building outline/texture areas (semantic objects). First, a gaussian pyramid needs to be generated:
Wherein the method comprises the steps ofIs the firstThe image of the layer is a layer image,For gaussian kernel weights, 5×5 kernels (k=2) are typically used.
ORB (Oriented FAST and Rotated BRIEF) combines FAST corner detection with the BRIEF descriptor and is modified to increase rotational invariance and scale invariance. Wherein the FAST corner detection section detects the corner of the pixel by judging whether the pixel is a candidate pixelThe surrounding 16 pixels have a successionThe individual points satisfy(For the threshold value, 10-30) is usually taken to realize the screening detection of the corner points, and rBRIEF rotation invariant descriptors are further also provided on the basis of BRIEF descriptors:
Wherein the method comprises the steps ofTo indicate a function, 1 is taken when the condition is satisfied, otherwise 0 is taken.By sampling 256 pairs of BRIEFAround the principal direction of the key pointAnd rotating to obtain the following components:
In addition, ORB inter-layer association-establishing feature correspondence between pyramid neighboring layers-is also added:
Wherein the method comprises the steps ofFor pyramid layers, hamming distance is used for binary descriptor matching.
SIFT directly detects keypoints in DoG (Difference of Gaussian) pyramid and performs keypoint direction assignment. The first step is to calculate the gradient amplitude and direction:
The gradient direction of the key point neighborhood (16×16 window) is divided into 36 bins, and the peak value is taken as the main direction. Then, dividing the key point neighborhood into 16 sub-blocks, calculating an 8-direction gradient histogram for each sub-block, and finally generating a SIFT descriptor:
semantic categories such as building outlines and the like are marked by counting cluster distribution (such as K-means) of SIFT features in a high-level pyramid:
Wherein the method comprises the steps ofTo pretrain the cluster center (e.g., "window," "roof," etc.).
2.2 Trigger criteria and policy switching
The foregoing describes how to construct a hierarchical feature pyramid, and the principles and combinations of ORB and SIFT. In the following, it will be described how the construction of the multimodal navigation trigger criteria and the switching of the navigation strategy can be performed using the previously extracted features.
First, a fused feature vector is constructed using the feature descriptors extracted by ORB and SIFT. ORB algorithm of feature pyramid bottom layer extracts binary descriptorsThe local geometry is characterized. SIFT generation of upper layer 128-dimensional gradient histogramTexture and semantic profiles are described. ORB, SIFT vector L2 is normalized:
fusing the normalized feature vectors:
and secondly, carrying out cosine similarity calculation and real-time evaluation. Pre-storing target area characteristicsAnd generating an offline map, and storing the offline map as a normalized vector. Defining the current characteristicsWith target featuresThe similarity of (2) is:
wherein the output rangeThe closer the value is to 1, the more closely the environment matches the target area.
Finally, a trigger criterion is generated. Maintenance length ofIs a window of (2)The successive frame similarities are stored. Setting the trigger condition of strategy switching to meet all frames in the window:
I.e. when continuousAnd when the cosine similarity of the frames is greater than 80%, switching the navigation strategy, and switching the visual inertial navigation to the image matching navigation by the unmanned aerial vehicle autonomous navigation system.
Step4, constructing an image matching navigation link;
The image matching navigation scheme specifically comprises the following steps:
① Image generation and style conversion. The camera shot pictures are converted into gallery style pictures (removing season, weather, viewing angle, etc.) using the Latent Diffusion model. Specifically, the VQ-VAE is used to compress the image to the potential space during the encoding stage, and the diffusion stage performs a denoising process in the potential space, reducing computational complexity.
② And (5) extracting the real-time characteristics. By adopting XFeat model and through a brand new strategy of 'when the resolution of the input image is halved, the convolution depth of the network is multiplied', the model is weighted between the network precision and the acceleration gain. Specifically, the backbone structure is composed of two parts Keypoint Head and Desriptor Head. The former is responsible for locating feature points with significant distinguishability in the image, while the latter generates high-dimensional feature vectors for the key points.
③ Image feature matching. Given dense local feature mapsThe input of the feature matching link is a subset with the spatial resolution of 1/8. From pairs of imagesIs matched by nearest neighbor to obtain two adjacent matching characteristics,. Predicting pose offset
Wherein the method comprises the steps ofIs the logarithm of the probability distribution over the possible offsets. For offset amountClassifying to realize correct pixel level matching:
④ Pose optimization and map point updating. Building optimization problem, matching image (feature point corresponding relation) As input, an objective function is constructed:
Wherein the method comprises the steps ofFor the pose of the camera,Is the coordinates of the points of the three-dimensional map,Representing a camera projection model. Then, gradually reducing the re-projection error by a nonlinear optimization methodAnd the pose optimization is realized. Finally, the old pose and map points are marginalized by using a sliding window optimization method, and the map points are updated by updating the positions of the map points nearby through new observation.
3.1 Image generation and style migration
Before the feature matching of the real-time image and the pre-stored image library is carried out, the shot real-time image needs to be converted into a photo of the style of the pre-stored image library, so that the matching accuracy is improved. The Latent Diffusion model performs this operation using the model by migrating the Diffusion process to a low dimensional Latent space, preserving the quality of the resulting image while also being efficient. The training process of the model is shown in fig. 5, and the process is divided into two steps:
first, the VQ-VAE is trained. Input image through depth convolution networkCompression into a low-dimensional continuous latent representationWherein(For a downsampling rate, such as 16,Is the potential vector dimension (typically 256 or 512). Introducing a learnable codebookEncoder outputAfter that, quantized to codebook vector by nearest neighbor search:
Wherein the method comprises the steps of. And then the quantized potential representationAs a decoderIs used for inputting and outputting reconstructed images。
Secondly, training a diffusion model LDM, and learning noiseWherein the upper half is a noise adding process for adding features to noise. The lower half is the denoising process, and the core structure is a U-Net composed of Cross Attention (Cross Attention) for denoising the core structureIs reduced to. The diffusion model can be understood as a temporal denoising self-encoder whose purpose is to be based on the input imageAt the position ofThe noise-added image at time predicts a noise added thereto, and the objective function of DM (Diffusion Model) can be expressed as:
Wherein the method comprises the steps ofIs in sequenceUniform sampling on the sample. Notably, we use LDM to learn in latent space, i.e., predict that it isThe noise added above, the corresponding loss function is expressed as:
Example 2
Embodiment 2 of the present invention provides a remote long-endurance integrated navigation system that combines visual inertia joint optimization and image matching, and is implemented based on a monocular camera and an IMU, as shown in fig. 2, the system includes:
The data acquisition module is used for acquiring an offline map of the surrounding environment of the unmanned aerial vehicle flight path by inquiring satellite information, and presetting a plurality of route points in the mission path according to the map condition;
the visual inertial navigation module is used for carrying out high-frequency relative navigation between navigation points by adopting visual inertial navigation to realize continuous pose estimation, and the specific processing process is the same as Step2 of the embodiment 1;
the multi-modal navigation autonomous decision-making module is used for triggering the switching of the navigation strategy when approaching the waypoint based on the constructed multi-modal navigation autonomous decision-making scheme and converting the switching into the use of the image matching navigation module;
The image matching navigation module is used for realizing image matching navigation by using a tension difference model and a XFeat model according to a real-time picture shot by a camera and correcting the relative error accumulated among route points, and the specific processing process is the same as Step4 of the embodiment 1;
and repeatedly using the visual inertial navigation module, the multi-mode navigation autonomous decision-making module and the image matching navigation module among the plurality of navigation route points until the task path is traversed.
It should be noted that, in the embodiment of the system, the included modules are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented, and the specific names of the functional modules are only for convenience of distinguishing each other, and are not used to limit the protection scope of the present invention.
Overview:
According to the invention, through establishing the ' visual inertial navigation among the waypoints ', the combined navigation mechanism of the multi-mode navigation autonomous decision-making switching to the image matching navigation ' is used near the waypoints, the relative error accumulation of the visual inertial navigation among the waypoints is corrected, the problems of high deployment cost, poor environmental adaptability and difficulty in meeting the quick response requirement of a dynamic complex scene are avoided, the accurate tracking and positioning of the unmanned aerial vehicle in the flight process are realized, and the navigation and positioning precision of the unmanned aerial vehicle in a long-distance long-endurance scene is improved.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.