CN120252746B

Movatterモバイル変換

Info

Publication number: CN120252746B
Application number: CN202510750550.3A
Authority: CN
Inventors: 韩鹏; 滕伯远; 徐子洋; 高东; 冯宇
Original assignee: National Space Science Center of CAS
Current assignee: National Space Science Center of CAS
Priority date: 2025-06-06
Filing date: 2025-06-06
Publication date: 2025-09-23
Anticipated expiration: 2045-06-06
Also published as: CN120252746A

Abstract

The invention belongs to the technical field of unmanned aerial vehicle autonomous navigation positioning, and relates to a remote long-endurance combined navigation method and system combining visual inertia combined optimization and image matching, wherein the method comprises the following steps of 1, acquiring an offline map of the surrounding environment of a flight path of an unmanned aerial vehicle by inquiring satellite information, and presetting a plurality of waypoints in a task path according to the map condition; the method comprises the steps of (1) carrying out high-frequency relative navigation among waypoints by adopting visual inertial navigation to realize continuous pose estimation, (3) triggering the switching of navigation strategies when approaching the waypoints based on a built multi-mode navigation autonomous decision scheme, and converting to image matching navigation, 4) carrying out image matching navigation by utilizing a Latent diffration model and a XFeat model according to a real-time photo shot by a monocular camera, and correcting relative errors accumulated among the waypoints, 5) repeating the steps (2) to (4) among a plurality of waypoints until a task path is traversed.

Description

Remote long-endurance integrated navigation method and system combining visual inertia joint optimization and image matching

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle autonomous navigation positioning, and particularly relates to a remote long-endurance integrated navigation method and system combining visual inertia combined optimization and image matching.

Background

The unmanned aerial vehicle autonomous navigation positioning technology is used for determining the position and the posture of the unmanned aerial vehicle autonomous navigation positioning technology by the information observed by the sensor under a completely unknown environment, and providing a data basis for the tasks such as subsequent map construction, navigation and the like. Autonomous positioning is not only a core component of an unmanned system, but also a key to ensure robust operation in complex environments.

The traditional unmanned system often uses a mode of combining satellite navigation and inertial navigation, however, the satellite positioning system is greatly affected by the environment, satellite signals are easy to be shielded or have multipath effect when working in valleys, building groups and rooms, and signals are easy to be shielded or interfered when in war. When satellite positioning fails, simple inertial navigation is easily affected by sensor errors, system noise and integral accumulated errors, and after long-time use, position and speed information drift and additional time is needed to correct the accumulated errors, so that the navigation system is unstable and inaccurate.

Purely visual localization was proposed by the Stanford university in the early 80 s of the last century, however, is limited by the influence of sensor accuracy and processor performance, and has been in the experimental study phase only for a long time. With the progress of electronic information technology in the twenty-first century, vision-based positioning has been a long-standing development. At present, navigation by vision is divided into two types, namely relative navigation and absolute navigation. The relative navigation is a slam technology, does not depend on a reference gallery, can autonomously calculate relative pose according to the correlation of two frames of images, but has the defect of error accumulation, and realizes unmanned aerial vehicle positioning through image matching, can calculate absolute pose according to the reference gallery, has no accumulated error, but needs to realize the storage of the reference gallery, and is difficult to realize for long-endurance unmanned aerial vehicle deployment and has large storage capacity.

For relative navigation (SLAM) technology, a visual/inertial sensor fusion approach is currently commonly employed. The earliest vision/inertial system was proposed in 2007 by Anastasios Mourikis et al, division of the university of california, riverside, usa, after which a series of classical algorithms such as VI-ORB, VINS-Mono, ORB-SLAM3, etc., emerged. The VI-ORB (Visual-Inertial ORB-SLAM) proposed in 2017 is a Visual-Inertial fusion system based on ORB-SLAM framework, integrating IMU data with ORB Visual features by tight coupling. The method has the core improvement that the track precision is improved by jointly optimizing the vision re-projection error and the IMU measurement error. However, this approach still has certain limitations, 1) sensitivity to time synchronization and calibration errors of the IMU and camera, and minor deviations may result in accumulated errors, and 2) no explicit handling of dynamic objects, depending on static environmental assumptions, dynamic objects may interfere with positioning results. The VINS-Mono proposed in the same year is a representative framework of monocular vision-inertial SLAM, and adopts a tight coupling optimization strategy to fuse IMU and monocular image data. The innovation point is that the precision and the efficiency are balanced by combining sliding window optimization with global pose diagram adjustment. The method has the main defects that 1) a dynamic object detection module is not integrated, the positioning is possibly invalid in a dynamic scene, and 2) closed loop detection depends on a visual word bag model, and mismatching is easy to occur when the texture or the visual angle is repeatedly changed. The ORB-SLAM3 proposed in 2020 is the third generation of the ORB-SLAM series, supporting multi-modal input (monocular, binocular, RGB-D, vision-inertia). The method has the core improvements that 1) a multi-map system (Atlas) is introduced to support cross-scene map fusion and long-term positioning, 2) a vision-inertia initialization process is optimized to improve convergence speed and robustness, but the problems still exist, 1) dynamic environment processing capacity is limited, static assumptions are used, dynamic features are not removed explicitly, 2) divergence problems caused by long-term accumulation of vision-inertia mode errors are solved, 3) map consistency is insufficient, and a practical map is constructed by relying on an expansion module. From the above analysis, in the case of long endurance across multiple scenes, the problem of unavoidable error accumulation exists in the visual inertial navigation scheme.

The image matching method of absolute navigation starts from cruise missile terminal guidance and gradually develops into a visual navigation technology. The image matching navigation (SCENE MATCHING navigation systems, SMNS) has the characteristics of simple equipment structure, passivity, higher positioning precision and the like, and an image sensor is adopted to acquire the image of the area near the flying or target area to match with a stored reference image so as to acquire the position data of the aircraft. The unmanned aerial vehicle image matching navigation system is shown in fig. 1, performs preliminary path planning through a defensive film and an aerial film after a destination is acquired, then selects an adaptation area on the planned aerial road, obtains a reference image through correction, and then matches a real-time image obtained by a vision sensor with the reference image to realize navigation and positioning. The core of the method is an image matching algorithm, and the performance of the method directly determines the performance of a navigation system. According to the different visual feature extraction modes, the image matching algorithm can be divided into a method based on template matching, a method based on local invariant features and a method based on scene semantic learning. The unmanned aerial vehicle scene matching positioning algorithm based on template matching is used as one of representative algorithms of the earliest unmanned aerial vehicle vision positioning method, and the core thought is to construct a matching model by using image information such as pixel intensity and the like to complete scene matching. Image matching algorithms based on locally invariant features (also known as manual features) exhibit a stronger environmental adaptation capability than image matching algorithms based on template matching. Locally invariant features, i.e. manual feature descriptors, typically rely on expert prior knowledge in the design of algorithms, the complement of which in many applications presents advantages over template matching methods. With the development of computer vision, the deep neural network derives a scene-based semantic learning method by virtue of the characteristic that the deep neural network can capture high-dimensional semantic features in an image. Compared with the template matching and local invariant feature methods, the image matching method based on scene semantic learning has the advantage that the deep neural network has semantic information extraction capability, so that the image matching relationship can be established better. Meanwhile, by means of semantic information extraction, the image matching algorithm can cope with more complex visual angle changes, and various possible interferences such as illumination changes, noise and the like are overcome. as a representative of a multi-view scene matching algorithm based on metric learning, zheng et al discloses a first multi-source multi-view scene matching unmanned aerial vehicle visual positioning dataset University-1652, which provides support for multi-view scene matching research. On the basis, an author builds a three-branch twin neural network model, and scene matching is completed by building corresponding relations among images of the multi-view unmanned aerial vehicle, street view images and satellite images. The characteristic extraction model adopts a classical ResNet structure, and no additional improvement is added. Yang et al fully excavate the position coding of characteristics in order to solve the visual angle difference of unmanned aerial vehicle visual positioning algorithm when the matching image. And guiding the double-branch twin network to establish the region corresponding relation between the multi-view image pairs by utilizing the position coding information. By means of position coding, the image perspective relation is marked, the network can learn the corresponding association of the areas after perspective change, and the multi-view robustness of the algorithm is improved. However, because the transducer is selected as the feature extraction backbone network, and the spatial location correspondence module enables the network to have a large complexity, there is a certain limitation in training and mobile platform deployment. Wang et al propose a local mode network that employs a square ring feature partitioning strategy to obtain features describing central targets and surrounding information. and obtaining a similarity result based on the center and edge square rings by measuring corresponding feature pairs for multiple times, and finally obtaining a similarity measurement result through fusion. However, the circular partitioning pattern of similarity measures is completely fixed during the region division of the image, which makes it difficult for the algorithm to self-adjust and to make optimal decisions in different situations. As can be seen from the description of the image matching method for scene semantic learning, the method can well process complex scenes and has excellent generalization and broad prospects as a feature matching mode taking deep learning as a core. On the contrary, the method has the defect that high requirements are required to be put on hardware conditions and an offline map data set of a scene, and meanwhile, the model is not compatible in light weight and matching performance.

In short, in unmanned aerial vehicle autonomous navigation positioning, two core problems are often faced in a remote long-endurance task, namely, visual inertial navigation is accumulated continuously under a GNSS signal-free environment, so that the positioning accuracy is obviously reduced along with the increase of the range, and the traditional positioning method based on image matching can provide absolute position correction, but is difficult to deploy and poor in real-time due to dependence on a whole-course high-quality reference map, and is difficult to stably work when a part of regional reference map is lost or changed.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a remote long-endurance integrated navigation method and system combining visual inertia combined optimization and image matching.

In view of the above, the invention provides a remote long-endurance integrated navigation method combining visual inertia joint optimization and image matching, which is realized based on a monocular camera and an IMU, and comprises the following steps:

step 1, acquiring an offline map of the surrounding environment of the unmanned aerial vehicle flight path by inquiring satellite information, and presetting a plurality of route points in a task path according to map conditions;

step 2, performing high-frequency relative navigation between navigation route points by adopting visual inertial navigation to realize continuous pose estimation;

Step 3, based on the constructed multi-mode navigation autonomous decision scheme, triggering the switching of navigation strategies when approaching a waypoint, and converting to using image matching navigation;

step 4, according to the real-time photo shot by the camera, realizing image matching navigation by using a Latent diffration model and a XFeat model, and correcting the accumulated relative error between the waypoints;

and 5, repeating the steps 2 to 4 among a plurality of waypoints until the task path is traversed.

Preferably, the step 2 includes:

step 2-1, pre-integrating IMU data acquired by the IMU to construct a pre-integration error term;

Step 2-2, selecting a key frame for a real-time image acquired by a monocular camera, and processing the key frame by adopting a minimum luminosity error direct method to obtain the luminosity error of the image of the adjacent key frame;

and 2-3, carrying out joint error optimization on the IMU pre-integral error item and the image luminosity error to obtain more accurate camera pose information.

Preferably, the pre-integral error term constructed in the step 2-1To be in a time intervalIn, the difference between the predicted value and the actual observed value satisfies the following equation:

Wherein, theIs thatIs a logarithmic mapping of (a) to (b),Is a li-algebra vector extraction operation,,,Are respectively fromFrom moment to momentPre-integral rotation of time, position and speed increment,,,,,,Respectively represent the IMU in the world coordinate systemTime of day and time of dayThe rotation, position and speed of the moment in time,Is a gravitational acceleration vector in the global coordinate system,Is the time interval, and the superscript T is the transpose.

Preferably, the step 2-2 includes:

Calculating the gradient amplitude matrix difference of any two adjacent frames of images;

Dividing each frame of image into 32 multiplied by 32 grids, and reserving the point with the largest gradient amplitude difference of the adjacent frames as a key frame of the initially selected image by each grid;

Dynamically removing the image key frames with insufficient stability, thereby obtaining a plurality of key frames;

For adjacent key frames after dynamic eliminationObtaining corresponding image luminosity errors according to the following formula:

Wherein, theThe gray value of the image is represented,Is a key frameIs used for the image processing of the image data,Projection to current key frame for pixel by pose transformationIs defined by the coordinates of (a).

Preferably, the step 2-3 includes:

And unifying the IMU pre-integral error item and the image luminosity error into a nonlinear least square problem, and adopting a sliding window optimization method to marginalize old pose and map points, and updating the positions of the nearby map points through new observation to obtain more accurate camera pose information.

Preferably, the multi-modal navigation autonomous decision in step 3 includes:

Extracting semantic feature vectors and geometric feature vectors of the environment in the current map by adopting a hierarchical feature pyramid, normalizing the extracted feature vectors, and fusing;

and calculating cosine similarity between the fused feature vector and the pre-stored target area features by adopting a similarity evaluation algorithm, triggering the switching of the navigation strategy when the similarity of the number of frames continuously set exceeds a threshold value, and converting to an image matching navigation method.

Preferably, the hierarchical feature pyramid comprises a bottom layer and a high layer, wherein,

The bottom layer adopts ORB to extract geometrical characteristics including edges and angular points, and the high layer adopts SIFT to extract semantic objects including building outlines and texture areas.

Preferably, the step 4 includes:

Converting the real-time photo shot by the camera into a picture in a gallery style through a trained Latent Diffusion model;

Performing image feature matching on the real-time image after style conversion shot around the waypoints and the offline gallery photo by using a trained XFeat model, predicting pose offset, and realizing pixel level matching by classification;

The matching result is solved to calculate the pose of the current unmanned aerial vehicle;

And (3) using a sliding window optimization method to marginalize old pose and map points, and updating the map points by updating the positions of the map points nearby through new observation, so as to correct the relative errors accumulated among the waypoints.

On the other hand, the invention provides a remote long-endurance integrated navigation system combining visual inertia joint optimization and image matching, which is realized based on a monocular camera and an IMU, and comprises a data acquisition module, a visual inertia navigation module, a multi-mode navigation autonomous decision module and an image matching navigation module, wherein,

The data acquisition module is used for acquiring an offline map of the surrounding environment of the unmanned aerial vehicle flight path by inquiring satellite information, and presetting a plurality of route points in the mission path according to the map condition;

The visual inertial navigation module is used for performing high-frequency relative navigation between navigation points by adopting visual inertial navigation to realize continuous pose estimation;

the multi-modal navigation autonomous decision-making module is used for triggering the switching of the navigation strategy when approaching the waypoint based on the constructed multi-modal navigation autonomous decision-making scheme, and is converted into the image matching navigation module;

The image matching navigation module is used for realizing image matching navigation by using a text distribution model and a XFeat model according to a real-time picture shot by a camera and correcting the accumulated relative error between route points;

and repeatedly using the visual inertial navigation module, the multi-mode navigation autonomous decision-making module and the image matching navigation module among the plurality of navigation route points until the task path is traversed.

Compared with the prior art, the invention has the advantages that:

1. image matching navigation method based on Latent diffration and XFeat model

The traditional image matching method directly matches the acquired real-time image with the pre-stored target image features, and then calculates the pose from the feature, and the positioning method can provide absolute position correction, but has difficult deployment and poor real-time performance due to the dependence on the whole-course high-quality reference image, and has high accuracy requirement on the image matching model. According to the method, firstly, a picture shot by a camera is converted into a picture in a gallery style through a Latent Diffusion model, and then, feature extraction and matching are carried out by using a XFeat model which is recently proposed. The Latent Diffusion model migrates the Diffusion process to a low-dimensional Latent space, and has processing efficiency while retaining the generated image quality. The XFeat lightweight model can ensure good feature matching accuracy by taking a brand new strategy of 'doubling the convolution depth of the network when the resolution of an input image is halved'.

2. Multi-modal navigation autonomous handoff strategy

A strategy for autonomously switching from visual inertial navigation to image matching navigation is constructed. Firstly, extracting semantic features and geometric features of the current environment by using a hierarchical feature pyramid (ORB+SIFT combination). And secondly, calculating cosine similarity between the environmental features acquired by the current camera in real time and the pre-stored target region features according to a similarity evaluation algorithm, triggering navigation strategy switching when the similarity exceeds a threshold value, and converting into an image matching navigation method.

3. Combined navigation scheme based on visual inertial navigation and image matching method

The invention provides a combined navigation strategy for adopting visual inertial navigation between aircraft waypoints and using image matching navigation near the waypoints. This approach has the advantage that it can effectively compensate for the relative error accumulation caused by long-term operation of visual inertial navigation. For an image matching navigation algorithm, the application range of the vicinity of the waypoints is defined, so that the requirements of the class of the data set are greatly reduced, and the vividness of the peripheral features of the waypoints can finish the feature matching task by adopting a lightweight model. These unique advantages are exactly complementary to visual inertial navigation with high maturity, enabling further improvement of positioning accuracy.

Drawings

FIG. 1 is a block diagram of an image matching navigation system;

FIG. 2 is a technical scheme framework diagram of a remote long-endurance integrated navigation system combining visual inertia joint optimization and image matching;

FIG. 3 is a technical scheme description diagram of a remote long-endurance integrated navigation method combining visual inertia joint optimization and image matching;

FIG. 4 is a visual inertial navigation solution frame diagram;

FIG. 5 is a Latent Diffuse model training process.

Detailed Description

The invention designs a combined navigation scheme combining visual inertial navigation and image matching, which adopts visual inertial navigation to carry out high-frequency relative navigation among waypoints to realize continuous and low-drift pose estimation, and introduces an image matching method to correct the current navigation state and correct the relative error accumulated among the waypoints when approaching the waypoints. By the method, the advantages of the two technologies are effectively combined, the positioning continuity and high precision of the middle section of the range are ensured, and the dependence on the complete reference diagram of the whole range is obviously reduced. The scheme improves the adaptability and reliability of the system in complex and non-cooperative environments, and can still keep the continuity and stability of the navigation precision especially under the condition that the reference diagram is partially lost or changed, thereby obviously improving the positioning precision of the whole mission course.

The invention is realized by using a monocular camera and an IMU as an external hardware equipment supporting algorithm framework, and comprises the following steps:

step1, constructing a data acquisition link. Map acquisition and navigation route point presetting are carried out, an offline map of the surrounding environment of the unmanned aerial vehicle flight path is obtained through inquiring satellite information, and a plurality of navigation route points are preset in a task path according to map conditions.

Step2, constructing a visual inertial navigation link. And adopting a visual inertial navigation method between the waypoints, respectively distributing corresponding weights to the IMU pre-integral error item and the image luminosity error to form a combined error item, and then carrying out combined optimization on all state parameters by a nonlinear optimization method.

Step3, constructing a multi-mode navigation autonomous decision-making link. Extracting semantic features and geometric features of an environment in a pre-stored offline map by adopting a hierarchical feature pyramid (ORB+SIFT combination), calculating cosine similarity between real-time environmental features acquired by a current camera and pre-stored target area features according to a similarity evaluation algorithm, triggering navigation strategy switching when the similarity exceeds a threshold value, and converting into an image matching navigation method.

Step4, constructing an image matching navigation link. And converting the real-time photo shot by the camera into a picture in a gallery style through a Latent Diffusion model, and then carrying out feature matching on the real-time image after the style conversion shot by the periphery of the waypoint and the offline gallery photo by XFeat to calculate the pose of the current unmanned aerial vehicle. By the aid of the technical means, relative errors generated by visual inertial navigation are corrected, and the positioning accuracy of the unmanned aerial vehicle under a multi-scene long-endurance navigation task is improved. The general structural framework of the technical scheme is shown in fig. 2, and the visual description of the scheme can be seen in fig. 3.

The technical scheme of the invention is described in detail below with reference to the accompanying drawings and examples.

Example 1

The embodiment of the invention provides a remote long-endurance integrated navigation method combining visual inertia joint optimization and image matching. Comprises the following parts:

Step1, constructing a data acquisition link;

step2, constructing a visual inertial navigation link;

the visual inertial navigation scheme specifically comprises the following steps:

① Inertial navigation branches. And pre-integrating the IMU and constructing a pre-integration error term. First, the angular velocity and acceleration data of the gyroscope are integrated in a local coordinate system in a continuous time window, and the relative motion increment is constructed. Secondly, modeling attitude change through the Litsea SO (3) to compensate zero offset drift. Then converting the integral result into discrete pose constraint to form a pre-integral observation model, and constructing a pre-integral error term. And finally, transmitting the jacobian matrix and the covariance to a back-end optimizer to realize efficient state updating.

② And (5) visual navigation branches. And processing the real-time picture shot by the camera by adopting a minimum luminosity error direct method. Firstly, calibration of the camera is required to be ensured, and internal parameters (such as focal length, principal point coordinates, distortion coefficients and the like) of the camera are determined, so that the validity of photometric errors in subsequent calculation and the performance of an algorithm are ensured. Second, the image key frame is extracted. And carrying out Sobel/Prewitt operator convolution on the shot image, and calculating the gradient amplitude of each pixel. The image is divided into 32 x 32 grids, each retaining the first point of gradient magnitude as the primary image key frame. And dynamically removing the image key frames with insufficient stability after the primary selection is finished. Finally, calculating the photometric error between adjacent frames。

③ And constructing and optimizing a joint error term. And respectively constructing a luminosity error and a pre-integral error term through the two branches, and performing joint error optimization at the rear end, as shown in fig. 4. The core idea of joint optimization is to dynamically balance the constraint weights of the two sensors and adjust all state parameters in nonlinear optimization. And after the joint optimization, more accurate camera pose information is obtained. Specifically, the old pose and map points are marginalized by adopting a sliding window optimization method, and the positions of the map points nearby are updated through new observation, so that more accurate characteristic constraint is provided for the subsequent frames, and the positioning accuracy of the system is further improved. And finally, recalculating the coordinates and covariance matrix of the map points by using the optimized pose.

1.1 Construction of IMU Pre-integration error term

In the time intervalIn, IMU pre-integral quantity,,Respectively represent slavesFrom moment to momentThe pre-integral rotation, position and speed increment of time is defined as:

Wherein the method comprises the steps ofIt is the angular velocity and acceleration that are measured,Is zero offset of gyroscope and accelerometer.Is the IMU sampling interval and,Is a plum groupIs a function of the index map of (a).

Zero offset compensation and error propagation:

When the zero offset estimated value changes slightlyWhen the correction formula of the pre-integration amount is:

the jacobian matrix is calculated by recursive updating:

constructing a pre-integral error term:

in IMU back-end optimization, the pre-integration error is defined as the difference between the predicted value and the actual observed value:

Wherein the method comprises the steps ofIs thatIs a logarithmic mapping of (a) to (b),Is a li-algebra vector extraction operation,, ,,,,Respectively represent the IMU in the world coordinate systemTime of day and time of dayThe rotation, position and speed of the moment in time,Is a gravitational acceleration vector in the global coordinate system.

1.2 Construction of image photometric error term

Before calculating the photometric error, key frame selection is performed, and the most representative frame is selected from the video sequence for summarizing the video content or marking significant scene changes. The key frame judgment depends on gradient change intensity of adjacent frames, and is realized by the following steps:

First, the adjacent frames are calculatedAndGradient magnitude matrix differences of (c):

Wherein, theIt is the image resolution, the larger the disparity value, the more significant the scene change. Gradient magnitude for each pixel. Next, the image is divided into 32 x 32 grids, each retaining the point of gradient magnitude 1 as the primary image key frame. And finally, dynamically eliminating the image key frames with insufficient stability.

Assuming neighboring key framesThe pixel intensity of the same scene point is unchanged, and the luminosity error is defined as:

Wherein the method comprises the steps ofThe gray value of the image is represented,Is thatThe coordinates of the pixels at the moment in time,Projection to current frame by pose transformation for pixelsIs defined by the coordinates of (a).

1.3 Joint optimization

Unifying the IMU pre-integral error term and the photometric error term into a nonlinear least squares problem:

Wherein the method comprises the steps ofIn order to be a robust kernel function,For covariance matrix, X is the optimization variable (pose, velocity, IMU zero bias, etc.). The old pose and map points are marginalized by adopting a sliding window optimization method, and the positions of the map points nearby are updated through new observation, so that more accurate feature constraints are provided for subsequent frames.

Step3, constructing a multi-mode navigation autonomous decision-making link;

the multi-mode navigation autonomous decision scheme specifically comprises the following steps:

① And extracting multi-mode environmental characteristics. And respectively extracting low-level geometric features (edges, corner points) and high-level semantic features (scenery and building outlines) of the environment in the pre-stored offline map by adopting a hierarchical feature pyramid (ORB+SIFT combination).

② Feature matching and trigger criterion generation. And carrying out real-time feature similarity evaluation, calculating cosine similarity of the current environmental features and the pre-stored target region features, and triggering strategy switching when the matching degree of continuous 10 frames exceeds 80%.

③ And (5) multi-mode decision fusion. A double-layer decision mechanism is employed. The primary decision generates a candidate trigger instruction based on the feature matching result, and the secondary verification fuses multi-mode data such as IMU track fitting degree, visual matching confidence degree and the like, and calculates the trigger confidence degree through a Bayesian network:

Wherein, theRepresenting the confidence level that the previous visual match resulted in,Representing the trajectory fitting error of the inertial measurement unit, and thereforeRepresenting the reliability of IMU trajectory fitting. Weighting ofDynamic adjustment according to circumstances, e.g. in case of good illumination, the visual weight can be increased appropriatelyProperly reducing weight of inertial measurement unit when IMU moves vigorously。

④ Dynamic environment adaptive optimization. In the flight process of the unmanned aerial vehicle, the pre-stored feature library is updated by adopting incremental learning. Amplifying a feature library according to the occurrence of new landmarks in the sparse semantic map constructed by SLAM, and automatically cleaning old features based on the access frequency and timeliness, so that the feature library is updated in real time in the decision process.

2.1 Hierarchical feature pyramid (ORB+SIFT Combined)

ORB is an improvement based on FAST corner detection and BRIEF descriptors, FAST, suitable for real-time applications, but may behave generally under scale and rotation variations. SIFT detects key points through the DoG, has scale and rotation invariance, and is more robust in descriptor, but larger in calculated amount. Therefore, the limitation of a single feature descriptor in a complex scene can be solved by combining the feature fusion.

The feature pyramid is divided into a bottom layer (L0-L2) and a high layer (L3-L5), wherein the bottom layer adopts ORB to extract edges/corner points (geometric primitives), and the high layer adopts SIFT to extract building outline/texture areas (semantic objects). First, a gaussian pyramid needs to be generated:

Wherein the method comprises the steps ofIs the firstThe image of the layer is a layer image,For gaussian kernel weights, 5×5 kernels (k=2) are typically used.

ORB (Oriented FAST and Rotated BRIEF) combines FAST corner detection with the BRIEF descriptor and is modified to increase rotational invariance and scale invariance. Wherein the FAST corner detection section detects the corner of the pixel by judging whether the pixel is a candidate pixelThe surrounding 16 pixels have a successionThe individual points satisfy(For the threshold value, 10-30) is usually taken to realize the screening detection of the corner points, and rBRIEF rotation invariant descriptors are further also provided on the basis of BRIEF descriptors:

Wherein the method comprises the steps ofTo indicate a function, 1 is taken when the condition is satisfied, otherwise 0 is taken.By sampling 256 pairs of BRIEFAround the principal direction of the key pointAnd rotating to obtain the following components:

In addition, ORB inter-layer association-establishing feature correspondence between pyramid neighboring layers-is also added:

Wherein the method comprises the steps ofFor pyramid layers, hamming distance is used for binary descriptor matching.

SIFT directly detects keypoints in DoG (Difference of Gaussian) pyramid and performs keypoint direction assignment. The first step is to calculate the gradient amplitude and direction:

The gradient direction of the key point neighborhood (16×16 window) is divided into 36 bins, and the peak value is taken as the main direction. Then, dividing the key point neighborhood into 16 sub-blocks, calculating an 8-direction gradient histogram for each sub-block, and finally generating a SIFT descriptor:

semantic categories such as building outlines and the like are marked by counting cluster distribution (such as K-means) of SIFT features in a high-level pyramid:

Wherein the method comprises the steps ofTo pretrain the cluster center (e.g., "window," "roof," etc.).

2.2 Trigger criteria and policy switching

The foregoing describes how to construct a hierarchical feature pyramid, and the principles and combinations of ORB and SIFT. In the following, it will be described how the construction of the multimodal navigation trigger criteria and the switching of the navigation strategy can be performed using the previously extracted features.

First, a fused feature vector is constructed using the feature descriptors extracted by ORB and SIFT. ORB algorithm of feature pyramid bottom layer extracts binary descriptorsThe local geometry is characterized. SIFT generation of upper layer 128-dimensional gradient histogramTexture and semantic profiles are described. ORB, SIFT vector L2 is normalized:

fusing the normalized feature vectors:

and secondly, carrying out cosine similarity calculation and real-time evaluation. Pre-storing target area characteristicsAnd generating an offline map, and storing the offline map as a normalized vector. Defining the current characteristicsWith target featuresThe similarity of (2) is:

wherein the output rangeThe closer the value is to 1, the more closely the environment matches the target area.

Finally, a trigger criterion is generated. Maintenance length ofIs a window of (2)The successive frame similarities are stored. Setting the trigger condition of strategy switching to meet all frames in the window:

I.e. when continuousAnd when the cosine similarity of the frames is greater than 80%, switching the navigation strategy, and switching the visual inertial navigation to the image matching navigation by the unmanned aerial vehicle autonomous navigation system.

Step4, constructing an image matching navigation link;

The image matching navigation scheme specifically comprises the following steps:

① Image generation and style conversion. The camera shot pictures are converted into gallery style pictures (removing season, weather, viewing angle, etc.) using the Latent Diffusion model. Specifically, the VQ-VAE is used to compress the image to the potential space during the encoding stage, and the diffusion stage performs a denoising process in the potential space, reducing computational complexity.

② And (5) extracting the real-time characteristics. By adopting XFeat model and through a brand new strategy of 'when the resolution of the input image is halved, the convolution depth of the network is multiplied', the model is weighted between the network precision and the acceleration gain. Specifically, the backbone structure is composed of two parts Keypoint Head and Desriptor Head. The former is responsible for locating feature points with significant distinguishability in the image, while the latter generates high-dimensional feature vectors for the key points.

③ Image feature matching. Given dense local feature mapsThe input of the feature matching link is a subset with the spatial resolution of 1/8. From pairs of imagesIs matched by nearest neighbor to obtain two adjacent matching characteristics,. Predicting pose offset

Wherein the method comprises the steps ofIs the logarithm of the probability distribution over the possible offsets. For offset amountClassifying to realize correct pixel level matching:

④ Pose optimization and map point updating. Building optimization problem, matching image (feature point corresponding relation) As input, an objective function is constructed:

Wherein the method comprises the steps ofFor the pose of the camera,Is the coordinates of the points of the three-dimensional map,Representing a camera projection model. Then, gradually reducing the re-projection error by a nonlinear optimization methodAnd the pose optimization is realized. Finally, the old pose and map points are marginalized by using a sliding window optimization method, and the map points are updated by updating the positions of the map points nearby through new observation.

3.1 Image generation and style migration

Before the feature matching of the real-time image and the pre-stored image library is carried out, the shot real-time image needs to be converted into a photo of the style of the pre-stored image library, so that the matching accuracy is improved. The Latent Diffusion model performs this operation using the model by migrating the Diffusion process to a low dimensional Latent space, preserving the quality of the resulting image while also being efficient. The training process of the model is shown in fig. 5, and the process is divided into two steps:

first, the VQ-VAE is trained. Input image through depth convolution networkCompression into a low-dimensional continuous latent representationWherein(For a downsampling rate, such as 16,Is the potential vector dimension (typically 256 or 512). Introducing a learnable codebookEncoder outputAfter that, quantized to codebook vector by nearest neighbor search:

Wherein the method comprises the steps of. And then the quantized potential representationAs a decoderIs used for inputting and outputting reconstructed images。

Secondly, training a diffusion model LDM, and learning noiseWherein the upper half is a noise adding process for adding features to noise. The lower half is the denoising process, and the core structure is a U-Net composed of Cross Attention (Cross Attention) for denoising the core structureIs reduced to. The diffusion model can be understood as a temporal denoising self-encoder whose purpose is to be based on the input imageAt the position ofThe noise-added image at time predicts a noise added thereto, and the objective function of DM (Diffusion Model) can be expressed as:

Wherein the method comprises the steps ofIs in sequenceUniform sampling on the sample. Notably, we use LDM to learn in latent space, i.e., predict that it isThe noise added above, the corresponding loss function is expressed as:

Example 2

Embodiment 2 of the present invention provides a remote long-endurance integrated navigation system that combines visual inertia joint optimization and image matching, and is implemented based on a monocular camera and an IMU, as shown in fig. 2, the system includes:

the visual inertial navigation module is used for carrying out high-frequency relative navigation between navigation points by adopting visual inertial navigation to realize continuous pose estimation, and the specific processing process is the same as Step2 of the embodiment 1;

the multi-modal navigation autonomous decision-making module is used for triggering the switching of the navigation strategy when approaching the waypoint based on the constructed multi-modal navigation autonomous decision-making scheme and converting the switching into the use of the image matching navigation module;

The image matching navigation module is used for realizing image matching navigation by using a tension difference model and a XFeat model according to a real-time picture shot by a camera and correcting the relative error accumulated among route points, and the specific processing process is the same as Step4 of the embodiment 1;

It should be noted that, in the embodiment of the system, the included modules are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented, and the specific names of the functional modules are only for convenience of distinguishing each other, and are not used to limit the protection scope of the present invention.

Overview:

According to the invention, through establishing the ' visual inertial navigation among the waypoints ', the combined navigation mechanism of the multi-mode navigation autonomous decision-making switching to the image matching navigation ' is used near the waypoints, the relative error accumulation of the visual inertial navigation among the waypoints is corrected, the problems of high deployment cost, poor environmental adaptability and difficulty in meeting the quick response requirement of a dynamic complex scene are avoided, the accurate tracking and positioning of the unmanned aerial vehicle in the flight process are realized, and the navigation and positioning precision of the unmanned aerial vehicle in a long-distance long-endurance scene is improved.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims

Translated fromChinese

1.一种结合视觉惯性联合优化与图像匹配的远程长航时组合导航方法，基于单目相机和IMU实现，包括：1. A long-range, long-duration integrated navigation method combining visual-inertial joint optimization and image matching, based on a monocular camera and an IMU, including:

步骤1：通过查询卫星信息，获取无人机飞行路径周边环境的离线地图，并根据地图情况在任务路径中预设若干航路点；Step 1: Obtain an offline map of the drone's flight path and surrounding environment by querying satellite information, and preset several waypoints in the mission path based on the map.

步骤2：在航路点之间采用视觉惯性导航进行高频率相对导航，实现连续的位姿估计；Step 2: Use visual inertial navigation for high-frequency relative navigation between waypoints to achieve continuous pose estimation;

步骤3：基于构建的多模态导航自主决策方案，在接近航路点时触发导航策略的切换，转为使用图像匹配导航；Step 3: Based on the constructed multimodal navigation autonomous decision-making scheme, the navigation strategy is switched to image matching navigation when approaching a waypoint.

步骤4：根据单目相机拍摄的实时照片，利用Latent Diffusion模型与XFeat模型实现图像匹配导航，修正航路点间积累的相对误差；Step 4: Based on the real-time photos taken by the monocular camera, the Latent Diffusion model and the XFeat model are used to implement image matching navigation and correct the relative errors accumulated between waypoints.

步骤5：在多个航路点间重复步骤2到步骤4，直至遍历完任务路径；Step 5: Repeat steps 2 to 4 between multiple waypoints until the mission path is traversed;

所述步骤3中的多模态导航自主决策包括：The multimodal navigation autonomous decision in step 3 includes:

采用层次化特征金字塔提取出当前地图中环境的语义特征向量和几何特征向量，将提取到的特征向量归一化，并进行融合；A hierarchical feature pyramid is used to extract the semantic feature vectors and geometric feature vectors of the environment in the current map, and the extracted feature vectors are normalized and fused;

采用相似度评估算法，计算融合的特征向量与预存目标区域特征的余弦相似度，当连续设定数目的帧数相似度超过阈值时触发导航策略的切换，转为使用图像匹配导航方法；A similarity evaluation algorithm is used to calculate the cosine similarity between the fused feature vector and the pre-stored target area features. When the similarity of a set number of consecutive frames exceeds the threshold, the navigation strategy is switched to the image matching navigation method.

所述步骤4包括：The step 4 comprises:

通过训练好的Latent Diffusion模型将相机拍摄的实时照片转换成图库风格的图片；Convert real-time photos taken by the camera into gallery-style images through the trained Latent Diffusion model;

用训练好的XFeat模型将航路点周边拍摄的风格转换后的实时图像与离线图库照片进行图像特征匹配，预测位姿偏移量，通过分类，实现像素级匹配；Use the trained XFeat model to match the style-converted real-time images taken around the waypoints with the offline library photos, predict the pose offset, and achieve pixel-level matching through classification.

根据匹配结果解算出当前无人机的位姿；Calculate the current UAV posture based on the matching results;

使用滑动窗口优化方法，将旧的位姿和地图点边缘化，通过新的观测更新附近的地图点位置，实现地图点更新，从而修正航路点间积累的相对误差。Using the sliding window optimization method, the old poses and map points are marginalized, and the positions of nearby map points are updated through new observations to achieve map point updates, thereby correcting the accumulated relative errors between waypoints.

2.根据权利要求1所述的结合视觉惯性联合优化与图像匹配的远程长航时组合导航方法，其特征在于，所述步骤2包括：2. The long-range and long-endurance integrated navigation method combining visual-inertial joint optimization and image matching according to claim 1, wherein step 2 comprises:

步骤2-1：对IMU采集的IMU数据进行预积分，构建预积分误差项；Step 2-1: Pre-integrate the IMU data collected by the IMU and construct the pre-integration error term;

步骤2-2：对单目相机采集的实时图像，选取关键帧，采用最小光度误差直接法进行处理，得到相邻关键帧的图像光度误差；Step 2-2: For the real-time image captured by the monocular camera, select key frames and process them using the minimum photometric error direct method to obtain the image photometric errors of adjacent key frames;

步骤2-3：将IMU预积分误差项与图像光度误差进行联合误差优化，得到更准确的相机位姿信息。Step 2-3: Perform joint error optimization on the IMU pre-integration error term and the image photometric error to obtain more accurate camera pose information.

3.根据权利要求2所述的结合视觉惯性联合优化与图像匹配的远程长航时组合导航方法，其特征在于，所述步骤2-1构建的预积分误差项为在时间区间内，预测值和实际观测值的差异，满足下式：3. The long-range and long-duration integrated navigation method combining visual-inertial joint optimization and image matching according to claim 2, characterized in that the pre-integrated error term constructed in step 2-1 is For the time interval The difference between the predicted value and the actual observed value satisfies the following formula:

； ;

其中，是的对数映射，是李代数向量提取操作，，，分别是从时刻到时刻的预积分旋转、位置和速度增量，，，，，，，分别表示IMU在世界坐标系中时刻和时刻的旋转、位置和速度，是全局坐标系下的重力加速度向量，是时间间隔，上角标T为转置。in, yes The logarithmic mapping of is the Lie algebra vector extraction operation, , , They are respectively from Time has come Pre-integrated rotation, position and velocity increments at time, , , , , , , respectively represent the IMU in the world coordinate system Moment and The rotation, position and velocity at each moment, is the gravitational acceleration vector in the global coordinate system, is the time interval, and the superscript T indicates the transpose.

4.根据权利要求2所述的结合视觉惯性联合优化与图像匹配的远程长航时组合导航方法，其特征在于，所述步骤2-2包括：4. The long-range and long-endurance integrated navigation method combining visual-inertial joint optimization and image matching according to claim 2, wherein step 2-2 comprises:

计算任意相邻两帧图像的梯度幅值矩阵差异；Calculate the difference in the gradient magnitude matrix between any two adjacent frames of images;

将每帧图像划分成32×32网格，每个网格保留相邻帧梯度幅值差异最大的点，作为初选的图像关键帧；Each frame image is divided into 32×32 grids, and each grid retains the point with the largest difference in gradient amplitude between adjacent frames as the primary image key frame;

动态剔除稳定性不足的图像关键帧，从而得到若干个关键帧；Dynamically eliminate image key frames with insufficient stability to obtain several key frames;

对于动态剔除后的相邻关键帧，根据下式得到对应的图像光度误差：For adjacent keyframes after dynamic culling , the corresponding image photometric error is obtained according to the following formula :

； ;

其中，表示图像的灰度值，为关键帧的像素坐标，为像素通过位姿变换投影到当前关键帧的坐标。in, represents the grayscale value of the image, Keyframe The pixel coordinates of Project the pixel to the current keyframe through pose transformation 's coordinates.

5.根据权利要求2所述的结合视觉惯性联合优化与图像匹配的远程长航时组合导航方法，其特征在于，所述步骤2-3包括：5. The long-range and long-endurance integrated navigation method combining visual-inertial joint optimization and image matching according to claim 2, wherein steps 2-3 comprise:

将IMU预积分误差项与图像光度误差统一到非线性最小二乘问题中，采用滑动窗口优化方法，将旧的位姿和地图点边缘化，通过新的观测更新附近的地图点位置，得到更准确的相机位姿信息。The IMU pre-integration error term and image photometric error are unified into a nonlinear least squares problem. The sliding window optimization method is used to marginalize the old poses and map points. The positions of nearby map points are updated through new observations to obtain more accurate camera pose information.

6.根据权利要求1所述的结合视觉惯性联合优化与图像匹配的远程长航时组合导航方法，其特征在于，所述层次化特征金字塔包括底层和高层，其中，6. The long-range and long-duration integrated navigation method combining visual-inertial joint optimization and image matching according to claim 1 is characterized in that the hierarchical feature pyramid includes a bottom layer and a top layer, wherein:

所述底层采用ORB提取包括边缘和角点的几何特征，所述高层采用SIFT提取包括建筑轮廓和纹理区域的语义对象。The bottom layer uses ORB to extract geometric features including edges and corners, and the top layer uses SIFT to extract semantic objects including building outlines and texture areas.

7.一种基于权利要求1所述的结合视觉惯性联合优化与图像匹配的远程长航时组合导航方法的系统，基于单目相机和IMU实现，其特征在于，所述系统包括：数据获取模块、视觉惯性导航模块、多模态导航自主决策模块和图像匹配导航模块；其中，7. A system based on the long-range and long-duration integrated navigation method combining visual-inertial joint optimization and image matching according to claim 1, implemented based on a monocular camera and an IMU, characterized in that the system comprises: a data acquisition module, a visual-inertial navigation module, a multimodal navigation autonomous decision module, and an image matching navigation module; wherein,

所述数据获取模块，用于通过查询卫星信息，获取无人机飞行路径周边环境的离线地图，并根据地图情况在任务路径中预设若干航路点；The data acquisition module is used to obtain an offline map of the environment surrounding the UAV flight path by querying satellite information, and preset a number of waypoints in the mission path according to the map conditions;

所述视觉惯性导航模块，用于在航路点之间采用视觉惯性导航进行高频率相对导航，实现连续的位姿估计；The visual inertial navigation module is used to perform high-frequency relative navigation between waypoints using visual inertial navigation to achieve continuous pose estimation;

所述多模态导航自主决策模块，用于基于构建的多模态导航自主决策方案，在接近航路点时触发导航策略的切换，转为使用图像匹配导航模块；The multimodal navigation autonomous decision-making module is used to trigger the switching of navigation strategies when approaching a waypoint based on the constructed multimodal navigation autonomous decision-making scheme, and switch to using the image matching navigation module;

所述图像匹配导航模块，用于根据相机拍摄的实时照片，利用Latent Diffusion模型与XFeat模型实现图像匹配导航，修正航路点间积累的相对误差；The image matching navigation module is used to implement image matching navigation based on real-time photos taken by the camera using the Latent Diffusion model and the XFeat model to correct the relative errors accumulated between waypoints;

在多个航路点间重复使用视觉惯性导航模块、多模态导航自主决策模块和图像匹配导航模块，直至遍历完任务路径。The visual inertial navigation module, multimodal navigation autonomous decision module, and image matching navigation module are reused between multiple waypoints until the mission path is traversed.