CN113570535A

Movatterモバイル変換

Info

Publication number: CN113570535A
Application number: CN202110874103.0A
Authority: CN
Inventors: 章国锋; 鲍虎军; 黄昭阳; 周晗; 周晓巍; 李鸿升
Original assignee: Shenzhen TetrasAI Technology Co Ltd
Current assignee: Shenzhen TetrasAI Technology Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-10-29
Anticipated expiration: 2041-07-30
Also published as: CN113570535B

Abstract

Description

Visual positioning method and related device and equipment

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a visual positioning method and related apparatus and devices.

Background

With the development of electronic information technology, applications such as augmented reality, virtual reality, mixed reality and the like are increasingly widely applied. Applications such as this generally require better accuracy and robustness of visual positioning to achieve better visual effects and enhance user experience.

At present, local extreme values are detected in a mode similar to thermodynamic diagrams so as to detect landmark points in an image, however, under the conditions that weak textures, repeated textures and the like exist in the image, the difficulty of detecting the landmark points is increased steeply, and the pose of a camera cannot be estimated. In view of this, how to improve the accuracy and robustness of visual positioning becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a visual positioning method and a related device and equipment.

The visual positioning method further comprises the following steps: and processing the first feature image based on at least one of an attention mechanism and a multi-scale feature extraction network to obtain a second feature image.

Therefore, the first feature image is processed based on the attention mechanism, each position in the image can acquire the importance degree of other positions to the first feature image, so that the pixel points in the second feature image can contain the feature information of the corresponding image positions of the pixel points, the feature information of other image positions can be referred to according to the importance degree, namely, the global feature information can be acquired from the importance degree angle, the first feature image is processed based on the multi-scale feature extraction network, the global feature information can be acquired from different scale angles, therefore, the global feature information is acquired from different angles, and the accuracy of the second feature image is favorably improved.

Wherein, processing the first feature image based on at least one of an attention mechanism and a multi-scale feature extraction network to obtain a second feature image comprises: processing the first characteristic image based on an attention mechanism to obtain a first global image, and fusing the multi-scale characteristic images extracted by the multi-scale extraction network to obtain a second global image; and fusing the first global image and the second global image to obtain a second characteristic image.

Therefore, the first feature image is processed based on the attention mechanism to obtain a first global image, the multi-scale feature images extracted by the multi-scale extraction network are fused to obtain a second global image, and the first global image and the second global image are fused to obtain the second feature image on the basis, so that the global feature information can be obtained from two angles of importance and different scales, and the accuracy of the second feature image can be further improved.

The first global image and the second global image are both multi-channel images; fusing the first global image and the second global image to obtain a second characteristic image, wherein the method comprises the following steps: performing channel shuffling on the first global image and the second global image to obtain a third global image; and carrying out channel fusion on the third global image to obtain a second characteristic image.

Therefore, under the condition that the first global image and the second global image are both multi-channel images, the channels of the first global image and the second global image are shuffled to obtain a third global image, and the channels of the third global image are fused to obtain a second feature image, so that the first global image and the second global image can be fully fused, and the accuracy of the second feature image can be further improved.

The method for detecting and obtaining the target landmark point in the image to be positioned based on the fusion characteristic image comprises the following steps: processing the fusion characteristic image by using a landmark detection model to obtain a first landmark prediction image and a first direction prediction image; analyzing the first landmark predicted image and the first direction predicted image to obtain a target landmark point; the target landmark point is at least one of a plurality of landmark points of a preset scene, the landmark points are selected from a scene map of the preset scene, the first landmark prediction image comprises a prediction landmark attribute of a pixel point in an image to be positioned, the first direction prediction image comprises a first direction attribute of the pixel point in the image to be positioned, the prediction landmark attribute is used for identifying the landmark point corresponding to the pixel point, the first direction attribute comprises first direction information pointing to landmark projection, and the landmark projection represents the projection position of the landmark point corresponding to the pixel point in the image to be positioned.

Therefore, a first landmark prediction image and a first direction prediction image are obtained by processing the fused feature image by using a landmark detection model, the first landmark prediction image comprises a predicted landmark attribute of a pixel point in an image to be positioned, the first direction prediction image comprises a first direction attribute of the pixel point in the image to be positioned, the predicted landmark attribute is used for identifying a landmark point corresponding to the pixel point, the first direction attribute comprises first direction information pointing to a landmark projection, the landmark projection represents the projection position of the landmark point corresponding to the pixel point in the image to be positioned, on the basis, the first landmark prediction image and the first direction prediction image are analyzed to obtain a target landmark point, and the first landmark prediction image comprises the landmark point corresponding to each pixel point, and the first direction prediction image comprises direction information pointing to the landmark projection of each pixel point, therefore, the influence of weak texture, repeated texture, dynamic environment and other factors on the visual positioning can be greatly reduced, and the positioning robustness is improved.

The method for obtaining the first landmark prediction image comprises the following steps of: decoding the fused feature image by using a landmark prediction network to obtain a first feature prediction image; the first characteristic prediction image comprises a first characteristic representation of a pixel point in an image to be positioned; for each pixel point, processing the first characteristic representation of the pixel point based on the local sensitive Hash to obtain the predicted landmark attribute of the pixel point; and obtaining a first landmark prediction image based on the prediction landmark attribute of each pixel point in the image to be positioned.

Therefore, the landmark detection model comprises a landmark prediction network, the landmark prediction network is used for decoding the fused feature image to obtain a first feature prediction image, the first feature prediction image comprises first feature representations of pixel points in the image to be positioned, and for each pixel point, the first feature representations of the pixel points are processed based on local sensitive hashing to obtain the predicted landmark attributes of the pixel points, so that the first landmark prediction image is obtained based on the predicted landmark attributes of each pixel point in the image to be positioned.

The method for obtaining the predicted landmark attribute of the pixel point based on the first characteristic representation of the locality sensitive hash processing pixel point comprises the following steps: determining a first target partition where a pixel point is located based on a first feature representation of the locality sensitive Hash mapping pixel point; the first target partition belongs to a plurality of first Hash partitions, the first Hash partitions are obtained by locally sensitive Hash processing of landmark feature representations of a plurality of landmark points, and the landmark feature representations are obtained after landmark detection model training convergence; selecting landmark points in the first target partition as first candidate landmark points; and obtaining the predicted landmark attributes of the pixel points based on the similarity between the first feature representation of the pixel points and the landmark feature representation of each first candidate landmark point.

Thus, based on the first characteristic representation of the locality-sensitive hashmap pixel points, a first target partition is determined in which the pixel points are located, the first target partition belongs to a plurality of first Hash partitions, the first Hash partitions are obtained by performing local Hash sensitive processing on landmark feature representations of a plurality of landmark points, the landmark feature representations are obtained after training of a landmark detection model, on the basis, selecting landmark points of the first target partition as first candidate landmark points, obtaining the predicted landmark attributes of the pixel points based on the similarity between the first feature representation of the pixel points and the landmark feature representation of each first candidate landmark point, because the similarity between the landmark feature representation of each landmark point and the first feature representation of the pixel point does not need to be calculated, the method is favorable for greatly reducing the calculated amount and improving the response speed of visual positioning.

The target landmark point is obtained by detecting through a landmark detection model, the target landmark point is at least one of a plurality of landmark points of a preset scene, the landmark points are selected from a scene map of the preset scene, the landmark points are respectively located at preset positions of sub-areas of the scene map, and the training step of the landmark detection model comprises the following steps: respectively determining the projection area and the projection position of the sub-area and the landmark point in the sample image; determining sample landmark attributes and sample direction attributes of sample pixel points in the sample image based on the projection area and the projection position; the sample landmark attribute is used for identifying sample landmark points corresponding to the sample pixel points, the sample landmark points are landmark points contained in a sub-region of a projection region covering the sample pixel points, and the sample direction attribute comprises sample direction information pointing to the projection positions of the sample landmark points corresponding to the sample pixel points; obtaining a sample landmark image and a sample direction image of the sample image based on the sample landmark attribute and the sample direction attribute respectively; the first pixel point in the sample landmark image is marked with the sample landmark attribute of the corresponding sample pixel point, and the second pixel point in the sample direction image is marked with the sample direction attribute of the corresponding sample pixel point; predicting the sample image by using a landmark detection model to obtain a second characteristic predicted image and a second direction predicted image of the sample image; the second characteristic prediction image comprises a second characteristic representation of the sample pixel point, the second direction prediction image comprises a second direction attribute of the sample pixel point, the second direction attribute comprises second direction information pointing to a sample landmark projection, and the sample landmark projection represents a projection position of the sample local landmark in the sample image; obtaining a first loss based on the sample landmark image and the second characteristic prediction image, and obtaining a second loss based on the sample direction image and the second direction prediction image; network parameters of the landmark detection model are optimized based on the first loss and the second loss.

Based on the processing result of the second feature representation of the sample pixel point by the locality sensitive hash, selecting a reference feature representation as the negative case feature representation of the sample pixel point, including: determining a second target partition where the sample pixel points are located based on the second characteristic representation of the local sensitive Hash mapping sample pixel points; the second target partition belongs to a plurality of second Hash partitions, and the second Hash partitions are obtained by locally sensitive Hash processing of the feature representation to be optimized of a plurality of landmark points; selecting landmark points in the second target partition as second candidate landmark points; the second candidate landmark point does not contain a sample landmark point corresponding to the sample pixel point, and the processing result comprises the second candidate landmark point; and obtaining the negative case characteristic representation of the sample pixel points based on the similarity between the second characteristic representation of the sample pixel points and the characteristic representation to be optimized of each second candidate landmark point.

Therefore, a second target partition where the sample pixel points are located is determined based on second feature representations of the sample pixel points of the locality sensitive Hash mapping, the second target partition belongs to a plurality of second Hash partitions, the plurality of second Hash partitions are obtained by performing locality sensitive Hash processing on feature representations to be optimized of a plurality of landmark points, on the basis, landmark points in the second target partition are selected as second candidate landmark points, the second candidate landmark points do not contain sample local landmark points corresponding to the sample pixel points, the processing result comprises the second candidate landmark points, and based on the similarity between the second feature representations of the sample pixel points and the feature representations to be optimized of the second candidate landmark points respectively, negative example feature representations of the sample pixel points are obtained, because the similarity between the feature representations to be optimized of the sample pixel points and the second feature representations of the sample pixel points does not need to be calculated for each landmark point, therefore, the method is beneficial to greatly reducing the calculated amount and improving the training speed of the landmark detection model.

The sub-regions are obtained by dividing the surface of the scene map; and/or the preset position comprises the central position of the sub-area; and/or the area difference between the sub-regions is below a preset threshold.

The method comprises the following steps that a plurality of sub-regions are obtained by dividing the surface of a scene map, and because an image to be positioned generally images the surface of a preset scene, the accuracy of a target landmark point detected in the image to be positioned can be improved; the preset position is set to include the central position of the sub-region, so that the characteristic of uniform distribution of landmark points can be further improved, and the point pair quality can be improved; in addition, the area difference between the sub-areas is set to be lower than the first threshold, so that the characteristic of uniform distribution of landmark points can be further improved, and the point pair quality can be improved.

A second aspect of the present application provides a visual positioning apparatus comprising: the system comprises a feature extraction module, a feature fusion module, a landmark detection module and a pose determination module, wherein the feature extraction module is used for extracting a first feature image and a second feature image of an image to be positioned; the first characteristic image comprises local characteristic information, and the second characteristic image comprises global characteristic information; the feature fusion module is used for fusing the first feature image and the second feature image to obtain a fused feature image; the landmark detection module is used for detecting and obtaining a target landmark point in the image to be positioned based on the fusion characteristic image; the position and pose determining module is used for obtaining position and pose parameters of the image to be positioned based on first position information of the target landmark point in the image to be positioned and second position information of the target landmark point in a scene map, wherein the image to be positioned is obtained by shooting a preset scene, and the scene map is obtained by performing three-dimensional modeling on the preset scene.

A third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the processor is configured to execute program instructions stored in the memory to implement the visual positioning method in the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions, which when executed by a processor, implement the visual positioning method of the first aspect.

According to the scheme, a first characteristic image and a second characteristic image of an image to be positioned are extracted, the first characteristic image comprises local characteristic information, the second characteristic image comprises global characteristic information, on the basis, the first characteristic image and the second characteristic image are fused to obtain a fused characteristic image, a target landmark point in the image to be positioned is detected and obtained on the basis of the fused characteristic image, so that the position and orientation parameter of the image to be positioned is obtained on the basis of first position information of the target landmark point in the image to be positioned and second position information of the target landmark point in a scene map, the image to be positioned is shot for a preset scene, the scene map is obtained by three-dimensional modeling of the preset scene, the first characteristic image comprises the local characteristic information, the second characteristic image comprises the global characteristic information, and the fused characteristic image is obtained by fusing the first characteristic image and the second characteristic image, therefore, the receptive field of the pixel points in the fusion characteristic image can be greatly expanded, so that the accuracy of characteristic representation of the pixel points in the weak texture and repeated texture areas can be greatly improved, the accuracy of the target landmark point can be improved, on the basis, the pose parameter is obtained based on the first position information of the target landmark point in the image to be positioned and the second position information of the target landmark point in the scene map, and the accuracy and the robustness of visual positioning can be improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a visual positioning method of the present application;

FIG. 2 is a schematic diagram of an embodiment of a scene map;

FIG. 3 is a diagram of one embodiment of detecting a target landmark point using a landmark detection model;

FIG. 4 is a schematic diagram of one embodiment of locating a target landmark point;

FIG. 5 is a flowchart illustrating an embodiment of step S13 in FIG. 1;

FIG. 6 is a diagram of an embodiment of visual localization using SIFT features;

FIG. 7 is a schematic diagram of one embodiment of visual positioning using landmark points;

FIG. 8 is a schematic diagram of one embodiment of a first landmark predicted image;

FIG. 9 is a diagram of one embodiment of a first direction predicted image;

FIG. 10 is a schematic flow chart diagram of an embodiment of training a landmark detection model;

FIG. 11 is a schematic diagram of one embodiment of calculating a first loss;

FIG. 12 is a schematic diagram of a frame of an embodiment of the visual positioning apparatus of the present application;

FIG. 13 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 14 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a visual positioning method according to an embodiment of the present application. Specifically, the method may include the steps of:

step S11: and extracting a first characteristic image and a second characteristic image of the image to be positioned.

In the embodiment of the present disclosure, the first feature image includes local feature information, and the second feature image includes global feature information. Specifically, the pixel points in the first characteristic image contain local characteristic information of a certain area in the image to be positioned, and the pixel points in the second characteristic image contain global characteristic information of the whole image to be positioned.

In one implementation scenario, the image to be positioned is captured in a preset scenario. The preset scene can be set according to the actual application requirement. For example, in a case where visual positioning is required to be achieved in a scenic spot, the preset scene may include the scenic spot; or, in the case that visual positioning is required to be realized at a commercial street, the preset scene may include the commercial street; alternatively, the preset scene may include the industrial park in case that the visual positioning needs to be implemented in the industrial park. Other cases may be analogized, and no one example is given here.

In another implementation scenario, the image to be positioned may be obtained by shooting a preset scene at any angle of view. For example, the image to be positioned may be obtained by looking up at a preset scene; or the image to be positioned can be obtained by shooting a preset scene in a downward mode; or, the image to be positioned may be obtained by flatly shooting a preset scene.

In another implementation scenario, in order to improve the accuracy of visual positioning, an included angle between an optical axis of a camera and a horizontal plane when a preset scene is shot should be lower than a preset angle threshold, that is, an image to be positioned should include the preset scene as much as possible, and include invalid regions such as the ground and the sky as little as possible.

In one implementation scenario, in order to improve the visual positioning efficiency, a landmark detection model may be trained in advance, where the landmark detection model includes an original feature extraction network for extracting a first feature image of an image to be positioned. Specifically, the raw feature extraction network may include, but is not limited to: a convolutional layer, a pooling layer, etc., and is not limited herein. On the basis, the first feature image can be processed to obtain a second feature image based on at least one of an attention mechanism and a multi-scale feature extraction network. According to the mode, the first feature image is processed based on the attention mechanism, the importance degree of other positions to the first feature image can be acquired from each position in the image, so that the pixel points in the second feature image can contain the feature information of the pixel points corresponding to the image positions, the feature information of other image positions can be referred to according to the importance degree, the global feature information can be acquired from the importance degree angle, the first feature image is processed based on the multi-scale feature extraction network, the global feature information can be acquired from different scale angles, the global feature information is acquired from different angles, and the accuracy of the second feature image is improved.

In a specific implementation scenario, on the basis of extracting the first feature image, the first feature image may be processed based on an attention mechanism to obtain a second feature image. It should be noted that the attention mechanism may include, but is not limited to, a self-attention mechanism, and the like, and is not limited herein. Taking the self-attention mechanism as an example, for each pixel point in the first characteristic image, the importance degree of each pixel point in the first characteristic image can be obtained, then the importance degree of each pixel point is utilized to respectively perform weighting processing on the characteristic information of each pixel point, so as to obtain the weighting characteristic of the pixel point, and based on the weighting characteristic of each pixel point, the second characteristic image can be obtained. In the above manner, feature information at other image positions can be referred to according to the importance degree, that is, global feature information can be obtained from the viewpoint of the importance degree.

In another specific implementation scenario, on the basis of extracting the first feature image, the first feature image may be processed based on a multi-scale feature extraction network to obtain a second feature image. It should be noted that the multi-scale feature network may include, but is not limited to: ASPP (aperture Spatial Pyramid), etc., without limitation. Taking ASPP as an example, for a first feature image, feature extraction may be performed on the first feature image by using void convolutions with different sampling rates to obtain a multi-scale feature image, and the receptive fields of pixels in each scale feature image are different, for example, for a void convolution with a sampling rate of 1, the receptive field of a pixel in the extracted feature image is 3 × 3, for a void convolution with a sampling rate of 2, the receptive field of a pixel in the extracted feature image is 7 × 7, and for a void convolution with a sampling rate of 4, the receptive field of a pixel in the extracted feature image is 15 × 15. The sampling rate of the hole convolution may be set according to actual conditions, and the above example does not limit the sampling rate actually adopted by the hole convolution. Based on the method, the multi-scale characteristic images can be fused, and the second characteristic image can be obtained. Specifically, the multi-scale feature images can be spliced, and then the spliced multi-scale feature images are subjected to channel fusion to obtain a second feature image. For example, a plurality of convolution layers may be used to perform channel fusion on the spliced multi-scale feature image to obtain a second feature image. In addition, the convolution layers may specifically comprise 1 × 1 convolution kernels. According to the mode, the global feature information can be acquired from different scale angles, so that the global feature information is acquired from different angles, and the accuracy of the second feature image is improved.

In yet another specific implementation scenario, as mentioned above, in order to improve the visual positioning efficiency, a landmark detection model may be trained in advance, and the landmark detection model includes an original feature extraction network for extracting a first feature image of an image to be positioned. On this basis, in order to extract the deep features, the original feature extraction network may include a plurality of network layers connected in sequence, for example, may include a plurality of convolutional layers connected in sequence, which is not limited herein. Based on this, the feature image extracted by one of the network layers may be used as the first feature image, for example, the feature image extracted by the first network layer may be used as the first feature image, or the feature image extracted by the second network layer may also be used as the first feature image, which is not limited herein. Further, in order to reduce the amount of computation, the feature image extracted by the last network layer may be used as the first feature image.

In another specific implementation scenario, in order to improve the visual positioning efficiency, a landmark detection model may be trained in advance, and the landmark detection model includes an original feature extraction network, an attention mechanism network, and a multi-scale feature extraction network, where the original feature extraction network is configured to extract a first feature image, the attention mechanism network is configured to extract a first global image based on the first feature image, and the multi-scale feature extraction network is configured to extract a second global image based on the first feature image, and on this basis, the first global image and the second global image may be fused to obtain a second feature image, and the first feature image and the second feature image may be fused to obtain a fused feature image.

Step S12: and fusing the first characteristic image and the second characteristic image to obtain a fused characteristic image.

Specifically, the first feature image and the second feature image may be spliced, and then the spliced first feature image and second feature image may be channel-fused by using a plurality of convolution layers (e.g., 2 convolution layers), so as to obtain a fused feature image. In addition, the plurality of convolution layers may include 1 × 1 convolution kernels, and specifically, the first feature image and the second feature image after the stitching may be subjected to channel fusion in the channel dimension by using the 1 × 1 convolution kernels to obtain a fusion feature image, and details about the fusion process may be referred to in detail in relation to the 1 × 1 convolution kernels, which is not described herein again.

In a real scene, there is a possibility that a weak texture, a repetitive texture, or the like may exist in a preset scene. For example, when the building surface is a solid color (e.g., white, red, etc.), it is expressed as a weak texture, and when the building surface is combined with tiles having the same design and color, it is expressed as a repeated texture, which is not illustrated herein. Under the condition, the first characteristic image containing the local characteristic information and the second characteristic image containing the global characteristic information are fused, so that the pixel points in the fused characteristic image not only contain the image characteristics of the corresponding positions of the pixel points, but also contain the image characteristics of other positions, the resistance capability to weak textures and repeated textures can be remarkably improved, and the accuracy and the robustness of visual positioning are improved. Tests show that the landmark point detection is carried out by fusing the fused characteristic image obtained by fusing the first characteristic image and the second characteristic image under the condition of weak texture and repeated texture, the robustness and the accuracy of landmark point detection can be greatly improved,

step S13: and detecting to obtain a target landmark point in the image to be positioned based on the fusion characteristic image.

In an implementation scenario, as described above, the image to be located is obtained by shooting a preset scene, the target landmark point may be at least one of a plurality of landmark points of the preset scene, the plurality of landmark points are selected from a scene map of the preset scene, the scene map is obtained by performing three-dimensional modeling on the preset scene, and the plurality of landmark points are located at preset positions of each sub-region of the scene map respectively. For ease of description, several landmark points of a preset scene may be denoted as { q }₁,q₂,...,q_nThe target landmark points may be the plurality of landmark points { q }₁,q₂,...,q_nAt least one of. According to the mode, the surface of the scene map is uniformly divided into the plurality of sub-areas, the landmark points are selected from the central positions of the plurality of sub-areas, and the landmark points are uniformly distributed on the surface of the scene map, so that the image to be positioned is shot in the preset scene at any view angle, the image to be positioned contains enough landmark points, and the robustness of visual positioning can be improved.

In a specific implementation scene, a shot video of a preset scene can be collected in advance, and the shot video is processed by using a three-dimensional reconstruction algorithm to obtain a scene map of the preset scene. The three-dimensional reconstruction algorithm may include, but is not limited to: MultiView stereo, Kinect fusion, etc., without limitation thereto. The specific process of the three-dimensional reconstruction algorithm may refer to specific technical details thereof, which are not described herein again.

In a specific implementation scenario, several sub-regions are obtained by dividing the surface of the scene map. Specifically, the surface of the scene map may be divided into several sub-regions by a three-dimensional over-segmentation algorithm (e.g., supervolume). Referring to fig. 2, fig. 2 is a schematic diagram of an embodiment of a scene map. As shown in fig. 2, the different grayscale regions represent different sub-regions of the scene map surface.

In a specific implementation scenario, the preset position may include a central position of the sub-region. With continuing reference to fig. 2, as shown in fig. 2, the black dots in the sub-area represent the landmark points determined in the sub-area.

In a specific implementation scenario, the area difference between the sub-regions may be lower than a first threshold, and the first region may be set according to an actual situation, for example, the area difference may be set as: 10 pixels, 15 pixels, 20 pixels, etc., without limitation. That is, the various sub-regions are of similar size.

In an implementation scenario, as described above, in order to improve the efficiency and accuracy of landmark detection, a landmark detection model may be trained in advance, so that the landmark detection model may be utilized to process the fused feature image to obtain a first landmark prediction image and a first direction prediction image, where the first landmark prediction image includes a predicted landmark attribute of a pixel point in an image to be located, the first direction prediction image includes a first direction attribute of a pixel point in an image to be located, the predicted landmark attribute is used to identify a landmark point corresponding to the pixel point, the first direction attribute includes first direction information pointing to landmark projection, and the landmark projection indicates a projection position of the landmark point corresponding to the pixel point in the image to be located. On the basis, the first landmark prediction image and the first direction prediction image are analyzed to obtain a target landmark point. Specifically, the following related disclosure embodiments may be referred to in the training process of the landmark detection model, which is not repeated herein. Different from the mode, the first landmark prediction image comprises the landmark points corresponding to the pixel points, and the first direction prediction image comprises the direction information of the projection of the pixel points to the landmark points, so that the influence of a dynamic environment can be greatly reduced, and the positioning robustness can be improved.

In a specific implementation scenario, please refer to fig. 3 in combination, fig. 3 is a schematic diagram of an embodiment of detecting a target landmark point by using a landmark detection model. As shown in fig. 3, the landmark detection model may include an original feature extraction network, an attention mechanism network, a multi-scale feature extraction network, a landmark prediction network, and a direction prediction network, and then feature extraction may be performed on the image to be located by using the original feature extraction network to obtain a first feature image, and the first feature image may be input to the attention mechanism network and the multi-scale feature extraction network respectively to obtain a first global image and a second global image respectively, and the first global image and the second global image may be fused to obtain a second feature image, and the first feature image and the second feature image may be fused to obtain a fused feature image. For a specific process, reference may be made to the foregoing related description, which is not repeated herein. On the basis, the landmark prediction network is used for carrying out landmark prediction on the fusion characteristic image to obtain a first landmark prediction image, and the direction prediction network is used for carrying out direction prediction on the fusion characteristic image to obtain a first direction prediction image, namely the landmark prediction network and the direction prediction network are respectively responsible for predicting the landmark and the direction, and the landmark prediction network and the direction prediction network share the fusion characteristic image, so that the prediction efficiency can be improved.

In another specific implementation scenario, please continue to refer to fig. 3, for convenience of description, the pixels with the same predicted landmark attribute are displayed with the same gray scale, that is, the pixels displayed with the same gray scale in the first landmark predicted image shown in fig. 3 correspond to the same landmark points (e.g., the aforementioned landmark points { q ″₁,q₂,...,q_nA landmark point in). Further, for convenience of description, the direction prediction attribute of a pixel point can be represented by different gray scales in the first direction prediction image. As shown in the example in fig. 3, the 0-degree direction, the 45-degree direction, the 90-degree direction, the 135-degree direction, the 180-degree direction, the 225-degree direction, the 270-degree direction, and the 315-degree direction are represented in different gradations, respectively. It should be noted that the first landmark prediction image and the first direction prediction image shown in fig. 3 are only one possible expression form of an actual application process, and the landmark prediction attribute and the direction prediction attribute are represented by different gray levels, so that the prediction visualization of the landmark detection model can be realized. In the practical application process, the output results of the landmark prediction network and the direction prediction network may also be directly expressed by numbers, which is not limited herein.

In yet another specific implementation scenario, please refer to fig. 4 in combination, and fig. 4 is a schematic diagram of an embodiment of locating a target landmark point. As shown in fig. 4, a hollow circle in the drawing indicates a target landmark point located in an image to be located, a lower right rectangular frame region is an enlarged schematic view of the upper left rectangular frame region, as shown in the lower right rectangular frame region, pixel points with the same gray level indicate that the pixel points have the same predicted landmark attribute, and a directional arrow indicates a predicted direction attribute of the pixel points, so that the target landmark point identified by the predicted landmark attribute (e.g., { q) can be determined based on the same predicted landmark attribute₁,q₂,...,q_nA certain landmark point in the image to be located), and based on the prediction direction attributes of the pixel points having the same prediction landmark attribute, determine the position information (e.g., the position indicated by a solid circle in the drawing) of the target landmark point in the image to be located, for example, determine the position information of the target landmark point in the image to be located by determining the intersection point of the directional arrow shown in fig. 4. The specific process may refer to the related description in the following disclosed embodiments, and is not repeated herein.

In yet another specific implementation scenario, both the first landmark predicted image and the first direction predicted image may be the same size as the image to be located; alternatively, at least one of the first landmark prediction image and the first direction prediction image may be different in size from the image to be positioned.

In yet another specific implementation scenario, specifically, the deplab v3 may be used as a backbone network of landmark detection models, which can significantly expand the receptive field through spatial pyramid pooling.

Step S14: and obtaining the pose parameters of the image to be positioned based on the first position information of the target landmark point in the image to be positioned and the second position information of the target landmark point in the scene map.

In the embodiment of the present disclosure, it should be noted that the image to be positioned is obtained by shooting a preset scene, and the scene map is obtained by performing three-dimensional modeling on the preset scene. Reference may be made to the foregoing description for details, which are not repeated herein.

It should be noted that the first position information of the target landmark point in the image to be positioned may be a two-dimensional coordinate, and the second position information of the target landmark point in the scene map may be a three-dimensional coordinate. In addition, as described above, the landmark point is selected from a scene map of the preset scene, and the scene map is obtained by three-dimensionally modeling the preset scene, so that the second position information of the landmark point in the scene map can be determined directly based on the scene map. On the basis, the landmark point with the mark corresponding to the target landmark point in the plurality of landmark points can be determined based on the mark of the target landmark point and the marks of the plurality of landmark points in the scene map, and the second position information of the corresponding landmark point is used as the second position information of the target landmark point. Referring to fig. 4, on the basis of obtaining a plurality of target landmark points (i.e. hollow circles in the drawing) through detection, a plurality of 2D-3D point pairs may be established based on first position information of the target landmark points in the image to be positioned and second position information of the target landmark points in the scene map, and based on the plurality of 2D-3D point pairs, the pose parameter (e.g. 6-degree-of-freedom parameter) of the image to be positioned may be recovered. Specifically, the pose parameters may be obtained by using a RANSAC (Random Sample Consensus) PnP algorithm. The specific algorithm steps can refer to the technical details of RANSAC PnP, and are not described herein again.

Referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of step S13 in fig. 1. As shown in fig. 5, the method may specifically include the following steps:

step S51: and processing the feature image to be fused by using a landmark detection model, and predicting to obtain a first landmark predicted image and a first direction predicted image.

In the embodiment of the disclosure, the first landmark prediction image includes a predicted landmark attribute of a pixel point in an image to be positioned, the first direction prediction image includes a first direction attribute of a pixel point in an image to be positioned, the predicted landmark attribute is used for identifying a landmark point corresponding to the pixel point, the first direction attribute includes first direction information pointing to landmark projection, and the landmark projection represents a projection position of the landmark point corresponding to the pixel point in the image to be positioned. In addition, both the first landmark prediction image and the first direction prediction image may have the same size as the image to be positioned, or at least one of the first landmark prediction image and the first direction prediction image may have a different size from the image to be positioned.

In one implementation scenario, as described in the foregoing disclosed embodiments, several landmark points may be denoted as { q }₁,q₂,...,q_nThe predicted landmark attribute may include the label of the landmark point corresponding to the pixel point, that is, in the case that the predicted landmark attribute includes i, the landmark point corresponding to the pixel point is q_i。

In one implementation scenario, the first direction information may specifically include a first direction vector, and the first direction vector points to a landmark projection. It should be noted that, in the case that the detection performance of the landmark detection model is excellent, the first direction vector predicted by the landmark detection model may accurately point to the landmark projection, in the practical application process, the detection performance of the landmark detection model is limited by various factors which may not be excellent, in this case, the first direction vector predicted by the landmark detection model may not be accurately directed to the landmark projection, if there may be an angular deviation (e.g., 1 degree, 2 degrees, 3 degrees, etc.) between the location pointed to by the first direction vector and the landmark projection, since each pixel point in the image to be positioned can predict to obtain a first direction vector, the direction deviation possibly existing in a single first direction vector can be corrected through the first direction vectors of a plurality of pixel points, and the specific process can refer to the following related description, which is not repeated herein.

In an implementation scenario, as described in the foregoing disclosed embodiment, the landmark detection model may include a landmark prediction network and a direction prediction network, and the landmark prediction network may be used to perform landmark prediction on the fused feature image to obtain a first landmark prediction image, and the direction prediction network may be used to perform direction prediction on the fused feature image to obtain a first direction prediction image. That is to say, the landmark prediction network and the direction prediction network may share the fused feature image, and specific reference may be made to the related description of the foregoing disclosed embodiment, which is not described herein again.

In a specific implementation scenario, as described above, the first direction information may include a first direction vector, and the first direction vector may specifically be a unit vector with a modulus value of 1.

In addition, for each pixel, a first feature representation of the pixel can be computed for each respective landmark point (e.g., q, as described above)₁,q₂,...,q_nAnd) representing the similarity between landmark features, and selecting a landmark point corresponding to the highest similarity as a landmark point corresponding to the pixel point, so that the pixel point can be identified by adopting the landmark point, and the predicted landmark attribute of the pixel point is obtained. For example, inner products between the first feature representations of the pixel points and the landmark feature representations of the landmark points may be calculated, and labels (e.g., 1, 2, … …, n, etc.) of the landmark points corresponding to the minimum inner products among a plurality of landmark points in a preset scene are selected to identify the landmark points, so as to obtain the predicted landmark attributes. Is obtained byAnd obtaining a first landmark prediction image after the predicted landmark attribute of each pixel point in the image to be positioned.

It should be noted that, if the similarity between the first feature representation of the pixel and the landmark feature representation of each landmark point is low (e.g., both are lower than a similarity threshold), the pixel may be considered as an invalid pixel (e.g., sky, ground, etc.) unrelated to the preset scene, and in this case, a special mark (e.g., 0) may be used for identification.

Step S52: and analyzing the first landmark predicted image and the first direction predicted image to obtain a target landmark point.

In an implementation scenario, a candidate region formed by pixel points having the same predicted landmark attribute may be obtained, that is, an image region formed by pixel points corresponding to the same landmark point may be used as a candidate region according to the predicted landmark attribute of the pixel points. On this basis, the consistency condition of the first direction attributes of the pixels in the candidate regions can be counted, that is, for each candidate region, the consistency condition of the first direction attributes of the pixels in the candidate region can be counted, so that the consistency condition of each candidate region can be obtained. Therefore, under the condition that the consistency condition meets the preset condition, the landmark point identified by the predicted landmark attribute of the pixel point in the candidate region is used as the target landmark point, and the first position information of the target landmark point in the image to be positioned is obtained based on the first direction attribute of the pixel point in the candidate region. According to the method, before the target landmark point is determined based on the predicted landmark attribute of the pixel point in the candidate region, the consistency of the first direction attribute of the pixel point in the candidate region is detected, so that the consistency of the first direction attribute of the pixel point in the candidate region can be ensured, the quality of subsequently constructed point pairs can be improved, and the accuracy and the robustness of visual positioning can be improved.

In a specific implementation scenario, in order to further improve accuracy and robustness of visual positioning, before statistics of a consistency of first direction attributes of pixel points in a candidate region, it may be further detected whether a region area of the candidate region is smaller than a second threshold, and if the region area of the candidate region is smaller than the second threshold, the candidate region may be filtered. By the aid of the method, unstable regions (such as regions which are easy to form changes along with natural conditions, such as grasses, trees and the like) can be filtered in advance, the quality of subsequently constructed point pairs can be further improved, and accuracy and robustness of visual positioning can be further improved.

In another specific implementation scenario, as described above, the first direction information may specifically include a first direction vector, and for each candidate region, an intersection point of the first direction vectors between the pixel points in the candidate region may be obtained first, and then an outlier rate of the intersection point is counted to obtain a consistency condition of the candidate region. In this case, the preset condition may be set that the external point rate is lower than the external point rate threshold, that is, as described above, there may be a direction deviation in the first direction vector predicted by the landmark detection model, in this case, the first direction vectors of the respective pixel points in the candidate region may not exactly intersect at one point (i.e., landmark projection), an external point rate threshold may be preset, and an RANSAC algorithm based on a straight line intersection model (i.e., RANSAC with a volume intersection model, which may refer to its relevant technical details and is not described herein again) is used to calculate the external point rate, if the external point rate of the candidate region is lower than the external point rate threshold, it may be considered that the direction consistency predicted by the landmark detection model for the candidate region is better, otherwise, if the external point rate of the candidate region is not lower than the external point rate threshold, it may be considered that the learning effect of the landmark detection model for the candidate region is poor or that the candidate region itself has a relatively large noise, to prevent subsequent impact on the accuracy and robustness of the visual localization, the candidate region may be directly filtered.

In yet another specific implementation scenario, toTaking the example that the selected area corresponds to the landmark point j, the initial position information of the landmark point j in the image to be positioned

The initial position information can be further optimized through an iterative algorithm similar to the EM iterative algorithm to obtain the first position information of the landmark point j in the image to be positioned, and the specific optimization process can refer to the technical details of the iterative algorithm of the EM, and is not described herein again. It should be noted that, as described above, in the iterative optimization process, if the consistency of the candidate region is not good enough, the candidate region may be directly discarded.

Referring to fig. 6, 7, 8 and 9 in combination, fig. 6 is a schematic diagram of an embodiment of performing visual localization by using SIFT (Scale-Invariant Feature Transform) features, fig. 7 is a schematic diagram of an embodiment of performing visual localization by using landmark points, fig. 8 is a schematic diagram of an embodiment of a first landmark predicted image, and fig. 9 is a schematic diagram of an embodiment of a first direction predicted image. Based on the first landmark prediction image shown in fig. 8, it can be counted that the area of the candidate region indicated by the arrow on the right side of fig. 7 in fig. 8 is too small, and therefore the unstable candidate region can be filtered (it can be seen from fig. 7 that the candidate region corresponds to a tree), and based on the first direction prediction image shown in fig. 9, it can be counted that the candidate region indicated by the arrow on the left side of fig. 7 in fig. 9 is not good in consistency, and therefore the candidate region can be filtered. On this basis, a target landmark point (indicated by an X mark in fig. 7) may be obtained based on the candidate regions remaining after the filtering. In addition, as for the meanings of the pixels with different gray levels in the first landmark prediction image shown in fig. 8 and the meanings of the pixels with different gray levels in the first direction prediction image shown in fig. 9, reference may be made to the foregoing related description, and details are not repeated here. Different from the above, as shown in fig. 6, by using the SIFT features to perform visual positioning, a huge number of feature points (as shown by hollow circles in fig. 6) can be obtained, and interference points corresponding to unstable regions such as trees exist in the feature points, so that on one hand, the subsequent visual positioning calculation amount is increased sharply due to the too large number of the feature points, and on the other hand, the accuracy and robustness of the subsequent visual positioning of the image are improved due to the extremely easy existence of the interference points in the feature points.

The scheme is that a fused feature image is processed by utilizing a landmark detection model to obtain a first landmark prediction image and a first direction prediction image, the first landmark prediction image comprises a predicted landmark attribute of a pixel point in an image to be positioned, the first direction prediction image comprises a first direction attribute of the pixel point in the image to be positioned, the predicted landmark attribute is used for identifying a landmark point corresponding to the pixel point, the first direction attribute comprises first direction information pointing to landmark projection, the landmark projection represents the projection position of the landmark point corresponding to the pixel point in the image to be positioned, on the basis, the first landmark prediction image and the first direction prediction image are analyzed to obtain a target landmark point, and the first landmark prediction image comprises the landmark point corresponding to each pixel point, and the first direction prediction image comprises direction information pointing to the landmark projection of each pixel point, therefore, the dynamic environment influence can be greatly reduced, and the positioning robustness is improved.

Referring to fig. 10, fig. 10 is a flowchart illustrating an embodiment of training a landmark detection model. Specifically, the method may include the steps of:

step S101: and respectively determining the projection area and the projection position of the sub-area and the landmark point in the sample image.

In the embodiment of the present disclosure, the meanings of the sub-region and the landmark point may refer to the related descriptions in the foregoing embodiments, and are not described herein again.

In one implementation scenario, the sample image is obtained by shooting a preset scene with the sample pose C. For each sub-region of the scene map, the sub-region can be projected to a sample image through the sample pose C and the camera internal parameter K to obtain a projection region of the sub-region in the sample image; similarly, for each landmark point, the aforementioned sample pose C and camera internal parameter K may also be projected to the sample image to obtain the projection position of the landmark point in the sample image. Taking landmark point projection as an example, for a number of landmark points { q }₁,q₂,...,q_nLandmark point q in_jIn other words, the projection position l in the sample image can be obtained by the following formula_j：

l_j＝f(q_j,K,C)……(1)

In the above formula (1), f represents a projection function, which may refer to a conversion process among a world coordinate system, a camera coordinate system, an image coordinate system, and a pixel coordinate system, and is not described herein again.

Step S102: and determining the sample landmark attribute and the sample direction attribute of the sample pixel point in the sample image based on the projection area and the projection position.

In the embodiment of the present disclosure, the sample landmark attribute is used to identify a sample landmark point corresponding to a sample pixel point, the sample landmark point is a landmark point included in a sub-region where a projection region covers the sample pixel point, and the sample direction attribute includes sample direction information pointing to a projection position of the sample landmark point corresponding to the sample pixel point.

For the landmark attribute of the sample, for convenience of description, taking a pixel point i in the sample image as an example, the position coordinate of the pixel point i in the sample image can be denoted as p_i＝(u_i,v_i) The pixel point i is covered by a projection area j, the projection area j is a projection area of a sub-area j in the scene map in the sample image, and the sub-area j comprises a landmark point q_jThen the sample landmark attribute of the pixel point i identifies the landmark point q_jSample landmark attributes, such as pixel point i, may include landmark point q_jAt several landmark points q₁,q₂,...,q_nLandmark point label j in. Other cases may be analogized, and no one example is given here. In addition, if a certain pixel point in the sample image is not covered by the projection area, the pixel point can be considered to correspond to the sky or some distant objects, and in this case, the sample landmark attribute of the pixel point is identified by using a special mark, for example, a plurality of landmark points { q } can be used₁,q₂,...,q_nThe landmark point is marked by a special mark (e.g. 0) which is irrelevant to the marking, so that the pixel point can be represented to have no effect on visual positioning.

For sample squareThe directional attribute may specifically include sample direction information that is a sample direction vector pointing to a projection position of the sample landmark. Further, the sample direction vector may be specifically a unit vector. For convenience of description, still taking the pixel point i in the sample image as an example, as described above, the sample landmark point corresponding to the pixel point i is the landmark point q_jAnd a landmark point q_jThe projection position in the sample image can be calculated by the above equation (1) (i.e. /)_j) Then the above-mentioned unit vector d_iCan be expressed as:

d_i＝(l_j-p_i)/||l_j-p_i||₂……(2)

step S103: and obtaining a sample landmark image and a sample direction image of the sample image respectively based on the sample landmark attribute and the sample direction attribute.

In one implementation scenario, both the sample landmark image and the sample orientation image may be the same size as the sample image, i.e., a first pixel in the sample landmark image is labeled with the sample landmark attribute of the corresponding sample pixel, and a second pixel in the sample orientation image is labeled with the sample orientation attribute of the corresponding sample pixel. That is to say, the first pixel point in the ith row and the jth column in the sample landmark image is marked with the sample landmark attribute of the sample pixel point in the ith row and the jth column in the sample image, and the second pixel point in the ith row and the jth column in the sample direction image is marked with the sample direction attribute of the sample pixel point in the ith row and the jth column in the sample image. Further, where the sample landmark attributes include landmark point labels, the sample landmark image may be annotated as

The resolution of the sample landmark image S is H × W, and each pixel value is an integer; similarly, where the sample direction attribute is represented as a sample direction vector, the sample direction image may be written as

That is, the resolution of the sample-direction image d is H × W, the number of channels is 2, and each pixel value in the channel image isAnd a real number, wherein a pixel value in one channel image represents one element of the sample direction vector and a pixel value in the other channel image represents the other element of the sample direction vector.

Step S104: and predicting the sample image by using the landmark detection model to obtain a second characteristic predicted image and a second direction predicted image of the sample image.

In this embodiment of the present disclosure, the second direction prediction image includes a second direction attribute of the sample pixel, where the second direction attribute includes second direction information pointing to a sample landmark projection, and the sample landmark projection represents a projection position of the sample landmark in the sample image.

In one implementation scenario, similar to the first direction information, the second direction information may specifically include a second direction vector pointing to the sample landmark projection. It should be noted that, under the condition that the detection performance of the landmark detection model is excellent, the second direction vector predicted by the landmark detection model may accurately point to the sample landmark projection, and during the training process, the performance of the landmark detection model gradually becomes excellent and is limited by various factors, and the detection performance of the landmark detection model may not reach an ideal state (i.e., 100% accuracy), under this condition, the second direction vector predicted by the landmark detection model may not accurately point to the sample landmark projection, for example, a certain angle deviation (e.g., 1 degree, 2 degrees, 3 degrees, etc.) may exist between the position pointed by the second direction vector and the sample landmark projection.

In an implementation scenario, as described above, the landmark detection model may include an original feature extraction network, an attention mechanism network, a multi-scale feature extraction network, a landmark prediction network, and a direction prediction network, on this basis, feature extraction may be performed on the sample image by using the original feature extraction network to obtain a first sample feature image, and the first sample feature image is input into the attention mechanism network and the multi-scale feature extraction network, respectively to obtain a first sample global image and a second sample global image, and the first sample global image and the second sample global image are fused to obtain a second sample feature image, and the first sample feature image and the second sample feature image are fused to obtain a fused sample feature image. On the basis, landmark prediction is carried out on the fusion sample characteristic image by utilizing a landmark prediction network to obtain a second characteristic prediction image, and direction prediction is carried out on the fusion sample characteristic image by utilizing a direction prediction network to obtain a second direction prediction image. For a specific process, reference may be made to the foregoing related description, which is not repeated herein.

Step S105: and obtaining a first loss based on the sample landmark image and the second characteristic prediction image, and obtaining a second loss based on the sample direction image and the second direction prediction image.

In one implementation scenario, as previously described, a landmark feature representation set P may be maintained and updated during training of the landmark detection model, the landmark feature representation set P including landmark points (e.g., the aforementioned { q } may be maintained and updated₁,q₂,...,q_n}) of the feature to be optimized. It should be noted that, during the first training, the feature representation to be optimized of each landmark point in the landmark feature representation set P may be obtained through random initialization. In addition, for convenience of description, the second feature prediction image may be denoted as E, and then the second feature representation of the pixel point i in the sample image may be denoted as E_i. In order to reduce the calculation load and resource consumption for calculating the first loss, an image region formed by sample pixel points with the same sample landmark attribute may be obtained, and then, for a sample pixel point i in the image region, the feature to be optimized of the sample landmark point identified by the sample landmark attribute may be represented as a positive case feature representation P of the sample pixel point i_i+And selecting a reference feature representation as a negative case feature representation P of the sample pixel point i_i-And the reference feature representation comprises a feature representation to be optimized other than the normative feature representation, that is, the feature representation to be optimized other than the normative feature representation can be selected from the landmark feature representation set P as the reference feature representation. On the basis, the second feature expression E of the sample pixel point i can be based on_iAnd positive case feature representation P_i+First similarity between and second feature representation E_iNegative example ofFeature representation P_i-And obtaining the sub-loss based on the sub-loss of the sample pixel point in the sample image. For example, the sub-losses of each pixel point in the sample image may be summed to obtain the first loss. In the above manner, on one hand, the second feature representation can be made to approach the positive example feature representation and to be separated from the negative example feature representation as much as possible by minimizing the first loss, so as to improve the prediction performance of the landmark prediction network, and on the other hand, the loss of calculating the second feature representation and all negative example classes is avoided by selecting one reference feature representation as the negative example feature representation, so that the calculation amount and the hardware consumption can be greatly reduced.

In a specific implementation scenario, the first similarity and the second similarity may be processed based on a triplet loss function to obtain a sub-loss, and the sub-losses of each sample pixel point in the sample image are summed to obtain a first loss

In the above formula (3), m represents the metric distance of triplet loss, sim represents the cosine similarity function, and specifically,

in another specific implementation scenario, before calculating the first similarity and the second similarity, the second feature representation of each sample pixel point may be normalized by L2, and on the basis, the first similarity between the normalized second feature representation and the positive case feature representation and the second similarity between the normalized second feature representation and the negative case feature representation may be calculated.

In yet another specific implementation scenario, please refer to fig. 11 in combination, and fig. 11 is a schematic diagram of an embodiment of calculating the first loss.As shown by the dotted line division in fig. 11, the sample image includes an image region formed by 4 sample pixel points having the same sample landmark attribute, and taking the lower-right image region as an example, the sample landmark points corresponding to the sample pixel points in the image region are all landmark points i⁺Then, the average feature representation of the second feature representation of the sample pixel points in the image region may be counted, specifically, the average value of the second feature representations of the sample pixel points in the image region may be taken to obtain the average feature representation M_i+M may then be represented based on the average features_i+And selecting a plurality of reference feature representations as candidate feature representations of the image area according to the similarity between the reference feature representations and the reference feature representations respectively. For example, a reference feature representation with similarity at a top preset order (e.g., top k) in order from high to low may be selected as a candidate feature representation of the image region (e.g., three to-be-optimized feature representations indicated by curved arrows in fig. 11). On the basis, when the negative case characteristic representation of each sample pixel point in the image area is obtained, uniform sampling can be carried out in the candidate characteristic representation, and the negative case characteristic representation of the sample pixel point is obtained. That is, since sample pixel points in the same image region are close to each other in space and should have similar feature representations, similar negative example feature representations can also be shared, and therefore, for each image region, only representative negative example feature representations need to be mined respectively, and thus each sample pixel point in the image region only needs to be sampled from the representative negative example feature representations. For example, for the sample pixel point 1, the sample pixel point 2, thesample pixel point 3, and the sample pixel point 4 in the image region, uniform sampling can be performed from the three to-be-optimized feature representations, to obtain corresponding negative case feature representations, for example, the to-be-optimized feature representations indicated by the bold arrows can be used as the respective negative case feature representations. For other image regions, the same can be said, and no further examples are given here. By the method, on one hand, the reference significance represented by the reference feature can be improved, and on the other hand, the complexity of selecting the negative example feature representation by each sample pixel point in the image area can be reduced.

At another placeIn one implementation scenario, as described above, in order to deal with the problem of heavy calculation load caused by a large number of landmark points (e.g., hundreds of landmark points, even tens of thousands of landmark points) in a real scenario, for a sample pixel point i in a sample image, a feature to be optimized of the sample landmark point identified by a sample landmark attribute may be used as a sample feature representation P of the sample pixel point i_i+And based on the processing result of the second feature representation of the sample pixel point by the locality sensitive hash, selecting a reference feature representation as the negative case feature representation P of the sample pixel point i_i-And obtaining the sub-loss of the sample pixel point i based on the first similarity between the second feature representation and the positive example feature representation and the second similarity between the second feature representation and the negative example feature representation, wherein the reference feature representation comprises the feature representation to be optimized except the positive example feature representation, and after the sub-loss of each sample pixel point is obtained through calculation, the first loss can be obtained based on the sub-loss of the sample pixel point in the sample image. It should be noted that, the feature representation to be optimized of the landmark point may refer to the foregoing related description; in addition, the related process for obtaining the first loss based on the sub-losses can also refer to the related description, and is not described herein again. In the above manner, the negative case feature representation is selected based on the processing result of the second feature representation of the sample pixel point by the locality sensitive hash, so that the calculation amount can be greatly reduced, and the response speed of the visual positioning can be improved.

In a specific implementation scenario, a second target partition where a sample pixel is located may be determined based on second feature representations of locality-sensitive hash mapping sample pixels, where the second target partition belongs to multiple second hash partitions, and the multiple second hash partitions are obtained by performing locality-sensitive hash processing on feature representations to be optimized of multiple landmark points, which may be specifically referred to the determination process of the first target partition and the mapping process of the first hash partition, and is not described herein again. On the basis, landmark points in the second target partition can be selected as second candidate points, the second candidate landmark points do not contain sample landmark points corresponding to the sample pixel points, the processing result comprises the second candidate landmark points, and then negative case feature representation of the sample pixel points is obtained based on the similarity between the second feature representation of the sample pixel points and the feature representation to be optimized of each second candidate landmark point. For example, the feature to be optimized of the second candidate landmark point corresponding to the highest similarity may be represented as the negative example feature representation of the sample pixel point. In the mode, the similarity between the feature representation to be optimized and the second feature representation of the sample pixel point does not need to be calculated for each landmark point, so that the calculation amount can be greatly reduced, and the training speed of the landmark detection model is improved.

In a specific implementation scenario, the first similarity and the second similarity may be processed based on a triplet loss function, so as to obtain a sub-loss of a sample pixel point. Reference may be made to the foregoing description for details, which are not repeated herein.

In an implementation scenario, as described above, the second direction attribute includes second direction information pointing to the sample landmark projection, for example, the second direction information may specifically include a second direction vector pointing to the sample landmark projection, and for convenience of description, the second direction vector marked by the sample pixel point i may be denoted as "sample pixel point" i

In addition, the sample direction vector marked by the sample pixel point i can be recorded as d_iFirst loss

In the above formula (4), l represents an indication function, S_iNot equal to 0 indicates a sample pixel point i in the sample landmark image S that identifies the corresponding sample local landmark point (i.e., a sample pixel point labeled as a special label such as 0 excluding the sample pixel point that represents a sky or a distant object).

Step S106: network parameters of the landmark detection model are optimized based on the first loss and the second loss.

In one implementation scenario, the user may be provided with a display,after obtaining the first loss and the second loss, the first loss and the second loss may be weighted and summed to obtain a total loss

In the above formula (5), λ represents a weighting factor. On the basis, the network parameters of the landmark detection model and the feature representation to be optimized can be optimized based on the total loss.

The project area and the project position of the sub-area and the earth surface point in the sample image are respectively determined, based on the project area and the project position, the sample landmark attribute and the sample direction attribute of the sample pixel point in the sample image are determined, the sample landmark attribute is used for identifying the sample landmark point corresponding to the sample pixel point, the sample landmark point is the landmark point contained in the sub-area of the project area covering the sample pixel point, the sample direction attribute comprises the sample direction information pointing to the project position of the sample landmark point corresponding to the sample pixel point, and based on the sample landmark attribute and the sample direction attribute, the sample landmark image and the sample direction image of the sample image are obtained, the sample landmark attribute of the corresponding sample pixel point is marked on the first pixel point in the sample landmark image, and the sample direction attribute of the corresponding sample pixel point is marked on the second pixel point in the sample directional diagram, on the basis, the landmark detection network is used for predicting the sample image to obtain a second characteristic predicted image and a second direction predicted image of the sample image, and the second feature prediction image comprises a second feature representation of the sample pixel, the second direction image comprises second direction information pointing to the sample landmark projection, and the sample landmark projection represents a projection position of the sample landmark in the sample image, thereby obtaining a first loss based on the sample landmark image and the second feature prediction image, and obtaining a second loss based on the sample direction image and the second direction prediction image, and optimizing network parameters of the landmark detection model based on the first loss and the second loss, so that a training sample can be accurately constructed, and the training of the landmark detection model is supervised by taking the pre-constructed sample landmark image and the sample direction image as a priori, so that the detection performance of the landmark detection model is favorably improved.

Referring to fig. 12, fig. 12 is a schematic diagram of a frame of avisual positioning apparatus 1200 according to an embodiment of the present application. Thevisual positioning apparatus 1200 includes: the system comprises afeature extraction module 1210, a feature fusion module 1220, alandmark detection module 1230, apose determination module 1240 and afeature extraction module 1210, wherein thefeature extraction module 1210 is used for extracting a first feature image and a second feature image of an image to be positioned; the first characteristic image comprises local characteristic information, and the second characteristic image comprises global characteristic information; a feature fusion module 1220, configured to fuse the first feature image and the second feature image to obtain a fused feature image; thelandmark detection module 1230 is used for detecting and obtaining a target landmark point in the image to be positioned based on the fusion feature image; thepose determining module 1240 is configured to obtain a pose parameter of the image to be positioned based on first position information of the target landmark point in the image to be positioned and second position information of the target landmark point in a scene map, where the image to be positioned is obtained by shooting a preset scene, and the scene map is obtained by performing three-dimensional modeling on the preset scene.

In some disclosed embodiments, thefeature extraction module 1210 is specifically configured to process the first feature image based on at least one of an attention mechanism and a multi-scale feature extraction network to obtain a second feature map.

In some disclosed embodiments, thefeature extraction module 1210 includes a first global extraction sub-module configured to process the first feature image based on an attention mechanism to obtain a first global image; thefeature extraction module 1210 comprises a second global extraction submodule and is used for fusing multi-scale feature images extracted by the multi-scale extraction network to obtain a second global image; thefeature extraction module 1210 includes a global fusion sub-module configured to fuse the first global image and the second global image to obtain a second feature image.

In some disclosed embodiments, the first global image and the second global image are both multi-channel images; the global fusion submodule comprises a channel shuffling unit, and is used for carrying out channel shuffling on the first global image and the second global image to obtain a third global image; the global fusion submodule comprises a channel fusion unit used for carrying out channel fusion on the third global image to obtain a second characteristic image.

In some disclosed embodiments, thelandmark detection module 1230 includes an image processing sub-module, configured to process the fused feature image using a landmark detection model to obtain a first landmark predicted image and a first direction predicted image; thelandmark detection module 1230 comprises an image analysis submodule, which is used for analyzing the first landmark prediction image and the first direction prediction image to obtain a target landmark point; the target landmark point is at least one of a plurality of landmark points of a preset scene, the landmark points are selected from a scene map of the preset scene, the first landmark prediction image comprises a prediction landmark attribute of a pixel point in an image to be positioned, the first direction prediction image comprises a first direction attribute of the pixel point in the image to be positioned, the prediction landmark attribute is used for identifying the landmark point corresponding to the pixel point, the first direction attribute comprises first direction information pointing to landmark projection, and the landmark projection represents the projection position of the landmark point corresponding to the pixel point in the image to be positioned.

In some disclosed embodiments, the landmark detection model includes a landmark prediction network, and the image processing sub-module includes a fused feature decoding unit, configured to decode the fused feature image by using the landmark prediction network to obtain a first feature prediction image; the first characteristic prediction image comprises a first characteristic representation of a pixel point in an image to be positioned; the image processing submodule comprises a landmark attribute determining unit, a landmark attribute predicting unit and a landmark attribute predicting unit, wherein the landmark attribute determining unit is used for processing the first characteristic representation of the pixel point based on the local sensitive Hash for each pixel point to obtain the predicted landmark attribute of the pixel point; the image processing submodule comprises a landmark image prediction unit used for obtaining a first landmark prediction image based on the predicted landmark attribute of each pixel point in the image to be positioned.

In some disclosed embodiments, the landmark attribute determination unit includes a first hash mapping subunit configured to determine, based on the first feature representation of the locality sensitive hash mapping pixel, a first target partition in which the pixel is located; the first target partition belongs to a plurality of first Hash partitions, the first Hash partitions are obtained by locally sensitive Hash processing of landmark feature representations of a plurality of landmark points, and the landmark feature representations are obtained after landmark detection model training convergence; the landmark attribute determining unit comprises a first candidate landmark screening subunit, and is used for selecting landmark points in the first target partition as first candidate landmark points; the landmark attribute determining unit comprises a predicted landmark attribute determining subunit and is used for obtaining the predicted landmark attributes of the pixel points based on the similarity between the first feature representation of the pixel points and the landmark feature representation of each first candidate landmark point.

In some disclosed embodiments, the target landmark point is detected by using a landmark detection model, and the target landmark point is at least one of a plurality of landmark points of a preset scene, the plurality of landmark points are selected from a scene map of the preset scene, the plurality of landmark points are respectively located at preset positions of each sub-area of the scene map, and the visual positioning apparatus 1200 further includes a projection obtaining module for respectively determining projection areas and projection positions of the sub-areas and the landmark points in the sample image; the visual positioning apparatus 1200 further includes an attribute determining module configured to determine a sample landmark attribute and a sample direction attribute of a sample pixel point in the sample image based on the projection region and the projection position; the sample landmark attribute is used for identifying sample landmark points corresponding to the sample pixel points, the sample landmark points are landmark points contained in a sub-region of a projection region covering the sample pixel points, and the sample direction attribute comprises sample direction information pointing to the projection positions of the sample landmark points corresponding to the sample pixel points; the visual positioning apparatus 1200 further includes a sample obtaining module, configured to obtain a sample landmark image and a sample direction image of the sample image based on the sample landmark attribute and the sample direction attribute, respectively; the first pixel point in the sample landmark image is marked with the sample landmark attribute of the corresponding sample pixel point, and the second pixel point in the sample direction image is marked with the sample direction attribute of the corresponding sample pixel point; the visual positioning apparatus 1200 further includes an image prediction module, configured to predict the sample image by using the landmark detection model, so as to obtain a second feature prediction image and a second direction prediction image of the sample image; the second characteristic prediction image comprises a second characteristic representation of the sample pixel point, the second direction prediction image comprises a second direction attribute of the sample pixel point, the second direction attribute comprises second direction information pointing to a sample landmark projection, and the sample landmark projection represents a projection position of the sample local landmark in the sample image; the visual positioning apparatus 1200 further includes a loss calculation module configured to obtain a first loss based on the sample landmark image and the second feature prediction image, and obtain a second loss based on the sample direction image and the second direction prediction image; the visual positioning apparatus 1200 further comprises a parameter optimization module for optimizing network parameters of the landmark detection model based on the first loss and the second loss.

In some disclosed embodiments, the loss calculation module includes a feature representation obtaining sub-module for obtaining feature representations to be optimized for respective landmark points; the loss calculation module comprises a positive example representation acquisition submodule and is used for representing the to-be-optimized features of the sample landmark points identified by the sample landmark attributes as the positive example feature representation of the sample pixel points; the loss calculation module comprises a negative example representation acquisition submodule and a negative example representation acquisition submodule, wherein the negative example representation acquisition submodule is used for selecting one reference feature representation as the negative example feature representation of the sample pixel point based on the processing result of the local sensitive hash on the second feature representation of the sample pixel point; the loss calculation module comprises a sub-loss calculation submodule and is used for obtaining sub-losses based on first similarity between the second feature representation and the positive example feature representation and second similarity between the second feature representation and the negative example feature representation; wherein the reference feature representation comprises a feature representation to be optimized in addition to the normative feature representation; the loss calculation module comprises a loss statistics submodule and is used for obtaining a first loss based on the sub-loss of the sample pixel points in the sample image.

In some disclosed embodiments, the negative case representation obtaining submodule includes a second hash mapping unit, configured to determine, based on a second feature representation of the locality sensitive hash mapping sample pixel points, a second target partition in which the sample pixel points are located; the second target partition belongs to a plurality of second Hash partitions, and the second Hash partitions are obtained by locally sensitive Hash processing of the feature representation to be optimized of a plurality of landmark points; the negative example representation acquisition sub-module comprises a second candidate landmark screening unit, and is used for selecting landmark points in a second target partition as second candidate landmark points; the second candidate landmark point does not contain a sample landmark point corresponding to the sample pixel point, and the processing result comprises the second candidate landmark point; the negative example representation obtaining submodule comprises a negative example representation determining unit which is used for obtaining the negative example feature representation of the sample pixel point based on the similarity between the second feature representation of the sample pixel point and the feature representation to be optimized of each second candidate landmark point.

In some disclosed embodiments, the sub-regions are obtained by dividing the surface of the scene map; and/or the preset position comprises the central position of the sub-area; and/or the area difference between the sub-regions is below a preset threshold.

Referring to fig. 13, fig. 13 is a block diagram of anelectronic device 1300 according to an embodiment of the present application. Theelectronic device 1300 comprises amemory 1301 and aprocessor 1302 coupled to each other, wherein theprocessor 1302 is configured to execute program instructions stored in thememory 1301 to implement the steps of any of the embodiments of the visual positioning method described above. In one particular implementation scenario, theelectronic device 1300 may include, but is not limited to: a microcomputer, a server, and theelectronic device 1300 may further include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

In particular, theprocessor 1302 is configured to control itself and thememory 1301 to implement the steps of any of the visual positioning method embodiments described above.Processor 1302 may also be referred to as a CPU (Central Processing Unit). Theprocessor 1302 may be an integrated circuit chip having signal processing capabilities. TheProcessor 1302 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, theprocessor 1302 may be commonly implemented by integrated circuit chips.

According to the scheme, the accuracy and robustness of visual positioning can be improved.

Referring to fig. 14, fig. 14 is a block diagram illustrating an embodiment of a computer-readable storage medium 140 according to the present application. The computer readable storage medium 140 stores program instructions 141 capable of being executed by a processor, the program instructions 141 for implementing the steps of any of the embodiments of the visual positioning method described above.

The disclosure relates to the field of augmented reality, and aims to detect or identify relevant features, states and attributes of a target object by means of various visual correlation algorithms by acquiring image information of the target object in a real environment, so as to obtain an AR effect combining virtual and reality matched with specific applications. For example, the target object may relate to a face, a limb, a gesture, an action, etc. associated with a human body, or a marker, a marker associated with an object, or a sand table, a display area, a display item, etc. associated with a venue or a place. The vision-related algorithms may involve visual localization, SLAM, three-dimensional reconstruction, image registration, background segmentation, key point extraction and tracking of objects, pose or depth detection of objects, and the like. The specific application can not only relate to interactive scenes such as navigation, explanation, reconstruction, virtual effect superposition display and the like related to real scenes or articles, but also relate to special effect treatment related to people, such as interactive scenes such as makeup beautification, limb beautification, special effect display, virtual model display and the like.

The detection or identification processing of the relevant characteristics, states and attributes of the target object can be realized through the convolutional neural network. The convolutional neural network is a network model obtained by performing model training based on a deep learning framework.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A visual positioning method, comprising:

extracting a first characteristic image and a second characteristic image of an image to be positioned; the first characteristic image comprises local characteristic information, and the second characteristic image comprises global characteristic information;

fusing the first characteristic image and the second characteristic image to obtain a fused characteristic image;

detecting and obtaining a target landmark point in the image to be positioned based on the fusion characteristic image;

obtaining a pose parameter of the image to be positioned based on first position information of the target landmark point in the image to be positioned and second position information of the target landmark point in a scene map; the image to be positioned is obtained by shooting a preset scene, and the scene map is obtained by performing three-dimensional modeling on the preset scene.

2. The method of claim 1, further comprising: and processing the first feature image based on at least one of an attention mechanism and a multi-scale feature extraction network to obtain the second feature image.

3. The method of claim 2, wherein the processing the first feature image to obtain the second feature image based on at least one of an attention mechanism and a multi-scale feature extraction network comprises:

processing the first feature image based on the attention mechanism to obtain a first global image, and fusing the multi-scale feature images extracted by the multi-scale extraction network to obtain a second global image;

and fusing the first global image and the second global image to obtain the second characteristic image.

4. The method of claim 3, wherein the first global image and the second global image are both multi-channel images; the fusing the first global image and the second global image to obtain the second feature image includes:

performing channel shuffling on the first global image and the second global image to obtain a third global image;

and carrying out channel fusion on the third global image to obtain the second characteristic image.

5. The method according to any one of claims 1 to 4, wherein the detecting a target landmark point in the image to be positioned based on the fused feature image comprises:

processing the fusion characteristic image by using a landmark detection model to obtain a first landmark prediction image and a first direction prediction image;

analyzing the first landmark prediction image and the first direction prediction image to obtain the target landmark point;

the target landmark point is at least one of a plurality of landmark points of the preset scene, the plurality of landmark points are selected from a scene map of the preset scene, the first landmark prediction image comprises a predicted landmark attribute of a pixel point in the image to be positioned, the first direction prediction image comprises a first direction attribute of the pixel point in the image to be positioned, the predicted landmark attribute is used for identifying a landmark point corresponding to the pixel point, the first direction attribute comprises first direction information pointing to a landmark projection, and the landmark projection represents a projection position of the landmark point corresponding to the pixel point in the image to be positioned.

6. The method of claim 5, wherein the landmark detection model comprises a landmark prediction network, and wherein processing the fused feature image using the landmark detection model to obtain a first landmark prediction image comprises:

decoding the fused feature image by using the landmark prediction network to obtain a first feature prediction image; the first characteristic prediction image comprises a first characteristic representation of a pixel point in the image to be positioned;

for each pixel point, processing the first characteristic representation of the pixel point based on local sensitive hash to obtain the predicted landmark attribute of the pixel point;

and obtaining the first landmark prediction image based on the predicted landmark attribute of each pixel point in the image to be positioned.

7. The method of claim 6, wherein the processing the first feature representation of the pixel based on locality sensitive hashing to obtain the predicted landmark attribute of the pixel comprises:

mapping a first characteristic representation of the pixel points based on the locality sensitive hash, and determining a first target partition where the pixel points are located; wherein the first target partition belongs to a plurality of first hash partitions, the plurality of first hash partitions are obtained by performing the locality sensitive hash processing on landmark feature representations of the plurality of landmark points, and the landmark feature representations are obtained after the landmark detection model training converges;

selecting the landmark points in the first target partition as first candidate landmark points;

and obtaining the predicted landmark attribute of the pixel point based on the similarity between the first feature representation of the pixel point and the landmark feature representation of each first candidate landmark point.

8. The method according to any one of claims 1 to 7, wherein the target landmark point is detected by using a landmark detection model, and the target landmark point is at least one of landmark points of the preset scene, the landmark points being selected from a scene map of the preset scene, the landmark points being respectively located at preset positions of sub-regions of the scene map, the training of the landmark detection model includes:

respectively determining a projection region and a projection position of the sub-region and the landmark point in a sample image;

determining sample landmark attributes and sample direction attributes of sample pixel points in the sample image based on the projection area and the projection position; the sample landmark attribute is used for identifying a sample landmark point corresponding to the sample pixel point, the sample landmark point is a landmark point contained in a sub-region of the projection region covering the sample pixel point, and the sample direction attribute comprises sample direction information pointing to the projection position of the sample landmark point corresponding to the sample pixel point;

obtaining a sample landmark image and a sample direction image of the sample image based on the sample landmark attribute and the sample direction attribute respectively; a first pixel point in the sample landmark image is marked with a sample landmark attribute of a corresponding sample pixel point, and a second pixel point in the sample direction image is marked with a sample direction attribute of a corresponding sample pixel point;

predicting the sample image by using the landmark detection model to obtain a second characteristic prediction image and a second direction prediction image of the sample image; wherein the second feature prediction image comprises a second feature representation of the sample pixel, the second direction prediction image comprises a second direction attribute of the sample pixel, the second direction attribute comprises second direction information pointing to a sample landmark projection, and the sample landmark projection represents a projection position of the sample landmark in the sample image;

obtaining a first loss based on the sample landmark image and the second characteristic prediction image, and obtaining a second loss based on the sample direction image and the second direction prediction image;

optimizing network parameters of the landmark detection model based on the first loss and the second loss.

9. The method according to claim 8, wherein said deriving a first loss based on the sample landmark image and the second feature prediction image comprises:

acquiring a feature representation to be optimized of each landmark point;

for the sample pixel point in the sample image, taking the feature representation to be optimized of the sample landmark point identified by the sample landmark attribute as positive case feature representation of the sample pixel point, selecting a reference feature representation as negative case feature representation of the sample pixel point based on a processing result of local sensitive hashing on second feature representation of the sample pixel point, and obtaining a sub-loss based on a first similarity between the second feature representation and the positive case feature representation and a second similarity between the second feature representation and the negative case feature representation; wherein the reference feature representation comprises a feature representation to be optimized in addition to the normative feature representation;

and obtaining the first loss based on the sub-losses of the sample pixel points in the sample image.

10. The method of claim 9, wherein selecting a reference feature representation as a negative case feature representation of the sample pixel based on the result of the processing of the second feature representation of the sample pixel by the locality sensitive hash comprises:

based on the second characteristic representation of the sample pixel points mapped by the local sensitive Hash, determining a second target partition where the sample pixel points are located; the second target partition belongs to a plurality of second Hash partitions, and the second Hash partitions are obtained by performing the locality sensitive Hash processing on the feature representations to be optimized of the landmark points;

selecting the landmark points in the second target partition as second candidate landmark points; wherein the second candidate landmark point does not include a sample landmark point corresponding to the sample pixel point, and the processing result includes the second candidate landmark point;

and obtaining the negative case characteristic representation of the sample pixel point based on the similarity between the second characteristic representation of the sample pixel point and the characteristic representation to be optimized of each second candidate landmark point.

11. The method of claim 8, wherein the sub-regions are partitioned from the surface of the scene map;

and/or the preset position comprises the central position of the sub-area;

and/or the area difference between the sub-regions is lower than a preset threshold value.

12. A visual positioning device, comprising:

the characteristic extraction module is used for extracting a first characteristic image and a second characteristic image of the image to be positioned; the first characteristic image comprises local characteristic information, and the second characteristic image comprises global characteristic information;

a feature fusion module, configured to fuse the first feature image and the second feature image to obtain a fused feature image;

the landmark detection module is used for detecting and obtaining a target landmark point in the image to be positioned based on the fusion characteristic image;

the pose determining module is used for obtaining pose parameters of the image to be positioned based on first position information of the target landmark point in the image to be positioned and second position information of the target landmark point in a scene map, wherein the image to be positioned is obtained by shooting a preset scene, and the scene map is obtained by performing three-dimensional modeling on the preset scene.

13. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the visual positioning method of any one of claims 1 to 11.

14. A computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the visual positioning method of any of claims 1 to 11.