CN120107330A

Movatterモバイル変換

Info

Publication number: CN120107330A
Application number: CN202510094675.5A
Authority: CN
Inventors: 田亮; 贺浩; 刘京; 陈栋; 王永刚; 路晓旭; 吕晓靥; 王钊洋
Original assignee: Hebei Normal University
Current assignee: Hebei Normal University
Priority date: 2025-01-21
Filing date: 2025-01-21
Publication date: 2025-06-06

Abstract

The application is suitable for the technical field of image recognition and provides a monocular remote sensing image height data estimation method, device, equipment and storage medium, wherein the method comprises the steps of obtaining a monocular remote sensing image of a target object; the method comprises the steps of inputting a monocular remote sensing image into a first convolution layer of a mixed pooling patch embedding module to obtain a first feature image, inputting a maximum pooling layer, an average pooling layer and a first depth separable convolution layer of the mixed pooling patch embedding module to obtain a second feature image, inputting the second feature image into a transducer module to obtain a global feature image, inputting the global feature image, the second feature image and the first feature image into a decoder to sequentially perform image fusion to obtain a first output image, and obtaining a height estimation value of a target object based on the first output image. The application can improve the local information extraction capability and accurately judge the height of the target.

Description

Monocular remote sensing image height data estimation method, monocular remote sensing image height data estimation device, monocular remote sensing image height data estimation equipment and storage medium

Technical Field

The application belongs to the technical field of image recognition, and particularly relates to a monocular remote sensing image height data estimation method, device, equipment and storage medium.

Background

Conventional methods for acquiring height often use multi-view methods and lidar-based methods, which are expensive and difficult to popularize. Therefore, a method of performing altitude estimation based on monocular remote sensing images has been developed.

One method is a monocular remote sensing image height estimation method based on convolution. It has good local information extraction capability, but when facing large-scale features, such as large-area building roofs connected in pieces, such methods cannot obtain enough global information due to the inherent nature of convolution, and are easy to confuse with other similar features, so that high estimation errors are caused.

Another method is a transform-based monocular remote sensing image height estimation method. Compared with convolution methods, the method based on the transform network can obtain more global information through a self-attention mechanism. When the remote sensing image is processed, the heights of the pixels in the image can be accurately judged by obtaining more global information.

However, the method based on the Transformer network has insufficient extraction capability for local details, and although the overall estimation of the height map extracted by the network is more accurate, the method is more fuzzy and lacks details, and objects with smaller areas, such as street lamps, trees in winter, tower buildings, cars and the like, are easily ignored. In addition, the Transformer network requires greater computational support.

Disclosure of Invention

The embodiment of the application provides a monocular remote sensing image height data estimation method, a monocular remote sensing image height data estimation device, monocular remote sensing image height data estimation equipment and a storage medium, so that the local information extraction capacity is improved, and meanwhile, the height of a target is accurately judged.

The application is realized by the following technical scheme:

In a first aspect, an embodiment of the present application provides a monocular remote sensing image height data estimation method, including:

And acquiring a monocular remote sensing image of the target object.

And inputting the monocular remote sensing image into a first convolution layer of the mixed pooling patch embedding module to obtain a first feature map, and inputting a maximum pooling layer, an average pooling layer and a first depth separable convolution layer of the mixed pooling patch embedding module to obtain a second feature map.

And inputting the second feature map into a transducer module to obtain a global feature map.

And inputting the global feature map, the second feature map and the first feature map into a decoder to sequentially perform image fusion to obtain a first output image.

And obtaining a height estimation value of the target object based on the first output image.

With reference to the first aspect, in some possible implementations, inputting a maximum pooling layer, an average pooling layer and a first depth-separable convolution layer of the hybrid pooling patch embedding module to obtain a second feature map includes:

and respectively inputting the first feature map into a maximum pooling layer and an average pooling layer to respectively obtain the maximum pooling map and the average pooling map.

And calculating the sum of the maximum pooling graph and the average pooling graph to obtain a pooling result graph.

And inputting the pooled result graph into a first depth-separable convolution layer to obtain a second characteristic graph.

With reference to the first aspect, in some possible implementations, the transducer module includes a Swin-LIE Block unit and a PATCH MERGING unit.

The Swin-LIE Block unit is a Swin Block unit in which the MLP module is replaced by a local information enhancement module, wherein the local information enhancement module comprises a dimension increasing layer, a second convolution layer, a second depth separable convolution layer, a third convolution layer and a dimension reducing layer.

Inputting the second feature map into a transducer module to obtain a global feature map, wherein the method comprises the following steps:

And inputting the second feature map into a Swin-LIE Block unit and a PATCH MERGING unit to obtain a global feature map.

With reference to the first aspect, in some possible implementations, the transducer module includes a first transducer module and a second transducer module.

The first transducer module comprises a first Swin-LIE Block unit and a first PATCH MERGING unit, and the second transducer module comprises a second Swin-LIE Block unit and a second PATCH MERGING unit.

The second feature map is input into a transducer module to obtain a global feature map, which comprises the following steps:

And inputting the second feature map into a first Swin-LIE Block unit and a first PATCH MERGING unit to obtain a first target image.

And inputting the first target image into a second Swin-LIE Block unit and a second PATCH MERGING unit to obtain a second target image.

And inputting the first target image and the second target image into a decoder for image fusion to obtain a global feature map.

With reference to the first aspect, in some possible implementations, inputting the second feature map into the first Swin-LIE Block unit and the first PATCH MERGING unit to obtain a first target image includes:

And inputting the second feature map into a dimension increasing layer to obtain a dimension increasing map.

And inputting the dimension increasing graph into a second convolution layer to obtain a first convolution graph.

And inputting the first convolution map into a second depth-separable convolution layer to obtain a depth-separable convolution map.

And adding the first convolution map and the depth-separable convolution map and inputting the added first convolution map and the depth-separable convolution map into a third convolution layer to obtain a second convolution map.

And inputting the second convolution graph into a dimension reduction layer to obtain a dimension reduction graph.

And adding the dimension reduction graph and the second feature graph, and inputting the dimension reduction graph and the second feature graph into a PATCH MERGING unit to obtain a first target image.

With reference to the first aspect, in some possible implementations, the method further includes:

And calculating the mean square error between the height estimation value and the real height value of the target object, the mean square error of the gradient in the horizontal direction of the height estimation value and the real height value of the target object and the sum of the mean square error of the gradient in the vertical direction of the height estimation value and the real height value of the target object, and obtaining an estimated error value.

And when the estimated error value is greater than or equal to a preset threshold value, adjusting the super parameters of the mixed pooling patch embedding module and the transducer module until the estimated error value is less than the preset threshold value.

With reference to the first aspect, in some possible implementations, before inputting the monocular remote sensing image into the hybrid pooled patch embedding module, the method further includes:

And preprocessing the monocular remote sensing image, and inputting the monocular remote sensing image after preprocessing into a mixed pooling patch embedding module.

In a second aspect, an embodiment of the present application provides a monocular remote sensing image height data estimation apparatus, including:

And the data acquisition module is used for acquiring the monocular remote sensing image of the target object.

The first processing module is used for inputting the monocular remote sensing image into the first convolution layer of the mixed pooling patch embedding module to obtain a first feature map, and inputting the maximum pooling layer, the average pooling layer and the first depth separable convolution layer of the mixed pooling patch embedding module to obtain a second feature map.

And the second processing module is used for inputting the second feature map into the transducer module to obtain a global feature map.

And the first fusion module is used for inputting the global feature map, the second feature map and the first feature map into the decoder to sequentially perform image fusion so as to obtain a first output image.

And the result output module is used for obtaining the height estimation value of the target object based on the first output image.

In a third aspect, an embodiment of the present application provides a terminal device, including a processor and a memory, where the memory is configured to store a computer program, where the processor implements the monocular telemetry image height data estimation method according to any one of the first aspects when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the monocular remote sensing image height data estimation method according to any one of the first aspects.

It will be appreciated that the advantages of the second to fourth aspects may be found in the relevant description of the first aspect and are not repeated here.

Compared with the prior art, the embodiment of the application has the beneficial effects that:

The application processes the image based on the mixed pooling patch embedding module, can ensure better initial feature extraction of the image, so that the misjudgment of the height estimation result is reduced, and the LIE module is added in the latter Transformer module, thereby enhancing the attention capability to local information, combining the advantages of two estimation methods (a monocular remote sensing image height estimation method based on convolution and a monocular remote sensing image height estimation method based on Transformer), and ensuring that the scheme can accurately judge the height of the target while improving the local information extraction capability.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a monocular remote sensing image height data estimation method according to an embodiment of the present application;

FIG. 2 is a flow chart diagram of a monocular remote sensing image height data estimation method according to an embodiment of the present application;

FIG. 3 is a flow chart of a hybrid pooled patch embedding module processing an image according to one embodiment of the present application;

FIG. 4 shows the structure of a Swin-LIE Block unit according to an embodiment of the present application;

FIG. 5 is a flow chart of a local information enhancement module processing an image according to an embodiment of the present application;

FIG. 6 is a flow chart diagram of a monocular remote sensing image height data estimation method provided by another embodiment of the present application;

Fig. 7 is a schematic structural diagram of a monocular remote sensing image height data estimation device according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The embodiment of the application provides a monocular remote sensing image height data estimation method, fig. 1 is a schematic flow chart of the monocular remote sensing image height data estimation method provided by an embodiment of the application, fig. 2 is a schematic flow chart of the monocular remote sensing image height data estimation method provided by an embodiment of the application, and detailed descriptions of the monocular remote sensing image height data estimation method are as follows with reference to fig. 1 and 2:

Step 101, obtaining a monocular remote sensing image of a target object.

Compared with other natural images, the remote sensing image has the characteristics of large coverage range, more ground object information and complex scene and structure. The ground feature features are different in size and scale, such as automobile, pedestrian building roof, overpass, etc.

Step 102, inputting the monocular remote sensing image into a first convolution layer of the mixed pooling patch embedding module to obtain a first feature map, and inputting a maximum pooling layer, an average pooling layer and a first depth separable convolution layer of the mixed pooling patch embedding module to obtain a second feature map.

In this embodiment, the hybrid pooling patch embedding module performs the primary extraction of the feature information in the remote sensing image by performing convolution on the image instead of blocking, so that the resolution is reduced, the number of channels is increased, and meanwhile, the spatial information among the parts of the image is also reserved. And obtaining a further resolution-reduced two-dimensional feature map through maximum pooling and average pooling, wherein the feature map still contains more space information, smooth output provided by average pooling can improve stability of a model, peak value reserved by maximum pooling can provide stronger response, complementary information fusion can be carried out by adding the two feature maps (the two feature maps are changed into 4 times of the original size through a convolution channel, the image size is changed into 1/2 of the original size on the premise of keeping the channel unchanged, the image size is changed into 1/2 of the original size through pooling on the premise of keeping the channel unchanged, the same effect of the original Transformer on image segmentation is realized on the basis of ensuring good local information through twice transformation, and due to the existence of pooling, the module has certain noise resistance, the maximum pooling can ignore some insignificant details and noise, and the average pooling reduces the influence of noise through smoothing processing, so that the effect on weather is very easy to influence on the remote sensing image shooting distance in high estimation is quite easy to exist a certain degree of noise. The module brings better initial feature extraction, so that the erroneous judgment of the height estimation result is reduced, and the overall accuracy is improved. The final output feature dimension and size of the module are the same as Patch Embedding module, so that the module can be well compatible with the following transducer module.

Illustratively, before inputting the monocular remote sensing image into the hybrid pooled patch embedding module, further comprises:

Illustratively, as shown in fig. 3, the maximum pooling layer, the average pooling layer and the first depth-separable convolution layer of the input hybrid pooling patch embedding module obtain a second feature map, including:

And step 103, inputting the second feature map into a transducer module to obtain a global feature map.

The transducer module illustratively includes a Swin-LIE Block unit and a PATCH MERGING unit.

As shown in FIG. 4, the Swin-LIE Block unit is a Swin Block unit in which the MLP module is replaced by a local information enhancement module, wherein the local information enhancement module (Local Information Enhancement, LIE) comprises a dimension increasing layer, a second convolution layer, a second depth-separable convolution layer, a third convolution layer, and a dimension decreasing layer.

By way of example, the LIE module can effectively enhance the local features by reshaping the input one-dimensional image sequence into a two-dimensional image format, increasing the number of feature channels by using a specific convolution operation, then learning local information such as two-dimensional proximity relations among pixels by depth separable convolution, and finally restoring to the original number of channels by a dimension reduction operation. In the process, the LIE module adds the features before and after the depth separable convolution, combines the enhanced one-dimensional image sequence with the original input one-dimensional image sequence, and maintains the original information while enhancing the information, so that the original transducer module enhances the attention to the local information. After the module is used, the result of the height estimation is clearer, the texture and the edge are clearer, and the problems of detail information loss, fuzzy height diagram and lack of necessary details and textures of the result of the height estimation can be solved.

Exemplary, the specific processing procedure of the LIE module on the data is:

① The sequence-to-image is that the input one-dimensional image sequence data (with the format of [ B, L, C ]) is unfolded into a two-dimensional form (with the format of [ B, H, W, C ]) and the dimension of [ B, C, H, W ] is changed, wherein L=H×W, B is the number of images, L is the length of the image sequence in the one-dimensional form, C is the number of image channels, and HW is the height and width of the image in the two-dimensional form. The input is (1,65536,3) and the input is (1,3,256,256) is expanded.

② And (3) carrying out 1*1 convolution on the two-dimensional image data of the B, C and H, W obtained in the previous step to obtain a size characteristic diagram of the B, C, H and W.

③ And adding the size characteristic diagram of the B4C H W obtained in the previous step and the characteristic diagram of the same size obtained by carrying out 3*3 depth-division convolution on the characteristic diagram to obtain the B4C H W characteristic diagram containing the position information.

④ And (3) carrying out 1*1 convolution on the two-dimensional image data of the B, 4C and H W obtained in the previous step to reduce the channel number to obtain a size characteristic diagram of the B, C and H.

⑤ The two-dimensional form BXHXW size feature map is converted into BXC (HW), and then the dimension BXC (HW) is exchanged, namely BXLXC sequence data.

⑥ And adding the data of the step B, the step L and the step ① to combine the information.

The module has 1*1 convolutions, 3*3 depth separable convolutions, an image sequence conversion portion, and a sequence-to-image conversion portion.

And inputting and outputting the sequence data and the output sequence data.

The method has the effects that image sequence data processed by the transducer is converted into two-dimensional data, and local information of the image is learned through convolution, so that the network can learn more global information by the transducer, and the two-dimensional local information in the feature map can be enhanced through the module.

The transducer module comprises a first transducer module and a second transducer module.

Illustratively, as shown in fig. 5, inputting the second feature map into the first Swin-LIE Block unit and the first PATCH MERGING unit to obtain a first target image, including:

For example, as shown in fig. 6, the number of the transducer modules may be three, and in consideration of the accuracy requirement of the solution in the urban area, the solution in which the three transducer modules are set may be defined as an optimal solution. The number of the transducer modules is not limited.

And 104, inputting the global feature map, the second feature map and the first feature map into a decoder to sequentially perform image fusion to obtain a first output image.

As an example, according to the sequential fusion process of fig. 6, it is easy to see that the process of sequentially performing image fusion by the global feature map, the second feature map, and the first feature map input decoder may be:

the global feature map is subjected to image fusion with the second feature map, and the obtained result is fused with the first feature image.

Step 105, obtaining a height estimation value of the target object based on the first output image.

The method further comprises:

For example, when the remote sensing image height estimation is performed, the estimated result often has some unevenness, even if the overall height of the areas is accurate, but the surface is uneven (more frequent height value fluctuation), and the following chart is specially processed to observe the uneven height chart. Our local information enhancement module (LIE) allows for enhanced detailed extraction of the network while also exacerbating such surface irregularities.

In the prior art, only the predicted value and the true value are subjected to loss calculation (for example, MAE or MSE is used), and the method can be used for better fitting the whole height value, but the method does not consider the constraint on image details and edge information, and the phenomenon of unevenness can be generated in the predicted result. Therefore, the calculation of the estimated error value is designed, the constraint of the gradient can be considered through the estimated error value, and the introduction of the gradient loss enhances the accuracy of the edge of the object and simultaneously avoids the uneven effect of the areas with the same height.

According to the monocular remote sensing image height data estimation method, the image is processed based on the mixed pooling patch embedding module, better initial feature extraction of the image can be guaranteed, misjudgment of a height estimation result is reduced, the LIE module is added into the subsequent transform module, the attention capacity to local information is enhanced, and the advantages of the two (monocular remote sensing image height estimation methods based on convolution and monocular remote sensing image height estimation methods based on transform) estimation methods are combined, so that the scheme can improve the local information extraction capacity and accurately judge the height of a target.

In order to facilitate understanding of the present embodiment, the above embodiment is explained in the manner of specific embodiment 1.

Example 1:

(1) Data preprocessing

The data set used in the method is a US3D public data set, and the US3D comprises satellite images and laser radar derived reference labels. The size of the US3D image is 2048 x 2048 pixels, and the spatial resolution is 30-50cm. The method is used for training and predicting images with 1024 x 1024 pixels obtained by double downsampling.

Potsdam data set is a remote sensing image data set for city semantic segmentation, and comprises 38 high-definition images with 5cm resolution, wherein the size of each image is 6000 x 6000 pixels. In the figure, the images are orthographic images, 34 images are randomly selected as training sets, and 4 images are selected as verification sets. And dividing each image into 36 remote sensing images with 1024 x 1024 size and overlapping area by sliding window with 1024 x 1024 window size and 995 step length.

For other data, the image is sampled directly to 1024 sizes for 1024 times and less in size. The images with the size of 1024 times or more are divided into a plurality of images with the size of 1024 by adopting a sliding window mode.

(2) Network construction

The monocular remote sensing image height estimation network based on the mixed pooling patch embedding module and the local information enhancement module is constructed.

The network forward propagation flow is as shown in fig. 6:

① Input image, 1024 x 3 channel monocular remote sensing image.

② Stage1 HPPE module (hybrid pooling patch embedded module):

The input image first passes through a 7 x 7 convolution layer and outputs a feature map of 512 x 64 size.

Next, the feature map is subjected to a max pooling and average pooling operation to obtain two feature maps of 256×256×64 channels, respectively.

Finally, the two feature maps are fused by a1×1 depth separable convolution (DWConv) after addition, so as to obtain a 256×256×64 channel feature map.

③Stage2:

Two serially connected Swin-LIEBlock receive the 256×256×64 channel profile as input and process it to output a 256×256×64 channel profile.

The above 256×256×64 feature map input PATCHMERGING module halves the feature map size, doubles the channel number, and outputs a 128×128 feature map.

④Stage3:

Two serially connected Swin-LIE-TransformerBlock receive as input a profile of 128×128 channels and output a profile of 128×128 channels.

The above feature map input PATCHMERGING module outputting 128 x 128 reduces the feature map size by half, and simultaneously doubling the channel number to output a characteristic diagram of 64×64×256.

⑤Stage4:

Ten serially connected Swin-LIE-TransformerBlock receive the profile of the last 64 x 256 channel as input and output a profile of the 64 x 256 channel.

The above output 64×64×256 feature map input PATCHMERGING module halves the feature map size, doubles the channel number, and outputs a 32×32×512 feature map.

⑥DecoderBlock:

The final output of the encoder is a 32 x 512 profile, which is up-sampled layer by four DecoderBlock and fused with the skip connection profile from the encoder, and finally outputs a 1024 x 16 channel profile.

⑦HeightHead

Receiving the final output 1024 x 16 characteristic diagram from the decoder, outputting a 1024 x 1 height diagram, wherein the height diagram has the same size as the original diagram, and each pixel corresponds to the height value of the position object in the original diagram relative to the ground.

(3) Training network weights

① Initializing all parameters of a network using random initialization

② Preparing dataset, dividing the preprocessed Potsdam or US3D dataset into training set and verification set

③ Defining a loss function (calculation formula for estimating error value)

④ Definition optimizer, adam sets initial learning rate, momentum and other super parameters

⑤ At the beginning of each epoch, the dataset is divided into multiple batches (batches).

For each batch of data, the following steps are performed:

1. and sending the input image into a network to obtain a predicted altitude map.

2. And calculating the height loss and the gradient loss of the height between the height map and the true value height map.

3. Back propagation is performed, updating the network parameters.

⑥ And (3) verifying and saving a model:

At the end of each epoch, the model performance is evaluated using the validation set. If the performance of the model on the validation set is improved, the model weights are saved. Step ⑤ is repeated until the model performance tends to be stable and does not rise any more, and finally the model weight with the best performance is selected.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Corresponding to the monocular remote sensing image height data estimation method described in the above embodiments, fig. 7 shows a block diagram of the monocular remote sensing image height data estimation apparatus according to the embodiment of the present application, and for convenience of explanation, only the portion related to the embodiment of the present application is shown.

Referring to fig. 7, the monocular remote sensing image height data estimation apparatus in an embodiment of the present application may include:

The data acquisition module 201 is configured to acquire a monocular remote sensing image of a target object.

The first processing module 202 is configured to input the monocular remote sensing image into a first convolution layer of the hybrid pooling patch embedding module to obtain a first feature map, and input the maximum pooling layer, the average pooling layer and the first depth separable convolution layer of the hybrid pooling patch embedding module to obtain a second feature map.

The second processing module 203 is configured to input the second feature map into the transducer module to obtain a global feature map.

The first fusion module 204 is configured to sequentially perform image fusion on the global feature map, the second feature map, and the first feature map input decoder, so as to obtain a first output image.

The result output module 205 is configured to obtain a height estimation value of the target object based on the first output image.

Illustratively, the first processing module 202 may be configured to:

The second processing module 203 may be configured to:

Illustratively, the second processing module 203 may be configured to:

Illustratively, the result output module 205 is further configured to:

Illustratively, prior to entering the monocular remote sensing image into the hybrid pooled patch embedding module, the first processing module 202 may be further configured to:

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The embodiment of the present application further provides a terminal device, referring to fig. 8, the terminal device 300 may include at least one processor 310, a memory 320, where the memory 320 is configured to store a computer program 321, and the processor 310 is configured to invoke and execute the computer program 321 stored in the memory 320 to implement the steps in any of the foregoing method embodiments, for example, steps 101 to 105 in the embodiment shown in fig. 1. Or the processor 310, when executing the computer program, performs the functions of the modules/units in the above-described apparatus embodiments, for example, the functions of the modules shown in fig. 7.

By way of example, the computer program 321 may be partitioned into one or more modules/units that are stored in the memory 320 and executed by the processor 310 to complete the present application. The one or more modules/units may be a series of computer program segments capable of performing specific functions for describing the execution of the computer program in the terminal device 300.

It will be appreciated by those skilled in the art that fig. 8 is merely an example of a terminal device and is not limiting of the terminal device and may include more or fewer components than shown, or may combine certain components, or different components, such as input-output devices, network access devices, buses, etc.

The Processor 310 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 320 may be an internal storage unit of the terminal device, or may be an external storage device of the terminal device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. The memory 320 is used for storing the computer program and other programs and data required by the terminal device. The memory 320 may also be used to temporarily store data that has been output or is to be output.

The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus.

The monocular remote sensing image height data estimation method provided by the embodiment of the application can be applied to terminal equipment such as computers, wearable equipment, vehicle-mounted equipment, tablet computers, notebook computers, netbooks and the like, and the embodiment of the application does not limit the specific type of the terminal equipment.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps in each embodiment of the monocular remote sensing image height data estimation method when being executed by a processor.

Embodiments of the present application provide a computer program product that, when run on a mobile terminal, enables the mobile terminal to perform the steps of the embodiments of the monocular remote sensing image height data estimation method described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least any entity or device capable of carrying computer program code to a camera device/terminal equipment, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The foregoing embodiments are merely illustrative of the technical solutions of the present application, and not restrictive, and although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that modifications may still be made to the technical solutions described in the foregoing embodiments or equivalent substitutions of some technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A monocular remote sensing image height data estimation method, comprising:

acquiring a monocular remote sensing image of a target object;

Inputting the monocular remote sensing image into a first convolution layer of a mixed pooling patch embedding module to obtain a first feature map, and inputting a maximum pooling layer, an average pooling layer and a first depth separable convolution layer of the mixed pooling patch embedding module to obtain a second feature map;

inputting the second feature map into a transducer module to obtain a global feature map;

Inputting the global feature map, the second feature map and the first feature map into a decoder to sequentially perform image fusion to obtain a first output image;

And obtaining the height estimation value of the target object based on the first output image.

2. The method of claim 1, wherein the inputting the maximum pooling layer, the average pooling layer, and the first depth-separable convolution layer of the hybrid pooling patch embedding module to obtain the second feature map comprises:

respectively inputting the first feature map into a maximum pooling layer and an average pooling layer to respectively obtain a maximum pooling map and an average pooling map;

Calculating the sum of the maximum pooling graph and the average pooling graph to obtain a pooling result graph;

and inputting the pooled result graph into a first depth-separable convolution layer to obtain the second characteristic graph.

3. The method for estimating the height data of the monocular remote sensing image according to claim 1, wherein the transducer module comprises a Swin-LIE Block unit and a PATCH MERGING unit;

The Swin-LIE Block unit is a Swin Block unit in which an MLP module is replaced by a local information enhancement module, wherein the local information enhancement module comprises a dimension increasing layer, a second convolution layer, a second depth separable convolution layer, a third convolution layer and a dimension reducing layer;

inputting the second feature map into a transducer module to obtain a global feature map, including:

4. The method of claim 3, wherein the transducer module comprises a first transducer module and a second transducer module;

The first transducer module comprises a first Swin-LIE Block unit and a first PATCH MERGING unit, and the second transducer module comprises a second Swin-LIE Block unit and a second PATCH MERGING unit;

The second feature map is input into a transducer module to obtain a global feature map, and the method comprises the following steps:

Inputting the second feature map into a first Swin-LIE Block unit and a first PATCH MERGING unit to obtain a first target image;

Inputting the first target image into a second Swin-LIE Block unit and a second PATCH MERGING unit to obtain a second target image;

And inputting the first target image and the second target image into a decoder for image fusion to obtain the global feature map.

5. The monocular remote sensing image height data estimation method of claim 4, wherein inputting the second feature map into a first Swin-LIE Block unit and a first PATCH MERGING unit to obtain a first target image comprises:

Inputting the second feature map into the dimension increasing layer to obtain a dimension increasing map;

inputting the dimension increasing graph into the second convolution layer to obtain a first convolution graph;

inputting the first convolution map into a second depth-separable convolution layer to obtain a depth-separable convolution map;

adding the first convolution map and the depth-separable convolution map and inputting the first convolution map and the depth-separable convolution map into the third convolution layer to obtain a second convolution map;

inputting the second convolution map into the dimension reduction layer to obtain a dimension reduction map;

And adding the dimension reduction graph and the second feature graph, and inputting the dimension reduction graph and the second feature graph into a PATCH MERGING unit to obtain the first target image.

6. The monocular remote sensing image height data estimation method of claim 1, further comprising:

Calculating the mean square error between the height estimation value and the real height value of the target object, the mean square error of the gradient in the horizontal direction of the height estimation value and the real height value of the target object and the sum of the mean square error of the gradient in the vertical direction of the height estimation value and the real height value of the target object, and obtaining an estimated error value;

and when the estimated error value is greater than or equal to a preset threshold value, adjusting the super parameters of the mixed pooling patch embedding module and the transducer module until the estimated error value is smaller than the preset threshold value.

7. The method of monocular remote sensing image height data estimation of claim 1, wherein prior to inputting the monocular remote sensing image into the hybrid pooled patch embedding module, further comprising:

8. A monocular remote sensing image height data estimation apparatus, comprising:

The data acquisition module is used for acquiring a monocular remote sensing image of the target object;

The first processing module is used for inputting the monocular remote sensing image into a first convolution layer of the mixed pooling patch embedding module to obtain a first feature map, and inputting a maximum pooling layer, an average pooling layer and a first depth separable convolution layer of the mixed pooling patch embedding module to obtain a second feature map;

The second processing module is used for inputting the second feature map into the transducer module to obtain a global feature map;

the first fusion module is used for sequentially carrying out image fusion on the global feature map, the second feature map and the first feature map input decoder to obtain a first output image;

9. Terminal device comprising a processor and a memory, in which a computer program is stored which is executable on the processor, characterized in that the processor implements the monocular telemetry image height data estimation method according to any of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the monocular remote sensing image height data estimation method of any one of claims 1 to 7.