CN115661779A

Movatterモバイル変換

Info

Publication number: CN115661779A
Application number: CN202211379410.2A
Authority: CN
Inventors: 陆强
Original assignee: International Network Technology Shanghai Co Ltd
Current assignee: International Network Technology Shanghai Co Ltd
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-01-31

Abstract

The invention relates to the technical field of automatic driving, and provides a monocular 3D detection frame prediction method and a monocular 3D detection frame prediction device, wherein the method comprises the following steps: acquiring a two-dimensional image shot by a monocular camera and a depth image corresponding to the two-dimensional image; and inputting the two-dimensional image and the depth image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model, wherein the monocular 3D detection frame prediction model is obtained by training a 3D detection frame label corresponding to a two-dimensional sample image, a depth sample image and a depth sample image, and the monocular 3D detection frame prediction model is used for predicting the 3D detection frame according to the two-dimensional image and the depth image. In the invention, the two-dimensional image and the depth image are combined to be used as the input of the monocular 3D detection frame prediction model, so that the depth prediction capability of the model is enhanced, and the accuracy of the depth information prediction of the 3D detection frame is improved.

Description

Monocular 3D detection frame prediction method and device

Technical Field

The invention relates to the technical field of target detection, in particular to a monocular 3D detection frame prediction method and device.

Background

Along with the development of the automatic driving technology, the monocular camera is installed at the front end of the automatic driving vehicle and used for identifying an object in front of the vehicle and outputting a monocular 3D detection frame corresponding to the detected object. Monocular 3D detection frame prediction refers to detecting 3D object information through a 2D image photographed by a single camera. The existing monocular 3D detection frame prediction generally detects a target in a 2D view, and when a 3D detection frame is predicted in the 2D view, the depth information prediction of the 3D detection frame is not accurate enough, so that the final prediction depth effect in the 3D detection frame is poor.

Disclosure of Invention

The invention provides a monocular 3D detection frame prediction method and device, which are used for solving the problem that the depth information prediction of a 3D detection frame in the prior art is not accurate enough.

The invention provides a monocular 3D detection frame prediction method, which comprises the following steps:

acquiring a two-dimensional image shot by a monocular camera and a depth image corresponding to the two-dimensional image;

inputting the two-dimensional image and the depth image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model,

wherein the monocular 3D detection frame prediction model is obtained by training based on a two-dimensional sample image, a depth sample image and a 3D detection frame label corresponding to the two images,

the monocular 3D detection frame prediction model is used for predicting the 3D detection frame according to the two-dimensional image and the depth image.

According to the monocular 3D detection frame prediction method provided by the invention, the monocular 3D detection frame prediction model comprises the following steps: a two-dimensional feature extraction layer, a multi-scale view conversion layer and a 3D network output layer,

the two-dimensional feature extraction layer is used for extracting two-dimensional features of the two-dimensional image and depth features of the depth image;

the multi-scale view conversion layer is used for carrying out coordinate fusion on the two-dimensional features and the depth features to obtain three-dimensional features;

and the 3D network output layer is used for outputting the 3D detection frame according to the three-dimensional characteristics.

According to the monocular 3D detection frame prediction method provided by the invention, the two-dimensional image and the depth image are input into a monocular 3D detection frame prediction model, and a 3D detection frame output by the monocular 3D detection frame prediction model is obtained, and the method comprises the following steps:

inputting the two-dimensional image into a two-dimensional feature extraction layer to obtain the two-dimensional feature and the depth feature;

inputting the two-dimensional features and the depth features into a multi-scale view conversion layer, and carrying out coordinate fusion on the two-dimensional features and the depth image to obtain the three-dimensional features;

and inputting the three-dimensional features into a 3D network output layer to output the 3D detection frame.

According to the monocular 3D detection frame prediction method provided by the present invention, before inputting the two-dimensional image and the depth image into the monocular 3D detection frame prediction model, the method further comprises: training the monocular 3D detection frame prediction model specifically comprises:

inputting the two-dimensional sample image and the depth sample image into a two-dimensional feature extraction layer respectively;

the two-dimensional characteristic extraction layer extracts two-dimensional sample characteristics of the two-dimensional sample image and depth sample characteristics of the depth sample image;

inputting the two-dimensional sample characteristics and the depth sample characteristics into a multi-scale view conversion layer, and carrying out coordinate fusion on the two-dimensional sample characteristics and the depth sample characteristics by the multi-scale view conversion layer to obtain three-dimensional sample characteristics;

the 3D network output layer outputs a 3D frame detection result according to the three-dimensional sample characteristics;

and substituting the 3D detection frame label and the 3D frame detection result into a first loss function, and finishing training when the first loss function is converged.

According to the monocular 3D detection frame prediction method provided by the present invention, the monocular 3D detection frame prediction model includes: a two-dimensional feature extraction layer, a multi-scale view conversion layer, a bird's-eye view feature extraction layer and a 3D network output layer,

the aerial view feature extraction layer is used for extracting features of the three-dimensional features to obtain aerial view visual angle features;

the 3D network output layer is used for outputting the 3D detection frame according to the aerial view angle characteristics.

inputting the two-dimensional features and the depth features into a multi-scale view conversion layer, and carrying out coordinate fusion on the two-dimensional features and the depth images to obtain the three-dimensional features;

inputting the three-dimensional features into a bird-eye view feature extraction layer to obtain the bird-eye view angle features;

inputting the aerial view angle characteristics into a 3D network output layer to obtain the 3D detection frame.

the aerial view characteristic extraction layer performs characteristic extraction on the three-dimensional sample characteristics to obtain aerial view angle sample characteristics;

the 3D network output layer outputs a 3D frame detection result according to the aerial view sample characteristics;

and substituting the 3D frame detection result and the 3D detection frame label under the aerial view angle into a first loss function, and finishing training when the first loss function is converged.

According to the monocular 3D detection frame prediction method provided by the invention, the two-dimensional feature and the depth image are input into a multi-scale view conversion layer, and coordinate fusion is carried out on the two-dimensional feature and the depth feature to obtain the three-dimensional feature, and the method comprises the following steps:

inputting the two-dimensional features and the depth features into a multi-scale view conversion layer;

the multi-scale view conversion layer expands the two-dimensional features according to different resolutions to obtain two-dimensional expansion features corresponding to the multiple resolutions, and the multiple two-dimensional expansion features are respectively subjected to coordinate fusion with the depth features to obtain fused three-dimensional expansion features;

and upsampling the three-dimensional extended features to the same resolution and combining the upsampled three-dimensional extended features into the three-dimensional features.

According to the monocular 3D detection frame prediction method provided by the present invention, the monocular 3D detection frame prediction model further includes: the 2D network output layer is used for outputting a 2D frame detection result according to the two-dimensional sample characteristics during model training, and the training of the monocular 3D detection frame prediction model further comprises:

inputting two-dimensional sample characteristics into the 2D network output layer to obtain a 2D frame detection result;

and substituting the 2D frame detection result and a preset 2D detection frame label into a second loss function, and finishing training when a third loss function compounded by the first loss function and the second loss function is converged, wherein the 2D detection frame label corresponds to the two-dimensional image.

The invention also provides a monocular 3D detection frame prediction device, which comprises:

the image acquisition module is used for acquiring a two-dimensional image shot by the monocular camera and a depth image corresponding to the two-dimensional image;

a model operation module for inputting the two-dimensional image and the depth image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model,

wherein the monocular 3D detection frame prediction model is obtained by training based on a two-dimensional sample image, a depth sample image and a 3D detection frame label corresponding to the two-dimensional sample image and the depth sample image,

and the monocular 3D detection frame prediction model is used for predicting the 3D detection frame according to the two-dimensional image and the depth image.

The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the monocular 3D detection frame prediction method.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a monocular 3D detection frame prediction method as described in any one of the above.

According to the monocular 3D detection frame prediction method and device provided by the invention, the two-dimensional image shot by the monocular camera and the corresponding depth image are obtained, and the two-dimensional image and the depth image are input into the monocular 3D detection frame prediction model, so that the 3D detection frame output by the monocular 3D detection frame prediction model is obtained. Because the two-dimensional image and the depth image are combined to be used as the input of the monocular 3D detection frame prediction model, the depth prediction capability of the model is enhanced, and the accuracy of the depth information prediction of the 3D detection frame is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a monocular 3D inspection box prediction method according to the present invention;

FIG. 2 is a schematic structural diagram of a monocular 3D detection frame prediction model in the monocular 3D detection frame prediction method provided by the present invention;

FIG. 3 is a schematic structural diagram of another monocular 3D detection frame prediction model in the monocular 3D detection frame prediction method provided by the present invention

FIG. 4 is a schematic structural diagram of a monocular 3D detection frame prediction device according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the monocular 3D detection frame prediction method according to the embodiment of the present invention includes:

step S110, acquiring a two-dimensional image and a depth image corresponding to the two-dimensional image captured by a monocular camera, wherein the monocular camera is usually installed at the front end of the vehicle and is used for capturing the vehicle, the obstacle, the guideboard and the like in front of the vehicle, and the captured image is a two-dimensional image. Depth images (depth images) are also called range images (range images) and are images in which the distance (depth) from an image capture to each point in a scene is defined as a pixel value. The depth image corresponding to the two-dimensional image is a depth image predicted by a pre-trained depth estimation model, or the two-dimensional image is shot and the corresponding depth image is obtained by a sensor such as a radar.

In the monocular 3D detection frame prediction method provided in this embodiment, a two-dimensional image shot by a monocular camera and a depth image corresponding to the two-dimensional image are obtained, and the two-dimensional image and the depth image are input into a monocular 3D detection frame prediction model, so as to obtain a 3D detection frame output by the monocular 3D detection frame prediction model. Because the two-dimensional image and the depth image are combined to be used as the input of the monocular 3D detection frame prediction model, the depth prediction capability of the model is enhanced, and the accuracy of the depth information prediction of the 3D detection frame is improved.

As shown in fig. 2, a monocular 3D detection frame prediction model according to another embodiment of the present invention includes: the system comprises a two-dimensional feature extraction layer, a multi-scale view conversion layer and a 3D network output layer.

The two-dimensional feature extraction layer is used for extracting two-dimensional features of the two-dimensional image and depth features of the depth image and outputting the two-dimensional features and the depth features to the multi-scale view conversion layer. Specifically, the two-dimensional feature extraction layer includes: the network skeleton layer carries out convolution operation on the input two-dimensional image and the depth image respectively to extract respective characteristics, and the extracted characteristics are input into the coding layer. The coding layer is mainly used for up-sampling and down-sampling the input features, so that the semantic information of the two-dimensional features and the depth features is richer, and the final prediction result is more accurate.

And the multi-scale view conversion layer is used for carrying out coordinate fusion on the two-dimensional features and the depth features to obtain three-dimensional features. The two-dimensional feature comprises coordinate information of the pixel, the depth feature is a depth value of the pixel, and coordinate fusion is to add the depth value to the coordinate information, thereby forming a three-dimensional feature, for example: and adding the depth information d into the two-dimensional features [ x, y ] to obtain three-dimensional features [ x, y, d ], namely a three-dimensional feature vector.

Based on the monocular 3D detection frame prediction model of fig. 2, the step S120 includes:

and inputting the two-dimensional image into a two-dimensional feature extraction layer to obtain the two-dimensional feature and the depth feature.

And inputting the two-dimensional features and the depth features into a multi-scale view conversion layer, and carrying out coordinate fusion on the two-dimensional features and the depth images to obtain the three-dimensional features.

Specifically, the way in which the fusion yields three-dimensional features is as follows:

inputting the two-dimensional features and the depth features into a multi-scale view conversion layer.

And the multi-scale view conversion layer expands the two-dimensional features according to different resolutions (such as 1/8, 1/16 and 1/32 of the resolution of the original image) to obtain two-dimensional expansion features corresponding to a plurality of resolutions, and performs coordinate fusion on the two-dimensional expansion features and the depth features respectively to obtain fused three-dimensional expansion features.

The three-dimensional extended features are upsampled to the same resolution (e.g., 1/8) and combined into the three-dimensional features. The three-dimensional features with richer semantics can be obtained through the processes of expanding, fusing and combining.

In this embodiment, before step S120, the method further includes: training the monocular 3D detection frame prediction model specifically comprises:

and respectively inputting the two-dimensional sample image and the depth sample image into a two-dimensional feature extraction layer.

The two-dimensional feature extraction layer extracts two-dimensional sample features of the two-dimensional sample image and depth sample features of the depth sample image.

And inputting the two-dimensional sample characteristics and the depth sample characteristics into a multi-scale view conversion layer, and carrying out coordinate fusion on the two-dimensional sample characteristics and the depth sample characteristics by the multi-scale view conversion layer to obtain the three-dimensional sample characteristics.

And the 3D network output layer outputs a 3D frame detection result according to the three-dimensional sample characteristics.

And substituting the 3D detection frame label and the 3D frame detection result into a first loss function, and finishing training when the first loss function is converged. Wherein the first loss function comprises the following three sub-loss functions:

the center point loss function is calculated using the focal point loss focal length.

And an offset loss function, wherein the offset refers to the target central point deviation caused by the coordinate rounding in the network down-sampling, and is calculated by adopting a regression loss function smooth L1 loss.

And 3D size loss function is calculated by adopting smooth L1 loss.

The three sub-loss functions finally obtain the first loss function loss _3d and loss_3d, which may be weighted sums of the three sub-loss functions, and the respective weights of the three sub-loss functions may be 1.

Further, the monocular 3D detection frame prediction model further includes: and the 2D network output layer is used for outputting a 2D frame detection result according to the two-dimensional sample characteristics when a model is trained. Based on the model structure, training the monocular 3D detection box prediction model further includes:

and inputting two-dimensional sample characteristics into the 2D network output layer to obtain the 2D frame detection result.

And substituting the 2D frame detection result and a preset 2D detection frame label corresponding to the two-dimensional sample image into a preset second loss function, and finishing training when a third loss function loss compounded by the first loss function loss _3D and the second loss function loss _2D is converged.

Specifically, the second loss function includes the following three sub-loss functions:

And (3) an offset loss function, wherein the offset refers to the target central point deviation caused by the coordinate rounding in the network down-sampling, and the offset is calculated by adopting a regression loss function smooth L1 loss.

And (3) projecting a 3D frame to a 2D size loss function on the image, and calculating by using smooth L1 loss. The second loss function loss _2d may be a weighted sum of three sub-loss functions, each of which may have a weight of 1.

The third loss function is: loss =0.5 × loss _2d + loss _3d, where loss _2D is the first loss function, loss _3D is the second loss function, and the weight 0.5 is the result of training tuning, since the weight of the first loss function loss _3D is greater for the predictive 3D detection box.

It should be noted that: and when the 2D network output layer is only used for model training, the auxiliary training effect is achieved, and after the model training is finished, only the 3D network output layer outputs a 3D detection frame. During training, new supervision information (2D) of the 2D information is added into the training to assist task training, so that feature extraction is more robust, semantic information is stronger, and accuracy of model detection is improved.

Since the 3D detection frame predicted by the monocular 3D detection frame prediction model shown in fig. 2 is not a detection frame under the Bird's Eye View (BEV), and cannot be directly used by the downstream planning control module, the downstream planning control module needs to perform further processing such as coordinate conversion on the detection result, and the processing burden of the downstream planning control module is increased. Therefore, a bird's-eye view feature extraction layer is added on the basis of the model of fig. 2 for performing feature extraction on the three-dimensional features to obtain the bird's-eye view angle features.

Specifically, as shown in fig. 3, a monocular 3D detection frame prediction model according to another embodiment of the present invention includes: the system comprises a two-dimensional feature extraction layer, a multi-scale view conversion layer, a bird's-eye view feature extraction layer and a 3D network output layer.

And the multi-scale view conversion layer is used for carrying out coordinate fusion on the two-dimensional features and the depth features to obtain three-dimensional features. The two-dimensional features comprise coordinate information of pixels, the depth features are depth values of the pixels, and coordinate fusion, i.e. adding depth values to the coordinate information, forms three-dimensional features, such as: and adding the depth information d into the two-dimensional features [ x, y ] to obtain three-dimensional features [ x, y, d ], namely a three-dimensional feature vector.

The bird's-eye view feature extraction layer is used for carrying out feature extraction on the three-dimensional features so as to obtain bird's-eye view angle features. Specifically, the bird's-eye view feature extraction layer converts the three-dimensional features into bird's-eye view angle features by convolution.

Based on the monocular 3D detection frame prediction model in fig. 3, the step S120 includes:

And inputting the three-dimensional features into a bird-eye view feature extraction layer to obtain bird-eye view angle features.

The three-dimensional extended features are upsampled to the same resolution (e.g., 1/8) and combined into the three-dimensional features. Through the expanding, fusing and combining processes, three-dimensional features with richer semantics can be obtained.

Based on the monocular 3D detection frame prediction model in fig. 3, the bird's-eye view angle characteristics are obtained in the 3D detection frame predicted by the present embodiment, and the 3D detection frame is predicted based on the bird's-eye view angle characteristics, so that the predicted 3D detection frame can be directly used for the downstream planning control module.

Before the two-dimensional image and the depth image are input into the monocular 3D detection frame prediction model, the method further comprises the following steps: training the monocular 3D detection frame prediction model specifically comprises:

And the aerial view characteristic extraction layer performs characteristic extraction on the three-dimensional sample characteristics to obtain the aerial view angle sample characteristics.

And the 3D network output layer outputs a 3D frame detection result according to the bird's-eye view angle sample characteristics.

And substituting the 3D frame detection result and the 3D detection frame label under the aerial view angle into a first loss function, and finishing training when the first loss function is converged. Wherein the first loss function comprises the following three loss functions:

And 3D size loss function is calculated by adopting smooth L1 loss.

The three sub-loss functions finally obtain the first loss function loss _3d and loss_3d can be weighted summation of the three sub-loss functions, and the respective weights of the three sub-loss functions can be 1

Further, the monocular 3D detection frame prediction model further includes: and the 2D network output layer is used for outputting a 2D frame detection result according to the two-dimensional sample characteristics during model training. Based on the model structure, training the monocular 3D detection box prediction model further includes:

And calculating a 2D size loss function of the 3D frame projected on the image by adopting smooth L1 loss. The second loss function loss _2d may be a weighted sum of three sub-loss functions, each of which may have a weight of 1.

The third loss function is: loss =0.5 × loss _2d + loss _3d, where loss _2D is the first loss function, loss _3D is the second loss function, and the weight 0.5 is the result of training tuning, since the first loss function loss _3D is weighted more heavily for the predictive 3D detection frame.

It should be noted that: when the 2D network output layer is only used for model training, the auxiliary training effect is achieved, and after the model training is completed, only the 3D network output layer outputs a 3D detection frame. During training, new supervision information (2D) of the 2D information is added into the training to assist task training, so that feature extraction is more robust, semantic information is stronger, and accuracy of model detection is improved.

The following describes a monocular 3D detection frame prediction apparatus provided in the present invention, and the monocular 3D detection frame prediction apparatus described below and the monocular 3D detection frame prediction method described above may be referred to in correspondence with each other.

As shown in fig. 4, the monocular 3D detection frame prediction device provided by the present invention includes:

theimage obtaining module 410 is configured to obtain a two-dimensional image captured by the monocular camera and a depth image corresponding to the two-dimensional image.

And themodel operation module 420 is used for inputting the two-dimensional image and the depth image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model.

The monocular 3D detection frame prediction model is obtained by training on the basis of the two-dimensional sample image, the depth sample image and the 3D detection frame label corresponding to the depth sample image.

According to the monocular 3D detection frame prediction device provided by the invention, the two-dimensional image shot by the monocular camera and the depth image corresponding to the two-dimensional image are obtained, and the two-dimensional image and the depth image are input into the monocular 3D detection frame prediction model, so that the 3D detection frame output by the monocular 3D detection frame prediction model is obtained. Because the two-dimensional image and the depth image are combined to be used as the input of the monocular 3D detection frame prediction model, the depth prediction capability of the model is enhanced, and the accuracy of the depth information prediction of the 3D detection frame is improved.

Optionally, the monocular 3D detection frame prediction model includes: the system comprises a two-dimensional feature extraction layer, a multi-scale view conversion layer and a 3D network output layer.

The two-dimensional feature extraction layer is used for extracting two-dimensional features of the two-dimensional image and depth features of the depth image.

And the multi-scale view conversion layer is used for carrying out coordinate fusion on the two-dimensional features and the depth features to obtain three-dimensional features.

Optionally, themodel operation module 420 is specifically configured to:

Optionally, the monocular 3D detection frame predicting device of the present invention further includes: the model training module is specifically configured to:

Optionally, the monocular 3D detection frame prediction model includes: the system comprises a two-dimensional feature extraction layer, a multi-scale view conversion layer, a bird's-eye view feature extraction layer and a 3D network output layer.

The bird's-eye view feature extraction layer is used for carrying out feature extraction on the three-dimensional features so as to obtain bird's-eye view angle features.

Optionally, themodel operation module 420 is specifically configured to:

And inputting the three-dimensional features into a bird-eye view feature extraction layer to obtain the bird-eye view angle features.

And inputting the aerial view angle characteristics into a 3D network output layer to obtain the 3D detection frame.

And the 3D network output layer outputs a 3D frame detection result according to the aerial view sample characteristics.

Optionally, themodel operation module 420 is further specifically configured to:

And the multi-scale view conversion layer expands the two-dimensional features according to different resolutions to obtain two-dimensional expansion features corresponding to a plurality of resolutions, and performs coordinate fusion on the two-dimensional expansion features and the depth features respectively to obtain fused three-dimensional expansion features.

Optionally, the monocular 3D detection frame prediction model further includes: the 2D network output layer is used for outputting a 2D frame detection result according to the two-dimensional sample characteristics during model training, and the model training module is further used for:

And substituting the 2D frame detection result and a preset 2D detection frame label into a second loss function, and finishing training when a third loss function formed by compounding the first loss function and the second loss function is converged, wherein the 2D detection frame label corresponds to the two-dimensional image.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530 and acommunication bus 540, wherein theprocessor 510, thecommunication Interface 520 and thememory 530 communicate with each other via thecommunication bus 540.Processor 510 may call logic instructions inmemory 530 to perform a monocular 3D detection box prediction method comprising:

and acquiring a two-dimensional image shot by the monocular camera and a depth image corresponding to the two-dimensional image.

And inputting the two-dimensional image and the depth image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model.

The monocular 3D detection frame prediction model is obtained based on two-dimensional sample images, depth sample images and 3D detection frame label training corresponding to the depth sample images.

In addition, the logic instructions in thememory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the monocular 3D detection frame prediction method provided by the above methods, the method including:

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a monocular 3D detection frame prediction method provided by performing the above methods, the method including:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A monocular 3D detection frame prediction method is characterized by comprising the following steps:

2. The monocular 3D detection frame prediction method of claim 1, wherein the monocular 3D detection frame prediction model comprises: a two-dimensional feature extraction layer, a multi-scale view conversion layer and a 3D network output layer,

and the 3D network output layer is used for outputting the 3D detection frame according to the three-dimensional features.

3. The monocular 3D detection frame prediction method according to claim 2, wherein inputting the two-dimensional image and the depth image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model, includes:

inputting the three-dimensional features into a 3D network output layer to output the 3D detection box.

4. The monocular 3D detection frame prediction method of claim 2, wherein before inputting the two-dimensional image and the depth image into the monocular 3D detection frame prediction model, further comprising: training the monocular 3D detection frame prediction model specifically comprises:

5. The monocular 3D detection frame prediction method of claim 1, wherein the monocular 3D detection frame prediction model comprises: a two-dimensional feature extraction layer, a multi-scale view conversion layer, a bird's-eye view feature extraction layer and a 3D network output layer,

6. The monocular 3D detection frame prediction method of claim 5, wherein inputting the two-dimensional image and the depth image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model, comprises:

7. The monocular 3D detection frame prediction method of claim 5, wherein before inputting the two-dimensional image and the depth image into the monocular 3D detection frame prediction model, further comprising: training the monocular 3D detection frame prediction model specifically comprises:

the 3D network output layer outputs a 3D frame detection result according to the aerial view angle sample characteristics;

and substituting the 3D frame detection result and the 3D detection frame label under the bird's-eye view angle into a first loss function, and finishing training when the first loss function is converged.

8. The monocular 3D detection frame prediction method according to claim 3 or 6, wherein the inputting the two-dimensional feature and the depth image into a multi-scale view conversion layer, and performing coordinate fusion on the two-dimensional feature and the depth feature to obtain the three-dimensional feature, comprises:

9. The monocular 3D detection frame prediction method according to claim 4 or 7, wherein the monocular 3D detection frame prediction model further includes: the 2D network output layer is used for outputting a 2D frame detection result according to the two-dimensional sample characteristics during model training, and the training of the monocular 3D detection frame prediction model further comprises:

inputting two-dimensional sample characteristics into the 2D network output layer to obtain the 2D frame detection result;

10. A monocular 3D detection frame prediction device, comprising:

the model operation module inputs the two-dimensional image and the depth image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model,

11. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the monocular 3D detection frame prediction method of any one of claims 1-9 when executing the program.

12. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the monocular 3D detection frame prediction method according to any one of claims 1 to 9.