TWI814503B

Movatterモバイル変換

Info

Publication number: TWI814503B
Application number: TW111127895A
Authority: TW
Inventors: 簡士超; 郭錦斌
Original assignee: 鴻海精密工業股份有限公司
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2023-09-01
Also published as: TW202405755A

Abstract

The present application relates to an image analysis technology and provides a method for training a depth identification model, a method for identifying a depth of an image and related devices. This application includes: obtaining a first image, a second image and a depth identification network; generating a ground level region based on the first image and a preset ground level segmentation network, generating a projected image based on a preset pose network and an initial depth image corresponding to the first image and the second image; generating a target altitude loss based on a camera device, the initial depth image and the ground level region; calculating a depth loss according to a loss between the initial depth image, the projected image and the first image; obtaining a depth identification model by adjusting the depth identification network using the depth loss and the target altitude loss; obtaining an identification result of an image to be identified by inputting the image to be identified into the depth identification model.

Description

Translated fromChinese

深度識別模型訓練方法、圖像深度識別方法及相關設備Deep recognition model training method, image depth recognition method and related equipment

本發明涉及影像處理領域，尤其涉及一種深度識別模型訓練方法、圖像深度識別方法及相關設備。The invention relates to the field of image processing, and in particular to a depth recognition model training method, an image depth recognition method and related equipment.

在目前對車載圖像進行深度識別的方案中，可利用圖像分割網路對圖像進行準確分割，再基於分割後的圖像進行深度識別。然而，由於圖像分割網路的網路參數多而複雜，會導致圖像分割這一過程的耗費時間長，從而導致圖像的深度識別效率不高。因此，如何提高圖像的深度識別效率成為了一個需要解決的技術問題。In the current solution for deep recognition of vehicle images, the image segmentation network can be used to accurately segment the image, and then deep recognition is performed based on the segmented image. However, due to the large number and complexity of the network parameters of the image segmentation network, the image segmentation process will take a long time, resulting in inefficient depth recognition of images. Therefore, how to improve the depth recognition efficiency of images has become a technical problem that needs to be solved.

鑒於以上內容，有必要提供一種深度識別模型訓練方法、圖像深度識別方法及相關設備，解決了圖像的深度識別效率不高的技術問題。In view of the above, it is necessary to provide a depth recognition model training method, an image depth recognition method and related equipment, which solves the technical problem of low efficiency of image depth recognition.

一種深度識別模型訓練方法，所述深度識別模型訓練方法包括：獲取所述拍攝設備拍攝的第一圖像及第二圖像，構建地平面分割網路，使用所述地平面分割網路對所述第一圖像進行分割，確定所述第一圖像中的地平面區域，基於所述第一圖像、所述第二圖像對應的初始深度圖像及同時與所述第一圖像及所述第二圖像對應的位姿矩陣，生成所述第一圖像的投影圖像，基於預先獲取的深度識別網路、所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域，生成所述深度識別網路的目標高度損失，根據所述初始深度圖像與所述第一圖像之間的梯度損失及所述投影圖像與所述第一圖像之間的光度損失，計算所述深度識別網路的深度損失，基於所述深度損失及所述目標高度損失，調整所述深度識別網路，得到深度識別模型。A depth recognition model training method. The depth recognition model training method includes: acquiring the first image and the second image captured by the shooting device, constructing a ground plane segmentation network, and using the ground plane segmentation network to Segment the first image to determine the ground plane area in the first image, based on the initial depth image corresponding to the first image, the second image and the first image simultaneously. and the pose matrix corresponding to the second image, to generate a projection image of the first image, based on the pre-acquired depth recognition network, the shooting device, the initial depth image and the first The ground plane area in the image generates the target height loss of the depth recognition network, based on the initial depth mapThe gradient loss between the image and the first image and the photometric loss between the projected image and the first image are used to calculate the depth loss of the depth recognition network, based on the depth loss and the According to the target height loss, the depth recognition network is adjusted to obtain a depth recognition model.

根據本申請可選實施例，所述構建地平面分割網路包括：獲取高解析度網路，其中，所述高解析度網路包括第一階段網路、第二階段網路、第三階段網路、第四階段網路及輸出層，將所述第四階段網路去除，並將所述輸出層中分支網路的數量調整為所述第三階段網路中分支網路的數量，得到所述第三階段網路中每個分支網路對應的輸出層，統計所述第一階段網路的主幹網路中最後一個卷積層的第一通道數量，並將所述第一通道數量調整為第一預設值，得到第一分辨網路，對所述第二階段網路中最後一個卷積層的通道數量進行調整，得到第二分辨網路，並對所述第三階段網路中最後一個卷積層的通道數量進行調整，得到第三分辨網路，拼接所述第一分辨網路、所述第二分辨網路及所述第三分辨網路，得到圖像分辨網路，將所述圖像分辨網路及所述第三階段網路中每個分支網路對應的輸出層進行拼接，得到所述地平面分割網路。According to an optional embodiment of the present application, the construction of a ground plane segmentation network includes: obtaining a high-resolution network, wherein the high-resolution network includes a first-stage network, a second-stage network, and a third-stage network. network, the fourth-stage network and the output layer, remove the fourth-stage network, and adjust the number of branch networks in the output layer to the number of branch networks in the third-stage network, Obtain the output layer corresponding to each branch network in the third stage network, count the first channel number of the last convolutional layer in the backbone network of the first stage network, and add the first channel number Adjust to the first default value to obtain the first resolution network, adjust the number of channels of the last convolutional layer in the second stage network to obtain the second resolution network, and adjust the third stage network The number of channels in the last convolutional layer is adjusted to obtain a third resolution network, and the first resolution network, the second resolution network and the third resolution network are spliced to obtain an image resolution network, The image resolution network and the output layer corresponding to each branch network in the third stage network are spliced to obtain the ground plane segmentation network.

根據本申請可選實施例，所述地平面分割網路還包括分類層，所述使用所述地平面分割網路對所述第一圖像進行分割，得到所述第一圖像中的地平面區域包括：將所述第一圖像輸入到所述地平面分割網路中的圖像分辨網路進行處理，得到所述圖像分辨網路中的所述第三分辨網路在每個分支網路所輸出的第三分支特徵圖，將多個所述第三分支特徵圖輸入到對應的輸出層中進行卷積操作，得到每個輸出層輸出的目標特徵圖，將多個所述目標特徵圖進行特徵融合，得到融合特徵圖，將所述融合特徵圖輸入到所述分類層中進行分類，得到所述第一圖像中每個預設類別對應的初始物件，選取所述預設類別為地平面類別對應的初始物件所佔的區域作為所述地平面區域。According to an optional embodiment of the present application, the ground plane segmentation network further includes a classification layer, and the ground plane segmentation network is used to segment the first image to obtain the ground plane in the first image. The plane area includes: inputting the first image to the image resolution network in the ground plane segmentation network for processing, and obtaining the third resolution network in the image resolution network in each For the third branch feature map output by the branch network, multiple third branch feature maps are input into the corresponding output layer to perform a convolution operation to obtain a target feature map output by each output layer. Perform feature fusion on the target feature map to obtain a fused feature map, input the fused feature map into the classification layer for classification, obtain the initial objects corresponding to each preset category in the first image, and select the preset feature map. Assume that the area occupied by the initial object corresponding to the ground plane category is used as the ground plane area.

根據本申請可選實施例，所述將所述第一圖像輸入到所述地平面分割網路中的圖像分辨網路進行處理，得到所述圖像分辨網路中的所述第三分辨網路在每個分支網路所輸出的第三分支特徵圖包括：將所述第一圖像輸入到所述第一分辨網路的主幹網路中進行特徵提取，得到所述第一分辨網路的主幹網路所輸出的第一主幹特徵圖，將所述第一主幹特徵圖輸入到所述第一分辨網路的每個分支網路中進行卷積運算，得到所述第一分辨網路中的每個分支網路輸出的第一分支特徵圖，將多個所述第一分支特徵圖輸入至所述第二分辨網路進行特徵提取，得到所述第二分辨網路所輸出的第二分支特徵圖，並將多個所述第二特徵圖輸入至所述第三分辨網路進行特徵提取，得到所述第三分支特徵圖。According to an optional embodiment of the present application, the first image is input to the image resolution network in the ground plane segmentation network for processing, and the third image in the image resolution network is obtained. Distinguishing the third branch feature map output by the network in each branch network includes: inputting the first image toFeature extraction is performed in the backbone network of the first resolution network to obtain the first backbone feature map output by the backbone network of the first resolution network, and the first backbone feature map is input to the first backbone feature map. Perform a convolution operation in each branch network of a resolution network to obtain the first branch feature map output by each branch network in the first resolution network, and input a plurality of the first branch feature maps Go to the second resolution network to perform feature extraction, obtain the second branch feature map output by the second resolution network, and input a plurality of the second feature maps to the third resolution network for feature extraction. Extract to obtain the third branch feature map.

根據本申請可選實施例，所述第一圖像及所述第二圖像中包括相同的初始對象，所述基於所述第一圖像、所述第二圖像對應的初始深度圖像及同時與所述第一圖像及所述第二圖像對應的位姿矩陣生成所述第一圖像的投影圖像包括：將所述第一圖像中所述相同的初始物件對應的像素點確定為第一像素點，並將所述第二圖像中所述相同的初始物件對應的像素點確定為第二像素點，獲取所述第一像素點的第一齊次座標矩陣，並獲取所述第二像素點的第二齊次座標矩陣，獲取所述拍攝設備的內參矩陣的逆矩陣，根據所述第一齊次座標矩陣及所述內參矩陣的逆矩陣計算出所述第一像素點的第一相機座標，並根據所述第二齊次座標矩陣及所述內參矩陣的逆矩陣計算出所述第二像素點的第二相機座標，基於預設對極約束關係式對所述第一相機座標及所述第二相機座標進行計算，得到旋轉矩陣及平移矩陣，將所述旋轉矩陣及所述平移矩陣進行拼接，得到所述位姿矩陣，獲取所述第一圖像中每個像素點的齊次座標矩陣，並從所述初始深度圖像中獲取所述第一圖像中每個像素點的深度值，基於所述位姿矩陣、每個像素點的齊次座標矩陣及每個像素點的深度值計算出所述第一圖像中每個像素點的投影座標，根據每個像素點的投影座標對每個像素點進行排列處理，得到所述投影圖像。According to an optional embodiment of the present application, the first image and the second image include the same initial object, and the initial depth image corresponding to the first image and the second image is And simultaneously generating the projection image of the first image with the pose matrix corresponding to the first image and the second image includes: converting the same initial object in the first image to the Determine the pixel point as the first pixel point, determine the pixel point corresponding to the same initial object in the second image as the second pixel point, and obtain the first homogeneous coordinate matrix of the first pixel point, And obtain the second homogeneous coordinate matrix of the second pixel point, obtain the inverse matrix of the internal parameter matrix of the shooting device, and calculate the third homogeneous coordinate matrix according to the first homogeneous coordinate matrix and the inverse matrix of the internal parameter matrix. The first camera coordinate of a pixel point, and the second camera coordinate of the second pixel point is calculated based on the second homogeneous coordinate matrix and the inverse matrix of the internal parameter matrix, based on the preset epipolar constraint relationship Calculate the first camera coordinates and the second camera coordinates to obtain a rotation matrix and a translation matrix, splice the rotation matrix and the translation matrix to obtain the pose matrix, and obtain the first image the homogeneous coordinate matrix of each pixel in the first image, and obtain the depth value of each pixel in the first image from the initial depth image, based on the pose matrix, the homogeneous coordinate matrix of each pixel The coordinate matrix and the depth value of each pixel point are used to calculate the projection coordinates of each pixel point in the first image, and each pixel point is arranged according to the projection coordinates of each pixel point to obtain the projection image. .

根據本申請可選實施例，所述基於所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域生成所述深度識別網路的目標高度損失包括：獲取所述拍攝設備的光心到所述地平面區域的真實世界高度，基於所述第一圖像及所述拍攝設備構建相機座標系，根據所述相機座標系中所述地平面區域的每個地面像素點的座標計算出投影高度，根據所述初始深度圖像中像素點的像素座標、所述投影高度及所述真實世界高度計算出所述目標高度損失。According to an optional embodiment of the present application, generating the target height loss of the depth recognition network based on the shooting device, the initial depth image and the ground plane area in the first image includes: obtaining the The real-world height from the optical center of the shooting device to the ground plane area, based on theAn image and the shooting equipment construct a camera coordinate system, calculate the projection height according to the coordinates of each ground pixel point in the ground plane area in the camera coordinate system, and calculate the projection height according to the pixels of the pixel points in the initial depth image coordinates, the projected height and the real world height to calculate the target height loss.

根據本申請可選實施例，所述根據所述相機座標系中所述地平面區域的每個地面像素點的座標計算出投影高度包括：獲取所述相機座標系中所述地平面區域的任一地面像素點的座標，根據所述任一地面像素點的座標計算出單位法向量，將以所述拍攝設備的光心為起點及以每個地面像素點為終點構成的向量確定為該地面像素點的目標向量，根據每個地面像素點的目標向量與所述單位法向量計算出每個地面像素點對應的投影距離，將所有地面像素點對應的投影距離進行加權平均運算，得到所述投影高度。According to an optional embodiment of the present application, calculating the projection height according to the coordinates of each ground pixel point of the ground plane area in the camera coordinate system includes: obtaining any of the ground plane areas in the camera coordinate system. The coordinates of a ground pixel point, the unit normal vector is calculated based on the coordinates of any ground pixel point, and the vector consisting of the optical center of the shooting device as the starting point and each ground pixel point as the end point is determined as the ground For the target vector of the pixel, calculate the projection distance corresponding to each ground pixel based on the target vector of each ground pixel and the unit normal vector, and perform a weighted average operation on the projection distances corresponding to all ground pixels to obtain the above Projection height.

根據本申請可選實施例，所述基於預先獲取的深度識別網路、所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域，生成所述深度識別網路的目標高度損失包括：計算所述真實世界高度與所述投影高度的高度比值，將所述高度比值與所述初始深度圖像中每個像素點的像素座標進行相乘運算，得到每個像素點對應的深度座標，根據所述初始深度圖像中每個像素點的像素座標及對應的深度座標生成第一高度損失，將所述平移矩陣與所述高度比值進行相乘運算，得到相乘矩陣，根據所述相乘矩陣及所述平移矩陣生成第二高度損失，根據所述第一高度損失及所述第二高度損失生成所述目標高度損失。According to an optional embodiment of the present application, the depth recognition network is generated based on the pre-acquired depth recognition network, the shooting device, the initial depth image and the ground plane area in the first image. The target height loss includes: calculating the height ratio between the real world height and the projected height, and multiplying the height ratio with the pixel coordinates of each pixel in the initial depth image to obtain each pixel. The depth coordinate corresponding to the point is generated according to the pixel coordinate and the corresponding depth coordinate of each pixel point in the initial depth image, and the translation matrix is multiplied by the height ratio to obtain the multiplication A matrix is used to generate a second height loss based on the multiplication matrix and the translation matrix, and the target height loss is generated based on the first height loss and the second height loss.

本申請提供一種圖像深度識別方法，所述圖像深度識別方法包括：獲取待識別圖像，將所述待識別圖像輸入到深度識別模型中，得到所述待識別圖像的目標深度圖像及所述待識別圖像的深度資訊，所述深度識別模型透過執行如所述的深度識別模型訓練方法而獲得。The present application provides an image depth recognition method. The image depth recognition method includes: acquiring an image to be recognized, inputting the image to be recognized into a depth recognition model, and obtaining a target depth map of the image to be recognized. The depth information of the image and the image to be recognized is obtained by executing the depth recognition model training method as described above.

本申請提供一種電腦設備，所述電腦設備包括：儲存器，儲存至少一個指令；及處理器，執行所述至少一個指令以實現所述的深度識別模型訓練方法或所述的圖像深度識別方法。This application provides a computer device. The computer device includes: a storage to store at least one instruction; and a processor to execute the at least one instruction to implement the deep recognition model training.method or the image depth recognition method.

本申請提供一種電腦可讀儲存介質，所述電腦可讀儲存介質中儲存有至少一個指令，所述至少一個指令被電腦設備中的處理器執行以實現所述的深度識別模型訓練方法及所述的圖像深度識別方法。The present application provides a computer-readable storage medium in which at least one instruction is stored, and the at least one instruction is executed by a processor in a computer device to implement the deep recognition model training method and the image depth recognition method.

綜上所述，本申請構建所述地平面分割網路，由於所述地平面分割網路是在保證分割準確率的情況下對高解析度網路的部分參數進行消減後生成的，因此，能夠減少所述地平面分割網路的執行時間，從而能夠準確又快速地分割出所述地平面區域，基於所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域生成所述深度識別網路的目標高度損失，由於所述目標高度損失是根據所述地平面區域的像素點與拍攝設備之間的預測高度及所述地平面區域的像素點與拍攝設備之間的真實高度生成的，使用所述目標高度損失對所述深度網路進行調整，能夠使得所述深度識別網路對所述待識別圖像中每個像素點對應的預測高度更加準確，基於所述深度損失及所述目標高度損失，調整所述深度識別網路，得到深度識別模型，使得所述深度識別模型能夠從所述待識別圖像的光度、梯度以及每個像素點與拍攝設備之間的預測高度等多個方面全面地對所述待識別圖像進行深度識別，從而能夠快速又準確地識別出所述待識別圖像中的深度資訊，進而能夠提高圖像的深度識別效率。To sum up, this application constructs the ground plane segmentation network. Since the ground plane segmentation network is generated by reducing some parameters of the high-resolution network while ensuring segmentation accuracy, therefore, The execution time of the ground plane segmentation network can be reduced, so that the ground plane area can be segmented accurately and quickly, based on the shooting device, the initial depth image and the ground plane in the first image. The target height loss of the depth recognition network is generated based on the predicted height between the pixels in the ground plane area and the shooting device and the predicted height between the pixels in the ground plane area and the shooting device. The depth network is generated by using the target height loss to adjust the depth network, which can make the depth recognition network more accurate in predicting the height corresponding to each pixel in the image to be recognized. Based on The depth loss and the target height loss are used to adjust the depth recognition network to obtain a depth recognition model, so that the depth recognition model can combine the luminosity, gradient and each pixel of the image to be recognized with the shooting device. The depth recognition of the image to be recognized can be comprehensively carried out in many aspects such as the prediction height, so that the depth information in the image to be recognized can be quickly and accurately recognized, thereby improving the depth recognition efficiency of the image. .

1:電腦設備1:Computer equipment

2:拍攝設備2: Shooting equipment

12:儲存器12:Storage

13:處理器13: Processor

101~107:步驟101~107: Steps

108~109:步驟108~109: Steps

O_uv:像素點O_uv : pixel

O_XY:光點O_XY : light spot

圖1是本申請的實施例提供的應用環境圖。Figure 1 is an application environment diagram provided by an embodiment of the present application.

圖2是本申請的實施例提供的深度識別模型訓練方法的流程圖。Figure 2 is a flow chart of a deep recognition model training method provided by an embodiment of the present application.

圖3是本申請的實施例提供的地平面分割網路的結構示意圖。Figure 3 is a schematic structural diagram of a ground plane segmentation network provided by an embodiment of the present application.

圖4是本申請實施例提供的像素座標系和相機座標系的示意圖。FIG. 4 is a schematic diagram of the pixel coordinate system and the camera coordinate system provided by the embodiment of the present application.

圖5是本申請實施例提供的圖像深度識別方法的流程圖。Figure 5 is a flow chart of the image depth recognition method provided by the embodiment of the present application.

圖6是本申請實施例提供的電腦設備的結構示意圖。Figure 6 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

為了使本申請的目的、技術方案和優點更加清楚，下面結合附圖和具體實施例對本申請進行詳細描述。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments.

如圖1所示，是本申請的實施例提供的應用環境圖。本申請提供的深度識別模型訓練方法以及圖像深度識別方法可應用於一個或者多個電腦設備1中，所述電腦設備1與拍攝設備2相通信，所述拍攝設備2可以是單目相機，也可以是實現拍攝的其它裝置。As shown in Figure 1, it is an application environment diagram provided by an embodiment of the present application. The depth recognition model training method and image depth recognition method provided by this application can be applied to one or more computer devices 1. The computer device 1 communicates with the shooting device 2. The shooting device 2 can be a monocular camera. It can also be other devices that implement shooting.

所述電腦設備1是一種能夠按照事先設定或儲存的指令，自動進行參數值計算和/或資訊處理的設備，其硬體包括，但不限於：微處理器、專用積體電路(Application Specific Integrated Circuit，ASIC)、可程式設計閘陣列(Field-Programmable Gate Array，FPGA)、數位訊號處理器(Digital Signal Processor，DSP)、嵌入式設備等。The computer device 1 is a device that can automatically perform parameter value calculation and/or information processing according to preset or stored instructions. Its hardware includes, but is not limited to: microprocessor, application specific integrated circuit (Application Specific Integrated Circuit). Circuit (ASIC), Field-Programmable Gate Array (FPGA), Digital Signal Processor (DSP), embedded devices, etc.

所述電腦設備1可以是任何一種可與用戶進行人機交互的電腦產品，例如，個人電腦、平板電腦、智慧手機、個人數位助理(Personal Digital Assistant，PDA)、遊戲機、互動式網路電視(Internet Protocol Television，IPTV)、穿戴式智慧設備等。所述電腦設備1還可以包括網路設備和/或使用者設備。其中，所述網路設備包括，但不限於單個網路伺服器、多個網路伺服器組成的伺服器組或基於雲計算(Cloud Computing)的由大量主機或網路伺服器構成的雲。所述電腦設備1所處的網路包括，但不限於：網際網路、廣域網路、都會區網路、區域網路、虛擬私人網路(Virtual Private Network，VPN)等。The computer device 1 can be any computer product that can perform human-computer interaction with the user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (Personal Digital Assistant, PDA), a game console, and an interactive Internet TV. (Internet Protocol Television, IPTV), wearable smart devices, etc. The computer equipment 1 may also include network equipment and/or user equipment. The network equipment includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing. The network where the computer device 1 is located includes, but is not limited to: the Internet, wide area network, metropolitan area network, regional network, virtual private network (Virtual Private Network, VPN), etc.

如圖2所示，是本申請實施例提供的深度識別模型訓練方法的流程圖。根據不同的需求，所述流程圖中各個步驟的順序可以根據實際檢測要求進行調整，某些步驟可以省略。所述方法的執行主體為電腦設備，例如圖1所示的電腦設備1。As shown in Figure 2, it is a flow chart of the deep recognition model training method provided by the embodiment of the present application. According to different needs, the order of each step in the flow chart can be adjusted according to actual detection requirements, and some steps can be omitted. The execution subject of the method is a computer device, such as the computer device 1 shown in Figure 1 .

步驟101，獲取所述拍攝設備拍攝的第一圖像及第二圖像。Step 101: Obtain the first image and the second image captured by the shooting device.

在本申請的至少一個實施例中，所述拍攝設備可以是單目相機，所述第一圖像及所述第二圖像為相鄰幀的三原色光(Red Green Blue，RGB)圖像，所述第二圖像的生成時間大於所述第一圖像的生成時間，所述第一圖像及所述第二圖像中可以包含車輛，地面、行人、天空、樹木等初始物件，所述第一圖像及所述第二圖像包含相同的初始物件。In at least one embodiment of the present application, the shooting device may be a monocular camera, and the first image and the second image are three primary color lights (Red Green Blue, RGB) of adjacent frames.Image, the generation time of the second image is greater than the generation time of the first image, the first image and the second image may include vehicles, ground, pedestrians, sky, trees, etc. Object, the first image and the second image include the same initial object.

步驟102，構建地平面分割網路。Step 102: Construct a ground plane segmentation network.

在本申請的至少一個實施例中，所述地平面分割網路是指分割出圖像中的地平面的網路。In at least one embodiment of the present application, the ground plane segmentation network refers to a network that segments the ground plane in the image.

在本申請的至少一個實施例中，所述電腦設備構建地平面分割網路包括：所述電腦設備獲取高解析度網路，其中，所述高解析度網路包括第一階段網路、第二階段網路、第三階段網路、第四階段網路及輸出層，所述電腦設備將所述第四階段網路去除，並將所述輸出層中分支網路的數量調整為所述第三階段網路中分支網路的數量，得到所述第三階段網路中每個分支網路對應的輸出層，所述電腦設備統計所述第一階段網路的主幹網路中最後一個卷積層的第一通道數量，並將所述第一通道數量調整為第一預設值，得到第一分辨網路，所述電腦設備對所述第二階段網路中最後一個卷積層的通道數量進行調整，得到第二分辨網路，並對所述第三階段網路中最後一個卷積層的通道數量進行調整，得到第三分辨網路，進一步地，所述電腦設備拼接所述第一分辨網路、所述第二分辨網路及所述第三分辨網路，得到圖像分辨網路，並將所述圖像分辨網路及所述第三階段網路中每個分支網路對應的輸出層進行拼接，得到所述地平面分割網路。In at least one embodiment of the present application, the computer device constructing a ground plane segmentation network includes: the computer device acquires a high-resolution network, wherein the high-resolution network includes a first-stage network, a third-stage network, and a first-stage network. The second-stage network, the third-stage network, the fourth-stage network and the output layer, the computer equipment removes the fourth-stage network and adjusts the number of branch networks in the output layer to the The number of branch networks in the third stage network is used to obtain the output layer corresponding to each branch network in the third stage network. The computer equipment counts the last one in the backbone network of the first stage network. The first channel number of the convolutional layer, and adjusting the first channel number to the first preset value to obtain the first resolution network, the computer equipment determines the channel of the last convolutional layer in the second stage network The number is adjusted to obtain the second resolution network, and the number of channels of the last convolutional layer in the third stage network is adjusted to obtain the third resolution network. Further, the computer equipment splices the first resolution network. The resolution network, the second resolution network and the third resolution network are used to obtain an image resolution network, and each branch network in the image resolution network and the third stage network is The corresponding output layers are spliced to obtain the ground plane segmentation network.

其中，所述高解析度網路(High-Resolution Net v2，HRNet v2)可以從網際網路上獲取，所述第一預設值小於所述第一通道數量。例如，所述第一預設值可以為所述第一通道數量的二分之一。在本申請的至少一個實施例中，所述輸出層的數量與所述第三階段網路中分支網路的數量相同，每一個輸出層與所述第三階段網路中對應的分支網路相連，例如，所述第三階段網路中分支網路的數量為三個，將所述輸出層的數量調整為三個。在本申請的至少一個實施例中，前一個分辨網路的分支數量與下一個分辨網路中主幹網路的數量相同，當存在多個主幹網路時，每個主幹網路相互之間並行連接。Wherein, the high-resolution network (High-Resolution Net v2, HRNet v2) can be obtained from the Internet, and the first preset value is smaller than the first number of channels. For example, the first preset value may be half of the first channel number. In at least one embodiment of the present application, the number of the output layers is the same as the number of branch networks in the third stage network, and each output layer is the same as the corresponding branch network in the third stage network. For example, the number of branch networks in the third stage network is three, and the number of output layers is adjusted to three. In at least one embodiment of the present application, the number of branches in the previous resolution network is the same as the number of backbone networks in the next resolution network,When there are multiple backbone networks, each backbone network is connected to each other in parallel.

如圖3所示，是本申請的實施例提供的地平面分割網路的結構示意圖，所述第一分辨網路包括1個主幹網路及2個分支網路，所述第一分辨網路中的主幹網路與每個分支網路相連，所述第二分辨網路包括2個並行連接的主幹網路和3個分支網路，所述第二分辨網路中的每個主幹網路與每個分支網路相連，所述第三分辨網路包括3個並行連接的主幹網路和3個分支網路，所述輸出層為3層，所述第一分辨網路、所述第二分辨網路及所述第三分辨網路中的每個主幹網路均包含四個卷積層，當所述第一階段網路stage1的主幹網路的最後一個卷積層的通道數量為64時，所述第一分辨網路branch1中主幹網路的最後一個卷積層的通道數量可以為32；當所述第二階段網路stage2的第一個主幹網路中最後一個卷積層的通道數量為48時，所述第二分辨網路branch2中第一個主幹網路的最後一個卷積層的通道數可以為24；當所述第二階段網路stage2的第二個主幹網路中最後一個卷積層的最後一個卷積層的通道數量為96時，所述第二分辨網路branch2中第二個主幹網路的最後一個卷積層的通道數可以為48；當所述第三階段網路stage3的第一個主幹網路中最後一個卷積層的通道數量為48時，所述第三分辨網路branch3中第一個主幹網路的最後一個卷積層的通道數可以為24；當所述第三階段網路stage3的第二個主幹網路中最後一個卷積層的通道數量為96時，所述第三分辨網路branch3中第二個主幹網路的最後一個卷積層的通道數可以為48；當所述第三階段網路stage3的第三個主幹網路中最後一個卷積層的通道數量為192時，所述第三分辨網路branch3中第三個主幹網路的最後一個卷積層的通道數可以為96。As shown in Figure 3, it is a schematic structural diagram of a ground plane segmentation network provided by an embodiment of the present application. The first resolution network includes a backbone network and two branch networks. The first resolution network The backbone network in the network is connected to each branch network. The second resolution network includes 2 parallel-connected backbone networks and 3 branch networks. Each backbone network in the second resolution network Connected to each branch network, the third resolution network includes 3 parallel-connected backbone networks and 3 branch networks, the output layer is 3 layers, the first resolution network, the third resolution network Each backbone network in the two-resolution network and the third-resolution network includes four convolutional layers. When the number of channels of the last convolutional layer of the backbone network of stage1 of the first-stage network is 64, , the number of channels of the last convolutional layer of the backbone network in the first resolution network branch1 can be 32; when the number of channels of the last convolutional layer of the first backbone network of the second stage network stage2 is 48, the channel number of the last convolutional layer of the first backbone network in the second resolution network branch2 can be 24; when the last convolutional layer of the second backbone network in the second stage network stage2 When the number of channels of the last convolutional layer of the multilayer layer is 96, the number of channels of the last convolutional layer of the second backbone network in the second resolution network branch2 can be 48; when the number of channels of the third stage network stage3 When the number of channels of the last convolutional layer in the first backbone network is 48, the number of channels of the last convolutional layer in the first backbone network in the third resolution network branch3 may be 24; when the third resolution network branch3 When the number of channels of the last convolutional layer in the second backbone network of the stage network stage3 is 96, the number of channels of the last convolutional layer of the second backbone network in the third resolution network branch3 can be 48; When the number of channels of the last convolutional layer in the third backbone network of the third stage network stage3 is 192, the channels of the last convolutional layer of the third backbone network in the third resolution network branch3 The number can be 96.

在本實施例中，透過縮減每個分辨網路的主幹網路中最後一個卷積層的通道數量，能夠減少所述地平面分割網路的大量參數。In this embodiment, a large number of parameters of the ground plane segmentation network can be reduced by reducing the number of channels in the last convolutional layer in the backbone network of each resolution network.

步驟103，使用所述地平面分割網路對所述第一圖像進行分割，得到所述第一圖像中的地平面區域。Step 103: Use the ground plane segmentation network to segment the first image to obtain the ground plane area in the first image.

在本申請的至少一個實施例中，所述地平面分割網路還包括分類層，所述電腦設備使用所述地平面分割網路對所述第一圖像進行分割，得到所述第一圖像中的地平面區域包括：所述電腦設備將所述第一圖像輸入到所述地平面分割網路中的圖像分辨網路進行處理，得到所述圖像分辨網路中的所述第三分辨網路在每個分支網路所輸出的第三分支特徵圖，進一步地，所述電腦設備將多個所述第三分支特徵圖輸入到對應的輸出層中進行卷積操作，得到每個輸出層輸出的目標特徵圖，更進一步地，所述電腦設備將多個所述目標特徵圖進行特徵融合，得到融合特徵圖，更進一步地，所述電腦設備將所述融合特徵圖輸入到所述分類層中進行分類，得到所述第一圖像中每個預設類別對應的初始物件，並選取所述預設類別為地平面類別對應的初始物件所佔的區域作為所述地平面區域。In at least one embodiment of the present application, the ground plane segmentation network further includes classificationlayer, the computer device uses the ground plane segmentation network to segment the first image, and obtaining the ground plane area in the first image includes: the computer device inputs the first image Go to the image resolution network in the ground plane segmentation network for processing, and obtain the third branch feature map output by the third resolution network in the image resolution network in each branch network, Further, the computer device inputs a plurality of the third branch feature maps into the corresponding output layer to perform a convolution operation to obtain a target feature map output by each output layer. Furthermore, the computer device inputs multiple third branch feature maps into the corresponding output layer to obtain a target feature map output by each output layer. Feature fusion is performed on each of the target feature maps to obtain a fused feature map. Furthermore, the computer device inputs the fused feature map into the classification layer for classification to obtain each predefined feature map in the first image. Set an initial object corresponding to the category, and select an area occupied by the initial object corresponding to the ground plane category as the default category as the ground plane area.

其中，所述預設類別可以包括，但不限於：地平面、道路、行人、天空等等。所述電腦設備將所述多個目標特徵圖輸入到1x1的卷積層中進行卷積操作，得到所述融合特徵圖。所述分類層可以為softmax層。具體地，所述電腦設備將所述第一圖像輸入到所述地平面分割網路中的圖像分辨網路進行處理，得到所述圖像分辨網路中的所述第三分辨網路在每個分支網路所輸出的第三分支特徵圖包括：所述電腦設備將所述第一圖像輸入到所述第一分辨網路的主幹網路中進行特徵提取，得到所述第一分辨網路的主幹網路所輸出的第一主幹特徵圖，並將所述第一主幹特徵圖輸入到所述第一分辨網路的每個分支網路中進行卷積運算，得到所述第一分辨網路中的每個分支網路輸出的第一分支特徵圖，進一步地，所述電腦設備將多個所述第一分支特徵圖輸入至所述第二分辨網路進行特徵提取，得到所述第二分辨網路所輸出的第二分支特徵圖，並將多個所述第二特徵圖輸入至所述第三分辨網路進行特徵提取，得到所述第三分支特徵圖。The preset categories may include, but are not limited to: ground plane, road, pedestrian, sky, etc. The computer device inputs the plurality of target feature maps into a 1x1 convolution layer to perform a convolution operation to obtain the fusion feature map. The classification layer may be a softmax layer. Specifically, the computer device inputs the first image to the image resolution network in the ground plane segmentation network for processing, and obtains the third resolution network in the image resolution network. The third branch feature map output by each branch network includes: the computer device inputs the first image into the backbone network of the first resolution network to perform feature extraction to obtain the first The first backbone feature map output by the backbone network of the resolution network is input into each branch network of the first resolution network to perform a convolution operation to obtain the first backbone feature map. A first branch feature map output by each branch network in a resolution network. Further, the computer device inputs a plurality of the first branch feature maps to the second resolution network for feature extraction, to obtain The second branch feature map output by the second resolution network is input to the third resolution network for feature extraction to obtain the third branch feature map.

在本實施例中，所述第二分辨網路及所述第三分辨網路與所述第一分辨網路的特徵提取過程基本一致，本申請在此不作贅述。所述第二分辨網路及所述第三分辨網路中每個主幹網路的解析度不同。透過上述實施方式，由於所述地平面分割網路中的參數減少了，因此能夠提高所述地平面分割網路的運行速度，從而能夠快速地將所述地平面區域分割出來。In this embodiment, the feature extraction process of the second resolution network and the third resolution network is basically the same as that of the first resolution network, and will not be described in detail here. The resolution of each backbone network in the second resolution network and the third resolution network is different. Through the above implementation, byThe parameters in the ground plane segmentation network are reduced, so the running speed of the ground plane segmentation network can be increased, so that the ground plane area can be quickly segmented.

步驟104，基於所述第一圖像、所述第二圖像對應的初始深度圖像及同時與所述第一圖像及所述第二圖像對應的位姿矩陣生成所述第一圖像的投影圖像。Step 104: Generate the first image based on the initial depth image corresponding to the first image, the second image and the pose matrix corresponding to both the first image and the second image. projected image of the image.

在本申請的至少一個實施例中，所述初始深度圖像是指包含深度資訊的圖像，其中，所述深度資訊是指所述第一圖像中每個像素點對應的初始物件與所述拍攝設備之間的距離，其中，所述拍攝設備可以為單目相機。在本申請的至少一個實施例中，所述位姿矩陣是指所述第一圖像中的像素點的相機座標與所述第二圖像中對應的像素點的相機座標的變換關係，所述相機座標是指每個像素點在相機座標系中的座標。In at least one embodiment of the present application, the initial depth image refers to an image containing depth information, wherein the depth information refers to the initial object corresponding to each pixel in the first image and the The distance between the shooting devices, wherein the shooting devices may be monocular cameras. In at least one embodiment of the present application, the pose matrix refers to the transformation relationship between the camera coordinates of the pixels in the first image and the camera coordinates of the corresponding pixels in the second image, so The camera coordinates refer to the coordinates of each pixel in the camera coordinate system.

如圖4所示，是本申請實施例提供的像素座標系和相機座標系的示意圖，所述電腦設備以所述第一圖像的第一行第一列的像素點O_uv為原點，以第一行像素點所在的平行線為u軸，以第一列像素點所在的垂直線為v軸構建像素座標系。此外，所述電腦設備以所述單目相機的光點O_XY為原點，以所述單目相機的光軸為X軸，以所述像素座標系u軸的平行線為Y軸，以所述像素座標系的v軸的平行線為Z軸構建所述相機座標系。As shown in Figure 4, which is a schematic diagram of the pixel coordinate system and the camera coordinate system provided by the embodiment of the present application, the computer device takes the pixel point O_uv in the first row and first column of the first image as the origin, The pixel coordinate system is constructed with the parallel line where the pixel points in the first row are located as the u-axis, and the vertical line where the pixel points in the first column are located as the v-axis. In addition, the computer device takes the light point O_XY of the monocular camera as the origin, the optical axis of the monocular camera as the The parallel line of the v-axis of the pixel coordinate system is the Z-axis to construct the camera coordinate system.

在本申請的至少一個實施例中，所述投影圖像表示變換過程的圖像，所述變換過程是指所述第一圖像中像素點的像素座標與所述第二圖像中對應的像素座標之間的變換過程。In at least one embodiment of the present application, the projection image represents an image of a transformation process, and the transformation process refers to the pixel coordinates of the pixels in the first image and the corresponding pixel coordinates in the second image. The transformation process between pixel coordinates.

在本申請的至少一個實施例中，所述電腦設備基於所述第一圖像、所述第二圖像對應的初始深度圖像及同時與所述第一圖像及所述第二圖像對應的位姿矩陣生成所述第一圖像的投影圖像包括：所述電腦設備將所述第一圖像中所述相同的初始物件對應的像素點確定為第一像素點，並將所述第二圖像中所述相同的初始物件對應的像素點確定為第二像素點，進一步地，所述電腦設備獲取所述第一像素點的第一齊次座標矩陣，並獲取所述第二像素點的第二齊次座標矩陣，所述電腦設備獲取所述拍攝設備的內參矩陣的逆矩陣，進一步地，所述電腦設備根據所述第一齊次座標矩陣及所述內參矩陣的逆矩陣計算出所述第一像素點的第一相機座標，並根據所述第二齊次座標矩陣及所述內參矩陣的逆矩陣計算出所述第二像素點的第二相機座標，更進一步地，所述電腦設備基於預設對極約束關係式對所述第一相機座標及所述第二相機座標進行計算，得到旋轉矩陣及平移矩陣，所述電腦設備將所述旋轉矩陣及所述平移矩陣進行拼接，得到所述位姿矩陣，所述電腦設備獲取所述第一圖像中每個像素點的齊次座標矩陣，並從所述初始深度圖像中獲取所述第一圖像中每個像素點的深度值，進一步地，所述電腦設備基於所述位姿矩陣、每個像素點的齊次座標矩陣及每個像素點的深度值計算出所述第一圖像中每個像素點的投影座標，更進一步地，所述電腦設備根據每個像素點的投影座標對每個像素點進行排列處理，得到所述投影圖像。In at least one embodiment of the present application, the computer device is based on the initial depth image corresponding to the first image, the second image and the first image and the second image simultaneously. Generating the projection image of the first image with the corresponding pose matrix includes: the computer device determines the pixel point corresponding to the same initial object in the first image as the first pixel point, and The pixel corresponding to the same initial object in the second image is determined as the second pixel. Further, the computer device obtains the first homogeneous coordinate matrix of the first pixel, and obtains the third The second level of two pixelsSecondary coordinate matrix, the computer device obtains the inverse matrix of the internal parameter matrix of the shooting device, and further, the computer device calculates the first homogeneous coordinate matrix and the inverse matrix of the internal parameter matrix according to the first homogeneous coordinate matrix and the inverse matrix of the internal parameter matrix. The first camera coordinate of the pixel point, and the second camera coordinate of the second pixel point is calculated based on the second homogeneous coordinate matrix and the inverse matrix of the internal parameter matrix. Furthermore, the computer device is based on the preset Suppose the epipolar constraint relationship is used to calculate the first camera coordinates and the second camera coordinates to obtain a rotation matrix and a translation matrix. The computer device splices the rotation matrix and the translation matrix to obtain the The pose matrix, the computer device obtains the homogeneous coordinate matrix of each pixel in the first image, and obtains the depth value of each pixel in the first image from the initial depth image. , further, the computer device calculates the projection coordinates of each pixel in the first image based on the pose matrix, the homogeneous coordinate matrix of each pixel and the depth value of each pixel, and more Further, the computer device arranges each pixel point according to the projection coordinates of each pixel point to obtain the projection image.

其中，所述深度值是指所述初始深度圖像中每個像素點的像素值。在本申請的至少一個實施例中，將所述第一圖像輸入到所述深度識別網路生成所述初始深度圖像的方式為現有技術，本申請在此不作贅述。其中，所述第一像素點的第一齊次座標矩陣是指維度比像素座標矩陣的維度多出一維的矩陣，而且多出的一個維度的元素值為1，所述像素座標矩陣是指根據所述第一像素點的第一像素座標生成的矩陣，所述第一像素座標是指所述第一像素點在所述像素座標系中的座標，例如，所述第一像素點在所述像素座標系中的第一像素座標為(u,v)，所述第一像素點的像素座標矩陣为

；則該像素點的齊次座標矩陣為

。將所述第一齊次座標矩陣及所述內參矩陣的逆矩陣進行相乘，得到所述第一像素點的第一相機座標，並將所述第二齊次座標矩陣及所述內參矩陣的逆矩陣進行相乘，得到所述第二像素點的第二相機座標。其中，所述第二齊次座標矩陣的生成方式與所述第一齊次座標矩陣的生成方式基本一致，本申請在此不作贅述。The depth value refers to the pixel value of each pixel in the initial depth image. In at least one embodiment of the present application, the method of inputting the first image into the depth recognition network to generate the initial depth image is an existing technology and will not be described in detail here. Wherein, the first homogeneous coordinate matrix of the first pixel point refers to a matrix with one dimension more than the dimension of the pixel coordinate matrix, and the element value of the extra dimension is 1, and the pixel coordinate matrix refers to A matrix generated according to the first pixel coordinate of the first pixel point, where the first pixel coordinate refers to the coordinate of the first pixel point in the pixel coordinate system. For example, the first pixel point is at the The first pixel coordinate in the pixel coordinate system is (u ,v ), and the pixel coordinate matrix of the first pixel point is

;Then the homogeneous coordinate matrix of the pixel is

. Multiply the first homogeneous coordinate matrix and the inverse matrix of the internal parameter matrix to obtain the first camera coordinate of the first pixel point, and combine the second homogeneous coordinate matrix and the internal parameter matrix Multiply the inverse matrices to obtain the second camera coordinates of the second pixel point. The generation method of the second homogeneous coordinate matrix is basically the same as the generation method of the first homogeneous coordinate matrix, and will not be described in detail here in this application.

所述旋轉矩陣可以表示為：

；其中，pose為所述位姿矩陣，所述位姿矩陣為4x4的矩陣，R為所述旋轉矩陣，所述旋轉矩陣為3x3的矩陣，t為所述平移矩陣，所述平移矩陣為3x1的矩陣。The rotation matrix can be expressed as:

;wherein,pose is the pose matrix, and the pose matrix is a 4x4 matrix,R is the rotation matrix, and the rotation matrix is a 3x3 matrix,t is the translation matrix, and the translation matrix is 3x1 matrix.

其中，所述平移矩陣及所述旋轉矩陣的計算公式為：K^-1p₁(txR)(K^-1p₂)^T=0；其中，K^-1p₁為所述第一相機座標，K^-1p₂為所述第二相機座標，p₁為所述第一齊次座標矩陣，p₂為所述第二齊次座標矩陣，K^-1為所述內參矩陣的逆矩陣。Wherein, the calculation formula of the translation matrix and the rotation matrix is:K^-1p₁ (t xR ) (K^-1p₂ )^T =0; where,K^-1p₁ is the first camera coordinates,K^-1p₂ is the second camera coordinate,p₁ is the first homogeneous coordinate matrix,p₂ is the second homogeneous coordinate matrix,K^-1 is the inverse matrix of the internal parameter matrix .

在本實施例中，計算出所述平移矩陣及所述旋轉矩陣，根據所述旋轉矩陣可以得知所述第一圖像中每個像素點的姿態，根據所述平移矩陣能夠得知所述第一圖像中每個像素點的位置。In this embodiment, the translation matrix and the rotation matrix are calculated. According to the rotation matrix, the posture of each pixel in the first image can be known. According to the translation matrix, the posture of each pixel in the first image can be known. The position of each pixel in the first image.

具體地，所述投影圖像中每個像素點的投影座標的計算公式為：P=K＊pose＊Z＊K^-1＊H；其中，P表示每個像素點的投影座標，K表示所述拍攝設備的內參矩陣，pose表示所述位姿矩陣，K^-1表示K的逆矩陣，H表示所述第一圖像中每個像素點的目標齊次座標矩陣，Z表示所述初始深度圖像中對應的像素點的深度值。Specifically, the calculation formula for the projection coordinates of each pixel in the projection image is:P =K *pose *Z *K^-1 *H ; whereP represents the projection coordinates of each pixel, and K represents the The internal parameter matrix of the shooting equipment,pose represents the pose matrix, K^-1 represents the inverse matrix ofK , H represents the target homogeneous coordinate matrix of each pixel in the first image, and Z represents the initial depth The depth value of the corresponding pixel in the image.

步驟105，基於預先獲取的深度識別網路、所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域，生成所述深度識別網路的目標高度損失。Step 105: Generate the target height loss of the depth recognition network based on the previously acquired depth recognition network, the shooting device, the initial depth image, and the ground plane area in the first image.

在本申請的至少一個實施例中，所述深度識別網路是指能夠識別出圖像中的深度資訊的網路。在本申請的至少一個實施例中，所述目標高度損失是指預測高度與真實世界高度之間的差異，所述預測高度是指所述深度識別網路預測的所述第一圖像中每個像素點與所述拍攝設備之間的距離，所述真實世界高度是指在現實中所述第一圖像中的像素點對應的初始物件與所述拍攝設備之間的距離。In at least one embodiment of the present application, the depth recognition network refers to a network capable of identifying depth information in images. In at least one embodiment of the present application, the target height loss refers to the difference between the predicted height and the real-world height, and the predicted height refers to each of the first images predicted by the depth recognition network. The distance between a pixel point and the shooting device, and the real-world height refers to the distance between the initial object corresponding to the pixel point in the first image and the shooting device in reality.

在本申請的至少一個實施例中，所述電腦設備基於預先獲取的深度識別網路、所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域，生成所述深度識別網路的目標高度損失包括：所述電腦設備獲取所述拍攝設備的光心到所述地平面區域的真實世界高度，所述電腦設備基於所述第一圖像及所述拍攝設備構建相機座標系，進一步地，所述電腦設備根據所述相機座標系中所述地平面區域的每個地面像素點的座標計算出投影高度，更進一步地，所述電腦設備根據所述初始深度圖像中像素點的像素座標、所述投影高度及所述真實世界高度計算出所述目標高度損失。In at least one embodiment of the present application, the computer device is based on a pre-acquired deepdegree recognition network, the shooting device, the initial depth image and the ground plane area in the first image. Generating the target height loss of the depth recognition network includes: the computer device obtains the shooting The real-world height from the optical center of the device to the ground plane area, the computer device constructs a camera coordinate system based on the first image and the shooting device, and further, the computer device constructs a camera coordinate system based on the The coordinates of each ground pixel point in the ground plane area calculate the projection height. Furthermore, the computer device calculates the projection height according to the pixel coordinates of the pixel points in the initial depth image, the projection height and the real world height. Calculate the target height loss.

其中，構建相機座標系如圖4所示。具體地，所述電腦設備根據所述相機座標系中所述地平面區域的每個地面像素點的座標計算出投影高度包括：所述電腦設備獲取所述相機座標系中所述地平面區域的任一地面像素點的座標，進一步地，所述電腦設備根據所述任一地面像素點的座標計算出單位法向量，所述電腦設備將以所述拍攝設備的光心為起點及以每個地面像素點為終點構成的向量確定為該地面像素點的目標向量，進一步地，所述電腦設備根據每個地面像素點的目標向量與所述單位法向量計算出每個地面像素點對應的投影距離，更進一步地，所述電腦設備將所有地面像素點對應的投影距離進行加權平均運算，得到所述投影高度。Among them, the camera coordinate system is constructed as shown in Figure 4. Specifically, the computer device calculating the projection height according to the coordinates of each ground pixel point of the ground plane area in the camera coordinate system includes: the computer device obtains the coordinates of the ground plane area in the camera coordinate system. The coordinates of any ground pixel point. Further, the computer device calculates the unit normal vector according to the coordinates of any ground pixel point. The computer device will take the optical center of the shooting device as the starting point and each The vector composed of the ground pixel point as the end point is determined as the target vector of the ground pixel point. Further, the computer device calculates the projection corresponding to each ground pixel point based on the target vector of each ground pixel point and the unit normal vector. distance. Furthermore, the computer device performs a weighted average calculation on the projection distances corresponding to all ground pixel points to obtain the projection height.

其中，所述單位法向量的計算公式為：N_t=(P_tP_t^T)^-1P_t；其中，N_t是指所述單位法向量，P_t是指所述相機座標系中所述地平面區域的任一地面像素點的座標，P_t^T是指所述目標向量。Wherein, the calculation formula of the unit normal vector is:N_t =(P_tP_t^T )^-1P_t ; where,N_t refers to the unit normal vector, andP_t refers to the position of the unit normal vector in the camera coordinate system. The coordinates of any ground pixel point in the ground plane area,P_t^T refers to the target vector.

在本實施例中，所述投影高度是指所述第一圖像中每個像素點到所述拍攝設備之間的多個投影距離的加權平均值，由於將所述地平面區域中的所有像素點的座標全部參與了運算，因此，能夠使得所述投影高度更加準確。In this embodiment, the projection height refers to the weighted average of multiple projection distances between each pixel in the first image and the shooting device. Since all the projection distances in the ground plane area are All coordinates of the pixels are involved in the calculation, therefore, the projection height can be made more accurate.

具體地，所述電腦設備根據所述初始深度圖像中像素點的像素座標、所述投影高度及所述真實世界高度計算出所述目標高度損失包括：所述電腦設備計算所述真實世界高度與所述投影高度的高度比值，進一步地，所述電腦設備將所述高度比值與所述初始深度圖像中每個像素點的像素座標進行相乘運算，得到每個像素點對應的深度座標，更進一步地，所述電腦設備根據所述初始深度圖像中每個像素點的像素座標及對應的深度座標生成所述第一高度損失，所述電腦設備將所述平移矩陣與所述高度比值進行相乘運算，得到相乘矩陣，進一步地，所述電腦設備根據所述相乘矩陣及所述平移矩陣生成所述第二高度損失，更進一步地，所述電腦設備根據所述第一高度損失及所述第二高度損失生成所述目標高度損失。Specifically, the computer device calculates the target height loss based on the pixel coordinates of the pixels in the initial depth image, the projection height and the real world height, including: the computer device calculates the real world height. and the height ratio of the projection height. Further, the computer device multiplies the height ratio by the pixel coordinates of each pixel in the initial depth image.operation to obtain the depth coordinate corresponding to each pixel point. Furthermore, the computer device generates the first height loss according to the pixel coordinate and the corresponding depth coordinate of each pixel point in the initial depth image. The computer device multiplies the translation matrix and the height ratio to obtain a multiplication matrix. Further, the computer device generates the second height loss based on the multiplication matrix and the translation matrix. Further, Preferably, the computer device generates the target height loss based on the first height loss and the second height loss.

所述第一高度損失的計算公式為：

；其中，所述L_d是指所述第一高度損失，n是指所述初始深度圖像中所有像素點的數量，i是指所述初始深度圖像中的第i個像素點，所述Dⁱ_t(u,v)是指所述初始深度圖像中第i個像素點對應的深度座標，Dⁱ(u,v)是指所述初始深度圖像中第i個像素點的像素座標。The calculation formula of the first height loss is:

; Wherein,the L_d refers to the first height loss,n refers to the number of all pixels in the initial depth image,i refers to thei- th pixel in the initial depth image, so TheDⁱ_t (u ,v ) refers to the depth coordinate corresponding tothe i- th pixel in the initial depth image, andDⁱ (u ,v ) refers to the depth coordinate of thei- th pixel in the initial depth image. Pixel coordinates.

所述第二高度損失的計算公式為：L_ts=|t_s-t|；其中，L_ts是指所述第二高度損失，t_s是指所述相乘矩陣，t是指所述平移矩陣。其中，所述電腦設備將所述第一高度損失及所述第二高度損失進行加權平均運算，得到所述目標高度損失。The calculation formula of the second height loss is:L_ts =|t_s -t |; where,L_ts refers to the second height loss,t_s refers to the multiplication matrix, andt refers to the translation matrix. Wherein, the computer device performs a weighted average operation on the first height loss and the second height loss to obtain the target height loss.

透過上述實施方式，根據所述初始深度圖像中像素點的像素座標、所述投影高度及所述真實世界高度計算出目標高度損失，由於所述投影高度更加準確，能夠使得所述目標高度損失下降的更快。Through the above implementation, the target height loss is calculated based on the pixel coordinates of the pixels in the initial depth image, the projection height and the real world height. Since the projection height is more accurate, the target height loss can be reduced Falling faster.

步驟106，根據所述初始深度圖像與所述第一圖像之間的梯度損失及所述投影圖像與所述第一圖像之間的光度損失計算所述深度識別網路的深度損失。Step 106: Calculate the depth loss of the depth recognition network based on the gradient loss between the initial depth image and the first image and the photometric loss between the projection image and the first image. .

在本申請的至少一個實施例中，所述深度損失包括光度損失和梯度損失。具體地，所述深度損失的計算公式為：Lc=Lt+Ls；其中，Lc表示所述深度損失，Lt表示所述光度損失，Ls表示所述梯度損失。其中，所述光度損失的計算公式為：Lt=αSSIM(x,y)+(1-α)∥x_i-y_i∥；其中，Lt表示所述光度損失，α為預設的平衡參數，一般取值為0.85，SSIM(x,y)表示所述投影圖像與所述第一圖像之間的結構相似指數，∥x_i-y_i∥表示所述投影圖像與所述第一圖像之間的灰度差值，x_i表示所述投影圖像第i個像素點的像素值，y_i表示所述第一圖像中與所述第i個像素點對應的像素點的像素值。In at least one embodiment of the present application, the depth loss includes photometric loss and gradient loss. Specifically, the calculation formula of the depth loss is:Lc =Lt +Ls ; where,Lc represents the depth loss,Lt represents the photometric loss, andLs represents the gradient loss. Wherein, the calculation formula of the photometric loss is:Lt =αSSIM (x ,y )+(1-α)∥x_i -y_i ∥; where,Lt represents the photometric loss, and α is the preset balance parameter , the general value is 0.85,SSIM (x ,y ) represents the structural similarity index between the projection image and the first image, ∥x_i -y_i ∥ represents the structural similarity index between the projection image and the first image The grayscale difference between an image,xi represents the pixel value of thei- th pixel of the projection image,y_i represents the pixel corresponding to_thei- th pixel in the first image pixel value.

其中，所述結構相似指數的計算公式為：

；c₁=(K₁L)²；c₂=(K₂L)²；其中，SSIM(x,y)為所述結構相似指數，x為所述投影圖像，y為所述第一圖像，μ_x為所述投影圖像的灰度平均值，μ_y為所述第一圖像的灰度平均值，σ_x為所述投影圖像的灰度標準差，σ_y為所述第一圖像的灰度標準差，σ_xy為所述投影圖像與所述第一圖像之間的灰度協方差，c₁及c₂均為預設參數，L為所述第一圖像中最大的像素值，K₁及K₂是預先設置的常數，且K₁≪1，K₂≪1。Wherein, the calculation formula of the structural similarity index is:

;c₁ =(K₁L )² ;c₂ =(K₂L )² ; Where,SSIM (x ,y ) is the structural similarity index,x is the projection image, andy is the first image, μ_x is_the average gray level of the projection image, μ_y is the average gray level of the first image,_σ The grayscale standard deviation of the first image, σ_xy is the grayscale covariance between the projection image and the first image,c₁ andc₂ are both preset parameters,L is the third The maximum pixel value in an image,K₁ andK₂ are preset constants, andK₁ ≪1,K₂ ≪1.

具體地，所述電腦設備計算所述初始深度圖像與所述第一圖像之間的梯度損失包括：

；其中，Ls表示所述梯度損失，x表示所述初始深度圖像，y表示所述第一圖像，D(u，v)表示所述初始深度圖像中第i個像素點的像素座標，I(u，v)表示所述第一圖像中第i個像素點的像素座標。在本實施例中，由於所述深度損失包括了所述第一圖像中的每個像素點到所述第二圖像中對應的像素點的光度、梯度的變化，所以所述深度損失能夠更加準確地反映所述第一圖像與所述第二圖像之間的差異。Specifically, the computer device calculating the gradient loss between the initial depth image and the first image includes:

;wherein,Ls represents the gradient loss,x represents the initial depth image,y represents the first image,D (u ,v ) represents the pixel coordinates of thei-th pixel in the initial depth image ,I (u ,v ) represents the pixel coordinates of thei- th pixel in the first image. In this embodiment, since the depth loss includes changes in luminosity and gradient from each pixel in the first image to the corresponding pixel in the second image, the depth loss can Reflect the difference between the first image and the second image more accurately.

步驟107，基於所述深度損失及所述目標高度損失，調整所述深度識別網路，得到深度識別模型。Step 107: Adjust the depth recognition network based on the depth loss and the target height loss to obtain a depth recognition model.

在本申請的至少一個實施例中，所述深度識別模型是指對所述深度識別網路進行調整後生成的模型。在本申請的至少一個實施例中，所述電腦設備基於所述深度損失及所述目標高度損失，調整所述深度識別網路，得到深度識別模型包括：所述電腦設備基於所述深度損失及所述目標高度損失計算所述深度識別網路的總體損失，進一步地，所述電腦設備基於所述總體損失調整所述深度識別網路，直至所述總體損失下降到最低，得到所述深度識別模型。在本申請的至少一個實施例中，將所述深度損失與所述目標高度損失進行加權平均運算，得到所述總體損失。在本實施例中，所述總體損失包括所述深度損失及所述目標高度損失，由於所述深度損失能夠更加準確地反映所述第一圖像與所述第二圖像之間的差異，基於所述總體損失對所述深度網路進行調整，能夠提高所述深度網路的學習能力，使得所述深度識別模型的識別精度更高。In at least one embodiment of the present application, the depth recognition model refers to a model generated after adjusting the depth recognition network. In at least one embodiment of the present application, the computerThe device adjusts the depth recognition network based on the depth loss and the target height loss to obtain a depth recognition model including: the computer device calculates the depth recognition network based on the depth loss and the target height loss. The overall loss, further, the computer device adjusts the depth recognition network based on the overall loss until the overall loss drops to a minimum, and the depth recognition model is obtained. In at least one embodiment of the present application, a weighted average operation is performed on the depth loss and the target height loss to obtain the overall loss. In this embodiment, the overall loss includes the depth loss and the target height loss, because the depth loss can more accurately reflect the difference between the first image and the second image, Adjusting the deep network based on the overall loss can improve the learning ability of the deep network, making the recognition accuracy of the deep recognition model higher.

如圖5所示，是本申請實施例提供的圖像深度識別方法的流程圖。根據不同的需求，所述流程圖中各個步驟的順序可以根據實際檢測要求進行調整，某些步驟可以省略。所述方法的執行主體為電腦設備，例如圖1所示的電腦設備1。As shown in Figure 5, it is a flow chart of the image depth recognition method provided by the embodiment of the present application. According to different needs, the order of each step in the flow chart can be adjusted according to actual detection requirements, and some steps can be omitted. The execution subject of the method is a computer device, such as the computer device 1 shown in Figure 1 .

步驟108，獲取待識別圖像。Step 108: Obtain the image to be recognized.

在本申請的至少一個實施例中，所述待識別圖像是指需要識別深度資訊的圖像。在本申請的至少一個實施例中，所述電腦設備獲取待識別圖像包括：所述電腦設備從預設的資料庫中獲取所述待識別圖像。其中，所述預設的資料庫可以為KITTI資料庫、Cityscapes資料庫及vKITTI資料庫等等。所述深度識別網路可以為深度神經網路，所述深度識別網路可以從網際網路的資料庫中獲取。In at least one embodiment of the present application, the image to be recognized refers to an image for which depth information needs to be recognized. In at least one embodiment of the present application, the computer device acquiring the image to be recognized includes: the computer device acquiring the image to be recognized from a preset database. The default database may be a KITTI database, a Cityscapes database, a vKITTI database, etc. The deep recognition network may be a deep neural network, and the deep recognition network may be obtained from a database on the Internet.

步驟109，將所述待識別圖像輸入到所述深度識別模型中，得到所述待識別圖像的目標深度圖像及所述待識別圖像的深度資訊，所述深度識別模型透過執行如所述的深度識別模型訓練方法而獲得。Step 109: Input the image to be recognized into the depth recognition model to obtain the target depth image of the image to be recognized and the depth information of the image to be recognized. The depth recognition model executes as follows: Obtained by the deep recognition model training method described.

在本申請的至少一個實施例中，所述目標深度圖像是指包含所述待識別圖像中每個像素點的深度資訊的圖像，所述待識別圖像中每個像素點的深度資訊是指所述待識別圖像中每個像素點對應的待識別物件與拍攝所述待識別圖像的拍攝設備之間的距離。在本申請的至少一個實施例中，所述目標深度圖像的生成方式與所述初始深度圖像的生成方式基本一致，故本申請在此不做贅述。In at least one embodiment of the present application, the target depth image refers to an image containing depth information of each pixel in the image to be identified, and the depth of each pixel in the image to be identified is Information refers to the object to be identified corresponding to each pixel in the image to be identified and the photograph of the object to be identified.distance between different image capturing devices. In at least one embodiment of the present application, the method of generating the target depth image is basically the same as the method of generating the initial depth image, so the details will not be described here in this application.

在本申請的至少一個實施例中，所述電腦設備獲取所述目標深度圖像中每個像素點的像素值作為所述待識別圖像中對應的像素點的深度資訊。透過上述實施方式，由於提升了所述深度識別模型的精度，因此能夠提高所述待識別圖像的深度識別的精確度。In at least one embodiment of the present application, the computer device obtains the pixel value of each pixel in the target depth image as the depth information of the corresponding pixel in the image to be recognized. Through the above embodiments, since the accuracy of the depth recognition model is improved, the accuracy of depth recognition of the image to be recognized can be improved.

如圖6所示，是本申請實施例提供的電腦設備的結構示意圖。As shown in Figure 6, it is a schematic structural diagram of a computer device provided by an embodiment of the present application.

在本申請的一個實施例中，所述電腦設備1包括，但不限於，儲存器12、處理器13，以及儲存在所述儲存器12中並可在所述處理器13上運行的電腦程式，例如圖像深度識別程式及深度識別模型訓練程式。In one embodiment of the present application, the computer device 1 includes, but is not limited to, a storage 12, a processor 13, and a computer program stored in the storage 12 and capable of running on the processor 13. , such as image depth recognition programs and depth recognition model training programs.

本領域技術人員可以理解，所述示意圖僅僅是電腦設備1的示例，並不構成對電腦設備1的限定，可以包括比圖示更多或更少的部件，或者組合某些部件，或者不同的部件，例如所述電腦設備1還可以包括輸入輸出設備、網路接入設備、匯流排等。Those skilled in the art can understand that the schematic diagram is only an example of the computer device 1 and does not constitute a limitation on the computer device 1. It may include more or less components than those shown in the figure, or a combination thereof.Certain components, or different components, for example, the computer device 1 may also include input and output devices, network access devices, buses, etc.

所述處理器13可以是中央處理單元(Central Processing Unit，CPU)，還可以是其他通用處理器、數位訊號處理器(Digital Signal Processor，DSP)、專用積體電路(Application Specific Integrated Circuit，ASIC)、現場可程式設計閘陣列(Field-Programmable Gate Array，FPGA)或者其他可程式設計邏輯器件、分立元器件門電路、分立硬體組件等。通用處理器可以是微處理器或者該處理器也可以是任何常規的處理器等，所述處理器13是所述電腦設備1的運算核心和控制中心，利用各種介面和線路連接整個電腦設備1的各個部分，及獲取所述電腦設備1的作業系統以及安裝的各類應用程式、程式碼等。例如，所述處理器13可以透過介面獲取所述拍攝設備2拍攝到的所述第一圖像。所述處理器13獲取所述電腦設備1的作業系統以及安裝的各類應用程式。所述處理器13獲取所述應用程式以實現上述各個深度識別模型訓練方法以及各個圖像深度識別方法實施例中的步驟，例如圖2及圖5所示的步驟。The processor 13 may be a central processing unit (CPU), or other general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC). , Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete component gate circuits, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can be any conventional processor, etc. The processor 13 is the computing core and control center of the computer device 1 and uses various interfaces and lines to connect the entire computer device 1 various parts, and obtain the operating system of the computer device 1 and various installed applications, program codes, etc. For example, the processor 13 can obtain the first image captured by the photographing device 2 through an interface. The processor 13 obtains the operating system of the computer device 1 and various installed applications. The processor 13 obtains the application program to implement steps in each of the above depth recognition model training methods and each image depth recognition method embodiment, such as the steps shown in Figure 2 and Figure 5 .

示例性的，所述電腦程式可以被分割成一個或多個模組/單元，所述一個或者多個模組/單元被儲存在所述儲存器12中，並由所述處理器13獲取，以完成本申請。所述一個或多個模組/單元可以是能夠完成特定功能的一系列電腦程式指令段，該指令段用於描述所述電腦程式在所述電腦設備1中的獲取過程。For example, the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the storage 12 and retrieved by the processor 13, to complete this application. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the acquisition process of the computer program in the computer device 1 .

所述儲存器12可用於儲存所述電腦程式和/或模組，所述處理器13透過運行或獲取儲存在所述儲存器12內的電腦程式和/或模組，以及調用儲存在儲存器12內的資料，實現所述電腦設備1的各種功能。所述儲存器12可主要包括儲存程式區和儲存資料區，其中，儲存程式區可儲存作業系統、至少一個功能所需的應用程式(比如聲音播放功能、圖像播放功能等)等；儲存資料區可儲存根據電腦設備的使用所創建的資料等。此外，儲存器12可以包括非易失性儲存器，例如硬碟、儲存器、插接式硬碟，智慧儲存卡(Smart Media Card,SMC)，安全數位(Secure Digital,SD)卡，快閃儲存器卡(Flash Card)、至少一個磁碟儲存器件、快閃儲存器器件、或其他非易失性固態儲存器件。The storage 12 can be used to store the computer programs and/or modules. The processor 13 runs or obtains the computer programs and/or modules stored in the storage 12 and calls the computer programs and/or modules stored in the storage. The information in 12 realizes various functions of the computer device 1. The storage 12 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; a storage data area. Areas can store information created based on the use of computer equipment, etc. In addition, the storage 12 may include non-volatile storage, such as a hard drive, a memory card, a plug-in hard drive, a smart media cardCard (SMC), Secure Digital (SD) card, flash memory card (Flash Card), at least one disk storage device, flash memory device, or other non-volatile solid-state storage device.

所述儲存器12可以是電腦設備1的外部儲存器和/或內部儲存器。進一步地，所述儲存器12可以是具有實物形式的儲存器，如儲存器條、TF卡(Trans-flash Card)等等。The storage 12 may be an external storage and/or an internal storage of the computer device 1 . Further, the storage 12 may be a storage in a physical form, such as a storage stick, a TF card (Trans-flash Card), and so on.

所述電腦設備1集成的模組/單元如果以軟體功能單元的形式實現並作為獨立的產品銷售或使用時，可以儲存在一個電腦可讀取儲存介質中。基於這樣的理解，本申請實現上述實施例方法中的全部或部分流程，也可以透過電腦程式來指令相關的硬體來完成，所述的電腦程式可儲存於一電腦可讀儲存介質中，該電腦程式在被處理器獲取時，可實現上述各個方法實施例的步驟。其中，所述電腦程式包括電腦程式代碼，所述電腦程式代碼可以為原始程式碼形式、物件代碼形式、可獲取檔或某些中間形式等。所述電腦可讀介質可以包括：能夠攜帶所述電腦程式代碼的任何實體或裝置、記錄介質、隨身碟、移動硬碟、磁碟、光碟、電腦儲存器、唯讀記憶體(ROM，Read-Only Memory)。If the integrated modules/units of the computer equipment 1 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the present application can implement all or part of the processes in the above embodiment methods by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. The computer program can be stored in a computer-readable storage medium. When acquired by the processor, the computer program can implement the steps of each of the above method embodiments. Wherein, the computer program includes computer program code, and the computer program code can be in the form of original program code, object code form, obtainable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a flash drive, a mobile hard drive, a magnetic disk, an optical disk, a computer storage, a read-only memory (ROM, Read- Only Memory).

結合圖2，所述電腦設備1中的所述儲存器12儲存多個指令以實現一種深度識別模型訓練方法，所述處理器13可獲取所述多個指令從而實現：獲取拍攝設備拍攝的第一圖像及第二圖像，構建地平面分割網路，使用所述地平面分割網路對所述第一圖像進行分割，確定所述第一圖像中的地平面區域，基於所述第一圖像、所述第二圖像對應的初始深度圖像及同時與所述第一圖像及所述第二圖像對應的位姿矩陣，生成所述第一圖像的投影圖像，基於預先獲取的深度識別網路、所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域，生成所述深度識別網路的目標高度損失，根據所述初始深度圖像與所述第一圖像之間的梯度損失及所述投影圖像與所述第一圖像之間的光度損失，計算所述深度識別網路的深度損失，基於所述深度損失及所述目標高度損失，調整所述深度識別網路，得到深度識別模型。2 , the storage 12 in the computer device 1 stores a plurality of instructions to implement a depth recognition model training method, and the processor 13 can obtain the plurality of instructions to achieve: obtain the first image captured by the shooting device. An image and a second image, construct a ground plane segmentation network, use the ground plane segmentation network to segment the first image, determine the ground plane area in the first image, based on the The first image, the initial depth image corresponding to the second image, and the pose matrix corresponding to both the first image and the second image generate a projection image of the first image. , based on the pre-acquired depth recognition network, the shooting device, the initial depth image and the ground plane area in the first image, generate the target height loss of the depth recognition network, according to the initial The gradient loss between the depth image and the first image and the photometric loss between the projection image and the first image are used to calculate the depth loss of the depth recognition network, based on the depth loss and the target height loss, adjust the depth recognition network, and obtain a depth recognition model.

結合圖5，所述電腦設備1中的所述儲存器12儲存多個指令以實現一種圖像深度識別方法，所述處理器13可獲取所述多個指令從而實現：獲取待識別圖像，將所述待識別圖像輸入到深度識別模型中，得到所述待識別圖像的目標深度圖像及所述待識別圖像的深度資訊。5, the memory 12 in the computer device 1 stores a plurality of instructions to implementAn image depth recognition method is provided. The processor 13 can obtain the plurality of instructions to achieve: obtain an image to be recognized, input the image to be recognized into a depth recognition model, and obtain the image to be recognized. The target depth image and the depth information of the image to be recognized.

具體地，所述處理器13對上述指令的具體實現方法可參考圖2及圖5對應實施例中相關步驟的描述，在此不贅述。在本申請所提供的幾個實施例中，應該理解到，所揭露的系統，裝置和方法，可以透過其它的方式實現。例如，以上所描述的裝置實施例僅僅是示意性的，例如，所述模組的劃分，僅僅為一種邏輯功能劃分，實際實現時可以有另外的劃分方式。Specifically, for the specific implementation method of the above instructions by the processor 13, reference can be made to the description of the relevant steps in the corresponding embodiments of FIG. 2 and FIG. 5, which will not be described again here. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of modules is only a logical function division, and there may be other division methods in actual implementation.

所述作為分離部件說明的模組可以是或者也可以不是物理上分開的，作為模組顯示的部件可以是或者也可以不是物理單元，即可以位於一個地方，或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部模組來實現本實施例方案的目的。The modules described as separate components may or may not be physically separated. The components shown as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple networks. on the unit. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申請各個實施例中的各功能模組可以集成在一個處理單元中，也可以是各個單元單獨物理存在，也可以兩個或兩個以上單元集成在一個單元中。上述集成的單元既可以採用硬體的形式實現，也可以採用硬體加軟體功能模組的形式實現。因此，無論從哪一點來看，均應將實施例看作是示範性的，而且是非限制性的，本申請的範圍由所附請求項而不是上述說明限定，因此旨在將落在請求項的等同要件的含義和範圍內的所有變化涵括在本申請內。不應將請求項中的任何附關聯圖標記視為限制所涉及的請求項。In addition, each functional module in various embodiments of the present application can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules. Therefore, the embodiments should be regarded as illustrative and non-restrictive from any point of view, and the scope of the present application is defined by the appended claims rather than the above description, and it is therefore intended that those falling within the claims All changes within the meaning and scope of the equivalent elements are included in this application. Any associated association markup in a request item should not be considered to limit the request item in question.

此外，顯然“包括”一詞不排除其他單元或步驟，單數不排除複數。本申請中陳述的多個單元或裝置也可以由一個單元或裝置透過軟體或者硬體來實現。第一、第二等詞語用來表示名稱，而並不表示任何特定的順序。最後應說明的是，以上實施例僅用以說明本申請的技術方案而非限制，儘管參照較佳實施例對本申請進行了詳細說明，本領域的普通技術人員應當理解，可以對本申請的技術方案進行修改或等同替換，而不脫離本申請技術方案的精神和範圍。Furthermore, it is clear that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Multiple units or devices stated in this application may also be implemented by one unit or device through software or hardware. The words first, second, etc. are used to indicate names and do not indicate any specific order. Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application and are not limiting. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be modified. Modifications or equivalent substitutions may be made without departing from the spirit and scope of the technical solution of the present application.

101~107:步驟101~107: Steps

Claims

Translated fromChinese

一種深度識別模型訓練方法，應用於電腦設備，所述電腦設備與拍攝設備相通信，其中，所述深度識別模型訓練方法包括：獲取所述拍攝設備拍攝的第一圖像及第二圖像；構建地平面分割網路，包括：獲取高解析度網路，其中，所述高解析度網路包括第一階段網路、第二階段網路、第三階段網路、第四階段網路及輸出層，將所述第四階段網路去除，並將所述輸出層中分支網路的數量調整為所述第三階段網路中分支網路的數量，得到所述第三階段網路中每個分支網路對應的輸出層，統計所述第一階段網路的主幹網路中最後一個卷積層的第一通道數量，並將所述第一通道數量調整為第一預設值，得到第一分辨網路，對所述第二階段網路中最後一個卷積層的通道數量進行調整，得到第二分辨網路，並對所述第三階段網路中最後一個卷積層的通道數量進行調整，得到第三分辨網路，拼接所述第一分辨網路、所述第二分辨網路及所述第三分辨網路，得到圖像分辨網路，將所述圖像分辨網路及所述第三階段網路中每個分支網路對應的輸出層進行拼接，得到所述地平面分割網路；使用所述地平面分割網路對所述第一圖像進行分割，確定所述第一圖像中的地平面區域；基於所述第一圖像、所述第二圖像對應的初始深度圖像及同時與所述第一圖像及所述第二圖像對應的位姿矩陣，生成所述第一圖像的投影圖像；基於預先獲取的深度識別網路、所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域，生成所述深度識別網路的目標高度損失；根據所述初始深度圖像與所述第一圖像之間的梯度損失及所述投影圖像與所述第一圖像之間的光度損失，計算所述深度識別網路的深度損失；基於所述深度損失及所述目標高度損失，調整所述深度識別網路，得到深度識別模型。A depth recognition model training method, applied to computer equipment, the computer equipment communicates with a shooting device, wherein the depth recognition model training method includes: acquiring a first image and a second image captured by the shooting device; Constructing a ground plane segmentation network includes: obtaining a high-resolution network, wherein the high-resolution network includes a first-stage network, a second-stage network, a third-stage network, a fourth-stage network, and In the output layer, remove the fourth-stage network and adjust the number of branch networks in the output layer to the number of branch networks in the third-stage network to obtain the For the output layer corresponding to each branch network, count the number of first channels of the last convolutional layer in the backbone network of the first-stage network, and adjust the number of first channels to the first preset value to obtain The first resolution network adjusts the number of channels of the last convolutional layer in the second-stage network to obtain the second resolution network, and adjusts the number of channels of the last convolutional layer in the third-stage network. Adjust to obtain a third resolution network, splice the first resolution network, the second resolution network and the third resolution network to obtain an image resolution network, combine the image resolution network and The output layers corresponding to each branch network in the third stage network are spliced to obtain the ground plane segmentation network; the first image is segmented using the ground plane segmentation network to determine the The ground plane area in the first image; based on the initial depth image corresponding to the first image and the second image and the pose corresponding to the first image and the second image simultaneously matrix to generate a projection image of the first image; based on the pre-acquired depth recognition network, the shooting device, the initial depth image and the ground plane area in the first image, generate the The target height loss of the depth recognition network; according to the gradient loss between the initial depth image and the first image and the photometric loss between the projection image and the first image, calculate the Depth loss of the depth recognition network; based on the depth loss and the target height loss, adjust the depth recognition network to obtain a depth recognition model.

如請求項1所述的深度識別模型訓練方法，其中，所述地平面分割網路還包括分類層，所述使用所述地平面分割網路對所述第一圖像進行分割，得到所述第一圖像中的地平面區域包括：將所述第一圖像輸入到所述地平面分割網路中的圖像分辨網路進行處理，得到所述圖像分辨網路中的所述第三分辨網路在每個分支網路所輸出的第三分支特徵圖；將多個所述第三分支特徵圖輸入到對應的輸出層中進行卷積操作，得到每個輸出層輸出的目標特徵圖；將多個所述目標特徵圖進行特徵融合，得到融合特徵圖；將所述融合特徵圖輸入到所述分類層中進行分類，得到所述第一圖像中每個預設類別對應的初始對象；選取所述預設類別為地平面類別對應的初始物件所佔的區域作為所述地平面區域。The depth recognition model training method as described in claim 1, wherein the horizonThe surface segmentation network also includes a classification layer. Using the ground plane segmentation network to segment the first image and obtain the ground plane area in the first image includes: dividing the first image into The image resolution network input to the ground plane segmentation network is processed to obtain the third branch feature map output by the third resolution network in the image resolution network in each branch network. ; Input multiple third branch feature maps into the corresponding output layer to perform a convolution operation to obtain a target feature map output by each output layer; perform feature fusion on multiple target feature maps to obtain a fused feature map ; Input the fused feature map into the classification layer for classification to obtain the initial object corresponding to each preset category in the first image; select the preset category to be the initial object corresponding to the ground plane category The area occupied is used as the ground plane area.

如請求項2所述的深度識別模型訓練方法，其中，所述將所述第一圖像輸入到所述地平面分割網路中的圖像分辨網路進行處理，得到所述圖像分辨網路中的所述第三分辨網路在每個分支網路所輸出的第三分支特徵圖包括：將所述第一圖像輸入到所述第一分辨網路的主幹網路中進行特徵提取，得到所述第一分辨網路的主幹網路所輸出的第一主幹特徵圖；將所述第一主幹特徵圖輸入到所述第一分辨網路的每個分支網路中進行卷積運算，得到所述第一分辨網路中的每個分支網路輸出的第一分支特徵圖；將多個所述第一分支特徵圖輸入至所述第二分辨網路進行特徵提取，得到所述第二分辨網路所輸出的第二分支特徵圖，並將多個所述第二分支特徵圖輸入至所述第三分辨網路進行特徵提取，得到所述第三分支特徵圖。The depth recognition model training method according to claim 2, wherein the first image is input to the image resolution network in the ground plane segmentation network for processing to obtain the image resolution network The third branch feature map output by the third resolution network in each branch network includes: inputting the first image into the backbone network of the first resolution network for feature extraction. , obtain the first backbone feature map output by the backbone network of the first resolution network; input the first backbone feature map into each branch network of the first resolution network to perform a convolution operation , obtain the first branch feature map output by each branch network in the first resolution network; input a plurality of the first branch feature maps to the second resolution network for feature extraction, and obtain the The second branch feature map output by the second resolution network, and a plurality of the second branch feature maps are input to the third resolution network for feature extraction to obtain the third branch feature map.

如請求項1所述的深度識別模型訓練方法，其中，所述第一圖像及所述第二圖像中包括相同的初始對象，所述基於所述第一圖像、所述第二圖像對應的初始深度圖像及同時與所述第一圖像及所述第二圖像對應的位姿矩陣，生成所述第一圖像的投影圖像包括：將所述第一圖像中所述相同的初始物件對應的像素點確定為第一像素點，並將所述第二圖像中所述相同的初始物件對應的像素點確定為第二像素點；獲取所述第一像素點的第一齊次座標矩陣，並獲取所述第二像素點的第二齊次座標矩陣；獲取所述拍攝設備的內參矩陣的逆矩陣；根據所述第一齊次座標矩陣及所述內參矩陣的逆矩陣計算出所述第一像素點的第一相機座標，並根據所述第二齊次座標矩陣及所述內參矩陣的逆矩陣計算出所述第二像素點的第二相機座標；基於預設對極約束關係式對所述第一相機座標及所述第二相機座標進行計算，得到旋轉矩陣及平移矩陣；將所述旋轉矩陣及所述平移矩陣進行拼接，得到所述位姿矩陣；獲取所述第一圖像中每個像素點的齊次座標矩陣，並從所述初始深度圖像中獲取所述第一圖像中每個像素點的深度值；基於所述位姿矩陣、每個像素點的齊次座標矩陣及每個像素點的深度值計算出所述第一圖像中每個像素點的投影座標；根據每個像素點的投影座標對每個像素點進行排列處理，得到所述投影圖像。The deep recognition model training method according to claim 1, wherein the first image and the second image include the same initial object. The initial depth image corresponding to the image and the pose corresponding to the first image and the second image simultaneouslymatrix, generating the projection image of the first image includes: determining the pixel points corresponding to the same initial object in the first image as the first pixel points, and determining the pixel points in the second image The pixel corresponding to the same initial object is determined as the second pixel; obtain the first homogeneous coordinate matrix of the first pixel, and obtain the second homogeneous coordinate matrix of the second pixel; obtain the The inverse matrix of the internal parameter matrix of the shooting device; calculate the first camera coordinate of the first pixel point according to the first homogeneous coordinate matrix and the inverse matrix of the internal parameter matrix, and calculate the first camera coordinate of the first pixel point according to the second homogeneous coordinate matrix and the inverse matrix of the internal parameter matrix to calculate the second camera coordinates of the second pixel point; calculate the first camera coordinates and the second camera coordinates based on the preset epipolar constraint relationship to obtain a rotation matrix and a translation matrix; concatenate the rotation matrix and the translation matrix to obtain the pose matrix; obtain the homogeneous coordinate matrix of each pixel in the first image, and obtain the homogeneous coordinate matrix from the initial depth image. Obtain the depth value of each pixel in the first image; calculate the depth value of each pixel in the first image based on the pose matrix, the homogeneous coordinate matrix of each pixel and the depth value of each pixel. The projection coordinates of each pixel point; each pixel point is arranged according to the projection coordinates of each pixel point to obtain the projection image.

如請求項4所述的深度識別模型訓練方法，其中，所述基於預先獲取的深度識別網路、所述拍攝設備、所述初始深度圖像及所述第一圖像中的地平面區域，生成所述深度識別網路的目標高度損失包括：獲取所述拍攝設備的光心到所述地平面區域的真實世界高度；基於所述第一圖像及所述拍攝設備構建相機座標系；根據所述相機座標系中所述地平面區域的每個地面像素點的座標計算出投影高度；根據所述初始深度圖像中像素點的像素座標、所述投影高度及所述真實世界高度計算出所述目標高度損失。The depth recognition model training method according to claim 4, wherein the ground plane area in the pre-acquired depth recognition network, the shooting device, the initial depth image and the first image, Generating the target height loss of the depth recognition network includes: obtaining the real-world height from the optical center of the shooting device to the ground plane area; constructing a camera coordinate system based on the first image and the shooting device; according to The projection height is calculated from the coordinates of each ground pixel point in the ground plane area in the camera coordinate system; based on the pixel coordinates of the pixel points in the initial depth image, the projection height and the real worldThe target height loss is calculated from the boundary height.

如請求項5所述的深度識別模型訓練方法，其中，所述根據所述相機座標系中所述地平面區域的每個地面像素點的座標計算出投影高度包括：獲取所述相機座標系中所述地平面區域的任一地面像素點的座標；根據所述任一地面像素點的座標計算出單位法向量；將以所述拍攝設備的光心為起點及以每個地面像素點為終點構成的向量確定為該地面像素點的目標向量；根據每個地面像素點的目標向量與所述單位法向量，計算出每個地面像素點對應的投影距離；將所有地面像素點對應的投影距離進行加權平均運算，得到所述投影高度。The depth recognition model training method according to claim 5, wherein calculating the projection height according to the coordinates of each ground pixel point in the ground plane area in the camera coordinate system includes: obtaining the The coordinates of any ground pixel point in the ground plane area; calculate the unit normal vector according to the coordinates of any ground pixel point; take the optical center of the shooting device as the starting point and each ground pixel point as the end point The formed vector is determined as the target vector of the ground pixel point; according to the target vector of each ground pixel point and the unit normal vector, the projection distance corresponding to each ground pixel point is calculated; the projection distance corresponding to all ground pixel points is calculated A weighted average operation is performed to obtain the projection height.

如請求項5所述的深度識別模型訓練方法，其中，所述根據所述初始深度圖像中像素點的像素座標、所述投影高度及所述真實世界高度計算出所述目標高度損失包括：計算所述真實世界高度與所述投影高度的高度比值；將所述高度比值與所述初始深度圖像中每個像素點的像素座標進行相乘運算，得到每個像素點對應的深度座標；根據所述初始深度圖像中每個像素點的像素座標及對應的深度座標生成第一高度損失；將所述平移矩陣與所述高度比值進行相乘運算，得到相乘矩陣；根據所述相乘矩陣及所述平移矩陣生成第二高度損失；根據所述第一高度損失及所述第二高度損失，生成所述目標高度損失。The depth recognition model training method according to claim 5, wherein calculating the target height loss based on the pixel coordinates of the pixels in the initial depth image, the projection height and the real world height includes: Calculate the height ratio between the real world height and the projected height; multiply the height ratio with the pixel coordinates of each pixel in the initial depth image to obtain the depth coordinate corresponding to each pixel; Generate a first height loss according to the pixel coordinates and corresponding depth coordinates of each pixel in the initial depth image; perform a multiplication operation on the translation matrix and the height ratio to obtain a multiplication matrix; according to the multiplication matrix The multiplication matrix and the translation matrix generate a second height loss; the target height loss is generated according to the first height loss and the second height loss.

一種圖像深度識別方法，應用於電腦設備，其中，所述圖像深度識別方法包括：獲取待識別圖像；將所述待識別圖像輸入到深度識別模型中，得到所述待識別圖像的目標深度圖像及所述待識別圖像的深度資訊，所述深度識別模型透過執行如請求項1至7中任一項所述的深度識別模型訓練方法而獲得。An image depth recognition method, applied to computer equipment, wherein the image depth recognition method includes: obtaining an image to be recognized; inputting the image to be recognized into a depth recognition model to obtain the image to be recognized target deepdegree image and the depth information of the image to be recognized, and the depth recognition model is obtained by executing the depth recognition model training method described in any one of claims 1 to 7.

一種電腦設備，其中，所述電腦設備包括：儲存器，儲存至少一個指令；及處理器，執行所述至少一個指令以實現如請求項1至7中任意一項所述的深度識別模型訓練方法，或者如請求項8所述的圖像深度識別方法。A computer device, wherein the computer device includes: a storage to store at least one instruction; and a processor to execute the at least one instruction to implement the deep recognition model training method as described in any one of claims 1 to 7 , or the image depth recognition method as described in claim 8.