CN116386003A

Movatterモバイル変換

Info

Publication number: CN116386003A
Application number: CN202310319429.6A
Authority: CN
Inventors: 王智慧; 李豪杰; 崇智禹
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-07-04

Abstract

The invention provides a three-dimensional target detection method based on knowledge distillation, which is a deep learning method for visual three-dimensional target detection in the field of automatic driving. Firstly, respectively constructing a teacher network taking laser point cloud as input and a student network taking RGB image as input to obtain a multi-scale feature map of the laser point cloud and the RGB image and a 3D detection result. And then three types of knowledge distillation modes are adopted to transfer depth information extracted from laser point cloud data by a teacher network to RGB image features. According to the invention, the LiDARNT is constructed as a teacher network to know the detection result of the RGBNet student network, the middle guiding module designs three sub-modules to guide depth information, and the three modules respectively give certain guidance from the characteristic layer and the result layer. The invention can be applied to any three-dimensional target detection network, does not need to increase the calculated amount, can ensure the real-time reasoning speed and is easy to realize.

Description

Translated fromChinese

基于知识蒸馏的三维目标检测方法3D Object Detection Method Based on Knowledge Distillation

技术领域technical field

本发明属于计算机视觉技术领域，涉及一种基于知识蒸馏的三维目标检测方法，具体是一种用于自动驾驶场景中三维目标检测的深度神经网络的方法。The invention belongs to the technical field of computer vision, and relates to a three-dimensional target detection method based on knowledge distillation, in particular to a deep neural network method for three-dimensional target detection in an automatic driving scene.

背景技术Background technique

三维物体检测是三维场景感知中不可或缺的一环，在现实世界中有着广泛的应用，特别是在自动驾驶领域。自动驾驶，旨在使车辆智能地感知周围环境，并在很少或无需人力的情况下安全行驶，近年来取得了快速发展。自动驾驶技术已广泛应用于自动驾驶卡车、无人驾驶出租车、送货机器人等多种场景，能够减少人为错误，提高道路安全。作为自动驾驶系统的核心组成部分，车辆感知通过各种传感器输入帮助自动驾驶汽车了解周围环境。感知系统的输入一般是多模态数据(来自摄像头的图像数据、来自LiDAR的点云、高精地图等)，并且会预测道路上关键要素的几何和语义信息。高质量的感知结果可作为轨迹预测和路径规划等后续步骤的可靠依据。3D object detection is an integral part of 3D scene perception and has a wide range of applications in the real world, especially in the field of autonomous driving. Autonomous driving, which aims to enable vehicles to intelligently sense their surroundings and drive safely with little or no human intervention, has made rapid progress in recent years. Autonomous driving technology has been widely used in various scenarios such as self-driving trucks, driverless taxis, and delivery robots, which can reduce human errors and improve road safety. As a core component of an autonomous driving system, vehicle perception helps self-driving cars understand their surroundings through various sensor inputs. The input of the perception system is generally multi-modal data (image data from the camera, point cloud from LiDAR, high-definition map, etc.), and it will predict the geometric and semantic information of key elements on the road. High-quality perception results can be used as a reliable basis for subsequent steps such as trajectory prediction and path planning.

为了全面了解驾驶环境，感知系统涉及到许多视觉任务，例如目标检测和跟踪、车道线检测、语义和实例分割等。在这些感知任务中，3D目标检测是车辆感知系统中最不可或缺的任务之一。3D目标检测旨在预测3D空间中关键目标的位置、大小和类别，例如机动车、行人、骑自行车的人等。与仅在图像上生成2D边界框并忽略目标与本车的实际距离信息的2D目标检测相比，3D目标检测侧重于对真实世界3D坐标系中目标的定位和识别。3D目标检测在现实世界坐标中预测的几何信息可以直接用于测量本车与关键目标之间的距离，并进一步帮助规划行驶路线和避免碰撞。To gain a comprehensive understanding of the driving environment, perception systems are involved in many vision tasks, such as object detection and tracking, lane line detection, semantic and instance segmentation, etc. Among these perception tasks, 3D object detection is one of the most indispensable tasks in vehicle perception systems. 3D object detection aims to predict the location, size, and category of key objects in 3D space, such as motor vehicles, pedestrians, cyclists, etc. Compared with 2D object detection, which only generates 2D bounding boxes on the image and ignores the actual distance information between the object and the ego vehicle, 3D object detection focuses on the localization and recognition of objects in the real-world 3D coordinate system. The geometric information predicted by 3D object detection in real-world coordinates can be directly used to measure the distance between the ego vehicle and key objects, and further help plan driving routes and avoid collisions.

三维目标检测的研究现状可以归纳为以下几个方面：The research status of 3D object detection can be summarized into the following aspects:

从传感器角度来说，许多类型的传感器都可以为3D目标检测提供原始数据，相机和激光雷达是两种最常采用的传感器类型。相机价格便宜且易于使用，并且可以从某个角度捕捉场景信息。相机能产生W×H×3的图像，用于3D目标检测的输入，其中W和H是一幅图像的宽和高，每个像素有3个RGB通道。尽管价格便宜，但相机应用在3D目标检测方面存在某些限制。首先，相机只捕捉外观信息，不能直接获取场景的3D结构信息。另一方面，3D目标检测通常需要在3D空间中进行准确定位，而从图像数据获取3D信息异常困难。此外，基于图像的3D检测很容易受到极端天气和时间条件的影响。在夜间或雾天从图像中检测目标比在晴天检测要困难得多，这样的自动驾驶系统无法保证鲁棒性。From a sensor perspective, many types of sensors can provide raw data for 3D object detection, with cameras and lidar being the two most commonly used sensor types. Cameras are cheap and easy to use, and can capture information about a scene from an angle. The camera can generate a W×H×3 image for 3D object detection input, where W and H are the width and height of an image, and each pixel has 3 RGB channels. Despite being cheap, camera applications have certain limitations in 3D object detection. First, the camera only captures the appearance information, and cannot directly obtain the 3D structure information of the scene. On the other hand, 3D object detection usually requires accurate positioning in 3D space, and it is extremely difficult to obtain 3D information from image data. Furthermore, image-based 3D detection is easily affected by extreme weather and temporal conditions. Detecting objects from images at night or in fog is much more difficult than in sunny days, and such autonomous driving systems cannot guarantee robustness.

作为相机的替代解决方案，激光雷达传感器可以通过发射激光，然后测量其反射信息来获得场景中的3D信息。一个激光雷达在一个扫描周期内发射光束并进行多次测量可以产生一个深度图像，每个深度图的像素有3个通道，分别为球坐标系中的深度r、方位角α和倾角φ。As an alternative solution to cameras, lidar sensors can obtain 3D information in a scene by emitting laser light and then measuring its reflection information. A laser radar emits beams and performs multiple measurements in one scan cycle to generate a depth image. Each pixel of the depth image has 3 channels, which are depth r, azimuth α, and inclination φ in the spherical coordinate system.

深度图像是激光雷达传感器获取的原始数据格式，可以通过将球坐标转换为笛卡尔坐标进一步转换为点云。一个点云可以表示为N×3，其中N表示一个场景中的点数，每个点有3个xyz坐标通道。深度图像和点云都包含由激光雷达传感器直接获取的准确3D信息。The depth image is the raw data format acquired by the lidar sensor, which can be further converted into a point cloud by converting spherical coordinates to Cartesian coordinates. A point cloud can be represented as N×3, where N represents the number of points in a scene, and each point has 3 channels of xyz coordinates. Both depth images and point clouds contain accurate 3D information directly acquired by LiDAR sensors.

因此，与相机相比，激光雷达更适合检测3D空间中的目标，并且激光雷达也更不易受时间和天气变化的影响。然而，激光雷达比摄像头贵得多，这限制了在驾驶场景中的大规模应用。所以基于单目图像去做3D检测依然是具有很大的研究价值的。Therefore, lidar is better suited for detecting objects in 3D space than cameras, and lidar is also less susceptible to time and weather changes. However, lidar is much more expensive than cameras, which limits its large-scale application in driving scenarios. Therefore, 3D detection based on monocular images still has great research value.

发明内容Contents of the invention

本发明的目的是提供基于单张RGB图像去进行三维目标检测的深度神经网络。本发明首先使用投影的激光点云数据训练教师网络，然后在教师网络的指导下训练本发明的单目三维目标检测器。本发明认为，与之前的工作相比，所提出的方法具有以下两个优点：首先，本发明的方法直接从以激光雷达作为输入的教师网络学习3D空间线索，而不是RGB图像预测出来的深度图。这种设计通过避免代理任务中的信息丢失而表现得更好。其次，本发明的方法不会改变基线模型的网络架构，因此不会引入额外的计算成本。为了达到上述技术目的，本发明采用的技术方案如下：The purpose of the present invention is to provide a deep neural network for three-dimensional target detection based on a single RGB image. The present invention first uses the projected laser point cloud data to train the teacher network, and then trains the monocular three-dimensional object detector of the present invention under the guidance of the teacher network. We believe that the proposed method has the following two advantages over previous work: First, our method learns 3D spatial cues directly from the teacher network with LiDAR as input, rather than depth predicted from RGB images. picture. This design performs better by avoiding information loss in the agent task. Second, the method of the present invention does not change the network architecture of the baseline model and thus does not introduce additional computational cost. In order to achieve the above-mentioned technical purpose, the technical scheme that the present invention adopts is as follows:

为了达到上述技术目的，本发明采用的技术方案如下：In order to achieve the above-mentioned technical purpose, the technical scheme that the present invention adopts is as follows:

一种基于知识蒸馏的三维目标检测方法，应用于自动驾驶场景中。该三维目标检测方法包括以下步骤：A 3D object detection method based on knowledge distillation, which is applied in autonomous driving scenarios. The three-dimensional object detection method comprises the following steps:

第一步，构建以激光点云作为输入的教师网络In the first step, build a teacher network with laser point cloud as input

将激光雷达采集到的点云通过相机内外参投影到图像平面上，得到稀疏的深度图，然后进行深度补全得到稠密的深度图。以此作为教师网络的输入。通过深度卷积网络提取该深度图的多尺度特征，并进行特征融合。将最终的特征输入到检测头模块得到3D检测结果的各个子结果。The point cloud collected by the lidar is projected onto the image plane through the internal and external parameters of the camera to obtain a sparse depth map, and then depth completion is performed to obtain a dense depth map. This is used as input to the teacher network. The multi-scale features of the depth map are extracted through a deep convolutional network, and feature fusion is performed. The final features are input to the detection head module to obtain the sub-results of the 3D detection results.

第二步，构建以RGB图像为输入的学生网络，与教师网络唯一不同的是输入数据不同，教师网络的输入是激光点云，而学生网络的输入是RGB图像。The second step is to construct a student network with RGB images as input. The only difference from the teacher network is the input data. The input of the teacher network is a laser point cloud, while the input of the student network is an RGB image.

第三步，通过教师和学生网络的构建，分别得到了激光点云和RGB图像的多尺度特征图以及3D检测的结果。然后本发明提出三种类型的知识蒸馏方式，分别为特征空间的场景蒸馏采用SF模块、特征空间的目标蒸馏采用OF模块、结果空间的目标蒸馏采用OR模块。这三种蒸馏方式将教师网络从激光点云数据中提取到的深度信息传递给RGB图像特征中。In the third step, through the construction of teacher and student networks, the multi-scale feature maps of laser point clouds and RGB images and the results of 3D detection are respectively obtained. Then the present invention proposes three types of knowledge distillation methods, which are the scene distillation of the feature space adopts the SF module, the object distillation of the feature space adopts the OF module, and the object distillation of the result space adopts the OR module. These three distillation methods transfer the depth information extracted by the teacher network from the laser point cloud data to the RGB image features.

对于SF模块，只输入1/16，1/32，1/64三个尺度的特征图，对于每个尺度的特征图，通过计算一张亲和力图来表示各个像素之间的相似度关系。对于一张HxW的特征图，其亲和力图为HW x HW。对于RGB图像每个尺度会得到一张亲和力图，对于激光点云数据也能够得到得到类似形状的亲和力图。在这两个不同模态间，采用平滑损失函数拉近两者亲和力图的关系，使得不同模态间每个像素与周围邻域关系更加相似。For the SF module, only the feature maps of three scales of 1/16, 1/32, and 1/64 are input. For each feature map of each scale, an affinity map is calculated to represent the similarity relationship between each pixel. For a HxW feature map, its affinity map is HW x HW. An affinity map will be obtained for each scale of the RGB image, and an affinity map of a similar shape can also be obtained for the laser point cloud data. Between these two different modalities, a smooth loss function is used to narrow the relationship between the two affinity maps, making the relationship between each pixel and the surrounding neighborhood more similar between different modalities.

对于OF模块，同样输入较小的三个尺度特征图，然后直接对RGB特征图和激光点云特征图的前景区域进行平滑损失函数约束。前景区域采用2D box的真值获取。该模块可以使得两个模态在特征空间的距离更加接近。For the OF module, the smaller three-scale feature maps are also input, and then the smooth loss function is directly constrained to the foreground area of the RGB feature map and the laser point cloud feature map. The foreground area is acquired using the true value of the 2D box. This module can make the distance between two modes closer in feature space.

对于OR模块，对于教师和学生网络的检测头结果进行约束，让激光点云的检测结果作为一种软标签去指导RGB网络的3D检测结果，监督区域除了每个目标的中心点之外，还根据每个目标的大小进行了标签的扩展。该模块可以使得RGB学生网络的检测结果更接近激光点云的检测结果。For the OR module, the detection head results of the teacher and student networks are constrained, and the detection results of the laser point cloud are used as a soft label to guide the 3D detection results of the RGB network. In addition to the center point of each target, the supervised area also includes The labels are expanded according to the size of each object. This module can make the detection results of the RGB student network closer to the detection results of the laser point cloud.

本发明的有益效果：Beneficial effects of the present invention:

(1)本发明是用于自动驾驶领域的视觉三维目标检测的深度学习方法，通过构建了以激光点云为输入的教师网络，来指导以视觉RGB图像作为输入的学生网络，从而达到提升纯视觉三维目标检测能力的目的。(1) The present invention is a deep learning method for visual three-dimensional target detection in the field of automatic driving. By constructing a teacher network with laser point cloud as input, it guides a student network with visual RGB image as input, so as to improve the pure The purpose of visual 3D object detection capability.

(2)本发明中间的引导模块设计了三个子模块分别为SF，OF，OR来进行深度信息的引导。这三个模块分别从特征层面和结果层面给予一定的指导，提高模型从图像中提取3D信息的能力。(2) The guide module in the middle of the present invention designs three sub-modules respectively as SF, OF and OR to guide depth information. These three modules provide certain guidance from the feature level and the result level respectively, and improve the ability of the model to extract 3D information from images.

(3)本发明可以应用到任何一种三维目标检测网络里，并且不需要增加计算量，显著提升模型的三维检测精度，可以保证实时的推理速度，易于部署。(3) The present invention can be applied to any kind of 3D object detection network, and does not need to increase the amount of calculation, significantly improves the 3D detection accuracy of the model, can ensure real-time reasoning speed, and is easy to deploy.

附图说明Description of drawings

图1为教师网络的示意图；Figure 1 is a schematic diagram of a teacher network;

图2为学生网络的示意图；Fig. 2 is a schematic diagram of a student network;

图3为标签扩展操作示意图；图3(a)为未经过标签扩散的图；图3(b)经过标签扩散处理后的示意图；Figure 3 is a schematic diagram of label expansion operation; Figure 3(a) is a diagram without label diffusion; Figure 3(b) is a schematic diagram after label diffusion processing;

图4为实施例结果图；图4(a)为图像空间的三维检测结果图；图4为点云空间的三维检测结果图。Fig. 4 is the result diagram of the embodiment; Fig. 4(a) is the three-dimensional detection result diagram in the image space; Fig. 4 is the three-dimensional detection result diagram in the point cloud space.

具体实施方式Detailed ways

第一步，构建以激光点云作为输入的教师网络，其结构如图1所示。具体步骤如下：In the first step, construct a teacher network with laser point cloud as input, and its structure is shown in Fig. 1. Specific steps are as follows:

1.1)首先将激光雷达采集到的点云通过相机内外参投影到图像平面上，得到稀疏的深度图，然后进行深度补全得到稠密的深度图。该深度图为H x W x 1，与对应的RGB图像具有相同的长和宽。1.1) First, the point cloud collected by the lidar is projected onto the image plane through the internal and external parameters of the camera to obtain a sparse depth map, and then depth completion is performed to obtain a dense depth map. The depth map is H x W x 1 and has the same height and width as the corresponding RGB image.

1.2)通过深度卷积网络提取该深度图的多尺度特征，主干网络部分本发明采用的是深度层级聚合网络：具体来说该网络经过多个树形层级结构处理。其中基本的结构单元为卷积+归一化处理+激活函数。整个特征提取网络可以提取到多尺度的特征表示，分别有1/64，1/32，1/16，1/8，1/4五个尺度的特征图。1.2) The multi-scale features of the depth map are extracted through a deep convolutional network. The backbone network part of the present invention uses a deep hierarchical aggregation network: specifically, the network is processed through multiple tree-like hierarchical structures. The basic structural unit is convolution + normalization + activation function. The entire feature extraction network can extract multi-scale feature representations, with five scale feature maps of 1/64, 1/32, 1/16, 1/8, and 1/4.

1.3)采用一个多尺度特征融合网络将得到的多尺度的特征最终融合成1/n尺度，本发明最终融合到1/8尺度下，最后将融合后的特征图输入到检测头模块中，得到3D检测的结果。检测头的具体结构如下：包含热力图头，偏移量头，尺度头以及旋转角度头等部分，通过热力图的结果可以定位到图像中目标的粗略位置，结合偏移量头的结果又可以进一步得到准确的位置。再结合相机内外参，可以将该目标的二维位置映射到三维空间得到3D的结果，尺度头的结果为该目标的实际长宽高，旋转角度头的结果为在俯视图下的朝向角。这几个部分组成在一起，就得到了最终的3D检测框的结果。1.3) Using a multi-scale feature fusion network to finally fuse the obtained multi-scale features into 1/n scale, the present invention finally fuses to 1/8 scale, and finally inputs the fused feature map into the detection head module to obtain The result of 3D detection. The specific structure of the detection head is as follows: it includes heat map head, offset head, scale head, and rotation angle head. The results of the heat map can be used to locate the rough position of the target in the image. Combined with the results of the offset head, it can be further get the exact location. Combined with the internal and external parameters of the camera, the 2D position of the target can be mapped to the 3D space to obtain a 3D result. The result of the scale head is the actual length, width and height of the target, and the result of the rotation angle head is the orientation angle in the top view. These several parts are combined to get the final result of the 3D detection frame.

第二步，构建以RGB图像为输入的学生网络，其结构如图2所示。具体步骤如下：In the second step, construct a student network that takes RGB images as input, and its structure is shown in Figure 2. Specific steps are as follows:

学生网络同样是由特征提取网络，特征融合网络和检测头三部分组成。为了简单化，学生网络中的特征提取网络，特征融合网络和检测头模块采用和教师网络相同的网络结构，这样一来两个网络可以得到相同的输出，唯一不同的是输入数据不同，教师网络的输入是激光点云，而学生网络的输入是RGB图像。The student network is also composed of three parts: feature extraction network, feature fusion network and detection head. For simplicity, the feature extraction network, feature fusion network and detection head module in the student network adopt the same network structure as the teacher network, so that the two networks can get the same output, the only difference is that the input data is different, the teacher network The input of L is a laser point cloud, while the input of the student network is an RGB image.

第三步，通过教师和学生网络的构建，分别得到了RGB图像和激光点云的多尺度特征图以及3D检测的结果。然后本发明采用知识蒸馏的方式将教师网络提取到的激光点云多尺度特征中的深度信息传递给RGB图像中。本发明设计了三种类型的蒸馏方式，分别为特征空间的场景蒸馏采用SF模块，特征空间的目标蒸馏采用OF模块，结果空间的目标蒸馏采用OR模块。具体如下：In the third step, through the construction of teacher and student networks, the multi-scale feature maps of RGB images and laser point clouds and the results of 3D detection are respectively obtained. Then the present invention transfers the depth information in the multi-scale features of the laser point cloud extracted by the teacher network to the RGB image by means of knowledge distillation. The present invention designs three types of distillation methods, respectively, the SF module is used for the scene distillation of the feature space, the OF module is used for the target distillation of the feature space, and the OR module is used for the target distillation of the result space. details as follows:

对于SF模块，只输入1/16，1/32，1/64三个尺度的特征图，对于每个尺度的特征图，去计算一张亲和力图来表示各个像素之间的相似度关系。对于一张HxW的特征图，其亲和力图为HW x HW。这样对于RGB图像每个尺度会得到一张亲和力图，对于激光点云数据也能够得到得到类似形状的亲和力图。在这两个不同模态间，采用平滑L1损失函数拉近两者亲和力图的关系，使得不同模态间每个像素与周围邻域关系更加相似。For the SF module, only three scale feature maps of 1/16, 1/32, and 1/64 are input, and for each scale feature map, an affinity map is calculated to represent the similarity relationship between each pixel. For a HxW feature map, its affinity map is HW x HW. In this way, an affinity map will be obtained for each scale of the RGB image, and an affinity map of a similar shape can also be obtained for the laser point cloud data. Between these two different modalities, the smooth L1 loss function is used to narrow the relationship between the two affinity maps, so that the relationship between each pixel and the surrounding neighborhood between different modalities is more similar.

对于OF模块，同样输入较小的三个尺度特征图，然后直接对RGB特征图和激光点云特征图的前景区域进行平滑L1损失函数约束。前景区域采用2D box的真值获取。该模块可以使得两个模态在特征空间的距离更加接近。For the OF module, the smaller three-scale feature maps are also input, and then the smooth L1 loss function is directly constrained to the foreground area of the RGB feature map and the laser point cloud feature map. The foreground area is acquired using the true value of the 2D box. This module can make the distance between two modes closer in feature space.

对于OR模块，对于教师和学生网络的检测头结果进行约束，让激光点云的检测结果作为一种软标签去指导RGB网络的3D检测结果，监督区域除了每个目标的中心点之外，还根据每个目标的大小进行了标签的扩展，如图4描述所示。该模块可以使得RGB学生网络的检测结果更接近激光点云的检测结果。For the OR module, the detection head results of the teacher and student networks are constrained, and the detection results of the laser point cloud are used as a soft label to guide the 3D detection results of the RGB network. In addition to the center point of each target, the supervised area also includes Labels are expanded according to the size of each object, as described in Figure 4. This module can make the detection results of the RGB student network closer to the detection results of the laser point cloud.

以上所述乃是本发明的具体实施例及所运用的技术原理，若依本发明的构想所作的改变，其所产生的功能作用仍未超出说明书及附图所涵盖的精神时，仍应属本发明的保护范围。The above descriptions are the specific embodiments of the present invention and the applied technical principles. If the changes made according to the conception of the present invention do not exceed the spirit covered by the description and accompanying drawings, they should still belong to protection scope of the present invention.

Claims

1. The three-dimensional target detection method based on knowledge distillation is characterized by comprising the following steps of:

first, constructing a teacher network taking laser point cloud as input

Projecting the point cloud acquired by the laser radar onto an image plane through an internal parameter and an external parameter of a camera to obtain a sparse depth map, and then performing depth complementation to obtain a dense depth map; taking the input of the teacher network; extracting multi-scale features of the depth map through a depth convolution network, and carrying out feature fusion; inputting the final characteristics into a detection head module to obtain each sub-result of the 3D detection result;

secondly, constructing a student network taking RGB images as input, wherein the input data is different from the teacher network, the input of the teacher network is laser point cloud, and the input of the student network is RGB images;

thirdly, respectively obtaining a multi-scale feature map of laser point cloud and RGB image and a 3D detection result through construction of teacher and student networks; then three types of knowledge distillation modes are adopted, namely: the scene distillation OF the feature space adopts an SF module, the target distillation OF the feature space adopts an OF module, and the target distillation OF the result space adopts an OR module; the three distillation modes transmit depth information extracted from laser point cloud data by a teacher network to RGB image features.

2. The three-dimensional object detection method based on knowledge distillation according to claim 1, wherein the three types of knowledge distillation modes are specifically as follows:

for an SF module, only inputting three smaller scale feature graphs, and for each scale feature graph, calculating an affinity graph to represent similarity relations among pixels; an affinity graph can be obtained for each scale of the RGB image, and affinity graphs with similar shapes can be obtained for laser point cloud data; the relation between the two affinity graphs is drawn by adopting a smooth loss function between the two different modes, so that the relation between each pixel between the different modes and the surrounding neighborhood is more similar;

for the OF module, inputting three smaller scale feature images as well, and then directly carrying out smooth loss function constraint on the RGB feature images and the foreground area OF the laser point cloud feature images; the foreground area is obtained by adopting a true value of a 2D box; the module can enable the distance between the two modes in the characteristic space to be more approximate;

for the OR module, restraining the detection head results of the teacher and student networks, enabling the detection result of the laser point cloud to serve as a soft label to guide the 3D detection result of the RGB network, and expanding labels according to the size of each target except the center point of each target in the supervision area; the module can enable the detection result of the RGB student network to be closer to the detection result of the laser point cloud.