Aerial view semantic segmentation label generation method based on multi-frame semantic point cloud splicingTechnical Field
The invention relates to the technical field of automobile automatic driving surrounding perception, in particular to a method for generating a bird's-eye view semantic segmentation label based on multi-frame semantic point cloud splicing.
Background
Autopilot has been valued by more and more manufacturers as a key technology for a new generation of intelligent cars. Generally, the entire autopilot system consists of three major modules: the system comprises a perception fusion module, a decision planning module and a control module, wherein the perception fusion is used as a front module of the other two modules, and the perception precision of the perception fusion module directly determines the performance of the whole automatic driving system.
The technology of the current perception module has not been limited to the traditional single front view Camera (Forward Camera) configuration, and all manufacturers have started to perform 360-degree blind-corner-free surround perception by using multiple cameras around the vehicle body, and the most common arrangement is shown in fig. 1: the six cameras are respectively arranged at the front View, the rear View, the left front, the left rear, the right front and the right rear, collect image information of different visual angles, then send the image information into a surrounding perception model, and the surrounding perception model directly outputs the semantic information of a Bird's Eye View (BEV). The bird's eye view discussed herein refers specifically to a bird's eye view taken from directly above the host vehicle. The bird's-eye view semantic information refers to semantic segmentation of the bird's-eye view, and segmentation elements of the bird's-eye view are defined according to requirements and comprise static objects such as lane lines and drivable areas, and comprise moving objects such as vehicles and pedestrians.
In order to train such a bird's-eye view semantic segmentation model, it is naturally necessary to acquire a corresponding bird's-eye view semantic segmentation label (hereinafter referred to as BEV label). The current possible semantic segmentation label acquisition mode is as follows:
the first mode is as follows: high-precision maps are generated off-line (such as the high-precision map generation method, device, equipment and readable storage medium disclosed in CN 202010597488.6), and then corresponding BEV labels are generated through semantic information elements of the high-precision maps. The method needs to extract a constructed high-precision map and then directly acquire the BEV aerial view by utilizing semantic information of the high-precision map.
The second mode is as follows: utilize unmanned aerial vehicle, carry out synchronous aerial photography directly over the data acquisition car, then the manual work marks the aerial view. The biggest disadvantage of this approach is that the drones are usually controlled by the region, so the data collection scenarios are limited. Furthermore, this acquisition approach cannot be triggered by the shadow mode, which makes subsequent closed-loop iterations of the model difficult.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a bird's-eye view semantic segmentation label generation method based on multi-frame semantic point cloud splicing so as to obtain a set of low-cost BEV label automatic generation algorithm, avoid the high cost and inconvenience of an unmanned aerial vehicle and a high-precision map, and directly obtain BEV labels by utilizing semantic point cloud and multi-frame splicing.
The technical scheme of the invention is realized as follows:
a bird's-eye view semantic segmentation label generation method based on multi-frame semantic point cloud splicing is characterized by comprising the following steps:
1) configuring sensors for data acquisition on the vehicle: arranging a camera in more than two directions of the vehicle respectively to ensure that parts of the visual angles of two adjacent cameras are overlapped together so as to cover 360 degrees around the vehicle body; the laser radar is arranged at the top of the vehicle body;
2) sensor calibration: calibrating the internal parameters of each camera and the external parameters relative to the vehicle body by using a calibration plate, and calibrating the external parameters of the laser radar relative to the vehicle body by using a camera and laser radar combined calibration method;
3) data acquisition: synchronizing data collected by a camera and a laser radar at the same moment, and ensuring that the time stamp difference of all data at each moment does not exceed a set value; when the scanning edge of the laser radar is overlapped with the optical axis of the camera, triggering the camera to expose to obtain an original image of the external image data of the vehicle body;
4) data annotation: performing joint labeling on the original image acquired by each camera at the same moment and the point cloud image acquired by the laser radar correspondingly; marking original image road surface information on each frame of image acquired by a camera, marking at least two static targets of a lane line and a travelable area, marking a 3D bounding box of a moving target on a synchronous point cloud image acquired by a laser radar, and marking at least two moving targets of a traveler and a vehicle;
5) generating a single-frame semantic point cloud: converting the marked point cloud picture into each camera plane, and dyeing the point cloud by using semantic information of the image to generate semantic point cloud;
6) and splicing continuous multi-frame semantic point clouds to a uniform body coordinate system taking a certain frame as a reference, and projecting the continuous multi-frame semantic point clouds to a BEV canvas to obtain a compact BEV label.
In this way, the semantic point clouds are generated by utilizing the image semantic information and the point cloud information, then continuous multi-frame semantic point clouds are spliced and finally projected into the bird's-eye view canvas and post-processed, so that the bird's-eye view semantic segmentation map is automatically generated, the situation that the bird's-eye view semantic tags are obtained in a high-cost mode such as an unmanned aerial vehicle or a high-precision map is avoided, and the cost of data tags is greatly reduced.
Further: and 6, a BEV label post-processing step is further included, some hole areas projected by point clouds are repaired, and manual or morphological refine is carried out on the map projected by the point clouds to obtain the BEV pavement area label. Thus, a more accurate BEV pavement area label can be obtained by the repair of the post-processing step.
Further: the external parameters of each camera are described by a yaw angle yaw, a pitch angle pitch, a roll angle, an X-direction translation distance tx, a Y-direction translation distance ty and a Z-direction translation distance tz; inner reference refers to the x-direction and y-direction pixel dimensions f of the camerax、fyAnd pixel center px、py;
The projection matrix from the vehicle body coordinate system to the camera pixel coordinate system can be obtained through external reference and internal reference, and the specific transformation derivation is as formulas (1) to (8):
R=RyawRpitchRroll (4)
recording the rigid body rotation matrix as R; recording the translation vector as T; k is an internal reference matrix formed by internal references of the camera; the formula (9) is derived from (6) and (8); r, T, K form a 3x4 projection matrix P which is the homogeneous coordinate of a certain point in the coordinate system of the vehicle body
Projected as pixel coordinates (u, v) of the camera pixel plane, where Z
cThis is the depth of the point in the camera coordinate system. Therefore, the external reference and the internal reference of each camera can be accurately calibrated, and the reference is determined for obtaining the BEV pavement area label with high precision.
Further: and (3) converting the laser radar coordinate system into the vehicle body coordinate system, wherein the conversion relation is as follows:
therefore, the laser radar coordinate system can be transformed to the vehicle body coordinate system, and the transformation of the coordinate system is convenient to accurately carry out.
Further: generating the single-frame semantic point cloud in the step 5 specifically as follows:
firstly, converting point cloud into a vehicle body coordinate system by using the following formula, then converting the point cloud into a camera coordinate system by using the formula (6), filtering out point cloud points with Z-direction coordinates smaller than 0, and finally converting the point cloud points into pixel coordinate points by using the formula (8);
the formula is as follows:
wherein
Representing the conversion of the lidar point cloud to the coordinates in the ith camera coordinate system, Z
i_lidar>0,
Representing the coordinates of the laser radar point cloud on the ith camera pixel plane;
the original point cloud perspective is transformed to a pixel plane of a certain camera i by the formulas (11) and (12), and a perspective projection picture is obtained and recorded as Maski_lidarWherein if a certain pixel coordinate has a projection of point cloud, then Maski_lidar(u, v) ═ 1, otherwise Maski_lidar(u, v) ═ 0; then passing through the semantic label Mask of the image of the ith camerai_gtFor Maski_lidarPerforming a dyeing; the point cloud has category attribute and becomes semantic point cloud.
Further: the dyeing process is as follows:
(1) selecting an opposite type j to be dyed;
(2) assume that the label value of this class is fjThat isThe set of pixel points corresponding to this category is:
Maskj_i_gt=(Maski_gt==fj) (13)
(3) solving the intersection M of the mask of the category j and the non-0 pixel point in the point cloud projection maskj:
Mij={(u,v)|Maskj_i_gt(u,v)==1,Maski_lidar(u,v)==1} (14)
(4) For the non-0 projection point sets, reversely solving the corresponding point sets in the original point cloud:
Pij={(Xlidar,Ylidar,Zlidar)|(Xlidar,Ylidar,Zlidar) Projection point E M of ith cameraij} (15)
Given point set PijLabel category j, i.e. stain point cloud. Therefore, the method can dye various opposite categories and becomes semantic point cloud.
Further: for each camera, P can be foundijThe final point cloud with category j is:
for each category of the pavement area, the point cloud can be dyed according to the steps (1) to (4), so that the semantic point cloud P with different category labels is finally obtainedj。
Further: the step 6 is to obtain a compact BEV label, which specifically comprises the following steps:
selecting total 2N +1 frames of the front N frames and the back N frames of the current frame as original information to generate a BEV label of the current frame, wherein the reference frame is the current frame, the label is 0, and the reference frame is represented by the following table ref; converting the points of the world coordinate system into the reference coordinate system expression by using the objective invariant world coordinate system and using subscript w as follows:
wherein R iswAnd TwStill defined according to the above equations (4) and (5), wherein the yaw angle yaw, the pitch angle pitch, the roll angle roll, the X-direction translation distance tx, the Y-direction translation distance ty, and the Z-direction translation distance tz are a pose information of the vehicle body itself; for the mth frame (m belongs to { -N, - (N-1),. and-1, 0, 1.. and N-1, N }), a semantic point cloud set P of the category j is first obtained in step 5mjThen, the point clouds are converted to a world coordinate system, and the conversion formula is as follows:
wherein
A certain point in the point cloud set representing the jth label category of the mth frame; by means of equations (17) and (18), the point cloud set of the jth tag of the mth frame can be converted into a point cloud of a uniform reference coordinate system:
converting the semantic point clouds of all the frames of the [ -N, N ] into a reference coordinate system through a formula (19) to obtain compact point clouds, and then projecting the compact point clouds onto a BEV canvas to obtain a compact BEV label.
Further: the BEV label post-processing is to repair some hollow areas of the point cloud projection and perform refine on the label again manually; and for the moving target, directly projecting four grounding points of a 3d bounding box of the moving target onto a BEV canvas of a reference coordinate system, and finally fusing the BEV pavement label and the BEV moving target to obtain an accurate BEV pavement area label.
And further: the timestamp difference of all data at each moment is set to 45 ms.
Further: the front view, the back view, the left front, the left back, the right front and the right back of the vehicle body are respectively provided with one camera, and the total number of the cameras is six, and the cameras cover 360 degrees around the vehicle body. Therefore, the visual angles of two adjacent cameras are partially overlapped, and the vehicle body can be covered by 360 degrees.
In summary, the invention has the following beneficial effects:
1. the method is a set of low-cost BEV label automatic generation algorithm, high cost and inconvenience of an unmanned aerial vehicle and a high-precision map are avoided, and BEV labels are obtained by directly utilizing semantic point cloud and multi-frame splicing.
2. According to the bird's-eye view semantic segmentation method, the semantic point clouds are generated by utilizing the image semantic information and the point cloud information, then the continuous multi-frame semantic point clouds are spliced and finally projected to the bird's-eye view canvas and subjected to post-processing, so that the bird's-eye view semantic segmentation map is automatically generated, the situation that a bird's-eye view semantic tag is obtained in a high-cost mode such as an unmanned aerial vehicle or a high-precision map is avoided, and the cost of data tags is greatly reduced.
Drawings
FIG. 1 is a schematic view of a data acquisition sensor configuration of the vehicle;
FIG. 2-1 is a schematic view of a raw picture taken by a sensor;
FIG. 2-2 is a mask generated from the label of the original picture of FIG. 2-1;
FIGS. 2-3 are cloud point views with bounding boxes labeled;
FIG. 3-1 is a projection view of a single-frame travelable area BEV; FIG. 3-2 is a single frame lane line BEV projection;
FIG. 4-1 is a multi-frame drivable area BEV projection; fig. 4-2 are multi-frame lane line BEV projection views;
FIG. 5 is a moving object BEV label, a pavement BEV detailing label and a final fusion map;
fig. 6 is a flow chart of the overall BEV label auto-generation algorithm.
Detailed Description
The following detailed description of specific embodiments of the invention refers to the accompanying drawings.
Referring to fig. 1 to 6, the invention relates to a bird's-eye view semantic segmentation label generation method based on multi-frame semantic point cloud splicing, which comprises the following steps:
the specific implementation steps are as follows:
1. the sensor for data acquisition is configured on a vehicle and mainly comprises a camera and a radar (laser radar): the acquisition devices are configured according to the position of the vehicle body shown in figure 1 to form a data acquisition vehicle, and six cameras are respectively arranged in more than two directions of the vehicle body, namely the front direction, the rear direction, the front left direction, the rear left direction, the front right direction and the rear right direction as shown in the figure, so that the visual angles of two adjacent cameras are partially overlapped, and the vehicle body can be covered by 360 degrees. For camera: the HFOV in front and back view is 50 degrees, the maximum distance is 200m, and the focal length is 6 mm; the HFOV in front of the side is 120 degrees, the maximum distance is 40m, and the focal length is 2.33 mm; the HFOV at the back of the side is 80 degrees with a maximum distance of 70m and a focal length of 4.14 mm. The resolution of all the cameras is 200 ten thousand. For the lidar, the lidar is carried (arranged) on the top of a vehicle body, the horizontal FOV is 360 degrees, the vertical FOV is about-20 degrees to 20 degrees, and the scanning frequency is 20 Hz.
2. Sensor calibration: calibrating an internal parameter (intrinsic) of each camera and an external parameter (extrinsic) relative to a vehicle body (ego vehicle) by using a calibration plate; and calibrating external parameters of the lidar relative to the vehicle body by using a camera and laser radar (hereinafter referred to as lidar) combined calibration method.
The outer parameter (extrinsic) and the inner parameter (intrinsic) of each camera relative to the vehicle body (ego vehicle) are calibrated by a camera calibration board, wherein the outer parameter of each camera is described by a yaw angle yaw, a pitch angle pitch, a roll angle roll, an X-direction translation distance tx, a Y-direction translation distance ty and a Z-direction translation distance tz. The internal reference refers to the x-direction and y-direction pixel dimensions f of the cameras (or the camera to which each camera is connected)x、fyAnd pixel center px、py. The projection matrix from the vehicle body coordinate system to the camera pixel coordinate system can be obtained through external reference and internal reference, and specific transformation derivation is shown in formulas (1) - (8).
R=RyawRpitchRroll (4)
The rigid rotation matrix is denoted herein as R and the translation vector is denoted as T. K is an internal reference matrix formed by internal references of the camera. The formula (9) is derived from (6) and (8). It can be seen that R, T, K form a 3x4 projection matrix P, which is the homogeneous coordinate of a point in the coordinate system of the vehicle body
Projected as pixel coordinates (u, v) of a camera pixel plane, where Z
cFor this purpose the depth of the point in the camera coordinate system.
For the Lidar, the patent only relates to the transformation from the Lidar coordinate system to the ego vehicle coordinate system, and the transformation relationship is as follows:
3. data acquisition: synchronizing data collected by a camera and a laser radar at the same moment, and ensuring that the time stamp difference of all data at each moment does not exceed a set value, such as a numerical value of 45ms or lower; and during data acquisition, the synchronization of the 6-path camera and the Lidar needs to be ensured. The synchronous mode of this patent is for when laser radar Lidar's scanning edge and camera (or this camera that every camera is connected) optical axis coincidence, triggers the camera exposure, acquires the outside image data original image of automobile body. Therefore, when the Lidar sweeps through 360 degrees, all cameras are exposed once. The scanning frequency of the Lidar is 20Hz, namely 50ms is needed for one rotation, so that the maximum synchronization difference of the camera is (5/6) × 50ms, and the requirement of less than 45ms is met.
4. Data annotation: performing joint labeling on original drawings (for example, 6 original drawings are obtained by six cameras) acquired by each camera at the same moment and point cloud drawings correspondingly acquired by the laser radar; marking a static area of a road surface on an image acquired by a camera, for example, marking a static target at least comprising two static objects such as a travelable area, a lane line and the like, and marking a 3D bounding box of at least two moving targets such as vehicles, pedestrians and the like on a point cloud picture acquired by a laser radar;
reference is made to fig. 2-1, 2-2, and 2-3, where each frame of image collected by the camera is marked with original image road information, which is two static targets, i.e., a lane line and a drivable area for the present patent. Moving target information is marked on a synchronous point cloud picture acquired by a laser radar, and for the patent, two moving targets, namely a pedestrian and a vehicle, are marked by adopting a 3D bounding box marking mode in the prior art.
5. Generating a single-frame semantic point cloud: converting the marked point cloud picture into each camera plane, and dyeing the point cloud by using semantic information of the image to generate semantic point cloud;
firstly, converting the point cloud into an ego vehicle coordinate system by using a formula (10), then converting the point cloud into a camera coordinate system by using a formula (6), then filtering out point cloud points with Z-direction coordinates smaller than 0, and finally converting the point cloud points into pixel coordinate points by using a formula (8). The above process can be formulated as follows:
wherein
Representing the conversion of the lidar point cloud to the coordinates of the ith camera coordinate system, Z
i_lidar>0,
And the coordinates of the lidar point cloud in the ith camera pixel plane are represented. The original point cloud perspective is transformed to a pixel plane of a certain camera i by the formulas (11) and (12), and a perspective projection picture is obtained and recorded as Mask
i_lidarWherein if a certain pixel coordinate has a projection of point cloud, then Mask
i_lidar(u, v) ═ 1, otherwise Mask
i_lidar(u, v) ═ 0. Then the semantic label Mask of the image of the ith camera can be passed
i_gtFor Mask
i_lidarOne dyeing was carried out, the specific dyeing procedure was as follows:
(1) selecting a subtending category j to be dyed, such as selecting a travelable area of the patent;
(2) assume that the label value of this class is fjThen the set of pixel points corresponding to this category is:
Maskj_i_gt=(Maski_gt==fj) (13)
(3) solving the intersection M of the mask of the category j and the non-0 pixel point in the point cloud projection maskj:
Mij={(u,v)|Maskj_i_gt(u,v)==1,Maski_lidar(u,v)==1} (14)
(4) For the non-0 projection point sets, reversely solving the corresponding point sets in the original point cloud:
Pij={(Xlidar,Ylidar,Zlidar)|(Xlidar,Ylidar,Zlidar) Projection point E M of ith cameraij} (15)
Given point set PijAnd attaching a label of the category j, namely dyeing the point cloud, wherein the point cloud has category attributes and becomes a semantic point cloud.
Further, for each camera, P can be foundijThe final point cloud with category j is:
for each category of the pavement area, the point cloud can be dyed according to the steps (1) to (4), so that the semantic point cloud P with different category labels is finally obtainedj. 3-1 and 3-2 show the result of projecting a single frame semantic point cloud of a travelable area on a BEV canvas.
6. Splicing and projecting multi-frame semantic point cloud: splicing continuous multi-frame semantic point clouds under a unified body coordinate system taking a certain frame as a reference, and projecting the point clouds onto a BEV canvas:
step 5 mainly focuses on the generation of single-frame semantic point clouds, but the point clouds have sparse characteristics (see fig. 3-1 and 3-2), so in order to obtain dense semantic point clouds, continuous multi-frame semantic point clouds are spliced to a uniform vehicle body coordinate system taking a certain frame as a reference. Specifically, the present patent selects a total of 2N +1 frames (where plus 1 refers to the current frame) of the previous N frames and the next N frames of the current frame as the original information to generate the BEV label of the current frame, where the reference frame is the current frame (labeled as 0) and is represented by the following table ref. The objective invariant world coordinate system, denoted by the subscript w, is introduced here, and similar to the principle of equation (6), the points of the world coordinate system are transformed into reference coordinate system expressions as follows:
wherein R iswAnd TwStill according to the definitions of the equations (4) and (5), the yaw angle yaw, pitch angle pitch, roll angle, X-direction translation distance tx, Y-direction translation distance ty, and Z-direction translation distance tz are the pose information of the ego vehicle body itself (the information can be obtained by the wheel encoder/IMU or VIO, and there is a set of algorithms of its own, which is out of the scope of the patent discussion). For the mth frame (m belongs to { -N, - (N-1),. and-1, 0, 1.. and N-1, N }), a semantic point cloud set P of the category j is first obtained in step 5mjThen, the point clouds are converted to a world coordinate system, and the conversion formula is as follows:
wherein
And (3) a certain point in the point cloud set representing the jth label category of the mth frame. By means of equations (17) and (18), the point cloud set of the jth tag of the mth frame can be converted into a point cloud of a uniform reference coordinate system:
converting the semantic point clouds of all the frames of the [ -N, N ] into a reference coordinate system through a formula (19) to obtain compact point clouds, and then projecting the compact point clouds onto a BEV canvas to obtain a compact BEV label. Fig. 4-1 and 4-2 show schematic diagrams of dense point cloud projection views of a travelable region of a road surface (where N is 5).
7. BEV label post-treatment: and (4) manually or morphologically refining the map after the point cloud projection to obtain the BEV pavement area label.
The BEV label generated in the previous step 6 has more or less holes, as shown in fig. 4-1 and 4-2, so the patent further performs morphological transformation on the map after point cloud projection, repairs some hole areas of the point cloud projection, and performs refine on the label again manually, thereby obtaining an accurate road BEV label (fig. 5). For a moving target, the four grounding points of its 3d bounding box (see fig. 2-3) can be projected directly onto the BEV canvas of the reference coordinate system (fig. 5). Finally, the BEV pavement label and the BEV dynamic target are fused to obtain an accurate BEV pavement area label (figure 5). The whole algorithm flow of the present application is summarized as a flow chart shown in fig. 6.
Finally, it should be noted that the above-mentioned examples of the present invention are only examples for illustrating the present invention, and are not intended to limit the embodiments of the present invention. While the invention has been described in detail with reference to preferred embodiments, it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention. Not all embodiments are exhaustive. All obvious changes and modifications of the present invention are within the scope of the present invention.