Cascade deep reinforcement learning safety decision method based on multi-mode space-time characterizationTechnical Field
The invention belongs to the field of deep reinforcement learning automatic driving, relates to a vehicle safety decision technology, and in particular relates to a cascade deep reinforcement learning safety decision method based on multi-mode space-time characterization.
Background
In the field of autopilot, it is critical to ensure that vehicles can make safety decisions in a wide variety of driving scenarios, which directly relate to the life and property safety of drivers and passengers. Conventional autopilot systems basically employ a modular approach, wherein each function, such as perception, prediction, decision making, etc., is developed and integrated into the system separately. The most common decision method among the modular methods is to use rule-based methods, which are often ineffective for addressing the large number of situations that occur while driving. Thus, existing approaches primarily trend data learning strategies to implement security decisions, such as mimicking learning and deep reinforcement learning.
The automatic driving safety decision method based on deep reinforcement learning is a method for characterizing a long-sequence driving task as a Markov decision process, and an intelligent vehicle automatically learns a driving strategy under the guidance of a reward function through continuous interaction with the environment, so that a self-adaptive optimal decision action is given according to current state observation. It allows intelligent automobiles to optimize their decision-making effect by trial and error without relying on manually designed rules and human driving data. The current deep reinforcement learning automatic driving method is mainly divided into two main types, namely an end-to-end method and a decoupling method. The end-to-end approach directly learns the mapping from raw sensor data to control commands. Since sensor data is typically complex, high-dimensional data, containing interference and redundant information, this requires a deeper network to learn good driving strategies. The gradients produced by DRL are often insufficient to effectively train deep neural networks, making the training process difficult. Whereas decoupling methods generally divide the autopilot system into two parts, sensing and decision making. The sensing part uses a supervised learning training deep network to understand the environment and generate an intermediate representation, and the decision part uses a reinforced learning training shallower network to learn the driving strategy from the intermediate representation. There are two main approaches, one of which is to train a perception model to directly map the original observations to the custom perception results as reinforcement learning states, such as related perception indexes (angle of car relative to road, distance to lane markers, etc.) or semantic segmentation masks that resolve the whole scene. Another approach is to use some auxiliary mission head training perceptual encoders to obtain driving related potential features from the original observations, which are then fed to a reinforcement learning network to decode the optimal driving strategy. The former is transmitted between perception and decision by a self-defining result, and error accumulation and information loss are easy to cause along with the progress of a sequential process, so that the performance of the decision is limited. The latter solves this problem by delivering a representation of the feature, which has received a great deal of attention.
Currently, methods that use potential features as reinforcement learning states have the problem of lack of safety in high traffic density scenarios involving large numbers of dynamic objects, especially in rare emergencies. There are many factors contributing to this security problem, two of which are 1) the lack of overall scene perception. A single sensor typically cannot provide enough information to perceive a driving scene, a single image approach cannot provide accurate 3D information of the scene, and a single lidar approach cannot provide semantic information. 2) Lacks time dimension information for traffic scenes. It can be seen that not only spatial information can be captured for a dynamic driving scenario, but dynamic variability between consecutive inputs should also be captured. In addition, predicting future behavior of surrounding traffic participants is also critical to taking safe and reliable decisions for an autonomous car.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a cascade depth reinforcement learning safety decision method based on multi-mode space-time characterization, wherein a perception and prediction encoder not only considers information of space dimension but also introduces information of time dimension, so that comprehensive understanding of a dynamic scene is realized, driving safety is improved, and the method is more in line with the requirements of practical application.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
A cascade depth reinforcement learning safety decision method based on multi-mode space-time representation firstly constructs a multi-mode space-time perception encoder to combine modeling space and motion information from multi-mode continuous input so as to obtain the current perception representation of a dynamic driving scene, then introduces a future prediction encoder to capture interactions among different traffic participants from the current perception representation so as to obtain a future prediction representation, then connects the current perception representation and the future prediction representation to form the multi-mode space-time representation and serve as the state input of reinforcement learning so as to comprehensively grasp the scene, combines a distributed PPO algorithm and realizes a safety decision task under the guidance of a reward function designed for safety decision, and specifically comprises the following steps:
S1, acquiring and preprocessing original sensor data; the method comprises the steps of firstly, realigning the laser radar point clouds of five frames in the past to a vehicle coordinate system at the current moment, then, vomizing the point clouds of six continuous frames into a 2D BEV grid with fixed resolution, and finally, connecting the two to obtain a six-channel laser radar BEV projection pseudo image;
S2, a multi-mode space-time sensing encoder is trained by multi-task head supervision combining space sensing and motion sensing, and current sensing characterization is extracted from single-frame forward-looking RGB color images and continuous six-frame laser radar BEV projection pseudo images;
s3, training a future prediction encoder to learn the correlations of all traffic participants captured in the laser radar BEV characteristic flidar_fusion output by the multi-mode space-time perception encoder, and obtaining the future prediction characterization of the dynamic driving scene, wherein the network of the future prediction encoder consists of a position attention network, a channel attention network, an attention fusion network and a future prediction task head network;
And S4, after the supervision training of the step S2 and the step S3 is completed, designing a reward function, training a deep reinforcement learning decision model by using a distributed PPO reinforcement learning algorithm, and learning an optimal security decision strategy from multi-mode space-time characterization and speed data consisting of laser radar BEV characteristics and future prediction characteristics, and deviation distance and deviation angle data between the laser radar BEV characteristics and the lane center.
Further, the image feature extraction backbone network performs feature extraction through four residual convolution blocks of ResNet-34 network pre-trained by image net to obtain four image features with different layers of informationSi represent different feature extraction stages.
Further, the main network for extracting the laser radar BEV features takes six-channel laser radar BEV projection pseudo images of six continuous frames as input items, and four space-time convolution blocks of Vi deoResnet-18 networks with a space-time convolution structure are used for respectively extracting the features to obtain four laser radar BEV features with different layers of informationWhere si represents the different feature extraction stages.
Further, the multi-mode feature fusion network is used for respectively carrying out feature fusion on four different-scale image features and four different-scale laser radar BEV features, and when fusion is carried out, the image features extracted by the two branch main networks are firstly extractedAnd lidar BEV featuresAfter dimension remolding, connecting to obtain a sequence vectorThen willThrough a multimode fusion transducer module, full information interaction among different modal characteristics is realized, and global space-time context characteristics in a 3D scene are obtainedFinally, willSlicing and separately reducing to sumFeatures of the same dimension, andElement addition is carried out to obtain the image characteristics after fusionAnd lidar BEV features
Further, in order to capture the interrelation between different branch tasks, the feature expression in different branches is enhanced, the image branch and the laser radar BEV branch respectively use different multitask heads for supervision training, the image branch consists of two task heads Hdep and Hsem for depth estimation and semantic segmentation of a forward-looking image respectively, the forward-looking semantic segmentation is performed by using cross entropy loss, the L1 loss supervision forward-looking depth estimation task is used, the BEV point cloud branch consists of task heads Hbev,Hv and Hbb and is respectively used for BEV semantic segmentation, peripheral vehicle speed prediction and 2D target detection, the BEV semantic segmentation is performed by using cross entropy loss, the peripheral vehicle speed prediction is monitored by using L2 loss, and the 2D target detection uses a CENTERNET decoder to locate other traffic participants in a scene.
Further, in the step S3, the position attention network inputs the multi-mode space-time fused laser radar BEV features flidar_fusion to three convolution layers to obtain three feature maps with the same dimension as the original feature maps, and then adjusts the dimensions of the feature maps to be two-dimensional features with the same dimensionAndNext, atTransposed sum of (2)Performs matrix multiplication between them and calculates a spatial attention map slo using the SoftMax layer, then at slo andPerforming matrix multiplication between transposes of (a) to capture the spatial dependence between any two positions of the feature map and re-dimension the result to obtain a feature map of the same dimension as flidar_fusionFinally, willAnd flidar_fusion to obtain the final output of the position attention network
Further, in the step S3, the channel attention network performs dimension adjustment on the multi-mode space-time fused laser radar BEV feature flidar_fusion to obtain two-dimensional features with the same dimensionThen atTransposed sum of (2)Performing matrix multiplication between the two channels and applying a SoftMax layer to obtain a channel attention map sch, performing matrix multiplication between sch and the transpose of flidar_fusion, capturing the channel dependence between any two channels, and re-dimensionality the result to obtain a feature map of the same dimension as flidar_fusionFinally, willAnd flidar_fusion to obtain the final output of the channel attention network
Further, in the step S3, the attention fusion network outputs the location attention network and the channel attention networkAndThe future prediction task head network analyzes scene states after 0.5 seconds in the future from the future prediction feature ffuture, namely, the position and speed predictions of other traffic participants in the current scene after 0.5 seconds.
Further, in the step S4, the reward function includes a sparse reward obtained by triggering a predefined event and a dense reward obtained by each time stamp, the sparse reward includes six kinds of factors including collision, no-accident parking, overspeed, deviation distance being greater than a threshold value, deviation angle being greater than a threshold value, deviation angular velocity being greater than a threshold value, and penalty being given when the above conditions are satisfied, and the dense reward includes four kinds of factors including deviation distance reward, deviation angular reward, angular velocity reward, and velocity reward.
Further, in the step S4, a multi-branch horizontal and vertical separated network structure is adopted, and an independent action prediction branch is adopted for each navigation command in a group of advanced navigation commands (left turn at an intersection, right turn at an intersection, straight travel at an intersection and travel along a current road), the navigation command serves as a selection switch of which branch is used under each time stamp, and each branch learns a sub-strategy specific to the navigation command;
The strategy network performs channel splicing operation on the multi-mode space-time fused laser radar BEV feature map flidar_fusion and the future prediction feature map ffuture to obtain a feature map f, then the feature map f is flattened into a one-dimensional feature vector after four-layer convolution, and finally the feature vector is connected with a vehicle measurement vector containing speed data, deviation distance from the center of a lane and deviation angle data, and then d-dimensional discrete action prediction probability P is output after passing through three full-connection layers (st);
And (3) the value network carries out channel splicing operation on the multi-mode space-time fused laser radar BEV characteristic graph flidar_fusion and the future prediction characteristic graph ffuture to obtain a characteristic graph f, then the characteristic graph f is flattened into a one-dimensional characteristic vector after four-layer convolution, and finally the characteristic vector is connected with a vehicle measurement vector containing speed data, deviation distance from the center of a lane and deviation angle data, and then the one-dimensional state value prediction Q is output after three layers of full-connection layers (st).
The invention has the following characteristics and beneficial effects:
The invention adopts a cascade deep reinforcement learning active safety decision method based on multi-mode space-time characterization to guide the vehicle to learn the active safety decision skill through a reward function. The designed multi-mode space-time perception encoder jointly models space and motion information, provides current perception representation of scenes for subsequent decisions, ensures richness of information needed by the decisions, captures interaction among different traffic participants, provides future prediction representation for the subsequent decisions, and improves decision success rate. The current perception representation and the future prediction representation form comprehensive understanding of the scene, ensure the safety of the decision process, and have high environmental adaptability and decision success rate. After the decision strategy learning, the method can realize the safety decision task of the intelligent automobile in a dense traffic scene and under an emergency.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is an algorithm flow chart of a cascade deep reinforcement learning security decision method based on multi-modal space-time characterization in an embodiment of the invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
As shown in fig. 1, a cascade deep reinforcement learning security decision method based on multi-modal space-time characterization includes the following steps:
And (1) collecting and preprocessing the original sensor data. The method comprises the steps of collecting forward-looking RGB color images at the current moment, collecting laser radar point clouds, speed data and deviation distance and deviation angle data between the laser radar point clouds and the center of a lane, aiming the laser radar point clouds of five previous frames at a vehicle coordinate system at the current moment, vomizing the point clouds of six continuous frames into 2D BEV grids with fixed resolution, and finally connecting the 2D BEV grids to obtain six-channel laser radar BEV projection pseudo images. For the 2D BEV grid, the positions of 32 meters in front and rear of the vehicle and 32 meters on both sides are considered, and the 64 m×64 m distance range is divided into a plurality of areas of 0.25 m×0.25 m, thereby obtaining a grid with a resolution of 256×256. And deleting point cloud data with the height of less than 0.2 meter to remove the ground plane. And carrying out normalization operation on the speed data, the deviation distance from the center of the lane and the deviation angle data.
And (2) combining the multi-task head supervision of space perception and motion perception to train a multi-mode space-time perception encoder, and extracting the current perception characterization from the single-frame forward-looking RGB image and the laser radar BEV projection pseudo image of six continuous frames. The multi-mode space-time perception encoder network consists of an image feature extraction backbone network, a laser radar BEV feature extraction backbone network, a multi-mode feature fusion network and a multi-task head network, wherein during extraction, a forward-looking RGB image of a current frame is input into the image feature extraction backbone network, and feature extraction is respectively carried out through four residual convolution blocks of a ResNet-34 network which is pretrained by an ImageNet to obtain four image features with different layers of informationWhere si represents the different feature extraction stages. Six-channel laser radar BEV projection pseudo-image input laser radar BEV feature extraction main network of continuous six frames, and four laser radar BEV features with different layers of information are obtained by respectively performing feature extraction through four space-time convolution blocks of Vi deoResnet-18 network introducing space-time convolution structureWhere si represents the different feature extraction stages. And then, respectively carrying out feature fusion on the four image features with different scales and the four laser radar BEV features with different scales through a multi-mode feature fusion network, and obtaining a final image feature frgb_fusion and a final laser radar BEV feature flidar_fusion after four feature fusion. Taking the si th stage as an example, firstly extracting image features of two branch main networksAnd lidar BEV featuresAfter dimension remolding, connecting to obtain a sequence vectorThen willThrough a multimode fusion transducer module, full information interaction among different modal characteristics is realized, and global space-time context characteristics in a 3D scene are obtainedFinally, willSlicing and separately reducing to sumFeatures of the same dimension, andElement addition is carried out to obtain the image characteristics after fusionAnd lidar BEV features
The operation steps of the multimode fusion transducer are as follows:
where LN1 (-) and LN2 (-) represent Layer Norma l i zat i on operations, MHA (-) represents the multi-headed self-care layer, and FFN (-) represents the feedforward neural network layer.
The sequence vector is first applied to a multi-mode fusion transducer moduleAnd one andLearner position embedding of the same dimensionAdding parameter elements to obtain a position-coded sequence vectorWill beBy means of a normalization layer LN1 and a multi-headed self-care layer MHA, andElement addition, realizing the attention interaction of the features of the two modes to obtainThenAfter passing through a normalization layer LN2 and a feed forward network layer FFNElement addition to obtain final global space-time context feature output
In this embodiment, the image branch and the laser radar BEV branch use different multitask heads to perform supervised training, capture correlations between different branch tasks, and enhance feature expression in different branches. The image branch consists of two task heads Hdep and Hsem, which are respectively the depth estimation and the semantic segmentation of the forward-looking image. We use cross entropy loss for forward semantic segmentation and L1 loss to supervise the forward depth estimation task. The BEV point cloud branch consists of three task heads Hbev,Hv and Hbb, for BEV semantic segmentation, peripheral vehicle speed prediction, and 2D object detection, respectively. We use cross entropy loss for BEV semantic segmentation, L2 loss to supervise surrounding vehicle speed prediction, and CENTERNET decoders to locate other traffic participants in the scene for 2D object detection.
And (3) training a future predictive coder to learn the correlation of each traffic participant captured in the laser radar BEV characteristic flidar_fusion output by the multi-mode space-time perception coder, and obtaining the future predictive representation of the dynamic driving scene. The future prediction encoder network consists of a position attention network, a channel attention network, an attention fusion network and a future prediction task head network, wherein:
(I) Location attention network
The multi-mode space-time fused laser radar BEV characteristic flidar_fusion is respectively input into three convolution layers to obtain three characteristic diagrams with the same dimension as the original one, and then the dimensions of the characteristic diagrams are adjusted into three two-dimensional characteristics with the same dimensionAndNext, atTransposed sum of (2)Performs matrix multiplication between them, and calculates a spatial attention map slo using the SoftMax layer. Then at slo andPerforming matrix multiplication between transposes of (a) to capture the spatial dependence between any two positions of the feature map and re-dimension the result to obtain a feature map of the same dimension as flidar_fusionFinally, willAnd flidar_fusion to obtain the final output of the position attention network
(I I) channel attention network
Performing dimension adjustment on the multi-mode space-time fused laser radar BEV characteristic flidar_fusion to obtain two-dimensional characteristics with the same dimensionThen atTransposed sum of (2)Performs matrix multiplication between them, and applies the SoftMax layer to obtain the channel attention map sch. Then, matrix multiplication is carried out between sch and the transpose of flidar_fusion, the channel dependence between any two channels is captured, and the result is re-dimensioned to obtain a feature map with the same dimension as flidar_fusionFinally, willAnd flidar_fusion to obtain the final output of the channel attention network
(II) attention fusion network
To further enhance feature representation, the outputs of the location and channel attention networks are usedAndAnd obtaining a future prediction characteristic ffuture through element addition and convolution operation.
(IV) future prediction task head network
The future prediction task head analyzes scene states after 0.5 seconds in the future from the future prediction feature ffuture, namely the position and speed predictions of other traffic participants in the current scene after 0.5 seconds.
And (4) after the supervision training of the step (2) and the step (3) is completed, designing a reward function, training a deep reinforcement learning decision model by using a distributed PPO reinforcement learning algorithm, and learning an optimal security decision strategy from multi-mode space-time characterization and speed data consisting of laser radar BEV characteristics and future prediction characteristics, and deviation distance and deviation angle data between the laser radar BEV characteristics and the lane center.
In this embodiment, the reward function includes a sparse reward that triggers the predefined event to be obtained and a dense reward that is obtained for each time stamp. The sparse rewards comprise six kinds, namely collision, non-accident parking, overspeed, deviation distance larger than a threshold value, deviation angle larger than a threshold value and deviation angular velocity larger than a threshold value, and punishment is given when the conditions are met. The dense rewards include four types, namely, offset distance rewards, offset angle rewards, angular velocity rewards, and velocity rewards. Wherein:
The offset distance rewards are calculated as follows:
rd=max(kd×(1-d),kd×(1-dmax))
Where d is the offset distance from the road center, dmax is the maximum threshold set to 2, and kd is the scale size.
The bias angle rewards are as follows:
rθ=max(kθ×(30-θ),kθ×(30-θmax))
Where θ is the route deviation angle, θmax is the maximum threshold set to 60, and kθ is the scale size.
The angular velocity rewards are as follows:
rω=max(kω×(10-ω),kω×(10-ωmax))
Where ω is the self-vehicle angular velocity, ωmax is the maximum threshold set to 20, and kω is the scale size.
The speed rewards are as follows:
Where v is the speed of the vehicle, kv is the scale size, vtarget is the target speed, rangev is the allowed speed range. vtarget and rangev are calculated from the distance l from the forward obstacle.
In this embodiment, when the reinforcement learning network is designed, a multi-branch horizontal and vertical separated network structure is adopted, and an independent action is adopted for each navigation command in a group of advanced navigation commands (left turn at an intersection, right turn at an intersection, straight travel at an intersection, and travel along a current road), so that the navigation command serves as a selection switch for which branch is used under each time stamp, and each branch learns a sub-strategy specific to the navigation command. In each branch, considering that the motion control of the vehicle involves two relatively independent operations, namely transverse control and longitudinal control, two identical branches are designed to respectively process the transverse control and the longitudinal control, and the transverse and longitudinal control network structures are composed of a strategy network and a value network.
(I) Policy network
And performing channel splicing operation on the multi-mode space-time fused laser radar BEV feature map flidar_fusion and the future prediction feature map ffuture to obtain a feature map f, flattening the feature map f into a one-dimensional feature vector after four-layer convolution, and finally connecting the feature vector with a vehicle measurement vector containing speed data, deviation distance from the center of a lane and deviation angle data, and outputting d-dimensional discrete action prediction probability P after three-layer full-connection (st). For the horizontal motion space, we uniformly discretize the steering space into 33 parts, so d is 33, the motion ast with the highest probability is selected to be executed, and for the vertical motion space, we combine the throttle space and the brake space and discretize the throttle space and the brake space into 3 different motions corresponding to acceleration, idling and deceleration respectively, so d of the vertical motion space is 3, and the motion alo with the highest probability is selected to be executed.
(Ii) Value network
And performing channel splicing operation on the multi-mode space-time fused laser radar BEV feature map flidar_fusion and the future prediction feature map ffuture to obtain a feature map f, flattening the feature map f into a one-dimensional feature vector after four-layer convolution, and finally connecting the feature vector with a vehicle measurement vector containing speed data, deviation distance from the center of a lane and deviation angle data, and outputting a one-dimensional state value prediction Q after three-layer full-connection layers (st). The state value of the output of the transverse control network is denoted as vst, and the state value of the output of the longitudinal control network is denoted as vlo.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.