CN118861965B

Movatterモバイル変換

Info

Publication number: CN118861965B
Application number: CN202410840724.0A
Authority: CN
Inventors: 杨宇翔; 葛风龙; 赵巨峰; 凡金龙; 董哲康; 高明裕
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2024-06-27
Filing date: 2024-06-27
Publication date: 2025-09-16
Anticipated expiration: 2044-06-27
Also published as: CN118861965A

Abstract

The invention belongs to the field of deep reinforcement learning automatic driving, relates to a vehicle safety decision technique, and in particular relates to a cascade deep reinforcement learning safety decision method based on multi-mode space-time characterization, which comprises the steps of constructing a multi-mode space-time perception encoder to jointly model space and motion information from multi-mode continuous input so as to obtain the current perception characterization of a dynamic driving scene; and then, connecting the current perception representation and the future prediction representation to form a multi-mode space-time representation and taking the multi-mode space-time representation as the state input of reinforcement learning so as to comprehensively grasp a scene, and combining a distributed PPO algorithm to realize a safety decision task under the guidance of a reward function designed for the safety decision. The invention has high environmental adaptability and decision success rate, and can realize the active safety decision task of the intelligent automobile in dense traffic scenes and under emergency events.

Description

Cascade deep reinforcement learning safety decision method based on multi-mode space-time characterization

Technical Field

The invention belongs to the field of deep reinforcement learning automatic driving, relates to a vehicle safety decision technology, and in particular relates to a cascade deep reinforcement learning safety decision method based on multi-mode space-time characterization.

Background

In the field of autopilot, it is critical to ensure that vehicles can make safety decisions in a wide variety of driving scenarios, which directly relate to the life and property safety of drivers and passengers. Conventional autopilot systems basically employ a modular approach, wherein each function, such as perception, prediction, decision making, etc., is developed and integrated into the system separately. The most common decision method among the modular methods is to use rule-based methods, which are often ineffective for addressing the large number of situations that occur while driving. Thus, existing approaches primarily trend data learning strategies to implement security decisions, such as mimicking learning and deep reinforcement learning.

The automatic driving safety decision method based on deep reinforcement learning is a method for characterizing a long-sequence driving task as a Markov decision process, and an intelligent vehicle automatically learns a driving strategy under the guidance of a reward function through continuous interaction with the environment, so that a self-adaptive optimal decision action is given according to current state observation. It allows intelligent automobiles to optimize their decision-making effect by trial and error without relying on manually designed rules and human driving data. The current deep reinforcement learning automatic driving method is mainly divided into two main types, namely an end-to-end method and a decoupling method. The end-to-end approach directly learns the mapping from raw sensor data to control commands. Since sensor data is typically complex, high-dimensional data, containing interference and redundant information, this requires a deeper network to learn good driving strategies. The gradients produced by DRL are often insufficient to effectively train deep neural networks, making the training process difficult. Whereas decoupling methods generally divide the autopilot system into two parts, sensing and decision making. The sensing part uses a supervised learning training deep network to understand the environment and generate an intermediate representation, and the decision part uses a reinforced learning training shallower network to learn the driving strategy from the intermediate representation. There are two main approaches, one of which is to train a perception model to directly map the original observations to the custom perception results as reinforcement learning states, such as related perception indexes (angle of car relative to road, distance to lane markers, etc.) or semantic segmentation masks that resolve the whole scene. Another approach is to use some auxiliary mission head training perceptual encoders to obtain driving related potential features from the original observations, which are then fed to a reinforcement learning network to decode the optimal driving strategy. The former is transmitted between perception and decision by a self-defining result, and error accumulation and information loss are easy to cause along with the progress of a sequential process, so that the performance of the decision is limited. The latter solves this problem by delivering a representation of the feature, which has received a great deal of attention.

Currently, methods that use potential features as reinforcement learning states have the problem of lack of safety in high traffic density scenarios involving large numbers of dynamic objects, especially in rare emergencies. There are many factors contributing to this security problem, two of which are 1) the lack of overall scene perception. A single sensor typically cannot provide enough information to perceive a driving scene, a single image approach cannot provide accurate 3D information of the scene, and a single lidar approach cannot provide semantic information. 2) Lacks time dimension information for traffic scenes. It can be seen that not only spatial information can be captured for a dynamic driving scenario, but dynamic variability between consecutive inputs should also be captured. In addition, predicting future behavior of surrounding traffic participants is also critical to taking safe and reliable decisions for an autonomous car.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a cascade depth reinforcement learning safety decision method based on multi-mode space-time characterization, wherein a perception and prediction encoder not only considers information of space dimension but also introduces information of time dimension, so that comprehensive understanding of a dynamic scene is realized, driving safety is improved, and the method is more in line with the requirements of practical application.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A cascade depth reinforcement learning safety decision method based on multi-mode space-time representation firstly constructs a multi-mode space-time perception encoder to combine modeling space and motion information from multi-mode continuous input so as to obtain the current perception representation of a dynamic driving scene, then introduces a future prediction encoder to capture interactions among different traffic participants from the current perception representation so as to obtain a future prediction representation, then connects the current perception representation and the future prediction representation to form the multi-mode space-time representation and serve as the state input of reinforcement learning so as to comprehensively grasp the scene, combines a distributed PPO algorithm and realizes a safety decision task under the guidance of a reward function designed for safety decision, and specifically comprises the following steps:

S1, acquiring and preprocessing original sensor data; the method comprises the steps of firstly, realigning the laser radar point clouds of five frames in the past to a vehicle coordinate system at the current moment, then, vomizing the point clouds of six continuous frames into a 2D BEV grid with fixed resolution, and finally, connecting the two to obtain a six-channel laser radar BEV projection pseudo image;

S2, a multi-mode space-time sensing encoder is trained by multi-task head supervision combining space sensing and motion sensing, and current sensing characterization is extracted from single-frame forward-looking RGB color images and continuous six-frame laser radar BEV projection pseudo images;

s3, training a future prediction encoder to learn the correlations of all traffic participants captured in the laser radar BEV characteristic f_{lidar_fusion} output by the multi-mode space-time perception encoder, and obtaining the future prediction characterization of the dynamic driving scene, wherein the network of the future prediction encoder consists of a position attention network, a channel attention network, an attention fusion network and a future prediction task head network;

And S4, after the supervision training of the step S2 and the step S3 is completed, designing a reward function, training a deep reinforcement learning decision model by using a distributed PPO reinforcement learning algorithm, and learning an optimal security decision strategy from multi-mode space-time characterization and speed data consisting of laser radar BEV characteristics and future prediction characteristics, and deviation distance and deviation angle data between the laser radar BEV characteristics and the lane center.

Further, the image feature extraction backbone network performs feature extraction through four residual convolution blocks of ResNet-34 network pre-trained by image net to obtain four image features with different layers of informationS_i represent different feature extraction stages.

Further, the main network for extracting the laser radar BEV features takes six-channel laser radar BEV projection pseudo images of six continuous frames as input items, and four space-time convolution blocks of Vi deoResnet-18 networks with a space-time convolution structure are used for respectively extracting the features to obtain four laser radar BEV features with different layers of informationWhere s_i represents the different feature extraction stages.

Further, the multi-mode feature fusion network is used for respectively carrying out feature fusion on four different-scale image features and four different-scale laser radar BEV features, and when fusion is carried out, the image features extracted by the two branch main networks are firstly extractedAnd lidar BEV featuresAfter dimension remolding, connecting to obtain a sequence vectorThen willThrough a multimode fusion transducer module, full information interaction among different modal characteristics is realized, and global space-time context characteristics in a 3D scene are obtainedFinally, willSlicing and separately reducing to sumFeatures of the same dimension, andElement addition is carried out to obtain the image characteristics after fusionAnd lidar BEV features

Further, in order to capture the interrelation between different branch tasks, the feature expression in different branches is enhanced, the image branch and the laser radar BEV branch respectively use different multitask heads for supervision training, the image branch consists of two task heads H_dep and H_sem for depth estimation and semantic segmentation of a forward-looking image respectively, the forward-looking semantic segmentation is performed by using cross entropy loss, the L1 loss supervision forward-looking depth estimation task is used, the BEV point cloud branch consists of task heads H_bev,H_v and H_bb and is respectively used for BEV semantic segmentation, peripheral vehicle speed prediction and 2D target detection, the BEV semantic segmentation is performed by using cross entropy loss, the peripheral vehicle speed prediction is monitored by using L2 loss, and the 2D target detection uses a CENTERNET decoder to locate other traffic participants in a scene.

Further, in the step S3, the position attention network inputs the multi-mode space-time fused laser radar BEV features f_{lidar_fusion} to three convolution layers to obtain three feature maps with the same dimension as the original feature maps, and then adjusts the dimensions of the feature maps to be two-dimensional features with the same dimensionAndNext, atTransposed sum of (2)Performs matrix multiplication between them and calculates a spatial attention map s_lo using the SoftMax layer, then at s_lo andPerforming matrix multiplication between transposes of (a) to capture the spatial dependence between any two positions of the feature map and re-dimension the result to obtain a feature map of the same dimension as f_{lidar_fusion}Finally, willAnd f_{lidar_fusion} to obtain the final output of the position attention network

Further, in the step S3, the channel attention network performs dimension adjustment on the multi-mode space-time fused laser radar BEV feature f_{lidar_fusion} to obtain two-dimensional features with the same dimensionThen atTransposed sum of (2)Performing matrix multiplication between the two channels and applying a SoftMax layer to obtain a channel attention map s_ch, performing matrix multiplication between s_ch and the transpose of f_{lidar_fusion}, capturing the channel dependence between any two channels, and re-dimensionality the result to obtain a feature map of the same dimension as f_{lidar_fusion}Finally, willAnd f_{lidar_fusion} to obtain the final output of the channel attention network

Further, in the step S3, the attention fusion network outputs the location attention network and the channel attention networkAndThe future prediction task head network analyzes scene states after 0.5 seconds in the future from the future prediction feature f_future, namely, the position and speed predictions of other traffic participants in the current scene after 0.5 seconds.

Further, in the step S4, the reward function includes a sparse reward obtained by triggering a predefined event and a dense reward obtained by each time stamp, the sparse reward includes six kinds of factors including collision, no-accident parking, overspeed, deviation distance being greater than a threshold value, deviation angle being greater than a threshold value, deviation angular velocity being greater than a threshold value, and penalty being given when the above conditions are satisfied, and the dense reward includes four kinds of factors including deviation distance reward, deviation angular reward, angular velocity reward, and velocity reward.

Further, in the step S4, a multi-branch horizontal and vertical separated network structure is adopted, and an independent action prediction branch is adopted for each navigation command in a group of advanced navigation commands (left turn at an intersection, right turn at an intersection, straight travel at an intersection and travel along a current road), the navigation command serves as a selection switch of which branch is used under each time stamp, and each branch learns a sub-strategy specific to the navigation command;

The strategy network performs channel splicing operation on the multi-mode space-time fused laser radar BEV feature map f_{lidar_fusion} and the future prediction feature map f_future to obtain a feature map f, then the feature map f is flattened into a one-dimensional feature vector after four-layer convolution, and finally the feature vector is connected with a vehicle measurement vector containing speed data, deviation distance from the center of a lane and deviation angle data, and then d-dimensional discrete action prediction probability P is output after passing through three full-connection layers (s_t);

And (3) the value network carries out channel splicing operation on the multi-mode space-time fused laser radar BEV characteristic graph f_{lidar_fusion} and the future prediction characteristic graph f_future to obtain a characteristic graph f, then the characteristic graph f is flattened into a one-dimensional characteristic vector after four-layer convolution, and finally the characteristic vector is connected with a vehicle measurement vector containing speed data, deviation distance from the center of a lane and deviation angle data, and then the one-dimensional state value prediction Q is output after three layers of full-connection layers (s_t).

The invention has the following characteristics and beneficial effects:

The invention adopts a cascade deep reinforcement learning active safety decision method based on multi-mode space-time characterization to guide the vehicle to learn the active safety decision skill through a reward function. The designed multi-mode space-time perception encoder jointly models space and motion information, provides current perception representation of scenes for subsequent decisions, ensures richness of information needed by the decisions, captures interaction among different traffic participants, provides future prediction representation for the subsequent decisions, and improves decision success rate. The current perception representation and the future prediction representation form comprehensive understanding of the scene, ensure the safety of the decision process, and have high environmental adaptability and decision success rate. After the decision strategy learning, the method can realize the safety decision task of the intelligent automobile in a dense traffic scene and under an emergency.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is an algorithm flow chart of a cascade deep reinforcement learning security decision method based on multi-modal space-time characterization in an embodiment of the invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

As shown in fig. 1, a cascade deep reinforcement learning security decision method based on multi-modal space-time characterization includes the following steps:

And (1) collecting and preprocessing the original sensor data. The method comprises the steps of collecting forward-looking RGB color images at the current moment, collecting laser radar point clouds, speed data and deviation distance and deviation angle data between the laser radar point clouds and the center of a lane, aiming the laser radar point clouds of five previous frames at a vehicle coordinate system at the current moment, vomizing the point clouds of six continuous frames into 2D BEV grids with fixed resolution, and finally connecting the 2D BEV grids to obtain six-channel laser radar BEV projection pseudo images. For the 2D BEV grid, the positions of 32 meters in front and rear of the vehicle and 32 meters on both sides are considered, and the 64 m×64 m distance range is divided into a plurality of areas of 0.25 m×0.25 m, thereby obtaining a grid with a resolution of 256×256. And deleting point cloud data with the height of less than 0.2 meter to remove the ground plane. And carrying out normalization operation on the speed data, the deviation distance from the center of the lane and the deviation angle data.

The operation steps of the multimode fusion transducer are as follows:

where LN1 (-) and LN2 (-) represent Layer Norma l i zat i on operations, MHA (-) represents the multi-headed self-care layer, and FFN (-) represents the feedforward neural network layer.

The sequence vector is first applied to a multi-mode fusion transducer moduleAnd one andLearner position embedding of the same dimensionAdding parameter elements to obtain a position-coded sequence vectorWill beBy means of a normalization layer LN1 and a multi-headed self-care layer MHA, andElement addition, realizing the attention interaction of the features of the two modes to obtainThenAfter passing through a normalization layer LN2 and a feed forward network layer FFNElement addition to obtain final global space-time context feature output

In this embodiment, the image branch and the laser radar BEV branch use different multitask heads to perform supervised training, capture correlations between different branch tasks, and enhance feature expression in different branches. The image branch consists of two task heads H_dep and H_sem, which are respectively the depth estimation and the semantic segmentation of the forward-looking image. We use cross entropy loss for forward semantic segmentation and L1 loss to supervise the forward depth estimation task. The BEV point cloud branch consists of three task heads H_bev,H_v and H_bb, for BEV semantic segmentation, peripheral vehicle speed prediction, and 2D object detection, respectively. We use cross entropy loss for BEV semantic segmentation, L2 loss to supervise surrounding vehicle speed prediction, and CENTERNET decoders to locate other traffic participants in the scene for 2D object detection.

And (3) training a future predictive coder to learn the correlation of each traffic participant captured in the laser radar BEV characteristic f_{lidar_fusion} output by the multi-mode space-time perception coder, and obtaining the future predictive representation of the dynamic driving scene. The future prediction encoder network consists of a position attention network, a channel attention network, an attention fusion network and a future prediction task head network, wherein:

(I) Location attention network

The multi-mode space-time fused laser radar BEV characteristic f_{lidar_fusion} is respectively input into three convolution layers to obtain three characteristic diagrams with the same dimension as the original one, and then the dimensions of the characteristic diagrams are adjusted into three two-dimensional characteristics with the same dimensionAndNext, atTransposed sum of (2)Performs matrix multiplication between them, and calculates a spatial attention map s_lo using the SoftMax layer. Then at s_lo andPerforming matrix multiplication between transposes of (a) to capture the spatial dependence between any two positions of the feature map and re-dimension the result to obtain a feature map of the same dimension as f_{lidar_fusion}Finally, willAnd f_{lidar_fusion} to obtain the final output of the position attention network

(I I) channel attention network

Performing dimension adjustment on the multi-mode space-time fused laser radar BEV characteristic f_{lidar_fusion} to obtain two-dimensional characteristics with the same dimensionThen atTransposed sum of (2)Performs matrix multiplication between them, and applies the SoftMax layer to obtain the channel attention map s_ch. Then, matrix multiplication is carried out between s_ch and the transpose of f_{lidar_fusion}, the channel dependence between any two channels is captured, and the result is re-dimensioned to obtain a feature map with the same dimension as f_{lidar_fusion}Finally, willAnd f_{lidar_fusion} to obtain the final output of the channel attention network

(II) attention fusion network

To further enhance feature representation, the outputs of the location and channel attention networks are usedAndAnd obtaining a future prediction characteristic f_future through element addition and convolution operation.

(IV) future prediction task head network

The future prediction task head analyzes scene states after 0.5 seconds in the future from the future prediction feature f_future, namely the position and speed predictions of other traffic participants in the current scene after 0.5 seconds.

And (4) after the supervision training of the step (2) and the step (3) is completed, designing a reward function, training a deep reinforcement learning decision model by using a distributed PPO reinforcement learning algorithm, and learning an optimal security decision strategy from multi-mode space-time characterization and speed data consisting of laser radar BEV characteristics and future prediction characteristics, and deviation distance and deviation angle data between the laser radar BEV characteristics and the lane center.

In this embodiment, the reward function includes a sparse reward that triggers the predefined event to be obtained and a dense reward that is obtained for each time stamp. The sparse rewards comprise six kinds, namely collision, non-accident parking, overspeed, deviation distance larger than a threshold value, deviation angle larger than a threshold value and deviation angular velocity larger than a threshold value, and punishment is given when the conditions are met. The dense rewards include four types, namely, offset distance rewards, offset angle rewards, angular velocity rewards, and velocity rewards. Wherein:

The offset distance rewards are calculated as follows:

r_d＝max(k_d×(1-d),k_d×(1-d_max))

Where d is the offset distance from the road center, d_max is the maximum threshold set to 2, and k_d is the scale size.

The bias angle rewards are as follows:

r_θ＝max(k_θ×(30-θ),k_θ×(30-θ_max))

Where θ is the route deviation angle, θ_max is the maximum threshold set to 60, and k_θ is the scale size.

The angular velocity rewards are as follows:

r_ω＝max(k_ω×(10-ω),k_ω×(10-ω_max))

Where ω is the self-vehicle angular velocity, ω_max is the maximum threshold set to 20, and k_ω is the scale size.

The speed rewards are as follows:

Where v is the speed of the vehicle, k_v is the scale size, v_target is the target speed, range_v is the allowed speed range. v_target and range_v are calculated from the distance l from the forward obstacle.

In this embodiment, when the reinforcement learning network is designed, a multi-branch horizontal and vertical separated network structure is adopted, and an independent action is adopted for each navigation command in a group of advanced navigation commands (left turn at an intersection, right turn at an intersection, straight travel at an intersection, and travel along a current road), so that the navigation command serves as a selection switch for which branch is used under each time stamp, and each branch learns a sub-strategy specific to the navigation command. In each branch, considering that the motion control of the vehicle involves two relatively independent operations, namely transverse control and longitudinal control, two identical branches are designed to respectively process the transverse control and the longitudinal control, and the transverse and longitudinal control network structures are composed of a strategy network and a value network.

(I) Policy network

And performing channel splicing operation on the multi-mode space-time fused laser radar BEV feature map f_{lidar_fusion} and the future prediction feature map f_future to obtain a feature map f, flattening the feature map f into a one-dimensional feature vector after four-layer convolution, and finally connecting the feature vector with a vehicle measurement vector containing speed data, deviation distance from the center of a lane and deviation angle data, and outputting d-dimensional discrete action prediction probability P after three-layer full-connection (s_t). For the horizontal motion space, we uniformly discretize the steering space into 33 parts, so d is 33, the motion a_st with the highest probability is selected to be executed, and for the vertical motion space, we combine the throttle space and the brake space and discretize the throttle space and the brake space into 3 different motions corresponding to acceleration, idling and deceleration respectively, so d of the vertical motion space is 3, and the motion a_lo with the highest probability is selected to be executed.

(Ii) Value network

And performing channel splicing operation on the multi-mode space-time fused laser radar BEV feature map f_{lidar_fusion} and the future prediction feature map f_future to obtain a feature map f, flattening the feature map f into a one-dimensional feature vector after four-layer convolution, and finally connecting the feature vector with a vehicle measurement vector containing speed data, deviation distance from the center of a lane and deviation angle data, and outputting a one-dimensional state value prediction Q after three-layer full-connection layers (s_t). The state value of the output of the transverse control network is denoted as v_st, and the state value of the output of the longitudinal control network is denoted as v_lo.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

Translated fromChinese

1.基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，包括以下步骤：1. A cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation, characterized by comprising the following steps:

S1、原始传感器数据的采集及预处理；其中，原始传感器数据包括当前时刻的前视RGB彩色图像、激光雷达点云、速度数据以及与车道中心的偏差距离和偏差角度数据；S1. Collection and preprocessing of raw sensor data; raw sensor data includes the current forward RGB color image, LiDAR point cloud, speed data, and deviation distance and angle from the lane center.

对于激光雷达点云，首先将过去五帧的激光雷达点云重新对准当前时刻的车辆坐标系，然后将这连续六帧的点云都体素化为具有固定分辨率的2D BEV网格，最后将它们连接起来得到六通道激光雷达BEV投影伪图像；对于速度数据以及与车道中心的偏差距离和偏差角度数据，进行归一化操作；For the lidar point cloud, the lidar point cloud of the past five frames is first realigned to the current vehicle coordinate system. Then, the point cloud of these six consecutive frames is voxelized into a 2D BEV grid with a fixed resolution. Finally, they are concatenated to obtain a six-channel lidar BEV projection pseudo image. The speed data and the deviation distance and deviation angle data from the lane center are normalized.

S2、结合空间感知和运动感知的多任务头监督训练一个多模态时空感知编码器，从单帧前视RGB彩色图像和连续六帧的激光雷达BEV投影伪图像中提取当前感知表征；该多模态时空感知编码器的网络由图像特征提取主干网络、激光雷达BEV特征提取主干网络、多模态特征融合网络和多任务头网络组成；S2. Combine spatial perception and motion perception with multi-task head supervision to train a multimodal spatiotemporal perception encoder to extract the current perception representation from a single frame of forward-looking RGB color image and six consecutive frames of lidar BEV projection pseudo images. The network of the multimodal spatiotemporal perception encoder consists of an image feature extraction backbone network, a lidar BEV feature extraction backbone network, a multimodal feature fusion network, and a multi-task head network.

S3、训练未来预测编码器学习从多模态时空感知编码器输出的激光雷达BEV特征f_{lidar_fusion}中捕获各交通参与者的相互关系，获取动态驾驶场景的未来预测表征；该未来预测编码器的网络由位置注意力网络、通道注意力网络、注意力融合网络和未来预测任务头网络组成；S3. Train a future prediction encoder to learn to capture the relationships between traffic participants from the lidar BEV features f_{lidar_fusion} output by the multimodal spatiotemporal perception encoder, and obtain future prediction representations of dynamic driving scenes. The network of the future prediction encoder consists of a position attention network, a channel attention network, an attention fusion network, and a future prediction task head network.

S4、完成步骤S2和步骤S3的监督训练后，设计奖励函数，使用分布式PPO强化学习算法训练深度强化学习决策模型，从激光雷达BEV特征和未来预测特征组成的多模态时空表征以及速度数据、与车道中心的偏差距离和偏差角度数据中学习最优安全决策策略。S4. After completing the supervised training of steps S2 and S3, design a reward function and use the distributed PPO reinforcement learning algorithm to train the deep reinforcement learning decision model to learn the optimal safety decision-making strategy from the multimodal spatiotemporal representation composed of the lidar BEV features and future prediction features, as well as the speed data, the deviation distance from the lane center, and the deviation angle data.

2.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，所述的图像特征提取主干网络通过一个经过ImageNet预训练的ResNet-34网络的四个残差卷积块分别进行特征提取得到四个具有不同层次信息的图像特征其中S_i表示不同的特征提取阶段。2. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation as described in claim 1 is characterized in that the image feature extraction backbone network extracts features through four residual convolution blocks of a ResNet-34 network pre-trained on ImageNet to obtain four image features with different levels of information. Where_Si represents different feature extraction stages.

3.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，所述的激光雷达BEV特征提取主干网络以连续六帧的六通道激光雷达BEV投影伪图像为输入项，通过一个引入时空卷积结构的VideoResnet-18网络的四个时空卷积块分别进行特征提取得到四个具有不同层次信息的激光雷达BEV特征其中S_i表示不同的特征提取阶段。3. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation as described in claim 1 is characterized in that the lidar BEV feature extraction backbone network takes six consecutive frames of six-channel lidar BEV projection pseudo images as input, and extracts features through four spatiotemporal convolution blocks of a VideoResnet-18 network that introduces a spatiotemporal convolution structure to obtain four lidar BEV features with different levels of information. Where_Si represents different feature extraction stages.

4.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，所述的多模态特征融合网络用于将四个不同尺度的图像特征和四个不同尺度的激光雷达BEV特征分别进行特征融合，融合时，先将两个分支的主干网络提取的图像特征和激光雷达BEV特征经过维度重塑后连接得到序列向量然后将经过一个多模态融合Transformer模块，实现不同模态特征之间的充分信息交互，获取到3D场景中的全局时空上下文特征最后，将切片且分别还原为和相同维度的特征，并和进行元素相加得到经过融合后的图像特征和激光雷达BEV特征4. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation as claimed in claim 1 is characterized in that the multimodal feature fusion network is used to fuse the image features of four different scales and the lidar BEV features of four different scales respectively. When fusion is performed, the image features extracted by the two branches of the backbone network are firstly combined. and LiDAR BEV features After dimension reshaping, the sequence vector is obtained by concatenation Then Through a multimodal fusion Transformer module, full information interaction between different modal features is achieved to obtain the global spatiotemporal context features in the 3D scene. Finally, Slice and restore to and Features of the same dimension, and Add elements to obtain fused image features and LiDAR BEV features

5.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，图像分支和激光雷达BEV分支分别使用不同的多任务头进行监督训练，图像分支由H_dep和H_sem两个任务头组成，分别为前视图像的深度估计和语义分割；BEV点云分支由任务头H_bev，H_v和H_bb组成，分别用于BEV语义分割、周边车辆速度预测和2D目标检测。5. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation as described in claim 1 is characterized in that the image branch and the lidar BEV branch use different multi-task heads for supervised training, respectively. The image branch consists of two task heads, H_dep and H_sem , which are used for depth estimation and semantic segmentation of the front view image, respectively; the BEV point cloud branch consists of task heads H_bev , H_v , and H_bb , which are used for BEV semantic segmentation, surrounding vehicle speed prediction, and 2D object detection, respectively.

6.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，所述的步骤S3中，位置注意力网络将多模态时空融合后的激光雷达BEV特征f_{lidar_fusion}分别输入到三个卷积层获取三个与原来相同维度的特征图，然后将它们的维度调整为三个相同维度的二维特征和接着，在的转置和之间执行矩阵乘法，并应用SoftMax层计算空间注意力图s_lo；然后在s_lo和的转置之间执行矩阵乘法，捕获特征图任意两个位置之间的空间依赖性，并将结果重新调整维度得到与f_{lidar_fusion}相同维度的特征图最后将和f_{lidar_fusion}进行元素相加得到位置注意力网络的最终输出6. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation as described in claim 1 is characterized in that in the step S3, the position attention network inputs the multimodal spatiotemporal fusion lidar BEV feature f_{lidar_fusion} into three convolutional layers to obtain three feature maps of the same dimension as the original, and then adjusts their dimensions to three two-dimensional features of the same dimension. and Then, in The transpose and Perform matrix multiplication between and apply the SoftMax layer to calculate the spatial attention map_slo ;_then Perform matrix multiplication between the transposes of to capture the spatial dependency between any two positions of the feature map, and resize the result to obtain a feature map of the same dimension as the_{lidar_fusion} . Finally Add the elements with f_{lidar_fusion} to get the final output of the position attention network

7.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，所述的步骤S3中，通道注意力网络将多模态时空融合后的激光雷达BEV特征f_{lidar_fusion}进行维度调整得到两个相同维度的二维特征然后在的转置和之间执行矩阵乘法，并应用SoftMax层来获取通道注意力图s_ch；之后在s_ch和f_{lidar_fusion}的转置之间执行矩阵乘法，捕获任意两个通道之间的通道依赖性，并将结果重新调整维度得到与f_{lidar_fusion}相同维度的特征图最后将和f_{lidar_fusion}进行元素相加得到通道注意力网络的最终输出7. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation according to claim 1 is characterized in that in the step S3, the channel attention network adjusts the dimension of the multimodal spatiotemporal fusion lidar BEV feature f_{lidar_fusion} to obtain two two-dimensional features of the same dimension. Then in The transpose and Perform matrix multiplication between them and apply the SoftMax layer to obtain the channel attention map s_ch ; then perform matrix multiplication between s_ch and the transpose of f_{lidar_fusion} to capture the channel dependency between any two channels, and resize the result to obtain a feature map of the same dimension as f_{lidar_fusion} Finally Add the elements with f_{lidar_fusion} to get the final output of the channel attention network

8.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，所述的步骤S3中，注意融合网络将位置注意力网络和通道注意力网络的输出和经过元素相加和卷积操作得到未来预测特征f_future；未来预测任务头网络从未来预测特征f_future中解析未来0.5秒后的场景状态。8. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation as claimed in claim 1 is characterized in that in step S3, the attention fusion network combines the outputs of the position attention network and the channel attention network and The future prediction feature f_future is obtained through element-wise addition and convolution operations. The future prediction task head network parses the scene state 0.5 seconds in the future from the future prediction feature f_future .

9.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，所述的步骤S4中，奖励函数包括触发预定义事件获得的稀疏奖励和每个时间戳都获得的稠密奖励；稀疏奖励包含六种，分别是发生碰撞、无故停车、超速、偏差距离大于阈值、偏差角度大于阈值及其偏差角速度大于阈值，当满足上述条件时给予惩罚；稠密奖励包含四种，分别是偏差距离奖励、偏差角度奖励、角速度奖励以及速度奖励。9. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation as described in claim 1 is characterized in that in the step S4, the reward function includes a sparse reward obtained by triggering a predefined event and a dense reward obtained for each timestamp; the sparse reward includes six types, namely, collision, unexplained parking, speeding, deviation distance greater than a threshold, deviation angle greater than a threshold, and deviation angular velocity greater than a threshold, and a penalty is given when the above conditions are met; the dense reward includes four types, namely, deviation distance reward, deviation angle reward, angular velocity reward, and speed reward.

10.如权利要求1所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，所述的步骤S4中，采用多分支横纵向分离的网络结构，对一组高级导航命令中的每个导航命令采用独立的动作预测分支，导航命令充当在每个时间戳下使用哪个分支的选择开关，每个分支学习特定于其导航命令的子策略；在每个分支中，考虑到车辆的运动控制涉及两个相对独立的操作：横向控制和纵向控制，设计两个相同的分支分别处理横向控制和纵向控制。10. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation as described in claim 1 is characterized in that in the step S4, a multi-branch horizontally and vertically separated network structure is adopted, and an independent action prediction branch is adopted for each navigation command in a set of high-level navigation commands. The navigation command acts as a selection switch for which branch to use at each timestamp, and each branch learns a sub-strategy specific to its navigation command; in each branch, considering that the vehicle's motion control involves two relatively independent operations: lateral control and longitudinal control, two identical branches are designed to handle lateral control and longitudinal control respectively.

11.如权利要求10所述的基于多模态时空表征的级联深度强化学习安全决策方法，其特征在于，横向控制和纵向控制的网络结构均由策略网络和值网络组成；11. The cascaded deep reinforcement learning safety decision-making method based on multimodal spatiotemporal representation according to claim 10, wherein the network structures of the horizontal control and the vertical control are both composed of a policy network and a value network;

策略网络将多模态时空融合后的激光雷达BEV特征图f_{lidar_fusion}和未来预测特征图f_future进行通道拼接操作得到特征图f，然后将f经过四层卷积后展平为一维特征向量，最后将该特征向量与包含速度数据、与车道中心的偏差距离和偏差角度数据的自车测量向量连接起来后经过三层全连接层后输出d维的离散动作预测概率P(s_t)；The policy network performs channel concatenation on the multimodal spatiotemporal fusion lidar BEV feature map f_{lidar_fusion} and the future prediction feature map f_future to obtain a feature map f. It then flattens f into a one-dimensional feature vector after passing it through four layers of convolution. This feature vector is then concatenated with the ego-vehicle measurement vector containing speed data, deviation distance from the lane center, and deviation angle data. This vector then passes through three fully connected layers and outputs a d-dimensional discrete action prediction probability P(s_t ).

值网络将多模态时空融合后的激光雷达BEV特征图f_{lidar_fusion}和未来预测特征图f_future进行通道拼接操作得到特征图f，然后将f经过四层卷积后展平为一维特征向量，最后将该特征向量与包含速度数据、与车道中心的偏差距离和偏差角度数据的自车测量向量连接起来后经过三层全连接层后输出一维的状态价值预测Q(s_t)。The value network performs channel concatenation on the multimodal spatiotemporal fusion lidar BEV feature map f_{lidar_fusion} and the future prediction feature map f_future to obtain a feature map f. It then flattens f into a one-dimensional feature vector after passing it through four layers of convolution. Finally, this feature vector is concatenated with the ego-vehicle measurement vector containing speed data, deviation distance from the lane center, and deviation angle data. The vector then passes through three fully connected layers to output a one-dimensional state value prediction Q(s_t ).