CN117962926A

Movatterモバイル変換

Info

Publication number: CN117962926A
Application number: CN202410177276.0A
Authority: CN
Inventors: 郑孝遥; 姚庆贺; 张健鹏; 赵思齐; 腾莉; 罗永龙
Original assignee: Anhui Normal University
Current assignee: Anhui Normal University
Priority date: 2024-02-08
Filing date: 2024-02-08
Publication date: 2024-05-03

Abstract

Translated fromChinese

本发明公开一种基于深度强化学习的自动驾驶决策系统，包括：车辆感知模块、生成器SAC学习模块及鉴别器判断模块；车辆感知模块基于车辆的传感器获得车辆的视觉数据和驾驶状态，通过卷积神经网络得到车辆的感知特征；生成器SAC学习模块通过对车辆的感知特征进行学习，得到车辆的动作；鉴别器判断模块通过判断生成器SAC学习模块得到的动作与专家数据的差异进行打分，得到鉴别器奖励。利用GAIL的结构，将鉴别器对生成器的打分作为鉴别器奖励，将环境奖励和生成对抗式模仿学习评分线性加权得到模型的奖励函数。然后通过对GAIL对专家数据的学习，用专家知识指导智能体的训练提高了模型的训练效率。

The present invention discloses an automatic driving decision system based on deep reinforcement learning, comprising: a vehicle perception module, a generator SAC learning module and a discriminator judgment module; the vehicle perception module obtains the visual data and driving state of the vehicle based on the vehicle's sensor, and obtains the perception characteristics of the vehicle through a convolutional neural network; the generator SAC learning module obtains the vehicle's action by learning the perception characteristics of the vehicle; the discriminator judgment module scores the difference between the action obtained by the generator SAC learning module and the expert data to obtain the discriminator reward. Using the structure of GAIL, the score of the discriminator on the generator is used as the discriminator reward, and the environmental reward and the generative adversarial imitation learning score are linearly weighted to obtain the reward function of the model. Then, by learning the expert data through GAIL, the training efficiency of the model is improved by guiding the training of the intelligent agent with expert knowledge.

Description

Translated fromChinese

基于深度强化学习的自动驾驶决策系统Autonomous driving decision system based on deep reinforcement learning

技术领域Technical Field

本发明属于自动驾驶技术领域，更具体地，本发明涉及一种基于深度强化学习的自动驾驶决策系统。The present invention belongs to the technical field of autonomous driving, and more specifically, the present invention relates to an autonomous driving decision-making system based on deep reinforcement learning.

背景技术Background technique

近年来，自动驾驶技术的快速发展引起了社会广泛的关注，随着传感器技术和人工智能等技术的不断进步，自动驾驶汽车对整个交通系统和城市规划产生了深远的影响。自动驾驶汽车在有效提高道路安全性的时，对于缓解交通拥堵等具有重要作用。自动驾驶系统一般由环境感知层、决策规划层和动作控制层组成。决策规划层是自动驾驶汽车的核心。在安全性、实时性、快速性和可预测性要求较高的自动驾驶系统中，决策的合理性将直接影响车辆的安全性、舒适性和经济性。In recent years, the rapid development of autonomous driving technology has attracted widespread attention from the society. With the continuous advancement of technologies such as sensor technology and artificial intelligence, autonomous driving cars have had a profound impact on the entire transportation system and urban planning. Autonomous driving cars play an important role in effectively improving road safety and alleviating traffic congestion. Autonomous driving systems generally consist of an environmental perception layer, a decision-making and planning layer, and an action control layer. The decision-making and planning layer is the core of autonomous driving cars. In autonomous driving systems with high requirements for safety, real-time, rapidity, and predictability, the rationality of decision-making will directly affect the safety, comfort, and economy of the vehicle.

目前，已经提出了多种策略来解决自动驾驶车辆行为决策的技术问题。自动驾驶的决策方法主要有三种，即基于规则的决策方法、基于深度学习的“端到端”的决策方法和基于强化学习的决策方法。其中，基于规则的方法在不增加计算复杂度的情况下具有良好的实时性，但由于实际场景复杂性极高，因此人工定义规则的方式很难覆盖所有场景；基于深度学习的"端到端"的决策方法简化了复杂任务的处理流程，通过自动学习特征实现对高维数据的适应，潜在泛化性能较强，但由于模型训练时数据需求量大等问题，限制了其在一些场景中的应用，目前利用深度学习进行自动驾驶决策的研究相对较少；基于强化学习的方案中，智能体可以进行自我学习与强化，在自动驾驶决策领域显示出巨大的应用潜力。At present, a variety of strategies have been proposed to solve the technical problems of autonomous driving vehicle behavior decision-making. There are three main decision-making methods for autonomous driving, namely rule-based decision-making methods, "end-to-end" decision-making methods based on deep learning, and decision-making methods based on reinforcement learning. Among them, the rule-based method has good real-time performance without increasing the computational complexity, but due to the extremely high complexity of actual scenarios, it is difficult to cover all scenarios by manually defining rules; the "end-to-end" decision-making method based on deep learning simplifies the processing flow of complex tasks, and adapts to high-dimensional data through automatic learning features. It has strong potential generalization performance, but due to the large amount of data required for model training and other issues, its application in some scenarios is limited. At present, there are relatively few studies on autonomous driving decision-making using deep learning; in the reinforcement learning-based scheme, the intelligent agent can perform self-learning and reinforcement, showing great application potential in the field of autonomous driving decision-making.

深度强化学习类似于人类学习新知识的过程。在与环境的持续交互中，智能体使用获得的奖励或惩罚来不断优化策略，直到学习到最优策略。深度强化学习在游戏、机器人控制和医疗等复杂环境中取得了很好的成果，但仍然存在一些挑战。其中之一是学习效率问题，即深度强化学习模型通常需要大量的样本和训练时间参与训练。这可能导致对环境的过度探索，特别是在高维状态空间中，使得学习算法效率降低。Deep reinforcement learning is similar to the process of humans learning new knowledge. In the continuous interaction with the environment, the agent uses the rewards or penalties obtained to continuously optimize the strategy until the optimal strategy is learned. Deep reinforcement learning has achieved good results in complex environments such as games, robot control, and medical treatment, but there are still some challenges. One of them is the problem of learning efficiency, that is, deep reinforcement learning models usually require a large amount of samples and training time to participate in training. This may lead to excessive exploration of the environment, especially in high-dimensional state space, making the learning algorithm less efficient.

发明内容Summary of the invention

本发明提供一种基于深度强化学习的自动驾驶决策系统，旨在改善上述问题。The present invention provides an autonomous driving decision-making system based on deep reinforcement learning, aiming to improve the above-mentioned problems.

本发明是这样实现的，一种基于深度强化学习的自动驾驶决策系统，所述系统包括：车辆感知模块、生成器SAC学习模块及鉴别器判断模块；The present invention is implemented as follows: an autonomous driving decision system based on deep reinforcement learning, the system comprising: a vehicle perception module, a generator SAC learning module and a discriminator judgment module;

车辆感知模块基于车辆的传感器获得车辆的视觉数据和驾驶状态，通过卷积神经网络得到车辆的感知特征；The vehicle perception module obtains the vehicle's visual data and driving status based on the vehicle's sensors, and obtains the vehicle's perception features through a convolutional neural network;

生成器SAC学习模块通过对车辆的感知特征进行学习，得到车辆的动作；The generator SAC learning module obtains the vehicle's actions by learning the vehicle's perception features;

鉴别器判断模块通过判断生成器SAC学习模块得到的动作与专家数据的差异进行打分，得到鉴别器奖励。The discriminator judgment module scores the difference between the actions obtained by the generator SAC learning module and the expert data to obtain the discriminator reward.

进一步的，生成器SAC学习模块包括：一个Actor网络，两个Critic网络，两个目标网络，Actor网络分别与两个Critic网络连接、两个目标网络连接，两个Critic网络与两个目标网络连接，采用SAC-GAIL算法对生成器SAC学习模块进行训练。Furthermore, the generator SAC learning module includes: an Actor network, two Critic networks, and two target networks. The Actor network is connected to the two Critic networks and the two target networks respectively, and the two Critic networks are connected to the two target networks. The generator SAC learning module is trained using the SAC-GAIL algorithm.

进一步的，生成器SAC学习模块的训练过程具体如下：Furthermore, the training process of the generator SAC learning module is as follows:

对SAC学习模块进行N轮训练，每轮的训练过程具体如下：The SAC learning module is trained for N rounds, and the training process of each round is as follows:

(1)将状态S_t输入Actor网络，选择Actor网络输出的动作a_t，并添加扰动Ν_t，形成动作a_t^*，(1) Input the state_St into the Actor network, select the action_at output by the Actor network, and add the perturbation_Nt to form the action_at^* .

(2)执行动作a_t^*，观测下一时刻的状态S_t+1，将样本(s_t,a_t,r_t,s_t+1)存储到缓冲器R_E；(2) Execute action a_t^* , observe the state S_t+1 at the next moment, and store the sample (s_t , a_t , r_t , st₊₁ ) in the buffer R_E ;

(3)令t＝t+1，执行步骤(1)和步骤(2)，直至执行M轮；(3) Let t = t + 1, and execute steps (1) and (2) until M rounds are executed;

(4)随机从缓冲器R_E及缓冲器R_D中随机抽取一小批样本(s_i,a_i,r_i,s_i+1)，其中，缓冲器R_D存储的是专家样本数据τ_D；(4) Randomly extract a small batch of samples (s_i , a_i , r_i , s_i+1 ) from the buffer_RE and the buffer_RD , where the buffer_RD stores the expert sample data τ_D ;

(5)通过样本对鉴别器判断模块进行训练，更新鉴别器网络权重参数；(5) Train the discriminator judgment module through samples and update the discriminator network weight parameters;

(6)在鉴别器判断模块训练C次后，更新两个critic网络及actor网络的权重超参数；(6) After the discriminator judgment module is trained C times, the weight hyperparameters of the two critic networks and the actor network are updated;

(7)在critic网络及actor网络完成T次训练后，更新两个目标网络的权重参数。(7) After the critic network and the actor network complete T training, update the weight parameters of the two target networks.

进一步的，加权奖励由环境奖励R_e和组成，计算公式如下：Furthermore, the weighted reward is composed of the environmental reward_Re and , and the calculation formula is as follows:

R^*＝αR_e+(1-α)R_dR^* = αR_e + (1-α)R_d

其中，α表示超参数。Here, α represents a hyperparameter.

进一步的，环境奖励函数包括意图奖励R_done、速度奖励R_speed和转角奖励R_stree。Furthermore, the environment reward function includes intention reward R_done , speed reward R_speed , and corner reward R_stree .

进一步的，意图奖励R_done具体如下：Furthermore, the intended reward R_done is as follows:

其中，done表示车辆是否碰撞或者越道，当车辆发生碰撞或者车辆越道时，意图奖励R_done＝-100，否则R_done＝0。Here, done indicates whether the vehicle collides or crosses the lane. When the vehicle collides or crosses the lane, the intention reward R_done = -100, otherwise R_done = 0.

进一步的，速度奖励R_speed具体如下：Furthermore, the speed reward R_speed is as follows:

其中，speed表示车辆的速度，当车辆速度大于v_t时，给车辆惩罚R_speed＝-50。Wherein, speed represents the speed of the vehicle. When the vehicle speed is greater than v_t , the vehicle is penalized with R_speed = -50.

进一步的，转角奖励R_stree具体如下：Furthermore, the corner reward R_stree is as follows:

R_stree＝-10×stree²R_stree = -10 × stree²

其中，stree表示车辆的转角。Among them, street represents the turning angle of the vehicle.

进一步的，鉴别器奖励R_d具体如下：Furthermore, the discriminator reward_Rd is as follows:

R_d＝-log(1-D_θ(s_t,a_t))R_d = -log(1-D_θ (s_t , a_t ))

其中，D_θ(s_t,a_t)表示鉴别器判断模块针对生成器SAC学习模块生成的策略(s_t,a_t)的打分。Among them, D_θ (s_t , a_t ) represents the score of the discriminator judgment module for the strategy (s_t , a_t ) generated by the generator SAC learning module.

进一步的，目标网络的权重参数更新公式具体如下：Furthermore, the weight parameters of the target network The update formula is as follows:

其中，右侧的表示更新前的两个目标网络的权重参数，左侧的/>表示更新后的两个目标网络的权重参数，/>表示两个critic网络当前的权重参数。Among them, the right Represents the weight parameters of the two target networks before updating, the left side /> Represents the updated weight parameters of the two target networks,/> Represents the current weight parameters of the two critic networks.

本发明通过引入生成对抗式模仿学习算法构建了SAC-GAIL自动驾驶决策模型，提升了模型整体的学习效率。该方法利用GAIL的结构，将鉴别器对生成器的打分作为鉴别器奖励，将环境奖励和生成对抗式模仿学习评分线性加权得到模型的奖励函数。然后通过对GAIL对专家数据的学习，用专家知识指导智能体的训练提高了模型的训练效率。The present invention constructs the SAC-GAIL autonomous driving decision model by introducing the generative adversarial imitation learning algorithm, which improves the overall learning efficiency of the model. The method uses the structure of GAIL, takes the score of the discriminator to the generator as the discriminator reward, and linearly weights the environmental reward and the generative adversarial imitation learning score to obtain the reward function of the model. Then, by learning GAIL's expert data, the training efficiency of the model is improved by using expert knowledge to guide the training of the intelligent agent.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例提供的基于深度强化学习的自动驾驶决策系统的结构示意图；FIG1 is a schematic diagram of the structure of an autonomous driving decision-making system based on deep reinforcement learning provided by an embodiment of the present invention;

图2为本发明实施例提供的各对比模型的速度曲线图；FIG2 is a speed curve diagram of each comparison model provided by an embodiment of the present invention;

图3为本发明实施例提供的对比模型的横向加速度曲线图。FIG. 3 is a lateral acceleration curve diagram of a comparison model provided in an embodiment of the present invention.

具体实施方式Detailed ways

下面对照附图，通过对实施例的描述，对本发明的具体实施方式作进一步详细的说明，以帮助本领域的技术人员对本发明的发明构思、技术方案有更完整、准确和深入的理解。The specific implementation modes of the present invention are further explained in detail below by describing the embodiments with reference to the accompanying drawings, so as to help those skilled in the art to have a more complete, accurate and in-depth understanding of the inventive concept and technical solution of the present invention.

(1)状态空间(1) State space

自动驾驶决策车辆的状态空间(S)是从环境中接收到的信息。在每个时刻，代理的状态空间S_t建模为元组S_t＝(V_t,X_t)，包括车辆视觉特征V_t，车辆驾驶特征X_t。The state space (S) of the autonomous driving decision vehicle is the information received from the environment. At each moment, the agent's state space_St is modeled as a tuple_St = (_Vt ,_Xt ), including the vehicle visual features_Vt and the vehicle driving features_Xt .

车辆视觉特征：V_t是从图像中提取的一组视觉特征相关联的视觉特征向量。图像为车辆正前方，左侧，右侧三个方向的语义分割图像。Vehicle visual features:_Vt is a visual feature vector associated with a set of visual features extracted from an image. The image is a semantically segmented image of the vehicle in three directions: front, left, and right.

车辆驾驶特征：车辆在时刻t的驾驶特征由速度和高级导航指令组成，表示为X_t＝(P_t,C_t)，P_t是车辆在t时刻的速度，C_t是车辆在时刻t的高级导航命令，本发明使用One-Hot码来描述当前的高级导航命令，即左转表示为[1,0,0]，直行表示为[0,1,0]，右转表示为[0,0,1]。Vehicle driving characteristics: The driving characteristics of the vehicle at time t are composed of speed and high-level navigation instructions, expressed as X_t = (P_t , C_t ), where P_t is the speed of the vehicle at time t, and C_t is the high-level navigation command of the vehicle at time t. The present invention uses a One-Hot code to describe the current high-level navigation command, that is, a left turn is represented as [1,0,0], a straight line is represented as [0,1,0], and a right turn is represented as [0,0,1].

(2)动作空间(2) Action Space

动作空间是每个时刻，模型根据当前观察到的状态S计算要为车辆执行的当前操作。为了完成任务，车辆必须以连续的方式提供油门、转向和刹车的命令。车辆的动作被定义为(acc,stree,break)，其中，acc是加速度，stree是转角，break是刹车。acc的范围为[0,1]，stree的范围为[-1,1]，break的范围为[0,1]。The action space is where at each moment, the model calculates the current action to be performed for the vehicle based on the currently observed state S. To complete the task, the vehicle must provide commands for throttle, steering, and braking in a continuous manner. The vehicle's action is defined as (acc, stree, break), where acc is the acceleration, stree is the turning angle, and break is the brake. The range of acc is [0,1], the range of stree is [-1,1], and the range of break is [0,1].

(3)环境奖励函数(3) Environmental Reward Function

在强化学习中，奖励函数通过提供即时的正向奖励或负向惩罚，引导智能体调整策略以最大化未来的累积奖励，起到指导学习和任务定义的关键作用。本发明中环境奖励函数从成功率、交通效率和舒适性三个方面考虑，使用的环境奖励函数包括三个元素：意图奖励R_done、速度奖励R_speed和转角奖励R_stree。In reinforcement learning, the reward function plays a key role in guiding learning and task definition by providing immediate positive rewards or negative penalties to guide the agent to adjust its strategy to maximize future cumulative rewards. In the present invention, the environmental reward function is considered from three aspects: success rate, traffic efficiency and comfort. The environmental reward function used includes three elements: intention reward R_done , speed reward R_speed and corner reward R_stree .

意图奖励R_done基于车辆的所在车道和安全来确定，并且用于鼓励车辆保持车道，同时惩罚碰撞行为。在每个时刻，代理的意图奖励R_done：The intention reward R_done is determined based on the vehicle's lane and safety, and is used to encourage the vehicle to stay in the lane while penalizing collision behavior. At each time instant, the agent's intention reward R_{done is} :

速度奖励R_speed定义为车辆的速度，用于评估车辆的驾驶效率：The speed reward R_speed is defined as the speed of the vehicle and is used to evaluate the driving efficiency of the vehicle:

其中speed表示车辆的速度，希望车辆尽快行驶但是不能超速，所以当车辆速度大于6时给车辆惩罚R_speed＝-50。Where speed represents the speed of the vehicle. It is hoped that the vehicle can travel as fast as possible but cannot exceed the speed limit. Therefore, when the vehicle speed is greater than 6, a penalty R_speed = -50 is given to the vehicle.

转角奖励R_stree定义为车辆的转角，用于评估车辆的舒适性：The corner reward R_stree is defined as the turning angle of the vehicle and is used to evaluate the comfort of the vehicle:

R_stree＝-10×stree²R_stree = -10 × stree²

其中stree表示车辆的转角，我们希望车辆有比较好的舒适性，所以当车辆的转角过大时给与更大的惩罚。Street represents the turning angle of the vehicle. We hope that the vehicle has better comfort, so a greater penalty will be given when the turning angle of the vehicle is too large.

将每个时刻的环境奖励函数R_e定义为：The environment reward function_Re at each moment is defined as:

R_e＝R_done+R_speed+R_stree_Re = R_done + R_speed + R_street

图1为本发明实施例提供的基于深度强化学习的自动驾驶决策系统的结构示意图，为了便于说明，仅示出与本发明实施例相关的部分。该系统包括三个部分，即车辆感知模块、生成器SAC学习模块、鉴别器判断模块；Figure 1 is a schematic diagram of the structure of an autonomous driving decision system based on deep reinforcement learning provided by an embodiment of the present invention. For the sake of convenience, only the parts related to the embodiment of the present invention are shown. The system includes three parts, namely, a vehicle perception module, a generator SAC learning module, and a discriminator judgment module;

车辆感知模块基于车辆的传感器获得车辆的视觉数据和驾驶状态，通过卷积神经网络得到车辆的感知特征。The vehicle perception module obtains the vehicle's visual data and driving status based on the vehicle's sensors, and obtains the vehicle's perception features through a convolutional neural network.

鉴别器判断模块通过判断生成器SAC算法得到的动作和专家数据的差异进行打分，得到鉴别器奖励；The discriminator judgment module scores the difference between the actions obtained by the generator SAC algorithm and the expert data to obtain the discriminator reward;

生成器SAC学习模块通过对车辆的感知特征进行学习得到车辆的行为策略。The generator SAC learning module obtains the vehicle's behavior strategy by learning the vehicle's perception features.

车辆感知模块基于车辆的传感器获得车辆的视觉数据和驾驶状态，通过卷积神经网络得到车辆的感知特征。车辆的感知特征包括车辆视觉特征V_t，车辆驾驶特征X_t。车辆视觉特征V_t是从图像中提取的一组视觉特征相关联的视觉特征向量。图像为车辆正前方，左侧，右侧三个方向的语义分割图像。车辆驾驶特征是车辆在时刻t的驾驶特征由速度和高级导航指令组成，表示为X_t＝(P_t,C_t)，P_t是车辆在t时刻的速度，C_t是车辆在时刻t的高级导航命令，本发明使用One-Hot码来描述当前的高级导航命令，即左转表示为[1,0,0]，直行表示为[0,1,0]，右转表示为[0,0,1]。The vehicle perception module obtains the visual data and driving status of the vehicle based on the vehicle's sensors, and obtains the perception features of the vehicle through a convolutional neural network. The perception features of the vehicle include the vehicle visual features V_t and the vehicle driving features X_t . The vehicle visual features V_t are a set of visual feature vectors associated with visual features extracted from the image. The image is a semantic segmentation image in three directions: in front of the vehicle, on the left, and on the right. The vehicle driving features are the driving features of the vehicle at time t, which are composed of speed and high-level navigation instructions, expressed as X_t = (P_t , C_t ), P_t is the speed of the vehicle at time t, C_t is the high-level navigation command of the vehicle at time t, and the present invention uses a One-Hot code to describe the current high-level navigation command, that is, a left turn is represented as [1,0,0], a straight line is represented as [0,1,0], and a right turn is represented as [0,0,1].

鉴别器判断模块主要比较专家数据和车辆生成数据之间的差异。通过对抗性学习，提高其区分专家数据和生成数据的准确性。鉴别器从专家经验和生成经验中采集数据作为输入，最后输出鉴别器对生成数据的打分。鉴别器的优化目标是最小化：The discriminator judgment module mainly compares the difference between expert data and vehicle generated data. Through adversarial learning, the accuracy of distinguishing expert data from generated data is improved. The discriminator collects data from expert experience and generated experience as input, and finally outputs the discriminator's score for the generated data. The optimization goal of the discriminator is to minimize:

其中，表示将生成样本数据τ_E输入鉴别器(s_i|r_i,a_i)后，鉴别器输出D(s_i|r_i,a_i)的期望值/>表示将专家数据τ_D输入鉴别器(s_i|r_i,a_i)后，鉴别器输出D(s_i|r_i,a_i)的期望值/>J(D)表示鉴别器的损失函数。in, It represents the expected value of the discriminator output D(s_i |_ri ,_ai ) after the generated sample data τ_E is input into the discriminator (s_i |_ri ,_ai )/> It represents the expected value of the discriminator output D(s_i |_ri ,_ai ) after the expert data τ_D is input into the discriminator (s_i |_ri ,_ai )/> J(D) represents the loss function of the discriminator.

鉴别器模块中将鉴别器的输出作为鉴别器奖励，鉴别器输出的奖励值函数设置为：In the discriminator module, the output of the discriminator is used as the discriminator reward, and the reward value function of the discriminator output is set to:

R_d＝r(s_t,a_t；θ)＝-log(1-D_θ(s_t,a_t)) (2)R_d = r(s_t , a_t ; θ) = -log(1-D_θ (s_t , a_t )) (2)

鉴别器输入专家数据和生成器生成的状态-动作(s_t,a_t)对，通过神经网络模型进行处理，输出一个介于0和1之间的数值，表示输入的状态-动作对来自于专家示例的概率。如果输出接近于1，则表示鉴别器认为输入的状态-动作对来自于专家示例；如果输出接近于0，则表示鉴别器认为输入的状态-动作对来自于生成器生成的样本。The discriminator inputs the expert data and the state-action (s_t , a_t ) pair generated by the generator, processes it through the neural network model, and outputs a value between 0 and 1, indicating the probability that the input state-action pair comes from the expert example. If the output is close to 1, it means that the discriminator believes that the input state-action pair comes from the expert example; if the output is close to 0, it means that the discriminator believes that the input state-action pair comes from the sample generated by the generator.

通过更新鉴别器，可以更好地区分示范动作和生成动作，从而更好地指导生成器，有效地促使生成器学习到专家驾驶行为。By updating the discriminator, we can better distinguish between demonstrated actions and generated actions, thereby better guiding the generator and effectively prompting the generator to learn expert driving behaviors.

本发明通过使用模拟器中自带的自动驾驶模块收集专家数据，场景为车辆训练场景，通过语义分割相机收集车辆前左右三个方向的图像。专家数据建模为元组Data＝(V，C，P)其中V是从图像中通过三层卷积神经网络提取的一组视觉特征相关联的视觉特征向量，C是高级导航命令(指示车辆“左转”、“右转”或“直行”)根据的地图人为设置，P是车辆当前速度。准确高效的专家样本是SAC-GAIL算法的重要组成部分。The present invention collects expert data by using the autonomous driving module built into the simulator. The scene is a vehicle training scene, and the images of the three directions of the front, left and right of the vehicle are collected by the semantic segmentation camera. The expert data is modeled as a tuple Data = (V, C, P), where V is a visual feature vector associated with a set of visual features extracted from the image by a three-layer convolutional neural network, C is a high-level navigation command (instructing the vehicle to "turn left", "turn right" or "go straight") according to the map artificial setting, and P is the current speed of the vehicle. Accurate and efficient expert samples are an important part of the SAC-GAIL algorithm.

SAC算法基于参与者-批评者框架，并引入了最大熵和最大软值的概念。在确保任务完成的同时，期望策略尽可能随机化，使模型在训练过程中能够实现充分的探索。The SAC algorithm is based on the actor-critic framework and introduces the concepts of maximum entropy and maximum soft value. While ensuring the completion of the task, the expected strategy is as random as possible so that the model can achieve sufficient exploration during the training process.

SAC算法在Actor-Critic框架中加入了最大熵原则。熵代表随机变量的随机性，是不确定性的度量。不确定性越大，信息量就越大。本发明采用最大熵强化学习将熵分量添加到原始目标函数，以最大化Actor网络(策略网络)的每个输出的熵：加入最大熵目标，使策略随机化，即尽可能地分散输出动作的概率，从而在不留下任何有用动作的情况下提高探索度。The SAC algorithm adds the maximum entropy principle to the Actor-Critic framework. Entropy represents the randomness of a random variable and is a measure of uncertainty. The greater the uncertainty, the greater the amount of information. The present invention uses maximum entropy reinforcement learning to add an entropy component to the original objective function to maximize the entropy of each output of the Actor network (policy network): adding a maximum entropy target randomizes the strategy, that is, dispersing the probability of output actions as much as possible, thereby increasing the degree of exploration without leaving any useful actions.

Q值损失函数J_Q(θ)设置为：The Q-value loss function J_Q (θ) is set as:

其中，表示将生成样本数据集中的状态-动作对(s_t,a_t)～E输入后输出结果的期望值，Q_θ(s_t,a_t)分别表示状态-动作对(s_t,a_t)下Critic网络输出的Q值，/>分别表示状态-动作对(s_t,a_t)下目标网络输出的Q值。in, Indicates that the state-action pair (s_t , a_t )~E in the generated sample data set will be input The expected value of the output result after the training, Q_θ (s_t , a_t ) respectively represents the Q value output by the Critic network under the state-action pair (s_t , a_t ), /> They respectively represent the Q value of the target network output under the state-action pair (s_t ,a_t ).

策略的损失函数函数J(π)设置为:The loss function J(π) of the strategy is set to:

其中，表示成样本数据集中的状态s_t与策略中的a_t组成的状态-动作输入[βlogπ(a_t|s_t)+Q_θ(s_t,a_t)]后输出结果的期望，π(a_t|s_t)表示状态s_t下选择动作a_t的策略概率，β表示超参数。in, It is expressed as the expectation of the output result after the state-action input [βlogπ(_at |_st )+_Qθ (_st ,_at )] composed of the state_st in the sample data set and the_at in the strategy. π(_at |_st ) represents the policy probability of selecting_action at under state_st , and β represents the hyperparameter.

在鉴别器判断模块中通过学习专家知识对生成器进行打分，并作为模型奖励可以解决深度强化学习中智能体奖励值函数的设计问题。但是训练后的模型输出趋向于专家数据，降低了探索性并且对专家数据有较高要求。本文通过将环境奖励和鉴别器奖励进行线性加权设计了奖励函数，在减少环境奖励复杂性的同时提高了探索性。加权奖励由环境奖励和鉴别器奖励组成。In the discriminator judgment module, the generator is scored by learning expert knowledge and used as a model reward to solve the problem of designing the agent reward value function in deep reinforcement learning. However, the output of the trained model tends to be expert data, which reduces the exploratory nature and has high requirements for expert data. This paper designs a reward function by linearly weighting the environmental reward and the discriminator reward, which reduces the complexity of the environmental reward and improves the exploratory nature. The weighted reward consists of the environmental reward and the discriminator reward.

R^*＝αR_e+(1-α)R_d (5)R^* ＝αR_e +(1-α)R_d (5)

为了提高训练的稳定性，本文对神经网络层进行正交初始化权重和零偏置设置。正交初始化权重可以提高训练的稳定性、泛化性能和模型的表示能力。正交初始化将权重初始化为正交矩阵，这意味着权重之间具有较低的相关性。零初始化偏置在模型中引入了一个恒定的偏移量，这可以减少模型的参数数量，从而降低了模型的复杂性。本文提出的算法以Actor-Critic为框架，结合双Q-学习算法中的对偶网络技巧，在最优策略的目标函数中加入了熵分量，最大限度地权衡累积收益和信息熵，提高了智能体的探索程度，避免了策略过早收敛到局部最优解。In order to improve the stability of training, this paper performs orthogonal initialization weights and zero bias settings on the neural network layer. Orthogonal initialization weights can improve the stability of training, generalization performance, and the representation ability of the model. Orthogonal initialization initializes the weights as an orthogonal matrix, which means that there is a lower correlation between the weights. Zero initialization bias introduces a constant offset into the model, which can reduce the number of parameters of the model and thus reduce the complexity of the model. The algorithm proposed in this paper uses the Actor-Critic framework and combines the dual network technique in the double Q-learning algorithm. It adds an entropy component to the objective function of the optimal strategy, maximizes the trade-off between cumulative benefits and information entropy, improves the degree of exploration of the agent, and avoids the strategy from converging to the local optimal solution too early.

在强化学习过程前期，智能体搜索空间大，试错次数多，学习效率低下。而GAIL可以利用专家数据的先验知识，加快智能体学习过程的学习效率。因此，本发明结合GAIL与SAC两者优势构建了SAC-GAIL算法。In the early stage of the reinforcement learning process, the agent search space is large, the number of trial and error is large, and the learning efficiency is low. However, GAIL can use the prior knowledge of expert data to speed up the learning efficiency of the agent learning process. Therefore, the present invention combines the advantages of GAIL and SAC to construct the SAC-GAIL algorithm.

SAC-GAIL算法中使用SAC方法作为生成器来输出操作策略。判别器对输入数据是专家数据还是生成数据进行了判别。在生成器中，SAC算法中的Actor网络用于生成动作策略，Critic网络用于更准确地评估输出动作的价值，从而使Actor网络生成的动作策略趋近于演示策略。鉴别器通过对生成器的打分来区分专家数据和生成数据。在模型训练过程中，生成器G和鉴别器D交替进行训练，两者通过梯度下降法更新各自的训练参数，以减小损失函数值。在连续训练中，生成器G和鉴别器D通过动态博弈形成纳什均衡。The SAC method is used as a generator in the SAC-GAIL algorithm to output the operation strategy. The discriminator determines whether the input data is expert data or generated data. In the generator, the Actor network in the SAC algorithm is used to generate action strategies, and the Critic network is used to more accurately evaluate the value of the output action, so that the action strategy generated by the Actor network is close to the demonstration strategy. The discriminator distinguishes between expert data and generated data by scoring the generator. During the model training process, the generator G and the discriminator D are trained alternately, and both update their respective training parameters through the gradient descent method to reduce the loss function value. In continuous training, the generator G and the discriminator D form a Nash equilibrium through dynamic game.

通过更新鉴别器，可以更好地区分专家数据和生成数据，从而更好地指导生成器。通过更新生成器，能够消除专家数据和生成数据之间差距。通过对生成器G和鉴别器D的对抗性训练，最终达到纳什均衡点，使生成数据接近专家数据。By updating the discriminator, we can better distinguish between expert data and generated data, thereby better guiding the generator. By updating the generator, we can eliminate the gap between expert data and generated data. Through adversarial training of the generator G and the discriminator D, we eventually reach the Nash equilibrium point, making the generated data close to the expert data.

生成器SAC学习模块包括：一个Actor网络，两个Critic网络，两个目标网络，Actor网络分别与两个Critic网络连接、两个目标网络连接，两个Critic网络与两个目标网络连接，采用SAC-GAIL算法对生成器SAC学习模块进行训练，其训练过程具体如下：The generator SAC learning module includes: an Actor network, two Critic networks, and two target networks. The Actor network is connected to two Critic networks and two target networks respectively, and the two Critic networks are connected to two target networks. The SAC-GAIL algorithm is used to train the generator SAC learning module. The training process is as follows:

(2)执行动作a_t^*，观测下一时刻的状态S_t+1，将样本τ_E(s_t,a_t,r_t,s_t+1)存储到缓冲器R_E；(2) Execute action a_t^* , observe the state S_t+1 at the next moment, and store the sample τ_E (s_t , a_t , r_t ,_st+1 ) in the buffer R_E ;

(3)令t＝t+1，执行步骤(1)和步骤(2)，直至执行M轮。(3) Let t = t + 1, and execute steps (1) and (2) until M rounds are executed.

(4)随机从缓冲器R_E及缓冲器R_D中随机抽取一小批样本τ_E(s_i,a_i,r_i,s_i+1)，其中，缓冲器R_D存储的是专家样本数据τ_D；(4) Randomly extract a small batch of samples τ_E (s_i , a_i , r_i , s_i+1 ) from the buffers_RE and_RD , where the buffer_RD stores the expert sample data τ_D ;

(5)通过样本对鉴别器进行训练，通过公式(1)更新鉴别器网络权重参数；(5) Train the discriminator through samples and update the weight parameters of the discriminator network through formula (1);

(6)在鉴别器训练C次后，更新SAC网络，即通过公式(3)更新两个critic网络，通过公式(4)更新actor网络；(6) After the discriminator is trained C times, the SAC network is updated, that is, the two critic networks are updated by formula (3) and the actor network is updated by formula (4);

(7)在SAC网络完成T次训练后，更新两个目标网络的权重参数更新公式具体如下：(7) After the SAC network completes T training, update the weight parameters of the two target networks The update formula is as follows:

算法伪代码：算法步骤1-3是对模型参数的定义；算法步骤6-10是模型先收集一定的经验用作训练；算法步骤13-15是鉴别器进行训练；算法步骤16-18是SAC算法进行训练。Algorithm pseudocode: Algorithm steps 1-3 are the definition of model parameters; algorithm steps 6-10 are the model first collecting certain experience for training; algorithm steps 13-15 are the discriminator training; algorithm steps 16-18 are the SAC algorithm training.

本文使用CARLA来构建交通场景，在模拟交通场景中，强化学习智能体通过动作控制，以生成训练和测试所需的数据。目的是测试算法是否具有在指定时间内完成安全驾驶的能力，并比较不同算法的效果。验证了本发明提出模型的有效性：实验结果揭示了算法的优势，算法的鲁棒性。This paper uses CARLA to construct traffic scenarios. In the simulated traffic scenarios, the reinforcement learning agent is controlled by actions to generate the data required for training and testing. The purpose is to test whether the algorithm has the ability to complete safe driving within the specified time and compare the effects of different algorithms. The effectiveness of the model proposed in this paper is verified: the experimental results reveal the advantages of the algorithm and the robustness of the algorithm.

在CARLA模拟器上对我们提出的方法进行了训练和评估。CARLA是一个用于自动驾驶研究的高清晰度开源模拟平台。它不仅模拟了驾驶环境和车辆动力学，还模拟了摄像机RGB图像等原始传感器数据输入。本文使用三个语义分割摄像机收集图片数据，位于车辆的前方、左侧和右侧。Our proposed method is trained and evaluated on the CARLA simulator. CARLA is a high-definition open-source simulation platform for autonomous driving research. It simulates not only the driving environment and vehicle dynamics, but also the raw sensor data inputs such as camera RGB images. This paper uses three semantic segmentation cameras to collect image data, located in front, on the left, and on the right of the vehicle.

深度强化学习的训练参数设置如下：总训练集数为800集，每集步数限制为1000步。将前5000步作为随机探索阶段，并将高斯噪声作为噪声添加到动作中。在正式训练阶段，从重放缓冲区中随机抽取BatchSize＝256的过渡数据，并将其反馈到模型中进行学习。ADAM优化器的初始学习率γ为3×10^-4，目标Q网络的软更新率τ为1×10^-3。The training parameters of deep reinforcement learning are set as follows: the total number of training episodes is 800, and the number of steps per episode is limited to 1000. The first 5000 steps are used as the random exploration phase, and Gaussian noise is added to the actions as noise. In the formal training phase, transition data of BatchSize=256 are randomly extracted from the replay buffer and fed back to the model for learning. The initial learning rate γ of the ADAM optimizer is 3×^10-4 , and the soft update rate τ of the target Q network is 1×^10-3 .

将所提出的SAC-GAIL方法与其他三种基线方法，即DDPG、TD3和SAC进行了对比。DDPG:一种用于解决连续动作空间的强化学习问题的方法，结合了确定性策略梯度和经验回放机制，通过神经网络学习近似值函数和策略，以实现对复杂环境的高效学习。TD3:采用了双重延迟更新策略、目标策略噪声和剪切双Q网络等技术，旨在提高学习的稳定性和效率，特别适用于处理连续动作空间的问题。SAC:在DDPG的基础上进行了改进，引入了熵最大化的思想，通过优化策略时考虑动作分布的不确定性，以提高对环境的探索性,主要用于解决连续动作空间下的学习问题。The proposed SAC-GAIL method is compared with three other baseline methods, namely DDPG, TD3 and SAC. DDPG: A method for solving reinforcement learning problems in continuous action spaces, combining deterministic policy gradients and experience replay mechanisms, learning approximate value functions and policies through neural networks to achieve efficient learning of complex environments. TD3: It adopts technologies such as dual delayed update strategy, target policy noise and clipped double Q network to improve the stability and efficiency of learning, and is particularly suitable for dealing with problems in continuous action spaces. SAC: It is improved on the basis of DDPG and introduces the idea of entropy maximization. By considering the uncertainty of action distribution when optimizing the strategy, it improves the exploration of the environment. It is mainly used to solve learning problems in continuous action spaces.

速度和横向加速度对比：本发明统计了车辆完成任务横向加速度和速度与集之间的变化关系，车辆在场景中运行的输出分别如图2和图3所示。Comparison of speed and lateral acceleration: The present invention statistics the changing relationship between the lateral acceleration and speed of the vehicle when completing the task and the set. The output of the vehicle running in the scene is shown in Figures 2 and 3 respectively.

由图2可知SAC-GAIL模型有着更稳定的速度。因为SAC-GAIL引入了专家数据，所以训练的模型的速度更加稳定。由图3可知，SAC-GAIL自动驾驶决策模型的横向加速度一直介于-0.4g～0.4g之间，说明基于SAC-GAIL算法的自动驾驶决策模型在车辆转弯时不会发生侧翻。此外，在相同的外界环境下，利用SAC-GAIL作为决策模型时智能体的横向稳定性表现更好。As shown in Figure 2, the SAC-GAIL model has a more stable speed. Because SAC-GAIL introduces expert data, the speed of the trained model is more stable. As shown in Figure 3, the lateral acceleration of the SAC-GAIL autonomous driving decision model has always been between -0.4g and 0.4g, indicating that the autonomous driving decision model based on the SAC-GAIL algorithm will not roll over when the vehicle turns. In addition, under the same external environment, the lateral stability of the agent is better when SAC-GAIL is used as the decision model.

模型鲁棒性对比：为了比较算法的鲁棒性，设置了三种程度的干扰。低级干扰：车辆每走10步就给车辆一个干扰。中级干扰：车辆每一步都加干扰。高级干扰：车辆每走10步就会进行随机动作。Model robustness comparison: In order to compare the robustness of the algorithm, three levels of interference are set. Low-level interference: a interference is given to the vehicle every 10 steps. Medium-level interference: interference is added to the vehicle every step. High-level interference: the vehicle will perform random actions every 10 steps.

表1干扰下模型任务完成率Table 1 Model task completion rate under interference

由表1可知SAC-GAIL自动驾驶决策模型在三种干扰下的任务完成率都是最高的。相比SAC算法，在无干扰的情况下，SAC-GAIL的任务完成率高了2％；在低级干扰下，SAC-GAIL的任务完成率高了3％；在中级干扰下，SAC-GAIL的任务完成率高了3％；在高级干扰下，SAC-GAIL的任务完成率高了15％；因为SAC-GAIL算法的奖励函数中包含了鉴别器奖励，所以算法的输出受到专家数据的影响，从而抗干扰能力加强。Table 1 shows that the SAC-GAIL autonomous driving decision model has the highest task completion rate under the three types of interference. Compared with the SAC algorithm, in the absence of interference, the task completion rate of SAC-GAIL is 2% higher; under low-level interference, the task completion rate of SAC-GAIL is 3% higher; under medium-level interference, the task completion rate of SAC-GAIL is 3% higher; under high-level interference, the task completion rate of SAC-GAIL is 15% higher; because the reward function of the SAC-GAIL algorithm includes the discriminator reward, the output of the algorithm is affected by the expert data, thereby enhancing the anti-interference ability.

总体性能对比：在默认条件下进行了100集的测试，结果如表2所示：Overall performance comparison: 100 episodes were tested under default conditions. The results are shown in Table 2:

表2各算法测试结果Table 2 Test results of each algorithm

由表2可知，该方法的FR(任务完成率)为97％，Rewards值为4433是所有算法中最高的，表明该方法具有更好的效果。相比SAC算法相比，在奖励值上提高了40％，完成率高2％。因为SAC-GAIL算法引入了专家数据，奖励函数中包含了鉴别器奖励，所以算法的输出会更倾向专家数据，导致奖励值和任务完成率更高。As shown in Table 2, the FR (task completion rate) of this method is 97%, and the Rewards value is 4433, which is the highest among all algorithms, indicating that this method has better results. Compared with the SAC algorithm, the reward value is increased by 40% and the completion rate is 2% higher. Because the SAC-GAIL algorithm introduces expert data and the discriminator reward is included in the reward function, the output of the algorithm will be more inclined to expert data, resulting in higher reward values and task completion rates.

基于仿真实验对该方法的性能进行了分析评估。实验结果表明，利用专家数据和GAIL的结构有效的提高了模型的训练效率，并提供了有效的驾驶决策，实现了驾驶过程中的安全和舒适需求。未来我们将考虑将驾驶风格融入到自动驾驶决策模型中，以满足不同驾驶人的特性需求The performance of this method is analyzed and evaluated based on simulation experiments. The experimental results show that the use of expert data and GAIL's structure effectively improves the training efficiency of the model, provides effective driving decisions, and achieves safety and comfort requirements during driving. In the future, we will consider incorporating driving style into the autonomous driving decision model to meet the characteristics of different drivers.

本发明进行了示例性描述，显然本发明具体实现并不受上述方式的限制，只要采用了本发明的方法构思和技术方案进行的各种非实质性的改进，或未经改进将本发明的构思和技术方案直接应用于其它场合的，均在本发明的保护范围之内。The present invention has been described exemplarily. Obviously, the specific implementation of the present invention is not limited to the above-mentioned method. As long as various non-substantial improvements are made using the method concept and technical solution of the present invention, or the concept and technical solution of the present invention are directly applied to other occasions without improvement, they are all within the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于深度强化学习的自动驾驶决策系统，其特征在于，所述系统包括：车辆感知模块、生成器SAC学习模块及鉴别器判断模块；1. An autonomous driving decision system based on deep reinforcement learning, characterized in that the system comprises: a vehicle perception module, a generator SAC learning module and a discriminator judgment module;

2.如权利要求1所述基于深度强化学习的自动驾驶决策系统，其特征在于，生成器SAC学习模块包括：一个Actor网络，两个Critic网络，两个目标网络，Actor网络分别与两个Critic网络连接、两个目标网络连接，两个Critic网络与两个目标网络连接，采用SAC-GAIL算法对生成器SAC学习模块进行训练。2. The autonomous driving decision system based on deep reinforcement learning as described in claim 1 is characterized in that the generator SAC learning module includes: an Actor network, two Critic networks, and two target networks, the Actor network is connected to the two Critic networks and the two target networks respectively, and the two Critic networks are connected to the two target networks, and the generator SAC learning module is trained using the SAC-GAIL algorithm.

3.如权利要求2所述基于深度强化学习的自动驾驶决策系统，其特征在于，生成器SAC学习模块的训练过程具体如下：3. The autonomous driving decision system based on deep reinforcement learning as claimed in claim 2, characterized in that the training process of the generator SAC learning module is as follows:

(4)随机从缓冲器R_E及缓冲器R_D中随机抽取一小批样本(si,ai,ri,si₊₁)，其中，缓冲器R_D存储的是专家样本数据τ_D；(4) Randomly extract a small batch of samples (si, ai, ri, si₊₁ ) from the buffer_RE and the buffer_RD , where the buffer_RD stores the expert sample data τ_D ;

4.如权利要求1所述基于深度强化学习的自动驾驶决策系统，其特征在于，加权奖励由环境奖励R_e和组成，计算公式如下：4. The autonomous driving decision-making system based on deep reinforcement learning as claimed in claim 1, characterized in that the weighted reward is composed of the environmental reward_Re and, and the calculation formula is as follows:

R^*＝αR_e+(1-α)R_dR^* = αR_e + (1-α)R_d

其中，α表示超参数。Here, α represents a hyperparameter.

5.如权利要求4所述基于深度强化学习的自动驾驶决策系统，其特征在于，环境奖励函数包括意图奖励R_done、速度奖励R_speed和转角奖励R_stree。5. The autonomous driving decision-making system based on deep reinforcement learning as claimed in claim 4, characterized in that the environmental reward function includes an intention reward R_done , a speed reward R_speed and a corner reward R_stree .

6.如权利要求5所述基于深度强化学习的自动驾驶决策系统，其特征在于，意图奖励R_done具体如下：6. The autonomous driving decision system based on deep reinforcement learning as claimed in claim 5, characterized in that the intention reward R_done is specifically as follows:

7.如权利要求5所述基于深度强化学习的自动驾驶决策系统，其特征在于，速度奖励R_speed具体如下：7. The autonomous driving decision system based on deep reinforcement learning as claimed in claim 5, characterized in that the speed reward R_speed is specifically as follows:

8.如权利要求5所述基于深度强化学习的自动驾驶决策系统，其特征在于，转角奖励R_stree具体如下：8. The autonomous driving decision system based on deep reinforcement learning as claimed in claim 5, characterized in that the corner reward R_stree is specifically as follows:

R_stree＝-10×stree²R_stree = -10 × stree²

9.如权利要求4所述基于深度强化学习的自动驾驶决策系统，其特征在于，鉴别器奖励R_d具体如下：9. The autonomous driving decision system based on deep reinforcement learning as claimed in claim 4, characterized in that the discriminator reward R_d is specifically as follows:

R_d＝-log(1-D_θ(s_t,a_t))R_d = -log(1-D_θ (s_t , a_t ))

10.如权利要求3所述基于深度强化学习的自动驾驶决策系统，其特征在于，目标网络的权重参数更新公式具体如下：10. The autonomous driving decision system based on deep reinforcement learning as claimed in claim 3, characterized in that the weight parameter of the target network The update formula is as follows: