CN115469663B

Movatterモバイル変換

Info

Publication number: CN115469663B
Application number: CN202211119904.7A
Authority: CN
Inventors: 尤国良; 张燕咏; 吉建民
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2024-06-14
Anticipated expiration: 2042-09-15
Also published as: CN115469663A

Abstract

Translated fromChinese

本发明涉及一种面向自动驾驶的基于深度强化学习的端到端导航避障方法，采用激光雷达，对激光雷达点云进行采样，生成环境一维激光点云，并通过代价地图转换算法进行转换，得到表示环境动态静态障碍物的障碍物地图；通过与仿真环境进行闭环交互，收集环境障碍物地图，从而生成导航环境数据集；构建异构智能体自我博弈避障模型，异构多智能体自我博弈模型包括：观测空间、动作空间、奖励函数和神经网络；进行多阶段并行课程学习，以使神经网络更快更好的达到局部最优解，同时加速学习过程，最终得到训练后的异构智能体自我博弈避障模型；将训练好的异构智能体自我博弈避障模型部署到实际车辆上，在现实世界中实现导航和避障。

The present invention relates to an end-to-end navigation obstacle avoidance method based on deep reinforcement learning for autonomous driving. A laser radar is used to sample a laser radar point cloud to generate a one-dimensional laser point cloud of the environment, and the point cloud is converted through a cost map conversion algorithm to obtain an obstacle map representing dynamic and static obstacles of the environment; the environmental obstacle map is collected through closed-loop interaction with a simulation environment, thereby generating a navigation environment data set; a heterogeneous agent self-game obstacle avoidance model is constructed, and the heterogeneous multi-agent self-game model includes: an observation space, an action space, a reward function and a neural network; multi-stage parallel course learning is performed to enable the neural network to reach a local optimal solution faster and better, while accelerating the learning process, and finally obtaining a trained heterogeneous agent self-game obstacle avoidance model; the trained heterogeneous agent self-game obstacle avoidance model is deployed on an actual vehicle to realize navigation and obstacle avoidance in the real world.

Description

Translated fromChinese

面向自动驾驶的基于深度强化学习的端到端导航避障方法End-to-end navigation and obstacle avoidance method based on deep reinforcement learning for autonomous driving

技术领域Technical Field

本发明涉及一种面向自动驾驶的基于深度强化学习的端到端导航避障方法，属于自动驾驶技术领域。The present invention relates to an end-to-end navigation obstacle avoidance method based on deep reinforcement learning for autonomous driving, and belongs to the technical field of autonomous driving.

背景技术Background technique

近年来，随着自动驾驶的广泛应用，导航和避障方法层出不穷。深度强化学习(DRL)方法已应用于智能体导航，并取得了较好的成果，其中自动驾驶车辆为Ackermann动力学模型。然而，开发一种基于深度强化学习的Ackermann转向车辆在未知的移动障碍物场景中的避碰方法仍然具有挑战性。经过训练的网络不仅需要生成满足Ackermann运动学约束的无碰撞轨迹，而且还需要在各种环境中处理多个移动障碍物。这使得深度强化学习需要在更庞大的搜索空间中搜索可行的无碰撞且满足动力学约束的轨迹。Ackermann转向车辆在具有静态和移动障碍物的各种环境中有效地找到通往目标的无碰撞且运动学可行的路径，这是自动驾驶汽车的主要挑战之一[1]。在自动驾驶场景中，常见的静态和动态障碍物中主要包括静态场景元素、动态车辆、行人和自行车等。静态和动态障碍物场景中实现自动驾驶的问题可以描述成异构动力学模型下的多智能体导航和避障问题。In recent years, with the widespread application of autonomous driving, navigation and obstacle avoidance methods have emerged in an endless stream. Deep reinforcement learning (DRL) methods have been applied to intelligent agent navigation and have achieved good results, in which the autonomous driving vehicle is an Ackermann dynamics model. However, it is still challenging to develop a collision avoidance method for Ackermann steering vehicles in unknown moving obstacle scenarios based on deep reinforcement learning. The trained network not only needs to generate collision-free trajectories that satisfy the Ackermann kinematic constraints, but also needs to handle multiple moving obstacles in various environments. This requires deep reinforcement learning to search for feasible collision-free trajectories that satisfy the dynamic constraints in a larger search space. Ackermann steering vehicles effectively find collision-free and kinematically feasible paths to the target in various environments with static and moving obstacles, which is one of the main challenges of autonomous vehicles [1]. In autonomous driving scenarios, common static and dynamic obstacles mainly include static scene elements, dynamic vehicles, pedestrians, and bicycles. The problem of achieving autonomous driving in static and dynamic obstacle scenarios can be described as a multi-agent navigation and obstacle avoidance problem under heterogeneous dynamics models.

尽管已经提出了许多避免碰撞的方法，但它们在实践中存在一些常见的限制。例如，该方法的假设可能并不适用于所有环境[2]。Although many collision avoidance methods have been proposed, they suffer from some common limitations in practice. For example, the assumptions of the method may not be applicable in all environments [2].

此外，在分布式和无通信场景中为具有不同形状和不同运动学约束的多智能体碰撞避免是一项更具挑战性的任务。现有的方法，如最优相互碰撞避免(ORCA)[3]和自行车相互碰撞避免(B-ORCA)[4]，为智能体避免相互碰撞提供了充分条件。但是，它们要求智能体使用相同的运动学约束。广义相互碰撞避免将方法扩展到具有不同运动学约束的智能体，如差动驱动和Ackermann转向。In addition, collision avoidance for multiple agents with different shapes and different kinematic constraints in distributed and communication-free scenarios is a more challenging task. Existing methods, such as Optimal Mutual Collision Avoidance (ORCA) [3] and Bicycle Mutual Collision Avoidance (B-ORCA) [4], provide sufficient conditions for agents to avoid mutual collisions. However, they require the agents to use the same kinematic constraints. Generalized Mutual Collision Avoidance extends the method to agents with different kinematic constraints, such as differential drive and Ackermann steering.

为了克服这些限制，深度强化学习(DRL)方法已被应用于避免碰撞问题中，并取得了较好的结果。然而，常见的深度强化学习方法很少考虑为具有不同运动学约束的多智能体碰撞避免，这限制了它们的应用。例如在同时具有汽车和行人的复杂场景中，传统的深度强化学习方法不能够很好的实现较好的导航。To overcome these limitations, deep reinforcement learning (DRL) methods have been applied to collision avoidance problems and have achieved good results. However, common deep reinforcement learning methods rarely consider multi-agent collision avoidance with different kinematic constraints, which limits their application. For example, in complex scenes with both cars and pedestrians, traditional deep reinforcement learning methods cannot achieve good navigation.

[1]Dong Y,Zhang Y,Ai J.Experimental test of unmanned ground vehicledelivering goods using RRT path planning algorithm[J].Unmanned Systems,2017,5(01):45-57.[1] Dong Y, Zhang Y, Ai J. Experimental test of unmanned ground vehicle delivering goods using RRT path planning algorithm[J]. Unmanned Systems, 2017, 5(01): 45-57.

[2]Zhang W,Wei S,Teng Y,et al.Dynamic obstacle avoidance for unmannedunderwater vehicles based on an improved velocity obstacle method[J].Sensors,2017,17(12):2742.[2] Zhang W, Wei S, Teng Y, et al. Dynamic obstacle avoidance for unmanned underwater vehicles based on an improved velocity obstacle method[J]. Sensors, 2017, 17(12): 2742.

[3]Berg J,Guy S J,Lin M,et al.Reciprocal n-body collision avoidance[M]//Robotics research.Springer,Berlin,Heidelberg,2011:3-19.[3] Berg J, Guy S J, Lin M, et al. Reciprocal n-body collision avoidance [M] // Robotics research. Springer, Berlin, Heidelberg, 2011: 3-19.

[4]Alonso-Mora J,Breitenmoser A,Beardsley P,et al.Reciprocalcollision avoidance for multiple car-like robots[C]//2012IEEE InternationalConference on Robotics and Automation.IEEE,2012:360-366.[4]Alonso-Mora J,Breitenmoser A,Beardsley P,et al.Reciprocal collision avoidance for multiple car-like robots[C]//2012IEEE International Conference on Robotics and Automation.IEEE,2012:360-366.

发明内容Summary of the invention

本发明技术解决问题：克服现有技术的不足，提供一种面向自动驾驶的基于深度强化学习的端到端导航避障方法，降低了计算资源消耗，提升了自动驾驶导航推理速度，增强了自动驾驶汽车对复杂多智能体场景下的避障能力，应对难以预测的行人场景能够具有较高的到达率和安全性。在实际场景中测试取得了98％的任务完成率。The technology of the present invention solves the problem: Overcoming the shortcomings of the existing technology, providing an end-to-end navigation and obstacle avoidance method based on deep reinforcement learning for autonomous driving, reducing computing resource consumption, improving the speed of autonomous driving navigation reasoning, enhancing the obstacle avoidance ability of autonomous driving vehicles in complex multi-agent scenarios, and having a higher arrival rate and safety in dealing with unpredictable pedestrian scenarios. In actual scenario tests, a task completion rate of 98% was achieved.

本发明技术解决方案：一种面向自动驾驶的基于深度强化学习的端到端导航避障方法，实现步骤如下：The technical solution of the present invention is an end-to-end navigation and obstacle avoidance method based on deep reinforcement learning for autonomous driving, which is implemented in the following steps:

步骤1：采用激光雷达，对激光雷达点云进行采样，生成环境一维激光点云，并通过代价地图转换算法进行转换，最终得到表示环境动态静态障碍物的障碍物地图；其次，通过与仿真环境进行闭环交互，不断收集环境障碍物地图，从而生成导航环境数据集，其中数据集包括训练集和测试集；Step 1: Use laser radar to sample the laser radar point cloud to generate a one-dimensional laser point cloud of the environment, and convert it through the cost map conversion algorithm to finally obtain an obstacle map representing the dynamic and static obstacles of the environment; secondly, through closed-loop interaction with the simulation environment, the environmental obstacle map is continuously collected to generate a navigation environment data set, which includes a training set and a test set;

步骤2：构建异构智能体自我博弈避障模型，异构多智能体自我博弈模型包括：观测空间、动作空间、奖励函数和神经网络；Step 2: Construct a heterogeneous agent self-game obstacle avoidance model. The heterogeneous multi-agent self-game model includes: observation space, action space, reward function and neural network;

观测空间是一个关于这个世界状态的完整描述，这个世界除了状态以外没有其他的信息，观测空间的某个值表示当前智能体的观测数据；The observation space is a complete description of the state of the world. There is no other information about the world except the state. A value in the observation space represents the observation data of the current agent.

动作空间，限定智能体可采取的行为范围；The action space, which defines the range of behaviors that the agent can take;

奖励函数，对智能体的行为进行约束和引导，以实现安全的导航，分别对行人和车辆设置不同的奖励函数r_r和r_v，其中使用奖励塑造方法，同时在r_v公式中引入r^w和r^d作为警告和危险区域的奖惩制；Reward function, which constrains and guides the behavior of the agent to achieve safe navigation. Different reward functions r_r and r_v are set for pedestrians and vehicles respectively. The reward shaping method is used, and r^w and r^d are introduced in the r_v formula as the reward and punishment system for warning and dangerous areas;

神经网络，用于导航策略的学习；Neural networks, used to learn navigation strategies;

神经网络结合奖励函数共同规划出目标行为，该目标行为包含于动作空间中；神经网络规划出的目标行为输入到智能体中，智能体执行该行为命令，并输出当前环境下的障碍物地图，该障碍物地图包含于观测空间中，此时智能体执行的行为及获得的障碍物地图再次输入至神经网络后构成闭环训练；The neural network combines with the reward function to plan the target behavior, which is included in the action space. The target behavior planned by the neural network is input into the intelligent agent, which executes the behavior command and outputs the obstacle map in the current environment. The obstacle map is included in the observation space. At this time, the behavior executed by the intelligent agent and the obtained obstacle map are input into the neural network again to form a closed-loop training.

结合观测空间、动作空间、奖励函数及神经网络实现在仿真环境中避障策略的学习，仿真环境是对真实世界的抽象描述，其中障碍物包括基于阿克曼模型的车辆智能体和基于差动模型的行人智能体，同时依据真实环境中动态障碍物的行为策略模式将阿克曼车辆的动作空间描述为阿克曼车辆期望行驶的轨迹，行人的动作空间描述为行人的动作方向和速度；基于上述仿真环境及异构智能体自我博弈避障模型，最终同一套网络结构和参数实现阿克曼车辆与阿克曼车辆之间、阿克曼车辆与差动行人之间以及差动行人与差动行人之间的智能体博弈学习，以实现多智能体在动态环境中灵活避障；The learning of obstacle avoidance strategy in a simulation environment is realized by combining observation space, action space, reward function and neural network. The simulation environment is an abstract description of the real world, in which obstacles include vehicle agents based on the Ackerman model and pedestrian agents based on the differential model. At the same time, according to the behavioral strategy pattern of dynamic obstacles in the real environment, the action space of the Ackerman vehicle is described as the expected driving trajectory of the Ackerman vehicle, and the action space of the pedestrian is described as the direction and speed of the pedestrian's movement. Based on the above simulation environment and the heterogeneous agent self-game obstacle avoidance model, the agent game learning between Ackerman vehicles and Ackerman vehicles, between Ackerman vehicles and differential pedestrians, and between differential pedestrians and differential pedestrians is finally realized with the same set of network structures and parameters, so as to realize flexible obstacle avoidance of multiple agents in a dynamic environment.

步骤3：基于步骤1中的训练集，对步骤2中的构建异构智能体自我博弈避障模型进行多阶段并行课程学习，以使神经网络更快更好的达到局部最优解，同时加速学习过程，最终得到训练后的异构智能体自我博弈避障模型；Step 3: Based on the training set in step 1, the heterogeneous agent self-game obstacle avoidance model constructed in step 2 is subjected to multi-stage parallel course learning, so that the neural network can reach the local optimal solution faster and better, and accelerate the learning process, and finally obtain the trained heterogeneous agent self-game obstacle avoidance model;

步骤4：将训练好的异构智能体自我博弈避障模型部署到实际车辆上，在现实世界中实现导航和避障。Step 4: Deploy the trained heterogeneous agent self-game obstacle avoidance model to the actual vehicle to achieve navigation and obstacle avoidance in the real world.

所述步骤2中构建的异构多智能体自我博弈模型中：In the heterogeneous multi-agent self-game model constructed in step 2:

观察空间：由三部分组成，即智能体所处环境的障碍物地图、当前自身状态以及目标点；障碍物地图由以自我为中心的局部网格图指定；障碍物地图表示智能体周围的环境信息，包括智能体的形状和可观察到的障碍物外观；Observation space: It consists of three parts: the obstacle map of the agent's environment, the current state of the agent, and the target point. The obstacle map is specified by a local grid map centered on the self. The obstacle map represents the environment information around the agent, including the shape of the agent and the appearance of observable obstacles.

动作空间：首先为基于差动模型的行人智能体设置连续动作空间，即设置v_t∈[0，0.6]和ω_t∈[-0.9，0.9]，其中v_t为t时刻采取的速度范围，0，0.6分别表示智能体的最大和最小速度0m/s 0.6m/s，单位：m/s，ω_t为t时刻采取的转角范围单位：rad；-0.9，0.9分别表示智能体的转角范围；其次为基于阿克曼模型的车辆智能体设置连续动作空间，即设置c_t∈[-1.43，1.43]和δ_t∈[-11.25，11.25]，其中c_t为t时刻智能体期望行驶的目标轨迹曲率，δ_t为t时刻智能体期望行驶的轨迹对应的加速度；Action space: First, a continuous action space is set for the pedestrian agent based on the differential model, that is, v_t ∈ [0, 0.6] and ω_t ∈ [-0.9, 0.9], where v_t is the speed range taken at time t, 0, 0.6 represent the maximum and minimum speeds of the agent, 0m/s 0.6m/s, respectively, in m/s, ω_t is the turning angle range taken at time t, in rad; -0.9, 0.9 represent the turning angle range of the agent, respectively; secondly, a continuous action space is set for the vehicle agent based on the Ackerman model, that is, c_t ∈ [-1.43, 1.43] and δ_t ∈ [-11.25, 11.25], where c_t is the curvature of the target trajectory that the agent expects to travel at time t, and δ_t is the acceleration corresponding to the trajectory that the agent expects to travel at time t;

奖励函数：分别对行人和车辆设置不同的奖励函数r_r和r_v，其中使用奖励塑造方法，同时在r_v公式中引入r^w和r^d作为警告和危险区域的奖惩制，具体如下：Reward function: Different reward functions r_r and r_v are set for pedestrians and vehicles respectively. The reward shaping method is used, and r^w and r^d are introduced in the r_v formula as the reward and punishment system for warning and danger areas, as follows:

行人的奖励函数r_r：The reward function r for pedestrians_is :

车辆的奖励函数r_v其中r^g,r^c,r^s与行人的奖励函数一致：The reward function of the vehicle r_v where r^g , r^c , r^s is consistent with the reward function of the pedestrian:

r_v＝r^g+r^c+r^s+r^w+r^dr_v = r^g + r^c + r^s + r^w + r^d

公式r_r中，r_arr>0，p_t表示智能体在当前时间t的位置，ε是神经网络的超参数，r^g表示到达目标的奖励和离开目标的惩罚；r_col<0并且r^c表示碰撞的惩罚；最后，对整个学习周期内应用一个负惩罚，即r^s<0，以鼓励最短路径；In the formula r_r , r_arr >0, p_t represents the position of the agent at the current time t, ε is a hyperparameter of the neural network, r^g represents the reward for reaching the goal and the penalty for leaving the goal; r_col <0 and r^c represents the penalty for collision; finally, a negative penalty is applied to the entire learning cycle, that is, r^s <0, to encourage the shortest path;

公式r_v中，r_warn,r_danger<0,r^w表示当车辆的警告区域内有障碍物时的惩罚，r^d表示当车辆的危险区域内有障碍物时的惩罚；在实现中，设置r_arr＝500、ε＝10、r_col＝-500、r_s＝-5、r_warn＝-20和r_danger＝-10；In the formula r_v , r_warn , r_danger <0, r^w represents the penalty when there is an obstacle in the warning area of the vehicle, and r^d represents the penalty when there is an obstacle in the danger area of the vehicle; in the implementation, r_arr = 500, ε = 10, r_col = -500, r_s = -5, r_warn = -20 and r_danger = -10 are set;

神经网络：神经网络中的策略网络的输出分别描述为基于阿克曼模型的车辆智能体和基于差动模型的行人智能体的行为，进而实现异构智能体自我博弈的学习，其中输入为各异构智能体的观测数据，包括智能体的障碍物地图、当前自身状态和目标点；修改神经网络结构中最后几层的输出层，使得该神经网络可以同时输出异构智能体的行为策略。其中修改后的输出层使用高斯分布采样获得线速度和角速度作为行人的行动方向和行动速度以及车辆期望行驶轨迹的曲率及加速度。Neural network: The output of the policy network in the neural network is described as the behavior of the vehicle agent based on the Ackerman model and the pedestrian agent based on the differential model, thereby realizing the learning of the self-game of heterogeneous agents, where the input is the observation data of each heterogeneous agent, including the obstacle map of the agent, the current state and the target point; the output layers of the last few layers in the neural network structure are modified so that the neural network can simultaneously output the behavior strategies of heterogeneous agents. The modified output layer uses Gaussian distribution sampling to obtain the linear velocity and angular velocity as the direction and speed of the pedestrian's action and the curvature and acceleration of the vehicle's expected driving trajectory.

这种表示方式使得一套网络可以实现不同异构智能体的行为预测以及众多智能体的行为预测。神经网络训练结束后获得较高的导航性能，同时尽可能的模拟了真实环境中行人和车辆博弈过程，增强了导航策略的安全性。This representation method enables a set of networks to predict the behavior of different heterogeneous agents and the behavior of many agents. After the neural network training is completed, a higher navigation performance is obtained, while the game process of pedestrians and vehicles in the real environment is simulated as much as possible, which enhances the safety of the navigation strategy.

本发明与现有技术相比的优点在于：The advantages of the present invention compared with the prior art are:

(1)本发明可以实现在多个异构动力学模型的移动障碍物和多种类型的静态障碍物场景下快速无碰撞的导航任务，利用端到端深度强化学习方法在仿真环境中进行上万次数据收集和策略迭代，可以覆盖实际应用中的大部分场景。相比于传统的导航算法，本发明的训练速度更快以及推理速度更快。结合深度强化学习算法的导航避障算法能够发挥其自主学习的优势以及仿真环境中学习的高效率优点，能够快速的学习到最优导航避障策略。(1) The present invention can achieve fast and collision-free navigation tasks in multiple heterogeneous dynamic models with moving obstacles and various types of static obstacles. It uses end-to-end deep reinforcement learning methods to perform tens of thousands of data collection and strategy iterations in a simulation environment, which can cover most scenarios in practical applications. Compared with traditional navigation algorithms, the present invention has faster training speed and faster reasoning speed. The navigation and obstacle avoidance algorithm combined with the deep reinforcement learning algorithm can give full play to its advantages of autonomous learning and the high efficiency of learning in a simulation environment, and can quickly learn the optimal navigation and obstacle avoidance strategy.

(2)本发明结合自我博弈协作训练可以提升自动驾驶车辆在场景中的自适应能力。相比于传统导航策略，其动态障碍物行为策略及环境中障静态障碍物形状特征简单，进而导致学习到的策略的鲁棒性和适应新场景的能力差。本发明在训练进程结束后能够适应各类简单到复杂的静态场景和各类动态场景以及混合场景。为了实现自适应能力的提升，本发明通过异构多智能体自我博弈方法，使得智能体能够在多种动力学模型及多种静态障碍物场景下通过自我博弈协作训练相互提升避障能力和适应能力。(2) The present invention can improve the adaptive ability of autonomous driving vehicles in scenes by combining self-game collaborative training. Compared with traditional navigation strategies, its dynamic obstacle behavior strategy and the shape characteristics of static obstacles in the environment are simple, which leads to poor robustness of the learned strategy and poor ability to adapt to new scenes. After the training process is completed, the present invention can adapt to various simple to complex static scenes and various dynamic scenes as well as mixed scenes. In order to achieve the improvement of adaptive ability, the present invention uses a heterogeneous multi-agent self-game method to enable the agents to improve each other's obstacle avoidance and adaptability through self-game collaborative training in a variety of dynamic models and a variety of static obstacle scenes.

(3)本发明通过深度强化学习算法输出自动驾驶车辆的轨迹以及行人的行动方式，提升了自动驾驶车辆实际部署过程中的部署效率以及避障能力。相比传统方法利用车辆的底层控制命令实现的动作行为输出，高层次的轨迹输出能够更好的保障车辆的安全性和导航能力。本发明在异构多智能体自我博弈中引入了轨迹策略，降低了抖动问题，同时轨迹输出包含时序信息，可以提前预制未来可能的危险并及时规避。(3) The present invention uses a deep reinforcement learning algorithm to output the trajectory of the autonomous driving vehicle and the behavior of pedestrians, thereby improving the deployment efficiency and obstacle avoidance capabilities of the autonomous driving vehicle during actual deployment. Compared with the traditional method of using the vehicle's underlying control commands to achieve action behavior output, the high-level trajectory output can better ensure the vehicle's safety and navigation capabilities. The present invention introduces a trajectory strategy in the heterogeneous multi-agent self-game, reducing the jitter problem. At the same time, the trajectory output contains timing information, which can pre-prepare possible future dangers and avoid them in time.

(4)由于复杂异构动力学模型多智能体动态环境过于复杂，直接进行学习会导致模型无法收敛，因此将多个智能体放置在多套难度递进的环境中进行学习，实现多环境并行学习及简单到复杂的课程学习；相比单一环境串行训练，多环境并行加速学习速度，多环境递进式课程学习降低了训练难度。(4) Since the dynamic environment of complex heterogeneous dynamic models with multiple agents is too complex, direct learning will cause the model to fail to converge. Therefore, multiple agents are placed in multiple sets of environments with progressive difficulty to learn, realizing multi-environment parallel learning and simple to complex course learning; compared with single-environment serial training, multi-environment parallelism accelerates the learning speed, and multi-environment progressive course learning reduces the training difficulty.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明方法实现流程图；FIG1 is a flow chart of the method of the present invention;

图2为本发明中异构智能体自我博弈避障神经网络结构图；FIG2 is a structural diagram of a heterogeneous agent self-game obstacle avoidance neural network in the present invention;

图3为本发明的多阶段并行课程学习：左图为随机静态场景内的并行课程学习，右图为的圆形场景内的并行课程学习；FIG3 is a multi-stage parallel course learning of the present invention: the left picture is a parallel course learning in a random static scene, and the right picture is a parallel course learning in a circular scene;

图4为仿真训练环境：左图随机训练环境，右图圆形训练环境；Figure 4 shows the simulation training environment: the left picture shows a random training environment, and the right picture shows a circular training environment;

图5为实体车辆部署和测试图。Figure 5 shows the physical vehicle deployment and testing diagram.

具体实施方式Detailed ways

下面结合附图及实施例对本发明进行详细说明。The present invention is described in detail below with reference to the accompanying drawings and embodiments.

自动驾驶场景复杂多变，车辆能够在复杂多变的场景中实现高效安全的导航和避障需要导航策略能够灵活的感知到不同动力学模型的动态障碍物和各类静态障碍物，同时规划出合理的无碰撞的轨迹。本发明提出了一种面向自动驾驶的基于深度强化学习的端到端导航避障方法，用于阿克曼转向车辆的碰撞避免，该车辆可以处理多个异构动力学的移动障碍物和多种类型的静态障碍物。具体来说，引入了一个卷积神经网络，它将车辆的观察映射到运动学上可行的轨迹，并在模拟环境中训练网络，其中移动障碍物策略和阿克曼转向车辆策略相同。引入了指定的奖励塑造策略和多阶段并行课程学习策略来加速和稳定学习过程。然后，将训练好的模型部署到实际车辆上，以在现实世界中实现导航和避障。在模拟和现实世界中使用多种场景评估该方法，可以使车辆以很高的成功率避开各种移动障碍物。Autonomous driving scenarios are complex and changeable. For vehicles to achieve efficient and safe navigation and obstacle avoidance in complex and changeable scenarios, it is necessary that the navigation strategy can flexibly perceive dynamic obstacles of different dynamic models and various types of static obstacles, and at the same time plan a reasonable collision-free trajectory. The present invention proposes an end-to-end navigation and obstacle avoidance method based on deep reinforcement learning for autonomous driving, which is used for collision avoidance of Ackerman steering vehicles that can handle multiple moving obstacles of heterogeneous dynamics and multiple types of static obstacles. Specifically, a convolutional neural network is introduced, which maps the observations of the vehicle to kinematically feasible trajectories, and the network is trained in a simulation environment, where the moving obstacle strategy is the same as the Ackerman steering vehicle strategy. A specified reward shaping strategy and a multi-stage parallel curriculum learning strategy are introduced to accelerate and stabilize the learning process. Then, the trained model is deployed on an actual vehicle to achieve navigation and obstacle avoidance in the real world. The method is evaluated using multiple scenarios in simulation and the real world, which enables the vehicle to avoid various moving obstacles with a high success rate.

车辆在复杂自动驾驶场景寻找无碰撞且符合动力学模型约束的轨迹是一项有挑战的任务。现有的方法通常是通过感知模块对环境进行感知和建模，通过生成导航所需导航地图结合导航避障算法实现复杂场景下的导航。现有的方法通常假定周围环境动态障碍物遵循特定的行为策略进行移动，例如动态阿克曼车辆使用自行车响应碰撞算法(B-ORCA)以及行人使用最佳响应碰撞算法(ORCA)，实际自动驾驶场景中的动态障碍物的行为策略复杂多变，因此使用传统算法模拟的动态场景重复性高，需要进一步探索提高其性能。It is a challenging task for vehicles to find collision-free trajectories that comply with dynamic model constraints in complex autonomous driving scenarios. Existing methods usually perceive and model the environment through perception modules, and achieve navigation in complex scenarios by generating navigation maps required for navigation and combining navigation and obstacle avoidance algorithms. Existing methods usually assume that dynamic obstacles in the surrounding environment move according to specific behavior strategies, such as the bicycle response collision algorithm (B-ORCA) used by dynamic Ackerman vehicles and the best response collision algorithm (ORCA) used by pedestrians. The behavior strategies of dynamic obstacles in actual autonomous driving scenarios are complex and changeable. Therefore, dynamic scenarios simulated using traditional algorithms have high repetitiveness, and further exploration is needed to improve their performance.

如图1所示，本发明方法的具体实现步骤如下：As shown in FIG1 , the specific implementation steps of the method of the present invention are as follows:

步骤1：构建异构智能体自我博弈避障模型，并与仿真环境模块互动收集数据，并对该数据进行预处理，将数据转换为异构智能体自我博弈避障模型要求的输入数据，该数据将用于后续的模型训练和评估过程中。仿真环境是针对于自动驾驶设计的策略级仿真器，该仿真环境基于OpenCV进行定制，扩展了自研的仿真环境，并添加可以通过神经网络驱动的行人以及车辆。其中仿真环境如图4所示，左图是多种静态障碍物和多种动态障碍物场景的仿真环境示例图，其中方块为自动驾驶车辆的仿真表示，圆圈为行人的表示方式，可以通过仿真环境设置调节行人和车辆的大小等属性。对应静态障碍物有正方形、长方形、圆形，可以通过调整尺寸来调节静态障碍物的大小。右图为动态障碍物场景，该场景用于后续步骤中组成难度递进的仿真环境使用，相关元素描述同上。Step 1: Build a heterogeneous agent self-game obstacle avoidance model, interact with the simulation environment module to collect data, pre-process the data, and convert the data into the input data required by the heterogeneous agent self-game obstacle avoidance model. The data will be used in the subsequent model training and evaluation process. The simulation environment is a policy-level simulator designed for autonomous driving. The simulation environment is customized based on OpenCV, expands the self-developed simulation environment, and adds pedestrians and vehicles that can be driven by neural networks. The simulation environment is shown in Figure 4. The left figure is an example of a simulation environment with multiple static obstacles and multiple dynamic obstacle scenes. The square is a simulation representation of an autonomous driving vehicle, and the circle is a representation of a pedestrian. The size and other properties of pedestrians and vehicles can be adjusted through the simulation environment settings. The corresponding static obstacles are squares, rectangles, and circles. The size of the static obstacles can be adjusted by adjusting the size. The right figure is a dynamic obstacle scene, which is used to form a simulation environment with progressive difficulty in subsequent steps. The relevant elements are described as above.

步骤2：将不同智能体的动力学模型建模到异构智能体自我博弈模型中，利用同一套网络输出不同动力学模型智能体的动作行为，同时当环境中存在众多智能体时，该网络可以同时预测所有智能体的行为。Step 2: Model the dynamic models of different agents into the heterogeneous agent self-game model, and use the same network to output the action behaviors of agents with different dynamic models. At the same time, when there are many agents in the environment, the network can predict the behavior of all agents at the same time.

异构智能体自我博弈模块：图2异构智能体自我博弈避障模型结构。异构多智能体自我博弈模块主要包括：观测空间、动作空间、奖励函数和神经网络结构。观测空间负责对环境进行完整的感知和预处理，动作空间定义了异构的多智能体的有效的控制命令类型和控制量的大小及范围。奖励函数用于定义异构智能体采取的行为好坏的定量评估公式。神经网络结构用于在复杂多变的场景下学习导航和避障策略。下面分别介绍异构智能体自我博弈模型中涉及到的内容：Heterogeneous agent self-game module: Figure 2 Heterogeneous agent self-game obstacle avoidance model structure. The heterogeneous multi-agent self-game module mainly includes: observation space, action space, reward function and neural network structure. The observation space is responsible for the complete perception and preprocessing of the environment, and the action space defines the effective control command types and the size and range of the control quantity of the heterogeneous multi-agent. The reward function is used to define the quantitative evaluation formula for the good or bad behavior of the heterogeneous agent. The neural network structure is used to learn navigation and obstacle avoidance strategies in complex and changeable scenarios. The following introduces the contents involved in the heterogeneous agent self-game model:

观察空间：观察由三部分组成，即智能体所处环境的障碍物地图、当前自身状态以及目标点；障碍物地图由以自我为中心的局部网格图指定；如图4所示，障碍物地图表示智能体周围的环境信息，包括智能体的形状和可观察到的障碍物外观；障碍物地图是根据各种传感器数据(例如2D激光器、3D激光器或深度相机的输出)生成的成本图构造的。Observation space: The observation consists of three parts, namely the obstacle map of the agent's environment, the current state of the agent, and the target point; the obstacle map is specified by a local grid map centered on the self; as shown in Figure 4, the obstacle map represents the environmental information around the agent, including the shape of the agent and the appearance of observable obstacles; the obstacle map is constructed based on the cost map generated by various sensor data (such as the output of 2D lasers, 3D lasers, or depth cameras).

动作空间：通过对智能体分别设置不同的行为策略，从而实现异构智能体的自我博弈过程，如图2，首先为基于差动模型的行人智能体实施连续动作，即设置v_t∈[0，0.6]和ω_t∈[-0.9，0.9]，其次为基于阿克曼模型的车辆智能体实施连续动作，即设置c_t∈[-1.43，1.43]和δ_t∈[-11.25，11.25]，通过对动作空间的设计，使得异构智能体可以使用同一套网络进行学习和训练。Action space: By setting different behavioral strategies for the agents, the self-game process of heterogeneous agents is realized, as shown in Figure 2. First, continuous actions are implemented for the pedestrian agent based on the differential model, that is, v_t ∈ [0, 0.6] and ω_t ∈ [-0.9, 0.9] are set. Secondly, continuous actions are implemented for the vehicle agent based on the Ackerman model, that is, c_t ∈ [-1.43, 1.43] and δ_t ∈ [-11.25, 11.25]. Through the design of the action space, heterogeneous agents can use the same set of networks for learning and training.

奖励函数：通过对复杂动态场景下的车辆以及行人进行行为建模，实现了异构智能体的自我博弈，同时为了实现安全的导航，导航策略的目标是最大程度地减少车辆与行人的碰撞次数以及最大程度地减少智能体无碰撞的到达时间。在本发明中，通过引入特殊的判定区域同时结合Reward Shaping策略，分别对行人和车辆设置了不同的奖励函数r_r和r_v。行人的奖励函数r_r：Reward function: By modeling the behavior of vehicles and pedestrians in complex dynamic scenes, the self-game of heterogeneous intelligent agents is realized. At the same time, in order to achieve safe navigation, the goal of the navigation strategy is to minimize the number of collisions between vehicles and pedestrians and minimize the collision-free arrival time of the intelligent agent. In this invention, by introducing a special judgment area and combining the Reward Shaping strategy, different reward functions r_r and r_v are set for pedestrians and vehicles respectively. The reward function of pedestrians r_r :

行人的奖励函数r_r：The reward function r for pedestrians_is :

r_r＝r^g+r^c+r^s,r_r = r^g + r^c + r^s ,

车辆的奖励函数r_v，其中r^g,r^c,r^s与行人的奖励函数一致：The reward function of the vehicle is r_v , where r^g , r^c , r^s is consistent with the reward function of the pedestrian:

r_v＝r^g+r^c+r^s+r^w+r^dr_v = r^g + r^c + r^s + r^w + r^d

神经网络：用于导航策略的卷积网络结构如图2所示。网络的输入为智能体的观测数据。网络输出利用高斯分布进行采样，最终计算动作的加权平均值，最终作为智能体的行为策略输出，其中基于差动模型的行人智能体使用线速度和角速度控制，基于阿克曼模型的车辆智能体使用轨迹进行控制。Neural network: The convolutional network structure used for navigation strategy is shown in Figure 2. The input of the network is the observation data of the agent. The network output is sampled using Gaussian distribution, and the weighted average of the action is finally calculated, which is finally output as the behavior strategy of the agent. The pedestrian agent based on the differential model uses linear velocity and angular velocity control, and the vehicle agent based on the Ackerman model uses trajectory control.

步骤3：结合多阶段课程学习和并行学习来加速模型在仿真环境中的训练和收敛速度。多阶段课程学习和并行学习可以加速学习过程，课程学习(Curriculum Learning)被证明是一种有效的策略，可以帮助找到更好的局部最小值并加速训练过程。课程学习的主要思想是将艰苦的学习任务分解为几个简单的任务，从最简单的任务开始，逐步提高难度。在本发明中静态场景和动态场景内场景难度大学习效率低，因此引入并行学习加速和引导训练过程，同时静态场景以及动态场景之间存在难度递增关系，因此在二者之间引入多阶段课程学习策略引导智能体学习，形成一套完整的多阶段并行课程学习策略，使得网络更高效的训练。模型收敛后对模型进行评估和测试，验证模型的有效性。最终将目标模型部署到现实世界中完成目标导航任务。Step 3: Combine multi-stage curriculum learning and parallel learning to accelerate the training and convergence speed of the model in the simulation environment. Multi-stage curriculum learning and parallel learning can accelerate the learning process. Curriculum learning has been proven to be an effective strategy that can help find better local minima and accelerate the training process. The main idea of curriculum learning is to decompose the difficult learning task into several simple tasks, starting from the simplest task and gradually increasing the difficulty. In the present invention, the scene difficulty is large and the learning efficiency is low in static scenes and dynamic scenes, so parallel learning is introduced to accelerate and guide the training process. At the same time, there is an increasing difficulty relationship between static scenes and dynamic scenes, so a multi-stage curriculum learning strategy is introduced between the two to guide the learning of the intelligent agent, forming a complete set of multi-stage parallel curriculum learning strategies, so that the network is more efficiently trained. After the model converges, the model is evaluated and tested to verify the effectiveness of the model. Finally, the target model is deployed in the real world to complete the target navigation task.

在这项工作中，通过仿真环境构建了具有异构多智能体的场景同时包含静态障碍物，通过将基于阿克曼模型的车辆智能体和基于差动模型的行人智能体以及障碍物一起进行训练，提升了基于阿克曼模型的车辆智能体应对复杂场景的能力。图4显示了仿真场景内训练场景的两个示例，图3多阶段并行课程学习：左图为随机静态场景内的并行课程学习，右图为的圆形场景内的并行课程学习。首先通过对第一个实例随机障碍物场景内进行并行课程学习，这将有助于基于阿克曼模型的车辆智能体避开随机障碍物(又称随机场景)。通过已经训练好的模型进行第二个示例圆形场景内进行并行课程学习，这有助于基于阿克曼模型的车辆智能体车辆提升复杂动态障碍物场景的避障能力。首先在随机场景中进行自我博弈训练使策略网络可以应对复杂的静态障碍物场景以及简单的动态行人场景。特别地，随机场景中的环境包含3个基于阿克曼模型的车辆智能体，3个基于差动模型的行人智能体和5个静态障碍物，它们会随机选择障碍物的位置，基于阿克曼模型的车辆智能体和基于差动模型的行人智能体的起始位置和目标位置也在一定范围内随机。其次环境中的基于阿克曼模型的车辆智能体和基于差动模型的行人智能体都将由策略网络驱动。环境中有多个异构智能体共享相同的导航策略，并且它们还需要避免相互冲突，这有助于网络学习多智能体避障的能力。同时结合Reward Shaping策略，使得基于阿克曼模型的车辆智能体可以有效的主动避开其他动态和静态障碍物并保持一定的安全距离。而在圆形场景中的环境包含3个基于阿克曼模型的车辆智能体和3个基于差动模型的行人智能体，这会将基于阿克曼模型的车辆智能体和基于差动模型的行人智能体随机放置在半径随机的圆上。在圆形场景中进行训练使基于阿克曼模型的车辆智能体和基于差动模型的行人智能体可以有效的学习复杂动态场景下的导航。In this work, a scenario with heterogeneous multi-agents and static obstacles was constructed through a simulation environment. By training the vehicle agent based on the Ackerman model and the pedestrian agent based on the differential model and obstacles together, the ability of the vehicle agent based on the Ackerman model to cope with complex scenarios was improved. Figure 4 shows two examples of training scenarios in a simulation scenario. Figure 3 Multi-stage parallel course learning: the left figure shows parallel course learning in a random static scene, and the right figure shows parallel course learning in a circular scene. First, by performing parallel course learning in the first instance of the random obstacle scene, this will help the vehicle agent based on the Ackerman model avoid random obstacles (also known as random scenes). Using the trained model to perform parallel course learning in the second example circular scene, this will help the vehicle agent based on the Ackerman model improve its obstacle avoidance ability in complex dynamic obstacle scenes. First, self-game training in random scenes enables the policy network to cope with complex static obstacle scenes and simple dynamic pedestrian scenes. In particular, the environment in the random scene contains 3 vehicle agents based on the Ackerman model, 3 pedestrian agents based on the differential model, and 5 static obstacles. They will randomly select the location of the obstacles. The starting and target locations of the vehicle agents based on the Ackerman model and the pedestrian agents based on the differential model are also random within a certain range. Secondly, the vehicle agents based on the Ackerman model and the pedestrian agents based on the differential model in the environment will be driven by the policy network. There are multiple heterogeneous agents in the environment that share the same navigation strategy, and they also need to avoid conflicts with each other, which helps the network learn the ability of multi-agent obstacle avoidance. At the same time, combined with the Reward Shaping strategy, the vehicle agent based on the Ackerman model can effectively and actively avoid other dynamic and static obstacles and maintain a certain safe distance. The environment in the circular scene contains 3 vehicle agents based on the Ackerman model and 3 pedestrian agents based on the differential model, which will randomly place the vehicle agents based on the Ackerman model and the pedestrian agents based on the differential model on a circle with a random radius. Training in circular scenes enables the Ackerman model-based vehicle agent and the differential model-based pedestrian agent to effectively learn navigation in complex dynamic scenes.

同时在训练过程中，由于场景复杂度增加，因此我们引入了多阶段并行课程学习框架。从而加速了异构智能体自我博弈避障模型训练进程，同时也有效的引导了基于阿克曼模型的车辆智能体在复杂动态场景下的学习方向并提升了训练速度。At the same time, during the training process, due to the increased complexity of the scene, we introduced a multi-stage parallel course learning framework, which accelerated the training process of the heterogeneous agent self-game obstacle avoidance model, and also effectively guided the learning direction of the vehicle agent based on the Ackerman model in complex dynamic scenes and improved the training speed.

如图3所示，仿真训练环境：左图随机训练环境，右图圆形训练环境。左图表示的训练环境为随机训练环境，左图中右侧为环境中各个智能体的观测数据，左侧包括多个初始位置随机的基于阿克曼模型的车辆智能体和基于差动模型的行人智能体，同时环境中包括任意形状的静态障碍物和多个动态障碍物，仿真环境中的每个智能体都是由神经网络进行驱动，经过长时间的自我博弈学习过程最终各个智能体均能获得较强的静态场景下的避障能力。右图表示的训练环境为圆形训练环境，右图中右侧为环境中各个智能体的观测数据，左侧包括多个初始位置在指定圆形区间内随机的基于阿克曼模型的车辆智能体和基于差动模型的行人智能体，同时仿真环境中不包含任何静态障碍物，仿真环境中的每个智能体都是由神经网络进行驱动，经过长时间的自我博弈学习过程最终各个智能体均能获得较强的动态场景下的避障能力。As shown in Figure 3, the simulation training environment: the left figure is a random training environment, and the right figure is a circular training environment. The training environment represented by the left figure is a random training environment. The right side of the left figure is the observation data of each agent in the environment. The left side includes multiple vehicle agents based on the Ackerman model and pedestrian agents based on the differential model with random initial positions. At the same time, the environment includes static obstacles of arbitrary shapes and multiple dynamic obstacles. Each agent in the simulation environment is driven by a neural network. After a long period of self-game learning, each agent can finally obtain a strong obstacle avoidance ability in static scenes. The training environment represented by the right figure is a circular training environment. The right side of the right figure is the observation data of each agent in the environment. The left side includes multiple vehicle agents based on the Ackerman model and pedestrian agents based on the differential model with random initial positions in a specified circular interval. At the same time, the simulation environment does not contain any static obstacles. Each agent in the simulation environment is driven by a neural network. After a long period of self-game learning, each agent can finally obtain a strong obstacle avoidance ability in dynamic scenes.

本发明在实体阿克曼模型车辆平台上部署仿真环境中的策略模型，从而在真实环境中测试复杂场景的避障能力。The present invention deploys a strategy model in a simulation environment on a physical Ackerman model vehicle platform, thereby testing the obstacle avoidance capability of complex scenarios in a real environment.

如图5所示，进行实体车辆部署和测试，阿克曼模型车辆基于松灵移动智能体底盘，并搭载速腾16线激光雷达进行环境感知，同时包括RTK-GPS惯性导航系统进行定位。该车辆配备了一台搭载RTX 2080Ti作为核心计算单元的工控机。主要用于环境感知数据的生成以及运行经过训练的策略网络。我们在真实环境中分别模拟了训练场景下的随机行人场景以及圆形场景，同时在静态动态场景下进行了实体测试。As shown in Figure 5, the physical vehicle deployment and testing are carried out. The Ackerman model vehicle is based on the Songling mobile intelligent chassis and is equipped with a Sagitar 16-line laser radar for environmental perception, and also includes an RTK-GPS inertial navigation system for positioning. The vehicle is equipped with an industrial computer equipped with an RTX 2080Ti as the core computing unit. It is mainly used for the generation of environmental perception data and the operation of the trained policy network. We simulated the random pedestrian scene and the circular scene in the training scene in the real environment, and conducted physical tests in static and dynamic scenes.

在复杂的静态场景、静态动态结合场景以及动态行人场景下，该阿克曼模型车辆均能够较快且无碰撞的到达目标终点，安全的完成了导航任务。基于地图结合自我博弈的深度强化学习算法可以有效的提升阿克曼模型车辆的避障能力，使得阿克曼模型车辆可以在复杂静态以及动态场景下均表现很好。同时使用轨迹作为策略行为可以更快的实现在实体的部署，并通过微调简单参数实现效果的提升。In complex static scenes, static and dynamic combined scenes, and dynamic pedestrian scenes, the Ackerman model vehicle can reach the target destination quickly and without collision, and safely complete the navigation task. The deep reinforcement learning algorithm based on the map combined with the self-game can effectively improve the obstacle avoidance ability of the Ackerman model vehicle, so that the Ackerman model vehicle can perform well in both complex static and dynamic scenes. At the same time, using trajectories as strategic behaviors can achieve faster deployment in entities, and improve the effect by fine-tuning simple parameters.

本发明用于在复杂的自动驾驶场景中进行导航和避障。使用观测空间来表达智能体周围环境信息。同时对异构智能体进行动力学和形状特征建模，通过神经网络同时训练异构智能体的行为策略，通过自我博弈过程探索复杂动态场景，从而提升阿克曼车辆应对复杂动态场景的避障能力。在模拟环境和真实场景的多种场景下评估本发明的方法。结果表明，本发明方法使得阿克曼模型车辆可以更好的应对复杂静态以及复杂动态行人场景下的导航。The present invention is used for navigation and obstacle avoidance in complex autonomous driving scenarios. The observation space is used to express the environment information around the intelligent agent. The dynamic and shape characteristics of heterogeneous intelligent agents are modeled at the same time, the behavior strategies of the heterogeneous intelligent agents are trained simultaneously through neural networks, and complex dynamic scenarios are explored through a self-game process, thereby improving the obstacle avoidance capability of the Ackerman vehicle in complex dynamic scenarios. The method of the present invention is evaluated in multiple scenarios of simulated environments and real scenes. The results show that the method of the present invention enables the Ackerman model vehicle to better cope with navigation in complex static and complex dynamic pedestrian scenarios.

以上虽然描述了本发明的具体实施方法，但是本领域的技术人员应当理解，这些仅是举例说明，在不背离本发明原理和实现的前提下，可以对这些实施方案做出多种变更或修改，因此，本发明的保护范围由所附权利要求书限定。Although the specific implementation methods of the present invention are described above, those skilled in the art should understand that these are only examples, and that various changes or modifications may be made to these implementation methods without departing from the principles and implementations of the present invention. Therefore, the scope of protection of the present invention is limited by the appended claims.

Claims

Translated fromChinese

1.一种面向自动驾驶的基于深度强化学习的端到端导航避障方法，其特征在于，实现步骤如下：1. An end-to-end navigation and obstacle avoidance method based on deep reinforcement learning for autonomous driving, characterized in that the implementation steps are as follows:

步骤1：采用激光雷达，对激光雷达点云进行采样，生成环境一维激光点云，并通过代价地图转换算法进行转换，最终得到表示环境动态静态障碍物的障碍物地图；再通过与仿真环境进行闭环交互，不断收集环境障碍物地图，生成导航环境数据集，所述数据集包括训练集和测试集；Step 1: Use laser radar to sample the laser radar point cloud to generate a one-dimensional laser point cloud of the environment, and convert it through the cost map conversion algorithm to finally obtain an obstacle map representing the dynamic and static obstacles of the environment; then through closed-loop interaction with the simulation environment, continuously collect the environmental obstacle map and generate a navigation environment data set, which includes a training set and a test set;

观测空间，表示当前智能体的观测数据；Observation space, which represents the observation data of the current agent;

动作空间，限定智能体采取的行为范围；The action space, which limits the range of behaviors that the agent can take;

神经网络结合奖励函数共同规划出目标行为，该目标行为包含于动作空间中；神经网络规划出的目标行为输入到智能体中，智能体执行该行为命令，并输出当前环境下的障碍物地图，所述障碍物地图包含于观测空间中，此时智能体执行的行为及获得的障碍物地图再次输入至神经网络后构成闭环训练；The neural network combines with the reward function to plan the target behavior, which is included in the action space. The target behavior planned by the neural network is input into the intelligent agent, which executes the behavior command and outputs the obstacle map in the current environment, which is included in the observation space. At this time, the behavior executed by the intelligent agent and the obtained obstacle map are input into the neural network again to form a closed-loop training.

结合观测空间、动作空间、奖励函数及神经网络实现在仿真环境中避障策略的学习，仿真环境是对真实世界的抽象描述，其中障碍物包括基于阿克曼模型的车辆智能体和基于差动模型的行人智能体，同时依据真实环境中动态障碍物的行为策略模式将阿克曼车辆的动作空间描述为阿克曼车辆期望行驶的轨迹，行人的动作空间描述为行人的动作方向和速度；基于上述仿真环境及异构智能体自我博弈避障模型，最终采用同一套网络结构和参数实现阿克曼车辆与阿克曼车辆之间、阿克曼车辆与差动行人之间以及差动行人与差动行人之间的智能体博弈学习，以实现多智能体在动态环境中灵活避障；The learning of obstacle avoidance strategy in a simulation environment is realized by combining observation space, action space, reward function and neural network. The simulation environment is an abstract description of the real world, in which obstacles include vehicle agents based on the Ackerman model and pedestrian agents based on the differential model. At the same time, according to the behavioral strategy pattern of dynamic obstacles in the real environment, the action space of the Ackerman vehicle is described as the expected driving trajectory of the Ackerman vehicle, and the action space of the pedestrian is described as the direction and speed of the pedestrian's movement. Based on the above simulation environment and the heterogeneous agent self-game obstacle avoidance model, the same set of network structure and parameters are finally used to realize the agent game learning between Ackerman vehicles and Ackerman vehicles, between Ackerman vehicles and differential pedestrians, and between differential pedestrians and differential pedestrians, so as to realize flexible obstacle avoidance of multiple agents in a dynamic environment.

2.根据权利要求1所述的面向自动驾驶的基于深度强化学习的端到端导航避障方法，其特征在于：所述步骤2中构建的异构多智能体自我博弈模型中：2. The end-to-end navigation and obstacle avoidance method based on deep reinforcement learning for autonomous driving according to claim 1, characterized in that: in the heterogeneous multi-agent self-game model constructed in step 2:

观察空间：由三部分组成，即智能体所处环境的障碍物地图、当前自身状态以及目标点；障碍物地图由以自我为中心的局部网格图指定；障碍物地图表示智能体周围的环境信息，包括智能体的形状和可观察到的障碍物外观；Observation space: It consists of three parts, namely the obstacle map of the agent's environment, the current state of the agent, and the target point; the obstacle map is specified by a local grid map centered on the self; the obstacle map represents the environmental information around the agent, including the shape of the agent and the appearance of observable obstacles;

行人的奖励函数r_r：The reward function r for pedestrians_is :

r_r＝r^g+r^c+r^s,r_r = r^g + r^c + r^s ,

r_v＝r^g+r^c+r^s+r^w+r^dr_v = r^g + r^c + r^s + r^w + r^d

神经网络：神经网络中的策略网络的输出分别描述为基于阿克曼模型的车辆智能体和基于差动模型的行人智能体的行为，进而实现异构智能体自我博弈的学习，其中输入为各异构智能体的观测数据，包括智能体的障碍物地图、当前自身状态和目标点；修改神经网络结构中最后几层的输出层，使得该神经网络可以同时输出异构智能体的行为策略；其中修改后的输出层使用高斯分布采样获得线速度和角速度作为行人的行动方向和行动速度以及车辆期望行驶轨迹的曲率及加速度；Neural network: The output of the policy network in the neural network is described as the behavior of the vehicle agent based on the Ackerman model and the pedestrian agent based on the differential model, thereby realizing the learning of the self-game of heterogeneous agents, where the input is the observation data of each heterogeneous agent, including the obstacle map of the agent, the current state and the target point; the output layers of the last few layers in the neural network structure are modified so that the neural network can simultaneously output the behavior strategies of the heterogeneous agents; the modified output layer uses Gaussian distribution sampling to obtain the linear velocity and angular velocity as the direction and speed of the pedestrian's action and the curvature and acceleration of the vehicle's expected driving trajectory;

这种表示方式使得一套网络可以实现不同异构智能体的行为预测以及众多智能体的行为预测；神经挽留过训练结束后获得较高的导航性能，同时尽可能的模拟了真实环境中行人和车辆博弈过程，增强了导航策略的安全性。This representation method enables a set of networks to realize the behavior prediction of different heterogeneous agents and the behavior prediction of many agents; the neural retention obtains higher navigation performance after training, and at the same time simulates the game process of pedestrians and vehicles in the real environment as much as possible, enhancing the safety of the navigation strategy.