CN116300909A

Movatterモバイル変換

Info

Publication number: CN116300909A
Application number: CN202310185208.4A
Authority: CN
Inventors: 孙长银; 操菁瑜; 蒋坤; 董璐; 穆朝絮
Original assignee: Southeast University; Peng Cheng Laboratory
Current assignee: Southeast University; Peng Cheng Laboratory
Priority date: 2023-03-01
Filing date: 2023-03-01
Publication date: 2023-06-23

Abstract

The invention discloses a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning. The method comprises the following steps: firstly, information preprocessing is carried out on multimode data acquired by a robot by utilizing an information preprocessing module formed by different types of neural network layers; secondly, describing a robot obstacle avoidance navigation process in a map-free environment as a Markov decision process, introducing a reinforcement learning frame to train the robot in a simulation environment, designing a reward function of a Guan Duowei target, and realizing the obstacle avoidance navigation function in the simulation environment; and finally, transplanting the trained information preprocessing module and the action network into a real environment to complete the obstacle avoidance navigation task of the robot. The invention realizes more complete perception of the environment through the multimode information preprocessing module, and the end-to-end reinforcement learning method does not need priori knowledge about the environment, thereby improving navigation performance of the robot in a map-free environment and generalization of an algorithm in a real environment.

Description

Translated fromChinese

一种基于信息预处理和强化学习的机器人避障导航方法A robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning

技术领域technical field

本发明涉及移动机器人的导航技术领域，具体涉及一种基于信息预处理和强化学习的机器人避障导航方法。The invention relates to the technical field of navigation of mobile robots, in particular to a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning.

背景技术Background technique

随着科学技术的发展，机器人已经广泛应用于自动化仓储，危险环境自主探测等任务。但是在无地图的复杂动态环境中，机器人的自主性和智能化水平受到了限制。于是，当机器人应用于实际问题时，自主避障导航技术成为实现机器人智能化的关键一步。With the development of science and technology, robots have been widely used in tasks such as automated warehousing and autonomous detection of dangerous environments. But in the complex dynamic environment without maps, the robot's autonomy and intelligence level are limited. Therefore, when the robot is applied to practical problems, the autonomous obstacle avoidance navigation technology becomes a key step to realize the intelligentization of the robot.

为了应对未知环境的复杂性和不可预测性，前人提出了一些自主定位导航方法。但是这些方法主要分为同步定位和地图创建与路径规划和运动控制两个部分，在实际应用中普遍存在以下局限性：1.机器人的路径规划和运动控制十分依赖于地图创建的准确性。在实际复杂动态环境中，机器人绘制地图对激光雷达、深度摄像头等传感器的精度要求较高。2.对于复杂动态环境来说，地图构建消耗大量的时间和计算资源，且环境中往往包含不可预测轨迹的动态障碍物，这使得机器人描述环境、认识环境的过程更加具有挑战性。因此设计一种基于信息预处理的无模型端到端的机器人避障导航方法对于提高机器人的智能化水平及其在复杂动态环境中的实用性具有重要的意义。In order to cope with the complexity and unpredictability of the unknown environment, predecessors proposed some autonomous positioning and navigation methods. However, these methods are mainly divided into two parts: synchronous positioning and map creation, path planning and motion control, and generally have the following limitations in practical applications: 1. The path planning and motion control of the robot are very dependent on the accuracy of map creation. In the actual complex dynamic environment, the accuracy of sensors such as laser radar and depth camera is relatively high when drawing maps by robots. 2. For complex dynamic environments, map construction consumes a lot of time and computing resources, and the environment often contains dynamic obstacles with unpredictable trajectories, which makes the process of describing and understanding the environment more challenging for robots. Therefore, it is of great significance to design a model-free end-to-end robot obstacle avoidance navigation method based on information preprocessing to improve the robot's intelligence level and its practicability in complex dynamic environments.

发明内容Contents of the invention

本发明针对无地图环境的复杂性和不可预测性带来的移动机器人控制难题，实现一种基于信息预处理的无模型端到端的机器人避障导航方法。该方法无需有关环境的先验知识，通过深度强化学习框架融合多模传感器信息预处理，根据从环境中获取的信息端到端地输出机器人的控制动作，提升了移动机器人在复杂动态的仿真环境或现实环境中的避障导航性能。Aiming at the problem of mobile robot control caused by the complexity and unpredictability of the mapless environment, the present invention realizes a model-free end-to-end robot obstacle avoidance navigation method based on information preprocessing. This method does not require prior knowledge of the environment, and uses the deep reinforcement learning framework to fuse multi-mode sensor information preprocessing, and output the control actions of the robot end-to-end according to the information obtained from the environment, which improves the mobile robot in a complex and dynamic simulation environment. Or obstacle avoidance navigation performance in real-world environments.

为了实现上述目的，本发明的技术方案如下，一种基于信息预处理和强化学习的机器人避障导航方法，主要包括利用不同的信息预处理模块对机器人采集到的RGB图像和激光雷达数据进行预处理，之后基于动作网络确定移动机器人的控制动作，并结合奖励信息利用评论家网络评估机器人的决策行为，并引入深度强化学习框架优化神经网络，包括评论家网络，动作网络和信息预处理模块，最终通过训练得到最优的机器人控制策略，其具体的技术方案包括以下步骤：In order to achieve the above object, the technical solution of the present invention is as follows, a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning, which mainly includes using different information preprocessing modules to pre-process the RGB images and laser radar data collected by the robot. After processing, the control action of the mobile robot is determined based on the action network, and the decision-making behavior of the robot is evaluated by the critic network combined with the reward information, and the deep reinforcement learning framework is introduced to optimize the neural network, including the critic network, action network and information preprocessing module. Finally, the optimal robot control strategy is obtained through training, and its specific technical solution includes the following steps:

步骤S1：设计不同的信息预处理模块对机器人所采集的多模传感器信息进行数据预处理；Step S1: Design different information preprocessing modules to perform data preprocessing on the multi-mode sensor information collected by the robot;

步骤S2：将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并在仿真环境中引入强化学习框架，将处理后的传感器信息与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息，由此得出机器人的决策行为，根据奖励信息在仿真环境中对机器人控制智能体进行训练，以获得能够最大化累计奖励的最优策略；Step S2: Describe the obstacle avoidance navigation task of the robot in a mapless environment as a Markov decision process and introduce a reinforcement learning framework in the simulation environment, and combine the processed sensor information with the position information of the robot from the target and the speed information of the robot itself Combined with the state information of the robot, the decision-making behavior of the robot is obtained, and the robot control agent is trained in the simulation environment according to the reward information to obtain the optimal strategy that can maximize the cumulative reward;

步骤S3：将训练好的信息预处理模块和动作网络移植到现实环境中的导航过程，使机器人在避障的同时以最短的时间到达目标位置。Step S3: Transplant the trained information preprocessing module and action network into the navigation process in the real environment, so that the robot can reach the target position in the shortest time while avoiding obstacles.

进一步的，步骤S1所述的设计不同的信息预处理模块进行数据预处理，具体方法是：Further, the design of different information preprocessing modules described in step S1 for data preprocessing, the specific method is:

步骤S11：针对由摄像头获取的RGB图像，利用若干层卷积神经网络构成信息预处理模块；Step S11: For the RGB images acquired by the camera, use several layers of convolutional neural networks to form an information preprocessing module;

步骤S12：针对由激光雷达束获取的机器人与障碍物之间的距离信息，利用若干层循环神经网络构成信息预处理模块。Step S12: For the distance information between the robot and the obstacle acquired by the laser radar beam, use several layers of recurrent neural networks to form an information preprocessing module.

进一步的，步骤S2所述的将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并在仿真环境中引入强化学习框架训练机器人，具体方法是：Further, as described in step S2, the robot obstacle avoidance navigation task in a map-free environment is described as a Markov decision process and the reinforcement learning framework is introduced into the simulation environment to train the robot. The specific method is:

步骤S21：初始化神经网络参数，包括信息预处理模块，动作网络以及评论家网络；Step S21: Initialize neural network parameters, including information preprocessing module, action network and critic network;

步骤S22：仿真环境中，采集机器人多模数据，经由相应的信息预处理模块处理后与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息；Step S22: In the simulation environment, collect the multi-mode data of the robot, and combine it with the position information of the robot's distance from the target and the speed information of the robot itself as the state information of the robot after being processed by the corresponding information preprocessing module;

步骤S23：将机器人的状态信息输入动作网络中输出机器人的决策动作，机器人的决策动作包括机器人的转角速度以及前移和侧移速度；Step S23: Input the state information of the robot into the action network to output the decision-making action of the robot, the decision-making action of the robot includes the rotational angular velocity of the robot and the forward and sideward movement speed;

步骤S24：执行了决策动作后，机器人的位置和观测状态发生转换；Step S24: After the decision-making action is executed, the robot's position and observation state are converted;

步骤S25：设计一个多维密集奖励函数；奖励函数包括5个部分：距离惩罚，角度惩罚，碰撞惩罚，完成奖励，时间惩罚；距离惩罚指机器人的位置与目标位置的距离作为惩罚，以激励机器人靠近目标点，表示为

角度惩罚指机器人前方摄像头与目标的角度差值作为惩罚，以激励机器人正对目标，表示为/>

碰撞惩罚是指当机器人与障碍物的距离小于安全距离时，判定其发生碰撞，其惩罚信号表示为

完成奖励指当机器人与目标之间的距离小于一定的阈值且中间无障碍物遮挡，并且机器人前方的摄像头正对目标时，视为完成导航任务，其奖励信号表示为

时间惩罚指为了防止机器人陷入停顿，在每个决策时间步给予一个恒定损失r_t；因此，总奖励定义为以上5个部分的总和：r＝r_d+r_o+r_c+r_f+r_t；Step S25: Design a multi-dimensional dense reward function; the reward function includes 5 parts: distance penalty, angle penalty, collision penalty, completion reward, and time penalty; the distance penalty refers to the distance between the robot's position and the target position as a penalty to encourage the robot to approach target point, expressed as

Angle penalty refers to the angle difference between the camera in front of the robot and the target as a penalty to encourage the robot to face the target, expressed as />

Collision penalty means that when the distance between the robot and the obstacle is less than the safety distance, it is determined that it collides, and the penalty signal is expressed as

The completion reward means that when the distance between the robot and the target is less than a certain threshold and there is no obstacle in the middle, and the camera in front of the robot is facing the target, it is considered to have completed the navigation task, and its reward signal is expressed as

Time penalty refers to giving a constant loss r_t at each decision time step in order to prevent the robot from stalling; therefore, the total reward is defined as the sum of the above 5 parts: r = r_d + r_o + r_c + r_f + r_t ;

步骤S26：利用评论家网络评估机器人的决策行为，并采用强化学习框架更新神经网络，包括评论家网络，动作网络和信息预处理模块；Step S26: Utilize the critic network to evaluate the decision-making behavior of the robot, and use the reinforcement learning framework to update the neural network, including the critic network, action network and information preprocessing module;

步骤S27：为了提升算法更新的稳定性，引入了目标网络，采用软更新的方式更新神经网络参数；Step S27: In order to improve the stability of the algorithm update, the target network is introduced, and the parameters of the neural network are updated by means of soft update;

步骤S28：重复步骤S22-S27，直至算法收敛至最优策略，其中最优策略是指能够最大化累计奖励的最优信息预处理策略和最优导航避障策略。Step S28: Repeat steps S22-S27 until the algorithm converges to the optimal strategy, where the optimal strategy refers to the optimal information preprocessing strategy and the optimal navigation and obstacle avoidance strategy that can maximize the cumulative reward.

进一步的，步骤S3所述的将训练好的信息预处理模块和动作网络移植到现实环境中的导航过程，具体方法是：Further, the navigation process of transplanting the trained information preprocessing module and action network into the real environment described in step S3, the specific method is:

步骤S31：现实环境中，采集机器人多模数据，经由训练好的信息预处理模块处理后与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息，然后送入训练完成的动作网络中输出机器人的决策动作；Step S31: In the real environment, collect the multi-mode data of the robot, process it through the trained information preprocessing module, combine it with the position information of the robot’s distance from the target, and the speed information of the robot itself as the state information of the robot, and then send it into the trained action Output the decision-making action of the robot in the network;

步骤S32：机器人执行决策动作后，状态发生变化并通过传感器接收新状态下的多模数据；Step S32: After the robot executes the decision-making action, the state changes and receives multi-mode data in the new state through the sensor;

步骤S33：重复步骤S31-S32，直至完成避障导航任务。Step S33: Repeat steps S31-S32 until the obstacle avoidance navigation task is completed.

有益效果Beneficial effect

相比于现有技术，本发明具有如下优点，1)与单模态模式相比，本发明采取多模机制，采集多模数据，并利用不同种类的神经网络组成信息预处理模块对多模数据进行预处理，这可以对环境进行更加全面的感知，使得基于感知信息的避障导航行为更加准确；2)本发明提供了基于强化学习框架的端到端的避障导航方法，无需有关环境的先验知识，设计有关多维目标的奖励函数，使得算法能够在无地图的情况下对指定目标进行有效导航与避障并具有优良的从仿真环境移植到现实环境的性能；3)本发明在仿真环境中不断训练直至获得仿真环境下的避障导航能力，减小了机器人在现实环境下在线训练给机器人带来的不可逆损伤，后将训练好的信息预处理模块和动作网络移植到现实环境，具有良好的经济效益和社会效益。Compared with the prior art, the present invention has the following advantages, 1) Compared with the single-mode mode, the present invention adopts a multi-mode mechanism, collects multi-mode data, and utilizes different types of neural networks to form information preprocessing modules for multi-mode The data is preprocessed, which can perform a more comprehensive perception of the environment, making the obstacle avoidance navigation behavior based on the perception information more accurate; 2) the present invention provides an end-to-end obstacle avoidance navigation method based on the reinforcement learning framework, without the need for environmental Prior knowledge, design reward function about multi-dimensional target, make algorithm can carry out effective navigation and obstacle avoidance to designated target under the situation of no map and have excellent performance from simulation environment transplantation to real environment; 3) the present invention is in simulation Continuous training in the environment until the obstacle avoidance navigation ability in the simulation environment is obtained, which reduces the irreversible damage to the robot caused by online training in the real environment, and then transplants the trained information preprocessing module and action network to the real environment. It has good economic and social benefits.

附图说明Description of drawings

图1为本发明实施例中的基于信息预处理和强化学习的机器人避障导航方法的算法框图。FIG. 1 is an algorithm block diagram of a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning in an embodiment of the present invention.

图2为本发明的仿真训练环境示意图。Fig. 2 is a schematic diagram of the simulation training environment of the present invention.

图3为本发明实施例中的现实避障导航环境示意图。Fig. 3 is a schematic diagram of a realistic obstacle avoidance navigation environment in an embodiment of the present invention.

具体实施方式Detailed ways

为了加深对本发明的理解，下面结合附图对本实施例做详细的说明。In order to deepen the understanding of the present invention, the present embodiment will be described in detail below in conjunction with the accompanying drawings.

实施例1：为了在复杂动态的无地图环境中实现机器人自动化导航避障功能，机器人试图通过多模态传感器获取周围信息并根据传感器信息决策合适的动作。本发明提出了一种基于信息预处理和强化学习的机器人避障导航方法，其算法框架图如图1所示，具体实施方式分为3个阶段，第一阶段是信息预处理阶段，第二阶段是将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并引入强化学习框架在仿真环境中训练智能体阶段，第三阶段是将训练好的智能体移植到现实环境中完成避障导航任务阶段。下面结合说明书附图分别进行介绍。Example 1: In order to realize the robot's automatic navigation and obstacle avoidance function in a complex and dynamic map-free environment, the robot tries to obtain surrounding information through multi-modal sensors and decides appropriate actions based on the sensor information. The present invention proposes a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning, its algorithm frame diagram is shown in Figure 1, the specific implementation is divided into three stages, the first stage is the information preprocessing stage, the second The first stage is to describe the obstacle avoidance navigation task of the robot in the mapless environment as a Markov decision process and introduce the reinforcement learning framework to train the agent in the simulation environment. The third stage is to transplant the trained agent into the real environment. Obstacle avoidance navigation task phase. Introduce respectively below in conjunction with accompanying drawing of specification sheet.

针对如图2所示的仿真环境或者如图3所示的现实环境，本发明设计不同的信息预处理模块对机器人从环境中采集的多模传感器信息进行数据预处理，这种分模块化的预处理方式使得机器人可以根据实际应用中获得的不同种类的信息选择性得利用不同的信息预处理模块，具体步骤如下：For the simulation environment as shown in Figure 2 or the real environment as shown in Figure 3, the present invention designs different information preprocessing modules to perform data preprocessing on the multi-mode sensor information collected by the robot from the environment. The preprocessing method enables the robot to selectively use different information preprocessing modules according to different types of information obtained in practical applications. The specific steps are as follows:

步骤S11：针对由摄像头获取的RGB图像，将像素点信息经由全连接输入层处理后输入卷积层进行卷积操作，卷积完成后由RELU层激活后送入池化层进行池化操作，视图片像素大小决定卷积核的大小以及卷积层的层数，最后经由全连接输出层输出处理后的信息，这样就构成了针对RGB图像的信息预处理模块，值得注意的是，其中RELU层和池化层都是固定不变的函数操作，卷积层和全连接输入/输出层中的神经网络参数会随着之后步骤中的神经网络梯度下降被更新；Step S11: For the RGB image acquired by the camera, the pixel information is processed by the fully connected input layer and then input to the convolution layer for convolution operation. After the convolution is completed, it is activated by the RELU layer and sent to the pooling layer for pooling operation. The size of the pixel of the picture determines the size of the convolution kernel and the number of layers of the convolutional layer, and finally the processed information is output through the fully connected output layer, thus forming an information preprocessing module for RGB images. It is worth noting that RELU Layers and pooling layers are fixed function operations, and the neural network parameters in the convolutional layer and the fully connected input/output layer will be updated with the gradient descent of the neural network in subsequent steps;

步骤S12：针对由激光雷达束获取的机器人与障碍物之间的距离信息，将各激光束获得的距离信息整合经由全连接输入层处理后输入循环神经网络层中，视激光束密度以及采集的频率决定循环神经网络的维数和层数，最后经由全连接输出层输出处理后的信息，这样就构成了针对激光雷达的信息预处理模块，其中循环神经层和全连接输入/输出层中的神经网络参数会随着之后步骤中的神经网络梯度下降被更新。Step S12: For the distance information between the robot and the obstacle obtained by the lidar beam, the distance information obtained by each laser beam is integrated and processed by the fully connected input layer and then input into the recurrent neural network layer. Depending on the laser beam density and the collected The frequency determines the dimension and number of layers of the recurrent neural network, and finally outputs the processed information through the fully connected output layer, thus forming an information preprocessing module for lidar, in which the recurrent neural layer and the fully connected input/output layer The neural network parameters are updated with the gradient descent of the neural network in subsequent steps.

在实际环境中训练机器人的避障导航功能安全成本较高，而且碰撞给机器人带来的损伤是不可逆的。为了降低实验成本，同时提升训练效率，本发明借助unity平台下的仿真环境训练机器人的避障导航策略，如图2所示，仿真环境中包括固定障碍物和动态随机障碍物，固定障碍物是摆放在场地中大小高低不同，位置固定的障碍物，动态随机障碍物为5个位置随机的目标块和以一定策略移动的其他机器人，机器人在完成任务的同时，需躲避固定障碍物和动态障碍物，本发明将无地图环境下的机器人避障导航问题描述为马尔可夫决策过程然后引入强化学习框架在仿真环境中训练智能体，以获得能够最大化累计奖励的最优策略，具体步骤如下：The safety cost of training the obstacle avoidance navigation function of the robot in the actual environment is relatively high, and the damage caused by the collision to the robot is irreversible. In order to reduce the experimental cost and improve the training efficiency, the present invention uses the simulation environment under the unity platform to train the robot's obstacle avoidance navigation strategy, as shown in Figure 2, the simulation environment includes fixed obstacles and dynamic random obstacles, and the fixed obstacles are Obstacles with different sizes and fixed positions are placed in the field. The dynamic random obstacles are 5 target blocks with random positions and other robots moving with a certain strategy. The robot needs to avoid fixed obstacles and dynamic obstacles while completing tasks. Obstacles, the present invention describes the robot obstacle avoidance navigation problem in a map-free environment as a Markov decision process and then introduces a reinforcement learning framework to train the agent in a simulation environment to obtain an optimal strategy that can maximize cumulative rewards, specific steps as follows:

步骤S21：初始化神经网络参数，包括信息预处理模块θ_p，动作网络θ_a以及评论家网络θ_c；Step S21: Initialize neural network parameters, including information preprocessing module θ_p , action network θ_a and critic network θ_c ;

步骤S22：仿真环境中，白色方块表示机器人需要依次遍历的目标，每回合初被初始化在环境中的随机位置，在每一个时间步采集机器人多模数据I_sensor，对不同传感器收集的信息由相应的信息预处理模块处理，值得注意的是，实际应用中可以根据机器人采集的数据选择性地使用部分信息预处理模块，处理后的信息I_pro与机器人距离目标的位置信息(d_x,d_y)、机器人本身的速度信息(v_x,v_y,v_o)联合作为机器人的状态信息，某一时刻的状态变量s_t表示为s_t＝(I_pro,d_x,d_y,v_x,v_y,v_o)，其中，I_pro表示经过预处理之后的传感器信息的集合，d_x和d_y分别表示机器人与目标位置在横轴和纵轴方向的位置差，v_x和v_y分别表示机器人在横轴和纵轴方向的速度，v_o表示机器人的角速度；Step S22: In the simulation environment, the white square represents the target that the robot needs to traverse sequentially. It is initialized at a random position in the environment at the beginning of each round, and the multi-mode data I_sensor of the robot is collected at each time step. The information collected by different sensors is determined by the corresponding It is worth noting that in practical applications, part of the information preprocessing module can be selectively used according to the data collected by the robot. The processed information I_pro and the position information of the robot from the target (d_x , d_y ), the speed information of the robot itself (v_x , v_y , v_o ) are combined as the state information of the robot, and the state variable s_t at a certain moment is expressed as s_t = (I_pro ,d_x ,d_y ,v_x , v_y , v_o ), where I_pro represents the set of sensor information after preprocessing, d_x and d_y represent the position difference between the robot and the target position in the horizontal axis and vertical axis respectively, v_x and v_y respectively Indicates the speed of the robot in the direction of the horizontal axis and the vertical axis, v_o indicates the angular velocity of the robot;

步骤S23：为了综合考虑机器人的横向运动和纵向运动，将机器人在横轴和纵轴方向上的速度以及角速度作为控制量，于是将机器人的状态信息输入动作网络中输出机器人的决策动作，表示为a_t＝(v_x,v_y,v_o),其中v_x和v_y分别表示机器人在横轴和纵轴方向的速度，v_o表示机器人的角速度，决策动作决定了机器人在下一个时间步内的运动轨迹；Step S23: In order to comprehensively consider the lateral motion and vertical motion of the robot, the speed and angular velocity of the robot in the direction of the horizontal axis and the vertical axis are used as the control variables, so the state information of the robot is input into the action network to output the decision-making action of the robot, expressed as a_t ＝(v_x , v_y , v_o ), where v_x and v_y represent the speed of the robot in the horizontal axis and vertical axis respectively, v_o represents the angular velocity of the robot, and the decision-making action determines the robot's speed in the next time step track of movement;

步骤S24：执行了决策动作a_t后，机器人的位置和周围环境发生变化，转移到下一时刻的状态s_t+1；Step S24: After the decision-making action a_t is executed, the position of the robot and the surrounding environment change, and it transfers to the state s_t+1 at the next moment;

步骤S26：将状态s_t和动作a_t输入评论家网络以评估机器人的决策行为，评论家网络据此输出状态动作值函数Q(s_t,a_t；θ_c)，表示当前决策行为代表的收益值，根据环境反馈的奖励函数定义评论家网络的损失函数L(θ_c)＝E_π[(r+γmaxQ(s_t+1,a_t+1；θ_c)-Q(s_t,a_t；θ_c))²]，其中s_t+1表示在当前状态s_t下采取动作a_t时下一时刻的状态，a_t+1表示按照当前策略π在状态s_t+1下所采取的动作，γ表示未来奖励的折扣系数，E_π表示策略π下括号内公式的期望值，此外，定义动作网络的损失函数

其中s_t中包含由信息预处理模块处理的传感器数据I_pro＝θ_p(I_sensor)，于是信息预处理模块和动作网络采用相同的损失函数L(θ_p)＝L(θ_a)，并沿着最小化损失函数的方向梯度下降更新神经网络参数，包括评论家网络

动作网络/>

和信息预处理模块/>

其中l_c，l_a和l_p分别表示各个神经网络的学习率，/>

和/>

表示梯度下降法更新相对应的参数；Step S26: Input the state s_t and action a_t into the critic network to evaluate the decision-making behavior of the robot, and the critic network outputs the state-action value function Q(_st ,_at ; θ_c ), which represents the current decision-making behavior Profit value, according to the reward function of environmental feedback, define the loss function of the critic network L(θ_c )=E_π [(r+γmaxQ(s_t+1 ,a_t+1 ;θ_c )-Q(s_t ,a_t ; θ_c ))² ], where s_t+1 represents the state at the next moment when the action a_t is taken in the current state s_t , and a_t+1 represents the action taken in the state s_t+1 according to the current strategy π Action, γ represents the discount coefficient of future rewards, E_π represents the expected value of the formula in the brackets under the strategy π, and defines the loss function of the action network

Where s_t contains the sensor data I_pro =θ_p (I_sensor ) processed by the information preprocessing module, so the information preprocessing module and the action network use the same loss function L(θ_p )=L(θ_a ), and Gradient descent along the direction of minimizing the loss function updates the neural network parameters, including the critic network

Action Network />

and information preprocessing module/>

Among them, l_c , l_a and l_p respectively represent the learning rate of each neural network, />

and />

Indicates that the gradient descent method updates the corresponding parameters;

步骤S27：为了提升算法更新的稳定性，定义目标评论家网络

目标动作网络/>

和目标信息预处理网络/>

将其引入网络参数更新过程并采用软更新的方式更新神经网络参数/>

其中τ表示目标网络的更新率；Step S27: In order to improve the stability of the algorithm update, define the target critic network

Target Action Network />

and target information preprocessing network/>

Introduce it into the network parameter update process and use soft update to update the neural network parameters />

where τ represents the update rate of the target network;

搭建现实避障导航环境测试所提出算法的性能，如图3所示，但是从仿真环境到现实环境的迁移需要考虑诸多问题，例如机器人与障碍物碰撞的情况下可能会发生打滑现象，激光雷达的采集帧率会受到限制等等，于是本发明在仿真环境的训练过程中已经对常见的迁移问题提出了解决方法，例如在设计的奖励函数中添加时间惩罚防止机器人打滑陷入停顿，采集多个时刻的激光雷达束的距离信息利用信息预处理模块中的循环神经网络层提取时间序列信息等，于是将仿真环境中训练好的信息预处理模块和动作网络移植到现实环境中的避障导航过程，具体步骤如下：Set up a realistic obstacle avoidance navigation environment to test the performance of the proposed algorithm, as shown in Figure 3. However, many issues need to be considered in the migration from the simulation environment to the real environment. For example, skidding may occur when the robot collides with obstacles. The acquisition frame rate will be limited, etc., so the present invention has proposed solutions to common migration problems during the training process of the simulation environment, such as adding time penalties to the designed reward function to prevent the robot from slipping and stalling, and collecting multiple The distance information of the laser radar beam at any time is extracted by using the cyclic neural network layer in the information preprocessing module to extract time series information, etc., so the information preprocessing module and action network trained in the simulation environment are transplanted to the obstacle avoidance navigation process in the real environment ,Specific steps are as follows:

步骤S31：现实环境中，采集机器人多模数据I_sensor，将多模信息选择性得输入相应的信息预处理模块，经由训练好的信息预处理模块处理后I_pro与机器人距离目标的位置信息(d_x,d_y)、机器人本身的速度信息(v_x,v_y,v_o)联合作为机器人的状态信息s_t，然后送入训练完成的动作网络中输出机器人的决策动作a_t；Step S31: In the real environment, collect the multi-mode data I_sensor of the robot, selectively input the multi-mode information into the corresponding information preprocessing module, and after the trained information preprocessing module processes the position information of the distance between I_pro and the robot from the target ( d_x , d_y ), the speed information of the robot itself (v_x , v_y , v_o ) are combined as the state information s_t of the robot, and then sent to the trained action network to output the decision-making action of the robot at_t ;

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also possible. It should be regarded as the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于信息预处理和强化学习的机器人避障导航办法，其特征在于，所述方法包括以下步骤：1. A robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning, it is characterized in that, described method comprises the following steps:

2.根据权利要求1所述的基于信息预处理和强化学习的机器人避障导航方法，其特征在于，步骤S1所述的设计不同的信息预处理模块对采集的多模传感器信息进行数据预处理，具体方法是：2. The robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning according to claim 1, characterized in that, the different information preprocessing modules of the design described in step S1 carry out data preprocessing to the multimode sensor information collected , the specific method is:

3.根据权利要求1所述的基于信息预处理和强化学习的机器人避障导航方法，其特征在于，步骤S2所述的将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并引入强化学习框架在仿真环境中训练机器人控制智能体，具体方法是：3. The robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning according to claim 1, characterized in that, the robot obstacle avoidance navigation task described in step S2 without a map environment is described as a Markov decision process And introduce the reinforcement learning framework to train the robot control agent in the simulation environment, the specific method is:

步骤S25：设计一个多维密集奖励函数；奖励函数包括5个部分：距离惩罚，角度惩罚，碰撞惩罚，完成奖励，时间惩罚；Step S25: Design a multi-dimensional dense reward function; the reward function includes 5 parts: distance penalty, angle penalty, collision penalty, completion reward, and time penalty;

距离惩罚指机器人的位置与目标位置的距离作为惩罚，以激励机器人靠近目标点，表示为

其中d_x表示机器人的位置与目标位置在横轴方向上的距离，d_y表示机器人的位置与目标位置在纵轴方向上的距离；The distance penalty refers to the distance between the position of the robot and the target position as a penalty to motivate the robot to approach the target point, expressed as

Where d_x represents the distance between the position of the robot and the target position in the direction of the horizontal axis, and d_y represents the distance between the position of the robot and the target position in the direction of the vertical axis;角度惩罚指机器人前方摄像头与目标的角度差值作为惩罚，以激励机器人正对目标，表示为

Angle penalty refers to the angle difference between the camera in front of the robot and the target as a penalty to encourage the robot to face the target, expressed as

其中r_{collision-penalty}表示机器人与障碍物碰撞后的惩罚信号；Collision penalty means that when the distance between the robot and the obstacle is less than the safety distance, it is determined that it collides, and the penalty signal is expressed as

Where r_{collision-penalty} represents the penalty signal after the robot collides with the obstacle;

其中r_finish表示任务完成时环境反馈给机器人的奖励信号；The completion reward means that when the distance between the robot and the target is less than a certain threshold and there is no obstacle in the middle, and the camera in front of the robot is facing the target, it is considered to have completed the navigation task, and its reward signal is expressed as

Where r_finish represents the reward signal that the environment feeds back to the robot when the task is completed;

时间惩罚指为了防止机器人陷入停顿，在每个决策时间步给予一个恒定损失r_t；因此，总奖励定义为以上5个部分的总和：r＝r_d+r_o+r_c+r_f+r_t；Time penalty refers to giving a constant loss r_t at each decision time step in order to prevent the robot from stalling; therefore, the total reward is defined as the sum of the above 5 parts: r = r_d + r_o + r_c + r_f + r_t ;

4.根据权利要求1所述的基于信息预处理和强化学习的机器人避障导航方法，其特征在于，步骤S3所述的将训练好的信息预处理模块和动作网络移植到现实环境中完成移动机器人的避障导航任务，具体方法是：4. The robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning according to claim 1, characterized in that, in step S3, the trained information preprocessing module and action network are transplanted into the real environment to complete the movement The obstacle avoidance navigation task of the robot, the specific method is: