Movatterモバイル変換


[0]ホーム

URL:


CN116300909A - Robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning - Google Patents

Robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning
Download PDF

Info

Publication number
CN116300909A
CN116300909ACN202310185208.4ACN202310185208ACN116300909ACN 116300909 ACN116300909 ACN 116300909ACN 202310185208 ACN202310185208 ACN 202310185208ACN 116300909 ACN116300909 ACN 116300909A
Authority
CN
China
Prior art keywords
robot
information
environment
obstacle avoidance
penalty
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310185208.4A
Other languages
Chinese (zh)
Inventor
孙长银
操菁瑜
蒋坤
董璐
穆朝絮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Peng Cheng Laboratory
Original Assignee
Southeast University
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University, Peng Cheng LaboratoryfiledCriticalSoutheast University
Priority to CN202310185208.4ApriorityCriticalpatent/CN116300909A/en
Publication of CN116300909ApublicationCriticalpatent/CN116300909A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The invention discloses a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning. The method comprises the following steps: firstly, information preprocessing is carried out on multimode data acquired by a robot by utilizing an information preprocessing module formed by different types of neural network layers; secondly, describing a robot obstacle avoidance navigation process in a map-free environment as a Markov decision process, introducing a reinforcement learning frame to train the robot in a simulation environment, designing a reward function of a Guan Duowei target, and realizing the obstacle avoidance navigation function in the simulation environment; and finally, transplanting the trained information preprocessing module and the action network into a real environment to complete the obstacle avoidance navigation task of the robot. The invention realizes more complete perception of the environment through the multimode information preprocessing module, and the end-to-end reinforcement learning method does not need priori knowledge about the environment, thereby improving navigation performance of the robot in a map-free environment and generalization of an algorithm in a real environment.

Description

Translated fromChinese
一种基于信息预处理和强化学习的机器人避障导航方法A robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning

技术领域technical field

本发明涉及移动机器人的导航技术领域,具体涉及一种基于信息预处理和强化学习的机器人避障导航方法。The invention relates to the technical field of navigation of mobile robots, in particular to a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning.

背景技术Background technique

随着科学技术的发展,机器人已经广泛应用于自动化仓储,危险环境自主探测等任务。但是在无地图的复杂动态环境中,机器人的自主性和智能化水平受到了限制。于是,当机器人应用于实际问题时,自主避障导航技术成为实现机器人智能化的关键一步。With the development of science and technology, robots have been widely used in tasks such as automated warehousing and autonomous detection of dangerous environments. But in the complex dynamic environment without maps, the robot's autonomy and intelligence level are limited. Therefore, when the robot is applied to practical problems, the autonomous obstacle avoidance navigation technology becomes a key step to realize the intelligentization of the robot.

为了应对未知环境的复杂性和不可预测性,前人提出了一些自主定位导航方法。但是这些方法主要分为同步定位和地图创建与路径规划和运动控制两个部分,在实际应用中普遍存在以下局限性:1.机器人的路径规划和运动控制十分依赖于地图创建的准确性。在实际复杂动态环境中,机器人绘制地图对激光雷达、深度摄像头等传感器的精度要求较高。2.对于复杂动态环境来说,地图构建消耗大量的时间和计算资源,且环境中往往包含不可预测轨迹的动态障碍物,这使得机器人描述环境、认识环境的过程更加具有挑战性。因此设计一种基于信息预处理的无模型端到端的机器人避障导航方法对于提高机器人的智能化水平及其在复杂动态环境中的实用性具有重要的意义。In order to cope with the complexity and unpredictability of the unknown environment, predecessors proposed some autonomous positioning and navigation methods. However, these methods are mainly divided into two parts: synchronous positioning and map creation, path planning and motion control, and generally have the following limitations in practical applications: 1. The path planning and motion control of the robot are very dependent on the accuracy of map creation. In the actual complex dynamic environment, the accuracy of sensors such as laser radar and depth camera is relatively high when drawing maps by robots. 2. For complex dynamic environments, map construction consumes a lot of time and computing resources, and the environment often contains dynamic obstacles with unpredictable trajectories, which makes the process of describing and understanding the environment more challenging for robots. Therefore, it is of great significance to design a model-free end-to-end robot obstacle avoidance navigation method based on information preprocessing to improve the robot's intelligence level and its practicability in complex dynamic environments.

发明内容Contents of the invention

本发明针对无地图环境的复杂性和不可预测性带来的移动机器人控制难题,实现一种基于信息预处理的无模型端到端的机器人避障导航方法。该方法无需有关环境的先验知识,通过深度强化学习框架融合多模传感器信息预处理,根据从环境中获取的信息端到端地输出机器人的控制动作,提升了移动机器人在复杂动态的仿真环境或现实环境中的避障导航性能。Aiming at the problem of mobile robot control caused by the complexity and unpredictability of the mapless environment, the present invention realizes a model-free end-to-end robot obstacle avoidance navigation method based on information preprocessing. This method does not require prior knowledge of the environment, and uses the deep reinforcement learning framework to fuse multi-mode sensor information preprocessing, and output the control actions of the robot end-to-end according to the information obtained from the environment, which improves the mobile robot in a complex and dynamic simulation environment. Or obstacle avoidance navigation performance in real-world environments.

为了实现上述目的,本发明的技术方案如下,一种基于信息预处理和强化学习的机器人避障导航方法,主要包括利用不同的信息预处理模块对机器人采集到的RGB图像和激光雷达数据进行预处理,之后基于动作网络确定移动机器人的控制动作,并结合奖励信息利用评论家网络评估机器人的决策行为,并引入深度强化学习框架优化神经网络,包括评论家网络,动作网络和信息预处理模块,最终通过训练得到最优的机器人控制策略,其具体的技术方案包括以下步骤:In order to achieve the above object, the technical solution of the present invention is as follows, a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning, which mainly includes using different information preprocessing modules to pre-process the RGB images and laser radar data collected by the robot. After processing, the control action of the mobile robot is determined based on the action network, and the decision-making behavior of the robot is evaluated by the critic network combined with the reward information, and the deep reinforcement learning framework is introduced to optimize the neural network, including the critic network, action network and information preprocessing module. Finally, the optimal robot control strategy is obtained through training, and its specific technical solution includes the following steps:

步骤S1:设计不同的信息预处理模块对机器人所采集的多模传感器信息进行数据预处理;Step S1: Design different information preprocessing modules to perform data preprocessing on the multi-mode sensor information collected by the robot;

步骤S2:将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并在仿真环境中引入强化学习框架,将处理后的传感器信息与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息,由此得出机器人的决策行为,根据奖励信息在仿真环境中对机器人控制智能体进行训练,以获得能够最大化累计奖励的最优策略;Step S2: Describe the obstacle avoidance navigation task of the robot in a mapless environment as a Markov decision process and introduce a reinforcement learning framework in the simulation environment, and combine the processed sensor information with the position information of the robot from the target and the speed information of the robot itself Combined with the state information of the robot, the decision-making behavior of the robot is obtained, and the robot control agent is trained in the simulation environment according to the reward information to obtain the optimal strategy that can maximize the cumulative reward;

步骤S3:将训练好的信息预处理模块和动作网络移植到现实环境中的导航过程,使机器人在避障的同时以最短的时间到达目标位置。Step S3: Transplant the trained information preprocessing module and action network into the navigation process in the real environment, so that the robot can reach the target position in the shortest time while avoiding obstacles.

进一步的,步骤S1所述的设计不同的信息预处理模块进行数据预处理,具体方法是:Further, the design of different information preprocessing modules described in step S1 for data preprocessing, the specific method is:

步骤S11:针对由摄像头获取的RGB图像,利用若干层卷积神经网络构成信息预处理模块;Step S11: For the RGB images acquired by the camera, use several layers of convolutional neural networks to form an information preprocessing module;

步骤S12:针对由激光雷达束获取的机器人与障碍物之间的距离信息,利用若干层循环神经网络构成信息预处理模块。Step S12: For the distance information between the robot and the obstacle acquired by the laser radar beam, use several layers of recurrent neural networks to form an information preprocessing module.

进一步的,步骤S2所述的将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并在仿真环境中引入强化学习框架训练机器人,具体方法是:Further, as described in step S2, the robot obstacle avoidance navigation task in a map-free environment is described as a Markov decision process and the reinforcement learning framework is introduced into the simulation environment to train the robot. The specific method is:

步骤S21:初始化神经网络参数,包括信息预处理模块,动作网络以及评论家网络;Step S21: Initialize neural network parameters, including information preprocessing module, action network and critic network;

步骤S22:仿真环境中,采集机器人多模数据,经由相应的信息预处理模块处理后与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息;Step S22: In the simulation environment, collect the multi-mode data of the robot, and combine it with the position information of the robot's distance from the target and the speed information of the robot itself as the state information of the robot after being processed by the corresponding information preprocessing module;

步骤S23:将机器人的状态信息输入动作网络中输出机器人的决策动作,机器人的决策动作包括机器人的转角速度以及前移和侧移速度;Step S23: Input the state information of the robot into the action network to output the decision-making action of the robot, the decision-making action of the robot includes the rotational angular velocity of the robot and the forward and sideward movement speed;

步骤S24:执行了决策动作后,机器人的位置和观测状态发生转换;Step S24: After the decision-making action is executed, the robot's position and observation state are converted;

步骤S25:设计一个多维密集奖励函数;奖励函数包括5个部分:距离惩罚,角度惩罚,碰撞惩罚,完成奖励,时间惩罚;距离惩罚指机器人的位置与目标位置的距离作为惩罚,以激励机器人靠近目标点,表示为

Figure BDA0004103472830000021
角度惩罚指机器人前方摄像头与目标的角度差值作为惩罚,以激励机器人正对目标,表示为/>
Figure BDA0004103472830000022
碰撞惩罚是指当机器人与障碍物的距离小于安全距离时,判定其发生碰撞,其惩罚信号表示为
Figure BDA0004103472830000023
完成奖励指当机器人与目标之间的距离小于一定的阈值且中间无障碍物遮挡,并且机器人前方的摄像头正对目标时,视为完成导航任务,其奖励信号表示为
Figure BDA0004103472830000024
时间惩罚指为了防止机器人陷入停顿,在每个决策时间步给予一个恒定损失rt;因此,总奖励定义为以上5个部分的总和:r=rd+ro+rc+rf+rt;Step S25: Design a multi-dimensional dense reward function; the reward function includes 5 parts: distance penalty, angle penalty, collision penalty, completion reward, and time penalty; the distance penalty refers to the distance between the robot's position and the target position as a penalty to encourage the robot to approach target point, expressed as
Figure BDA0004103472830000021
Angle penalty refers to the angle difference between the camera in front of the robot and the target as a penalty to encourage the robot to face the target, expressed as />
Figure BDA0004103472830000022
Collision penalty means that when the distance between the robot and the obstacle is less than the safety distance, it is determined that it collides, and the penalty signal is expressed as
Figure BDA0004103472830000023
The completion reward means that when the distance between the robot and the target is less than a certain threshold and there is no obstacle in the middle, and the camera in front of the robot is facing the target, it is considered to have completed the navigation task, and its reward signal is expressed as
Figure BDA0004103472830000024
Time penalty refers to giving a constant loss rt at each decision time step in order to prevent the robot from stalling; therefore, the total reward is defined as the sum of the above 5 parts: r = rd + ro + rc + rf + rt ;

步骤S26:利用评论家网络评估机器人的决策行为,并采用强化学习框架更新神经网络,包括评论家网络,动作网络和信息预处理模块;Step S26: Utilize the critic network to evaluate the decision-making behavior of the robot, and use the reinforcement learning framework to update the neural network, including the critic network, action network and information preprocessing module;

步骤S27:为了提升算法更新的稳定性,引入了目标网络,采用软更新的方式更新神经网络参数;Step S27: In order to improve the stability of the algorithm update, the target network is introduced, and the parameters of the neural network are updated by means of soft update;

步骤S28:重复步骤S22-S27,直至算法收敛至最优策略,其中最优策略是指能够最大化累计奖励的最优信息预处理策略和最优导航避障策略。Step S28: Repeat steps S22-S27 until the algorithm converges to the optimal strategy, where the optimal strategy refers to the optimal information preprocessing strategy and the optimal navigation and obstacle avoidance strategy that can maximize the cumulative reward.

进一步的,步骤S3所述的将训练好的信息预处理模块和动作网络移植到现实环境中的导航过程,具体方法是:Further, the navigation process of transplanting the trained information preprocessing module and action network into the real environment described in step S3, the specific method is:

步骤S31:现实环境中,采集机器人多模数据,经由训练好的信息预处理模块处理后与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息,然后送入训练完成的动作网络中输出机器人的决策动作;Step S31: In the real environment, collect the multi-mode data of the robot, process it through the trained information preprocessing module, combine it with the position information of the robot’s distance from the target, and the speed information of the robot itself as the state information of the robot, and then send it into the trained action Output the decision-making action of the robot in the network;

步骤S32:机器人执行决策动作后,状态发生变化并通过传感器接收新状态下的多模数据;Step S32: After the robot executes the decision-making action, the state changes and receives multi-mode data in the new state through the sensor;

步骤S33:重复步骤S31-S32,直至完成避障导航任务。Step S33: Repeat steps S31-S32 until the obstacle avoidance navigation task is completed.

有益效果Beneficial effect

相比于现有技术,本发明具有如下优点,1)与单模态模式相比,本发明采取多模机制,采集多模数据,并利用不同种类的神经网络组成信息预处理模块对多模数据进行预处理,这可以对环境进行更加全面的感知,使得基于感知信息的避障导航行为更加准确;2)本发明提供了基于强化学习框架的端到端的避障导航方法,无需有关环境的先验知识,设计有关多维目标的奖励函数,使得算法能够在无地图的情况下对指定目标进行有效导航与避障并具有优良的从仿真环境移植到现实环境的性能;3)本发明在仿真环境中不断训练直至获得仿真环境下的避障导航能力,减小了机器人在现实环境下在线训练给机器人带来的不可逆损伤,后将训练好的信息预处理模块和动作网络移植到现实环境,具有良好的经济效益和社会效益。Compared with the prior art, the present invention has the following advantages, 1) Compared with the single-mode mode, the present invention adopts a multi-mode mechanism, collects multi-mode data, and utilizes different types of neural networks to form information preprocessing modules for multi-mode The data is preprocessed, which can perform a more comprehensive perception of the environment, making the obstacle avoidance navigation behavior based on the perception information more accurate; 2) the present invention provides an end-to-end obstacle avoidance navigation method based on the reinforcement learning framework, without the need for environmental Prior knowledge, design reward function about multi-dimensional target, make algorithm can carry out effective navigation and obstacle avoidance to designated target under the situation of no map and have excellent performance from simulation environment transplantation to real environment; 3) the present invention is in simulation Continuous training in the environment until the obstacle avoidance navigation ability in the simulation environment is obtained, which reduces the irreversible damage to the robot caused by online training in the real environment, and then transplants the trained information preprocessing module and action network to the real environment. It has good economic and social benefits.

附图说明Description of drawings

图1为本发明实施例中的基于信息预处理和强化学习的机器人避障导航方法的算法框图。FIG. 1 is an algorithm block diagram of a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning in an embodiment of the present invention.

图2为本发明的仿真训练环境示意图。Fig. 2 is a schematic diagram of the simulation training environment of the present invention.

图3为本发明实施例中的现实避障导航环境示意图。Fig. 3 is a schematic diagram of a realistic obstacle avoidance navigation environment in an embodiment of the present invention.

具体实施方式Detailed ways

为了加深对本发明的理解,下面结合附图对本实施例做详细的说明。In order to deepen the understanding of the present invention, the present embodiment will be described in detail below in conjunction with the accompanying drawings.

实施例1:为了在复杂动态的无地图环境中实现机器人自动化导航避障功能,机器人试图通过多模态传感器获取周围信息并根据传感器信息决策合适的动作。本发明提出了一种基于信息预处理和强化学习的机器人避障导航方法,其算法框架图如图1所示,具体实施方式分为3个阶段,第一阶段是信息预处理阶段,第二阶段是将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并引入强化学习框架在仿真环境中训练智能体阶段,第三阶段是将训练好的智能体移植到现实环境中完成避障导航任务阶段。下面结合说明书附图分别进行介绍。Example 1: In order to realize the robot's automatic navigation and obstacle avoidance function in a complex and dynamic map-free environment, the robot tries to obtain surrounding information through multi-modal sensors and decides appropriate actions based on the sensor information. The present invention proposes a robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning, its algorithm frame diagram is shown in Figure 1, the specific implementation is divided into three stages, the first stage is the information preprocessing stage, the second The first stage is to describe the obstacle avoidance navigation task of the robot in the mapless environment as a Markov decision process and introduce the reinforcement learning framework to train the agent in the simulation environment. The third stage is to transplant the trained agent into the real environment. Obstacle avoidance navigation task phase. Introduce respectively below in conjunction with accompanying drawing of specification sheet.

针对如图2所示的仿真环境或者如图3所示的现实环境,本发明设计不同的信息预处理模块对机器人从环境中采集的多模传感器信息进行数据预处理,这种分模块化的预处理方式使得机器人可以根据实际应用中获得的不同种类的信息选择性得利用不同的信息预处理模块,具体步骤如下:For the simulation environment as shown in Figure 2 or the real environment as shown in Figure 3, the present invention designs different information preprocessing modules to perform data preprocessing on the multi-mode sensor information collected by the robot from the environment. The preprocessing method enables the robot to selectively use different information preprocessing modules according to different types of information obtained in practical applications. The specific steps are as follows:

步骤S11:针对由摄像头获取的RGB图像,将像素点信息经由全连接输入层处理后输入卷积层进行卷积操作,卷积完成后由RELU层激活后送入池化层进行池化操作,视图片像素大小决定卷积核的大小以及卷积层的层数,最后经由全连接输出层输出处理后的信息,这样就构成了针对RGB图像的信息预处理模块,值得注意的是,其中RELU层和池化层都是固定不变的函数操作,卷积层和全连接输入/输出层中的神经网络参数会随着之后步骤中的神经网络梯度下降被更新;Step S11: For the RGB image acquired by the camera, the pixel information is processed by the fully connected input layer and then input to the convolution layer for convolution operation. After the convolution is completed, it is activated by the RELU layer and sent to the pooling layer for pooling operation. The size of the pixel of the picture determines the size of the convolution kernel and the number of layers of the convolutional layer, and finally the processed information is output through the fully connected output layer, thus forming an information preprocessing module for RGB images. It is worth noting that RELU Layers and pooling layers are fixed function operations, and the neural network parameters in the convolutional layer and the fully connected input/output layer will be updated with the gradient descent of the neural network in subsequent steps;

步骤S12:针对由激光雷达束获取的机器人与障碍物之间的距离信息,将各激光束获得的距离信息整合经由全连接输入层处理后输入循环神经网络层中,视激光束密度以及采集的频率决定循环神经网络的维数和层数,最后经由全连接输出层输出处理后的信息,这样就构成了针对激光雷达的信息预处理模块,其中循环神经层和全连接输入/输出层中的神经网络参数会随着之后步骤中的神经网络梯度下降被更新。Step S12: For the distance information between the robot and the obstacle obtained by the lidar beam, the distance information obtained by each laser beam is integrated and processed by the fully connected input layer and then input into the recurrent neural network layer. Depending on the laser beam density and the collected The frequency determines the dimension and number of layers of the recurrent neural network, and finally outputs the processed information through the fully connected output layer, thus forming an information preprocessing module for lidar, in which the recurrent neural layer and the fully connected input/output layer The neural network parameters are updated with the gradient descent of the neural network in subsequent steps.

在实际环境中训练机器人的避障导航功能安全成本较高,而且碰撞给机器人带来的损伤是不可逆的。为了降低实验成本,同时提升训练效率,本发明借助unity平台下的仿真环境训练机器人的避障导航策略,如图2所示,仿真环境中包括固定障碍物和动态随机障碍物,固定障碍物是摆放在场地中大小高低不同,位置固定的障碍物,动态随机障碍物为5个位置随机的目标块和以一定策略移动的其他机器人,机器人在完成任务的同时,需躲避固定障碍物和动态障碍物,本发明将无地图环境下的机器人避障导航问题描述为马尔可夫决策过程然后引入强化学习框架在仿真环境中训练智能体,以获得能够最大化累计奖励的最优策略,具体步骤如下:The safety cost of training the obstacle avoidance navigation function of the robot in the actual environment is relatively high, and the damage caused by the collision to the robot is irreversible. In order to reduce the experimental cost and improve the training efficiency, the present invention uses the simulation environment under the unity platform to train the robot's obstacle avoidance navigation strategy, as shown in Figure 2, the simulation environment includes fixed obstacles and dynamic random obstacles, and the fixed obstacles are Obstacles with different sizes and fixed positions are placed in the field. The dynamic random obstacles are 5 target blocks with random positions and other robots moving with a certain strategy. The robot needs to avoid fixed obstacles and dynamic obstacles while completing tasks. Obstacles, the present invention describes the robot obstacle avoidance navigation problem in a map-free environment as a Markov decision process and then introduces a reinforcement learning framework to train the agent in a simulation environment to obtain an optimal strategy that can maximize cumulative rewards, specific steps as follows:

步骤S21:初始化神经网络参数,包括信息预处理模块θp,动作网络θa以及评论家网络θcStep S21: Initialize neural network parameters, including information preprocessing module θp , action network θa and critic network θc ;

步骤S22:仿真环境中,白色方块表示机器人需要依次遍历的目标,每回合初被初始化在环境中的随机位置,在每一个时间步采集机器人多模数据Isensor,对不同传感器收集的信息由相应的信息预处理模块处理,值得注意的是,实际应用中可以根据机器人采集的数据选择性地使用部分信息预处理模块,处理后的信息Ipro与机器人距离目标的位置信息(dx,dy)、机器人本身的速度信息(vx,vy,vo)联合作为机器人的状态信息,某一时刻的状态变量st表示为st=(Ipro,dx,dy,vx,vy,vo),其中,Ipro表示经过预处理之后的传感器信息的集合,dx和dy分别表示机器人与目标位置在横轴和纵轴方向的位置差,vx和vy分别表示机器人在横轴和纵轴方向的速度,vo表示机器人的角速度;Step S22: In the simulation environment, the white square represents the target that the robot needs to traverse sequentially. It is initialized at a random position in the environment at the beginning of each round, and the multi-mode data Isensor of the robot is collected at each time step. The information collected by different sensors is determined by the corresponding It is worth noting that in practical applications, part of the information preprocessing module can be selectively used according to the data collected by the robot. The processed information Ipro and the position information of the robot from the target (dx , dy ), the speed information of the robot itself (vx , vy , vo ) are combined as the state information of the robot, and the state variable st at a certain moment is expressed as st = (Ipro ,dx ,dy ,vx , vy , vo ), where Ipro represents the set of sensor information after preprocessing, dx and dy represent the position difference between the robot and the target position in the horizontal axis and vertical axis respectively, vx and vy respectively Indicates the speed of the robot in the direction of the horizontal axis and the vertical axis, vo indicates the angular velocity of the robot;

步骤S23:为了综合考虑机器人的横向运动和纵向运动,将机器人在横轴和纵轴方向上的速度以及角速度作为控制量,于是将机器人的状态信息输入动作网络中输出机器人的决策动作,表示为at=(vx,vy,vo),其中vx和vy分别表示机器人在横轴和纵轴方向的速度,vo表示机器人的角速度,决策动作决定了机器人在下一个时间步内的运动轨迹;Step S23: In order to comprehensively consider the lateral motion and vertical motion of the robot, the speed and angular velocity of the robot in the direction of the horizontal axis and the vertical axis are used as the control variables, so the state information of the robot is input into the action network to output the decision-making action of the robot, expressed as at =(vx , vy , vo ), where vx and vy represent the speed of the robot in the horizontal axis and vertical axis respectively, vo represents the angular velocity of the robot, and the decision-making action determines the robot's speed in the next time step track of movement;

步骤S24:执行了决策动作at后,机器人的位置和周围环境发生变化,转移到下一时刻的状态st+1Step S24: After the decision-making action at is executed, the position of the robot and the surrounding environment change, and it transfers to the state st+1 at the next moment;

步骤S25:设计一个多维密集奖励函数;奖励函数包括5个部分:距离惩罚,角度惩罚,碰撞惩罚,完成奖励,时间惩罚;距离惩罚指机器人的位置与目标位置的距离作为惩罚,以激励机器人靠近目标点,表示为

Figure BDA0004103472830000051
角度惩罚指机器人前方摄像头与目标的角度差值作为惩罚,以激励机器人正对目标,表示为/>
Figure BDA0004103472830000052
碰撞惩罚是指当机器人与障碍物的距离小于安全距离时,判定其发生碰撞,其惩罚信号表示为
Figure BDA0004103472830000053
完成奖励指当机器人与目标之间的距离小于一定的阈值且中间无障碍物遮挡,并且机器人前方的摄像头正对目标时,视为完成导航任务,其奖励信号表示为
Figure BDA0004103472830000054
时间惩罚指为了防止机器人陷入停顿,在每个决策时间步给予一个恒定损失rt;因此,总奖励定义为以上5个部分的总和:r=rd+ro+rc+rf+rt;Step S25: Design a multi-dimensional dense reward function; the reward function includes 5 parts: distance penalty, angle penalty, collision penalty, completion reward, and time penalty; the distance penalty refers to the distance between the robot's position and the target position as a penalty to encourage the robot to approach target point, expressed as
Figure BDA0004103472830000051
Angle penalty refers to the angle difference between the camera in front of the robot and the target as a penalty to encourage the robot to face the target, expressed as />
Figure BDA0004103472830000052
Collision penalty means that when the distance between the robot and the obstacle is less than the safety distance, it is determined that it collides, and the penalty signal is expressed as
Figure BDA0004103472830000053
The completion reward means that when the distance between the robot and the target is less than a certain threshold and there is no obstacle in the middle, and the camera in front of the robot is facing the target, it is considered to have completed the navigation task, and its reward signal is expressed as
Figure BDA0004103472830000054
Time penalty refers to giving a constant loss rt at each decision time step in order to prevent the robot from stalling; therefore, the total reward is defined as the sum of the above 5 parts: r = rd + ro + rc + rf + rt ;

步骤S26:将状态st和动作at输入评论家网络以评估机器人的决策行为,评论家网络据此输出状态动作值函数Q(st,at;θc),表示当前决策行为代表的收益值,根据环境反馈的奖励函数定义评论家网络的损失函数L(θc)=Eπ[(r+γmaxQ(st+1,at+1;θc)-Q(st,at;θc))2],其中st+1表示在当前状态st下采取动作at时下一时刻的状态,at+1表示按照当前策略π在状态st+1下所采取的动作,γ表示未来奖励的折扣系数,Eπ表示策略π下括号内公式的期望值,此外,定义动作网络的损失函数

Figure BDA0004103472830000055
其中st中包含由信息预处理模块处理的传感器数据Ipro=θp(Isensor),于是信息预处理模块和动作网络采用相同的损失函数L(θp)=L(θa),并沿着最小化损失函数的方向梯度下降更新神经网络参数,包括评论家网络
Figure BDA0004103472830000056
动作网络/>
Figure BDA0004103472830000057
和信息预处理模块/>
Figure BDA0004103472830000058
其中lc,la和lp分别表示各个神经网络的学习率,/>
Figure BDA0004103472830000061
和/>
Figure BDA0004103472830000062
表示梯度下降法更新相对应的参数;Step S26: Input the state st and action at into the critic network to evaluate the decision-making behavior of the robot, and the critic network outputs the state-action value function Q(st ,at ; θc ), which represents the current decision-making behavior Profit value, according to the reward function of environmental feedback, define the loss function of the critic network L(θc )=Eπ [(r+γmaxQ(st+1 ,at+1c )-Q(st ,at ; θc ))2 ], where st+1 represents the state at the next moment when the action at is taken in the current state st , and at+1 represents the action taken in the state st+1 according to the current strategy π Action, γ represents the discount coefficient of future rewards, Eπ represents the expected value of the formula in the brackets under the strategy π, and defines the loss function of the action network
Figure BDA0004103472830000055
Where st contains the sensor data Iprop (Isensor ) processed by the information preprocessing module, so the information preprocessing module and the action network use the same loss function L(θp )=L(θa ), and Gradient descent along the direction of minimizing the loss function updates the neural network parameters, including the critic network
Figure BDA0004103472830000056
Action Network />
Figure BDA0004103472830000057
and information preprocessing module/>
Figure BDA0004103472830000058
Among them, lc , la and lp respectively represent the learning rate of each neural network, />
Figure BDA0004103472830000061
and />
Figure BDA0004103472830000062
Indicates that the gradient descent method updates the corresponding parameters;

步骤S27:为了提升算法更新的稳定性,定义目标评论家网络

Figure BDA0004103472830000063
目标动作网络/>
Figure BDA0004103472830000064
和目标信息预处理网络/>
Figure BDA0004103472830000065
将其引入网络参数更新过程并采用软更新的方式更新神经网络参数/>
Figure BDA0004103472830000066
其中τ表示目标网络的更新率;Step S27: In order to improve the stability of the algorithm update, define the target critic network
Figure BDA0004103472830000063
Target Action Network />
Figure BDA0004103472830000064
and target information preprocessing network/>
Figure BDA0004103472830000065
Introduce it into the network parameter update process and use soft update to update the neural network parameters />
Figure BDA0004103472830000066
where τ represents the update rate of the target network;

步骤S28:重复步骤S22-S27,直至算法收敛至最优策略,其中最优策略是指能够最大化累计奖励的最优信息预处理策略和最优导航避障策略。Step S28: Repeat steps S22-S27 until the algorithm converges to the optimal strategy, where the optimal strategy refers to the optimal information preprocessing strategy and the optimal navigation and obstacle avoidance strategy that can maximize the cumulative reward.

搭建现实避障导航环境测试所提出算法的性能,如图3所示,但是从仿真环境到现实环境的迁移需要考虑诸多问题,例如机器人与障碍物碰撞的情况下可能会发生打滑现象,激光雷达的采集帧率会受到限制等等,于是本发明在仿真环境的训练过程中已经对常见的迁移问题提出了解决方法,例如在设计的奖励函数中添加时间惩罚防止机器人打滑陷入停顿,采集多个时刻的激光雷达束的距离信息利用信息预处理模块中的循环神经网络层提取时间序列信息等,于是将仿真环境中训练好的信息预处理模块和动作网络移植到现实环境中的避障导航过程,具体步骤如下:Set up a realistic obstacle avoidance navigation environment to test the performance of the proposed algorithm, as shown in Figure 3. However, many issues need to be considered in the migration from the simulation environment to the real environment. For example, skidding may occur when the robot collides with obstacles. The acquisition frame rate will be limited, etc., so the present invention has proposed solutions to common migration problems during the training process of the simulation environment, such as adding time penalties to the designed reward function to prevent the robot from slipping and stalling, and collecting multiple The distance information of the laser radar beam at any time is extracted by using the cyclic neural network layer in the information preprocessing module to extract time series information, etc., so the information preprocessing module and action network trained in the simulation environment are transplanted to the obstacle avoidance navigation process in the real environment ,Specific steps are as follows:

步骤S31:现实环境中,采集机器人多模数据Isensor,将多模信息选择性得输入相应的信息预处理模块,经由训练好的信息预处理模块处理后Ipro与机器人距离目标的位置信息(dx,dy)、机器人本身的速度信息(vx,vy,vo)联合作为机器人的状态信息st,然后送入训练完成的动作网络中输出机器人的决策动作atStep S31: In the real environment, collect the multi-mode data Isensor of the robot, selectively input the multi-mode information into the corresponding information preprocessing module, and after the trained information preprocessing module processes the position information of the distance between Ipro and the robot from the target ( dx , dy ), the speed information of the robot itself (vx , vy , vo ) are combined as the state information st of the robot, and then sent to the trained action network to output the decision-making action of the robot att ;

步骤S32:机器人执行决策动作后,状态发生变化并通过传感器接收新状态下的多模数据;Step S32: After the robot executes the decision-making action, the state changes and receives multi-mode data in the new state through the sensor;

步骤S33:重复步骤S31-S32,直至完成避障导航任务。Step S33: Repeat steps S31-S32 until the obstacle avoidance navigation task is completed.

以上所述仅是本发明的优选实施方式,应当指出:对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also possible. It should be regarded as the protection scope of the present invention.

Claims (4)

Translated fromChinese
1.一种基于信息预处理和强化学习的机器人避障导航办法,其特征在于,所述方法包括以下步骤:1. A robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning, it is characterized in that, described method comprises the following steps:步骤S1:设计不同的信息预处理模块对机器人所采集的多模传感器信息进行数据预处理;Step S1: Design different information preprocessing modules to perform data preprocessing on the multi-mode sensor information collected by the robot;步骤S2:将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并在仿真环境中引入强化学习框架,将处理后的传感器信息与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息,由此得出机器人的决策行为,根据奖励信息在仿真环境中对机器人控制智能体进行训练,以获得能够最大化累计奖励的最优策略;Step S2: Describe the obstacle avoidance navigation task of the robot in a mapless environment as a Markov decision process and introduce a reinforcement learning framework in the simulation environment, and combine the processed sensor information with the position information of the robot from the target and the speed information of the robot itself Combined with the state information of the robot, the decision-making behavior of the robot is obtained, and the robot control agent is trained in the simulation environment according to the reward information to obtain the optimal strategy that can maximize the cumulative reward;步骤S3:将训练好的信息预处理模块和动作网络移植到现实环境中的导航过程,使机器人在避障的同时以最短的时间到达目标位置。Step S3: Transplant the trained information preprocessing module and action network into the navigation process in the real environment, so that the robot can reach the target position in the shortest time while avoiding obstacles.2.根据权利要求1所述的基于信息预处理和强化学习的机器人避障导航方法,其特征在于,步骤S1所述的设计不同的信息预处理模块对采集的多模传感器信息进行数据预处理,具体方法是:2. The robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning according to claim 1, characterized in that, the different information preprocessing modules of the design described in step S1 carry out data preprocessing to the multimode sensor information collected , the specific method is:步骤S11:针对由摄像头获取的RGB图像,利用若干层卷积神经网络构成信息预处理模块;Step S11: For the RGB images acquired by the camera, use several layers of convolutional neural networks to form an information preprocessing module;步骤S12:针对由激光雷达束获取的机器人与障碍物之间的距离信息,利用若干层循环神经网络构成信息预处理模块。Step S12: For the distance information between the robot and the obstacle acquired by the laser radar beam, use several layers of recurrent neural networks to form an information preprocessing module.3.根据权利要求1所述的基于信息预处理和强化学习的机器人避障导航方法,其特征在于,步骤S2所述的将无地图环境下的机器人避障导航任务描述为马尔可夫决策过程并引入强化学习框架在仿真环境中训练机器人控制智能体,具体方法是:3. The robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning according to claim 1, characterized in that, the robot obstacle avoidance navigation task described in step S2 without a map environment is described as a Markov decision process And introduce the reinforcement learning framework to train the robot control agent in the simulation environment, the specific method is:步骤S21:初始化神经网络参数,包括信息预处理模块,动作网络以及评论家网络;Step S21: Initialize neural network parameters, including information preprocessing module, action network and critic network;步骤S22:仿真环境中,采集机器人多模数据,经由相应的信息预处理模块处理后与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息;Step S22: In the simulation environment, collect the multi-mode data of the robot, and combine it with the position information of the robot's distance from the target and the speed information of the robot itself as the state information of the robot after being processed by the corresponding information preprocessing module;步骤S23:将机器人的状态信息输入动作网络中输出机器人的决策动作,机器人的决策动作包括机器人的转角速度以及前移和侧移速度;Step S23: Input the state information of the robot into the action network to output the decision-making action of the robot, the decision-making action of the robot includes the rotational angular velocity of the robot and the forward and sideward movement speed;步骤S24:执行了决策动作后,机器人的位置和观测状态发生转换;Step S24: After the decision-making action is executed, the robot's position and observation state are converted;步骤S25:设计一个多维密集奖励函数;奖励函数包括5个部分:距离惩罚,角度惩罚,碰撞惩罚,完成奖励,时间惩罚;Step S25: Design a multi-dimensional dense reward function; the reward function includes 5 parts: distance penalty, angle penalty, collision penalty, completion reward, and time penalty;距离惩罚指机器人的位置与目标位置的距离作为惩罚,以激励机器人靠近目标点,表示为
Figure QLYQS_1
其中dx表示机器人的位置与目标位置在横轴方向上的距离,dy表示机器人的位置与目标位置在纵轴方向上的距离;The distance penalty refers to the distance between the position of the robot and the target position as a penalty to motivate the robot to approach the target point, expressed as
Figure QLYQS_1
Where dx represents the distance between the position of the robot and the target position in the direction of the horizontal axis, and dy represents the distance between the position of the robot and the target position in the direction of the vertical axis;角度惩罚指机器人前方摄像头与目标的角度差值作为惩罚,以激励机器人正对目标,表示为
Figure QLYQS_2
Angle penalty refers to the angle difference between the camera in front of the robot and the target as a penalty to encourage the robot to face the target, expressed as
Figure QLYQS_2
碰撞惩罚是指当机器人与障碍物的距离小于安全距离时,判定其发生碰撞,其惩罚信号表示为
Figure QLYQS_3
其中rcollision-penalty表示机器人与障碍物碰撞后的惩罚信号;
Collision penalty means that when the distance between the robot and the obstacle is less than the safety distance, it is determined that it collides, and the penalty signal is expressed as
Figure QLYQS_3
Where rcollision-penalty represents the penalty signal after the robot collides with the obstacle;
完成奖励指当机器人与目标之间的距离小于一定的阈值且中间无障碍物遮挡,并且机器人前方的摄像头正对目标时,视为完成导航任务,其奖励信号表示为
Figure QLYQS_4
其中rfinish表示任务完成时环境反馈给机器人的奖励信号;
The completion reward means that when the distance between the robot and the target is less than a certain threshold and there is no obstacle in the middle, and the camera in front of the robot is facing the target, it is considered to have completed the navigation task, and its reward signal is expressed as
Figure QLYQS_4
Where rfinish represents the reward signal that the environment feeds back to the robot when the task is completed;
时间惩罚指为了防止机器人陷入停顿,在每个决策时间步给予一个恒定损失rt;因此,总奖励定义为以上5个部分的总和:r=rd+ro+rc+rf+rtTime penalty refers to giving a constant loss rt at each decision time step in order to prevent the robot from stalling; therefore, the total reward is defined as the sum of the above 5 parts: r = rd + ro + rc + rf + rt ;步骤S26:利用评论家网络评估机器人的决策行为,并采用强化学习框架更新神经网络,包括评论家网络,动作网络和信息预处理模块;Step S26: Utilize the critic network to evaluate the decision-making behavior of the robot, and use the reinforcement learning framework to update the neural network, including the critic network, action network and information preprocessing module;步骤S27:为了提升算法更新的稳定性,引入了目标网络,采用软更新的方式更新神经网络参数;Step S27: In order to improve the stability of the algorithm update, the target network is introduced, and the parameters of the neural network are updated by means of soft update;步骤S28:重复步骤S22-S27,直至算法收敛至最优策略,其中最优策略是指能够最大化累计奖励的最优信息预处理策略和最优导航避障策略。Step S28: Repeat steps S22-S27 until the algorithm converges to the optimal strategy, where the optimal strategy refers to the optimal information preprocessing strategy and the optimal navigation and obstacle avoidance strategy that can maximize the cumulative reward.4.根据权利要求1所述的基于信息预处理和强化学习的机器人避障导航方法,其特征在于,步骤S3所述的将训练好的信息预处理模块和动作网络移植到现实环境中完成移动机器人的避障导航任务,具体方法是:4. The robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning according to claim 1, characterized in that, in step S3, the trained information preprocessing module and action network are transplanted into the real environment to complete the movement The obstacle avoidance navigation task of the robot, the specific method is:步骤S31:现实环境中,采集机器人多模数据,经由训练好的信息预处理模块处理后与机器人距离目标的位置信息、机器人本身的速度信息联合作为机器人的状态信息,然后送入训练完成的动作网络中输出机器人的决策动作;Step S31: In the real environment, collect the multi-mode data of the robot, process it through the trained information preprocessing module, combine it with the position information of the robot’s distance from the target, and the speed information of the robot itself as the state information of the robot, and then send it into the trained action Output the decision-making action of the robot in the network;步骤S32:机器人执行决策动作后,状态发生变化并通过传感器接收新状态下的多模数据;Step S32: After the robot executes the decision-making action, the state changes and receives multi-mode data in the new state through the sensor;步骤S33:重复步骤S31-S32,直至完成避障导航任务。Step S33: Repeat steps S31-S32 until the obstacle avoidance navigation task is completed.
CN202310185208.4A2023-03-012023-03-01Robot obstacle avoidance navigation method based on information preprocessing and reinforcement learningPendingCN116300909A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202310185208.4ACN116300909A (en)2023-03-012023-03-01Robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202310185208.4ACN116300909A (en)2023-03-012023-03-01Robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning

Publications (1)

Publication NumberPublication Date
CN116300909Atrue CN116300909A (en)2023-06-23

Family

ID=86777257

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202310185208.4APendingCN116300909A (en)2023-03-012023-03-01Robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning

Country Status (1)

CountryLink
CN (1)CN116300909A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117193378A (en)*2023-10-242023-12-08安徽大学Multi-unmanned aerial vehicle path planning method based on improved PPO algorithm
CN117697769A (en)*2024-02-062024-03-15成都威世通智能科技有限公司Robot control system and method based on deep learning
CN118466557A (en)*2024-07-102024-08-09北京理工大学 UAV high-speed navigation and obstacle avoidance method, system, terminal and storage medium
CN119762540A (en)*2025-03-052025-04-04大连理工大学Target tracking method without updating Euclidean symbol distance field
CN119779312A (en)*2025-01-082025-04-08四川大学 Navigation method, device, robot and storage medium based on knowledge-guided map-free navigation model

Cited By (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117193378A (en)*2023-10-242023-12-08安徽大学Multi-unmanned aerial vehicle path planning method based on improved PPO algorithm
CN117193378B (en)*2023-10-242024-04-12安徽大学 Multi-UAV path planning method based on improved PPO algorithm
CN117697769A (en)*2024-02-062024-03-15成都威世通智能科技有限公司Robot control system and method based on deep learning
CN117697769B (en)*2024-02-062024-04-30成都威世通智能科技有限公司Robot control system and method based on deep learning
CN118466557A (en)*2024-07-102024-08-09北京理工大学 UAV high-speed navigation and obstacle avoidance method, system, terminal and storage medium
CN119779312A (en)*2025-01-082025-04-08四川大学 Navigation method, device, robot and storage medium based on knowledge-guided map-free navigation model
CN119762540A (en)*2025-03-052025-04-04大连理工大学Target tracking method without updating Euclidean symbol distance field

Similar Documents

PublicationPublication DateTitle
CN116300909A (en)Robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning
CN101943916B (en)Kalman filter prediction-based robot obstacle avoidance method
WO2021135554A1 (en)Method and device for planning global path of unmanned vehicle
CN112097769B (en)Homing pigeon brain-hippocampus-imitated unmanned aerial vehicle simultaneous positioning and mapping navigation system and method
CN114237235B (en)Mobile robot obstacle avoidance method based on deep reinforcement learning
CN112629542A (en)Map-free robot path navigation method and system based on DDPG and LSTM
CN116679711A (en)Robot obstacle avoidance method based on model-based reinforcement learning and model-free reinforcement learning
Liu et al.Reinforcement learning-based collision avoidance: impact of reward function and knowledge transfer
CN112114592A (en)Method for realizing autonomous crossing of movable frame-shaped barrier by unmanned aerial vehicle
CN117631660A (en) Multi-scenario path planning method and system for robots based on cross-media continuous learning
CN115373383A (en) Autonomous obstacle avoidance method, device and related equipment for garbage recycling unmanned boat
CN115265547A (en)Robot active navigation method based on reinforcement learning in unknown environment
Li et al.Vision-based obstacle avoidance algorithm for mobile robot
CN118394090A (en)Unmanned vehicle decision and planning method and system based on deep reinforcement learning
CN118502457A (en)Track planning method, device and autonomous system
Meftah et al.Improving autonomous vehicles maneuverability and collision avoidance in adverse weather conditions using generative adversarial networks
Ejaz et al.Autonomous visual navigation using deep reinforcement learning: An overview
CN119904766A (en) An end-to-end UAV autonomous control method based on environmental complexity
Yu et al.Road-following with continuous learning
CN119717842A (en)Method and system for collaborative formation of multiple unmanned aerial vehicles in complex dynamic environment based on MASAC algorithm
CN118470061A (en) A multi-target tracking method and system based on improved Pointpillars network
Oliveira et al.Deep reinforcement learning for mapless robot navigation systems
GB2633776A (en)A computer-implemented method for deep reinforcement learning using analogous mapping
CN115686076A (en)Unmanned aerial vehicle path planning method based on incremental development depth reinforcement learning
CN113625718A (en) Vehicle path planning method

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp