CN117666592A

Movatterモバイル変換

Info

Publication number: CN117666592A
Application number: CN202311707059.XA
Authority: CN
Inventors: 章城骏; 唐华锦; 吴迅冬; 袁孟雯; 杨博; 潘纲
Original assignee: Zhejiang University ZJU; Zhejiang Lab
Current assignee: Zhejiang University ZJU; Zhejiang Lab
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-03-08

Abstract

The invention relates to a group robot obstacle avoidance method, a group robot obstacle avoidance device and a group robot obstacle avoidance medium, wherein the method comprises the following steps: initializing a simulation environment, a robot pose and navigation points, and designing a reward function of the robot; training a pulse neural network reinforcement learning model by using a multi-robot model training method, wherein the pulse neural network reinforcement learning model comprises an SNN navigation model and an evaluation network, the input of the SNN navigation model is robot state information, the output is a speed instruction controlled by a robot, and the evaluation network is used for guiding the training of the SNN navigation model; based on the SNN navigation model after training, the speed instruction at the current moment is inferred according to the state information acquired by each robot, and the group mobile robots cooperatively complete the navigation task according to the corresponding speed instruction. Compared with the prior art, the method and the device have the advantages that a plurality of robots can independently find the navigation path with high time efficiency and no collision under the condition that the mutual position relation among the robots is not input in advance.

Description

Translated fromChinese

一种群体机器人避障方法、装置及介质A method, device and medium for group robot obstacle avoidance

技术领域Technical field

本发明涉及群体机器人导航领域，尤其是涉及一种基于脉冲神经网络强化学习的群体机器人避障方法、装置及介质。The invention relates to the field of group robot navigation, and in particular to a group robot obstacle avoidance method, device and medium based on impulse neural network reinforcement learning.

背景技术Background technique

群体机器人导航是机器人和人工智能领域的研究热点，它有许多现实世界的应用，包括群体机器人搜索与救援、在复杂动态环境中导航和群体探索建图。其中主要的挑战是就是对群体中每一个机器人在从起始位置到预期目标点的过程中制定一个安全可靠的防碰撞策略。Swarm robot navigation is a research hotspot in the field of robotics and artificial intelligence. It has many real-world applications, including swarm robot search and rescue, navigation in complex dynamic environments, and group exploration and mapping. The main challenge is to develop a safe and reliable anti-collision strategy for each robot in the group from the starting position to the expected target point.

目前比较成熟的群体机器人导航方法为集中式方法，集中式方法都需要预先获取所有机器人的初始状态、导航目标及其工作空间(包含图像与雷达等信息)，并将这些综合性的知识提供给中央服务器，中央服务器通过同步的为所有机器人进行最优路径规划来防止机器人间的相互碰撞。这种方法在机器人数量较少是能够很好的提供导航方案，但如果扩展到数量较多的群体机器人系统中，因机器人间需要频繁更新局部路径导航方案，集中式方法导航表现就会下降。且机器人数量较多情况下，实际的数据通信将会成为挑战，一旦某个机器人网络通信故障必将导致整个多机器系统的崩溃。相比集中式方式，现在也有一些工作是基于分散的避障策略，即每个智能体能够独立的做出决策，但大都需要考虑预先观察其他智能体的状态作为输入，依旧需要全局的跟踪每个机器人的位置与速度并共享每一个机器人的信息，而实际环境是没有完美的感知定位的，这些依靠全局信息的算法应用依旧受到了极大的限制。At present, the more mature swarm robot navigation method is the centralized method. The centralized method needs to obtain the initial state, navigation target and work space of all robots in advance (including information such as images and radar), and provide these comprehensive knowledge to The central server prevents collisions between robots by synchronously planning optimal paths for all robots. This method can provide a good navigation solution when the number of robots is small. However, if it is extended to a larger number of swarm robot systems, the centralized method navigation performance will decline because the robots need to frequently update the local path navigation solution. And when there are a large number of robots, actual data communication will become a challenge. Once a robot's network communication fails, it will inevitably lead to the collapse of the entire multi-machine system. Compared with the centralized approach, there are also some works based on decentralized obstacle avoidance strategies, that is, each agent can make decisions independently, but most of them need to consider observing the status of other agents in advance as input, and still need to globally track each agent. The position and speed of each robot are shared by each robot. However, the actual environment does not have perfect perception and positioning. The application of these algorithms that rely on global information is still greatly limited.

发明内容Contents of the invention

本发明的目的是为了提供一种基于脉冲神经网络强化学习的群体机器人避障方法、装置及介质，将一种以脉冲神经网络脉冲构建的感知模块应用于导航控制中，无需全局感知机器人间相互位置也能够实现机器人间的相互感知与避障。The purpose of the present invention is to provide a group robot obstacle avoidance method, device and medium based on pulse neural network reinforcement learning, and to apply a perception module constructed with pulse neural network pulses in navigation control without the need for global perception of each other among the robots. Position can also realize mutual perception and obstacle avoidance between robots.

本发明的目的可以通过以下技术方案来实现：The object of the present invention can be achieved through the following technical solutions:

一种基于脉冲神经网络强化学习的群体机器人避障方法，包括以下步骤：A group robot obstacle avoidance method based on impulse neural network reinforcement learning includes the following steps:

S1、初始化仿真环境，将多机器人随机分布到仿真环境中，并初始化机器人位姿与导航点，根据环境设计机器人的奖励函数；S1. Initialize the simulation environment, randomly distribute multiple robots into the simulation environment, initialize the robot pose and navigation points, and design the robot's reward function according to the environment;

S2、利用多机器人模型训练方法对脉冲神经网络强化学习模型进行训练，其中，所述脉冲神经网络强化学习模型包括SNN导航模型和评价网络，所述SNN导航模型的输入为机器人状态信息，输出为机器人控制的速度指令，所述评价网络根据SNN导航模型输出的信息和当前状态信息，输出<状态，动作>的价值，指导SNN导航模型训练；S2. Use a multi-robot model training method to train the spiking neural network reinforcement learning model, wherein the spiking neural network reinforcement learning model includes an SNN navigation model and an evaluation network. The input of the SNN navigation model is robot status information, and the output is Speed instructions for robot control. The evaluation network outputs the value of <state, action> based on the information output by the SNN navigation model and the current status information to guide the training of the SNN navigation model;

S3、基于步骤S2训练完成的SNN导航模型，根据每一个机器人所获取的不同状态信息，推理出该机器人当前时刻的速度指令，群体移动机器人根据对应的速度指令协同完成导航任务。S3. Based on the SNN navigation model trained in step S2, the speed instructions of the robot at the current moment are inferred based on the different status information obtained by each robot. The group mobile robots collaboratively complete the navigation task according to the corresponding speed instructions.

所述步骤S1中，群体中每个机器人的奖励函数R(s_t,a_t)的表达式如下：In step S1, the expression of the reward function R(s_t , a_t ) of each robot in the group is as follows:

R(s_t,a_t)＝R_g+R_c+R_wR(s_t ,a_t )=R_g +R_c +R_w

其中R_g表示移动机器人接近目标时的奖励项，D_t为机器人与目标点的欧式距离，T_g为判断是否到达目标点的阈值，r_arrivaal为机器人到达目标点时的奖励，w_g为计算机器人距离奖励项的幅值系数；R_c表示机器人是否与其他机器人或障碍物发生碰撞的奖励，r_collision为机器人发生碰撞事件时的奖励；R_w为鼓励机器人行进路径更加平滑的奖励项，ω为机器人当前角速度，T_ω为机器人角速度过大时的阈值，w_ω为惩罚参数。Among them, R_g represents the reward item when the mobile robot approaches the target, D_t is the Euclidean distance between the robot and the target point, T_g is the threshold for judging whether it has reached the target point, r_arrivaal is the reward when the robot reaches the target point, and w_g is the calculation The amplitude coefficient of the robot distance reward item; R_c represents the reward for whether the robot collides with other robots or obstacles, r_collision is the reward for the robot when a collision event occurs; R_w is the reward item that encourages the robot to travel a smoother path, ω is the current angular velocity of the robot, T_ω is the threshold when the angular velocity of the robot is too large, and w_ω is the penalty parameter.

所述状态信息包括移动机器人获取的雷达信息、里程计信息与导航点位置，所述速度指令包括线速度与角速度。The status information includes radar information, odometer information and navigation point positions obtained by the mobile robot, and the speed instructions include linear speed and angular speed.

所述雷达信息包括当前时刻雷达信息与历史雷达信息，表示为：The radar information includes current radar information and historical radar information, expressed as:

S_B＝[B_t,B_t-1,B_t-2...B_t-n]S_B =[B_t ,B_t-1 ,B_t-2 ...B_tn ]

其中S_B为输入SNN导航模型的雷达信息，B_t为当前时刻t的雷达信息，B_t-n为t-n时刻的雷达信息。Among them, S_B is the radar information input to the SNN navigation model, B_t is the radar information at the current time t, and B_tn is the radar information at time tn.

所述SNN导航模型包括感知模块和控制模块，其中，The SNN navigation model includes a perception module and a control module, where,

所述感知模块用于处理输入的状态信息，具有两个分支，其中一个分支采用一维卷积接LIF神经元的形式将输入的实数值的雷达信息转成脉冲发放的形式，另一个分支将输入的里程计信息与导航点位置采用全连接层与LIF神经元结合的网络转为脉冲编码，将两个分支的编码信息展平成一维信息后输入控制模块；The perception module is used to process the input state information and has two branches. One branch uses one-dimensional convolution connected to LIF neurons to convert the input real-valued radar information into the form of pulse emission, and the other branch converts the input real-valued radar information into the form of pulse emission. The input odometer information and navigation point position are converted into pulse coding using a network combining fully connected layers and LIF neurons, and the coding information of the two branches is flattened into one-dimensional information and then input into the control module;

所述控制模块由两层全连接与LIF神经元构成，输出两个LIF神经元的累计膜电压值，所述两个LIF神经元设置为没有脉冲发放阈值且不发放脉冲的神经元，其中一个为控制线速度的神经元，另一个为控制角速度的神经元，线速度的神经元电压值经过sigmod函数处理获取范围为[0,1]的归一化的线速度，角速度的神经元电压值经过双曲正切函数处理获取范围为[-1,1]归一化的角速度。The control module is composed of two layers of fully connected LIF neurons, and outputs the cumulative membrane voltage value of two LIF neurons. The two LIF neurons are set as neurons that have no pulse firing threshold and do not fire pulses. One of them is a neuron that controls linear velocity, and the other is a neuron that controls angular velocity. The neuron voltage value of the linear velocity is processed by the sigmod function to obtain the normalized linear velocity in the range [0,1], and the neuron voltage value of the angular velocity. After hyperbolic tangent function processing, the normalized angular velocity in the range [-1,1] is obtained.

所述脉冲神经网络强化学习模型的训练包括以下步骤：The training of the spiking neural network reinforcement learning model includes the following steps:

S21、设置训练参数；S21. Set training parameters;

S22、将里程计获取的机器人位置、导航点位置与雷达信息作为状态输入，SNN导航模型给予当前状态输入的决策动作；S22. Use the robot position, navigation point position and radar information obtained by the odometer as state input, and the SNN navigation model gives decision-making actions based on the current state input;

S23、在仿真环境中执行决策动作并获取奖励信息与新的状态输入；S23. Execute decision-making actions in the simulation environment and obtain reward information and new status input;

S24、构建经验池，并存入机器人与环境交互的经验信息，定期更新网络权重；S24. Build an experience pool, store the experience information of the interaction between the robot and the environment, and regularly update the network weight;

S25、判断训练是否结束，若是，则结束训练，否则返回步骤S21重新设置训练参数，进行下一轮训练。S25. Determine whether the training is over. If so, end the training. Otherwise, return to step S21 to reset the training parameters and perform the next round of training.

所述多机器人模型训练方法采用并行计算的方法加速模型权重的训练，多个机器人在训练中共享一套网络权重，且所有机器人的状态、价值、动作信息均存储于同一个经验池，在满足条件的情况下对权重进行更新。The multi-robot model training method uses parallel computing to accelerate the training of model weights. Multiple robots share a set of network weights during training, and the status, value, and action information of all robots are stored in the same experience pool. The weights are updated based on the conditions.

所述脉冲神经网络强化学习模型采用STBP进行梯度反传对SNN导航模型进行权重训练。The spiking neural network reinforcement learning model uses STBP to perform gradient backpropagation to perform weight training on the SNN navigation model.

一种基于脉冲神经网络强化学习的群体机器人避障装置，包括存储器、处理器，以及存储于所述存储器中的程序，所述处理器执行所述程序时实现如上述所述的方法。A group robot obstacle avoidance device based on impulse neural network reinforcement learning includes a memory, a processor, and a program stored in the memory. When the processor executes the program, the method as described above is implemented.

一种存储介质，其上存储有程序，所述程序被执行时实现如上述所述的方法。A storage medium has a program stored thereon, and when the program is executed, the method as described above is implemented.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1)本发明的方法无需全局感知机器人间相互位置，而是由雷达信息输入通过脉冲神经网络来感知机器人间相对位置关系，并以两个神经元的膜电压来表征机器人的线速度与角速度来实现机器人自主导航控制；1) The method of the present invention does not require global perception of the mutual positions of robots, but uses radar information input through a pulse neural network to perceive the relative position relationship between robots, and uses the membrane voltage of two neurons to represent the linear velocity and angular velocity of the robot. Realize autonomous navigation control of robots;

2)本发明采用SNN导航模型的输出作为策略，用于搭载类脑芯片的机器人系统时，能够实现移动机器人计算低功耗优势。2) The present invention uses the output of the SNN navigation model as a strategy. When used in a robot system equipped with a brain-like chip, it can achieve the advantage of low power consumption in mobile robot calculations.

附图说明Description of drawings

图1为本发明的方法流程图；Figure 1 is a flow chart of the method of the present invention;

图2为本发明的脉冲神经网络强化学习模型训练流程示意图；Figure 2 is a schematic diagram of the training process of the spiking neural network reinforcement learning model of the present invention;

图3为一种实施例中的机器人训练场景示意图；Figure 3 is a schematic diagram of a robot training scene in an embodiment;

图4为一种实施例中的SNN导航模型和评价网络的网络结构示意图；Figure 4 is a schematic diagram of the network structure of the SNN navigation model and evaluation network in an embodiment;

图5为一种实施例中的训练过程奖励函数值变化曲线；Figure 5 is a reward function value change curve during the training process in an embodiment;

图6为一种实施例中不同数量机器人导航测试中的机器人路径示意图。Figure 6 is a schematic diagram of robot paths in a navigation test with different numbers of robots in one embodiment.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. This embodiment is implemented based on the technical solution of the present invention and provides detailed implementation modes and specific operating procedures. However, the protection scope of the present invention is not limited to the following embodiments.

实施例1Example 1

脉冲神经网络模型是第三代神经网络模型，其与神经科学的结合紧密，使用最拟合生物神经元机制的模型来进行计算，更贴近与人脑的工作机理。而随着类脑芯片方面的研究发展，结合类脑芯片的计算优点，采用类脑芯片部署脉冲神经网络控制模型能够极大程度降低机器人在神经网络的推理上的功耗。且脉冲神经网络具有良好的抗噪性能，能够有效的应对机器人运行过程中获取的雷达信息的噪声，脉冲神经网络依旧能够给予有效的输出来控制机器人的决策。The spiking neural network model is the third generation neural network model. It is closely integrated with neuroscience. It uses the model that best fits the biological neuron mechanism to perform calculations, which is closer to the working mechanism of the human brain. With the development of research on brain-like chips, combined with the computing advantages of brain-like chips, using brain-like chips to deploy impulse neural network control models can greatly reduce the power consumption of robots in neural network reasoning. Moreover, the pulse neural network has good anti-noise performance and can effectively deal with the noise of radar information obtained during the operation of the robot. The pulse neural network can still give effective output to control the robot's decision-making.

深度强化学习的碰撞避免策略在多机器人中有着广泛的应用前景，单个智能体依靠自身传感器信息与全局所获取的其他智能体的位置与速度进行自主的路径规划获取最优避免碰撞策略，且基于强化学习有利于算法适应不同的交互场景并提高其鲁棒性。The collision avoidance strategy of deep reinforcement learning has broad application prospects in multi-robots. A single agent relies on its own sensor information and the globally acquired positions and speeds of other agents to perform autonomous path planning to obtain the optimal collision avoidance strategy, and based on Reinforcement learning helps the algorithm adapt to different interaction scenarios and improve its robustness.

因此，本实施例利用融合脉冲神经网络的强化学习算法训练获得控制指令，采用历史与当前雷达信息，机器人状态与目标点信息作为输入，由脉冲神经网络最终的累计电压作为轮式机器人线速度与角速度的控制信号，完成大型多机器人系统的相互避障任务。本发明提出的这种方法可以在不预先输入机器人间相互位置关系的情况下，借由脉冲神经网络输出的控制指令，使得多个机器人能够自主的找到时间效率高且无碰撞的导航路径。Therefore, this embodiment uses reinforcement learning algorithm training that fuses pulse neural networks to obtain control instructions, using historical and current radar information, robot status and target point information as input, and the final accumulated voltage of the pulse neural network as the wheeled robot linear velocity and The angular velocity control signal is used to complete the mutual obstacle avoidance task of a large multi-robot system. The method proposed by the present invention can enable multiple robots to independently find time-efficient and collision-free navigation paths through the control instructions output by the pulse neural network without pre-inputting the mutual position relationship between the robots.

具体的，本实施例提供一种基于脉冲神经网络强化学习的群体机器人避障方法，如图1所示，包括以下步骤：Specifically, this embodiment provides a group robot obstacle avoidance method based on impulse neural network reinforcement learning, as shown in Figure 1, including the following steps:

S1、初始化仿真环境，将多机器人随机分布到仿真环境中，并初始化机器人位姿与导航点，根据环境设计机器人的奖励函数。S1. Initialize the simulation environment, randomly distribute multiple robots into the simulation environment, initialize the robot pose and navigation points, and design the robot's reward function according to the environment.

本实施例中，群体中每个机器人的奖励函数R(s_t,a_t)的表达式如下：In this embodiment, the expression of the reward function R(s_t , a_t ) of each robot in the group is as follows:

R(s_t,a_t)＝R_g+R_c+R_wR(s_t ,a_t )=R_g +R_c +R_w

S2、利用多机器人模型训练方法对脉冲神经网络强化学习模型进行训练，其中，所述脉冲神经网络强化学习模型包括SNN导航模型和评价网络，所述SNN导航模型的输入为机器人状态信息，输出为机器人控制的速度指令，所述评价网络根据SNN导航模型输出的信息和当前状态信息，输出<状态，动作>的价值，作为训练SNN导航模型的损失依据。S2. Use a multi-robot model training method to train the spiking neural network reinforcement learning model, wherein the spiking neural network reinforcement learning model includes an SNN navigation model and an evaluation network. The input of the SNN navigation model is robot status information, and the output is For the speed command of robot control, the evaluation network outputs the value of <state, action> based on the information output by the SNN navigation model and the current status information, as the basis for the loss of training the SNN navigation model.

本实施例中，输入的状态信息包括移动机器人获取的雷达信息、里程计信息与导航点位置，输出的速度指令包括线速度与角速度。In this embodiment, the input status information includes radar information, odometer information and navigation point position obtained by the mobile robot, and the output speed command includes linear speed and angular speed.

择优的，雷达信息包括当前时刻雷达信息与历史雷达信息，表示为：Selecting the best, radar information includes current radar information and historical radar information, expressed as:

S_B＝[B_t,B_t-1,B_t-2...B_t-n]S_B =[B_t ,B_t-1 ,B_t-2 ...B_tn ]

SNN导航模型包括感知模块和控制模块，其中，The SNN navigation model includes a perception module and a control module, where,

如图2所示，脉冲神经网络强化学习模型的训练包括以下步骤：As shown in Figure 2, the training of the spiking neural network reinforcement learning model includes the following steps:

S21、设置训练参数；S21. Set training parameters;

本实施例中，多机器人模型训练方法采用并行计算的方法加速模型权重的训练，多个机器人在训练中共享一套网络权重，且所有机器人的状态、价值、动作信息均存储于同一个经验池，在满足条件的情况下对权重进行更新。In this embodiment, the multi-robot model training method uses parallel computing to accelerate the training of model weights. Multiple robots share a set of network weights during training, and the status, value, and action information of all robots are stored in the same experience pool. , the weights are updated when the conditions are met.

训练过程中，脉冲神经网络强化学习模型采用STBP(Spatio-temporalbackpropagation))进行梯度反传对SNN导航模型进行权重训练。During the training process, the spiking neural network reinforcement learning model uses STBP (Spatio-temporal backpropagation)) for gradient backpropagation to perform weight training on the SNN navigation model.

脉冲神经网络强化学习所获取的SNN导航模型能够根据输入的雷达信息检测出其他移动机器人的信息并给出速度指令，多机器人间能够根据各自网络的输出指令完成高效的协同导航。The SNN navigation model obtained by impulse neural network reinforcement learning can detect the information of other mobile robots based on the input radar information and give speed instructions. Multiple robots can complete efficient collaborative navigation based on the output instructions of their respective networks.

本实施例基于机器人仿真训练环境机器gym，移动机器人的位姿与目标点坐标都基于仿真环境的里程计坐标系，移动机器人采用单线雷达获取机器人朝向处180度的雷达信息；机器人数量选择为24，训练地图大小为20m*20m，如图3所示，所有机器人初始位姿与导航点随机，机器人在到达目标点、发生碰撞与仿真次数达到150次时停止当前仿真并重置位姿与导航点。This embodiment is based on the robot simulation training environment machine gym. The pose and target point coordinates of the mobile robot are based on the odometer coordinate system of the simulation environment. The mobile robot uses a single-line radar to obtain 180-degree radar information at the direction of the robot; the number of robots is selected to be 24 , the size of the training map is 20m*20m, as shown in Figure 3. The initial poses and navigation points of all robots are random. When the robot reaches the target point, collides, and the number of simulations reaches 150, it stops the current simulation and resets the pose and navigation. point.

此外，本实施例在设置奖励函数时，T_g＝0.2m，r_arrival＝10，w_g＝2.0，r_collision＝-10，w_ω＝-0.1，T_ω＝1.03。In addition, when setting the reward function in this embodiment, T_g =0.2m, r_arrival =10, w_g =2.0, r_collision =-10, w_ω =-0.1, and T_ω =1.03.

本实施例构建的脉冲神经网络强化学习模型包括SNN导航模型与基于CNN的评价网络，网络结构如图4所示，其中SNN导航模型的感知模块，由两层一维卷积，一层全连接层与LIF神经元构成对雷达信息的编码，一层全连层与LIF神经元用于处理机器人速度与目标点信息，控制模块则由两层全连接与LIF神经元构成，最后输出的两个LIF神经元为不发放脉冲的神经元，两个LIF神经元的累计膜电压值，其中一个为控制线速度的神经元，另一个为控制角速度神经元，线速度的神经元电压值经过sigmod函数处理获取范围为的归一化的线速度，角速度神经元电压值经过双曲正切函数处理获取范围为归一化的角速度，本实例线速度控制在范围[0,0.5]，角速度控制范围[-1,1]。The impulse neural network reinforcement learning model constructed in this embodiment includes an SNN navigation model and a CNN-based evaluation network. The network structure is shown in Figure 4. The perception module of the SNN navigation model consists of two layers of one-dimensional convolution and one layer of fully connected The layer and LIF neurons constitute the encoding of radar information. One layer of fully connected layers and LIF neurons are used to process the robot speed and target point information. The control module consists of two layers of fully connected layers and LIF neurons. The final two output LIF neurons are neurons that do not emit impulses. The cumulative membrane voltage value of two LIF neurons, one of which is a neuron that controls linear velocity and the other is a neuron that controls angular velocity. The neuron voltage value of linear velocity is passed through the sigmod function. The process obtains the normalized linear velocity in the range of , and the angular velocity neuron voltage value is processed by the hyperbolic tangent function to obtain the normalized angular velocity in the range. In this example, the linear velocity is controlled in the range [0,0.5], and the angular velocity control range is [- 1,1].

其中，模型输入的雷达信息包含当前时刻雷达信息与历史雷达信息，且n取2，即：Among them, the radar information input by the model includes current radar information and historical radar information, and n is 2, that is:

S_B＝[B_t,B_t-1,B_t-2]S_B =[B_t ,B_t-1 ,B_t-2 ]

其中S_B输入网络的雷达信息，B_t为当前时刻t的雷达信息，B_t-n为t-n时刻的雷达信息。Among them, S_B inputs the radar information of the network, B_t is the radar information at the current time t, and B_tn is the radar information at time tn.

构建的CNN的评价网络结构与SNN导航模型基本一致，其采用Relu激活函数，输入状态与SNN控制模型保持一致。The evaluation network structure of the constructed CNN is basically consistent with the SNN navigation model. It uses the Relu activation function, and the input state is consistent with the SNN control model.

学习算法基于PPO(Proximal Policy Optimization)强化学习算法，强化学习经验池在所有机器人仿真步数到达128时，进行经验回放并训练更新权重；其中基于SNN导航网络采用STBP(Spatio-Temporal Backpropagation)算法，将不可导的脉冲输出替换为近似的可导函数，采用随机梯度下降算法进行网络参数的优化；The learning algorithm is based on the PPO (Proximal Policy Optimization) reinforcement learning algorithm. When the number of simulation steps for all robots reaches 128, the reinforcement learning experience pool performs experience playback and trains to update the weights; among them, the STBP (Spatio-Temporal Backpropagation) algorithm is used based on the SNN navigation network. Replace the non-differentiable pulse output with an approximate differentiable function, and use the stochastic gradient descent algorithm to optimize network parameters;

训练周期为5000，学习率设置为0.0005，gym中机器人尺寸设置为半径为0.5m的圆形机器人，训练过程随着周期加长强化学习的奖励数值曲线如图5所示。The training period is 5000, the learning rate is set to 0.0005, and the robot size in the gym is set to a circular robot with a radius of 0.5m. The reward value curve of reinforcement learning as the training process lengthens is shown in Figure 5.

根据训练获取的SNN导航模型，为SNN导航模型的有效性，在模型训练结束后，本实施例在gym仿真平台上建立的不同数量机器人数量的仿真试验如图6所示，机器人数量分别为6，20，100，在这三种测试环境下，所有机器人都能在不碰撞的情况下各自导航到其目标点位置。According to the SNN navigation model obtained through training, the effectiveness of the SNN navigation model is determined. After the model training is completed, the simulation tests of different numbers of robots established in this embodiment on the gym simulation platform are shown in Figure 6. The number of robots is 6. , 20, 100. In these three test environments, all robots can independently navigate to their target point positions without collision.

实施例2Example 2

本实施例提供一种基于脉冲神经网络强化学习的群体机器人避障装置，包括存储器、处理器，以及存储于所述存储器中的程序，所述处理器执行所述程序时实现如上述实施例1所述的方法。This embodiment provides a group robot obstacle avoidance device based on impulse neural network reinforcement learning, including a memory, a processor, and a program stored in the memory. When the processor executes the program, it implements the above-mentioned Embodiment 1 the method described.

实施例3Example 3

本实施例提供一种存储介质，其上存储有程序，所述程序被执行时实现如上述实施例1所述的方法。This embodiment provides a storage medium on which a program is stored. When the program is executed, the method described in Embodiment 1 is implemented.

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的，计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.

为了描述的方便，描述以上装置时以功能分为各种单元分别描述。当然，在实施本发明时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing the present invention, the functions of each unit can be implemented in the same or multiple software and/or hardware.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention may be provided as methods, systems, or computer program products. Thus, the invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for implementing the functions specified in one process or processes of the flowchart and/or one block or blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in computer-readable media, random access memory (RAM), and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.

本领域技术人员应明白，本发明的实施例可提供为方法、系统或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as methods, systems or computer program products. Thus, the invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本发明，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

本发明中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment of the present invention is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

以上详细描述了本发明的较佳具体实施例。应当理解，本领域的普通技术人员无需创造性劳动就可以根据本发明的构思做出诸多修改和变化。因此，凡本技术领域中技术人员依据本发明的构思在现有技术的基础上通过逻辑分析、推理、或者有限的实验可以得到的技术方案，皆应在权利要求书所确定的保护范围内。The preferred embodiments of the present invention are described in detail above. It should be understood that those skilled in the art can make many modifications and changes based on the concept of the present invention without creative efforts. Therefore, any technical solutions that can be obtained by those skilled in the art through logical analysis, reasoning, or limited experiments on the basis of the prior art based on the concept of the present invention should be within the scope of protection determined by the claims.

Claims

Translated fromChinese

1.一种基于脉冲神经网络强化学习的群体机器人避障方法，其特征在于，包括以下步骤：1. A group robot obstacle avoidance method based on impulse neural network reinforcement learning, which is characterized by including the following steps:

2.根据权利要求1所述的一种基于脉冲神经网络强化学习的群体机器人避障方法，其特征在于，所述步骤S1中，群体中每个机器人的奖励函数R(s_t,a_t)的表达式如下：2. A group robot obstacle avoidance method based on impulse neural network reinforcement learning according to claim 1, characterized in that in step S1, the reward function R(s_t , a_t ) of each robot in the group is The expression is as follows:

R(s_t,a_t)＝R_g+R_c+R_wR(s_t ,a_t )=R_g +R_c +R_w

其中R_g表示移动机器人接近目标时的奖励项，D_t为机器人与目标点的欧式距离，T_g为判断是否到达目标点的阈值，r_arrival为机器人到达目标点时的奖励，w_g为计算机器人距离奖励项的幅值系数；R_c表示机器人是否与其他机器人或障碍物发生碰撞的奖励，r_collision为机器人发生碰撞事件时的奖励；R_w为鼓励机器人行进路径更加平滑的奖励项，ω为机器人当前角速度，T_ω为机器人角速度过大时的阈值，w_ω为惩罚参数。Among them, R_g represents the reward term when the mobile robot approaches the target, D_t is the Euclidean distance between the robot and the target point, T_g is the threshold for judging whether it has reached the target point, r_arrival is the reward when the robot reaches the target point, and w_g is the calculation The amplitude coefficient of the robot distance reward item; R_c represents the reward for whether the robot collides with other robots or obstacles, r_collision is the reward for the robot when a collision event occurs; R_w is the reward item that encourages the robot to travel a smoother path, ω is the current angular velocity of the robot, T_ω is the threshold when the angular velocity of the robot is too large, and w_ω is the penalty parameter.

3.根据权利要求1所述的一种基于脉冲神经网络强化学习的群体机器人避障方法，其特征在于，所述状态信息包括移动机器人获取的雷达信息、里程计信息与导航点位置，所述速度指令包括线速度与角速度。3. A group robot obstacle avoidance method based on impulse neural network reinforcement learning according to claim 1, characterized in that the status information includes radar information, odometer information and navigation point positions obtained by the mobile robot, and the Speed commands include linear speed and angular speed.

4.根据权利要求2所述的一种基于脉冲神经网络强化学习的群体机器人避障方法，其特征在于，所述雷达信息包括当前时刻雷达信息与历史雷达信息，表示为：4. A group robot obstacle avoidance method based on pulse neural network reinforcement learning according to claim 2, characterized in that the radar information includes current radar information and historical radar information, expressed as:

S_B＝[B_t,B_t-1,B_t-2...B_t-n]S_B =[B_t ,B_t-1 ,B_t-2 ...B_tn ]

5.根据权利要求3所述的一种基于脉冲神经网络强化学习的群体机器人避障方法，其特征在于，所述SNN导航模型包括感知模块和控制模块，其中，5. A group robot obstacle avoidance method based on impulse neural network reinforcement learning according to claim 3, characterized in that the SNN navigation model includes a perception module and a control module, wherein,

6.根据权利要求1所述的一种基于脉冲神经网络强化学习的群体机器人避障方法，其特征在于，所述脉冲神经网络强化学习模型的训练包括以下步骤：6. A group robot obstacle avoidance method based on spiking neural network reinforcement learning according to claim 1, characterized in that the training of the spiking neural network reinforcement learning model includes the following steps:

S21、设置训练参数；S21. Set training parameters;

7.根据权利要求1所述的一种基于脉冲神经网络强化学习的群体机器人避障方法，其特征在于，所述多机器人模型训练方法采用并行计算的方法加速模型权重的训练，多个机器人在训练中共享一套网络权重，且所有机器人的状态、价值、动作信息均存储于同一个经验池，在满足条件的情况下对权重进行更新。7. A group robot obstacle avoidance method based on impulse neural network reinforcement learning according to claim 1, characterized in that the multi-robot model training method adopts a parallel computing method to accelerate the training of model weights, and multiple robots are A set of network weights is shared during training, and the status, value, and action information of all robots are stored in the same experience pool, and the weights are updated when conditions are met.

8.根据权利要求1所述的一种基于脉冲神经网络强化学习的群体机器人避障方法，其特征在于，所述脉冲神经网络强化学习模型采用STBP进行梯度反传对SNN导航模型进行权重训练。8. A group robot obstacle avoidance method based on spiking neural network reinforcement learning according to claim 1, characterized in that the spiking neural network reinforcement learning model uses STBP to perform gradient backpropagation to perform weight training on the SNN navigation model.

9.一种基于脉冲神经网络强化学习的群体机器人避障装置，包括存储器、处理器，以及存储于所述存储器中的程序，其特征在于，所述处理器执行所述程序时实现如权利要求1-8中任一所述的方法。9. A group robot obstacle avoidance device based on impulse neural network reinforcement learning, including a memory, a processor, and a program stored in the memory, characterized in that when the processor executes the program, it implements the claims as claimed The method described in any one of 1-8.

10.一种存储介质，其上存储有程序，其特征在于，所述程序被执行时实现如权利要求1-8中任一所述的方法。10. A storage medium with a program stored thereon, characterized in that when the program is executed, the method according to any one of claims 1-8 is implemented.