技术领域technical field
本发明属于深度强化学习和智能控制领域,涉及一种基于深度强化学习的自主水下航行器(AUV)轨迹跟踪控制方法。The invention belongs to the field of deep reinforcement learning and intelligent control, and relates to an autonomous underwater vehicle (AUV) trajectory tracking control method based on deep reinforcement learning.
背景技术Background technique
深海海底科学的发展高度依赖于深海探测技术和装备,由于深海环境复杂、条件极端,目前主要采用深海作业型自主水下航行器代替或辅助人对深海进行探测、观察和采样。而针对海洋资源探索、海底调查和海洋测绘等人类无法到达现场操作的任务场景,保证AUV水下运动的自主性和可控性是一项最基本且重要的功能要求,是实现各项复杂作业任务的前提。然而,AUV的许多离岸应用(例如轨迹跟踪控制、目标跟踪控制等)极具挑战性,这种挑战性主要由AUV系统以下三方面的特性导致。第一,AUV作为一种多输入多输出系统,其动力学和运动学模型(以下简称模型)复杂,具有高度非线性、强耦合、存在输入或状态约束和时变等特点;第二,模型参数或水动力环境存在不确定性,导致AUV系统建模较为困难;第三,当前大部分AUV属于欠驱动系统,即自由度大于独立执行器的数量(各独立执行器分别对应一个自由度)。通常,通过数学物理机理推导、数值模拟和实物实验相结合的方法来确定AUV的模型及参数,并合理刻画模型中的不确定部分。复杂的模型导致AUV的控制问题也非常复杂。而且,随着AUV应用场景的不断扩展,人们对其运动控制的精度、稳定性都提出更高的要求,如何提高AUV在各种运动场景下的控制效果已成了重要的研究方向。The development of deep-sea submarine science is highly dependent on deep-sea exploration technology and equipment. Due to the complex deep-sea environment and extreme conditions, currently, deep-sea operating autonomous underwater vehicles are mainly used to replace or assist humans in deep-sea detection, observation and sampling. For mission scenarios where humans cannot reach the on-site operations, such as marine resource exploration, seabed survey, and oceanographic mapping, ensuring the autonomy and controllability of AUV underwater movement is the most basic and important functional requirement. premise of the task. However, many offshore applications of AUVs (such as trajectory tracking control, target tracking control, etc.) are extremely challenging. This challenge is mainly caused by the following three characteristics of the AUV system. First, as a multi-input multi-output system, AUV’s dynamics and kinematics model (hereinafter referred to as the model) is complex and has the characteristics of high nonlinearity, strong coupling, input or state constraints, and time-varying; second, the model Uncertainty in parameters or hydrodynamic environment makes it difficult to model AUV systems; third, most current AUVs are underactuated systems, that is, the degree of freedom is greater than the number of independent actuators (each independent actuator corresponds to a degree of freedom) . Usually, the model and parameters of the AUV are determined through a combination of mathematical and physical mechanism derivation, numerical simulation, and physical experiments, and the uncertain part of the model is reasonably described. The complex model makes the control problem of AUV very complicated. Moreover, with the continuous expansion of AUV application scenarios, people put forward higher requirements for the accuracy and stability of its motion control. How to improve the control effect of AUV in various motion scenarios has become an important research direction.
在过去的几十年中,针对轨迹跟踪、路径点跟踪、路径规划和编队控制等不同应用场景,研究者们设计了各种AUV运动控制方法并验证了其有效性。其中具有代表性的是Refsnes等人提出的基于模型的输出反馈控制方法,该控制方法采用了两个解耦的系统模型:一个用于刻画海流负载的三自由度海流诱导船体模型和一个用于描述系统动态的五自由度模型。另外,Healey等人设计了一种基于状态反馈的跟踪控制方法,该控制方法采用固定的前向运动速度并对系统模型进行线性化处理,同时该控制方法采用了三个解耦的模型:纵荡模型、水平导向模型(横荡和艏摇)和垂向模型(垂荡和纵摇)。然而,这些方法都对系统模型进行了解耦或线性化处理,因此很难满足AUV在特定应用场景下的高精度控制要求。In the past few decades, researchers have designed various AUV motion control methods and verified their effectiveness for different application scenarios such as trajectory tracking, waypoint tracking, path planning, and formation control. The representative one is the model-based output feedback control method proposed by Refsnes et al., which uses two decoupled system models: a three-degree-of-freedom current-induced hull model for describing the current load and a model for A five-degree-of-freedom model describing system dynamics. In addition, Healey et al. designed a tracking control method based on state feedback, which uses a fixed forward motion speed and linearizes the system model. At the same time, the control method uses three decoupled models: longitudinal sway model, horizontally oriented model (sway and yaw) and vertical model (heave and pitch). However, these methods decouple or linearize the system model, so it is difficult to meet the high-precision control requirements of AUV in specific application scenarios.
由于上述经典运动控制方法的局限性以及强化学习强大的自学习能力,近几年,研究者们对以强化学习为代表的智能控制方法表现出了极大的研究兴趣。而各种基于强化学习技术(例如Q学习、直接策略搜索、策略-评价网络和自适应强化学习)的智能控制方法也是不断地被提出并成功应用到不同的复杂应用场景中,如机器人运动控制、无人机飞行控制、高超音速飞行器跟踪控制以及道路信号灯控制等。基于强化学习的控制方法的核心思想是在无先验知识的前提下实现控制系统的性能优化。对于AUV系统,不少研究者已经设计出各种基于强化学习的控制方法并实际验证了其可行性。针对自主水下缆线跟踪控制问题,EI-Fakdi等人采用直接策略搜索技术来学习状态/动作映射关系,但是该方法仅适用于状态和动作空间都是离散的情况;而对于连续的动作空间,Paula等人采用径向基网络来近似策略函数,然而由于径向基网络的函数近似能力较弱,该控制方法无法保证较高的跟踪控制精度。Due to the limitations of the above-mentioned classical motion control methods and the powerful self-learning ability of reinforcement learning, researchers have shown great research interest in intelligent control methods represented by reinforcement learning in recent years. And various intelligent control methods based on reinforcement learning techniques (such as Q-learning, direct policy search, policy-evaluation network and adaptive reinforcement learning) are continuously proposed and successfully applied to different complex application scenarios, such as robot motion control , UAV flight control, hypersonic vehicle tracking control and road signal light control, etc. The core idea of the control method based on reinforcement learning is to realize the performance optimization of the control system without prior knowledge. For AUV systems, many researchers have designed various control methods based on reinforcement learning and actually verified their feasibility. For the problem of autonomous underwater cable tracking control, EI-Fakdi et al. used direct policy search technology to learn the state/action mapping relationship, but this method is only applicable to the case where the state and action space are both discrete; and for continuous action space , Paula et al. used radial basis network to approximate the policy function. However, due to the weak function approximation ability of radial basis network, this control method cannot guarantee high tracking control accuracy.
近年来,随着批学习、经验回放和批正则化等深度神经网络(DNN)训练技术的发展,深度强化学习在机器人运动控制、自主地面车辆运动控制、四旋翼控制和自动驾驶等复杂任务中表现出了优异性能。尤其是近期提出的深度Q网络(DQN)在许多极具挑战性的任务中都表现出人类水平的控制精度。然而DQN不能处理同时具有高维状态空间和连续动作空间的问题。在DQN的基础上,深度确定性策略梯度(DDPG)算法被进一步提出并实现了连续控制。然而DDPG使用目标评价网络来估计评价网络的目标值,使得评价网络不能有效地评价由策略网络学习到的策略,且学习到的动作值函数存在较大的方差,因此当DDPG应用于AUV轨迹跟踪控制问题时,无法满足较高的跟踪控制精度和稳定学习的要求。In recent years, with the development of deep neural network (DNN) training techniques such as batch learning, experience replay, and batch regularization, deep reinforcement learning has been used in complex tasks such as robot motion control, autonomous ground vehicle motion control, quadrotor control, and autonomous driving. showed excellent performance. In particular, the recently proposed Deep Q-Network (DQN) has demonstrated human-level control accuracy in many challenging tasks. However, DQN cannot handle problems with both high-dimensional state space and continuous action space. On the basis of DQN, Deep Deterministic Policy Gradient (DDPG) algorithm is further proposed and realizes continuous control. However, DDPG uses the target evaluation network to estimate the target value of the evaluation network, so that the evaluation network cannot effectively evaluate the policy learned by the policy network, and there is a large variance in the learned action value function, so when DDPG is applied to AUV trajectory tracking When controlling the problem, it cannot meet the requirements of high tracking control accuracy and stable learning.
发明内容Contents of the invention
本发明的目的是提出一种基于深度强化学习的AUV轨迹跟踪控制方法,该方法采用一种混合策略-评价网络结构,并采用多个准Q学习和确定性策略梯度来分别训练评价网络和策略网络,克服以往基于强化学习的方法控制精度较低、无法实现连续控制和学习过程不稳定等问题,实现高精度的AUV轨迹跟踪控制和稳定学习。The purpose of the present invention is to propose a deep reinforcement learning-based AUV trajectory tracking control method, which uses a hybrid policy-evaluation network structure, and uses multiple quasi-Q learning and deterministic policy gradients to train the evaluation network and policy respectively Network, to overcome the problems of low control accuracy, inability to achieve continuous control and unstable learning process in previous methods based on reinforcement learning, and realize high-precision AUV trajectory tracking control and stable learning.
为了实现上述目的,本发明采用如下技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:
一种基于深度强化学习的自主水下航行器轨迹跟踪控制方法,该方法包括以下步骤:A trajectory tracking control method for an autonomous underwater vehicle based on deep reinforcement learning, the method comprising the following steps:
1)定义自主水下航行器AUV轨迹跟踪控制问题1) Define the AUV trajectory tracking control problem for autonomous underwater vehicles
定义AUV轨迹跟踪控制问题包括四个部分:确定AUV系统输入、确定AUV系统输出、定义轨迹跟踪控制误差和建立AUV轨迹跟踪控制目标;具体步骤如下:Defining the AUV trajectory tracking control problem includes four parts: determining the AUV system input, determining the AUV system output, defining the trajectory tracking control error, and establishing the AUV trajectory tracking control target; the specific steps are as follows:
1-1)确定AUV系统输入1-1) Determine AUV system input
令AUV系统输入向量为τk=[ξk,δk]T,其中ξk、δk分别为AUV的螺旋桨推力和舵角,下标k表示第k个时间步;ξk、δk的取值范围分别为和分别为最大的螺旋桨推力和最大舵角;Let the input vector of the AUV system be τk =[ξk , δk ]T , where ξk and δk are the propeller thrust and rudder angle of the AUV respectively, and the subscript k represents the kth time step; the values of ξk and δk The value range is and are the maximum propeller thrust and maximum rudder angle, respectively;
1-2)确定AUV系统输出1-2) Determine the AUV system output
令AUV系统输出向量为ηk=[xk,yk,ψk]T,其中xk、yk分别为第k个时间步AUV在惯性坐标系I-XYZ下沿X、Y轴的坐标,ψk为第k个时间步AUV前进方向与X轴的夹角;Let the output vector of the AUV system be ηk =[xk ,yk ,ψk ]T , where xk and yk are the coordinates of the k-th time step AUV along the X and Y axes in the inertial coordinate system I-XYZ , ψk is the angle between the advancing direction of the AUV and the X-axis at the kth time step;
1-3)定义轨迹跟踪控制误差1-3) Define trajectory tracking control error
根据AUV的行驶路径选取参考轨迹定义第k个时间步的AUV轨迹跟踪控制误差为:Select the reference trajectory according to the driving path of the AUV Define the AUV trajectory tracking control error of the kth time step as:
1-4)建立AUV轨迹跟踪控制目标1-4) Establish AUV trajectory tracking control target
对于步骤1-3)中的参考轨迹dk,选择如下形式的目标函数:For the reference trajectory dk in steps 1-3), the objective function of the following form is selected:
其中,γ是折扣因子,H为权重矩阵;Among them, γ is the discount factor and H is the weight matrix;
建立AUV轨迹跟踪控制的目标为找到一个最优系统输入序列τ*使得初始时刻的目标函数P0(τ)最小,计算公式如下:The goal of establishing AUV trajectory tracking control is to find an optimal system input sequence τ* to minimize the objective function P0 (τ) at the initial moment, and the calculation formula is as follows:
2)建立AUV轨迹跟踪问题的马尔科夫决策过程模型2) Establish a Markov decision process model for the AUV trajectory tracking problem
对步骤1)中的AUV轨迹跟踪问题进行马尔科夫决策过程建模,具体步骤如下:Carry out Markov decision process modeling on the AUV trajectory tracking problem in step 1), the specific steps are as follows:
2-1)定义状态向量2-1) Define the state vector
定义AUV系统的速度向量为φk=[uk,vk,χk]T,其中uk、vk分别为第k个时间步AUV沿前进方向、垂直于前进方向的线速度,χk为第k个时间步AUV环绕前进方向的角速度;Define the velocity vector of the AUV system as φk = [uk , vk , χk ]T , where uk and vk are the linear velocity of the AUV at the kth time step along the forward direction and perpendicular to the forward direction, χk is the angular velocity of the AUV around the forward direction at the kth time step;
根据步骤1-2)确定的AUV系统输出向量ηk和步骤1-3)定义的参考轨迹,定义第k个时间步的状态向量如下:According to the AUV system output vector ηk determined in step 1-2) and the reference trajectory defined in step 1-3), the state vector defining the kth time step is as follows:
2-2)定义动作向量2-2) Define the action vector
定义第k个时间步的动作向量为该时间步的AUV系统输入向量,即ak=τk;Define the action vector of the kth time step as the AUV system input vector of this time step, that is, ak =τk ;
2-3)定义奖励函数2-3) Define the reward function
第k个时间步的奖励函数用于刻画在状态sk采取动作ak的执行效果,根据步骤1-3)定义的轨迹跟踪控制误差ek和步骤2-2)定义的动作向量ak,定义第k个时间步的AUV奖励函数如下:The reward function of the kth time step is used to describe the execution effect of taking an action ak in the state sk , according to the trajectory tracking control error ek defined in step 1-3) and the action vector ak defined in step 2-2), The AUV reward function for the kth time step is defined as follows:
2-4)将步骤1-4)建立的AUV轨迹跟踪控制的目标τ*转换为强化学习框架下的AUV轨迹跟踪控制目标2-4) Convert the target τ* of the AUV trajectory tracking control established in steps 1-4) to the AUV trajectory tracking control target under the reinforcement learning framework
定义策略π为在某一状态下选择各个可能动作的概率,则定义动作值函数如下:Define the strategy π as the probability of selecting each possible action in a certain state, then define the action value function as follows:
其中,表示对奖励函数、状态和动作的期望值;K为最大时间步;in, Represents the expected value of the reward function, state and action; K is the maximum time step;
该动作值函数用于描述在当前及之后所有状态下均采取策略π时的期望累计折扣奖励,故在强化学习框架下,AUV轨迹跟踪控制目标是通过与AUV所处环境的交互来学习一个最优目标策略π*,使得初始时刻的动作值最大,计算公式如下:The action value function is used to describe the expected cumulative discounted reward when the strategy π is adopted in the current and subsequent states. Therefore, under the reinforcement learning framework, the AUV trajectory tracking control goal is to learn an optimal The optimal target strategy π* makes the action value at the initial moment the largest, and the calculation formula is as follows:
其中,p(s0)为初始状态s0的分布;a0为初始动作向量;Among them, p(s0 ) is the distribution of the initial state s0 ; a0 is the initial action vector;
将步骤1-4)建立的AUV轨迹跟踪控制的目标τ*的求解转换为π*的求解;The solution of the target τ* of the AUV trajectory tracking control that steps 1-4) are set up is converted into the solution of π* ;
2-5)简化强化学习框架下的AUV轨迹跟踪控制目标2-5) AUV trajectory tracking control target under the framework of simplified reinforcement learning
通过如下迭代贝尔曼方程来求解步骤2-4)中的动作值函数:The action-value function in steps 2-4) is solved by iterating the Bellman equation as follows:
设策略π是确定性的,即从AUV的状态向量空间到AUV的动作向量空间是一一映射的关系,并记为μ,则将上述迭代贝尔曼方程简化为:Assuming that the strategy π is deterministic, that is, there is a one-to-one mapping relationship from the state vector space of the AUV to the action vector space of the AUV, and denoted as μ, the above iterative Bellman equation is simplified as:
对于确定性的策略μ,将步骤2-4)中的最优目标策略π*简化为确定性最优目标策略μ*:For a deterministic policy μ, the optimal target policy π* in steps 2-4) is simplified to a deterministic optimal target policy μ* :
3)构建混合策略-评价网络3) Build a mixed strategy-evaluation network
通过构建混合策略-评价网络来分别估计确定性最优目标策略μ*和对应的最优动作值函数构建混合策略-评价网络包括三部分:构建策略网络、构建评价网络和确定目标策略,具体步骤如下:Estimating the deterministic optimal target policy μ* and the corresponding optimal action-value function separately by constructing a hybrid policy-evaluation network Building a mixed strategy-evaluation network includes three parts: building a strategy network, building an evaluation network, and determining the target strategy. The specific steps are as follows:
3-1)构建策略网络3-1) Build a policy network
混合策略-评价网络结构通过构建n个策略网络来估计确定性最优目标策略μ*;其中,θp为第p个策略网络的权重参数,p=1,…,n;各策略网络均分别使用一个全连接的深度神经网络来实现,各策略网络均分别包含一个输入层、两个隐藏层和一个输出层;各策略网络的输入为状态向量sk,各策略网络的输出为动作向量ak;Mixed strategy-evaluation network structure by constructing n-policy networks to estimate the deterministic optimal target policy μ* ; among them, θp is the weight parameter of the pth policy network, p=1,...,n; each policy network is realized by a fully connected deep neural network, each Each strategy network includes an input layer, two hidden layers and an output layer; the input of each strategy network is a state vector sk , and the output of each strategy network is an action vector ak ;
3-2)构建评价网络3-2) Build an evaluation network
混合策略-评价网络结构通过构建m个评价网络来估计最优动作值函数其中,wq为第q个评价网络的权重参数,q=1,…,m;各评价网络均分别使用一个全连接的深度神经网络来实现,各评价网络均分别包含一个输入层、两个隐藏层和一个输出层;各评价网络的输入为状态向量sk和动作向量ak,其中状态向量sk从输入层输入到各评价网络,动作向量ak从第一个隐藏层输入到各评价网络,各评价网络输出为在状态向量sk下采取动作向量ak的动作值;Mixed strategy-evaluation network structure by constructing m evaluation networks to estimate the optimal action-value function Among them, wq is the weight parameter of the qth evaluation network, q=1,...,m; each evaluation network is implemented by a fully connected deep neural network, and each evaluation network includes an input layer, two hidden layer and an output layer; the input of each evaluation network is the state vector sk and action vector ak , where the state vector sk is input from the input layer to each evaluation network, and the action vector ak is input from the first hidden layer to each Evaluation network, the output of each evaluation network is the action value of the action vector ak under the state vector sk ;
3-3)确定目标策略3-3) Determine the target strategy
根据所构建的混合策略-评价网络,将第k个时间步学习到的AUV轨迹跟踪控制的目标策略μf(sk)定义为n个策略网络输出的均值,计算公式如下:According to the constructed hybrid strategy-evaluation network, the target strategy μf (sk ) of the AUV trajectory tracking control learned at the kth time step is defined as the mean value of n strategy network outputs, and the calculation formula is as follows:
4)求解AUV轨迹跟踪控制的目标策略μf(sk),具体步骤如下:4) Solving the target strategy μf (sk ) of AUV trajectory tracking control, the specific steps are as follows:
4-1)参数设置4-1) Parameter setting
分别设置最大迭代次数M、每次迭代的最大时间步K、经验回放抽取的训练集大小N、各评价网络的学习率αω、各策略网络的学习率αθ、折扣因子γ和奖励函数中的权重矩阵H;Set the maximum number of iterations M, the maximum time step K of each iteration, the size of the training set N extracted from experience playback, the learning rate αω of each evaluation network, the learning rate αθ of each policy network, the discount factor γ and the reward function The weight matrix H of
4-2)初始化混合策略-评价网络4-2) Initialize the mixed strategy-evaluation network
随机初始化n个策略网络和m个评价网络的权重参数θp和wq;从n个策略网络中随机选择第d个策略网络记为d=1,…,n;Randomly initialize n policy networks and m evaluation networks The weight parameters θp and wq of ; randomly select the dth policy network from n policy networks and record it as d=1,...,n;
构建经验列队集合R,设该经验列队集合R的最大容量为B,并初始化为空;Construct the experience queue set R, set the maximum capacity of the experience queue set R to be B, and initialize it to be empty;
4-3)迭代开始,对混合策略-评价网络进行训练,初始化迭代次数episode=1;4-3) Start the iteration, train the mixed strategy-evaluation network, and initialize the number of iterations episode=1;
4-4)设置当前时间步k=0,随机初始化AUV的状态变量s0,令当前时间步的状态变量sk=s0;并产生一个探索噪声Noisek;4-4) Set the current time step k=0, randomly initialize the state variable s0 of the AUV, make the state variable sk =s0 of the current time step; and generate an exploration noise Noisek ;
4-5)根据n个当前策略网络和探索噪声Noisek确定当前时间步的动作向量ak为:4-5) Network according to n current strategies and explore the noise Noisek to determine the action vector ak of the current time step as:
4-6)AUV在当前状态sk下执行动作ak,根据步骤2-3)得到奖励函数rk+1,并观测到一个新的状态sk+1;记ek=(sk,ak,rk+1,sk+1)为一个经验样本;如果经验列队集合R的样本数量已经达到最大容量B,则先删除最先加入的一个样本,再将经验样本ek存入经验列队集合R中;否则直接将经验样本ek存入经验列队集合R中;4-6) The AUV executes the action ak in the current state sk , obtains the reward function rk+1 according to step 2-3), and observes a new state sk+1 ; write ek =(sk , ak , rk+1 , sk+1 ) is an experience sample; if the number of samples in the experience queue set R has reached the maximum capacity B, delete the first sample added first, and then store the experience sample ek into In the experience queue set R; otherwise, the experience sample ek is directly stored in the experience queue set R;
从经验列队集合R中选取A个经验样本,具体如下:当经验列队集合R中样本数量不超过N时,则选取该经验列队集合R中的所有经验样本;当经验列队集合R超过N时,则从该经验列队集合R中随机选取N个经验样本(sl,al,rl+1,sl+1);Select A experience samples from the experience queue set R, specifically as follows: when the number of samples in the experience queue set R does not exceed N, then select all the experience samples in the experience queue set R; when the experience queue set R exceeds N, Then randomly select N experience samples (sl , al , rl+1 , sl+1 ) from the experience queue set R;
4-7)根据选取的A个经验样本计算每个评价网络的期望贝尔曼绝对误差EBAEq,用于表征每个评价网络的性能,公式如下:4-7) Calculate the expected Bellman absolute error EBAEq of each evaluation network according to the selected A empirical samples, which is used to characterize the performance of each evaluation network. The formula is as follows:
选择性能最差的评价网络,通过以下公式求得该性能最差的评价网络的序号,记为c:Select the evaluation network with the worst performance, and obtain the serial number of the evaluation network with the worst performance through the following formula, denoted as c:
4-8)由第c个评价网络通过如下次贪婪策略得到每个经验样本在下一时间步的动作向量:4-8) Evaluate the network by the cth The action vector of each empirical sample at the next time step is obtained by the following greedy strategy:
4-9)通过多个准Q学习方法计算第c个评价网络的目标值公式如下:4-9) Calculate the target value of the c-th evaluation network through multiple quasi-Q learning methods The formula is as follows:
4-10)计算第c个评价网络的损失函数L(wc),公式如下:4-10) Calculate the loss function L(wc ) of the c-th evaluation network, the formula is as follows:
4-11)通过损失函数L(wc)对权重参数wc的导数来更新第c个评价网络的权重参数,公式如下:4-11) Update the weight parameter of the c-th evaluation network through the derivative of the loss function L(wc ) to the weight parameter wc , the formula is as follows:
其余评价网络的权重参数保持不变;The weight parameters of the rest of the evaluation network remain unchanged;
4-12)从n个策略网络中随机选择一个策略网络来重置第d个策略网络4-12) Randomly select a policy network from n policy networks to reset the dth policy network
4-13)根据更新后的第c个评价网络计算第d个策略网络的确定性策略梯度并以此更新第d个策略网络的权重参数θd,计算公式分别如下:4-13) Calculate the d-th policy network based on the updated c-th evaluation network Deterministic policy gradient for and update the dth policy network with this The weight parameter θd of , the calculation formulas are as follows:
其余策略网络的权重参数保持不变;The weight parameters of the remaining policy networks remain unchanged;
4-14)令k=k+1并对k进行判定:如k<K,则重新返回步骤4-5),AUV继续跟踪参考轨迹;否则,进入步骤4-15);4-14) Make k=k+1 and judge k: if k<K, then return to step 4-5), AUV continues to track the reference trajectory; otherwise, enter step 4-15);
4-15)令episode=episode+1并对episode进行判定:如episode<M,则重新返回步骤4-4),AUV进行下一个迭代过程;否则,进入步骤4-16);4-15) make episode=episode+1 and judge the episode: if episode<M, then return to step 4-4), and the AUV proceeds to the next iterative process; otherwise, enter step 4-16);
4-16)迭代结束,终止混合策略-评价网络的训练过程,将迭代终止时的n个策略网络的输出值通过步骤3-3)中的计算公式得到最终AUV轨迹跟踪控制的目标策略μf(sk),由该目标策略实现对AUV的轨迹跟踪控制。4-16) The iteration ends, and the training process of the hybrid strategy-evaluation network is terminated, and the output values of the n strategy networks at the termination of the iteration are obtained by the calculation formula in step 3-3) to obtain the target strategy μf of the final AUV trajectory tracking control (sk ), the trajectory tracking control of the AUV is realized by the target strategy.
本发明的特点及有益效果:Features and beneficial effects of the present invention:
本发明提出的方法采用了多个策略网络和评价网络。对于多个评价网络,通过定义期望贝尔曼绝对误差来评估每个评价网络的性能,在每个时间步只更新性能最差的一个评价网络,不同于已有基于强化学习的控制方法,本发明提出多个准Q学习方法来计算更为准确的评价网络目标值,该方法可以解决动作值函数过估计问题,并且可以在不借助目标评价网络的前提下稳定学习过程。对于多个策略网络,在每个时间步随机选择一个策略网络,并采用确定性策略梯度进行更新。最终学习到的策略为所有策略网络的均值。The method proposed by the present invention employs multiple policy networks and evaluation networks. For multiple evaluation networks, the performance of each evaluation network is evaluated by defining the expected Bellman absolute error, and only the evaluation network with the worst performance is updated at each time step, which is different from the existing control method based on reinforcement learning. The present invention A number of quasi-Q learning methods are proposed to calculate more accurate evaluation network target values. This method can solve the problem of overestimation of the action value function and stabilize the learning process without using the target evaluation network. For multiple policy networks, a policy network is randomly selected at each time step and updated using deterministic policy gradients. The final learned policy is the average of all policy networks.
1)本发明提出的AUV轨迹跟踪控制方法不依赖于模型,通过AUV在行驶过程中的采样数据,来自主学习出使得控制目标达到最优的目标策略,该过程不需要对AUV模型做任何假设,尤其适用于在复杂深海环境下工作的AUV,有很高的实际应用价值。1) The AUV trajectory tracking control method proposed by the present invention does not depend on the model. Through the sampling data of the AUV during the driving process, the target strategy that makes the control target reach the optimum is autonomously learned. This process does not need to make any assumptions about the AUV model , especially suitable for AUVs working in complex deep-sea environments, and has high practical application value.
2)本发明方法采用多个准Q学习来得到比已有方法更加准确的评价网络目标值,既减小了由评价网络近似得到的动作值函数的方差,还解决了动作值函数过估计问题,从而得到更优的目标策略,实现高精度的AUV轨迹跟踪控制。2) The method of the present invention adopts a plurality of quasi-Q learning to obtain more accurate evaluation network target values than existing methods, which not only reduces the variance of the action value function approximated by the evaluation network, but also solves the overestimation problem of the action value function , so as to obtain a better target strategy and realize high-precision AUV trajectory tracking control.
3)本发明方法基于期望贝尔曼绝对误差来决定每个时间步该更新哪一个评价网络,这种更新规则可以减弱较差评价网络的影响,从而保证学习过程的快速收敛。3) The method of the present invention determines which evaluation network should be updated at each time step based on the expected Bellman absolute error. This update rule can weaken the influence of poor evaluation networks, thereby ensuring rapid convergence of the learning process.
4)本发明方法由于采用了多个评价网络,其学习过程不易受到恶劣的AUV历史跟踪轨迹的影响,鲁棒性好,学习过程稳定。4) Since the method of the present invention adopts a plurality of evaluation networks, its learning process is not easily affected by bad AUV historical tracking trajectories, the robustness is good, and the learning process is stable.
5)本发明方法将强化学习与深度神经网络相结合,具有很强的自学习能力,能够在不确定的深海环境中实现对AUV的高精度自适应控制,在AUV轨迹跟踪、水下避障等场景中有着很好的应用前景。5) The method of the present invention combines reinforcement learning with a deep neural network, has a strong self-learning ability, and can realize high-precision adaptive control of AUVs in uncertain deep-sea environments, and can be used in AUV trajectory tracking and underwater obstacle avoidance. It has a good application prospect in other scenarios.
附图说明Description of drawings
图1是本发明提出方法与现有DDPG方法的性能对比图;其中,图(a)为学习曲线对比图,图(b)为AUV轨迹跟踪效果对比图。Fig. 1 is a performance comparison diagram between the proposed method of the present invention and the existing DDPG method; wherein, Figure (a) is a comparison diagram of learning curves, and Figure (b) is a comparison diagram of AUV trajectory tracking effects.
图2是本发明提出方法与神经网络PID方法的性能对比图;其中,图(a)为AUV沿X、Y方向的坐标轨迹跟踪效果对比图,图(b)为AUV在X、Y方向的跟踪误差对比图。Fig. 2 is the performance comparison figure of method proposed by the present invention and neural network PID method; Wherein, figure (a) is the coordinate trajectory tracking effect comparison figure of AUV along X, Y direction, and figure (b) is AUV in X, Y direction Tracking error comparison chart.
具体实施方式Detailed ways
本发明提出的一种基于深度强化学习的自主水下航行器轨迹跟踪控制方法,下面结合附图和具体实施例进一步详细说明如下。An autonomous underwater vehicle trajectory tracking control method based on deep reinforcement learning proposed by the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
本发明提出了一种基于深度强化学习的自主水下航行器跟踪控制算法,主要包括四个部分:定义AUV轨迹跟踪控制问题、建立AUV轨迹跟踪问题的马尔科夫决策过程模型、构建混合策略-评价网络结构和求解AUV轨迹跟踪控制的目标策略。The present invention proposes an autonomous underwater vehicle tracking control algorithm based on deep reinforcement learning, which mainly includes four parts: defining the AUV trajectory tracking control problem, establishing a Markov decision process model for the AUV trajectory tracking problem, and constructing a hybrid strategy- Evaluate the network structure and solve the target policy for AUV trajectory tracking control.
1)定义AUV轨迹跟踪控制问题1) Define the AUV trajectory tracking control problem
定义AUV轨迹跟踪控制问题包括四个组成部分:确定AUV系统输入、确定AUV系统输出、定义轨迹跟踪控制误差和建立AUV轨迹跟踪控制目标;具体步骤如下:Defining the AUV trajectory tracking control problem includes four components: determining the AUV system input, determining the AUV system output, defining the trajectory tracking control error, and establishing the AUV trajectory tracking control target; the specific steps are as follows:
1-1)确定AUV系统输入1-1) Determine AUV system input
令AUV系统输入向量为τk=[ξk,δk]T,其中ξk、δk分别为AUV的螺旋桨推力和舵角,下标k表示第k个时间步即时刻k·t的取值,其中t为时间步长,下同;ξk、δk的取值范围分别为和其中分别为最大的螺旋桨推力和最大舵角,根据AUV所采用的螺旋桨型号确定。Let the input vector of the AUV system be τk =[ξk , δk ]T , where ξk and δk are the propeller thrust and rudder angle of the AUV respectively, and the subscript k represents the k-th time step, that is, the value of time k·t , where t is the time step, the same below; the value ranges of ξk and δk are respectively and in are the maximum propeller thrust and maximum rudder angle respectively, determined according to the propeller model used by the AUV.
1-2)确定AUV系统输出1-2) Determine the AUV system output
令AUV系统输出向量为ηk=[xk,yk,ψk]T,其中xk、yk分别为第k个时间步AUV在惯性坐标系I-XYZ下沿X、Y轴的坐标,ψk为第k个时间步AUV前进方向与X轴的夹角。Let the output vector of the AUV system be ηk =[xk ,yk ,ψk ]T , where xk and yk are the coordinates of the k-th time step AUV along the X and Y axes in the inertial coordinate system I-XYZ , ψk is the angle between the advancing direction of the AUV and the X axis at the kth time step.
1-3)定义轨迹跟踪控制误差1-3) Define trajectory tracking control error
根据AUV的行驶路径选取参考轨迹定义第k个时间步的AUV轨迹跟踪控制误差为:Select the reference trajectory according to the driving path of the AUV Define the AUV trajectory tracking control error of the kth time step as:
1-4)建立AUV轨迹跟踪控制目标1-4) Establish AUV trajectory tracking control target
对于步骤1-3)中的参考轨迹dk,选择如下形式的目标函数:For the reference trajectory dk in steps 1-3), the objective function of the following form is selected:
其中,γ是折扣因子,H为权重矩阵;Among them, γ is the discount factor and H is the weight matrix;
建立AUV轨迹跟踪控制的目标为找到一个最优系统输入序列τ*使得初始时刻的目标函数P0(τ)最小,计算公式如下:The goal of establishing AUV trajectory tracking control is to find an optimal system input sequence τ* to minimize the objective function P0 (τ) at the initial moment, and the calculation formula is as follows:
2)建立AUV轨迹跟踪问题的马尔科夫决策过程模型2) Establish a Markov decision process model for the AUV trajectory tracking problem
马尔科夫决策过程(MDP)是强化学习理论的基础,因此需要对步骤1)中的AUV轨迹跟踪问题进行MDP建模。强化学习的主要元素包括智能体、环境、状态、动作和奖励函数,智能体的目标是通过与AUV所处环境的交互来学习一个最优动作(或控制输入)序列来最大化累计奖励(或最小化累计跟踪控制误差),进而实现AUV轨迹跟踪目标的求解。具体步骤如下:Markov decision process (MDP) is the basis of reinforcement learning theory, so MDP modeling is required for the AUV trajectory tracking problem in step 1). The main elements of reinforcement learning include agent, environment, state, action and reward function. The goal of the agent is to learn an optimal action (or control input) sequence to maximize the cumulative reward (or Minimize the cumulative tracking control error), and then realize the solution of the AUV trajectory tracking target. Specific steps are as follows:
2-1)定义状态向量2-1) Define the state vector
定义AUV系统的速度向量为φk=[uk,vk,χk]T,其中uk、vk分别为第k个时间步AUV沿前进方向、垂直于前进方向的线速度,χk为第k个时间步AUV环绕前进方向的角速度。Define the velocity vector of the AUV system as φk = [uk , vk , χk ]T , where uk and vk are the linear velocity of the AUV at the kth time step along the forward direction and perpendicular to the forward direction, χk is the angular velocity of the AUV around the forward direction at the kth time step.
根据步骤1-2)确定的AUV系统输出向量ηk和步骤1-3)定义的参考轨迹,定义第k个时间步的状态向量如下:According to the AUV system output vector ηk determined in step 1-2) and the reference trajectory defined in step 1-3), the state vector defining the kth time step is as follows:
2-2)定义动作向量2-2) Define the action vector
定义第k个时间步的动作向量为该时间步的AUV系统输入向量,即:ak=τk。Define the action vector of the kth time step as the input vector of the AUV system at this time step, namely: ak =τk .
2-3)定义奖励函数2-3) Define the reward function
第k个时间步的奖励函数用于刻画在状态sk采取动作ak的执行效果,根据步骤1-3)定义的轨迹跟踪控制误差ek和步骤2-2)定义的动作向量ak,定义第k个时间步的AUV奖励函数如下:The reward function of the kth time step is used to describe the execution effect of taking an action ak in the state sk , according to the trajectory tracking control error ek defined in step 1-3) and the action vector ak defined in step 2-2), The AUV reward function for the kth time step is defined as follows:
2-4)将步骤1-4)建立的AUV轨迹跟踪控制的目标τ*转换为强化学习框架下的AUV轨迹跟踪控制目标2-4) Convert the target τ* of the AUV trajectory tracking control established in steps 1-4) to the AUV trajectory tracking control target under the reinforcement learning framework
定义策略π为在某一状态下选择各个可能动作的概率,则定义动作值函数如下:Define the strategy π as the probability of selecting each possible action in a certain state, then define the action value function as follows:
其中,表示对奖励函数、状态和动作的期望值(下同);K为最大时间步;in, Represents the expected value of the reward function, state and action (the same below); K is the maximum time step;
该动作值函数用于描述在当前及之后所有状态下均采取策略π时的期望累计折扣奖励,因此,在强化学习框架下,AUV轨迹跟踪控制目标(即智能体的目标)是通过与AUV所处环境的交互来学习一个最优目标策略π*,使得初始时刻的动作值最大,即:The action value function is used to describe the expected cumulative discounted reward when the policy π is adopted in the current and subsequent states. Therefore, under the framework of reinforcement learning, the AUV trajectory tracking control target (that is, the target of the agent) is obtained by combining with the AUV The environment interaction to learn an optimal target strategy π* , so that the action value at the initial moment is the largest, that is:
其中,p(s0)为初始状态s0的分布;a0为初始动作向量。Among them, p(s0 ) is the distribution of the initial state s0 ; a0 is the initial action vector.
因此,步骤1-4)建立的AUV轨迹跟踪控制的目标τ*的求解可转换为π*的求解。Therefore, the solution to the target τ* of the AUV trajectory tracking control established in steps 1-4) can be converted to the solution to π* .
2-5)简化强化学习框架下的AUV轨迹跟踪控制目标2-5) AUV trajectory tracking control target under the framework of simplified reinforcement learning
类似于动态规划,许多强化学习方法使用如下迭代贝尔曼方程来求解步骤2-4)中的动作值函数:Similar to dynamic programming, many reinforcement learning methods use the following iterative Bellman equation to solve the action-value function in steps 2-4):
假定策略π是确定性的,即从AUV的状态向量空间到AUV的动作向量空间是一一映射的关系,并记为μ,于是上述迭代贝尔曼方程可以简化为:Assuming that the policy π is deterministic, that is, there is a one-to-one mapping relationship from the state vector space of the AUV to the action vector space of the AUV, which is denoted as μ, then the above iterative Bellman equation can be simplified as:
此外,对于确定性的策略μ,将步骤2-4)中的最优目标策略π*简化为确定性最优目标策略μ*:In addition, for a deterministic policy μ, the optimal target policy π* in steps 2-4) is simplified to a deterministic optimal target policy μ* :
3)构建混合策略-评价网络3) Build a mixed strategy-evaluation network
由步骤2-5)可知,利用强化学习求解AUV轨迹跟踪问题的核心是如何求解确定性最优目标策略μ*和对应的最优动作值函数本发明方法采用一种混合策略-评价网络来分别估计μ*和构建混合策略-评价网络包括三部分:构建策略网络、构建评价网络和确定目标策略,具体步骤如下:From steps 2-5), it can be seen that the core of solving the AUV trajectory tracking problem using reinforcement learning is how to solve the deterministic optimal target policy μ* and the corresponding optimal action value function The method of the present invention uses a mixed strategy-evaluation network to estimate μ* and Building a mixed strategy-evaluation network includes three parts: building a strategy network, building an evaluation network, and determining the target strategy. The specific steps are as follows:
3-1)构建策略网络3-1) Build a policy network
混合策略-评价网络结构通过构建n(为了平衡本发明算法跟踪控制精度与网络训练速度,其取值不宜过大也不宜过小)个策略网络来估计确定性最优目标策略μ*。其中,θp为第p个策略网络的权重参数,p=1,…,n;各策略网络均分别使用一个全连接的深度神经网络来实现,每个策略网络均分别包含一个输入层、两个隐藏层和一个输出层,各策略网络的输入为状态向量sk,各策略网络输出为动作向量ak,两个隐藏层分别含有400和300个单元。Mixed strategy-evaluation network structure by constructing n (in order to balance the algorithm tracking control accuracy and network training speed of the present invention, its value should not be too large or too small) strategy network to estimate the deterministic optimal target policy μ* . Among them, θp is the weight parameter of the pth strategy network, p=1,...,n; each strategy network is realized by a fully connected deep neural network, and each strategy network includes an input layer, two The input of each strategy network is the state vectorsk , and the output of each strategy network is the action vector ak . The two hidden layers contain 400 and 300 units respectively.
3-2)构建评价网络3-2) Build an evaluation network
混合策略-评价网络结构通过构建m(评价网络数量的选取依据与上述策略网络数量的选取依据相同)个评价网络来估计最优动作值函数其中,wq为第q个评价网络的权重参数,q=1,…,m;各评价网络均分别使用一个全连接的深度神经网络来实现,各评价网络均分别包含一个输入层、两个隐藏层和一个输出层,两个隐藏层分别含有400和300个单元;各评价网络的输入为状态向量sk和动作向量ak,其中状态向量sk从输入层输入到各评价网络,动作向量ak从第一个隐藏层输入到各评价网络,各评价网络输出为在状态向量sk下采取动作向量ak的动作值。The mixed strategy-evaluation network structure constructs m evaluation networks (the selection basis for the number of evaluation networks is the same as the selection basis for the number of strategy networks above) to estimate the optimal action-value function Among them,wq is the weight parameter of the qth evaluation network, q=1,...,m; each evaluation network is implemented by a fully connected deep neural network, and each evaluation network includes an input layer, two hidden layer and an output layer, the two hidden layers contain 400 and 300 units respectively; the input of each evaluation network is the state vector sk and the action vector ak , where the state vector sk is input from the input layer to each evaluation network, and the action The vector ak is input from the first hidden layer to each evaluation network, and each evaluation network outputs the action value of taking the action vector ak under the state vector sk .
3-3)确定目标策略3-3) Determine the target strategy
根据所构建的混合策略-评价网络,将第k个时间步学习到的AUV轨迹跟踪控制的目标策略μf(sk)定义为n个策略网络输出的均值,计算公式如下:According to the constructed hybrid strategy-evaluation network, the target strategy μf (sk ) of the AUV trajectory tracking control learned at the kth time step is defined as the mean value of n strategy network outputs, and the calculation formula is as follows:
4)求解AUV轨迹跟踪控制的目标策略μf(sk),具体步骤如下:4) Solving the target strategy μf (sk ) of AUV trajectory tracking control, the specific steps are as follows:
4-1)参数设置4-1) Parameter setting
分别设置最大迭代次数M、每次迭代的最大时间步K、经验回放抽取的训练集大小N、各评价网络的学习率αω、各策略网络的学习率αθ、折扣因子γ和奖励函数中的权重矩阵H;本实施例中,M=1500,K=1000(每个时间步长t=0.2s),N=64,各评价网络的αω=0.01,各策略网络的αθ=0.001,γ=0.99,H=[0.001,0;0,0.001];Set the maximum number of iterations M, the maximum time step K of each iteration, the size of the training set N extracted from experience playback, the learning rate αω of each evaluation network, the learning rate αθ of each policy network, the discount factor γ and the reward function weight matrix H; in this embodiment, M=1500, K=1000 (each time step t=0.2s), N=64, αω =0.01 for each evaluation network, αθ =0.001 for each strategy network , γ=0.99, H=[0.001,0;0,0.001];
4-2)初始化混合策略-评价网络4-2) Initialize the mixed strategy-evaluation network
随机初始化n个策略网络和m个评价网络的权重参数θp和wq;从n个策略网络中随机选择第d(d=1,…,n)个策略网络记为Randomly initialize n policy networks and m evaluation networks The weight parameters θp and wq of ; Randomly select the d(d=1,…,n) policy network from n policy networks and write it as
构建经验列队集合R,设该经验列队集合R的最大容量为B(本实施例B=10000),并初始化为空;Build the experience queue set R, set the maximum capacity of the experience queue set R as B (the present embodiment B=10000), and initialize it as empty;
4-3)迭代开始,对混合策略-评价网络进行训练,初始化迭代次数episode=1;4-3) Start the iteration, train the mixed strategy-evaluation network, and initialize the number of iterations episode=1;
4-4)设置当前时间步k=0,随机初始化AUV的状态变量s0,令当前时间步的状态变量sk=s0;并产生一个探索噪声Noisek(本实施例采用奥恩斯坦-乌伦贝克(Ornstein-Uhlenbeck)探索噪声);4-4) Set the current time step k=0, randomly initialize the state variable s0 of the AUV, make the state variable sk =s0 of the current time step; and generate an exploration noise Noisek (this embodiment adopts Ornstein- Ornstein-Uhlenbeck explores noise);
4-5)根据n个当前策略网络和探索噪声Noisek确定当前时间步的动作向量ak为:4-5) Network according to n current strategies and explore the noise Noisek to determine the action vector ak of the current time step as:
4-6)AUV在当前状态sk下执行动作ak,根据步骤2-3)得到奖励函数rk+1,并观测到一个新的状态sk+1;记ek=(sk,ak,rk+1,sk+1)为一个经验样本;如果经验列队集合R的样本数量已经达到最大容量B,则先删除最先加入的一个样本,再将经验样本ek存入经验列队集合R中;否则直接将经验样本ek存入经验列队集合R中;4-6) The AUV executes the action ak in the current state sk , obtains the reward function rk+1 according to step 2-3), and observes a new state sk+1 ; write ek =(sk , ak , rk+1 , sk+1 ) is an experience sample; if the number of samples in the experience queue set R has reached the maximum capacity B, delete the first sample added first, and then store the experience sample ek into In the experience queue set R; otherwise, the experience sample ek is directly stored in the experience queue set R;
从经验列队集合R中选取A个经验样本,A≤N,具体如下:当经验列队集合R中样本数量不超过N时,则选取该经验列队集合R中的所有经验样本;当经验列队集合R超过N时,则从该经验列队集合R中随机选取N个经验样本(sl,al,rl+1,sl+1),l为被选择的经验样本所在的时间步;Select A experience samples from the experience queue set R, A≤N, as follows: when the number of samples in the experience queue set R does not exceed N, then select all the experience samples in the experience queue set R; when the experience queue set R When it exceeds N, randomly select N experience samples (sl , al , rl+1 , sl+1 ) from the experience queue set R, where l is the time step of the selected experience sample;
4-7)根据选取的A个经验样本计算每个评价网络的期望贝尔曼绝对误差EBAEq,用于表征每个评价网络的性能,公式如下:4-7) Calculate the expected Bellman absolute error EBAEq of each evaluation network according to the selected A empirical samples, which is used to characterize the performance of each evaluation network. The formula is as follows:
选择性能最差的评价网络,通过以下公式求得该性能最差的评价网络的序号,记为c:Select the evaluation network with the worst performance, and obtain the serial number of the evaluation network with the worst performance through the following formula, denoted as c:
4-8)由第c个评价网络通过如下次贪婪策略得到每个经验样本在下一时间步的动作向量:4-8) Evaluate the network by the cth The action vector of each empirical sample at the next time step is obtained by the following greedy strategy:
4-9)通过多个准Q学习方法计算第c个评价网络的目标值公式如下:4-9) Calculate the target value of the c-th evaluation network through multiple quasi-Q learning methods The formula is as follows:
4-10)计算第c个评价网络的损失函数L(wc),公式如下:4-10) Calculate the loss function L(wc ) of the c-th evaluation network, the formula is as follows:
4-11)通过损失函数L(wc)对权重参数wc的导数来更新第c个评价网络的权重参数,公式如下:4-11) Update the weight parameter of the c-th evaluation network through the derivative of the loss function L(wc ) to the weight parameter wc , the formula is as follows:
其余评价网络的权重参数保持不变;The weight parameters of the rest of the evaluation network remain unchanged;
4-12)从n个策略网络中随机选择一个策略网络来重置第d个策略网络4-12) Randomly select a policy network from n policy networks to reset the dth policy network
4-13)根据更新后的第c个评价网络计算第d个策略网络的确定性策略梯度并以此更新第d个策略网络的权重参数θd,计算公式分别如下:4-13) Calculate the d-th policy network based on the updated c-th evaluation network Deterministic policy gradient for and update the dth policy network with this The weight parameter θd of , the calculation formulas are as follows:
其余策略网络的权重参数保持不变。The weight parameters of the rest of the policy network remain unchanged.
4-14)令k=k+1并对k进行判定:如k<K,则重新返回步骤4-5),AUV继续跟踪参考轨迹;否则,进入步骤4-15)。4-14) Set k=k+1 and judge k: if k<K, then return to step 4-5), and the AUV continues to track the reference trajectory; otherwise, enter step 4-15).
4-15)令episode=episode+1并对episode进行判定:如episode<M,则重新返回步骤4-4),AUV进行下一个迭代过程;否则,进入步骤4-16)。4-15) set episode=episode+1 and judge the episode: if episode<M, then return to step 4-4), and the AUV proceeds to the next iterative process; otherwise, enter step 4-16).
4-16)迭代结束,终止混合策略-评价网络的训练过程,将迭代终止时的n个策略网络的输出值通过步骤3-3)中的计算公式得到最终AUV轨迹跟踪控制的目标策略μf(sk),由该目标策略实现对AUV的轨迹跟踪控制。4-16) The iteration ends, and the training process of the hybrid strategy-evaluation network is terminated, and the output values of the n strategy networks at the termination of the iteration are obtained by the calculation formula in step 3-3) to obtain the target strategy μf of the final AUV trajectory tracking control (sk ), the trajectory tracking control of the AUV is realized by the target strategy.
本发明实施例的有效性验证Validity verification of the embodiment of the present invention
本发明所提出的基于深度强化学习的AUV轨迹跟踪控制方法(以下简称MPQ-DPG)的性能分析如下所示,所有对比实验均是基于广泛使用的REMUS自主无人航行器,其最大螺旋桨推力和舵角分别为86N和0.24rad;且采用如下参考轨迹:The performance analysis of the AUV trajectory tracking control method based on deep reinforcement learning (hereinafter referred to as MPQ-DPG) proposed by the present invention is as follows. All comparative experiments are based on the widely used REMUS autonomous unmanned aerial vehicle. Its maximum propeller thrust and rudder angle They are 86N and 0.24rad respectively; and the following reference trajectory is adopted:
此外,在本发明实施例中,评价网络数量m与策略网络数量n相同,后文统一记为n。In addition, in the embodiment of the present invention, the number m of evaluation networks is the same as the number n of policy networks, and is collectively denoted as n hereinafter.
1)MPQ-DPG与现有的DDPG方法对比分析1) Comparative analysis of MPQ-DPG and existing DDPG methods
图1为本发明提出的深度强化学习的AUV提出轨迹跟踪控制方法(MPQ-DPG)与现有DDPG方法在训练过程中的学习曲线和轨迹跟踪效果上的比较。其中,图(a)中的学习曲线是通过五次独立实验得到,图(b)中Ref表示参考轨迹。Fig. 1 is a comparison of the learning curve and trajectory tracking effect in the training process between the proposed trajectory tracking control method (MPQ-DPG) of the AUV proposed by the present invention and the existing DDPG method. Among them, the learning curve in Figure (a) is obtained through five independent experiments, and Ref in Figure (b) represents the reference trajectory.
分析图1,可得如下结论:Analyzing Figure 1, the following conclusions can be drawn:
a)相对于DDPG方法,MPQ-DPG的学习稳定性更好,这是由于MPQ-DPG采用多个评价网络和策略网络,可以降低差样本对学习稳定性的影响。a) Compared with the DDPG method, MPQ-DPG has better learning stability, because MPQ-DPG uses multiple evaluation networks and policy networks, which can reduce the impact of poor samples on learning stability.
b)MPQ-DPG方法最终收敛的平均累计奖励明显高于DDPG方法,这说明了MPQ-DPG方法的跟踪控制精度要明显高于DDPG方法。b) The average cumulative reward of the final convergence of the MPQ-DPG method is significantly higher than that of the DDPG method, which shows that the tracking control accuracy of the MPQ-DPG method is significantly higher than that of the DDPG method.
c)从图1(b)中可以观察到,MPQ-DPG方法得到的跟踪轨迹几乎与参考轨迹重合,说明MPQ-DPG方法可以实现高精度的AUV跟踪控制。c) From Figure 1(b), it can be observed that the tracking trajectory obtained by the MPQ-DPG method almost coincides with the reference trajectory, indicating that the MPQ-DPG method can achieve high-precision AUV tracking control.
d)随着策略网络和评价网络数量的增大,MPQ-DPG方法的跟踪控制精度会逐渐提高,但提高的幅度在n>4之后将不再明显。d) As the number of policy networks and evaluation networks increases, the tracking control accuracy of the MPQ-DPG method will gradually increase, but the improvement will not be obvious after n>4.
2)MPQ-DPG方法与现有神经网络PID方法对比分析2) Comparative analysis of MPQ-DPG method and existing neural network PID method
图2为本发明为水下无人航行器轨迹跟踪控制提出的MPQ-DPG方法与神经网络PID方法在坐标轨迹跟踪曲线和坐标轨迹跟踪误差上的比较。图中Ref表示参考坐标轨迹,PIDNN表示神经网络PID算法,n=4。Fig. 2 is the comparison between the MPQ-DPG method proposed by the present invention and the neural network PID method for the trajectory tracking control of the underwater unmanned vehicle on the coordinate trajectory tracking curve and the coordinate trajectory tracking error. In the figure, Ref represents the reference coordinate track, PIDNN represents the neural network PID algorithm, and n=4.
分析图2可得,神经网络PID控制方法的跟踪性能明显差于本发明提出的MPQ-DPG方法;此外,图2(b)中的跟踪误差表明,MPQ-DPG方法可以实现误差更快的收敛,特别是在起始阶段,MPQ-DPG方法仍然可以实现快速、高精度的跟踪性能,而神经网络PID方法的响应时间要明显长于MPQ-DPG方法,且跟踪误差的收敛性较差。Analysis of Fig. 2 shows that the tracking performance of the neural network PID control method is significantly worse than that of the MPQ-DPG method proposed by the present invention; in addition, the tracking error in Fig. 2 (b) shows that the MPQ-DPG method can achieve faster convergence of errors , especially in the initial stage, the MPQ-DPG method can still achieve fast and high-precision tracking performance, while the response time of the neural network PID method is significantly longer than that of the MPQ-DPG method, and the convergence of the tracking error is poor.
上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810535773.8ACN108803321B (en) | 2018-05-30 | 2018-05-30 | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810535773.8ACN108803321B (en) | 2018-05-30 | 2018-05-30 | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning |
| Publication Number | Publication Date |
|---|---|
| CN108803321Atrue CN108803321A (en) | 2018-11-13 |
| CN108803321B CN108803321B (en) | 2020-07-10 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810535773.8AActiveCN108803321B (en) | 2018-05-30 | 2018-05-30 | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning |
| Country | Link |
|---|---|
| CN (1) | CN108803321B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109361700A (en)* | 2018-12-06 | 2019-02-19 | 郑州航空工业管理学院 | A protocol framework for self-organizing network adaptive data transmission for unmanned aerial vehicles |
| CN109696830A (en)* | 2019-01-31 | 2019-04-30 | 天津大学 | The reinforcement learning adaptive control method of small-sized depopulated helicopter |
| CN109726866A (en)* | 2018-12-27 | 2019-05-07 | 浙江农林大学 | Path planning method for unmanned ship based on Q-learning neural network |
| CN109719721A (en)* | 2018-12-26 | 2019-05-07 | 北京化工大学 | A kind of autonomous emergence of imitative snake search and rescue robot adaptability gait |
| CN109765916A (en)* | 2019-03-26 | 2019-05-17 | 武汉欣海远航科技研发有限公司 | A kind of unmanned surface vehicle path following control device design method |
| CN109828467A (en)* | 2019-03-01 | 2019-05-31 | 大连海事大学 | Data-driven unmanned ship reinforcement learning controller structure and design method |
| CN109828463A (en)* | 2019-02-18 | 2019-05-31 | 哈尔滨工程大学 | A kind of adaptive wave glider bow of ocean current interference is to control method |
| CN109870162A (en)* | 2019-04-04 | 2019-06-11 | 北京航空航天大学 | A UAV flight path planning method based on competitive deep learning network |
| CN109960259A (en)* | 2019-02-15 | 2019-07-02 | 青岛大学 | A Gradient Potential-Based Multi-Agent Reinforcement Learning Path Planning Method for Unmanned Guided Vehicles |
| CN110045614A (en)* | 2019-05-16 | 2019-07-23 | 河海大学常州校区 | A kind of traversing process automatic learning control system of strand suction ship and method based on deep learning |
| CN110083064A (en)* | 2019-04-29 | 2019-08-02 | 辽宁石油化工大学 | A kind of network optimal track control method based on non-strategy Q- study |
| CN110321666A (en)* | 2019-08-09 | 2019-10-11 | 重庆理工大学 | Multi-robots Path Planning Method based on priori knowledge Yu DQN algorithm |
| CN110333739A (en)* | 2019-08-21 | 2019-10-15 | 哈尔滨工程大学 | A Reinforcement Learning-Based AUV Behavior Planning and Action Control Method |
| CN110362089A (en)* | 2019-08-02 | 2019-10-22 | 大连海事大学 | Unmanned ship autonomous navigation method based on deep reinforcement learning and genetic algorithm |
| CN110428615A (en)* | 2019-07-12 | 2019-11-08 | 中国科学院自动化研究所 | Learn isolated intersection traffic signal control method, system, device based on deeply |
| CN110673602A (en)* | 2019-10-24 | 2020-01-10 | 驭势科技(北京)有限公司 | Reinforced learning model, vehicle automatic driving decision method and vehicle-mounted equipment |
| CN110716574A (en)* | 2019-09-29 | 2020-01-21 | 哈尔滨工程大学 | A real-time collision avoidance planning method for UUV based on deep Q network |
| CN110806759A (en)* | 2019-11-12 | 2020-02-18 | 清华大学 | Aircraft route tracking method based on deep reinforcement learning |
| CN110806756A (en)* | 2019-09-10 | 2020-02-18 | 西北工业大学 | Unmanned aerial vehicle autonomous guidance control method based on DDPG |
| CN110989576A (en)* | 2019-11-14 | 2020-04-10 | 北京理工大学 | Target following and dynamic obstacle avoidance control method for differential slip steering vehicle |
| CN111027677A (en)* | 2019-12-02 | 2020-04-17 | 西安电子科技大学 | Multi-maneuvering-target tracking method based on depth certainty strategy gradient DDPG |
| CN111061277A (en)* | 2019-12-31 | 2020-04-24 | 歌尔股份有限公司 | Unmanned vehicle global path planning method and device |
| CN111091710A (en)* | 2019-12-18 | 2020-05-01 | 上海天壤智能科技有限公司 | Traffic signal control method, system and medium |
| CN111240345A (en)* | 2020-02-11 | 2020-06-05 | 哈尔滨工程大学 | A Trajectory Tracking Method of Underwater Robot Based on Double BP Network Reinforcement Learning Framework |
| CN111310384A (en)* | 2020-01-16 | 2020-06-19 | 香港中文大学(深圳) | Wind field cooperative control method, terminal and computer readable storage medium |
| CN111580544A (en)* | 2020-03-25 | 2020-08-25 | 北京航空航天大学 | Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm |
| TWI706238B (en)* | 2018-12-18 | 2020-10-01 | 大陸商北京航跡科技有限公司 | Systems and methods for autonomous driving |
| CN111736617A (en)* | 2020-06-09 | 2020-10-02 | 哈尔滨工程大学 | A speed observer-based trajectory tracking control method for benthic underwater robot with preset performance |
| CN111813143A (en)* | 2020-06-09 | 2020-10-23 | 天津大学 | An intelligent control system and method for underwater glider based on reinforcement learning |
| CN111856936A (en)* | 2020-07-21 | 2020-10-30 | 天津蓝鳍海洋工程有限公司 | Control method for underwater high-flexibility operation platform with cable |
| CN112100834A (en)* | 2020-09-06 | 2020-12-18 | 西北工业大学 | Underwater glider attitude control method based on deep reinforcement learning |
| CN112132263A (en)* | 2020-09-11 | 2020-12-25 | 大连理工大学 | A Multi-agent Autonomous Navigation Method Based on Reinforcement Learning |
| CN112148025A (en)* | 2020-09-24 | 2020-12-29 | 东南大学 | Unmanned aerial vehicle stability control algorithm based on integral compensation reinforcement learning |
| CN112162555A (en)* | 2020-09-23 | 2021-01-01 | 燕山大学 | Vehicle control method based on reinforcement learning control strategy in mixed fleet |
| CN112179367A (en)* | 2020-09-25 | 2021-01-05 | 广东海洋大学 | An autonomous navigation method for agents based on deep reinforcement learning |
| CN112241176A (en)* | 2020-10-16 | 2021-01-19 | 哈尔滨工程大学 | A path planning and obstacle avoidance control method for an underwater autonomous vehicle in a large-scale continuous obstacle environment |
| CN112462792A (en)* | 2020-12-09 | 2021-03-09 | 哈尔滨工程大学 | Underwater robot motion control method based on Actor-Critic algorithm |
| CN112506210A (en)* | 2020-12-04 | 2021-03-16 | 东南大学 | Unmanned aerial vehicle control method for autonomous target tracking |
| US10955853B2 (en) | 2018-12-18 | 2021-03-23 | Beijing Voyager Technology Co., Ltd. | Systems and methods for autonomous driving |
| CN112558465A (en)* | 2020-12-03 | 2021-03-26 | 大连海事大学 | Unknown unmanned ship finite time reinforcement learning control method with input limitation |
| CN112698572A (en)* | 2020-12-22 | 2021-04-23 | 西安交通大学 | Structural vibration control method, medium and equipment based on reinforcement learning |
| CN112929900A (en)* | 2021-01-21 | 2021-06-08 | 华侨大学 | MAC protocol for realizing time domain interference alignment based on deep reinforcement learning in underwater acoustic network |
| CN113029123A (en)* | 2021-03-02 | 2021-06-25 | 西北工业大学 | Multi-AUV collaborative navigation method based on reinforcement learning |
| CN113052372A (en)* | 2021-03-17 | 2021-06-29 | 哈尔滨工程大学 | Dynamic AUV tracking path planning method based on deep reinforcement learning |
| CN113095463A (en)* | 2021-03-31 | 2021-07-09 | 南开大学 | Robot confrontation method based on evolution reinforcement learning |
| CN113095500A (en)* | 2021-03-31 | 2021-07-09 | 南开大学 | Robot tracking method based on multi-agent reinforcement learning |
| CN113128702A (en)* | 2021-04-15 | 2021-07-16 | 杭州电子科技大学 | Neural network self-adaptive distributed parallel training method based on reinforcement learning |
| CN113359448A (en)* | 2021-06-03 | 2021-09-07 | 清华大学 | Autonomous underwater vehicle track tracking control method aiming at time-varying dynamics |
| CN113370205A (en)* | 2021-05-08 | 2021-09-10 | 浙江工业大学 | Baxter mechanical arm track tracking control method based on machine learning |
| US20210286786A1 (en)* | 2019-04-11 | 2021-09-16 | Tencent Technology (Shenzhen) Company Limited | Database performance tuning method, apparatus, and system, device, and storage medium |
| CN113467248A (en)* | 2021-07-22 | 2021-10-01 | 南京大学 | Fault-tolerant control method for unmanned aerial vehicle sensor during fault based on reinforcement learning |
| CN113595768A (en)* | 2021-07-07 | 2021-11-02 | 西安电子科技大学 | Distributed cooperative transmission algorithm for guaranteeing control performance of mobile information physical system |
| CN113821035A (en)* | 2021-09-22 | 2021-12-21 | 北京邮电大学 | Unmanned ship trajectory tracking control method and device |
| CN113829351A (en)* | 2021-10-13 | 2021-12-24 | 广西大学 | Collaborative control method of mobile mechanical arm based on reinforcement learning |
| CN113885330A (en)* | 2021-10-26 | 2022-01-04 | 哈尔滨工业大学 | A security control method for cyber-physical systems based on deep reinforcement learning |
| CN114020001A (en)* | 2021-12-17 | 2022-02-08 | 中国科学院国家空间科学中心 | Intelligent control method of Mars UAV based on deep deterministic policy gradient learning |
| CN114089633A (en)* | 2021-11-19 | 2022-02-25 | 江苏科技大学 | A multi-motor coupling drive control device and method for an underwater robot |
| CN114357884A (en)* | 2022-01-05 | 2022-04-15 | 厦门宇昊软件有限公司 | Reaction temperature control method and system based on deep reinforcement learning |
| CN114527642A (en)* | 2022-03-03 | 2022-05-24 | 东北大学 | AGV automatic PID parameter adjusting method based on deep reinforcement learning |
| CN114721408A (en)* | 2022-04-18 | 2022-07-08 | 哈尔滨理工大学 | A Reinforcement Learning-Based Path Tracking Method for Underwater Robots |
| CN114839884A (en)* | 2022-07-05 | 2022-08-02 | 山东大学 | Underwater vehicle bottom layer control method and system based on deep reinforcement learning |
| CN114967472A (en)* | 2022-06-17 | 2022-08-30 | 南京航空航天大学 | A UAV Trajectory Tracking State Compensation Depth Deterministic Policy Gradient Control Method |
| CN114967713A (en)* | 2022-07-28 | 2022-08-30 | 山东大学 | Control method of underwater vehicle under discrete change of buoyancy based on reinforcement learning |
| CN114954840A (en)* | 2022-05-30 | 2022-08-30 | 武汉理工大学 | Stability changing control method, system and device for stability changing ship and storage medium |
| CN114995137A (en)* | 2022-06-01 | 2022-09-02 | 哈尔滨工业大学 | Rope-driven parallel robot control method based on deep reinforcement learning |
| CN115016496A (en)* | 2022-06-30 | 2022-09-06 | 重庆大学 | Path tracking method of surface unmanned vehicle based on deep reinforcement learning |
| CN115330276A (en)* | 2022-10-13 | 2022-11-11 | 北京云迹科技股份有限公司 | Method and device for robot to automatically select elevator based on reinforcement learning |
| CN115366099A (en)* | 2022-08-18 | 2022-11-22 | 江苏科技大学 | Mechanical arm depth certainty strategy gradient training method based on forward kinematics |
| CN115562345A (en)* | 2022-10-28 | 2023-01-03 | 北京理工大学 | Unmanned aerial vehicle detection track planning method based on deep reinforcement learning |
| CN115657683A (en)* | 2022-11-14 | 2023-01-31 | 中国电子科技集团公司第十研究所 | A real-time obstacle avoidance method for unmanned untethered submersibles that can be used for inspection tasks |
| CN115657477A (en)* | 2022-10-13 | 2023-01-31 | 北京理工大学 | An Adaptive Control Method for Robots in Dynamic Environment Based on Offline Reinforcement Learning |
| WO2023019536A1 (en)* | 2021-08-20 | 2023-02-23 | 上海电气电站设备有限公司 | Deep reinforcement learning-based photovoltaic module intelligent sun tracking method |
| CN115826594A (en)* | 2023-02-23 | 2023-03-21 | 北京航空航天大学 | Unmanned underwater vehicle switching topology formation control method independent of dynamic model parameters |
| CN115857556A (en)* | 2023-01-30 | 2023-03-28 | 中国人民解放军96901部队 | Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning |
| CN115855226A (en)* | 2023-02-24 | 2023-03-28 | 青岛科技大学 | Multi-AUV cooperative underwater data acquisition method based on DQN and matrix completion |
| CN116295449A (en)* | 2023-05-25 | 2023-06-23 | 吉林大学 | Path indication method and device for underwater autonomous vehicle |
| CN116449856A (en)* | 2022-01-06 | 2023-07-18 | 中国科学院声学研究所 | Underwater vehicle attitude control system and method based on reinforcement learning compensator |
| CN116578102A (en)* | 2023-07-13 | 2023-08-11 | 清华大学 | Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium |
| CN116827685A (en)* | 2023-08-28 | 2023-09-29 | 成都乐超人科技有限公司 | Dynamic defense strategy method of micro-service system based on deep reinforcement learning |
| CN117826860A (en)* | 2024-03-04 | 2024-04-05 | 北京航空航天大学 | A method for determining the control strategy of fixed-wing UAV based on reinforcement learning |
| CN119260750A (en)* | 2024-12-09 | 2025-01-07 | 北京配天技术有限公司 | Method and electronic device for realizing robot imitation learning trajectory |
| CN119558236A (en)* | 2025-02-05 | 2025-03-04 | 天津清润博智能科技有限公司 | A numerical simulation method and system for AUV oblique navigation based on turbine machinery |
| CN119927927A (en)* | 2025-04-07 | 2025-05-06 | 中移(杭州)信息技术有限公司 | Robot following method, device, equipment, storage medium and program product |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20120188365A1 (en)* | 2009-07-20 | 2012-07-26 | Precitec Kg | Laser processing head and method for compensating for the change in focus position in a laser processing head |
| KR101545731B1 (en)* | 2014-04-30 | 2015-08-20 | 인하대학교 산학협력단 | System and method for video tracking |
| CN107065881A (en)* | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
| CN107102644A (en)* | 2017-06-22 | 2017-08-29 | 华南师范大学 | The underwater robot method for controlling trajectory and control system learnt based on deeply |
| CN107368076A (en)* | 2017-07-31 | 2017-11-21 | 中南大学 | Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment |
| CN107856035A (en)* | 2017-11-06 | 2018-03-30 | 深圳市唯特视科技有限公司 | A kind of robustness dynamic motion method based on intensified learning and whole body controller |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20120188365A1 (en)* | 2009-07-20 | 2012-07-26 | Precitec Kg | Laser processing head and method for compensating for the change in focus position in a laser processing head |
| KR101545731B1 (en)* | 2014-04-30 | 2015-08-20 | 인하대학교 산학협력단 | System and method for video tracking |
| CN107065881A (en)* | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
| CN107102644A (en)* | 2017-06-22 | 2017-08-29 | 华南师范大学 | The underwater robot method for controlling trajectory and control system learnt based on deeply |
| CN107368076A (en)* | 2017-07-31 | 2017-11-21 | 中南大学 | Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment |
| CN107856035A (en)* | 2017-11-06 | 2018-03-30 | 深圳市唯特视科技有限公司 | A kind of robustness dynamic motion method based on intensified learning and whole body controller |
| Title |
|---|
| LI ZHOU等: "AUV Based Source Seeking with Estimated Gradients", 《JOURNAL OF SYSTEMS SCIENCE & COMPLEXITY》* |
| RUNSHENG YU等: "Deep Reinforcement Learning Based Optimal Trajectory Tracking Control of Autonomous Underwater Vehicle", 《PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE》* |
| 段勇等: "进化强化学习及其在机器人路径跟踪中的应用", 《控制与决策》* |
| 马琼雄等: "基于深度强化学习的水下机器人最优轨迹控制", 《华南师范大学(自然科学版)》* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109361700A (en)* | 2018-12-06 | 2019-02-19 | 郑州航空工业管理学院 | A protocol framework for self-organizing network adaptive data transmission for unmanned aerial vehicles |
| US11669097B2 (en) | 2018-12-18 | 2023-06-06 | Beijing Voyager Technology Co., Ltd. | Systems and methods for autonomous driving |
| TWI706238B (en)* | 2018-12-18 | 2020-10-01 | 大陸商北京航跡科技有限公司 | Systems and methods for autonomous driving |
| US10955853B2 (en) | 2018-12-18 | 2021-03-23 | Beijing Voyager Technology Co., Ltd. | Systems and methods for autonomous driving |
| CN109719721A (en)* | 2018-12-26 | 2019-05-07 | 北京化工大学 | A kind of autonomous emergence of imitative snake search and rescue robot adaptability gait |
| CN109719721B (en)* | 2018-12-26 | 2020-07-24 | 北京化工大学 | A method for autonomous emergence of adaptive gait of a snake-like search and rescue robot |
| CN109726866A (en)* | 2018-12-27 | 2019-05-07 | 浙江农林大学 | Path planning method for unmanned ship based on Q-learning neural network |
| CN109696830A (en)* | 2019-01-31 | 2019-04-30 | 天津大学 | The reinforcement learning adaptive control method of small-sized depopulated helicopter |
| CN109696830B (en)* | 2019-01-31 | 2021-12-03 | 天津大学 | Reinforced learning self-adaptive control method of small unmanned helicopter |
| CN109960259B (en)* | 2019-02-15 | 2021-09-24 | 青岛大学 | A Gradient Potential-Based Multi-Agent Reinforcement Learning Path Planning Method for Unmanned Guided Vehicles |
| CN109960259A (en)* | 2019-02-15 | 2019-07-02 | 青岛大学 | A Gradient Potential-Based Multi-Agent Reinforcement Learning Path Planning Method for Unmanned Guided Vehicles |
| CN109828463A (en)* | 2019-02-18 | 2019-05-31 | 哈尔滨工程大学 | A kind of adaptive wave glider bow of ocean current interference is to control method |
| CN109828467A (en)* | 2019-03-01 | 2019-05-31 | 大连海事大学 | Data-driven unmanned ship reinforcement learning controller structure and design method |
| CN109765916A (en)* | 2019-03-26 | 2019-05-17 | 武汉欣海远航科技研发有限公司 | A kind of unmanned surface vehicle path following control device design method |
| CN109870162A (en)* | 2019-04-04 | 2019-06-11 | 北京航空航天大学 | A UAV flight path planning method based on competitive deep learning network |
| US12287768B2 (en)* | 2019-04-11 | 2025-04-29 | Tencent Technology (Shenzhen) Company Limited | Database performance tuning method, apparatus, and system, device, and storage medium |
| US20210286786A1 (en)* | 2019-04-11 | 2021-09-16 | Tencent Technology (Shenzhen) Company Limited | Database performance tuning method, apparatus, and system, device, and storage medium |
| CN110083064B (en)* | 2019-04-29 | 2022-02-15 | 辽宁石油化工大学 | Network optimal tracking control method based on non-strategy Q-learning |
| CN110083064A (en)* | 2019-04-29 | 2019-08-02 | 辽宁石油化工大学 | A kind of network optimal track control method based on non-strategy Q- study |
| CN110045614A (en)* | 2019-05-16 | 2019-07-23 | 河海大学常州校区 | A kind of traversing process automatic learning control system of strand suction ship and method based on deep learning |
| CN110428615B (en)* | 2019-07-12 | 2021-06-22 | 中国科学院自动化研究所 | Single intersection traffic signal control method, system and device based on deep reinforcement learning |
| CN110428615A (en)* | 2019-07-12 | 2019-11-08 | 中国科学院自动化研究所 | Learn isolated intersection traffic signal control method, system, device based on deeply |
| CN110362089A (en)* | 2019-08-02 | 2019-10-22 | 大连海事大学 | Unmanned ship autonomous navigation method based on deep reinforcement learning and genetic algorithm |
| CN110321666A (en)* | 2019-08-09 | 2019-10-11 | 重庆理工大学 | Multi-robots Path Planning Method based on priori knowledge Yu DQN algorithm |
| CN110321666B (en)* | 2019-08-09 | 2022-05-03 | 重庆理工大学 | Multi-robot path planning method based on prior knowledge and DQN algorithm |
| CN110333739A (en)* | 2019-08-21 | 2019-10-15 | 哈尔滨工程大学 | A Reinforcement Learning-Based AUV Behavior Planning and Action Control Method |
| CN110333739B (en)* | 2019-08-21 | 2020-07-31 | 哈尔滨工程大学 | AUV (autonomous Underwater vehicle) behavior planning and action control method based on reinforcement learning |
| CN110806756A (en)* | 2019-09-10 | 2020-02-18 | 西北工业大学 | Unmanned aerial vehicle autonomous guidance control method based on DDPG |
| CN110806756B (en)* | 2019-09-10 | 2022-08-02 | 西北工业大学 | Autonomous guidance and control method of UAV based on DDPG |
| CN110716574B (en)* | 2019-09-29 | 2023-05-02 | 哈尔滨工程大学 | A Real-time Collision Avoidance Planning Method for UUV Based on Deep Q-Network |
| CN110716574A (en)* | 2019-09-29 | 2020-01-21 | 哈尔滨工程大学 | A real-time collision avoidance planning method for UUV based on deep Q network |
| CN110673602B (en)* | 2019-10-24 | 2022-11-25 | 驭势科技(北京)有限公司 | Reinforced learning model, vehicle automatic driving decision method and vehicle-mounted equipment |
| CN110673602A (en)* | 2019-10-24 | 2020-01-10 | 驭势科技(北京)有限公司 | Reinforced learning model, vehicle automatic driving decision method and vehicle-mounted equipment |
| CN110806759A (en)* | 2019-11-12 | 2020-02-18 | 清华大学 | Aircraft route tracking method based on deep reinforcement learning |
| CN110989576B (en)* | 2019-11-14 | 2022-07-12 | 北京理工大学 | Target following and dynamic obstacle avoidance control method for differential slip steering vehicle |
| CN110989576A (en)* | 2019-11-14 | 2020-04-10 | 北京理工大学 | Target following and dynamic obstacle avoidance control method for differential slip steering vehicle |
| CN111027677A (en)* | 2019-12-02 | 2020-04-17 | 西安电子科技大学 | Multi-maneuvering-target tracking method based on depth certainty strategy gradient DDPG |
| CN111091710A (en)* | 2019-12-18 | 2020-05-01 | 上海天壤智能科技有限公司 | Traffic signal control method, system and medium |
| US11747155B2 (en) | 2019-12-31 | 2023-09-05 | Goertek Inc. | Global path planning method and device for an unmanned vehicle |
| CN111061277A (en)* | 2019-12-31 | 2020-04-24 | 歌尔股份有限公司 | Unmanned vehicle global path planning method and device |
| CN111310384A (en)* | 2020-01-16 | 2020-06-19 | 香港中文大学(深圳) | Wind field cooperative control method, terminal and computer readable storage medium |
| CN111310384B (en)* | 2020-01-16 | 2024-05-21 | 香港中文大学(深圳) | Wind field cooperative control method, terminal and computer readable storage medium |
| CN111240345A (en)* | 2020-02-11 | 2020-06-05 | 哈尔滨工程大学 | A Trajectory Tracking Method of Underwater Robot Based on Double BP Network Reinforcement Learning Framework |
| CN111240345B (en)* | 2020-02-11 | 2023-04-07 | 哈尔滨工程大学 | Underwater robot trajectory tracking method based on double BP network reinforcement learning framework |
| CN111580544A (en)* | 2020-03-25 | 2020-08-25 | 北京航空航天大学 | Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm |
| CN111736617A (en)* | 2020-06-09 | 2020-10-02 | 哈尔滨工程大学 | A speed observer-based trajectory tracking control method for benthic underwater robot with preset performance |
| CN111813143A (en)* | 2020-06-09 | 2020-10-23 | 天津大学 | An intelligent control system and method for underwater glider based on reinforcement learning |
| CN111813143B (en)* | 2020-06-09 | 2022-04-19 | 天津大学 | Underwater glider intelligent control system and method based on reinforcement learning |
| CN111736617B (en)* | 2020-06-09 | 2022-11-04 | 哈尔滨工程大学 | Track tracking control method for preset performance of benthonic underwater robot based on speed observer |
| CN111856936B (en)* | 2020-07-21 | 2023-06-02 | 天津蓝鳍海洋工程有限公司 | Control method for cabled underwater high-flexibility operation platform |
| CN111856936A (en)* | 2020-07-21 | 2020-10-30 | 天津蓝鳍海洋工程有限公司 | Control method for underwater high-flexibility operation platform with cable |
| CN112100834A (en)* | 2020-09-06 | 2020-12-18 | 西北工业大学 | Underwater glider attitude control method based on deep reinforcement learning |
| CN112132263A (en)* | 2020-09-11 | 2020-12-25 | 大连理工大学 | A Multi-agent Autonomous Navigation Method Based on Reinforcement Learning |
| CN112162555B (en)* | 2020-09-23 | 2021-07-16 | 燕山大学 | Vehicle control method based on reinforcement learning control strategy in mixed fleet |
| CN112162555A (en)* | 2020-09-23 | 2021-01-01 | 燕山大学 | Vehicle control method based on reinforcement learning control strategy in mixed fleet |
| CN112148025A (en)* | 2020-09-24 | 2020-12-29 | 东南大学 | Unmanned aerial vehicle stability control algorithm based on integral compensation reinforcement learning |
| CN112179367A (en)* | 2020-09-25 | 2021-01-05 | 广东海洋大学 | An autonomous navigation method for agents based on deep reinforcement learning |
| CN112179367B (en)* | 2020-09-25 | 2023-07-04 | 广东海洋大学 | A method for autonomous navigation of agents based on deep reinforcement learning |
| CN112241176B (en)* | 2020-10-16 | 2022-10-28 | 哈尔滨工程大学 | Path planning and obstacle avoidance control method of underwater autonomous vehicle in large-scale continuous obstacle environment |
| CN112241176A (en)* | 2020-10-16 | 2021-01-19 | 哈尔滨工程大学 | A path planning and obstacle avoidance control method for an underwater autonomous vehicle in a large-scale continuous obstacle environment |
| CN112558465A (en)* | 2020-12-03 | 2021-03-26 | 大连海事大学 | Unknown unmanned ship finite time reinforcement learning control method with input limitation |
| CN112506210B (en)* | 2020-12-04 | 2022-12-27 | 东南大学 | Unmanned aerial vehicle control method for autonomous target tracking |
| CN112506210A (en)* | 2020-12-04 | 2021-03-16 | 东南大学 | Unmanned aerial vehicle control method for autonomous target tracking |
| CN112462792A (en)* | 2020-12-09 | 2021-03-09 | 哈尔滨工程大学 | Underwater robot motion control method based on Actor-Critic algorithm |
| CN112698572A (en)* | 2020-12-22 | 2021-04-23 | 西安交通大学 | Structural vibration control method, medium and equipment based on reinforcement learning |
| CN112929900B (en)* | 2021-01-21 | 2022-08-02 | 华侨大学 | MAC Protocol for Interference Alignment in Time Domain Based on Deep Reinforcement Learning in Underwater Acoustic Networks |
| CN112929900A (en)* | 2021-01-21 | 2021-06-08 | 华侨大学 | MAC protocol for realizing time domain interference alignment based on deep reinforcement learning in underwater acoustic network |
| CN113029123A (en)* | 2021-03-02 | 2021-06-25 | 西北工业大学 | Multi-AUV collaborative navigation method based on reinforcement learning |
| CN113052372A (en)* | 2021-03-17 | 2021-06-29 | 哈尔滨工程大学 | Dynamic AUV tracking path planning method based on deep reinforcement learning |
| CN113052372B (en)* | 2021-03-17 | 2022-08-02 | 哈尔滨工程大学 | Dynamic AUV tracking path planning method based on deep reinforcement learning |
| CN113095463A (en)* | 2021-03-31 | 2021-07-09 | 南开大学 | Robot confrontation method based on evolution reinforcement learning |
| CN113095500B (en)* | 2021-03-31 | 2023-04-07 | 南开大学 | Robot tracking method based on multi-agent reinforcement learning |
| CN113095500A (en)* | 2021-03-31 | 2021-07-09 | 南开大学 | Robot tracking method based on multi-agent reinforcement learning |
| CN113128702B (en)* | 2021-04-15 | 2024-11-19 | 杭州电子科技大学 | A neural network adaptive distributed parallel training method based on reinforcement learning |
| CN113128702A (en)* | 2021-04-15 | 2021-07-16 | 杭州电子科技大学 | Neural network self-adaptive distributed parallel training method based on reinforcement learning |
| CN113370205A (en)* | 2021-05-08 | 2021-09-10 | 浙江工业大学 | Baxter mechanical arm track tracking control method based on machine learning |
| CN113370205B (en)* | 2021-05-08 | 2022-06-17 | 浙江工业大学 | Baxter mechanical arm track tracking control method based on machine learning |
| CN113359448A (en)* | 2021-06-03 | 2021-09-07 | 清华大学 | Autonomous underwater vehicle track tracking control method aiming at time-varying dynamics |
| CN113595768A (en)* | 2021-07-07 | 2021-11-02 | 西安电子科技大学 | Distributed cooperative transmission algorithm for guaranteeing control performance of mobile information physical system |
| CN113467248A (en)* | 2021-07-22 | 2021-10-01 | 南京大学 | Fault-tolerant control method for unmanned aerial vehicle sensor during fault based on reinforcement learning |
| WO2023019536A1 (en)* | 2021-08-20 | 2023-02-23 | 上海电气电站设备有限公司 | Deep reinforcement learning-based photovoltaic module intelligent sun tracking method |
| CN113821035A (en)* | 2021-09-22 | 2021-12-21 | 北京邮电大学 | Unmanned ship trajectory tracking control method and device |
| CN113829351B (en)* | 2021-10-13 | 2023-08-01 | 广西大学 | A Cooperative Control Method of Mobile Manipulator Based on Reinforcement Learning |
| CN113829351A (en)* | 2021-10-13 | 2021-12-24 | 广西大学 | Collaborative control method of mobile mechanical arm based on reinforcement learning |
| CN113885330B (en)* | 2021-10-26 | 2022-06-17 | 哈尔滨工业大学 | Information physical system safety control method based on deep reinforcement learning |
| CN113885330A (en)* | 2021-10-26 | 2022-01-04 | 哈尔滨工业大学 | A security control method for cyber-physical systems based on deep reinforcement learning |
| CN114089633A (en)* | 2021-11-19 | 2022-02-25 | 江苏科技大学 | A multi-motor coupling drive control device and method for an underwater robot |
| CN114089633B (en)* | 2021-11-19 | 2024-04-26 | 江苏科技大学 | A multi-motor coupling drive control device and method for underwater robot |
| CN114020001A (en)* | 2021-12-17 | 2022-02-08 | 中国科学院国家空间科学中心 | Intelligent control method of Mars UAV based on deep deterministic policy gradient learning |
| CN114357884A (en)* | 2022-01-05 | 2022-04-15 | 厦门宇昊软件有限公司 | Reaction temperature control method and system based on deep reinforcement learning |
| CN116449856A (en)* | 2022-01-06 | 2023-07-18 | 中国科学院声学研究所 | Underwater vehicle attitude control system and method based on reinforcement learning compensator |
| CN114527642B (en)* | 2022-03-03 | 2024-04-02 | 东北大学 | A method for automatically adjusting PID parameters of AGV based on deep reinforcement learning |
| CN114527642A (en)* | 2022-03-03 | 2022-05-24 | 东北大学 | AGV automatic PID parameter adjusting method based on deep reinforcement learning |
| CN114721408A (en)* | 2022-04-18 | 2022-07-08 | 哈尔滨理工大学 | A Reinforcement Learning-Based Path Tracking Method for Underwater Robots |
| CN114954840B (en)* | 2022-05-30 | 2023-09-05 | 武汉理工大学 | Method, system and device for controlling stability of ship |
| CN114954840A (en)* | 2022-05-30 | 2022-08-30 | 武汉理工大学 | Stability changing control method, system and device for stability changing ship and storage medium |
| CN114995137B (en)* | 2022-06-01 | 2023-04-28 | 哈尔滨工业大学 | Control method of rope-driven parallel robot based on deep reinforcement learning |
| CN114995137A (en)* | 2022-06-01 | 2022-09-02 | 哈尔滨工业大学 | Rope-driven parallel robot control method based on deep reinforcement learning |
| CN114967472B (en)* | 2022-06-17 | 2025-04-18 | 南京太司德智能科技有限公司 | A deep deterministic policy gradient control method for UAV trajectory tracking and state compensation |
| CN114967472A (en)* | 2022-06-17 | 2022-08-30 | 南京航空航天大学 | A UAV Trajectory Tracking State Compensation Depth Deterministic Policy Gradient Control Method |
| CN115016496B (en)* | 2022-06-30 | 2024-11-22 | 重庆大学 | Path tracking method of unmanned surface vehicle based on deep reinforcement learning |
| CN115016496A (en)* | 2022-06-30 | 2022-09-06 | 重庆大学 | Path tracking method of surface unmanned vehicle based on deep reinforcement learning |
| CN114839884B (en)* | 2022-07-05 | 2022-09-30 | 山东大学 | Underwater vehicle bottom layer control method and system based on deep reinforcement learning |
| CN114839884A (en)* | 2022-07-05 | 2022-08-02 | 山东大学 | Underwater vehicle bottom layer control method and system based on deep reinforcement learning |
| CN114967713A (en)* | 2022-07-28 | 2022-08-30 | 山东大学 | Control method of underwater vehicle under discrete change of buoyancy based on reinforcement learning |
| CN114967713B (en)* | 2022-07-28 | 2022-11-29 | 山东大学 | Underwater vehicle buoyancy discrete change control method based on reinforcement learning |
| CN115366099A (en)* | 2022-08-18 | 2022-11-22 | 江苏科技大学 | Mechanical arm depth certainty strategy gradient training method based on forward kinematics |
| CN115366099B (en)* | 2022-08-18 | 2024-05-28 | 江苏科技大学 | Deep deterministic policy gradient training method for robotic arms based on forward kinematics |
| CN115330276B (en)* | 2022-10-13 | 2023-01-06 | 北京云迹科技股份有限公司 | Method and device for robot to automatically select elevator based on reinforcement learning |
| CN115657477A (en)* | 2022-10-13 | 2023-01-31 | 北京理工大学 | An Adaptive Control Method for Robots in Dynamic Environment Based on Offline Reinforcement Learning |
| CN115330276A (en)* | 2022-10-13 | 2022-11-11 | 北京云迹科技股份有限公司 | Method and device for robot to automatically select elevator based on reinforcement learning |
| CN115562345A (en)* | 2022-10-28 | 2023-01-03 | 北京理工大学 | Unmanned aerial vehicle detection track planning method based on deep reinforcement learning |
| CN115657683A (en)* | 2022-11-14 | 2023-01-31 | 中国电子科技集团公司第十研究所 | A real-time obstacle avoidance method for unmanned untethered submersibles that can be used for inspection tasks |
| CN115857556A (en)* | 2023-01-30 | 2023-03-28 | 中国人民解放军96901部队 | Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning |
| CN115826594A (en)* | 2023-02-23 | 2023-03-21 | 北京航空航天大学 | Unmanned underwater vehicle switching topology formation control method independent of dynamic model parameters |
| CN115855226B (en)* | 2023-02-24 | 2023-05-30 | 青岛科技大学 | Multi-AUV cooperative underwater data acquisition method based on DQN and matrix completion |
| CN115855226A (en)* | 2023-02-24 | 2023-03-28 | 青岛科技大学 | Multi-AUV cooperative underwater data acquisition method based on DQN and matrix completion |
| CN116295449B (en)* | 2023-05-25 | 2023-09-12 | 吉林大学 | Underwater autonomous vehicle path indication method and device |
| CN116295449A (en)* | 2023-05-25 | 2023-06-23 | 吉林大学 | Path indication method and device for underwater autonomous vehicle |
| CN116578102B (en)* | 2023-07-13 | 2023-09-19 | 清华大学 | Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium |
| CN116578102A (en)* | 2023-07-13 | 2023-08-11 | 清华大学 | Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium |
| CN116827685B (en)* | 2023-08-28 | 2023-11-14 | 成都乐超人科技有限公司 | Dynamic defense strategy method of micro-service system based on deep reinforcement learning |
| CN116827685A (en)* | 2023-08-28 | 2023-09-29 | 成都乐超人科技有限公司 | Dynamic defense strategy method of micro-service system based on deep reinforcement learning |
| CN117826860B (en)* | 2024-03-04 | 2024-06-21 | 北京航空航天大学 | Fixed wing unmanned aerial vehicle control strategy determination method based on reinforcement learning |
| CN117826860A (en)* | 2024-03-04 | 2024-04-05 | 北京航空航天大学 | A method for determining the control strategy of fixed-wing UAV based on reinforcement learning |
| CN119260750A (en)* | 2024-12-09 | 2025-01-07 | 北京配天技术有限公司 | Method and electronic device for realizing robot imitation learning trajectory |
| CN119260750B (en)* | 2024-12-09 | 2025-02-18 | 北京配天技术有限公司 | Method for realizing imitation of learning track by robot and electronic equipment |
| CN119558236A (en)* | 2025-02-05 | 2025-03-04 | 天津清润博智能科技有限公司 | A numerical simulation method and system for AUV oblique navigation based on turbine machinery |
| CN119927927A (en)* | 2025-04-07 | 2025-05-06 | 中移(杭州)信息技术有限公司 | Robot following method, device, equipment, storage medium and program product |
| CN119927927B (en)* | 2025-04-07 | 2025-07-15 | 中移(杭州)信息技术有限公司 | Method, apparatus, device, storage medium and program product for robot following |
| Publication number | Publication date |
|---|---|
| CN108803321B (en) | 2020-07-10 |
| Publication | Publication Date | Title |
|---|---|---|
| CN108803321B (en) | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning | |
| CN107748566B (en) | Underwater autonomous robot fixed depth control method based on reinforcement learning | |
| CN115016496B (en) | Path tracking method of unmanned surface vehicle based on deep reinforcement learning | |
| CN112198870B (en) | Unmanned aerial vehicle autonomous guiding maneuver decision method based on DDQN | |
| Sun et al. | Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning | |
| CN114625151B (en) | Underwater robot obstacle avoidance path planning method based on reinforcement learning | |
| Song et al. | Guidance and control of autonomous surface underwater vehicles for target tracking in ocean environment by deep reinforcement learning | |
| CN113534668B (en) | AUV motion planning method based on maximum entropy actor-critic framework | |
| CN115793455B (en) | Trajectory tracking control method of unmanned boat based on Actor-Critic-Advantage network | |
| CN111240345A (en) | A Trajectory Tracking Method of Underwater Robot Based on Double BP Network Reinforcement Learning Framework | |
| CN111240344B (en) | Autonomous underwater robot model-free control method based on reinforcement learning technology | |
| CN113359448A (en) | Autonomous underwater vehicle track tracking control method aiming at time-varying dynamics | |
| Mousavian et al. | Identification-based robust motion control of an AUV: optimized by particle swarm optimization algorithm | |
| CN114740873A (en) | Path planning method of autonomous underwater robot based on multi-target improved particle swarm algorithm | |
| CN119088044A (en) | Autonomous navigation control method and system for unmanned ships based on artificial intelligence | |
| CN117606490B (en) | A collaborative search path planning method for underwater autonomous vehicles | |
| CN117908565A (en) | Unmanned aerial vehicle safety path planning method based on maximum entropy multi-agent reinforcement learning | |
| CN116700327A (en) | Unmanned aerial vehicle track planning method based on continuous action dominant function learning | |
| CN115562345A (en) | Unmanned aerial vehicle detection track planning method based on deep reinforcement learning | |
| CN115903474A (en) | Unmanned ship automatic berthing control method based on reinforcement learning | |
| Song et al. | Surface path tracking method of autonomous surface underwater vehicle based on deep reinforcement learning | |
| CN118363379A (en) | Unmanned ship dynamic positioning control method based on deep reinforcement learning | |
| CN117215308A (en) | Novel underactuated small-sized water surface unmanned ship guidance control platform | |
| Huang et al. | The USV path planning of dueling DQN algorithm based on tree sampling mechanism | |
| CN113959446B (en) | Autonomous logistics transportation navigation method for robot based on neural network |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |