CN108803321A

Movatterモバイル変換

Info

Publication number: CN108803321A
Application number: CN201810535773.8A
Authority: CN
Inventors: 宋士吉; 石文杰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2018-11-13
Anticipated expiration: 2038-05-30
Also published as: CN108803321B

Abstract

Translated fromChinese

本发明提出了一种基于深度强化学习的自主水下航行器轨迹跟踪控制方法，属于深度强化学习和智能控制领域。首先定义AUV轨迹跟踪控制问题；然后建立AUV轨迹跟踪问题的马尔科夫决策过程模型；接着构建混合策略‑评价网络，该网络由多个策略网络和评价网络构成；最后由构建的混合策略‑评价网络求解AUV轨迹跟踪控制的目标策略，对于多个评价网络，通过定义期望贝尔曼绝对误差来评估每个评价网络的性能，在每个时间步只更新性能最差的一个评价网络，对于多个策略网络，在每个时间步随机选择一个策略网络，并采用确定性策略梯度进行更新，最终学习到的策略为所有策略网络的均值。本发明不易受到恶劣AUV历史跟踪轨迹的影响，精度高。The invention proposes a trajectory tracking control method for an autonomous underwater vehicle based on deep reinforcement learning, which belongs to the field of deep reinforcement learning and intelligent control. First define the AUV trajectory tracking control problem; then establish the Markov decision process model of the AUV trajectory tracking problem; then build a mixed strategy-evaluation network, which is composed of multiple strategy networks and evaluation networks; finally, the constructed mixed strategy-evaluation The network solves the target strategy of AUV trajectory tracking control. For multiple evaluation networks, the performance of each evaluation network is evaluated by defining the expected Bellman absolute error. At each time step, only the evaluation network with the worst performance is updated. For multiple evaluation networks In the policy network, a policy network is randomly selected at each time step, and updated using deterministic policy gradients, and the final learned policy is the mean of all policy networks. The invention is not easily affected by bad AUV historical tracking tracks, and has high precision.

Description

Translated fromChinese

基于深度强化学习的自主水下航行器轨迹跟踪控制方法Trajectory Tracking Control Method for Autonomous Underwater Vehicle Based on Deep Reinforcement Learning

技术领域technical field

本发明属于深度强化学习和智能控制领域，涉及一种基于深度强化学习的自主水下航行器(AUV)轨迹跟踪控制方法。The invention belongs to the field of deep reinforcement learning and intelligent control, and relates to an autonomous underwater vehicle (AUV) trajectory tracking control method based on deep reinforcement learning.

背景技术Background technique

深海海底科学的发展高度依赖于深海探测技术和装备，由于深海环境复杂、条件极端，目前主要采用深海作业型自主水下航行器代替或辅助人对深海进行探测、观察和采样。而针对海洋资源探索、海底调查和海洋测绘等人类无法到达现场操作的任务场景，保证AUV水下运动的自主性和可控性是一项最基本且重要的功能要求，是实现各项复杂作业任务的前提。然而，AUV的许多离岸应用(例如轨迹跟踪控制、目标跟踪控制等)极具挑战性，这种挑战性主要由AUV系统以下三方面的特性导致。第一，AUV作为一种多输入多输出系统，其动力学和运动学模型(以下简称模型)复杂，具有高度非线性、强耦合、存在输入或状态约束和时变等特点；第二，模型参数或水动力环境存在不确定性，导致AUV系统建模较为困难；第三，当前大部分AUV属于欠驱动系统，即自由度大于独立执行器的数量(各独立执行器分别对应一个自由度)。通常，通过数学物理机理推导、数值模拟和实物实验相结合的方法来确定AUV的模型及参数，并合理刻画模型中的不确定部分。复杂的模型导致AUV的控制问题也非常复杂。而且，随着AUV应用场景的不断扩展，人们对其运动控制的精度、稳定性都提出更高的要求，如何提高AUV在各种运动场景下的控制效果已成了重要的研究方向。The development of deep-sea submarine science is highly dependent on deep-sea exploration technology and equipment. Due to the complex deep-sea environment and extreme conditions, currently, deep-sea operating autonomous underwater vehicles are mainly used to replace or assist humans in deep-sea detection, observation and sampling. For mission scenarios where humans cannot reach the on-site operations, such as marine resource exploration, seabed survey, and oceanographic mapping, ensuring the autonomy and controllability of AUV underwater movement is the most basic and important functional requirement. premise of the task. However, many offshore applications of AUVs (such as trajectory tracking control, target tracking control, etc.) are extremely challenging. This challenge is mainly caused by the following three characteristics of the AUV system. First, as a multi-input multi-output system, AUV’s dynamics and kinematics model (hereinafter referred to as the model) is complex and has the characteristics of high nonlinearity, strong coupling, input or state constraints, and time-varying; second, the model Uncertainty in parameters or hydrodynamic environment makes it difficult to model AUV systems; third, most current AUVs are underactuated systems, that is, the degree of freedom is greater than the number of independent actuators (each independent actuator corresponds to a degree of freedom) . Usually, the model and parameters of the AUV are determined through a combination of mathematical and physical mechanism derivation, numerical simulation, and physical experiments, and the uncertain part of the model is reasonably described. The complex model makes the control problem of AUV very complicated. Moreover, with the continuous expansion of AUV application scenarios, people put forward higher requirements for the accuracy and stability of its motion control. How to improve the control effect of AUV in various motion scenarios has become an important research direction.

在过去的几十年中，针对轨迹跟踪、路径点跟踪、路径规划和编队控制等不同应用场景，研究者们设计了各种AUV运动控制方法并验证了其有效性。其中具有代表性的是Refsnes等人提出的基于模型的输出反馈控制方法，该控制方法采用了两个解耦的系统模型：一个用于刻画海流负载的三自由度海流诱导船体模型和一个用于描述系统动态的五自由度模型。另外，Healey等人设计了一种基于状态反馈的跟踪控制方法，该控制方法采用固定的前向运动速度并对系统模型进行线性化处理，同时该控制方法采用了三个解耦的模型：纵荡模型、水平导向模型(横荡和艏摇)和垂向模型(垂荡和纵摇)。然而，这些方法都对系统模型进行了解耦或线性化处理，因此很难满足AUV在特定应用场景下的高精度控制要求。In the past few decades, researchers have designed various AUV motion control methods and verified their effectiveness for different application scenarios such as trajectory tracking, waypoint tracking, path planning, and formation control. The representative one is the model-based output feedback control method proposed by Refsnes et al., which uses two decoupled system models: a three-degree-of-freedom current-induced hull model for describing the current load and a model for A five-degree-of-freedom model describing system dynamics. In addition, Healey et al. designed a tracking control method based on state feedback, which uses a fixed forward motion speed and linearizes the system model. At the same time, the control method uses three decoupled models: longitudinal sway model, horizontally oriented model (sway and yaw) and vertical model (heave and pitch). However, these methods decouple or linearize the system model, so it is difficult to meet the high-precision control requirements of AUV in specific application scenarios.

由于上述经典运动控制方法的局限性以及强化学习强大的自学习能力，近几年，研究者们对以强化学习为代表的智能控制方法表现出了极大的研究兴趣。而各种基于强化学习技术(例如Q学习、直接策略搜索、策略-评价网络和自适应强化学习)的智能控制方法也是不断地被提出并成功应用到不同的复杂应用场景中，如机器人运动控制、无人机飞行控制、高超音速飞行器跟踪控制以及道路信号灯控制等。基于强化学习的控制方法的核心思想是在无先验知识的前提下实现控制系统的性能优化。对于AUV系统，不少研究者已经设计出各种基于强化学习的控制方法并实际验证了其可行性。针对自主水下缆线跟踪控制问题，EI-Fakdi等人采用直接策略搜索技术来学习状态/动作映射关系，但是该方法仅适用于状态和动作空间都是离散的情况；而对于连续的动作空间，Paula等人采用径向基网络来近似策略函数，然而由于径向基网络的函数近似能力较弱，该控制方法无法保证较高的跟踪控制精度。Due to the limitations of the above-mentioned classical motion control methods and the powerful self-learning ability of reinforcement learning, researchers have shown great research interest in intelligent control methods represented by reinforcement learning in recent years. And various intelligent control methods based on reinforcement learning techniques (such as Q-learning, direct policy search, policy-evaluation network and adaptive reinforcement learning) are continuously proposed and successfully applied to different complex application scenarios, such as robot motion control , UAV flight control, hypersonic vehicle tracking control and road signal light control, etc. The core idea of the control method based on reinforcement learning is to realize the performance optimization of the control system without prior knowledge. For AUV systems, many researchers have designed various control methods based on reinforcement learning and actually verified their feasibility. For the problem of autonomous underwater cable tracking control, EI-Fakdi et al. used direct policy search technology to learn the state/action mapping relationship, but this method is only applicable to the case where the state and action space are both discrete; and for continuous action space , Paula et al. used radial basis network to approximate the policy function. However, due to the weak function approximation ability of radial basis network, this control method cannot guarantee high tracking control accuracy.

近年来，随着批学习、经验回放和批正则化等深度神经网络(DNN)训练技术的发展，深度强化学习在机器人运动控制、自主地面车辆运动控制、四旋翼控制和自动驾驶等复杂任务中表现出了优异性能。尤其是近期提出的深度Q网络(DQN)在许多极具挑战性的任务中都表现出人类水平的控制精度。然而DQN不能处理同时具有高维状态空间和连续动作空间的问题。在DQN的基础上，深度确定性策略梯度(DDPG)算法被进一步提出并实现了连续控制。然而DDPG使用目标评价网络来估计评价网络的目标值，使得评价网络不能有效地评价由策略网络学习到的策略，且学习到的动作值函数存在较大的方差，因此当DDPG应用于AUV轨迹跟踪控制问题时，无法满足较高的跟踪控制精度和稳定学习的要求。In recent years, with the development of deep neural network (DNN) training techniques such as batch learning, experience replay, and batch regularization, deep reinforcement learning has been used in complex tasks such as robot motion control, autonomous ground vehicle motion control, quadrotor control, and autonomous driving. showed excellent performance. In particular, the recently proposed Deep Q-Network (DQN) has demonstrated human-level control accuracy in many challenging tasks. However, DQN cannot handle problems with both high-dimensional state space and continuous action space. On the basis of DQN, Deep Deterministic Policy Gradient (DDPG) algorithm is further proposed and realizes continuous control. However, DDPG uses the target evaluation network to estimate the target value of the evaluation network, so that the evaluation network cannot effectively evaluate the policy learned by the policy network, and there is a large variance in the learned action value function, so when DDPG is applied to AUV trajectory tracking When controlling the problem, it cannot meet the requirements of high tracking control accuracy and stable learning.

发明内容Contents of the invention

本发明的目的是提出一种基于深度强化学习的AUV轨迹跟踪控制方法，该方法采用一种混合策略-评价网络结构，并采用多个准Q学习和确定性策略梯度来分别训练评价网络和策略网络，克服以往基于强化学习的方法控制精度较低、无法实现连续控制和学习过程不稳定等问题，实现高精度的AUV轨迹跟踪控制和稳定学习。The purpose of the present invention is to propose a deep reinforcement learning-based AUV trajectory tracking control method, which uses a hybrid policy-evaluation network structure, and uses multiple quasi-Q learning and deterministic policy gradients to train the evaluation network and policy respectively Network, to overcome the problems of low control accuracy, inability to achieve continuous control and unstable learning process in previous methods based on reinforcement learning, and realize high-precision AUV trajectory tracking control and stable learning.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于深度强化学习的自主水下航行器轨迹跟踪控制方法，该方法包括以下步骤：A trajectory tracking control method for an autonomous underwater vehicle based on deep reinforcement learning, the method comprising the following steps:

1)定义自主水下航行器AUV轨迹跟踪控制问题1) Define the AUV trajectory tracking control problem for autonomous underwater vehicles

定义AUV轨迹跟踪控制问题包括四个部分：确定AUV系统输入、确定AUV系统输出、定义轨迹跟踪控制误差和建立AUV轨迹跟踪控制目标；具体步骤如下：Defining the AUV trajectory tracking control problem includes four parts: determining the AUV system input, determining the AUV system output, defining the trajectory tracking control error, and establishing the AUV trajectory tracking control target; the specific steps are as follows:

1-1)确定AUV系统输入1-1) Determine AUV system input

令AUV系统输入向量为τ_k＝[ξ_k,δ_k]^T，其中ξ_k、δ_k分别为AUV的螺旋桨推力和舵角，下标k表示第k个时间步；ξ_k、δ_k的取值范围分别为和分别为最大的螺旋桨推力和最大舵角；Let the input vector of the AUV system be τ_k =[ξ_k , δ_k ]^T , where ξ_k and δ_k are the propeller thrust and rudder angle of the AUV respectively, and the subscript k represents the kth time step; the values of ξ_k and δ_k The value range is and are the maximum propeller thrust and maximum rudder angle, respectively;

1-2)确定AUV系统输出1-2) Determine the AUV system output

令AUV系统输出向量为η_k＝[x_k,y_k,ψ_k]^T，其中x_k、y_k分别为第k个时间步AUV在惯性坐标系I-XYZ下沿X、Y轴的坐标，ψ_k为第k个时间步AUV前进方向与X轴的夹角；Let the output vector of the AUV system be η_k =[x_k ,y_k ,ψ_k ]^T , where x_k and y_k are the coordinates of the k-th time step AUV along the X and Y axes in the inertial coordinate system I-XYZ , ψ_k is the angle between the advancing direction of the AUV and the X-axis at the kth time step;

1-3)定义轨迹跟踪控制误差1-3) Define trajectory tracking control error

根据AUV的行驶路径选取参考轨迹定义第k个时间步的AUV轨迹跟踪控制误差为：Select the reference trajectory according to the driving path of the AUV Define the AUV trajectory tracking control error of the kth time step as:

1-4)建立AUV轨迹跟踪控制目标1-4) Establish AUV trajectory tracking control target

对于步骤1-3)中的参考轨迹d_k，选择如下形式的目标函数：For the reference trajectory d_k in steps 1-3), the objective function of the following form is selected:

其中，γ是折扣因子，H为权重矩阵；Among them, γ is the discount factor and H is the weight matrix;

建立AUV轨迹跟踪控制的目标为找到一个最优系统输入序列τ^*使得初始时刻的目标函数P₀(τ)最小，计算公式如下：The goal of establishing AUV trajectory tracking control is to find an optimal system input sequence τ^* to minimize the objective function P₀ (τ) at the initial moment, and the calculation formula is as follows:

2)建立AUV轨迹跟踪问题的马尔科夫决策过程模型2) Establish a Markov decision process model for the AUV trajectory tracking problem

对步骤1)中的AUV轨迹跟踪问题进行马尔科夫决策过程建模，具体步骤如下：Carry out Markov decision process modeling on the AUV trajectory tracking problem in step 1), the specific steps are as follows:

2-1)定义状态向量2-1) Define the state vector

定义AUV系统的速度向量为φ_k＝[u_k,v_k,χ_k]^T，其中u_k、v_k分别为第k个时间步AUV沿前进方向、垂直于前进方向的线速度，χ_k为第k个时间步AUV环绕前进方向的角速度；Define the velocity vector of the AUV system as φ_k = [u_k , v_k , χ_k ]^T , where u_k and v_k are the linear velocity of the AUV at the kth time step along the forward direction and perpendicular to the forward direction, χ_k is the angular velocity of the AUV around the forward direction at the kth time step;

根据步骤1-2)确定的AUV系统输出向量η_k和步骤1-3)定义的参考轨迹，定义第k个时间步的状态向量如下：According to the AUV system output vector η_k determined in step 1-2) and the reference trajectory defined in step 1-3), the state vector defining the kth time step is as follows:

2-2)定义动作向量2-2) Define the action vector

定义第k个时间步的动作向量为该时间步的AUV系统输入向量，即a_k＝τ_k；Define the action vector of the kth time step as the AUV system input vector of this time step, that is, a_k =τ_k ;

2-3)定义奖励函数2-3) Define the reward function

第k个时间步的奖励函数用于刻画在状态s_k采取动作a_k的执行效果，根据步骤1-3)定义的轨迹跟踪控制误差e_k和步骤2-2)定义的动作向量a_k，定义第k个时间步的AUV奖励函数如下：The reward function of the kth time step is used to describe the execution effect of taking an action a_k in the state s_k , according to the trajectory tracking control error e_k defined in step 1-3) and the action vector a_k defined in step 2-2), The AUV reward function for the kth time step is defined as follows:

2-4)将步骤1-4)建立的AUV轨迹跟踪控制的目标τ^*转换为强化学习框架下的AUV轨迹跟踪控制目标2-4) Convert the target τ^* of the AUV trajectory tracking control established in steps 1-4) to the AUV trajectory tracking control target under the reinforcement learning framework

定义策略π为在某一状态下选择各个可能动作的概率，则定义动作值函数如下：Define the strategy π as the probability of selecting each possible action in a certain state, then define the action value function as follows:

其中，表示对奖励函数、状态和动作的期望值；K为最大时间步；in, Represents the expected value of the reward function, state and action; K is the maximum time step;

该动作值函数用于描述在当前及之后所有状态下均采取策略π时的期望累计折扣奖励，故在强化学习框架下，AUV轨迹跟踪控制目标是通过与AUV所处环境的交互来学习一个最优目标策略π^*，使得初始时刻的动作值最大，计算公式如下：The action value function is used to describe the expected cumulative discounted reward when the strategy π is adopted in the current and subsequent states. Therefore, under the reinforcement learning framework, the AUV trajectory tracking control goal is to learn an optimal The optimal target strategy π^* makes the action value at the initial moment the largest, and the calculation formula is as follows:

其中，p(s₀)为初始状态s₀的分布；a₀为初始动作向量；Among them, p(s₀ ) is the distribution of the initial state s₀ ; a₀ is the initial action vector;

将步骤1-4)建立的AUV轨迹跟踪控制的目标τ^*的求解转换为π^*的求解；The solution of the target τ^* of the AUV trajectory tracking control that steps 1-4) are set up is converted into the solution of π^* ;

2-5)简化强化学习框架下的AUV轨迹跟踪控制目标2-5) AUV trajectory tracking control target under the framework of simplified reinforcement learning

通过如下迭代贝尔曼方程来求解步骤2-4)中的动作值函数：The action-value function in steps 2-4) is solved by iterating the Bellman equation as follows:

设策略π是确定性的，即从AUV的状态向量空间到AUV的动作向量空间是一一映射的关系，并记为μ，则将上述迭代贝尔曼方程简化为：Assuming that the strategy π is deterministic, that is, there is a one-to-one mapping relationship from the state vector space of the AUV to the action vector space of the AUV, and denoted as μ, the above iterative Bellman equation is simplified as:

对于确定性的策略μ，将步骤2-4)中的最优目标策略π^*简化为确定性最优目标策略μ^*：For a deterministic policy μ, the optimal target policy π^* in steps 2-4) is simplified to a deterministic optimal target policy μ^* :

3)构建混合策略-评价网络3) Build a mixed strategy-evaluation network

通过构建混合策略-评价网络来分别估计确定性最优目标策略μ^*和对应的最优动作值函数构建混合策略-评价网络包括三部分：构建策略网络、构建评价网络和确定目标策略，具体步骤如下：Estimating the deterministic optimal target policy μ^* and the corresponding optimal action-value function separately by constructing a hybrid policy-evaluation network Building a mixed strategy-evaluation network includes three parts: building a strategy network, building an evaluation network, and determining the target strategy. The specific steps are as follows:

3-1)构建策略网络3-1) Build a policy network

混合策略-评价网络结构通过构建n个策略网络来估计确定性最优目标策略μ^*；其中，θ_p为第p个策略网络的权重参数，p＝1,…,n；各策略网络均分别使用一个全连接的深度神经网络来实现，各策略网络均分别包含一个输入层、两个隐藏层和一个输出层；各策略网络的输入为状态向量s_k，各策略网络的输出为动作向量a_k；Mixed strategy-evaluation network structure by constructing n-policy networks to estimate the deterministic optimal target policy μ^* ; among them, θ_p is the weight parameter of the pth policy network, p=1,...,n; each policy network is realized by a fully connected deep neural network, each Each strategy network includes an input layer, two hidden layers and an output layer; the input of each strategy network is a state vector s_k , and the output of each strategy network is an action vector a_k ;

3-2)构建评价网络3-2) Build an evaluation network

混合策略-评价网络结构通过构建m个评价网络来估计最优动作值函数其中，w_q为第q个评价网络的权重参数，q＝1,…,m；各评价网络均分别使用一个全连接的深度神经网络来实现，各评价网络均分别包含一个输入层、两个隐藏层和一个输出层；各评价网络的输入为状态向量s_k和动作向量a_k，其中状态向量s_k从输入层输入到各评价网络，动作向量a_k从第一个隐藏层输入到各评价网络，各评价网络输出为在状态向量s_k下采取动作向量a_k的动作值；Mixed strategy-evaluation network structure by constructing m evaluation networks to estimate the optimal action-value function Among them, w_q is the weight parameter of the qth evaluation network, q=1,...,m; each evaluation network is implemented by a fully connected deep neural network, and each evaluation network includes an input layer, two hidden layer and an output layer; the input of each evaluation network is the state vector s_k and action vector a_k , where the state vector s_k is input from the input layer to each evaluation network, and the action vector a_k is input from the first hidden layer to each Evaluation network, the output of each evaluation network is the action value of the action vector a_k under the state vector s_k ;

3-3)确定目标策略3-3) Determine the target strategy

根据所构建的混合策略-评价网络，将第k个时间步学习到的AUV轨迹跟踪控制的目标策略μ_f(s_k)定义为n个策略网络输出的均值，计算公式如下：According to the constructed hybrid strategy-evaluation network, the target strategy μ_f (s_k ) of the AUV trajectory tracking control learned at the kth time step is defined as the mean value of n strategy network outputs, and the calculation formula is as follows:

4)求解AUV轨迹跟踪控制的目标策略μ_f(s_k)，具体步骤如下：4) Solving the target strategy μ_f (s_k ) of AUV trajectory tracking control, the specific steps are as follows:

4-1)参数设置4-1) Parameter setting

分别设置最大迭代次数M、每次迭代的最大时间步K、经验回放抽取的训练集大小N、各评价网络的学习率α_ω、各策略网络的学习率α_θ、折扣因子γ和奖励函数中的权重矩阵H；Set the maximum number of iterations M, the maximum time step K of each iteration, the size of the training set N extracted from experience playback, the learning rate α_ω of each evaluation network, the learning rate α_θ of each policy network, the discount factor γ and the reward function The weight matrix H of

4-2)初始化混合策略-评价网络4-2) Initialize the mixed strategy-evaluation network

随机初始化n个策略网络和m个评价网络的权重参数θ_p和w_q；从n个策略网络中随机选择第d个策略网络记为d＝1,…,n；Randomly initialize n policy networks and m evaluation networks The weight parameters θ_p and w_q of ; randomly select the dth policy network from n policy networks and record it as d=1,...,n;

构建经验列队集合R，设该经验列队集合R的最大容量为B，并初始化为空；Construct the experience queue set R, set the maximum capacity of the experience queue set R to be B, and initialize it to be empty;

4-3)迭代开始，对混合策略-评价网络进行训练，初始化迭代次数episode＝1；4-3) Start the iteration, train the mixed strategy-evaluation network, and initialize the number of iterations episode=1;

4-4)设置当前时间步k＝0，随机初始化AUV的状态变量s₀，令当前时间步的状态变量s_k＝s₀；并产生一个探索噪声Noise_k；4-4) Set the current time step k=0, randomly initialize the state variable s₀ of the AUV, make the state variable s_k =s₀ of the current time step; and generate an exploration noise Noise_k ;

4-5)根据n个当前策略网络和探索噪声Noise_k确定当前时间步的动作向量a_k为：4-5) Network according to n current strategies and explore the noise Noise_k to determine the action vector a_k of the current time step as:

4-6)AUV在当前状态s_k下执行动作a_k，根据步骤2-3)得到奖励函数r_k+1，并观测到一个新的状态s_k+1；记e_k＝(s_k,a_k,r_k+1,s_k+1)为一个经验样本；如果经验列队集合R的样本数量已经达到最大容量B，则先删除最先加入的一个样本，再将经验样本e_k存入经验列队集合R中；否则直接将经验样本e_k存入经验列队集合R中；4-6) The AUV executes the action a_k in the current state s_k , obtains the reward function r_k+1 according to step 2-3), and observes a new state s_k+1 ; write e_k =(s_k , a_k , r_k+1 , s_k+1 ) is an experience sample; if the number of samples in the experience queue set R has reached the maximum capacity B, delete the first sample added first, and then store the experience sample e_k into In the experience queue set R; otherwise, the experience sample e_k is directly stored in the experience queue set R;

从经验列队集合R中选取A个经验样本，具体如下：当经验列队集合R中样本数量不超过N时，则选取该经验列队集合R中的所有经验样本；当经验列队集合R超过N时，则从该经验列队集合R中随机选取N个经验样本(s_l,a_l,r_l+1,s_l+1)；Select A experience samples from the experience queue set R, specifically as follows: when the number of samples in the experience queue set R does not exceed N, then select all the experience samples in the experience queue set R; when the experience queue set R exceeds N, Then randomly select N experience samples (s_l , a_l , r_l+1 , s_l+1 ) from the experience queue set R;

4-7)根据选取的A个经验样本计算每个评价网络的期望贝尔曼绝对误差EBAE_q，用于表征每个评价网络的性能，公式如下：4-7) Calculate the expected Bellman absolute error EBAE_q of each evaluation network according to the selected A empirical samples, which is used to characterize the performance of each evaluation network. The formula is as follows:

选择性能最差的评价网络，通过以下公式求得该性能最差的评价网络的序号，记为c：Select the evaluation network with the worst performance, and obtain the serial number of the evaluation network with the worst performance through the following formula, denoted as c:

4-8)由第c个评价网络通过如下次贪婪策略得到每个经验样本在下一时间步的动作向量：4-8) Evaluate the network by the cth The action vector of each empirical sample at the next time step is obtained by the following greedy strategy:

4-9)通过多个准Q学习方法计算第c个评价网络的目标值公式如下：4-9) Calculate the target value of the c-th evaluation network through multiple quasi-Q learning methods The formula is as follows:

4-10)计算第c个评价网络的损失函数L(w_c)，公式如下：4-10) Calculate the loss function L(w_c ) of the c-th evaluation network, the formula is as follows:

4-11)通过损失函数L(w_c)对权重参数w_c的导数来更新第c个评价网络的权重参数，公式如下：4-11) Update the weight parameter of the c-th evaluation network through the derivative of the loss function L(w_c ) to the weight parameter w_c , the formula is as follows:

其余评价网络的权重参数保持不变；The weight parameters of the rest of the evaluation network remain unchanged;

4-12)从n个策略网络中随机选择一个策略网络来重置第d个策略网络4-12) Randomly select a policy network from n policy networks to reset the dth policy network

4-13)根据更新后的第c个评价网络计算第d个策略网络的确定性策略梯度并以此更新第d个策略网络的权重参数θ_d，计算公式分别如下：4-13) Calculate the d-th policy network based on the updated c-th evaluation network Deterministic policy gradient for and update the dth policy network with this The weight parameter θ_d of , the calculation formulas are as follows:

其余策略网络的权重参数保持不变；The weight parameters of the remaining policy networks remain unchanged;

4-14)令k＝k+1并对k进行判定：如k<K，则重新返回步骤4-5)，AUV继续跟踪参考轨迹；否则，进入步骤4-15)；4-14) Make k=k+1 and judge k: if k<K, then return to step 4-5), AUV continues to track the reference trajectory; otherwise, enter step 4-15);

4-15)令episode＝episode+1并对episode进行判定：如episode<M，则重新返回步骤4-4)，AUV进行下一个迭代过程；否则，进入步骤4-16)；4-15) make episode=episode+1 and judge the episode: if episode<M, then return to step 4-4), and the AUV proceeds to the next iterative process; otherwise, enter step 4-16);

4-16)迭代结束，终止混合策略-评价网络的训练过程，将迭代终止时的n个策略网络的输出值通过步骤3-3)中的计算公式得到最终AUV轨迹跟踪控制的目标策略μ_f(s_k)，由该目标策略实现对AUV的轨迹跟踪控制。4-16) The iteration ends, and the training process of the hybrid strategy-evaluation network is terminated, and the output values of the n strategy networks at the termination of the iteration are obtained by the calculation formula in step 3-3) to obtain the target strategy μ_f of the final AUV trajectory tracking control (s_k ), the trajectory tracking control of the AUV is realized by the target strategy.

本发明的特点及有益效果：Features and beneficial effects of the present invention:

本发明提出的方法采用了多个策略网络和评价网络。对于多个评价网络，通过定义期望贝尔曼绝对误差来评估每个评价网络的性能，在每个时间步只更新性能最差的一个评价网络，不同于已有基于强化学习的控制方法，本发明提出多个准Q学习方法来计算更为准确的评价网络目标值，该方法可以解决动作值函数过估计问题，并且可以在不借助目标评价网络的前提下稳定学习过程。对于多个策略网络，在每个时间步随机选择一个策略网络，并采用确定性策略梯度进行更新。最终学习到的策略为所有策略网络的均值。The method proposed by the present invention employs multiple policy networks and evaluation networks. For multiple evaluation networks, the performance of each evaluation network is evaluated by defining the expected Bellman absolute error, and only the evaluation network with the worst performance is updated at each time step, which is different from the existing control method based on reinforcement learning. The present invention A number of quasi-Q learning methods are proposed to calculate more accurate evaluation network target values. This method can solve the problem of overestimation of the action value function and stabilize the learning process without using the target evaluation network. For multiple policy networks, a policy network is randomly selected at each time step and updated using deterministic policy gradients. The final learned policy is the average of all policy networks.

1)本发明提出的AUV轨迹跟踪控制方法不依赖于模型，通过AUV在行驶过程中的采样数据，来自主学习出使得控制目标达到最优的目标策略，该过程不需要对AUV模型做任何假设，尤其适用于在复杂深海环境下工作的AUV，有很高的实际应用价值。1) The AUV trajectory tracking control method proposed by the present invention does not depend on the model. Through the sampling data of the AUV during the driving process, the target strategy that makes the control target reach the optimum is autonomously learned. This process does not need to make any assumptions about the AUV model , especially suitable for AUVs working in complex deep-sea environments, and has high practical application value.

2)本发明方法采用多个准Q学习来得到比已有方法更加准确的评价网络目标值，既减小了由评价网络近似得到的动作值函数的方差，还解决了动作值函数过估计问题，从而得到更优的目标策略，实现高精度的AUV轨迹跟踪控制。2) The method of the present invention adopts a plurality of quasi-Q learning to obtain more accurate evaluation network target values than existing methods, which not only reduces the variance of the action value function approximated by the evaluation network, but also solves the overestimation problem of the action value function , so as to obtain a better target strategy and realize high-precision AUV trajectory tracking control.

3)本发明方法基于期望贝尔曼绝对误差来决定每个时间步该更新哪一个评价网络，这种更新规则可以减弱较差评价网络的影响，从而保证学习过程的快速收敛。3) The method of the present invention determines which evaluation network should be updated at each time step based on the expected Bellman absolute error. This update rule can weaken the influence of poor evaluation networks, thereby ensuring rapid convergence of the learning process.

4)本发明方法由于采用了多个评价网络，其学习过程不易受到恶劣的AUV历史跟踪轨迹的影响，鲁棒性好，学习过程稳定。4) Since the method of the present invention adopts a plurality of evaluation networks, its learning process is not easily affected by bad AUV historical tracking trajectories, the robustness is good, and the learning process is stable.

5)本发明方法将强化学习与深度神经网络相结合，具有很强的自学习能力，能够在不确定的深海环境中实现对AUV的高精度自适应控制，在AUV轨迹跟踪、水下避障等场景中有着很好的应用前景。5) The method of the present invention combines reinforcement learning with a deep neural network, has a strong self-learning ability, and can realize high-precision adaptive control of AUVs in uncertain deep-sea environments, and can be used in AUV trajectory tracking and underwater obstacle avoidance. It has a good application prospect in other scenarios.

附图说明Description of drawings

图1是本发明提出方法与现有DDPG方法的性能对比图；其中，图(a)为学习曲线对比图，图(b)为AUV轨迹跟踪效果对比图。Fig. 1 is a performance comparison diagram between the proposed method of the present invention and the existing DDPG method; wherein, Figure (a) is a comparison diagram of learning curves, and Figure (b) is a comparison diagram of AUV trajectory tracking effects.

图2是本发明提出方法与神经网络PID方法的性能对比图；其中，图(a)为AUV沿X、Y方向的坐标轨迹跟踪效果对比图，图(b)为AUV在X、Y方向的跟踪误差对比图。Fig. 2 is the performance comparison figure of method proposed by the present invention and neural network PID method; Wherein, figure (a) is the coordinate trajectory tracking effect comparison figure of AUV along X, Y direction, and figure (b) is AUV in X, Y direction Tracking error comparison chart.

具体实施方式Detailed ways

本发明提出的一种基于深度强化学习的自主水下航行器轨迹跟踪控制方法，下面结合附图和具体实施例进一步详细说明如下。An autonomous underwater vehicle trajectory tracking control method based on deep reinforcement learning proposed by the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明提出了一种基于深度强化学习的自主水下航行器跟踪控制算法，主要包括四个部分：定义AUV轨迹跟踪控制问题、建立AUV轨迹跟踪问题的马尔科夫决策过程模型、构建混合策略-评价网络结构和求解AUV轨迹跟踪控制的目标策略。The present invention proposes an autonomous underwater vehicle tracking control algorithm based on deep reinforcement learning, which mainly includes four parts: defining the AUV trajectory tracking control problem, establishing a Markov decision process model for the AUV trajectory tracking problem, and constructing a hybrid strategy- Evaluate the network structure and solve the target policy for AUV trajectory tracking control.

1)定义AUV轨迹跟踪控制问题1) Define the AUV trajectory tracking control problem

定义AUV轨迹跟踪控制问题包括四个组成部分：确定AUV系统输入、确定AUV系统输出、定义轨迹跟踪控制误差和建立AUV轨迹跟踪控制目标；具体步骤如下：Defining the AUV trajectory tracking control problem includes four components: determining the AUV system input, determining the AUV system output, defining the trajectory tracking control error, and establishing the AUV trajectory tracking control target; the specific steps are as follows:

1-1)确定AUV系统输入1-1) Determine AUV system input

令AUV系统输入向量为τ_k＝[ξ_k,δ_k]^T，其中ξ_k、δ_k分别为AUV的螺旋桨推力和舵角，下标k表示第k个时间步即时刻k·t的取值，其中t为时间步长，下同；ξ_k、δ_k的取值范围分别为和其中分别为最大的螺旋桨推力和最大舵角，根据AUV所采用的螺旋桨型号确定。Let the input vector of the AUV system be τ_k =[ξ_k , δ_k ]^T , where ξ_k and δ_k are the propeller thrust and rudder angle of the AUV respectively, and the subscript k represents the k-th time step, that is, the value of time k·t , where t is the time step, the same below; the value ranges of ξ_k and δ_k are respectively and in are the maximum propeller thrust and maximum rudder angle respectively, determined according to the propeller model used by the AUV.

1-2)确定AUV系统输出1-2) Determine the AUV system output

令AUV系统输出向量为η_k＝[x_k,y_k,ψ_k]^T，其中x_k、y_k分别为第k个时间步AUV在惯性坐标系I-XYZ下沿X、Y轴的坐标，ψ_k为第k个时间步AUV前进方向与X轴的夹角。Let the output vector of the AUV system be η_k =[x_k ,y_k ,ψ_k ]^T , where x_k and y_k are the coordinates of the k-th time step AUV along the X and Y axes in the inertial coordinate system I-XYZ , ψ_k is the angle between the advancing direction of the AUV and the X axis at the kth time step.

1-3)定义轨迹跟踪控制误差1-3) Define trajectory tracking control error

马尔科夫决策过程(MDP)是强化学习理论的基础，因此需要对步骤1)中的AUV轨迹跟踪问题进行MDP建模。强化学习的主要元素包括智能体、环境、状态、动作和奖励函数，智能体的目标是通过与AUV所处环境的交互来学习一个最优动作(或控制输入)序列来最大化累计奖励(或最小化累计跟踪控制误差)，进而实现AUV轨迹跟踪目标的求解。具体步骤如下：Markov decision process (MDP) is the basis of reinforcement learning theory, so MDP modeling is required for the AUV trajectory tracking problem in step 1). The main elements of reinforcement learning include agent, environment, state, action and reward function. The goal of the agent is to learn an optimal action (or control input) sequence to maximize the cumulative reward (or Minimize the cumulative tracking control error), and then realize the solution of the AUV trajectory tracking target. Specific steps are as follows:

2-1)定义状态向量2-1) Define the state vector

定义AUV系统的速度向量为φ_k＝[u_k,v_k,χ_k]^T，其中u_k、v_k分别为第k个时间步AUV沿前进方向、垂直于前进方向的线速度，χ_k为第k个时间步AUV环绕前进方向的角速度。Define the velocity vector of the AUV system as φ_k = [u_k , v_k , χ_k ]^T , where u_k and v_k are the linear velocity of the AUV at the kth time step along the forward direction and perpendicular to the forward direction, χ_k is the angular velocity of the AUV around the forward direction at the kth time step.

2-2)定义动作向量2-2) Define the action vector

定义第k个时间步的动作向量为该时间步的AUV系统输入向量，即：a_k＝τ_k。Define the action vector of the kth time step as the input vector of the AUV system at this time step, namely: a_k =τ_k .

2-3)定义奖励函数2-3) Define the reward function

其中，表示对奖励函数、状态和动作的期望值(下同)；K为最大时间步；in, Represents the expected value of the reward function, state and action (the same below); K is the maximum time step;

该动作值函数用于描述在当前及之后所有状态下均采取策略π时的期望累计折扣奖励，因此，在强化学习框架下，AUV轨迹跟踪控制目标(即智能体的目标)是通过与AUV所处环境的交互来学习一个最优目标策略π^*，使得初始时刻的动作值最大，即：The action value function is used to describe the expected cumulative discounted reward when the policy π is adopted in the current and subsequent states. Therefore, under the framework of reinforcement learning, the AUV trajectory tracking control target (that is, the target of the agent) is obtained by combining with the AUV The environment interaction to learn an optimal target strategy π^* , so that the action value at the initial moment is the largest, that is:

其中，p(s₀)为初始状态s₀的分布；a₀为初始动作向量。Among them, p(s₀ ) is the distribution of the initial state s₀ ; a₀ is the initial action vector.

因此，步骤1-4)建立的AUV轨迹跟踪控制的目标τ^*的求解可转换为π^*的求解。Therefore, the solution to the target τ^* of the AUV trajectory tracking control established in steps 1-4) can be converted to the solution to π^* .

类似于动态规划，许多强化学习方法使用如下迭代贝尔曼方程来求解步骤2-4)中的动作值函数：Similar to dynamic programming, many reinforcement learning methods use the following iterative Bellman equation to solve the action-value function in steps 2-4):

假定策略π是确定性的，即从AUV的状态向量空间到AUV的动作向量空间是一一映射的关系，并记为μ，于是上述迭代贝尔曼方程可以简化为：Assuming that the policy π is deterministic, that is, there is a one-to-one mapping relationship from the state vector space of the AUV to the action vector space of the AUV, which is denoted as μ, then the above iterative Bellman equation can be simplified as:

此外，对于确定性的策略μ，将步骤2-4)中的最优目标策略π^*简化为确定性最优目标策略μ^*：In addition, for a deterministic policy μ, the optimal target policy π^* in steps 2-4) is simplified to a deterministic optimal target policy μ^* :

3)构建混合策略-评价网络3) Build a mixed strategy-evaluation network

由步骤2-5)可知，利用强化学习求解AUV轨迹跟踪问题的核心是如何求解确定性最优目标策略μ^*和对应的最优动作值函数本发明方法采用一种混合策略-评价网络来分别估计μ^*和构建混合策略-评价网络包括三部分：构建策略网络、构建评价网络和确定目标策略，具体步骤如下：From steps 2-5), it can be seen that the core of solving the AUV trajectory tracking problem using reinforcement learning is how to solve the deterministic optimal target policy μ^* and the corresponding optimal action value function The method of the present invention uses a mixed strategy-evaluation network to estimate μ^* and Building a mixed strategy-evaluation network includes three parts: building a strategy network, building an evaluation network, and determining the target strategy. The specific steps are as follows:

3-1)构建策略网络3-1) Build a policy network

混合策略-评价网络结构通过构建n(为了平衡本发明算法跟踪控制精度与网络训练速度，其取值不宜过大也不宜过小)个策略网络来估计确定性最优目标策略μ^*。其中，θ_p为第p个策略网络的权重参数，p＝1,…,n；各策略网络均分别使用一个全连接的深度神经网络来实现，每个策略网络均分别包含一个输入层、两个隐藏层和一个输出层，各策略网络的输入为状态向量s_k，各策略网络输出为动作向量a_k，两个隐藏层分别含有400和300个单元。Mixed strategy-evaluation network structure by constructing n (in order to balance the algorithm tracking control accuracy and network training speed of the present invention, its value should not be too large or too small) strategy network to estimate the deterministic optimal target policy μ^* . Among them, θ_p is the weight parameter of the pth strategy network, p=1,...,n; each strategy network is realized by a fully connected deep neural network, and each strategy network includes an input layer, two The input of each strategy network is the state vector_sk , and the output of each strategy network is the action vector a_k . The two hidden layers contain 400 and 300 units respectively.

3-2)构建评价网络3-2) Build an evaluation network

混合策略-评价网络结构通过构建m(评价网络数量的选取依据与上述策略网络数量的选取依据相同)个评价网络来估计最优动作值函数其中，^w_q为第q个评价网络的权重参数，q＝1,…,m；各评价网络均分别使用一个全连接的深度神经网络来实现，各评价网络均分别包含一个输入层、两个隐藏层和一个输出层，两个隐藏层分别含有400和300个单元；各评价网络的输入为状态向量s_k和动作向量a_k，其中状态向量s_k从输入层输入到各评价网络，动作向量a_k从第一个隐藏层输入到各评价网络，各评价网络输出为在状态向量s_k下采取动作向量a_k的动作值。The mixed strategy-evaluation network structure constructs m evaluation networks (the selection basis for the number of evaluation networks is the same as the selection basis for the number of strategy networks above) to estimate the optimal action-value function Among them,^w_q is the weight parameter of the qth evaluation network, q=1,...,m; each evaluation network is implemented by a fully connected deep neural network, and each evaluation network includes an input layer, two hidden layer and an output layer, the two hidden layers contain 400 and 300 units respectively; the input of each evaluation network is the state vector s_k and the action vector a_k , where the state vector s_k is input from the input layer to each evaluation network, and the action The vector a_k is input from the first hidden layer to each evaluation network, and each evaluation network outputs the action value of taking the action vector a_k under the state vector s_k .

3-3)确定目标策略3-3) Determine the target strategy

4-1)参数设置4-1) Parameter setting

分别设置最大迭代次数M、每次迭代的最大时间步K、经验回放抽取的训练集大小N、各评价网络的学习率α_ω、各策略网络的学习率α_θ、折扣因子γ和奖励函数中的权重矩阵H；本实施例中，M＝1500，K＝1000(每个时间步长t＝0.2s)，N＝64，各评价网络的α_ω＝0.01，各策略网络的α_θ＝0.001，γ＝0.99，H＝[0.001,0；0,0.001]；Set the maximum number of iterations M, the maximum time step K of each iteration, the size of the training set N extracted from experience playback, the learning rate α_ω of each evaluation network, the learning rate α_θ of each policy network, the discount factor γ and the reward function weight matrix H; in this embodiment, M=1500, K=1000 (each time step t=0.2s), N=64, α_ω =0.01 for each evaluation network, α_θ =0.001 for each strategy network , γ=0.99, H=[0.001,0;0,0.001];

随机初始化n个策略网络和m个评价网络的权重参数θ_p和w_q；从n个策略网络中随机选择第d(d＝1,…,n)个策略网络记为Randomly initialize n policy networks and m evaluation networks The weight parameters θ_p and w_q of ; Randomly select the d(d=1,…,n) policy network from n policy networks and write it as

构建经验列队集合R，设该经验列队集合R的最大容量为B(本实施例B＝10000)，并初始化为空；Build the experience queue set R, set the maximum capacity of the experience queue set R as B (the present embodiment B=10000), and initialize it as empty;

4-4)设置当前时间步k＝0，随机初始化AUV的状态变量s₀，令当前时间步的状态变量s_k＝^s₀；并产生一个探索噪声Noise_k(本实施例采用奥恩斯坦-乌伦贝克(Ornstein-Uhlenbeck)探索噪声)；4-4) Set the current time step k=0, randomly initialize the state variable s₀ of the AUV, make the state variable s_k =^s₀ of the current time step; and generate an exploration noise Noise_k (this embodiment adopts Ornstein- Ornstein-Uhlenbeck explores noise);

从经验列队集合R中选取A个经验样本，A≤N，具体如下：当经验列队集合R中样本数量不超过N时，则选取该经验列队集合R中的所有经验样本；当经验列队集合R超过N时，则从该经验列队集合R中随机选取N个经验样本(s_l,a_l,r_l+1,s_l+1)，l为被选择的经验样本所在的时间步；Select A experience samples from the experience queue set R, A≤N, as follows: when the number of samples in the experience queue set R does not exceed N, then select all the experience samples in the experience queue set R; when the experience queue set R When it exceeds N, randomly select N experience samples (s_l , a_l , r_l+1 , s_l+1 ) from the experience queue set R, where l is the time step of the selected experience sample;

其余策略网络的权重参数保持不变。The weight parameters of the rest of the policy network remain unchanged.

4-14)令k＝k+1并对k进行判定：如k<K，则重新返回步骤4-5)，AUV继续跟踪参考轨迹；否则，进入步骤4-15)。4-14) Set k=k+1 and judge k: if k<K, then return to step 4-5), and the AUV continues to track the reference trajectory; otherwise, enter step 4-15).

4-15)令episode＝episode+1并对episode进行判定：如episode<M，则重新返回步骤4-4)，AUV进行下一个迭代过程；否则，进入步骤4-16)。4-15) set episode=episode+1 and judge the episode: if episode<M, then return to step 4-4), and the AUV proceeds to the next iterative process; otherwise, enter step 4-16).

本发明实施例的有效性验证Validity verification of the embodiment of the present invention

本发明所提出的基于深度强化学习的AUV轨迹跟踪控制方法(以下简称MPQ-DPG)的性能分析如下所示，所有对比实验均是基于广泛使用的REMUS自主无人航行器，其最大螺旋桨推力和舵角分别为86N和0.24rad；且采用如下参考轨迹：The performance analysis of the AUV trajectory tracking control method based on deep reinforcement learning (hereinafter referred to as MPQ-DPG) proposed by the present invention is as follows. All comparative experiments are based on the widely used REMUS autonomous unmanned aerial vehicle. Its maximum propeller thrust and rudder angle They are 86N and 0.24rad respectively; and the following reference trajectory is adopted:

此外，在本发明实施例中，评价网络数量m与策略网络数量n相同，后文统一记为n。In addition, in the embodiment of the present invention, the number m of evaluation networks is the same as the number n of policy networks, and is collectively denoted as n hereinafter.

1)MPQ-DPG与现有的DDPG方法对比分析1) Comparative analysis of MPQ-DPG and existing DDPG methods

图1为本发明提出的深度强化学习的AUV提出轨迹跟踪控制方法(MPQ-DPG)与现有DDPG方法在训练过程中的学习曲线和轨迹跟踪效果上的比较。其中，图(a)中的学习曲线是通过五次独立实验得到，图(b)中Ref表示参考轨迹。Fig. 1 is a comparison of the learning curve and trajectory tracking effect in the training process between the proposed trajectory tracking control method (MPQ-DPG) of the AUV proposed by the present invention and the existing DDPG method. Among them, the learning curve in Figure (a) is obtained through five independent experiments, and Ref in Figure (b) represents the reference trajectory.

分析图1，可得如下结论：Analyzing Figure 1, the following conclusions can be drawn:

a)相对于DDPG方法，MPQ-DPG的学习稳定性更好，这是由于MPQ-DPG采用多个评价网络和策略网络，可以降低差样本对学习稳定性的影响。a) Compared with the DDPG method, MPQ-DPG has better learning stability, because MPQ-DPG uses multiple evaluation networks and policy networks, which can reduce the impact of poor samples on learning stability.

b)MPQ-DPG方法最终收敛的平均累计奖励明显高于DDPG方法，这说明了MPQ-DPG方法的跟踪控制精度要明显高于DDPG方法。b) The average cumulative reward of the final convergence of the MPQ-DPG method is significantly higher than that of the DDPG method, which shows that the tracking control accuracy of the MPQ-DPG method is significantly higher than that of the DDPG method.

c)从图1(b)中可以观察到，MPQ-DPG方法得到的跟踪轨迹几乎与参考轨迹重合，说明MPQ-DPG方法可以实现高精度的AUV跟踪控制。c) From Figure 1(b), it can be observed that the tracking trajectory obtained by the MPQ-DPG method almost coincides with the reference trajectory, indicating that the MPQ-DPG method can achieve high-precision AUV tracking control.

d)随着策略网络和评价网络数量的增大，MPQ-DPG方法的跟踪控制精度会逐渐提高，但提高的幅度在n>4之后将不再明显。d) As the number of policy networks and evaluation networks increases, the tracking control accuracy of the MPQ-DPG method will gradually increase, but the improvement will not be obvious after n>4.

2)MPQ-DPG方法与现有神经网络PID方法对比分析2) Comparative analysis of MPQ-DPG method and existing neural network PID method

图2为本发明为水下无人航行器轨迹跟踪控制提出的MPQ-DPG方法与神经网络PID方法在坐标轨迹跟踪曲线和坐标轨迹跟踪误差上的比较。图中Ref表示参考坐标轨迹，PIDNN表示神经网络PID算法，n＝4。Fig. 2 is the comparison between the MPQ-DPG method proposed by the present invention and the neural network PID method for the trajectory tracking control of the underwater unmanned vehicle on the coordinate trajectory tracking curve and the coordinate trajectory tracking error. In the figure, Ref represents the reference coordinate track, PIDNN represents the neural network PID algorithm, and n=4.

分析图2可得，神经网络PID控制方法的跟踪性能明显差于本发明提出的MPQ-DPG方法；此外，图2(b)中的跟踪误差表明，MPQ-DPG方法可以实现误差更快的收敛，特别是在起始阶段，MPQ-DPG方法仍然可以实现快速、高精度的跟踪性能，而神经网络PID方法的响应时间要明显长于MPQ-DPG方法，且跟踪误差的收敛性较差。Analysis of Fig. 2 shows that the tracking performance of the neural network PID control method is significantly worse than that of the MPQ-DPG method proposed by the present invention; in addition, the tracking error in Fig. 2 (b) shows that the MPQ-DPG method can achieve faster convergence of errors , especially in the initial stage, the MPQ-DPG method can still achieve fast and high-precision tracking performance, while the response time of the neural network PID method is significantly longer than that of the MPQ-DPG method, and the convergence of the tracking error is poor.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于深度强化学习的自主水下航行器轨迹跟踪控制方法，其特征在于，该方法包括以下步骤：1. an autonomous underwater vehicle trajectory tracking control method based on depth reinforcement learning, it is characterized in that, the method comprises the following steps:

1-1)确定AUV系统输入1-1) Determine AUV system input

1-2)确定AUV系统输出1-2) Determine the AUV system output

1-3)定义轨迹跟踪控制误差1-3) Define trajectory tracking control error

2-1)定义状态向量2-1) Define the state vector

2-2)定义动作向量2-2) Define the action vector

2-3)定义奖励函数2-3) Define the reward function

3)构建混合策略-评价网络3) Build a mixed strategy-evaluation network

3-1)构建策略网络3-1) Build a policy network

3-2)构建评价网络3-2) Build an evaluation network

3-3)确定目标策略3-3) Determine the target strategy

4-1)参数设置4-1) Parameter setting