CN111829527B

Movatterモバイル変換

Info

Publication number: CN111829527B
Application number: CN202010717418.XA
Authority: CN
Inventors: 曾喆; 杜沛; 刘善伟; 万剑华
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2021-07-20
Anticipated expiration: 2040-07-23
Also published as: CN111829527A

Abstract

Translated fromChinese

本发明公开了一种基于深度强化学习且顾及海洋环境要素的无人船路径规划方法，基本步骤为：S1插值目标海域风、浪、流数据，添加障碍物信息、起点和终点信息；S2使用贝叶斯网络评估无人船承受风浪流的最大值；S3将目标海域AIS数据重组织训练网络，获得优化经验池和初步网络参数；S4将无人船状态特征向量分别输入到深度强化学习模块进行算法迭代，更新网络参数，输出动作；S5每次迭代无人船运行15s，累计时间到1h更新数据；S6当无人船到达目标点结束迭代，输出路径。本发明充分考虑了海洋环境要素对于无人船航行的影响，更符合无人船实际远航情况，能在让无人船在恶劣海况下同时考虑环境要素和障碍物信息，得到一条优质的安全。

The invention discloses an unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements. The basic steps are: S1 interpolate wind, wave and current data in the target sea area, add obstacle information, starting point and end point information; S2 use The Bayesian network evaluates the maximum value of wind, wave and current for the unmanned ship; S3 reorganizes the training network with the AIS data in the target sea area to obtain the optimized experience pool and preliminary network parameters; S4 inputs the state feature vector of the unmanned ship into the deep reinforcement learning module respectively. Perform algorithm iteration, update network parameters, and output actions; S5, the unmanned ship runs for 15s each iteration, and the accumulated time reaches 1h to update the data; S6, when the unmanned ship reaches the target point, the iteration ends, and the path is output. The present invention fully considers the influence of marine environmental elements on the navigation of the unmanned ship, and is more in line with the actual voyage of the unmanned ship.

Description

Unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements

Technical Field

The patent relates to the field of unmanned ship path planning, in particular to an unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements.

Background

The unmanned ship breaks through in many technical fields depending on the development of an artificial intelligence control technology, gradually enters the visual field of people, starts to take on tasks such as ocean exploration and data acquisition, and gradually develops to the marine operation industry.

The presently published patents: CN109657863A, CN109726866A and CN107289939A all provide better path planning methods in the field, but only consider the influence of obstacles on unmanned ships in general. Each unmanned ship has a limit value for bearing wind waves according to factors such as own materials, structure, draught and the like, and when the unmanned ship is influenced by strong wind and strong waves in a real sea area, the unmanned ship has the danger of side turning or overturning, so that the unmanned ship sails in the sea area to avoid dangerous marine environment elements and areas with obstacles, which is extremely important for sailing safety and is particularly prominent for marine transport type unmanned ships.

The influence of marine environment elements on the navigation of the unmanned ship is considered, the marine environment elements and obstacle information around the unmanned ship are used as characteristic input vectors of deep reinforcement learning, the elements which have the largest influence on algorithm output at each moment are highlighted by using an attention moment array, and compared with a collision avoidance reinforcement learning method, the method has the advantages that the reward value is not a fixed value, and the influence degree of the marine environment elements and the obstacles on the unmanned ship is comprehensively changed. The method is more suitable for the actual situation of unmanned ship navigation, and can obtain a high-quality safe path by considering environmental factors and barrier information during navigation of the unmanned ship.

Disclosure of Invention

Objects of the invention

Aiming at the problem that marine environment elements are not considered in many unmanned ship path planning methods proposed at present, the invention provides the unmanned ship path planning method based on deep reinforcement learning and considering the marine environment elements, which fully considers the real marine environment elements and marine obstacles and combines the deep reinforcement learning method to plan a safe and efficient driving path for the unmanned ship.

(II) technical scheme

In order to achieve the purpose, the technical scheme of the invention is as follows: an unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements comprises the following specific steps:

(1) interpolating the data of wind speed, flow speed and wave height at t moment of the target sea area into a grid of 200m multiplied by 200m by s_tTo describe the characteristic state vector of the unmanned ship at the moment t, namely:

wherein

Respectively representing the wind speed and wave height of the unmanned ship at the t momentAnd the flow rate of the liquid, and,

the distance between the time t and the obstacle of the unmanned ship

Indicating that no obstacle has been detected by the drone;

(2) the capability of the unmanned ship for resisting wind, wave and flow is evaluated by utilizing a Bayesian network, the material, the displacement, the length, the width and the height of the unmanned ship are input, and the output is

Three parameters respectively representing the maximum values of wind speed, wave height and flow speed borne by the unmanned ship and used for calculating a reward function

(3) Initializing a deep reinforcement learning model, specifically comprising: two identical LSTM networks (as target Q network and actual Q network, respectively), a reward function model, a model experience pool, and an action output set.

(4) Three attributes of coordinates, course and speed of real AIS data of a target sea area are reserved, three marine environment element values and obstacle information are superposed into the AIS data according to time and point positions, new AIS data are used as training samples and put into a deep reinforcement learning model for training, and an optimized experience pool and preliminary network parameters are obtained;

(5) setting a starting point coordinate and an end point coordinate of the voyage of the unmanned ship, and obtaining a state feature vector s of the unmanned ship at the time t_tRespectively inputting the data into an actual Q network and a reward function model;

wherein: actual Q network calculation yields Q_{Practice of}And find Q according to an e-greedy strategy_{Practice of}Corresponding actions are output; the reward function model calculates the reward value R of the current iteration_t(ii) a Randomly extracting n records in an experience pool by a target Q network, and combining the n records with R_tCalculating Q_Target，Q_{Practice of}And Q_TargetCalculating loss functions together, updating network parameters of the actual Q network by using a gradient descent method, and when the iteration times reach a certain threshold value alpha, updating the network parameters of the actual Q networkCopying all parameters to a target Q network;

(6) the motion time of the unmanned ship is 15 seconds each time, and when the accumulated motion time reaches 1h, the information of the wind speed, the ocean current, the wave height and the obstacles in the sea area is updated to the current time;

(7) and when the unmanned ship reaches the target point, finishing the iteration output of the safety path.

Specifically, the bayesian network construction method in the step (2) includes the following steps:

(2.1) the nodes of the unmanned ship evaluating bayesian network include: the material, the water displacement, the length, the width, the height, the wind resistance level, the wave resistance level and the flow resistance level are taken as bottom-layer nodes, the wind resistance level, the wave resistance level and the flow resistance level are high-grade nodes, and the bottom-layer nodes are fully connected with the high-grade nodes;

(2.2) training a Bayesian network by taking the unmanned ship structure data as a sample to obtain a conditional probability table of each node;

(2.3) inputting unmanned ship information to be evaluated, including: material, water displacement, length, width and height, calculating the probability of each grade of the three high-grade nodes according to the conditional probability table, and outputting the maximum probability grade as a final value;

(2.4) mapping the grades of the wind speed, wave height and flow velocity of the unmanned ship obtained through the Bayesian network corresponding to the sea state grade into specific numerical values as

The value of (c).

Specifically, the reward function model described in step (3), wherein the reward value R_tThe calculation formula is as follows:

R_t＝softmax(|(θ_safe-s_t)·w₁|)·((θ_safe-s_t)·w₂)^T

wherein s is_tIs a characteristic state vector theta of the unmanned ship at the time t_safeIncluding four parameters for a safety threshold vector

Wherein

The maximum value of the wind speed, wave height and flow velocity borne by the unmanned ship obtained in the step 2,

sensing obstacle Range, w, for unmanned vessels₁The attention matrix of the reward function is a 4 multiplied by 4 upper triangular constant square matrix, and the diagonal element W of the matrix_ii(i ═ 1,2,3,4) corresponding to the wind speed, wave height, ocean currents and the extent of the effect of obstacles on the path plan, respectively, the off-diagonal element W_ijRepresenting the correlation between the element i and the element j, wherein the matrix has the function of processing the element values of the marine environment with different orders of magnitude to the same order of magnitude for comparison, and can highlight key elements; w is a₂Is a 4 × 4 diagonal array, requires the sum (theta)_safe-s_t) Partially combined to give the final prize value R_tEndowing positive and negative, and simultaneously enlarging the reward value to facilitate decision making;

softmax(|(θ_safe-s_t)·w₁|) partial calculation of the coefficient of the reward function, responsible for giving weight to each element value, the weight highlights the more important elements to the decision in each iteration, and the reward value can be rapidly reduced when encountering the elements with suddenly increased numerical values and the time of detecting the obstacle; when the unmanned ship does not sense the obstacle, the reward function can guide the model to avoid a high-storm area, and can make collision avoidance actions in the first time when the obstacle is sensed.

(III) advantageous effects

The advantages of the invention are embodied in that:

1. the wind speed, the wave height, the ocean current flow velocity and the obstacle information are jointly used as main reference objects for planning the unmanned ship path, the planned path is more feasible, and in the calculation process of the method, the data are updated according to the running time of the unmanned ship, so that the reliability of the path planning result is ensured.

2. The designed reward function can highlight the more important elements for decision in each iteration, simultaneously considers the detection capability and the capability of bearing storm impact of the ship, gives rewards in a safe area, gives proper penalty in a dangerous area, and makes an avoidance decision at the first time when an obstacle is detected, so that the path planning efficiency of the method is improved, and the path planning result is optimized.

3. The method for evaluating the wind and wave resistance of the unmanned ship by using the Bayesian network is provided, replaces the conventional evaluation mode of giving the wind and wave resistance grade by using expert experience, and is more scientific and efficient.

Drawings

FIG. 1 is a flow chart of a method for planning a unmanned ship route based on deep reinforcement learning and considering marine environment elements

FIG. 2 is a schematic diagram of a Bayesian network for evaluating the capability of an unmanned ship in resisting wind, wave and current

FIG. 3 is a flow chart of a deep reinforcement learning algorithm used by the model

FIG. 4 is a schematic diagram of path planning under the influence of elements and obstacles in marine environment

Detailed Description

The invention will now be described more fully and clearly with reference to the accompanying drawings and examples:

FIG. 1 is a flow chart of a method for planning a path of an unmanned ship based on deep reinforcement learning and considering elements of a marine environment, wherein the method gives a reasonable solution for safely completing a navigation task of the unmanned ship by fully considering the material and structure of the unmanned ship and possible strong wind, strong waves, ocean currents and obstacles in a sea area; the method mainly comprises two modules, wherein the first module is a Bayesian network evaluation module used for evaluating the wind wave resistance of the unmanned ship, and the second module is a deep reinforcement learning route planning module considering marine environment elements; the method utilizes a reward function of deep reinforcement learning to couple the two modules, so that the unmanned ship can make a proper risk avoiding decision according to the self material and structure. The unmanned ship planning method is suitable for planning the path of the unmanned ship for executing the long-range mission.

Specifically, the method comprises the following steps:

(1) the wind speed, the flow speed and the wave height data of the target sea area at the time t and the forecast wind speed and the forecast after the time t are obtainedJointly using a kriging interpolation method for predicting ocean current and wave height data, and interpolating into a grid of 200m multiplied by 200 m; the grid size of 200m × 200m is a value for which the unmanned ship acquires a new element after at most three operations. Storing data by using a three-dimensional array, wherein three dimensions of the array are longitude, latitude and time respectively, the time interval of the data is 1h, and s is used at the time of t_tTo describe the characteristic state vector of the unmanned ship at the moment t, namely:

wherein

Respectively represents the wind speed, wave height and flow velocity of the unmanned ship at the time t,

the distance between the time t and the obstacle of the unmanned ship

Indicating that no obstacle has been detected by the drone;

(2) the capability of resisting wind, wave and flow of the unmanned ship is evaluated by using the Bayesian network, the material, the displacement, the length, the width and the height of the unmanned ship are input, and the output is

And three parameters respectively represent the maximum values of wind speed, wave height and flow speed which can be borne by the unmanned ship and are used for calculating the reward function. The specific steps for constructing the Bayesian network are as follows:

(2.1) as shown in fig. 2, a bayesian network schematic diagram for evaluating the capability of the unmanned ship for resisting wind, wave and current is provided, and nodes comprise: the material, the water displacement, the length, the width, the height, the wind speed resistance grade, the wave height resistance grade and the flow speed resistance grade are taken as bottom layer nodes; the wind resistance level, the wave resistance level and the flow speed resistance level are high-grade nodes, and the bottom-layer nodes are fully connected with the high-grade nodes;

(2.2) training a Bayesian network by taking the unmanned ship structure information as a sample, wherein data need to be subjected to discretization processing, and the unmanned ship structure table is shown as the following table:

TABLE 1

And putting the data into a Bayesian network for training to obtain a conditional probability table of each node.

(2.3) inputting unmanned ship information to be evaluated, including: material, water displacement, length, width and height; and calculating the probability of each grade of the three high-grade nodes according to the conditional probability table, and outputting the maximum probability grade as a final value.

(2.4) mapping the grades of the wind speed, wave height and flow velocity of the unmanned ship obtained through the Bayesian network to specific numerical values corresponding to the sea condition grade table to serve as

The value of (c).

(3) Initializing a deep reinforcement learning model, specifically comprising: two identical LSTM networks (respectively serving as a target Q network and an actual Q network), a reward function model, a model experience pool and an action output set;

in particular, a reward function model is described, wherein a reward value R is_tThe calculation formula is as follows:

R_t＝softmax(|(θ_safe-s_t)·w₁|)·((θ_safe-s_t)·w₂)^T

Wherein

For unmanned ship bearing obtained by Bayesian network evaluationThe maximum value of wind speed, wave height and flow speed,

the collision avoidance range is sensed for the unmanned ship, and the negative sign is added in the front for convenient calculation; weight matrix w₁Is a 4 × 4 symmetric constant square matrix with diagonal element W_ii(i ═ 1,2,3,4) corresponding to the wind speed, wave height, ocean currents and the extent of the effect of obstacles on the path plan, respectively, the off-diagonal element W_ijRepresenting the correlation between the element i and the element j, wherein the matrix is used for processing the marine environment element values with different orders of magnitude to the same order of magnitude for comparison, and can highlight the key element, particularly the w₁The values are given empirically:

w₂is a 4 x 4 diagonal matrix, needs and

partially combined to give the final prize value R_tEndowing positive and negative, simultaneously enlarging the reward value and accelerating the decision-making speed; w is a₂The method specifically comprises the following steps:

softmax(|(θ_safe-s_t)·w₁|) calculating the coefficients of the reward function, in charge of giving weight to each characteristic state element, the weight highlights the element which is more important to the decision in each iteration, and the reward value is rapidly reduced when encountering the element with suddenly increased numerical value and the moment of detecting the obstacle, (theta)_safe-s_t)·w₂And part, attaching positive and negative to the calculation result to indicate that reward or punishment is made. The calculation mode of the reward function is exemplified and divided into two cases of not meeting obstacles and meeting obstacles:

when no obstacle is encountered:

suppose a certain time t_nCharacteristic state vector of

Comprises the following steps:

unmanned ship safety threshold vector theta_safe＝[3,1.5,0.2,500]NAN represents not participating in the calculation, then softmax (| (θ)_safe-s_t)·w₁I) the calculation result is [0.867,0.117,0.016,0 |)]It means that in this calculation, the marine factor "wind speed" needs attention; (theta)_safe-s_t)·w₂The part and the weight value are subjected to multiplication, and positive and negative are attached to a calculation result to indicate that reward or punishment is given; the final calculation result of-19.95 represents that punishment is made.

When an obstacle is encountered:

suppose a certain time t_nCharacteristic state vector of

Comprises the following steps:

indicating that the obstacle is detected 50m away from the unmanned ship, the unmanned ship just senses the obstacle, and the safety threshold vector of the unmanned ship

Softmax (| (θ)_safe-s_t)·w₁I) the calculation result is [0,0,0,1 |)]It means that in this calculation, avoiding obstacles is the most important; (theta)_safe-s_t)·w₂And performing dot multiplication on the part and the weight value, attaching positive and negative to a calculation result, and giving a penalty when the final calculation result is-200.

Through the calculation, the algorithm drives the unmanned ship through the reward function, focuses on marine environment elements when no obstacle is detected, and reacts at the first time when the obstacle is detected;

(4) the method comprises the following steps of reserving three attributes of coordinates and course of real AIS data of a target sea area, superposing three marine environment element values and obstacle information into the AIS data according to time and point positions, wherein a new AIS data sample is shown in the following table:

TABLE 2

Putting the newly-sorted AIS data serving as training samples into a deep reinforcement learning model for training to obtain an optimized experience pool and preliminary network parameters;

(5) and (5) selecting the discretization course angle of the unmanned ship as the action output of the deep reinforcement learning when the fixed unmanned ship running speed v is 10 m/s. Considering the steering capacity of the ship, the heading change range is limited to be between 35 degrees and minus 35 degrees and discretized at equal intervals, namely an action set output by the model:

A＝{35,25,15,5,-5,-15,-25,-35}

(6) referring to fig. 3, which is a flowchart of a deep reinforcement learning algorithm, two identical LSTM networks are used as an actual Q network and a target Q value network in a deep reinforcement learning framework, respectively; obtaining state characteristic vector s of unmanned ship at time t_tRespectively inputting the data into an actual Q network and a reward function model; the LSTM input layer of the actual Q network at time t is the feature state vector s_tAnd the output Q(s) of the actual Q network at the last moment_t-1)_{Practice of}The output layer is Q(s)_t)_{Practice of}Value of Q(s)_t)_{Practice of}Then, selecting action a corresponding to the Q value by utilizing an epsilon-greedy strategy_t(a_t∈A)；

(7) Calculating the reward value R at time t_tThe characteristic state vector s at the time t_tAnd action a_tExecution of a_tThe latter feature state vector s_t' and the Boolean value isend, which determines whether the iteration has terminated, together as a record rec_t＝{s_t,a_t,R_t,s_t', is _ end } is stored in experience pool D;

(8) randomly extracting n records from experience pool D s_i,a_i,R_i,s_i′,is_end_iN calculating a target Q value Q, 1,2, …_Target：

Wherein R is_iThe prize value recorded for the ith entry, γ is the discount factor, in this example γ is 0.9, ω is a parameter of the actual Q network, ω' is a parameter of the target Q network, a^max(s_i', ω) is the action chosen to record the re-projection of i into the actual Q network:

wherein s is_i′、a_iAnd omega respectively recording the state characteristic vector, the action and the network parameter of the i;

(9) calculating the accumulated loss of the i records, and updating the parameter omega of the actual Q network by utilizing gradient descent, wherein the used loss function is as follows:

(10) when the iteration number of the actual Q network reaches the threshold value alpha, the parameter omega of the actual Q network is wholly copied to the target Q network.

(11) The motion time of the unmanned ship is 10 seconds each time, and when the accumulated motion time reaches 1h, the wind speed, ocean current, wave height and obstacle information data of the sea area are updated to the current time;

(12) and finishing the iteration when the unmanned ship reaches the termination point, and outputting a safe path.

Fig. 4 is a schematic diagram of path planning under the influence of marine environmental elements and obstacles, and the method can avoid high marine environmental risk areas and obstacles when planning a path.

The above is an example of the present invention, and all changes made according to the technical scheme of the present invention, which produce the functional effects, do not exceed the technical scheme of the present invention, and all belong to the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于深度强化学习且顾及海洋环境要素的无人船路径规划方法，其特征在于，包括以下步骤：1. an unmanned ship path planning method based on deep reinforcement learning and taking into account marine environmental elements, is characterized in that, comprises the following steps:

(1)将目标海域t时刻的风速、流速、浪高数据插值成500m×500m的栅格，用s_t来描述无人船t时刻的特征状态向量，即:(1) Interpolate the wind speed, flow velocity, and wave height data of the target sea area at time t into a 500m × 500m grid, and use s_t to describe the characteristic state vector of the unmanned ship at time t, namely:

其中

分别代表无人船所在t时刻位置的风速、浪高和流速，

为无人船t时刻和障碍物的距离，当

表示无人船未探测到障碍物；in

represent the wind speed, wave height and flow velocity of the unmanned ship at time t, respectively,

is the distance between the unmanned ship at time t and the obstacle, when

Indicates that the unmanned ship has not detected obstacles;

(2)利用贝叶斯网络评估无人船抗风浪流的能力，输入为无人船的材质、排水量、长、宽和高度，输出

三个参数，分别代表无人船能承受风速、浪高、流速的最大值；(2) Using the Bayesian network to evaluate the ability of the unmanned ship to resist wind and waves, the input is the material, displacement, length, width and height of the unmanned ship, and the output is

Three parameters, which represent the maximum value of wind speed, wave height and flow velocity that the unmanned ship can withstand;

1)构建贝叶斯网络节点，包括：材质、排水量、长、宽、高、抗风等级、抗浪等级和抗流速等级，将材质、排水量、长、宽和高作为底层节点，抗风速等级、抗浪高等级和抗流速等级为高级节点，底层节点和高级节点之间为全连接；1) Build a Bayesian network node, including: material, displacement, length, width, height, wind resistance level, wave resistance level and flow resistance level, using material, displacement, length, width and height as the underlying nodes, wind resistance level , Anti-wave height level and anti-velocity level are high-level nodes, and the bottom nodes and high-level nodes are fully connected;

2)将无人船结构数据作为样本训练贝叶斯网络，获得各个节点的条件概率表；2) Using the unmanned ship structure data as a sample to train the Bayesian network to obtain the conditional probability table of each node;

3)输入需评估的无人船信息，包括：材质、排水量、长、宽、高，根据条件概率表计算三个高级节点各等级的概率，输出最大概率等级作为最终值；3) Input the information of the unmanned ship to be evaluated, including: material, displacement, length, width, height, calculate the probability of each level of the three advanced nodes according to the conditional probability table, and output the maximum probability level as the final value;

4)将通过贝叶斯网络得到的无人船抗风速、浪高、流速的等级，对应海况等级映射为具体数值，作为

的值；4) Map the levels of wind resistance, wave height, and flow velocity of the unmanned ship obtained through the Bayesian network, and the corresponding sea state level to specific values, as

the value of;

(3)初始化深度强化学习模型，具体包括：作为目标Q网络和实际Q网络的两个相同的LSTM网络、奖励函数模型、模型经验池、动作输出集，其中，奖励值R_t计算公式为：(3) Initialize the deep reinforcement learning model, which specifically includes: two identical LSTM networks as the target Q network and the actual Q network, the reward function model, the model experience pool, and the action output set, where the reward value R_t The calculation formula is:

式中s_t为无人船t时刻特征状态向量，θ_safe为安全阈值向量包括四个参数

其中

为步骤2得到的无人船承受风速、浪高、流速的最大值，

为无人船避碰范围；权阵w_4×4为奖励函数的注意力矩阵，是4×4的上三角常数方阵，矩阵的对角线元素W_ii，分别对应风速、浪高、洋流和障碍物对于路径规划的影响程度，i＝1,2,3,4，非对角线元素W_ij代表要素i和要素j之间的相关性；

计算奖励函数的系数，负责给每一个特征状态要素赋予权重，权重凸显了每次迭代中对决策更重要的要素，在遇到数值突然增高的要素和检测到障碍物的时刻，会让奖励值快速下降；该奖励函数在无人船未感知到障碍物时，会引导模型避开高风浪区域，在感知到障碍物时会第一时间做出避碰动作；where s_t is the characteristic state vector of the unmanned ship at time t, and θ_safe is the safety threshold vector including four parameters

in

is the maximum value of wind speed, wave height and flow velocity obtained by the unmanned ship in step 2,

is the collision avoidance range of the unmanned ship; the weight matrix w_4×4 is the attention matrix of the reward function, which is a 4×4 upper triangular constant square matrix. The diagonal elements of the matrix W_ii correspond to wind speed, wave height, and ocean current respectively. and the influence degree of obstacles on path planning, i=1, 2, 3, 4, and the off-diagonal element W_ij represents the correlation between element i and element j;

Calculate the coefficient of the reward function, which is responsible for assigning weights to each feature state element. The weights highlight the elements that are more important for decision-making in each iteration. When encountering elements with a sudden increase in value and detecting obstacles, the reward value will be increased. Rapid descent; the reward function will guide the model to avoid high wind and wave areas when the unmanned ship does not perceive obstacles, and will make collision avoidance actions as soon as it perceives obstacles;

(4)保留目标海域真实AIS数据的坐标、航向、速度三种属性，并将三种海洋环境要素值和障碍物信息，按照时间和点位叠加到AIS数据中，新的AIS数据作为训练样本投入到深度强化学习模型中进行训练，获得优化经验池和初步网络参数；(4) The three attributes of coordinates, heading and speed of the real AIS data in the target sea area are retained, and the three marine environment element values and obstacle information are superimposed into the AIS data according to time and point, and the new AIS data is used as a training sample Invest in a deep reinforcement learning model for training to obtain an optimized experience pool and preliminary network parameters;

(5)设置无人船航程的起点坐标和终点坐标，将无人船t时刻获取的状态特征向量s_t分别输入到实际Q网络和奖励函数模型；(5) Set the coordinates of the starting point and the end point of the voyage of the unmanned ship, and input the state feature vector s_t obtained by the unmanned ship at time t into the actual Q network and the reward function model respectively;

其中：实际Q网络计算得到Q_实际，并根据∈-贪婪策略找到Q_实际对应的动作并输出；奖励函数模型计算当前迭代的奖励值R_t；目标Q网络随机抽取经验池中的n条记录，结合R_t计算Q_目标，Q_实际和Q_目标需一起计算损失函数并使用梯度下降方法更新实际Q网络的网络参数，当迭代次数到达阈值α，将实际Q网络的参数全部复制给目标Q网络；Among them: the actual Q network calculates the_actual Q, and finds the_actual corresponding action of Q according to the ∈-greedy strategy and outputs it; the reward function model calculates the reward value R_t of the current iteration; the target Q network randomly selects n records in the experience pool, Combining R_t to calculate the Q_target , Q_actual and Q_target need to calculate the loss function together and use the gradient descent method to update the network parameters of the actual Q network. When the number of iterations reaches the threshold α, all the parameters of the actual Q network are copied to the target Q network;

(6)每次迭代无人船运动时间为15秒，累计运动时间到1h时，将海域的风速、洋流、浪高和障碍物信息更新到当前时间；(6) The movement time of each iteration of the unmanned ship is 15 seconds, and when the cumulative movement time reaches 1 hour, the wind speed, ocean current, wave height and obstacle information in the sea area are updated to the current time;

(7)当无人船达到目标点，结束迭代输出安全路径。(7) When the unmanned ship reaches the target point, end the iteration and output the safe path.