Disclosure of Invention
Objects of the invention
Aiming at the problem that marine environment elements are not considered in many unmanned ship path planning methods proposed at present, the invention provides the unmanned ship path planning method based on deep reinforcement learning and considering the marine environment elements, which fully considers the real marine environment elements and marine obstacles and combines the deep reinforcement learning method to plan a safe and efficient driving path for the unmanned ship.
(II) technical scheme
In order to achieve the purpose, the technical scheme of the invention is as follows: an unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements comprises the following specific steps:
(1) interpolating the data of wind speed, flow speed and wave height at t moment of the target sea area into a grid of 200m multiplied by 200m by stTo describe the characteristic state vector of the unmanned ship at the moment t, namely:
wherein
Respectively representing the wind speed and wave height of the unmanned ship at the t momentAnd the flow rate of the liquid, and,
the distance between the time t and the obstacle of the unmanned ship
Indicating that no obstacle has been detected by the drone;
(2) the capability of the unmanned ship for resisting wind, wave and flow is evaluated by utilizing a Bayesian network, the material, the displacement, the length, the width and the height of the unmanned ship are input, and the output is
Three parameters respectively representing the maximum values of wind speed, wave height and flow speed borne by the unmanned ship and used for calculating a reward function
(3) Initializing a deep reinforcement learning model, specifically comprising: two identical LSTM networks (as target Q network and actual Q network, respectively), a reward function model, a model experience pool, and an action output set.
(4) Three attributes of coordinates, course and speed of real AIS data of a target sea area are reserved, three marine environment element values and obstacle information are superposed into the AIS data according to time and point positions, new AIS data are used as training samples and put into a deep reinforcement learning model for training, and an optimized experience pool and preliminary network parameters are obtained;
(5) setting a starting point coordinate and an end point coordinate of the voyage of the unmanned ship, and obtaining a state feature vector s of the unmanned ship at the time ttRespectively inputting the data into an actual Q network and a reward function model;
wherein: actual Q network calculation yields QPractice ofAnd find Q according to an e-greedy strategyPractice ofCorresponding actions are output; the reward function model calculates the reward value R of the current iterationt(ii) a Randomly extracting n records in an experience pool by a target Q network, and combining the n records with RtCalculating QTarget,QPractice ofAnd QTargetCalculating loss functions together, updating network parameters of the actual Q network by using a gradient descent method, and when the iteration times reach a certain threshold value alpha, updating the network parameters of the actual Q networkCopying all parameters to a target Q network;
(6) the motion time of the unmanned ship is 15 seconds each time, and when the accumulated motion time reaches 1h, the information of the wind speed, the ocean current, the wave height and the obstacles in the sea area is updated to the current time;
(7) and when the unmanned ship reaches the target point, finishing the iteration output of the safety path.
Specifically, the bayesian network construction method in the step (2) includes the following steps:
(2.1) the nodes of the unmanned ship evaluating bayesian network include: the material, the water displacement, the length, the width, the height, the wind resistance level, the wave resistance level and the flow resistance level are taken as bottom-layer nodes, the wind resistance level, the wave resistance level and the flow resistance level are high-grade nodes, and the bottom-layer nodes are fully connected with the high-grade nodes;
(2.2) training a Bayesian network by taking the unmanned ship structure data as a sample to obtain a conditional probability table of each node;
(2.3) inputting unmanned ship information to be evaluated, including: material, water displacement, length, width and height, calculating the probability of each grade of the three high-grade nodes according to the conditional probability table, and outputting the maximum probability grade as a final value;
(2.4) mapping the grades of the wind speed, wave height and flow velocity of the unmanned ship obtained through the Bayesian network corresponding to the sea state grade into specific numerical values as
The value of (c).
Specifically, the reward function model described in step (3), wherein the reward value RtThe calculation formula is as follows:
Rt=softmax(|(θsafe-st)·w1|)·((θsafe-st)·w2)T
wherein s is
tIs a characteristic state vector theta of the unmanned ship at the time t
safeIncluding four parameters for a safety threshold vector
Wherein
The maximum value of the wind speed, wave height and flow velocity borne by the unmanned ship obtained in the step 2,
sensing obstacle Range, w, for unmanned vessels
1The attention matrix of the reward function is a 4 multiplied by 4 upper triangular constant square matrix, and the diagonal element W of the matrix
ii(i ═ 1,2,3,4) corresponding to the wind speed, wave height, ocean currents and the extent of the effect of obstacles on the path plan, respectively, the off-diagonal element W
ijRepresenting the correlation between the element i and the element j, wherein the matrix has the function of processing the element values of the marine environment with different orders of magnitude to the same order of magnitude for comparison, and can highlight key elements; w is a
2Is a 4 × 4 diagonal array, requires the sum (theta)
safe-s
t) Partially combined to give the final prize value R
tEndowing positive and negative, and simultaneously enlarging the reward value to facilitate decision making;
softmax(|(θsafe-st)·w1|) partial calculation of the coefficient of the reward function, responsible for giving weight to each element value, the weight highlights the more important elements to the decision in each iteration, and the reward value can be rapidly reduced when encountering the elements with suddenly increased numerical values and the time of detecting the obstacle; when the unmanned ship does not sense the obstacle, the reward function can guide the model to avoid a high-storm area, and can make collision avoidance actions in the first time when the obstacle is sensed.
(III) advantageous effects
The advantages of the invention are embodied in that:
1. the wind speed, the wave height, the ocean current flow velocity and the obstacle information are jointly used as main reference objects for planning the unmanned ship path, the planned path is more feasible, and in the calculation process of the method, the data are updated according to the running time of the unmanned ship, so that the reliability of the path planning result is ensured.
2. The designed reward function can highlight the more important elements for decision in each iteration, simultaneously considers the detection capability and the capability of bearing storm impact of the ship, gives rewards in a safe area, gives proper penalty in a dangerous area, and makes an avoidance decision at the first time when an obstacle is detected, so that the path planning efficiency of the method is improved, and the path planning result is optimized.
3. The method for evaluating the wind and wave resistance of the unmanned ship by using the Bayesian network is provided, replaces the conventional evaluation mode of giving the wind and wave resistance grade by using expert experience, and is more scientific and efficient.
Detailed Description
The invention will now be described more fully and clearly with reference to the accompanying drawings and examples:
FIG. 1 is a flow chart of a method for planning a path of an unmanned ship based on deep reinforcement learning and considering elements of a marine environment, wherein the method gives a reasonable solution for safely completing a navigation task of the unmanned ship by fully considering the material and structure of the unmanned ship and possible strong wind, strong waves, ocean currents and obstacles in a sea area; the method mainly comprises two modules, wherein the first module is a Bayesian network evaluation module used for evaluating the wind wave resistance of the unmanned ship, and the second module is a deep reinforcement learning route planning module considering marine environment elements; the method utilizes a reward function of deep reinforcement learning to couple the two modules, so that the unmanned ship can make a proper risk avoiding decision according to the self material and structure. The unmanned ship planning method is suitable for planning the path of the unmanned ship for executing the long-range mission.
Specifically, the method comprises the following steps:
(1) the wind speed, the flow speed and the wave height data of the target sea area at the time t and the forecast wind speed and the forecast after the time t are obtainedJointly using a kriging interpolation method for predicting ocean current and wave height data, and interpolating into a grid of 200m multiplied by 200 m; the grid size of 200m × 200m is a value for which the unmanned ship acquires a new element after at most three operations. Storing data by using a three-dimensional array, wherein three dimensions of the array are longitude, latitude and time respectively, the time interval of the data is 1h, and s is used at the time of ttTo describe the characteristic state vector of the unmanned ship at the moment t, namely:
wherein
Respectively represents the wind speed, wave height and flow velocity of the unmanned ship at the time t,
the distance between the time t and the obstacle of the unmanned ship
Indicating that no obstacle has been detected by the drone;
(2) the capability of resisting wind, wave and flow of the unmanned ship is evaluated by using the Bayesian network, the material, the displacement, the length, the width and the height of the unmanned ship are input, and the output is
And three parameters respectively represent the maximum values of wind speed, wave height and flow speed which can be borne by the unmanned ship and are used for calculating the reward function. The specific steps for constructing the Bayesian network are as follows:
(2.1) as shown in fig. 2, a bayesian network schematic diagram for evaluating the capability of the unmanned ship for resisting wind, wave and current is provided, and nodes comprise: the material, the water displacement, the length, the width, the height, the wind speed resistance grade, the wave height resistance grade and the flow speed resistance grade are taken as bottom layer nodes; the wind resistance level, the wave resistance level and the flow speed resistance level are high-grade nodes, and the bottom-layer nodes are fully connected with the high-grade nodes;
(2.2) training a Bayesian network by taking the unmanned ship structure information as a sample, wherein data need to be subjected to discretization processing, and the unmanned ship structure table is shown as the following table:
TABLE 1
And putting the data into a Bayesian network for training to obtain a conditional probability table of each node.
(2.3) inputting unmanned ship information to be evaluated, including: material, water displacement, length, width and height; and calculating the probability of each grade of the three high-grade nodes according to the conditional probability table, and outputting the maximum probability grade as a final value.
(2.4) mapping the grades of the wind speed, wave height and flow velocity of the unmanned ship obtained through the Bayesian network to specific numerical values corresponding to the sea condition grade table to serve as
The value of (c).
(3) Initializing a deep reinforcement learning model, specifically comprising: two identical LSTM networks (respectively serving as a target Q network and an actual Q network), a reward function model, a model experience pool and an action output set;
in particular, a reward function model is described, wherein a reward value R istThe calculation formula is as follows:
Rt=softmax(|(θsafe-st)·w1|)·((θsafe-st)·w2)T
wherein s is
tIs a characteristic state vector theta of the unmanned ship at the time t
safeIncluding four parameters for a safety threshold vector
Wherein
For unmanned ship bearing obtained by Bayesian network evaluationThe maximum value of wind speed, wave height and flow speed,
the collision avoidance range is sensed for the unmanned ship, and the negative sign is added in the front for convenient calculation; weight matrix w
1Is a 4 × 4 symmetric constant square matrix with diagonal element W
ii(i ═ 1,2,3,4) corresponding to the wind speed, wave height, ocean currents and the extent of the effect of obstacles on the path plan, respectively, the off-diagonal element W
ijRepresenting the correlation between the element i and the element j, wherein the matrix is used for processing the marine environment element values with different orders of magnitude to the same order of magnitude for comparison, and can highlight the key element, particularly the w
1The values are given empirically:
w
2is a 4 x 4 diagonal matrix, needs and
partially combined to give the final prize value R
tEndowing positive and negative, simultaneously enlarging the reward value and accelerating the decision-making speed; w is a
2The method specifically comprises the following steps:
softmax(|(θsafe-st)·w1|) calculating the coefficients of the reward function, in charge of giving weight to each characteristic state element, the weight highlights the element which is more important to the decision in each iteration, and the reward value is rapidly reduced when encountering the element with suddenly increased numerical value and the moment of detecting the obstacle, (theta)safe-st)·w2And part, attaching positive and negative to the calculation result to indicate that reward or punishment is made. The calculation mode of the reward function is exemplified and divided into two cases of not meeting obstacles and meeting obstacles:
when no obstacle is encountered:
suppose a certain time t
nCharacteristic state vector of
Comprises the following steps:
unmanned ship safety threshold vector theta
safe=[3,1.5,0.2,500]NAN represents not participating in the calculation, then softmax (| (θ)
safe-s
t)·w
1I) the calculation result is [0.867,0.117,0.016,0 |)]It means that in this calculation, the marine factor "wind speed" needs attention; (theta)
safe-s
t)·w
2The part and the weight value are subjected to multiplication, and positive and negative are attached to a calculation result to indicate that reward or punishment is given; the final calculation result of-19.95 represents that punishment is made.
When an obstacle is encountered:
suppose a certain time t
nCharacteristic state vector of
Comprises the following steps:
indicating that the obstacle is detected 50m away from the unmanned ship, the unmanned ship just senses the obstacle, and the safety threshold vector of the unmanned ship
Softmax (| (θ)
safe-s
t)·w
1I) the calculation result is [0,0,0,1 |)]It means that in this calculation, avoiding obstacles is the most important; (theta)
safe-s
t)·w
2And performing dot multiplication on the part and the weight value, attaching positive and negative to a calculation result, and giving a penalty when the final calculation result is-200.
Through the calculation, the algorithm drives the unmanned ship through the reward function, focuses on marine environment elements when no obstacle is detected, and reacts at the first time when the obstacle is detected;
(4) the method comprises the following steps of reserving three attributes of coordinates and course of real AIS data of a target sea area, superposing three marine environment element values and obstacle information into the AIS data according to time and point positions, wherein a new AIS data sample is shown in the following table:
TABLE 2
Putting the newly-sorted AIS data serving as training samples into a deep reinforcement learning model for training to obtain an optimized experience pool and preliminary network parameters;
(5) and (5) selecting the discretization course angle of the unmanned ship as the action output of the deep reinforcement learning when the fixed unmanned ship running speed v is 10 m/s. Considering the steering capacity of the ship, the heading change range is limited to be between 35 degrees and minus 35 degrees and discretized at equal intervals, namely an action set output by the model:
A={35,25,15,5,-5,-15,-25,-35}
(6) referring to fig. 3, which is a flowchart of a deep reinforcement learning algorithm, two identical LSTM networks are used as an actual Q network and a target Q value network in a deep reinforcement learning framework, respectively; obtaining state characteristic vector s of unmanned ship at time ttRespectively inputting the data into an actual Q network and a reward function model; the LSTM input layer of the actual Q network at time t is the feature state vector stAnd the output Q(s) of the actual Q network at the last momentt-1)Practice ofThe output layer is Q(s)t)Practice ofValue of Q(s)t)Practice ofThen, selecting action a corresponding to the Q value by utilizing an epsilon-greedy strategyt(at∈A);
(7) Calculating the reward value R at time ttThe characteristic state vector s at the time ttAnd action atExecution of atThe latter feature state vector st' and the Boolean value isend, which determines whether the iteration has terminated, together as a record rect={st,at,Rt,st', is _ end } is stored in experience pool D;
(8) randomly extracting n records from experience pool D si,ai,Ri,si′,is_endiN calculating a target Q value Q, 1,2, …Target:
Wherein R isiThe prize value recorded for the ith entry, γ is the discount factor, in this example γ is 0.9, ω is a parameter of the actual Q network, ω' is a parameter of the target Q network, amax(si', ω) is the action chosen to record the re-projection of i into the actual Q network:
wherein s isi′、aiAnd omega respectively recording the state characteristic vector, the action and the network parameter of the i;
(9) calculating the accumulated loss of the i records, and updating the parameter omega of the actual Q network by utilizing gradient descent, wherein the used loss function is as follows:
(10) when the iteration number of the actual Q network reaches the threshold value alpha, the parameter omega of the actual Q network is wholly copied to the target Q network.
(11) The motion time of the unmanned ship is 10 seconds each time, and when the accumulated motion time reaches 1h, the wind speed, ocean current, wave height and obstacle information data of the sea area are updated to the current time;
(12) and finishing the iteration when the unmanned ship reaches the termination point, and outputting a safe path.
Fig. 4 is a schematic diagram of path planning under the influence of marine environmental elements and obstacles, and the method can avoid high marine environmental risk areas and obstacles when planning a path.
The above is an example of the present invention, and all changes made according to the technical scheme of the present invention, which produce the functional effects, do not exceed the technical scheme of the present invention, and all belong to the protection scope of the present invention.