Unmanned boat paths planning method based on Q learning neural networkTechnical field
The invention belongs to unmanned ship field of intelligent control, are specifically related to a kind of nothing based on Q learning neural networkPeople's ship paths planning method.
Background technique
Water quality monitoring is the main method of water quality assessment and prevention water pollution.With increasing for industrial wastewater, water pollutionThe problem of getting worse, the demand of water pollution dynamic monitoring is very urgent.But because traditional water quality monitoring method stepIt is various, time-consuming, but the data diversity that gets, accuracy are much unsatisfactory for the demand of decision.It is more according to the above problemKind of water quality monitoring method is suggested, and if Cao Lijie et al. is proposed by establishing sensor network, obtains more that accurately water quality is anti-Drill model.Field et al. proposes to obtain the water quality parameter distribution in monitoring waters to satellite data progress inverting by water quality modelFigure.But above method can not neatly replace monitoring waters, and project amount is big, step is various, in contrast water quality monitoring withoutNot by the influence of topography, energy continuity carries out multinomial water quality parameter monitoring in situ, makes in people's hull product small easy to carry, monitoring fieldMonitoring result has more diversity and accuracy.
Unmanned ship (Unmanned Surface Vehicle, USV) is that one kind can be under unknown water environment certainlyMain navigation, and the Water surface motion platform completed various tasks, because extensive by application field, research contents be related to automatic Pilot,Automatic obstacle avoiding, navigation planning and pattern-recognition etc. are many-sided.It can not only be used to the clearance of military field, scout and anti-Latent operation etc. can be also used for hydrometeorology detection, environmental monitoring and search and rescue waterborne of civil field etc..But byIn the mobility of water, Various Complex landform can be flowed through, staff can not detect, such as when water flows through cave;Or again due toWeather it is changeable, if waters is chronically at foggy days, keep staff unsighted, can not accurately to USV real-time operation,The autonomous navigation that can use USV reaches target water level and is detected, and autonomous navigation function passage path planning technology is subject toIt realizes.
USV Path Planning Technique refers to USV in operation waters, and according to certain performance indicator, (such as distance is most short, the timeIt is most short etc.) search obtains a nothing from starting point to target point and touches path, and it is core component in USV airmanship, even moreRepresent USV intelligent level standard.Currently used planing method mainly have particle swarm algorithm, A* algorithm, Visual Graph method,Artificial Potential Field Method, ant group algorithm etc., but its method is chiefly used under the conditions of known environment.
Preferable solution has been obtained currently for the trajectory planning problem under known environment, but USV is in unknown watersIt is unable to get the environmental information that will monitor waters before job execution task, the road based on known environment information can not be passed throughDiameter planing method goes planning USV navigation path.Secondly because monitoring water environment is complicated, sensor information is more, the meter of systemOperator workload is big, and it is poor to cause USV there are real-times, the disadvantages of oscillation before barrier.Therefore USV path planning is badly in need of research calculationMethod is simple, strong real-time, and can control the path planning algorithm of the indeterminacy phenomenon in system, and it is therefore necessary to introduce to haveThe method of independent learning ability, wherein the path planning based on Q learning algorithm is suitable for the path planning in circumstances not known.It is existingGuo Na et al. is on the basis of traditional Q-learning algorithm in research, carries out movement selection using simulated annealing method, solve to explore withThe equilibrium problem utilized.Chen Zili et al.] it proposes that genetic algorithm is used to establish new Q value table to carry out static global path ruleIt draws.Dong Peifang et al. Artificial Potential Field Method is added in Q learning algorithm, using gravitation potential field as initial environment prior information, then it is rightEnvironment is successively searched for, and Q value iteration is accelerated.
Notification number is that the Chinese patent of CN108106623A discloses a kind of unmanned vehicle paths planning method based on flow field,The following steps are included: establishing flow field calculation model according to the barrier in the starting point of vehicle, terminal and environment;With front wheel angleFor input quantity, coordinate and course angle are quantity of state, establish vehicle kinematics model;Using vehicle kinematics model as rollingEquation, the rolling time horizon optimization problem for solving flow field are obtained using flow field velocity vector distribution as the guidance information of path planningTo planning path, wherein optimized amount is front wheel angle, optimization aim include vehicle movement with flow field movement reach it is consistent andIt does not collide with barrier during vehicle movement, constraint condition is that front wheel angle is no more than steering wheel hard-over.The partyMethod can find connection source and terminal, smooth and avoidance path in complicated landform.Under the premise of avoidance, pathSlickness and completeness obtain preferable effect simultaneously.But this method needs to know the position of environment terrain and barrier, cannotPath planning is carried out for unknown field.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the invention proposes one kind to be based on BP neural networkQ learn intensified learning path planning algorithm, with neural network fitting Q learning method in Q function, enable it to continuousSystem mode as input, and by experience replay and be arranged target network method significantly improve network in the training processConvergence rate.By experiment simulation, the feasibility of modified two-step method planing method presented here is demonstrated.
Technical solution: to achieve the above object, a kind of unmanned boat path planning based on Q learning neural network of the inventionA kind of method, comprising the following steps: unmanned boat paths planning method based on Q learning neural network, which is characterized in that includingFollowing steps:
A), memory block D is initialized;
B), Q network, state, movement initial value are initialized;It include following element: S, A, P in Q networks,α, R, wherein S tableShow the set of system mode locating for USV, A indicates the set for the movement that USV can take, Ps,αIndicate that systematic state transfer is generalRate, R indicate reward function;
C), training objective is set at random;
D), random selection acts at, obtain currently rewarding rt, subsequent time state st+1, by (st,at,rt,st+1) be stored toIn storage area D;
E), stochastical sampling batch of data is trained from the D of memory block, i.e. a batch (st,at,rt,st+1), when USV reachesTarget position, or more than it is every wheel maximum time when state be regarded as end-state;
If f), st+1It is not end-state, then return step d, if st+1It is end-state, then updates Q network parameter,And return step d, algorithm terminates after repeating n wheel;
G), target is set, carries out path planning with the Q network after training, until USV reaches target position.
Preferably, memory block D is experience replay memory block in step a), for storing USV navigation process soldier acquisitionTraining sample, the presence of experience replay not being continuous in time between multiple samples when each training.
Preferably, the algorithmic rule of Q network are as follows:
Q(st,at)=Q (st,at)+αδ't
Wherein, function Q (st, at) it is in state stShi Zhihang acts at, α is learning rate, δ 'tFor TD (0) deviation, TD(0) 0 in indicate is 1 step, the values seen forward under current state more are as follows:
δ't=R (st)+γV(st+1)-Q(st,at)
Wherein, γ is discount factor, and R (s) is reward function, and V (s) is value function, value functionAlternatively, it is also possible to which TD (0) deviation is defined as
δt+1=R (st+1)+γV(st+2)-V(st+1)
Wherein, δt+1For TD (0) deviation, R (s) is reward function, and V (s) is value function,
Come to carry out discount to the TD deviation in step in future using another discount factor λ ∈ [0,1],
Wherein, function Q (st,at) it is in state stShi Zhihang acts at, α is learning rate,For the deviation of TD (λ),TD (λ) is to see that λ is walked forward under current state more;
The deviation of TD (λ) hereinIt is defined as
Wherein, δ 'tThe deviation for the study of represent over,The deviation of multistep study is carried out, γ is discount factor, λFor discount factor, and λ ∈ [0,1], δt+iFor the deviation learnt now.
Preferably, by ηt(s a) is defined as characteristic function: t moment (s, a) occur, then return to 1, otherwise return to 0,To put it more simply, ignore learning efficiency, (s a) defines one with trace e to eacht(s,a)
So it is in moment t online updating
Q (s, a)=Q (s, a)+α [δ 'tηt(s,a)+
δtet(s,a)]
Wherein, (s is a) the execution movement a in state s to function Q, and α is learning rate, ηt(s a) is characterized function, et(s,A) for trace, δ 'tThe deviation for the study of represent over, δ1For the deviation learnt now, δ1It is by adding up backReport the deviation δ ' of R (s) and current estimation V (s)t, and update to obtain multiplied by learning rate with the deviation.
Preferably, the overall income expectation harvested when intensified learning wishes to run system is maximum, i.e. E (R (s0)+γR(s1)+γ2R(s2)+...) it is maximum, need to find an optimal policy π thus, so that when USV carries out decision and movement according to πWhen, the total revenue of acquisition is maximum,
The objective function of intensified learning is one of:
Vπ(s)=E (R (s0)+γR(s1)+
γ2R(s2)+…|s0=s, π)
Qπ(s, a)=E (R (s0)+γR(s1)+
γ2R(s2)+…|s0=s, a0=a, π)
Wherein, Vπ(s) it indicates at current original state s, can be obtained expectation according to the decision movement of tactful π and receiveBenefit;And Qπ(s a) is indicated to take movement a at current state s, be moved in the state of all later all in accordance with the decision of tactful πIt can be obtained expected revenus, E (R (s0)+γR(s1)+γ2R(s2)+...) it is the overall income phase harvested when system operationIt hopes, R (st) indicate t moment reward function, γ is discount factor;
Purpose is exactly to find optimal policy π in Q study*, so that
Preferably, definitionQπ(s, a) refer in state s execution act a, and itThe decision all done according to optimal policy in the state of all afterwards moves the expected revenus size that can be harvested, it is assumed that Q*(s,a)It is known that so can be easily by Q*(s a) generates π*As long as making to each sIt sets up,In this way, the problem of seeking optimal policy, which translates into, seeks Q*(s, a), due to having:
Q*(s, a)=R (s0)+γE(R(s1)+
γR(s2)+…|s1,a1)
Qπ(s, a) indicate taken at current state s movement a, it is all later in the state of all in accordance with tactful π decisionMovement can be obtained expected revenus, E (R (s0)+γR(s1)+γ2R(s2)+...) it is the overall income harvested when system operationIt is expected that R (st) indicate t moment reward function, γ is discount factor;
And a1By π*It determines, then:
a1Indicate the movement taken under optimal policy, π*(s1) indicate optimal policy, Qπ(s a) is indicated in current state sUnder take movement a, it is all later in the state of all in accordance with tactful π decision movement can be obtained expected revenus,
Then according to the graceful equation of Buddhist script written on pattra leaves, Q function is iterated and is found out.
Preferably, the graceful equation of Buddhist script written on pattra leaves is with recursive formal definition Q*(s, a), so that Q function can be iteratedIt finds out, the graceful equation of Buddhist script written on pattra leaves are as follows:
Wherein, Qπ(s, a) indicate taken at current state s movement a, it is all later in the state of all in accordance with tactful π'sDecision movement can be obtained expected revenus, R (s0) represent reward function, η1(s a) is characterized function, e1(s, a) indicate withTrace, δ 'tThe deviation for the study of represent over, δ1For the deviation learnt now, δ1It is by accumulative return R (s)With the deviation δ ' of current estimation V (s)t, and update to obtain multiplied by learning rate with the deviation.
Preferably, reward function is divided into 3 kinds, the first is rewarded at a distance from target position for USV;SecondKind is that USV arrival target position is rewarded;The third is punished for USV and barrier collision;Specifically:
Preferably, the value range for repeating to take turns number n is 3000-5000 in step f.
The utility model has the advantages that
The present invention compared with the prior art, has the advantage, that
1, the method for intensified learning of the invention solve water quality monitoring unmanned boat when unknown waters carries out water quality monitoring oneselfLeading bit path planning problem is fitted Q function by BP neural network, so that trained strategy can be according to currentThe real time information of barrier makes a policy in environment.
2, method of the invention can be such that water quality monitoring unmanned boat is cooked up in circumstances not known according to different states feasiblePath, and the decision-making time is short, route is more optimized, can satisfy the requirement of real-time planned online, to overcome traditional QThe disadvantage that paths planning method is computationally intensive, convergence rate is slow is practised, it can first time monitoring problem waters.
3, the present invention is enabled it to the Q function in neural network fitting Q learning method with continuous system mode workTo input, and the convergence rate of network in the training process is significantly improved by experience replay and setting target network method.
4, the present invention improves traditional Q-learning, realizes Q value iteration, the output pair of network using BP neural networkShould each movement Q value, the state of the corresponding description environment of the input of network.
5, the present invention returns to different reward values for different situations, so that USV exists by the design of reward functionWhen study and exploration more efficiently.
Detailed description of the invention
Fig. 1 is;Overall flow figure of the invention;
Fig. 2 is;The analogous diagram of complex water areas landform;
Fig. 3 is: complex water areas landform is actually reached a little absolute error figure figure at a distance from target point;
Fig. 4 is: the analogous diagram in simple concentric circles labyrinth;
Fig. 5 is: simple concentric circles labyrinth is actually reached a little absolute error figure figure at a distance from target point;
Fig. 6 is: the analogous diagram of complex maze;
Fig. 7 is: complex maze is actually reached a little absolute error figure figure at a distance from target point;
Fig. 8 is: the simulation result diagram of East Lake background;
Fig. 9 is: East Lake background is actually reached a little absolute error figure figure at a distance from target point;
Figure 10 is: the number of iterations figure of East Lake background.
Specific embodiment
The present invention will be further explained with reference to the accompanying drawings and examples.
Embodiment one:
A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment, comprising the following steps: a kind ofUnmanned boat paths planning method based on Q learning neural network, which comprises the following steps:
A), memory block D is initialized;
B), Q network, state, movement initial value are initialized;It include following element: S, A, P in Q networks,α, R, wherein S tableShow the set of system mode locating for USV, A indicates the set for the movement that USV can take, Ps,αIndicate that systematic state transfer is generalRate, R indicate reward function;
C), training objective is set at random;
D), random selection acts at, obtain currently rewarding rt, subsequent time state st+1, by (st,at,rt,st+1) be stored toIn storage area D;
E), stochastical sampling batch of data is trained from the D of memory block, i.e. a batch (st,at,rt,st+1), when USV reachesTarget position, or more than it is every wheel maximum time when state be regarded as end-state;
If f), st+1It is not end-state, then return step d, if st+1It is end-state, then updates Q network parameter,And return step d, algorithm terminates after repeating n wheel;
G), target is set, carries out path planning with the Q network after training, until USV reaches target position.
Wherein, D is experience replay memory block, for storing USV navigation process and acquiring training sample.Experience replayIn the presence of so that not being continuously, to minimize the correlation between sample in time between multiple samples when training every timeEnhance the stability and accuracy of sample;
Embodiment two:
A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment one, traditionQ learning algorithm specifically:
Q study is to describe problem, Ma Erke based on markov decision process (Markov Decision Process)Husband's decision process includes 4 elements: S, A, Ps,a,R.Wherein S indicates system mode set locating for USV, i.e., USV is currentState and current environment state, such as the size and location of barrier;A indicates the set of actions that USV can take, i.e. USVThe direction of rotation;Ps,aExpression system model, i.e. systematic state transfer probability, P (s'| s a) is described at current state s,After execution acts a, system reaches the probability of state s;R indicates reward function, has current state and the movement taken to determineIt is fixed.Q study is regarded as and finds strategy and makes overall merit maximum increment type planning, the thought of Q study be do not consider environment becauseElement, but directly one Q function that can be iterated to calculate of optimization, defined function Q (st, at) it is in state stShi Zhihang acts at,And accoumulation of discount reinforcement value when hereafter optimal action sequence executes, it may be assumed that
In formula, stFor t moment USV state in which, st+1For subsequent time USV state in which, atFor holding for t momentAction is made, and γ is discount factor, value 0≤γ≤1;R(st) it is reward function, value is positive number or negative.In initial rankIn section study, Q value may be improperly to reflect strategy defined in them, initial Q0(s, a) for all state andMovement is it is assumed that provide.Assuming that the possible set of actions A of state set s, USV of given environment is selectively more, data volumeGreatly, a large amount of system memory space is needed to go to store, and can not be extensive.In order to overcome drawbacks described above, traditional Q-learning is carried outIt improves, realizes that Q value iteration, the input correspondence of the Q value of the corresponding each movement of the output of network, network are retouched using BP neural networkState the state of environment.
Improved Q learning path planning algorithm
Q (λ) algorithm is to use for reference TD (λ) algorithm to generate, and allows data constantly to transmit by the thought of backtracking, so that a certainThe movement decision of state is influenced by its succeeding state.If the following a certain decision π is the decision of a failure, whenPreceding decision will also undertake corresponding punishment, this influence can be appended to current decision;If the following a certain decision π is oneA correct decision equally also will affect in current decision then current decision can also be rewarded accordingly.In conjunction withIt can be improved convergence speed of the algorithm after improvement, meet the practicability of study.Improved Q (λ) algorithm updates rule
Q(st,at)=Q (st,at)+αδ't (2)
Wherein, function Q (st, at) it is in state stShi Zhihang acts at, α is learning rate, is TD (0) deviation, TD (0)In 0 indicate is 1 step, the values seen forward under current state more are as follows:
δ't=R (st)+γV(st+1)-Q(st,at) (3)
Wherein, γ is discount factor, and R (s) is reward function, and V (s) is value function, value functionAlternatively, it is also possible to which TD (0) deviation is defined as
δt+1=R (st+1)+γV(st+2)-V(st+1) (4)
Wherein, δt+1For TD (0) deviation, R (s) is reward function, and V (s) is value function, and what 0 in TD (0) indicated isIt is forward under current state to see 1 step more.
Wherein also come to carry out discount to the TD deviation in step in future using discount factor λ ∈ [0,1].
Wherein, function Q (st,at) it is in state stShi Zhihang acts at, α is learning rate,
Introducing a new parameter lambda herein can be accomplished by this new parameter in the feelings for not increasing computation complexityThe prediction that all step numbers are comprehensively considered under condition is for controlling weight as γ parameter before.TD (λ) is currentIt is forward under state to see that λ is walked more.
The deviation of TD (λ) hereinIt is defined as
Wherein, δ 'tThe deviation that the study of represent over obtains,Carry out multistep study deviation, γ be discount becauseSon, λ are discount factor, and λ ∈ [0,1], δt+iFor the deviation learnt now.
Embodiment three
A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment two, as long asThe TD deviation in future is unknown, and update above can not just carry out.But them can be gradually calculated by using with trace.Below by ηt(s a) is defined as characteristic function: in t moment, (s a) occurs, then returns to 1, otherwise return to 0.To put it more simply, ignoringLearning efficiency, to it is each (s, a) define one with trace et(s,a)
So it is in moment t online updating
Q (s, a)=Q (s, a)+α [δ 'tηt(s,a)+
δtet(s,a)] (8)
Wherein, (s is a) the execution movement a in state s to function Q, and α is learning rate, ηt(s a) is characterized function, et(s,A) for trace, δ 'tThe deviation for the study of represent over, δ1For the deviation learnt now, δ1It is by adding up backReport the deviation δ ' of R (s) and current estimation V (s)t, and update to obtain multiplied by learning rate with the deviation.
Example IV:
A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment three, strengthensThe overall income expectation that study harvests when wishing to run system is maximum, needs to find an optimal policy π thus, so that working asWhen USV carries out decision and movement according to π, the total revenue of acquisition is maximum.In general, the objective function of intensified learning be it is following whereinOne of:
Vπ(s)=E (R (s0)+γR(s1)+
γ2R(s2)+…|s0=s, π)
Qπ(s, a)=E (R (s0)+γR(s1)+
γ2R(s2)+…|s0=s, a0=a, π) (9)
Wherein, Vπ(s) it indicates at current original state s, can be obtained expectation according to the decision movement of tactful π and receiveBenefit;And Qπ(s a) is indicated to take movement a at current state s, be moved in the state of all later all in accordance with the decision of tactful πIt can be obtained expected revenus, E (R (s0)+γR(s1)+γ2R(s2)+...) it is the overall income phase harvested when system operationIt hopes, R (st) indicates the reward function of t moment, and γ is discount factor.
Purpose is exactly to find optimal policy π in Q study*, so that
Embodiment five:
A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on example IV, definitionQπ(s, a) refer in state s execution act a, and it is all later in the state of all according to mostThe decision that dominant strategy is done moves the expected revenus size that can be harvested.Assuming that Q*(s, a) it is known that so can be easilyBy Q*(s a) generates π*As long as making to each sIt sets up.In this way, the problem of seeking optimal policyIt translates into and seeks Q*(s,a).Due to having:
Q*(s, a)=R (s0)+γE(R(s1)+
γR(s2)+…|s1,a1) (10)
Wherein, Qπ(s, a) indicate taken at current state s movement a, it is all later in the state of all in accordance with tactful π'sDecision movement can be obtained expected revenus, E (R (s0)+γR(s1)+γ2R(s2)+...) it is the totality harvested when system operationProfit expectation, R (st) indicate t moment reward function, γ is discount factor;
And a1By π*It determines, then:
Wherein, a1Indicate the movement taken under optimal policy, π*(s1) indicate optimal policy, Qπ(s a) is indicated currentTaken under state s movement a, it is all later in the state of all in accordance with tactful π decision movement can be obtained expected revenus, soAfterwards according to the graceful equation of Buddhist script written on pattra leaves, Q function is iterated and is found out.
Embodiment six
A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment five, Buddhist script written on pattra leavesGraceful equation is with recursive formal definition Q*(s a) is found out so that Q function can be iterated, the graceful equation of Buddhist script written on pattra leaves are as follows:
Wherein, Qπ(s, a) indicate taken at current state s movement a, it is all later in the state of all in accordance with tactful π'sDecision movement can be obtained expected revenus, R (s0) represent reward function, η1(s a) is characterized function, e1(s, a) indicate withTrace, δ 'tThe deviation for the study of represent over, δ1For the deviation learnt now, δ1It is by accumulative return R (s)With the deviation δ ' of current estimation V (s)t, and update to obtain multiplied by learning rate with the deviation.
Q function is to save and update in table form in traditional Q-learning algorithm, but in USV obstacle-avoiding route planning,Since barrier is likely to occur any position in space, Q function is difficult to describe in continuous space in table formThe barrier of appearance.Therefore herein on Q learning foundation, depth Q study is fitted Q function with BP neural network, inputs shapeState s is continuous variable.In general, being difficult to restrain with learning process when nonlinear function approximation Q function, experience replay is used thusImprove study stability with the method for target network.
Embodiment seven:
A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment six, strongDuring chemistry is practised, the design of reward function directly affects the quality of learning effect.In general, reward function corresponds to people to a certainThe description of business is incorporated in study by the priori knowledge that the design of reward function can solve task.In USV path planning,Wish to avoid safely bumping against with barrier during USV is also wanted to while reaching target position as early as possible in navigation.ThisReward function is divided into 3 kinds by text, the first is rewarded at a distance from target position for USV;Second is USV arrival targetPosition is rewarded;The third is punished for USV and barrier collision.Reward function are as follows:
From magnitude, first and second kind of reward value is bigger than first reward value.Because coming for USV avoidance taskIt says, main target is exactly avoiding obstacles and gets to target position, rather than only shortens USV at a distance from target position,The reason of this is added be, if only target position is reached to USV and USV knocks barrier and rewarded and punished,It is 0 that a large amount of step reward will be so had in motion process, this meeting is so that USV will not be improved in most casesStrategy, learning efficiency are low.This reward is added and is equivalent to the priori knowledge that joined people to this task, so that USV is learningWith explore when more efficiently.
Embodiment eight:
In order to examine the path planning algorithm designed herein, emulation experiment is carried out on Matlab2014a software herein.In an experiment, simulated environment is the region of 20*20, and discount factor γ value is 0.9, and memory block D is sized to 40000, circulationNumber 1000, neural network first layer have 64 neurons, and the second layer has 32 neurons.In trained each round, wheneverWhen USV bumps against barrier or the arrival target position USV, which is all immediately finished, and returns to a reward.
For the accuracy for verifying context of methods, it will be tested using labyrinth landform, three kinds of different terrains will be designed hereinCarry out the comparison of algorithm, respectively complex water areas landform (as shown in Figure 2), simple concentric circles labyrinth landform is (such as Fig. 4 instituteShow), complex maze landform (as shown in Figure 6).This paper innovatory algorithm is emulated with traditional Q-learning algorithm in the above landform,The innovatory algorithm route that blue represents it can be seen from path profile compares the route of traditional Q-learning algorithm simulating, path lengthIt is shorter, it is simpler and more direct.By be actually reached a little at a distance from target point it can be seen from absolute error figure innovatory algorithm than traditional QHabit algorithm shifts to an earlier date one third and carries out convergence stabilization.
Embodiment nine:
Experiment simulation is carried out by taking the actual environment background of Linan East Lake waters as an example, as seen from Figure 8, USV is in simulation processIn have no that appearance and barrier bump against and path is simple and fast.Fig. 9 is standard error curve, Figure 10 is learning curve, You ShangtuFind out, when frequency of training reaches 56 times, curve tends to be steady, illustrate substantially to cook up a safe and efficient whole route,USV can avoiding obstacles arrival target position in most cases at this time.It therefore deduces that, based on BP neural networkQ learning algorithm is improved than traditional Q-learning algorithm, learns convergence rate faster, path is more optimized.
The above is only a preferred embodiment of the present invention, it should be pointed out that: those skilled in the art are comeIt says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications should also regardFor protection scope of the present invention.