Data-driven self-adaptive dynamic programming air combat decision methodTechnical Field
The invention belongs to the technical field of unmanned aerial vehicles, and particularly relates to a data-driven self-adaptive dynamic planning air combat decision method.
Background
The unmanned air fighter decision is aimed at making it take advantage of or turn down in fight, and the key of research is to design an efficient autonomous decision mechanism. The autonomous decision of the unmanned flight fighter is a mechanism for making a tactical plan or selecting a flight action in real time according to the actual combat environment in the air combat, and the degree of the decision mechanism reflects the intelligent level of the unmanned flight fighter in the modern air combat. The input of the autonomous decision mechanism is various parameters related to air combat, such as flight parameters of an aircraft, weapon parameters, three-dimensional space scene parameters and the relative relationship of two parties of a friend and foe, the decision process is information processing and calculation decision process inside a system, and the output is a tactical plan or some specific flight actions of decision making.
The self-adaptive dynamic programming integrates the ideas of dynamic programming and reinforcement learning, not only inherits the advantages of a dynamic programming method, but also can overcome the problem of dimension disaster generated by dynamic programming. The principle of the self-adaptive dynamic programming is to approximate the performance function and the control strategy in the traditional dynamic programming by adopting a function approximation structure, and obtain an optimal value function and the control strategy by means of the idea of reinforcement learning so as to meet the principle of the Belman optimality. The idea of adaptive dynamic programming can be represented by fig. 1.
Air combat decision making is a complex task involving a large amount of information and variables, making conventional manually-made decision rules difficult to adapt to changing battlefield environments. Therefore, the existing air combat decision method often has the following problems:
1. The static planning method cannot cope with the dynamic environment: conventional decision methods are generally based on preset rules or models, and are difficult to adapt to the battlefield environment and dynamic enemy conditions which change in real time.
2. The manual decision requires a lot of time and effort: the decision process needs to process a large amount of information and variables, consumes a large amount of time and effort, and is also prone to omission and misjudgment.
3. Lack of comprehensive consideration and flexible strain capacity: conventional decision methods typically make decisions based on a single factor or a small number of factors, and it is difficult to comprehensively consider and flexibly strain multiple factors, which may lead to decision bias or inaccuracy.
4. Cannot accommodate the needs of informationized warfare: the modern air combat environment has large information quantity and rapid change, and the traditional method for manually making decision rules cannot adapt to the requirements of informationized war.
Disclosure of Invention
The invention aims to provide a data-driven self-adaptive dynamic planning air combat decision-making method, which mainly solves the problem that the traditional manual decision-making rule is difficult to adapt to continuously-changing battlefield environments.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a data-driven self-adaptive dynamic programming air combat decision method comprises the following steps:
S1, assuming that the fighter unmanned aerial vehicle is a red unmanned aerial vehicle and a blue unmanned aerial vehicle; respectively establishing an unmanned aerial vehicle escape problem system model according to red side pursuit-blue Fang Taoyi and red side escape-blue side pursuit problems;
S2, solving the unmanned aerial vehicle escape problem by adopting model-free self-adaptive dynamic programming, and improving a strategy by adopting a bounded exploration signal;
s3, acquiring real-time control rates of the red unmanned aerial vehicle and the blue unmanned aerial vehicle by adopting an offline neural network model training algorithm, and collecting state information of the red unmanned aerial vehicle and the blue unmanned aerial vehicle in real time;
And S4, updating the neural network on line through an on-line model training algorithm to realize the air combat decision of the self-adaptive dynamic programming of the red unmanned aerial vehicle and the blue unmanned aerial vehicle in the tracking-escaping problem.
Further, in the invention, the method for establishing the red party pursuit-blue party escape problem model is as follows:
The real-time position of the red-side unmanned aerial vehicle is Xr (t), the position of the blue-side unmanned aerial vehicle is Xb (t), and then the position difference of the two sides is as follows:
e=Xb(t)-Xr(t) (1)
the tracking error system is:
Wherein represents the differential of the tracking error with respect to time of the two-party position difference e,/> represents the differential of the real-time position Xb (t) of the blue-party unmanned aerial vehicle with respect to time, and/> represents the differential of the real-time position Xr (t) of the red-party unmanned aerial vehicle with respect to time;
assuming that the red party chaser can only measure the three-dimensional movement speed of the blue party unmanned aerial vehicle, the expression (2) can be expressed specifically as:
The system model of the red party pursuing the blue party is expressed as:
Vr is the speed of the red unmanned aerial vehicle, mach is the unit, is the differential of the speed Vr of the red unmanned aerial vehicle with respect to time; χr is the heading angle of the red unmanned aerial vehicle, the unit is radian, and/() is the differentiation of the heading angle χr of the red unmanned aerial vehicle with respect to time; gammar is the track inclination angle of the red unmanned aerial vehicle, the unit is radian, and/() is the derivative of the track inclination angle of the red unmanned aerial vehicle with respect to time; distance error ex,ey,ez in km,/> is the derivative of distance error with respect to ex,ey,ez time; g is gravity acceleration; vc is sound velocity, and nx,ny,nz is overload control quantity of the red unmanned aerial vehicle.
Further, in the invention, the method for establishing the red party escape-blue party pursuit problem model is as follows:
The virtual displacement method is adopted to minimize the distance between the local reverse displacement and the enemy aircraft, namely, the effect of maximizing the local position and the enemy aircraft position is achieved, wherein the virtual displacement is the displacement quantity generated by the virtual displacement speed V', namely:
The system model of red-side escape-blue-side chase is expressed as:
Wherein the sign and meaning are the same as the pursuit problem.
Further, in the invention, the red party pursuit-blue party escape pursuit problem system model is processed, which comprises the following steps:
s11, the nonlinear continuous state space equation of the unmanned aerial vehicle is abbreviated as:
Wherein x= [ Vr,χr,γr,ex,ey,ez]T ] represents a red aircraft state vector, represents a differential of the red aircraft state vector x with respect to time, u= [ nx,ny,nz]T ] represents a red aircraft control vector, F (x), G (x) are respectively
S12, defining a performance index function as:
Wherein Q (x, t) is an index function related to the state, and R (u, t) is an index function related to the control amount;
s13, establishing an angle dominance function of the unmanned aerial vehicle, and setting a speed vector of the unmanned aerial vehicle of the red party as follows:
V=[cosγrcosχr,cosγrsinχr,sinγr]T,
the blue-side unmanned aerial vehicle speed vector is:
Vb=[cosγbcosχb,cosγbsinχb,sinγb]T,
The distance vector of the red unmanned aerial vehicle to the blue Fang Lanfang unmanned aerial vehicle is erb=[ex,ey,ez]T, and the geometric relationship is that
Obtaining an angle dominance function:
Qα=cαr+(1-c)αb (9)
wherein c= (αr+αb)/2pi;
s14, defining a distance dominance function as follows:
Qd=eTQ1e (10)
wherein e= [ ex,ey,ez]T, is a positive definite matrix;
The state index function of the red party can be expressed as:
Wherein Q2 is a weight coefficient;
S15, defining a controller index function as:
R(u,t)=(u-u0)TR(u-u0) (12)
Wherein is a control quantity weight coefficient, and u0=[sinγr,0,cosγr]T is a control quantity of the unmanned aerial vehicle under stable flight.
Further, the specific implementation method of the step S2 in the present invention is as follows:
Defining a bounded exploration signal ue, and rewrites a system model type (5) of the red unmanned aerial vehicle as follows:
The performance index function is:
The derivative of the performance index function (7) with respect to time is expressed as:
When the performance index is the minimum value calculated by the functional formula (7), the following Belman equation is satisfied:
Wherein R(j) = Q (x, t) +r (u, t); by combining the formula (17) and the formula (18), it is possible to obtain:
The optimal control quantity of the real system is as follows:
The formula (20) is used to solve G, and the formula (19) is used to obtain:
integrating the two ends of the formula (21) from t0 to t to obtain:
a neural network is employed to approximate the cost function and control inputs, namely:
Wherein is the ideal neural network weight for the evaluation network and the execution network, respectively; l1,L2 is the number of hidden layer neurons of the evaluation network and the execution network, respectively; the/> is the neural network activation function of the evaluation network and the execution network, respectively; the/> is the reconstruction error of the evaluation network and the execution network, respectively;
let the evaluation network and the evaluation network execute the estimated values of the network as follows:
Wherein is the estimated value of the ideal neural network weight Wc,Wa, respectively; substituting equation (24) into equation (22) yields the following residual term errors:
wherein is the control amount obtained by improving the strategy, and the expression is:
Wherein Ω is the exploration set of control quantities, obtained by adding a bounded random exploration signal, and/> is optimized/> by least squares algorithm, namely:
optimizing by the least squares algorithm is:
Further, in step S3 in the present invention, the offline neural network model training algorithm includes the steps of:
S31: by giving different initial states, a data set { sk(t0) } can be obtained, and initialized
S32: obtaining control quantity corresponding to the state, namely data set according to the formula (26)
S33: update get/> update get/> according to equation (28) using dataset update according to equation (27)
S34: terminating the algorithm if or/>; otherwise j=j+1, step S32 is skipped, where ea、∈c is convergence accuracy.
Further, in step S4, the step of online updating the neural network by the online model training algorithm is as follows:
S41: the current neural network separation weight is Wc,Wa, the online learning rate is alpha, a real-time data set { x (t), u (t) } is obtained by sampling at a fixed time interval δt, and a step S42 is carried out after a plurality of groups of data are acquired;
S42: obtaining control quantity corresponding to the state, namely data set according to the formula (26)
S43: calculation according to equation (27) using dataset/> calculation according to equation (28)
S44: the online updating of the neural network weights,, jumps to step S41.
Drawings
Fig. 1 is a diagram of a prior art adaptive dynamic programming architecture.
FIG. 2 is a flow chart of the present invention.
Fig. 3 is a schematic view of the angular advantage of the unmanned aerial vehicle according to the embodiment of the present invention.
Fig. 4 is a schematic diagram of virtual displacement principle in an embodiment of the present invention.
Detailed Description
The invention will be further illustrated by the following description and examples, which include but are not limited to the following examples.
As shown in fig. 2, in the data-driven adaptive dynamic programming air combat decision method disclosed by the invention, in a tracking-escaping scene, one tracker and one escaper exist, and in this embodiment, both sides of the tracking and escaping are represented by both red and blue sides. This problem is described herein in terms of red party chase, blue party escape, and red party unmanned aerial vehicle reduces the distance with blue party unmanned aerial vehicle through maneuver, avoids simultaneously being caught by blue party unmanned aerial vehicle, avoids being aimed at by blue party unmanned aerial vehicle aircraft nose promptly to be in the inferior situation.
In the embodiment, an unmanned aerial vehicle escape problem system model is built according to the red party pursuit-blue Fang Taoyi and the red party escape-blue party pursuit problems respectively.
First, note red square unmanned aerial vehicle real-time position is Xr (t), and blue square unmanned aerial vehicle position is Xb (t), and then both sides position difference is:
e=Xb(t)-Xr(t) (1)
the tracking error system is:
Wherein represents the differential of the tracking error with respect to time of the two-party position difference e,/> represents the differential of the real-time position Xb (t) of the blue-party unmanned aerial vehicle with respect to time, and/> represents the differential of the real-time position Xr (t) of the red-party unmanned aerial vehicle with respect to time;
assuming that the red party chaser can only measure the three-dimensional movement speed of the blue party unmanned aerial vehicle, the expression (2) can be expressed specifically as:
The system model of the red party pursuing the blue party is expressed as:
Vr is the speed of the red unmanned aerial vehicle, mach is the unit, is the differential of the speed Vr of the red unmanned aerial vehicle with respect to time; χr is the heading angle of the red unmanned aerial vehicle, the unit is radian, and/() is the differentiation of the heading angle χr of the red unmanned aerial vehicle with respect to time; gammar is the track inclination angle of the red unmanned aerial vehicle, the unit is radian, and/() is the derivative of the track inclination angle of the red unmanned aerial vehicle with respect to time; distance error ex,ey,ez in km,/> is the derivative of distance error with respect to ex,ey,ez time; g is gravity acceleration; vc is the speed of sound, nx,ny,nz is the overload control amount of the red unmanned aerial vehicle, and the overload is normally saturated.
For convenience of description, the nonlinear continuous state space equation of the unmanned aerial vehicle is abbreviated as
Where x= [ Vγ,χγ,γγ,ex,ey,ez ] T represents a red aircraft state vector, represents a differential of the red aircraft state vector x with respect to time, u= [ nx,ny,nz ] represents a red aircraft control vector, F (x), G (x) are respectively:
Because the unmanned aerial vehicle chases after the escape problem is a nonlinear optimal control problem with a saturated actuator, the performance index function is defined as follows:
Wherein Q (x, t) is an index function related to the state, and R (ut) is an index function related to the control amount;
s13, establishing an angle dominance function of the unmanned aerial vehicle, and setting a speed vector of the unmanned aerial vehicle of the red party as follows:
Vr=[cosγγcosχγ,cosγγ,sinχγ,sinγγ]T,
the blue-side unmanned aerial vehicle speed vector is:
Vb=[cosγbcosχb,cosγbsinχb,sinγb]T,
the distance vector of the red unmanned aerial vehicle to the blue Fang Lanfang unmanned aerial vehicle is erb=[ex,ey,ez]T, as shown in fig. 3, and the geometric relationship is as follows:
During air combat, it is desirable that αγ,αb be as small as possible to achieve a red square angle situation predominance. Taking a red square as an example, when alphar-(π-αb) is less than 0, namely alphar+αb is less than pi, the attack angle of the red square is dominant; conversely, if αr+αb > pi, the red attack angle is at a disadvantage; when αr+αb =pi, the red attack angle is at the equilibrium. Setting an angle dominance function:
Qa=cαr+(1-c)αb (9)
Wherein c= (αr+αb)/2pi; the optimization level relation of the angle alphar,αb can be dynamically adjusted through the weight c, when c is smaller than 0.5, the attack angle of the red party is dominant, alphab is optimized in a key way, and the blue party is prevented from obtaining the dominant angle situation; when c > 0.5, the attack angle of the red party is at a disadvantage, and alphar should be optimized with emphasis so that the red party obtains a dominant angle situation.
In the tracking problem, the goal of the red party is to shorten the distance to the blue party, thus defining a distance dominance function as:
Qd=eTQle (10)
wherein e= [ ex,ey,ez]T, is a positive definite matrix;
The state index function of the red party can be expressed as:
Q(x,t)=Qd+Q2Qα (11)
Wherein Q2 is a weight coefficient;
In order to meet the control limitation requirement, the unmanned aerial vehicle is stable in a stable flight state, and the controller index function is defined as follows:
R(u,t)=(u-u0)TR(u-u0) (12)
Wherein is a control quantity weight coefficient, and u0=[sinγr,0,cosγr]T is a control quantity of the unmanned aerial vehicle under stable flight.
For the red-side escape-blue-side pursuit problem model establishment, the escape problem is different from the pursuit problem in that the objective function is opposite to the pursuit problem, so as to maximize the double-machine distance. Meanwhile, in order to avoid the missile, when the distance between the unmanned aerial vehicle and the missile is smaller, the unmanned aerial vehicle needs to change the course and the climbing angle in a large maneuver, so as to avoid the missile. In order to solve the problem of maximizing the distance between the two aircraft, a virtual displacement method is adopted to minimize the distance between the reverse displacement of the aircraft and the enemy aircraft, namely, the effect of maximizing the position of the aircraft and the position of the enemy aircraft is achieved.
As shown in fig. 4, the host is chased by the enemy machine, and the distance between the host and the enemy machine is intended to be maximized, and the distance between the virtual displacement and the enemy machine is minimized for a virtual displacement speed V' with the opposite direction of the host speed vector V. The virtual displacement is the displacement amount generated by V', namely:
The system model of red-side escape-blue-side chase is expressed as:
Wherein the sign and meaning are the same as the pursuit problem.
Generally, an accurate unmanned aerial vehicle system model cannot be obtained in actual operation, but the existing model-free adaptive dynamic programming based on data is seriously dependent on the data, and policy improvement cannot be performed on the basis of the existing data. Therefore, the embodiment adopts model-free self-adaptive dynamic programming to solve the escape problem of the unmanned aerial vehicle, and adopts bounded exploration signals to improve strategies.
Defining a bounded exploration signal ue, and rewrites a system model type (5) of the red unmanned aerial vehicle as follows:
The performance index function is:
The derivative of the performance index function (7) with respect to time is expressed as:
When the performance index function (16) is a minimum value, the following bellman equation is satisfied:
Wherein R(j) = Q (x, t) +r (u, t); by combining the formula (17) and the formula (18), it is possible to obtain:
The optimal control quantity of the real system is as follows:
The formula (20) is used to solve G, and the formula (19) is used to obtain:
integrating the two ends of the formula (21) from t0 to t to obtain:
a neural network is employed to approximate the cost function and control inputs, namely:
Wherein is the ideal neural network weight for the evaluation network and the execution network, respectively; l1,L2 is the number of hidden layer neurons of the evaluation network and the execution network, respectively; the/> is the neural network activation function of the evaluation network and the execution network, respectively; and/> is the reconstruction error of the evaluation network and the execution network, respectively.
Let the evaluation network and the evaluation network execute the estimated values of the network as follows:
wherein is the estimated value of the ideal neural network weight Wc,Wa, respectively; substituting equation (24) into equation (22) yields the following residual term errors:
Wherein is the control amount obtained by improving the strategy, and the expression is:
Wherein Ω is the exploration set of control quantities, obtained by adding a bounded random exploration signal, and/> is optimized/> by least squares algorithm, namely:
Optimizing by the least squares algorithm is:
In the embodiment, an offline neural network model training algorithm is adopted to obtain the real-time control rate of the red unmanned aerial vehicle and the blue unmanned aerial vehicle, and the information of the red control rate and the state information of the red and blue unmanned aerial vehicles are collected in real time. The method specifically comprises the following steps:
S31: by giving different initial states, a data set { xk(t0) } can be obtained, and initialized
S32: obtaining control quantity corresponding to the state, namely data set according to the formula (26)
S33: update get/> update get/> according to equation (28) using dataset update according to equation (27)
S34: terminating the algorithm if or/>; otherwise j=j+1, step S32 is skipped, where ea、∈c is convergence accuracy.
In the embodiment, the neural network is updated on line through an on-line model training algorithm at intervals, so that the air combat decision of self-adaptive dynamic programming of the red unmanned aerial vehicle and the blue unmanned aerial vehicle in the tracking-escaping problem is realized. The method specifically comprises the following steps:
S41: the current neural network separation weight is Wc,Wa, the online learning rate is alpha, a real-time data set { x (t), u (t) } is obtained by sampling at a fixed time interval δt, and a step S42 is carried out after a plurality of groups of data are acquired;
s42: obtaining control quantity corresponding to the state, namely data set according to the formula (26)
S43: calculation according to equation (27) using dataset/> calculation according to equation (28)
S44: the online updating of the neural network weights,, jumps to step S41.
Through the method, the capacity of the online self-adaptive adjustment strategy is improved, and the adaptability of the unmanned aerial vehicle air combat decision in different scenes is improved. The method does not depend on an aircraft system model, has strong generalization capability, and can be popularized to the control technology of other equipment, such as a plurality of application scenes of unmanned vehicles, mechanical arms and the like. Thus, the present invention provides a significant and substantial improvement over the prior art.
The above embodiment is only one of the preferred embodiments of the present invention, and should not be used to limit the scope of the present invention, but all the insubstantial modifications or color changes made in the main design concept and spirit of the present invention are still consistent with the present invention, and all the technical problems to be solved are included in the scope of the present invention.