CN116880186A

Movatterモバイル変換

Info

Publication number: CN116880186A
Application number: CN202310861633.0A
Authority: CN
Inventors: 李彬; 宁召柯; 史明明; 李清亮; 陶呈纲; 孙绍山; 李导
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-10-13
Anticipated expiration: 2043-07-13
Also published as: CN116880186B

Abstract

Description

Translated fromChinese

一种数据驱动的自适应动态规划空战决策方法A data-driven adaptive dynamic programming method for air combat decision making

技术领域Technical Field

本发明属于无人机技术领域，具体地说，是涉及一种数据驱动的自适应动态规划空战决策方法。The invention belongs to the technical field of unmanned aerial vehicles, and in particular relates to a data-driven adaptive dynamic programming air combat decision-making method.

背景技术Background Art

无人飞行战斗机决策的目的是使其能够在战斗中占据优势取胜或者转劣势为优势，研究的关键就是设计出高效的自主决策机制。无人飞行战斗机的自主决策是关于如何在空战中根据实战环境实时制定战术计划或选择飞行动作的机制，该决策机制的优劣程度反映了无人飞行战斗机在现代化空战中的智能化水平。自主决策机制的输入是与空战有关的各种参数，如飞行器的飞行参数，武器参数和三维空间场景参数以及敌我双方的相对关系，决策过程是系统内部的信息处理和计算决策过程，输出是决策制定的战术计划或某些特定的飞行动作。The purpose of unmanned aerial fighter decision-making is to enable it to gain an advantage in combat or turn a disadvantage into an advantage. The key to the research is to design an efficient autonomous decision-making mechanism. The autonomous decision-making of unmanned aerial fighters is about how to formulate tactical plans or select flight actions in real time according to the actual combat environment in air combat. The quality of this decision-making mechanism reflects the intelligence level of unmanned aerial fighters in modern air combat. The input of the autonomous decision-making mechanism is various parameters related to air combat, such as the flight parameters of the aircraft, weapon parameters and three-dimensional space scene parameters and the relative relationship between the enemy and us. The decision-making process is the information processing and calculation decision-making process within the system, and the output is the tactical plan or certain specific flight actions made by the decision.

自适应动态规划融合了动态规划和增强学习的思想，不但继承了动态规划方法的优势，而且又能克服动态规划产生的“维数灾”问题。自适应动态规划的原理是采用函数近似结构，逼近传统的动态规划中的性能函数和控制策略，借助增强学习的思想得到最优值函数和控制策略以满足贝尔曼最优性原理。自适应动态规划的思想可以用图1表示。Adaptive dynamic programming combines the ideas of dynamic programming and reinforcement learning. It not only inherits the advantages of dynamic programming methods, but also overcomes the "curse of dimensionality" problem caused by dynamic programming. The principle of adaptive dynamic programming is to use function approximation structure to approximate the performance function and control strategy in traditional dynamic programming, and use the idea of reinforcement learning to obtain the optimal value function and control strategy to meet the Bellman optimality principle. The idea of adaptive dynamic programming can be represented by Figure 1.

空战决策是一项复杂的任务，涉及到大量的信息和变量，使得传统的人工制定决策规则难以适应不断变化的战场环境。因此，现有的空战决策方法往往存在以下问题：Air combat decision-making is a complex task involving a large amount of information and variables, making it difficult for traditional manual decision-making rules to adapt to the ever-changing battlefield environment. Therefore, existing air combat decision-making methods often have the following problems:

1.静态规划方法无法应对动态环境：传统的决策方法通常是基于事先设定的规则或者模型，难以适应实时变化的战场环境和动态的敌情。1. Static planning methods cannot cope with dynamic environments: Traditional decision-making methods are usually based on pre-set rules or models, which are difficult to adapt to real-time changing battlefield environments and dynamic enemy situations.

2.人工决策需要耗费大量时间和精力：决策过程需要处理大量的信息和变量，需要消耗大量的时间和精力，同时还容易出现疏漏和误判。2. Manual decision-making requires a lot of time and energy: The decision-making process needs to process a large amount of information and variables, which consumes a lot of time and energy, and is also prone to omissions and misjudgments.

3.缺乏综合考虑和灵活应变能力：传统的决策方法通常是基于单一因素或者少数几个因素进行决策，难以对多种因素进行综合考虑和灵活应变，从而可能导致决策的偏颇或者不准确。3. Lack of comprehensive consideration and flexible response capabilities: Traditional decision-making methods are usually based on a single factor or a few factors. It is difficult to comprehensively consider and flexibly respond to multiple factors, which may lead to biased or inaccurate decisions.

4.无法适应信息化战争的需要：现代空战环境信息量大、变化快，传统的人工制定决策规则的方法已经无法适应信息化战争的需要。4. Unable to adapt to the needs of information warfare: The modern air combat environment has a large amount of information and changes rapidly. The traditional method of manually formulating decision-making rules can no longer meet the needs of information warfare.

发明内容Summary of the invention

本发明的目的在于提供一种数据驱动的自适应动态规划空战决策方法，主要解决传统的人工制定决策规则难以适应不断变化的战场环境的问题。The purpose of the present invention is to provide a data-driven adaptive dynamic programming air combat decision-making method, which mainly solves the problem that traditional manually formulated decision rules are difficult to adapt to the ever-changing battlefield environment.

为实现上述目的，本发明采用的技术方案如下：To achieve the above purpose, the technical solution adopted by the present invention is as follows:

一种数据驱动的自适应动态规划空战决策方法，包括以下步骤：A data-driven adaptive dynamic programming air combat decision-making method comprises the following steps:

S1，假定对战双方无人机为红方无人机和蓝方无人机；分别以红方追击-蓝方逃逸和红方逃逸-蓝方追击问题建立无人机追逃问题系统模型；S1, assuming that the two drones in the battle are red drones and blue drones; establish the drone pursuit and escape problem system model based on the red pursuit-blue escape and red escape-blue pursuit problems respectively;

S2，采用无模型自适应动态规划求解上述无人机追逃问题，并采用有界探索信号对策略进行改进；S2, adopts model-free adaptive dynamic programming to solve the above drone pursuit problem, and uses bounded exploration signals to improve the strategy;

S3，采用离线神经网络模型训练算法获得红方无人机和蓝方无人机实时控制率，并实时收集红方控制率信息和红蓝双方状态信息；S3, using offline neural network model training algorithm to obtain the real-time control rate of the red UAV and the blue UAV, and collect the red UAV control rate information and the red and blue UAV status information in real time;

S4，通过在线模型训练算法在线更新神经网络，实现红方无人机和蓝方无人机在“追踪-逃逸”问题中的自适应动态规划的空战决策。S4, through the online model training algorithm, updates the neural network online to achieve the adaptive dynamic programming air combat decision-making of the red and blue drones in the "track-and-escape" problem.

进一步地，在本发明中，红方追击-蓝方逃逸问题模型建立方法如下：Furthermore, in the present invention, the method for establishing the red party pursuit-blue party escape problem model is as follows:

记红方无人机实时位置为X_r(t)，蓝方无人机位置为X_b(t)，则双方位置差为：The real-time position of the red drone is X_r (t), and the position of the blue drone is X_b (t). The position difference between the two is:

e＝X_b(t)-X_r(t) (1)e＝X_b (t)-X_r (t) (1)

则跟踪误差系统为：The tracking error system is:

式中，表示跟踪误差为双方位置差e关于时间的微分，为蓝方无人机实时位置X_b(t)关于时间的微分，为红方无人机实时位置X_r(t)关于时间的微分；In the formula, The tracking error is the differential of the position difference e between the two parties with respect to time, is the time derivative of the real-time position X_b (t) of the blue drone, is the time derivative of the real-time position of the Red Team UAV X_r (t);

假设红方追击方仅能测到蓝方无人机的三维运动速度，因此式(2)可以具体表示为：Assuming that the red pursuer can only measure the three-dimensional motion speed of the blue drone, equation (2) can be specifically expressed as:

则红方追击蓝方的系统模型表示为：Then the system model of the red party chasing the blue party is expressed as:

其中，V_r为红方无人机速度，单位为马赫，为红方无人机速度V_r关于时间的微分；χ_r为红方无人机航向角，单位为弧度，为红方无人机航向角χ_r关于时间的微分；γ_r为红方无人机航迹倾角，单位为弧度，为红方无人机航迹倾角关于时间的微分；距离误差e_x,e_y,e_z，单位为千米，为距离误差关于e_x,e_y,e_z时间的微分；g为重力加速度；V_c为声速，n_x,n_y,n_z为红方无人机的过载控制量。Where V_r is the speed of the Red UAV in Mach. is the differential of the speed_Vr of the Red UAV with respect to time;_χr is the heading angle of the Red UAV, in radians, is the differential of the heading angle_χr of the red drone with respect to time;_γr is the track inclination angle of the red drone, in radians, is the differential of the Red Team UAV’s track inclination with respect to time; the distance error_ex , e_y , e_z is in kilometers, is the differential of the distance error with respect to time_ex ,_ey ,_ez ; g is the acceleration due to gravity;_Vc is the speed of sound,_nx ,_ny ,_nz are the overload control quantities of the red side UAV.

进一步地，在本发明中红方逃逸-蓝方追击问题模型建立方法如下：Furthermore, in the present invention, the method for establishing the red party escape-blue party pursuit problem model is as follows:

采用“虚位移”方法，极小化本机反向位移与敌方飞行器之间的距离，即达到极大化本机位置与敌机位置的效果，其中，虚位移为“虚位移速度”V′产生的位移量，即：The "virtual displacement" method is used to minimize the distance between the aircraft's reverse displacement and the enemy aircraft, that is, to maximize the position of the aircraft and the enemy aircraft. The virtual displacement is the displacement generated by the "virtual displacement speed" V', that is:

则红方逃逸-蓝方追击的系统模型表示为：The system model of the red party escaping and the blue party pursuing is expressed as:

其中符号及意义与追击问题相同。The symbols and meanings are the same as those in the pursuit problem.

进一步地，在本发明中对红方追击-蓝方逃逸追逃问题系统模型进行处理，包括：Furthermore, in the present invention, the red party pursuit-blue party escape and pursuit problem system model is processed, including:

S11，将无人机的非线性连续状态空间方程简记为：S11, the nonlinear continuous state space equation of the UAV is simplified as:

其中，x＝[V_r，χ_r，γ_r，e_x，e_y，e_z]^T表示红方飞行器状态向量，表示红方飞行器状态向量x关于时间的微分，u＝[n_x，n_y，n_z]^T表示红方飞行器控制向量，F(x)，G(x)分别为Where x = [V_r , χ_r , γ_r , e_x , e_y , e_z ]^T represents the state vector of the red aircraft, represents the differential of the red aircraft state vector x with respect to time, u = [n_x ,_ny ,_nz ]^T represents the red aircraft control vector, F(x) and G(x) are respectively

S12，定义性能指标函数为：S12, define the performance index function as:

其中，Q(x，t)为与状态相关的指标函数，R(u，t)为与控制量相关的指标函数；Among them, Q(x, t) is the index function related to the state, and R(u, t) is the index function related to the control amount;

S13，建立无人机角度优势函数，设红方无人机速度矢量为：S13, establish the UAV angle advantage function, assuming that the red UAV velocity vector is:

V_r＝[cosγ_rcosχ_r，cosγ_rsinχ_r，sinγ_r]^TV_r = [cosγ_r cosχ_r , cosγ_r sinχ_r , sinγ_r ]^T

蓝方无人机速度矢量为：The velocity vector of the blue drone is:

V_b＝[cosγ_bcosχ_b，cosγ_bsinχ_b，sinγ_b]^T，V_b = [cosγ_b cosχ_b , cosγ_b sinχ_b , sinγ_b ]^T ,

红方无人机对蓝方蓝方无人机距离矢量为e_rb＝[e_x，e_y，e_z]^T，其几何关系为The distance vector of the red drone to the blue drone is_erb = [_ex ,_ey ,_ez ]^T , and the geometric relationship is:

得到角度优势函数：Get the angle advantage function:

Q_α＝cα_r+(1-c)α_b (9)Q_α = c α_r + (1-c) α_b (9)

其中c＝(α_r+α_b)/2π；where c=(α_r +α_b )/2π;

S14，定义距离优势函数为：S14, define the distance advantage function as:

Q_d＝e^TQ₁e (10)Q_d ＝e^T Q₁ e (10)

其中e＝[e_x，e_y，e_z]^T，为正定矩阵；Where e=[e_x , e_y , e_z ]^T , is a positive definite matrix;

红方的状态指标函数可表示为：The state indicator function of the red team can be expressed as:

Q(x，t)＝Q_d+Q₂Q_α (11)Q(x, t) = Q_d + Q₂ Q_α (11)

其中Q₂为权重系数；Where Q₂ is the weight coefficient;

S15，定义控制器指标函数为：S15, define the controller indicator function as:

R(u，t)＝(u-u₀)^TR(u-u₀) (12)R(u,t)＝(uu₀ )^T R(uu₀ ) (12)

其中，为控制量权重系数，u₀＝[sinγ_r，0，cosγ_r]^T为无人机稳定飞行下的控制量。in, is the control variable weight coefficient, u₀ =[sinγ_r , 0, cosγ_r ]^T is the control variable under stable flight of the UAV.

进一步地，在本发明中所述步骤2的具体实现方法如下：Furthermore, the specific implementation method of step 2 in the present invention is as follows:

定义有界探索信号u_e，将红方无人机系统模型式(5)可改写为：Define the bounded exploration signal u_e , and rewrite the red team UAV system model (5) as follows:

则性能指标函数为：The performance index function is:

则性能指标函数式(7)关于时间的导数，表示为：Then the derivative of the performance index function (7) with respect to time is expressed as:

性能指标函数式(7)求的极小值时，满足如下贝尔曼方程：When the performance index function (7) is minimized, it satisfies the following Bellman equation:

其中r^(j)＝Q(x，t)+R(u，t)；结合式(17)和式(18)，可以得到：Where r^(j) = Q (x, t) + R (u, t); Combining equation (17) and equation (18), we can get:

真实系统的最优控制量为：The optimal control quantity of the real system is:

通过式(20)反解出G，带入式(19)得到：By inversely solving G through equation (20), and substituting it into equation (19), we can obtain:

将式(21)两端从t₀到t进行积分，得到：Integrating both ends of equation (21) from t₀ to t, we obtain:

采用神经网络来近似代价函数和控制输入，即：A neural network is used to approximate the cost function and control input, namely:

其中，分别是评价网络和执行网络的理想神经网络权重；L₁，L₂分别是评价网络和执行网络的隐藏层神经元数量；分别是评价网络和执行网络的神经网络激活函数；分别是评价网络和执行网络的重建误差；in, are the ideal neural network weights of the evaluation network and the execution network respectively; L₁ and L₂ are the numbers of hidden layer neurons of the evaluation network and the execution network respectively; They are the neural network activation functions of the evaluation network and the execution network respectively; are the reconstruction errors of the evaluation network and the execution network respectively;

令评价网络和执行网络的估计值为：Let the estimated values of the evaluation network and the execution network be:

其中，分别是理想神经网络权重W_c，W_a的估计值；将式(24)代入式(22)可得残差项误差为：in, are the estimated values of the ideal neural network weights W_c and_Wa respectively; Substituting equation (24) into equation (22), the residual error can be obtained as follows:

其中为改进策略得到的控制量，其表达式为：in The control quantity obtained by improving the strategy is expressed as:

其中Ω为控制量的探索集合，由添加有界随机探索信号得到，且通过最小二乘算法优化即：Where Ω is the exploration set of the control quantity, given by Adding a bounded random exploration signal yields Optimization by least squares algorithm Right now:

通过最小二乘算法优化即：Optimization by least squares algorithm Right now:

进一步地，在本发明中在步骤S3中，离线神经网络模型训练算法包括如下步骤：Furthermore, in step S3 of the present invention, the offline neural network model training algorithm includes the following steps:

S31：通过给定不同的初始状态，可得到数据集{x_k(t₀)}，初始化S31: By giving different initial states, we can get the data set {x_k (t₀ )}, initialize

S32：根据式(26)得到状态对应的控制量，即数据集S32: According to formula (26), the control quantity corresponding to the state is obtained, that is, the data set

S33：利用数据集根据式(27)更新得到根据式(28)更新得到S33: Utilizing Datasets According to formula (27), we can get According to formula (28), we can get

S34：如果或则终止算法；否则j＝j+1，跳转步骤S32，其中∈_a、∈_c为收敛精度。S34: If or Then the algorithm is terminated; otherwise j=j+1, jump to step S32, where ∈_a , ∈_c are the convergence accuracy.

进一步地，在步骤S4中，通过在线模型训练算法在线更新神经网络的步骤如下：Furthermore, in step S4, the steps of updating the neural network online through the online model training algorithm are as follows:

S41：当前神经网络离权重为W_c，W_a，在线学习率为α，通过固定时间间隔δt采样得到实时数据集{x(t)，u(t)}，采集到多组数据后进入步骤S42；S41: The current neural network weights are W_c , W_a , the online learning rate is α, and a real-time data set {x(t), u(t)} is obtained by sampling at a fixed time interval δt. After collecting multiple sets of data, step S42 is entered;

S42：根据式(26)得到状态对应的控制量，即数据集S42: According to formula (26), the control quantity corresponding to the state is obtained, that is, the data set

S43：利用数据集根据式(27)计算根据式(28)计算S43: Utilizing Datasets According to formula (27) According to formula (28),

S44：在线更新神经网络权重，即跳转到步骤S41。S44: Update the neural network weights online, i.e. Jump to step S41.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为现有技术中自适应动态规划结构图。FIG. 1 is a diagram of an adaptive dynamic programming structure in the prior art.

图2为本发明的流程示意图。FIG. 2 is a schematic diagram of the process of the present invention.

图3为本发明-实施例中无人机角度优势示意图。FIG. 3 is a schematic diagram of the angle advantage of a drone in an embodiment of the present invention.

图4为本发明-实施例中虚位移原理示意图。FIG. 4 is a schematic diagram of the virtual displacement principle in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图说明和实施例对本发明作进一步说明，本发明的方式包括但不仅限于以下实施例。The present invention is further described below in conjunction with the accompanying drawings and embodiments. The embodiments of the present invention include but are not limited to the following embodiments.

如图2所示，本发明公开的一种数据驱动的自适应动态规划空战决策方法，在“追踪-逃逸”场景中，存在一个追踪者和一个逃逸者，本实施例用红蓝双方表示追逃双方。文中以红方追击、蓝方逃逸来描述该问题，红方无人机通过机动减少与蓝方无人机之间的距离，同时避免被蓝方无人机捕获，即避免被蓝方无人机机头对准，从而处于劣势态势。As shown in FIG2 , the present invention discloses a data-driven adaptive dynamic programming air combat decision method. In the "pursuit-escape" scenario, there is a pursuer and an escaper. In this embodiment, the red and blue parties are used to represent the pursuit and escape parties. The article describes the problem by using the red party chasing and the blue party escaping. The red party UAV reduces the distance between it and the blue party UAV through maneuvers, while avoiding being captured by the blue party UAV, that is, avoiding being pointed at by the blue party UAV's nose, and thus being in a disadvantageous situation.

本实施例分别以红方追击-蓝方逃逸和红方逃逸-蓝方追击问题建立无人机追逃问题系统模型。This embodiment establishes a drone pursuit problem system model based on the red party pursuit-blue party escape and red party escape-blue party pursuit problems respectively.

首先，记红方无人机实时位置为X_r(t)，蓝方无人机位置为X_b(t)，则双方位置差为：First, let the real-time position of the red drone be X_r (t) and the position of the blue drone be X_b (t), then the position difference between the two is:

e＝X_b(t)-X_r(t) (1)e＝X_b (t)-X_r (t) (1)

则跟踪误差系统为：The tracking error system is:

其中，V_r为红方无人机速度，单位为马赫，为红方无人机速度V_r关于时间的微分；χ_r为红方无人机航向角，单位为弧度，为红方无人机航向角χ_r关于时间的微分；γ_r为红方无人机航迹倾角，单位为弧度，为红方无人机航迹倾角关于时间的微分；距离误差e_x，e_y，e_z，单位为千米，为距离误差关于e_x，e_y，e_z时间的微分；g为重力加速度；V_c为声速，n_x，n_y，n_z为红方无人机的过载控制量，通常情况下过载会受到饱和约束。Where V_r is the speed of the Red UAV in Mach. is the differential of the speed_Vr of the Red UAV with respect to time;_χr is the heading angle of the Red UAV, in radians, is the differential of the heading angle_χr of the red drone with respect to time;_γr is the track inclination angle of the red drone, in radians, is the differential of the Red Team UAV’s track inclination with respect to time; the distance error_ex ,_ey , e_z , is in kilometers, is the time differential of the distance error with respect to_ex ,_ey ,_ez ; g is the gravitational acceleration;_Vc is the speed of sound,_nx ,_ny ,_nz are the overload control quantities of the red team’s UAV. Usually, the overload will be subject to saturation constraints.

为了描述方便，将无人机的非线性连续状态空间方程简记为For the convenience of description, the nonlinear continuous state space equation of the UAV is simplified as

其中，x＝[V_r，χ_r，γ_r，e_x，e_y，e_z]^T表示红方飞行器状态向量，表示红方飞行器状态向量x关于时间的微分，u＝[n_x，n_y，n_z]^T表示红方飞行器控制向量，F(x)，G(x)分别为：Where x = [V_r , χ_r , γ_r , e_x , e_y , e_z ]^T represents the state vector of the red aircraft, represents the differential of the red aircraft state vector x with respect to time, u = [n_x ,_ny ,_nz ]^T represents the red aircraft control vector, F(x) and G(x) are respectively:

由于无人机追逃问题是一个带饱和执行器的非线性最优控制问题，因此定义性能指标函数为：Since the UAV pursuit problem is a nonlinear optimal control problem with a saturated actuator, the performance index function is defined as:

V_r＝[cosγ_rcosχ_r，cosγ_rsinχ_r，sinγ_r]^T，V_r = [cosγ_r cosχ_r , cosγ_r sinχ_r , sinγ_r ]^T ,

蓝方无人机速度矢量为：The velocity vector of the blue drone is:

红方无人机对蓝方蓝方无人机距离矢量为e_rb＝[e_x，e_y，e_z]^T，如图3所示，其几何关系为：The distance vector of the red drone to the blue drone is_erb = [_ex ,_ey ,_ez ]^T , as shown in Figure 3, and the geometric relationship is:

在空战过程中，期望α_r，α_b都尽可能小，以达到红方角度态势占优。以红方为例，当α_r-(π-α_b)＜0，即α_r+α_b＜π则红方攻击角度占优；相反，若α_r+α_b＞π，则红方攻击角度处于劣势；当α_r+α_b＝π，则红方攻击角度处于均势。设置角度优势函数：In the air combat process, it is expected that α_r and α_b are as small as possible to achieve the Red side's angular situation advantage. Taking the Red side as an example, when α_r -(π-α_b )＜0, that is, α_r +α_b ＜π, the Red side's attack angle is advantageous; on the contrary, if α_r +α_b ＞π, the Red side's attack angle is at a disadvantage; when α_r +α_b ＝π, the Red side's attack angle is in balance. Set the angle advantage function:

Q_α＝cα_r+(1-c)α_b (9)Q_α = c α_r + (1-c) α_b (9)

其中c＝(α_r+α_b)/2π；通过权重c可动态调整角度α_r，α_b的优化级关系，当c＜0.5时，红方攻击角度占优，应当重点优化α_b，防止蓝方获得优势角度态势；当c＞0.5时，红方攻击角度处于劣势，应当重点优化α_r，使得红方获得优势角度态势。Wherein c＝(α_r +α_b )/2π; the optimization level relationship of angles α_r and α_b can be dynamically adjusted through the weight c. When c＜0.5, the red side has an advantage in attack angle, and α_b should be optimized to prevent the blue side from gaining an advantage in angle. When c＞0.5, the red side has a disadvantage in attack angle, and α_r should be optimized to enable the red side to gain an advantage in angle.

在追踪问题中，红方的目标是缩短与蓝方的距离，因此定义距离优势函数为：In the tracking problem, the goal of the red team is to shorten the distance with the blue team, so the distance advantage function is defined as:

Q_d＝e^TQ₁e (10)Q_d ＝e^T Q₁ e (10)

Q(x，t)＝Q_d+Q₂Q_α (11)Q(x, t) = Q_d + Q₂ Q_α (11)

其中Q₂为权重系数；Where Q₂ is the weight coefficient;

为了满足控制受限要求，同时使得无人机在稳定飞行状态下，控制器也稳定，定义控制器指标函数为：In order to meet the control constraints and make the UAV stable in flight, the controller is also stable. The controller index function is defined as:

R(u，t)＝(u-u₀)^TR(u-u₀) (12)R(u,t)＝(uu₀ )^T R(uu₀ ) (12)

对于红方逃逸-蓝方追击问题模型建立，逃逸问题与追击问题的不同之处，在于目标函数与追击问题相反，为极大化双机距离。同时为躲避导弹，在无人机与导弹距离较小时，无人机需要做大机动改变航向和爬升角，躲避导弹。为解决极大化双机距离，采用“虚位移”方法，极小化本机反向位移与敌方飞行器之间的距离，即达到极大化本机位置与敌机位置的效果。For the establishment of the red team escape-blue team pursuit problem model, the difference between the escape problem and the pursuit problem is that the objective function is opposite to that of the pursuit problem, which is to maximize the distance between the two aircraft. At the same time, in order to avoid missiles, when the distance between the UAV and the missile is small, the UAV needs to make large maneuvers to change the heading and climb angle to avoid the missile. In order to maximize the distance between the two aircraft, the "virtual displacement" method is used to minimize the distance between the reverse displacement of the aircraft and the enemy aircraft, that is, to achieve the effect of maximizing the position of the aircraft and the position of the enemy aircraft.

如图4所示本机被敌机追击，我方想要极大化与敌机之间的距离，而对于本机速度矢量V方向相反的“虚位移速度”V’来说，为极小化虚位移与敌机之间的距离。虚位移为V’产生的位移量，即：As shown in Figure 4, the aircraft is being chased by an enemy aircraft. We want to maximize the distance between us and the enemy aircraft. For the "virtual displacement speed" V' in the opposite direction of the aircraft's velocity vector V, the virtual displacement is minimized. The distance between us and the enemy aircraft. The virtual displacement is the displacement generated by V', that is:

通常来说，实际操作中无法获得精确的无人机系统模型，而现有基于数据的无模型自适应动态规划严重依赖于数据，无法在已有数据基础上进行策略改进。因此，本实施例采用无模型自适应动态规划求解上述无人机追逃问题，并采用有界探索信号对策略进行改进。Generally speaking, it is impossible to obtain an accurate drone system model in actual operation, and the existing data-based model-free adaptive dynamic programming is heavily dependent on data and cannot improve the strategy based on the existing data. Therefore, this embodiment uses model-free adaptive dynamic programming to solve the above drone pursuit problem, and uses bounded exploration signals to improve the strategy.

则性能指标函数为：The performance index function is:

性能指标函数式(16)求的极小值时，满足如下贝尔曼方程：When the performance index function (16) is minimized, it satisfies the following Bellman equation:

其中，分别是评价网络和执行网络的理想神经网络权重；L₁，L₂分别是评价网络和执行网络的隐藏层神经元数量；分别是评价网络和执行网络的神经网络激活函数；分别是评价网络和执行网络的重建误差。in, are the ideal neural network weights of the evaluation network and the execution network respectively; L₁ and L₂ are the numbers of hidden layer neurons of the evaluation network and the execution network respectively; They are the neural network activation functions of the evaluation network and the execution network respectively; are the reconstruction errors of the evaluation network and the execution network, respectively.

本实施例采用离线神经网络模型训练算法获得红方无人机和蓝方无人机实时控制率，并实时收集红方控制率信息和红蓝双方状态信息。具体包括：This embodiment uses an offline neural network model training algorithm to obtain the real-time control rate of the red drone and the blue drone, and collects the red drone control rate information and the red and blue drone status information in real time. Specifically including:

在本实施例中，每隔一段时间，通过在线模型训练算法在线更新神经网络，实现红方无人机和蓝方无人机在“追踪-逃逸”问题中的自适应动态规划的空战决策。具体包括：In this embodiment, the neural network is updated online at regular intervals through an online model training algorithm to achieve adaptive dynamic programming of air combat decisions between the red and blue drones in the "track-and-escape" problem. Specifically, it includes:

通过上述方法，本发明提升了本发明的在线自适应调整策略的能力，提升无人机空战决策在不同场景的适应能力。且本发明不依赖于飞行器系统模型，具有很强的泛化能力，可以推广到其他设备的控制技术上，如无人车、机械臂等多个应用场景。因此，与现有技术相比，本发明具有突出的实质性的特点和显著的进步。Through the above method, the present invention improves the ability of the online adaptive adjustment strategy of the present invention and improves the adaptability of drone air combat decision-making in different scenarios. In addition, the present invention does not rely on the aircraft system model and has a strong generalization ability, and can be extended to the control technology of other devices, such as unmanned vehicles, robotic arms and other application scenarios. Therefore, compared with the prior art, the present invention has outstanding substantial characteristics and significant progress.

上述实施例仅为本发明的优选实施方式之一，不应当用于限制本发明的保护范围，但凡在本发明的主体设计思想和精神上作出的毫无实质意义的改动或润色，其所解决的技术问题仍然与本发明一致的，均应当包含在本发明的保护范围之内。The above embodiment is only one of the preferred implementation modes of the present invention and should not be used to limit the protection scope of the present invention. Any changes or modifications that are made to the main design concept and spirit of the present invention and have no substantive significance, and the technical problems they solve are still consistent with the present invention, should be included in the protection scope of the present invention.