CN114611669B

Movatterモバイル変換

Info

Publication number: CN114611669B
Application number: CN202210244709.0A
Authority: CN
Inventors: 张震; 臧兆祥
Original assignee: China Three Gorges University CTGU
Current assignee: Shenzhen Wanzhida Enterprise Management Co ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2023-10-13
Anticipated expiration: 2042-03-14
Also published as: CN114611669A

Abstract

Translated fromChinese

本发明公开了一种基于双经验池DDPG网络的兵棋推演智能决策方法，包括：获取兵棋推演数据，构建双经验池DDPG模型；对兵棋推演数据进行预处理，将预处理后的数据向量化，获得向量化数据；将向量化数据输入双经验池DDPG模型进行训练，双经验池DDPG模型达到预设收敛程度时完成训练，基于训练完成的双经验池DDPG模型生成兵棋推演智能决策。本发明相较于一般的强化学习架构，收敛速度更快，节省了训练时间，更快地学习到了整体策略。将双经验池DDPG结构应用于兵棋推演中，利用双经验池对训练速度的提升，更快地训练出可用的神经网络模型。通过对高质量样本的筛选和利用，在一定程度上改善了模型表现依赖于样本质量的问题。

The invention discloses an intelligent decision-making method for war game deduction based on a dual experience pool DDPG network, which includes: obtaining war game deduction data, constructing a dual experience pool DDPG model; preprocessing the war game deduction data, and vectorizing the preprocessed data, Obtain vectorized data; input the vectorized data into the dual experience pool DDPG model for training. The training is completed when the dual experience pool DDPG model reaches the preset convergence level. Intelligent war game decisions are generated based on the trained dual experience pool DDPG model. Compared with the general reinforcement learning architecture, the present invention converges faster, saves training time, and learns the overall strategy faster. The dual experience pool DDPG structure is applied to war games, and the dual experience pool is used to improve the training speed to train usable neural network models faster. By screening and utilizing high-quality samples, the problem that model performance depends on sample quality is improved to a certain extent.

Description

Translated fromChinese

一种基于双经验池DDPG网络的兵棋推演智能决策方法An intelligent decision-making method for wargames based on dual experience pool DDPG network

技术领域Technical field

本发明属于智能决策领域，特别是涉及一种基于双经验池DDPG网络的兵棋推演智能决策方法。The invention belongs to the field of intelligent decision-making, and in particular relates to an intelligent decision-making method for war games based on a dual experience pool DDPG network.

背景技术Background technique

智能决策的目的是利用人类的知识并借助计算机通过人工智能方法来解决复杂的决策问题。典型的复杂决策问题如兵棋推演。兵棋推演是军事演习中的常见对抗样式，用沙盘代替实地，用不同的棋子代替不同的兵力，基于后台数据库和电子态势信息，最大限度地模拟实地实兵对抗，可用来检验战略战术，并能给指挥官以战法策略上的启发。随着人工智能技术的发展，智能决策与兵棋推演融合成为了兵棋推演和人工智能领域的研究热点，针对兵棋推演智能决策的研究取得了不少成果，这些成果有望切实地提升部队战斗力，深化军事智能化进程。The purpose of intelligent decision-making is to use human knowledge and use computers to solve complex decision-making problems through artificial intelligence methods. Typical complex decision-making problems are such as war games. War game deduction is a common confrontation style in military exercises. The sand table is used to replace the actual field, and different chess pieces are used to replace different troops. Based on the background database and electronic situation information, it simulates the actual confrontation of actual troops on the ground to the maximum extent. It can be used to test strategies and tactics, and can Provide commanders with tactical inspiration. With the development of artificial intelligence technology, the integration of intelligent decision-making and war game deduction has become a research hotspot in the field of war game deduction and artificial intelligence. Research on intelligent decision-making in war game deduction has achieved many results. These results are expected to effectively improve the combat effectiveness of the troops and deepen the military Intelligent process.

现有的智能决策方法主要分为两种：Existing intelligent decision-making methods are mainly divided into two types:

规则型：例如决策树方法，通过设定不同局面下采取的不同应对策略来解决决策问题。这类技术的主要问题是兵棋推演中局面复杂度高，通过判断局面来进行动作的规则型智能体所需设置的分支过多，且随着问题的复杂度上升，整个决策树的复杂度呈指数级增长。Rule-based: For example, the decision tree method solves decision-making problems by setting different response strategies in different situations. The main problem of this type of technology is that the situation in wargames is highly complex, and the rule-based agent that takes action by judging the situation needs to set up too many branches. As the complexity of the problem increases, the complexity of the entire decision tree becomes Exponential growth.

学习型：用深度学习和强化学习技术构建一定的网络模型，将战场态势当做网络的输入，己方兵力需采取的动作作为网络的输出，通过一定的评价，来更新网络的参数，实现整个决策框架的学习，经过一定时间的训练后，网络模型可以直接进行对战。这类技术的主要限制在于，网络模型的收敛速度受样本质量的影响较大，收敛速度没有保证。Learning type: Use deep learning and reinforcement learning technology to build a certain network model, treat the battlefield situation as the input of the network, and the actions that one's own forces need to take as the output of the network. Through certain evaluations, the parameters of the network are updated to realize the entire decision-making framework. After a certain period of training, the network model can directly compete. The main limitation of this type of technology is that the convergence speed of the network model is greatly affected by the sample quality, and the convergence speed is not guaranteed.

发明内容Contents of the invention

本发明的目的是提供一种基于双经验池DDPG网络的兵棋推演智能决策方法，以解决上述现有技术存在的问题。The purpose of the present invention is to provide an intelligent decision-making method for war games based on a dual experience pool DDPG network to solve the problems existing in the above-mentioned existing technologies.

为实现上述目的，本发明提供了一种基于双经验池DDPG网络的兵棋推演智能决策方法，包括：In order to achieve the above objectives, the present invention provides an intelligent decision-making method for wargames based on the dual experience pool DDPG network, including:

获取兵棋推演数据，构建双经验池DDPG模型；Obtain war game deduction data and build a dual experience pool DDPG model;

对所述兵棋推演数据进行预处理，将预处理后的数据向量化，获得向量化数据；Preprocess the wargame deduction data, vectorize the preprocessed data, and obtain vectorized data;

将所述向量化数据输入所述双经验池DDPG模型进行训练，所述双经验池DDPG模型达到预设收敛程度时完成训练，基于训练完成的双经验池DDPG模型生成兵棋推演智能决策。The vectorized data is input into the dual experience pool DDPG model for training. The training is completed when the dual experience pool DDPG model reaches a preset convergence level. Intelligent war game decisions are generated based on the trained dual experience pool DDPG model.

可选的，获取兵棋推演数据的过程中包括，运行兵棋推演环境，并在所述兵棋推演环境中获取兵棋推演数据；Optionally, the process of obtaining wargame deduction data includes running a wargame deduction environment, and obtaining wargame deduction data in the wargame deduction environment;

所述兵棋推演数据包括：己方实体属性信息、敌方已被发现的实体属性信息、推演时间、地图属性信息、记分板信息；The wargame deduction data includes: own entity attribute information, enemy entity attribute information that has been discovered, deduction time, map attribute information, and scoreboard information;

其中所述己方实体属性信息包括己方单位的剩余血量、己方单位的位置、己方单位的剩余弹药量；The attribute information of one's own entity includes the remaining blood volume of one's own unit, the position of one's own unit, and the remaining ammunition of one's own unit;

所述敌方已被发现的实体属性信息包括敌方剩余血量和敌方位置；The discovered entity attribute information of the enemy includes the enemy's remaining health and the enemy's location;

所述地图属性信息包括高程和编号；The map attribute information includes elevation and number;

所述记分板信息包括目前获得的分数信息。The scoreboard information includes score information currently obtained.

可选的，对所述兵棋推演数据进行预处理的过程中，所述预处理的方式采用数据清洗，所述数据清洗包括：Optionally, in the process of preprocessing the war game data, the preprocessing method adopts data cleaning, and the data cleaning includes:

对采集的所述兵棋推演数据进行数据提取，获得规范化数据；Perform data extraction on the collected war game data to obtain standardized data;

对所述规范化数据进行分类和冗余数据剔除。The standardized data is classified and redundant data is eliminated.

可选的，对采集的所述兵棋推演数据进行数据提取，获得所述规范化数据的过程中包括：Optionally, perform data extraction on the collected war game data. The process of obtaining the standardized data includes:

对所述兵棋推演数据进行提取时，去除其中不规范的数据，获得规范化数据；When extracting the war game data, remove non-standard data and obtain standardized data;

所述不规范数据包括：空行数据和乱码数据。The irregular data includes: blank line data and garbled data.

可选的，对规范化数据进行分类和冗余数据剔除的过程中包括：Optionally, the process of classifying standardized data and eliminating redundant data includes:

将所述规范化数据分为所述己方实体属性信息、所述敌方已被发现的实体属性信息、所述推演时间和所述记分板信息；Divide the normalized data into the own entity attribute information, the enemy's discovered entity attribute information, the deduction time and the scoreboard information;

剔除完成分类的数据中的冗余数据，所述冗余数据包括对决策无用的信息。Eliminate redundant data from the classified data, which includes information that is useless for decision-making.

可选的，将预处理后的数据向量化的过程中包括：Optional, the process of vectorizing the preprocessed data includes:

基于one-hot编码方式对推演时间、所述己方实体属性信息、所述敌方已被发现的实体属性信息进行编码；Encode the deduction time, the own entity attribute information, and the enemy's discovered entity attribute information based on the one-hot encoding method;

无需对所述地图属性信息、所述记分板信息进行编码，直接将所述记分板信息作为所述向量化数据之一。There is no need to encode the map attribute information and the scoreboard information, and the scoreboard information is directly used as one of the vectorized data.

可选的，构建所述双经验池DDPG模型的过程中包括：Optionally, the process of building the dual experience pool DDPG model includes:

基于DDPG算法架构构建DDPG神经网络，所述DDPG神经网络包括Actor网络、Critic网络、Actor_target网络和Cirtic_target网络；Construct a DDPG neural network based on the DDPG algorithm architecture. The DDPG neural network includes Actor network, Critic network, Actor_target network and Cirtic_target network;

构建两个用于储存训练过程中产生的经验的经验池，所述经验池为多维数组；Construct two experience pools for storing experience generated during the training process, and the experience pools are multi-dimensional arrays;

基于所述DDPG神经网络和两个所述经验池构建所述双经验池DDPG模型。The dual experience pool DDPG model is constructed based on the DDPG neural network and the two experience pools.

可选的，将所述向量化数据输入所述双经验池DDPG模型进行训练的过程中包括：Optionally, the process of inputting the vectorized data into the dual experience pool DDPG model for training includes:

将所述向量化数据输入所述Actor网络中，获取的值输入到所述Critic网络中进行处理；The vectorized data is input into the Actor network, and the obtained values are input into the Critic network for processing;

每隔预设的时间步，基于所述Actor网络的参数更新所述Actor_target网络，基于所述Critic网络的参数更新所述Cirtic_target网络；Every preset time step, the Actor_target network is updated based on the parameters of the Actor network, and the Cirtic_target network is updated based on the parameters of the Critic network;

每次训练完成时，将当前经验存入第一经验池，若当前经验中获得的奖励大于第一经验池中的平均奖励，则将当前经验保存到第二经验池中。When each training is completed, the current experience is stored in the first experience pool. If the reward obtained in the current experience is greater than the average reward in the first experience pool, the current experience is saved in the second experience pool.

可选的，基于所述Actor网络的参数更新所述Actor_target网络的过程中，所述Actor网络采用梯度下降法进行更新；基于所述Critic网络的参数更新所述Cirtic_target网络的过程中，所述Critic网络同样使用梯度下降法进行更新，在更新过程中所述Critic网络的损失函数使用均方差损失。Optionally, in the process of updating the Actor_target network based on the parameters of the Actor network, the Actor network is updated using the gradient descent method; in the process of updating the Cirtic_target network based on the parameters of the Critic network, the Critic The network is also updated using the gradient descent method. During the update process, the loss function of the Critic network uses mean square error loss.

本发明的技术效果为：The technical effects of the present invention are:

本发明相较于一般的强化学习架构，收敛速度更快，节省了训练时间，更快地学习到了整体策略。将双经验池DDPG结构应用于兵棋推演中，利用双经验池对训练速度的提升，更快地训练出可用的神经网络模型。通过对高质量样本的筛选和利用，在一定程度上改善了模型表现依赖于样本质量的问题。Compared with the general reinforcement learning architecture, the present invention converges faster, saves training time, and learns the overall strategy faster. The dual experience pool DDPG structure is applied to war games, and the dual experience pool is used to improve the training speed to train usable neural network models faster. By screening and utilizing high-quality samples, the problem that model performance depends on sample quality is improved to a certain extent.

附图说明Description of the drawings

构成本申请的一部分的附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings that form a part of this application are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an improper limitation of this application. In the attached picture:

图1为本发明实施例中的方法流程图；Figure 1 is a method flow chart in an embodiment of the present invention;

图2为本发明实施例中的训练过程示意图。Figure 2 is a schematic diagram of the training process in the embodiment of the present invention.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of this application can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, although a logical sequence is shown in the flowchart, in some cases, The steps shown or described may be performed in a different order than here.

实施例一Embodiment 1

如图1-2所示，本实施例中提供一种基于双经验池DDPG网络的兵棋推演智能决策方法，包括以下步骤：As shown in Figure 1-2, this embodiment provides an intelligent decision-making method for war games based on the dual experience pool DDPG network, which includes the following steps:

步骤1.运行兵棋推演环境，收集数据；Step 1. Run the war game environment and collect data;

步骤2.数据清洗，包括数据提取、数据分类，冗余数据剔除；Step 2. Data cleaning, including data extraction, data classification, and elimination of redundant data;

步骤3.文本数据向量化。Step 3. Vectorization of text data.

步骤4.构建双经验池DDPG模型。Step 4. Build a dual experience pool DDPG model.

步骤5.将数据输入到模型中，填充两个经验池。Step 5. Enter the data into the model to populate the two experience pools.

步骤6.训练至模型收敛。Step 6. Train until the model converges.

步骤1具体包括：Step 1 specifically includes:

1.1所述兵棋为即时策略战术兵棋，以与回合制兵棋作区分。The wargame described in 1.1 is a real-time strategy and tactical wargame to distinguish it from a turn-based wargame.

1.2所述数据主要包括己方实体属性信息、敌方已被发现的实体属性信息、地图属性信息、记分板信息。其中己方实体属性信息包括己方单位的剩余血量、位置、剩余弹药量等，敌方已被发现的实体属性信息包括剩余血量和位置，但不包括弹药量，地图属性信息包括高程、编号等，记分板信息是目前获得的分数信息等。1.2 The data mentioned mainly include own entity attribute information, enemy entity attribute information that has been discovered, map attribute information, and scoreboard information. Among them, the attribute information of one's own entities includes the remaining health, position, remaining ammunition, etc. of one's own units. The attribute information of the enemy's discovered entities includes the remaining health and position, but does not include the amount of ammunition. The map attribute information includes elevation, number, etc. , the scoreboard information is the score information currently obtained, etc.

步骤2具体包括：Step 2 specifically includes:

2.1提取收集到的每条数据，去除其中不规范的数据，例如空行和乱码。得到规范化的数据。2.1 Extract each piece of data collected and remove irregular data, such as blank lines and garbled characters. Get normalized data.

2.2去除冗余的数据，根据专家经验来剔除掉一些对决策无用的数据，以缩减状态空间，例如环境给出的“是否聚合”的信息，对智能体的决策没有帮助，可以删除。2.2 Remove redundant data and eliminate some data that is useless for decision-making based on expert experience to reduce the state space. For example, the "whether to aggregate" information given by the environment is not helpful for the agent's decision-making and can be deleted.

步骤3具体包括：Step 3 specifically includes:

1.1将环境输出的格式化数据转换为向量格式，使用one-hot编码方式对推演时间、己方单位信息、已获得的敌方单位信息进行编码，记分板信息可直接使用，不需要使用one-hot编码进行转换。上述格式化数据的含义是：环境给出的数据的格式是一定的，数据所在的位置是分门别类预设好的。1.1 Convert the formatted data output by the environment into vector format, and use one-hot encoding to encode the deduction time, friendly unit information, and acquired enemy unit information. The scoreboard information can be used directly without using one-hot. Encoding is converted. The meaning of the above formatted data is: the format of the data given by the environment is certain, and the location of the data is preset by category.

1.2消除不同量纲对数据的影响，对数据进行归一化处理，归一化公式如下：1.2 Eliminate the influence of different dimensions on the data and normalize the data. The normalization formula is as follows:

x′_ij为x_ij归一化x_ij之后的数值，是第i列第j维特征，x_i是第i列特征，min(x_i)是第i列的所有维度中的数值的最小值，max(x_i)是第i列所有维度中的数值最大值.x′_ij is the value of x_ij after normalizing x_ij . It is the j-th dimension feature of the i-th column. x_i is the feature of the i-th column. min(x_i ) is the minimum value of the values in all dimensions of the i-th column. , max(_xi ) is the maximum numerical value in all dimensions of column i.

步骤4具体包括：Step 4 specifically includes:

按照DDPG算法架构来建立神经网络，DDPG架构需要构建4个神经网络，分别为Actor网络，Critic网络，Actor_target网络，Cirtic_target网络。The neural network is established according to the DDPG algorithm architecture. The DDPG architecture requires the construction of 4 neural networks, namely Actor network, Critic network, Actor_target network, and Cirtic_target network.

4.1卷积层使用多个卷积核，不同的卷积核关注不同的特征，分别针对较为重要的属性如血量和坐标提取特征。卷积神经网络的更新公式如下：4.1 The convolution layer uses multiple convolution kernels. Different convolution kernels focus on different features and extract features for more important attributes such as blood volume and coordinates. The update formula of the convolutional neural network is as follows:

x^t＝σ_cnn(w_cnn⊙x_t+b_cnn)x^t ＝σ_cnn (w_cnn ⊙x_t +b_cnn )

其中x^t表示当前状态特征，w_cnn表示过滤器的权重，b_cnn表示偏差参数，σ_cnn是激活函数。where x^t represents the current state feature, w_cnn represents the weight of the filter, b_cnn represents the bias parameter, and σ_cnn is the activation function.

4.2Actor网络为三层神经网络，第一层为卷积层，神经元数量由态势信息维度决定，第二层神经元数量为128，第三层神经元数量由动作动作变量维度决定。4.2 The Actor network is a three-layer neural network. The first layer is a convolutional layer. The number of neurons is determined by the dimension of situation information. The number of neurons in the second layer is 128. The number of neurons in the third layer is determined by the dimension of action variables.

4.3Critic网络为三层全连接神经网络，第一层的神经元数量由Actor网络的输入变量维度和Actor网络的输出变量维度共同确定，第二层的神经元数量为128，第三层的神经元数量为1。4.3Critic network is a three-layer fully connected neural network. The number of neurons in the first layer is determined by the input variable dimension of the Actor network and the output variable dimension of the Actor network. The number of neurons in the second layer is 128. The number of neurons in the third layer is 128. The number of yuan is 1.

4.4Acotr_target网络结构与Actor网络相同，Critic_target网络的结构与Critic网络相同。4.4The Acotr_target network structure is the same as the Actor network, and the Critic_target network structure is the same as the Critic network.

4.5构建两个经验池来存储推演过程中获得的经验，经验池为多维数组，结构如下：4.5 Construct two experience pools to store the experience gained during the deduction process. The experience pools are multi-dimensional arrays with the following structure:

其中为当前时刻的状态，/>为在时刻t采取的动作，/>为获取的奖励，/>为t+1时刻的状态，i代表第i条经验。in is the status at the current moment,/> is the action taken at time t,/> For the rewards obtained,/> is the state at time t+1, and i represents the i-th experience.

经验池的维度由以下公式确定：The dimensions of the experience pool are determined by the following formula:

dim＝2*state_dim+action_dim+1dim＝2*state_dim+action_dim+1

其中，dim代表经验池的维度，state_dim是态势信息的维度，action_dim是动作向量的维度，加上奖励值所需要的1个维度。Among them, dim represents the dimension of the experience pool, state_dim is the dimension of situation information, action_dim is the dimension of the action vector, plus 1 dimension required for the reward value.

步骤5具体包括：Step 5 specifically includes:

5.1将步骤3中向量化的数据输入到Actor网络中，挖掘态势信息中潜在的联系。Actor网络的输出值。5.1 Input the vectorized data in step 3 into the Actor network to mine potential connections in the situation information. The output value of the Actor network.

5.2将卷积神经网络的输出值和Actor网络的输出值拼接到一起输入到Critic网络中，得到Critic网络的输出。每隔固定的时间步，就使用Actor网络的参数和Critic网络的参数分别更新Actor_target网络和Critic_target网络。更新目标网络的参数的方式为soft-update，按照如下公式进行：5.2 Splice the output value of the convolutional neural network and the output value of the Actor network together and input them into the Critic network to obtain the output of the Critic network. Every fixed time step, the Actor_target network and Critic_target network are updated respectively using the parameters of the Actor network and the parameters of the Critic network. The method of updating the parameters of the target network is soft-update, which is performed according to the following formula:

θ_targ←——ρθ_targ+(1-ρ)θθ_targ ←——ρθ_targ +(1-ρ)θ

此处θ_targ指target网络中的初始参数，ρ一般取一个较大的值，以确保参数是被缓慢更新的，这种方式相较于直接复制参数的硬更新方式，更能保证模型的稳定性。Here θ_targ refers to the initial parameters in the target network. ρ generally takes a larger value to ensure that the parameters are slowly updated. This method can better ensure the stability of the model compared with the hard update method of directly copying parameters. sex.

5.3每次环境给出一个的闭环，都存入经验池A中，本条经验R_i^t大于经验池A中的平均奖励，则将其复制一份到经验池B中。5.3 Each environment gives a closed loops are all stored in experience pool A. If this experience R_i^t is greater than the average reward in experience pool A, then a copy of it will be copied to experience pool B.

步骤6具体包括：Step 6 specifically includes:

6.1Actor网络使用梯度下降法进行更新。6.1The Actor network is updated using the gradient descent method.

6.2Critic网络的损失函数使用均方差损失(MSE)，同样使用梯度下降法进行更新。The loss function of the 6.2Critic network uses mean square error loss (MSE), which is also updated using the gradient descent method.

6.3使用覆盖方式更新经验池中的数据。6.3 Use the overwriting method to update the data in the experience pool.

实施例二Embodiment 2

如图1-2所示，本实施例中提供一种基于双经验池DDPG网络的兵棋推演智能决策方法，运行兵棋推演环境，收集数据；数据清洗，包括数据提取、数据分类，冗余数据剔除；文本数据向量化、构建双经验池DDPG模型、填充经验池、更新网络参数，训练至模型收敛，具体包括：As shown in Figure 1-2, this embodiment provides an intelligent decision-making method for war games based on the dual experience pool DDPG network, runs the war game environment, and collects data; data cleaning includes data extraction, data classification, and redundant data elimination. ; Text data vectorization, constructing a dual experience pool DDPG model, filling the experience pool, updating network parameters, and training until the model converges, including:

步骤1：运行兵棋推演环境，收集对战数据，包括每一步的状态与我方采取的动作、我方得分等数据，这些数据可以先由人工编写的硬编码策略与推演环境内置的bot对战产生。Step 1: Run the war game deduction environment and collect battle data, including the status of each step, the actions taken by our side, our scores and other data. These data can be generated by manually written hard-coded strategies and the bot built into the deduction environment.

步骤2：将收集到的数据规范化Step 2: Normalize the collected data

x′_ij为x_ij归一化x_ij之后的数值，是第i列第j维特征，x_i是第i列特征，min(x_i)是第i列的所有维度中的数值的最小值，max(x_i)是第i列所有维度中的数值最大值。x′_ij is the value of x_ij after normalizing x_ij . It is the j-th dimension feature of the i-th column. x_i is the feature of the i-th column. min(x_i ) is the minimum value of the values in all dimensions of the i-th column. , max(_xi ) is the maximum numerical value in all dimensions of column i.

步骤3：构建模型，本发明设计的双经验池DDPG网络模型由5部分构成，分别是Actor网络、Critic网络、Actor_target网络、Critic_target网络、双经验池。其中Actor网络输入为当前观测到的状态、输出为当前状态各单位的动作；Critic网络的输入为当前观测到的状态、当前各单位的动作，输出为一个估计值，主要过程如图所示。Step 3: Build the model. The dual experience pool DDPG network model designed by this invention consists of five parts, namely Actor network, Critic network, Actor_target network, Critic_target network, and dual experience pools. The input of the Actor network is the currently observed state, and the output is the action of each unit in the current state; the input of the Critic network is the currently observed state, the action of each unit, and the output is an estimated value. The main process is shown in the figure.

步骤4：以上所述观测状态为环境中可观测到的态势信息，包括当前得分；步数；己方单位的位置、血量、弹药余量；已观察到的敌方单位的位置、血量；夺控点的位置。以上所述各单位的动作主要包括移动、掩蔽、射击等。Step 4: The above-mentioned observation status is the observable situation information in the environment, including the current score; the number of steps; the position, blood volume, and ammunition remaining of the own unit; the observed position and blood volume of the enemy unit; The location of the capture point. The actions of each unit mentioned above mainly include movement, cover, shooting, etc.

实施例中，己方单位数量为3。敌方单位算子数目也为3。具体的状态信息是推演环境返回的己方单位状态、敌方单位状态、计分板信息所构成的36维向量。这组向量代表了当前场上需要关注的信息，将这组向量作为Actor网络的输入。所述各单位动作包括机动、机动目标、夺控、射击、射击目标等动作构成的15维向量。In the embodiment, the number of friendly units is 3. The number of enemy unit counters is also 3. The specific status information is a 36-dimensional vector composed of the status of the own unit, the status of the enemy unit, and the scoreboard information returned by the deduction environment. This set of vectors represents the information that needs attention on the current field, and this set of vectors is used as the input of the Actor network. The actions of each unit include 15-dimensional vectors composed of actions such as maneuvering, maneuvering targets, seizing control, shooting, and shooting targets.

步骤5：根据网络模型与环境进行交互，得到当前时刻状态S^t，当前时刻动作A^t，下一时刻状态S^t+1，计分板给出的奖励信息R^t。据此构建经验池如下：Step 5: Interact with the environment according to the network model to obtain the current state S^t , the current action A^t , the next state S^t+1 , and the reward information R^t given by the scoreboard. Based on this, the experience pool is constructed as follows:

其中，下标i代表第i条经验。双经验池标记为经验池A和经验池B，其中经验池A正常存放对战数据，经验池B只接受经验池A中R_i^t值在平均奖励值之上的经验。Among them, the subscript i represents the i-th experience. The dual experience pools are marked as experience pool A and experience pool B. Experience pool A stores battle data normally, and experience pool B only accepts experience whose R_i^t value in experience pool A is above the average reward value.

奖励值的定义为：The reward value is defined as:

其中γ为折扣因子，r(s_i,a_i)为在状态为s_i时采取动作a_i可以获得的奖励值。Among them, γ is the discount factor, and r(s_i , a_i ) is the reward value that can be obtained by taking action a_i when the state is s_i .

步骤6：经验池中的数据达到一定数目后，按照不同比例从经验池A和B中抽取经验对网络进行训练。其中Actor网络第一层神经元数量为36，第二层神经元数量为128，第三层神经元数量为15。Critic网络第一层神经元数量为51，第二层神经元数量为128，第三层神经元数量为1。Step 6: After the data in the experience pool reaches a certain amount, experience is extracted from experience pools A and B according to different proportions to train the network. The number of neurons in the first layer of the Actor network is 36, the number of neurons in the second layer is 128, and the number of neurons in the third layer is 15. The number of neurons in the first layer of the Critic network is 51, the number of neurons in the second layer is 128, and the number of neurons in the third layer is 1.

使用梯度下降算法Actor网络的参数逐层更新。The parameters of the Actor network are updated layer by layer using the gradient descent algorithm.

Actor网络更新公式为：The Actor network update formula is:

使用均方差损失函数和梯度下降法对Critic网络参数逐层更新。The Critic network parameters are updated layer by layer using the mean square error loss function and the gradient descent method.

Critic网络更新公式为：The Critic network update formula is:

更新Actor_target网络时采用以下公式：The following formula is used when updating the Actor_target network:

θ_targ←——ρθ_targ+(1-ρ)θθ_targ ←——ρθ_targ +(1-ρ)θ

更新Critic_target网络时采用以下公式：The following formula is used when updating the Critic_target network:

φ_targ←——ρφ_targ+(1-ρ)φφ_targ ←——ρφ_targ +(1-ρ)φ

本实施例中ρ设为0.95。In this embodiment, ρ is set to 0.95.

以上所述，仅为本申请较佳的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应该以权利要求的保护范围为准。The above are only preferred specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of changes or modifications within the technical scope disclosed in the present application. Replacements shall be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.