CN112131786A

Movatterモバイル変換

Info

Publication number: CN112131786A
Application number: CN202010959038.7A
Authority: CN
Inventors: 伊山; 魏晓龙; 鹿涛; 黄谦; 齐智敏; 蔡春晓; 赵昊; 张帅; 亢原平
Original assignee: China Aerospace System Simulation Technology Co ltd Beijing; Evaluation Argument Research Center Academy Of Military Sciences Pla China
Current assignee: China Aerospace System Simulation Technology Co ltd Beijing; Evaluation Argument Research Center Academy Of Military Sciences Pla China
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2020-12-25
Anticipated expiration: 2040-09-14
Also published as: CN112131786B

Abstract

The invention relates to a target detection and distribution method and device based on multi-agent reinforcement learning, which comprises the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model; and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The invention integrates the reinforcement learning algorithm MADDPG into the war game deduction system, constructs a simple to complex simulation environment, optimizes the reinforcement learning convergence speed, and effectively solves the problem of optimizing the convergence speed by an intelligent body in the war game deduction system.

Description

Target detection and distribution method and device based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of analog simulation, and particularly relates to a target detection and distribution method and device based on multi-agent reinforcement learning.

Background

With the development of artificial intelligence, the era of tactical research and military planning relying on manpower is gradually moving away from us. In the past, in the process of applying a computer to war chess deduction simulation, people rely on a differential equation and a war theory to effectively simulate the process of war, and the operational level of a military is greatly improved. Nowadays, the application of artificial intelligence in war game deduction will play a more important role. The capability of describing complex systems and the capability of modeling behaviors in a dynamic environment based on multi-agent modeling are superior to the traditional modeling method. The appearance of the multi-agent system provides a new platform for further expanding the war game deduction system.

In the process of war chess simulation deduction, an experienced commander can judge and predict the battle tasks executed by the enemy according to the information such as the state, the fighting capacity, the fighting rules and the like of the enemy. With the continuous development and improvement of the military chess system, the simulated combat mission faces many new changes: firstly, the number of the combat units is sharply increased, and a commander analyzes and determines that the workload of each target combat task is heavy one by one, so that the battlefield situation is difficult to be comprehensively and accurately mastered; secondly, the continuous development of the information technology enables the battlefield situation evolution speed to be continuously accelerated, the enemy aerial task is only identified manually, the response time of the enemy is seriously influenced, and the combat efficiency is reduced; finally, the mass battlefield data is often incomplete, untimely and inaccurate, even with deception, making it difficult for the commander to analyze the key situation hidden therein. The series of profound changes increase the difficulty for aerial task recognition, and the traditional method relying on manual recognition is difficult to adapt to the battlefield situation with high complexity and rapid transformation, so that the research on the intelligent battlefield task recognition method frees a commander from multi-source, complex and heterogeneous battlefield data, puts more energy into command and decision, and is a great trend for the development of the future intelligent military chess system.

With the continuous development of the multi-agent reinforcement learning, the reinforcement learning has the capabilities of autonomous learning, distribution coordination and organization, and changes the state information of the reinforcement learning by cooperating with other agents to plan the behavior of the reinforcement learning, so that the goal is finally and efficiently completed. The multi-agent system can not only completely replace a single agent to finish the target, but also finish the efficiency exceeding that of the single agent, thereby reflecting the phenomenon of a plurality of people and large labor. The goal of multiple intelligent agents to form a team and cooperate like a person is a new topic. Deep reinforcement learning uses an asynchronous framework to train multiple agents, each agent is independent relative to other agents, and the asynchronous framework is not suitable if the agents are in different labor division. The interaction of the intelligent bodies in the multi-intelligent-body algorithm is full connection, so that the complexity of the algorithm is increased, the algorithm is more difficult to apply to reality, and the optimization convergence speed of the battle behavior model in the war push system is low.

Disclosure of Invention

In view of the above, the present invention provides a target detection and allocation method and device based on multi-agent reinforcement learning to solve the problem of slow optimization convergence rate of the combat behavior model in the war game deduction system in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme: a target detection and distribution method based on multi-agent reinforcement learning comprises the following steps:

constructing a combat behavior model and a reinforcement learning training environment;

training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model;

and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model.

Further, constructing a reinforcement learning training environment, comprising:

and mapping the combat simulation engine and the reinforcement learning training environment by adopting the MADDPG algorithm.

Further, the mapping of the combat simulation engine and the reinforcement learning training environment by using the madpg algorithm includes:

mapping a combat behavior model in the combat simulation engine to a plurality of agents in the reinforcement learning training environment, the agents serving as training objects;

mapping a perception model in the combat simulation engine to a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;

mapping a decision model in the combat simulation engine to a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting an action to be executed according to the current battlefield situation;

mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;

and mapping a memory model in the combat simulation engine to a memory agent module in the reinforcement learning training environment, wherein the memory agent module is used for storing battlefield situations.

Further, it is right to adopt reinforcement learning training environment the action model of fighting trains to the model convergence, obtains artificial intelligence action model, includes:

initializing an agent;

the sensing agent module acquires environmental information, determines the current battlefield situation and stores the situation in the memory agent module;

the decision agent module selects an action to be executed according to the current battlefield situation;

the action agent module executes the selected action;

the reinforcement learning training environment feeds a battlefield environment back to the intelligent agent for optimization according to action results;

and judging whether the intelligent agent converges or not, and outputting an artificial intelligence behavior model after the intelligent agent converges.

Further, the training of the artificial intelligence behavior model by adopting a combat simulation engine and the output of an optimization model comprise:

initializing an artificial intelligence behavior model;

the sensing model acquires environmental information, determines the current battlefield situation and stores the situation in the memory model;

the decision model selects an action to be executed according to the current battlefield situation;

the action model performs the selected action;

the battle policy engine feeds back battlefield environment to the artificial intelligence behavior model for optimization according to action results;

and judging whether the artificial intelligence behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.

Further, before determining whether to converge, the method further includes:

judging whether a preset training end time is reached;

and if the training end time is reached, ending and exiting, otherwise, continuing the training.

Further, the reinforcement learning training environment utilizes a MADDPG algorithm to run and train the combat behavior model in a distributed and centralized manner.

Further, the number of the agents is 3.

Furthermore, the combat behavior model adopts a multi-agent artificial neural network.

The embodiment of the application provides a target detection and distribution device based on multi-agent reinforcement learning, including:

the construction module is used for constructing a combat behavior model and a reinforcement learning training environment;

the obtaining module is used for training the combat behavior model to model convergence by adopting a reinforcement learning training environment, and obtaining an artificial intelligence behavior model;

and the output module is used for training the artificial intelligence behavior model by adopting a combat simulation engine and outputting an optimization model.

By adopting the technical scheme, the invention can achieve the following beneficial effects:

the invention provides a target detection and distribution method and device based on multi-agent reinforcement learning, which comprises the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model; and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The invention integrates the reinforcement learning algorithm MADDPG into the war game deduction system, constructs a simple to complex simulation environment, optimizes the reinforcement learning convergence speed, and effectively solves the problem of optimizing the convergence speed by an intelligent body in the war game deduction system.

The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent intelligent agent which wants to be combined, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and pheromones with poor attenuation as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies.

The invention applies the MADDPG idea to the military simulation field, so that each combat unit becomes an independent and combined intelligent agent, and aiming at the problem of convergence speed of a multi-intelligent agent in the MADDPG algorithm during training, the invention adopts MPE (Multi-agent-envs) environment developed by OpenAI as the basis, removes most of the mathematical operation of combat models, and retains most of the function simulation warfare and chess simulation deduction of an engine. After each step, the learned experience of the intelligent agent is inherited to the war game simulation and deduction system, and the war game simulation and deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating the steps of a multi-agent reinforcement learning-based target detection and distribution method according to the present invention;

FIG. 2 is a schematic view of a combat simulation scenario according to the present invention;

FIG. 3 is a diagram of the MADDPG algorithm structure of the present invention;

FIG. 4 is a schematic structural diagram of an objective detecting and distributing device based on multi-agent reinforcement learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

Inspired by MADDPG (Multi-Agent Deep Deterministic Policy Gradient) Multi-Agent algorithm, the strategy Gradient algorithm is improved in series, so that the method can be suitable for complex Multi-Agent scenes which cannot be processed by the traditional algorithm. The MADDPG algorithm has the following three characteristics:

1, the optimal strategy obtained by learning can give the optimal action by only utilizing local information during application.

2, there is no need to know the dynamic model of the environment and the special communication requirements.

3, the algorithm can be used not only in a cooperative environment but also in a competitive environment.

The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent and mutually cooperative intelligent agent, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and poor attenuation pheromones as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies.

The following describes a specific target detection and assignment method and apparatus based on multi-agent reinforcement learning provided in the embodiments of the present application with reference to the accompanying drawings.

As shown in fig. 1, an object detection and distribution method based on multi-agent reinforcement learning provided in the embodiment of the present application includes:

s101, constructing a combat behavior model and a reinforcement learning training environment;

as shown in fig. 2, for the convenience of study, the following settings are made for the battle behavior model:

(1) the target and the target do not overlap;

(2) the target and the radar detection range are not overlapped;

(3) supposing that the unmanned aerial vehicle cluster is fixed at a stable flying height, the measurement accuracy and the ground resolution of the magnetic detector can be ensured.

A group of unmanned aerial vehicles detect dynamic and static targets of a large-scale unstructured environment with obstacles, and the quality of the process is measured through a proper objective function, wherein the perception radius of the detector changes with the environment and the targets in real time. The overall quality is optimized herein by minimizing the time required to find a given static object, or by maximizing the average number of dynamic objects found within a certain search time, in anticipation of the observation time as a function of the task objective.

The combat behavior model corresponds to a multi-agent artificial neural network, is the core for generating intelligence and is the object for strengthening learning and training.

S102, training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligence behavior model;

the reinforcement learning training environment is an environment set according to a combat simulation engine, such as: the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment made by extracting necessary factors in the large environment. And mapping the combat behavior model into multiple intelligent agents, and training the multiple intelligent agents to obtain an optimized artificial intelligence behavior model. In the present application, 3 agents are used, and it is understood that 4, 5, and 6 agents may also be used, which is not limited herein.

And S103, training the artificial intelligence behavior model by adopting a combat simulation engine, and outputting an optimization model.

And (4) putting the artificial intelligence behavior model obtained after the small environment pre-training into a large environment for compensation training again to obtain an optimized model. And (3) constructing a simple to complex simulation environment, and optimizing the convergence speed of reinforcement learning.

The working principle of the target detection and distribution method based on multi-agent reinforcement learning is as follows: constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model; and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The training stage is divided into two stages of small environment pre-training and large environment compensation training, and adaptability of the artificial intelligence behavior model is improved.

In some embodiments, a combat simulation engine is mapped to a reinforcement learning training environment using the MADDPG algorithm.

Preferably, the mapping of the combat simulation engine and the reinforcement learning training environment by using the madpg algorithm includes:

In the application, because the reinforcement learning training environment is different from the operation environment and the programming language of a combat simulation engine and is difficult to integrate directly, the combat simulation engine and the reinforcement learning training environment are mapped, the most advanced multi-agent reinforcement learning algorithm MADDPG is adopted, an algorithm frame is shown in figure 3, and because the data volume and the calculation volume of a real simulation engine are huge, a simple battlefield environment called a weapon chess deduction small environment is firstly established on the basis of MPE of OpenAI at the outside, and the weapon chess deduction small environment can provide simple geographic information data and can also generate simple deduction process data.

In some embodiments, the training the combat behavior model to model convergence by using a reinforcement learning training environment to obtain an artificial intelligence behavior model includes:

initializing an agent;

the action agent module executes the selected action;

Specifically, the training is extremely divided into two stages, pre-training in a small environment is firstly carried out, specifically, the intelligent agent is placed in a reinforcement learning training environment, the sensing module calls a simulated sensor and a simulated communication interface to obtain environment information, the current battlefield situation is determined and stored in the memory agent module, wherein the sensor is a radar and can obtain the position of teammates, the position of enemies and the like. The movement is performed by selecting an action, such as a left action or a right action, based on the positional relationship. And the reinforcement learning training environment gives feedback to the intelligent agent according to the action result, and determines a reward function so as to optimize.

The expression of the reward function is

Where 100 is the reward and 100 is the penalty, the penalty being given when the agent has a collision.

And continuously optimizing until the intelligent agent converges, and outputting an artificial intelligence behavior model.

The above is a simulation training process of one combat, and after a plurality of times of sample training, the combat behavior model is gradually converged to generate an artificial intelligent behavior model with a countermeasure being a reactive behavior model. Because the first-stage training countermeasure is a traditional reactive behavior model, the behavior logic is relatively solidified, and in order to increase the training sample space and improve the adaptability of the artificial intelligence behavior model, the second-stage training is also needed.

In some embodiments, the training the artificial intelligence behavior model with the combat simulation engine to output an optimization model includes:

initializing an artificial intelligence behavior model;

the action model performs the selected action;

And then putting the artificial intelligence behavior model into a large environment for compensation training, wherein the probability of errors of the data acquired in the 'small environment pre-training' stage is high, the data processing in the pre-training stage is not complete enough, and a real combat simulation engine is adopted as the environment for re-training after the 'small environment pre-training'. The sensing model, the decision model, the memory model and the action model of the combat simulation engine are adopted for training, the training process is the same as that of a reinforcement learning training environment, and the detailed description is omitted here.

Specifically, before determining whether to converge, the method further includes:

judging whether a preset training end time is reached;

The end of the training process of the present application can be divided into two categories, one is to achieve the maximum running time, for example, i run at most one thousand steps and can also be understood as one thousand frame periods, and the other is to achieve the predicted optimization goal.

Preferably, the reinforcement learning training environment utilizes a madpg algorithm to run the warfare behavior model in a distributed and centralized manner.

Preferably, the combat behavior model employs a multi-agent artificial neural network.

As shown in fig. 4, an embodiment of the present application provides an object detecting and assigning apparatus based on multi-agent reinforcement learning, including:

theconstruction module 401 is used for constructing a combat behavior model and a reinforcement learning training environment;

an obtainingmodule 402, configured to train the combat behavior model to model convergence by using a reinforcement learning training environment, and obtain an artificial intelligence behavior model;

and anoutput module 403, configured to train the artificial intelligence behavior model by using a combat simulation engine, and output an optimization model.

The working principle of the multi-agent reinforcement learning-based target detection and distribution device provided by the application is that theconstruction module 401 constructs a combat behavior model and a reinforcement learning training environment; the obtainingmodule 402 adopts a reinforcement learning training environment to train the combat behavior model until the model converges, and obtains an artificial intelligence behavior model; theoutput module 403 trains the artificial intelligence behavior model by using a combat simulation engine, and outputs an optimization model.

The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;

the memory is used for storing a computer program, and the computer program is used for executing the target detection and distribution method based on multi-agent reinforcement learning provided by any one of the above embodiments;

the processor is used to call and execute the computer program in the memory.

In summary, the invention provides a target detection and allocation method and device based on multi-agent reinforcement learning, which integrates a reinforcement learning algorithm MADDPG into a chess deduction system, constructs a simple to complex simulation environment, optimizes reinforcement learning convergence speed, and effectively solves the problem of optimizing convergence speed by an agent in the chess deduction system. The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent intelligent agent which wants to be combined, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and pheromones with poor attenuation as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies. The invention applies the MADDPG idea to the military simulation field, so that each combat unit becomes an independent and combined intelligent agent, and aiming at the problem of convergence speed of a multi-intelligent agent in the MADDPG algorithm during training, the invention adopts MPE (Multi-agent-envs) environment developed by OpenAI as the basis, removes most of the mathematical operation of combat models, and retains most of the function simulation warfare and chess simulation deduction of an engine. After each step, the learned experience of the intelligent agent is inherited to the war game simulation and deduction system, and the war game simulation and deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.

It is to be understood that the embodiments of the method provided above correspond to the embodiments of the apparatus described above, and the corresponding specific contents may be referred to each other, which is not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A target detection and distribution method based on multi-agent reinforcement learning is characterized by comprising the following steps:

2. The method of claim 1, wherein constructing a reinforcement learning training environment comprises:

3. The method of claim 2, wherein said mapping a combat simulation engine to a reinforcement learning training environment using a madpg algorithm comprises:

4. The method of claim 3, wherein the training the combat behavior model to model convergence using a reinforcement learning training environment to obtain an artificial intelligence behavior model comprises:

initializing an agent;

the action agent module executes the selected action;

5. The method of claim 4, wherein the training the artificial intelligence behavior model with the combat simulation engine to output an optimization model comprises:

initializing an artificial intelligence behavior model;

the action model performs the selected action;

6. The method of claim 4 or 5, further comprising, before determining whether to converge:

judging whether a preset training end time is reached;

7. The method of claim 1,

and the reinforcement learning training environment utilizes a MADDPG algorithm to perform distributed operation and centralized training on the combat behavior model.

8. The method of claim 3,

the number of the intelligent agents is 3.

9. The method of claim 1,

the combat behavior model adopts a multi-agent artificial neural network.

10. An object detection and distribution device based on multi-agent reinforcement learning, comprising: