Disclosure of Invention
In view of the above, the present invention provides a target detection and allocation method and device based on multi-agent reinforcement learning to solve the problem of slow optimization convergence rate of the combat behavior model in the war game deduction system in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme: a target detection and distribution method based on multi-agent reinforcement learning comprises the following steps:
constructing a combat behavior model and a reinforcement learning training environment;
training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model;
and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model.
Further, constructing a reinforcement learning training environment, comprising:
and mapping the combat simulation engine and the reinforcement learning training environment by adopting the MADDPG algorithm.
Further, the mapping of the combat simulation engine and the reinforcement learning training environment by using the madpg algorithm includes:
mapping a combat behavior model in the combat simulation engine to a plurality of agents in the reinforcement learning training environment, the agents serving as training objects;
mapping a perception model in the combat simulation engine to a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;
mapping a decision model in the combat simulation engine to a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting an action to be executed according to the current battlefield situation;
mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;
and mapping a memory model in the combat simulation engine to a memory agent module in the reinforcement learning training environment, wherein the memory agent module is used for storing battlefield situations.
Further, it is right to adopt reinforcement learning training environment the action model of fighting trains to the model convergence, obtains artificial intelligence action model, includes:
initializing an agent;
the sensing agent module acquires environmental information, determines the current battlefield situation and stores the situation in the memory agent module;
the decision agent module selects an action to be executed according to the current battlefield situation;
the action agent module executes the selected action;
the reinforcement learning training environment feeds a battlefield environment back to the intelligent agent for optimization according to action results;
and judging whether the intelligent agent converges or not, and outputting an artificial intelligence behavior model after the intelligent agent converges.
Further, the training of the artificial intelligence behavior model by adopting a combat simulation engine and the output of an optimization model comprise:
initializing an artificial intelligence behavior model;
the sensing model acquires environmental information, determines the current battlefield situation and stores the situation in the memory model;
the decision model selects an action to be executed according to the current battlefield situation;
the action model performs the selected action;
the battle policy engine feeds back battlefield environment to the artificial intelligence behavior model for optimization according to action results;
and judging whether the artificial intelligence behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.
Further, before determining whether to converge, the method further includes:
judging whether a preset training end time is reached;
and if the training end time is reached, ending and exiting, otherwise, continuing the training.
Further, the reinforcement learning training environment utilizes a MADDPG algorithm to run and train the combat behavior model in a distributed and centralized manner.
Further, the number of the agents is 3.
Furthermore, the combat behavior model adopts a multi-agent artificial neural network.
The embodiment of the application provides a target detection and distribution device based on multi-agent reinforcement learning, including:
the construction module is used for constructing a combat behavior model and a reinforcement learning training environment;
the obtaining module is used for training the combat behavior model to model convergence by adopting a reinforcement learning training environment, and obtaining an artificial intelligence behavior model;
and the output module is used for training the artificial intelligence behavior model by adopting a combat simulation engine and outputting an optimization model.
By adopting the technical scheme, the invention can achieve the following beneficial effects:
the invention provides a target detection and distribution method and device based on multi-agent reinforcement learning, which comprises the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model; and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The invention integrates the reinforcement learning algorithm MADDPG into the war game deduction system, constructs a simple to complex simulation environment, optimizes the reinforcement learning convergence speed, and effectively solves the problem of optimizing the convergence speed by an intelligent body in the war game deduction system.
The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent intelligent agent which wants to be combined, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and pheromones with poor attenuation as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies.
The invention applies the MADDPG idea to the military simulation field, so that each combat unit becomes an independent and combined intelligent agent, and aiming at the problem of convergence speed of a multi-intelligent agent in the MADDPG algorithm during training, the invention adopts MPE (Multi-agent-envs) environment developed by OpenAI as the basis, removes most of the mathematical operation of combat models, and retains most of the function simulation warfare and chess simulation deduction of an engine. After each step, the learned experience of the intelligent agent is inherited to the war game simulation and deduction system, and the war game simulation and deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Inspired by MADDPG (Multi-Agent Deep Deterministic Policy Gradient) Multi-Agent algorithm, the strategy Gradient algorithm is improved in series, so that the method can be suitable for complex Multi-Agent scenes which cannot be processed by the traditional algorithm. The MADDPG algorithm has the following three characteristics:
1, the optimal strategy obtained by learning can give the optimal action by only utilizing local information during application.
2, there is no need to know the dynamic model of the environment and the special communication requirements.
3, the algorithm can be used not only in a cooperative environment but also in a competitive environment.
The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent and mutually cooperative intelligent agent, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and poor attenuation pheromones as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies.
The following describes a specific target detection and assignment method and apparatus based on multi-agent reinforcement learning provided in the embodiments of the present application with reference to the accompanying drawings.
As shown in fig. 1, an object detection and distribution method based on multi-agent reinforcement learning provided in the embodiment of the present application includes:
s101, constructing a combat behavior model and a reinforcement learning training environment;
as shown in fig. 2, for the convenience of study, the following settings are made for the battle behavior model:
(1) the target and the target do not overlap;
(2) the target and the radar detection range are not overlapped;
(3) supposing that the unmanned aerial vehicle cluster is fixed at a stable flying height, the measurement accuracy and the ground resolution of the magnetic detector can be ensured.
A group of unmanned aerial vehicles detect dynamic and static targets of a large-scale unstructured environment with obstacles, and the quality of the process is measured through a proper objective function, wherein the perception radius of the detector changes with the environment and the targets in real time. The overall quality is optimized herein by minimizing the time required to find a given static object, or by maximizing the average number of dynamic objects found within a certain search time, in anticipation of the observation time as a function of the task objective.
The combat behavior model corresponds to a multi-agent artificial neural network, is the core for generating intelligence and is the object for strengthening learning and training.
S102, training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligence behavior model;
the reinforcement learning training environment is an environment set according to a combat simulation engine, such as: the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment made by extracting necessary factors in the large environment. And mapping the combat behavior model into multiple intelligent agents, and training the multiple intelligent agents to obtain an optimized artificial intelligence behavior model. In the present application, 3 agents are used, and it is understood that 4, 5, and 6 agents may also be used, which is not limited herein.
And S103, training the artificial intelligence behavior model by adopting a combat simulation engine, and outputting an optimization model.
And (4) putting the artificial intelligence behavior model obtained after the small environment pre-training into a large environment for compensation training again to obtain an optimized model. And (3) constructing a simple to complex simulation environment, and optimizing the convergence speed of reinforcement learning.
The working principle of the target detection and distribution method based on multi-agent reinforcement learning is as follows: constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model; and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The training stage is divided into two stages of small environment pre-training and large environment compensation training, and adaptability of the artificial intelligence behavior model is improved.
In some embodiments, a combat simulation engine is mapped to a reinforcement learning training environment using the MADDPG algorithm.
Preferably, the mapping of the combat simulation engine and the reinforcement learning training environment by using the madpg algorithm includes:
mapping a combat behavior model in the combat simulation engine to a plurality of agents in the reinforcement learning training environment, the agents serving as training objects;
mapping a perception model in the combat simulation engine to a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;
mapping a decision model in the combat simulation engine to a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting an action to be executed according to the current battlefield situation;
mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;
and mapping a memory model in the combat simulation engine to a memory agent module in the reinforcement learning training environment, wherein the memory agent module is used for storing battlefield situations.
In the application, because the reinforcement learning training environment is different from the operation environment and the programming language of a combat simulation engine and is difficult to integrate directly, the combat simulation engine and the reinforcement learning training environment are mapped, the most advanced multi-agent reinforcement learning algorithm MADDPG is adopted, an algorithm frame is shown in figure 3, and because the data volume and the calculation volume of a real simulation engine are huge, a simple battlefield environment called a weapon chess deduction small environment is firstly established on the basis of MPE of OpenAI at the outside, and the weapon chess deduction small environment can provide simple geographic information data and can also generate simple deduction process data.
In some embodiments, the training the combat behavior model to model convergence by using a reinforcement learning training environment to obtain an artificial intelligence behavior model includes:
initializing an agent;
the sensing agent module acquires environmental information, determines the current battlefield situation and stores the situation in the memory agent module;
the decision agent module selects an action to be executed according to the current battlefield situation;
the action agent module executes the selected action;
the reinforcement learning training environment feeds a battlefield environment back to the intelligent agent for optimization according to action results;
and judging whether the intelligent agent converges or not, and outputting an artificial intelligence behavior model after the intelligent agent converges.
Specifically, the training is extremely divided into two stages, pre-training in a small environment is firstly carried out, specifically, the intelligent agent is placed in a reinforcement learning training environment, the sensing module calls a simulated sensor and a simulated communication interface to obtain environment information, the current battlefield situation is determined and stored in the memory agent module, wherein the sensor is a radar and can obtain the position of teammates, the position of enemies and the like. The movement is performed by selecting an action, such as a left action or a right action, based on the positional relationship. And the reinforcement learning training environment gives feedback to the intelligent agent according to the action result, and determines a reward function so as to optimize.
The expression of the reward function is
Where 100 is the reward and 100 is the penalty, the penalty being given when the agent has a collision.
And continuously optimizing until the intelligent agent converges, and outputting an artificial intelligence behavior model.
The above is a simulation training process of one combat, and after a plurality of times of sample training, the combat behavior model is gradually converged to generate an artificial intelligent behavior model with a countermeasure being a reactive behavior model. Because the first-stage training countermeasure is a traditional reactive behavior model, the behavior logic is relatively solidified, and in order to increase the training sample space and improve the adaptability of the artificial intelligence behavior model, the second-stage training is also needed.
In some embodiments, the training the artificial intelligence behavior model with the combat simulation engine to output an optimization model includes:
initializing an artificial intelligence behavior model;
the sensing model acquires environmental information, determines the current battlefield situation and stores the situation in the memory model;
the decision model selects an action to be executed according to the current battlefield situation;
the action model performs the selected action;
the battle policy engine feeds back battlefield environment to the artificial intelligence behavior model for optimization according to action results;
and judging whether the artificial intelligence behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.
And then putting the artificial intelligence behavior model into a large environment for compensation training, wherein the probability of errors of the data acquired in the 'small environment pre-training' stage is high, the data processing in the pre-training stage is not complete enough, and a real combat simulation engine is adopted as the environment for re-training after the 'small environment pre-training'. The sensing model, the decision model, the memory model and the action model of the combat simulation engine are adopted for training, the training process is the same as that of a reinforcement learning training environment, and the detailed description is omitted here.
Specifically, before determining whether to converge, the method further includes:
judging whether a preset training end time is reached;
and if the training end time is reached, ending and exiting, otherwise, continuing the training.
The end of the training process of the present application can be divided into two categories, one is to achieve the maximum running time, for example, i run at most one thousand steps and can also be understood as one thousand frame periods, and the other is to achieve the predicted optimization goal.
Preferably, the reinforcement learning training environment utilizes a madpg algorithm to run the warfare behavior model in a distributed and centralized manner.
Preferably, the combat behavior model employs a multi-agent artificial neural network.
As shown in fig. 4, an embodiment of the present application provides an object detecting and assigning apparatus based on multi-agent reinforcement learning, including:
theconstruction module 401 is used for constructing a combat behavior model and a reinforcement learning training environment;
an obtainingmodule 402, configured to train the combat behavior model to model convergence by using a reinforcement learning training environment, and obtain an artificial intelligence behavior model;
and anoutput module 403, configured to train the artificial intelligence behavior model by using a combat simulation engine, and output an optimization model.
The working principle of the multi-agent reinforcement learning-based target detection and distribution device provided by the application is that theconstruction module 401 constructs a combat behavior model and a reinforcement learning training environment; the obtainingmodule 402 adopts a reinforcement learning training environment to train the combat behavior model until the model converges, and obtains an artificial intelligence behavior model; theoutput module 403 trains the artificial intelligence behavior model by using a combat simulation engine, and outputs an optimization model.
The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;
the memory is used for storing a computer program, and the computer program is used for executing the target detection and distribution method based on multi-agent reinforcement learning provided by any one of the above embodiments;
the processor is used to call and execute the computer program in the memory.
In summary, the invention provides a target detection and allocation method and device based on multi-agent reinforcement learning, which integrates a reinforcement learning algorithm MADDPG into a chess deduction system, constructs a simple to complex simulation environment, optimizes reinforcement learning convergence speed, and effectively solves the problem of optimizing convergence speed by an agent in the chess deduction system. The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent intelligent agent which wants to be combined, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and pheromones with poor attenuation as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies. The invention applies the MADDPG idea to the military simulation field, so that each combat unit becomes an independent and combined intelligent agent, and aiming at the problem of convergence speed of a multi-intelligent agent in the MADDPG algorithm during training, the invention adopts MPE (Multi-agent-envs) environment developed by OpenAI as the basis, removes most of the mathematical operation of combat models, and retains most of the function simulation warfare and chess simulation deduction of an engine. After each step, the learned experience of the intelligent agent is inherited to the war game simulation and deduction system, and the war game simulation and deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.
It is to be understood that the embodiments of the method provided above correspond to the embodiments of the apparatus described above, and the corresponding specific contents may be referred to each other, which is not described herein again.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.