Disclosure of Invention
In order to solve the problems in the prior art, the invention aims to provide a spacecraft morphology control method based on an evolutionary algorithm and reinforcement learning. It is another object of the present invention to provide a spacecraft morphology control system based on evolutionary algorithm and reinforcement learning that implements the above method.
In order to achieve the above purpose, the spacecraft morphology control method based on the evolutionary algorithm and reinforcement learning of the invention specifically comprises the following steps:
s1, establishing an initial population, wherein individuals in the population are spacecraft with different forms of different functional modules and different module quantity combinations;
S2, performing inner loop initialization learning training on all individuals in the population, and calculating the fitness value of each individual;
s3, selecting individuals with higher fitness values to form elite populations;
s4, carrying out uniform crossing and single-point mutation on elite population by utilizing genetic and mutation operation in a genetic algorithm to generate elite filial generation;
S5, inner ring reinforcement learning, namely learning and training elite offspring obtained in the step S4;
s6, morphological evaluation, namely analyzing the task completion condition of the spacecraft in the current task scene to obtain a comprehensive morphological evaluation result, taking the evaluation result as an adaptability value of the spacecraft morphology and taking the evaluation result as an evaluation basis of external circulation evolution;
S7, forming optimal individuals, namely adding elite offspring after the inner ring reinforcement learning training into an original population, sequencing individuals in the population according to the fitness value, eliminating individuals with the lowest fitness until the number of the individuals in the population is consistent with the number of the initial population, and outputting the optimal individuals when the outer circulation reaches a set evolution algebra.
Further, the steps S1, S2, S3, S4, S7 are external cyclic morphological evolution processes, and the steps S5, S6 are internal cyclic reinforcement learning processes.
Further, the internal circulation comprises an intelligent body, a sensor, a controller and an actuator, wherein the intelligent body is a research object, namely an individual in a population, and the sensor, the controller and the actuator realize the sensing and control actions of the intelligent body.
Further, the reward value obtained in the inner loop learning training process is used for calculating the fitness value of the population individuals, and is output to the outer loop.
Further, the outer circulation completes the evolution of the population generation by generation according to the fitness value of the population individuals, and the evolution of the spacecraft morphology is realized.
Further, in step S1, defineAs an initial population set in a genetic algorithm, the initial populations are co-populatedIndividual.
Further, in step S3, the initial population is randomly selectedIndividuals, develop tournaments, co-mingleGroup tournaments, the winners of each group tournament, i.e. the individuals with highest fitness values, are used as parents to form elite populations consisting ofIndividual constitution.
Further, in step S4, elite populations are obtained by uniform crossover and single point mutation based on elite genetic algorithm by using crossover mutation operators in population iterationsElite offspring.
Further, in step S5, the inner loop reinforcement learning stage employs a nested pair of PPO algorithmsAnd (5) carrying out learning training on the elite offspring.
The spacecraft morphology control system based on the evolutionary algorithm and the reinforcement learning is used for implementing the spacecraft morphology control method based on the evolutionary algorithm and the reinforcement learning.
The method fully combines the characteristics of space environment and task requirements, realizes the morphological autonomous generation of the modularized spacecraft by continuously alternating the outer ring morphological evolution and the inner ring learning training based on the inner and outer ring algorithm architecture of deep evolution reinforcement learning.
Detailed Description
The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, unless explicitly stated or limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected via an intervening medium, or in communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
Specific embodiments of the present invention are described in detail below with reference to fig. 1-6. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
The spacecraft form control method and system based on the evolutionary algorithm and the reinforcement learning, the inner and outer ring algorithm frames based on the deep evolutionary reinforcement learning, the evolutionary algorithm and the reinforcement learning are utilized to realize the modularized spacecraft form optimization oriented to different task scenes, and the structure frame diagram is shown in figure 1.
Wherein, defineAs an initial population set in the genetic algorithm,、、∈,、、For individuals in the population, each individual represents a different form of spacecraft, namely spacecraft with different functional modules and different module quantity combinations, and the initial population is commonIndividual.
It is clear that the invention defines the shape of the spacecraft as a configuration, namely, different functional modules and different module numbers are combined to form the spacecraft.
The outer circulation is the process of realizing the interplanting evolution of the population:
Firstly, an initial population is established, then, all individuals in the population are subjected to inner loop initialization learning training, and the fitness value of each individual is calculated. Here, the fitness value is a calculation result of the fitness function, and is a quantization index for measuring the quality of the individual, and is determined by a morphological evaluation rule set forth later. Individuals with higher fitness values are then selected to make up the elite population. And carrying out uniform crossing and single-point mutation on elite population by utilizing genetic and mutation operation in a genetic algorithm to generate elite filial generation.
The internal circulation is to realize the life evolution process of individuals in the population:
The internal circulation comprises an intelligent body, a sensor, a controller and an actuator. The intelligent agent is a research object, namely population individuals, and the sensor, the controller and the executor realize the sensing and control behaviors of the intelligent agent.
Firstly, setting a training environment under the spacecraft form given by external circulation, continuously interacting population individuals with a task scene, training an optimal controller, an output configuration and a gesture track control strategy according to a set form evaluation rule, and forming the current generation of optimal configuration according to a reward value obtained by completing a task. The reward value obtained in the internal circulation learning training process is used for calculating the fitness value of the population individuals and is output to the external circulation. And the outer circulation completes the evolution of the population generation by generation according to the fitness value of the population individuals, and the evolution of the spacecraft morphology is realized.
The invention relates to a spacecraft morphology control method based on an evolutionary algorithm and reinforcement learning, which comprises the following specific method steps of:
① Initial population generation, ② inner loop initial learning training, ③ acquisition of elite population, ④ generation of elite offspring, ⑤ inner loop reinforcement learning, ⑥ morphological assessment, ⑦ formation of optimal individuals. Wherein ①、③、④、⑦ is an outer loop morphological evolution algorithm, and ②、⑤、⑥ is an inner loop reinforcement learning algorithm.
① Generating an initial population:
in the initial stage, the modularized spacecraft is firstly modeled according to different functional modules and different module numbers. Defining vectorsThe number of functional modules and the functional modules inside the spacecraft are represented as follows.
(1) ;
Wherein, theRepresenting the number of functional modules in the modular spacecraft, in particular,Representing the number of attitude and orbit control modules,Indicating the number of propulsion modules,Indicating the number of energy modules,Indicating the number of the management and control modules,Representing the number of calculation modules,Indicating the number of communication modules,Indicating the number of optical sensing modules,Indicating the number of radar sensing modules,Indicating the number of the electronic positioning modules,Indicating the number of emi modules. According to the function modules in the spacecraft, the number of the function modules and the connection constraint among the function modules, population individuals are randomly generated, and then an initial population of the modularized spacecraft is established.
② Inner loop initial chemistry training:
And sending the individuals in the initial population to the inner ring to interact with the task scene, and performing initial learning training to obtain the fitness value of each individual in the population. In the inner loop training process, a multi-process mode is used for distributing independent processes for learning training of each individual, so that parallel computing is realized, and computing efficiency is improved. The individual fitness function is expressed as:
(2);
Wherein, theRepresenting the number of typical task scenarios,Representing a spacecraft in a mission scenarioThe prize value obtained by the inner loop reinforcement learning,Representing task scenesIs a higher threshold for the prize value of (c),Represents the standard weight of the reward function for unifying the magnitude of the reward value and fitness value.
And in the inner ring reinforcement learning stage, a nested PPO algorithm is adopted to learn and train individuals in the whole initial population. The PPO algorithm adopts an Actor-Critic architecture to realize spacecraft configuration and attitude orbit control strategies. Wherein the Actor model uses a neural networkFitting a control strategy function of a spacecraft, whereinIs an Actor network parameter to be optimized. Actor network uses environmental situation information of task sceneOutput spacecraft allosteric and maneuver strategiesCritic model also fits an evaluation function from a neural networkWhereinIs a network parameter.Output ofTo evaluate the policy of the Actor. The PPO algorithm needs to sample a large amount of data to form an experience buffer to supply training to the neural network and update the strategy, while a single Actor network cannot perform training and updating of network parameters when performing the sampling task. So design a sampling networkSpecially adapted for empirical sampling, whereThe network parameters are sampled, and the Actor network is only responsible for continuously updating parameters and policies by using empirical data.
The PPO algorithm targets the cost function with the desire to maximize the dynamic weight dominance function as follows:
(3);
Wherein, the,Respectively expressed in the task process, the current time stepThe environmental state of the task scene and the action taken by the Actor network; expressed in the current parametersUnder the following strategy, the task scene is acted by the spacecraftFrom the stateTransition to the next stateIs a function of the probability of (1),Then expressed in parametersProbability below. The objective of the optimization function is to maximize the dominance functionAt probability weightingExpectations of the aboveWherein the dominance function can be expressed as:
(4);
Wherein, theRepresenting a spacecraft in time stepsEnvironmental rewards achieved by reinforcement learning the reward function,Indicating that the bonus discount factor is to be applied,Representing slave statesIn response to the desired cumulative discount prize being obtained,Representing slave statesThe desired cumulative discount prize obtained.
The reinforcement learning training framework nested by the double PPO algorithm is shown in figure 3. Will beIndividual vectorsAnd inputting the inner ring configuration reinforcement learning. Processing vectors in orderFunctional modules corresponding to the elementsA pose rail control module,The number of propulsion modules.The electromagnetic interference modules) are randomly addressed for each module in sequence according to module connection constraint and front-back installation sequence (attitude and orbit control, propulsion. According to the method, a single attitude and orbit control module is used as an initial unit, a connection space with a current configuration meeting connection constraint is solved through iteration, and a new module is added from a random selected position to update the configuration until all functional modules are traversed. The method can obtain the initial configuration with connectivity, rationality and randomness.
The configuration strategy output by the Actor1 network of the configuration PPO algorithm is a set of action sequences, and then the initial configuration is transformed. The sequence of actions specifies the flip direction of each module. And processing each action sequentially, namely calculating the motion space of the corresponding module, and executing the overturn and updating the configuration if the overturn is feasible (in the motion space) and the configuration after the overturn meets the connectivity and the constraint. And iterating the process based on the new configuration until the sequence is processed, and obtaining the training configuration.
And inputting the training configuration into gesture track control reinforcement learning, and obtaining a gesture track control strategy under a typical task scene through reinforcement learning. The training configuration completes state transition by executing an Actor2 network output attitude and orbit control strategy, and obtains a new attitude and orbit control reinforcement learning environment state and a corresponding rewarding value. Reward function for gesture track control reinforcement learningConsists of action rewards, boundary penalties, success rewards, and failure penalties. And the Critic2 network evaluates the control strategy of the Actor2 according to the attitude and orbit control reinforcement learning environment state and the rewarding value, and optimizes the network parameters of the Actor2 according to the evaluation, so as to improve the attitude and orbit control performance. The graph of the attitude and orbit control reward function under the invention is shown in figure 4. And meanwhile, evaluating the gesture track control effect of the controller through the efficiency evaluation rule, and outputting the evaluation result to the configuration reinforcement learning.
And then the Critic1 network evaluates the configuration transformation strategy of the Actor1 through the current state information and evaluation information, and completes parameter adjustment of the Actor1 network. Configuration reinforcement learning evaluates the strategy of Actor1 by a configuration rewarding function (numerical form) thatThe expression of (2) is as follows
;
Wherein, theRespectively represents the task completion rate rewards, the task completion time rewards, the task cost penalties, the platform cost penalties and the accumulated rewards of reinforcement learning in the internal circulation typical task scene interaction process of the spacecraft, andThe calculation method of (3) is shown as "morphology evaluation ⑥".
③ Obtaining elite population:
random selection from an initial populationIndividuals, develop tournaments, co-mingleGroup tournaments. Wherein the method comprises the steps ofFor the size of a tournament, this refers to the number of individuals randomly selected for comparison per competition. Winners of each group of tournaments, i.e. individuals with highest fitness values, as parents form elite populations consisting ofIndividual constitution.
④ Producing elite offspring:
elite population is obtained by uniform crossing and single-point variationElite offspring. Based on elite genetic algorithm, by using cross mutation operator in population iteration, population diversity is improved, algorithm optimizing space is enlarged, and spacecraft morphology is fully evolved. The crossover operator adopts a uniform crossover method. Traversing the population, when an individual triggers a crossover algorithm through crossover probability, the individual becomes a male parent, and then searching another individual as a female parent for uniform crossover. The chromosomes of an individual are traversed, i.e., each locus on their morphovariable matrix is traversed. If the gene positions trigger uniform crossing through uniform crossing probability, genes on the corresponding gene positions of the male parent and the female parent are exchanged, namely the number of the corresponding functional modules of the individual is adjusted. In this process, it is checked whether the module class corresponding to the genetic locus meets a specific module connection constraint among model constraints. If a specific modular linkage exists, the genes of the specific linked modules at the parent crossover gene locus are simultaneously transformed to ensure that the progeny always meet the modular constraints. If the uniform crossing is not triggered, keeping the gene position unchanged, and continuing to move to the next operation position until all the gene positions of the morphological variable matrix are completely traversed, and completing one uniform crossing.
The mutation operator adopts a single-point mutation method. Traversing the population, and selecting the chromosome of the individual to operate when the individual triggers a mutation algorithm through mutation probability, namely randomly selecting a gene position in a morphological variable matrix of the individual to carry out single-point mutation. In the single-point mutation process, the number of the genes on the gene position, namely the corresponding functional modules, is adjusted, so that random mutation reconstruction is ensured within the range allowed by the constraint of the number of the modules. Meanwhile, whether the module type corresponding to the gene position accords with a specific module connection constraint in the model constraint is checked. If specific module connection exists, the number of modules on the specific connection module gene positions corresponding to the variant gene positions is synchronously changed, so that the offspring can always meet the module constraint. And finishing the primary morphological variable mutation operation through the flow. The complete outer loop evolutionary algorithm structure is shown in figure 5.
⑤ Inner ring reinforcement learning:
The inner loop reinforcement learning phase invokes the same nested PPO algorithm as the ② th step initial learning training. Unlike ②, in this step, the whole population of individuals is not required to be input into the inner loop reinforcement learning, but only obtained in ④ in the outer circulationAnd learning and training the elite offspring, and obtaining the fitness value of the elite offspring through a fitness function.
⑥ Morphological assessment:
And analyzing the task completion condition of the spacecraft in the current task scene, and providing a comprehensive morphological evaluation method. The reward function value obtained by training in the current task scene and the task execution condition are taken as the basis of comprehensive morphological evaluation, and the design comprises the task completion rateTime of task completionTask rewardsCost ofCost of platformSpacecraft morphology evaluation rules including indexes. Task completion rateEstimating task execution progress and task completion time of spacecraftThe number of the training steps of the gesture track control reinforcement learning is evaluated, and the expression is as follows:
;
Wherein, theThe number of time steps for the gesture track control task is executed,And completing the mark for the gesture track control task. The training task of the screen reinforcement learning is completed,=1, Vice versa=0. Task rewardsMeaning that the pose and orbit control reinforcement learning training rewards function value and costThe energy consumption for evaluating the task execution process of the spacecraft is expressed as follows:
;
Wherein, theIs the firstThe control force output by the secondary spacecraft,The number of time steps is performed for the gesture track task. Platform costThe total number of functional modules required by the spacecraft to complete the task is estimated, and the expression is as follows:
;
Wherein, theIs the total number of each functional module.
Awarding a configuration to a functionThe method is used as an adaptability value of spacecraft morphology and is used as an evaluation basis of external circulation evolution to support morphology evolution.
⑦ Optimal individuals:
And (3) finishing inner-loop reinforcement learning training, adding elite offspring with obtained fitness values into the original population where the father is located, and expanding the population individual number. And then sequencing individuals in the population according to the fitness value, eliminating individuals with the lowest fitness until the number of individuals in the population is consistent with the number of initial population, and outputting optimal individuals when the outer circulation reaches a certain evolution algebra. The outer loop population fitness value change process is shown in fig. 6.
Aiming at the problems that the traditional structure fixed spacecraft has long research and development period, high development cost, functional solidification, difficulty in meeting the flexible response of microsatellites to complex external environments and the like, the invention carries out modularized modeling on the spacecraft. And then adopting an inner and outer ring algorithm framework of deep evolution reinforcement learning, wherein the outer ring realizes the morphological evolution of the spacecraft by carrying out topological recombination on the functional module through elite genetic algorithm, the inner ring adopts the dynamic weight dominance function expectation maximization as a target cost function, reinforcement learning is utilized to complete spacecraft configuration strategy and attitude orbit control strategy learning, data generated in the inner ring learning training process are used for calculating the fitness value of an individual in the population and are output to the outer ring, and the outer ring realizes the morphological evolution of the spacecraft according to the fitness value of the individual in the population until the population converges to the optimal individual.
The method is carried out continuously and alternately through morphological evolution and training, so that the morphological optimization and control of the modularized spacecraft are realized, and the aims of improving the environmental adaptability, the quick response and the task agility of the spacecraft are fulfilled.
Any process or method description in a flowchart of the invention or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, which may be implemented in any computer-readable medium for use by an instruction execution system, apparatus, or device, which may be any medium that contains a program for storing, communicating, propagating, or transmitting for use by the execution system, apparatus, or device. Including read-only memory, magnetic or optical disks, and the like.
In the description herein, reference to the term "embodiment," "example," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the different embodiments or examples described in this specification and the features therein may be combined or combined by those skilled in the art without creating contradictions.
While embodiments of the present invention have been shown and described, it will be understood that the embodiments are illustrative and not to be construed as limiting the invention, and that various changes, modifications, substitutions and alterations may be made by those skilled in the art without departing from the scope of the invention.