Disclosure of Invention
The invention aims to provide a multi-agent task planning method for high-order thinking development under hybrid game so as to solve the problems in the prior art.
In order to achieve the above purpose, the present invention provides a multi-agent task planning method for higher-order thinking development under hybrid game, comprising the following steps:
S1, multi-agent reinforcement learning based on two-stage intention sharing under the guidance of learner high-order thinking;
s2, multi-agent task coupling planning for learner high-order thinking under hybrid game inspiring specifically comprises the following steps:
s21, constructing a multi-agent model under the heuristic of the hybrid game;
S22, multi-agent model strategy optimization oriented to higher-order thinking under hybrid game inspiring;
s23, multi-agent task coupling planning for higher-order thinking of learners under mixed game inspiring;
s3, high-order thinking development of learners based on multi-agent migration strategies.
Preferably, step S1 specifically includes:
S11, acquiring and processing relevant data of the high-order thinking of the learner based on multiple agents, wherein the method specifically comprises the steps of capturing thinking features of the learner in advanced language recognition, natural language processing of an image recognition machine and other technologies in real time in high-order thinking depth discussion of characters and other types, including but not limited to logic reasoning paths in images, criticizing thinking expression and innovative viewpoint explanation in the characters and the like, wherein the capturing formulas are as follows:
(1)
Wherein,Representing a high-order mental depth discussion content dataset,The representation of the capture function is made,The speech recognition is represented by a sequence of speech,The image recognition is represented by a pattern,An output representing a natural language processing module;
The captured high-order thinking data is preprocessed and then is input into a multi-agent system for collaborative processing, the multi-agent system performs deep mining and analysis on the high-order thinking data through the internal communication and collaboration mechanism, and key characteristics capable of reflecting the high-order thinking capability of a learner are extracted, wherein the formula is as follows:
(2)
Wherein,Representing the high-order thought feature vector geometry,The processing function is represented by a function of the processing,Representing the communication and collaboration mechanisms of the multi-agent system;
s12, multi-agent reinforcement learning based on two-stage intention sharing.
Preferably, step S12 specifically includes:
Each agentEach maintaining a value networkThe content data set of the observed high-order thinking depth discussion is firstly independently analyzed, and the content data set is based on local observationGenerating a preliminary intent actionThe specific formula is as follows:
(3)
subsequently, entering the first round of intention sharing stage, the intelligent agent acts the intentionBroadcasting to all other intelligent agents in the system, assuming the total number of the intelligent agents in the system is N, after the first round of intention sharing is completed, each intelligent agent not only observes according to local selfAnd also combine the received intention action information of other intelligent agentsEvaluating the importance of such information to the current decision of itself;
the potential impact of each received intent action information on its decision process is measured by calculating the value of V, specifically, for a given action k, the current agent is calculatedProviding an agentAction information, agentMaximum value of V when different optimal actions are takenAnd minimum valueThe specific formula is as follows:
(4)
(5)
in order to establish a unified quantization standard, the maximum value and the minimum value of the quantization standard are normalized, and the specific formula is as follows:
(6)
if the decision importance of the intention action information of one agent to the other agent exceeds a preset threshold valueThe agent will be incorporated into the dependent object set of the latter;
Then, the second-round intent sharing stage is entered, in which each agent sends its dependent object set to all other agents in the system, through this round of information exchange, the agents can construct a more comprehensive dependency graph, which is a directed graph for intuitively showing the dependency between the agents.
Preferably, in step S12, the cyclic dependency in the dependency graph is detected and eliminated by a cyclic dependency removal algorithm, specifically:
Initializing, for all nodesSetting visual (r) =false and onStack (r) =false;
The visible (r) is a status flag indicating whether the node r has been accessed, and indicates that in the initialization phase, it is assumed that all nodes have not been accessed. onStack (r) is also a status flag indicating whether node r is currently in the DFS stack, onStack (r) =false indicating that all nodes are not in the stack during the initialization phase;
Depth First Search (DFS), for each node r that is not accessed, the following is performed:
Will beMarked accessed, set visited (r) =true;
Will bePushing a DFS stack, setting onStack (r) =true;
For the followingRecursively performing a depth-first search for w if w is not accessed (i.e., if w is accessed) =true and onStack (r) =true), then detecting a circular dependency if w is accessed and still in the stack;
Will bePop out from stack, set onStack (r) =false;
The loop dependence is eliminated, and once the loop dependence is detected, the dependence relationship is adjusted to ensure that the obtained graph is a directed acyclic graph;
after obtaining the acyclic dependency graph, the decision process of an agent will follow the rule that if an agent is dependent on other agents, it cannot make decision again, and its final action will directly take its preliminary generated intended actionIf, on the contrary, an agent is not relied upon by any agent, it makes a decision again to optimize its action selection in accordance with the intended action information of the agent relied upon, in combination with its own local observations.
Preferably, step S21 specifically includes:
Suppose that learning process of other agent is performed on agentThe effect is produced, and the formula is:
(7)
Wherein,Representing an agentIs used to update the policy update amount of (a),Representing an agentIs used for the learning rate of the model (a),Representing an agentIs used for the function of the object of (a),Representing an agentIs used in the method of the present invention,Representing an estimate of the other agent policy,Representing the gradient operation of the strategy updating quantity of the intelligent agent i;
In order to more accurately capture the interaction between agents, based on the average field theory, the average actions of surrounding agents are taken as input, and the policy learning of the self-agent is influenced and optimizedThe policy update mode formula of (1) is:
(8)
Wherein,Representing an agentThe policy before the update is made,Representing an agentThe policy after the update is made,Representing an agentIs a function of the state-action value of (c),Representing the average actions of other agents;
In order to improve the strategy effectiveness of the intelligent agent in a non-stationary environment, a neural network is adopted as a value function approximator in combination with a deep Bayesian strategy reuse method, and a distillation strategy network is introduced to realize efficient strategy learning and reuse, and higher accumulated rewards and convergence performance are obtained in a plurality of random game scenes. The update formula of the value function approximator is as follows:
(9)
Wherein,Is the parameter of the value function approximator, s is the current state,Instant rewards for agents;
Finally, ensuring that an algorithm of the intelligent agent can receive Nash equilibrium, namely, the optimal reaction on the basis of meeting all intelligent agent strategies is realized, wherein the formula is as follows:
(10)
Wherein,Representing an agentIs used in the method of the present invention,Representation except for agentIs a combination strategy of other intelligent agents,Representing an agentIs used to determine the utility value of (1),Indicating that the optimal reaction strategy for all agents is satisfied.
Preferably, step S22 specifically includes:
and carrying out strategy optimization on the multi-agent model under the mixed game inspired condition, and realizing the efficient joint optimization of the agent strategy and the model in each training round through the gradient-based optimization strategy. To achieve the above-mentioned optimization objective, the policy function is first of allAuxiliary functionParameterized and respectively marked asAndSubsequently, the gradient-based method updates these parameters, specifically updating the rule formula:
(11)
(12)
Wherein,An objective function representing the desired cumulative prize,Representing objective function versus policy parametersIs used for the gradient of (a),Representing the auxiliary parameters of the objective function pairIs used for the gradient of (a),The update step size of the policy parameters is represented,Representing an update step size of the auxiliary parameter;
in the training process, the state and action of the intelligent agent at each moment are evaluated through a neural network cost function, first, a learner high-order thinking data sample task is subjected to preliminary processing, and then the state of the current moment of all tasks is inputAnd actionsUsing neural network cost functionsOutputting the estimated value at this time, defining a loss functionThe loss function is used to measure the difference between the output of the neural network cost function and the actual rewards, and then the gradient descent method is used to update the parameters of the neural network cost functionThe specific updating rule formula is as follows:
(13)
Wherein, xi is the update step length of the neural network cost function parameter;
By constructing an optimized objective function for maximizing the weighted sum of the desired cumulative internal and external rewards, the objective function is formulated as:
(14)
Wherein,Indicating that the agent is in stateExecute action downwardsThe internal rewards that are obtained at the time,Representing the result of the auxiliary functionThe external rewards generated are used to generate a game,A weight coefficient representing the relative importance of adjusting the internal and external rewards, T representing the time step of the training round;
By repeating the training process continuously until the task faced by the agent reaches the predetermined index.
Preferably, step S23 specifically includes:
Firstly, a multi-task coupling relation matrix is set, each output quantity of a multi-task strategy network of a certain intelligent agent is used for forming the multi-task coupling relation matrix, and the strategy network provided with task R hasOutput, taskIs provided with the policy network ofOutput, and p, q=1, 2,..n, n is the total number of tasks, then construct oneIs a multi-task coupling relation matrix of (a)Wherein,If the o-th output of task R and taskThe first output of (2) has a coupling relation, the corresponding output element is 1, and the rest is 0;
Then, each intelligent agent forms a multi-intelligent agent coupling relation matrix, whether the intelligent agents have a cooperative coordination relation or not is judged according to actual application conditions, the cooperative relation among the intelligent agents is represented in the multi-intelligent agent coupling relation matrix, the input reconstruction of each task evaluation network of each intelligent agent is supported, N intelligent agents exist in an intelligent agent cluster, an N multiplied by N multi-intelligent agent coupling relation matrix C is constructed, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N, if the cooperative coordination relation exists between the intelligent agent i and the intelligent agent j, the output corresponding element is 1, and the rest is 0;
According to a multitasking coupling relation matrixAnd a multi-agent coupling relation matrix C, through matrix operation, slave is realizedAnd C, extracting information closely related to the current task of the current intelligent agent, and inputting the information into an evaluation network of the strategy network, wherein the formula is as follows:
(15)
Wherein I represents an input vector of the evaluation network, represents element-by-element multiplication operation of a multi-task coupling relation matrix and a multi-agent coupling relation matrix, and W is a weight matrix for adjusting the influence degree of each element on the input of the evaluation network;
after receiving the input vector I, the evaluation network processes the input vector I, outputs an evaluation result of the current task, combines the current state of the intelligent agent and the task target according to the evaluation result, and adopts a heuristic search task coupling planning algorithm based on a greedy strategy to generate a task plan meeting the actual requirements.
Preferably, the heuristic search task coupling planning algorithm specifically comprises:
Initializing, setting the initial state of the intelligent agent asTask target is G, current time is t=0, task is planned as empty set;
And the evaluation function is used for defining an evaluation function f (S, P, G) and evaluating the difficulty or the quality degree of reaching the task target G under the given state S and the task plan P, wherein the formula is as follows:
(16)
Wherein,Representing the distance of the current state S to the task target G,Representing the cost of the mission plan P,The execution time of the mission plan P is represented,、、All represent weight coefficients for adjusting the relative importance of each index in the evaluation function;
Heuristic search:
From the current stateStarting to generate all next action sets;
For each actionAfter the action is executed, a new state is obtainedAnd update the mission plan;
Calculating an evaluation functionSelecting the action which enables the evaluation function value to be optimal as the optimal action of the current step;
repeating the steps until the task target G is reached or a new action can not be generated;
Outputting the task plan, wherein the finally obtained task plan P is from an initial stateThe optimal solution to task goal G.
Preferably, step S3 specifically includes:
S31, setting a multi-agent migration strategy theoretical framework, which specifically comprises the following steps:
a series of source tasks is first set up and initialized to a set of mutually orthogonal vectorsWhereinRepresents the firstVector representation of each source task, adopting an orthogonalization algorithm to process vector sets one by one to obtain an orthogonalization task representation set;
Next, the task representation most similar to the new task is searched in the orthogonalized task representation set and recorded asThen, initializing a task representation vector z and fixing the source task representation y to construct a parameterized forward model capable of predicting the transfer functionAnd loss of the reward function Y (v, c), where v andThe method comprises the steps of respectively representing a current state and a next state, wherein c represents an executed action, updating a task representation vector z by using a gradient descent optimization method so that a new task is reserved in a task space, then, further learning task representation by modeling a conversion function and a reward function, designing a network structure with a constant population for processing interaction and non-interaction actions in the task, and directly applying a traditional depth network to calculate a Q value of the non-interaction action, wherein the specific formula is as follows:
(17)
Representing the Q value between the current state v and the execution action c in a non-interactive environment; representing a functional relationship between the current state v and the execution action c in a non-interactive environment;
Whereas for interaction a shared network is utilizedViewing section to be related to corresponding entityAnd the connection z of the task representation is taken as input, and the Q value estimation of the corresponding interaction action is output, wherein the formula is as follows:
(18)
representing the Q value between the current state v and the execution action c in an interactive environment;
Finally, combining the calculated Q value with task representation, inputting the Q value into a hybrid network to generate new tasks and performing task coupling planning, wherein the hybrid network can comprehensively consider the relevance among the tasks to realize effective migration and utilization of knowledge;
s32, high-order thinking development of learners based on multi-agent migration strategies is realized, specifically:
Based on a multi-agent migration strategy, new tasks are efficiently generated and optimized through the synergistic effect among agents, and further, through planning comparison between source tasks and the new tasks, deep migration and fusion of knowledge are promoted. The process not only strengthens the mastery of the existing knowledge by the learner, but also shows remarkable effect in promoting the development of the higher-order thinking of the learner, and defines the source task planning set asThe new task planning set isIn order to quantify migration potential between a source task and a new task, a migration evaluation model based on task feature similarity is adopted, and the formula is as follows:
(19)
Wherein,Representing the first in the source mission plan setThe task plan is set up in such a way that,Represents the h-th mission plan in the new mission plan set,Represent the firstAn extraction function of the task features, the function mapping the task plan to the firstThe number of feature spaces in the set of features,Representing a similarity measure between features, for calculating the similarity between two feature vectors,Represent the firstThe weight of a feature reflects the importance of that feature in the migration evaluation, and d represents the total number of features, i.e., the number of task features considered.
The model evaluates migration potential between source mission plans and new mission plans by computing a weighted sum of their similarity over the respective feature spaces.
Preferably, the orthogonalization algorithm in step S31 is specifically:
a. Initializing an orthogonal vector set as an empty set;
b. for each original vectorCalculating the inner product of the orthogonalization vector set and each vector in the orthogonalization vector set to obtain a projection component;
c. from the original vectorSubtracting all projection components to obtain orthogonalized vector;
D. Will beAdding to a set of orthogonal vectors;
e. repeating steps b to d until all original vectors are orthogonalized.
Therefore, the multi-agent task planning method for the higher-order thinking development under the hybrid game is adopted, and the higher-order thinking features in the collected learner higher-order thinking related data show a remarkable rising trend compared with the prior art. Particularly, on key characteristics reflecting higher-order thinking capability such as problem solving, criticizing thinking capability and innovation capability, the performance of learners is remarkably improved. The achievement provides firm support and powerful promotion for the physical development of high-order thinking of learners.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Detailed Description
The following detailed description of the embodiments of the invention, provided in the accompanying drawings, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a multi-agent mission planning method for advanced thinking development under hybrid game:
Learner high-level thinking refers to the high-level thinking ability exhibited by a learner beyond basic cognitive skills (such as memory, understanding, etc.) in the course of cognitive development. The learning system covers multiple dimensions of criticizing thinking, creative thinking, problem solving capability, logic reasoning, meta-cognition monitoring, self-regulation and the like, and is the capability of a learner to comprehensively utilize the existing knowledge, skills and strategies to carry out deep analysis, comprehensive judgment, creative solving and creative thinking when facing complex and unstructured problems. Specifically, the learner's higher-order thinking is embodied in several aspects:
(1) And the criticizing thinking that a learner can independently think, screen, analyze and evaluate the information to form own insight and judgment, and not blindly accept the insight or the information of other people.
(2) Creative thinking the learner can break through the traditional thinking framework, bring forward novel and unique ideas, methods and solutions, and show stronger innovation capability and imagination.
(3) The problem solving capability is that a learner can determine the nature of the problem when facing the complex problem, formulate a solution, effectively execute and evaluate the effect of the solution, and effectively solve the problem.
(4) Logical reasoning, namely, a learner can conduct reasoning and deduction according to the known information and the logical rules to obtain reasonable conclusion and judgment.
(5) And the meta-cognition monitoring, namely a learner can monitor and regulate the cognition process of the learner, define the learning target and strategy of the learner and timely think back and adjust the learning behavior of the learner.
(6) Self-regulation, namely, a learner can flexibly adjust the mind state and the behavior of the learner according to the learning condition of the learner and the change of the external environment, and keep a positive learning attitude and an efficient learning state.
The multi-agent hybrid game is characterized in that in an agent system comprising a plurality of agents with autonomous decision-making capability, each agent adopts a diversified action scheme comprising a pure strategy and a mixed strategy in a strategy space according to certain rules and constraints, and in the process, the agents need to consider the direct effect of own actions and predict and cope with the possible reactions of other agents and the potential influence of the reactions on own benefits. The complex gaming framework aimed at maximizing the respective benefits is achieved through a dynamic interaction process of mutual competition and cooperation.
The method comprises the following steps:
S1, multi-agent reinforcement learning based on two-stage intention sharing under the guidance of learner high-order thinking;
S11, acquiring and processing relevant data of higher-order thinking of a learner based on multiple agents;
Through advanced language recognition, natural language processing of an image recognition machine and other technologies, the thinking features of a learner in high-order thinking depth discussion of characters and other types are captured in real time, including but not limited to logic reasoning paths in images, criticizing thinking expressions in the characters, innovative viewpoint explanation and the like, and the capturing formulas are as follows:
(1)
Wherein,Representing a high-order mental depth discussion content dataset,The representation of the capture function is made,The speech recognition is represented by a sequence of speech,The image recognition is represented by a pattern,An output representing a natural language processing module;
The captured high-order thinking data is preprocessed and then is input into a multi-agent system for collaborative processing, the multi-agent system performs deep mining and analysis on the high-order thinking data through the internal communication and collaboration mechanism, and key characteristics capable of reflecting the high-order thinking capability of a learner are extracted, wherein the formula is as follows:
(2)
Wherein,Representing the high-order thought feature vector geometry,The processing function is represented by a function of the processing,Representing the communication and collaboration mechanisms of the multi-agent system;
s12, multi-agent reinforcement learning based on two-stage intention sharing;
the present embodiment performs reinforcement learning on multiple agents based on a two-stage intent sharing mechanism, aiming at optimizing a collaborative decision process of multiple agent system weight, as shown in fig. 2. Each agentEach maintaining a value networkThe content data set of the observed high-order thinking depth discussion is firstly independently analyzed, and the content data set is based on local observationGenerating a preliminary intent actionThe specific formula is as follows:
(3)
subsequently, entering the first round of intention sharing stage, the intelligent agent acts the intentionBroadcast to all other agents in the system (assuming the total number of agents in the system is N), after the first round of intention sharing is completed, each agent not only observes according to its own localAnd also combine the received intention action information of other intelligent agentsEvaluating the importance of such information to the current decision of itself;
the potential impact of each received intent action information on its decision process is measured by calculating the value of V, specifically, for a given action k, the current agent is calculatedProviding an agentAction information, agentMaximum value of V when different optimal actions are takenAnd minimum valueThe specific formula is as follows:
(4)
(5)
in order to establish a unified quantization standard, the maximum value and the minimum value of the quantization standard are normalized, and the specific formula is as follows:
(6)
if the decision importance of the intention action information of one agent to the other agent exceeds a preset threshold valueThe agent will be incorporated into the dependent object set of the latter;
Then, the second-round intent sharing stage is entered, in which each agent sends its dependent object set to all other agents in the system, through this round of information exchange, the agents can construct a more comprehensive dependency graph, which is a directed graph for intuitively showing the dependency between the agents.
It should be noted, however, that there may be a directed loop in the initially constructed dependency graph, i.e. a case of circular dependency. To solve this problem, the present embodiment employs an algorithm for detecting and eliminating the cyclic dependencies in the dependency graph, so as to ensure that the resulting dependency graph is a directed acyclic graph, specifically as follows:
1. initializing, for all nodesSetting visual (r) =false and onStack (r) =false;
The visible (r) is a status flag indicating whether the node r has been accessed, and indicates that in the initialization phase, it is assumed that all nodes have not been accessed. onStack (r) is also a status flag indicating whether node r is currently in the DFS stack, onStack (r) =false indicating that all nodes are not in the stack during the initialization phase;
2. Depth First Search (DFS), for each node r that is not accessed, the following is performed:
Will beMarked accessed, set visited (r) =true;
Will bePushing a DFS stack, setting onStack (r) =true;
For the followingIf w is not accessed, recursively performing DFS on w, if w has been accessed and is still in the stack (i.e., visited (w) =true and onStack (r) =true), then detecting a circular dependency;
Will bePop out from stack, set onStack (r) =false;
3. The loop dependence is eliminated, and once the loop dependence is detected, the dependence relationship is adjusted to ensure that the obtained graph is a directed acyclic graph;
after obtaining the acyclic dependency graph, the decision process of an agent will follow the rule that if an agent is dependent on other agents, it cannot make decision again, and its final action will directly take its preliminary generated intended actionIf, on the contrary, an agent is not relied upon by any agent, it makes a decision again to optimize its action selection in accordance with the intended action information of the agent relied upon, in combination with its own local observations.
S2, multi-agent task coupling planning for higher-order thinking of learners under mixed game inspiring;
s21, constructing a multi-agent model under the heuristic of the hybrid game;
In a multi-agent system, each agent tries to maximize its benefits by optimizing its own strategy, however, due to the interaction between agents, the traditional single-agent reinforcement learning method often has difficulty in achieving ideal effects, so this embodiment builds a multi-agent model under hybrid game inspiring, aiming at realizing efficient strategy updating through mutual learning and prediction between agents. Suppose that learning process of other agent is performed on agentThe effect is produced, and the formula is:
(7)
Wherein,Representing an agentIs used to update the policy update amount of (a),Representing an agentIs used for the learning rate of the model (a),Representing an agentIs used for the function of the object of (a),Representing an agentIs used in the method of the present invention,Representing an estimate of the other agent policy,Representing the gradient operation of the strategy updating quantity of the intelligent agent i;
In order to more accurately capture the interaction between agents, based on the average field theory, the average actions of surrounding agents are taken as input, and the policy learning of the self-agent is influenced and optimizedThe policy update mode formula of (1) is:
(8)
Wherein,Representing an agentThe policy before the update is made,Representing an agentThe policy after the update is made,Representing an agentIs a function of the state-action value of (c),Representing the average actions of other agents;
In order to improve the strategy effectiveness of the intelligent agent in a non-stationary environment, a neural network is adopted as a value function approximator in combination with a deep Bayesian strategy reuse method, and a distillation strategy network is introduced to realize efficient strategy learning and reuse, and higher accumulated rewards and convergence performance are obtained in a plurality of random game scenes. The update formula of the value function approximator is as follows:
(9)
Wherein,Is the parameter of the value function approximator, s is the current state,Instant rewards for agents;
Finally, ensuring that an algorithm of the intelligent agent can receive Nash equilibrium, namely, the optimal reaction on the basis of meeting all intelligent agent strategies is realized, wherein the formula is as follows:
(10)
Wherein,Representing an agentIs used in the method of the present invention,Representation except for agentIs a combination strategy of other intelligent agents,Representing an agentIs used to determine the utility value of (1),Indicating that the optimal reaction strategy for all agents is satisfied.
S22, multi-agent model strategy optimization oriented to higher-order thinking under hybrid game inspiring;
and carrying out strategy optimization on the multi-agent model under the mixed game inspired condition, and realizing the efficient joint optimization of the agent strategy and the model in each training round through the gradient-based optimization strategy. To achieve the above-mentioned optimization objective, the policy function is first of allAuxiliary functionParameterized and respectively marked asAndSubsequently, the gradient-based method updates these parameters, specifically updating the rule formula:
(11)
(12)
Wherein,An objective function representing the desired cumulative prize,Representing objective function versus policy parametersIs used for the gradient of (a),Representing the auxiliary parameters of the objective function pairIs used for the gradient of (a),The update step size of the policy parameters is represented,Representing an update step size of the auxiliary parameter;
in the training process, the state and action of the intelligent agent at each moment are evaluated through a neural network cost function, first, a learner high-order thinking data sample task is subjected to preliminary processing, and then the state of the current moment of all tasks is inputAnd actionsUsing neural network cost functionsOutputting the estimated value at this time, defining a loss function for continuously optimizing the neural network cost functionThe loss function is used to measure the difference between the output of the neural network cost function and the actual rewards, and then the gradient descent method is used to update the parameters of the neural network cost functionThe specific updating rule formula is as follows:
(13)
Wherein, xi is the update step length of the neural network cost function parameter;
By constructing an optimized objective function for maximizing the weighted sum of the desired cumulative internal and external rewards, the objective function is formulated as:
(14)
Wherein,Indicating that the agent is in stateExecute action downwardsThe internal rewards that are obtained at the time,Representing the result of the auxiliary functionThe external rewards generated are used to generate a game,A weight coefficient representing the relative importance of adjusting the internal and external rewards, T representing the time step of the training round;
By repeating the training process continuously until the task faced by the agent reaches the predetermined index.
S23, multi-agent task coupling planning for higher-order thinking of learners under mixed game inspiring;
in the embodiment, in a multi-agent model under the mixed game inspiring, coupling planning is performed on the high-order thinking data task of a learner, and the coupling relation among each agent and the task thereof is accurately described by constructing a multi-task coupling relation matrix and a multi-agent coupling relation matrix, so that powerful support is provided for collaborative decision-making, as shown in fig. 3.
Firstly, a multi-task coupling relation matrix is set, and each output quantity of a multi-task strategy network of a certain agent is used for forming the multi-task coupling relation matrix, so that the part with coupling relation in each output is represented, the coupling relation among the multi-tasks faced by the current agent is reflected, and the input reconstruction of each task evaluation network of each agent is supported. The policy network provided with the task R is provided withOutput, taskIs provided with the policy network ofOutput, and p, q=1, 2,..n, n is the total number of tasks, then construct oneIs a multi-task coupling relation matrix of (a)Wherein,If the o-th output of task R and taskThe first output of (2) has a coupling relation, the corresponding output element is 1, and the rest is 0;
Then, each intelligent agent forms a multi-intelligent agent coupling relation matrix, whether the intelligent agents have a cooperative coordination relation or not is judged according to actual application conditions, the cooperative relation among the intelligent agents is represented in the multi-intelligent agent coupling relation matrix, the input reconstruction of each task evaluation network of each intelligent agent is supported, N intelligent agents exist in an intelligent agent cluster, an N multiplied by N multi-intelligent agent coupling relation matrix C is constructed, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N, if the cooperative coordination relation exists between the intelligent agent i and the intelligent agent j, the output corresponding element is 1, and the rest is 0;
According to a multitasking coupling relation matrixAnd a multi-agent coupling relation matrix C, through matrix operation, slave is realizedAnd C, extracting information closely related to the current task of the current intelligent agent, and inputting the information into an evaluation network of the strategy network, wherein the formula is as follows:
(15)
Wherein I represents an input vector of the evaluation network, represents element-by-element multiplication operation of a multi-task coupling relation matrix and a multi-agent coupling relation matrix, and W is a weight matrix for adjusting the influence degree of each element on the input of the evaluation network;
after receiving the input vector I, the evaluation network processes the input vector I, outputs an evaluation result of the current task, combines the current state of the intelligent agent and the task target according to the evaluation result, and adopts a heuristic search task coupling planning algorithm based on a greedy strategy to generate a task plan meeting the actual requirements.
The heuristic search task coupling planning algorithm specifically comprises the following steps:
1. Initializing, setting the initial state of the intelligent agent asTask target is G, current time is t=0, task is planned as empty set;
2. And the evaluation function is used for defining an evaluation function f (S, P, G) and evaluating the difficulty or the quality degree of reaching the task target G under the given state S and the task plan P, wherein the formula is as follows:
(16)
Wherein,Representing the distance of the current state S to the task target G,Representing the cost of the mission plan P,The execution time of the mission plan P is represented,、、All represent weight coefficients for adjusting the relative importance of each index in the evaluation function;
3. Heuristic search:
From the current stateStarting to generate all next action sets;
For each actionAfter the action is executed, a new state is obtainedAnd update the mission plan;
Calculating an evaluation functionSelecting the action which enables the evaluation function value to be optimal as the optimal action of the current step;
repeating the steps until the task target G is reached or a new action can not be generated;
4. Outputting the task plan, wherein the finally obtained task plan P is from an initial stateThe optimal solution to task goal G.
S3, high-order thinking development of learners based on multi-agent migration strategies;
s31, setting a multi-agent migration strategy theoretical framework;
In the embodiment, the knowledge migration is performed on the learner high-order thinking task coupling planning based on the multi-agent migration strategy, and a series of source tasks are firstly set and initialized to be mutually orthogonal vector setsWhereinRepresents the firstVector representation of each source task adopts an orthogonalization algorithm, and vector sets are processed one by one to obtain an orthogonalization task representation set;
1. Initializing an orthogonal vector set as an empty set;
2. for each original vectorCalculating the inner product of the orthogonalization vector set and each vector in the orthogonalization vector set to obtain a projection component;
3. From the original vectorSubtracting all projection components to obtain orthogonalized vector;
4. Will beAdding to a set of orthogonal vectors;
5. Steps 2 to 4 are repeated until all original vectors are orthogonalized.
Next, the task representation most similar to the new task is searched in the orthogonalized task representation set and recorded asThen, initializing a task representation vector z and fixing the source task representation y to construct a parameterized forward model capable of predicting the transfer functionAnd a bonus functionWherein v andThe method comprises the steps of respectively representing a current state and a next state, wherein c represents an executed action, updating a task representation vector z by using a gradient descent optimization method so that a new task is reserved in a task space, then, further learning task representation by modeling a conversion function and a reward function, designing a network structure with a constant population for processing interaction and non-interaction actions in the task, and directly applying a traditional depth network to calculate a Q value of the non-interaction action, wherein the specific formula is as follows:
(17)
Representing the Q value between the current state v and the execution action c in a non-interactive environment; representing a functional relationship between the current state v and the execution action c in a non-interactive environment;
Whereas for interaction a shared network is utilizedViewing section to be related to corresponding entityAnd the connection z of the task representation is taken as input, and the Q value estimation of the corresponding interaction action is output, wherein the formula is as follows:
(18)
representing the Q value between the current state v and the execution action c in an interactive environment;
Finally, combining the calculated Q value with task representation, inputting the Q value into a hybrid network to generate new tasks and performing task coupling planning, wherein the hybrid network can comprehensively consider the relevance among the tasks to realize effective migration and utilization of knowledge;
S32, high-order thinking development of learners based on multi-agent migration strategies;
The embodiment realizes knowledge migration through planning comparison of the source task and the new task, promotes the development of the higher-order thinking of the learner, and particularly shows the rising of key features capable of reflecting the higher-order thinking capability of the learner.
Based on a multi-agent migration strategy, new tasks are efficiently generated and optimized through the synergistic effect among agents, and further, through planning comparison between source tasks and the new tasks, deep migration and fusion of knowledge are promoted. The process not only strengthens the mastery of the existing knowledge by the learner, but also shows remarkable effect in promoting the development of the higher-order thinking of the learner, and defines the source task planning set asThe new task planning set isIn order to quantify migration potential between a source task and a new task, a migration evaluation model based on task feature similarity is adopted, and the formula is as follows:
(19)
Wherein,Representing the first in the source mission plan setThe task plan is set up in such a way that,Represents the h-th mission plan in the new mission plan set,Represent the firstAn extraction function of the task features, the function mapping the task plan to the firstThe number of feature spaces in the set of features,Representing a similarity measure between features, for calculating the similarity between two feature vectors,Represent the firstThe weight of a feature reflects the importance of that feature in the migration evaluation, and d represents the total number of features, i.e., the number of task features considered.
The model evaluates migration potential between source mission plans and new mission plans by computing a weighted sum of their similarity over the respective feature spaces.
Therefore, the multi-agent task planning method for the higher-order thinking development under the hybrid game is adopted, and the higher-order thinking features in the collected learner higher-order thinking related data show a remarkable rising trend compared with the prior art. Particularly, on key characteristics reflecting higher-order thinking capability such as problem solving, criticizing thinking capability and innovation capability, the performance of a learner is obviously improved, and solid support and powerful promotion are provided for the physical development of the higher-order thinking of the learner.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted by the same, and the modified or substituted technical solution may not deviate from the spirit and scope of the technical solution of the present invention.