Movatterモバイル変換


[0]ホーム

URL:


CN119312876B - A multi-agent task planning method for the development of higher-order thinking under hybrid games - Google Patents

A multi-agent task planning method for the development of higher-order thinking under hybrid games
Download PDF

Info

Publication number
CN119312876B
CN119312876BCN202411855334.7ACN202411855334ACN119312876BCN 119312876 BCN119312876 BCN 119312876BCN 202411855334 ACN202411855334 ACN 202411855334ACN 119312876 BCN119312876 BCN 119312876B
Authority
CN
China
Prior art keywords
agent
task
representing
function
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411855334.7A
Other languages
Chinese (zh)
Other versions
CN119312876A (en
Inventor
韩中美
柯聪聪
黄昌勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Normal University CJNU
Original Assignee
Zhejiang Normal University CJNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Normal University CJNUfiledCriticalZhejiang Normal University CJNU
Priority to CN202411855334.7ApriorityCriticalpatent/CN119312876B/en
Publication of CN119312876ApublicationCriticalpatent/CN119312876A/en
Application grantedgrantedCritical
Publication of CN119312876BpublicationCriticalpatent/CN119312876B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种混合博弈下高阶思维发展的多智能体任务规划方法,属于任务耦合规划技术领域,包括S1、学习者高阶思维引领下基于两阶段意图共享的多智能体强化学习;S2、混合博弈启发下面向学习者高阶思维的多智能体任务耦合规划,具体包括:S21、混合博弈启发下的多智能体模型的构建;S22、混合博弈启发下面向高阶思维的多智能体模型策略优化;S23、混合博弈启发下面向学习者高阶思维的多智能体任务耦合规划;S3、基于多智能体迁移策略的学习者高阶思维具身发展;本发明提供的一种混合博弈下高阶思维发展的多智能体任务规划方法,为学习者高阶思维的具身发展提供了坚实的支持和有力的推动。

The present invention discloses a multi-agent task planning method for the development of high-order thinking under a hybrid game, which belongs to the technical field of task coupling planning, including S1, multi-agent reinforcement learning based on two-stage intention sharing under the guidance of learners' high-order thinking; S2, multi-agent task coupling planning for learners' high-order thinking inspired by hybrid games, specifically including: S21, construction of a multi-agent model inspired by hybrid games; S22, strategy optimization of multi-agent models for high-order thinking inspired by hybrid games; S23, multi-agent task coupling planning for learners' high-order thinking inspired by hybrid games; S3, embodied development of learners' high-order thinking based on multi-agent migration strategy; The present invention provides a multi-agent task planning method for the development of high-order thinking under a hybrid game, which provides solid support and strong impetus for the embodied development of learners' high-order thinking.

Description

Multi-agent task planning method for higher-order thinking development under hybrid game
Technical Field
The invention relates to the technical field of task coupling planning, in particular to a multi-agent task planning method for high-order thinking development under hybrid game.
Background
In the current rapidly developing intelligent age, the culture of high-order thinking ability is of great importance to learners. The high-order thinking covers the core factors of criticizing thinking, creative thinking, problem solving capability and the like, and is the key of learners to adapt to complex and changeable environments, solve nonlinear problems and realize innovation breakthrough. It not only relates to the cognitive development of individuals, but also has profound effects on the progress of society and the innovation of science and technology. Therefore, the search for an effective high-order thinking culture method is an important topic to be solved in the education field.
Multi-agent reinforcement learning is an important branch in the field of artificial intelligence, and has great potential in solving complex tasks and optimizing strategies due to strong adaptability and flexibility. In a multi-agent system, the agents cooperate or compete with each other to explore the environment together and optimize their own strategies to maximize the overall benefit. The mechanism not only accords with the complexity of multi-subject interaction in the real world, but also provides an ideal experimental platform for simulating and training high-order thinking capability.
Hybrid gaming, an important form of game theory, allows agents to maintain a certain randomness in the policy choices, thereby increasing the complexity and uncertainty of gaming. Under the framework of hybrid gaming, the intelligent agent needs to comprehensively consider possible actions of opponents, allocation of self resources and trade-off between long-term benefits and short-term benefits, and a more flexible and robust strategy is made. The strategy making process is highly compatible with criticism analysis and creative decision in high-order thinking, and provides a new view angle and path for the culture of the high-order thinking.
Based on the background, the invention provides a multi-agent task planning method for high-order thinking development under hybrid game. The method aims at simulating a hybrid game environment by constructing a multi-agent system, so that a learner can continuously exercise and improve the higher-order thinking capability of the learner in the process of participating in multi-agent interaction.
In addition, the method focuses on the application of multi-agent migration strategies to promote the self-development of advanced thinking for learners. The migration strategy allows the agent to migrate knowledge and skills learned in one task to another related but different task effectively, thereby speeding up the learning process and improving learning efficiency. The migration capability has important practical significance and application value for a learner to quickly call the existing knowledge and flexibly adjust the strategy when facing a new situation and a new problem.
Disclosure of Invention
The invention aims to provide a multi-agent task planning method for high-order thinking development under hybrid game so as to solve the problems in the prior art.
In order to achieve the above purpose, the present invention provides a multi-agent task planning method for higher-order thinking development under hybrid game, comprising the following steps:
S1, multi-agent reinforcement learning based on two-stage intention sharing under the guidance of learner high-order thinking;
s2, multi-agent task coupling planning for learner high-order thinking under hybrid game inspiring specifically comprises the following steps:
s21, constructing a multi-agent model under the heuristic of the hybrid game;
S22, multi-agent model strategy optimization oriented to higher-order thinking under hybrid game inspiring;
s23, multi-agent task coupling planning for higher-order thinking of learners under mixed game inspiring;
s3, high-order thinking development of learners based on multi-agent migration strategies.
Preferably, step S1 specifically includes:
S11, acquiring and processing relevant data of the high-order thinking of the learner based on multiple agents, wherein the method specifically comprises the steps of capturing thinking features of the learner in advanced language recognition, natural language processing of an image recognition machine and other technologies in real time in high-order thinking depth discussion of characters and other types, including but not limited to logic reasoning paths in images, criticizing thinking expression and innovative viewpoint explanation in the characters and the like, wherein the capturing formulas are as follows:
(1)
Wherein,Representing a high-order mental depth discussion content dataset,The representation of the capture function is made,The speech recognition is represented by a sequence of speech,The image recognition is represented by a pattern,An output representing a natural language processing module;
The captured high-order thinking data is preprocessed and then is input into a multi-agent system for collaborative processing, the multi-agent system performs deep mining and analysis on the high-order thinking data through the internal communication and collaboration mechanism, and key characteristics capable of reflecting the high-order thinking capability of a learner are extracted, wherein the formula is as follows:
(2)
Wherein,Representing the high-order thought feature vector geometry,The processing function is represented by a function of the processing,Representing the communication and collaboration mechanisms of the multi-agent system;
s12, multi-agent reinforcement learning based on two-stage intention sharing.
Preferably, step S12 specifically includes:
Each agentEach maintaining a value networkThe content data set of the observed high-order thinking depth discussion is firstly independently analyzed, and the content data set is based on local observationGenerating a preliminary intent actionThe specific formula is as follows:
(3)
subsequently, entering the first round of intention sharing stage, the intelligent agent acts the intentionBroadcasting to all other intelligent agents in the system, assuming the total number of the intelligent agents in the system is N, after the first round of intention sharing is completed, each intelligent agent not only observes according to local selfAnd also combine the received intention action information of other intelligent agentsEvaluating the importance of such information to the current decision of itself;
the potential impact of each received intent action information on its decision process is measured by calculating the value of V, specifically, for a given action k, the current agent is calculatedProviding an agentAction information, agentMaximum value of V when different optimal actions are takenAnd minimum valueThe specific formula is as follows:
(4)
(5)
in order to establish a unified quantization standard, the maximum value and the minimum value of the quantization standard are normalized, and the specific formula is as follows:
(6)
if the decision importance of the intention action information of one agent to the other agent exceeds a preset threshold valueThe agent will be incorporated into the dependent object set of the latter;
Then, the second-round intent sharing stage is entered, in which each agent sends its dependent object set to all other agents in the system, through this round of information exchange, the agents can construct a more comprehensive dependency graph, which is a directed graph for intuitively showing the dependency between the agents.
Preferably, in step S12, the cyclic dependency in the dependency graph is detected and eliminated by a cyclic dependency removal algorithm, specifically:
Initializing, for all nodesSetting visual (r) =false and onStack (r) =false;
The visible (r) is a status flag indicating whether the node r has been accessed, and indicates that in the initialization phase, it is assumed that all nodes have not been accessed. onStack (r) is also a status flag indicating whether node r is currently in the DFS stack, onStack (r) =false indicating that all nodes are not in the stack during the initialization phase;
Depth First Search (DFS), for each node r that is not accessed, the following is performed:
Will beMarked accessed, set visited (r) =true;
Will bePushing a DFS stack, setting onStack (r) =true;
For the followingRecursively performing a depth-first search for w if w is not accessed (i.e., if w is accessed) =true and onStack (r) =true), then detecting a circular dependency if w is accessed and still in the stack;
Will bePop out from stack, set onStack (r) =false;
The loop dependence is eliminated, and once the loop dependence is detected, the dependence relationship is adjusted to ensure that the obtained graph is a directed acyclic graph;
after obtaining the acyclic dependency graph, the decision process of an agent will follow the rule that if an agent is dependent on other agents, it cannot make decision again, and its final action will directly take its preliminary generated intended actionIf, on the contrary, an agent is not relied upon by any agent, it makes a decision again to optimize its action selection in accordance with the intended action information of the agent relied upon, in combination with its own local observations.
Preferably, step S21 specifically includes:
Suppose that learning process of other agent is performed on agentThe effect is produced, and the formula is:
(7)
Wherein,Representing an agentIs used to update the policy update amount of (a),Representing an agentIs used for the learning rate of the model (a),Representing an agentIs used for the function of the object of (a),Representing an agentIs used in the method of the present invention,Representing an estimate of the other agent policy,Representing the gradient operation of the strategy updating quantity of the intelligent agent i;
In order to more accurately capture the interaction between agents, based on the average field theory, the average actions of surrounding agents are taken as input, and the policy learning of the self-agent is influenced and optimizedThe policy update mode formula of (1) is:
(8)
Wherein,Representing an agentThe policy before the update is made,Representing an agentThe policy after the update is made,Representing an agentIs a function of the state-action value of (c),Representing the average actions of other agents;
In order to improve the strategy effectiveness of the intelligent agent in a non-stationary environment, a neural network is adopted as a value function approximator in combination with a deep Bayesian strategy reuse method, and a distillation strategy network is introduced to realize efficient strategy learning and reuse, and higher accumulated rewards and convergence performance are obtained in a plurality of random game scenes. The update formula of the value function approximator is as follows:
(9)
Wherein,Is the parameter of the value function approximator, s is the current state,Instant rewards for agents;
Finally, ensuring that an algorithm of the intelligent agent can receive Nash equilibrium, namely, the optimal reaction on the basis of meeting all intelligent agent strategies is realized, wherein the formula is as follows:
(10)
Wherein,Representing an agentIs used in the method of the present invention,Representation except for agentIs a combination strategy of other intelligent agents,Representing an agentIs used to determine the utility value of (1),Indicating that the optimal reaction strategy for all agents is satisfied.
Preferably, step S22 specifically includes:
and carrying out strategy optimization on the multi-agent model under the mixed game inspired condition, and realizing the efficient joint optimization of the agent strategy and the model in each training round through the gradient-based optimization strategy. To achieve the above-mentioned optimization objective, the policy function is first of allAuxiliary functionParameterized and respectively marked asAndSubsequently, the gradient-based method updates these parameters, specifically updating the rule formula:
(11)
(12)
Wherein,An objective function representing the desired cumulative prize,Representing objective function versus policy parametersIs used for the gradient of (a),Representing the auxiliary parameters of the objective function pairIs used for the gradient of (a),The update step size of the policy parameters is represented,Representing an update step size of the auxiliary parameter;
in the training process, the state and action of the intelligent agent at each moment are evaluated through a neural network cost function, first, a learner high-order thinking data sample task is subjected to preliminary processing, and then the state of the current moment of all tasks is inputAnd actionsUsing neural network cost functionsOutputting the estimated value at this time, defining a loss functionThe loss function is used to measure the difference between the output of the neural network cost function and the actual rewards, and then the gradient descent method is used to update the parameters of the neural network cost functionThe specific updating rule formula is as follows:
(13)
Wherein, xi is the update step length of the neural network cost function parameter;
By constructing an optimized objective function for maximizing the weighted sum of the desired cumulative internal and external rewards, the objective function is formulated as:
(14)
Wherein,Indicating that the agent is in stateExecute action downwardsThe internal rewards that are obtained at the time,Representing the result of the auxiliary functionThe external rewards generated are used to generate a game,A weight coefficient representing the relative importance of adjusting the internal and external rewards, T representing the time step of the training round;
By repeating the training process continuously until the task faced by the agent reaches the predetermined index.
Preferably, step S23 specifically includes:
Firstly, a multi-task coupling relation matrix is set, each output quantity of a multi-task strategy network of a certain intelligent agent is used for forming the multi-task coupling relation matrix, and the strategy network provided with task R hasOutput, taskIs provided with the policy network ofOutput, and p, q=1, 2,..n, n is the total number of tasks, then construct oneIs a multi-task coupling relation matrix of (a)Wherein,If the o-th output of task R and taskThe first output of (2) has a coupling relation, the corresponding output element is 1, and the rest is 0;
Then, each intelligent agent forms a multi-intelligent agent coupling relation matrix, whether the intelligent agents have a cooperative coordination relation or not is judged according to actual application conditions, the cooperative relation among the intelligent agents is represented in the multi-intelligent agent coupling relation matrix, the input reconstruction of each task evaluation network of each intelligent agent is supported, N intelligent agents exist in an intelligent agent cluster, an N multiplied by N multi-intelligent agent coupling relation matrix C is constructed, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N, if the cooperative coordination relation exists between the intelligent agent i and the intelligent agent j, the output corresponding element is 1, and the rest is 0;
According to a multitasking coupling relation matrixAnd a multi-agent coupling relation matrix C, through matrix operation, slave is realizedAnd C, extracting information closely related to the current task of the current intelligent agent, and inputting the information into an evaluation network of the strategy network, wherein the formula is as follows:
(15)
Wherein I represents an input vector of the evaluation network, represents element-by-element multiplication operation of a multi-task coupling relation matrix and a multi-agent coupling relation matrix, and W is a weight matrix for adjusting the influence degree of each element on the input of the evaluation network;
after receiving the input vector I, the evaluation network processes the input vector I, outputs an evaluation result of the current task, combines the current state of the intelligent agent and the task target according to the evaluation result, and adopts a heuristic search task coupling planning algorithm based on a greedy strategy to generate a task plan meeting the actual requirements.
Preferably, the heuristic search task coupling planning algorithm specifically comprises:
Initializing, setting the initial state of the intelligent agent asTask target is G, current time is t=0, task is planned as empty set;
And the evaluation function is used for defining an evaluation function f (S, P, G) and evaluating the difficulty or the quality degree of reaching the task target G under the given state S and the task plan P, wherein the formula is as follows:
(16)
Wherein,Representing the distance of the current state S to the task target G,Representing the cost of the mission plan P,The execution time of the mission plan P is represented,All represent weight coefficients for adjusting the relative importance of each index in the evaluation function;
Heuristic search:
From the current stateStarting to generate all next action sets;
For each actionAfter the action is executed, a new state is obtainedAnd update the mission plan;
Calculating an evaluation functionSelecting the action which enables the evaluation function value to be optimal as the optimal action of the current step;
repeating the steps until the task target G is reached or a new action can not be generated;
Outputting the task plan, wherein the finally obtained task plan P is from an initial stateThe optimal solution to task goal G.
Preferably, step S3 specifically includes:
S31, setting a multi-agent migration strategy theoretical framework, which specifically comprises the following steps:
a series of source tasks is first set up and initialized to a set of mutually orthogonal vectorsWhereinRepresents the firstVector representation of each source task, adopting an orthogonalization algorithm to process vector sets one by one to obtain an orthogonalization task representation set;
Next, the task representation most similar to the new task is searched in the orthogonalized task representation set and recorded asThen, initializing a task representation vector z and fixing the source task representation y to construct a parameterized forward model capable of predicting the transfer functionAnd loss of the reward function Y (v, c), where v andThe method comprises the steps of respectively representing a current state and a next state, wherein c represents an executed action, updating a task representation vector z by using a gradient descent optimization method so that a new task is reserved in a task space, then, further learning task representation by modeling a conversion function and a reward function, designing a network structure with a constant population for processing interaction and non-interaction actions in the task, and directly applying a traditional depth network to calculate a Q value of the non-interaction action, wherein the specific formula is as follows:
(17)
Representing the Q value between the current state v and the execution action c in a non-interactive environment; representing a functional relationship between the current state v and the execution action c in a non-interactive environment;
Whereas for interaction a shared network is utilizedViewing section to be related to corresponding entityAnd the connection z of the task representation is taken as input, and the Q value estimation of the corresponding interaction action is output, wherein the formula is as follows:
(18)
representing the Q value between the current state v and the execution action c in an interactive environment;
Finally, combining the calculated Q value with task representation, inputting the Q value into a hybrid network to generate new tasks and performing task coupling planning, wherein the hybrid network can comprehensively consider the relevance among the tasks to realize effective migration and utilization of knowledge;
s32, high-order thinking development of learners based on multi-agent migration strategies is realized, specifically:
Based on a multi-agent migration strategy, new tasks are efficiently generated and optimized through the synergistic effect among agents, and further, through planning comparison between source tasks and the new tasks, deep migration and fusion of knowledge are promoted. The process not only strengthens the mastery of the existing knowledge by the learner, but also shows remarkable effect in promoting the development of the higher-order thinking of the learner, and defines the source task planning set asThe new task planning set isIn order to quantify migration potential between a source task and a new task, a migration evaluation model based on task feature similarity is adopted, and the formula is as follows:
(19)
Wherein,Representing the first in the source mission plan setThe task plan is set up in such a way that,Represents the h-th mission plan in the new mission plan set,Represent the firstAn extraction function of the task features, the function mapping the task plan to the firstThe number of feature spaces in the set of features,Representing a similarity measure between features, for calculating the similarity between two feature vectors,Represent the firstThe weight of a feature reflects the importance of that feature in the migration evaluation, and d represents the total number of features, i.e., the number of task features considered.
The model evaluates migration potential between source mission plans and new mission plans by computing a weighted sum of their similarity over the respective feature spaces.
Preferably, the orthogonalization algorithm in step S31 is specifically:
a. Initializing an orthogonal vector set as an empty set;
b. for each original vectorCalculating the inner product of the orthogonalization vector set and each vector in the orthogonalization vector set to obtain a projection component;
c. from the original vectorSubtracting all projection components to obtain orthogonalized vector;
D. Will beAdding to a set of orthogonal vectors;
e. repeating steps b to d until all original vectors are orthogonalized.
Therefore, the multi-agent task planning method for the higher-order thinking development under the hybrid game is adopted, and the higher-order thinking features in the collected learner higher-order thinking related data show a remarkable rising trend compared with the prior art. Particularly, on key characteristics reflecting higher-order thinking capability such as problem solving, criticizing thinking capability and innovation capability, the performance of learners is remarkably improved. The achievement provides firm support and powerful promotion for the physical development of high-order thinking of learners.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a flow chart of a multi-agent mission planning method for higher-order thought development under hybrid gaming of the present invention;
FIG. 2 is a schematic diagram of multi-agent reinforcement learning based on two-stage intent sharing in accordance with the present invention;
Fig. 3 is a schematic diagram of a multi-agent mission coupling plan for a learner with a high-order thinking under the hybrid game heuristic of the present invention.
Detailed Description
The following detailed description of the embodiments of the invention, provided in the accompanying drawings, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a multi-agent mission planning method for advanced thinking development under hybrid game:
Learner high-level thinking refers to the high-level thinking ability exhibited by a learner beyond basic cognitive skills (such as memory, understanding, etc.) in the course of cognitive development. The learning system covers multiple dimensions of criticizing thinking, creative thinking, problem solving capability, logic reasoning, meta-cognition monitoring, self-regulation and the like, and is the capability of a learner to comprehensively utilize the existing knowledge, skills and strategies to carry out deep analysis, comprehensive judgment, creative solving and creative thinking when facing complex and unstructured problems. Specifically, the learner's higher-order thinking is embodied in several aspects:
(1) And the criticizing thinking that a learner can independently think, screen, analyze and evaluate the information to form own insight and judgment, and not blindly accept the insight or the information of other people.
(2) Creative thinking the learner can break through the traditional thinking framework, bring forward novel and unique ideas, methods and solutions, and show stronger innovation capability and imagination.
(3) The problem solving capability is that a learner can determine the nature of the problem when facing the complex problem, formulate a solution, effectively execute and evaluate the effect of the solution, and effectively solve the problem.
(4) Logical reasoning, namely, a learner can conduct reasoning and deduction according to the known information and the logical rules to obtain reasonable conclusion and judgment.
(5) And the meta-cognition monitoring, namely a learner can monitor and regulate the cognition process of the learner, define the learning target and strategy of the learner and timely think back and adjust the learning behavior of the learner.
(6) Self-regulation, namely, a learner can flexibly adjust the mind state and the behavior of the learner according to the learning condition of the learner and the change of the external environment, and keep a positive learning attitude and an efficient learning state.
The multi-agent hybrid game is characterized in that in an agent system comprising a plurality of agents with autonomous decision-making capability, each agent adopts a diversified action scheme comprising a pure strategy and a mixed strategy in a strategy space according to certain rules and constraints, and in the process, the agents need to consider the direct effect of own actions and predict and cope with the possible reactions of other agents and the potential influence of the reactions on own benefits. The complex gaming framework aimed at maximizing the respective benefits is achieved through a dynamic interaction process of mutual competition and cooperation.
The method comprises the following steps:
S1, multi-agent reinforcement learning based on two-stage intention sharing under the guidance of learner high-order thinking;
S11, acquiring and processing relevant data of higher-order thinking of a learner based on multiple agents;
Through advanced language recognition, natural language processing of an image recognition machine and other technologies, the thinking features of a learner in high-order thinking depth discussion of characters and other types are captured in real time, including but not limited to logic reasoning paths in images, criticizing thinking expressions in the characters, innovative viewpoint explanation and the like, and the capturing formulas are as follows:
(1)
Wherein,Representing a high-order mental depth discussion content dataset,The representation of the capture function is made,The speech recognition is represented by a sequence of speech,The image recognition is represented by a pattern,An output representing a natural language processing module;
The captured high-order thinking data is preprocessed and then is input into a multi-agent system for collaborative processing, the multi-agent system performs deep mining and analysis on the high-order thinking data through the internal communication and collaboration mechanism, and key characteristics capable of reflecting the high-order thinking capability of a learner are extracted, wherein the formula is as follows:
(2)
Wherein,Representing the high-order thought feature vector geometry,The processing function is represented by a function of the processing,Representing the communication and collaboration mechanisms of the multi-agent system;
s12, multi-agent reinforcement learning based on two-stage intention sharing;
the present embodiment performs reinforcement learning on multiple agents based on a two-stage intent sharing mechanism, aiming at optimizing a collaborative decision process of multiple agent system weight, as shown in fig. 2. Each agentEach maintaining a value networkThe content data set of the observed high-order thinking depth discussion is firstly independently analyzed, and the content data set is based on local observationGenerating a preliminary intent actionThe specific formula is as follows:
(3)
subsequently, entering the first round of intention sharing stage, the intelligent agent acts the intentionBroadcast to all other agents in the system (assuming the total number of agents in the system is N), after the first round of intention sharing is completed, each agent not only observes according to its own localAnd also combine the received intention action information of other intelligent agentsEvaluating the importance of such information to the current decision of itself;
the potential impact of each received intent action information on its decision process is measured by calculating the value of V, specifically, for a given action k, the current agent is calculatedProviding an agentAction information, agentMaximum value of V when different optimal actions are takenAnd minimum valueThe specific formula is as follows:
(4)
(5)
in order to establish a unified quantization standard, the maximum value and the minimum value of the quantization standard are normalized, and the specific formula is as follows:
(6)
if the decision importance of the intention action information of one agent to the other agent exceeds a preset threshold valueThe agent will be incorporated into the dependent object set of the latter;
Then, the second-round intent sharing stage is entered, in which each agent sends its dependent object set to all other agents in the system, through this round of information exchange, the agents can construct a more comprehensive dependency graph, which is a directed graph for intuitively showing the dependency between the agents.
It should be noted, however, that there may be a directed loop in the initially constructed dependency graph, i.e. a case of circular dependency. To solve this problem, the present embodiment employs an algorithm for detecting and eliminating the cyclic dependencies in the dependency graph, so as to ensure that the resulting dependency graph is a directed acyclic graph, specifically as follows:
1. initializing, for all nodesSetting visual (r) =false and onStack (r) =false;
The visible (r) is a status flag indicating whether the node r has been accessed, and indicates that in the initialization phase, it is assumed that all nodes have not been accessed. onStack (r) is also a status flag indicating whether node r is currently in the DFS stack, onStack (r) =false indicating that all nodes are not in the stack during the initialization phase;
2. Depth First Search (DFS), for each node r that is not accessed, the following is performed:
Will beMarked accessed, set visited (r) =true;
Will bePushing a DFS stack, setting onStack (r) =true;
For the followingIf w is not accessed, recursively performing DFS on w, if w has been accessed and is still in the stack (i.e., visited (w) =true and onStack (r) =true), then detecting a circular dependency;
Will bePop out from stack, set onStack (r) =false;
3. The loop dependence is eliminated, and once the loop dependence is detected, the dependence relationship is adjusted to ensure that the obtained graph is a directed acyclic graph;
after obtaining the acyclic dependency graph, the decision process of an agent will follow the rule that if an agent is dependent on other agents, it cannot make decision again, and its final action will directly take its preliminary generated intended actionIf, on the contrary, an agent is not relied upon by any agent, it makes a decision again to optimize its action selection in accordance with the intended action information of the agent relied upon, in combination with its own local observations.
S2, multi-agent task coupling planning for higher-order thinking of learners under mixed game inspiring;
s21, constructing a multi-agent model under the heuristic of the hybrid game;
In a multi-agent system, each agent tries to maximize its benefits by optimizing its own strategy, however, due to the interaction between agents, the traditional single-agent reinforcement learning method often has difficulty in achieving ideal effects, so this embodiment builds a multi-agent model under hybrid game inspiring, aiming at realizing efficient strategy updating through mutual learning and prediction between agents. Suppose that learning process of other agent is performed on agentThe effect is produced, and the formula is:
(7)
Wherein,Representing an agentIs used to update the policy update amount of (a),Representing an agentIs used for the learning rate of the model (a),Representing an agentIs used for the function of the object of (a),Representing an agentIs used in the method of the present invention,Representing an estimate of the other agent policy,Representing the gradient operation of the strategy updating quantity of the intelligent agent i;
In order to more accurately capture the interaction between agents, based on the average field theory, the average actions of surrounding agents are taken as input, and the policy learning of the self-agent is influenced and optimizedThe policy update mode formula of (1) is:
(8)
Wherein,Representing an agentThe policy before the update is made,Representing an agentThe policy after the update is made,Representing an agentIs a function of the state-action value of (c),Representing the average actions of other agents;
In order to improve the strategy effectiveness of the intelligent agent in a non-stationary environment, a neural network is adopted as a value function approximator in combination with a deep Bayesian strategy reuse method, and a distillation strategy network is introduced to realize efficient strategy learning and reuse, and higher accumulated rewards and convergence performance are obtained in a plurality of random game scenes. The update formula of the value function approximator is as follows:
(9)
Wherein,Is the parameter of the value function approximator, s is the current state,Instant rewards for agents;
Finally, ensuring that an algorithm of the intelligent agent can receive Nash equilibrium, namely, the optimal reaction on the basis of meeting all intelligent agent strategies is realized, wherein the formula is as follows:
(10)
Wherein,Representing an agentIs used in the method of the present invention,Representation except for agentIs a combination strategy of other intelligent agents,Representing an agentIs used to determine the utility value of (1),Indicating that the optimal reaction strategy for all agents is satisfied.
S22, multi-agent model strategy optimization oriented to higher-order thinking under hybrid game inspiring;
and carrying out strategy optimization on the multi-agent model under the mixed game inspired condition, and realizing the efficient joint optimization of the agent strategy and the model in each training round through the gradient-based optimization strategy. To achieve the above-mentioned optimization objective, the policy function is first of allAuxiliary functionParameterized and respectively marked asAndSubsequently, the gradient-based method updates these parameters, specifically updating the rule formula:
(11)
(12)
Wherein,An objective function representing the desired cumulative prize,Representing objective function versus policy parametersIs used for the gradient of (a),Representing the auxiliary parameters of the objective function pairIs used for the gradient of (a),The update step size of the policy parameters is represented,Representing an update step size of the auxiliary parameter;
in the training process, the state and action of the intelligent agent at each moment are evaluated through a neural network cost function, first, a learner high-order thinking data sample task is subjected to preliminary processing, and then the state of the current moment of all tasks is inputAnd actionsUsing neural network cost functionsOutputting the estimated value at this time, defining a loss function for continuously optimizing the neural network cost functionThe loss function is used to measure the difference between the output of the neural network cost function and the actual rewards, and then the gradient descent method is used to update the parameters of the neural network cost functionThe specific updating rule formula is as follows:
(13)
Wherein, xi is the update step length of the neural network cost function parameter;
By constructing an optimized objective function for maximizing the weighted sum of the desired cumulative internal and external rewards, the objective function is formulated as:
(14)
Wherein,Indicating that the agent is in stateExecute action downwardsThe internal rewards that are obtained at the time,Representing the result of the auxiliary functionThe external rewards generated are used to generate a game,A weight coefficient representing the relative importance of adjusting the internal and external rewards, T representing the time step of the training round;
By repeating the training process continuously until the task faced by the agent reaches the predetermined index.
S23, multi-agent task coupling planning for higher-order thinking of learners under mixed game inspiring;
in the embodiment, in a multi-agent model under the mixed game inspiring, coupling planning is performed on the high-order thinking data task of a learner, and the coupling relation among each agent and the task thereof is accurately described by constructing a multi-task coupling relation matrix and a multi-agent coupling relation matrix, so that powerful support is provided for collaborative decision-making, as shown in fig. 3.
Firstly, a multi-task coupling relation matrix is set, and each output quantity of a multi-task strategy network of a certain agent is used for forming the multi-task coupling relation matrix, so that the part with coupling relation in each output is represented, the coupling relation among the multi-tasks faced by the current agent is reflected, and the input reconstruction of each task evaluation network of each agent is supported. The policy network provided with the task R is provided withOutput, taskIs provided with the policy network ofOutput, and p, q=1, 2,..n, n is the total number of tasks, then construct oneIs a multi-task coupling relation matrix of (a)Wherein,If the o-th output of task R and taskThe first output of (2) has a coupling relation, the corresponding output element is 1, and the rest is 0;
Then, each intelligent agent forms a multi-intelligent agent coupling relation matrix, whether the intelligent agents have a cooperative coordination relation or not is judged according to actual application conditions, the cooperative relation among the intelligent agents is represented in the multi-intelligent agent coupling relation matrix, the input reconstruction of each task evaluation network of each intelligent agent is supported, N intelligent agents exist in an intelligent agent cluster, an N multiplied by N multi-intelligent agent coupling relation matrix C is constructed, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N, if the cooperative coordination relation exists between the intelligent agent i and the intelligent agent j, the output corresponding element is 1, and the rest is 0;
According to a multitasking coupling relation matrixAnd a multi-agent coupling relation matrix C, through matrix operation, slave is realizedAnd C, extracting information closely related to the current task of the current intelligent agent, and inputting the information into an evaluation network of the strategy network, wherein the formula is as follows:
(15)
Wherein I represents an input vector of the evaluation network, represents element-by-element multiplication operation of a multi-task coupling relation matrix and a multi-agent coupling relation matrix, and W is a weight matrix for adjusting the influence degree of each element on the input of the evaluation network;
after receiving the input vector I, the evaluation network processes the input vector I, outputs an evaluation result of the current task, combines the current state of the intelligent agent and the task target according to the evaluation result, and adopts a heuristic search task coupling planning algorithm based on a greedy strategy to generate a task plan meeting the actual requirements.
The heuristic search task coupling planning algorithm specifically comprises the following steps:
1. Initializing, setting the initial state of the intelligent agent asTask target is G, current time is t=0, task is planned as empty set;
2. And the evaluation function is used for defining an evaluation function f (S, P, G) and evaluating the difficulty or the quality degree of reaching the task target G under the given state S and the task plan P, wherein the formula is as follows:
(16)
Wherein,Representing the distance of the current state S to the task target G,Representing the cost of the mission plan P,The execution time of the mission plan P is represented,All represent weight coefficients for adjusting the relative importance of each index in the evaluation function;
3. Heuristic search:
From the current stateStarting to generate all next action sets;
For each actionAfter the action is executed, a new state is obtainedAnd update the mission plan;
Calculating an evaluation functionSelecting the action which enables the evaluation function value to be optimal as the optimal action of the current step;
repeating the steps until the task target G is reached or a new action can not be generated;
4. Outputting the task plan, wherein the finally obtained task plan P is from an initial stateThe optimal solution to task goal G.
S3, high-order thinking development of learners based on multi-agent migration strategies;
s31, setting a multi-agent migration strategy theoretical framework;
In the embodiment, the knowledge migration is performed on the learner high-order thinking task coupling planning based on the multi-agent migration strategy, and a series of source tasks are firstly set and initialized to be mutually orthogonal vector setsWhereinRepresents the firstVector representation of each source task adopts an orthogonalization algorithm, and vector sets are processed one by one to obtain an orthogonalization task representation set;
1. Initializing an orthogonal vector set as an empty set;
2. for each original vectorCalculating the inner product of the orthogonalization vector set and each vector in the orthogonalization vector set to obtain a projection component;
3. From the original vectorSubtracting all projection components to obtain orthogonalized vector;
4. Will beAdding to a set of orthogonal vectors;
5. Steps 2 to 4 are repeated until all original vectors are orthogonalized.
Next, the task representation most similar to the new task is searched in the orthogonalized task representation set and recorded asThen, initializing a task representation vector z and fixing the source task representation y to construct a parameterized forward model capable of predicting the transfer functionAnd a bonus functionWherein v andThe method comprises the steps of respectively representing a current state and a next state, wherein c represents an executed action, updating a task representation vector z by using a gradient descent optimization method so that a new task is reserved in a task space, then, further learning task representation by modeling a conversion function and a reward function, designing a network structure with a constant population for processing interaction and non-interaction actions in the task, and directly applying a traditional depth network to calculate a Q value of the non-interaction action, wherein the specific formula is as follows:
(17)
Representing the Q value between the current state v and the execution action c in a non-interactive environment; representing a functional relationship between the current state v and the execution action c in a non-interactive environment;
Whereas for interaction a shared network is utilizedViewing section to be related to corresponding entityAnd the connection z of the task representation is taken as input, and the Q value estimation of the corresponding interaction action is output, wherein the formula is as follows:
(18)
representing the Q value between the current state v and the execution action c in an interactive environment;
Finally, combining the calculated Q value with task representation, inputting the Q value into a hybrid network to generate new tasks and performing task coupling planning, wherein the hybrid network can comprehensively consider the relevance among the tasks to realize effective migration and utilization of knowledge;
S32, high-order thinking development of learners based on multi-agent migration strategies;
The embodiment realizes knowledge migration through planning comparison of the source task and the new task, promotes the development of the higher-order thinking of the learner, and particularly shows the rising of key features capable of reflecting the higher-order thinking capability of the learner.
Based on a multi-agent migration strategy, new tasks are efficiently generated and optimized through the synergistic effect among agents, and further, through planning comparison between source tasks and the new tasks, deep migration and fusion of knowledge are promoted. The process not only strengthens the mastery of the existing knowledge by the learner, but also shows remarkable effect in promoting the development of the higher-order thinking of the learner, and defines the source task planning set asThe new task planning set isIn order to quantify migration potential between a source task and a new task, a migration evaluation model based on task feature similarity is adopted, and the formula is as follows:
(19)
Wherein,Representing the first in the source mission plan setThe task plan is set up in such a way that,Represents the h-th mission plan in the new mission plan set,Represent the firstAn extraction function of the task features, the function mapping the task plan to the firstThe number of feature spaces in the set of features,Representing a similarity measure between features, for calculating the similarity between two feature vectors,Represent the firstThe weight of a feature reflects the importance of that feature in the migration evaluation, and d represents the total number of features, i.e., the number of task features considered.
The model evaluates migration potential between source mission plans and new mission plans by computing a weighted sum of their similarity over the respective feature spaces.
Therefore, the multi-agent task planning method for the higher-order thinking development under the hybrid game is adopted, and the higher-order thinking features in the collected learner higher-order thinking related data show a remarkable rising trend compared with the prior art. Particularly, on key characteristics reflecting higher-order thinking capability such as problem solving, criticizing thinking capability and innovation capability, the performance of a learner is obviously improved, and solid support and powerful promotion are provided for the physical development of the higher-order thinking of the learner.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted by the same, and the modified or substituted technical solution may not deviate from the spirit and scope of the technical solution of the present invention.

Claims (9)

Next, a task representation similar to the new task is searched in the orthogonalized task representation set and marked asInitializing a task representation vector z, fixing a source task representation y, and constructing a parameterized forward model for predicting a transfer function X #, wherein the source task representation y is a linear modelLoss of v, c) and the reward function Y (v, c), where v andThe method comprises the steps of respectively representing a current state and a next state, wherein c represents an executed action, updating a task representation vector z by using a gradient descent optimization method so that a new task is reserved in a task space, then further learning task representation by modeling a conversion function and a reward function, designing a network structure with a constant population, and directly applying a traditional depth network to calculate a Q value for non-interactive actions, wherein the specific formula is as follows:
CN202411855334.7A2024-12-172024-12-17 A multi-agent task planning method for the development of higher-order thinking under hybrid gamesActiveCN119312876B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411855334.7ACN119312876B (en)2024-12-172024-12-17 A multi-agent task planning method for the development of higher-order thinking under hybrid games

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411855334.7ACN119312876B (en)2024-12-172024-12-17 A multi-agent task planning method for the development of higher-order thinking under hybrid games

Publications (2)

Publication NumberPublication Date
CN119312876A CN119312876A (en)2025-01-14
CN119312876Btrue CN119312876B (en)2025-06-03

Family

ID=94192285

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411855334.7AActiveCN119312876B (en)2024-12-172024-12-17 A multi-agent task planning method for the development of higher-order thinking under hybrid games

Country Status (1)

CountryLink
CN (1)CN119312876B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120069491B (en)*2025-04-302025-08-19烟台大学Crowd sensing task scheduling method and system based on multi-space modeling and fairness reinforcement learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116029365A (en)*2022-12-272023-04-28南京大学Optimal strategy generation method for multi-agent cooperation based on two-stage intention sharing
CN116542470A (en)*2023-05-082023-08-04北京航空航天大学Intelligent unmanned cluster layered distributed task planning decision-making method based on hybrid game

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2013138321A1 (en)*2012-03-142013-09-19Gamblit Gaming, LlcAutonomous agent hybrid games
CN117076993A (en)*2023-10-162023-11-17江苏万维艾斯网络智能产业创新中心有限公司Multi-agent game decision-making system and method based on cloud protogenesis
CN118014013A (en)*2024-01-122024-05-10东南大学Deep reinforcement learning quick search game method and system based on priori policy guidance
CN118468920A (en)*2024-04-242024-08-09中国人民解放军国防科技大学 A generation-time planning method for large language model agents based on prompt game learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116029365A (en)*2022-12-272023-04-28南京大学Optimal strategy generation method for multi-agent cooperation based on two-stage intention sharing
CN116542470A (en)*2023-05-082023-08-04北京航空航天大学Intelligent unmanned cluster layered distributed task planning decision-making method based on hybrid game

Also Published As

Publication numberPublication date
CN119312876A (en)2025-01-14

Similar Documents

PublicationPublication DateTitle
Wong et al.Deep multiagent reinforcement learning: Challenges and directions
Du et al.A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications
CN119312876B (en) A multi-agent task planning method for the development of higher-order thinking under hybrid games
CN114139637B (en)Multi-agent information fusion method and device, electronic equipment and readable storage medium
Wong et al.Deep multiagent reinforcement learning: Challenges and directions
CN111282272B (en)Information processing method, computer readable medium and electronic device
CN111260039B (en)Video game decision-making method based on auxiliary task learning
CN112215350A (en)Smart agent control method and device based on reinforcement learning
CN112434791A (en)Multi-agent strong countermeasure simulation method and device and electronic equipment
CN118796041B (en) A Chinese learning method and system based on multi-agent
CN117391153A (en)Multi-agent online learning method based on decision-making attention mechanism
CN119004986A (en)Man-machine intelligent game countermeasure scene design method in field of chess deduction
Liu et al.Scaling up multi-agent reinforcement learning: An extensive survey on scalability issues
Garcia et al.Inverse engineering preferences in simple games
Kumaran et al.End-to-end procedural level generation in educational games with natural language instruction
Hain et al.Incremental by Design? On the Role of Incumbents in Technology Niches: An Evolutionary Network Analysis
CN119005287A (en)Man-machine reinforcement learning method based on multidimensional human feedback fusion
ApeldoornComprehensible knowledge base extraction for learning agents: practical challenges and applications in games
DockhornPrediction-based search for autonomous game-playing
Blackford et al.The real-time strategy game multi-objective build order problem
ZhangKey technologies of confrontational intelligent decision support for multi-agent systems
MohanSolution of real life optimization problems
ChenCooperative and competitive multi-agent deep reinforcement learning
CN111443806A (en)Interactive task control method and device, electronic equipment and storage medium
Navarro et al.Towards real-time agreements

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp