Disclosure of Invention
The technical task of the invention is to provide an intelligent agent service system based on a domestic operation system, so as to solve the problems of poor behavioral perception sensitivity, delayed preference change response and low decision interpretation of the conventional intelligent agent system in domestic ecology.
The technical task of the invention is realized in the following way, namely an intelligent agent service system based on a domestic operating system, which is a fusion operation behavior perception and reinforcement learning technology, and forms a complete technology closed loop of perception-modeling-optimization-interpretation through an operation perception engine, a human feedback reinforcement learning center, an MCP protocol adapter and a thinking chain visual designer to realize the paradigm transition of the intelligent agent from passive execution to active cooperation;
The MCP protocol adapter is used for interacting the operation instruction recommended by the reinforcement learning center with different data sources and services through a standardized interface, obtaining a result corresponding to the operation instruction, transmitting feedback (such as user scoring and eye tracking data) of an external system back to the human feedback reinforcement learning center, continuously optimizing a strategy network of the human feedback reinforcement learning center, and converting a complex decision process and data relationship of the human feedback reinforcement learning center into an intuitive visual view by a thinking chain visual designer, so that a user is helped to understand decision logic and a behavior mode of the intelligent behavior analysis device, the interpretability and the user trust degree of the system are improved, and the adjustment and optimization of the human feedback reinforcement learning center are guided to form a closed-loop optimization process.
Preferably, the operation awareness engine includes:
The system comprises an operation sequence data acquisition module, a file system access track and a user operation sequence data acquisition module, wherein the operation sequence data acquisition module is used for capturing user operation behaviors in real time based on a kernel-level hook mechanism of a domestic operation system to form operation sequence data, the user operation behaviors comprise GUI operation events and the file system access track;
And the user multidimensional image construction module is used for constructing a user multidimensional behavior image through an event tracing technology and realizing the analysis of the context relation of operation semantics.
Preferably, the operation awareness engine also has the following functions:
① The multi-device synchronous sensing is supported, namely the mobile terminal, the desktop terminal and the cloud terminal are uniformly captured;
② The abnormal behavior detection function is added, namely, real-time alarming of atypical operation modes is carried out;
③ Providing encryption storage and transmission of behavior data, and ensuring data security;
④ And supporting plug-in extension, and allowing a third party developer to access the custom behavior awareness rule.
Preferably, the reinforcement learning center includes:
The system comprises a model training module, a behavior-file multi-modal joint probability model, a model learning module and a model analysis module, wherein the model training module is used for performing behavior-file multi-modal joint probability model training, the behavior-file multi-modal joint probability model is a technology for combining behavior data and file data (such as texts, images and the like) and modeling multi-modal association through a probability frame;
The optimizing engine building module is used for building a strategy optimizing engine driven by double-channel feedback and realizing the optimization of the behavior-file multi-mode joint probability model;
And the privacy protection and safety module is used for realizing privacy protection in the reinforcement learning center by utilizing a privacy protection and safety mechanism.
More preferably, the model training module works as follows:
(1) Calculating characteristic values of three dimensions of corresponding operation frequency, duration and path complexity for each operation behavior respectively; the operation frequency refers to counting the execution times of a user on each operation behavior in a specific time period and reflecting the use frequency of the user on different operation behaviors, the duration refers to recording the time spent by the user on each operation behavior and reflecting the attention degree and the investment time of the user on different operations, and the path complexity refers to analyzing the path complexity of the user when the user executes the operations, such as the directory depth, the jump times and the like of an access file and measuring the complexity of the operation path of the user;
(2) Taking different operation behaviors of a user as vocabulary, taking a series of operation sequences of the user as documents, and quantifying preference weights W of the user on different types of operation behaviors by using a TF-IDF algorithm, wherein the specific formula is as follows;
;
Wherein, theRepresenting the number of occurrences of the vocabulary t in the document Di; Representing documentsN represents the total number of documents; Representing documentsWhether the word t is contained or not, if the word t is contained as1, the word t is not contained as 0;
The TF-IDF is a statistical method for evaluating the importance degree of a word for one document in a document set or a corpus, wherein the importance of the word is increased in proportion to the occurrence Frequency of the word in the document, but is reduced in inverse proportion to the occurrence Frequency of the word in the corpus, so that the influence of the common word on the key word can be effectively avoided, and the correlation between the key word and the article is improved;
(3) Converting all operation sequence data into structured feature vectors by using preference weights and three-dimensional feature values of different types of operation behaviors of a user, and using the structured feature vectors as input for training a behavior-file multi-mode joint probability model;
(4) Calculating joint probability distribution of operation behaviors and file access behaviors through a dynamic Bayesian network, modeling operation sequences and file access behaviors of users at different time points as conditional probability distribution, and capturing causal relations between the operation behaviors and the file access;
(5) Inputting the structured feature vector into a multi-layer neural network, outputting probability distribution of operation suggestions by the multi-layer neural network, training a behavior-file joint probability model through historical behavior data by adopting a supervised learning method, constructing time-space association of an operation sequence and file access, and dynamically updating the probability distribution to reflect time sequence and context dependency of user behavior;
(6) And verifying behavior-file multi-mode joint probability model performance through cross verification and index evaluation (such as accuracy, recall rate and F1 score), and ensuring generalization capability of the model.
More preferably, the working process of the optimization engine building module is specifically as follows:
(1) Designing a feedback channel, wherein the feedback channel comprises an explicit feedback channel and an implicit feedback channel; the implicit channel records the gaze point, glance path and pupil change of a user in the operation process through eye movement tracking, records the stay time of the user on a specific operation or interface element, calculates the cognitive load index such as gaze frequency, average stay time and the like through eye movement and stay time length data, converts the score and the cognitive load index into a numeric feedback signal, and inputs the numeric feedback signal as a reward function of reinforcement learning;
(2) Designing a reward function by combining an explicit feedback signal and an implicit feedback signal, wherein the explicit feedback signal is directly used as a reward value, and the implicit feedback signal indirectly influences rewards through cognitive load indexes;
(3) The PPO algorithm is used for calculating gradient update of a behavior-file multi-mode joint probability model strategy network, and stability of training is ensured through truncation strategy update, wherein the PPO algorithm adopts multi-objective optimization, optimizes a user operation path through minimizing path entropy, improves operation efficiency, improves unidentifiability of sensitive operation through maximizing sensitive operation confusion degree, and protects user privacy.
More preferably, the privacy protection and security module works as follows:
(1) In the back propagation process, the gradient masking technology is utilized to mask the gradient related to the sensitive data, so that the private data is ensured not to be leaked;
(2) By adding noise or transforming feature vectors, the behavior features of the sensitive operation are confused, and the identifiability of the sensitive operation is reduced;
(3) And the privacy protection effect of the system is periodically evaluated, and the effectiveness of a privacy protection mechanism is ensured.
Preferably, the MCP protocol adapter includes a protocol gateway deployment module and a dynamic service discovery module;
The protocol network management deployment module is used for deploying the MCP protocol gateway by adopting a client-server architecture, wherein the client is used for receiving an operation strategy generated by the reinforcement learning center, the server is an external data source and a tool, the client and the server perform function negotiation to determine that the client and the server mutually provide functions and services, and the protocol network management deployment module integrates a JSON-RPC 2.0 standard protocol and supports two communication modes, and the protocol network management deployment module comprises the following specific steps:
① A local pipeline (stdio) mode implementing a <10ms low-latency response, adapted to process local operational behavior data;
② A network streaming (SSE) mode supporting high concurrency calls, suitable for processing behavioral data in a distributed system;
The dynamic service discovery module is used for identifying an available MCP server through an automatic scanning mechanism, the available MCP server comprises a local IDE plug-in unit, an enterprise ERP system interface and cloud AI service (such as a Claude reasoning engine), a client sends a request to the server according to a user request or the requirement of an AI model, the server processes the user request and possibly interacts with local or remote resources, after operation execution is completed, the server returns a processing result to the client, the client transmits information back to a host application program, parameterized resource positioning is realized by adopting a URI dynamic template, and the JSON-RPC 2.0 standard protocol is maintained, so that the flexibility and the dynamics of service discovery are ensured.
More preferably, the MCP protocol adapter has the following functions:
① The cross-platform compatibility is supported, namely Windows, linux, macOS and a mobile terminal;
② The protocol version management function is supported, and the MCP protocol seamless switching of different versions is supported;
③ Providing a service health monitoring function, and monitoring the availability of the MCP server in real time;
④ The service fusing mechanism is supported, and system breakdown caused by single-point faults is avoided.
Preferably, the mind chain visualization designer includes:
the decision traceability model construction module is used for constructing a decision traceability model based on a multi-head attention mechanism, integrating time sequence behavior data with system state characteristics, tracking and recording the forming process of each decision generated by the reinforcement learning center, and constructing a complete decision chain, wherein the forming process of each decision generated by the reinforcement learning center comprises key influencing factors of the decision and context information during the decision;
The decision path reconstruction module is used for reconstructing a decision path by adopting an LSTM network weighted by a time attenuation factor, modeling and analyzing time sequence data in a decision process, highlighting decision trend and mode which change along with time, and helping to understand the evolution process of the decision;
the causal relation analysis module is used for generating an interpretable view with causal relation by combining a knowledge graph technology, correlating each event and operation in the decision process with a result generated by the event and operation to form a causal relation graph, and revealing logic and motivation behind the decision;
The visual output module is used for acquiring a behavior thermodynamic diagram, a file relationship network and a strategy evolution time axis; the system comprises a file association network, a strategy evolution time axis, a strategy analysis time axis and a strategy analysis time axis, wherein the behavior thermodynamic diagram is used for presenting operation mode distribution, showing the operation frequency and modes of a user in different time and different scenes, helping to identify a high-frequency operation area and a user behavior habit;
the feedback and optimization module is used for feeding back the visualized result to the reinforcement learning center, providing reference for further optimization of the reinforcement learning strategy, finding out potential problems and improvement spaces in the reinforcement learning strategy by analyzing the visualized output, guiding adjustment and optimization of the reinforcement learning algorithm, and forming a closed-loop optimization process.
The intelligent agent service system based on the domestic operating system has the following advantages:
Through multimode behavior perception, behavior-file multimode reinforcement learning optimization and lightweight thinking chain generation technologies, MCP protocol access, dynamic behavior modeling and man-machine co-evolution mechanisms are innovatively fused, the problems of poor behavior perception sensitivity, delayed preference change response, low decision interpretation and the like of a conventional intelligent system in domestic ecology are solved, and autonomous controllable and efficient reasoning of intelligent service in a domestic environment is realized;
Secondly, the invention realizes the dynamic access of the cross-platform intelligent agent tool through a context protocol (MCP), and combines the operation behavior perception, the file semantic understanding and the human feedback reinforcement learning algorithm to construct a user personalized thinking chain system;
the invention breaks through the core bottleneck of the intelligent agent technology in the application of the landing in the domestic operation system through three technical paths of domestic adaptation, safety enhancement and ecological cooperation;
the invention aims to construct an intelligent agent service system and device based on a domestic operating system, and realizes the normal form transition from passive execution to active cooperation by constructing an operation-feedback-optimization technology closed loop, specifically, constructing a unified tool access framework by using a MCP (micro control protocol) protocol, realizing seamless integration of heterogeneous tools based on a client-server architecture, calling a request by using a JSON-RPC 2.0 protocol encapsulation tool, automatically matching a local tool or a service end API by a dynamic service discovery mechanism, combining fine grain authority control, ensuring that only authenticated equipment and users can access system resources, establishing a three-dimensional behavior modeling system of user operation-file-feedback, designing a multi-mode feedback interface to convert a user score into an enhanced learning excitation signal by using a bidirectional LSTM (LSTM) analysis window focus track and gesture operation sequence, deploying a hierarchical PPO algorithm-driven thinking chain optimization engine, wherein a meta-strategy network fuses operation time sequence characteristics with a file TF-IDF vector optimization long-term target, and a task strategy network realizes quick strategy update by using a real-time data pipeline;
the invention ensures data main rights through domestic kernel-level monitoring, realizes transmission encryption by adopting a national cryptographic algorithm, builds a dynamic association model of behavior-files to break through the static limitation of traditional log analysis, simultaneously innovates and fuses a human feedback mechanism and an interpretable AI technology to ensure that an agent decision process has evolutionary capability and transparency, and experiments show that the invention can improve the operation efficiency of a common office scene by 37 percent, reduce the misoperation rate by 62 percent and simultaneously provide privacy protection capability conforming to GB/T35273 standard.
Detailed Description
An intelligent agent service system based on a domestic operating system according to the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
Examples:
As shown in fig. 1, the present embodiment provides an agent service system based on a domestic operating system, which is a fused operation behavior sensing and reinforcement learning technology, and forms a complete technology closed loop of sensing-modeling-optimizing-explaining through an operation sensing engine, a human feedback reinforcement learning center, an MCP protocol adapter and a thinking chain visual designer, so as to realize the transition from passive execution to active cooperation of an agent;
The MCP protocol adapter is used for interacting the operation instruction recommended by the reinforcement learning center with different data sources and services through a standardized interface, obtaining a result corresponding to the operation instruction, transmitting feedback (such as user scoring and eye tracking data) of an external system back to the human feedback reinforcement learning center, continuously optimizing a strategy network of the human feedback reinforcement learning center, and converting a complex decision process and data relationship of the human feedback reinforcement learning center into an intuitive visual view by a thinking chain visual designer, so that a user is helped to understand decision logic and a behavior mode of the intelligent behavior analysis device, the interpretability and the user trust degree of the system are improved, and the adjustment and optimization of the human feedback reinforcement learning center are guided to form a closed-loop optimization process.
The operation awareness engine in this embodiment includes:
The system comprises an operation sequence data acquisition module, a file system access track and a user operation sequence data acquisition module, wherein the operation sequence data acquisition module is used for capturing user operation behaviors in real time based on a kernel-level hook mechanism of a domestic operation system to form operation sequence data, the user operation behaviors comprise GUI operation events and the file system access track;
And the user multidimensional image construction module is used for constructing a user multidimensional behavior image through an event tracing technology and realizing the analysis of the context relation of operation semantics.
The operation awareness engine in this embodiment also has the following functions:
① The multi-device synchronous sensing is supported, namely the mobile terminal, the desktop terminal and the cloud terminal are uniformly captured;
② The abnormal behavior detection function is added, namely, real-time alarming of atypical operation modes is carried out;
③ Providing encryption storage and transmission of behavior data, and ensuring data security;
④ And supporting plug-in extension, and allowing a third party developer to access the custom behavior awareness rule.
The reinforcement learning center in the present embodiment includes:
The system comprises a model training module, a behavior-file multi-modal joint probability model, a model learning module and a model analysis module, wherein the model training module is used for performing behavior-file multi-modal joint probability model training, the behavior-file multi-modal joint probability model is a technology for combining behavior data and file data (such as texts, images and the like) and modeling multi-modal association through a probability frame;
The optimizing engine building module is used for building a strategy optimizing engine driven by double-channel feedback and realizing the optimization of the behavior-file multi-mode joint probability model;
And the privacy protection and safety module is used for realizing privacy protection in the reinforcement learning center by utilizing a privacy protection and safety mechanism.
The working process of the model training module in this embodiment is specifically as follows:
(1) Calculating characteristic values of three dimensions of corresponding operation frequency, duration and path complexity for each operation behavior respectively; the operation frequency refers to counting the execution times of a user on each operation behavior in a specific time period and reflecting the use frequency of the user on different operation behaviors, the duration refers to recording the time spent by the user on each operation behavior and reflecting the attention degree and the investment time of the user on different operations, and the path complexity refers to analyzing the path complexity of the user when the user executes the operations, such as the directory depth, the jump times and the like of an access file and measuring the complexity of the operation path of the user;
(2) Taking different operation behaviors of a user as vocabulary, taking a series of operation sequences of the user as documents, and quantifying preference weights W of the user on different types of operation behaviors by using a TF-IDF algorithm, wherein the specific formula is as follows;
;
Wherein, theRepresenting the number of occurrences of the vocabulary t in the document Di; Representing documentsN represents the total number of documents; Representing documentsWhether the word t is contained or not, if the word t is contained as1, the word t is not contained as 0;
(3) Converting all operation sequence data into structured feature vectors by using preference weights and three-dimensional feature values of different types of operation behaviors of a user, and using the structured feature vectors as input for training a behavior-file multi-mode joint probability model;
(4) Calculating joint probability distribution of operation behaviors and file access behaviors through a dynamic Bayesian network, modeling operation sequences and file access behaviors of users at different time points as conditional probability distribution, and capturing causal relations between the operation behaviors and the file access;
(5) Inputting the structured feature vector into a multi-layer neural network, outputting probability distribution of operation suggestions by the multi-layer neural network, training a behavior-file joint probability model through historical behavior data by adopting a supervised learning method, constructing time-space association of an operation sequence and file access, and dynamically updating the probability distribution to reflect time sequence and context dependency of user behavior;
(6) And verifying behavior-file multi-mode joint probability model performance through cross verification and index evaluation (such as accuracy, recall rate and F1 score), and ensuring generalization capability of the model.
The working process of the optimization engine building module in this embodiment is specifically as follows:
(1) Designing a feedback channel, wherein the feedback channel comprises an explicit feedback channel and an implicit feedback channel; the implicit channel records the gaze point, glance path and pupil change of a user in the operation process through eye movement tracking, records the stay time of the user on a specific operation or interface element, calculates the cognitive load index such as gaze frequency, average stay time and the like through eye movement and stay time length data, converts the score and the cognitive load index into a numeric feedback signal, and inputs the numeric feedback signal as a reward function of reinforcement learning;
(2) Designing a reward function by combining an explicit feedback signal and an implicit feedback signal, wherein the explicit feedback signal is directly used as a reward value, and the implicit feedback signal indirectly influences rewards through cognitive load indexes;
(3) The PPO algorithm is used for calculating gradient update of a behavior-file multi-mode joint probability model strategy network, and stability of training is ensured through truncation strategy update, wherein the PPO algorithm adopts multi-objective optimization, optimizes a user operation path through minimizing path entropy, improves operation efficiency, improves unidentifiability of sensitive operation through maximizing sensitive operation confusion degree, and protects user privacy.
The working process of the privacy protection and security module in this embodiment is specifically as follows:
(1) In the back propagation process, the gradient masking technology is utilized to mask the gradient related to the sensitive data, so that the private data is ensured not to be leaked;
(2) By adding noise or transforming feature vectors, the behavior features of the sensitive operation are confused, and the identifiability of the sensitive operation is reduced;
(3) And the privacy protection effect of the system is periodically evaluated, and the effectiveness of a privacy protection mechanism is ensured.
The MCP protocol adapter in the embodiment comprises a protocol gateway deployment module and a dynamic service discovery module;
The protocol network management deployment module is used for deploying the MCP protocol gateway by adopting a client-server architecture, wherein the client is used for receiving an operation strategy generated by the reinforcement learning center, the server is an external data source and a tool, the client and the server perform function negotiation to determine that the client and the server mutually provide functions and services, and the protocol network management deployment module integrates a JSON-RPC 2.0 standard protocol and supports two communication modes, and the protocol network management deployment module comprises the following specific steps:
① A local pipeline (stdio) mode implementing a <10ms low-latency response, adapted to process local operational behavior data;
② A network streaming (SSE) mode supporting high concurrency calls, suitable for processing behavioral data in a distributed system;
The dynamic service discovery module is used for identifying an available MCP server through an automatic scanning mechanism, the available MCP server comprises a local IDE plug-in unit, an enterprise ERP system interface and cloud AI service (such as a Claude reasoning engine), a client sends a request to the server according to a user request or the requirement of an AI model, the server processes the user request and possibly interacts with local or remote resources, after operation execution is completed, the server returns a processing result to the client, the client transmits information back to a host application program, parameterized resource positioning is realized by adopting a URI dynamic template, and the JSON-RPC 2.0 standard protocol is maintained, so that the flexibility and the dynamics of service discovery are ensured.
The MCP protocol adapter in this embodiment has the following functions:
① The cross-platform compatibility is supported, namely Windows, linux, macOS and a mobile terminal;
② The protocol version management function is supported, and the MCP protocol seamless switching of different versions is supported;
③ Providing a service health monitoring function, and monitoring the availability of the MCP server in real time;
④ The service fusing mechanism is supported, and system breakdown caused by single-point faults is avoided.
The thinking chain visual designer in this embodiment includes:
the decision traceability model construction module is used for constructing a decision traceability model based on a multi-head attention mechanism, integrating time sequence behavior data with system state characteristics, tracking and recording the forming process of each decision generated by the reinforcement learning center, and constructing a complete decision chain, wherein the forming process of each decision generated by the reinforcement learning center comprises key influencing factors of the decision and context information during the decision;
The decision path reconstruction module is used for reconstructing a decision path by adopting an LSTM network weighted by a time attenuation factor, modeling and analyzing time sequence data in a decision process, highlighting decision trend and mode which change along with time, and helping to understand the evolution process of the decision;
the causal relation analysis module is used for generating an interpretable view with causal relation by combining a knowledge graph technology, correlating each event and operation in the decision process with a result generated by the event and operation to form a causal relation graph, and revealing logic and motivation behind the decision;
The visual output module is used for acquiring a behavior thermodynamic diagram, a file relationship network and a strategy evolution time axis; the system comprises a file association network, a strategy evolution time axis, a strategy analysis time axis and a strategy analysis time axis, wherein the behavior thermodynamic diagram is used for presenting operation mode distribution, showing the operation frequency and modes of a user in different time and different scenes, helping to identify a high-frequency operation area and a user behavior habit;
the feedback and optimization module is used for feeding back the visualized result to the reinforcement learning center, providing reference for further optimization of the reinforcement learning strategy, finding out potential problems and improvement spaces in the reinforcement learning strategy by analyzing the visualized output, guiding adjustment and optimization of the reinforcement learning algorithm, and forming a closed-loop optimization process.
The working process of this embodiment is specifically as follows:
The method comprises the following steps of S1, capturing operation behaviors of a user in real time based on an operation perception engine, wherein the operation behaviors of the user are input basis of subsequent modules (such as a reinforcement learning center and an MCP protocol adapter), namely capturing the operation behaviors of the user in real time based on a domestic operation system kernel-level hook mechanism, wherein the operation behaviors comprise GUI operation events (such as window focus switching, control clicking and shortcut key triggering) and file system access tracks (such as creating/reading/writing/deleting operations) to form operation sequence data;
S2, inputting user operation behavior data captured by an operation perception engine as characteristics into a reinforcement learning center, training a behavior-file multi-mode joint probability model, wherein the behavior-file multi-mode joint probability model is used for receiving task instructions of a user and outputting probability distribution of suggested operation instructions, and the method specifically comprises the following steps of:
S201, calculating characteristic values of three dimensions of operation frequency, duration and path complexity of each operation behavior respectively, wherein the operation frequency refers to counting the execution times of a user on each operation behavior in a specific time period and reflecting the use frequency of the user on different operation behaviors, the duration refers to recording the time spent by the user on each operation behavior and reflecting the attention degree and the input time of the user on different operations;
S202, regarding different operation behaviors of a user as words, regarding a series of operation sequences of the user as documents, and quantifying preference weights W of the user on different types of operation behaviors by using a TF-IDF algorithm, wherein a specific formula is as follows;
;
Wherein, theRepresenting the number of occurrences of the vocabulary t in the document Di; Representing documentsN represents the total number of documents; Representing documentsWhether the word t is contained or not, if the word t is contained as1, the word t is not contained as 0;
s203, converting all operation sequence data into a structured feature vector by using the preference weight obtained by calculation and the extracted three-dimensional features, and taking the structured feature vector as an input of model training;
s204, calculating joint probability distribution of operation behaviors and file access behaviors through a dynamic Bayesian network, modeling operation sequences of users at different time points and the file access behaviors as conditional probability distribution, and capturing causal relations between the operation behaviors and the file access;
s205, constructing a multi-layer neural network, inputting a structured feature vector, outputting a probability distribution suggested by operation, adopting a supervised learning method, training a behavior-file joint probability model through historical behavior data, constructing time-space association between an operation sequence and file access, and dynamically updating the probability distribution to reflect the time sequence and the context dependency of user behavior;
s206, verifying the performance of the model through cross verification and index evaluation (such as accuracy, recall and F1 score), and ensuring the generalization capability of the model;
S3, receiving a task instruction of a user, obtaining an optimal operation instruction based on a trained behavior-file multi-mode joint probability model, calling an external data source and service through a standardized interface provided by an MCP protocol adapter, executing corresponding instructions, and obtaining an operation result corresponding to the operation instruction, wherein the operation instruction is transmitted to an external system (such as ERP and AI service) through the MCP protocol adapter based on an operation strategy generated by a reinforcement learning center, and automatically executing corresponding operations, an MCP protocol adaptation framework is constructed and comprises two parts of protocol gateway deployment and dynamic service discovery, wherein the protocol network management deployment adopts a client-server architecture to deploy the MCP protocol gateway, the client is used for receiving the operation strategy generated by the reinforcement learning center, the server performs function negotiation between the external data source and can mutually provide functions and services, and integrates JSON-RPC 2.0 standard protocols and supports two communication modes:
① The local pipeline (stdio) mode implements a <10ms low-latency response, suitable for processing local operational behavior data;
② Network streaming (SSE) mode supports high concurrency calls, suitable for processing behavioral data in a distributed system;
the dynamic service discovery identifies available MCP servers through an automatic scanning mechanism, including but not limited to a local IDE plug-in, an enterprise ERP system interface, cloud AI service (such as Claude reasoning engine) and the like, wherein a client sends requests to the server according to user requests or the requirements of an AI model, and the server processes the requests and possibly interacts with local or remote resources;
S4, a strategy optimization engine driven based on double-channel feedback in a reinforcement learning center collects feedback (such as user scoring and eye movement tracking data) of a user on an operation result, further optimizes the strategy, and improves the operation efficiency and privacy protection capability of a model, wherein the method comprises the following steps:
S401, designing feedback channels, including an explicit feedback channel and an implicit feedback channel. The explicit channel receives star grade scores of intelligent suggestions from users through a user interface, the scores range is 1-5 grade, the implicit channel records the gaze point, glance path and pupil change of the users in the operation process through eye movement tracking, records the stay time of the users on specific operation or interface elements, calculates cognitive load indexes such as gaze times, average stay time and the like through eye movement and stay time length data, converts the scores and the cognitive load indexes into numeric feedback signals, and inputs the numeric feedback signals as reward functions of reinforcement learning;
s402, designing a reward function by combining an explicit feedback signal and an implicit feedback signal, wherein the explicit feedback signal is directly used as a reward value, and the implicit feedback signal indirectly influences rewards through cognitive load indexes;
S403, calculating gradient update of a behavior-file multi-mode joint probability model strategy network by adopting a PPO algorithm, and ensuring training stability by means of truncated strategy update, wherein the PPO algorithm adopts multi-objective optimization, optimizes a user operation path by minimizing path entropy, improves operation efficiency, improves unidentifiability of sensitive operation by maximizing sensitive operation confusion degree, and protects user privacy;
S5, converting a decision process and a data relationship of the behavior-file multi-mode joint probability model into an intuitive visual view by using a thinking chain visual designer, helping a user understand decision logic and behavior modes of the system, improving the interpretability and user trust degree of the system, guiding adjustment and optimization of a reinforcement learning algorithm, and forming a closed-loop optimization process, wherein the method comprises the following steps:
S501, constructing a decision traceability model based on a multi-head attention mechanism, fusing time sequence behavior data with system state characteristics, tracking and recording the forming process of each decision generated by the reinforcement learning center, including key influence factors of the decision, context information during the decision and the like, so as to construct a complete decision chain;
S502, reconstructing a decision path, namely reconstructing the decision path by adopting an LSTM network weighted by a time attenuation factor, modeling and analyzing time sequence data in a decision process, highlighting decision trend and mode which change along with time, and helping to understand the evolution process of the decision;
S503, causal relation analysis, namely generating an interpretable view with causal relation by combining a knowledge graph technology, correlating each event and operation in the decision process with a result generated by the event and operation to form a causal relation graph, and revealing logic and motivation behind the decision;
S504, visual output, specifically comprising the following steps:
S50401, showing operation mode distribution, showing operation frequencies and modes of users in different time and different scenes, and helping to identify high-frequency operation areas and user behavior habits;
S50402, a file association network, wherein an implicit knowledge structure is revealed, association relations among files are displayed, and the method comprises direct reference, indirect association, content similarity and the like, so that potential knowledge structures and information flows can be found;
S50403, a strategy evolution time axis is used for showing a learning process, presenting the evolution and optimization process of a reinforcement learning strategy along with time, including key nodes for strategy adjustment, the change trend of performance indexes and the like, and helping to evaluate the learning effect and the convergence of the strategy;
and S505, feeding back and optimizing, namely feeding back a visual result to the reinforcement learning center to provide a reference for further optimization of the strategy, and finding out potential problems and improvement spaces in the strategy by analyzing visual output to guide adjustment and optimization of the reinforcement learning algorithm so as to form a closed-loop optimization process.
It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.