CN119168590A

Movatterモバイル変換

Info

Publication number: CN119168590A
Application number: CN202411406798.XA
Authority: CN
Inventors: 曹侃; 袁俊刚; 徐伟
Original assignee: China United Network Communications Group Co Ltd; Unicom Digital Technology Co Ltd; China Unicom Internet of Things Corp Ltd
Current assignee: China United Network Communications Group Co Ltd; Unicom Digital Technology Co Ltd; China Unicom Internet of Things Corp Ltd
Priority date: 2024-10-09
Filing date: 2024-10-09
Publication date: 2024-12-20

Abstract

The application provides a fault detection method, a fault detection system, a fault detection device, a fault detection equipment, a fault detection medium and a fault detection program product, and relates to the technical field of artificial intelligence. The method comprises the steps of obtaining a fault detection data set of a detected object when first detection data of the detected object indicate a fault of the detected object, wherein the first detection data is detection data obtained in any time frame, the fault detection data set comprises a plurality of second detection data, each second detection data is detection data obtained after the fault of the detected object, obtaining configuration information of the detected object, judging whether the detected object has the fault according to the first detection data, the fault detection data set and the configuration information, obtaining a fault analysis report of the detected object when a judgment result indicates the fault of the detected object, and reporting the fault analysis report to an operation and maintenance platform. By the method, the problems of low response speed and high false alarm rate caused by low automation degree of the existing operation and maintenance team are avoided.

Description

Fault detection method, system, device, equipment, medium and program product

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a fault detection method, system, apparatus, device, medium, and program product.

Background

With the rapid development of information technology, the demands for system stability and reliability are increasing, which makes enterprises more important to the role of operation and maintenance team in the digital transformation process. To accommodate this change, the operations and maintenance team needs to continually advance its own skills and tools, implementing more advanced management strategies to ensure that the information system can support the rapid development and business innovation of the enterprise.

The existing operation and maintenance team relies on manual intervention to confirm and process alarm information, and lacks intelligent automatic processing capability, so that the fault response time is prolonged, and the probability of human errors is increased.

Disclosure of Invention

The application provides a fault detection method, a system, a device, equipment, a medium and a program product, which are used for solving the technical problems of low response speed and high false alarm rate caused by lower automation degree of the existing operation and maintenance team.

In a first aspect, the present application provides a fault detection method, where the method is applied to a core agent of an agent platform, the method includes:

Acquiring a fault detection data set of a detected object when first detection data of the detected object indicate that the detected object has a fault, wherein the first detection data refers to detection data obtained in any time frame, the fault detection data set comprises a plurality of second detection data, and each second detection data refers to detection data obtained after the detected object has the fault;

Acquiring configuration information of the detected object;

Judging whether the detected object is faulty according to the first detection data, the fault detection data set and the configuration information, and obtaining a fault analysis report of the detected object when the judging result indicates that the detected object is faulty;

and reporting the fault analysis report to an operation and maintenance platform.

Optionally, the method as described above, wherein the agent platform further includes monitoring an agent, and when the first detection data of the detected object indicates that the detected object is faulty, acquiring a fault detection data set of the detected object includes:

obtaining fault cause analysis pushed by the monitoring agent to determine the fault of the detected object, wherein the fault cause analysis is obtained according to the first detection data;

And adjusting the monitoring agent according to the fault cause analysis so as to collect the fault detection data set from the detected object according to the adjusted monitoring agent.

Optionally, the method as described above, wherein the monitoring agent comprises a plurality of sub-agents, and the fault cause analysis comprises a fault prompt word;

and adjusting the monitoring agent according to the fault cause analysis, wherein the method comprises the following steps:

Extracting the fault prompt words from the fault cause analysis;

generating a function instruction through a preset natural language processing model and a function instruction library according to the fault prompt word;

And calling at least one target intelligent agent from the plurality of sub-intelligent agents according to the function instruction, and adjusting the sampling mode of the at least one target intelligent agent.

Optionally, the method as described above, the plurality of sub-agents includes a link agent and a log agent;

Calling at least one target agent from the plurality of sub-agents according to the function instruction, and adjusting a sampling mode of the at least one target agent, including:

When the functional instruction indicates to call the link agent, the sampling frequency of the link agent is increased until the number of the plurality of second detection data reaches a first preset number and/or the adjacent plurality of second detection data is kept stable;

And when the function instruction indicates to call the log intelligent agent, the sampling range and/or the sampling precision of the log intelligent agent are increased until the number of the plurality of second detection data reaches a second preset number.

Optionally, the method as described above, where the agent platform further includes a configuration management database agent, the obtaining configuration information of the detected object includes:

and according to the fault cause analysis pushed by the monitoring agent, calling the configuration information of the detected object from the configuration information set of a preset configuration management database by the configuration management database agent.

Optionally, in the method as described above, the determining whether the detected object is faulty according to the first detection data, the fault detection data set and the configuration information, and when the determination result indicates that the detected object is faulty, obtaining a fault analysis report of the detected object includes:

Inputting the first detection data, the fault detection data set and the configuration information into a preset fault tree model to obtain a first judgment result output by the fault tree model;

and when the first judging result indicates that the detected object has a fault, extracting the fault analysis report from the fault tree model.

Optionally, in the method as described above, when the first determination result does not indicate the detected object fault, the method further includes:

Reporting the first detection data, the fault detection data set and the configuration information to a fault detection platform so that the fault detection platform can output the fault analysis report;

And updating the fault tree model according to the fault analysis report.

Optionally, as described above, the method further includes an alarm agent, and the reporting the fault analysis report to an operation and maintenance platform includes:

pushing the fault analysis report to the alarm intelligent agent so that the alarm intelligent agent can generate an alarm prompt according to the fault analysis report, and pushing the alarm prompt to the operation and maintenance platform.

In a second aspect, the application provides a fault detection system, comprising at least one detected device, an operation and maintenance platform and an agent platform in communication connection with the operation and maintenance platform;

The intelligent agent platform is used for realizing the fault detection method provided by the first aspect of the application.

Optionally, the method as described above, the agent platform includes a core agent, and a monitoring agent, a configuration management database agent, and an alarm agent communicatively coupled to the core agent;

the monitoring agent comprises a link agent and a log agent.

In a third aspect, the present application provides a fault detection device, the device being applied to a core agent of an agent platform, the device comprising:

The fault detection data set acquisition module is used for acquiring a fault detection data set of a detected object when first detection data of the detected object indicate the detected object to be faulty, wherein the first detection data is detection data obtained in any time frame, the fault detection data set comprises a plurality of second detection data, and each second detection data is detection data obtained after the detected object to be faulty;

a configuration information acquisition module, configured to acquire configuration information of the detected object;

The fault analysis report acquisition module is used for judging whether the detected object is faulty according to the first detection data, the fault detection data set and the configuration information, and obtaining a fault analysis report of the detected object when the judging result indicates that the detected object is faulty;

And the fault analysis report reporting module is used for reporting the fault analysis report to the operation and maintenance platform.

In a fourth aspect, the present application provides an electronic device comprising a processor, and a memory communicatively coupled to the processor;

The memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored in the memory to implement the fault detection method provided in the first aspect of the present application.

In a fifth aspect, the present application provides a computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the fault detection method provided in the first aspect of the present application.

In a sixth aspect, an embodiment of the present application provides a computer program product, including a computer program, where the computer program when executed by a processor implements the fault detection method provided in the first aspect of the present application.

The application provides a fault detection method, a system, a device, equipment, a medium and a program product, wherein the fault detection method comprises the steps of acquiring a fault detection data set of a detected object when first detection data of the detected object indicate the fault of the detected object; the method comprises the steps of obtaining configuration information of a detected object, judging whether the detected object is faulty according to first detection data, a fault detection data set and the configuration information, obtaining a fault analysis report of the detected object when a judgment result indicates that the detected object is faulty, and reporting the fault analysis report to an operation and maintenance platform. Based on the method, the technical effects that a core intelligent body obtains a fault analysis report according to first detection data of a detected object, a fault detection data set and configuration information of the detected object, the arranged fault analysis report is reported to an operation and maintenance platform, the operation and maintenance platform directly carries out corresponding processing on faults according to the fault analysis report, the degree of automation is greatly improved, accordingly fault response time is shortened, the probability of human errors is reduced, the intelligent body platform supports dynamic expansion and configuration, the monitoring range and the alarm strategy can be flexibly adjusted according to actual requirements, meanwhile, data transmission and instruction interaction are carried out between the intelligent bodies through efficient communication protocols, and the integral performance and stability of the intelligent body platform are ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flow chart of a fault detection method according to an embodiment of the present application;

Fig. 2 is a second schematic flow chart of a fault detection method according to an embodiment of the present application;

Fig. 3 is a schematic flow chart III of a fault detection method according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a fault detection device according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

In the embodiments of the present application, the words "first", "second", etc. are used to distinguish between the same item or similar items that have substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion. In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more.

The "at the time of the" of the embodiment of the present application may be instantaneous when a certain situation occurs, or may be within a period of time after a certain situation occurs, which is not particularly limited. In addition, the model training method provided by the embodiment of the application is only used as an example, and the model training method can also comprise more or less contents.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.

In order to facilitate the clear description of the technical solutions of the embodiments of the present application, the following simply describes some terms and techniques involved in the embodiments of the present application:

Fine-tuning (Fine-tuning) is a technique widely used in the field of deep learning, particularly in the fields of natural language processing, computer vision, and the like. The basic concept is to adjust model parameters by continuing training on a small-scale task-specific dataset on the basis of a model that has been pre-trained on a large-scale dataset to better adapt it to the new task or dataset.

Function call (Function) is a widely used concept in programming and artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) interactions, which refers to the triggering of certain preset functions or methods by sending specific instructions or commands to perform specific tasks.

Speculation and action (Reasoning AND ACTING, reAct) is a tip framework aimed at enhancing decision-making capabilities of a large language model (Large Language Model, LLM). The ReAct mode allows the large language model to alternately generate the predicted path and text actions. The predictive portion helps the model to generalize and update action plans and handle abnormal situations, while the action portion enables the model to interact with external sources of information (knowledge and environment) to gather the desired information. In the ReAct framework, the model updates its context window based on the new observations and recalculates the existing information. This mechanism enables the model to dynamically adjust its prediction and action strategies based on changes in the external environment.

For a clear understanding of the technical solutions of the present application, the prior art solutions will be described in detail first.

Therefore, the method aims at the technical problems of low response speed and high false alarm rate caused by low automation degree of the existing operation and maintenance team, and discovers that ① acquires a fault detection data set of a detected object by a core intelligent agent, ② acquires configuration information of the detected object, ③ judges whether the detected object has faults or not and obtains a fault analysis report when the detected object has faults, and ④ reports the fault analysis report.

Based on the creative findings, the technical scheme of the application is provided.

The following describes an application scenario of the fault detection method provided by the embodiment of the present invention.

Embodiments of the present application will now be described with reference to the accompanying drawings.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a fault detection method according to an embodiment of the present application, where the fault detection method is applied to a core agent of an agent platform, and the fault detection method according to the embodiment includes the following steps:

s101, acquiring a fault detection data set of the detected object when the first detection data of the detected object indicates the fault of the detected object.

In this embodiment, the first detection data refers to detection data obtained in any one time frame, and the fault detection data set includes a plurality of second detection data, where each second detection data refers to detection data obtained after a fault of the detected object.

S102, acquiring configuration information of the detected object.

In this embodiment, a flexible configuration management function is provided, allowing the user to adjust the monitoring range and the alarm policy according to the actual requirements. Meanwhile, the number and types of the intelligent agents are dynamically expanded and configured, and data transmission and instruction interaction are carried out among the intelligent agents through an efficient communication protocol, so that the overall performance and stability of the system are ensured.

And S103, judging whether the detected object is faulty according to the first detection data, the fault detection data set and the configuration information, and obtaining a fault analysis report of the detected object when the judging result indicates that the detected object is faulty.

In this embodiment, when the detected object fails, a failure analysis report is output, which includes a failure generation source, failure basic information, a failure diagnosis process, a failure diagnosis conclusion, a failure solution, and the like.

S104, reporting a fault analysis report to the operation and maintenance platform.

In this embodiment, the consolidated fault analysis report is reported to the operation and maintenance platform, so that the operation and maintenance platform processes the fault according to the fault analysis report. The data can be reported in various data forms such as character strings, videos, pictures and texts, and the like, can also be directly reported, and can also be reported through a certain push server.

The application provides a fault detection method, a system, a device, equipment, a medium and a program product, wherein the fault detection method comprises the steps of acquiring a fault detection data set of a detected object when first detection data of the detected object indicate the fault of the detected object; the method comprises the steps of obtaining configuration information of a detected object, judging whether the detected object is faulty according to first detection data, a fault detection data set and the configuration information, obtaining a fault analysis report of the detected object when a judgment result indicates that the detected object is faulty, and reporting the fault analysis report to an operation and maintenance platform. Based on the method, the technical effects that the well-arranged fault analysis report is reported to the operation and maintenance platform, the operation and maintenance platform directly carries out corresponding processing on faults according to the fault analysis report, the automation degree is greatly improved through arranging the fault analysis report, so that the fault response time is shortened, the probability of human errors is reduced, the system supports dynamic expansion and configuration, the monitoring range and the alarm strategy can be flexibly adjusted according to actual requirements, meanwhile, data transmission and instruction interaction are carried out among all intelligent agents through efficient communication protocols, and the overall performance and stability of the system are ensured.

Fig. 2 is a second schematic flow chart of a fault detection method according to an embodiment of the present application. The present embodiment further explains the fault detection method on the basis of the embodiment provided in fig. 1. In one possible design, the agent platform further comprises a monitoring agent and a configuration management database agent, the monitoring agent comprises a plurality of sub-agents, the fault cause analysis comprises fault prompt words, and further the monitoring agent comprises a link agent and a log agent. Then as shown in fig. 2, the fault detection method includes:

S201, fault reason analysis of monitoring agent pushing is obtained to determine faults of detected objects.

In this embodiment, the agent platform further includes a monitoring agent, where the failure cause analysis is obtained according to first detection data, where the first detection data is detection data obtained in any one time frame.

The failure cause analysis is data indicating the failure of the detected object, i.e., alarm information. The alarm information can be acquired by the user or transmitted after being acquired by other equipment, and takes a time frame as a reference. Whether the detected object is faulty or not can be judged by calculating in real time through a certain algorithm, can be determined through historical data, and can be determined through a technician.

S202, extracting fault prompt words from fault cause analysis.

In this embodiment, the failure cause analysis includes a failure prompt word. Analyzing the alarm information by utilizing natural language processing (Natural Language Processing, NLP) technology, and combining a pre-stored knowledge base and a strategy base to obtain the fault prompt word.

S203, generating a function instruction through a preset natural language processing model and a function instruction library according to the fault prompt word.

In this embodiment, S201, S202 and S203 belong to the decision stage. In the decision stage, the agent platform needs to guide and judge which sub-agent to call based on the fault prompt. The Fine-tuning technique is used to optimize the comprehension capabilities of the agent platform. By fine tuning on the large-scale operation and maintenance data set, the intelligent agent platform can better understand the fault prompt words related to operation and maintenance and accurately select the corresponding sub intelligent agents. Specifically, the operation and maintenance prompt words are analyzed by using the NLP model which is subjected to Fine-tuning, and the model is trained on a large amount of text data related to operation and maintenance, so that key operation and maintenance events and fault prompt words can be identified.

S204 is performed when the function instruction instructs to call the link agent, i.e., the next step of S204 is S206, and S205 is performed when the function instruction instructs to call the log agent. The link agent, the log agent and the link agent can be called independently, and the link agent and the log agent can be called simultaneously.

S204, the sampling frequency of the link agent is increased until the number of the plurality of second detection data reaches the first preset number, and/or the adjacent plurality of second detection data is kept stable.

In this embodiment, the monitoring agent includes a plurality of sub-agents, including a link agent and a log agent. The link agent is focused on monitoring the network link, and helps operation and maintenance personnel to know the network health condition by collecting the state information and performance data of the network equipment in real time, so as to discover and solve network faults in time.

Specifically, when the function instruction instructs to call the link agent, based on the sampling frequency of the link agent, multiple rounds of dynamic observation are performed on the action result and the observation data of the link agent, and based on the observation result, the preset maximum try number and the steady state, whether to jump out of the cycle is judged, and the current processing flow is ended.

And S205, improving the sampling range and/or the sampling precision of the log agent until the number of the plurality of second detection data reaches a second preset number.

In this embodiment, the log agent is focused on the collection and processing of the log file, and provides powerful support for fault detection and performance optimization by extracting and analyzing key information in the log. When the called sub-agent is a log agent, carrying out multi-round dynamic analysis on the action result and the observation data of the log agent based on the sampling range and/or the sampling precision of the log agent, and judging whether to jump out of the cycle and ending the current processing flow based on the analysis result, the preset sampling range and/or the sampling precision. Wherein the link agent and the log agent may be invoked at the same device and at the same time. S206 is executed after S205 is executed.

S206, collecting a fault detection data set from the detected object according to the adjusted monitoring agent.

In this embodiment, the fault detection data set includes a plurality of second detection data, and each of the second detection data refers to detection data obtained after a fault of the detected object. The fault detection data set can be collected only after the detected object is abnormal, or can be collected at any time, and the data set is called only after the abnormality occurs.

S207, according to fault cause analysis pushed by the monitoring agent, the configuration management database agent invokes the configuration information of the detected object from the configuration information set of the preset configuration management database.

In this embodiment, the agent platform further includes a configuration management database agent. The configuration management database agent is tightly integrated with the configuration management database and is responsible for synchronizing and managing the configuration information of system resources, and necessary configuration data and dependency relation support is provided for other agents.

S204, S205, S206 and S207 belong to the action phase. In the action phase, the Function rolling technology is the core. After the decision, the agent platform needs to call the corresponding sub-agent to realize specific automation operation, and the specific automation operation is completed by calling a preset function or application programming interface (Application Programming Interface, API). The Function rolling technology allows the agent platform to interact with external systems or components in a structured and controllable manner, enabling automated operation and maintenance tasks. The Fine-tuning technique may also be used to optimize the behavior model of the sub-agents, for example, if a sub-agent is responsible for restarting services, the performance of the sub-agent in a particular environment or configuration may be Fine-tuned to increase the probability and efficiency of restarting success. To improve execution efficiency, sub-agent invocations may be performed asynchronously and execution results monitored by a polling or event notification mechanism.

And S208, judging whether the detected object is faulty according to the first detection data, the fault detection data set and the configuration information, and obtaining a fault analysis report of the detected object when the judging result indicates that the detected object is faulty.

In this embodiment, the agent platform collects action results by various modes such as log records, API return status codes, system monitoring indexes, and the like, uses One-Class SVM algorithm to perform anomaly detection on the collected data, identifies action results inconsistent with expectations, and feeds back the detection results to the agent platform in real time so as to make adjustments in subsequent actions.

In this embodiment, the action effect of S208 is similar to that of S103 in the previous embodiment of the present invention, and will not be described here again.

S209, reporting a fault analysis report to the operation and maintenance platform.

In this embodiment, the action effect of S209 is similar to the action effect of S104 in the previous embodiment of the present invention, and will not be described here again.

The fault detection method provided by the embodiment of the application has the following technical effects that the accuracy of fault judgment is improved by carrying out multi-round dynamic analysis by adjusting sampling modes such as sampling frequency, sampling range or sampling accuracy of sub-agents, which sub-agent is judged to be called according to fault prompt words, the pertinence to fault treatment can be improved, the autonomous perception capability of an intelligent agent platform is utilized to realize the whole coverage and high-efficiency monitoring of system resources, each intelligent agent is responsible for monitoring tasks in a specific field, all possible fault points and abnormal conditions are ensured to be monitored without omission through cooperative work, the processing and analysis process of alarm information is simplified through unified alarm information templates and standards, the alarm information is screened, classified and associated analyzed by utilizing an intelligent analysis algorithm, and the accuracy and the readability of the alarm information are improved.

Fig. 3 is a flowchart illustrating a fault detection method according to an embodiment of the present application. The present embodiment further explains the fault detection method on the basis of the embodiments provided in fig. 1 and 2. As shown in fig. 3, S208 includes:

s301, inputting the first detection data, the fault detection data set and the configuration information into a preset fault tree model to obtain a first judgment result output by the fault tree model.

In this embodiment, when a fault of a detected object is found, it is first determined whether a fault analysis report applicable to the present fault processing exists in the fault tree model.

And judging the content indicated by the first judging result. When the first judgment result indicates the failure of the detected object, S302 is executed, and when the first judgment result does not indicate the failure of the detected object, S303 is executed.

S302, extracting a fault analysis report from the fault tree model.

In this embodiment, common failure modes and solutions are discovered through association rule mining model algorithms. And deducing the root cause of the fault by combining the results of the multiple rounds of analysis and a professional knowledge base, and integrating the information such as the analysis results, the processing procedures, the action results and the like into a detailed fault analysis report. Specifically, S209 is executed after S302 is executed.

When a fault analysis report suitable for the fault processing exists in the fault tree model, the processing efficiency can be greatly improved by directly calling the fault analysis report, namely, obtaining the fault analysis report according to historical data.

S303, reporting the first detection data, the fault detection data set and the configuration information to the fault detection platform so as to facilitate the fault detection platform to output a fault analysis report.

In this embodiment, when there is no fault analysis report applicable to the present fault processing in the fault tree model, the fault analysis report is obtained through the fault detection platform. The fault detection platform can be a third party device such as an expert system, an automatic analysis platform and the like. Specifically, S304 is executed after S303 is executed.

S304, updating the fault tree model according to the fault analysis report.

In this embodiment, the fault analysis report is integrated into the fault tree model by an algorithm.

The method and the device have the advantages that when the fault analysis report suitable for the fault processing exists in the fault tree model, the processing efficiency can be greatly improved by directly calling the fault analysis report, namely obtaining the fault analysis report according to historical data, and the fault analysis report is integrated into the fault tree model through an algorithm so as to be reused and improved in future operation and maintenance tasks. Specifically, S209 is executed after S304 is executed.

In one possible design, where the agent platform further includes an alarm agent, S209 includes:

Pushing the fault analysis report to the alarm agent so that the alarm agent can generate an alarm prompt according to the fault analysis report and push the alarm prompt to the operation and maintenance platform.

In this embodiment, the alarm agents serve as a core processing unit of the system, and are responsible for receiving alarm information from each agent, performing intelligent analysis and processing, and sending alarm notification to operation and maintenance personnel through various channels.

The following is a specific embodiment of invoking a link agent, where the alarm type is a high temperature alarm:

The link agent is deployed in the third row of server racks in the machine room a, and detects that the temperature inside the third row of server racks in the machine room a exceeds a preset threshold (normal <25 ℃, 30 ℃ at present) for 10 minutes.

1) Generation and delivery of alert information

And data information, wherein the temperature value detected by the link Agent is 30 ℃ and exceeds a preset threshold value of 25 ℃.

Timestamp 2024-05-10 09:45:00

And the transmission mode is that the alarm information is automatically pushed to the integrated large model system through the API interface.

2) Large model analysis and suggestion

And in the analysis process, the large model analyzes the alarm information by utilizing an NLP technology and combines rules in a knowledge base and a strategy base for analysis.

The generated prompting words are that 'temperature distribution in the cabinet is checked', 'server load and heat dissipation condition are analyzed', 'configuration management database agent is called to obtain configuration information of the cabinet and the server'

And recommending actions, namely definitely indicating to call the link agent to execute temperature distribution check and then call the configuration management database agent.

3) Designating sub-agent actions

The link agent acts to increase the temperature sampling frequency from conventional once every 5 minutes to once every 1 minute.

The time stamps and corresponding temperature data of the top, middle, bottom and other areas of the cabinet are acquired through the link agent, namely the temperature data are densely sampled, as shown in table 1.

And then the obtained temperature data are arranged, and a temperature distribution thermodynamic diagram in each area of the cabinet is drawn.

From observation, the temperature distribution thermodynamic diagram clearly shows that the top area is a high temperature point, so that from the temperature data, it can be concluded that the temperature of the top area of the cabinet is too high.

TABLE 1

Time stamp	Cabinet area	Temperature (° C)
			2024/5/10 9:45	Top part	35
2024/5/10 9:45	Middle part	26
			2024/5/10 9:45	Bottom part	24
...	...	...
			2024/5/10 10:00	Top part	34
2024/5/10 10:00	Middle part	26.5
			2024/5/10 10:00	Bottom part	24.2

And (3) configuring the management database agent to act, namely acquiring key configuration information. Table 2 is key configuration information obtained from the configuration management database. The key configuration information obtained in combination with the temperature data of table 1 may be padded for subsequent analysis.

TABLE 2

4) Observing the result of the action

Link agent observation:

data points are that the top zone maximum temperature reached 35 ℃, while the other zones average temperature remained below 27 ℃.

Thermodynamic diagrams visually indicate that there is a distinct high temperature region in the top region.

Configuration management database observations:

the configuration details are that the top area server is higher in configuration, high in power consumption and lower in rotating speed of the radiating fans of part of servers than a standard value.

5) Dynamic analysis and problem solving

And in the analysis process, the system analyzes that the heat dissipation capacity of the top server is insufficient due to high-density deployment and high power consumption and insufficient rotation speed of the heat dissipation fan by combining temperature data and server configuration information.

Problem location confirming that the heat dissipation problem of the server at the top of the cabinet is the source of the overall excessive temperature.

6) Integrating diagnostic conclusions and solutions

The whole diagnosis conclusion is that the temperature in the third row of server cabinets of the machine room A is too high (up to 35 ℃), and the main reason is that the heat dissipation capacity of the servers at the top of the cabinet is insufficient, and the method is characterized in that the rotating speed of the heat dissipation fans of part of the servers is low and the deployment density is high.

The solution proposal is as follows:

immediate measures:

The rotational speed of the top server radiator fan is manually turned up to a standard value or higher (e.g., 2500 RPM).

The improvement measures are as follows:

Optimizing the layout of servers in a cabinet, reducing the density of servers in a top area, considering adding additional heat dissipation devices such as additional heat dissipation fans or heat dissipation fins on the top of the cabinet, periodically checking and maintaining the heat dissipation devices to ensure the normal operation of the heat dissipation devices, adjusting the wind direction and the wind speed of an air conditioner of a machine room, and ensuring the top of the cabinet to be cooled more fully.

7) Updating fault tree models

The diagnosis process, the solution and the implementation effect of the high-temperature alarm are recorded into a fault tree model, so that the similar problems in the future can be quickly referred and dealt with.

The following is a specific embodiment of calling log agent, and the alarm type at this time is log abnormal alarm:

The log agent discovers a frequently occurring "switch" key in the firewall log, which typically indicates that the active-standby firewall is doing an abnormal switching, or that there is a potential configuration/hardware problem.

1) Alarm information receiving and preliminary analysis

After the log agent generates the alarm information, the system automatically transmits the alarm information to an integrated large model or monitoring center.

The intelligent agent platform is used as a front end sensing unit of the system and is responsible for collecting system resource state information in real time and carrying out preliminary anomaly detection. Once an abnormal condition is found, an alarm is triggered immediately and related information is sent to the monitoring center.

The monitoring center is used as a core part of the system and is responsible for receiving monitoring data and alarm information sent by each intelligent agent. The monitoring center also has intelligent analysis capability, can screen, classify and correlate and analyze the alarm information, and provides accurate alarm view and fault positioning suggestion for operation and maintenance personnel.

The large model or the monitoring center combines a pre-trained knowledge base and a strategy base according to keywords such as log abnormality, switch keyword, exit firewall and the like in the alarm information to generate prompting words such as verifying the master and slave states of the firewall, checking the detailed information of the log file of the firewall, calling a configuration management database agent to acquire the configuration information of the firewall and analyzing whether the network flow is abnormal.

2) Designating corresponding sub-agent to realize specific operation

Log agent action-the firewall log, particularly log fragments containing the "switch" key, continues to be monitored and collected in more detail for further analysis.

Configuration management database agent acts of retrieving configuration information of the outlet firewall from the configuration management database, including master-slave switching strategy, equipment model, latest maintenance record, etc.

And the network monitoring agent analyzes the current network flow condition, particularly the flow related to the outlet firewall, and checks whether the firewall is overloaded due to abnormal flow.

3) Observing the result of the action

And (3) the log agent observes the result, namely, through analyzing the log, confirming that the firewall generates multiple main/standby switches in a short time, and recording a switch keyword before and after each switch.

The configuration management database agent observes that the information obtained from the configuration management database shows that the firewall configuration is free of obvious errors, but that the hardware of the main firewall may be near the service life and that no planned maintenance or upgrade is recently performed.

And the network monitoring agent observes that the network flow is in a normal range and no abnormal flow surge or attack behavior is found.

4) Multi-round dynamic analysis and problem localization

The observation results of the comprehensive log agent, the configuration management database agent and the network monitoring agent are analyzed and judged dynamically through multiple rounds, and the root of the problem is possibly unstable caused by the aging of the main firewall hardware, so that abnormal main/standby switching is triggered. The system judges that the specific problem is positioned, and can jump out of the circulation to enter the next solving stage.

5) Integrating diagnostic process output conclusions

The system integrates the whole fault diagnosis process and outputs the following overall diagnosis conclusion:

The problem description is that the outlet firewall has frequent main-standby abnormal switching, and the main reason is instability caused by the aging of main firewall hardware.

The solution proposal is as follows:

And immediately moving the business to the standby firewall temporarily to ensure the continuity of network service.

Short-term measures are arranged to make emergency checks and necessary hardware changes to the main firewall.

And taking long-term measures, namely taking regular upgrading and updating of firewall hardware into consideration, and making a maintenance plan to prevent similar problems from happening again.

Follow-up actions-submitting the diagnostic conclusions and solution suggestions to an operations and maintenance team, performing specific resolution steps by them, and monitoring the firewall status after resolution.

6) Fault tree model updating

Recording the processing process, experience training and solution of the abnormal switching into a fault tree model, including key points of log analysis, firewall maintenance practices and the like, so that rapid reference and response can be realized in the future when similar conditions occur.

The embodiment of the application also provides a fault detection system, which comprises at least one detected device, an operation and maintenance platform and an intelligent agent platform in communication connection with the operation and maintenance platform;

the intelligent agent platform is used for realizing the fault detection method of the embodiment of the invention.

Each intelligent agent in the intelligent agent platform is responsible for monitoring tasks in specific fields, including data acquisition, anomaly detection, alarm triggering and other functions, and the intelligent agents perform data transmission and instruction interaction through a preset communication protocol.

In one possible design, the agent platform includes a core agent, and a monitoring agent, a configuration management database agent, and an alarm agent in communication with the core agent;

The monitoring agent includes a link agent and a log agent.

The fault detection system provided in this embodiment has similar implementation principles and technical effects to those of the fault detection method in the foregoing embodiment, and this embodiment is not described herein again.

Fig. 4 is a schematic structural diagram of a fault detection device according to an embodiment of the present application. As shown in fig. 4, in this embodiment, the fault detection device may be located in the electronic apparatus. The fault detection device includes:

A failure detection data set obtaining module 401, configured to obtain a failure detection data set of a detected object when first detection data of the detected object indicates that the detected object is failed, where the first detection data is detection data obtained in any one time frame, and the failure detection data set includes a plurality of second detection data, and each second detection data is detection data obtained after the detected object is failed;

A configuration information acquisition module 402, configured to acquire configuration information of a detected object;

A fault analysis report obtaining module 403, configured to determine whether the detected object is faulty according to the first detection data, the fault detection data set, and the configuration information, and obtain a fault analysis report of the detected object when the determination result indicates that the detected object is faulty;

And the fault analysis report reporting module 404 is configured to report a fault analysis report to the operation and maintenance platform.

The fault detection device provided in this embodiment may execute the technical scheme of the fault detection method embodiment shown in fig. 1, and its implementation principle and technical effects are similar to those of the fault detection method embodiment shown in fig. 1, and are not described in detail herein.

Meanwhile, the fault detection device provided by the invention further refines the fault detection device on the basis of the fault detection device provided by the embodiment.

Optionally, in this embodiment, when the first detection data of the detected object indicates that the detected object is faulty, the fault detection data set obtaining module 401 obtains a fault detection data set of the detected object, where the intelligent platform further includes a monitoring intelligent agent;

and adjusting the monitoring agent according to the analysis of the fault reasons so as to collect a fault detection data set from the detected object according to the adjusted monitoring agent.

Optionally, the fault detection data set obtaining module 401 adjusts the monitoring agent according to a fault cause analysis, where the fault cause analysis includes a fault prompt word, where the monitoring agent includes a plurality of sub-agents;

Extracting fault prompt words from fault cause analysis;

and calling at least one target agent from the plurality of sub-agents according to the function instruction, and adjusting the sampling mode of the at least one target agent.

Optionally, the fault detection data set obtaining module 401 invokes at least one target agent from the plurality of sub-agents according to the function instruction, and adjusts a sampling mode of the at least one target agent, where the plurality of sub-agents includes a link agent and a log agent;

when the function instruction indicates to call the link agent, the sampling frequency of the link agent is increased until the number of the plurality of second detection data reaches the first preset number, and/or the adjacent plurality of second detection data is kept stable;

and when the function instruction indicates to call the log agent, the sampling range and/or the sampling precision of the log agent are increased until the number of the plurality of second detection data reaches a second preset number.

Optionally, in this embodiment, when the configuration information acquisition module 402 acquires the configuration information of the detected object, the agent platform further includes a configuration management database agent,

And according to analysis of fault reasons pushed by the monitoring agent, the configuration management database agent invokes the configuration information of the detected object from the configuration information set of the preset configuration management database.

Optionally, in this embodiment, the failure analysis report obtaining module 403 is specifically configured to:

and when the first judging result indicates that the detected object has faults, extracting a fault analysis report from the fault tree model.

Optionally, when the first judgment result does not indicate the fault of the detected object, the fault analysis report obtaining module 403 reports the first detection data, the fault detection data set and the configuration information to the fault detection platform, so that the fault detection platform outputs a fault analysis report;

and updating the fault tree model according to the fault analysis report.

Optionally, in this embodiment, the failure analysis report reporting module 404 is specifically configured to:

The intelligent body platform further comprises an alarm intelligent body, and the fault analysis report is pushed to the alarm intelligent body, so that the alarm intelligent body can generate an alarm prompt according to the fault analysis report, and the alarm prompt is pushed to the operation and maintenance platform.

The fault detection device provided in this embodiment may execute the technical scheme of the above fault detection method embodiment, and its implementation principle and technical effects are similar to those of the above fault detection method embodiment, and are not described in detail herein.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device is intended for various electronic devices that may be used to perform the fault detection method, such as microcomputers, singlechips, and other suitable computers. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 5, the electronic device comprises at least one processor 501 and a memory 502. The electronic device further comprises a communication part 503. The processor 501, the memory 502, and the communication unit 503 are connected via a bus 504.

In a specific implementation, at least one processor 501 executes computer-executable instructions stored in memory 502, such that at least one processor 501 performs the fault detection method as performed on the electronic device side above.

The specific implementation process of the processor 501 may refer to the above-mentioned fault detection method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.

In the above embodiment, it should be understood that the Processor 501 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), and the like. The general purpose processor 401 may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.

Memory 502 may comprise high-speed RAM memory or may also include non-volatile storage NVM, such as at least one disk memory.

Bus 504 may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus 504 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, the bus 504 in the present figures is not limited to only one bus or to one type of bus.

The scheme provided by the embodiment of the invention is introduced aiming at the functions realized by the electronic equipment and the main control equipment. It will be appreciated that the electronic device or the master device, in order to implement the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. The present embodiments can be implemented in hardware or a combination of hardware and computer software in combination with the various exemplary elements and algorithm steps described in connection with the embodiments disclosed in the embodiments of the present invention. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not to be considered as beyond the scope of the embodiments of the present invention.

The present application also provides a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the above fault detection method.

The computer readable storage medium described above may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. A readable storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. The readable storage medium may also be integral to the processor. The processor and the readable storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). The processor and the readable storage medium may reside as discrete components in an electronic device or a host device.

Memory 502 is a non-transitory computer readable storage medium provided by the present invention. The non-transitory computer readable storage medium of the present invention stores computer instructions for causing a computer to execute the fault detection method provided by the present invention.

The memory 502 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the fault detection method in the embodiment of the present invention (e.g., the fault detection data set acquisition module 401, the configuration information acquisition module 402, the fault analysis report acquisition module 403, and the fault analysis report reporting module 404 shown in fig. 4). The processor 501 executes various functional applications and data processing by running non-transitory software programs, instructions, and modules stored in the memory 502, i.e., implements the fault detection method in the method embodiments described above.

Meanwhile, the embodiment also provides a computer program product, which comprises a computer program, and the computer program is used for realizing the fault detection method of the embodiment when being executed by a processor.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are alternative embodiments, and that the acts and modules referred to are not necessarily required for the present application.

It should be further noted that, although the steps in the flowchart are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

It will be appreciated that the device embodiments described above are merely illustrative and that the device of the application may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.

In addition, each functional unit/module in each embodiment of the present application may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together, unless otherwise specified. The integrated units/modules described above may be implemented either in hardware or in software program modules.

The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The processor may be any suitable hardware processor, such as CPU, GPU, FPGA, DSP and an ASIC, etc., unless otherwise specified. Unless otherwise indicated, the storage elements may be any suitable magnetic or magneto-optical storage medium, such as resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (ENHANCED DYNAMIC Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc.

The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. The Memory includes a U disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. which can store the program codes.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, all of the combinations of the technical features should be considered as being within the scope of the disclosure.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.