Movatterモバイル変換


[0]ホーム

URL:


CN117221087A - Alarm root cause positioning method, device and medium - Google Patents

Alarm root cause positioning method, device and medium
Download PDF

Info

Publication number
CN117221087A
CN117221087ACN202311286937.5ACN202311286937ACN117221087ACN 117221087 ACN117221087 ACN 117221087ACN 202311286937 ACN202311286937 ACN 202311286937ACN 117221087 ACN117221087 ACN 117221087A
Authority
CN
China
Prior art keywords
alarm
root cause
cause
data
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311286937.5A
Other languages
Chinese (zh)
Inventor
黄兵明
马瑞涛
乔治
赵慧英
张溶芳
刘楠
吴浩然
赵世琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co LtdfiledCriticalChina United Network Communications Group Co Ltd
Priority to CN202311286937.5ApriorityCriticalpatent/CN117221087A/en
Publication of CN117221087ApublicationCriticalpatent/CN117221087A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

The invention provides an alarm root cause positioning method, a device and a medium, which relate to the technical field of network operation and maintenance and are used for solving the problems of insufficient accuracy of root cause positioning and insufficient robustness of a scheme caused by insufficient application of the existing root cause positioning analysis to information, wherein the method comprises the following steps: acquiring first cause alarm information of alarm data to be analyzed based on alarm association rules; obtaining second cause alarm information of alarm data to be analyzed based on the alarm knowledge graph comprises the following steps: acquiring an alarm knowledge graph, wherein two nodes connected in the alarm knowledge graph are cause nodes and effect nodes, each connecting line is provided with a cause and effect edge weight, matching alarm data to be analyzed with the nodes of the alarm knowledge graph to acquire an alarm cause and effect graph, and acquiring second cause alarm information according to the alarm cause and effect graph and the cause and effect edge weights; and obtaining final root cause alarm information based on the first root cause alarm information and the second root cause alarm information. The invention increases the analysis of the alarm knowledge graph and improves the accuracy of root cause positioning.

Description

Translated fromChinese
告警根因定位方法、装置及介质Alarm root cause locating methods, devices and media

技术领域Technical field

本发明涉及网络运维技术领域,尤其涉及一种告警根因定位方法、告警根因定位装置及计算机可读存储介质。The present invention relates to the technical field of network operation and maintenance, and in particular to an alarm root cause locating method, an alarm root cause locating device and a computer-readable storage medium.

背景技术Background technique

随着网络技术的发展,网络系统越来越复杂,网络发生故障时,会接收到大量的告警数据。一些情况下,采用关联规则挖掘的技术手段进行告警数据中的根因定位分析,可是现有基于关联规则挖掘的分析方法存在对信息运用不足的问题,这就可能导致根因定位的准确性和方案的鲁棒性不足。With the development of network technology, network systems are becoming more and more complex. When a network failure occurs, a large amount of alarm data will be received. In some cases, the technical means of association rule mining are used to perform root cause location analysis in alarm data. However, the existing analysis methods based on association rule mining have the problem of insufficient application of information, which may lead to the accuracy and inaccuracy of root cause location. The scheme is not robust enough.

发明内容Contents of the invention

本发明所要解决的技术问题是针对上述不足,提供一种告警根因定位方法、告警根因定位装置及计算机可读存储介质,以解决现有根因定位分析对信息运用不足,导致根因定位的准确性和方案的鲁棒性不足的问题。The technical problem to be solved by the present invention is to provide an alarm root cause locating method, an alarm root cause locating device and a computer-readable storage medium in view of the above-mentioned deficiencies, so as to solve the problem of insufficient information application in existing root cause locating analysis, resulting in root cause locating. The accuracy and robustness of the scheme are insufficient.

第一方面,本发明提供一种告警根因定位方法,包括:In a first aspect, the present invention provides a method for locating alarm root causes, including:

基于告警关联规则获得待分析告警数据的第一根因告警信息;Obtain the first cause alarm information of the alarm data to be analyzed based on alarm association rules;

基于告警知识图谱获得待分析告警数据的第二根因告警信息,包括:Obtain the second cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph, including:

获取告警知识图谱,告警知识图谱中相连的两个节点为因节点和果节点、每条连接线设置有因果边权重,Obtain the alarm knowledge graph. The two connected nodes in the alarm knowledge graph are the cause node and the effect node. Each connecting line is set with a causal edge weight.

将待分析告警数据匹配告警知识图谱的节点以获得告警因果图,Match the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain the alarm cause-and-effect graph.

根据告警因果图和因果边权重获得第二根因告警信息;Obtain the second root cause alarm information based on the alarm causal graph and causal edge weights;

基于第一根因告警信息和第二根因告警信息获得最终根因告警信息。The final root cause alarm information is obtained based on the first root cause alarm information and the second root cause alarm information.

第二方面,本发明提供一种告警根因定位装置,包括:In a second aspect, the present invention provides an alarm root cause locating device, including:

第一分析模块,用于基于告警关联规则获得待分析告警数据的第一根因告警信息;The first analysis module is used to obtain the first root cause alarm information of the alarm data to be analyzed based on alarm association rules;

第二分析模块,用于基于告警知识图谱获得待分析告警数据的第二根因告警信息,包括:The second analysis module is used to obtain the second root cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph, including:

告警知识图谱单元,用于获取告警知识图谱,告警知识图谱中相连的两个节点为因节点和果节点、每条连接线设置有因果边权重,The alarm knowledge graph unit is used to obtain the alarm knowledge graph. The two connected nodes in the alarm knowledge graph are the cause node and the effect node, and each connection line is set with a causal edge weight.

第二匹配单元,与告警知识图谱单元连接,用于将待分析告警数据匹配告警知识图谱的节点以获得告警因果图,The second matching unit is connected to the alarm knowledge graph unit and is used to match the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain the alarm cause and effect graph,

第二定位单元,与第二匹配单元连接,用于根据告警因果图和因果边权重获得第二根因告警信息;The second positioning unit is connected to the second matching unit and is used to obtain the second root cause alarm information based on the alarm causal graph and causal edge weights;

第三分析模块,与第一分析模块和第二分析模块连接,基于第一根因告警信息和第二根因告警信息获得最终根因告警信息。The third analysis module is connected to the first analysis module and the second analysis module, and obtains the final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.

第三方面,本发明提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当所述计算机程序被处理器运行时,实现如上所述的告警根因定位方法。In a third aspect, the present invention provides a computer-readable storage medium. A computer program is stored in the computer-readable storage medium. When the computer program is run by a processor, the alarm root cause locating method as described above is implemented.

本发明提供一种告警根因定位方法、告警根因定位装置及计算机可读存储介质,将基于告警关联规则获得的待分析告警数据的第一根因告警信息,结合基于告警知识图谱获得的待分析告警数据的第二根因告警信息,获得最终根因告警信息,在基于告警知识图谱获得待分析告警数据的第二根因告警信息的过程中,根据告警知识图谱中的因节点和果节点之间的因果关系和因果边权重综合判断,获得第二根因告警信息,通过在关联规则挖掘的基础上,增加告警知识图谱分析,提高对信息的运用,从而提高根因定位的准确性和方案的鲁棒性。The present invention provides an alarm root cause locating method, an alarm root cause locating device and a computer-readable storage medium. The first root cause alarm information of alarm data to be analyzed obtained based on alarm association rules is combined with the alarm information to be analyzed based on the alarm knowledge graph. Analyze the second root cause alarm information of the alarm data to obtain the final root cause alarm information. In the process of obtaining the second root cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph, according to the cause node and effect node in the alarm knowledge graph Comprehensive judgment of the causal relationship and causal edge weight between them, and obtain the second root cause alarm information. By adding alarm knowledge graph analysis on the basis of association rule mining, we can improve the application of information, thereby improving the accuracy and accuracy of root cause positioning. Robustness of the scheme.

附图说明Description of drawings

图1是本发明实施例的一种告警根因定位方法的流程图;Figure 1 is a flow chart of an alarm root cause locating method according to an embodiment of the present invention;

图2是本发明实施例的另一种告警根因定位方法的流程图;Figure 2 is a flow chart of another alarm root cause locating method according to an embodiment of the present invention;

图3是本发明实施例的一种基于告警关联规则的告警根因定位方法的流程图;Figure 3 is a flow chart of an alarm root cause locating method based on alarm association rules according to an embodiment of the present invention;

图4是本发明实施例的一种基于告警知识图谱的告警根因定位方法的图谱变化示意图,其中,(a)是一种告警知识图谱的示意图,(b)是一种告警因果图的示意图,(c)是一种告警因果子图的示意图,(d)是另一种告警因果子图的示意图;Figure 4 is a schematic diagram of graph changes of an alarm root cause locating method based on an alarm knowledge graph according to an embodiment of the present invention. (a) is a schematic diagram of an alarm knowledge graph, and (b) is a schematic diagram of an alarm cause-and-effect graph. , (c) is a schematic diagram of an alarm causal subgraph, (d) is a schematic diagram of another alarm causal subgraph;

图5是本发明实施例的一种基于告警知识图谱的告警根因定位方法的流程图;Figure 5 is a flow chart of an alarm root cause locating method based on an alarm knowledge graph according to an embodiment of the present invention;

图6是本发明实施例的一种告警根因定位装置的结构示意图;Figure 6 is a schematic structural diagram of an alarm root cause locating device according to an embodiment of the present invention;

图7是本发明实施例的一种用于告警根因定位的第二分析模块的结构示意图。FIG. 7 is a schematic structural diagram of a second analysis module for locating alarm root causes according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本领域技术人员更好地理解本发明的技术方案,下面将结合附图对本发明实施方式作进一步地详细描述。In order to enable those skilled in the art to better understand the technical solutions of the present invention, the embodiments of the present invention will be described in further detail below in conjunction with the accompanying drawings.

可以理解的是,此处描述的具体实施例和附图仅仅用于解释本发明,而非对本发明的限定。It can be understood that the specific embodiments and drawings described here are only used to explain the present invention, but not to limit the present invention.

可以理解的是,在不冲突的情况下,本发明中的各实施例及实施例中的各特征可相互组合。It can be understood that, without conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

可以理解的是,为便于描述,本发明的附图中仅示出了与本发明相关的部分,而与本发明无关的部分未在附图中示出。It can be understood that, for convenience of description, only the parts related to the present invention are shown in the drawings of the present invention, and the parts irrelevant to the present invention are not shown in the drawings.

可以理解的是,本发明的实施例中所涉及的每个单元、模块可仅对应一个实体结构,也可由多个实体结构组成,或者,多个单元、模块也可集成为一个实体结构。It can be understood that each unit and module involved in the embodiments of the present invention may correspond to only one entity structure, or may be composed of multiple entity structures, or multiple units and modules may be integrated into one entity structure.

可以理解的是,在不冲突的情况下,本发明的流程图和框图中所标注的功能、步骤可根据不同于附图中所标注的顺序发生。It can be understood that, provided there is no conflict, the functions and steps marked in the flowcharts and block diagrams of the present invention may occur in a sequence different from that marked in the drawings.

可以理解的是,本发明的流程图和框图中,示出了根据本发明各实施例的系统、装置、设备、方法的可能实现的体系架构、功能和操作。其中,流程图或框图中的每个方框可代表一个单元、模块、程序段、代码,其包含用于实现规定的功能的可执行指令。而且,框图和流程图中的每个方框或方框的组合,可用实现规定的功能的基于硬件的系统实现,也可用硬件与计算机指令的组合来实现。It can be understood that the flowcharts and block diagrams of the present invention illustrate the possible implementation architecture, functions and operations of systems, devices, equipment, and methods according to various embodiments of the present invention. Each box in the flow chart or block diagram may represent a unit, module, program segment, or code, which contains executable instructions for realizing the specified function. Furthermore, each block or combination of blocks in the block diagrams and flowchart illustrations may be implemented by a hardware-based system that performs the specified functions, or by a combination of hardware and computer instructions.

可以理解的是,本发明实施例中所涉及的单元、模块可通过软件的方式实现,也可通过硬件的方式来实现,例如单元、模块可位于处理器中。It can be understood that the units and modules involved in the embodiments of the present invention can be implemented in software or hardware. For example, the units and modules can be located in a processor.

为了便于理解本发明,首先介绍本发明相关的一些技术及概念。In order to facilitate understanding of the present invention, some technologies and concepts related to the present invention are first introduced.

随着运营商云网一体化的发展,在通信云NFV(网络功能虚拟化,NetworkFunctions Virtualization)部署新形态下,运营商云网平台全部云化部署,业务平台应用与底层云资源完全解耦,对于云资源底层的状态感知和监控能力较弱,专业间协同机制尚未完善,由此引发的云网平台跨层故障定位难、故障处理时间长等问题,急需通过智能化的手段来支撑跨层的运维协同,以确保云网业务的安全可靠和连续运行。With the development of operator cloud network integration, under the new form of communication cloud NFV (Network Functions Virtualization) deployment, all operator cloud network platforms are deployed in the cloud, and business platform applications are completely decoupled from underlying cloud resources. The status awareness and monitoring capabilities of the underlying cloud resources are weak, and the inter-professional collaboration mechanism has not yet been perfected. This has caused problems such as difficulty in locating cross-layer faults on the cloud network platform and long fault handling time. Intelligent means are urgently needed to support cross-layer Operation and maintenance collaboration to ensure the safe, reliable and continuous operation of cloud network services.

在上述云化建设背景下,因为缺少告警关联以及根因定位的能力,运维人员在实际运维过程中面临以下痛点:1)云网平台业务层和资源层跨层信息不同步,信息搜集困难;2)云网平台发生故障后,告警数量大,派单不精准,运维人员负荷大;3)云网平台涉及的系统越来越复杂,单靠人工定位故障困难,故障定位时间长;4)云网平台的历史故障经验沉淀困难。因此,如何实现跨层的告警故障根因定位,从海量云网告警中快速找到根因,提高根因定位准确率,从而支撑运营商运维平台精准派单,促进面向云化平台的集约化维护体系构建,成为最大的挑战。In the context of the above-mentioned cloud construction, due to the lack of alarm correlation and root cause location capabilities, operation and maintenance personnel face the following pain points in the actual operation and maintenance process: 1) The cross-layer information of the cloud network platform business layer and resource layer is not synchronized, and information collection Difficulties; 2) After a failure occurs on the cloud network platform, the number of alarms is large, order dispatching is inaccurate, and the load on operation and maintenance personnel is heavy; 3) The systems involved in the cloud network platform are becoming more and more complex, and it is difficult to locate faults manually alone, and the fault location time is long ;4) It is difficult to accumulate historical failure experience of the cloud network platform. Therefore, how to realize cross-layer alarm fault root cause location, quickly find the root cause from massive cloud network alarms, and improve the accuracy of root cause location, thereby supporting the operator's operation and maintenance platform to dispatch orders accurately, and promoting the intensification of cloud-based platforms. Building a maintenance system has become the biggest challenge.

关联分析方法有望成为业内解决告警关联和故障原因定位问题的主流手段,关联分析方法首先要进行关联规则挖掘,关联规则挖掘能够从数据集中发现项与项之间的关联关系,并从中提取出符合一定条件的强关联规则进而指导业务。现有的关联规则分析方法主要是利用Apriori或FP-growth等传统数据挖掘算法进行告警关联挖掘,其中:The correlation analysis method is expected to become the mainstream method in the industry to solve the problems of alarm correlation and fault cause location. The correlation analysis method first requires association rule mining. Association rule mining can discover the correlation between items in the data set and extract the relevant information from it. Strong association rules under certain conditions guide the business. Existing association rule analysis methods mainly use traditional data mining algorithms such as Apriori or FP-growth for alarm association mining, among which:

Apriori算法是一种最有影响力的挖掘布尔关联规则的频繁项集的算法,使用一种称作逐层搜索的迭代方法,k-项集用于探索(k+1)-项集。其算法流程可描述为:首先找出频繁1-项集,记为L1;然后利用L1来产生候选项集C2,对C2中的项进行判定挖掘出L2,即频繁2-项集;不断如此循环下去直到无法发现更多的频繁k-项集为止。为提高频繁项集逐层产生的效率,一种称作Apriori性质的重要性质用于压缩搜索空间。其运行定理在于一是频繁项集的所有非空子集都必须也是频繁的,二是非频繁项集的所有父集都是非频繁的;The Apriori algorithm is one of the most influential algorithms for mining frequent itemsets of Boolean association rules. It uses an iterative method called layer-by-level search, and k-itemsets are used to explore (k+1)-itemsets. The algorithm process can be described as follows: first find the frequent 1-item set, recorded as L1; then use L1 to generate the candidate item set C2, judge the items in C2 and dig out L2, which is the frequent 2-item set; and so on. The loop continues until no more frequent k-itemsets can be found. In order to improve the efficiency of generating frequent itemsets layer by layer, an important property called the Apriori property is used to compress the search space. Its operating theorem is that firstly, all non-empty subsets of frequent itemsets must also be frequent, and secondly, all parent sets of non-frequent itemsets are non-frequent;

FP-Growth算法通过构造频繁模式树这种比较紧凑的数据结构,将频繁模式信息进行压缩,本质上是一种深度优先搜索算法,算法主要分为构建频繁模式树和挖掘树中的频繁模式2个过程,只需对事务数据库进行2次扫描就可以获取频繁模式集合。其算法流程可描述为:(1)第一次扫描数据集统计每个项目出现的次数,并在头指针表中进行计数,然后删除不满足支持度要求的表项;(2)第二次扫描数据集删除每条记录中不满足支持度的项目,然后对每条记录中的项目按照其出现次数进行降序排序;(3)第三次扫描数据集建立FPTree并完善头指针表;迭代(1)至(3)收集频繁项集。The FP-Growth algorithm compresses frequent pattern information by constructing a relatively compact data structure such as a frequent pattern tree. It is essentially a depth-first search algorithm. The algorithm is mainly divided into building a frequent pattern tree and mining frequent patterns in the tree 2 This process only requires 2 scans of the transaction database to obtain the frequent pattern set. The algorithm process can be described as: (1) The first scan of the data set counts the number of occurrences of each item, counts it in the head pointer table, and then deletes the items that do not meet the support requirements; (2) The second scan Scan the data set to delete items that do not meet the support level in each record, and then sort the items in each record in descending order according to the number of occurrences; (3) Scan the data set for the third time to establish an FPTree and improve the head pointer table; iterate ( 1) to (3) collect frequent itemsets.

但是,现有的关联规则分析方法通常存在对业务拓扑、资源拓扑等关键信息运用不足的问题。即使采用上述两种算法挖掘关联规则,然后基于关联规则进行根因定位分析,也存在挖掘出来的规则依赖于项集出现的频次,只使用了历史告警数据中的时间和频率信息,无法考虑到具体运维场景中告警级别、产生设备重要性等因素,容易造成大部分规则置信度不高、有效性差的问题。However, existing association rule analysis methods usually have the problem of insufficient application of key information such as business topology and resource topology. Even if the above two algorithms are used to mine association rules, and then perform root cause location analysis based on the association rules, there are still rules that are mined that depend on the frequency of item sets. Only the time and frequency information in historical alarm data are used, which cannot be taken into account. Factors such as alarm levels and the importance of generating equipment in specific operation and maintenance scenarios can easily lead to low confidence and poor effectiveness of most rules.

除关联规则分析方法外,还可以考虑采用因果图、故障树等图形演绎方法,例如,利用因果图方法实现告警根因定位,即针对不同场景和数据特性,首先构建不同类型的因果图,然后使用不同类型的算法从因果图中搜索根源告警,比较典型的有基于条件概率或共现率特征,使用PC算法构建因果图模型,以及基于调用链特征构建因果图模型。如果采用因果图实现的根源告警定位技术,在构建因果图的过程中,只采用告警关联关系、业务拓扑数据、设备连接关系中的一种,没有形成对多种信息进行综合考虑的成熟方案,则限制了方案的适用范围,不定期的网络变更会对根因定位效果造成大幅度影响。In addition to association rule analysis methods, you can also consider using graphical deduction methods such as cause-and-effect diagrams and fault trees. For example, using the cause-and-effect diagram method to locate alarm root causes, that is, according to different scenarios and data characteristics, first construct different types of cause-and-effect diagrams, and then Different types of algorithms are used to search for root cause alarms from causal graphs. Typical ones are based on conditional probability or co-occurrence rate features, using the PC algorithm to build a causal graph model, and building a causal graph model based on call chain features. If the root alarm locating technology implemented by the cause-and-effect diagram is used, only one of the alarm correlation relationships, business topology data, and device connection relationships is used in the process of constructing the cause-effect diagram, and there is no mature solution that comprehensively considers multiple types of information. This limits the scope of application of the solution, and irregular network changes will have a significant impact on the root cause positioning effect.

因此针对这些缺点和问题,本发明面向运营商视频彩铃、5G消息等云网业务平台跨域告警分析场景,提出一种跨层的根源告警定位技术方案,主要包括:Therefore, in view of these shortcomings and problems, the present invention proposes a cross-layer root cause alarm locating technical solution for cross-domain alarm analysis scenarios of cloud network service platforms such as operators' video ringback tones and 5G messages, which mainly includes:

1)提出改进FP-Growth算法,改进点包括:根据告警设备关联、告警等级等重要性因素,对每个频繁项集设置权重系数,提高关联规则置信度和有效性;在挖掘频繁项集过程中,对已经遍历过的子节点路径添加已处理标注,避免重复遍历,提高计算效率,基于改进的FP-Growth算法完成跨层告警的关联规则挖掘,然后进行跨层告警根因分析;1) An improved FP-Growth algorithm is proposed. The improvement points include: setting a weight coefficient for each frequent item set based on important factors such as alarm device association and alarm level to improve the confidence and effectiveness of association rules; in the process of mining frequent item sets , add processed annotations to the traversed sub-node paths to avoid repeated traversals, improve calculation efficiency, complete cross-layer alarm association rule mining based on the improved FP-Growth algorithm, and then perform cross-layer alarm root cause analysis;

2)提出云网平台告警知识图谱和软硬件知识图谱构建方法,充分利用历史告警指标、业务指标、资源指标、性能指标等数据构建告警知识图谱和软硬件知识图谱,利用知识图谱进行跨层告警根因分析,通过多知识图谱的有机融合运用,方案具备可优化、可迭代的特性,可实现运维案例和经验的沉淀;2) Propose a method to construct the cloud network platform alarm knowledge graph and software and hardware knowledge graph, make full use of historical alarm indicators, business indicators, resource indicators, performance indicators and other data to construct the alarm knowledge graph and software and hardware knowledge graph, and use the knowledge graph to conduct cross-layer alarms Root cause analysis, through the organic integration and application of multiple knowledge graphs, the solution has the characteristics of optimization and iteration, and can realize the accumulation of operation and maintenance cases and experience;

3)将基于改进FP-Growth算法的关联规则算法分析结果与基于知识图谱的算法分析结果,通过投票法等融合方法进行综合研判,确定最终根因告警,可有效提高根因定位的准确性和方案的鲁棒性。3) The association rule algorithm analysis results based on the improved FP-Growth algorithm and the algorithm analysis results based on the knowledge graph are comprehensively analyzed and judged through fusion methods such as voting methods to determine the final root cause alarm, which can effectively improve the accuracy and accuracy of root cause location. Robustness of the scheme.

实施例1:Example 1:

如图1所示,本发明提供一种告警根因定位方法,包括:As shown in Figure 1, the present invention provides an alarm root cause locating method, which includes:

S1、基于告警关联规则获得待分析告警数据的第一根因告警信息;S1. Obtain the first root cause alarm information of the alarm data to be analyzed based on the alarm association rules;

S2、基于告警知识图谱获得待分析告警数据的第二根因告警信息,包括:S2. Obtain the second cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph, including:

获取告警知识图谱,告警知识图谱中相连的两个节点为因节点和果节点、每条连接线设置有因果边权重,Obtain the alarm knowledge graph. The two connected nodes in the alarm knowledge graph are the cause node and the effect node. Each connecting line is set with a causal edge weight.

将待分析告警数据匹配告警知识图谱的节点以获得告警因果图,Match the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain the alarm cause-and-effect graph.

根据告警因果图和因果边权重获得第二根因告警信息;Obtain the second root cause alarm information based on the alarm causal graph and causal edge weights;

S3、基于第一根因告警信息和第二根因告警信息获得最终根因告警信息。S3. Obtain the final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.

具体而言,在本实施例中,首先执行步骤S1和S2,其中步骤S1和S2的执行顺序没有先后关系,在完成了步骤S1和S2后执行步骤S3。步骤S1和S3具体的执行细节可以有多种,如可以按照本实施例后续提出的具体步骤执行,对步骤S2具体的执行细节包括获取告警知识图谱,告警知识图谱中根据因果关系连接各节点,并设置有每对因果节点之间的因果边权重,以表示因节点事件发生引发果节点事件发生的概率,因此,可将待分析告警数据匹配告警知识图谱的节点,以明确待分析告警数据之间的因果关系,再结合所获得的因果关系和对应的因果边权重确定第二根因告警信息。通过上述流程,将基于告警关联规则获得的待分析告警数据的第一根因告警信息,结合基于告警知识图谱获得的待分析告警数据的第二根因告警信息,获得最终根因告警信息,在基于告警知识图谱获得待分析告警数据的第二根因告警信息的过程中,根据告警知识图谱中的因节点和果节点之间的因果关系和因果边权重综合判断,获得第二根因告警信息,通过在关联规则挖掘的基础上,增加告警知识图谱分析,提高对信息的运用,从而提高根因定位的准确性和方案的鲁棒性。Specifically, in this embodiment, steps S1 and S2 are executed first, and the execution order of steps S1 and S2 has no sequential relationship. Step S3 is executed after completing steps S1 and S2. There can be many specific execution details of steps S1 and S3. For example, they can be executed according to the specific steps proposed later in this embodiment. The specific execution details of step S2 include obtaining an alarm knowledge graph, and connecting nodes in the alarm knowledge graph according to causal relationships. The causal edge weight between each pair of causal nodes is set to represent the probability of the occurrence of a fruit node event due to the occurrence of a node event. Therefore, the alarm data to be analyzed can be matched to the nodes of the alarm knowledge graph to clarify the relationship between the alarm data to be analyzed. The causal relationship between the two is then combined with the obtained causal relationship and the corresponding causal edge weight to determine the second root cause alarm information. Through the above process, the first root cause alarm information of the alarm data to be analyzed based on the alarm association rules is combined with the second root cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph to obtain the final root cause alarm information. In the process of obtaining the second root cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph, the second root cause alarm information is obtained based on the comprehensive judgment of the causal relationship between the cause node and the effect node in the alarm knowledge graph and the causal edge weight. , by adding alarm knowledge graph analysis based on association rule mining to improve the use of information, thereby improving the accuracy of root cause location and the robustness of the solution.

本实施例一种具体的应用场景是,针对运营商视频彩铃、5G消息等云网业务平台进行跨层告警分析,技术方案整体还可以如图2所示,主要包含三部分:关联规则挖掘和第一根因定位,具体是基于改进FP-Growth进行跨层告警关联规则挖掘和根因定位;告警知识图谱构建和第二根因定位,具体还包括软硬件知识图谱构建和利用软硬件知识图谱辅助告警知识图谱进行根因定位;根因定位结果融合研判,具体是融合知识图谱(包括告警知识图谱)和规则库(存储告警关联规则)的根因定位结果,进行最终根因定位结果的研判。上述三部分的执行,首先需要有告警数据,告警数据包括历史告警数据和待分析告警数据,待分析告警数据具体可以是实时告警切片中的实时告警数据,历史告警数据用于关联规则挖掘和告警知识图谱构建,然后基于挖掘获得的告警关联规则和构建的告警知识图谱分别对待分析告警数据进行第一根因定位和第二根因定位分析,最后融合第一根因定位分析结果(第一根因告警信息)和第二根因定位分析结果(第二根因告警信息)输出最终根因告警信息,最终根因告警信息可以包括根告警(此时是最终根因告警),同时还可以包括与根告警(最终根因告警)相应的根因路径(最终根因告警路径)。A specific application scenario of this embodiment is to conduct cross-layer alarm analysis for cloud network business platforms such as operator video ringback tones and 5G messaging. The overall technical solution can also be shown in Figure 2, which mainly includes three parts: association rule mining and First root cause positioning, specifically based on improved FP-Growth for cross-layer alarm association rule mining and root cause positioning; alarm knowledge graph construction and second root cause positioning, specifically including the construction of software and hardware knowledge graphs and the use of software and hardware knowledge graphs Assist the alarm knowledge graph to perform root cause location; the root cause positioning results are fused and judged, specifically the root cause positioning results of the knowledge graph (including the alarm knowledge graph) and the rule database (which stores alarm association rules) are integrated to conduct research and judgment on the final root cause positioning results. . To execute the above three parts, you first need alarm data. Alarm data includes historical alarm data and alarm data to be analyzed. The alarm data to be analyzed can specifically be real-time alarm data in real-time alarm slices. Historical alarm data is used for association rule mining and alarming. The knowledge graph is constructed, and then based on the alarm association rules obtained through mining and the constructed alarm knowledge graph, the first root cause positioning and the second root cause positioning analysis are performed on the alarm data to be analyzed, and finally the first root cause positioning analysis results (first root cause) are integrated The final root cause alarm information) and the second root cause location analysis result (the second root cause alarm information) output the final root cause alarm information. The final root cause alarm information can include the root alarm (in this case, it is the final root cause alarm), and can also include The root cause path (final root cause alarm path) corresponding to the root alarm (final root cause alarm).

在一种可能的实施方式中,基于告警关联规则获得待分析告警数据的第一根因告警信息之前,所述方法还包括:In a possible implementation, before obtaining the first root cause alarm information of the alarm data to be analyzed based on the alarm association rules, the method further includes:

根据历史告警数据基于改进的FP-Growth算法构建告警关联规则,改进的FP-Growth算法包括增加权重系数改进FP-Growth算法的支持度计算和/或置信度计算,告警关联规则中相连的两个节点为根因告警节点和派生告警节点。Construct alarm association rules based on historical alarm data based on the improved FP-Growth algorithm. The improved FP-Growth algorithm includes increasing the weight coefficient to improve the support calculation and/or confidence calculation of the FP-Growth algorithm. The two connected alarm association rules The nodes are root alarm nodes and derived alarm nodes.

具体而言,在本实施例中,如图3所示,基于改进FP-Growth算法的关联规则挖掘和根因定位可分为两个部分:第一部分为关联规则库构建,该部分主要利用改进的FP-Growth算法对历史告警数据以及资源拓扑数据进行数据挖掘,挖掘出跨层告警关联规则。同时,也支持专家结合运维经验人工录入关联规则;第二部分为基于规则的告警根因分析,主要是基于构建的规则库(关联规则库,其中包括告警关联规则)进行告警根因分析,输出根告警和根因列表(即第一根因告警信息,具体形式可以是第一根因告警列表)。FP-Growth算法的改进点主要包括:根据资源拓扑数据确定告警设备关联、告警等级等重要性因素,根据告警设备关联、告警等级等在FP-Growth算法的支持度计算和/或置信度计算中加入权重系数,以提高告警关联规则的置信度和有效性。Specifically, in this embodiment, as shown in Figure 3, association rule mining and root cause location based on the improved FP-Growth algorithm can be divided into two parts: the first part is the construction of the association rule library, which mainly uses the improved The FP-Growth algorithm conducts data mining on historical alarm data and resource topology data, and mines cross-layer alarm association rules. At the same time, it also supports experts to manually input association rules based on operation and maintenance experience; the second part is the rule-based alarm root cause analysis, which is mainly based on the built rule base (association rule base, including alarm association rules) for alarm root cause analysis. Output the root alarm and root cause list (that is, the first root cause alarm information, the specific form may be the first root cause alarm list). The improvement points of the FP-Growth algorithm mainly include: determining alarm device association, alarm level and other important factors based on resource topology data, and using alarm device association, alarm level, etc. in the support calculation and/or confidence calculation of the FP-Growth algorithm. Add a weight coefficient to improve the confidence and effectiveness of alarm association rules.

在一种可能的实施方式中,根据历史告警数据基于改进的FP-Growth算法构建告警关联规则,具体包括:In a possible implementation, alarm association rules are constructed based on historical alarm data and based on the improved FP-Growth algorithm, specifically including:

获取经过第一预处理的历史告警数据,第一预处理包括:对历史告警数据提取关键字段、生成包括指示历史告警数据的告警节点和关键字段的第一字典值、使用第一字典值表示历史告警数据、根据关键字段将历史告警数据分为不同的第一告警类、通过设置时间窗和滑动步长获取每个第一告警类在各历史时间窗出现的频次;Obtain the historical alarm data that has undergone first preprocessing. The first preprocessing includes: extracting key fields from the historical alarm data, generating a first dictionary value including alarm nodes and key fields indicating the historical alarm data, and using the first dictionary value. Represents historical alarm data, divides historical alarm data into different first alarm categories based on key fields, and obtains the frequency of each first alarm category in each historical time window by setting the time window and sliding step;

根据第一权重系数ω1和所述频次按照式(1)计算每个第一告警类的第一加权支持度ωs1,将ωs1大于第二预设值的第一告警类根据所述频次在各历史时间窗排序,并将第一告警类按在各历史时间窗排序后的顺序作为子节点建立频繁树FP-tree,其中:Calculate the first weighted support ωs1 of each first alarm class according to the first weight coefficient ω1 and the frequency according to Equation (1), and assign the first alarm class whose ωs1 is greater than the second preset value according to the frequency Sort in each historical time window, and use the first alarm category as a child node in the order after sorting in each historical time window to establish a frequent tree FP-tree, where:

其中,X是指某第一告警类,N(X)是第一告警类X的频次,N是所有第一告警类的频次;Among them, X refers to a certain first alarm category, N(X) is the frequency of the first alarm category X, and N is the frequency of all first alarm categories;

获取FP-tree中每个子节点的条件模式基,根据第二权重系数ω2和所述频次按照式(2)计算每条条件模式基路径的第二加权支持度ωs2,得到ωs2大于第三预设值的条件模式基路径作为频繁项集,其中:Obtain the conditional pattern base of each child node in the FP-tree, calculate the second weighted support ωs2 of each conditional pattern base path according to the second weight coefficient ω2 and the frequency according to Equation (2), and obtain that ωs2 is greater than the The conditional pattern base paths of three preset values serve as frequent itemsets, where:

其中,X1是指某条条件模式基路径的末节点的第一告警类,Y1是指所述某条条件模式基路径的头节点的第一告警类,N(X1∪Y1)是指所述某条条件模式基路径的第一告警类的频次之和;Amongthem,_ It refers to the sum of the frequencies of the first alarm category of a certain conditional mode base path;

根据第三权重系数ω3和所述频次按照式(3)计算每个频繁项集中任意两个相邻第一告警类X2和Y2之间的加权置信度ωc,响应于ωc(X2→Y2)<ωc(Y2→X2)将Y2排列为X2的根因告警节点,直至对每个频繁项集中的所有第一告警类完成排列获得告警关联规则,其中:Calculate the weighted confidence ωc between any two adjacent first alarm classes X2 and Y2 in each frequent item set according to the third weight coefficient ω3 and the frequency according to Equation (3), in response to ωc (X2 →Y2 )<ωc(Y2 →X2 ) arrange Y2 as the root cause alarm node of X2 until all the first alarm classes in each frequent item set are arranged to obtain the alarm association rules, where:

其中,σ(X2∪Y2)是指X2和Y2同时出现的频次,σ(X2)是指X2出现的频次。Among them, σ(X2 ∪Y2 ) refers to the frequency of X2 and Y2 appearing at the same time, and σ(X2 ) refers to the frequency of X2 appearing.

具体而言,在本实施例中,基于改进FP-Growth的跨层告警关联规则库构建步骤参见图3,包括:Specifically, in this embodiment, the construction steps of the cross-layer alarm correlation rule base based on improved FP-Growth are shown in Figure 3, including:

从数据仓库中获取数据,包括获取告警历史数据文件中的历史告警数据和获取资源拓扑数据。Obtain data from the data warehouse, including obtaining historical alarm data from alarm history data files and obtaining resource topology data.

数据预处理,如包括:告警去噪:主要是去除冗余和无效告警,如将工程告警、割接标识告警、关键字段缺失告警剔除,若有相同告警名称与相同告警位置的告警,只保留一条等;关键信息抽取:抽取告警数据中有用的字段,一般包含告警标题、设备类型、告警码、专业、告警对象、告警级别字段等;告警字典整合:指将告警字段联合形成字典值,如将“网元ID+告警标题”字段串在一起形成字典值,保证告警数据的唯一性;按拓扑分类:根据告警对象拓扑关系,将具有上下游关系的设备告警先分类,判断具有上下游关系的依据可以是资源中心提供的资源设备关系(资源设备关系包含同传输链路,机架等关系);事件窗滑动:通过设置时间窗和滑动步长,将离散的告警数据转换成告警数据事务集。Data preprocessing, for example, includes: Alarm denoising: mainly to remove redundant and invalid alarms, such as project alarms, cutover identification alarms, and key field missing alarms. If there are alarms with the same alarm name and the same alarm location, only Keep one, etc.; key information extraction: extract useful fields in alarm data, generally including alarm title, device type, alarm code, major, alarm object, alarm level fields, etc.; alarm dictionary integration: refers to combining alarm fields to form a dictionary value, For example, the "network element ID + alarm title" fields are strung together to form a dictionary value to ensure the uniqueness of the alarm data; classification by topology: According to the topological relationship of the alarm object, the device alarms with upstream and downstream relationships are first classified to determine the upstream and downstream relationships. The basis can be the resource equipment relationship provided by the resource center (the resource equipment relationship includes the same transmission link, rack, etc.); event window sliding: by setting the time window and sliding step, discrete alarm data is converted into alarm data transactions set.

通过AI引擎执行基于改进FP-Growth算法关联规则挖掘,包括:自上而下构建告警频繁树:先遍历告警集合,根据设备类型、告警标题、告警码等字段分类,判断某类告警(第一告警类)出现的频次,由高到低排序,并设置最小支持度,低于最小支持度的告警就摒弃,不作为频繁项集,依据告警设备关联、告警等级等关键影响因素,对每个事务项(第一告警类)的初始频次,增加权重系数ω(此时可以设置为第一权重系数ω1,重要程度越高的,权重系数值越大,可人工设定),再与最小支持度对比,此最小支持度为最小加权支持度(第二预设值),每个事务项的支持度乘以权重系数ω,再与最小加权支持度进行比对,保留大于最小加权支持度的事务项作为频繁项,根据遍历的频繁项出现频次排序,剔除频次低于最小加权支持度的告警,生成频繁树FP-tree;自下而上挖掘子节点频繁项集:将频繁树FP-tree的每一个最底层子节点项向上遍历,获取每个子节点项的条件模式基,条件模式基由子节点向上遍历寻找上级节点,上级节点频次保持和子节点一致,根据条件模式基,自下而上遍历构建子节点项的条件FP-tree,该子节点项作为后缀,如果子节点项的条件FP-tree是单路径,那么直接生成频繁项集,若有分支,再向上遍历,剔除小于加权最小支持度(第二预设值,可以与第一预设值相同,此时加权的权重可设置为第二权重系数ω2)的项集,最后得到子节点项的频繁项集;计算加权置信度生成强关联规则:根据得到的频繁项集,计算每一项告警的置信度,剔除小于最小置信度(第三预设值)的频繁项,生成强关联规则,同样,依据告警设备关联、告警等级等关键影响因素,对每个频繁项集设置第三权重系数ω3(重要程度越高、资源关联度越强的频繁项集的权重系数值越大,可由人工设定),频繁项集的置信度乘以权重系数ω3为加权置信度,再与最小加权置信度做对比,同时根据频繁项集的加权置信度可以判断出根告警,如一个频繁项集中仅包括X和Y,且置信度ωc(X→Y)<ωc(Y→X),则频繁项集X、Y的根告警为Y,项数更多时只需两两比较依此类推即可。The AI engine is used to perform association rule mining based on the improved FP-Growth algorithm, including: building an alarm frequent tree from top to bottom: first traversing the alarm collection, classifying according to fields such as device type, alarm title, alarm code, etc., to determine a certain type of alarm (first The frequency of occurrence of alarm categories) is sorted from high to low, and the minimum support is set. Alarms lower than the minimum support are discarded and are not used as frequent itemsets. Based on key influencing factors such as alarm device association and alarm level, each alarm is For the initial frequency of the transaction item (first alarm category), increase the weight coefficient ω (this can be set to the first weight coefficient ω1 , the higher the importance, the larger the weight coefficient value, which can be set manually), and then combine it with the minimum Support comparison, this minimum support is the minimum weighted support (the second default value), the support of each transaction item is multiplied by the weight coefficient ω, and then compared with the minimum weighted support, and the retention is greater than the minimum weighted support. The transaction items are used as frequent items, sorted according to the frequency of occurrence of the traversed frequent items, and alarms whose frequency is lower than the minimum weighted support are eliminated to generate a frequent tree FP-tree; the frequent item sets of child nodes are mined from the bottom up: the frequent tree FP- Each lowest-level sub-node item of the tree is traversed upward to obtain the conditional pattern base of each sub-node item. The conditional pattern base is traversed upward from the child nodes to find the superior node. The frequency of the superior node remains consistent with the child node. According to the conditional pattern base, the conditional pattern base is searched from bottom to top. Traverse to construct the conditional FP-tree of the sub-node item, and the sub-node item is used as a suffix. If the conditional FP-tree of the sub-node item is a single path, then the frequent item set is directly generated. If there are branches, traverse upwards and eliminate items less than the weighted minimum. Support (the second preset value, which can be the same as the first preset value, at this time the weighted weight can be set to the item set of the second weight coefficient ω2 ), and finally the frequent item set of the child node item is obtained; the weighted confidence is calculated Generate strong association rules: Based on the obtained frequent item set, calculate the confidence of each alarm item, eliminate frequent items less than the minimum confidence level (the third preset value), and generate strong association rules. Similarly, based on the alarm device association, Alarm level and other key influencing factors, set a third weight coefficient ω3 for each frequent item set (the higher the importance and the stronger the resource correlation, the greater the weight coefficient value of the frequent item set, which can be set manually). Frequent items The confidence of the set multiplied by the weight coefficient ω3 is the weighted confidence, and then compared with the minimum weighted confidence. At the same time, the root alarm can be determined based on the weighted confidence of the frequent item set. For example, a frequent item set only includes X and Y. And the confidence degree ωc(X→Y)<ωc(Y→X), then the root alarm of frequent item sets X and Y is Y. When the number of items is more, just compare them in pairs and so on.

关联规则审核与入库:将上述挖掘出来的跨层告警关联规则进行有效校验,并进行专家审核,经验证和审核后,统一形成跨专业(层)告警关联规则库,可以进一步基于跨专业告警关联规则库生成故障传播链,对故障传播链进行one-hot编码,以便于后续的第一根因定位分析。Association rule review and storage: The cross-layer alarm association rules mined above are effectively verified and reviewed by experts. After verification and review, a cross-professional (layer) alarm association rule library is unified, which can be further based on cross-professional The alarm correlation rule base generates a fault propagation chain, and performs one-hot encoding on the fault propagation chain to facilitate subsequent first root cause location analysis.

在一种可能的实施方式中,获取FP-tree中每个子节点的条件模式基,具体包括:In a possible implementation, obtaining the conditional pattern base of each child node in the FP-tree specifically includes:

获取FP-tree中位于末端的每个节点作为子节点,对某个子节点向上遍历其FP-tree中全部的上级节点,将上下节点之间依次具有上下连接关系的路径作为一条条件模式基路径,标记每条条件模式基路径以在之后的子节点遍历中避免重复。Obtain each node at the end of the FP-tree as a child node, traverse all the superior nodes in the FP-tree upwards for a child node, and use the path with an up-down connection relationship between the upper and lower nodes as a conditional pattern base path. Mark each conditional pattern base path to avoid duplication in subsequent child node traversals.

具体而言,在本实施例中,自下而上遍历构建子节点项的条件模式基的步骤中,对子节点向上遍历的执行策略进行改进,对已经遍历过的子节点路径添加已处理标注,避免重复遍历,提高计算效率。Specifically, in this embodiment, in the step of constructing the conditional pattern base of child node items through bottom-up traversal, the execution strategy of upward traversal of child nodes is improved, and processed annotations are added to the traversed child node paths. , avoid repeated traversal and improve calculation efficiency.

在一种可能的实施方式中,基于告警关联规则获得待分析告警数据的第一根因告警信息,具体包括:In a possible implementation, the first root cause alarm information of the alarm data to be analyzed is obtained based on alarm association rules, which specifically includes:

对待分析告警数据进行第二预处理,第二预处理包括:对待分析告警数据提取关键字段、生成包括指示待分析告警数据的告警节点和关键字段的第二字典值、使用第二字典值表示待分析告警数据;A second preprocessing is performed on the alarm data to be analyzed. The second preprocessing includes: extracting key fields from the alarm data to be analyzed, generating a second dictionary value including alarm nodes and key fields indicating the alarm data to be analyzed, and using the second dictionary value. Indicates the alarm data to be analyzed;

获取告警关联规则中每个节点对应的历史告警数据的第一字典值,计算第二字典值和获取的每个第一字典值之间的余弦相似度,响应于余弦相似度大于第四预设值,将第二字典值标注在告警关联规则中对应的节点处,以获得待分析告警数据命中的告警关联子规则;Obtain the first dictionary value of the historical alarm data corresponding to each node in the alarm association rule, calculate the cosine similarity between the second dictionary value and each obtained first dictionary value, and respond to the result that the cosine similarity is greater than the fourth preset value, mark the second dictionary value at the corresponding node in the alarm association rule to obtain the alarm association sub-rule hit by the alarm data to be analyzed;

根据每条告警关联子规则中被命中的节点的余弦相似度之和,获取余弦相似度之和最大的TopN条告警关联子规则,以TopN条告警关联子规则的TopN个最终根因告警节点为第一根因告警。According to the sum of cosine similarities of the hit nodes in each alarm correlation sub-rule, the TopN alarm correlation sub-rules with the largest sum of cosine similarities are obtained, and the TopN final root cause alarm nodes of the TopN alarm correlation sub-rules are The first reason is alarm.

具体而言,在本实施例中,在构建的关联规则库的基础上,基于规则的告警根因分析步骤参见图3,包括:Specifically, in this embodiment, based on the constructed association rule library, the rule-based alarm root cause analysis steps are shown in Figure 3, including:

获取实时告警数据,可以是从云资源池、业务平台实时监听活动告警队列,获取其中的M条告警作为待分析告警数据;To obtain real-time alarm data, you can monitor the active alarm queue in real time from the cloud resource pool or business platform, and obtain M alarms among them as alarm data to be analyzed;

将实时告警监听数据输入规则匹配器,规则匹配器根据输入数据与规则库进行比较,例如对活动告警队列one-hot编码,并计算此one-hot编码与故障传播链的one-hot编码的余弦相似度并倒排;Enter the real-time alarm monitoring data into the rule matcher. The rule matcher compares the input data with the rule base, such as one-hot encoding of the active alarm queue, and calculates the cosine of this one-hot encoding and the one-hot encoding of the fault propagation chain. Similarity and inversion;

获得根因列表,如果命中一条规则(余弦相似度大于第四预设值),则对命中的规则进行分析返回根源告警(比如命中规则Z→Y→X,返回最终根因告警节点Z为根源告警,根源告警、根告警、根因告警是相同含义),如果命中多条规则,则返回包括TopN条根因(根因告警的简称,此时是第一根因告警)及其相似度。Obtain the root cause list. If a rule is hit (the cosine similarity is greater than the fourth preset value), the hit rule is analyzed and the root cause alarm is returned (for example, rule Z→Y→X is hit, and the final root cause alarm node Z is returned as the root cause). Alarm, root alarm, root alarm and root cause alarm have the same meaning). If multiple rules are hit, the top N root causes (abbreviation for root cause alarm, in this case the first root cause alarm) and their similarity will be returned.

在一种可能的实施方式中,根据告警因果图和因果边权重获得第二根因告警信息,具体包括:In a possible implementation, the second root cause alarm information is obtained based on the alarm causal graph and causal edge weights, specifically including:

获取每个告警因果图中连续的因果边权重大于第一预设值的告警因果子图;Obtain the alarm causal subgraph in which the continuous causal edge weight in each alarm causal graph is greater than the first preset value;

以每个告警因果子图的最终因节点为第二根因告警、以每个告警因果子图为根因告警路径。The final cause node of each alarm causal subgraph is the second root cause alarm, and each alarm causal subgraph is the root cause alarm path.

具体而言,在本实施例中,对于步骤S2具体的执行细节可以结合图4(仅为一种简化的示例)理解,首先获取如图4(a)所示的告警知识图谱,图4(a)中的节点包括A1、A2、B1、B2、C1、C2、C3,连接线表示因果关系、其箭头表示因果方向,即A1、A2是B1、B2的因节点,B1、B2是A1、A2的果节点,B1、B2是C1、C2、C3的因节点,C1、C2、C3是B1、B2的果节点,每条连接线上设置有相应的因果边权重;假如将待分析告警数据匹配图4(a)所示的告警知识图谱的节点后,获得如图4(b)所示的告警因果图,即待分析告警数据命中了告警知识图谱的A1、B1、B2、C1、C2、C3节点;假设第一预设值为0.5(仅为示例),那么可以根据因果边权重获得如图4(c)和(d)所示的两个告警因果子图;根据如图4(c)和(d)所示的两个告警因果子图可以获得第二根因告警信息,如可以将每个告警因果子图的最终因节点作为第二根因告警,即获得A1和B2两个第二根因告警,同时每个告警因果子图即为一条根因路径,即A1这个根因告警可以引发B1、C1、C2三种派生告警,B2可以引发C3这种派生告警。可以理解的是,对于获取告警因果图和告警因果子图的具体顺序不受上述限制,也可以先根据因果边权重将告警知识图谱划分为多个分图,然后匹配节点则直接获得告警因果子图。Specifically, in this embodiment, the specific execution details of step S2 can be understood in conjunction with Figure 4 (just a simplified example). First, the alarm knowledge graph shown in Figure 4(a) is obtained. Figure 4( The nodes in a) include A1, A2, B1, B2, C1, C2, and C3. The connecting lines represent causal relationships, and their arrows represent the causal direction, that is, A1 and A2 are the cause nodes of B1 and B2, and B1 and B2 are A1, The fruit node of A2, B1 and B2 are the cause nodes of C1, C2 and C3, C1, C2 and C3 are the fruit nodes of B1 and B2. Each connection line is set with a corresponding causal edge weight; if the alarm data to be analyzed is After matching the nodes of the alarm knowledge graph shown in Figure 4(a), the alarm causal graph shown in Figure 4(b) is obtained, that is, the alarm data to be analyzed hits A1, B1, B2, C1, and C2 of the alarm knowledge graph. , C3 node; assuming the first preset value is 0.5 (example only), then two alarm causal subgraphs as shown in Figure 4(c) and (d) can be obtained according to the causal edge weights; according to Figure 4( The two alarm causal subgraphs shown in c) and (d) can obtain the second root cause alarm information. For example, the final cause node of each alarm causal subgraph can be used as the second root cause alarm, that is, two root cause alarms, A1 and B2, can be obtained. A second root cause alarm, and each alarm causal subgraph is a root cause path, that is, the root cause alarm A1 can trigger three derived alarms B1, C1, and C2, and B2 can trigger a derived alarm C3. It can be understood that the specific order of obtaining the alarm causal graph and the alarm causal subgraph is not subject to the above restrictions. The alarm knowledge graph can also be divided into multiple subgraphs according to the causal edge weights, and then the alarm causal subgraph can be obtained directly by matching the nodes. picture.

在一种可能的实施方式中,所述告警知识图谱中的节点按照告警节点层级分层排列、每个节点对应一种告警节点层级的一种告警故障类型;In a possible implementation, the nodes in the alarm knowledge graph are arranged hierarchically according to the alarm node level, and each node corresponds to an alarm fault type at the alarm node level;

将待分析告警数据匹配告警知识图谱的节点以获得告警因果图,具体包括:Match the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain the alarm cause-and-effect graph, which includes:

根据告警节点层级和告警故障类型将待分析告警数据分为不同的第三告警类,根据第三告警类将待分析告警数据匹配告警知识图谱的节点以获得告警因果图。The alarm data to be analyzed is divided into different third alarm categories according to the alarm node level and alarm fault type. According to the third alarm category, the alarm data to be analyzed is matched to the nodes of the alarm knowledge graph to obtain the alarm cause and effect diagram.

具体而言,在本实施例中,告警知识图谱采用分层排列的方式建立,其层级根据告警节点层级设置,告警节点指发出告警的软硬件,如一台主机、一台虚拟机、一个数据库等,每个告警节点层级会有多种告警故障类型,以此作为告警数据分类和告警知识图谱建立的依据,分类方法可以采用训练好的告警数据分类模型进行,该模型不作为本申请讨论的内容,待分析告警数据采用模型分类为与告警知识图谱节点对应的分类结果(第三告警类),然后根据分类结果进行匹配就可以得到告警因果图。Specifically, in this embodiment, the alarm knowledge graph is established in a hierarchical arrangement, and its level is set according to the alarm node level. The alarm node refers to the software and hardware that issues the alarm, such as a host, a virtual machine, a database, etc. , each alarm node level will have multiple alarm fault types, which are used as the basis for alarm data classification and alarm knowledge graph establishment. The classification method can be carried out using a trained alarm data classification model. This model is not discussed in this application. , the alarm data to be analyzed is classified into classification results (third alarm category) corresponding to the alarm knowledge graph nodes using the model, and then the alarm cause-and-effect diagram can be obtained by matching according to the classification results.

在一种可能的实施方式中,获取告警知识图谱之前,所述方法还包括:In a possible implementation, before obtaining the alarm knowledge graph, the method further includes:

根据历史告警数据的告警节点层级和告警故障类型构建告警知识图谱,告警节点层级包括物理机HOST、虚拟机VM和软件SOFTWARE三层,告警知识图谱中的每个节点对应一种HOST告警故障类型、VM告警故障类型或SOFTWARE告警故障类型。An alarm knowledge graph is constructed based on the alarm node level and alarm fault type of historical alarm data. The alarm node level includes three layers: physical machine HOST, virtual machine VM and software SOFTWARE. Each node in the alarm knowledge graph corresponds to a HOST alarm fault type, VM alarm fault type or SOFTWARE alarm fault type.

具体而言,在本实施例中,如图5所示,告警知识图谱构架和第二根因定位分析可分为两个部分,第一部分为云网平台知识图谱构建,该部分充分利用云资源池硬件资源、云平台软件资源、调用链、历史告警等数据,分别构建云网平台软硬件知识图谱、告警知识图谱;第二部分为基于知识图谱的根因分析,即基于上述构建的云网平台知识图谱进行告警根因定位分析,输出根告警和根因路径。告警知识图谱采用历史告警数据构建,历史告警数据与待分析告警数据采用相同的方法进行分类,告警节点层级目前分为物理机HOST、虚拟机VM和软件SOFTWARE三层。Specifically, in this embodiment, as shown in Figure 5, the alarm knowledge graph structure and the second root cause location analysis can be divided into two parts. The first part is the construction of the cloud network platform knowledge graph, which makes full use of cloud resources. Pool hardware resources, cloud platform software resources, call chains, historical alarms and other data are used to construct the cloud network platform software and hardware knowledge graph and alarm knowledge graph respectively; the second part is the root cause analysis based on the knowledge graph, that is, the cloud network based on the above construction The platform knowledge graph performs alarm root cause location analysis and outputs root alarms and root cause paths. The alarm knowledge graph is constructed using historical alarm data. Historical alarm data and alarm data to be analyzed are classified using the same method. The alarm node level is currently divided into three layers: physical machine HOST, virtual machine VM and software SOFTWARE.

在一种可能的实施方式中,根据历史告警数据的告警节点层级和告警故障类型构建告警知识图谱,具体包括:In a possible implementation, an alarm knowledge graph is constructed based on the alarm node level and alarm fault type of historical alarm data, specifically including:

根据告警节点层级和告警故障类型将历史告警数据分为不同的第二告警类,第二告警类包括HOST告警故障类型、VM告警故障类型和SOFTWARE告警故障类型;The historical alarm data is divided into different second alarm categories according to the alarm node level and alarm fault type. The second alarm category includes HOST alarm fault type, VM alarm fault type and SOFTWARE alarm fault type;

将每个第二告警类作为告警知识图谱的一个节点,将HOST告警故障类型作为VM告警故障类型的因节点,将VM告警故障类型作为SOFTWARE告警故障类型的因节点,分层排列和连接全部第二告警类;Use each second alarm class as a node in the alarm knowledge graph, use the HOST alarm fault type as the cause node of the VM alarm fault type, and use the VM alarm fault type as the cause node of the SOFTWARE alarm fault type, arrange and connect all first 2. Alarm category;

计算每对因果节点同时发生历史告警的次数与每对因果节点中因节点总共发生历史告警的次数之比,作为每对因果节点之间连接线的因果边权重。Calculate the ratio of the number of historical alarms that occur simultaneously in each pair of causal nodes to the total number of historical alarms that occur in each pair of causal nodes as the causal edge weight of the connection line between each pair of causal nodes.

具体而言,在本实施例中,如图5所示,云网平台告警知识图谱构建过程包括:获取云资源池、业务平台历史告警数据;告警数据分类(告警分类、训练分类模型):对历史告警信息(历史告警数据)进行分词,再计算词向量,然后将词向量输入训练模型,对物理机层、虚拟机层、软件层告警进行分类;因果节点选取(根据分类结果确定节点):因果节点不具体指一个物理机或虚拟机IP上的告警,而是对所有告警类型的一个抽象总结,目前包含三层,可以结合图4所示理解,A表示物理机层面的告警、B表示虚拟机层面的告警、C表示软件层面的告警,比如:将任何一台物理机上的宕机告警都归类于因果图上【物理机-宕机】节点(A1),另外还可以有A2【物理机-网卡流量告警】、B1【VM-宕机】、C1【Jboss-日志OutOfMemory】等节点,经过告警数据分类,初步将所有的告警分类都作为因果节点,在经过因果算法输出因果边并人工筛查确认之后,选取最终的因果节点;构建因果发现样本:根据告警分类,已将每一条告警记录归类为一种告警类型(告警类型:物理机-xx告警、虚拟机-xx告警、软件-xx告警),如以每条虚拟机告警记录为中心,给定一个告警时间切片(1min、2min等),寻找每条虚拟机告警时间切片内的相关告警记录(相关告警包括:该虚拟机隶属的物理机上的告警,同隶属该台物理机上的其他虚拟机上的告警)集合作为一个因果发现样本;获取因果边(因果算法发现因果边):利用构建的因果发现样本,采用的传统的PC因果算法(Peter-Clark algorithm,是一种常见的用以发现因果关系的算法)计算发现因果边;因果边的权重计算(根据条件概率计算因果边权重):因果边的权重采用条件概率计算,即:基于因果发现样本数据和因果发现算法给出的因果边(连接因节点和果节点两个节点),【因节点发生告警的条件下果节点发生告警的次数】与【因节点总共发生的告警次数】的比值作为该因果边的权重;形成告警知识图谱:经过【告警分类】-->【构建因果发现样本】-->【因果算法发现因果边】-->【因果边权重计算】,最终生成所有的因果边及其权重,完成统一的告警知识图谱构建(构建告警知识图谱)。Specifically, in this embodiment, as shown in Figure 5, the cloud network platform alarm knowledge graph construction process includes: obtaining cloud resource pools and business platform historical alarm data; alarm data classification (alarm classification, training classification model): The historical alarm information (historical alarm data) is segmented into words, and then the word vector is calculated, and then the word vector is input into the training model to classify the physical machine layer, virtual machine layer, and software layer alarms; causal node selection (nodes are determined based on the classification results): The causal node does not specifically refer to an alarm on a physical machine or virtual machine IP, but an abstract summary of all alarm types. It currently contains three layers, which can be understood as shown in Figure 4. A represents an alarm at the physical machine level, and B represents Alarms at the virtual machine level, C represents alarms at the software level. For example, any downtime alarm on any physical machine is classified into the [Physical Machine - Downtime] node (A1) on the cause and effect diagram, and there can also be A2 [ Physical machine - network card traffic alarm], B1 [VM - downtime], C1 [Jboss - log OutOfMemory] and other nodes, after classifying the alarm data, all alarm classifications are initially regarded as causal nodes, and the causal edges are output through the causal algorithm. After manual screening and confirmation, the final causal node is selected; a causal discovery sample is constructed: according to the alarm classification, each alarm record has been classified into an alarm type (alarm type: physical machine-xx alarm, virtual machine-xx alarm, Software-xx alarm), for example, centering on each virtual machine alarm record, given an alarm time slice (1min, 2min, etc.), look for relevant alarm records within each virtual machine alarm time slice (relevant alarms include: the virtual machine Alarms on the physical machine to which the machine belongs, and alarms on other virtual machines belonging to the physical machine) are collected as a causal discovery sample; obtaining causal edges (causal algorithm discovers causal edges): Using the constructed causal discovery samples, the traditional The PC causal algorithm (Peter-Clark algorithm, a common algorithm used to discover causal relationships) calculates and finds causal edges; the weight calculation of causal edges (calculates the weight of causal edges based on conditional probability): the weight of causal edges uses conditional probability Calculation, that is: based on the causal discovery sample data and the causal edge (connecting the cause node and the effect node) given by the causal discovery algorithm, [the number of alarms on the effect node under the condition that an alarm occurs on the cause node] and [the total number of cause nodes The ratio of the number of alarms that occurred] is used as the weight of the causal edge; forming an alarm knowledge graph: through [Alarm Classification] --> [Building Causal Discovery Samples] --> [Causal Algorithm to Discover Causal Edges] --> [Causal Edge Weights Calculation], and finally generate all causal edges and their weights, completing the construction of a unified alarm knowledge graph (building an alarm knowledge graph).

在一种可能的实施方式中,待分析告警数据包括通过设置时间窗和滑动步长获取的云网平台在当前时间窗内的多条实时告警数据,云网平台包括多个系统,每个系统均包括HOST、VM和SOFTWARE三层;In a possible implementation, the alarm data to be analyzed includes multiple real-time alarm data of the cloud network platform within the current time window obtained by setting the time window and sliding step size. The cloud network platform includes multiple systems, each system All include three layers: HOST, VM and SOFTWARE;

根据第三告警类将待分析告警数据匹配告警知识图谱的节点以获得告警因果图,具体包括:According to the third alarm category, the alarm data to be analyzed is matched to the nodes of the alarm knowledge graph to obtain the alarm cause-and-effect graph, which specifically includes:

根据待分析告警数据的告警节点查询云网平台的软硬件知识图谱,根据软硬件知识图谱中告警节点所属的云网平台系统,将全部第三告警类按照云网平台系统进行划分,将划分至云网平台不同系统的第三告警类分别匹配告警知识图谱的节点,以获得待分析告警数据在云网平台不同系统命中的告警因果图。Query the software and hardware knowledge graph of the cloud network platform according to the alarm node of the alarm data to be analyzed. According to the cloud network platform system to which the alarm node in the software and hardware knowledge graph belongs, all third alarm categories are divided according to the cloud network platform system and will be divided into The third alarm categories of different systems on the cloud network platform are respectively matched with the nodes of the alarm knowledge graph to obtain an alarm cause and effect diagram in which the alarm data to be analyzed hits different systems on the cloud network platform.

具体而言,在本实施例中,对云网平台的告警根因分析,使用告警知识图谱结合软硬件知识图谱,实现多知识图谱的有机融合运用,优化根因定位结果,因此需要构建云网平台软硬件知识图谱,更具体地,如图5所示,软硬件知识图谱是以全局的视角展示云网平台内各应用、软件、虚拟机、物理机间的内在逻辑,系统间的调用关系,网络设备的物理连接关系等。图谱中的节点包括系统、DU(部署单元)、group(主机实例组)、软件、虚拟机、物理机、接入交换机、核心交换机、汇聚交换机、路由器等。关系包括constitute(构成)、call(调用)、logical(逻辑连接)、cluster(汇聚)、ship(承载)、host(宿主)、connect(物理连接)等。软硬件知识图谱构建的数据源一方面主要有云资源池数据(主要是云资源池硬件数据)、调用链数据和物理设备网络连接关系数据(即需要获取设备连接关系),首先基于离线数据初始化软硬件知识图谱,随着业务的变化和拓展会出现旧系统的下线和新系统的上线,可根据变化定时或定期更新软硬件知识图谱,假设云资源池数据格式如下表1所示:Specifically, in this embodiment, for the alarm root cause analysis of the cloud network platform, the alarm knowledge graph is used in combination with the software and hardware knowledge graph to realize the organic integration and application of multiple knowledge graphs and optimize the root cause positioning results. Therefore, it is necessary to build a cloud network Platform software and hardware knowledge graph. More specifically, as shown in Figure 5, the software and hardware knowledge graph shows the internal logic between applications, software, virtual machines, and physical machines in the cloud network platform from a global perspective, and the calling relationships between systems. , the physical connection relationship of network equipment, etc. The nodes in the graph include systems, DUs (deployment units), groups (host instance groups), software, virtual machines, physical machines, access switches, core switches, aggregation switches, routers, etc. Relationships include constitute, call, logical connection, cluster, ship, host, connect, etc. On the one hand, the data sources for the construction of the software and hardware knowledge graph mainly include cloud resource pool data (mainly cloud resource pool hardware data), call chain data and physical device network connection relationship data (that is, the device connection relationship needs to be obtained). First, it is initialized based on offline data. Software and hardware knowledge graph. As the business changes and expands, old systems will be offline and new systems will be online. The software and hardware knowledge graph can be updated regularly or regularly according to the changes. Assume that the cloud resource pool data format is as shown in Table 1 below:

表1云资源池数据格式示例Table 1 Example of cloud resource pool data format

通过上述数据可以构建HOST(物理机)->VM(虚拟机)->SOFTWARE(软件)->GROUP(组)及GROUP(WildFly(JBossAS,JBoss应用服务器))->GROUP(Nginx(一个高性能的HTTP和反向代理web服务器))的关系图谱,即构建单系统关系图谱。调用链/物理设备网络连接数据知识构建:利用调用链数据获取DU(Distributed Unit,分布式单元)间调用关系、系统间调用关系、DU/IP(Internet Protocol,网络互连协议)映射关系、中间件间的逻辑连接关系(即获取云平台软件资源数据和系统、DU调用关系)等,构建业务调用关系图谱。知识图谱合并:将前面得到的单系统关系图谱、调用关系图谱通过Python Networkx(一个python包,用于创建、操作和研究复杂网络的结构、动态和功能)软件包完成合并,形成统一云网平台软硬件知识图谱(构建软硬件知识图谱)。将上述软硬件知识图谱和告警知识图谱存储到图数据库中(图谱数据导入图数据库),以实现图谱的实时查询和调用,用以支撑后续根因定位分析。Through the above data, you can build HOST (physical machine)->VM (virtual machine)->SOFTWARE (software)->GROUP (group) and GROUP (WildFly (JBossAS, JBoss application server))->GROUP (Nginx (a high-performance The relationship graph of HTTP and reverse proxy web server)) is to build a single system relationship graph. Call chain/physical device network connection data knowledge construction: Use call chain data to obtain the calling relationship between DU (Distributed Unit, distributed unit), the calling relationship between systems, DU/IP (Internet Protocol, network interconnection protocol) mapping relationship, and intermediate Logical connection relationships between software (i.e., obtaining cloud platform software resource data and system, DU calling relationships), etc., to build a business calling relationship graph. Knowledge graph merging: The previously obtained single system relationship graph and call relationship graph are merged through the Python Networkx (a python package used to create, operate and study the structure, dynamics and functions of complex networks) software package to form a unified cloud network platform Software and hardware knowledge graph (build software and hardware knowledge graph). The above-mentioned software and hardware knowledge graph and alarm knowledge graph are stored in the graph database (the graph data is imported into the graph database) to realize real-time query and call of the graph to support subsequent root cause location analysis.

在一种可能的实施方式中,基于第一根因告警信息和第二根因告警信息获得最终根因告警信息,具体包括:In a possible implementation, the final root cause alarm information is obtained based on the first root cause alarm information and the second root cause alarm information, which specifically includes:

响应于对某个云网平台系统确定了多个第二根因告警,将同时对应第一根因告警和所述多个第二根因告警中某一个的待分析告警数据确定为某个云网平台系统的最终根因告警,并获得最终根因告警对应的根因告警路径。In response to multiple second root cause alarms being determined for a certain cloud network platform system, the alarm data to be analyzed that simultaneously corresponds to the first root cause alarm and one of the plurality of second root cause alarms is determined to be a certain cloud network platform system. The final root cause alarm of the network platform system is obtained, and the root cause alarm path corresponding to the final root cause alarm is obtained.

具体而言,在本实施例中,基于云网平台知识图谱的告警根因分析包括:云资源池、业务平台实时告警监听,获取切片窗口内的告警:设置时间切片粒度,实时获取时间切片内(1min、5min等)的告警数据;调用分类模型进行告警分类:针对原始的告警数据,结合具体的告警信息和监控项等信息,根据训练好的分类模型(可使用与构建告警知识图谱相同的模型)对原始的告警数据从HOST、VM、SOFTWARE三个方面进行分类,例如分为:vm_网卡流量大、host_磁盘使用率过高、software_网页访问失败等;告警收敛(查询云网平台知识图谱图数据库获取软硬件知识图谱节点,告警收敛到相关节点上):查询软硬件知识图谱将告警以系统为单位进行收敛,收敛格式如下,格式如下:系统1:{软硬件知识图谱节点1:[告警类型1,告警类型2…],软硬件知识图谱节点2:[告警类型1,告警类型2…];获取告警因果图(查询云网平台知识图谱图数据库获取告警因果关系):基于告警收敛结果,在图数据库中按照系统级别查询每个系统下的所有节点之间的连接子图,得到某个系统下的各节点之间的最终连接关系,即告警因果图;得出可能根因路径:基于上述生成的告警因果图,以及权重来计算疑似路径,排序给出根告警和根因路径。最后,采用投票法等融合算法,将基于改进FP-Growth算法的根因定位算法分析的结果,和基于云网平台知识图谱的根因定位算法的分析结果,进行综合研判,确定最终根因。最终根因可以是针对云网平台每个系统获得一个最终根因告警和对应的最终根因告警路径,具体可以将既被判定为第一根因告警又被判定为第二根因告警的实时告警数据作为最终根因告警,根据第二根因告警的根因路径获得最终根因告警路径。Specifically, in this embodiment, alarm root cause analysis based on the knowledge graph of the cloud network platform includes: real-time alarm monitoring of cloud resource pools and business platforms, obtaining alarms within the slice window: setting the time slice granularity, and obtaining real-time alarms within the time slice (1min, 5min, etc.); call the classification model to classify the alarm: for the original alarm data, combined with the specific alarm information and monitoring items and other information, according to the trained classification model (the same as that used to build the alarm knowledge graph can be used Model) classifies the original alarm data from three aspects: HOST, VM, and SOFTWARE, for example, it is divided into: vm_high network card traffic, host_high disk usage, software_web page access failure, etc.; alarm convergence (query cloud network The platform knowledge graph database obtains software and hardware knowledge graph nodes, and alarms converge to relevant nodes): Query the software and hardware knowledge graph to converge alarms in system units. The convergence format is as follows: System 1: {Software and hardware knowledge graph nodes 1: [Alarm type 1, Alarm type 2...], Software and hardware knowledge graph node 2: [Alarm type 1, Alarm type 2...]; Obtain the alarm causality graph (query the cloud network platform knowledge graph graph database to obtain the alarm causality): Based on the alarm convergence results, the connection subgraph between all nodes under each system is queried in the graph database according to the system level, and the final connection relationship between the nodes under a certain system is obtained, that is, the alarm cause-and-effect graph; it is possible to obtain Root cause path: Calculate the suspected path based on the alarm cause-and-effect diagram generated above and the weight, and sort the root alarm and root cause path. Finally, using fusion algorithms such as voting methods, the analysis results of the root cause location algorithm based on the improved FP-Growth algorithm and the analysis results of the root cause location algorithm based on the knowledge graph of the cloud network platform are comprehensively studied and judged to determine the final root cause. The final root cause can be to obtain a final root cause alarm and the corresponding final root cause alarm path for each system of the cloud network platform. Specifically, the real-time alarm that is determined to be both the first root cause alarm and the second root cause alarm can be The alarm data is used as the final root cause alarm, and the final root cause alarm path is obtained based on the root cause path of the second root cause alarm.

实施例2:Example 2:

如图6和7所示,本发明实施例2提供一种告警根因定位装置,包括:As shown in Figures 6 and 7, Embodiment 2 of the present invention provides an alarm root cause locating device, which includes:

第一分析模块1,用于基于告警关联规则获得待分析告警数据的第一根因告警信息;The first analysis module 1 is used to obtain the first root cause alarm information of the alarm data to be analyzed based on alarm association rules;

第二分析模块2,用于基于告警知识图谱获得待分析告警数据的第二根因告警信息,包括:The second analysis module 2 is used to obtain the second root cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph, including:

告警知识图谱单元21,用于获取告警知识图谱,告警知识图谱中相连的两个节点为因节点和果节点、每条连接线设置有因果边权重,The alarm knowledge graph unit 21 is used to obtain the alarm knowledge graph. The two connected nodes in the alarm knowledge graph are the cause node and the effect node, and each connection line is set with a causal edge weight.

第二匹配单元22,与告警知识图谱单元21连接,用于将待分析告警数据匹配告警知识图谱的节点以获得告警因果图,The second matching unit 22 is connected to the alarm knowledge graph unit 21 and is used to match the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain the alarm cause and effect graph,

第二定位单元23,与第二匹配单元22连接,用于根据告警因果图和因果边权重获得第二根因告警信息;The second positioning unit 23 is connected to the second matching unit 22 and is used to obtain the second root cause alarm information based on the alarm causal graph and the causal edge weight;

第三分析模块3,与第一分析模块1和第二分析模块2连接,基于第一根因告警信息和第二根因告警信息获得最终根因告警信息。The third analysis module 3 is connected to the first analysis module 1 and the second analysis module 2, and obtains the final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.

在一种可能的实施方式中,第一分析模块1还包括:In a possible implementation, the first analysis module 1 also includes:

告警关联规则单元,用于根据历史告警数据基于改进的FP-Growth算法构建告警关联规则,改进的FP-Growth算法包括增加权重系数改进FP-Growth算法的支持度计算和/或置信度计算,告警关联规则中相连的两个节点为根因告警节点和派生告警节点。The alarm association rule unit is used to construct alarm association rules based on historical alarm data and based on the improved FP-Growth algorithm. The improved FP-Growth algorithm includes increasing the weight coefficient to improve the support calculation and/or confidence calculation of the FP-Growth algorithm. Alarm The two connected nodes in the association rule are the root alarm node and the derived alarm node.

在一种可能的实施方式中,告警关联规则单元,具体包括:In a possible implementation, the alarm association rule unit specifically includes:

预处理子单元,用于获取经过第一预处理的历史告警数据,第一预处理包括:对历史告警数据提取关键字段、生成包括指示历史告警数据的告警节点和关键字段的第一字典值、使用第一字典值表示历史告警数据、根据关键字段将历史告警数据分为不同的第一告警类、通过设置时间窗和滑动步长获取每个第一告警类在各历史时间窗出现的频次;The preprocessing subunit is used to obtain the historical alarm data that has undergone the first preprocessing. The first preprocessing includes: extracting key fields from the historical alarm data, and generating a first dictionary including alarm nodes and key fields indicating the historical alarm data. value, use the first dictionary value to represent historical alarm data, divide historical alarm data into different first alarm categories based on key fields, and obtain the occurrence of each first alarm category in each historical time window by setting the time window and sliding step. frequency;

频繁树子单元,与第一预处理子单元连接,用于根据第一权重系数ω1和所述频次按照式(1)计算每个第一告警类的第一加权支持度ωs1,将ωs1大于第二预设值的第一告警类根据所述频次在各历史时间窗排序,并将第一告警类按在各历史时间窗排序后的顺序作为子节点建立频繁树FP-tree,其中:Frequent tree subunit, connected to the first preprocessing subunit, used to calculate the first weighted support ωs1 of each first alarm class according to the first weight coefficient ω1 and the frequency according to equation (1), and ωs1 The first alarm category that is greater than the second preset value is sorted in each historical time window according to the frequency, and the first alarm category is used as a child node in the order after sorting in each historical time window to establish a frequent tree FP-tree, where :

其中,X是指某第一告警类,N(X)是第一告警类X的频次,N是所有第一告警类的频次;Among them, X refers to a certain first alarm category, N(X) is the frequency of the first alarm category X, and N is the frequency of all first alarm categories;

频繁项集子单元,与频繁树子单元连接,用于获取FP-tree中每个子节点的条件模式基,根据第二权重系数ω2和所述频次按照式(2)计算每条条件模式基路径的第二加权支持度ωs2,得到ωs2大于第三预设值的条件模式基路径作为频繁项集,其中:The frequent itemset subunit is connected to the frequent tree subunit and is used to obtain the conditional pattern basis of each sub-node in the FP-tree. According to the second weight coefficient ω2 and the frequency, each conditional pattern basis is calculated according to Equation (2) The second weighted support degree ωs2 of the path is used to obtain the conditional pattern base path whose ωs2 is greater than the third preset value as a frequent itemset, where:

其中,X1是指某条条件模式基路径的末节点的第一告警类,Y1是指所述某条条件模式基路径的头节点的第一告警类,N(X1∪Y1)是指所述某条条件模式基路径的第一告警类的频次之和;Amongthem,_ It refers to the sum of the frequencies of the first alarm category of a certain conditional mode base path;

关联规则子单元,与频繁项集子单元连接,用于根据第三权重系数ω3和所述频次按照式(3)计算每个频繁项集中任意两个相邻第一告警类X2和Y2之间的加权置信度ωc,响应于ωc(X2→Y2)<ωc(Y2→X2)将Y2排列为X2的根因告警节点,直至对每个频繁项集中的所有第一告警类完成排列获得告警关联规则,其中:The association rule subunit is connected to the frequent itemset subunit, and is used to calculate any two adjacent first alarm categories X2 and Y in each frequent item set according to the third weight coefficient ω3 and the frequency according to Equation (3) The weighted confidence ωc between2 , in response to ωc(X2 →Y2 )<ωc(Y2 →X2 ), arrange Y2 as the root cause alarm node of X2 until all items in each frequent item set The first alarm category is arranged and the alarm association rules are obtained, among which:

其中,σ(X2∪Y2)是指X2和Y2同时出现的频次,σ(X2)是指X2出现的频次。Among them, σ(X2 ∪Y2 ) refers to the frequency of X2 and Y2 appearing at the same time, and σ(X2 ) refers to the frequency of X2 appearing.

在一种可能的实施方式中,频繁项集子单元中包括条件模式基子单元,用于:In a possible implementation, the frequent itemset subunit includes a conditional pattern base subunit for:

获取FP-tree中位于末端的每个节点作为子节点,对某个子节点向上遍历其FP-tree中全部的上级节点,将上下节点之间依次具有上下连接关系的路径作为一条条件模式基路径,标记每条条件模式基路径以在之后的子节点遍历中避免重复。Obtain each node at the end of the FP-tree as a child node, traverse all the superior nodes in the FP-tree upwards for a child node, and use the path with an up-down connection relationship between the upper and lower nodes as a conditional pattern base path. Mark each conditional pattern base path to avoid duplication in subsequent child node traversals.

在一种可能的实施方式中,第一分析模块1,具体包括:In a possible implementation, the first analysis module 1 specifically includes:

预处理单元,用于对待分析告警数据进行第二预处理,第二预处理包括:对待分析告警数据提取关键字段、生成包括指示待分析告警数据的告警节点和关键字段的第二字典值、使用第二字典值表示待分析告警数据;A preprocessing unit configured to perform second preprocessing on the alarm data to be analyzed. The second preprocessing includes: extracting key fields from the alarm data to be analyzed, and generating a second dictionary value including alarm nodes and key fields indicating the alarm data to be analyzed. , use the second dictionary value to represent the alarm data to be analyzed;

第一匹配单元,与第二预处理单元连接,用于获取告警关联规则中每个节点对应的历史告警数据的第一字典值,计算第二字典值和获取的每个第一字典值之间的余弦相似度,响应于余弦相似度大于第四预设值,将第二字典值标注在告警关联规则中对应的节点处,以获得待分析告警数据命中的告警关联子规则;The first matching unit is connected to the second preprocessing unit and is used to obtain the first dictionary value of the historical alarm data corresponding to each node in the alarm association rule, and calculate the difference between the second dictionary value and each obtained first dictionary value. cosine similarity, in response to the cosine similarity being greater than the fourth preset value, marking the second dictionary value at the corresponding node in the alarm association rule to obtain the alarm association sub-rule hit by the alarm data to be analyzed;

第一定位单元,与第一匹配单元连接,用于根据每条告警关联子规则中被命中的节点的余弦相似度之和,获取余弦相似度之和最大的TopN条告警关联子规则,以TopN条告警关联子规则的TopN个最终根因告警节点为第一根因告警。The first positioning unit is connected to the first matching unit and is used to obtain the TopN alarm correlation sub-rules with the largest sum of cosine similarities based on the sum of cosine similarities of the hit nodes in each alarm correlation sub-rule, and use TopN The TopN final root cause alarm nodes of an alarm association sub-rule are the first root cause alarms.

在一种可能的实施方式中,第二定位单元23,具体包括:In a possible implementation, the second positioning unit 23 specifically includes:

权重判断子单元,用于获取每个告警因果图中连续的因果边权重大于第一预设值的告警因果子图;The weight judgment subunit is used to obtain the alarm causal subgraph in which the continuous causal edge weight in each alarm causal graph is greater than the first preset value;

根因结果子单元,与权重判断子单元连接,用于以每个告警因果子图的最终因节点为第二根因告警、以每个告警因果子图为根因告警路径。The root cause result subunit is connected to the weight judgment subunit, and is used to use the final cause node of each alarm causal subgraph as the second root cause alarm, and use each alarm causal subgraph as the root cause alarm path.

在一种可能的实施方式中,所述告警知识图谱中的节点按照告警节点层级分层排列、每个节点对应一种告警节点层级的一种告警故障类型;In a possible implementation, the nodes in the alarm knowledge graph are arranged hierarchically according to the alarm node level, and each node corresponds to an alarm fault type at the alarm node level;

第二匹配单元22,具体包括:The second matching unit 22 specifically includes:

分类子单元,用于根据告警节点层级和告警故障类型将待分析告警数据分为不同的第三告警类,The classification subunit is used to divide the alarm data to be analyzed into different third alarm categories according to the alarm node level and alarm fault type.

第二匹配子单元,与分类子单元连接,用于根据第三告警类将待分析告警数据匹配告警知识图谱的节点以获得告警因果图。The second matching subunit is connected to the classification subunit and is used to match the alarm data to be analyzed to the nodes of the alarm knowledge graph according to the third alarm category to obtain the alarm cause and effect graph.

在一种可能的实施方式中,告警知识图谱单元21还用于:In a possible implementation, the alarm knowledge graph unit 21 is also used to:

根据历史告警数据的告警节点层级和告警故障类型构建告警知识图谱,告警节点层级包括物理机HOST、虚拟机VM和软件SOFTWARE三层,告警知识图谱中的每个节点对应一种HOST告警故障类型、VM告警故障类型或SOFTWARE告警故障类型。An alarm knowledge graph is constructed based on the alarm node level and alarm fault type of historical alarm data. The alarm node level includes three layers: physical machine HOST, virtual machine VM and software SOFTWARE. Each node in the alarm knowledge graph corresponds to a HOST alarm fault type, VM alarm fault type or SOFTWARE alarm fault type.

在一种可能的实施方式中,告警知识图谱单元21,具体包括:In a possible implementation, the alarm knowledge graph unit 21 specifically includes:

分类子单元,用于根据告警节点层级和告警故障类型将历史告警数据分为不同的第二告警类,第二告警类包括HOST告警故障类型、VM告警故障类型和SOFTWARE告警故障类型;The classification subunit is used to divide historical alarm data into different second alarm categories according to the alarm node level and alarm fault type. The second alarm category includes HOST alarm fault type, VM alarm fault type and SOFTWARE alarm fault type;

节点排列连接子单元,与分类子单元连接,用于将每个第二告警类作为告警知识图谱的一个节点,将HOST告警故障类型作为VM告警故障类型的因节点,将VM告警故障类型作为SOFTWARE告警故障类型的因节点,分层排列和连接全部第二告警类;The node arrangement connects the sub-unit and is connected to the classification sub-unit, which is used to treat each second alarm class as a node of the alarm knowledge graph, use the HOST alarm fault type as the cause node of the VM alarm fault type, and use the VM alarm fault type as the SOFTWARE The cause nodes of alarm fault types are hierarchically arranged and connected to all second alarm categories;

因果边权重获取子单元,与节点排列连接子单元连接,用于计算每对因果节点同时发生历史告警的次数与每对因果节点中因节点总共发生历史告警的次数之比,作为每对因果节点之间连接线的因果边权重。The causal edge weight acquisition subunit is connected to the node arrangement connection subunit and is used to calculate the ratio of the number of historical alarms that occur simultaneously for each pair of causal nodes to the total number of historical alarms that occur for each pair of causal nodes, as each pair of causal nodes The causal edge weight of the connecting line between them.

在一种可能的实施方式中,待分析告警数据包括通过设置时间窗和滑动步长获取的云网平台在当前时间窗内的多条实时告警数据,云网平台包括多个系统,每个系统均包括HOST、VM和SOFTWARE三层;In a possible implementation, the alarm data to be analyzed includes multiple real-time alarm data of the cloud network platform within the current time window obtained by setting the time window and sliding step size. The cloud network platform includes multiple systems, each system All include three layers: HOST, VM and SOFTWARE;

第二匹配子单元,具体用于:The second matching subunit is specifically used for:

根据待分析告警数据的告警节点查询云网平台的软硬件知识图谱,根据软硬件知识图谱中告警节点所属的云网平台系统,将全部第三告警类按照云网平台系统进行划分,将划分至云网平台不同系统的第三告警类分别匹配告警知识图谱的节点,以获得待分析告警数据在云网平台不同系统命中的告警因果图。Query the software and hardware knowledge graph of the cloud network platform according to the alarm node of the alarm data to be analyzed. According to the cloud network platform system to which the alarm node in the software and hardware knowledge graph belongs, all third alarm categories are divided according to the cloud network platform system and will be divided into The third alarm categories of different systems on the cloud network platform are respectively matched with the nodes of the alarm knowledge graph to obtain an alarm cause and effect diagram in which the alarm data to be analyzed hits different systems on the cloud network platform.

在一种可能的实施方式中,第三分析模块3,具体用于:In a possible implementation, the third analysis module 3 is specifically used for:

响应于对某个云网平台系统确定了多个第二根因告警,将同时对应第一根因告警和所述多个第二根因告警中某一个的待分析告警数据确定为某个云网平台系统的最终根因告警,并获得最终根因告警对应的根因告警路径。In response to multiple second root cause alarms being determined for a certain cloud network platform system, the alarm data to be analyzed that simultaneously corresponds to the first root cause alarm and one of the plurality of second root cause alarms is determined to be a certain cloud network platform system. The final root cause alarm of the network platform system is obtained, and the root cause alarm path corresponding to the final root cause alarm is obtained.

实施例3:Example 3:

本发明实施例3提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当所述计算机程序被处理器运行时,实现如实施例1所述的告警根因定位方法。Embodiment 3 of the present invention provides a computer-readable storage medium. A computer program is stored in the computer-readable storage medium. When the computer program is run by a processor, the alarm root cause location as described in Embodiment 1 is realized. method.

所述计算机可读存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、计算机程序模块或其他数据)的任何方法或技术中实施的易失性或非易失性、可移除或不可移除的介质。计算机可读存储介质包括但不限于RAM(Random Access Memory,随机存取存储器),ROM(Read-Only Memory,只读存储器),EEPROM(Electrically ErasableProgrammable read only memory,带电可擦可编程只读存储器)、闪存或其他存储器技术、CD-ROM(Compact Disc Read-Only Memory,光盘只读存储器),数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。The computer-readable storage media includes volatile or nonvolatile, removable storage media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, computer program modules or other data. or non-removable media. Computer-readable storage media include but are not limited to RAM (Random Access Memory, random access memory), ROM (Read-Only Memory, read-only memory), EEPROM (Electrically Erasable Programmable read only memory, electrically erasable programmable read-only memory) , flash memory or other memory technology, CD-ROM (Compact Disc Read-Only Memory), digital versatile disk (DVD) or other optical disk storage, magnetic cassette, tape, magnetic disk storage or other magnetic storage device, or Any other medium that can be used to store the desired information and that can be accessed by a computer.

另外,本发明还可以提供一种计算机装置,包括存储器和处理器,所述存储器中存储有计算机程序,当所述处理器运行所述存储器存储的计算机程序时,所述处理器执行如实施例1所述的告警根因定位方法。In addition, the present invention can also provide a computer device, including a memory and a processor. A computer program is stored in the memory. When the processor runs the computer program stored in the memory, the processor executes the embodiments. The alarm root cause locating method described in 1.

其中,存储器与处理器连接,存储器可采用闪存或只读存储器或其他存储器,处理器可采用中央处理器或单片机。The memory is connected to the processor, and the memory can be flash memory, read-only memory, or other memories, and the processor can be a central processing unit or a single-chip microcomputer.

本发明实施例1-3提供一种告警根因定位方法、告警根因定位装置及计算机可读存储介质,将基于告警关联规则获得的待分析告警数据的第一根因告警信息,结合基于告警知识图谱获得的待分析告警数据的第二根因告警信息,获得最终根因告警信息,在基于告警知识图谱获得待分析告警数据的第二根因告警信息的过程中,根据告警知识图谱中的因节点和果节点之间的因果关系和因果边权重综合判断,获得第二根因告警信息,通过在关联规则挖掘的基础上,增加告警知识图谱分析,提高对信息的运用,从而提高根因定位的准确性和方案的鲁棒性。Embodiments 1-3 of the present invention provide an alarm root cause locating method, an alarm root cause locating device, and a computer-readable storage medium. The first root cause alarm information of the alarm data to be analyzed obtained based on alarm association rules is combined with the alarm-based alarm information. The second root cause alarm information of the alarm data to be analyzed obtained from the knowledge graph is used to obtain the final root cause alarm information. In the process of obtaining the second root cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph, according to the The causal relationship and causal edge weight between cause nodes and effect nodes are comprehensively judged to obtain the second root cause alarm information. By adding alarm knowledge graph analysis based on association rule mining, the application of information is improved, thereby improving the root cause. Positioning accuracy and scheme robustness.

可以理解的是,以上实施方式仅仅是为了说明本发明的原理而采用的示例性实施方式,然而本发明并不局限于此。对于本领域内的普通技术人员而言,在不脱离本发明的精神和实质的情况下,可以做出各种变型和改进,这些变型和改进也视为本发明的保护范围。It can be understood that the above embodiments are only exemplary embodiments adopted to illustrate the principles of the present invention, but the present invention is not limited thereto. For those of ordinary skill in the art, various modifications and improvements can be made without departing from the spirit and essence of the present invention, and these modifications and improvements are also regarded as the protection scope of the present invention.

Claims (13)

CN202311286937.5A2023-10-072023-10-07Alarm root cause positioning method, device and mediumPendingCN117221087A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311286937.5ACN117221087A (en)2023-10-072023-10-07Alarm root cause positioning method, device and medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311286937.5ACN117221087A (en)2023-10-072023-10-07Alarm root cause positioning method, device and medium

Publications (1)

Publication NumberPublication Date
CN117221087Atrue CN117221087A (en)2023-12-12

Family

ID=89035277

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311286937.5APendingCN117221087A (en)2023-10-072023-10-07Alarm root cause positioning method, device and medium

Country Status (1)

CountryLink
CN (1)CN117221087A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117807589A (en)*2023-12-262024-04-02电子科技大学Correlation analysis method based on intrusion detection of industrial control system
CN117806916A (en)*2024-02-292024-04-02中国人民解放军国防科技大学Multi-unit server lightweight alarm correlation mining and converging method and system
CN118227847A (en)*2024-05-222024-06-21北京启明星辰信息安全技术有限公司Filter tree generation method, alarm log processing method, storage medium and terminal

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117807589A (en)*2023-12-262024-04-02电子科技大学Correlation analysis method based on intrusion detection of industrial control system
CN117806916A (en)*2024-02-292024-04-02中国人民解放军国防科技大学Multi-unit server lightweight alarm correlation mining and converging method and system
CN118227847A (en)*2024-05-222024-06-21北京启明星辰信息安全技术有限公司Filter tree generation method, alarm log processing method, storage medium and terminal

Similar Documents

PublicationPublication DateTitle
Wang et al.Linkage based face clustering via graph convolution network
CN117221087A (en)Alarm root cause positioning method, device and medium
CN110288004A (en) A system fault diagnosis method and device based on log semantic mining
CN112306820B (en)Log operation and maintenance root cause analysis method and device, electronic equipment and storage medium
CN114090850B (en) Log classification method, electronic device and computer readable storage medium
CN109697456A (en)Business diagnosis method, apparatus, equipment and storage medium
CN114186073A (en) Operation and Maintenance Fault Diagnosis and Analysis Method Based on Subgraph Matching and Distributed Query
CN116225760A (en)Real-time root cause analysis method based on operation and maintenance knowledge graph
WO2022048668A1 (en)Knowledge graph construction method and apparatus, check method and storage medium
US20220284045A1 (en)Matching machine generated data entries to pattern clusters
CN114647558A (en) A method and device for log anomaly detection
Xie et al.Logm: Log analysis for multiple components of hadoop platform
CN113590451A (en)Root cause positioning method, operation and maintenance server and storage medium
CN117389908B (en)Dependency analysis method, system and medium for interface automation test case
WO2024021603A1 (en)Fault handling method, device, and storage medium
CN116841779A (en)Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium
CN120074857A (en)Safety monitoring alarm device and method for network security loopholes
CN115277124B (en)Online system and server for searching matching attack mode based on system traceability graph
CN116340536A (en)Operation and maintenance knowledge graph construction method, device, equipment, medium and program product
CN115509848A (en)Log analysis method and device, electronic equipment and storage medium
CN111899117B (en)K-edge connected component mining system and k-edge connected component mining method applied to social network
CN118820006A (en) An asset failure early warning system based on operation and maintenance data model
CN118013440A (en) An abnormal detection method for personal sensitive information desensitization operation based on event graph
CN113132351B (en) Method and system for abnormal detection of internal state of mimic router system based on graph convolutional network
CN114153980B (en) Knowledge graph construction method and device, inspection method, and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp