Movatterモバイル変換


[0]ホーム

URL:


WO2015090098A1 - Method and apparatus for realizing fault location - Google Patents

Method and apparatus for realizing fault location
Download PDF

Info

Publication number
WO2015090098A1
WO2015090098A1PCT/CN2014/087332CN2014087332WWO2015090098A1WO 2015090098 A1WO2015090098 A1WO 2015090098A1CN 2014087332 WCN2014087332 WCN 2014087332WWO 2015090098 A1WO2015090098 A1WO 2015090098A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
conduction chain
conduction
current
monitoring
Prior art date
Application number
PCT/CN2014/087332
Other languages
French (fr)
Chinese (zh)
Inventor
郭宪杰
申山宏
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司filedCritical中兴通讯股份有限公司
Priority to CN201480057055.4ApriorityCriticalpatent/CN105659528B/en
Publication of WO2015090098A1publicationCriticalpatent/WO2015090098A1/en

Links

Images

Classifications

Definitions

Landscapes

Abstract

Disclosed are a method and apparatus for realizing fault location, comprising: acquiring current fault information; establishing a conduction chain set of all monitoring objects in a pre-determined time window of different time points with respect to all fault types according to the acquired current fault information; analysing the relevance among conduction chains in the conduction chain set to acquire a fault object conduction chain of all the monitoring objects with respect to different fault types; and locating a fault object and a fault type according to the fault object conduction chain. The method for fault location realizes rapid and accurate location of a rout fault and efficient order sending, and improves the efficiency of daily network maintenance and a fault order sending process.

Description

一种实现故障定位的方法及装置Method and device for realizing fault location技术领域Technical field
本发明涉及网络管理技术,尤指一种实现故障定位的方法及装置。The present invention relates to network management technologies, and more particularly to a method and apparatus for implementing fault location.
背景技术Background technique
现有的网络管理系统用于管理各个监控对象。通常需要通过网络配置功能配置监控对象的各个参数,包括监控对象的名称标识、连接关系等。比如监控对象为一个交换机和四台计算机,交换机连接这四台计算机。有了这个配置数据后,就认识了管理系统的各个对象,通常是根据标识名称来识别监控对象的,如Switcher100,Computer100,Computer101,Computer102,Computer103等。The existing network management system is used to manage various monitoring objects. You need to configure the parameters of the monitored object through the network configuration function, including the name and connection of the monitored object. For example, the monitoring object is a switch and four computers, and the switch connects the four computers. With this configuration data, I have recognized the various objects of the management system, usually identifying the monitored objects based on the identification name, such as Switcher100, Computer100, Computer101, Computer102, Computer103, etc.
通常对监控对象的监控结果达到故障阈值后会上报给维护人员,比如CPU利用率达到96%以上需要报警,这个时候监控对象就会向监控者(网络管理系统)发送一条消息,消息包括:对象类型、对象标识、监控的指标、当前指标值、告警名称等信息。比如Computer,ID=100,CPU,98%,计算机CPU利用率过高。从网络管理系统来看,这些告警数据都是来自各个被监控对象上报的,消息类型是可以自定义的。Generally, the monitoring result of the monitoring object reaches the fault threshold and is reported to the maintenance personnel. For example, if the CPU utilization reaches 96% or more, an alarm is required. At this time, the monitoring object sends a message to the monitor (network management system), including: object Type, object identifier, monitored indicator, current indicator value, alarm name, and other information. For example, Computer, ID=100, CPU, 98%, computer CPU utilization is too high. From the perspective of the network management system, these alarm data are reported from various monitored objects, and the message type can be customized.
告警数据由监控对象上报后,根据接口定义,会获取消息类型、消息对象和对象标识,如上面提到的收到一条“Computer,ID=100,CPU,98%,计算机CPU利用率过高”,就会知道是Computer100出现了异常情况。After the alarm data is reported by the monitoring object, according to the interface definition, the message type, message object and object identifier are obtained. As mentioned above, a "Computer, ID=100, CPU, 98%, computer CPU utilization is too high" is received. , you will know that Computer100 has an abnormal situation.
在复杂的真实的网络中,一个故障会导致更多的监控对象发生故障,典型的如掉电后,所有的监控对象可能都无法正常工作了;传输线路中断导致一片区域的通信受阻。可能就是在一两分钟内会上报上百条告警信息,在这些上报的告警数据中,如果快速定位根源的告警数据,对其优先进行修复,其它告警数据可能就会自动恢复了。如何快速定位根源性的告警数据就是现有技术的分析重点,通常是根据网络监控对象之间的连接关系(如Switcher100连接了Computer100等4台)、业务之间的因果关系(掉电和低压等有前后或者因果关系),归纳这些连接关系、因果关系形成告警知识库或者经验规则,利用既有的告警知识库或者告警经验规则对告警数据进行故障定位与分析。In a complex real network, a fault will cause more monitored objects to fail. Typically, after power failure, all monitored objects may not work properly; transmission line interruptions cause communication in one area to be blocked. It may be that hundreds of alarms are reported in one or two minutes. If the alarm data of the root source is quickly located and the alarm data is quickly repaired, the other alarm data may be automatically restored. How to quickly locate the rooted alarm data is the analysis focus of the prior art, usually based on the connection relationship between network monitoring objects (such as Switcher100 connected to Computer100, etc.), causal relationship between services (power down and low voltage, etc.) Before and afterOr causal relationship, summarizing these connection relationships, causal relationships to form an alarm knowledge base or empirical rules, using the existing alarm knowledge base or alarm experience rules to locate and analyze the alarm data.
利用既有的告警知识库或者告警经验规则对告警数据进行故障定位与分析,是现有网络维护的主要方法。但是现有的方法应用在全网络的监控中会带来海量的告警数据,并且跨网络设备跨管理系统之间的告警关联分析难度非常大。特别是周期性的网络建设和持续性地日常维护使得网络始终处于动态变更的过程当中,而面对动态的网络配置变更会给先验的告警经验规则带来很大的不准确性,无法快速、准确的进行根源故障的定位,无法提升日常网络维护和挂账派单过程中的效率。Using the existing alarm knowledge base or alarm experience rules to locate and analyze the alarm data is the main method of existing network maintenance. However, the existing method applied in the monitoring of the entire network will bring a large amount of alarm data, and the analysis of the alarm correlation between the cross-network devices across the management system is very difficult. In particular, periodic network construction and continuous daily maintenance make the network always in the process of dynamic change, and the dynamic network configuration changes will bring great inaccuracy to the a priori alarm experience rules. Accurate positioning of root cause failures does not improve the efficiency of daily network maintenance and pending billing.
发明内容Summary of the invention
为了解决上述技术问题,本发明实施例提供了一种实现故障定位的方法及装置,能够快速、准确的进行根源故障的定位,提升日常网络维护和故障派单过程中的效率。In order to solve the above technical problem, an embodiment of the present invention provides a method and apparatus for implementing fault location, which can quickly and accurately locate a root cause fault, and improve efficiency in daily network maintenance and fault dispatch procedures.
为了达到上述发明目的,本发明实施例公开了一种实现故障定位的方法,包括:In order to achieve the above object, the embodiment of the present invention discloses a method for implementing fault location, including:
获取当前故障信息,当前故障信息至少包括监控对象、故障类型和时间信息;Obtain current fault information, where the current fault information includes at least a monitoring object, a fault type, and time information;
根据获得的当前故障信息,获取所述监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链,根据所述监控对象的当前故障历史故障信息,建立当前监控对象针对当前故障类型在不同时间点的预定时间窗内的传导链集合;Obtaining, according to the obtained current fault information, the conduction chain of the monitoring object in a predetermined time window of the current fault type at the current time point, according to the current fault history fault information of the monitoring object, establishing a current monitoring object for the current fault type. a set of conductive chains within a predetermined time window at different points in time;
对建立的传导链集合中的传导链之间的相关性进行分析,获得所有监控对象针对不同故障类型的故障对象传导链;The correlation between the conduction chains in the established conduction chain set is analyzed, and the fault object conduction chain of all monitoring objects for different fault types is obtained;
根据获得的故障对象传导链,定位出当前的故障对象和故障类型。According to the obtained faulty object conduction chain, the current fault object and fault type are located.
可选地,所述对建立的传导链集合中的传导链之间的相关性进行分析包括对传导链之间的对象故障的相关性进行分析,获得所有监控对象针对不同故障类型的故障对象传导链。Optionally, the analyzing the correlation between the conduction chains in the established conduction chain set comprises analyzing the correlation of the object failure between the conduction chains, and obtaining all the monitored objects for differentFault type of fault object conduction chain.
可选地,上述方法还具有如下特点:根据获得的历史故障信息,建立故障元数据库。Optionally, the foregoing method further has the following feature: establishing a fault metabase according to the obtained historical fault information.
可选地,上述方法还具有如下特点:所述建立传导链集合之前,该方法还包括:判断所述监控对象的当前故障是否存在于所述历史故障信息中;Optionally, the foregoing method further has the following feature: before the establishing the set of the conductive chain, the method further includes: determining whether the current fault of the monitored object exists in the historical fault information;
可选地,上述方法还具有如下特点:所述对传导链集合中的传导链之间的相关性进行分析,获得所有监控对象针对不同故障类型的故障对象传导链,包括:Optionally, the foregoing method further has the following feature: analyzing the correlation between the conduction chains in the set of conductive chains, and obtaining the faulty object conduction chain of all the monitored objects for different fault types, including:
分别获得所述传导链集合中每个监控对象发生每种故障的次数,计算每个监控对象发生每种故障的次数在所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。Obtaining, respectively, the number of times each type of fault occurs in each monitored object in the set of conduction chains, and calculating a ratio of the number of times each type of fault occurs in each monitored object to the total number of times all monitored objects fail, and the ratio is greater than a predetermined threshold. The list of monitored objects serves as the faulty object conduction chain.
可选地,上述方法还具有如下特点:当所述监控对象的当前故障不存在历史故障信息时,该方法还包括:Optionally, the method further has the following feature: when there is no historical fault information in the current fault of the monitored object, the method further includes:
对所述监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链进行分析,获得所述传导链中所有监控对象针对不同故障类型的故障对象传导链,包括:The monitoring object is analyzed for the conduction chain of the current fault type in the predetermined time window of the current time point, and the fault object conduction chain of all the monitoring objects in the conduction chain for different fault types is obtained, including:
分别获得当前传导链中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在当前传导链中所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。Obtaining, respectively, the number of times each fault occurs in each monitored object in the current conduction chain, and calculating a ratio of the number of times each of the faults occurred in each monitored object to the total number of times the faults of all monitored objects in the current conduction chain fail, A list of monitored objects whose ratio is greater than a predetermined threshold is used as a faulty object conduction chain.
可选地,上述方法还具有如下特点:所述获得所有监控对象针对不同故障类型的故障对象传导链后,该方法还包括:Optionally, the method further has the following feature: after obtaining the fault object conduction chain of all the monitoring objects for different fault types, the method further includes:
根据所述故障对象传导链,获得针对不同监控对象的故障传导链,根据不同监控对象的故障传导链定位出故障对象和故障类型;或者,Obtaining a fault conduction chain for different monitoring objects according to the faulty object conduction chain, and locating the fault object and the fault type according to the fault conduction chain of different monitoring objects; or
根据所述故障对象传导链,获得针对不同故障类型的对象传导链,根据不同故障类型的对象传导链定位出故障对象和故障类型。According to the faulty object conduction chain, an object conduction chain for different fault types is obtained, and the fault object and the fault type are located according to the object conduction chain of different fault types.
本发明实施例还公开了一种实现故障定位的装置,包括:The embodiment of the invention further discloses an apparatus for implementing fault location, comprising:
接收模块,设置为获得当前故障信息,当前故障信息至少包括监控对象、故障类型和时间信息;The receiving module is configured to obtain current fault information, and the current fault information includes at least a monitoring object, a fault type, and time information;
第一建立模块,设置为根据获得的当前故障信息,获取所述监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链,判断所述监控对象的当前故障是否存在历史故障信息;a first establishing module, configured to obtain, according to the obtained current fault information, a conduction chain of the monitoring object in a predetermined time window of the current fault type at a current time point, and determine whether the current fault of the monitoring object has historical fault information;
当所述监控对象的当前故障存在历史故障信息时,根据所述历史故障信息,建立当前监控对象针对当前故障类型在不同时间点的预定时间窗内的传导链集合,向第二建立模块发送第一通知;And establishing, according to the historical fault information, a conduction chain set of the current monitoring object in a predetermined time window at different time points according to the historical fault information, and sending the first to the second establishing module. a notice;
第二建立模块,设置为对第一建立模块建立的传导链集合中的传导链之间的相关性进行分析,获得所有监控对象针对所有故障类型的故障对象传导链并输出给定位模块;a second establishing module, configured to analyze a correlation between the conductive chains in the set of conductive chains established by the first establishing module, obtain a faulty object conduction chain of all the monitoring objects for all fault types, and output the signals to the positioning module;
定位模块,设置为根据来自第二建立模块的故障对象传导链,定位出故障对象和故障类型。The positioning module is configured to locate the fault object and the fault type according to the fault object conduction chain from the second establishing module.
可选地,第二建立模块是设置为对第一建立模块建立的传导链集合中的传导链之间的对象故障的相关性进行分析。Optionally, the second establishing module is configured to analyze the correlation of the object fault between the conductive chains in the set of conductive chains established by the first establishing module.
可选地,上述装置还可以具有如下特点:所述装置还包括:故障元数据建立模块,设置为根据获得的故障信息,建立故障元数据库,将故障元数据库信息传给第一建立模块。Optionally, the device may further have the following feature: the device further includes: a fault metadata establishing module, configured to establish a fault metabase according to the obtained fault information, and transmit the fault metabase information to the first establishing module.
可选地,上述装置还可以具有如下特点:所述第二建立模块具体设置为:Optionally, the foregoing apparatus may further have the following feature: the second establishing module is specifically configured to:
接收到来自第一建立模块的第一通知,获得所述传导链集合中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。Receiving a first notification from the first establishing module, obtaining the number of times each fault occurs in each monitored object in the set of conductive chains, and calculating the number of occurrences of each fault of each monitored object in all monitored objects. The ratio of the total number of times, the list of monitored objects whose ratio is greater than a predetermined threshold is used as the fault object conduction chain.
可选地,上述装置还可以具有如下特点:所述第一建立模块,还设置为当所述监控对象的当前故障不存在历史故障信息时,向第二建立模块发送第二通知;Optionally, the foregoing apparatus may further have the following feature: the first establishing module is further configured to: when the current fault of the monitoring object does not have historical fault information, send a second notification to the second establishing module;
所述第二建立模块,还设置为接收来自第一建立模块的第二通知,获得所述监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在当前传导链中所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。The second establishing module is further configured to receive a second notification from the first establishing module, and obtain the monitoring object for each of the conduction chains in the predetermined time window of the current time point for the current fault type.The number of times each type of fault occurs in each monitored object, and the ratio of the number of times each type of fault occurs in each monitored object to the total number of times the faults of all monitored objects in the current conduction chain fail, and the monitored object whose ratio is greater than a predetermined threshold is calculated. The list acts as a faulty object conduction chain.
可选地,上述装置还可以具有如下特点:所述定位模块还设置为:Optionally, the foregoing apparatus may further have the following feature: the positioning module is further configured to:
根据所述故障对象传导链,获得针对不同监控对象的故障传导链,根据得到的不同监控对象的故障传导链定位出故障对象和故障类型;Obtaining a fault conduction chain for different monitoring objects according to the fault object conduction chain, and locating the fault object and the fault type according to the obtained fault conduction chain of different monitoring objects;
或者,根据所述故障对象传导链,获得针对不同故障类型的对象传导链,根据不同故障类型的对象传导链定位出故障对象和故障类型。Alternatively, according to the faulty object conduction chain, an object conduction chain for different fault types is obtained, and the fault object and the fault type are located according to the object conduction chain of different fault types.
本申请技术方案包括:获得当前故障信息,当前故障信息包括监控对象、故障类型和时间信息;根据获得当前故障信息,建立所有监控对象针对不同故障类型在不同时间点的预定时间窗内的传导链集合;对建立的传导链集合中的传导链之间的相关性进行分析,获得所有监控对象针对所有故障类型的故障对象传导链;以及根据获得的故障对象传导链,定位出故障对象和故障类型。本申请的技术方案不必逐一寻找监控对象之间的连接关系以及故障类型之间的因果关系,这样就避免了花费较高的时间代价,满足了实时性的要求。不强调逻辑上的因果关系而进行强相关性的判断,包容了可能存在的由变更导致的不确定性,按照监控维护的能力水平,根据相关性的高低判断其处理的优先级,以更灵活的手段进行故障定位。The technical solution of the present application includes: obtaining current fault information, the current fault information includes a monitoring object, a fault type, and time information; and according to obtaining current fault information, establishing a conductive chain of all monitored objects in predetermined time windows of different fault types at different time points Aggregation; analyzes the correlation between the conduction chains in the established conduction chain set, obtains the fault object conduction chain of all the monitoring objects for all fault types; and locates the fault object and fault type according to the obtained fault object conduction chain . The technical solution of the present application does not need to find the connection relationship between the monitoring objects and the causal relationship between the fault types one by one, thus avoiding the high time cost and satisfying the requirement of real-time. It does not emphasize logical causality and makes strong correlation judgments. It contains possible uncertainties caused by changes. According to the ability level of monitoring and maintenance, the priority of the processing is judged according to the level of relevance, so as to be more flexible. The means to locate the fault.
附图概述BRIEF abstract
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:
图1为本发明实施例实现故障定位的方法的流程图;FIG. 1 is a flowchart of a method for implementing fault location according to an embodiment of the present invention;
图2为本发明实施例实现故障定位的方法的流程图;2 is a flowchart of a method for implementing fault location according to an embodiment of the present invention;
图3为本发明实施例的一种实现故障定位的装置的结构示意图。FIG. 3 is a schematic structural diagram of an apparatus for implementing fault location according to an embodiment of the present invention.
本发明的较佳实施方式Preferred embodiment of the invention
下面结合附图及具体实施例对本发明进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
图1是本发明实施例实现故障定位的方法的流程图,包括以下步骤:FIG. 1 is a flowchart of a method for implementing fault location according to an embodiment of the present invention, including the following steps:
步骤101,获取当前故障信息。Step 101: Acquire current fault information.
其中,当前故障信息包括监控对象、故障类型和时间信息。The current fault information includes a monitoring object, a fault type, and time information.
可选地,还包括:建立故障元数据库。Optionally, the method further includes: establishing a fault metabase.
具体包括:首先根据全网的现有故障信息状态,识别出最小粒度的监控对象和故障类别,然后根据最小粒度的监控对象和故障类型建立基本的故障元数据库。The method includes: firstly, according to the status of the existing fault information of the entire network, the monitoring object and the fault category of the minimum granularity are identified, and then the basic fault metabase is established according to the minimum granularity of the monitoring object and the fault type.
举例说明,监控对象是网络管理中主要的关注焦点,监控对象发生轻微故障时可以进行修复,严重故障时只能替换。通常每个监控对象都是由若干个不同部件组成的,从维护角度来看,所谓最小粒度的监控对象,就是可以替换的最小单元部件。比如交换机,如果一个小型集成度高的交换机,出现故障后无法针对每个端口进行更换,则每个端口出现严重故障后都需要更换该交换机,则该监控对象的最小粒度就为交换机本身。如果是一个大型交换机,每个端口都可以更换部件,则最小粒度定义为交换机下的每个端口,该端口出现故障时可以更换端口部件。那么最小粒度的监控对象是交换机下的端口编号。For example, the monitoring object is the main focus of attention in network management. When a monitoring object has a minor fault, it can be repaired. In the case of a serious fault, it can only be replaced. Usually each monitoring object is composed of several different components. From the maintenance point of view, the so-called minimum granularity monitoring object is the smallest unit component that can be replaced. For example, if the switch has a small, highly integrated switch that cannot be replaced for each port after a failure, the switch needs to be replaced after a serious fault occurs on each port. The minimum granularity of the monitored object is the switch itself. If it is a large switch and each port can be replaced, the minimum granularity is defined as each port under the switch. When the port fails, the port components can be replaced. Then the smallest granularity of the monitoring object is the port number under the switch.
上述故障元数据库由于监控对象的网络扩张、故障类型的丰富而不断扩大,由于故障元数据库数量有限,可以只增加不删除,保证在监控历史故障中持续可用。The above-mentioned fault meta-database is continuously expanded due to the network expansion of the monitoring object and the richness of the fault type. Since the number of fault meta-databases is limited, it can be added only without deletion, and it is guaranteed to be continuously available in the monitoring history failure.
步骤102,获取监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链,或建立监控对象的当前故障类型在不同时间点的预定时间窗内的传导链集合。Step 102: Acquire a conduction chain of the monitoring object within a predetermined time window of the current fault type at the current time point, or establish a conduction chain set of the current fault type of the monitoring object within a predetermined time window at different time points.
具体包括:Specifically include:
首先,获得当前监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链,在获得当前故障信息之前如果不存在历史故障信息时,则直接转入步骤103。First, obtaining a conduction chain of the current monitoring object for a predetermined time window of the current fault type at the current time point, if there is no historical fault information before obtaining the current fault information,Go to step 103.
其次,在获得当前故障信息之前如果已存在历史故障信息时,根据历史故障信息建立当前监控对象针对当前故障类型在不同时间点的预定时间窗内的传导链集合,然后转入步骤103;Next, before the current fault information is obtained, if the historical fault information already exists, the current monitoring object is set according to the historical fault information for the current fault type in the predetermined time window of the current fault type at different time points, and then proceeds to step 103;
优选地,上述传导链定义为:某一对象故障发生后所能影响的一系列的对象故障的序列。Preferably, the above-mentioned conduction chain is defined as a sequence of a series of object failures that can be affected after a certain object failure occurs.
步骤103,对建立的传导链进行分析或建立的传导链集合中的传导链之间的相关性进行分析,获得所有监控对象针对不同故障类型的故障对象传导链。Step 103: Analyze the correlation between the established conduction chain or the conduction chain in the established conduction chain set, and obtain the fault object conduction chain of all the monitored objects for different fault types.
具体包括:Specifically include:
在获得当前故障信息之前如果已存在历史故障信息时,获得上述传导链集合中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在所有监控对象发生故障的总次数中的比值,将上述比值大于预定阈值的监控对象列表作为故障对象传导链。或者If the historical fault information already exists before the current fault information is obtained, the number of times each fault occurs in each monitored object in the above-mentioned conduction chain set is obtained, and the number of occurrences of each fault of each monitored object is calculated, and all the monitored objects are faulty. The ratio of the total number of times, the list of monitored objects whose ratio is greater than a predetermined threshold is used as the fault object conduction chain. or
在获得当前故障信息之前如果不存在历史故障信息时,获得当前传导链中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在当前传导链中所有监控对象发生故障的总次数中的比值,将上述比值大于预定阈值的监控对象列表作为故障对象传导链。If there is no historical fault information before obtaining the current fault information, obtain the number of times each fault occurs in each monitored object in the current conduction chain, and calculate the number of times each fault occurs in each monitored object in the current conduction chain. The ratio of the total number of times the object has failed, and the list of monitored objects whose ratio is greater than a predetermined threshold is used as the fault object conduction chain.
步骤104,根据获得的故障对象传导链,定位出故障对象和故障类型。Step 104: Locate the fault object and the fault type according to the obtained fault object conduction chain.
具体包括:Specifically include:
根据故障对象传导链,获得针对不同监控对象的故障传导链,根据故障传导链定位出故障对象和故障类型。或者,According to the fault chain of the fault object, the fault conduction chain for different monitoring objects is obtained, and the fault object and the fault type are located according to the fault conduction chain. or,
根据故障对象传导链,获得针对不同故障类型的对象传导链,根据对象传导链定位出故障对象和故障类型。According to the faulty object conduction chain, the object conduction chain for different fault types is obtained, and the fault object and the fault type are located according to the object conduction chain.
其中,初始上报的当前故障信息,包括:监控对象、故障类型、时间等基本信息,上述当前故障信息作为基本的相关性判断依据,该数据从被监控对象的网元对象上来;如果初始历史数据为空,则相关性都暂定为100%强相关,因计数次数仅为1,可信度和优先级降低,当历史数据不断累积时,相关性的可计算性越来越高。The current fault information that is initially reported includes basic information such as a monitoring object, a fault type, and a time. The current fault information is used as a basic correlation judgment basis, and the data is from the network element object of the monitored object; if the initial historical data If it is empty, the correlation is tentatively 100% strong, because the number of counts is only 1, the credibility and priority are reduced, and when the historical data is accumulated, the computability of correlation is getting higher and higher.
首先,上述预定阈值可在实际应用中调整。First, the above predetermined threshold can be adjusted in practical applications.
其次,上述故障对象传导链定义为:监控对象的故障类型所影响的强相关的对象故障集合。Secondly, the above-mentioned fault object conduction chain is defined as: a strongly related object fault set affected by the fault type of the monitoring object.
再者,上述故障传导链定义为:强相关性的故障的有限故障集合,即针对该故障发生时都很容易引发该链条上的其它故障类型(可能是不同的对象)。Furthermore, the fault conduction chain described above is defined as a finite set of faults of a highly correlated fault, ie it is easy to trigger other fault types (possibly different objects) on the chain for the fault to occur.
最后,上述对象传导链定义为:强相关性的对象的有限对象集合,即针对该对象发生任何故障都很容易影响该链条上的其它对象(可能是不同故障)。Finally, the above object conduction chain is defined as a finite set of objects of strongly correlated objects, ie any failures on the object can easily affect other objects on the chain (possibly different faults).
上述方法在使用网络管理系统在监控全网各监控对象和故障类型时,摒弃现有的基于统计的分析方法,而是面向实时动态的故障信息,找出在网络中监控对象和故障类型的时空分布的强相关关系,并且参考历史故障信息中的对象链的相关性(包括但不限制于监控对象、线路连接、故障时间、故障类型等),进行故障对象之间的强相关性判断。When using the network management system to monitor the monitoring objects and fault types of the entire network, the above method discards the existing statistical-based analysis method, and faces the real-time dynamic fault information to find out the time and space of the monitored objects and fault types in the network. The strong correlation of the distribution, and refer to the correlation of the object chain in the historical fault information (including but not limited to the monitoring object, line connection, fault time, fault type, etc.), and make a strong correlation judgment between the fault objects.
本发明实施例中不强调逻辑上的因果关系而进行强相关性的判断,包容可能存在的由变更导致的不确定性,按照监控维护的能力水平,根据相关性的高低判断其处理的优先级,以更灵活的手段实现了故障定位。In the embodiment of the present invention, the strong causality judgment is performed without emphasizing the logical causal relationship, and the uncertainty caused by the change may be contained, and the priority of the processing may be judged according to the level of the correlation according to the level of the ability of monitoring and maintenance. To achieve fault location with more flexible means.
图2为本发明实施例实现故障定位的方法的详细流程图,包括以下步骤:2 is a detailed flowchart of a method for implementing fault location according to an embodiment of the present invention, including the following steps:
步骤201,获得当前故障信息,包括:监控对象、故障类型和时间等基本信息。Step 201: Obtain current fault information, including basic information such as a monitoring object, a fault type, and a time.
步骤202,判断是否有历史数据,若有历史数据,则转入步骤204;若没有历史数据,则接步骤203。Instep 202, it is determined whether there is historical data. If there is historical data, the process proceeds to step 204; if there is no historical data, then step 203 is followed.
步骤203,获得当前时间点T0的预定时间窗W内的传导链Lij0,然后转步骤205。Instep 203, the conduction chain Lij0 in the predetermined time window W of the current time point T0 isobtained , and then the process proceeds to step 205.
具体包括:获得当前监控对象针对当前故障在当前时间点的预定时间窗W内的传导链Lij0Specifically, the method includes: obtaining a conduction chain Lij0 of the current monitoring object for a predetermined time window W of the current fault at the current time point.
其中,传导链Lij0表示在时间序列上,当某一故障发生后的传导时间W内,所有出现的监控对象及其故障类型,形成的一个监控对象及其故障类型的集合。Among them, the conduction chain Lij0 represents a set of monitoring objects and their fault types formed in the time series, when the transmission time W after a certain fault occurs, all the monitored objects and their fault types.
举例说明,例如发电机Oi的输出电压低的故障Fj发生在某天晚上20:03分时,其以后的W时间内出现的所有故障对象的序列集合都可以认为是该故障对象(Oi,Fj)在该时间点的故障传导链上的节点,其中W为经验常数,通常为3分钟或者5分钟。如果没有该对象故障(Oi,Fj)的历史信息,此时获得的传导链为Lij0For example, for example, the fault Fj whose output voltage of the generator Oi is low occurs at 20:03 pm, and the sequence set of all the fault objects that appear in the subsequent W time can be regarded as the fault object (Oi , Fj ) nodes at the fault conduction chain at this point in time, where W is the empirical constant, usually 3 minutes or 5 minutes. If there is no history information of the object failure (Oi , Fj ), the conduction chain obtained at this time is Lij0 .
可选地,还包括建立或更新故障元数据库,故障元数据库包括:最小粒度的监控对象和故障类别;Optionally, the method further includes establishing or updating a fault metabase, where the fault metabase includes: a minimum granularity of the monitoring object and the fault category;
具体为:Specifically:
在无先验知识的前提下,根据全网的现有故障信息状态,识别出最小粒度的监控对象On和故障类型Fm,根据最小粒度的监控对象On和故障类型Fm建立基本的故障元数据库。Under the premise of without prior knowledge, according to the prior state of the whole network failure information, identifying the monitored object On and the minimum particle size of fault type Fm, to establish substantially the monitored object according to the type of fault On Fm and a minimum particle size Fault metabase.
上述故障元数据库由于监控对象的网络扩容、故障类型的丰富而不断扩充。The above-mentioned fault meta-database is continuously expanded due to the network expansion of the monitoring object and the richness of the fault type.
初始上报的当前故障信息,包括:监控对象、故障类型、时间等基本信息,上述当前故障信息作为基本的相关性判断依据,该数据从被监控对象的网元对象上来;如果初始历史数据为空,则相关性都暂定为100%强相关,因计数次数仅为1,可信度和优先级降低,当历史数据不断累积时,相关性的可计算性越来越高。The current fault information that is initially reported includes: basic information such as the monitoring object, fault type, time, and the like. The current fault information is used as a basic correlation judgment basis, and the data is from the network element object of the monitored object; if the initial historical data is empty , the correlation is tentatively set to 100% strong correlation, because the number of counts is only 1, the credibility and priority are reduced, when the historical data is accumulated, the computability of the correlation is getting higher and higher.
新增加的故障类型,或者变更的故障类型,在上述故障元数据库中未查询到的,当作初始的故障信息按强相关性计算;新增加的监控对象,或者变更标识的监控对象,在上述故障元数据库中未查询到的,当作初始的故障信息按强相关性计算。The newly added fault type, or the type of fault that is changed, is not queried in the above fault metabase, and is calculated as a strong fault as the initial fault information; the newly added monitoring object, or the monitored object of the change identifier, If the fault is not queried in the metadata database, the initial fault information is calculated as strong correlation.
对变更标识的监控对象,最终其相关性关系仍会和原监控对象的算法结果相同。For the monitoring object of the change identifier, the correlation relationship will still be the same as the algorithm result of the original monitoring object.
步骤204,根据历史数据,建立Tk时间点的传导链Lijk的集合。Step 204, based on the historical data, establish a set of conduction chains Lijk at a time point of Tk .
包括:根据历史故障信息建立当前监控对象针对当前故障类型在不同时间点的预定时间窗内的传导链的集合。Including: establishing a current monitoring object according to historical fault information for different types of current faults at different timesA collection of conduction chains within a predetermined time window of the point.
具体地,分析每个监控对象Oi的故障类型Fj,建立在Tk时间点的传导链Lijk的集合。Specifically, the fault type Fj of each monitored object Oi is analyzed, and a set of conductive chains Lijk at the time point of Tk is established.
其中,传导链Lijk的集合定义为:在对象Oi的故障类型Fj发生的时间点Tk以后的W时间内出现的对象故障时间序列以及在Tk时间点之前发生的历史记录中的对象故障时间序列的集合。Wherein, the set of the conduction chain Lijk is defined as: an object failure time sequence occurring within a W time after the time point Tk at which the failure type Fj of the object Oi occurs, and a history record occurring before the Tk time point A collection of time series of object failures.
举例说明,例如发电机Oi的输出电压低的故障Fj首次发生在当前时间Tk之前的某天晚上18:01分时及其以后的W时间内出现的所有故障对象的时间序列集合为一条传导链;此后在不同时间点还发生了同样的对象故障,对于当前时间Tk而言,就存在k-1次的该对象故障的历史记录,包括当前时间点Tk的故障对象时间序列在内,就得到当前监控对象Oi针对当前故障类型Fj在不同时间点的预定时间窗内W内的K条传导链的集合,其中W为经验常数,通常为3分钟或者5分钟。For example, for example, the time series of all fault objects that occur when the fault Fj of the generator Oi has a low output voltage for the first time occurs at 18:01 pm and the following W time of the night before the current time Tk is A conduction chain; thereafter the same object failure occurs at different points in time. For the current time Tk , there is a k-1 history of the object failure, including the time sequence of the fault object at the current time point Tk Within this, a set of K conduction chains within the predetermined time window of the current monitoring objectOj for different fault typesFj at different points in time is obtained, where W is an empirical constant, typically 3 minutes or 5 minutes.
步骤205,分析传导链集合中各传导链之间对象故障的相关性或当前时间点的预定时间窗内的传导链中对象故障的相关性,获得所有监控对象针对所有故障类型的故障对象传导链LijStep 205: Analyze the correlation between object faults in each conduction chain in the conduction chain set or the correlation of object faults in the conduction chain in a predetermined time window at the current time point, and obtain the fault object conduction chain of all the monitoring objects for all fault types. Lij .
具体地,在获得当前故障信息之前已存在历史故障信息时,进行传导链集合中各传导链之间对象故障的相关性判断,即:Specifically, when the historical fault information already exists before the current fault information is obtained, the correlation judgment of the object fault between the conduction chains in the conduction chain set is performed, that is,
获得所述传导链集合中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。或者Obtaining a number of times each type of fault occurs in each monitored object in the set of conduction chains, and calculating a ratio of the number of occurrences of each type of faults in each monitored object to the total number of times all monitored objects fail, and the ratio is greater than a predetermined number The list of monitored objects of the threshold is used as the fault object conduction chain. or
在获得当前故障信息之前不存在历史故障信息时,进行当前时间点的预定时间窗内的传导链中对象故障的相关性判断,即:获得当前传导链中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在当前传导链中所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。When there is no historical fault information before the current fault information is obtained, the correlation judgment of the object fault in the conduction chain within the predetermined time window of the current time point is performed, that is, the number of times each fault occurs in each monitored object in the current conduction chain is obtained. And calculating a ratio of the number of times each of the faults occurred in each monitoring object in the total number of times that all the monitoring objects in the current conduction chain fail, and the monitoring object list whose ratio is greater than a predetermined threshold is used as the fault object conduction chain.
其中,预定阈值可在实际应用中进行调整。Wherein, the predetermined threshold can be adjusted in practical applications.
举例说明,首先假设,当前时间Tk监控对象Oi的故障类型Fj已经发生,建立其W传导时间内所有的故障对象集合为Lijk=F(Oi,Fj,Tk),k=1,2,…,K-1。分析历史数据,因为此前该监控对象Oi的故障类型Fj已经发生过K-1次,累计共K条故障传导链。For example, it is first assumed that the fault type Fj of the monitoring object Oi has occurred at the current time Tk , and all the fault object sets in the W conduction time are established as Lijk =F(Oi ,Fj ,Tk ),k =1, 2, ..., K-1. The historical data is analyzed because the fault type Fj of the monitored object Oi has occurred K-1 times before, and a total of K fault conduction chains are accumulated.
接着,在这第K条故障传导链中,共计Mk个不同故障对象,分析在历史的K-1次传导链集合中所有故障对象分别出现的次数,得到所述Mk个故障对象分别发生的次数,为了归一化可以计算其发生的频次,即出现次数占总数量的百分比。Then, in the Kth fault conduction chain, a total of Mk different fault objects are analyzed, and the number of occurrences of all fault objects in the historical K-1 secondary conduction chain set is analyzed, and the Mk fault objects are respectively generated. The number of times, in order to normalize, can be calculated as the frequency of occurrence, that is, the percentage of occurrences as a percentage of the total number.
最后,如果出现频次为100%的故障对象,则相关度最高,为因果强相关关系,但是由于实际的生产环境中故障对象链会因网络变更而发生变化,经验数据可以取频次为90%以上,或者按照频次由高到低的顺序来确定故障对象的优先级顺序。故障对象传导链Lij定义为:对象Oi的故障类型Fj所影响的强相关的对象故障集合;Finally, if there is a fault object with a frequency of 100%, the correlation is the highest, which is a strong correlation between causality. However, since the faulty object chain changes in the actual production environment due to network changes, the empirical data can be taken at a frequency of more than 90%. Or determine the priority order of the faulty objects in order of frequency from high to low. The faulty object conduction chain Lij is defined as: a strongly correlated object fault set affected by the fault type Fj of the object Oi ;
举例说明,在某一复杂通信网络中,包括有无线基站网络、骨干网传输网络、IT监控网络、动力与环境监控网络等网络子系统。简化其组网模型,假设其组网方式中有三个监控节点:电源P1、传输T1和基站S1。其三个对象具有因果关系:电源中断后传输无源,基站也中断不能提供服务,电源正常时传输异常中断基站也不能提供服务,即:P1-->(T1-->S1)。For example, in a complex communication network, a network subsystem including a wireless base station network, a backbone network transmission network, an IT monitoring network, and a power and environment monitoring network is included. Simplify its networking model, assuming that there are three monitoring nodes in its networking mode: power P1 , transmission T1 and base station S1 . The three objects have a causal relationship: the transmission is passive after the power interruption, and the base station also interrupts the service. When the power is normal, the transmission is abnormally interrupted and the base station cannot provide the service, ie: P1 -->(T1 -->S1 ) .
当传输T1中断故障发生后,可以计算出其W时间段内有很多的故障上报,其中基站S1中断会在其时间序列出现之后发生,当然同一时间点附近也会有其它的故障产生;与历史数据的传导链进行相关性分析,就会发现(T1-->S1)的出现频度会非常高,理想情况下应该达到100%伴随出现,而其它随机出现的故障,则出现频度的相关度会比较低。When the transmission T1 interrupt fault occurs, it can be calculated that there are many fault reports in the W period, wherein the base station S1 interrupt will occur after the time sequence occurs, and of course, other faults will occur near the same time point; Correlation analysis with the conduction chain of historical data reveals that the frequency of occurrence of (T1 -->S1 ) will be very high, ideally 100% concomitant, and other random failures will occur. The frequency correlation will be lower.
同样,当电源P1掉电故障发生后,可以计算出其传导链上的T1和S1也会出现在时间序列之后,且相关度非常高;(P1-->T1)和(P1-->S1)就是电源P1的传导链,P1-->(T1-->S1)就是一个更大的传导链。Similarly, when the power supply P1 power failure occurs, it can be calculated that T1 and S1 on its conduction chain also appear after the time series, and the correlation is very high; (P1 --> T1 ) and ( P1 --> S1 ) is the conduction chain of the power source P1 , and P1 -->(T1 -->S1 ) is a larger conduction chain.
但是,当由于网络扩建或者维护变更时,传输T1不再连接基站S1而是S2,这时(T1-->S1)的关系不再出现,(T1-->S2)则是新的传导关系。这种传导关系开始时由于历史数据不存在,则认为是只出现一次的强关联关系(初始情况下所有只出现一次的都认为是强关联关系100%,但是优先级要降低),(P1-->T1)和(P1-->S2)是电源P1的传导链,当出现第二次以上时,优先级就可以提升了。However, when the transmission T1 is no longer connected to the base station S1 but S2 due to network expansion or maintenance change, the relationship of (T1 --> S1 ) no longer occurs, (T1 --> S2 ) is a new conduction relationship. At the beginning of this conduction relationship, since the historical data does not exist, it is considered to be a strong association relationship that occurs only once (in the initial case, all occurrences are considered to be 100% of the strong association, but the priority is lowered), (P1 -->T1 ) and (P1-->S2 ) are the conduction chains of the power supply P1 . When the second time or more occurs, the priority can be improved.
步骤206,根据上述故障对象传导链Lij,找到故障对象传导链上的根源故障,定位出监控对象和故障类型。Step 206: Find a root cause fault on the faulty object conduction chain according to the fault object conduction chain Lij , and locate the monitoring object and the fault type.
上述方法可以生成基于监控对象和故障类型的强关联的生成树;在故障发生后,所有的告警监控都可以在时间轴上,按照对象传导链Lij进行强关联的自动呈现;这种呈现可以帮助用户更好地分析和定位故障,更方便地在派单时对一类现场问题进行统一派单,结合历史数据,方便排查,提高效率。The above method can generate a spanning tree based on a strong association between the monitoring object and the fault type; after the fault occurs, all alarm monitoring can be automatically presented on the time axis according to the object conduction chain Lij ; Help users to better analyze and locate faults, and more convenient to dispatch a single type of on-site problems when dispatching orders, combined with historical data, easy to check and improve efficiency.
步骤207,在步骤,205的基础上,上述方法还可以包括:Step 207, on the basis ofstep 205, the foregoing method may further include:
根据上述故障对象传导链Lij,获得针对不同故障类型的对象传导链Li,根据上述对象传导链Li定位出故障对象和故障类型;其中Obtaining an object conduction chain Li for different fault types according to the faulty object conduction chain Lij , and locating the fault object and the fault type according to the object conduction chain Li ;
上述对象传导链Li定义为:强相关性的对象Oi的有限对象集合,即针对该对象发生任何故障都很容易影响该链条上的其它对象,其中可能是不同的故障;The above object conduction chain Li is defined as: a finite object set of the strongly correlated object Oi , that is, any failure of the object for the object easily affects other objects on the chain, which may be different faults;
对象传导链Li的具体判断方法:The specific judgment method of the object conduction chain Li :
一个对象Oi会检测多个故障类型,每个故障类型Fj都可以计算获得一个传导链Lij(j=1…m),传导链包括有被影响的监控对象和它检测的故障。在多个传导链中的对象故障集合中,计算各个集合中所有出现的对象故障的频次来判断多个传导链之间的相关性,与上述判断方法相同;An object Oi detects multiple fault types, and each fault type Fj can be calculated to obtain a conduction chain Lij (j=1...m), which includes the affected monitoring object and the fault it detects. In the object failure set in the plurality of conduction chains, calculating the frequency of all occurrences of the object failures in each set to determine the correlation between the plurality of conduction chains, which is the same as the above determination method;
举例说明,在某机框内的多个单板上,针对机框的严重通讯故障检测,都会影响到单板自身的通信能力。这种与故障类型关联不大,对象之间具有父子关系的,就能够通过对象传导链的方式进行发现和挖掘,故障恢复时就可以优先排查传导链根源的父故障节点。For example, the detection of severe communication faults on the chassis of multiple chassis in a certain chassis affects the communication capability of the board itself. This type of fault is not related to the fault type. If there is a parent-child relationship between the objects, the object can be discovered and mined through the object's conduction chain. When the fault is recovered, the parent fault node of the root of the conduction chain can be prioritized.
具有强相关性的对象可以扩展归纳为一个大的对象包,对象包中的故障可以指派为一个故障上站团队,而对象包中的强相关性的故障可以优先排查传导链根源的故障节点。或者Objects with strong correlation can be extended into a large object package. The faults in the object package can be assigned as a faulty upper station team, and the strong correlation faults in the object package can prioritize the fault nodes of the conduction chain root. or
步骤208,根据上述故障对象传导链Lij,获得针对不同监控对象的故障传导链Lj,根据故障传导链Lj定位出故障对象和故障类型。其中Step 208: Obtain a fault conduction chain Lj for different monitoring objects according to the fault object conduction chain Lij , and locate the fault object and the fault type according to the fault conduction chain Lj . among them
上述故障传导链Lj定义为强相关性的故障Fj的有限故障集合,即针对该故障发生时都很容易引发该链条上的其它故障类型,可能是不同的监控对象。The fault conduction chain Lj described above is defined as a finite fault set of the fault Fj with strong correlation, that is, it is easy to cause other fault types on the chain when the fault occurs, and may be different monitoring objects.
故障传导链Lj的具体判断方法:一个故障Fj会在多个对象上被检测发生,针对每个故障类型Fj同样可以生成不同对象Oi发生时的一个传导链Lij(i=1…n),传导链包括有被影响的对象和它检测的故障。在多个传导链中的对象故障集合中,计算各个集合中所有出现的对象故障的频次来判断多个传导链之间的相关性,与上述判断方法相同。The specific judgment method of the fault conduction chain Lj : a fault Fj is detected on multiple objects, and for each fault type Fj , a conduction chain Lij when different objects Oi occur can also be generated (i=1) ...n), the conduction chain includes the affected object and the fault it detects. In the object failure set in the plurality of conduction chains, the frequency of all occurrences of object failures in each set is calculated to determine the correlation between the plurality of conduction chains, which is the same as the above determination method.
举例说明,在通讯协议栈的上下层通信过程中,低层通信往往会影响上层通信。如果对不同层次的协议栈进行监控时,底层协议栈的故障会影响上层协议栈的功能;这种与对象本身关联不大,对象之间具有逻辑的强关联关系的,就能够通过故障传导链的方式进行发现和挖掘,故障恢复时就可以优先排查传导链根源的故障节点。For example, in the upper and lower layers of the communication protocol stack, low-level communication often affects upper-layer communication. If the protocol stacks of different layers are monitored, the fault of the underlying protocol stack will affect the function of the upper protocol stack; if the association with the object itself is not large and the objects have a strong logical relationship, the fault conduction chain can be passed. The way to discover and mine, when the fault is recovered, it can give priority to the fault node of the root of the conduction chain.
图3为本发明实施例的一种实施故障定位的装置的结构示意图,包括:接收模块(30),故障元数据库建立模块(31),第一建立模块(32),第二建立模块(33)和定位模块(34)。FIG. 3 is a schematic structural diagram of an apparatus for implementing fault location according to an embodiment of the present invention, including: a receiving module (30), a fault metabase establishing module (31), a first establishing module (32), and a second establishing module (33) And the positioning module (34).
接收模块,设置为获得当前故障信息,当前故障信息至少包括监控对象、故障类型和时间信息;The receiving module is configured to obtain current fault information, and the current fault information includes at least a monitoring object, a fault type, and time information;
其中,第一建立模块,设置为根据获得的当前故障信息,获取所述监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链,判断所述当前故障信息是否存在于所述历史故障信息中;The first establishing module is configured to obtain, according to the obtained current fault information, a conduction chain of the monitoring object in a predetermined time window of the current fault type at a current time point, and determine whether the current fault information exists in the history. In the fault information;
第一建立模块,还设置为当判断出所述当前故障信息存在于所述历史故障信息中时,根据所述历史故障信息建立当前监控对象针对当前故障类型在不同时间点的预定时间窗内的传导链集合,向第二建立模块发送第一通知。The first establishing module is further configured to: when determining that the current fault information exists in the historical fault information, establish, according to the historical fault information, a current monitoring object, within a predetermined time window of the current fault type, at different time points The set of conductive chains sends a first notification to the second setup module.
可选地,第一建立模块,还设置为在判断出获得当前故障信息之前不存在历史故障信息时,向第二建立模块发送第二通知;Optionally, the first establishing module is further configured to: send a second notification to the second establishing module when there is no historical fault information before determining that the current fault information is obtained;
第二建立模块,设置为对第一建立模块获取所述监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链中对象故障进行分析,或对第一建立模块建立的传导链集合中的传导链之间对象故障的相关性进行分析,获得所有监控对象针对所有故障类型的故障对象传导链并输出给定位模块。a second establishing module, configured to acquire, by the first establishing module, the monitoring object, for an object fault in a conduction chain within a predetermined time window of a current fault type at a current time point, or to the firstThe correlation between the object faults in the conduction chain in the set of conduction chains established by the module is established, and the faulty object conduction chain of all the monitoring objects for all fault types is obtained and output to the positioning module.
可选地,第二建立模块具体设置为:接收到来自第一建立模块的第一通知,获得所述传导链集合中每个监控对象发生每种故障的次数,计算每个监控对象发生每种故障的次数在所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。Optionally, the second establishing module is specifically configured to: receive the first notification from the first establishing module, obtain the number of times each fault occurs in each monitored object in the set of conductive chains, and calculate each occurrence of each monitoring object The ratio of the number of failures to the total number of failures of all monitored objects, and the list of monitored objects whose ratio is greater than a predetermined threshold is used as the fault object conduction chain.
可选地,第二建立模块,还设置为接收来自第一建立模块的第二通知,获得所述监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在当前传导链中所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。Optionally, the second establishing module is further configured to receive a second notification from the first establishing module, where the monitoring object is obtained for each monitoring object in the conduction chain within a predetermined time window of the current fault time for the current fault type. The number of times of each fault is calculated as a ratio of the number of times each of the faults occurs in each monitored object to the total number of times the faults of all monitored objects in the current conduction chain fail, and the list of monitored objects whose ratio is greater than a predetermined threshold is used as the fault object. Conduction chain.
定位模块,设置为根据来自第二建立模块的故障对象传导链,定位出故障对象和故障类型。The positioning module is configured to locate the fault object and the fault type according to the fault object conduction chain from the second establishing module.
可选地,定位模块还设置为:Optionally, the positioning module is further configured to:
根据故障对象传导链,获得针对不同监控对象的故障传导链,根据得到的不同监控对象的故障传导链定位出故障对象和故障类型;或者,所述故障对象传导链,获得针对不同故障类型的对象传导链,根据不同故障类型的对象传导链定位出故障对象和故障类型。According to the fault object conduction chain, the fault conduction chain for different monitoring objects is obtained, and the fault object and the fault type are located according to the fault conduction chain of different monitored objects obtained; or the fault object conduction chain obtains objects for different fault types. The conduction chain locates the fault object and fault type according to the object conduction chain of different fault types.
可选地,上述装置还包括:故障元数据建立模块,设置为根据获得的故障信息,建立故障元数据库,将故障元数据库信息传给第一建立模块。Optionally, the foregoing apparatus further includes: a fault metadata establishing module, configured to establish a fault metabase according to the obtained fault information, and transmit the fault metabase information to the first establishing module.
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。One of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments can be implemented using a computer program flow, which can be stored in a computer readable storage medium, such as on a corresponding hardware platform (eg, The system, device, device, device, etc. are executed, and when executed, include one or a combination of the steps of the method embodiments.
可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Alternatively, all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve. Thus, the invention is not limited to any specific combination of hardware and software.
上述实施例中的各装置/功能模块/功能单元可以采用通用的计算装置来实现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。The devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
上述实施例中的各装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。When each device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. The above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求所述的保护范围为准。Variations or substitutions are readily conceivable within the scope of the present invention by those skilled in the art and are within the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.
工业实用性Industrial applicability
本发明实施例公开了一种实现故障定位的方法及装置,包括:获取当前故障信息;根据获得的当前故障信息,建立所有监控对象针对所有故障类型在不同时间点的预定时间窗内的传导链集合;对传导链集合中的传导链之间的相关性进行分析,获得所有监控对象针对不同故障类型的故障对象传导链;以及,根据故障对象传导链,定位出故障对象和故障类型,可以实现快速、准确的进行根源故障的定位和高效派单,提升日常网络维护和故障派单过程中的效率。The embodiment of the invention discloses a method and a device for implementing fault location, comprising: acquiring current fault information; and establishing, according to the obtained current fault information, a conduction chain of all monitored objects in predetermined time windows of different fault types at different time points. Set; analyze the correlation between the conduction chains in the conduction chain set, obtain the fault object conduction chain of all monitoring objects for different fault types; and, according to the fault object conduction chain, locate the fault object and the fault type, which can be realized Fast and accurate root cause fault location and efficient dispatch, improving efficiency in daily network maintenance and fault dispatching.

Claims (14)

  1. 根据权利要求1所述的方法,其中,所述建立传导链集合之前,该方法还包括:判断所述当前故障信息是否存在于所述历史故障信息中;所述对建立的传导链集合中的传导链之间的相关性进行分析包括对所述传导链集合中的传导链之间的对象故障的相关性进行分析,获得所有监控对象针对不同故障类型的故障对象传导链。The method of claim 1, wherein before the establishing the set of conductive chains, the method further comprises: determining whether the current fault information is present in the historical fault information; The analysis of the correlation between the conduction chains includes analyzing the correlation of object failures between the conduction chains in the set of conduction chains, and obtaining the faulty object conduction chains of all monitored objects for different failure types.
  2. 所述第二建立模块,还设置为接收来自第一建立模块的第二通知,获得所述监控对象针对当前故障类型在当前时间点的预定时间窗内的传导链中每个监控对象发生每种故障的次数,计算每个监控对象发生所述每种故障的次数在当前传导链中所有监控对象发生故障的总次数中的比值,将所述比值大于预定阈值的监控对象列表作为故障对象传导链。The second establishing module is further configured to receive a second notification from the first establishing module, and obtain the monitoring object for each monitoring object in the conduction chain within a predetermined time window of the current fault time for the current fault type. The number of failures is calculated as a ratio of the number of times each of the faults occurs in each monitored object to the total number of times the faults of all monitored objects in the current conduction chain fail, and the list of monitored objects whose ratio is greater than a predetermined threshold is used as the fault object conduction chain. .
PCT/CN2014/0873322013-12-202014-09-24Method and apparatus for realizing fault locationWO2015090098A1 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201480057055.4ACN105659528B (en)2013-12-202014-09-24A kind of method and device for realizing fault location

Applications Claiming Priority (2)

Application NumberPriority DateFiling DateTitle
CN201310711392.8ACN104734871A (en)2013-12-202013-12-20Method and device for positioning failures
CN201310711392.82013-12-20

Publications (1)

Publication NumberPublication Date
WO2015090098A1true WO2015090098A1 (en)2015-06-25

Family

ID=53402074

Family Applications (1)

Application NumberTitlePriority DateFiling Date
PCT/CN2014/087332WO2015090098A1 (en)2013-12-202014-09-24Method and apparatus for realizing fault location

Country Status (2)

CountryLink
CN (2)CN104734871A (en)
WO (1)WO2015090098A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108306747A (en)*2017-01-112018-07-20阿里巴巴集团控股有限公司A kind of cloud security detection method, device and electronic equipment
CN111327443A (en)*2018-12-172020-06-23中国移动通信集团北京有限公司 A method and device for determining a fault root cause index
CN115988551A (en)*2022-12-192023-04-18南京濠暻通讯科技有限公司O-RAN wireless unit fault management method based on ZYNQ

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10275300B2 (en)*2015-10-272019-04-30Oracle International CorporationSystems and methods for prioritizing a support bundle
WO2018010176A1 (en)*2016-07-152018-01-18华为技术有限公司Method and device for acquiring fault information
CN106294076B (en)*2016-08-242019-03-15浪潮(北京)电子信息产业有限公司 A server-related fault prediction method and system thereof
CN108880838B (en)*2017-05-102021-11-09阿里巴巴集团控股有限公司Service fault monitoring method and device, computer equipment and readable medium
WO2019006654A1 (en)*2017-07-042019-01-10深圳怡化电脑股份有限公司Financial self-service equipment maintenance dispatch generation method, hand-held terminal and electronic device
CN109936470A (en)*2017-12-182019-06-25中国电子科技集团公司第十五研究所 An anomaly detection method
CN108229613A (en)*2017-12-302018-06-29武汉凌科通光电科技有限公司Opto-electronic device Fault Locating Method and system
CN110611604A (en)*2019-09-192019-12-24国家电网有限公司 Local area network equipment evaluation processing method and device
CN111739188B (en)*2019-10-112022-02-01北京京东乾石科技有限公司AGV fault growth rate determination method and apparatus
CN110635960A (en)*2019-11-112019-12-31国家电网有限公司 Method and device for upgrading communication equipment
CN111143101B (en)*2019-12-122023-07-07东软集团股份有限公司Method, device, storage medium and electronic equipment for determining fault source
CN113839804B (en)*2020-06-242023-03-10华为技术有限公司Network fault determination method and network equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1713591A (en)*2004-06-222005-12-28中兴通讯股份有限公司Alarm correlation analysis of light synchronous transmitting net
CN101442762A (en)*2008-12-292009-05-27中国移动通信集团北京有限公司Method and apparatus for analyzing network performance and locating network fault
CN102158360A (en)*2011-04-012011-08-17华中科技大学Network fault self-diagnosis method based on causal relationship positioning of time factors
US20120005532A1 (en)*2010-07-022012-01-05Oracle International CorporationMethod and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101252477B (en)*2008-03-272010-12-22杭州华三通信技术有限公司Determining method and analyzing apparatus of network fault root
CN101854277B (en)*2010-06-122012-04-25河北全通通信有限公司Monitoring method for mobile communication operation analysis system
CN103001811B (en)*2012-12-312016-01-06北京启明星辰信息技术股份有限公司Fault locating method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1713591A (en)*2004-06-222005-12-28中兴通讯股份有限公司Alarm correlation analysis of light synchronous transmitting net
CN101442762A (en)*2008-12-292009-05-27中国移动通信集团北京有限公司Method and apparatus for analyzing network performance and locating network fault
US20120005532A1 (en)*2010-07-022012-01-05Oracle International CorporationMethod and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series
CN102158360A (en)*2011-04-012011-08-17华中科技大学Network fault self-diagnosis method based on causal relationship positioning of time factors

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108306747A (en)*2017-01-112018-07-20阿里巴巴集团控股有限公司A kind of cloud security detection method, device and electronic equipment
CN111327443A (en)*2018-12-172020-06-23中国移动通信集团北京有限公司 A method and device for determining a fault root cause index
CN111327443B (en)*2018-12-172022-11-22中国移动通信集团北京有限公司Fault root index determination method and device
CN115988551A (en)*2022-12-192023-04-18南京濠暻通讯科技有限公司O-RAN wireless unit fault management method based on ZYNQ
CN115988551B (en)*2022-12-192023-09-08南京濠暻通讯科技有限公司O-RAN wireless unit fault management method based on ZYNQ

Also Published As

Publication numberPublication date
CN105659528A (en)2016-06-08
CN104734871A (en)2015-06-24
CN105659528B (en)2019-10-08

Similar Documents

PublicationPublication DateTitle
WO2015090098A1 (en)Method and apparatus for realizing fault location
CN110493042B (en)Fault diagnosis method and device and server
CN106992877B (en)Network Fault Detection and restorative procedure based on SDN framework
US11108619B2 (en)Service survivability analysis method and apparatus
CN107864063B (en)Abnormity monitoring method and device and electronic equipment
CN113900844B (en)Fault root cause positioning method, system and storage medium based on service code level
CN105991332A (en)Alarm processing method and device
CN106685676B (en)Node switching method and device
WO2016095710A1 (en)Method and device for adjusting srlg
CN1992636B (en) A system and method for processing alarm information
WO2016017208A1 (en)Monitoring system, monitoring device, and inspection device
CN104639368A (en)Method and device for processing faults of communications network equipment
CN106411617A (en)Power communication network fault warning correlation processing method
US20190296960A1 (en)System and method for event processing order guarantee
CN118740678A (en) Fault detection method and device for network equipment and electronic equipment
CN106254137A (en)The alarm root-cause analysis system and method for supervisory systems
US9009533B2 (en)Home/building fault analysis system using resource connection map log and method thereof
CN110135603A (en) A method for analyzing spatial characteristics of power network alarms based on improved entropy weight method
CN111611097A (en) Fault detection method, device, equipment and storage medium
WO2019079961A1 (en)Method and device for determining shared risk link group
CN114726708A (en)Network element equipment fault prediction method and system based on artificial intelligence
CN104486786B (en)A kind of fault detection method of wireless sensor network
CN108449212B (en)MAS message transmission method based on event association
CN117421155A (en)Alarm processing method, device, equipment and medium
CN111309537A (en) A method and device for detecting errors reported by a server diagnostic system

Legal Events

DateCodeTitleDescription
121Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number:14872383

Country of ref document:EP

Kind code of ref document:A1

NENPNon-entry into the national phase

Ref country code:DE

122Ep: pct application non-entry in european phase

Ref document number:14872383

Country of ref document:EP

Kind code of ref document:A1


[8]ページ先頭

©2009-2025 Movatter.jp