



技术领域technical field
本发明属于存储系统运行维护技术领域,尤其涉及一种分布式存储系统I/O亚健康智能检测和恢复方法以及与之相关的智能检测和恢复系统。The invention belongs to the technical field of storage system operation and maintenance, and in particular relates to a distributed storage system I/O sub-health intelligent detection and recovery method and a related intelligent detection and recovery system.
背景技术Background technique
技术人员发现,在分布式存储系统运行过程中,当存储系统处于I/O(输入/输出)亚健康状态时,大多数情况下,该存储系统可能仍然可以提供I/O读写服务,此时,运维系统如果不能及时感知到系统的亚健康状态,将会导致故障扩散,进一步引发存储系统业务中断等严重后果。Technicians found that during the operation of the distributed storage system, when the storage system is in the I/O (input/output) sub-health state, in most cases, the storage system may still provide I/O read and write services. At this time, if the operation and maintenance system cannot detect the sub-health state of the system in time, it will lead to the spread of faults, which will further cause serious consequences such as service interruption of the storage system.
分布式异步对象存储(DAOS)是英特尔构建的百亿亿次级存储堆栈的基础。具体来说,DAOS是一种开源软件定义横向扩展对象存储,可为高性能计算应用提供高带宽、低延迟和高I/OPS的存储容器。作为一套轻量级的系统,DAOS可在用户空间中端对端运行,并能完全绕开操作系统,由于DAOS没有延续针对高延迟、块存储的I/O模型,而是选择了为访问高细粒度数据提供原生支持的I/O模型,因此释放下一代存储技术的性能,从而能够为以数据为中心的工作流程提供支持。然而,针对存储系统I/O亚健康状态,目前的DAOS系统并没有有效的检测和处理机制,系统运行风险较高。Distributed Asynchronous Object Storage (DAOS) is the foundation of the exascale storage stack that Intel is building. Specifically, DAOS is an open source software-defined scale-out object storage that provides storage containers with high bandwidth, low latency, and high I/OPS for high-performance computing applications. As a lightweight system, DAOS can run end-to-end in user space, and can completely bypass the operating system, because DAOS does not continue the I/O model for high-latency, block storage, but chooses to access Highly granular data provides a natively supported I/O model, thereby unleashing the performance of next-generation storage technologies to support data-centric workflows. However, the current DAOS system does not have an effective detection and processing mechanism for the sub-healthy I/O status of the storage system, and the system runs at a high risk.
针对上述的问题,目前尚未提出理想的解决方案。For the above problems, an ideal solution has not yet been proposed.
发明内容Contents of the invention
为了解决DAOS系统针对存储系统I/O亚健康状态没有有效的检测和处理机制,系统运行风险较高的问题,我们提供了一种解决方案。In order to solve the problem that the DAOS system has no effective detection and processing mechanism for the sub-health state of the storage system I/O, and the system runs at a high risk, we provide a solution.
我们通过在存储系统I/O路径关键服务上,设置I/O统计点位采集I/O异常数据,获取每个服务上I/O失败计数和超过I/O时延阈值的计数,根据亚健康检测算法推理出哪个关键服务或节点出现I/O亚健康(I/O超时、I/O失败、I/O挂死)状态,进一步智能决策出采用自愈或隔离方式处理该故障,避免故障扩散和业务中断。We collect I/O abnormal data by setting I/O statistical points on key services of the I/O path of the storage system, and obtain the counts of I/O failures and counts exceeding the I/O delay threshold on each service. The health detection algorithm deduces which key service or node is in the I/O sub-health state (I/O timeout, I/O failure, I/O hang-up), and further intelligently decides to use self-healing or isolation to deal with the fault to avoid Fault propagation and business interruption.
具体而言,第一方面,本发明提供了一种分布式存储系统I/O亚健康智能检测和恢复方法,所述方法包括:Specifically, in the first aspect, the present invention provides a distributed storage system I/O sub-health intelligent detection and recovery method, the method comprising:
S1:在存储系统I/O路径各节点的关键服务上设置I/O统计点位;S1: Set the I/O statistics points on the key services of each node of the storage system I/O path;
S2:I/O统计点位采集并记录I/O异常数据;S2: I/O statistics point collection and record I/O abnormal data;
S3:获取每个I/O统计点位记录的I/O异常数据类别及各类异常数据的次数计数结果;S3: Obtain the I/O abnormal data category recorded by each I/O statistical point and the counting results of the number of times of various abnormal data;
S4:根据上步获得的数据通过亚健康检测算法推导出出现I/O亚健康状态的关键服务或节点;S4: According to the data obtained in the previous step, deduce the key services or nodes with I/O sub-health state through the sub-health detection algorithm;
S5:采用自愈或隔离方式处理出现I/O亚健康状态的关键服务或节点,避免故障扩散和业务中断。S5: Use self-healing or isolation to deal with key services or nodes in I/O sub-health state, to avoid fault spread and business interruption.
进一步地,根据本发明的一些实施例,本发明分布式存储系统I/O亚健康智能检测和恢复方法中步骤S2中所述I/O异常数据包括以下三种场景的异常数据:Further, according to some embodiments of the present invention, the I/O abnormal data described in step S2 in the distributed storage system I/O sub-health intelligent detection and recovery method of the present invention includes abnormal data of the following three scenarios:
I/O超时:I/O时延超过阈值T秒(不同业务对时延要求不同,视具体业务而定,比如800ms~3s);I/O timeout: I/O delay exceeds the threshold T seconds (different services have different delay requirements, depending on specific services, such as 800ms~3s);
I/O失败:I/O报错;I/O failure: I/O error;
I/O挂死:I/O挂住不返回。I/O hang: I/O hang does not return.
进一步地,根据本发明的一些实施例,本发明分布式存储系统I/O亚健康智能检测和恢复方法中S4中所述亚健康检测算法包括:Further, according to some embodiments of the present invention, the sub-health detection algorithm described in S4 in the distributed storage system I/O sub-health intelligent detection and recovery method of the present invention includes:
(1)I/O异常计数方法(1) I/O exception counting method
连续检测M个周期,当有N个周期检测到异常数据时,则认定为I/O异常,其中M≥N≥1;Continuously detect M cycles, when abnormal data is detected in N cycles, it is considered as I/O abnormal, where M≥N≥1;
(2)I/O异常推理方法(2) I/O exception reasoning method
I/O从客户端发起,在I/O路径各节点的关键服务路径中,上层服务调用下层服务接口,当上层服务判断下层服务出现异常时,假设下层服务没有进一步指认更下一层服务异常,则认定为I/O异常出现在该下层服务上;I/O is initiated from the client. In the key service path of each node in the I/O path, the upper-layer service calls the lower-layer service interface. When the upper-layer service judges that the lower-layer service is abnormal, it is assumed that the lower-layer service does not further identify the lower-layer service. , it is determined that the I/O exception occurs on the underlying service;
当上层服务判断下层服务出现异常时,假设下层服务进一步指认了更下一层服务异常,而该更下一层服务没有继续向下指认,则认定为I/O异常出现在该更下一层服务上;When the upper-level service judges that the lower-level service is abnormal, assuming that the lower-level service further identifies the lower-level service exception, but the lower-level service does not continue to identify downwards, it is determined that the I/O exception occurs in the lower-level service service;
以此类推,按照上述逻辑逐层指认,直至确认所有出现I/O异常的节点及关键服务。By analogy, identify layer by layer according to the above logic until all nodes and key services with I/O exceptions are confirmed.
进一步地,根据本发明的一些实施例,本发明分布式存储系统I/O亚健康智能检测和恢复方法中S5中所述采用自愈或隔离方式处理出现I/O亚健康状态的关键服务或节点,按照下述决策逻辑选择适用的恢复方式:Further, according to some embodiments of the present invention, in the distributed storage system I/O sub-health intelligent detection and recovery method of the present invention, as described in S5, a self-healing or isolation method is used to process the critical service or I/O sub-health state. Nodes, select the applicable recovery method according to the following decision logic:
(1)如果最终推理出只有一个节点上的一个服务出现I/O异常,则通过自愈该服务的方式恢复异常并上报告警;如果在单位时间S1内,再次检测到该服务I/O异常,则隔离该服务并上报告警;(1) If it is finally deduced that only one service on one node has an I/O exception, it will recover the exception by self-healing the service and report an alarm; if the service I/O is detected again within the unit time S1 If abnormal, isolate the service and report an alarm;
(2)如果最终推理出只有一个节点上的两个或以上的服务出现I/O异常,推断为节点I/O异常,则通过重启自愈该节点的方式恢复异常并上报告警;如果在单位时间S2内,再次检测到该节点I/O异常,则隔离该节点并上报告警;(2) If it is finally deduced that only two or more services on one node have I/O exceptions, it is inferred that the node I/O is abnormal, then the abnormality will be recovered by restarting the self-healing node and an alarm will be reported; if Within the unit time S2, if the I/O abnormality of the node is detected again, the node is isolated and an alarm is reported;
(3)如果最终推理出两个或以上的节点出现I/O异常,推断为群体事件,上报告警,并提示人工介入恢复。(3) If it is concluded that two or more nodes have I/O abnormalities, it is inferred as a group event, an alarm is reported, and manual intervention is prompted for recovery.
进一步地,上述分布式存储系统I/O亚健康智能检测和恢复方法中所述S1为12-48小时,所述S2为3-15天。Further, in the above distributed storage system I/O sub-health intelligent detection and recovery method, the S1 is 12-48 hours, and the S2 is 3-15 days.
第二方面,本发明还提供了一种分布式存储系统I/O亚健康智能检测和恢复系统,所述智能检测和恢复系统包括:In the second aspect, the present invention also provides a distributed storage system I/O sub-health intelligent detection and recovery system, the intelligent detection and recovery system comprising:
I/O异常数据采集模块,用于采集各I/O统计点位记录到的I/O异常数据;The I/O abnormal data collection module is used to collect the I/O abnormal data recorded in each I/O statistical point;
I/O异常数据分析模块,用于分析I/O异常数据类别并统计各类异常数据出现的次数;The I/O abnormal data analysis module is used to analyze the types of I/O abnormal data and count the number of occurrences of various types of abnormal data;
检测算法模块,内置亚健康检测算法,用于推导出现I/O亚健康状态的关键服务或节点;Detection algorithm module, with built-in sub-health detection algorithm, used to deduce key services or nodes with I/O sub-health state;
恢复决策模块,内置决策逻辑,用于选择适用的恢复方式。Restoration decision-making module, with built-in decision-making logic, is used to select the applicable restoration method.
进一步地,根据本发明的一些实施例,本发明分布式存储系统I/O亚健康智能检测和恢复系统中所述亚健康检测算法包括:Further, according to some embodiments of the present invention, the sub-health detection algorithm in the distributed storage system I/O sub-health intelligent detection and recovery system of the present invention includes:
(1)I/O异常计数方法(1) I/O exception counting method
连续检测M个周期,当有N个周期检测到异常数据时,则认定为I/O异常,其中M≥N≥1;Continuously detect M cycles, when abnormal data is detected in N cycles, it is considered as I/O abnormal, where M≥N≥1;
(2)I/O异常推理方法(2) I/O exception reasoning method
I/O从客户端发起,在I/O路径各节点的关键服务路径中,上层服务调用下层服务接口,当上层服务判断下层服务出现异常时,假设下层服务没有进一步指认更下一层服务异常,则认定为I/O异常出现在该下层服务上;I/O is initiated from the client. In the key service path of each node in the I/O path, the upper-layer service calls the lower-layer service interface. When the upper-layer service judges that the lower-layer service is abnormal, it is assumed that the lower-layer service does not further identify the lower-layer service. , it is determined that the I/O exception occurs on the underlying service;
当上层服务判断下层服务出现异常时,假设下层服务进一步指认了更下一层服务异常,而该更下一层服务没有继续向下指认,则认定为I/O异常出现在该更下一层服务上;When the upper-level service judges that the lower-level service is abnormal, assuming that the lower-level service further identifies the lower-level service exception, but the lower-level service does not continue to identify downwards, it is determined that the I/O exception occurs in the lower-level service service;
以此类推,按照上述逻辑逐层指认,直至确认所有出现I/O异常的节点及关键服务。By analogy, identify layer by layer according to the above logic until all nodes and key services with I/O exceptions are confirmed.
进一步地,根据本发明的一些实施例,本发明分布式存储系统I/O亚健康智能检测和恢复系统中所述决策逻辑包括:Further, according to some embodiments of the present invention, the decision logic in the distributed storage system I/O sub-health intelligent detection and recovery system of the present invention includes:
(1)如果最终推理出只有一个节点上的一个服务出现I/O异常,则通过自愈该服务的方式恢复异常并上报告警;如果在单位时间S1内,再次检测到该服务I/O异常,则隔离该服务并上报告警;(1) If it is finally deduced that only one service on one node has an I/O exception, it will recover the exception by self-healing the service and report an alarm; if the service I/O is detected again within the unit time S1 If abnormal, isolate the service and report an alarm;
(2)如果最终推理出只有一个节点上的两个或以上的服务出现I/O异常,推断为节点I/O异常,则通过重启自愈该节点的方式恢复异常并上报告警;如果在单位时间S2内,再次检测到该节点I/O异常,则隔离该节点并上报告警;(2) If it is finally deduced that only two or more services on one node have I/O exceptions, it is inferred that the node I/O is abnormal, then the abnormality will be recovered by restarting the self-healing node and an alarm will be reported; if Within the unit time S2, if the I/O abnormality of the node is detected again, the node is isolated and an alarm is reported;
(3)如果最终推理出两个或以上的节点出现I/O异常,推断为群体事件,上报告警,并提示人工介入恢复。(3) If it is concluded that two or more nodes have I/O abnormalities, it is inferred as a group event, an alarm is reported, and manual intervention is prompted for recovery.
第三方面,本发明还提供了一种计算机可读存储介质,所述存储介质上存储有计算机程序,所述程序被处理器执行时实现上述的分布式存储系统I/O亚健康智能检测和恢复方法的步骤。In the third aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the above-mentioned distributed storage system I/O sub-health intelligent detection and The steps of the recovery method.
综上,本发明分布式存储系统I/O亚健康智能检测和恢复方法具有以下特点:To sum up, the distributed storage system I/O sub-health intelligent detection and recovery method of the present invention has the following characteristics:
(1)本方法通过存储系统中预埋的I/O统计点位,周期性采集关键服务上的I/O异常数据,并根据亚健康检测算法自动推理出存在I/O异常的服务,继而智能选择自愈或隔离的方式处理异常,有效地防止了故障扩散。(1) This method periodically collects I/O abnormal data on key services through the I/O statistical points embedded in the storage system, and automatically deduces services with I/O abnormalities according to the sub-health detection algorithm, and then Intelligently choose self-healing or isolation to handle exceptions, effectively preventing faults from spreading.
(2)本方法通过内置亚健康检测算法和恢复决策逻辑的方式实现了I/O异常的智能检测及恢复,无需人为干预即可自动识别出IO亚健康的关键服务,并通过自愈或隔离的方式恢复系统,及时消除风险,避免业务中断,提高了系统运行的稳定性和安全性。(2) This method realizes the intelligent detection and recovery of I/O abnormalities through the built-in sub-health detection algorithm and recovery decision logic. It can automatically identify the key services of IO sub-health without human intervention, and through self-healing or isolation The system can be recovered in a timely manner, risks can be eliminated in time, business interruption can be avoided, and the stability and security of system operation have been improved.
附图说明Description of drawings
为了更清楚地说明本发明实施例技术方案,下面对实施例描述中需要使用的附图作简要介绍。显而易见,下面描述中的附图仅是本发明中记载的一些实施例,而不是全部实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following briefly introduces the drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments recorded in the present invention, rather than all embodiments. For those skilled in the art, they can also obtain other attached drawings.
图1为根据本发明一种实施例的监控集群结构示意图。Fig. 1 is a schematic diagram of a monitoring cluster structure according to an embodiment of the present invention.
图2为根据本发明一种实施例的I/O异常推理方法流程示意图,Fig. 2 is a schematic flow chart of an I/O abnormal reasoning method according to an embodiment of the present invention,
说明:服务A1指认服务B2异常,服务A2和A3指认服务B1异常,此时服务B又进一步指认,服务B1和B2均指认服务C3异常,由此推理出服务C3异常。Explanation: Service A1 identified that service B2 was abnormal, and services A2 and A3 identified service B1 as abnormal. At this time, service B made further identifications. Both services B1 and B2 identified service C3 as abnormal, and thus inferred that service C3 was abnormal.
图3为本发明分布式存储系统I/O亚健康智能检测和恢复方法流程示意图。Fig. 3 is a schematic flow chart of the I/O sub-health intelligent detection and recovery method of the distributed storage system of the present invention.
图4为本发明分布式存储系统I/O亚健康智能检测和恢复系统结构示意图。Fig. 4 is a schematic structural diagram of the distributed storage system I/O sub-health intelligent detection and recovery system of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚,下面将结合具体实施例对本发明的技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本发明一部分实施例,而不是全部的实施例,本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with specific embodiments. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. The present invention can also be implemented or applied through other different specific implementation modes, and the details in this specification can also be based on different viewpoints Various modifications or changes may be made without departing from the spirit of the invention.
需要说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合;并且,基于本公开中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。It should be noted that, in the case of no conflict, the following embodiments and the features in the embodiments can be combined with each other; and, based on the embodiments in the present disclosure, all those skilled in the art obtained without creative work Other embodiments all belong to the protection scope of the present disclosure.
需要说明的是,本文中所描述的方面可体现于广泛多种形式中,且本文中所描述的任何特定结构及/或功能仅为说明性的。基于本公开,所属领域的技术人员应了解,本文中所描述的一个方面可与任何其它方面独立地实施,且可以各种方式组合这些方面中的两者或两者以上,举例来说,可使用本文中所阐述的任何数目个方面来实施设备及/或实践方法。It should be noted that the aspects described herein can be embodied in a wide variety of forms, and any specific structures and/or functions described herein are illustrative only. Based on the present disclosure, those skilled in the art should appreciate that one aspect described herein can be implemented independently of any other aspect, and two or more of these aspects can be combined in various ways, for example, Apparatus and/or methods are practiced using any number of the aspects set forth herein.
以下利用图1-图4所示实施方式对本发明进行详细说明。The present invention will be described in detail below using the embodiments shown in FIGS. 1-4 .
根据图3所示,本发明分布式存储系统I/O亚健康智能检测和恢复方法包括:As shown in Figure 3, the I/O sub-health intelligent detection and recovery method of the distributed storage system of the present invention includes:
(1)集群功能(1) Cluster function
存储系统中部署一套I/O亚健康检测服务monitor,该服务由server和client两部分组成,其中,server组件以独立的进程运行在每个存储节点上,为提供可靠性server组件以主备模式(如图1中monitor_server1、monitor_server2、monitor_server3)运行,具备集群选主、I/O亚健康数据汇总、推理和决策隔离功能;client组件被加载在每个存储节点关键业务进程中,负责搜集每个关键业务服务(如图2中服务A、服务B、服务C)的I/O时延、I/O错误、I/O挂死数据,并上报给本节点monitor_server,由本节点monitor_server上报给monitor_server主处理。A set of I/O sub-health detection service monitor is deployed in the storage system. The service consists of two parts: server and client. The server component runs on each storage node as an independent process. In order to provide reliability, the server component is used as the main and backup Mode (monitor_server1, monitor_server2, monitor_server3 in Figure 1) running, with the functions of cluster master selection, I/O sub-health data summary, reasoning and decision isolation; the client component is loaded in the key business process of each storage node and is responsible for collecting each The I/O delay, I/O error, and I/O hanging data of a key business service (such as service A, service B, and service C in Figure 2) are reported to the monitor_server of the local node, and the monitor_server of the local node reports the data to the monitor_server master processing.
(2)周期检测上报(2) Periodic detection and reporting
I/O路径关键服务(如图2中服务A、服务B、服务C)上,预埋统计点位,关键服务通过monitor_client周期性检测I/O异常数据,上报给monitor_server,异常I/O包括以下三种场景:On the key services of the I/O path (such as service A, service B, and service C in Figure 2), the statistical points are embedded, and the key services periodically detect I/O abnormal data through the monitor_client and report them to the monitor_server. The abnormal I/O includes The following three scenarios:
I/O超时:I/O时延超过阈值T秒(不同业务对时延要求不同,视具体业务而定,比如800ms~3s);I/O timeout: I/O delay exceeds the threshold T seconds (different services have different delay requirements, depending on specific services, such as 800ms~3s);
I/O失败:I/O报错;I/O failure: I/O error;
I/O挂死:I/O挂住不返回;I/O hangs: I/O hangs and does not return;
(3)基于亚健康检测算法的智能推理(3) Intelligent reasoning based on sub-health detection algorithm
(3-1)I/O异常计数方法(3-1) I/O exception counting method
I/O超时:连续检测M1个(比如8个)周期,有N1个(比如4个)周期检测到I/O时延大于T秒(M1≥N1≥1),则认定为I/O超时异常;I/O timeout: Continuously detect M1 (such as 8) cycles, if there are N1 (such as 4) cycles to detect that the I/O delay is greater than T seconds (M1≥N1≥1), it is considered as I/O timeout abnormal;
I/O失败:连续检测M2个(比如8个)周期,有N2个(比如4个)周期检测到I/O报错(M2≥N2≥1),则认定为I/O失败异常;I/O failure: Continuously detect M2 (for example, 8) cycles, if there are N2 (for example, 4) cycles to detect I/O errors (M2≥N2≥1), it is considered as I/O failure exception;
I/O挂死:检测到I/O挂住不返回,则认定为I/O挂死异常。I/O hang-up: If it is detected that I/O hang-up does not return, it is considered as an I/O hang-up exception.
(3-2)I/O异常推理方法(3-2) I/O exception reasoning method
以I/O时延大场景为例说明I/O异常推理逻辑,如图2,I/O从客户端发起,在I/O路径关键服务路径上客户端->服务A->服务B->服务C:Take the I/O delay scenario as an example to illustrate the I/O abnormal reasoning logic, as shown in Figure 2, the I/O is initiated from the client, and on the key service path of the I/O path, the client->service A->service B- > Service C:
(3-2-1)上层服务调用下层服务接口,服务A判断服务B时延大等异常,初步推断服务B异常,假设服务B没有进一步指认服务C异常,那么认定异常出在服务B上;(3-2-1) The upper-layer service calls the lower-layer service interface. Service A judges that service B has an abnormality such as a long delay, and preliminarily infers that service B is abnormal. Assuming that service B does not further identify service C as abnormal, then it is determined that the abnormality is on service B;
(3-2-2)服务A判断服务B时延大等异常,初步推断服务B异常,假设服务B进一步指认服务C异常,服务C没有继续向下指认,那么认定异常出在服务C上(如果服务C下还有更底层服务以此类推);(3-2-2) Service A judges that service B has an abnormality such as a long delay, and preliminarily infers that service B is abnormal. Assuming that service B further identifies service C as abnormal, and service C does not continue to identify downwards, then it is determined that the abnormality is on service C ( If there are lower-level services under service C and so on);
按照上述逻辑逐层指认,直至确认所有出现I/O异常的节点及关键服务。Identify layer by layer according to the above logic until all nodes and key services with I/O exceptions are confirmed.
(4)智能决策恢复(4) Intelligent decision recovery
(4-1)如果最终推理出只有一个节点上的一个服务出现I/O异常,则通过自愈该服务的方式恢复异常并上报告警;如果在单位时间S1(比如24小时)内,再次检测到该服务I/O异常,则隔离该服务(即停止该服务)并上报告警;(4-1) If it is finally deduced that only one service on one node has an I/O exception, then the service will be restored by self-healing and an alarm will be reported; if within the unit time S1 (such as 24 hours), again Detect that the service I/O is abnormal, then isolate the service (that is, stop the service) and report an alarm;
(4-2)如果最终推理出只有一个节点上的两个或以上的服务出现I/O异常,推断为节点I/O异常,则通过重启自愈该节点的方式恢复异常并上报告警;如果在单位时间S2(比如7天)内,再次检测到该节点I/O异常,则隔离该节点(即停止该节点上所有服务)并上报告警;(4-2) If it is finally deduced that only two or more services on one node have I/O exceptions, it is inferred that the node I/O is abnormal, then the abnormality will be recovered by restarting and self-healing the node and an alarm will be reported; If within the unit time S2 (such as 7 days), the I/O abnormality of the node is detected again, then the node is isolated (that is, all services on the node are stopped) and an alarm is reported;
(4-3)如果最终推理出两个或以上的节点出现I/O异常,推断为群体事件,上报告警,并提示人工介入恢复。(4-3) If it is inferred that two or more nodes have I/O abnormalities, it is inferred as a group event, an alarm is reported, and manual intervention is prompted for recovery.
根据图4所示,本发明分布式存储系统I/O亚健康智能检测和恢复系统包括:As shown in Figure 4, the I/O sub-health intelligent detection and recovery system of the distributed storage system of the present invention includes:
I/O异常数据采集模块,用于采集各I/O统计点位记录到的I/O异常数据;The I/O abnormal data collection module is used to collect the I/O abnormal data recorded in each I/O statistical point;
I/O异常数据分析模块,用于分析I/O异常数据类别并统计各类异常数据出现的次数;The I/O abnormal data analysis module is used to analyze the types of I/O abnormal data and count the number of occurrences of various types of abnormal data;
检测算法模块,内置亚健康检测算法,用于推导出现I/O亚健康状态的关键服务或节点;Detection algorithm module, with built-in sub-health detection algorithm, used to deduce key services or nodes with I/O sub-health state;
恢复决策模块,内置决策逻辑,用于选择适用的恢复方式。Restoration decision-making module, with built-in decision-making logic, is used to select the applicable restoration method.
各模块按照上述分布式存储系统I/O亚健康智能检测和恢复方法实施运行。Each module operates according to the above-mentioned I/O sub-health intelligent detection and recovery method of the distributed storage system.
本发明中各个实施例采用递进的方式描述,各个实施例之间相同或相似的部分相互参见即可。Various embodiments of the present invention are described in a progressive manner, and the same or similar parts of the various embodiments may be referred to each other.
以上所述仅为本发明的实施例而已,并不用于限制本发明。对于本领域技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原理之内所作的任何修改、替换等,均应包含在本发明的权利要求保护范围之内。The above descriptions are only examples of the present invention, and are not intended to limit the present invention. Various modifications and variations of the present invention will occur to those skilled in the art. Any modification, substitution, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the claims of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211233825.9ACN115470061A (en) | 2022-10-10 | 2022-10-10 | A distributed storage system I/O sub-health intelligent detection and recovery method |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211233825.9ACN115470061A (en) | 2022-10-10 | 2022-10-10 | A distributed storage system I/O sub-health intelligent detection and recovery method |
| Publication Number | Publication Date |
|---|---|
| CN115470061Atrue CN115470061A (en) | 2022-12-13 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211233825.9APendingCN115470061A (en) | 2022-10-10 | 2022-10-10 | A distributed storage system I/O sub-health intelligent detection and recovery method |
| Country | Link |
|---|---|
| CN (1) | CN115470061A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20240036990A1 (en)* | 2021-06-15 | 2024-02-01 | Inspur Suzhou Intelligent Technology Co., Ltd. | Inference service management method, apparatus and system for inference platform, and medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105893166A (en)* | 2016-04-29 | 2016-08-24 | 浪潮电子信息产业股份有限公司 | Method and device for processing memory errors |
| CN111104283A (en)* | 2019-11-29 | 2020-05-05 | 浪潮电子信息产业股份有限公司 | Fault detection method, device, equipment and medium of distributed storage system |
| CN113032106A (en)* | 2021-04-29 | 2021-06-25 | 中国工商银行股份有限公司 | Automatic detection method and device for IO suspension abnormality of computing node |
| US11372705B1 (en)* | 2021-01-22 | 2022-06-28 | EMC IP Holding Company LLC | Intelligent monitoring of backup and recovery activity in data storage systems |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105893166A (en)* | 2016-04-29 | 2016-08-24 | 浪潮电子信息产业股份有限公司 | Method and device for processing memory errors |
| CN111104283A (en)* | 2019-11-29 | 2020-05-05 | 浪潮电子信息产业股份有限公司 | Fault detection method, device, equipment and medium of distributed storage system |
| US11372705B1 (en)* | 2021-01-22 | 2022-06-28 | EMC IP Holding Company LLC | Intelligent monitoring of backup and recovery activity in data storage systems |
| CN113032106A (en)* | 2021-04-29 | 2021-06-25 | 中国工商银行股份有限公司 | Automatic detection method and device for IO suspension abnormality of computing node |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20240036990A1 (en)* | 2021-06-15 | 2024-02-01 | Inspur Suzhou Intelligent Technology Co., Ltd. | Inference service management method, apparatus and system for inference platform, and medium |
| US11994958B2 (en)* | 2021-06-15 | 2024-05-28 | Inspur Suzhou Intelligent Technology Co., Ltd. | Inference service management method, apparatus and system for inference platform, and medium |
| Publication | Publication Date | Title |
|---|---|---|
| JP6828096B2 (en) | Server hardware failure analysis and recovery | |
| CN103026344B (en) | Fault test set, fault detection method and program recorded medium | |
| CN107179957B (en) | Physical machine fault classification processing method and device and virtual machine recovery method and system | |
| US8930757B2 (en) | Operations management apparatus, operations management method and program | |
| US8799446B2 (en) | Service resiliency within on-premise products | |
| US20090249129A1 (en) | Systems and Methods for Managing Multi-Component Systems in an Infrastructure | |
| US20080250265A1 (en) | Systems and methods for predictive failure management | |
| CN101997709B (en) | Root alarm data analysis method and system | |
| CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
| WO2020167464A1 (en) | Fault prediction and detection using time-based distributed data | |
| CN105607973B (en) | Method, device and system for processing equipment fault in virtual machine system | |
| US7702780B2 (en) | Monitoring method, system, and computer program based on severity and persistence of problems | |
| CN116781488A (en) | Database high availability implementation methods, devices, database architectures, equipment and products | |
| CN115470061A (en) | A distributed storage system I/O sub-health intelligent detection and recovery method | |
| CN116340045A (en) | Database exception handling method, apparatus, device and computer readable storage medium | |
| CN103188103A (en) | A self-monitoring method for a network management system | |
| CN104199747B (en) | High-availability system obtaining method and system based on health management | |
| JP5503177B2 (en) | Fault information collection device | |
| CN109271270A (en) | The troubleshooting methodology, system and relevant apparatus of bottom hardware in storage system | |
| Sahoo et al. | Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems | |
| Oussane et al. | Fault tolerance in The IoT: a taxonomy based on techniques | |
| CN115150253B (en) | Fault root cause determining method and device and electronic equipment | |
| CN105991305A (en) | Method and device of identifying link abnormity | |
| CN113656207A (en) | Troubleshooting method, apparatus, electronic device and medium | |
| KR101864126B1 (en) | Intrusion tolerance system and method for providing service based on steady state model |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |