Movatterモバイル変換


[0]ホーム

URL:


CN114024825B - A method for end-to-end fault monitoring of services in a cloud computing environment - Google Patents

A method for end-to-end fault monitoring of services in a cloud computing environment
Download PDF

Info

Publication number
CN114024825B
CN114024825BCN202111285767.XACN202111285767ACN114024825BCN 114024825 BCN114024825 BCN 114024825BCN 202111285767 ACN202111285767 ACN 202111285767ACN 114024825 BCN114024825 BCN 114024825B
Authority
CN
China
Prior art keywords
service
alarm
cloud
host
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111285767.XA
Other languages
Chinese (zh)
Other versions
CN114024825A (en
Inventor
林德生
郑生华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Youke Communication Technology Co ltd
Original Assignee
China Youke Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Youke Communication Technology Co ltdfiledCriticalChina Youke Communication Technology Co ltd
Priority to CN202111285767.XApriorityCriticalpatent/CN114024825B/en
Publication of CN114024825ApublicationCriticalpatent/CN114024825A/en
Application grantedgrantedCritical
Publication of CN114024825BpublicationCriticalpatent/CN114024825B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明涉及一种云计算环境下业务端到端的故障监测方法。基于云池,网络,设备,业务四个维度的实时监测,并结合静态、动态阈值算法,实时监测各个维度的相关故障,建立IaaS、PaaS到SaaS三层的故障监测体系,全方位地进行故障的分析和诊断,从而为业务故障快速定位提供准确的手段,提高业务的服务质量,提高客服满意度。

The present invention relates to a method for end-to-end fault monitoring of services in a cloud computing environment. Based on real-time monitoring of four dimensions, namely, cloud pool, network, equipment, and services, and combined with static and dynamic threshold algorithms, relevant faults in each dimension are monitored in real time, and a fault monitoring system for three layers, namely, IaaS, PaaS, and SaaS, is established to analyze and diagnose faults in an all-round way, thereby providing an accurate means for quickly locating service faults, improving the service quality of the service, and improving customer satisfaction.

Description

Service end-to-end fault monitoring method in cloud computing environment
Technical Field
The invention relates to the field of probe deployment and integrated acquisition programs in a cloud computing environment, in particular to a business end-to-end fault monitoring method in the cloud computing environment.
Background
With the rapid development of cloud computing, many telecom operators have been cloud-loaded with many services. The cloud computing environment aggregates a large amount of physical resources and virtual resources, and provides multiple levels of services such as IaaS, paaS, saaS and the like. Failure of the IaaS, paaS layer often results in failure of the SaaS layer, and multiple links or nodes may exist for each of the IaaS, paaS and SaaS layers. Because of the diversity of cloud computing service data and the dynamic nature of the deployment environment, when a host machine fails or is a performance bottleneck, the cloud computing nodes often dynamically migrate, and the cloud computing nodes dynamically migrate from one host machine to another host machine, so that the process is easy to cause abnormality, and the abnormality is usually converted into serious faults, so that service failure is caused. Moreover, a node that is abnormal during operation may cause other nodes associated therewith to be abnormal and further cause a large-scale service failure. On the other hand, the cloud host of the acquisition node also can misjudge faults due to failure of external network connection due to the self dynamic property, so that interference is caused, and the reliability of an alarm is affected. At present, when service failure is found, the physical resource node from the service end to the end is firstly checked, cross-department collaboration is needed, fault positioning is slow, and fault processing is time-consuming. Therefore, abnormal monitoring and correct fault judgment of related nodes in a cloud computing environment are problems to be solved.
Disclosure of Invention
The invention aims to provide a business end-to-end fault monitoring method in a cloud computing environment, so as to enhance the efficiency of business fault positioning and processing in the cloud computing environment and improve the accuracy of fault judgment.
In order to achieve the purpose, the technical scheme of the invention is that the service end-to-end fault monitoring method in the cloud computing environment comprises the following steps:
Step S1, deploying an integrated acquisition service, butting a cloud resource pool through an API interface, and acquiring information of a cloud host, a host and a computing resource pool of a service carried on the cloud resource pool;
Step S2, deploying acquisition probes on the host machines in the same computing resource pool, respectively acquiring IaaS layer performance monitoring indexes on each host machine, and sending the IaaS layer performance monitoring indexes to an integrated acquisition service, wherein the integrated acquisition service judges whether an alarm exists or not according to a defined threshold value;
Step S3, deploying acquisition probes on all cloud hosts, and respectively acquiring IaaS layer performance indexes of all cloud hosts;
Step S4, deploying an acquisition probe on a cloud host provided with the PaaS component, acquiring performance indexes of the PaaS component, and sending the performance indexes to an integrated acquisition service, wherein the integrated acquisition service judges whether an alarm exists or not according to a defined threshold;
s5, establishing a cross acquisition matrix, monitoring network quality of a host machine and a cloud host machine on line in real time, and judging and early warning an alarm according to a time correlation and alarm consistency principle;
s6, deploying an integrated acquisition service, acquiring a business service state index SaaS, and judging whether an alarm exists or not according to a threshold value;
Step S7, deploying acquisition probes on cloud hosts carrying the services, acquiring service state indexes SaaS at the other ends between adjacent nodes, and sending data to an integrated acquisition service by the acquisition probes, wherein the integrated acquisition service judges whether an alarm is required according to a threshold value;
And S8, drawing the end-to-end full flow display nodes of the service from bottom to top, and projecting alarm and performance indexes or service indexes of each node.
Compared with the prior art, the invention has the beneficial effects that the invention can carry out the analysis and diagnosis of faults in an omnibearing way, thereby providing an accurate means for rapidly positioning the faults of the service, improving the service quality of the service and improving the customer service satisfaction.
Drawings
Fig. 1 is a flow chart of monitoring, positioning, collecting and early warning of end-to-end faults of a service in a cloud computing environment.
Fig. 2 is a flow chart of a cross monitoring matrix for monitoring and positioning end-to-end faults of a service in a cloud computing environment according to the present invention.
Detailed Description
The technical scheme of the invention is specifically described below with reference to the accompanying drawings.
As shown in fig. 1 and 2, the method for monitoring service end-to-end faults in a cloud computing environment includes:
1. As shown in fig. 1, the logic for monitoring, locating, collecting and early warning of end-to-end fault of service in cloud computing environment is as follows:
(1) And deploying probes on a host computer of the computing resource pool acquired by the resource pool API interface, and acquiring performance indexes of the host computer, including CPU utilization rate, memory utilization rate, file system space utilization rate and time, for monitoring whether the host computer can normally operate.
(2) And installing an IaaS layer probe on the cloud host for bearing the service, and collecting performance indexes of the cloud host, including CPU utilization rate, memory utilization rate, file system space and collection time, for monitoring whether the cloud host for bearing the service can normally provide service.
(3) And installing a probe on a cloud host carrying the PaaS service, and collecting performance indexes of the PaaS component. Such as the table space utilization of the database service, the request latency of the message queue, and other performance metrics and times.
(4) By integrating the acquisition services, the status of the business services is acquired. Such as service response status, service response duration, and time.
(5) And installing probes on each node cloud host carrying the SaaS service, and collecting the state and time of the service at the other end of the node.
(6) The integrated acquisition service performs automatic scheduling of each acquisition task, receives index data acquired by the probe, and then calls a threshold engine to monitor whether the index data exceeds a threshold value, and alarm data is generated if the index data exceeds the threshold value.
(7) And storing, filtering and the like the generated alarm data.
(8) And establishing an end-to-end visual topology according to the logical end-to-end relationship of the service, projecting alarms of corresponding nodes to each topology node, and presenting the alarms in a more visual mode.
2. As shown in fig. 2, the implementation steps of the cross acquisition matrix of the quality of service network in the cloud computing environment are as follows:
(1) And respectively deploying a set of integrated acquisition services on cloud hosts of three different network segments to establish a cross acquisition matrix.
(2) The three sets of integrated acquisition services initiate network quality monitoring on the target host and the cloud host through ICMP respectively.
(3) And the three sets of integrated acquisition services schedule a threshold engine according to the monitoring strategy to judge whether the network quality exceeds a threshold value for alarming.
(4) According to the time correlation, the network quality alarms acquired by the three sets of integrated acquisition services are judged to be correlated. According to the alarm consistency principle, when three sets of acquisition services send out network quality alarms at similar time, the alarms are judged to be real alarms, otherwise, the faults of the integrated acquisition services generating the alarms at present are judged.
(5) And storing, filtering and the like the judged network quality alarm data.
(6) The alarms are projected onto the service end-to-end visual topological device, and the alarms are presented in a more striking mode by using color identifiers different from performance index alarms, so that the positions and types of faults are rapidly judged, and the table 1 is a fault classification, fault type and index correlation table.
TABLE 1
Fault classificationFault typeIndex (I)
Performance index alertIaaSCPU utilization, memory utilization, file system space utilization, etc
Network quality index alarmsIaaSNetwork connectivity, network packet loss rate, network delay, etc
Database performance indexPaaSTablespace utilization, failure index, database snoop status, session occupancy, etc
Message queue performance indexPaaSWaiting time in request queue, time of node processing, time of sending response, etc
Business end-to-end service performanceSaaSService port connectivity and the like
Business indexSaaSTraffic call completing rate and the like
The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims (1)

Translated fromChinese
1.一种云计算环境下业务端到端的故障监测方法,其特征在于,包括如下步骤:1. A method for end-to-end fault monitoring of services in a cloud computing environment, characterized by comprising the following steps:步骤S1、部署集成采集服务通过API接口对接云资源池,获取云资源池上承载业务的云主机、宿主机和计算资源池信息;Step S1: Deploy the integrated acquisition service to connect to the cloud resource pool through the API interface to obtain the cloud host, host machine and computing resource pool information that carries the business on the cloud resource pool;步骤S2、在同一个计算资源池内的宿主机上部署采集探针分别采集各宿主机上的IaaS层性能监测指标,并送给集成采集服务,集成采集服务根据定义的阈值判定是否有告警;Step S2: Deploy collection probes on the host machines in the same computing resource pool to collect IaaS layer performance monitoring indicators on each host machine, and send them to the integrated collection service. The integrated collection service determines whether there is an alarm based on the defined threshold;步骤S3、在各云主机上部署采集探针,分别采集各云主机的IaaS层性能指标;Step S3: deploy a collection probe on each cloud host to collect IaaS layer performance indicators of each cloud host;步骤S4、在安装有PaaS组件的云主机上,部署采集探针,采集PaaS组件性能指标,并送给集成采集服务,集成采集服务根据定义的阈值判定是否有告警;Step S4: Deploy a collection probe on the cloud host where the PaaS component is installed, collect the performance indicators of the PaaS component, and send them to the integrated collection service. The integrated collection service determines whether there is an alarm based on the defined threshold;步骤S5、建立交叉采集矩阵,实时在线监测宿主机、云主机网络质量,并根据时间相关性和告警一致性原则,进行告警的判定和预警;Step S5: Establish a cross-collection matrix to monitor the network quality of the host and cloud host in real time online, and make alarm determinations and early warnings based on the principles of time correlation and alarm consistency;步骤S6、部署集成采集服务,采集业务服务状态指标SaaS,并根据阈值判定是否有告警;Step S6: deploy an integrated collection service to collect business service status indicators SaaS, and determine whether there is an alarm based on the threshold;步骤S7、在各个承载业务的云主机上部署采集探针,采集相邻节点间另一端服务状态指标SaaS,然后采集探针把数据送给集成采集服务,集成采集服务根据阈值判定是否需告警;Step S7: deploy a collection probe on each cloud host that carries the service, collect the service status indicator SaaS of the other end between adjacent nodes, and then send the data to the integrated collection service, which determines whether an alarm is required based on the threshold;步骤S8、自底向上绘制业务端到端全流程展示节点,并投影各节点告警和性能指标或业务指标;Step S8: Draw the end-to-end full-process display nodes of the service from bottom to top, and project the alarms and performance indicators or business indicators of each node;步骤S5中,所述交叉采集矩阵具体实施步骤如下:In step S5, the cross acquisition matrix is specifically implemented as follows:(1)选择在三个不同网段的云主机上各部署一套集成采集服务,建立交叉采集矩阵;(1) Deploy an integrated collection service on each of the three cloud hosts in different network segments to establish a cross-collection matrix;(2)三套集成采集服务分别通过ICMP发起对目标宿主机和云主机的网络质量监测;(2) The three integrated collection services initiate network quality monitoring of the target host and cloud host through ICMP respectively;(3)三套集成采集服务依据监控策略,调度门限引擎,判定网络质量是否超阈值告警;(3) The three integrated collection services dispatch the threshold engine based on the monitoring strategy to determine whether the network quality exceeds the threshold and issue an alarm;(4)根据时间相关性,判断三套集成采集服务采集到的网络质量告警是相关的;根据告警一致性原则,当三套集成采集服务都在相近时间发出网络质量告警时,判定相应告警为真实告警,否则判定为当前产生告警的集成采集服务自身的故障;(4) Based on the time correlation, the network quality alarms collected by the three integrated collection services are judged to be related; based on the alarm consistency principle, when the three integrated collection services all issue network quality alarms at similar times, the corresponding alarms are judged to be real alarms; otherwise, it is judged to be a fault of the integrated collection service that currently generates the alarm;(5)对被判断为的网络质量告警数据进行存储、过滤处理;(5) Store and filter the network quality alarm data that is judged to be;(6)将告警投影到业务端到端的可视化拓扑装置上,从而快速判断故障的位置和类型。(6) Project the alarm onto the end-to-end visualization topology device of the service to quickly determine the location and type of the fault.
CN202111285767.XA2021-11-022021-11-02 A method for end-to-end fault monitoring of services in a cloud computing environmentActiveCN114024825B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111285767.XACN114024825B (en)2021-11-022021-11-02 A method for end-to-end fault monitoring of services in a cloud computing environment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111285767.XACN114024825B (en)2021-11-022021-11-02 A method for end-to-end fault monitoring of services in a cloud computing environment

Publications (2)

Publication NumberPublication Date
CN114024825A CN114024825A (en)2022-02-08
CN114024825Btrue CN114024825B (en)2024-12-20

Family

ID=80059607

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111285767.XAActiveCN114024825B (en)2021-11-022021-11-02 A method for end-to-end fault monitoring of services in a cloud computing environment

Country Status (1)

CountryLink
CN (1)CN114024825B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115080405A (en)*2022-06-232022-09-20平安银行股份有限公司 System health detection method, device, electronic device and storage medium
CN115545658A (en)*2022-10-182022-12-30中邮科通信技术股份有限公司Asset configuration-based fault co-processing method
CN116055213A (en)*2023-02-072023-05-02中国工商银行股份有限公司Host network traffic monitoring method and device, computer equipment and storage medium
CN116501460A (en)*2023-03-292023-07-28中邮科通信技术股份有限公司Cloud host dynamic migration monitoring and early warning method

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105071954A (en)*2015-07-172015-11-18云南电网有限责任公司信息中心Resource pool fault diagnosis and positioning processing method based on probe technology
CN107832187A (en)*2017-10-182018-03-23广西电网有限责任公司电力科学研究院A kind of power transmission and transformation equipment state monitoring system

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9043647B2 (en)*2013-01-022015-05-26Tata Consultancy Services LimitedFault detection and localization in data centers
US10511514B1 (en)*2015-09-102019-12-17Cisco Technology, Inc.Node-specific probes in a native load balancer
CN106230987B (en)*2016-09-212019-09-13南方电网科学研究院有限责任公司 An information integration system and method based on electric power PaaS cloud platform
CN106789158A (en)*2016-11-112017-05-31工业和信息化部电信研究院Damage identification method and system are insured in a kind of cloud service
CN107835098B (en)*2017-11-282021-01-29车智互联(北京)科技有限公司Network fault detection method and system
CN109905276B (en)*2019-01-312020-01-24山东省电子信息产品检验院Cloud service quality monitoring method and system
CN112653586B (en)*2019-10-122022-04-19苏州工业园区测绘地理信息有限公司Time-space big data platform application performance management method based on full link monitoring
CN111786827A (en)*2020-06-292020-10-16中国工商银行股份有限公司Fault association positioning alarm method and device for distributed cloud computing environment
CN112015637A (en)*2020-07-292020-12-01国家电网有限公司Application performance management system and method
CN112714013B (en)*2020-12-222023-02-03浪潮云信息技术股份公司Application fault positioning method in cloud environment
CN112994972B (en)*2021-02-022022-05-20成都卓源网络科技有限公司Distributed probe monitoring platform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105071954A (en)*2015-07-172015-11-18云南电网有限责任公司信息中心Resource pool fault diagnosis and positioning processing method based on probe technology
CN107832187A (en)*2017-10-182018-03-23广西电网有限责任公司电力科学研究院A kind of power transmission and transformation equipment state monitoring system

Also Published As

Publication numberPublication date
CN114024825A (en)2022-02-08

Similar Documents

PublicationPublication DateTitle
CN114024825B (en) A method for end-to-end fault monitoring of services in a cloud computing environment
CN110716842B (en) Cluster fault detection method and device
CN103200050B (en)The hardware state monitoring method and system of server
US8245079B2 (en)Correlation of network alarm messages based on alarm time
CN112994972B (en)Distributed probe monitoring platform
CN112468335B (en)IPRAN cloud private line fault positioning method and device
CN108599977B (en)System and method for monitoring system availability based on statistical method
CN112636942B (en) Method and device for monitoring service host node
CN106445754A (en)Method and system for inspecting cluster health status and cluster server
CN105610648A (en)Operation and maintenance monitoring data collection method and server
CN109039795B (en)Cloud server resource monitoring method and system
CN114118991B (en)Third party system monitoring system, method, device, equipment and storage medium
CN114363151A (en)Fault detection method and device, electronic equipment and storage medium
CN107094086A (en)A kind of information acquisition method and device
CN112134754A (en)Pressure testing method and device, network equipment and storage medium
CN106911519A (en)A kind of data acquisition monitoring method and device
CN117034052A (en)Power safety early warning analysis method and system
CN118827316A (en) Monitoring and early warning method, device, electronic equipment, storage medium and computer program product
CN117729576A (en) Alarm monitoring methods, devices, equipment and storage media
CN106656636A (en)Cloud platform fault detection method and device
CN113300914A (en)Network quality monitoring method, device, system, electronic equipment and storage medium
CN109818808B (en)Fault diagnosis method and device and electronic equipment
CN106899436A (en)A kind of cloud platform failure predication diagnostic system
CN107204868B (en)Task operation monitoring information acquisition method and device
CN101252477B (en)Determining method and analyzing apparatus of network fault root

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp