Service end-to-end fault monitoring method in cloud computing environmentTechnical Field
The invention relates to the field of probe deployment and integrated acquisition programs in a cloud computing environment, in particular to a business end-to-end fault monitoring method in the cloud computing environment.
Background
With the rapid development of cloud computing, many telecom operators have been cloud-loaded with many services. The cloud computing environment aggregates a large amount of physical resources and virtual resources, and provides multiple levels of services such as IaaS, paaS, saaS and the like. Failure of the IaaS, paaS layer often results in failure of the SaaS layer, and multiple links or nodes may exist for each of the IaaS, paaS and SaaS layers. Because of the diversity of cloud computing service data and the dynamic nature of the deployment environment, when a host machine fails or is a performance bottleneck, the cloud computing nodes often dynamically migrate, and the cloud computing nodes dynamically migrate from one host machine to another host machine, so that the process is easy to cause abnormality, and the abnormality is usually converted into serious faults, so that service failure is caused. Moreover, a node that is abnormal during operation may cause other nodes associated therewith to be abnormal and further cause a large-scale service failure. On the other hand, the cloud host of the acquisition node also can misjudge faults due to failure of external network connection due to the self dynamic property, so that interference is caused, and the reliability of an alarm is affected. At present, when service failure is found, the physical resource node from the service end to the end is firstly checked, cross-department collaboration is needed, fault positioning is slow, and fault processing is time-consuming. Therefore, abnormal monitoring and correct fault judgment of related nodes in a cloud computing environment are problems to be solved.
Disclosure of Invention
The invention aims to provide a business end-to-end fault monitoring method in a cloud computing environment, so as to enhance the efficiency of business fault positioning and processing in the cloud computing environment and improve the accuracy of fault judgment.
In order to achieve the purpose, the technical scheme of the invention is that the service end-to-end fault monitoring method in the cloud computing environment comprises the following steps:
Step S1, deploying an integrated acquisition service, butting a cloud resource pool through an API interface, and acquiring information of a cloud host, a host and a computing resource pool of a service carried on the cloud resource pool;
Step S2, deploying acquisition probes on the host machines in the same computing resource pool, respectively acquiring IaaS layer performance monitoring indexes on each host machine, and sending the IaaS layer performance monitoring indexes to an integrated acquisition service, wherein the integrated acquisition service judges whether an alarm exists or not according to a defined threshold value;
Step S3, deploying acquisition probes on all cloud hosts, and respectively acquiring IaaS layer performance indexes of all cloud hosts;
Step S4, deploying an acquisition probe on a cloud host provided with the PaaS component, acquiring performance indexes of the PaaS component, and sending the performance indexes to an integrated acquisition service, wherein the integrated acquisition service judges whether an alarm exists or not according to a defined threshold;
s5, establishing a cross acquisition matrix, monitoring network quality of a host machine and a cloud host machine on line in real time, and judging and early warning an alarm according to a time correlation and alarm consistency principle;
s6, deploying an integrated acquisition service, acquiring a business service state index SaaS, and judging whether an alarm exists or not according to a threshold value;
Step S7, deploying acquisition probes on cloud hosts carrying the services, acquiring service state indexes SaaS at the other ends between adjacent nodes, and sending data to an integrated acquisition service by the acquisition probes, wherein the integrated acquisition service judges whether an alarm is required according to a threshold value;
And S8, drawing the end-to-end full flow display nodes of the service from bottom to top, and projecting alarm and performance indexes or service indexes of each node.
Compared with the prior art, the invention has the beneficial effects that the invention can carry out the analysis and diagnosis of faults in an omnibearing way, thereby providing an accurate means for rapidly positioning the faults of the service, improving the service quality of the service and improving the customer service satisfaction.
Drawings
Fig. 1 is a flow chart of monitoring, positioning, collecting and early warning of end-to-end faults of a service in a cloud computing environment.
Fig. 2 is a flow chart of a cross monitoring matrix for monitoring and positioning end-to-end faults of a service in a cloud computing environment according to the present invention.
Detailed Description
The technical scheme of the invention is specifically described below with reference to the accompanying drawings.
As shown in fig. 1 and 2, the method for monitoring service end-to-end faults in a cloud computing environment includes:
1. As shown in fig. 1, the logic for monitoring, locating, collecting and early warning of end-to-end fault of service in cloud computing environment is as follows:
(1) And deploying probes on a host computer of the computing resource pool acquired by the resource pool API interface, and acquiring performance indexes of the host computer, including CPU utilization rate, memory utilization rate, file system space utilization rate and time, for monitoring whether the host computer can normally operate.
(2) And installing an IaaS layer probe on the cloud host for bearing the service, and collecting performance indexes of the cloud host, including CPU utilization rate, memory utilization rate, file system space and collection time, for monitoring whether the cloud host for bearing the service can normally provide service.
(3) And installing a probe on a cloud host carrying the PaaS service, and collecting performance indexes of the PaaS component. Such as the table space utilization of the database service, the request latency of the message queue, and other performance metrics and times.
(4) By integrating the acquisition services, the status of the business services is acquired. Such as service response status, service response duration, and time.
(5) And installing probes on each node cloud host carrying the SaaS service, and collecting the state and time of the service at the other end of the node.
(6) The integrated acquisition service performs automatic scheduling of each acquisition task, receives index data acquired by the probe, and then calls a threshold engine to monitor whether the index data exceeds a threshold value, and alarm data is generated if the index data exceeds the threshold value.
(7) And storing, filtering and the like the generated alarm data.
(8) And establishing an end-to-end visual topology according to the logical end-to-end relationship of the service, projecting alarms of corresponding nodes to each topology node, and presenting the alarms in a more visual mode.
2. As shown in fig. 2, the implementation steps of the cross acquisition matrix of the quality of service network in the cloud computing environment are as follows:
(1) And respectively deploying a set of integrated acquisition services on cloud hosts of three different network segments to establish a cross acquisition matrix.
(2) The three sets of integrated acquisition services initiate network quality monitoring on the target host and the cloud host through ICMP respectively.
(3) And the three sets of integrated acquisition services schedule a threshold engine according to the monitoring strategy to judge whether the network quality exceeds a threshold value for alarming.
(4) According to the time correlation, the network quality alarms acquired by the three sets of integrated acquisition services are judged to be correlated. According to the alarm consistency principle, when three sets of acquisition services send out network quality alarms at similar time, the alarms are judged to be real alarms, otherwise, the faults of the integrated acquisition services generating the alarms at present are judged.
(5) And storing, filtering and the like the judged network quality alarm data.
(6) The alarms are projected onto the service end-to-end visual topological device, and the alarms are presented in a more striking mode by using color identifiers different from performance index alarms, so that the positions and types of faults are rapidly judged, and the table 1 is a fault classification, fault type and index correlation table.
TABLE 1
| Fault classification | Fault type | Index (I) |
| Performance index alert | IaaS | CPU utilization, memory utilization, file system space utilization, etc |
| Network quality index alarms | IaaS | Network connectivity, network packet loss rate, network delay, etc |
| Database performance index | PaaS | Tablespace utilization, failure index, database snoop status, session occupancy, etc |
| Message queue performance index | PaaS | Waiting time in request queue, time of node processing, time of sending response, etc |
| Business end-to-end service performance | SaaS | Service port connectivity and the like |
| Business index | SaaS | Traffic call completing rate and the like |
The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.