CN103812699A

Movatterモバイル変換

Info

Publication number: CN103812699A
Application number: CN201410052286.8A
Authority: CN
Inventors: 许广彬; 郭晓; 张银滨; 李德才
Original assignee: Wuxi Huayun Data Technology Service Co Ltd
Current assignee: Wuxi Huayun Data Technology Service Co Ltd
Priority date: 2014-02-17
Filing date: 2014-02-17
Publication date: 2014-05-21

Abstract

The invention provides a monitoring management system based on cloud computing. The monitoring management system comprises a data acquisition unit which comprises a monitoring client for acquiring node data in a large-scale cluster server in real time, and three monitoring databases for storing the node data, a fault characteristic library and a fault handling unit, wherein the fault characteristic library is used for defining and storing fault characteristic items, and the monitoring client is used for verifying the node data acquired in real time and the fault characteristic items in the fault characteristic library in order to judge whether a fault occurs, and transmitting a fault command to the fault handling unit if the fault occurs; the fault handling unit is used for making a response to the fault command transmitted by the monitoring client, generating a fault handling strategy, and transmitting the fault handling strategy to the large-scale cluster server. By adopting the monitoring management system, data acquisition, early warning and fault solution can be performed automatically on the fault of each node in the large-scale cluster server based on cloud computing, thereby improving the stability and availability of the large-scale cluster server.

Description

Monitoring management system based on cloud computing

Technical Field

The invention relates to the technical field of cloud computing, in particular to a monitoring management system based on cloud computing, which is used for fault data acquisition, fault monitoring, fault early warning and fault recovery of faults of data nodes in the process of cloud computing large-scale cluster servers.

Background

In a cloud computing system, it is necessary to monitor the operating state of a data node and perform a failure recovery operation when a failure occurs.

In the prior art, a monitoring client is installed in a cloud server, and the running state of a data node is dynamically acquired and reported by closing or opening the monitoring client and by means of multiple concurrent information acquisition and reporting, message mining and automatic processing technologies. When a cloud server failure is discovered, a new node is dynamically created on a healthy physical server. However, the technology is not suitable for a cloud computing system of a large-scale cluster because the monitoring of each cloud node is relatively single in breadth and depth.

In order to meet the requirements of a cloud computing system of a large-scale cluster, the cloud computing service platform provided by main cloud service providers at home and abroad at present basically adopts an open-source architecture. For example, the disclosure of the patent is CN103024060A entitled "an open cloud computing large-scale cluster monitoring system and method". The method mainly adopts a plug-in design mode, and is used for monitoring a virtual machine cluster formed by a plurality of VMs (virtual machines) and a physical machine cluster formed by a plurality of PMs (physical machines) and sampling node data through an open API (application program interface), and is used for collecting related operating parameters of the VMs or the PMs and cluster platform parameters such as Hadoop. However, this technical solution often only can monitor and alarm, but cannot realize the function of providing automatic fault handling for each node (including VM and PM) in cloud computing.

In view of the above, there is a need to improve the node monitoring and automatic recovery technique in the large-scale cloud computing based cluster server in the prior art to solve the above technical defects.

Disclosure of Invention

The invention aims to disclose a monitoring management system based on cloud computing, which is used for monitoring and automatically recovering faults of nodes in the cloud computing, so that the potential faults or the faults which occur are monitored and automatically recovered, and the stability and the usability of a large-scale cluster server based on the cloud computing are guaranteed.

In order to achieve the above object, the present invention provides a monitoring management system based on cloud computing, which is used for monitoring and managing the operation state of a large-scale cluster server in cloud computing, and includes:

a data acquisition unit, comprising: the system comprises a monitoring client used for collecting node data in a large-scale cluster server in real time and three monitoring databases used for storing the node data;

the monitoring management system further comprises:

a fault feature library and a fault processing unit; wherein,

the fault feature library is used for defining and storing fault feature items, the monitoring client verifies the node data acquired in real time and the fault feature items in the fault feature library to judge whether the node data is a fault, and if the node data is the fault, a fault instruction is sent to the fault processing unit;

and the fault processing unit is used for responding to the fault instruction sent by the monitoring client, generating a fault processing strategy and sending the fault processing strategy to the large-scale cluster server.

As a further improvement of the present invention, the fault processing unit includes a fault monitoring engine, a fault early warning engine, and a fault recovery engine, and the fault monitoring engine receives a fault instruction sent by the monitoring client after verification, sends the fault instruction to the fault early warning engine and the fault recovery engine, generates a fault processing policy by the fault recovery engine, and feeds the fault processing policy back to the fault monitoring engine.

As a further improvement of the invention, the large-scale cluster server comprises a plurality of physical machines, and is virtualized into a plurality of virtual machines with distributed data structures through the physical machines.

As a further improvement of the present invention, the data acquisition unit further includes an administrator interface module, which is used to receive the fault feature item defined by initialization, and output the fault feature item to the fault feature library for storage.

As a further improvement of the present invention, the present invention further includes a Web client remotely connected to the fault processing unit and embedded in the visualization device, so as to create and display the operation status of each data node in the large-scale cluster server in real time, and a user can manually configure user configuration information through the Web client.

As a further improvement of the present invention, the user configuration information includes: fault monitoring strategy, fault early warning strategy, fault recovery strategy and user-defined fault characteristic item.

As a further improvement of the invention, the visualization device comprises a mobile phone and a personal computer.

As a further improvement of the invention, the failure signature library comprises a MySQL database.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, the data acquisition, early warning and fault resolution can be automatically carried out on the fault of each node in the large-scale cluster server based on cloud computing, so that the stability and the availability of the large-scale cluster server are improved.

Drawings

Fig. 1 is a schematic structural diagram of a monitoring management system based on cloud computing according to a first embodiment of the present invention;

fig. 2 is a schematic structural diagram of another embodiment of the monitoring management system based on cloud computing according to the present invention.

Wherein, the reference numbers in the detailed description are as follows:

monitoring management system-100; a data acquisition unit-10; a monitoring client-102; large-scale cluster server-11; virtual machine-11 a; a physical machine-11 b; monitoring databases-111, 112, 113; fault signature library-103; a fault handling unit-40; fault monitoring engine-401; a fault warning engine-402; fail-over engine-403; administrator interface module-104; a Web client-50; visualization device-501.

Detailed Description

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

The present invention aims to implement unified state monitoring, fault detection, early warning, alarming and automatic recovery on physical resources (including physical computing devices, physical storage devices, physical network devices and physical security devices) and virtual resources (including virtual computing devices, virtual storage devices, virtual network devices and virtual security devices) of a large-scale or super-large-scale cluster server in cloud computing through themonitoring management system 100 according to the specific embodiments shown in the present invention, so as to ensure that the physical resources and the virtual resources of each data node in cloud computing are in a healthy and highly available state.

The first embodiment is as follows:

please refer to a first embodiment of the monitoring management system based on cloud computing shown in fig. 1, which discloses amonitoring management system 100 based on cloud computing, configured to monitor and manage an operation state of a large-scale cluster server 11 in cloud computing.

In the present embodiment, themonitoring management system 100 includes:

adata acquisition unit 10 comprising: the monitoring system comprises amonitoring client 102 for collecting node data in the large-scale cluster server 11 in real time, and three

monitoring databases

111, 112, 113 for storing the node data. Specifically, themonitoring client 102 in thedata acquisition unit 10 acquires data such as CPUs, disk IOs, ports, processes, DNS, and the like in different network nodes or nodes of different network types in real time through a Virtual private network (VMN), and sends a storage access request after dividing the data into three parts horizontally; and then, three pieces of data which are horizontally split are respectively stored in the

monitoring databases

111, 112 and 113 by calling MySQL data interfaces of the

monitoring databases

111, 112 and 113 so as to be ready for the real-time access of themonitoring client 102.

In the present embodiment, the

monitoring databases

111, 112, and 113 are each a MySQL database.

The operation of horizontally splitting the node data can be completed at the MySQL database end, so that the bottleneck problem of a large amount of or ultra-large amount of data and a high-load table can be avoided, and the transaction processing is relatively simple. The data stored in the

different monitoring databases

111, 112, 113 at these levels can help themonitoring client 102 to find out abnormal changes of data of each node in the large-scale cluster server 11 in time, and warn the occurrence of small abnormal changes in each data node through a large amount of historical data.

In this embodiment, themonitoring management system 100 further includes: afault feature library 103 and afault processing unit 40.

Specifically, thefault feature library 103 is configured to define and store fault feature items, and themonitoring client 102 verifies node data acquired in real time with the fault feature items in thefault feature library 103 to determine whether the node data is a fault, and if the node data is a fault, sends a fault instruction to thefault processing unit 40. Thefault signature library 103 is a MySQL database, may also be an Oracle database, and is preferably a MySQL database.

In this embodiment, thefailure processing unit 40 is configured to respond to the failure instruction sent by themonitoring client 102, generate a failure processing policy, and send the failure processing policy to the large-scale cluster server 11.

Specifically, thefault handling unit 40 includes afault monitoring engine 401, a faultearly warning engine 402, and afault recovery engine 403, where thefault monitoring engine 401 receives a fault instruction sent by themonitoring client 10 after verification, sends the fault instruction to the faultearly warning engine 402 and thefault recovery engine 403, generates a fault handling policy by thefault recovery engine 403, and feeds the fault handling policy back to thefault monitoring engine 401.

The large-scale cluster server 11 includes a plurality of Physical machines 11b (PMs) and is virtualized by the Physical machines into a plurality of virtual machines 11a (VMs) having a distributed data structure.

After the administrator defines a common fault and initializes thefault feature library 103, once themonitoring client 102 in thedata acquisition unit 10 finds abnormal monitoring data, themonitoring client 102 may upload fault information to thefault feature library 103 in thedata acquisition unit 10 through a Virtual Monitoring Network (VMN) for verification.

Firstly, reading mass data mining analysis from the

monitoring databases

111, 112, 113 to obtain data such as I/O, response time, communication rate, online rate and the like of the abnormal monitoring item, comparing the data with normal data, and calculating deviation amount according to the monitoring data. If the amount of deviation exceeds a threshold value entered from thefault recovery engine 403, a point of failure is counted;

secondly, the corresponding Globally Unique Identifier (GUID) of the fault point in the

monitoring databases

111, 112, 113 is queried, and all possible fault types are found in thefault feature library 103 through the GUID.

Next, thefailure recovery engine 403 performs a one-by-one troubleshooting on the failure types found in the last step, where the troubleshooting is performed in an exploratory mode, such as a cloud server network interruption. Wherein possible fault types include: the method comprises the following steps of network failure of a rack where a cloud server is located, network failure of a computing node where the cloud server is located, failure of the cloud server and the like.

Next, an example of "network failure" and "disk failure" occurring in a certain node in the large-scale cluster server 11 will be described in detail.

When a "network failure" occurs, thefailure recovery engine 403 firstly pins the chassis gateway, if the network cannot be connected, it is determined that a network failure occurs in the chassis, and immediately starts a recovery measure: switching to a standby network and starting the standby network; and if the network is unobstructed, then the computing node where the Ping cloud server is located.

And if the network of the computing nodes is smooth, judging that the fault is the network fault of the cloud server, immediately copying and starting the copy of the cloud server on other computing nodes which normally operate. The fault type can be determined by checking, and the fault link can be automatically recovered. After the fault recovery task is completed, themonitoring client 102 verifies the fault processing, and if the fault processing is completed, the fault automatic recovery working process is ended.

When a disk failure occurs, the exploratory troubleshooting results in the disk failure, the GUID of the failed disk is immediately read from the

monitoring databases

111, 112, 113, two copies of the data stored in the failed disk in other storage servers are found, a normal storage server node with a low load rate is found in the large-scale cluster server 11 according to the load rate, and the data is copied from the two copies to the normal storage server node.

After copying is completed, a VM table is searched in thefault feature library 103, a VM associated with the GUID of the fault disk is found, a corresponding table between the VM and the disk is modified, the corresponding relation between the fault disk and the VM is deleted, and the GUID of a new storage server is written. If other types of faults are encountered, the above process is repeatedly executed.

In this embodiment, the monitoring andmanagement system 100 further includes aWeb client 50 remotely connected to thefault processing unit 40 and embedded in thevisualization device 501, so as to create and display the operating status of each data node in the large-scale cluster server 11 in real time, and manually configure the user configuration information through theWeb client 50.

Specifically, the user configuration information includes: fault monitoring strategy, fault early warning strategy, fault recovery strategy and user-defined fault characteristic item. In a preferred embodiment, thevisualization device 501 is a mobile phone, and is further preferably a smart phone, and displays information such as the operation status of each data node in the large-scale cluster server 11 displayed in thefault processing unit 40 to the user in real time through a 2G \ 3G \ 4G wireless network.

Different failure processing recovery modes are configured on the Web page provided by thefailure processing engine 403, and thefailure processing engine 403 recovers various failures according to a preset rule.

Through the Web page configuration provided by thefault processing engine 403, a user can deploy and apply cloud services such as a cloud server, load balancing, relational data storage, and the like which the user is using, and after logging in thefault processing unit 40 according to the requirements of the application on the service continuity and data consistency of the user, personalized configuration is performed on the fault monitoring frequency, the monitoring granularity, the monitoring items, and the fault processing mode, and the configuration information can be stored in thefault processing engine 403 through an API.

Example two:

please refer to fig. 2, which shows another embodiment of a monitoring management system based on cloud computing according to the present invention. The main difference between this embodiment and the first embodiment is that, in this embodiment, thedata acquisition unit 10 in themonitoring management system 100 further includes an administrator interface module 104, which is used to receive the fault feature item defined by initialization, and output the fault feature item to thefault feature library 103 for storage.

Meanwhile, in this embodiment, thevisualization device 501 is a personal computer, and may display information such as the operation state of each node in the large-scale cluster server 11 displayed in thefault processing unit 40 to the user in real time through other wireless network connection methods such as WLAN, Internet, or WAP.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A monitoring management system based on cloud computing is used for monitoring and managing the running state of large-scale cluster servers in the cloud computing, and comprises:

characterized in that, the monitoring management system further comprises:

a fault feature library and a fault processing unit; wherein,

2. The monitoring management system according to claim 1, wherein the fault processing unit includes a fault monitoring engine, a fault pre-warning engine, and a fault recovery engine, and the fault monitoring engine receives a verified fault instruction sent by the monitoring client, sends the verified fault instruction to the fault pre-warning engine and the fault recovery engine, generates a fault processing policy by the fault recovery engine, and feeds the generated fault processing policy back to the fault monitoring engine.

3. The monitoring management system of claim 1, wherein the large-scale cluster server comprises a plurality of physical machines, and is virtualized by the physical machines into a plurality of virtual machines having a distributed data structure.

4. The monitoring and management system of claim 1, wherein the data collection unit further comprises an administrator interface module for receiving the initially defined fault signature items and outputting the fault signature items to a fault signature library for storage.

5. The monitoring and management system according to claim 1, further comprising a Web client remotely connected to the fault handling unit and embedded in a visualization device for creating and displaying in real time an operation status of each data node in the large-scale cluster server, wherein a user can manually configure user configuration information through the Web client.

6. The monitoring management system of claim 5, wherein the user configuration information comprises: fault monitoring strategy, fault early warning strategy, fault recovery strategy and user-defined fault characteristic item.

7. The monitoring management system according to claim 5, wherein the visualization device comprises a mobile phone, a personal computer.

8. The monitoring management system of claim 1, wherein the fault signature library comprises a MySQL database.