Disclosure of Invention
The invention aims to disclose a monitoring management system based on cloud computing, which is used for monitoring and automatically recovering faults of nodes in the cloud computing, so that the potential faults or the faults which occur are monitored and automatically recovered, and the stability and the usability of a large-scale cluster server based on the cloud computing are guaranteed.
In order to achieve the above object, the present invention provides a monitoring management system based on cloud computing, which is used for monitoring and managing the operation state of a large-scale cluster server in cloud computing, and includes:
a data acquisition unit, comprising: the system comprises a monitoring client used for collecting node data in a large-scale cluster server in real time and three monitoring databases used for storing the node data;
the monitoring management system further comprises:
a fault feature library and a fault processing unit; wherein,
the fault feature library is used for defining and storing fault feature items, the monitoring client verifies the node data acquired in real time and the fault feature items in the fault feature library to judge whether the node data is a fault, and if the node data is the fault, a fault instruction is sent to the fault processing unit;
and the fault processing unit is used for responding to the fault instruction sent by the monitoring client, generating a fault processing strategy and sending the fault processing strategy to the large-scale cluster server.
As a further improvement of the present invention, the fault processing unit includes a fault monitoring engine, a fault early warning engine, and a fault recovery engine, and the fault monitoring engine receives a fault instruction sent by the monitoring client after verification, sends the fault instruction to the fault early warning engine and the fault recovery engine, generates a fault processing policy by the fault recovery engine, and feeds the fault processing policy back to the fault monitoring engine.
As a further improvement of the invention, the large-scale cluster server comprises a plurality of physical machines, and is virtualized into a plurality of virtual machines with distributed data structures through the physical machines.
As a further improvement of the present invention, the data acquisition unit further includes an administrator interface module, which is used to receive the fault feature item defined by initialization, and output the fault feature item to the fault feature library for storage.
As a further improvement of the present invention, the present invention further includes a Web client remotely connected to the fault processing unit and embedded in the visualization device, so as to create and display the operation status of each data node in the large-scale cluster server in real time, and a user can manually configure user configuration information through the Web client.
As a further improvement of the present invention, the user configuration information includes: fault monitoring strategy, fault early warning strategy, fault recovery strategy and user-defined fault characteristic item.
As a further improvement of the invention, the visualization device comprises a mobile phone and a personal computer.
As a further improvement of the invention, the failure signature library comprises a MySQL database.
Compared with the prior art, the invention has the beneficial effects that: according to the invention, the data acquisition, early warning and fault resolution can be automatically carried out on the fault of each node in the large-scale cluster server based on cloud computing, so that the stability and the availability of the large-scale cluster server are improved.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
The present invention aims to implement unified state monitoring, fault detection, early warning, alarming and automatic recovery on physical resources (including physical computing devices, physical storage devices, physical network devices and physical security devices) and virtual resources (including virtual computing devices, virtual storage devices, virtual network devices and virtual security devices) of a large-scale or super-large-scale cluster server in cloud computing through themonitoring management system 100 according to the specific embodiments shown in the present invention, so as to ensure that the physical resources and the virtual resources of each data node in cloud computing are in a healthy and highly available state.
The first embodiment is as follows:
please refer to a first embodiment of the monitoring management system based on cloud computing shown in fig. 1, which discloses amonitoring management system 100 based on cloud computing, configured to monitor and manage an operation state of a large-scale cluster server 11 in cloud computing.
In the present embodiment, themonitoring management system 100 includes:
adata acquisition unit 10 comprising: the monitoring system comprises amonitoring client 102 for collecting node data in the large-scale cluster server 11 in real time, and threemonitoring databases 111, 112, 113 for storing the node data. Specifically, themonitoring client 102 in thedata acquisition unit 10 acquires data such as CPUs, disk IOs, ports, processes, DNS, and the like in different network nodes or nodes of different network types in real time through a Virtual private network (VMN), and sends a storage access request after dividing the data into three parts horizontally; and then, three pieces of data which are horizontally split are respectively stored in themonitoring databases 111, 112 and 113 by calling MySQL data interfaces of themonitoring databases 111, 112 and 113 so as to be ready for the real-time access of themonitoring client 102.
In the present embodiment, themonitoring databases 111, 112, and 113 are each a MySQL database.
The operation of horizontally splitting the node data can be completed at the MySQL database end, so that the bottleneck problem of a large amount of or ultra-large amount of data and a high-load table can be avoided, and the transaction processing is relatively simple. The data stored in thedifferent monitoring databases 111, 112, 113 at these levels can help themonitoring client 102 to find out abnormal changes of data of each node in the large-scale cluster server 11 in time, and warn the occurrence of small abnormal changes in each data node through a large amount of historical data.
In this embodiment, themonitoring management system 100 further includes: afault feature library 103 and afault processing unit 40.
Specifically, thefault feature library 103 is configured to define and store fault feature items, and themonitoring client 102 verifies node data acquired in real time with the fault feature items in thefault feature library 103 to determine whether the node data is a fault, and if the node data is a fault, sends a fault instruction to thefault processing unit 40. Thefault signature library 103 is a MySQL database, may also be an Oracle database, and is preferably a MySQL database.
In this embodiment, thefailure processing unit 40 is configured to respond to the failure instruction sent by themonitoring client 102, generate a failure processing policy, and send the failure processing policy to the large-scale cluster server 11.
Specifically, thefault handling unit 40 includes afault monitoring engine 401, a faultearly warning engine 402, and afault recovery engine 403, where thefault monitoring engine 401 receives a fault instruction sent by themonitoring client 10 after verification, sends the fault instruction to the faultearly warning engine 402 and thefault recovery engine 403, generates a fault handling policy by thefault recovery engine 403, and feeds the fault handling policy back to thefault monitoring engine 401.
The large-scale cluster server 11 includes a plurality of Physical machines 11b (PMs) and is virtualized by the Physical machines into a plurality of virtual machines 11a (VMs) having a distributed data structure.
After the administrator defines a common fault and initializes thefault feature library 103, once themonitoring client 102 in thedata acquisition unit 10 finds abnormal monitoring data, themonitoring client 102 may upload fault information to thefault feature library 103 in thedata acquisition unit 10 through a Virtual Monitoring Network (VMN) for verification.
Firstly, reading mass data mining analysis from themonitoring databases 111, 112, 113 to obtain data such as I/O, response time, communication rate, online rate and the like of the abnormal monitoring item, comparing the data with normal data, and calculating deviation amount according to the monitoring data. If the amount of deviation exceeds a threshold value entered from thefault recovery engine 403, a point of failure is counted;
secondly, the corresponding Globally Unique Identifier (GUID) of the fault point in themonitoring databases 111, 112, 113 is queried, and all possible fault types are found in thefault feature library 103 through the GUID.
Next, thefailure recovery engine 403 performs a one-by-one troubleshooting on the failure types found in the last step, where the troubleshooting is performed in an exploratory mode, such as a cloud server network interruption. Wherein possible fault types include: the method comprises the following steps of network failure of a rack where a cloud server is located, network failure of a computing node where the cloud server is located, failure of the cloud server and the like.
Next, an example of "network failure" and "disk failure" occurring in a certain node in the large-scale cluster server 11 will be described in detail.
When a "network failure" occurs, thefailure recovery engine 403 firstly pins the chassis gateway, if the network cannot be connected, it is determined that a network failure occurs in the chassis, and immediately starts a recovery measure: switching to a standby network and starting the standby network; and if the network is unobstructed, then the computing node where the Ping cloud server is located.
Similarly, if the node network where the cloud server is located cannot be connected, the network is judged to be in fault, a recovery measure is immediately started, and all cloud servers on the cloud computing node copy and start the copies of the cloud server on the computing nodes which normally operate.
And if the network of the computing nodes is smooth, judging that the fault is the network fault of the cloud server, immediately copying and starting the copy of the cloud server on other computing nodes which normally operate. The fault type can be determined by checking, and the fault link can be automatically recovered. After the fault recovery task is completed, themonitoring client 102 verifies the fault processing, and if the fault processing is completed, the fault automatic recovery working process is ended.
When a disk failure occurs, the exploratory troubleshooting results in the disk failure, the GUID of the failed disk is immediately read from themonitoring databases 111, 112, 113, two copies of the data stored in the failed disk in other storage servers are found, a normal storage server node with a low load rate is found in the large-scale cluster server 11 according to the load rate, and the data is copied from the two copies to the normal storage server node.
After copying is completed, a VM table is searched in thefault feature library 103, a VM associated with the GUID of the fault disk is found, a corresponding table between the VM and the disk is modified, the corresponding relation between the fault disk and the VM is deleted, and the GUID of a new storage server is written. If other types of faults are encountered, the above process is repeatedly executed.
In this embodiment, the monitoring andmanagement system 100 further includes aWeb client 50 remotely connected to thefault processing unit 40 and embedded in thevisualization device 501, so as to create and display the operating status of each data node in the large-scale cluster server 11 in real time, and manually configure the user configuration information through theWeb client 50.
Specifically, the user configuration information includes: fault monitoring strategy, fault early warning strategy, fault recovery strategy and user-defined fault characteristic item. In a preferred embodiment, thevisualization device 501 is a mobile phone, and is further preferably a smart phone, and displays information such as the operation status of each data node in the large-scale cluster server 11 displayed in thefault processing unit 40 to the user in real time through a 2G \ 3G \ 4G wireless network.
Different failure processing recovery modes are configured on the Web page provided by thefailure processing engine 403, and thefailure processing engine 403 recovers various failures according to a preset rule.
Through the Web page configuration provided by thefault processing engine 403, a user can deploy and apply cloud services such as a cloud server, load balancing, relational data storage, and the like which the user is using, and after logging in thefault processing unit 40 according to the requirements of the application on the service continuity and data consistency of the user, personalized configuration is performed on the fault monitoring frequency, the monitoring granularity, the monitoring items, and the fault processing mode, and the configuration information can be stored in thefault processing engine 403 through an API.
Example two:
please refer to fig. 2, which shows another embodiment of a monitoring management system based on cloud computing according to the present invention. The main difference between this embodiment and the first embodiment is that, in this embodiment, thedata acquisition unit 10 in themonitoring management system 100 further includes an administrator interface module 104, which is used to receive the fault feature item defined by initialization, and output the fault feature item to thefault feature library 103 for storage.
Meanwhile, in this embodiment, thevisualization device 501 is a personal computer, and may display information such as the operation state of each node in the large-scale cluster server 11 displayed in thefault processing unit 40 to the user in real time through other wireless network connection methods such as WLAN, Internet, or WAP.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.