BACKGROUND1. Field of the Invention
Embodiments of the present invention generally relate to cluster management and, more particularly, to a method and apparatus for proactively monitoring application health data to achieve workload management and high availability.
2. Description of the Related Art
A computing environment may include a computer cluster (e.g., a plurality of client computers coupled to a server computer or a plurality of server computers (i.e., a peer-to-peer)) that hosts multiple applications. The applications generally depend on system resources, such as software resources (e.g., operating system (OS), device drivers and/or the like), hardware resources (e.g., storage resources, processors and/or the like) and thus provide services to the client computers. In addition, the system resources are shared dynamically across the various applications in the computing environment. In operation, dependencies on the system resources may introduce a failure, which affects application performance within the cluster. For example, on occurrence of a failure and/or non-availability of computer resources, the particular application may become non-responsive and/or terminate.
Generally, a system administrator of the computing environment desires that the applications run continuously and/or uninterruptedly. For example, the system administrator monitors a status of the particular application using clustering software and/or health monitoring software. However, the status does not indicate application health of the particular application. Hence, the clustering software and/or the health monitoring software are not cluster-aware and cannot provide corrective measures that leverage a clustering infrastructure to ensure that the particular application is performing optimally.
Currently, the clustering software and/or the health monitoring software do not ascertain a cause of the failures in the clustering environment. If a failure did occur, the clustering software employs a single static (i.e., pre-defined) priority list of nodes to which a particular application can failover. However, such a static list does not account for a cause of the failure. As such, the static priority list may indicate a target node that is also affected by the failure and therefore, not suitable for operating the particular application. For example, the static priority list may indicate that the particular application is to failover to a Node1, a Node2 or a Node3 in the stated order. Furthermore, the Node1 and the Node2 share a router of which the Node3 does not use. If the particular application is operating on the Node1 as the router fails, the particular application will failover to the Node2 even the failure most likely affects Node2 as well as Node1 but does not affect the Node3. As such, Node3 is a better choice but is not selected because the static priority list does not account for the cause of the failure.
Accordingly, there is a need in the art for a method and apparatus for proactively monitoring application health data to achieve workload management and high availability.
SUMMARY OF THE INVENTIONEmbodiments of the present invention comprise a method and apparatus for proactively monitoring application health data to achieve workload management and high availability. In one embodiment, a method for processing application health data to improve application performance within a cluster includes accessing at least one of performance data or event information associated with at least one application to determine application health data and examining the application health data to identify an application of the at least one application to migrate.
BRIEF DESCRIPTION OF THE DRAWINGSSo that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
FIG. 1 is a block diagram of a system for proactively monitoring application health data to achieve workload management and high availability according to one or more embodiments;
FIG. 2 is a functional block diagram of a proactive monitoring system for examining application health data to improve application performance within a cluster according to one or more embodiments;
FIG. 3 is a functional block diagram of an event-based monitoring system for processing event information to improve application performance within a cluster according to one or more embodiments;
FIG. 4 is a functional block diagram of performance-based monitoring system for processing performance data to improve application performance within a cluster according to one or more embodiments;
FIG. 5 is a functional block diagram of a cluster architecture for processing application health data to improve application performance within a three-node cluster in accordance with dynamic priority list information according to one or more embodiments; and
FIG. 6 is a flow diagram of a method for processing application health data to improve application performance within a cluster according to one or more embodiments.
DETAILED DESCRIPTIONFIG. 1 is a block diagram of asystem100 for proactively monitoring application health data to achieve workload management and high availability according to various embodiments. Thesystem100 comprises aserver102 and anode104 where each is coupled to each other through anetwork106. It is appreciated that thesystem100 comprises a plurality of nodes that provides application services to one or more client computers. Alternatively, thesystem100 may include a peer-to-peer cluster.
Theserver102 is a type of computing device (e.g., a laptop, a desktop, a Personal Digital Assistant (PDA) and/or the like), such as those generally known in the art. Theserver102 includes a Central Processing Unit (CPU)118,various support circuits120 and amemory122. TheCPU118 may comprise one or more commercially available microprocessors or microcontrollers that facilitate data processing and storage. Thevarious support circuits120 facilitate the operation of theCPU118 and include one or more clock circuits, power supplies, cache, input/output circuits, and the like. Thememory122 comprises at least one of Read Only Memory (ROM), Random Access Memory (RAM), disk drive storage, optical storage, removable storage and/or the like. Thememory122 includes various software packages, such asclustering software124. Theclustering software124 may be high availability clustering server software. Furthermore, the memory includes various data, such as dynamicpriority list information126.
Thenode104 are one or more types of computing devices (e.g., a laptop, a desktop, a Personal Digital Assistant (PDA) and/or the like), such as those generally known in the art. Generally, thenode104 includes various computing resources, such as application resources, replication resources database resources, network resources, storage resources and/or the like. Thenode104 includes a Central Processing Unit (CPU)108,various support circuits110 and amemory112. TheCPU108 may comprise one or more commercially available microprocessors or microcontrollers that facilitate data processing and storage. Thevarious support circuits110 facilitate the operation of theCPU108 and include one or more clock circuits, power supplies, cache, input/output circuits, and the like. Thememory112 comprises at least one of Read Only Memory (ROM), Random Access Memory (RAM), disk drive storage, optical storage, removable storage and/or the like. Thememory112 includes various software packages, such as anagent114. Furthermore, theagent114 includes amonitor116 as a subcomponent.
According to one or more embodiments, thenode104 provides one or more application services to an end user (e.g., a client computer) that depends upon various computing resources. For example, an application service may be a database service, which depends on one or more computer resources, such as network resources (e.g., Virtual IP addresses, network interface cards (NIC) and/or the like), storage resources (e.g., physical disks, magnetic tape drives and/or the like), software resources (e.g., operating system processes, application processes and/or the like), file system resources (e.g., mounted volumes, network shared partitions and/or the like) and/or the like. During ordinary cluster operations, a failure may occur at the one or more computer resources at thenode104. For example, one or more hard disks at thenode104 may be operating poorly due to various failures, malfunctions and/or errors and therefore, may affect the performance of the application.
Thenetwork106 comprises a communication system that connects computers by wire, cable, fiber optic and/or wireless link facilitated by various types of well-known network elements, such as hubs, switches, routers, and the like. Thenetwork106 may employ various well-known protocols to communicate information amongst the network resources. For example, thenetwork106 may be a part of the internet or intranet using various communications infrastructure such as Ethernet, WiFi, WiMax, General Packet Radio Service (GPRS), and the like.
In one embodiment, theagent114 provides an interface to theclustering software124 for monitoring and/or controlling access to the computer resources. Theagent114 may be a replication resource agent, an application resource agent, a database resource agent and/or the like. For example, theclustering software124 deploys a replication agent to thenode104 in order to monitor and/or control access to the replication resources. According to one embodiment, theagent114 monitors event logs and/or performance counters and accesses the event information and/or the performance data. In another embodiment, theagent114 uses themonitor116 to access the event logs and/or the performance counters and extract the event information and/or the performance data associated with one or more applications (i.e., clustered) running on thenode104.
According to one or more embodiments, theclustering software124 includes software code for providing high availability to the one or more applications running on thenode104. Generally, theclustering software124 monitors and/or controls a status of various computers, such as, theserver102, thenode104 as well as one or more additional nodes within the cluster. In one embodiment, theclustering software124 monitors the online and/or offline status of the computer resources. Accordingly, theclustering software124 controls the computer resources. For example, if a system resource is not online, theclustering software124 brings the computer resource online for the application.
According to one or more embodiments, the dynamicpriority list information126 includes one or more dynamic priority lists for selecting a target node for operating an application based on the application health data. For example, the application is failed over to the target node. Each dynamic priority list within the dynamicpriority list information126 may be used to identify one or more target nodes for operating the application. As explained further below, a dynamic priority list from the dynamicpriority list information126 may be selected based on the event information and/or the performance data. In other words, the dynamicpriority list information126 includes a dynamic priority list that corresponds with a particular event. In one embodiment, theclustering software124 selects a dynamic priority list of the from the dynamicpriority list information126 that corresponds a failure, such as a Fiber Switch Down event.
Generally, the application provides the event information and/or the performance data through event logs and/or performance counters, respectively. The event information includes logged events that indicate application health. The event information may indicate a failure of node resources. Additionally, the event information may indicate non-availability of computer resources for the application. The performance data includes performance counter values that indicate application health. The performance data may indicate one or more poorly functioning computer resources. Theapplication health data117 includes the event information in form of error messages, warning messages and/or the like.
Theclustering software124 may notify the administrator to perform a necessary action that optimizes the application performance in the clustering computing environment. The administrator may be alerted through a user interface for theclustering software124 or through a notification service. For example, on being notified, the administrator may switchover the application to the other node. Alternatively, theclustering software124 automatically performs the necessary action depending on the extracted information, and thus optimizes the application performance in the clustering computing environment.
In one embodiment, theapplication health data117 indicates states of various computing resources associated with thenode104. In one embodiment, themonitor116 accesses performance data and/or event information to determine theapplication health data117 as explained further below. Furthermore, themonitor116 examines theapplication health data117 to identify an application to migrate to a target node. Accordingly, themonitor116 communicates theapplication health data117 to theclustering software124. As such, theclustering software124 may failover or migrate the application to a target node of the one or more target to ensure high availability of the application in accordance with the dynamicpriority list information126 as explained further below.
FIG. 2 is a functional block diagram of aproactive monitoring system200 for examiningapplication health data117 to improve application performance within a cluster according to one or more embodiments of the present invention. Theproactive monitoring system200 improves a performance of anapplication202 using event information and/or performance data.
Various computing resources are configured to theapplication202, such as a Central Processing Unit (CPU)204, anetwork206, adisk208 and amemory210. In one embodiment, one or more capacities of the various computing resources are allocated to theapplication202. Thedisk208 generally includes one or more data storage devices (e.g., hard disk drives, optical drives and/or the like) and storage management software that provides an interface to the one or more data storage devices for theapplication202.
According to one or more embodiments, theproactive monitoring system200 includes theagent114, themonitor116 and theclustering software124. Theapplication202 depends on the computing resources, such as theCPU204, thenetwork206 and/or the like. Additionally, theapplication202 utilizes the computing resources for providing application services to an end user. According to one or more embodiments, theapplication202 typically logs events that are raised during execution to generate theevent log212. As such, the event information helps in proactively monitoring health of theapplication202.
Generally, an application vendor provides information associated with the events of theapplication202. For example, a MICROSOFT SQL Server application may raise an event during execution, such as, MSSQLSERVER_823. Using vendor information, this event indicates that an error has occurred in the MS SQL Server. Such errors may be caused by computers resource failures such as, device driver failures, lower layered software and/or the like. Accordingly, themonitor116 compares the raised event with events that indicate application health to determine theapplication health data117. If themonitor116 determines that the raised event indicates poor application health, then theapplication202 is to be migrated to a target node. In one embodiment, theclustering software124 selects the target from one or more target nodes in accordance with the dynamicpriority list information126 as explained further below.
According to various embodiments, performance data is stored in the performance counters214. Additionally, the application vendor provides threshold values of the performance counters. For example, a threshold value for MICROSOFT EXCHANGE 2003 performance counter “MSExchangeIS\Exchmem: Number of heaps with memory errors” is zero. Such a performance counter indicates the total number of Exchmem heaps that failed allocations is due to insufficient available memory. In one embodiment, themonitor116 accesses current performance data values regarding various computer resources, such as, theCPU204, thenetwork206, thedisk208, thememory210 and/or the like, stored in the performance counters214. For example, themonitor116 compares the current performance data values with the threshold values to determine theapplication health data117. If themonitor116 determines that one or more current performance data values deviate from the threshold values, theapplication202 is to be migrated to another node (i.e., redundant system) in order to ensure optimal application performance.
According to one or more embodiments, themonitor116 identifies optimal application capacities for various computing resources using theperformance data214. For example, theproactive monitoring system200 operates a MICROSOFT EXCHANGE application in which themonitor116 accesses one or more performance counters “Processor\% Processor Time” and “System\Processor Queue Length”. Based on such performance counters, themonitor116 determines processor utilization data for theCPU204. Furthermore, themonitor116 accesses one or more additional performance counters, such as “Process(STORE)\% Processor Time”, “Process(inetinfo)\% Processor Time”, “Process(EMSMTA)\% Processor Time”, to compute processor utilization data for the MICROSOFT EXCHANGE application.
Accordingly, in response to theapplication health data117, theclustering software124 may notify the administrator to take a corrective action. Optionally, theclustering software124 automatically perform the corrective action and thereby optimizes the application performance ensures high availability in the computer cluster environment. Theclustering software124 may identify one or more target nodes for operating the application in accordance with the dynamicpriority list information126. In one embodiment, theclustering software124 selects a dynamic priority list that corresponds with a particular event within theevent log212. Theclustering software124 selects a target node of the one or more target nodes having a highest priority. Then, theclustering software124 migrates theapplication202 to the target node.
FIG. 3 is a functional block diagram of an event-basedmonitoring system300 for processing event information to improve application performance within a cluster according to one or more embodiments. The event-basedmonitoring system300 includes amonitoring database302, anevent log304, anevent listener306, and anevent processor308. In one embodiment, themonitoring database302, theevent listener306 and theevent processor308 form a monitor (e.g., themonitor116 ofFIG. 1) that is a subcomponent of an agent (e.g., theagent114 ofFIG. 1).
According to one or more embodiments, themonitoring database302 defines one or more events that indicate application health. In one embodiment, themonitoring database302 includes one or more warnings, failures and/or errors, which may be exposed during execution of such of an application (e.g., theapplication202 ofFIG. 2). The one or more events may indicate non-availability of required computing resources. For example, the application, such as MICROSOFT SQL Server, may raise a “The server is out of memory” event because of non-availability of one or more required memory resources. For example, themonitoring database302 may includes one or more statically analyzed events. Additionally, the one or more statically analyzed events may be stored in one or more files in an Extensible Markup Language (XML) format. In one embodiment, the one or more files are editable. For example, an administrator may add a new event for proactively monitoring application health to themonitoring database302.
According to one or more embodiments, the application logs events raised during execution to theevent log304. Generally, the logged events may be warnings, errors, failures and/or the like. Every time the application logs an event in theevent log304, theevent listener306 accesses and analyzes the logged event.
In one embodiment, theevent listener306 compares the logged event with the events defined in themonitoring database302 to determine application health data. If the logged event matches an event that indicates application health, the event listen306 caches the logged event as application health data. For example, the cached event may indicate that the application is starving for computer resources and therefore, performing poorly. In other words, the application is unable to utilize a required and/or minimal amount of a particular computer resource.
In one embodiment, theevent processor308 examines one or more cached events after a monitoring cycle (e.g., a user-defined time period) in a chronological order. As such, the one or more cached events reflect various states of computer resources. For example, the cached events may indicate one or more device driver errors at a particular node that are disrupting performance of a SQL Server application. Hence, the SQL Server application is starving for computer resources (e.g., CPU/memory resources). For instance, if a required load of a particular computer resource exceeds an available capacity due to various errors (e.g., device driver errors, lower subsystem errors, operating system errors and/or the like), the SQL Server application may still be alive but cannot function properly.
In one embodiment, theevent processor308 examines the one or more cached events to identify an application to migrate to a target node. For example, one or more target nodes may be redundant systems configured to a SQL Server application service group. Accordingly, the one or more target nodes may or may not be suitable for operating the SQL server application. In one embodiment, clustering software selects the target node of the one or more target nodes in accordance with a dynamic priority list based on the one or more cached events as explained far below. For example, some of the one or more target nodes may be susceptible to the one or more device driver errors and are not to be selected as the target node for operating the SQL Server Application.
In one embodiment, theevent processor308 may determine that the one or more device driver errors reflect poor application health. The SQL Server application service group is not able to function despite being alive and thus, is to be migrated to the target node of the one or more target nodes in order to improve application health. After the migration, the SQL application service group will most likely operate properly. Alternatively, the event processor determines that one or more low-priority application service groups are to be migrated to the target node in order to increase available capacities of various computer resources.
According to one embodiment, theevent processor308 communicates the cached events in order to notify an administrator of the one or more device driver errors. Such a notification may instigate the administrator to perform a corrective measure. In one embodiment, theevent processor308 communicates the cached events to clustering software (e.g., theclustering software124 ofFIG. 1). The clustering software notifies the administrator and/or performs the corrective measure against the raised events. For example, the clustering software may migrate the application to another node (i.e., a target node). As another example, the clustering software may failover the application to the another node in response to a particular cached event (e.g., a storage device failure).
FIG. 4 is a functional block diagram of a performance-basedmonitoring system400 for processing performance data to improve application performance within a cluster according to one or more embodiments. The performance basedmonitoring system400 includes amonitoring database402, aperformance data collector404, asystem monitor406, aperformance data analyzer408 and a performance data processor410. In one embodiment, themonitoring database402, theperformance data collector404, theperformance data analyzer408 and the performance data processor410 form a monitor (e.g., themonitor116 ofFIG. 1) that is a subcomponent of an agent (e.g., theagent114 ofFIG. 1).
According to one or more embodiments, themonitoring database402 defines one or more performance counters that indicate application health. In one embodiment, themonitoring database402 includes pre-defined threshold values for the one or more performance counters. Additionally, themonitoring database402 stores the performance counters as one or more files in an Extensible Markup Language (XML) format. Furthermore, an administrator may edit the threshold values associated with the one or more performance counters. The administrator may add a performance counter to themonitoring database402. The administrator may also remove a performance counter from themonitoring database402.
Theperformance data collector404 accesses the one or more performance counters specified in themonitoring database402 at user-defined intervals. Generally, at an end of each interval, theperformance data collector404 caches current values of the one or more performance counters using the system monitor406. Subsequently, theperformance data analyzer408 accesses the cached current values of the one or more performance counters and determines one or more mean values, which are compared with the pre-defined threshold values to determine application health data. Theperformance data analyzer408 stores the application health data in a cache. As described herein, the application health data may indicate that the application is starving for computer resources and performing poorly. In one embodiment, the administrator defines a time interval for determining the application health data. In another embodiment, theperformance data analyzer404 determines the application health data in response to a health check request from clustering software.
For example, theperformance data analyzer408 compares the cached current values with user-defined threshold values to generate a comparison result. One or more computer resource bottlenecks may be identified based on the comparison result. In one embodiment, the performance data analyzer408 logs the one or more computer resource bottlenecks to the comparison result. Then, theperformance data analyzer408 stores the comparison result in a cache as application health data. For example, a threshold value for MICROSOFT EXCHANGE performance counter “Memory\Page Reads/sec” may be defined as equal to or less than one hundred reads per second. Accordingly, theperformance data analyzer408 compares the “Memory\Page Reads/sec” performance counter value with the threshold value to detect a memory resource bottleneck, which indicates that MICROSOFT Exchange is starving for memory resources.
In one embodiment, the performance data processor410 examines one or more cached computer resource bottlenecks (i.e., application health data) after a monitoring cycle (e.g., a user-defined time period) in a chronological order. As such, the cached computer resource bottlenecks reflect various states of computer resources. For example, the cached computer resource bottlenecks may indicate a performance counter that is below a pre-defined threshold. Computer resources associated with the performance counter are causing a MICROSOFT Exchange application to operate poorly. Hence, the MICROSOFT Exchange application is starving for the computer resources (e.g., CPU/memory resources). For instance, if a required load of a particular computer resource exceeds an available capacity as indicated by the performance counter, the MICROSOFT Exchange application may not perform adequately.
In one embodiment, the performance data processor410 examines the cached computer resource bottlenecks to identify an application to migrate to one or more target nodes. For example, the one or more target nodes may be redundant systems configured to a MICROSOFT Exchange application service group. Accordingly, the one or more target nodes may or may not be suitable for operating the MICROSOFT Exchange application. In one embodiment, clustering software selects a target node of the one or more target nodes based on the cached computer resource bottlenecks in accordance with a dynamic priority list as explained further below. For example, some of the one or more target nodes may have current performance data values that are below pre-defined threshold values and are not to be selected as the target node for operating the MICROSOFT Exchange application.
In one embodiment, the performance data processor410 may determine that the one or more performance counter reflect poor application health and thus, the MICROSOFT Exchange application service group is to be migrated to the target node of the one or more target nodes in order to improve application health. Alternatively, the performance data processor410 determines that one or more low-priority application service groups are to be migrated to the target node in order to increase available capacities of various computer resources for the MICROSOFT Exchange application service group.
According to one embodiment, the performance data processor410 communicates the application health data in order to notify an administrator as to poor application health. Such a notification may instigate the administrator to perform a corrective measure. In one embodiment, theevent processor308 communicates the application health data to clustering software (e.g., theclustering software124 ofFIG. 1). The clustering software notifies the administrator and/or performs the corrective measure. For example, the clustering software may migrate the application to another node (i.e., a target node). As another example, the clustering software may failover the application to the another node in response to a failure.
FIG. 5 is a functional block diagram that illustratescluster architecture500 for processing application health data to improve application performance within a three-node cluster in accordance with dynamic priority list information according to one or more embodiments. Thecluster architecture500 provides nodes (i.e., redundant systems) for operating an application service group. As an example and not as limitation, thecluster architecture500 comprises a three-node cluster502 that includes anode504, anode506 and anode508. Each node of the three-node cluster502 is coupled to astorage disk510 throughfiber switches512 and/or afiber switch514.
Generally, the three-node cluster502 employs clustering software for operating application service groups and controlling access to various computer resources. Each node of the three-node cluster502 is a type of computing device (e.g., a laptop, a desktop, a Personal Digital Assistant (PDA) and/or the like), such as those generally known in the art. Each node of the computer cluster502 includes one or more computer resources (e.g., memory resources) for operating the application service group. For example, thenode504 and thenode508 include two GB of Read Only Memory (RAM) each whereas, thenode506 includes four GB of RAM.
Generally, thestorage disk510 generally includes one or more data storage devices (e.g., hard disk drives, optical drives, magnetic tape drives and/or the like). Generally, fiber switches, such as thefiber switch512 and thefiber switch514, provides communication paths between the three-node cluster502 and thestorage disk510. Thenode504 and thenode506 are coupled to thestorage disk510 through thefiber switch512. Furthermore, thenode508 is coupled to thestorage disk510 using thefiber switch514.
Generally, the application service group comprises of various computing resources, such as, the hardware resources and/or software resources operating together for providing services to the user. Additionally, the application service group is executed on an active node (e.g., the node504). For example, in response to a failure and/or non-availability of computing resources on the active node, the application service group is failed over from the active node to the target node to achieve high availability. In one embodiment, the target node may have a highest priority amongst one or more target nodes (e.g., redundant systems for the application service group). As such, clustering software migrates or fails over the application service group to a target node having highest priority amongst the one or more target nodes.
In one embodiment, thenode504 is the active node for operating the application service group. During execution, the application service group running on thenode504 determines non-availability of one or more computing resources through event logs and/or performance counters. For example, the application service group using may require four GB of RAM to operate. As such, target nodes having at least four GB of RAM have higher priorities than target nodes that have less than four GB of RAM. Accordingly, thenode506 has a highest priority in the three-node cluster502. Consequently, the application service group is switched or failed over to thenode506 instead of thenode508.
Based on application health data associated with the application service group, the application is migrated from the active node to a target node to improve workload management in accordance with one or more dynamic priority lists. Such dynamic priority lists may be associated with one or more events. For example, a certain dynamic priority list may indicate that the application is to failover to theNode506 and then, theNode508 if there is an error associated with memory resources. However, another dynamic priority list indicates that the application is to failover to theNode508 before theNode506 if there is a particular error associated with the fiber switches, such as a Fiber Switch Down event. Hence, if the Fiber Switch Down event occurs, then the another dynamic priority list is selected instead of the certain dynamic priority list and thenode508 is selected as the target node.
According to one or more embodiments, thefiber switch512 may be operating poorly and affecting communications between thenode504 and thestorage disk510. Accordingly, communications between thenode506 and thestorage disk510 are also affected because thenode504 and thenode506 share thefiber switch512. Since thenode508 employs thefiber switch514, a failure associated with thefiber switch512, such as a Fiber Switch Down event, does not affect communications between thenode508 and thestorage disk510. As such, a dynamic list that corresponds with the failure associated with thefiber switch512 is selected. Such a dynamic list indicates that thenode508 has a highest priority within the three-node cluster502 instead of thenode506. Hence, thenode508 is selected as the target node by clustering software in order to ensure that a particular application is operating properly.
FIG. 6 is a flow diagram of amethod600 for processing application health data to improve application performance within a cluster according to one or more embodiments. Themethod600 starts atstep602 and proceeds to step604, at which a node (e.g., thenode104 of theFIG. 1) is monitored.
Atstep606, an event log (e.g., the event log212 of theFIG. 2) and performance counters (e.g., the performance counters214 of theFIG. 2) are accessed. Atstep608, application health data (e.g.,application health data117 of theFIG. 1) is examined. In an embodiment, the application health data is determined from the events raised by the application and/or performance counters values of the application and/or the computing resources.
Atstep610, a determination is made as to whether the application is to be migrated. If it is determined that the application is to be migrated (option “YES”), themethod600 proceeds to step612. If atstep610, it is determined that the application is not to be migrated (option “NO”), then themethod600 proceeds to step612. Atstep612, the application health data is communicated to clustering software (e.g., theclustering software124 ofFIG. 1). As described above, the clustering software may perform various operations using the application health data.
In one embodiment, the clustering software displays the application health data to a system administrator and/or sends a notification (e.g., an email) that includes a corrective measure (e.g., migrate the application). In another embodiment, the clustering software selects a target node based on the application health data in accordance with dynamic priority list information (e.g., the dynamic priority list information ofFIG. 1). Once the target node is selected, the clustering software stops the application running on an active node. Then, the application is restarted on the target node. After thestep612, themethod600 proceeds to step614, at which themethod600 ends.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.