Disclosure of Invention
Therefore, the invention provides a unified construction method of heterogeneous cloud platforms, which is used for solving the problem that the resource management efficiency of heterogeneous clusters is affected by adopting the same operation parameters to finish the task processing aiming at the heterogeneous cloud platforms under all operation conditions in the prior art.
In order to achieve the above object, the present invention provides a unified construction method for heterogeneous cloud platforms, including:
completing the construction of a cloud platform;
distributing a plurality of tasks, determining response time length of each task, determining whether the operation of the cloud platform is qualified or not based on the average response time length and variance of each response time length, dividing each task abnormal category into a scheduling abnormal category and a transmission abnormal category based on log information when determining operation abnormality, and respectively acquiring proportion of the scheduling abnormal category and proportion of the transmission abnormal category;
Determining to process the service flow or process the service flow according to the comparison result of the proportion of the scheduling abnormal class and the proportion of the transmission abnormal class;
Determining a processing mode aiming at facility deployment based on the proportion of scheduling exception categories, wherein the processing mode comprises the steps of adjusting the number of servers to corresponding values or judging heterogeneous support mode change;
Determining a processing mode aiming at the service flow based on the proportion of the transmission abnormal class, wherein the processing mode comprises the steps of adjusting a current limiting standard to a corresponding value or judging an updating interface;
And deploying online when the operation of the cloud platform is qualified, and continuously monitoring after online.
Further, the process of determining whether the cloud platform is qualified for operation based on the average response time length and the variance of each response time length includes:
acquiring average response time length, and determining whether the operation of the cloud platform is qualified or not based on the average response time length;
If the average response time length is less than or equal to the first preset time length, judging that the cloud platform is qualified in operation;
If the average response time length is smaller than or equal to a second preset time length and larger than the first preset time length, determining whether the operation of the cloud platform is qualified or not based on the variance of each response time length;
If the average response time length is longer than the second preset time length, judging that the cloud platform runs abnormally, dividing each task abnormal category based on log information, and adjusting the running parameters of the cloud platform based on each task abnormal category proportion.
Further, the process of determining whether the cloud platform is qualified for operation based on the variance of each response time length includes:
If the variance of each response time length is smaller than or equal to the preset time variance, judging that the cloud platform operates abnormally, dividing each task abnormal category based on log information, and adjusting the operation parameters of the cloud platform based on each task abnormal category proportion;
if the variance of each response time length is larger than the variance of the preset time length, the first preset time length is adjusted based on the response floating parameters, and whether the operation of the cloud platform is qualified or not is determined again based on the adjusted first preset time length.
Further, the process of adjusting the first preset time period based on the response floating parameter comprises the following steps:
Determining response floating parameters, and calculating the difference value between the variance of each response time length and the variance of the preset time to obtain the response floating parameters;
the first predetermined time period determined based on the response float parameter is positively correlated with the response float parameter.
Further, the process of dividing each task anomaly category based on the log information includes:
screening logs related to task scheduling, traversing the logs to determine abnormal prompt entries;
determining a task with a task retry prompt entry as a scheduling exception category;
and determining the task with the data packet loss prompting entry as a transmission abnormality type.
Further, the process of adjusting the operation parameters of the cloud platform based on the abnormal class proportion of each task comprises the following steps:
determining the proportion of the scheduling abnormal categories, and calculating the proportion of the number of the scheduling abnormal category tasks to the total number of the tasks to obtain the proportion of the scheduling abnormal categories;
Determining the proportion of the transmission abnormal categories, and calculating the proportion of the number of the transmission abnormal category tasks to the total number of the tasks to obtain the proportion of the transmission abnormal categories;
comparing the proportion of the scheduling exception categories with the proportion of the transmission exception categories, wherein,
If the proportion of the scheduling abnormality categories is greater than the proportion of the transmission abnormality categories, determining a processing mode aiming at facility deployment based on the proportion of the scheduling abnormality categories;
if the proportion of the scheduling exception categories is smaller than or equal to the proportion of the transmission exception categories, determining a processing mode aiming at the service flow based on the proportion of the transmission exception categories.
Further, determining a processing manner for the facility deployment based on the proportion of the scheduling anomaly category includes:
If the proportion of the scheduling exception categories is smaller than or equal to the preset exception proportion, the number of servers is adjusted to a corresponding value based on the proportion difference value between the preset exception proportion and the scheduling exception categories;
if the proportion of the scheduling exception categories is larger than the preset exception proportion, the heterogeneous support mode is judged to be changed.
Further, determining a processing manner for the service flow based on the proportion of the transmission anomaly class includes:
If the proportion of the transmission abnormal type is smaller than or equal to the preset abnormal proportion, the current limiting standard is adjusted to a corresponding value based on the CPU utilization rate;
If the proportion of the transmission abnormal type is larger than the preset abnormal proportion, the updating interface is judged.
Further, the number of servers is adjusted to a corresponding value based on a ratio difference of a preset anomaly ratio and a scheduling anomaly category, wherein,
Marking the ratio difference value between the preset abnormal ratio and the scheduling abnormal category as a scheduling ratio difference value;
the adjustment amplitude of the number of servers determined based on the scheduling ratio difference is inversely proportional to the scheduling ratio difference.
Further, the current limit criteria is adjusted to a corresponding value based on CPU utilization, wherein,
The current limit criteria determined based on CPU utilization is inversely proportional to CPU utilization.
Compared with the prior art, the method has the beneficial effects that whether the heterogeneous cloud platform operates is monitored according to the average response time length of each distributed task and the variance of each response time length, when the abnormal operation of the cloud platform is determined, the abnormal classes of each task are divided based on log information, so that the processing mode is determined according to the pertinence of specific abnormal conditions, the operation parameters of the cloud platform are adjusted based on the abnormal class proportion of each task, and the resource management efficiency of the heterogeneous cluster is improved.
Further, when the average response time length is smaller than or equal to the second preset time length and larger than the first preset time length, determining whether the operation of the heterogeneous cloud platform is qualified or not according to the variance of each response time length, and when the variance of each response time length is larger than the preset time variance, the average response time length is excessively long for part of task response time lengths, wherein the condition that monitoring deviation is caused by the existence of part of regional network anomalies is considered to influence the judgment accuracy, adjusting the evaluation standard of the average response time length, and determining whether the operation of the cloud platform is qualified or not according to the adjusted evaluation standard again, so that the resource management efficiency of the heterogeneous cluster is improved.
Further, when judging that the operation of the cloud platform is abnormal, determining specific solving measures, dividing each task abnormal class according to logs, wherein the abnormal scheduling class represents that the response time is too long due to task scheduling abnormality, the abnormal scheduling class represents that the response time is too long due to data transmission abnormality, counting the proportion of each class, when the proportion of the abnormal scheduling class is larger than the proportion of the abnormal scheduling class, a large number of conditions that the response time is abnormal due to task scheduling abnormality exist, processing the condition of the abnormal scheduling, and when the proportion of the abnormal scheduling class is larger than the preset abnormal proportion, due to the fact that the heterogeneous architecture has strategy abnormality in resource allocation and management, the response time is too long due to compatibility abnormality of components or systems on which the heterogeneous mode and the tasks depend, at the moment, judging that the heterogeneous supporting mode is changed, when the proportion of the abnormal scheduling class is smaller than or equal to the preset abnormal proportion, the response time is too long due to a small number of tasks, and at the moment, the number of servers is increased, so that the operation stability of the heterogeneous cloud platform is improved, and the utilization rate of the servers is further improved.
Further, when the proportion of the scheduling abnormal class is smaller than or equal to the proportion of the transmission abnormal class, the response time of the task is too long due to the abnormal data transmission, the service flow is processed under the condition, when the proportion of the transmission abnormal class is larger than the preset abnormal proportion, a large number of tasks with abnormal data transmission exist, under the condition, due to the fact that the interface has compatibility problems, data packets are mishandled and lost, the interfaces are updated, when the proportion of the transmission abnormal class is smaller than or equal to the preset abnormal proportion, only a small number of conditions that the response time of the task is too long due to unreasonable task allocation caused by too low current limiting standard occur, at the moment, the response speed of the task is effectively improved, the processing efficiency of the cloud platform for the task is improved, and the resource management efficiency of the heterogeneous cluster is further improved.
Detailed Description
The invention will be further described with reference to examples for the purpose of making the objects and advantages of the invention more apparent, it being understood that the specific examples described herein are given by way of illustration only and are not intended to be limiting.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
It should be noted that, in the description of the present invention, terms such as "upper," "lower," "left," "right," "inner," "outer," and the like indicate directions or positional relationships based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In addition, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected through an intermediate medium, or in communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.
Referring to fig. 1, fig. 2, fig. 3, and fig. 4, which are respectively a flowchart of steps of a heterogeneous cloud platform unified construction method according to an embodiment of the present invention, a logic decision diagram for determining whether to construct qualified based on average response time length, a logic decision diagram for determining whether to construct qualified based on variance of each response time length, and a logic decision diagram for determining a processing mode for deployment of facilities based on a proportion of scheduling anomaly categories, the heterogeneous cloud platform unified construction method according to an embodiment of the present invention includes:
upgrading and adjusting hardware, installing virtualization software and an operating system on a server, setting and managing network infrastructure, configuring a virtualization environment, constructing a containerization environment and constructing a storage resource pool;
integrating an automatic deployment tool and building a monitoring and alarming system;
Distributing a plurality of tasks, determining response time length of each task, determining whether the operation of the cloud platform is qualified based on the average response time length and variance of each response time length, dividing each task abnormality category based on log information when the operation abnormality of the cloud platform is determined, and adjusting the operation parameters of the cloud platform based on each task abnormality category proportion, wherein the method comprises the following steps:
heterogeneous support mode change, server quantity redetermining, interface updating or current limiting standard redetermining;
and (5) deploying on-line, and continuously monitoring after the on-line.
Specifically, the process of facility deployment includes upgrading and adjusting hardware, installing virtualization software and operating systems on servers, setting and managing network infrastructure, including physical networks and virtual networks, installing KVM on servers and configuring virtualization environments, installing Docker and Kubernetes to build a containerized environment, using distributed storage services or Glusteris to build a storage resource pool, which will not be described in detail.
The process of integrating the automation tool and determining the service flow includes integrating Ansible, terraform the automation deployment tool, building Prometheus, grafana the monitoring and alarm system, analyzing the service flow, determining the requirement of the service flow, designing the service flow, and integrating the Kubernetes orchestration tool to optimize the service flow, which is not described in detail.
The invention will be further described with reference to examples for the purpose of making the objects and advantages of the invention more apparent, it being understood that the specific examples described herein are given by way of illustration only and are not intended to be limiting.
It should be noted that, the data in this embodiment are obtained by comprehensively analyzing and evaluating the historical detection data and the corresponding historical detection results three months before the current detection by the method of the present invention. According to the method, before the current detection, the numerical value of each preset parameter standard aiming at the current detection is comprehensively determined according to cloud platform running conditions corresponding to the task dispatch of 25863 batches and the response time of each task dispatch task in the previous three months, detection data information of each batch of tasks and judgment results. It can be understood by those skilled in the art that the determining manner of the method according to the present invention for the parameters of the single item may be selecting the value with the highest duty ratio as the preset standard parameter according to the data distribution, using weighted summation to use the obtained value as the preset standard parameter or other selecting manners, so long as the method according to the present invention can clearly define different specific situations in the single item judging process through the obtained value.
Specifically, the process of determining whether the cloud platform is qualified for operation based on the average response time length and the variance of each response time length includes:
Determining whether the operation of the cloud platform is qualified or not based on the average response time length;
If the average response time length is less than or equal to the first preset time length, judging that the operation of the cloud platform is qualified, and deploying online;
If the average response time length is smaller than or equal to a second preset time length and larger than the first preset time length, determining whether the operation of the cloud platform is qualified or not based on the variance of each response time length;
If the average response time length is longer than the second preset time length, judging that the cloud platform runs abnormally, dividing each task abnormal category based on log information, and adjusting the running parameters of the cloud platform based on each task abnormal category proportion.
Specifically, the first preset duration T1 is selected within the interval [0.5,2], and the second preset duration T2 is selected within the interval [5,30], wherein the units of the preset durations are seconds.
Specifically, whether the operation of the heterogeneous cloud platform is qualified or not is monitored according to the average response time length of each distributed task and the variance of each response time length, when the operation abnormality of the cloud platform is determined, the task abnormality categories are divided based on log information, so that the processing mode is determined according to specific abnormality conditions, the operation parameters of the cloud platform are adjusted based on the task abnormality category proportion, and the resource management efficiency of the heterogeneous cluster is improved.
Specifically, the process of determining whether the cloud platform is qualified for operation based on the variance of each response time length includes:
If the variance of each response time length is smaller than or equal to the preset time variance, judging that the cloud platform operates abnormally, dividing each task abnormal category based on log information, and adjusting the operation parameters of the cloud platform based on each task abnormal category proportion;
if the variance of each response time length is larger than the variance of the preset time length, the first preset time length is adjusted based on the response floating parameters, and whether the operation of the cloud platform is qualified or not is determined again based on the adjusted first preset time length.
Specifically, the preset time variance C0 is selected within the interval [0.05T,0.09T ], and T is the average response duration.
Specifically, when the average response time length is smaller than or equal to the second preset time length and larger than the first preset time length, determining whether the operation of the heterogeneous cloud platform is qualified by integrating the variance of each response time length, and when the variance of each response time length is larger than the preset time variance, reducing the average response time length by too long for part of task response time lengths, considering the influence of monitoring deviation caused by the abnormality of part of regional network on the judgment accuracy, adjusting the evaluation standard of the average response time length, and determining whether the operation of the cloud platform is qualified again according to the adjusted evaluation standard, thereby improving the resource management efficiency of the heterogeneous cluster.
Specifically, the process of adjusting the first preset time period based on the response floating parameter comprises the following steps:
Determining response floating parameters, and calculating the difference value between the variance of each response time length and the variance of the preset time to obtain the response floating parameters;
the first predetermined time period determined based on the response float parameter is positively correlated with the response float parameter.
In this embodiment, the number of the optional,
The response float parameter is compared with a first preset float comparison threshold and a second preset float comparison threshold,
If the response floating parameter is smaller than or equal to a first preset floating comparison threshold value, the first preset time length is adjusted to a corresponding value by using a first preset evaluation adjustment coefficient, and the first preset time length adjusted by using the first preset evaluation adjustment coefficient is 1.11 times of the initial first preset time length;
If the response floating parameter is smaller than or equal to the second preset floating comparison threshold and larger than the first preset floating comparison threshold, the second preset assessment adjustment coefficient is used for adjusting the first preset duration to a corresponding value, and the first preset duration adjusted by the second preset assessment adjustment coefficient is 1.17 times of the initial first preset duration;
if the response floating parameter is larger than the second preset floating comparison threshold value, the third preset assessment adjustment coefficient is used for adjusting the first preset duration to a corresponding value, and the first preset duration adjusted by the third preset assessment adjustment coefficient is 1.23 times of the initial first preset duration;
wherein the first preset float comparison threshold is 2.1C0 and the second preset float comparison threshold is 5.3C0.
Specifically, the process of redefining whether the operation of the cloud platform is qualified based on the adjusted first preset time length includes:
if the average response time length is less than or equal to the adjusted first preset time length, judging that the operation of the cloud platform is qualified, and deploying online;
if the average response time length is longer than the adjusted first preset time length, judging that the cloud platform operates abnormally, dividing each task abnormal category based on log information, and adjusting the operation parameters of the cloud platform based on each task abnormal category proportion.
Specifically, the process of dividing each task abnormality category based on log information includes:
screening logs related to task scheduling, traversing the logs to determine abnormal prompt entries;
determining a task with a task retry prompt entry as a scheduling exception category;
and determining the task with the data packet loss prompting entry as a transmission abnormality type.
Specifically, a process for adjusting operation parameters of a cloud platform based on abnormal class proportions of tasks includes:
determining the proportion of the scheduling abnormal categories, and calculating the proportion of the number of the scheduling abnormal category tasks to the total number of the tasks to obtain the proportion of the scheduling abnormal categories;
Determining the proportion of the transmission abnormal categories, and calculating the proportion of the number of the transmission abnormal category tasks to the total number of the tasks to obtain the proportion of the transmission abnormal categories;
Comparing the proportion of the scheduling abnormal category with the proportion of the transmission abnormal category;
if the proportion of the scheduling abnormality categories is greater than the proportion of the transmission abnormality categories, determining a processing mode aiming at facility deployment based on the proportion of the scheduling abnormality categories;
if the proportion of the scheduling exception categories is smaller than or equal to the proportion of the transmission exception categories, determining a processing mode aiming at the service flow based on the proportion of the transmission exception categories.
Specifically, determining a processing manner for a facility deployment based on a proportion of scheduling anomaly categories includes:
If the proportion of the scheduling exception categories is smaller than or equal to the preset exception proportion, the number of servers is adjusted to a corresponding value based on the proportion difference value between the preset exception proportion and the scheduling exception categories;
if the proportion of the scheduling exception categories is larger than the preset exception proportion, the heterogeneous support mode is judged to be changed.
Specifically, the preset abnormality ratio S0 is selected within the section [0.62,0.78 ].
The method comprises the steps of determining specific solutions when judging operation abnormality of a cloud platform, dividing each task abnormality category according to logs, judging that response time is overlarge due to task scheduling abnormality, calculating various category proportions due to task scheduling abnormality, processing the situation of scheduling abnormality when the proportion of the scheduling abnormality category is larger than the proportion of the transmission abnormality category and a large number of response time abnormalities due to task scheduling abnormality, wherein when the proportion of the scheduling abnormality category is larger than a preset abnormality proportion, strategy abnormality exists in resource allocation and management due to heterogeneous architecture, and the response time of a large number of tasks is overhigh due to compatibility abnormality of components or systems on which heterogeneous modes and tasks depend, judging that heterogeneous support modes are changed at the moment, and when the proportion of the scheduling abnormality category is smaller than or equal to the preset abnormality proportion, a small number of tasks are overhigh due to response time due to scheduling problems, and the number of servers is increased at the moment, so that data processing efficiency is improved, operation stability of the heterogeneous cloud platform is further improved, and utilization rate of the servers is further improved.
Specifically, the scheme meets the compatibility requirements of CPUs with different architectures, adopts a CPUID instruction set and a microcode updating technology, can identify the characteristics of a processor through the CPUID instruction set, thereby providing a corresponding running environment for a virtual machine, and can ensure that the processor can support the latest instruction set and characteristics.
Specifically, determining a processing manner for a service flow based on a proportion of transmission anomaly categories includes:
If the proportion of the transmission abnormal type is smaller than or equal to the preset abnormal proportion, the current limiting standard is adjusted to a corresponding value based on the CPU utilization rate;
If the proportion of the transmission abnormal type is larger than the preset abnormal proportion, the updating interface is judged.
Specifically, when the proportion of the scheduling exception type is smaller than or equal to the proportion of the transmission exception type, the response time of the task is too long due to the data transmission exception, the service flow is processed under the condition, when the proportion of the transmission exception type is larger than the preset exception proportion, a large number of tasks with abnormal data transmission exist, under the condition, due to the fact that the interface has compatibility problems, data packets are mishandled and lost, the interfaces are updated, when the proportion of the transmission exception type is smaller than or equal to the preset exception proportion, only a small number of conditions that the response time of the task is too long due to unreasonable task allocation caused by too low current limiting standard occur, at the moment, the response speed of the task is effectively improved, the processing efficiency of the cloud platform for the task is improved, and the resource management efficiency of heterogeneous clusters is further improved.
Specifically, for interface updating, the cloud engine provides a physical server interface, a virtual machine interface, a storage device interface and a network interface, supports management of an X86 physical server and a small computer server resource pool, and provides management functions of adding, controlling, removing, inquiring states and the like. On the virtualization platform, the cloud engine supports access management support PowerVM, VMWare, XEN, KYM for access to server virtualization technology and provides creation, operation, deletion, configuration change, and status queries for virtual machines. And aiming at different open interface conditions of each virtualization technology, respectively processing when the access management of the bottom virtualization technology is carried out. The method comprises the steps of updating an interface, determining a currently used interface version, determining an available interface updated version, automatically selecting a version to backup related configuration files and data, and installing and configuring a new interface, wherein the method is the prior art and is not repeated.
The CPU utilization rate is the time proportion of the CPU in the working state in the historical time period, the selected duration of the historical time period is not limited, and the workload condition and the resource utilization condition can be characterized as understood, and the description is omitted.
Specifically, the current limit criterion is the maximum data transmission amount allowed to pass in a specific time interval, which is not described in detail.
Specifically, the number of servers is adjusted to a corresponding value based on a ratio difference of a preset anomaly ratio and a scheduling anomaly category, wherein:
Marking the ratio difference value between the preset abnormal ratio and the scheduling abnormal category as a scheduling ratio difference value;
the adjustment amplitude of the number of servers determined based on the scheduling ratio difference is inversely proportional to the scheduling ratio difference.
In this embodiment, the number of the optional,
Comparing the scheduling proportion difference value with a first preset scheduling comparison threshold value and a second preset scheduling comparison threshold value,
If the scheduling proportion difference value is smaller than or equal to a first preset scheduling comparison threshold value, determining the number of servers as a first number, and setting the first number to be 1.32 times of the number of the current servers;
if the scheduling proportion difference value is smaller than or equal to a second preset scheduling comparison threshold value and larger than the first preset scheduling comparison threshold value, determining the number of servers as a second number, and setting the second number to be 1.25 times of the number of the current servers;
If the scheduling proportion difference value is larger than a second preset scheduling comparison threshold value, determining the number of the servers as a third number, and setting the third number to be 1.11 times of the number of the current servers;
Wherein the first preset schedule compares the threshold value to be taken 0.25S0, and the second preset schedule compares the threshold value to be taken 0.63S0.
Specifically, the current limit criteria is adjusted to a corresponding value based on CPU utilization, wherein:
the current limit criteria determined based on CPU utilization is inversely proportional to CPU utilization.
In this embodiment, the number of the optional,
Comparing the CPU utilization with a first preset occupancy comparison threshold and a second preset occupancy comparison threshold,
If the CPU utilization rate is smaller than or equal to a first preset occupancy comparison threshold, determining a current limiting standard as a first current limiting standard, and setting the first current limiting standard to be 1.35 times of an initial current limiting standard;
if the CPU utilization rate is smaller than or equal to a second preset occupancy comparison threshold and larger than the first preset occupancy comparison threshold, determining a current limiting standard as a second current limiting standard, and setting the second current limiting standard to be 1.24 times of the initial current limiting standard;
If the CPU utilization rate is larger than the second preset occupancy comparison threshold, determining the current limiting standard as a third current limiting standard, and setting the third current limiting standard to be 1.12 times of the initial current limiting standard;
Wherein the first preset occupancy comparison threshold takes 50%, and the second preset occupancy comparison threshold takes 70%.
Specifically, when the adjustment of the current limiting standard is completed, the adjusted current limiting standard is compared with a preset maximum current limiting standard, if the adjusted current limiting standard is smaller than or equal to the preset maximum current limiting standard, the adjusted current limiting standard is judged to be used as the operation parameter of the cloud platform, if the adjusted current limiting standard is larger than the preset maximum current limiting standard, the preset maximum current limiting standard is judged to be used as the operation parameter of the cloud platform, and the number of servers is determined to be a third number.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.