CN116010064A

Movatterモバイル変換

Info

Publication number: CN116010064A
Application number: CN202310086733.0A
Authority: CN
Inventors: 蒲菊华; 孟巧岚; 陈烨轩; 王元宏
Original assignee: Beihang University
Current assignee: International Innovation Research Institute Of Beihang University In Hangzhou; Beihang University
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-04-25

Abstract

Translated fromChinese

本发明公开了一种DAG作业调度和集群管理的方法、系统及装置，包括，任务管理模块和集群管理模块，所述任务管理模块和集群管理模块连接；所述任务管理模块用于接收并解析DAG作业，将解析DAG作业后的子任务调度到合适的计算节点中；集群管理模块用于维护集群节点状态和控制集群的变更操作。本发明可以实现DAG作业调度和集群管理。

The invention discloses a method, system and device for DAG job scheduling and cluster management, including a task management module and a cluster management module, the task management module and the cluster management module are connected; the task management module is used to receive and analyze The DAG job dispatches the subtasks after parsing the DAG job to the appropriate computing nodes; the cluster management module is used to maintain the status of the cluster nodes and control the change operation of the cluster. The invention can realize DAG job scheduling and cluster management.

Description

Translated fromChinese

DAG作业调度和集群管理的方法、系统及装置Method, system and device for DAG job scheduling and cluster management

技术领域technical field

本发明涉及DAG作业调度和集群管理领域，尤其是涉及一种DAG作业调度和集群管理的方法、系统及装置。The present invention relates to the field of DAG job scheduling and cluster management, in particular to a method, system and device for DAG job scheduling and cluster management.

背景技术Background technique

当前kubernetes的调度策略是串行、独立地调度每一个Pod，可以简单分为两个过程：节点过滤阶段和节点打分阶段。在节点过滤阶段，会筛选出集群中所有能够满足Pod需求(CPU、内存等)的节点从而得到一个节点列表；在节点打分阶段，根据一定的打分规则去评价前述节点列表中每一个节点，选择最合适的节点去调度Pod。The current scheduling strategy of kubernetes is to schedule each pod serially and independently, which can be simply divided into two processes: node filtering phase and node scoring phase. In the node filtering stage, all nodes in the cluster that can meet the Pod requirements (CPU, memory, etc.) will be screened out to obtain a node list; in the node scoring stage, each node in the aforementioned node list will be evaluated according to certain scoring rules, and select The most suitable node to schedule the Pod.

kubernetes本身的调度方法较为基础，主要针对单个任务或者应用容器进行调度，而没有考虑到类似于DAG作业等的复杂场景，导致调度效率低下，严重影响了作业的完成时间和计算资源的利用率，降低了用户感知的服务质量。因此，在kubernetes中，设计有效的调度技术仍然是一个悬而未决的问题。同时，由于kubernetes不会根据当前的待执行的任务数量去自动地进行节点管理。例如，当节点负载过高、任务调度失败时不会去尝试进行节点扩展；而当节点负载低、任务数量少时，同样不会尝试节点的收缩来提高资源利用率。而随着当今云计算的快速发展，各大公有云厂商能够提供几乎于无限量的云资源。如何提高kubernetes集群的资源利用率也成为了关键难题。The scheduling method of kubernetes itself is relatively basic. It mainly schedules a single task or application container without considering complex scenarios such as DAG jobs, which leads to low scheduling efficiency and seriously affects the completion time of jobs and the utilization of computing resources. The quality of service perceived by users is reduced. Therefore, designing efficient scheduling techniques in Kubernetes is still an open problem. At the same time, since kubernetes will not automatically perform node management according to the current number of tasks to be executed. For example, when the node load is too high and task scheduling fails, it will not try to expand the node; and when the node load is low and the number of tasks is small, it will also not try to shrink the node to improve resource utilization. With the rapid development of cloud computing today, major public cloud vendors can provide almost unlimited cloud resources. How to improve the resource utilization of the kubernetes cluster has also become a key problem.

发明内容Contents of the invention

本发明的目的在于提供一种DAG作业调度和集群管理的方法、系统及装置，旨在解决DAG作业调度和集群管理。The purpose of the present invention is to provide a method, system and device for DAG job scheduling and cluster management, aiming at solving DAG job scheduling and cluster management.

本发明提供一种DAG作业调度和集群管理方法，包括：The present invention provides a DAG job scheduling and cluster management method, including:

S1、通过任务管理模块接收并解析DAG作业，将解析DAG作业后的子任务调度到合适的计算节点中；S1. Receive and analyze the DAG job through the task management module, and dispatch the subtasks after analyzing the DAG job to appropriate computing nodes;

S2、通过集群管理模块维护集群节点状态和控制集群的变更操作。S2. Maintain the state of the cluster nodes and control the change operation of the cluster through the cluster management module.

本发明还提供一种DAG作业调度和集群管理系统，包括：The present invention also provides a DAG job scheduling and cluster management system, including:

任务管理模块和集群管理模块，所述任务管理模块和集群管理模块连接；A task management module and a cluster management module, the task management module is connected to the cluster management module;

所述任务管理模块用于接收并解析DAG作业，将解析DAG作业后的子任务调度到合适的计算节点中；The task management module is used to receive and analyze the DAG job, and dispatch the subtasks after analyzing the DAG job to a suitable computing node;

集群管理模块用于维护集群节点状态和控制集群的变更操作。The cluster management module is used to maintain the status of the cluster nodes and control the change operation of the cluster.

本发明实施例还提供一种DAG作业调度和集群管理装置，包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述计算机程序被所述处理器执行时实现上述方法的步骤。An embodiment of the present invention also provides a DAG job scheduling and cluster management device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, the computer program being processed by the Implement the steps of the above method when the device is executed.

本发明实施例还提供一种计算机可读存储介质，所述计算机可读存储介质上存储有信息传递的实现程序，所述程序被处理器执行时实现上述方法的步骤。An embodiment of the present invention also provides a computer-readable storage medium, where a program for implementing information transfer is stored on the computer-readable storage medium, and when the program is executed by a processor, the steps of the above method are implemented.

采用本发明实施例，可以实现DAG作业调度和集群管理。By adopting the embodiment of the present invention, DAG job scheduling and cluster management can be realized.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, implement it according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable, the following Specific embodiments of the present invention are given in particular.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.

图1是本发明实施例的DAG作业调度和集群管理系统的示意图；1 is a schematic diagram of a DAG job scheduling and cluster management system according to an embodiment of the present invention;

图2是本发明实施例的DAG作业调度和集群管理系统的具体示意图；Fig. 2 is the specific schematic diagram of the DAG job scheduling and cluster management system of the embodiment of the present invention;

图3是本发明实施例的DAG作业调度和集群管理系统的集群管理模块示意图；Fig. 3 is a schematic diagram of the cluster management module of the DAG job scheduling and cluster management system of the embodiment of the present invention;

图4是本发明实施例的DAG作业调度和集群管理系统的任务聚合示意图；4 is a schematic diagram of task aggregation of the DAG job scheduling and cluster management system of the embodiment of the present invention;

图5是本发明实施例的DAG作业调度和集群管理系统的任务调度示意图；Fig. 5 is a schematic diagram of task scheduling of DAG job scheduling and cluster management system according to an embodiment of the present invention;

图6是本发明实施例的DAG作业调度和集群管理系统的节点分配示意图；Fig. 6 is a schematic diagram of node allocation of the DAG job scheduling and cluster management system according to the embodiment of the present invention;

图7是本发明实施例的DAG作业调度和集群管理的方法的流程图；7 is a flowchart of a method for DAG job scheduling and cluster management according to an embodiment of the present invention;

图8是本发明实施例的DAG作业调度和集群管理的装置示意图。Fig. 8 is a schematic diagram of a device for DAG job scheduling and cluster management according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合实施例对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below in conjunction with the embodiments. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

系统实施例System embodiment

根据本发明实施例，提供了一种DAG作业调度和集群管理系统，图1是本发明实施例的DAG作业调度和集群管理系统的示意图，如图1所示，具体包括：According to an embodiment of the present invention, a DAG job scheduling and cluster management system is provided. FIG. 1 is a schematic diagram of a DAG job scheduling and cluster management system according to an embodiment of the present invention. As shown in FIG. 1 , it specifically includes:

集群管理模块用于通过维护集群节点状态和控制集群的变更操作与任务管理模块交互。The cluster management module is used to interact with the task management module by maintaining the status of the cluster nodes and controlling the change operation of the cluster.

任务管理模块具体用于：接收并解析DAG作业，采用递归的方式从作业的首任务开始计算每个任务的最晚开始和完成时间和优先级，对用户提交的作业完成解析后更新任务队列，将解析DAG作业后的子任务调度到合适的计算节点中。The task management module is specifically used to: receive and analyze DAG jobs, calculate the latest start and completion time and priority of each task from the first task of the job in a recursive manner, update the task queue after completing the analysis of the jobs submitted by users, Schedule the subtasks after parsing the DAG job to the appropriate computing nodes.

任务管理模块具体用于对DAG作业中的任务进行遍历，如果两个子任务一和子任务二存在直接的数据依赖关系，子任务一计算完成后需要传输数据到子任务二，且子任务二是子任务一的唯一直接后序任务，子任务一是子任务二的唯一直接前序任务，那么则将两个子任务进行聚类，视为一个任务，聚类后任务的计算开销是两个子任务计算开销之和，聚类后任务的前序任务和后序任务分别为子任务一的前序任务的子任务二的后序任务，对DAG作业不断执行聚类操作直到没有新的聚类任务产生；The task management module is specifically used to traverse the tasks in the DAG job. If there is a direct data dependency between the two subtasks 1 and 2, after the calculation of subtask 1 is completed, the data needs to be transferred to subtask 2, and subtask 2 is a subtask Task 1 is the only direct successor task, and subtask 1 is the only direct predecessor task of subtask 2, then the two subtasks are clustered and regarded as one task, and the calculation cost of the clustered task is two subtask calculations The sum of the overhead, the pre-order task and the post-order task of the post-clustering task are respectively the pre-order task of sub-task 1 and the post-order task of sub-task 2, and the clustering operation is continuously performed on the DAG job until no new clustering tasks are generated ;

采用递归的方式从作业的首任务开始计算每个任务的最晚开始和完成时间和优先级；Calculate the latest start and finish time and priority of each task from the first task of the job in a recursive manner;

遍历DAG作业解析后得到的全部任务，如果任务没有前序任务则插入到就绪队列当中等待调度器进行分配，否则插入到等待队列之中，在集群中有任务完成计算之后也会进行等待队列更新，某个任务计算完成后，系统检查等待队列中是否存在完成任务的后序任务，如果存在则对其状态进行更新；判断任务状态更新后是否满足执行条件，所述执行条件为任务所有的前序任务是否均已执行完成，若满足则将其加入就绪队列等待调度；Traversing all the tasks obtained after DAG job analysis, if the task has no pre-order tasks, it will be inserted into the ready queue and wait for the scheduler to allocate, otherwise it will be inserted into the waiting queue, and the waiting queue will be updated after a task in the cluster completes the calculation , after the calculation of a certain task is completed, the system checks whether there is a subsequent task that completes the task in the waiting queue, and if so, updates its status; judges whether the updated task status meets the execution conditions, and the execution conditions are all previous tasks of the task Whether all the sequence tasks have been executed, and if they are satisfied, they will be added to the ready queue to wait for scheduling;

在任务调度的顺序上，依据任务就绪队列的状态来确定，首先遍历任务就绪队列，判断是否存在紧急任务，即任务的最晚开始时间已经小于当前时间，若存在则立即对该任务进行调度；完成对紧急任务的排序后，根据剩下任务的优先级来按序调度。In the order of task scheduling, it is determined according to the state of the task ready queue. First, it traverses the task ready queue to determine whether there is an urgent task, that is, the latest start time of the task is less than the current time. If it exists, the task is scheduled immediately; After sorting the urgent tasks, schedule them in order according to the priority of the remaining tasks.

集群管理模块具体用于：节点代理和中心服务器周期性的通过RPC进行交互；节点代理将自身状态上报到中心服务器，节中心服务器将期望的节点状态回复给对应的节点代理；The cluster management module is specifically used for: the node agent and the central server periodically interact through RPC; the node agent reports its own status to the central server, and the node central server replies the expected node status to the corresponding node agent;

当节点租赁时间结束需要释放节点计算资源时，对该节点执行下线操作；中心服务器将该节点的期望状态置为下线状态，并在下一次对该节点代理的上报信息的回复中写入新的期望状态；节点代理收到信息后开始改变节点上的kubernetes相关服务，完成后将新的节点状态上报到中心服务器，完成节点下线；When the node's lease time ends and the node's computing resources need to be released, the node will be offline; the central server will set the expected state of the node to the offline state, and write a new The expected state; the node agent starts to change the kubernetes-related services on the node after receiving the information, and reports the new node status to the central server after completion, and completes the node offline;

发送扩容操作请求，集群管理模块则对目标节点执行上线线操作，中心服务器修改节点期望状态-回复节点上报信息-节点代理修改节点的kubernetes服务状态-节点代理上报节点信息。Send an expansion operation request, the cluster management module performs an online operation on the target node, the central server modifies the expected status of the node - replies to the node's reported information - the node agent modifies the kubernetes service status of the node - the node agent reports the node information.

具体实施如下：The specific implementation is as follows:

主要包含两大模块：集群管理模块和任务管理模块。It mainly includes two modules: cluster management module and task management module.

任务管理模块的主要功能是接收并解析用户提交的DAG作业，并将其中的子任务调度到合适的计算节点当中。其中等待任务队列即不满足所有前序任务已经执行完成这一条件的任务所构成的队列，由于该类任务的输入依赖于前序任务，当存在前序任务没有执行完成时，即使将该任务分配到计算节点，也无法及时开始计算；就绪任务队列即所有前序任务已经执行完成的任务的队列；节点状态即与集群管理模块中相同。The main function of the task management module is to receive and analyze DAG jobs submitted by users, and dispatch the subtasks to appropriate computing nodes. The waiting task queue is a queue composed of tasks that do not satisfy the condition that all previous tasks have been executed. Since the input of this type of task depends on the previous tasks, when there is a previous task that has not been completed, even if the task Even if it is assigned to a computing node, it cannot start computing in time; the ready task queue is the queue of all previous tasks that have been executed; the node status is the same as that in the cluster management module.

集群管理模块主要功能是维护集群节点状态、控制集群的变更操作，包括将新的节点加入集群、将集群中的某个节点退出集群。其中集群节点状态包括每一个计算节点的硬件负载状态(CPU、内存等)和该节点上等待执行的任务数。集群管理模块将集群中节点的状态信息提供给任务管理模块，同时接收并执行任务管理模块的集群变更操作。集群管理模块分为中心侧和节点侧。在集群每个计算节点设置代理构成节点侧，节点代理需要实现对节点的声明式管理。节点侧统一向中心侧的中心服务器发送节点相关信息，并接受和维护中心侧下发的声明式节点状态。The main function of the cluster management module is to maintain the status of the cluster nodes and control the change operation of the cluster, including adding new nodes to the cluster and withdrawing a node in the cluster from the cluster. The cluster node status includes the hardware load status (CPU, memory, etc.) of each computing node and the number of tasks waiting to be executed on the node. The cluster management module provides the status information of the nodes in the cluster to the task management module, and at the same time receives and executes the cluster change operation of the task management module. The cluster management module is divided into a central side and a node side. An agent is set on each computing node of the cluster to form the node side, and the node agent needs to implement declarative management of the node. The node side uniformly sends node-related information to the central server on the central side, and accepts and maintains the declarative node status issued by the central side.

任务管理模块与集群管理模块通过集群节点状态和集群变更操作来进行交互。系统工作过程中，集群节点状态被集群管理模块更新，被任务管理模块读取；集群变更操作由任务管理模块中的任务调度器生产，并发送到集群管理模块中的计算资源管理器。任务管理模块的任务调度器根据集群管理模块的集群节点状态中存储的信息来进行任务的调度，同时，任务调度器会根据当前任务队列的状态来向集群管理模块的计算资源管理器提交集群变更操作。The task management module and the cluster management module interact through cluster node status and cluster change operations. During the working process of the system, the cluster node status is updated by the cluster management module and read by the task management module; the cluster change operation is produced by the task scheduler in the task management module and sent to the computing resource manager in the cluster management module. The task scheduler of the task management module schedules tasks according to the information stored in the cluster node status of the cluster management module. At the same time, the task scheduler submits cluster changes to the computing resource manager of the cluster management module according to the status of the current task queue operate.

1在收到用户提交的DAG作业后，对其进行解析。1 After receiving the DAG job submitted by the user, parse it.

首先进行任务聚类。任务聚类是将DAG中存在依赖关系的若干个子任务聚合在一起视为一个任务的过程。任务聚类可以减少由任务之间数据传输带来的耗时、提高调度时的效率。First perform task clustering. Task clustering is the process of aggregating several subtasks with dependencies in DAG as one task. Task clustering can reduce the time consumption caused by data transmission between tasks and improve the efficiency of scheduling.

聚类操作如下，对DAG作业中的任务进行遍历，如果两个子任务一和子任务二存在直接的数据依赖关系，子任务一计算完成后需要传输数据到子任务二，且子任务二是子任务一的唯一直接后序任务，子任务一是子任务二的唯一直接前序任务，那么则将两个子任务进行聚类，视为一个任务，聚类后任务的计算开销是两个任务计算开销之和，聚类后任务的前序任务和后序任务分别为子任务一的前序任务的子任务二的后序任务。在该阶段，系统对DAG作业不断执行聚类操作直到没有新的聚类任务产生。The clustering operation is as follows, traversing the tasks in the DAG job, if there is a direct data dependency between the two subtasks 1 and 2, after the calculation of subtask 1 is completed, the data needs to be transferred to subtask 2, and subtask 2 is a subtask is the only direct successor task of subtask 1, and subtask 1 is the only direct predecessor task of subtask 2, then the two subtasks are clustered and regarded as one task, and the computational overhead of the clustered task is the computational overhead of the two tasks The sum, the pre-order task and the post-order task of the post-clustering task are respectively the post-order task of the pre-order task of sub-task 1 and the post-order task of sub-task 2. At this stage, the system continues to perform clustering operations on DAG jobs until no new clustering tasks are generated.

接着采用递归的方式从作业的首任务开始计算每个任务的最晚开始和完成时间和优先级。Then, the latest start and finish time and priority of each task are calculated from the first task of the job in a recursive manner.

任务的最晚开始时间由最晚完成时间减去预计的任务运行耗时来确定。任务的最晚完成时间通过作业整体的截止期限来确定。作业最后一个任务的最晚完成时间即作业自身的截止期限，再减去预计的任务运行耗时即可得到任务的最晚开始时间。其他任务的最晚完成时间由直接后序任务的最晚开始时间加上数据传输时间来确定，当任务具有多个直接后序任务时，分别计算然后取最大值。The latest start time for a task is determined by subtracting the estimated task run time from the latest finish time. The latest completion time for a task is determined by the deadline for the job as a whole. The latest completion time of the last task of the job is the deadline of the job itself, and then subtract the expected running time of the task to get the latest start time of the task. The latest completion time of other tasks is determined by the latest start time of the immediate successor task plus the data transmission time. When a task has multiple immediate successor tasks, they are calculated separately and then the maximum value is taken.

任务的最晚开始时间这一属性用于评估作业的紧急程度，直接影响着调度器分配资源时所做出的决策。例如，当调度器对一个任务进行调度时，发现当前集群中的可用机器无法满足任务的最晚完成时间，便会做出集群扩容的决策，通过增加集群节点来满足当前任务的最晚完成时间。The attribute of the latest start time of the task is used to evaluate the urgency of the job, which directly affects the decision made by the scheduler when allocating resources. For example, when the scheduler schedules a task and finds that the available machines in the current cluster cannot meet the latest completion time of the task, it will make a decision to expand the cluster to meet the latest completion time of the current task by adding cluster nodes .

任务的优先级由该任务的一个特殊的任务集合中所有任务的计算耗时和数据传输耗时求和确定。这个特殊的任务集合即后代任务集合，集合中的元素为该任务自身和其后代任务。在计算DAG作业中的所有任务的优先级时，首先计算所有任务的直接耗时，即该任务所需的计算耗时和与所有后序任务的数据传输耗时。然后计算每个任务的所有后代任务。为了降低计算后代任务这个过程的时间复杂度，使用位图的数据结构来保存每个任务的后代任务集合，位图位和该DAG作业中的子任务一一对应。例如，如果子任务二是子任务一的后序任务，那么子任务一的位图中子任务二对应位置置1，否在置0。这样在计算某个任务的后代任务集合时，只需对该任务的所有后序任务的后代任务集合求并集即可。注意，一个任务的后代任务集合的位图表示中，该任务自身所对应的位置也应当置1。The priority of a task is determined by the sum of the calculation time and data transmission time of all tasks in a special task set of the task. This special task set is the descendant task set, and the elements in the set are the task itself and its descendant tasks. When calculating the priority of all tasks in a DAG job, first calculate the direct time consumption of all tasks, that is, the calculation time required by the task and the data transmission time with all subsequent tasks. Then calculate all descendant tasks of each task. In order to reduce the time complexity of the process of calculating descendant tasks, a bitmap data structure is used to save the set of descendant tasks of each task, and the bits of the bitmap correspond to the subtasks in the DAG job one by one. For example, if subtask 2 is a subsequent task of subtask 1, then the corresponding position of subtask 2 in the bitmap of subtask 1 is set to 1, and if not, it is set to 0. In this way, when calculating the descendant task set of a certain task, it is only necessary to obtain the union of the descendant task sets of all subsequent tasks of the task. Note that in the bitmap representation of a task's descendant task set, the corresponding position of the task itself should also be set to 1.

3对用户提交的作业完成解析后更新任务队列。3. Update the task queue after parsing the job submitted by the user.

遍历DAG作业解析后得到的全部任务，如果任务没有前序任务则插入到就绪队列当中等待调度器进行分配，否则插入到等待队列之中。除了在提交作业、解析作业后进行任务队列的更新之外，在集群中有任务完成计算之后也会进行等待队列更新。某个任务计算完成后，系统检查等待队列中是否存在完成任务的后序任务，如果存在则对其状态进行更新。判断任务状态更新后是否满足执行条件：该任务所有的前序任务均已执行完成，若满足则将其加入就绪队列等待调度。Traversing all the tasks obtained after DAG job analysis, if the task has no previous task, it will be inserted into the ready queue and wait for the scheduler to allocate, otherwise it will be inserted into the waiting queue. In addition to updating the task queue after submitting and parsing jobs, the waiting queue will also be updated after tasks in the cluster complete calculations. After the calculation of a certain task is completed, the system checks whether there is a subsequent task that completes the task in the waiting queue, and updates its status if it exists. Judging whether the execution condition is met after the task status is updated: all the previous tasks of the task have been executed, and if it is satisfied, it will be added to the ready queue for scheduling.

4与作业解析和队列更新同步进行的是任务调度。4 Synchronously with job parsing and queue updating is task scheduling.

在任务调度的顺序上，调度器依据任务就绪队列的状态来确定。首先遍历任务就绪队列，判断是否存在紧急任务，即任务的最晚开始时间已经小于当前时间，若存在则立即对该任务进行调度。完成对紧急任务的排序后，调度器根据剩下任务的优先级来按序调度。In the order of task scheduling, the scheduler determines according to the state of the task ready queue. First traverse the task ready queue to determine whether there is an urgent task, that is, the latest start time of the task is less than the current time, and if so, schedule the task immediately. After sorting the urgent tasks, the scheduler schedules the remaining tasks sequentially according to their priority.

在资源分配策略上，为了找到合适的节点，调度器需要读取当前集群中所有节点的状态，并计算待调度任务在每个节点上的最快完成时间。调度器首先会过滤无法满足任务最晚完成时间的节点，如果将任务分配到该类节点上，有可能导致作业整体延期完成。如果在该轮过滤后集群中没有剩余节点，此时调度器会选择在集群中加入新的节点，并将该任务分配到该节点上。如果当前集群中存在其他满足待调度任务最晚完成时间的节点，则调度器会优先将任务分配到能够在当前租赁周期内完成该任务的节点上，这样可以使得在集群整体负载较低时，其他节点可以能够在当前租赁周期结束后能够及时退出集群，提高资源利用率。如果存在多个节点能够在当前租赁周期内完成该任务，则调度器采用贪心策略，即选择当前能够最快完成任务的节点。该策略下，任务更有可能被分配到正在执行其前序任务的节点，这样能够减少任务之间的数据传输开销，从而优化作业整体的完成时间。如果不存在能够在当前租赁周期内完成该任务的节点，则同样在在满足待调度任务最晚完成时间的所有节点当中通过贪心策略选择节点。In terms of resource allocation strategy, in order to find a suitable node, the scheduler needs to read the status of all nodes in the current cluster, and calculate the fastest completion time of tasks to be scheduled on each node. The scheduler will first filter the nodes that cannot meet the latest completion time of the task. If the task is assigned to such nodes, it may cause the overall completion of the job to be delayed. If there are no remaining nodes in the cluster after this round of filtering, the scheduler will choose to add a new node to the cluster and assign the task to this node. If there are other nodes in the current cluster that meet the latest completion time of the task to be scheduled, the scheduler will give priority to assigning the task to the node that can complete the task within the current lease period, so that when the overall load of the cluster is low, Other nodes may be able to exit the cluster in time after the current lease period ends, improving resource utilization. If there are multiple nodes that can complete the task within the current lease period, the scheduler adopts a greedy strategy, that is, selects the node that can currently complete the task the fastest. Under this strategy, tasks are more likely to be assigned to the nodes that are executing their predecessor tasks, which can reduce the data transmission overhead between tasks, thereby optimizing the overall completion time of the job. If there is no node that can complete the task within the current lease period, a greedy strategy is used to select a node among all nodes that meet the latest completion time of the task to be scheduled.

集群管理模块工作流程：Cluster management module workflow:

集群管理模块主要是负责声明式地完成任务管理模块所需要的集群变更操作。The cluster management module is mainly responsible for declaratively completing the cluster change operations required by the task management module.

1首先，节点代理和中心服务器周期性的通过RPC进行交互。节点代理将自身状态上报到中心服务器，节中心服务器将期望的节点状态回复给对应的节点代理。1 First, the node agent and the central server periodically interact through RPC. The node agent reports its own status to the central server, and the node central server replies the expected node status to the corresponding node agent.

2当节点租赁时间结束需要释放节点计算资源时，集群管理模块则对该节点执行下线操作。首先，中心服务器将该节点的期望状态置为下线状态，并在下一次对该节点代理的上报信息的回复中写入新的期望状态。节点代理收到信息后开始改变节点上的kubernetes相关服务，完成后将新的节点状态上报到中心服务器，完成节点下线。2 When the node lease time ends and the computing resources of the node need to be released, the cluster management module performs an offline operation on the node. First, the central server sets the expected state of the node as an offline state, and writes a new expected state in the next reply to the reported information of the node agent. After the node agent receives the information, it starts to change the kubernetes-related services on the node. After completion, it reports the new node status to the central server and completes the node offline.

3当任务管理模块发送扩容操作请求时，集群管理模块则对目标节点执行上线线操作。和节点下线一样，流程为：中心服务器修改节点期望状态-回复节点上报信息-节点代理修改节点的kubernetes服务状态-节点代理上报节点信息。3. When the task management module sends an expansion operation request, the cluster management module performs an online operation on the target node. Same as node offline, the process is: the central server modifies the expected status of the node - replies to the information reported by the node - the node agent modifies the kubernetes service status of the node - the node agent reports the node information.

发明了针对有向无环图作业的调度系统，及时、动态地处理用户提交的DAG作业，同时根据作业负载来动态地管理集群资源。系统能够解析用户的作业，评估作业中子任务的优先级、最晚开始时间和最晚结束时间，并在调度任务时基于这些信息和集群节点状态来分配计算资源，尽可能保证用户作业按时完成的同时，提高节点的利用率，节省计算资源。为了能够对系统调度任务时所需的集群管理操作提供支持，发明了集群管理方法，采用节点代理的方式解决了kubernetes的配置问题。Invented a scheduling system for directed acyclic graph jobs, timely and dynamically process DAG jobs submitted by users, and dynamically manage cluster resources according to job loads. The system can analyze the user's job, evaluate the priority, the latest start time and the latest end time of the subtasks in the job, and allocate computing resources based on this information and the status of the cluster nodes when scheduling the task, so as to ensure that the user's job is completed on time as much as possible At the same time, it improves the utilization rate of nodes and saves computing resources. In order to be able to provide support for the cluster management operations required for system scheduling tasks, a cluster management method was invented, and the configuration problem of kubernetes was solved by using node agents.

本发明提出的技术方案设计了基于kubernetes集群的作业调度系统，设计了兼顾用户作业截止期限和计算资源利用率的调度方法，解决了针对DAG作业的调度难题。与现有技术相比，我们发明的方法在满足用户作业截止期限的限制下，能够获得更高的资源利用率，降低租赁计算资源所花费的开销。同时我们的发明集成了契合系统的集群管理方法，能够迅速的完成任务调度过程中的节点变更操作，实现对计算资源的管控。The technical solution proposed by the present invention designs a job scheduling system based on kubernetes clusters, designs a scheduling method that takes into account user job deadlines and computing resource utilization, and solves the scheduling problem for DAG jobs. Compared with the existing technology, the method we invented can achieve higher resource utilization and reduce the cost of renting computing resources under the constraints of user job deadlines. At the same time, our invention integrates a cluster management method suitable for the system, which can quickly complete the node change operation in the task scheduling process, and realize the management and control of computing resources.

方法实施例method embodiment

根据本发明实施例，提供了一种DAG作业调度和集群管理的方法，图7是本发明实施例的DAG作业调度和集群管理的方法的流程图，如图7所示，具体包括：According to an embodiment of the present invention, a method for DAG job scheduling and cluster management is provided. FIG. 7 is a flow chart of the method for DAG job scheduling and cluster management in an embodiment of the present invention. As shown in FIG. 7 , it specifically includes:

S2、通过集群管理模块维护集群节点状态和控制集群的变更操作与任务管理模块交互。S2. The cluster management module maintains the status of the cluster nodes and controls the change operation of the cluster to interact with the task management module.

S1具体用于：接收并解析DAG作业，采用递归的方式从作业的首任务开始计算每个任务的最晚开始和完成时间和优先级，对用户提交的作业完成解析后更新任务队列，将解析DAG作业后的子任务调度到合适的计算节点中。S1 is specifically used for: receiving and parsing DAG jobs, calculating the latest start and finish time and priority of each task from the first task in a recursive manner, updating the task queue after parsing the job submitted by the user, and parsing The subtasks after the DAG job are scheduled to the appropriate computing nodes.

接收并解析DAG作业，采用递归的方式从作业的首任务开始计算每个任务的最晚开始和完成时间和优先级，对用户提交的作业完成解析后更新任务队列，将解析DAG作业后的子任务调度到合适的计算节点中具体包括：Receive and analyze the DAG job, calculate the latest start and finish time and priority of each task from the first task in a recursive manner, update the task queue after completing the analysis of the job submitted by the user, and analyze the sub-tasks after the DAG job Scheduling tasks to appropriate computing nodes specifically includes:

对DAG作业中的任务进行遍历，如果两个子任务一和子任务二存在直接的数据依赖关系，子任务一计算完成后需要传输数据到子任务二，且子任务二是子任务一的唯一直接后序任务，子任务一是子任务二的唯一直接前序任务，那么则将两个子任务进行聚类，视为一个任务，聚类后任务的计算开销是两个子任务计算开销之和，聚类后任务的前序任务和后序任务分别为子任务一的前序任务的子任务二的后序任务，对DAG作业不断执行聚类操作直到没有新的聚类任务产生；Traverse the tasks in the DAG job. If there is a direct data dependency between the two subtasks 1 and 2, the calculation of subtask 1 needs to transfer data to subtask 2, and subtask 2 is the only direct successor of subtask 1. If subtask 1 is the only direct preorder task of subtask 2, then the two subtasks are clustered and regarded as one task, and the computational overhead of the clustered task is the sum of the computational overhead of the two subtasks, clustering The pre-order task and the post-sequence task of the post-task are respectively the post-sequence tasks of the pre-sequence task of sub-task 1 and the post-sequence task of sub-task 2, and the clustering operation is continuously performed on the DAG job until no new clustering tasks are generated;

在任务调度的顺序上，依据任务就绪队列的状态来确定；首先遍历任务就绪队列，判断是否存在紧急任务，即任务的最晚开始时间已经小于当前时间，若存在则立即对该任务进行调度；完成对紧急任务的排序后，根据剩下任务的优先级来按序调度。In the order of task scheduling, it is determined according to the state of the task ready queue; firstly, it traverses the task ready queue to determine whether there is an urgent task, that is, the latest start time of the task is less than the current time, and if it exists, schedule the task immediately; After sorting the urgent tasks, schedule them in order according to the priority of the remaining tasks.

S2具体包括：节点代理和中心服务器周期性的通过RPC进行交互；节点代理将自身状态上报到中心服务器，节中心服务器将期望的节点状态回复给对应的节点代理；S2 specifically includes: the node agent and the central server periodically interact through RPC; the node agent reports its own status to the central server, and the node central server replies the expected node status to the corresponding node agent;

装置实施例一Device embodiment one

本发明实施例提供一种DAG作业调度和集群管理的装置，如图8所示，包括：存储器80、处理器82及存储在存储器80上并可在处理器82上运行的计算机程序，计算机程序被处理器执行时实现上述方法实施例中的步骤。An embodiment of the present invention provides a device for DAG job scheduling and cluster management, as shown in FIG. 8 , including: amemory 80, aprocessor 82, and a computer program stored on thememory 80 and operable on theprocessor 82, the computer program When executed by the processor, the steps in the above method embodiments are realized.

装置实施例二Device embodiment two

本发明实施例提供一种计算机可读存储介质，计算机可读存储介质上存储有信息传输的实现程序，程序被处理器82执行时实现上述方法实施例中的步骤。An embodiment of the present invention provides a computer-readable storage medium, on which a program for realizing information transmission is stored, and when the program is executed by theprocessor 82, the steps in the foregoing method embodiments are implemented.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换本发明各实施例技术方案，并不使相应技术方案的本质脱离本方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements for the technical solutions of the various embodiments of the present invention do not make the essence of the corresponding technical solutions deviate from the present invention. scope of the program.

Claims

Translated fromChinese

1.一种DAG作业调度和集群管理的系统，其特征在于，包括，1. A system for DAG job scheduling and cluster management, characterized in that, comprising,

2.根据权利要求1所述的系统，其特征在于，所述任务管理模块具体用于：接收并解析DAG作业，采用递归的方式从作业的首任务开始计算每个任务的最晚开始和完成时间和优先级，对用户提交的作业完成解析后更新任务队列，将解析DAG作业后的子任务调度到合适的计算节点中。2. The system according to claim 1, wherein the task management module is specifically configured to: receive and analyze DAG jobs, and calculate the latest start and completion of each task from the first task of the job in a recursive manner Time and priority, update the task queue after parsing the job submitted by the user, and dispatch the subtasks after parsing the DAG job to the appropriate computing nodes.

3.根据权利要求1所述的系统，其特征在于，所述任务管理模块具体用于对DAG作业中的任务进行遍历，如果两个子任务一和子任务二存在直接的数据依赖关系，子任务一计算完成后需要传输数据到子任务二，且子任务二是子任务一的唯一直接后序任务，子任务一是子任务二的唯一直接前序任务，那么则将两个子任务进行聚类，视为一个任务，聚类后任务的计算开销是两个子任务计算开销之和，聚类后任务的前序任务和后序任务分别为子任务一的前序任务的子任务二的后序任务，对DAG作业不断执行聚类操作直到没有新的聚类任务产生；3. The system according to claim 1, wherein the task management module is specifically used to traverse the tasks in the DAG job, if there is a direct data dependency between two subtasks one and two subtasks, subtask one After the calculation is completed, the data needs to be transferred to subtask 2, and subtask 2 is the only direct successor task of subtask 1, and subtask 1 is the only direct predecessor task of subtask 2, then the two subtasks are clustered, As a task, the computational overhead of the post-clustering task is the sum of the computational overhead of the two subtasks. The pre-order task and post-order task of the post-clustering task are respectively the pre-order task of subtask 1 and the post-order task of subtask 2. , continuously perform clustering operations on DAG jobs until no new clustering tasks are generated;

4.根据权利要求3所述的系统，其特征在于，所述集群管理模块具体用于：节点代理和中心服务器周期性的通过RPC进行交互；节点代理将自身状态上报到中心服务器，节中心服务器将期望的节点状态回复给对应的节点代理；4. The system according to claim 3, wherein the cluster management module is specifically used for: the node agent and the central server periodically interact through RPC; the node agent reports its own state to the central server, and the node central server Reply the expected node status to the corresponding node agent;

5.一种DAG作业调度和集群管理的方法，其特征在于，包括，5. A method for DAG job scheduling and cluster management, characterized in that, comprising,

6.根据权利要求5所述的方法，其特征在于，所述S1具体用于：接收并解析DAG作业，采用递归的方式从作业的首任务开始计算每个任务的最晚开始和完成时间和优先级，对用户提交的作业完成解析后更新任务队列，将解析DAG作业后的子任务调度到合适的计算节点中。6. The method according to claim 5, wherein the S1 is specifically used for: receiving and analyzing the DAG job, and calculating the latest start and completion time and sum of each task from the first task of the job in a recursive manner Priority, update the task queue after parsing the job submitted by the user, and dispatch the subtasks after parsing the DAG job to the appropriate computing nodes.

7.根据权利要求6所述的方法，其特征在于，所述接收并解析DAG作业，采用递归的方式从作业的首任务开始计算每个任务的最晚开始和完成时间和优先级，对用户提交的作业完成解析后更新任务队列，将解析DAG作业后的子任务调度到合适的计算节点中具体包括：7. The method according to claim 6, wherein the receiving and parsing of the DAG job adopts a recursive method to calculate the latest start and completion time and priority of each task from the first task of the job, and the user After the submitted job is parsed, the task queue is updated, and the subtasks after parsing the DAG job are scheduled to the appropriate computing nodes, including:

8.根据权利要求7所述的方法，其特征在于，所述S2具体包括：节点代理和中心服务器周期性的通过RPC进行交互；节点代理将自身状态上报到中心服务器，节中心服务器将期望的节点状态回复给对应的节点代理；8. The method according to claim 7, wherein said S2 specifically comprises: the node agent and the central server periodically interact through RPC; the node agent reports its own state to the central server, and the node central server sends the expected The node status replies to the corresponding node agent;

9.一种DAG作业调度和集群管理的装置，其特征在于，包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述计算机程序被所述处理器执行时实现如权利要求1至4中任一项所述的AG作业调度和集群管理方法的步骤。9. A device for DAG job scheduling and cluster management, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, the computer program being described by the The processor implements the steps of the AG job scheduling and cluster management method according to any one of claims 1 to 4 when executing.

10.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质上存储有信息传递的实现程序，所述程序被处理器执行时实现如权利要求1至4中任一项所述的DAG作业调度和集群管理的方法的步骤。10. A computer-readable storage medium, characterized in that, the computer-readable storage medium is stored with an implementation program for information transmission, and when the program is executed by a processor, it implements the information described in any one of claims 1 to 4. The steps of the method for DAG job scheduling and cluster management described above.