CN103491024B

Movatterモバイル変換

Info

Publication number: CN103491024B
Application number: CN201310451552.XA
Authority: CN
Inventors: 王旻; 韩冀中; 李勇; 张章; 孟丹
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2013-09-27
Filing date: 2013-09-27
Publication date: 2017-01-11
Anticipated expiration: 2033-09-27
Also published as: CN103491024A

Abstract

本发明涉及一种面向流式数据的作业调度方法，包括以下步骤：调度管理器从调度队列中实时获取待调度作业，根据待调度作业的信息利用有向无环图生成处理单元队列;调度管理器根据物理节点的非本地通信数和主导资源比例，为处理单元队列中的每个处理单元分别选择物理节点，将所有处理单元分别分配给对应的物理节点；执行器在启动处理单元时，先在该物理节点上创建一个linux容器，然后在linux容器内部启动处理单元。本发明将处理单元调度到非本地通信数较小、负载较低的物理节点上，可以将需要频繁通信的处理单元集中到同一物理节点，减少了跨物理节点的网络通信。

The invention relates to a stream data-oriented job scheduling method, comprising the following steps: a scheduling manager obtains a job to be scheduled from a scheduling queue in real time, generates a queue of processing units by using a directed acyclic graph according to the information of the job to be scheduled; scheduling management The executor selects a physical node for each processing unit in the processing unit queue according to the number of non-local communications of the physical node and the proportion of dominant resources, and assigns all processing units to the corresponding physical nodes; when the executor starts the processing unit, it first Create a linux container on the physical node, and then start the processing unit inside the linux container. The invention dispatches the processing units to the physical nodes with small number of non-local communications and low load, can concentrate the processing units requiring frequent communication to the same physical node, and reduces network communication across physical nodes.

Description

Translated fromChinese

一种面向流式数据的作业调度方法及装置A streaming data-oriented job scheduling method and device

技术领域technical field

本发明涉及计算机并行计算领域，特别涉及一种面向流式数据的作业调度方法及装置。The present invention relates to the field of computer parallel computing, in particular to a stream data-oriented job scheduling method and device.

背景技术Background technique

近年来，随着实时搜索、广告推荐、社交网络、日志在线分析等应用的不断发展，一种新的数据形态——流式数据正在兴起。流式数据是指一组大量、快速、不间断的事件序列。在不同场景下，流式数据可以是实时查询、用户点击、在线日志、流媒体等多种数据形式。流式应用注重实时交互，过高延时的响应会严重影响其功能或用户体验。由于流式数据的重要性和独特性，一批流式数据处理系统应用而生，例如Yahoo！的S4系统。In recent years, with the continuous development of applications such as real-time search, advertisement recommendation, social networking, and log online analysis, a new data form - streaming data is emerging. Streaming data refers to a large, fast, uninterrupted sequence of events. In different scenarios, streaming data can be in various data forms such as real-time queries, user clicks, online logs, and streaming media. Streaming applications focus on real-time interaction, and high-latency responses will seriously affect their functions or user experience. Due to the importance and uniqueness of streaming data, a batch of streaming data processing systems have emerged, such as Yahoo! The S4 system.

事件是流式数据的基本组成单位，以键-值（key-value）形式出现。处理单元是处理事件的基本单位，有特定的事件类型和键，专门处理具有相应类型和键的事件。处理单元接收流式数据，对其中的事件进行处理，然后输出事件或者直接发布结果。Events are the basic unit of streaming data in the form of key-value. A processing unit is the basic unit for processing events, has a specific event type and key, and specializes in processing events with corresponding types and keys. The processing unit receives the streaming data, processes the events in it, and then outputs the events or publishes the results directly.

流式数据处理具有通信流量大、计算量大、响应速度快等特点，对系统性能要求较高。在流式数据处理中，处理单元被分布在物理节点上，使用其CPU、内存、网络带宽等物理资源，物理节点过载会直接影响处理单元的性能表现。此外，处理单元需要相互通信、传输流式数据，这将产生网络延时和通信代价。如何根据流式数据的特点，合理地调度处理单元是流式数据处理的关键问题。Streaming data processing has the characteristics of large communication traffic, large amount of calculation, and fast response speed, and has high requirements on system performance. In streaming data processing, processing units are distributed on physical nodes and use physical resources such as CPU, memory, and network bandwidth. The overload of physical nodes will directly affect the performance of processing units. In addition, processing units need to communicate with each other and transmit streaming data, which will cause network delay and communication costs. How to reasonably schedule processing units according to the characteristics of streaming data is a key issue in streaming data processing.

现有流式数据处理系统未综合考虑处理单元通信和物理节点实时负载，只是保证处理单元在数据量上分布均衡，但是由于不同处理单元对各种资源的需求不同，数量上的均衡并不意味资源使用和实际负载的均衡。而且相互频繁通信的处理单元可能被调度到远程环境，增加其通信代价和延时。现有调度方法不能很好地满足流式数据处理的需求，造成物理节点负载不均、通信延时较高的现象，影响流式数据处理的整体性能。The existing streaming data processing system does not comprehensively consider the communication of processing units and the real-time load of physical nodes, but only ensures that the data volume of processing units is evenly distributed. However, because different processing units have different requirements for various resources, the balance in quantity does not mean Balancing of resource usage and actual load. Moreover, processing units that communicate frequently with each other may be dispatched to remote environments, increasing their communication costs and delays. Existing scheduling methods cannot well meet the needs of streaming data processing, resulting in uneven loads on physical nodes and high communication delays, which affect the overall performance of streaming data processing.

发明内容Contents of the invention

本发明所要解决的技术问题是提供一种通过计算非本地通信数和主导资源比例，并以此为标准对物理节点进行排序，可以将处理单元调度到非本地通信较少、负载较低的物理节点上的面向流式数据的作业调度方法及装置。The technical problem to be solved by the present invention is to provide a method of sorting the physical nodes by calculating the number of non-local communications and the ratio of dominant resources, and sorting the physical nodes according to this standard, so that the processing units can be scheduled to physical nodes with less non-local communications and lower load. A streaming data-oriented job scheduling method and device on a node.

本发明解决上述技术问题的技术方案如下：一种面向流式数据的作业调度方法，包括以下步骤：The technical solution of the present invention to solve the above-mentioned technical problems is as follows: a job scheduling method for streaming data, comprising the following steps:

步骤1：调度管理器从存储待调度作业的调度队列中实时获取待调度作业，并根据待调度作业的信息利用有向无环图生成包括多个处理单元的处理单元队列，所述调度管理器设置于高配置的物理节点上，未设置有调度管理器的其他物理节点上分别设置有一个执行器;Step 1: The scheduling manager obtains the job to be scheduled in real time from the scheduling queue storing the job to be scheduled, and uses the directed acyclic graph to generate a processing unit queue including a plurality of processing units according to the information of the job to be scheduled, and the scheduling manager It is set on a high-configuration physical node, and an executor is set on other physical nodes without a scheduling manager;

步骤2：调度管理器根据物理节点的非本地通信数和主导资源比例，为处理单元队列中的每个处理单元分别选择物理节点，将所有处理单元分别分配给对应的物理节点，每个物理节点上设置有处理单元的数量为零至多个；Step 2: The scheduling manager selects a physical node for each processing unit in the processing unit queue according to the number of non-local communications of the physical node and the proportion of dominant resources, and assigns all processing units to the corresponding physical nodes. Each physical node The number of processing units set on it is zero to more;

步骤3:执行器在启动处理单元时，先在该物理节点上创建一个linux容器，然后在linux容器内部启动处理单元。Step 3: When the executor starts the processing unit, it first creates a linux container on the physical node, and then starts the processing unit inside the linux container.

本发明的有益效果是：该方法将需要频繁通信的处理单元集中到同一物理节点，减少跨物理节点的网络通信。同时，在通信代价相同或者物理节点负载过高的情况下，该方法会选择负载较低的物理节点进行部署，避免过载现象发生。本发明减少了流式数据处理的通信代价和网络延时，实现了负载均衡，提高了流式数据处理的整体性能。The beneficial effect of the invention is: the method concentrates the processing units requiring frequent communication into the same physical node, and reduces the network communication across physical nodes. At the same time, when the communication cost is the same or the physical node load is too high, the method will select the physical node with a lower load for deployment to avoid overloading. The invention reduces the communication cost and network delay of streaming data processing, realizes load balance, and improves the overall performance of streaming data processing.

在上述技术方案的基础上，本发明还可以做如下改进。On the basis of the above technical solutions, the present invention can also be improved as follows.

进一步，所述步骤1具体包括以下步骤：Further, the step 1 specifically includes the following steps:

步骤1.1：调度管理器根据先入先出原则从存储待调度作业的调度队列中获取待调度作业；Step 1.1: The scheduling manager acquires the jobs to be scheduled from the scheduling queue storing the jobs to be scheduled according to the first-in-first-out principle;

步骤1.2：根据待调度作业的预定的业务需求信息和并发需求信息，将作业分解为以处理单元为顶点、以数据通路为边的有向无环图；Step 1.2: According to the predetermined business requirement information and concurrent requirement information of the job to be scheduled, the job is decomposed into a directed acyclic graph with the processing unit as the vertex and the data path as the edge;

步骤1.3：从有向无环图中选择一个没有入边的顶点，将该没有入边的顶点加入处理单元队列中；Step 1.3: Select a vertex without incoming edges from the directed acyclic graph, and add the vertex without incoming edges to the processing unit queue;

步骤1.4：从有向无环图中删除所述步骤1.3中选择的没有入边的顶点，并删除从该没有入边的顶点发出的所有边；Step 1.4: delete the vertex without incoming edge selected in the step 1.3 from the directed acyclic graph, and delete all edges from the vertex without incoming edge;

步骤1.5：判断所述有向无环图中是否还有顶点，如果有则转至步骤1.2；如果没有则结束。Step 1.5: Judging whether there are vertices in the DAG, if yes, go to step 1.2; if not, end.

进一步，所述步骤2进一步包括以下步骤：Further, said step 2 further includes the following steps:

步骤2.1：计算每个物理节点的非本地通信数和主导资源比例，所述非本地通信数指与当前物理节点上的处理单元进行网络通信的、不在同一物理节点的待调度的处理单元的数量，所述主导资源比例指多种资源需求可用比中最高的资源需求可用比；Step 2.1: Calculate the number of non-local communications and the proportion of dominant resources for each physical node, the number of non-local communications refers to the number of processing units to be scheduled that are not in the same physical node for network communication with the processing unit on the current physical node , the dominant resource ratio refers to the highest resource demand-to-availability ratio among multiple resource demand-to-availability ratios;

步骤2.2：根据非本地通信数和主导资源比例，对物理节点进行排序，得到排序列表，并将列表中的物理节点标记为“未读”；Step 2.2: According to the number of non-local communications and the proportion of dominant resources, sort the physical nodes to obtain a sorted list, and mark the physical nodes in the list as "unread";

步骤2.3：从所述排序列表中选择第一个“未读”的物理节点，并将其标记为“已读”；Step 2.3: select the first "unread" physical node from the sorted list, and mark it as "read";

步骤2.4：判断所述选择的第一个“未读”的物理节点的加权负载值是否小于预定值，如果小于预定值，则将待调度的处理单元放置于该物理节点上，结束；Step 2.4: judging whether the weighted load value of the selected first "unread" physical node is less than a predetermined value, if less than the predetermined value, placing the processing unit to be scheduled on the physical node, and ending;

步骤2.5：所述排序列表中是否还有“未读”的物理节点，如果有则转至步骤2.3；如果没有，则选择排序列表中的第一个物理节点。Step 2.5: Whether there are "unread" physical nodes in the sorted list, if so, go to step 2.3; if not, select the first physical node in the sorted list.

进一步，计算物理节点的非本地通信数时进一步包括：Further, when calculating the number of non-local communications of physical nodes, it further includes:

将非本地通信数初始化为0；Initialize the non-local communication number to 0;

假设当前物理节点上的处理单元已经调度到该物理节点，然后遍历所有已经完成调度的处理单元；Assume that the processing units on the current physical node have been scheduled to the physical node, and then traverse all the scheduled processing units;

如果一个处理单元与待调度的处理单元需要通信、且两个处理单元不在同一物理节点上，则当前的非本地通信数加一，最后得到非本地通信数。If a processing unit needs to communicate with a processing unit to be scheduled, and the two processing units are not on the same physical node, the current number of non-local communication is increased by one, and finally the number of non-local communication is obtained.

进一步，所述主导资源比例的计算方法具体为：Further, the calculation method of the dominant resource ratio is specifically:

计算该物理节点每种资源的资源需求可用比，所述资源需求可用比是处理单元的资源需求量与物理节点的资源可用量的比例；Calculating the resource requirement availability ratio of each resource of the physical node, where the resource requirement availability ratio is the ratio of the resource requirement of the processing unit to the resource availability of the physical node;

选择多种资源需求可用比中最高的一个即为该物理节点的主导资源比例。Selecting the highest resource demand-to-availability ratio among multiple resources is the dominant resource ratio of the physical node.

进一步，所述步骤2.2中，根据非本地通信数和主导资源比例，对物理节点从小到大排序具体包括以下步骤：Further, in the step 2.2, according to the number of non-local communications and the proportion of dominant resources, sorting the physical nodes from small to large specifically includes the following steps:

两个物理节点先比较非本地通信数，如果不等，则非本地通信数小的物理节点排在前面；如果相等，再比较二者的主导资源比例，主导资源比例小的物理节点排在前面。The two physical nodes first compare the number of non-local communications. If they are not equal, the physical node with the smaller number of non-local communications is ranked first; if they are equal, then compare the ratio of dominant resources between the two, and the physical node with the smaller ratio of dominant resources is ranked first. .

进一步，所述加权负载值的具体计算方法为：Further, the specific calculation method of the weighted load value is:

加权负载值=CPU利用率*0.3+内存利用率*0.3+网络带宽利用率*0.4。Weighted load value = CPU utilization * 0.3 + memory utilization * 0.3 + network bandwidth utilization * 0.4.

进一步，一种面向流式数据的作业调度装置，包括调度管理器，执行器和调度队列；Further, a streaming data-oriented job scheduling device, including a scheduling manager, an executor, and a scheduling queue;

所述调度管理器，设置于高配置的物理节点上，用于从调度队列中实时获取待调度作业，并根据待调度作业的信息利用有向无环图生成包括多个处理单元的处理单元队列，根据物理节点的非本地通信数和主导资源比例，为处理单元队列中的每个处理单元分别选择物理节点，将所有处理单元分别分配给对应的物理节点，每个物理节点设置有至少一个处理单元；The scheduling manager is set on a high-configuration physical node, and is used to obtain the jobs to be scheduled from the scheduling queue in real time, and generate a processing unit queue including a plurality of processing units by using a directed acyclic graph according to the information of the jobs to be scheduled , select a physical node for each processing unit in the processing unit queue according to the number of non-local communications of the physical node and the proportion of dominant resources, and assign all the processing units to the corresponding physical nodes, each physical node is set with at least one processing unit unit;

所述执行器，用于启动处理单元，将调度管理器调度给该物理节点的处理单元放置于linux容器内部，在linux容器内部启动处理单元；The executor is used to start the processing unit, place the processing unit dispatched by the scheduling manager to the physical node inside the linux container, and start the processing unit inside the linux container;

所述调度队列，与调度管理器部署在同一物理节点上，用于存储待调度作业。The scheduling queue is deployed on the same physical node as the scheduling manager, and is used to store jobs to be scheduled.

进一步，所述调度管理器包括收集模块和调度模块；Further, the scheduling manager includes a collection module and a scheduling module;

所述收集模块，用于收集每个执行器所在物理节点的IP地址、通信端口、每种资源的总量及可用量，执行器资源使用状况及处理单元的资源使用状况；The collection module is used to collect the IP address of the physical node where each executor is located, the communication port, the total amount and availability of each resource, the resource usage status of the executor and the resource usage status of the processing unit;

所述调度模块，用于从调度队列中获取待调度作业，并根据待调度作业的信息生成包括多个处理单元的处理单元队列，为处理单元队列中的每个处理单元分别选择物理节点，将所有处理单元分别调度给对应的物理节点。The scheduling module is configured to obtain a job to be scheduled from the scheduling queue, and generate a processing unit queue including a plurality of processing units according to the information of the job to be scheduled, select a physical node for each processing unit in the processing unit queue, and All processing units are dispatched to corresponding physical nodes respectively.

进一步，每个所述处理单元均设置有唯一的处理单元标识。Further, each processing unit is provided with a unique processing unit identifier.

附图说明Description of drawings

图1为本发明方法步骤流程图；Fig. 1 is a flowchart of the method steps of the present invention;

图2为本发明生成处理单元队列的流程图；Fig. 2 is the flow chart that the present invention generates processing unit queue;

图3为本发明处理单元有向无环图的示意图；Fig. 3 is a schematic diagram of a directed acyclic graph of a processing unit in the present invention;

图4为本发明处理单元有向无环图变化的示意图；Fig. 4 is a schematic diagram of a directed acyclic graph change of a processing unit in the present invention;

图5为本发明处理单元调度方法的流程图；Fig. 5 is a flow chart of the processing unit scheduling method of the present invention;

图6为部署处理单元的流程图；Fig. 6 is a flowchart of deploying a processing unit;

图7为本发明装置结构图。Fig. 7 is a structural diagram of the device of the present invention.

附图中，各标号所代表的部件列表如下：In the accompanying drawings, the list of parts represented by each label is as follows:

1、调度管理器，2、执行器，3、调度队列。1. Scheduling manager, 2. Executor, 3. Scheduling queue.

具体实施方式detailed description

以下结合附图对本发明的原理和特征进行描述，所举实例只用于解释本发明，并非用于限定本发明的范围。The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.

如图1所示，为本发明方法步骤流程图；图2为本发明生成处理单元队列的流程图；图3为本发明处理单元有向无环图的示意图；图4为本发明处理单元有向无环图变化的示意图；图5为本发明处理单元调度方法的流程图；图6为部署处理单元的流程图；图7为本发明装置结构图。As shown in Figure 1, it is a flowchart of the steps of the method of the present invention; Fig. 2 is a flowchart of the generation of the processing unit queue in the present invention; Fig. 3 is a schematic diagram of a directed acyclic graph of the processing unit of the present invention; Fig. 4 is a processing unit of the present invention with Schematic diagram of changing to an acyclic graph; FIG. 5 is a flow chart of the processing unit scheduling method of the present invention; FIG. 6 is a flow chart of deploying processing units; FIG. 7 is a device structure diagram of the present invention.

实施例1Example 1

本发明实施例实现了一个流式数据处理系统，该系统包括多个执行器和一个调度管理器。其中执行器是运行于物理节点上的守护进程，除调度管理器所在的物理节点以外，系统管理的每个物理节点上都运行着一个执行器。The embodiment of the present invention implements a streaming data processing system, which includes multiple executors and a scheduling manager. The executor is a daemon process running on a physical node, except for the physical node where the scheduling manager is located, each physical node managed by the system runs an executor.

执行器可以在该物理节点上启动和关闭处理单元。启动处理单元时，执行器将先在物理节点上创建一个指定资源容量的Linux容器，然后在Linux容器内部启动处理单元需要执行的任务。处理单元与Linux容器一一对应，每个处理单元都放置在一个Linux容器之中。Linux容器可以为其中的进程分配指定资源，由于流式数据处理模型通常伴以高流量通信，所以本系统分配的资源类型比较全面，包括CPU、内存、网络带宽等。这样，每个处理单元都在Linux容器内部，使用系统分配的指定资源，相互独立运行，实现了资源隔离，避免了资源竞争，提升了处理单元的整体性能和运行稳定性。Executors can start and stop processing units on that physical node. When starting the processing unit, the executor will first create a Linux container with a specified resource capacity on the physical node, and then start the tasks that the processing unit needs to execute inside the Linux container. Processing units correspond to Linux containers one by one, and each processing unit is placed in a Linux container. The Linux container can allocate specified resources for the processes in it. Since the streaming data processing model is usually accompanied by high-traffic communication, the resource types allocated by this system are relatively comprehensive, including CPU, memory, and network bandwidth. In this way, each processing unit is inside the Linux container, uses the specified resources allocated by the system, and runs independently of each other, realizing resource isolation, avoiding resource competition, and improving the overall performance and operation stability of the processing unit.

同时，执行器还用于监控处理单元的运行状态和资源使用状况，由于每个Linux容器内部只有一个处理单元，因此监控处理单元可以转化为监控Linux容器的资源使用状况。执行器定时向调度管理器的收集模块发送心跳。每次需要发送心跳时，执行器会汇总其管理的处理单元的资源使用状况和执行器总体的资源使用状况，将其组织为心跳，发送给收集模块。心跳间隔可以通过配置文件进行设置和管理。At the same time, the executor is also used to monitor the running status and resource usage of the processing unit. Since there is only one processing unit inside each Linux container, monitoring the processing unit can be transformed into monitoring the resource usage of the Linux container. The executor periodically sends heartbeats to the collection module of the scheduling manager. Every time a heartbeat needs to be sent, the executor will summarize the resource usage status of the processing units it manages and the overall resource usage status of the executor, organize them into heartbeats, and send them to the collection module. The heartbeat interval can be set and managed through configuration files.

在流式数据处理系统中，处理单元间会传递事件序列，因此本发明需要支持处理单元相互间进行通信，系统为此提供了名字空间机制。系统为每个处理单元分配一个全局唯一的标识（ID），处理单元在初始化时只需记录与之通信的处理单元ID以及相应的业务逻辑关系。系统的名字空间会维护处理单元标识（ID）到其通信地址（IP地址与端口）的映射关系。处理单元首次与其他处理单元通信时，需要先访问名字空间，获取其通信地址，再与之通信。In a streaming data processing system, event sequences are transmitted between processing units, so the present invention needs to support communication between processing units, and the system provides a name space mechanism for this. The system assigns a globally unique identifier (ID) to each processing unit. When a processing unit is initialized, it only needs to record the ID of the processing unit communicating with it and the corresponding business logic relationship. The namespace of the system will maintain the mapping relationship between the processing unit ID (ID) and its communication address (IP address and port). When a processing unit communicates with other processing units for the first time, it needs to access the name space first, obtain its communication address, and then communicate with it.

调度管理器是系统的核心管理者，包括收集模块、调度模块两个部分。为避免程序内部进程过多，影响程序性能和稳定性，系统以进程的形式实现两个模块，模块之前通过远程过程调用（Remote Procedure Call）进行通信。两个模块理论上可以部署在不同物理节点，但为减少通信开销，实际运行中应部署在同一物理节点上。The scheduling manager is the core manager of the system, including two parts: the collection module and the scheduling module. In order to avoid too many internal processes in the program and affect the performance and stability of the program, the system implements two modules in the form of processes, and the modules communicate through Remote Procedure Call. The two modules can theoretically be deployed on different physical nodes, but in order to reduce communication overhead, they should be deployed on the same physical node in actual operation.

收集模块维护全局执行器资源信息，包括每个执行器所在物理节点的IP地址、通信端口以及每种资源的总量、可用量等，调度模块以上述资源信息为基础进行调度。在调度模块启动、关闭相应的处理单元之后，收集模块根据该处理模块的资源需求和部署节点，会更新全局资源信息。此外，收集模块接收各个执行器定时发送的心跳，其中包括执行器的资源使用状况及处理单元的资源使用状况，主要包括执行器和处理单元的状态和各种资源的资源利用率。The collection module maintains the global executor resource information, including the IP address of the physical node where each executor is located, the communication port, and the total amount and availability of each resource, etc. The scheduling module performs scheduling based on the above resource information. After the scheduling module starts and shuts down the corresponding processing unit, the collection module will update the global resource information according to the resource requirement of the processing module and the deployment node. In addition, the collection module receives heartbeats regularly sent by each executor, including the resource usage status of the executor and the resource usage status of the processing unit, mainly including the status of the executor and the processing unit and the resource utilization rate of various resources.

调度模块定时从调度队列中获取待调度任务，根据任务信息生成处理单元，在获取收集模块全局资源信息的基础上，使用处理单元调度方法，调度、启动处理单元；此外根据系统的运行需求或系统管理员的指令，调度模块可以控制、动态迁移处理单元。系统管理员或者外部程序通过客户端与整个系统进行交互，具体方式是通过客户端与调度模块交互，交互内容包括提交任务或指定指令。The scheduling module regularly obtains the tasks to be scheduled from the scheduling queue, generates processing units according to the task information, and uses the processing unit scheduling method to schedule and start the processing units on the basis of obtaining the global resource information of the collection module; in addition, according to the operating requirements of the system or the system According to the instructions of the administrator, the scheduling module can control and dynamically migrate the processing units. The system administrator or external program interacts with the entire system through the client. The specific way is to interact with the scheduling module through the client. The interaction content includes submitting tasks or specifying instructions.

本发明实施例中具有两类配置文件，分别供调度管理器和执行器使用。其中调度管理器的配置文件包括调度模块、收集模块的通信地址、资源分配策略选项、Linux容器配置信息等，三个模块启动时需要获取配置文件内容进行初始化。执行器配置文件包括执行器通信端口、资源管理中收集模块的通信地址、本物理节点绑定网卡等信息，执行器在启动时也需要通过获取配置文件内容进行初始化，并向收集模块发送心跳，进行注册。In the embodiment of the present invention, there are two types of configuration files, which are respectively used by the scheduling manager and the executor. The configuration file of the scheduling manager includes the scheduling module, the communication address of the collection module, resource allocation policy options, Linux container configuration information, etc. When the three modules start, the content of the configuration file needs to be obtained for initialization. The executor configuration file includes information such as the executor communication port, the communication address of the collection module in resource management, and the network card bound to the physical node. When the executor starts, it also needs to obtain the content of the configuration file to initialize and send a heartbeat to the collection module. to register.

图1为本发明方法步骤流程图。该方法按照以下步骤对进行调度部署，具体包括：Fig. 1 is a flowchart of the method steps of the present invention. The method schedules deployment according to the following steps, specifically including:

在本发明实施例中，步骤1根据下述方法获得待调度作业：根据“先入先出原则”，从作业队列中获得待调度作业。系统的作业调度队列供所有用户共同使用，用户使用客户端管理工具提交作业，先提交的作业将率先被调度。In the embodiment of the present invention, step 1 obtains the job to be scheduled according to the following method: according to the "first-in-first-out principle", the job to be scheduled is obtained from the job queue. The job scheduling queue of the system is shared by all users. Users use the client management tool to submit jobs, and the jobs submitted first will be scheduled first.

图2为本发明实施例的生成处理单元队列的流程图，用于根据调度作业信息，生成待调度的处理单元队列，其步骤如下：Fig. 2 is a flowchart of generating a processing unit queue according to an embodiment of the present invention, which is used to generate a processing unit queue to be scheduled according to scheduling job information, and the steps are as follows:

所述步骤1具体包括以下步骤：Described step 1 specifically comprises the following steps:

步骤1.2：根据待调度作业的预定的业务需求信息和并发需求信息，将作业分解为以处理单元为顶点、以数据通路为边的有向无环图（DirectedAcyclic Graph）；Step 1.2: According to the predetermined business requirement information and concurrent requirement information of the job to be scheduled, decompose the job into a directed acyclic graph (DirectedAcyclic Graph) with the processing unit as the vertex and the data path as the edge;

在本发明实施例中，步骤1.2所述的有向无环图由顶点和边组成，其中顶点为处理单元，边为数据通路。处理单元队列是对处理单元进行“拓扑排序”得到的结果。拓扑排序是指对有向无环图的顶点的一种排序，它使得如果存在一条从顶点A到顶点B的路径，那么在排序中B出现在A的后面。In the embodiment of the present invention, the directed acyclic graph described in step 1.2 is composed of vertices and edges, wherein the vertices are processing units and the edges are data paths. The processing unit queue is the result of "topologically sorting" the processing units. Topological sorting refers to a sorting of the vertices of a directed acyclic graph such that if there is a path from vertex A to vertex B, then B appears after A in the sorting.

图3为本发明实施例的处理单元有向无环图的示意图。图3所示例子中，系统根据作业信息，生成了A、B、C、E、D、F、G、H供7个处理单元。外部流式数据从A、B、C三个处理单元流入系统，经过一系列的处理后，最终在处理单元H汇聚，并输出结果。从单个处理单元的角度来看，以D为例，它接收A、B、C发送的事件序列，经过处理后生成新的事件，并将生成的事件发送给F。FIG. 3 is a schematic diagram of a directed acyclic graph of processing units according to an embodiment of the present invention. In the example shown in Figure 3, the system generates seven processing units A, B, C, E, D, F, G, and H according to the job information. The external streaming data flows into the system from the three processing units A, B, and C. After a series of processing, it finally gathers in the processing unit H and outputs the result. From the perspective of a single processing unit, taking D as an example, it receives the event sequence sent by A, B, and C, generates new events after processing, and sends the generated events to F.

图4为本发明实施例的处理单元有向无环图变化的示意图。拓扑排序是对有向无环图进行不断变换得到的排序，具体变换方式是每次从图中选择一个没有入边的顶点，然后将该顶点以及从该顶点发出的边删除。图3就是图3进行这种变换的中间形态。在图3中，我们选择没有入边的顶点A，并删除从A发出的两条边，于是就得到了图4所示的有向无环图。对图3所示的有向无环图进行完整的拓扑排序，可以得到A、B、C、D、E、F、G、H的序列。FIG. 4 is a schematic diagram of a directed acyclic graph change of a processing unit according to an embodiment of the present invention. Topological sorting is a sorting method obtained by continuously transforming a directed acyclic graph. The specific transformation method is to select a vertex without an incoming edge from the graph each time, and then delete the vertex and the edge from the vertex. Figure 3 is the intermediate form of Figure 3 for this transformation. In Figure 3, we select a vertex A with no incoming edges, and delete the two edges from A, so we get the directed acyclic graph shown in Figure 4. Performing a complete topological sort on the directed acyclic graph shown in Figure 3, the sequence of A, B, C, D, E, F, G, H can be obtained.

图5为本发明实施例的处理单元调度方法的流程图，用于调度处理单元，为处理单元选择合适的物理节点，其步骤如下：5 is a flowchart of a processing unit scheduling method according to an embodiment of the present invention, which is used to schedule a processing unit and select a suitable physical node for a processing unit. The steps are as follows:

步骤2.2：根据非本地通信数和主导资源比例，对物理节点进行排序，得到排序列表L，并将列表中的物理节点标记为“未读”；Step 2.2: According to the number of non-local communications and the proportion of dominant resources, sort the physical nodes to obtain the sorted list L, and mark the physical nodes in the list as "unread";

步骤2.3：从所述排序列表L中选择第一个“未读”的物理节点N，并将其标记为“已读”；Step 2.3: select the first "unread" physical node N from the sorted list L, and mark it as "read";

步骤2.4：判断所述选择的第一个“未读”的物理节点的加权负载值是否小于80%，如果小于80%，则将待调度的处理单元放置于该物理节点上，结束；Step 2.4: judging whether the weighted load value of the selected first "unread" physical node is less than 80%, if less than 80%, place the processing unit to be scheduled on the physical node, and end;

步骤2.5：所述排序列表L中是否还有“未读”的物理节点，如果有则转至步骤2.3；如果没有，则选择排序列表L中的第一个物理节点。Step 2.5: Whether there are "unread" physical nodes in the sorted list L, if so, go to step 2.3; if not, select the first physical node in the sorted list L.

在本发明实施例中，上述步骤2.1提到的“非本地通信数”指需要与待调度处理单元进行网络通信的、且不在同一物理节点的处理单元的数量。物理节点的非本地通信数根据下面方法计算：将非本地通信数初始化为0，并假设待调度处理单元已经部署到该物理节点，然后遍历所有已经完成调度的处理单元，如果一个处理单元与待调度处理单元需要通信、且两个处理单元不在同一物理节点上，则“非本地通信数”加一。最后得到的“非本地通信数”即为所求。以图3所示的处理单元为例，假设系统已经对处理单元A、B、C完成了调度，3个处理单元被调度到不同的物理节点a、b、c上，系统现在需要调度处理单元D。由于处理单元D需要与A、B、C通信，那么对于D来说，物理节点a、b、c的“非本地通信数都是2。假如系统中还包括另一个物理节点d，那么d的“非本地通信数”是3。In the embodiment of the present invention, the "number of non-local communications" mentioned in step 2.1 above refers to the number of processing units that need to perform network communication with the processing unit to be scheduled and are not located in the same physical node. The non-local communication number of the physical node is calculated according to the following method: initialize the non-local communication number to 0, and assume that the processing unit to be scheduled has been deployed to the physical node, and then traverse all the processing units that have been scheduled. If the scheduling processing unit needs to communicate and the two processing units are not on the same physical node, then the "Number of non-local communication" will be increased by one. The finally obtained "number of non-local communications" is what is required. Taking the processing unit shown in Figure 3 as an example, assuming that the system has already scheduled the processing units A, B, and C, and the three processing units are scheduled to different physical nodes a, b, and c, the system now needs to schedule the processing units d. Since processing unit D needs to communicate with A, B, and C, then for D, the "non-local communication numbers" of physical nodes a, b, and c are all 2. If another physical node d is included in the system, then d's "Number of non-local communications" is 3.

在本发明实施例中，步骤2.1采用下面方法计算物理节点的主导资源比例：计算该物理节点每种资源的资源需求可用比，所述资源需求可用比是处理单元的资源需求量与物理节点的资源可用量的比例，多种资源需求可用比中最高的一个即为该物理节点的主导资源比例。举例来说，本系统维护CPU、内存、网络带宽三种资源，一个处理单元的资源需求为<1CPU,2G内存,1Mbits/s>，此时物理节点的可用资源为3个CPU、10G内存、4Mbits/s，那么CPU、内存、网络带宽的资源需求可用比分别为1/3、1/5、1/4。由于1/3最大，所以该物理节点的主导资源比例是1/3，主导资源是CPU。In the embodiment of the present invention, step 2.1 adopts the following method to calculate the dominant resource ratio of the physical node: calculate the resource demand and availability ratio of each resource of the physical node, and the resource demand and availability ratio is the resource demand of the processing unit and the physical node The ratio of resource availability, the highest resource availability ratio among multiple resource requirements is the dominant resource ratio of the physical node. For example, the system maintains three resources: CPU, memory, and network bandwidth. The resource requirement of a processing unit is <1CPU, 2G memory, 1Mbits/s>. At this time, the available resources of the physical node are 3 CPUs, 10G memory, 4Mbits/s, then the available ratios of CPU, memory, and network bandwidth resources are 1/3, 1/5, and 1/4, respectively. Since 1/3 is the largest, the dominant resource ratio of this physical node is 1/3, and the dominant resource is CPU.

值得注意的是，如果一个物理节点某种资源的的资源可用量小于处理单元的资源需求，则认为该物理节点没有足够的资源，将其主导资源比标记为无穷大，不参与排序。举例来说，一个处理单元的资源需求为<1CPU,2G内存,1Mbits/s>，而此时物理节点的可用资源为3个CPU、1G内存、4Mbits/s，该物理节点没有足够可用资源，就将其主导资源比例标记为无穷大。如果所有的物理节点都没有足够资源，则本次调度失败，则将处理单元重新放回调度队列，一段时间后重新进行调度。It is worth noting that if the resource availability of a certain resource of a physical node is less than the resource demand of the processing unit, it is considered that the physical node does not have enough resources, its dominant resource ratio is marked as infinite, and it does not participate in the sorting. For example, the resource requirement of a processing unit is <1CPU, 2G memory, 1Mbits/s>, and the available resources of the physical node at this time are 3 CPUs, 1G memory, 4Mbits/s, and the physical node does not have enough available resources. mark its dominant resource ratio as infinity. If all the physical nodes do not have enough resources, the scheduling fails, and the processing unit is put back into the scheduling queue, and the scheduling is re-scheduled after a period of time.

在发明实施例中，步骤2.2根据非本地通信数和主导资源比例，对物理节点从小到大排序。两个物理节点先比较非本地通信数，如果不等，则非本地通信数小的物理节点排在前面；如果相等，再比较二者的主导资源比例，主导资源比例小的物理节点排在前面。In the embodiment of the invention, step 2.2 sorts the physical nodes from small to large according to the number of non-local communications and the proportion of dominant resources. The two physical nodes first compare the number of non-local communications. If they are not equal, the physical node with the smaller number of non-local communications is ranked first; if they are equal, then compare the ratio of dominant resources between the two, and the physical node with the smaller ratio of dominant resources is ranked first. .

在发明实施例中，步骤2.4所述的加权负载值是根据CPU、内存、网络带宽三种资源计算出的加权负载，加权负载值等于“CPU利用率*0.3+内存利用率*0.3+网络带宽利用率*0.4”。In the embodiment of the invention, the weighted load value described in step 2.4 is a weighted load calculated according to the three resources of CPU, memory, and network bandwidth, and the weighted load value is equal to "CPU utilization rate * 0.3 + memory utilization rate * 0.3 + network bandwidth Utilization * 0.4".

图6为本发明实施例的部署处理单元的流程图，所述步骤3用于在选择物理节点之后部署处理单元，具体包括：Fig. 6 is a flow chart of deploying a processing unit according to an embodiment of the present invention, and the step 3 is used to deploy a processing unit after selecting a physical node, specifically including:

步骤3.1，在全局资源信息中，减去处理单元的资源配额；Step 3.1, in the global resource information, subtract the resource quota of the processing unit;

步骤3.2，将处理单元的信息下发给所述选择的物理节点；Step 3.2, sending the information of the processing unit to the selected physical node;

步骤3.3，在该物理节点上创建Linux容器（Linux Container），并根据处理单元的资源需求为Linux容器设置资源配额；Step 3.3, create a Linux container (Linux Container) on the physical node, and set resource quotas for the Linux container according to the resource requirements of the processing unit;

步骤3.4，在所述步骤3.3中创建的Linux容器内，启动处理单元。Step 3.4, start the processing unit in the Linux container created in the step 3.3.

在本发明实施例中，步骤701所指全局资源信息，是指调度管理器的收集模块维护的所有物理节点的各种资源总量及资源可用量。一个物理节点的各种资源总量可以表示为<32CPU,64G内存,100Mb/s>，此时的资源可用量可以表示为<8CPU,15G内存,30Mb/s>。我们假设所选的物理节点a的资源可用量为<8CPU,15G内存,30Mb/s>，待部署处理单元的资源需求为<2CPU,4G内存,10Mb/s>，那么就需要在a的资源可用量中减去处理单元的资源需求，更新后a的资源可用量为<6CPU,11G内存,20Mb/s>。In the embodiment of the present invention, the global resource information referred to in step 701 refers to the total amount of various resources and available resources of all physical nodes maintained by the collection module of the scheduling manager. The total amount of various resources of a physical node can be expressed as <32CPU, 64G memory, 100Mb/s>, and the resource availability at this time can be expressed as <8CPU, 15G memory, 30Mb/s>. We assume that the resource availability of the selected physical node a is <8CPU, 15G memory, 30Mb/s>, and the resource requirement of the processing unit to be deployed is <2CPU, 4G memory, 10Mb/s>, then the resources in a are required Subtract the resource requirement of the processing unit from the available amount, and the resource available amount of a after updating is <6CPU, 11G memory, 20Mb/s>.

本系统选择使用Linux容器（Linux Container）为处理单元提供资源隔离环境，Linux容器是轻量级的虚拟化技术，性能开销非常小，通常可以忽略不计，非常适合流式数据处理注重性能的特性。此外，Linux容器可以动态调整资源容量，为本系统的资源配额自动伸缩方法提供可行性。在Linux容器内部启动处理单元，根据Linux容器的性质，其内部的处理单元只能使用分配给它的资源配额，在资源紧张时，不会抢占其他处理单元的资源。This system chooses to use Linux Container (Linux Container) to provide a resource isolation environment for processing units. Linux Container is a lightweight virtualization technology with very small performance overhead, usually negligible, and is very suitable for streaming data processing that focuses on performance. In addition, the Linux container can dynamically adjust the resource capacity, which provides feasibility for the automatic scaling method of the resource quota of this system. Start the processing unit inside the Linux container. According to the nature of the Linux container, the internal processing unit can only use the resource quota allocated to it. When resources are tight, it will not seize the resources of other processing units.

一种面向流式数据的作业调度装置，包括调度管理器1，执行器2和调度队列3；A streaming data-oriented job scheduling device, comprising a scheduling manager 1, an executor 2 and a scheduling queue 3;

所述调度管理器1，设置于高配置的物理节点上，用于从调度队列3中实时获取待调度作业，并根据待调度作业的信息利用有向无环图生成包括多个处理单元的处理单元队列，根据物理节点的非本地通信数和主导资源比例，为处理单元队列中的每个处理单元分别选择物理节点，将所有处理单元分别分配给对应的物理节点，每个物理节点设置有至少一个处理单元；The scheduling manager 1 is set on a high-configuration physical node, and is used to obtain the jobs to be scheduled from the scheduling queue 3 in real time, and use a directed acyclic graph to generate a process including multiple processing units according to the information of the jobs to be scheduled. The unit queue selects a physical node for each processing unit in the processing unit queue according to the number of non-local communications of the physical node and the proportion of dominant resources, and assigns all the processing units to the corresponding physical nodes. Each physical node is set with at least a processing unit;

所述执行器2，用于启动处理单元，将调度管理器1调度给该物理节点的处理单元放置于linux容器内部，在linux容器内部启动处理单元；The executor 2 is used to start the processing unit, place the processing unit dispatched by the scheduling manager 1 to the physical node inside the linux container, and start the processing unit inside the linux container;

所述调度队列3，与调度管理器部署在同一物理节点上，用于存储待调度作业。The scheduling queue 3 is deployed on the same physical node as the scheduling manager, and is used to store jobs to be scheduled.

所述调度管理器1包括收集模块和调度模块；The scheduling manager 1 includes a collection module and a scheduling module;

每个所述处理单元均设置有唯一的处理单元标识。Each processing unit is provided with a unique processing unit identifier.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.