Movatterモバイル変換


[0]ホーム

URL:


CN119201416A - Multi-job distributed training system and method - Google Patents

Multi-job distributed training system and method
Download PDF

Info

Publication number
CN119201416A
CN119201416ACN202410033375.1ACN202410033375ACN119201416ACN 119201416 ACN119201416 ACN 119201416ACN 202410033375 ACN202410033375 ACN 202410033375ACN 119201416 ACN119201416 ACN 119201416A
Authority
CN
China
Prior art keywords
training
data
network
gradient
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410033375.1A
Other languages
Chinese (zh)
Inventor
赵伯罕
徐葳
李强
龙利民
胡勇超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tuling Artificial Intelligence Institute Nanjing Co ltd
Tsinghua University
Original Assignee
Tuling Artificial Intelligence Institute Nanjing Co ltd
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tuling Artificial Intelligence Institute Nanjing Co ltd, Tsinghua UniversityfiledCriticalTuling Artificial Intelligence Institute Nanjing Co ltd
Priority to CN202410033375.1ApriorityCriticalpatent/CN119201416A/en
Publication of CN119201416ApublicationCriticalpatent/CN119201416A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请公开一种多作业分布式训练系统及方法,所述多作业分布式训练系统包括:存储服务模块,用于管理存储的训练数据并依据每一训练任务的ID及其迭代次数为各训练任务分发训练数据;命名控制模块,用于为每一训练任务分配ID以定位各训练任务的训练数据,其中,每一训练任务由一个计算节点来完成;主机代理模块,用于从存储服模块获取分发的对应一训练任务的训练数据以令计算节点进行梯度计算,并将所述计算节点得到的梯度数据发送至网络;以及从网络中接收聚合数据并发送至各计算节点;交换机模块,用于依据从网络中获取的每一训练任务对应的梯度数据进行聚合计算以得到所述聚合数据,并通过网络将聚合数据发送至网络中的各计算节点中。

The present application discloses a multi-job distributed training system and method, the multi-job distributed training system comprising: a storage service module, used to manage stored training data and distribute training data to each training task according to the ID of each training task and its iteration number; a naming control module, used to assign an ID to each training task to locate the training data of each training task, wherein each training task is completed by a computing node; a host agent module, used to obtain the distributed training data corresponding to a training task from the storage service module to enable the computing node to perform gradient calculation, and send the gradient data obtained by the computing node to the network; and receive aggregated data from the network and send it to each computing node; a switch module, used to perform aggregate calculation according to the gradient data corresponding to each training task obtained from the network to obtain the aggregated data, and send the aggregated data to each computing node in the network through the network.

Description

Multi-job distributed training system and method
Technical Field
The present application relates to the field of computer technology, and in particular, to a multi-job distributed training system and method, a storage server, a computer device, a programmable switch, a computer readable storage medium, and a computer program product.
Background
Training deep neural networks is a huge task involving a large amount of data and computation. Single machine training has failed to meet the training requirements and multi-machine based distributed training (Distributed Training, DT) has become a research hotspot.
Currently, multiple GPUs for performing training tasks in a distributed training network train the deep neural network locally, and each GPU sends the computed gradient to a parameter server for gradient aggregation. The parameter server then sends the aggregated gradient to each GPU, which performs the next iterative training based on the acquired aggregated gradient. Because of the large data traffic transmitted by the network during distributed training, communication overhead becomes a major bottleneck, such as data communication over half the time during training, which limits the efficiency and scalability of the distributed training operation.
Therefore, in the network of distributed training, how to reduce the communication overhead and thus improve the efficiency and the scalability of the distributed training is a technical problem to be solved.
Disclosure of Invention
In view of the above-mentioned drawbacks of the related art, an object of the present application is to provide a multi-job distributed training system and method, a storage server, a computer device, a programmable switch, a computer readable storage medium, and a computer program product, so as to solve the technical problem of how to reduce communication overhead in a distributed training network, thereby improving the efficiency and scalability of distributed training.
To achieve the above and other related objects, a first aspect of the present application provides a multi-job distributed training system applied to a network for performing distributed training on a deep neural network, where the network includes a plurality of computing nodes for participating in the distributed training, the multi-job distributed training system includes a storage service module for managing stored training data and distributing training data for each training task according to an ID of each training task and the number of iterations thereof, a naming control module for distributing an ID for each training task to locate training data of each training task, where each training task is completed by one computing node, a host agent module for acquiring distributed training data of a corresponding training task from the storage service module to cause the computing nodes to perform gradient computation and send gradient data obtained by the computing nodes to the network, and a switch module for receiving aggregate data from the network and sending the aggregate data to each computing node according to gradient data corresponding to each training task obtained from the network, and sending the aggregate data to each computing node in the network.
The application provides a multi-job distributed training method which is applied to a network for carrying out distributed training on a deep neural network, wherein the network comprises a plurality of computing nodes for participating in the distributed training, the multi-job distributed training method comprises the steps of distributing an ID (identification) for each training task to locate training data of each training task, wherein each training task is completed by one computing node, the training data is managed by a storage service module, the storage service module is used for distributing the training data for each training task according to the ID of each training task and the iteration times of the training tasks, acquiring the training data of the computing tasks to enable the computing nodes to carry out gradient computation to obtain gradient data and send the gradient data to the network, carrying out aggregation computation on the gradient data corresponding to each training task acquired from the network to obtain the aggregation data and broadcasting the aggregation data to the network through the network, and the computing nodes receive the aggregation data from the network to execute the next iteration training.
The third aspect of the application provides a storage server applied to a network for carrying out distributed training on a deep neural network, wherein the network comprises a plurality of computing nodes for participating in the distributed training, the storage server comprises a memory for storing training data for carrying out distributed training on the deep neural network, a processor for managing the stored training data and distributing the training data for each training task according to the ID of each training task and the iteration number thereof, wherein the ID of each training task is used for positioning the training data of each training task, each training task is completed by one computing node, the computing nodes carry out gradient computation according to the distributed training data of the corresponding training task to obtain gradient data and send the gradient data to the network and acquire aggregate data from the network, and the aggregate data is obtained through aggregation computation by a programmable switch.
The fourth aspect of the application provides a computer device applied to a network for performing distributed training on a deep neural network, wherein the network comprises a plurality of computing nodes for participating in the distributed training, the computer device comprises a memory, a processor and a programmable switch, wherein the processor is used for acquiring distributed training data corresponding to a training task to enable the computing nodes to perform gradient calculation and send the gradient data obtained by the computing nodes to the network, and receiving aggregated data from the network and sending the aggregated data to the computing nodes, each training task is completed by one computing node, the training data corresponding to the training task is distributed to each training task through a storage server according to the ID of each training task and the iteration number of the training task, the ID of each training task is used for positioning the training data of each training task, and the aggregated data is obtained by performing the aggregation calculation through the programmable switch.
The fifth aspect of the application provides a computer device applied to a network for performing distributed training on a deep neural network, wherein the network comprises a plurality of computing nodes for participating in the distributed training, the computer device comprises a memory, a processor and a processor, wherein the processor is used for allocating an ID (identity) for each training task to locate training data of each training task, each training task is completed by one computing node, the training data of the corresponding training task is distributed for each training task through a storage server according to the ID of each training task and the iteration times of the training tasks, the computing nodes perform gradient calculation according to the distributed training data of the corresponding training task to obtain gradient data and send the gradient data to the network, and the aggregation data are obtained through aggregation calculation of a programmable switch.
The sixth aspect of the application provides a programmable switch applied to a network for performing distributed training on a deep neural network, wherein the network comprises a plurality of computing nodes for participating in the distributed training, the programmable switch comprises a network port, an aggregation module and a calculation module, wherein the aggregation module is used for performing aggregation calculation according to gradient data corresponding to each training task acquired from the network port to obtain the aggregation data, and sending the aggregation data to each computing node in the network through the network port, each training task is completed by one computing node, the training data corresponding to one training task is distributed to each training task through a storage server according to the ID of each training task and the iteration number thereof, the ID of each training task is used for positioning the training data of each training task, and the gradient data is obtained by performing gradient calculation according to the distributed training data corresponding to one training task by each computing node.
A seventh aspect of the present application provides a computer device, comprising a storage device for storing at least one program, and a processing device, connected to the storage device, for implementing the multi-job distributed training method as described in any one of the embodiments disclosed in the second aspect of the present application when the at least one program is called from the storage device and executed.
An eighth aspect of the present application provides a computer readable storage medium storing at least one program which when invoked and executed by a processor of a computer implements a multi-job distributed training method as described in any of the embodiments disclosed in the second aspect of the present application.
A ninth aspect of the application provides a computer program product which, when run on a computer, causes the computer to perform a multi-job distributed training method as described in any of the embodiments disclosed in the second aspect of the application.
In summary, the present application provides a multi-job distributed training system and method, a storage server, a computer device, a programmable switch, a computer readable storage medium, and a computer program product,
The switch module is utilized to aggregate gradient data, the aggregated data is sent to each computing node in the network, and the storage service module is utilized to manage training data, so that the communication overhead of the parameter server can be reduced, the utilization rate of the network can be improved, and the efficiency and the expandability of distributed training can be further improved.
Furthermore, the shortest waiting time during aggregation is realized by simultaneously sending the same gradient blocks of different computing nodes, so that the computing efficiency of distributed training can be improved, and the expenditure of an aggregator can be reduced. Moreover, the application uses a computing node as a standby parameter server which is only started when the data packet is lost, thereby reducing the traffic in the network in the normal aggregation process. And the computing task on the computing node can be transferred to other computing nodes when the computing node fails during training, so that the subsequent distributed training is not influenced.
Drawings
The specific features of the application are set forth in the appended claims. The features and advantages of the application that are related to the present application will be better understood by reference to the exemplary embodiments and the drawings described in detail below. The brief description of the drawings is as follows:
FIG. 1 is a diagram illustrating a topology of a network for distributed training of a deep neural network in one embodiment of the present application.
Fig. 2 is a schematic diagram of data transmission in a network when a computing node fails according to an embodiment of the present application.
Fig. 3a and 3b show schematic diagrams of the gradient blocks obtained by blocking the gradient in different embodiments of the present application, respectively.
FIG. 4 shows a comparison of time for one iterative calculation of different models for different distributed training systems.
FIG. 5 shows a graph comparing the utilization of the GPU and network for different distributed training systems.
Fig. 6 shows a graph comparing training times for different distributed training systems at different iterations in the event of a failure introduced at iteration 70.
FIG. 7 is a flow chart of a multi-job distributed training method according to an embodiment of the present application.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Further advantages and effects of the present application will become apparent to those skilled in the art from the disclosure of the present application, which is described by the following specific examples.
As described in the background, when training a deep neural network, data parallel distributed training is run on a common GPU server, a reasonable way of supporting economic, efficient and scalable deep learning is often needed, and because DT jobs usually involve many sub-second DT tasks to perform multiple iterations on the GPU, this work becomes complicated, each task loads input data and exchanges computed gradients, if not planned carefully, communication time may block GPU utilization, thus limiting efficiency and scalability of DT jobs.
While a clustered network architecture proposed by the industry is a explorable optimization goal, for example, the NVIDIA a100 architecture suggests that each server uses 8 network interface cards of 200Gbps for GPU communication and adds an additional one for training data access, this architecture incurs high network costs. With recent advances in programmable switches like Tofino, one also uses intra-network computing (INC) to perform gradient aggregation. These systems, commonly referred to as intra-network aggregation (INCREMENTAL NETWORK AGGREGATION, INA), significantly reduce latency and utilization of network bandwidth by eliminating the problems of software overhead on most host-based parameter servers.
While existing intra-network aggregation (INA) systems achieve good results, they typically use programmable switches as computational accelerators for only a single job. When concurrent training jobs are performed in a shared cluster, performance may be degraded and loading training data using the same network links may result in a slow down of tasks. Conventional intra-network aggregation (INA) systems require all training data to be partitioned and replicated before the job begins, which is detrimental to fast failover in a manual and static data management manner.
Therefore, in the network of distributed training, how to reduce the communication overhead and thus improve the efficiency and the scalability of the distributed training is a technical problem to be solved. In the prior art, the number of gradients to be aggregated by a parameter server can be reduced by utilizing a programmable switch to conduct advanced aggregation on a network path for transmitting the gradients, so that congestion of the parameter server is relieved, communication overhead of the parameter server is reduced, but at present, the function of gradient aggregation is simply executed by using the switch, management and coordination of training data are absent in a distributed training network, and the lack of management and coordination of the training data can cause the reduction of the utilization rate of the network, so that the efficiency of the distributed training is also reduced.
In view of this, in some embodiments of the present application, a multi-job distributed training system, a multi-job distributed training method, a storage server, a computer device, a programmable switch, a computer readable storage medium, and a computer program product are disclosed, where the switch module is used to aggregate gradient data and send the aggregated aggregate data to each computing node in a network, and the storage service module is used to manage training data, so that communication overhead of a parameter server can be reduced, and network utilization can be improved, thereby improving efficiency and scalability of distributed training.
In the application, the multi-job distributed training system is applied to a network for performing distributed training on a deep neural network. The deep neural network comprises, but is not limited to, a convolutional neural network for image recognition, a generation countermeasure network for generating data such as images, and a long-time and short-time memory network for voice recognition. Training is typically done in a distributed training manner in order to speed up the training time of the deep neural network. The distributed training refers to dividing tasks in an iterative training process into a plurality of training tasks to be simultaneously executed by a plurality of computing nodes. In some embodiments of the application, the computing node is a computer device comprised of one or more GPUs, the plurality of computing nodes establishing communication with a server or switch in the network through a network.
Referring to fig. 1, a topology diagram of a network for distributed training of a deep neural network in an embodiment of the present application is shown, wherein the network includes a multi-job distributed training system 1 and a plurality of computing nodes 2 for participating in the distributed training. The multi-job distributed system is used for being in communication connection with the plurality of computing nodes so as to distribute training tasks and training data of each iteration to each computing node and acquire gradient data calculated by each computing node. Each computing node performs gradient calculations using the training tasks and training data it acquires. In an embodiment, one compute node 2 includes at least one GPU for performing computations, e.g., one compute node 2 includes 1, 2, 4, 8, 16, or 32 GPUs.
In one embodiment, referring to fig. 1, as shown, the multi-job distributed system 1 includes a storage service module 10, a naming control module 11, a host agent module 12, and a switch module 13.
The naming control module 11 is configured to allocate an ID to each training task to locate training data of each training task, the storage service module 10 is configured to manage the stored training data and distribute the training data for each training task according to the ID of each training task and the iteration number thereof, the host agent module 12 is configured to obtain the distributed training data corresponding to a training task from the storage service module to enable the computing nodes to perform gradient computation and generate gradient data to send the gradient data to the host agent module 12, the host agent module 12 sends the obtained gradient data to the switch module 13 in the network, the switch module 13 is configured to perform aggregate computation according to the gradient data corresponding to each training task obtained from the network to obtain the aggregate data and send the aggregate data to each computing node in the network through the network, for example, the switch module 13 sends the aggregate data to the host agent module 12 in the network, and the host agent module 12 communicates with each computing node 2 to send the aggregate data to each computing node for each computing node to start the next iteration.
The storage service module 10 is a software tool or software module that can process data (e.g., training data) by means of a computer device or a hardware device in a storage server described later and an operating environment provided by an operating system. In one example, the storage service module 10 may run on a processor (CPU) coupled to a memory (e.g., the memory includes at least one SSD). Because the training data is read-only, in an example, when the number of the computing nodes is 40 and 8 GPUs are arranged in each computing node, the application only needs 12 SSDs (the transmission rate of each SSD is not less than 4GB/s or the bandwidth of each SSD is not less than 32 Gbps) and 4 network interface cards of 100Gbps, so that the configuration can not only meet the data transmission efficiency, but also reduce the configuration cost of the network.
In one embodiment, the storage service module 10 includes one or more physical or logical blocks of computer instructions organized as objects, procedures, or functions. However, the executables of an module need not be physically located together, but may comprise disparate commands stored in different locations which, when joined logically together, achieve the stated purpose for the storage service module 10. In one example, the storage service module 10 includes C++ code.
The storage service module 10 is used for managing stored training data. In one embodiment, all training data for training the deep neural network is stored on the memory, and all stored training data is centrally managed by the unified storage service module 10. For example, the storage service module 10 performs the allocation of training data according to the iteration number set by the user and the number of the computing nodes in each iteration number, the storage service module 10 stores each allocated data in the memory, and the storage service module 10 also correspondingly records the storage address of each allocated data. For example, 10000 training data (for example, 10000 images) needed for training the deep neural network, the user sets the iteration number to 10, and the number of the computing nodes in each iteration number is 10, so that the storage service module 10 divides 10000 images into 100 parts and stores the images in the memory respectively. It should be noted that, when the number of GPUs in the computing node is two or more, the computing node may be allocated according to the training data acquired by the computing node.
The storage service module 10 is configured to distribute training data for each training task according to the ID of each training task and the iteration number thereof.
In one embodiment, the naming control module 11 first divides the tasks required to be performed for training the deep neural network into a plurality of training tasks according to the number of computing nodes, and each computing node completes one training task, in other words, the computation performed by each computing node is one training task. The naming control module 11 then assigns an ID to each training task to locate training data for each of the training tasks. For this purpose, the naming control module 11 is further configured to determine the ID of each training task and the correspondence between the number of iterations and the training data, that is, determine the number of iterations and the mapping between the ID of each training task and the training data, for example, determine the storage address of the training data corresponding to each iteration number and each ID. It should be noted that, training data corresponding to training tasks with different IDs for the same iteration number are different, and training data corresponding to training tasks with the same ID for different iteration numbers are also different. Then, the naming control module 11 also sends the correspondence to the storage service module 10. Finally, the storage service module 10 distributes training data for a training task based on the corresponding relationship and the request data of the training task generated by the host agent module 12, specifically, the host agent module 12 generates the request data according to the ID distributed by the naming control module 11 through the switch module 13, the request data includes the ID of the training task and the iteration number thereof, and the storage service module 10 can locate which training data is required by the training task of the ID in the iteration according to the corresponding relationship and the ID of the training task included in the request data and the iteration number thereof, and sends the training data to the host agent module 12 for caching through the switch module 13.
The naming control module 11 is a software tool or software module capable of processing data, and processes the data by means of a hardware device in a computer device and an operating environment provided by an operating system, for example, determining the ID of each training task and the corresponding relationship between the iteration number and resources. In one example, the naming control module 11 may run on a processor (CPU) of a computer device.
In one embodiment, the naming control module 11 includes one or more physical or logical blocks of computer instructions organized as objects, programs, or functions. However, the executables of an module need not be physically located together, but may comprise disparate commands stored in different locations which, when joined logically together, achieve the stated purpose for naming control module 11. In an example, the naming control module 11 includes Python code.
In one embodiment, the naming control module 11 is configured to assign an ID to each training task to locate training data of each training task, wherein each training task is performed by a computing node. Specifically, the manner in which the naming control module 11 is used to assign an ID to each training task is the same as or similar to that described above, and will not be described herein. It should be noted that, when the number of GPUs in one computing node is two or more, the computing node may reassign the acquired training tasks by the number of GPUs.
The naming control module 11 is further configured to determine a correspondence between the ID of each training task and the iteration number thereof and the training data, where the step of determining the correspondence between the ID of each training task and the iteration number thereof and the training data by the naming control module 11 is the same as or similar to the foregoing, and will not be described herein.
Further, the naming control module further sends the corresponding relation to the storage service module, and then the storage service module can determine training data corresponding to the request data according to the corresponding relation so as to send the training data to the host agent module through the switch module, and then the host agent module can buffer the training data.
The host agent module 12 is a software tool or software module that processes data (e.g., gradient data) by means of hardware devices in the computer device and the operating environment provided by the operating system. In one example, the host agent module 12 may run on a processor (CPU) coupled to a memory.
In one embodiment, the host agent module 12 includes one or more physical or logical blocks of computer instructions organized as objects, programs, or functions. However, the modules' executables need not be physically located together, but may comprise disparate commands stored in disparate locations which, when joined logically together, achieve the stated purpose for the host agent module 12. In one example, the host agent module 12 includes C++ code.
The host agent module 12 is configured to obtain the distributed training data corresponding to a training task from the storage server module 10, so that the computing node 2 performs gradient computation, and send the gradient data obtained by the computing node 2 to a network, and receive aggregate data from the network and send the aggregate data to each computing node.
In one embodiment, each computing node 2 communicates directly with the host agent module 12 for data interaction. For example, the computing node 2 receives the aggregate data sent by the switch module 13 from the host agent module 12, and for example, the computing node 2 sends gradient data to the host agent module 12 to send the gradient data to the switch module for aggregate calculation, and for example, the computing node obtains training data, which is cached by the host agent module 12 and corresponds to a training task obtained from the storage server module 10.
To buffer the gradient data and aggregate data, the host agent module 12 includes a training data buffer for buffering training data acquired from the storage service module and a gradient data buffer for buffering gradient data acquired from the compute nodes. In one example, the training data buffer and the gradient data buffer are constructed using DPDK to minimize software stack arrival at the network interface card, thereby reducing fluctuations in software latency. For example, the host agent module continues to buffer training data to the training data buffer when its training data buffer is idle and during idle periods of bandwidth (e.g., when the host agent module is not receiving gradient data and is not transmitting gradient data).
In distributed training, a computing node can only start the next round of iterative training after receiving the aggregated data calculated by the switch module. The completion time of the aggregate computation is determined by the computation node that last sent the gradient data. Thus, if their gradient data is sent simultaneously, we can achieve the shortest latency. Further, if one computing node first sends gradient block 1 and the other computing node first sends gradient block 2. They are sent in a different order and the two gradient blocks will require different aggregators to be used on the switch and consume 2 times the registers. For example, gradient block 1 occupies aggregator 1 and gradient block 2 occupies aggregator 2. Wherein each aggregator includes a plurality of registers. The gradient blocks will be described in detail later.
For this purpose, firstly, in an iterative process, the computing node blocks a gradient obtained by performing a training task according to a preset block rule to obtain gradient blocks. In other words, all the computing nodes block the gradient according to the preset block rule. The preset partitioning rules are equal-length partitioning rules or unequal-length partitioning rules. The partitioning rule with equal length refers to that the data length of each partitioned gradient block is equal, for example, when partitioning, a preset number of bytes are partitioned into one gradient block, for example, 1MB of bytes are partitioned into one gradient block. The unequal length partitioning rule means that the data length of each partitioned gradient block is not completely equal, for example, different preset numbers of bytes are partitioned into different gradient blocks during partitioning.
Because the neural network model of each computing node is consistent, the array form of the gradient obtained by each computing node executing a training task in one iteration process is consistent (for example, the gradient obtained by each computing node is a one-dimensional array comprising 256 numbers), and the data length of the gradient obtained by each computing node is consistent under the same number representation mode in the CPU. In other words, the data size of each gradient is also uniform. For example, the gradient computed by each compute node includes 100MB bytes. Therefore, the gradient blocks obtained by dividing the gradient into blocks are identical in all the calculation nodes.
Further, each gradient block is also provided with an identification to identify the position of the gradient block in the gradient. Referring to fig. 3a and 3b, there are shown schematic diagrams of the gradient blocks obtained by blocking the gradient according to the present application in different embodiments, wherein the gradient block 1 is shown in the beginning region of the gradient, the gradient block 2 is shown in the middle region of the gradient, and the gradient block 3 is shown in the end region of the gradient.
And generating gradient data sent to the host agent module after the computing node calculates the gradient block. The gradient data comprises an ID of a training task and a gradient block, and further comprises an identification of the gradient block.
It should be noted that, when a computing node includes multiple GPUs, the computing node further locally aggregates gradients obtained by the GPUs to obtain a gradient corresponding to the training task, and then divides gradient blocks. Furthermore, when the computing node divides the gradient, the gradient after the complete calculation of a training task can be divided, or only a part of the gradient after the complete calculation can be divided, and after the division, the gradient block after the complete calculation is sent to the host agent module. For example, a training task gradient includes 100MB, and a computing node only calculates a partial gradient of the first 10MB, and then blocks the first 10MB and sends the block to the host agent module.
In order to determine that the same gradient block (e.g., gradient block 1) for each compute node is ready on the host agent module, in other words, to ensure that the same gradient block for each compute node is already in the gradient data buffer of the host agent module. The host agent module is further configured to generate a bitmap of each computing node according to the received gradient data sent by each computing node, and send the bitmap to the switch module. The bitmap indicates whether a gradient block exists by using a value 0 or a value 1 at a data location, for example, a value 1 indicates that the gradient block corresponding to the data location is received, and a value 0 indicates that the gradient block corresponding to the data location is not received. In an example, please continue to refer to fig. 3a and 3b, as shown, each computing node divides the gradient into a gradient block 1, a gradient block 2, and a gradient block 3, and the bitmap of a computing node generated by the host agent module is (1, 0), which indicates that the host agent module has received the gradient block 1 sent by the computing node, and has not received the gradient block 2 and the gradient block 3.
And the switch module is further used for generating a global bitmap according to the received bitmaps of the computing nodes and sending the global bitmap to the host agent module. The global bitmap indicates whether the host agent module receives all the gradient blocks at the data location by using a value 0 or a value 1 at the data location, for example, a value 1 indicates that the host agent module receives all the gradient blocks corresponding to the data location sent by each computing node, and a value 0 indicates that the host agent module does not receive all the gradient blocks corresponding to the data location sent by each computing node.
In an example, referring to fig. 3a and 3b, as shown, each computing node divides the gradient into a gradient block 1, a gradient block 2, and a gradient block 3, the host agent module generates bitmaps of three computing nodes (1, 0), (1, 0) and sends the bitmaps to the switch module, the switch module generates a global bitmap (1, 0), 1 indicates that the host agent module receives all the gradient blocks 1 sent by each computing node, and two digits 0 indicates that the host agent module does not receive all the gradient blocks 2 and 3 sent by each computing node.
After the host agent module sends the generated global bitmap to the host agent module, the host agent module is further configured to send gradient data of each computing node corresponding to a data position with a value of 1 in the global bitmap to the switch module.
In an embodiment, the host agent module sends a gradient data as a data packet to be aggregated to the switch module, so that the switch module allocates an aggregator for the data packet to be aggregated corresponding to the same gradient data to perform aggregation calculation, for example, when the data length of the gradient data is smaller than the number of registers in the aggregator, the host agent module may send the gradient data as a data packet to be aggregated to the switch module. For example, as shown in fig. 3a and fig. 3b, the host agent module simultaneously sends 3 gradient data corresponding to the gradient blocks 1 of the cached 3 computing nodes as data packets to be aggregated to the switch module, so that the switch module distributes the same aggregator to aggregate and calculate the 3 data packets to be aggregated.
In another embodiment, when gradient data of each computing node corresponding to a data position with a value of 1 in the global bitmap is simultaneously sent to the switch module, the host agent module divides the gradient block in each gradient data according to the number of registers in the aggregator, specifically, divides the gradient block by taking the number of registers in the aggregator as a division length, so that the number of registers in one aggregator after division is smaller than or equal to the number of registers in one aggregator. Further, after the host agent module is divided, a data packet to be aggregated is generated according to the divided data, the position of the divided data and the ID.
For example, the gradient block 1 in fig. 3a is used for generating a data packet 1 to be aggregated, a data packet 2 to be aggregated and a data packet 3 to be aggregated, wherein the position of data in the data packet 1 to be aggregated is located from the beginning position to the 92 th position of the gradient block 1, the position of data in the data packet 2 to be aggregated is located from the 93 rd position to the 184 th position of the gradient block 1, and the position of data in the data packet 3 to be aggregated is located from the 185 th position to the 276 th position of the gradient block 1. The host agent module simultaneously sends a plurality of data packets to be aggregated, which correspond to each gradient data, to the switch module so that the switch module distributes the same aggregator for the same data packets to be aggregated, which correspond to different gradient data, to perform aggregation calculation.
It should be noted that, a transmission rule may be preset in the host agent module, where the transmission rule specifies a transmission sequence of the gradient blocks, and when there are multiple 1 values in the global bitmap, the transmission sequence of the gradient blocks may be coordinated according to the transmission rule. For example, the transmission rule is to transmit a gradient block near the start position of the gradient first. For another example, the transmission rule is to transmit data near the start position of the gradient block.
Thus, the shortest waiting time during aggregation is realized by simultaneously sending the gradient blocks with the same computing nodes, so that the computing efficiency of distributed training and the cost of an aggregator can be improved.
In order to further improve the synchronicity, the host agent module is further configured to simultaneously start all the computing nodes in the network to start each computing node to perform the next iteration when the last gradient data is received in the process of one iteration. Specifically, when all gradient blocks corresponding to each training task in one iterative training process are received, the host agent module simultaneously starts the computing node to start the next iteration. For example, the host agent module acts as a barrier signal for the last gradient data, and when the last gradient data arrives, the host agent module immediately de-asserts the waiting compute node to begin the next round of iterations.
As the compute nodes may fail during training. If the fault cannot be timely and effectively monitored and processed, the training efficiency of the deep neural network can be greatly reduced. However, the computing node misses a single Acknowledgement (ACK) during the gradient data transmission process, which does not represent that the computing node must fail, and is likely to be caused by factors such as network instability.
To this end, a computing node retransmits gradient data to the host agent module when it misses a single Acknowledgement (ACK). When a computing node retransmits gradient data to the host agent module multiple times within a preset time, the host agent module generates a fault report for the computing node and sends the fault report to the naming control module. Wherein the preset time is exemplified by 3000 milliseconds. Further, referring to fig. 2, a schematic diagram of data transmission in a network when a computing node fails according to an embodiment of the present application is shown. After receiving the fault report sent by the host agent module 12, the naming control module 11 determines the number of computing nodes capable of working normally in the network according to the obtained fault report and sends the number of computing nodes capable of working normally to the storage service module 10, so that the storage service module 10 redistributes training data according to the number of computing nodes capable of working normally.
In an embodiment, the storage service module 10 redistributes all training data corresponding to the iteration number when the computing node fails according to the number of computing nodes that can work normally. For example, in an iterative process, the storage service module allocates 100 pictures to 5 computing nodes, but after one computing node fails, the storage service module allocates 100 pictures to the remaining 4 computing nodes for computation.
Therefore, the subsequent iterative computation is not influenced when the computing node fails, so that the success rate of distributed training is improved.
The switch module 13 is a software tool or software module that can process data, and is used to describe the corresponding processing and operation of the data plane of the programmable switch on the data packet, for example, the switch module describes parsing, processing, modifying, forwarding logic, etc. of the data packet. In a practical embodiment, the programmable switch is exemplified by a 64×100Gbps programmable switch.
In an embodiment, the switch module 13 includes one or more physical or logical blocks of computer instructions organized as an object, procedure, or function. However, the executable files of the modules need not be physically located together, but may include different commands stored in different locations that, when logically connected together, achieve the specified goals of the switch module 13. In one example, the switch module 13 includes a P4 code. The P4 language is a programming language for describing the behavior of the data plane.
The switch module is used for carrying out aggregation calculation according to gradient data corresponding to each training task acquired from the network to obtain the aggregation data, and sending the aggregation data to each calculation node in the network through the network.
In an embodiment, when the host agent module sends the gradient data, the host agent module sends the gradient data as a data packet to be aggregated. And the switch module distributes the same aggregator for the same gradient data of different computing nodes to perform aggregation computation. Wherein, the aggregator for aggregation in the switch is composed of a plurality of registers, and the number of the registers in the aggregator is the same as the size of one data packet. For example, if a packet processed by the switch includes 92 bits, the aggregator includes 92 registers. One aggregator is used to aggregate gradient data corresponding to the same gradient block of different computing nodes, e.g., one aggregator is used to aggregate gradient data corresponding to gradient block 1 of different computing nodes. And when the switch module is used for completely aggregating the gradient data sent by all the computing nodes in one iteration, generating aggregated data, and sending the aggregated data to all the computing nodes in the network through the network. For example, the switch module sends the aggregate data to each computing node through a host agent module for each computing node to perform the next iteration according to the aggregate data.
In another embodiment, when gradient data of each computing node corresponding to a data position with a value of 1 in the global bitmap is simultaneously sent to the switch module, the host agent module simultaneously sends a plurality of data packets to be aggregated corresponding to each gradient data to the switch module, and the switch module provides an aggregator for the plurality of data packets to be aggregated corresponding to each gradient data to perform aggregation calculation. Specifically, the switch module distributes the same aggregator for the same data packet to be aggregated corresponding to different gradient data to perform aggregation calculation. The same data packet to be aggregated refers to that data in the data packet to be aggregated is located at the same position of the gradient block. For example, the data packets 1 to be aggregated are all located from the start position to the 92 th position of different gradient blocks, and the data packets 1 to be aggregated of different gradient data are distributed to the same aggregator for calculation.
Currently, the size of a packet that a switch can aggregate at a time is related to the number of registers. The switch module of the present application therefore allows a packet to be aggregated to access a register of the switch during the egress processing stage of the pipeline, and in particular allows one packet to be aggregated to access a register during the egress processing stage of the pipeline to increase the number of available registers for aggregation in a single pipeline. The payload size of the switch pipeline processing packet is further increased, resulting in a higher goodput for the same number of packets.
Further, in order to ensure that the data packet to be aggregated received by the switch module is a data packet that allows access to a register of the switch, the switch module is further configured to perform an initial check on the acquired data packet to be aggregated to determine whether the data packet to be aggregated is allowed to access an aggregator of the switch for aggregation calculation. Specifically, when the switch module receives the data packet to be aggregated, the switch module performs initial inspection on the data packet to be aggregated, the data packet to be aggregated passing through the initial inspection can be allowed to access a register of the switch in the pipeline to perform aggregation calculation, and the data packet to be aggregated can be forwarded normally by the switch module without passing through the initial inspection, namely, the data packet to be aggregated is forwarded according to a destination address of the data packet to be aggregated.
In an embodiment, the switch module determines, during initial inspection, whether the ID corresponding to the data packet to be aggregated is a data packet registered by the naming control module, whether the data packet overflows the aggregation, and whether the data packet uses the parameter server as a destination address. If the ID corresponding to the data packet to be aggregated is registered by the naming control module, the data packet to be aggregated does not overflow the aggregation, and the data packet to be aggregated does not take the parameter server as the destination address, the data packet to be aggregated can pass the initial inspection, otherwise, the data packet to be aggregated does not pass the initial inspection.
The switch module is further configured to determine whether gradient data of all training tasks are aggregated, discard aggregated data obtained by aggregation if gradient data of all training tasks in one iteration process are not aggregated, and re-circulate the aggregated data after aggregation to read the aggregated data if it is determined that gradient data of all training tasks in one iteration process are aggregated. For example, the switch module determines whether the gradient data for all training tasks are all aggregated after aggregation by determining whether the data packet to be aggregated is the last data packet in an iterative process. If the data packet is not the last data packet to be aggregated, the aggregation is not completed, otherwise, the aggregation is completed. Further, when the aggregate data is recycled, the switch module does not need to perform initial inspection on the aggregate data, and only needs to walk the aggregate data back to the pipeline once. It should be noted that, the aggregate data is recycled once only to read the total number of bits of the aggregate data, in other words, 2n bits of aggregate data may be written in a single memory access, but only n bits may be read, so that it is necessary to re-cycle once again to read 2n bits of aggregate data.
In order to improve the calculation efficiency of the switch module and reduce the traffic, the switch module does not backup the aggregated data, when a computing node does not receive the aggregated data, the computing node retransmits the gradient data, and then the host agent module retransmits the data packet to be aggregated corresponding to the gradient data, and the switch module forwards the retransmitted data packet to be aggregated to the parameter server module for the parameter server module to send the aggregated data to the switch module, so that the switch module can send the aggregated data to the computing node which does not receive the aggregated data. It should be noted that, the destination address of the retransmitted data packet to be aggregated is the parameter server module. The parameter server module performs backup on the aggregated data of each iteration, and is configured on any one computing node in the network. The parameter server module is a software tool or software module that can process data, and is used to describe the corresponding processing and operation of the data by the computing node.
The application does not have a parameter server alone, but uses a computing node as a standby parameter server, and the standby parameter server is started only when data is lost, thereby reducing the traffic in the network in the normal aggregation process.
Referring to fig. 4, a comparison diagram of the time of performing one iteration calculation on different models by different distributed training systems is shown, and the calculation time of the distributed training system of the present application in one iteration is shorter than the calculation time of the ATP system, switchML-1 system, and SwitchML-4 system in one iteration on all the deep neural network models shown in fig. 4.
Higher GPU and network utilization will result in better performance under the same workload. Referring to FIG. 5, which is a graph comparing the utilization of the GPU and the network by different distributed training systems, we have performed 400 different training task combinations in three distributed training systems (ATP system, switchML-1 system and multi-job distributed training system of the present application corresponding to red lines) to compare the utilization of the GPU and the network. Compared with the GPU utilization rate of an ATP system and a SwitchML-1 system, the GPU utilization rate of the application is respectively improved by 100 percent and 50 percent. For networks, the application achieves 3.2 times and 2.0 times utilization of the ATP system and SwitchML-1 system.
Referring to fig. 6, a comparison of training times for different distributed training systems at different iterations with the introduction of a fault at the 70 th iteration is shown, we run 120 VGG16 iterations using three compute nodes in three distributed training systems (ATP system, switchML-system 1 and multi-job distributed training system of the application with red lines). At iteration 70, we introduce a fault by shutting down one compute node. The compute node is then restarted at the 100 th iteration. As shown, after the injection of the fault, all of the computing nodes in the ATP system and SwitchML-1 system cease to operate and fail to recover. In contrast, the multi-job distributed training system of the present application automatically retries the current iteration after a recovery time of a few seconds, and then continues the next iteration using the remaining two compute nodes. We observe that each iteration time from 71 st to 100 iterations is longer, since there are only two compute nodes instead of three. We can see that the training time is recovered immediately after the computing node is restarted at iteration 100.
In summary, in the multi-job distributed training system disclosed by the application, the switch module is utilized to aggregate gradient data and send the aggregated data to each computing node in the network, and the storage service module is utilized to manage training data, so that the communication overhead of the parameter server can be reduced, the utilization rate of the network can be improved, and the efficiency and the expandability of distributed training can be further improved.
In addition, the application also realizes the shortest waiting time during aggregation by simultaneously sending the same gradient blocks of different computing nodes, thereby improving the computing efficiency of distributed training and the expenditure of an aggregator, and when the host agent module receives the last gradient data in one iteration process, the host agent module simultaneously starts all the computing nodes in the network so as to enable each computing node to start the next iteration to further improve the synchronism. Moreover, the application uses a computing node as a standby parameter server, and the standby parameter server can be started only when data is lost, thereby reducing the traffic in the network in the normal aggregation process. And the computing task on the computing node can be transferred to other computing nodes when the computing node fails during training, so that the subsequent distributed training is not influenced.
In an embodiment, please refer to fig. 7, which is a flowchart illustrating a multi-job distributed training method according to an embodiment of the present application, the multi-job distributed training method includes steps S10, S11, S12 and S13. The multi-job distributed training method may be performed by each device in the network that performs distributed training on the deep neural network as described above (e.g., a storage server configured with the storage service module, a computer device configured with a naming control module, a computer device configured with a host agent module, and a programmable switch configured with a switch module).
In step S10, the computer device configured with the naming control module assigns an ID to each training task to locate training data for each of the training tasks. The training data is managed through a storage service module, and the storage service module is used for distributing the training data for each training task according to the ID of each training task and the iteration times of each training task.
In step S11, the computer device configured with the host agent module acquires training data of a computing task to cause the computing node to perform gradient computation to acquire gradient data and send the gradient data to the network.
In step S12, the gradient data corresponding to each training task acquired from the network at the programmable switch configured with the switch module is aggregated and calculated to obtain aggregated data, and the aggregated data is broadcast to the network through the network.
In step S13, the computing node receives aggregated data from the network to perform training for the next iteration.
The working mode of each step in the multi-operation distributed training method of the present application is the same as or similar to the working mode of each module in the multi-operation distributed training system, and will not be described herein.
In some embodiments, the present application further provides a storage server applied to a network for performing distributed training on a deep neural network, where the network includes a plurality of computing nodes for participating in the distributed training. The storage server includes a memory and a server. Further, the storage server also includes a network port for connecting to a network.
In some embodiments, the Memory may include random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable programmable Read Only Memory Electric Erasable Programmable Read-Only Memory, EEPROM), and the like. The memory is used for storing training data or storing the software program corresponding to the storage service module, and the processor executes the software program. In an example, the memory is exemplified by an SSD.
In some embodiments, the processor includes an integrated circuit chip having signal processing capabilities, or a general-purpose processor, which may be, for example, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic device, a discrete hardware component, may implement or perform the functions of the methods and modules disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like.
The memory is used for storing training data for distributed training of the deep neural network.
The processor is used for managing the stored training data and distributing the training data for each training task according to the ID of each training task and the iteration times of each training task. The ID of each training task is used for positioning training data of each training task, each training task is completed by a computing node, the computing node performs gradient calculation according to distributed training data corresponding to the training task to obtain gradient data, the gradient data are sent to a network, and aggregate data are obtained from the network, and the aggregate data are obtained through aggregation calculation by an exchanger.
The manner in which the processor manages the training data and distributes the training data for each training task is the same as or similar to the working manner of the storage service module in the multi-job distributed training system, and will not be described in detail herein.
The application also provides a computer device in some embodiments, which is configured to execute the corresponding functions of the host agent module in the multi-job distributed training system, and the computer device is applied to a network for performing distributed training on a deep neural network, wherein the network comprises a plurality of computing nodes for participating in the distributed training. The computer device includes a memory and a server. Further, the computer device also includes a network port for connecting to a network.
In some embodiments, the Memory of the computer device may include random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable programmable Read Only Memory Electric Erasable Programmable Read-Only Memory, EEPROM), and the like. The memory of the computer device is used for storing the software program corresponding to the host agent module, and the processor of the computer device executes the software program.
In some embodiments, the processor of the computer device includes an integrated circuit chip having signal processing capabilities, or a general-purpose processor, which may be, for example, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic, a discrete hardware component, may implement or perform the functions of the methods and modules disclosed in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like.
The processor of the computer equipment is used for acquiring distributed training data corresponding to a training task so as to enable the computing nodes to perform gradient calculation, transmitting the gradient data obtained by the computing nodes to a network, and receiving aggregate data from the network and transmitting the aggregate data to each computing node. Each training task is completed by a computing node, training data corresponding to each training task is distributed to each training task through a storage server according to the ID of each training task and the iteration times of each training task, the ID of each training task is used for positioning the training data of each training task, and the aggregate data is obtained by performing aggregate computation through an exchanger.
In this embodiment, the working manner of the processor of the computer device is the same as or similar to that of the host agent module in the multi-job distributed training system, which is not described herein.
In another embodiment, the present application further provides a computer device configured to perform the corresponding function of the naming control module in the multi-job distributed training system, where the computer device is applied to a network for performing distributed training on a deep neural network, and the network includes a plurality of computing nodes for participating in the distributed training. The computer device includes a memory and a server. Further, the computer device also includes a network port for connecting to a network.
In some embodiments, the Memory of the computer device may include random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable programmable Read Only Memory Electric Erasable Programmable Read-Only Memory, EEPROM), and the like. The memory of the computer device is used for storing the software program corresponding to the naming control module, and the processor of the computer device executes the software program.
In some embodiments, the processor of the computer device includes an integrated circuit chip having signal processing capabilities, or a general-purpose processor, which may be, for example, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic, a discrete hardware component, may implement or perform the functions of the methods and modules disclosed in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like.
The processor of the computer device is used for assigning an ID to each training task to locate training data of each training task, wherein each training task is completed by a computing node, the training data of a corresponding training task is distributed to each training task through a storage server according to the ID of each training task and the iteration times thereof, the computing node performs gradient calculation according to the distributed training data of the corresponding training task to obtain gradient data, sends the gradient data to a network and obtains aggregate data from the network, and the aggregate data is obtained through aggregation calculation by an exchanger.
In this embodiment, the working manner of the processor of the computer device is the same as or similar to the working manner of the naming control module in the multi-job distributed training system, and will not be described herein.
In one embodiment, the application also provides a programmable switch, which is applied to a network for performing distributed training on a deep neural network, wherein the network comprises a plurality of computing nodes for participating in the distributed training. The switch includes a network port and an aggregation module. In a practical embodiment, the programmable switch is exemplified by a 64×100Gbps programmable switch. In an example, the programmable switch may also be exemplified by tofino switches.
In an embodiment, the network port is used to connect the switch to the network, and the network port includes an ethernet interface, a fibre channel interface, and the like.
The processor of the programmable switch may comprise an integrated circuit chip having signal processing capabilities, or a general purpose processor, such as a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), discrete gate or transistor logic, discrete hardware components, may implement or perform the functions of the disclosed method steps and modules in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like. The processor of the programmable switch further includes a register for aggregate computation. In an example, the processor of the programmable switch includes Barefoot Tofino chips.
Further, the programmable switch further includes a memory, and the switch module described above may be configured in the memory of the programmable switch.
Referring to fig. 8, a schematic structural diagram of a computer device according to an embodiment of the present application is shown, where the computer device 3 includes a storage device 30 and a processing device 31 connected to the storage device 30. Further, the computer device comprises interface means 32.
In some embodiments, the storage device 30 is configured to store at least one program that is executable by the processing device 31 to coordinate execution of the storage device 30 and implement the multi-job distributed training method described above with respect to fig. 7 including steps S10-S13. Herein, the storage device 30 includes, but is not limited to, read only memory, random access memory, nonvolatile memory. For example, the storage 30 includes a flash memory device or other non-volatile solid state storage device. In some embodiments, the storage device 30 may also include memory remote from the one or more processing devices 31, such as network-attached memory accessed via RF circuitry or external ports and a communication network, which may be the internet, one or more intranets, a local area network, a wide area network, a storage local area network, etc., or a suitable combination thereof. The memory controller may control access to memory by other components of the device, such as the CPU and peripheral interfaces.
In some embodiments, the processing device 31 includes one or more processors. The processing means 31 is operable to perform data read and write operations with the storage means 30. The processing means 31 comprise one or more general purpose microprocessors, one or more special purpose processors, one or more digital signal processors, one or more field programmable logic arrays, or any combination thereof.
In some embodiments, the interface device 32 includes at least one interface unit, each for outputting a visual interface, receiving a man-machine interaction event generated according to a technician's operation, and the like. For example, the interface device 32 includes, but is not limited to, a serial interface such as an HDMI interface or a USB interface, or a parallel interface, etc. In one embodiment, the interface device 32 further comprises a network communication unit, which is a device for transmitting data using a wired or wireless network, and includes, for example, but not limited to, an integrated circuit including a network card, a local area network module such as a WiFi module or a bluetooth module, a wide area network module such as a mobile network, etc.
The present application also provides a computer readable storage medium storing at least one program which when invoked and executed by a processor of a computer implements a multi-job distributed training method as described above with respect to fig. 7 including steps S10-S13.
The present application also provides a computer program product which, when run on a computer, causes the computer to perform the above-described related steps to implement the multi-job distributed training method described above with respect to fig. 7 including steps S10-S13.
The method may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a stand alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium comprising several instructions for enabling a device installed with said storage medium to perform all or part of the steps of the method according to the embodiments of the present application.
In the embodiments provided herein, the computer storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, usb disk, removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.
In one or more exemplary aspects, the functions described by the multi-job distributed training methodology of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The steps of a method or algorithm disclosed in the present application may be embodied in a processor-executable software module, which may be located on a tangible, non-transitory computer storage medium. Tangible, non-transitory computer storage media can be any available media that can be accessed by a computer.
The flowcharts and block diagrams in the figures described above illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In summary, the multi-job distributed training system, the multi-job distributed training method, the storage server, the computer device, the programmable switch, the computer readable storage medium and the computer program product disclosed by the application utilize the switch module to aggregate gradient data and send the aggregated data to each computing node in the network and utilize the storage service module to manage training data, so that the communication cost of the parameter server can be reduced, the utilization rate of the network can be improved, and the efficiency and the expandability of the distributed training can be further improved.
Furthermore, the shortest waiting time during aggregation is realized by simultaneously sending the same gradient blocks of different computing nodes, so that the computing efficiency of distributed training and the cost of an aggregator can be improved, and when the host agent module receives the last gradient data in one iteration process, all the computing nodes in the network are started at the same time, so that each computing node starts the next iteration to further improve the synchronism. Moreover, the application uses a computing node as a standby parameter server, and the standby parameter server can be started only when data is lost, thereby reducing the traffic in the network in the normal aggregation process. And the computing task on the computing node can be transferred to other computing nodes when the computing node fails during training, so that the subsequent distributed training is not influenced.
The above embodiments are merely illustrative of the principles of the present application and its effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the application. Accordingly, it is intended that all equivalent modifications and variations of the application be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims (25)

CN202410033375.1A2024-01-092024-01-09Multi-job distributed training system and methodPendingCN119201416A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410033375.1ACN119201416A (en)2024-01-092024-01-09Multi-job distributed training system and method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410033375.1ACN119201416A (en)2024-01-092024-01-09Multi-job distributed training system and method

Publications (1)

Publication NumberPublication Date
CN119201416Atrue CN119201416A (en)2024-12-27

Family

ID=94053402

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410033375.1APendingCN119201416A (en)2024-01-092024-01-09Multi-job distributed training system and method

Country Status (1)

CountryLink
CN (1)CN119201416A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119442237A (en)*2025-01-082025-02-14北京简网科技有限公司 Virus protection method based on host security, computer equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119442237A (en)*2025-01-082025-02-14北京简网科技有限公司 Virus protection method based on host security, computer equipment

Similar Documents

PublicationPublication DateTitle
US11256582B2 (en)System, and control method and program for input/output requests for storage systems
US10819657B2 (en)Allocating acceleration component functionality for supporting services
US20170034310A1 (en)Remote procedure call management
US20220382529A1 (en)Systems and methods for managing releases of global services in a controlled manner
CN109857545B (en)Data transmission method and device
US20210263676A1 (en)Queue management in multi-site storage systems
WO2021073546A1 (en)Data access method, device, and first computer device
US11477102B2 (en)Upgrading user space networking stacks without disruptions to network traffic
WO2012119369A1 (en)Message processing method, device and system based on cc-numa
JP7725135B2 (en) Computer-implemented method, computer system, and computer program product (Fault management in edge computing environments)
US8176026B2 (en)Consolidating file system backend operations with access of data
WO2018054271A1 (en)Method and device for data transmission
CN112099728B (en) A method and device for performing write and read operations
US12259782B2 (en)Efficient networking for a distributed storage system
US11579926B2 (en)Processing rest API requests based on resource usage satisfying predetermined limits
CN119201416A (en)Multi-job distributed training system and method
CN114911411A (en)Data storage method and device and network equipment
WO2021258861A1 (en)Operation processing method and a related device
US10909122B2 (en)Using modified blockchain concepts to guarantee no tuple loss and/or no tuple corruption
US20170063966A1 (en)Real time data streaming from a mainframe computer to off platform destinations
WO2019223444A1 (en)Data storage system
WO2022267909A1 (en)Method for reading and writing data and related apparatus
US8750120B2 (en)Confirmed delivery of bridged unicast frames
CN114489465A (en) Method, network device and computer system for data processing using network card
US11509562B1 (en)System and method for a system level data sharding analysis of information handling systems

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp