Disclosure of Invention
The invention provides a deep learning large model training method and a deep learning large model training system for heterogeneous equipment, which are used for solving the problem that the heterogeneous GPU clusters cannot be effectively utilized in the deep learning large model training process of the traditional scheme.
According to a first aspect of an embodiment of the present invention, there is provided a deep learning large model training method for heterogeneous devices, including:
Dividing different network layers of a deep learning large model to be trained into a plurality of stages, wherein forward propagation and backward propagation calculation of all network layers in each stage are executed by independent virtual equipment, and the virtual equipment is composed of one or more isomorphic GPU equipment in a heterogeneous equipment cluster;
Dividing training samples in a training data set into a plurality of large batches meeting the first scale requirement, and dividing each large batch into a plurality of small batches meeting the second scale requirement;
Taking each small batch as the input of the virtual equipment to carry out training of a deep learning large model, wherein in the training of the deep learning large model, all small batches in the same large batch use the weight version of the same training stage;
In the training of the deep learning large model, a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing is adopted among the virtual devices, and a data parallel processing mode, a tensor parallel processing mode or a pipeline parallel processing mode is adopted inside the virtual devices.
Further, the training samples in the training data set are divided into a plurality of large batches meeting the first scale requirement, and each large batch is divided into a plurality of small batches meeting the second scale requirement, specifically, the training data set is divided into a plurality of large batches, wherein each large batch comprises a plurality of training samples; in performing training of pipeline parallelism among virtual devices, each large batch is divided into a plurality of small batches to realize forward propagation and backward propagation alternately among the virtual devices, wherein each virtual device takes one small batch as a basic unit of input.
Further, the different network layers of the deep learning large model to be trained are divided into a plurality of stages, specifically, the different network layers of the deep learning large model to be trained are divided into a plurality of stages, each stage is mapped to a single virtual device, and one or more isomorphic GPU devices in the virtual devices are cooperatively responsible for forward propagation and backward propagation calculation of all layers of the stages.
Further, when a pipeline parallel processing mode is adopted among the virtual devices, a preset double buffer mechanism is adopted, specifically, two different weight versions of the new and the old are maintained in each virtual device, wherein after each small lot in the virtual device completes back propagation, weight updating is carried out, new version weights are generated, all the small lots in the same large lot use weight versions in the same training stage, and if and only if all the small lots in the same large lot complete back propagation, discarding of the old version weights is carried out.
Further, the communication mode between the virtual devices is specifically that when two adjacent virtual devices transmit the activation value or gradient, the point-to-point mode is adopted for communication, and when different virtual devices transmit the activation value or gradient, the customized communication mode is adopted for communication.
The customized communication mode is characterized in that when data parallel operation is adopted in the virtual equipment, in forward propagation, small batches are divided into micro batches meeting the preset scale requirements and are sent to corresponding GPUs to be executed, gradient aggregation is carried out through AllReduce operation in a reverse propagation stage, when tensor parallel operation is adopted in the virtual equipment, tensor distribution is carried out through the Scatter operation in forward propagation, tensor collection is carried out through ALLGATHER operation in reverse propagation, and when pipeline strategy is adopted in the virtual equipment, point-to-point communication is adopted among the virtual equipment.
Further, in the deep learning large model training method, the selection of the parallel strategies in the virtual equipment is specifically that for each parallel strategy, simulation training is carried out by adopting different batch sizes, and the optimal parallel strategy, the batch sizes during training and the proportions of the heterogeneous equipment are determined by dynamically adjusting the proportions of the heterogeneous equipment and maximizing the comprehensive utilization rate of the heterogeneous GPU equipment.
According to a second aspect of the embodiment of the present invention, there is provided a deep learning large model training method for heterogeneous devices, including:
The virtual equipment construction unit is used for dividing different network layers of the deep learning large model to be trained into a plurality of stages, and forward propagation and backward propagation calculation of all network layers in each stage is executed by independent virtual equipment, wherein the virtual equipment is composed of one or more isomorphic GPU equipment in a heterogeneous equipment cluster;
The training data set dividing unit is used for dividing training samples in the training data set into a plurality of large batches meeting the first scale requirement and dividing each large batch into a plurality of small batches meeting the second scale requirement;
The training unit is used for training the deep learning large model by taking each small batch as the input of the virtual equipment, wherein in the training of the deep learning large model, all the small batches in the same large batch use the weight version of the same training stage, and in the training of the deep learning large model, each virtual equipment adopts a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing, and the virtual equipment adopts a data parallel processing mode, tensor parallel processing mode.
According to a third aspect of embodiments of the present invention, there is provided an electronic device including a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform a deep learning large model training method for heterogeneous devices as described above.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium storing computer instructions that, when executed by a processor, perform a deep learning large model training method for heterogeneous devices as described above.
Compared with the prior art, the invention has the beneficial effects that:
The scheme of the invention provides a deep learning large model training method and a deep learning large model training system for heterogeneous equipment, which are based on the proposed virtual equipment concept, and the scheme is characterized in that different network layers of a deep learning large model to be trained are divided into a plurality of stages, forward propagation and backward propagation calculation of all network layers in each stage are executed by independent virtual equipment, and meanwhile, the utilization of GPU resources of different configurations can be coordinated, so that efficient model training is realized by combining a proposed hybrid parallel training strategy (comprising a pipeline parallel processing mode, a pipeline parallel processing mode and a data parallel combining processing mode adopted among the virtual equipment, a data parallel processing mode, a tensor parallel processing mode adopted inside the virtual equipment and the like);
the scheme provides an automatic selection strategy of the parallel strategy, the optimal parallel strategy, the batch size during training and the proportion of the heterogeneous equipment are determined by maximizing the comprehensive utilization rate of the heterogeneous GPU equipment, and the optimal parallel strategy can be automatically obtained for model training.
The scheme provides a customized communication mode among the virtual devices, so that the communication efficiency of hybrid parallelism among the virtual devices of different types is effectively improved, and the training efficiency of the model is further improved.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Term interpretation:
Scatter is a distributed operation that distributes data from a data source (typically on a root node) to multiple receiving nodes, the root node dividing its data into multiple portions, and then sending the portions to the different receiving nodes, respectively;
ALLGATHER is an operation of collecting data on each node to all nodes, each node sends own data to other nodes, and finally each node has the data of all nodes;
AllReduce is an operation of performing some sort of reduction operation (e.g., summing, averaging, maximizing, etc.) on the data at each node and broadcasting the result to all nodes.
The scheme of the embodiment is mainly used for realizing efficient hybrid parallel training in heterogeneous GPU equipment environments so as to reduce the cost and resource threshold of large-scale basic model training, and specifically solves the following problems:
The computing load is unbalanced, namely, GPUs with different performances (such as H100, A100, T4 and the like) can cause the high-performance GPUs to wait for the computation of the low-performance GPUs during training, so that time is wasted, and the overall training efficiency is reduced.
Memory bottleneck-as basic model parameters and input data batches increase, the Memory requirements Of devices increase dramatically, and low Memory devices may face the problem Of OOM (Out Memory overflow).
The communication efficiency between heterogeneous GPUs is low, the computing power and memory difference of different types of GPUs make the cooperation between devices more complex, and the traditional parallel training method may not fully optimize the communication between heterogeneous devices.
Optimizing the training strategy and the load scheduling, namely reducing the negative influence of the device performance difference and the memory difference on training through reasonable scheduling and load balancing strategy, and maximizing the calculation efficiency of various devices.
Some previous work used one-forward-one-backward (1F 1B) to schedule the processing order of micro-batches, which entered the back-propagation phase immediately after the forward propagation was completed, overlapping the computation and communication of the different inputs in a pipelined fashion, thereby reducing point-to-point communication between workers. However, this approach requires storing multiple copies of activation information. Based on this, some researchers have proposed a double buffer weight updating mechanism, through merging the backup number of the gradient limiting weight, have improved throughput and memory efficiency, but only limited to using in isomorphic apparatus, this advantage is further expanded to the heterogeneous GPU environment by the scheme of this embodiment.
Because of the huge amount of data for basic model training and the ever-increasing amount of data, it is often necessary to increase the batch size of input data in order to increase the generalization ability of the model and reduce the training time. As the basic model parameters and input data batches increase, the memory requirements also increase dramatically, and low-memory devices often face memory overflow problems. Therefore, according to the scheme of the embodiment, according to the performance of the heterogeneous device and the Memory capacity of the GPU, batches with proper sizes are reasonably distributed, so that OOM (Out of Memory) errors of the economic display card are avoided.
According to the scheme, a mixed parallel strategy is provided for heterogeneous GPU equipment, so that training of a basic model can be conducted by coordinating various heterogeneous GPU resources, computing capacities of different equipment are fully utilized, and training cost and threshold of a base model are effectively reduced. The main innovation of the scheme of the embodiment is that:
(1) A novel hybrid parallel training method is provided, which is specially designed for heterogeneous GPU equipment.
(2) And customizing the communication modes among the virtual devices of different device types so as to improve the hybrid parallel communication efficiency among the devices of different types.
(3) A hardware-aware policy search algorithm is presented by which to search for hybrid parallel policies to speed training on heterogeneous GPUs.
(4) The effectiveness of the scheme described in this embodiment is demonstrated by training a large base model in heterogeneous GPU clusters.
In order to solve the above problems, the solution of the present embodiment provides a deep learning large model training method for heterogeneous devices, which aims to automatically select an optimal parallel strategy by adopting a strategy search algorithm based on hardware perception, and coordinate the utilization of different GPU resources so as to realize efficient basic model training.
In order to simplify the complexity of heterogeneous devices, the solution described in this embodiment introduces a core concept, namely a virtual device, which is specifically defined as follows, where the virtual device is a logical representation of GPU resources, and corresponds to one or more isomorphic GPU devices. By abstracting the physically heterogeneous devices into logically homogenous virtual devices, the complexity of hybrid heterogeneous device training is reduced. The reason for selecting isomorphic equipment to construct the virtual equipment is that the equipment of the same type has consistent computing capacity and memory capacity, so that the waiting problem caused by communication synchronization among the internal equipment of the virtual equipment is avoided.
Specifically, as shown in fig. 8, the embodiment provides a deep learning large model training method for heterogeneous equipment, which includes the following processing procedures:
Dividing different network layers of a deep learning large model to be trained into a plurality of stages, wherein forward propagation and backward propagation calculation of all network layers in each stage are executed by independent virtual equipment, and the virtual equipment is composed of one or more isomorphic GPU equipment in a heterogeneous equipment cluster;
Step 2, dividing training samples in a training data set into a plurality of large batches meeting the first scale requirement, and dividing each large batch into a plurality of small batches meeting the second scale requirement;
step 3, training a deep learning large model by taking each small batch as the input of the virtual equipment, wherein in the training of the deep learning large model, all small batches in the same large batch use the weight version of the same training stage;
In the training of the deep learning large model, a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing is adopted among the virtual devices, and a data parallel processing mode, a tensor parallel processing mode or a pipeline parallel processing mode is adopted inside the virtual devices.
In a specific implementation, the step 1 specifically includes the following processing procedures:
The different network layers of the deep-learning large model to be trained are divided into a plurality of stages, each stage is mapped to a separate virtual device, and one or more devices in the virtual devices are cooperatively responsible for forward propagation and backward propagation computation of all layers of the stage.
In a specific implementation, the step2 specifically includes the following processing procedures:
The training data set is first divided into a plurality of large batches, each large batch containing a number of training samples. In performing training of PP, the large batch is further subdivided into a plurality of small batches, so that forward propagation and backward propagation alternate between virtual devices, each virtual device having one small batch as a basic unit of input.
In a specific implementation, in the training of the deep learning large model, a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing is adopted among the virtual devices, and a data parallel processing mode, a tensor parallel processing mode or a pipeline parallel processing mode is adopted inside the virtual devices, wherein:
(1) Pipeline parallelism (namely PIPELINED PARALLELISM, PP for short) among virtual devices is specifically as follows:
The PP between virtual devices divides the different layers of the large base model into multiple phases, each phase being mapped to a separate virtual device, one or more of the virtual devices being responsible for forward and backward propagation computations of all layers of the phase in concert.
In a pipeline system, a synchronous pipeline scheduling mode easily increases idle time of a GPU, and particularly when computing loads of each stage are uneven, idle computing resources and training time are prolonged. In asynchronous pipeline scheduling, the computation of each stage is not required to be completed synchronously, each stage can be independently executed according to own progress, which effectively increases the utilization rate of the GPU, but because different stages are not synchronous, the forward propagation and the backward propagation of each stage can use different versions of parameters, so that the gradient computation is incorrect, the convergence of a model is affected, and therefore, a proper synchronization mechanism is required to avoid the problem. Some common synchronization strategies include gradient accumulation, delay parameter update, parameter server or AllReduce, weight version control, etc. to solve convergence problems in PP.
To maximize GPU utilization, the present method uses asynchronous PP to extend the 1F1B (One-Fetch One-Block) scheduling strategy into heterogeneous devices, as shown in fig. 3, where each virtual device alternates forward and backward propagation between different inputs, and by asynchronously passing forward activation and backward gradients, communications and computation can be overlapped, making PP easier. In order to ensure that consistent parameter versions are used in different stages, ensure that a model can be effectively converged and effectively reduce the occupation of the video memory space of multi-version weights, the method uses a double buffering mechanism oriented to virtual equipment.
Unlike the conventional approach, we first divide the training data set into a plurality of large batches, each of which contains several training samples. In performing training of PP, the large batch is further subdivided into a plurality of small batches for alternating forward and backward propagation between the various virtual devices. Each virtual device takes a small batch as a basic unit, and all small batches in the same large batch use the weight of the same version and perform gradient accumulation and updating on the fine granularity level.
The fine granularity level performs gradient accumulation and updating, which means that the physical devices in each virtual device calculate and accumulate gradients when processing each small batch and then perform weight updating when appropriate.
In particular, the double buffering mechanism requires each virtual device to maintain two different new and old weight versions. After each small batch completes the back propagation, the weights are updated and a new version is generated. Since some of the forward propagating small batches in process are started by old weight versions, this small batch also requires gradient accumulation during the back propagation using old weight versions in order to guarantee convergence. Only after all small batches in the same large batch have completed the back propagation update will the weights of the old version be discarded and the new weight version used to process the newly entered small batch. For example, as shown in fig. 3, the forward propagation process of the small lot 8 in the virtual device 1 does not use the weight version after the weight update of the small lot 4 in the virtual device 1, thus ensuring that the small lot 8 and the small lots 5,6,7 in the same large lot are updated using the same weight version. After the virtual devices have processed the back propagation of the small lot 8, devices within all virtual devices discard the old version of the weights and enable the new weights to process the newly entered small lot.
In further embodiments, it is contemplated that the communication between virtual devices may vary during different pipeline stages. To reduce the communication overhead, we customize the way in which virtual devices communicate. In the case of transmitting the activation value or gradient between two adjacent virtual devices, the point-to-point Send/Receive communication is usually adopted, because the device arrangement in the virtual devices is the same, and if the activation value or gradient is transmitted by different types of virtual devices, the communication method is usually customized.
As shown in FIG. 4, a customized communication mode of pipeline parallelism between heterogeneous virtual devices is shown when data parallel is used in the virtual device (such as the virtual device 2), specifically, when DP is used in the virtual device 2, in forward propagation, GPUA splits small batches into smaller micro-batches and sends the smaller micro-batches to corresponding GPU for execution, so that the video memory occupation of a single device is reduced. Gradient aggregation is performed by using AllReduce in the back propagation stage, and gradient synchronization of the virtual devices is efficiently completed. As shown in FIG. 5, the customized communication mode of pipeline parallelism between heterogeneous virtual devices is shown when tensors are used in parallel in the virtual devices (such as virtual device 2) in forward or backward propagation, specifically, when TP is used in the virtual device 2, the tensors are distributed by the forward propagation through the Scatter, so that each device only processes a part of tensors, the calculation burden of a single device is obviously reduced, and more efficient parallel calculation is realized. The back propagation then collects tensors through ALLGATHER and aggregates them onto each device for global gradient computation and parameter updating.
It should be noted that, because different parallel policies (DP (DATA PARALLELISM: data parallel), TP (Tensor Parallelism: tensor parallel), PP, dp+tp, dp+pp) may be used in a certain virtual device (such as virtual device 2), which results in pipelining and parallel between virtual devices, various communication modes are generated, only two typical cases of DP and TP are selected and illustrated herein, and meanwhile, when a PP policy is used in the virtual device, the communication between virtual devices is point-to-point Send/Receive communication, which has a simple mechanism and is not repeated herein.
The customized communication strategy optimizes the communication overhead, remarkably improves the utilization rate of the video memory and the calculation efficiency, fully exerts the expansibility and flexibility of multi-GPU or multi-node hardware, effectively solves the problems of gradient synchronization and resource utilization in large-scale training, and is particularly suitable for the distributed training of oversized models such as a transducer.
(2) Pipeline parallelism and data parallelism (namely PP+DP) between virtual devices are combined, and specifically:
DP is widely used to accelerate DNN execution, and for DP between virtual devices, DP can be nested on a PP basis when available heterogeneous device resources meet device requirements. As shown in fig. 6, we obtain copies through different stages of the copy model, and distribute the copies to identical virtual devices in the same machine, and let the virtual devices process different small batches of data respectively, so that all the virtual devices participate in the calculation at the same time. In addition, efficient DP synchronization between virtual devices can be performed through PCIe connections of the motherboard within the machine, orange arrows represent the synchronization process, use communication primitives (e.g., allReduce) to aggregate gradients across devices, and ensure consistency and accuracy of the model. The DP between the virtual devices further improves the utilization rate of computing resources, reduces the memory pressure, improves the throughput and is beneficial to the efficient training of the ultra-large-scale deep learning model.
In one or more embodiments, pipeline parallelism and data parallelism among the virtual devices can be used in a nested manner, namely, a processing mode of pp+dp, which can significantly further expand the scale of the trainable model and support larger-scale model training by using more heterogeneous GPU device clusters.
(3) The data parallelism in the virtual equipment is specifically as follows:
The union of heterogeneous GPU devices is achieved by DP, as shown in part (a) of fig. 1. In the heterogeneous environment, the virtual device 1 is composed of GPUA and is responsible for executing the pipeline stage S1, the virtual device 2 is composed of four GPUB and executes the stage S2 in a DP mode. The method dynamically adjusts according to the actual equipment performance, and does not limit the GPU type or the DP proportion thereof.
Specifically, the gpu a in virtual device 1 processes four micro-batches of data simultaneously (we assume 1 micro-batch=4 micro-batches), and then distributes each micro-batch of data to 4 GPUBs in virtual device 2, each GPUB processing one micro-batch of data.
By implementing DP, the method relieves the disadvantage of GPUB in terms of video memory and computing power relative to GPUA. More work load is effectively distributed to the GPU with stronger performance, and meanwhile, the GPU with weaker performance bears fewer tasks, so that load balancing among heterogeneous GPUs is realized, and the overall performance is optimized.
(4) Tensor parallelism in the virtual device is specifically:
The TP-based device matching method is illustrated in part (b) of fig. 1. Virtual device 1 represents the first phase of execution in which the gpu a processes 1 small batch of data to fully utilize its computing resources. The four GPUBs in the virtual device 2 are responsible for tensor slicing processing, and each GPU receives one tensor slice (e.g., A1, A2, A3, A4) and performs matrix multiplication (e.g., x·a1, x·a2, etc.) with the input data X, so as to implement fine-grained parallelism. The calculation results are then combined by AllReduce or ALLGATHER operations to form a complete tensor matrix. Thus, hardware resources are utilized to the maximum extent, the parallel system has higher flexibility, and the training and reasoning of the large-scale deep learning model become more efficient.
Of these, allReduce and ALLGATHER are two collective communication operations commonly used in distributed computing, commonly used for data interactions between multiple nodes, multiple processors, or multiple GPUs.
In more embodiments, when the computing power of the two GPU devices differ greatly, TP and DP can be dynamically combined, as shown in part (a) of fig. 2, by combining the dimensions of different TP and DP, the performance and the memory difference of the heterogeneous GPUs are further balanced. In addition, the combination of TP and DP can keep high calculation efficiency, and each device only needs to store partial model parameters and partial data, so that the memory bottleneck faced by low-memory GPU devices is avoided.
(5) Pipelined parallelism within virtual devices
In addition to DP and TP, the PP mechanism may be used in the virtual device, as shown in part (c) in fig. 1, the GPU with higher performance calculates more stage tasks, while the GPU device with lower performance advances in a pipeline manner by decomposing the calculation into a plurality of subtasks, so as to reduce the pressure of the video memory. The strategy of the heterogeneous PP can achieve better load balancing among heterogeneous devices, reduce performance bottlenecks caused by overload of a single GPU, and ensure that each device can operate efficiently within the performance limit of the device.
In further embodiments, the PP within the virtual device may be used in combination with the DP. As shown in part (b) of fig. 2, when the DP degree is 2, the input data of each GPU in the virtual device 2 at the time of executing the pipeline is reduced from 4 micro-batches to 2 micro-batches. Although the PP method in the virtual device generates a certain air bubble overhead, the PP method is still an effective strategy on a platform with good communication (such as a single machine and multiple cards).
In one or more embodiments, the present embodiment provides a hardware-aware policy search method, by which a search of a hybrid parallel policy can be implemented to accelerate model training on a heterogeneous GPU, where the core idea of the method is to search for different batch sizes and proportions of bridged heterogeneous GPUs under different parallel policies (i.e., DP, PP, TP, DP +pp, dp+pp in a virtual device) through a Depth-first search (DFS: depth-FIRST SEARCH) policy, and dynamically adjust the proportions of the heterogeneous GPUs to perform load balancing under the condition that the heterogeneous device does not generate an ook, so as to optimize the comprehensive utilization rate of computing resources of the heterogeneous GPUs, and find an optimal hybrid parallel policy, where the Depth-first search policy adopts the following concept:
And sequentially selecting a parallel strategy, under the strategy, using different batch sizes bs to perform simulation training, dynamically adjusting the proportion of heterogeneous equipment to maximize the comprehensive utilization rate of the heterogeneous GPU equipment, and obtaining specific available training configuration after algorithm searching is completed, wherein the specific available training configuration comprises the parallel strategy, the batch size during training and the proportion of the heterogeneous equipment. Similar to a concept strategy of DFS searching, the best heterogeneous device proportion is searched under different parallel strategies and different batch sizes in a depth-first searching mode, so that the maximum comprehensive utilization rate of heterogeneous GPU devices is achieved, and the method is specifically:
First, algorithm initializationEmpty and initialize the number of heterogeneous GPU devicesAndI.e. the number of class a GPUsNumber of class B GPUs) Then for the current policyAnd batch sizeThen useFunction calculation of utilization of currently configured heterogeneous GPUsAnd) Then calculate the comprehensive utilization of the heterogeneous device;
In the calculation of the utilization rate, aiming at two heterogeneous devices, only sub-models (namely network layers) corresponding to the devices are needed to be cut out from the deep learning large model to be trained, or the number of layers and parameter quantities in the model are reduced for simulation training, and the performance of the devices during actual model training is simulated through a simplified model structure without actually executing the training of the deep learning large model, so that the processing efficiency can be effectively improved.
Further, the batch size refers to the size of training data quantity input by the GPU at one time, namely the number of data samples processed by the GPU, when training is performed, the batch size is set to be 1 in the initial stage, and when searching is performed under different strategies, the batch size bs is gradually increased until OOM errors occur in the equipment.
In the searching process, the method dynamically adjusts and increases the number of the devices according to the devices with higher loads so as to balance the loads of the heterogeneous GPU devices and compare the loads with the current optimal plan. And if the comprehensive utilization rate of the current strategy is higher than that of the existing optimal strategy, updating the optimal strategy. If the comprehensive utilization rate of the current strategy is reduced after adjustment, or the number of the existing devices cannot meet the adjusted number, or exceeds the memory limit of the GPU, the current searching process of the strategy is skipped, and the next parallel strategy searching is performed. And finally, finding a hybrid parallel strategy which is applicable to heterogeneous GPU resources and can maximize the utilization rate of heterogeneous equipment by gradually adjusting configuration and evaluating the comprehensive utilization rate of the GPU of each combination.
Wherein the saidThe function is to:
firstly, checking whether the current configuration exceeds the memory limit of GPU, if so, returning to 0, otherwise, callingA function, wherein theThe function is used for evaluating the device performance utilization under the appointed configuration and is used for calculating the comprehensive utilization of the subsequent heterogeneous devices.
Specifically, when model training is performed on the GPU, if the memory required by the current configuration exceeds the available memory of the GPU, an abnormality of RuntimeError: CUDA out of memory is thrown,The function is used for capturing the exception, if the current configuration is checked to generate an OOM record, 0 is returned to indicate that the configuration is not feasible, and meanwhile, when the simulation training is performed, the time for completing one iteration is recorded, the CalcUtil function is used for estimating FLOPS (Floating-point operation times per second) according to the scale of the current calculation task (such as the size of input data, the number of parameters and the like) and dividing the calculation time by the value of GPU theory FLOPS to obtain the device performance utilization rate.
In general, the method eventually finds an optimal execution plan that balances between GPU resources and task demands by incrementally adjusting the configuration and evaluating the resource utilization of each combination.
Further, to verify the effectiveness of the scheme of the present embodiment, the scheme of the present embodiment uses two types of GPUs, NVIDIA a100 (40 gb×12) and NVIDIA TESLA T4 (16 gb×36), to simulate heterogeneous GPU clusters and environments. Each node is configured with 64GB of memory, each node is configured with 2 isomorphic GPUs, and the Ubuntu 20.04 LTS operating system is employed. Each GPU is connected through PCIe-3 x 16, and nodes are connected through InfiniBand (IB) networks, the bandwidth between the nodes is about 5 Gbps, and high efficiency and low delay of data transmission are ensured. In order to evaluate the performance of the method in various heterogeneous GPU environments, the embodiment selects a GPT-3-1.3B architecture model for performance test and compares the performance test with the existing method to verify the effectiveness of the method.
The scheme described in this example uses popular parallel training frameworks Gpipe and HetPipe to compare performance to our approach. The scheme described in this embodiment uses GPT3 models with different layers (layer number=4/6/8) for performance testing, with the horizontal axis representing global lot size (Global Batch Size) and the vertical axis representing throughput. As a result, as shown in fig. 7 (a) to 7 (c), the performance of the scheme of the present embodiment is significantly better than Gpipe and HetPipe at any global lot size, and the advantages of the present method are more prominent, especially as the global lot size increases. The difference is derived from the fact that the scheme of the embodiment adopts a more flexible and efficient heterogeneous parallel strategy, so that load distribution and memory use among different GPUs are optimized, and throughput is remarkably improved. In addition, the use of the double buffering strategy enables the method to maintain high performance and avoid memory overflow under larger batch sizes, thereby improving the overall throughput performance. In contrast, gpipe is not optimized for heterogeneous devices, resulting in OOM for low memory devices, while HetPipe is constrained by a centralized parameter server, resulting in limited throughput improvement, thus reducing system scalability. The evaluation result shows that the scheme of the embodiment is improved by 180% and 40% respectively at the training speed compared with the prior method.
In one or more embodiments, there is provided a deep learning large model training system for heterogeneous devices corresponding to the above method, comprising:
The virtual equipment construction unit is used for dividing different network layers of the deep learning large model to be trained into a plurality of stages, and forward propagation and backward propagation calculation of all network layers in each stage is executed by independent virtual equipment, wherein the virtual equipment is composed of one or more isomorphic GPU equipment in a heterogeneous equipment cluster;
The training data set dividing unit is used for dividing training samples in the training data set into a plurality of large batches meeting the first scale requirement and dividing each large batch into a plurality of small batches meeting the second scale requirement;
The training unit is used for training the deep learning large model by taking each small batch as the input of the virtual equipment, wherein in the training of the deep learning large model, all the small batches in the same large batch use the weight version of the same training stage, and in the training of the deep learning large model, each virtual equipment adopts a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing, and the virtual equipment adopts a data parallel processing mode, tensor parallel processing mode.
It can be understood that the system in this embodiment corresponds to the method in the foregoing embodiment, and its technical details are described in the first embodiment, so that details are not repeated here.
In further embodiments, there is also provided:
an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of embodiment one. For brevity, the description is omitted here.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.
The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.
Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.