Movatterモバイル変換


[0]ホーム

URL:


CN119557113B - Deep learning large model training method and system for heterogeneous equipment - Google Patents

Deep learning large model training method and system for heterogeneous equipment
Download PDF

Info

Publication number
CN119557113B
CN119557113BCN202510131779.9ACN202510131779ACN119557113BCN 119557113 BCN119557113 BCN 119557113BCN 202510131779 ACN202510131779 ACN 202510131779ACN 119557113 BCN119557113 BCN 119557113B
Authority
CN
China
Prior art keywords
training
virtual
deep learning
heterogeneous
devices
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510131779.9A
Other languages
Chinese (zh)
Other versions
CN119557113A (en
Inventor
赵志刚
刘福来
李传涛
肖连辉
王春晓
李响
李锦涛
王雨欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
National Supercomputing Center in Jinan
Original Assignee
Qilu University of Technology
National Supercomputing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology, National Supercomputing Center in JinanfiledCriticalQilu University of Technology
Priority to CN202510131779.9ApriorityCriticalpatent/CN119557113B/en
Publication of CN119557113ApublicationCriticalpatent/CN119557113A/en
Application grantedgrantedCritical
Publication of CN119557113BpublicationCriticalpatent/CN119557113B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供了一种用于异构设备的深度学习大模型训练方法及系统,其属于模型训练技术领域,为了解决传统方案在深度学习大模型的训练时,无法对异构GPU集群进行有效利用的问题,所述方案基于提出的虚拟设备概念,通过将待训练的深度学习大模型的不同网络层划分为若干阶段,每个阶段所有网络层的前向传播和反向传播计算均由独立的虚拟设备执行,同时,结合提出的混合并行训练策略来协调不同构的GPU资源的利用,实现高效的模型训练。

The present invention provides a deep learning large model training method and system for heterogeneous devices, which belongs to the field of model training technology. In order to solve the problem that traditional solutions cannot effectively utilize heterogeneous GPU clusters when training deep learning large models, the solution is based on the proposed virtual device concept. By dividing the different network layers of the deep learning large model to be trained into several stages, the forward propagation and backpropagation calculations of all network layers in each stage are performed by independent virtual devices. At the same time, the proposed hybrid parallel training strategy is combined to coordinate the utilization of GPU resources of different structures to achieve efficient model training.

Description

Deep learning large model training method and system for heterogeneous equipment
Technical Field
The invention belongs to the technical field of model training, and particularly relates to a deep learning large model training method and system for heterogeneous equipment.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
With the rapid development of technology, the GPU (Graphics Processing Unit: graphics processing unit) hardware of data centers is continually updated, but many older devices with lower performance still exist. In particular in large data centers, the role of these old devices in distributed training is not negligible. For some small and medium-sized data centers, the number of high-performance computing resources, such as high-performance GPUs of types NVIDIA H100 and a100, is usually limited, so that training of the basic model is difficult to complete, while some low-performance economic GPUs, such as GPUs of types NVIDIA TESLA T and P4, are difficult to play a role in training of the large model. In order to reduce the training cost and resource threshold of the basic model, hybrid parallel training using heterogeneous GPUs becomes the best choice. However, if GPUs with different performances, such as H100, a100, T4, and P4, are simply combined to perform training, a great deal of time may be wasted when the high-performance GPU waits for the computation of the low-performance GPU due to unbalanced computation load, thereby significantly reducing the overall training efficiency.
Currently, for research on heterogeneous GPU training large DNN (Deep Neural Network) models, such as HetPipe (i.e., a hybrid parallel method combining pipeline model parallelism and data parallelism), it aggregates multiple heterogeneous GPU resources to form a virtual worker with similar performance, uses traditional protocols and centralized parameter servers to handle static heterogeneity of different versions of GPUs, and internally, its virtual worker uses pipeline mode parallelism to handle small batches, but externally it only regards whole pipeline parallelism as a virtual worker in data parallelism, thus being affected by the bottleneck of centralized parameter servers, and the SWARM method (i.e., a model parallel training method designed for heterogeneous devices with low-speed connection, isomerization and unreliability) creates temporary random pipes between nodes and rebalances when faults occur, but its rebalancing process is slow, and cannot guarantee efficient operation of devices.
Disclosure of Invention
The invention provides a deep learning large model training method and a deep learning large model training system for heterogeneous equipment, which are used for solving the problem that the heterogeneous GPU clusters cannot be effectively utilized in the deep learning large model training process of the traditional scheme.
According to a first aspect of an embodiment of the present invention, there is provided a deep learning large model training method for heterogeneous devices, including:
Dividing different network layers of a deep learning large model to be trained into a plurality of stages, wherein forward propagation and backward propagation calculation of all network layers in each stage are executed by independent virtual equipment, and the virtual equipment is composed of one or more isomorphic GPU equipment in a heterogeneous equipment cluster;
Dividing training samples in a training data set into a plurality of large batches meeting the first scale requirement, and dividing each large batch into a plurality of small batches meeting the second scale requirement;
Taking each small batch as the input of the virtual equipment to carry out training of a deep learning large model, wherein in the training of the deep learning large model, all small batches in the same large batch use the weight version of the same training stage;
In the training of the deep learning large model, a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing is adopted among the virtual devices, and a data parallel processing mode, a tensor parallel processing mode or a pipeline parallel processing mode is adopted inside the virtual devices.
Further, the training samples in the training data set are divided into a plurality of large batches meeting the first scale requirement, and each large batch is divided into a plurality of small batches meeting the second scale requirement, specifically, the training data set is divided into a plurality of large batches, wherein each large batch comprises a plurality of training samples; in performing training of pipeline parallelism among virtual devices, each large batch is divided into a plurality of small batches to realize forward propagation and backward propagation alternately among the virtual devices, wherein each virtual device takes one small batch as a basic unit of input.
Further, the different network layers of the deep learning large model to be trained are divided into a plurality of stages, specifically, the different network layers of the deep learning large model to be trained are divided into a plurality of stages, each stage is mapped to a single virtual device, and one or more isomorphic GPU devices in the virtual devices are cooperatively responsible for forward propagation and backward propagation calculation of all layers of the stages.
Further, when a pipeline parallel processing mode is adopted among the virtual devices, a preset double buffer mechanism is adopted, specifically, two different weight versions of the new and the old are maintained in each virtual device, wherein after each small lot in the virtual device completes back propagation, weight updating is carried out, new version weights are generated, all the small lots in the same large lot use weight versions in the same training stage, and if and only if all the small lots in the same large lot complete back propagation, discarding of the old version weights is carried out.
Further, the communication mode between the virtual devices is specifically that when two adjacent virtual devices transmit the activation value or gradient, the point-to-point mode is adopted for communication, and when different virtual devices transmit the activation value or gradient, the customized communication mode is adopted for communication.
The customized communication mode is characterized in that when data parallel operation is adopted in the virtual equipment, in forward propagation, small batches are divided into micro batches meeting the preset scale requirements and are sent to corresponding GPUs to be executed, gradient aggregation is carried out through AllReduce operation in a reverse propagation stage, when tensor parallel operation is adopted in the virtual equipment, tensor distribution is carried out through the Scatter operation in forward propagation, tensor collection is carried out through ALLGATHER operation in reverse propagation, and when pipeline strategy is adopted in the virtual equipment, point-to-point communication is adopted among the virtual equipment.
Further, in the deep learning large model training method, the selection of the parallel strategies in the virtual equipment is specifically that for each parallel strategy, simulation training is carried out by adopting different batch sizes, and the optimal parallel strategy, the batch sizes during training and the proportions of the heterogeneous equipment are determined by dynamically adjusting the proportions of the heterogeneous equipment and maximizing the comprehensive utilization rate of the heterogeneous GPU equipment.
According to a second aspect of the embodiment of the present invention, there is provided a deep learning large model training method for heterogeneous devices, including:
The virtual equipment construction unit is used for dividing different network layers of the deep learning large model to be trained into a plurality of stages, and forward propagation and backward propagation calculation of all network layers in each stage is executed by independent virtual equipment, wherein the virtual equipment is composed of one or more isomorphic GPU equipment in a heterogeneous equipment cluster;
The training data set dividing unit is used for dividing training samples in the training data set into a plurality of large batches meeting the first scale requirement and dividing each large batch into a plurality of small batches meeting the second scale requirement;
The training unit is used for training the deep learning large model by taking each small batch as the input of the virtual equipment, wherein in the training of the deep learning large model, all the small batches in the same large batch use the weight version of the same training stage, and in the training of the deep learning large model, each virtual equipment adopts a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing, and the virtual equipment adopts a data parallel processing mode, tensor parallel processing mode.
According to a third aspect of embodiments of the present invention, there is provided an electronic device including a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform a deep learning large model training method for heterogeneous devices as described above.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium storing computer instructions that, when executed by a processor, perform a deep learning large model training method for heterogeneous devices as described above.
Compared with the prior art, the invention has the beneficial effects that:
The scheme of the invention provides a deep learning large model training method and a deep learning large model training system for heterogeneous equipment, which are based on the proposed virtual equipment concept, and the scheme is characterized in that different network layers of a deep learning large model to be trained are divided into a plurality of stages, forward propagation and backward propagation calculation of all network layers in each stage are executed by independent virtual equipment, and meanwhile, the utilization of GPU resources of different configurations can be coordinated, so that efficient model training is realized by combining a proposed hybrid parallel training strategy (comprising a pipeline parallel processing mode, a pipeline parallel processing mode and a data parallel combining processing mode adopted among the virtual equipment, a data parallel processing mode, a tensor parallel processing mode adopted inside the virtual equipment and the like);
the scheme provides an automatic selection strategy of the parallel strategy, the optimal parallel strategy, the batch size during training and the proportion of the heterogeneous equipment are determined by maximizing the comprehensive utilization rate of the heterogeneous GPU equipment, and the optimal parallel strategy can be automatically obtained for model training.
The scheme provides a customized communication mode among the virtual devices, so that the communication efficiency of hybrid parallelism among the virtual devices of different types is effectively improved, and the training efficiency of the model is further improved.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a schematic diagram of parallel processing in a virtual device according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a combined tensor-parallel and data-parallel process within a virtual device according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of alternate forward and reverse propagation of virtual devices between different inputs according to an embodiment of the present invention;
FIG. 4 is a customized communication mode of pipeline parallelism between heterogeneous virtual devices when data is used in the virtual devices in parallel (e.g., virtual device 2) in forward or backward propagation according to an embodiment of the present invention;
FIG. 5 is a customized communication mode of pipeline parallelism between heterogeneous virtual devices when tensors are used in the virtual devices (e.g., virtual device 2) in forward or backward propagation according to an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a combination of pipeline parallelism and data parallelism between virtual devices according to an embodiment of the present invention;
FIG. 7 (a) is a comparative schematic diagram of the performance of a 4-layer GPT-3 model using the prior art method and the present method described in the examples of the present invention;
FIG. 7 (b) is a comparative schematic diagram of the performance of a 6-layer GPT-3 model using the prior art method and the present method described in the examples of the present invention;
FIG. 7 (c) is a comparative schematic diagram of the performance of the 8-layer GPT-3 model using the prior art method and the present method described in the examples of the present invention;
Fig. 8 is a flowchart of a deep learning large model training method for heterogeneous devices according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Term interpretation:
Scatter is a distributed operation that distributes data from a data source (typically on a root node) to multiple receiving nodes, the root node dividing its data into multiple portions, and then sending the portions to the different receiving nodes, respectively;
ALLGATHER is an operation of collecting data on each node to all nodes, each node sends own data to other nodes, and finally each node has the data of all nodes;
AllReduce is an operation of performing some sort of reduction operation (e.g., summing, averaging, maximizing, etc.) on the data at each node and broadcasting the result to all nodes.
The scheme of the embodiment is mainly used for realizing efficient hybrid parallel training in heterogeneous GPU equipment environments so as to reduce the cost and resource threshold of large-scale basic model training, and specifically solves the following problems:
The computing load is unbalanced, namely, GPUs with different performances (such as H100, A100, T4 and the like) can cause the high-performance GPUs to wait for the computation of the low-performance GPUs during training, so that time is wasted, and the overall training efficiency is reduced.
Memory bottleneck-as basic model parameters and input data batches increase, the Memory requirements Of devices increase dramatically, and low Memory devices may face the problem Of OOM (Out Memory overflow).
The communication efficiency between heterogeneous GPUs is low, the computing power and memory difference of different types of GPUs make the cooperation between devices more complex, and the traditional parallel training method may not fully optimize the communication between heterogeneous devices.
Optimizing the training strategy and the load scheduling, namely reducing the negative influence of the device performance difference and the memory difference on training through reasonable scheduling and load balancing strategy, and maximizing the calculation efficiency of various devices.
Some previous work used one-forward-one-backward (1F 1B) to schedule the processing order of micro-batches, which entered the back-propagation phase immediately after the forward propagation was completed, overlapping the computation and communication of the different inputs in a pipelined fashion, thereby reducing point-to-point communication between workers. However, this approach requires storing multiple copies of activation information. Based on this, some researchers have proposed a double buffer weight updating mechanism, through merging the backup number of the gradient limiting weight, have improved throughput and memory efficiency, but only limited to using in isomorphic apparatus, this advantage is further expanded to the heterogeneous GPU environment by the scheme of this embodiment.
Because of the huge amount of data for basic model training and the ever-increasing amount of data, it is often necessary to increase the batch size of input data in order to increase the generalization ability of the model and reduce the training time. As the basic model parameters and input data batches increase, the memory requirements also increase dramatically, and low-memory devices often face memory overflow problems. Therefore, according to the scheme of the embodiment, according to the performance of the heterogeneous device and the Memory capacity of the GPU, batches with proper sizes are reasonably distributed, so that OOM (Out of Memory) errors of the economic display card are avoided.
According to the scheme, a mixed parallel strategy is provided for heterogeneous GPU equipment, so that training of a basic model can be conducted by coordinating various heterogeneous GPU resources, computing capacities of different equipment are fully utilized, and training cost and threshold of a base model are effectively reduced. The main innovation of the scheme of the embodiment is that:
(1) A novel hybrid parallel training method is provided, which is specially designed for heterogeneous GPU equipment.
(2) And customizing the communication modes among the virtual devices of different device types so as to improve the hybrid parallel communication efficiency among the devices of different types.
(3) A hardware-aware policy search algorithm is presented by which to search for hybrid parallel policies to speed training on heterogeneous GPUs.
(4) The effectiveness of the scheme described in this embodiment is demonstrated by training a large base model in heterogeneous GPU clusters.
In order to solve the above problems, the solution of the present embodiment provides a deep learning large model training method for heterogeneous devices, which aims to automatically select an optimal parallel strategy by adopting a strategy search algorithm based on hardware perception, and coordinate the utilization of different GPU resources so as to realize efficient basic model training.
In order to simplify the complexity of heterogeneous devices, the solution described in this embodiment introduces a core concept, namely a virtual device, which is specifically defined as follows, where the virtual device is a logical representation of GPU resources, and corresponds to one or more isomorphic GPU devices. By abstracting the physically heterogeneous devices into logically homogenous virtual devices, the complexity of hybrid heterogeneous device training is reduced. The reason for selecting isomorphic equipment to construct the virtual equipment is that the equipment of the same type has consistent computing capacity and memory capacity, so that the waiting problem caused by communication synchronization among the internal equipment of the virtual equipment is avoided.
Specifically, as shown in fig. 8, the embodiment provides a deep learning large model training method for heterogeneous equipment, which includes the following processing procedures:
Dividing different network layers of a deep learning large model to be trained into a plurality of stages, wherein forward propagation and backward propagation calculation of all network layers in each stage are executed by independent virtual equipment, and the virtual equipment is composed of one or more isomorphic GPU equipment in a heterogeneous equipment cluster;
Step 2, dividing training samples in a training data set into a plurality of large batches meeting the first scale requirement, and dividing each large batch into a plurality of small batches meeting the second scale requirement;
step 3, training a deep learning large model by taking each small batch as the input of the virtual equipment, wherein in the training of the deep learning large model, all small batches in the same large batch use the weight version of the same training stage;
In the training of the deep learning large model, a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing is adopted among the virtual devices, and a data parallel processing mode, a tensor parallel processing mode or a pipeline parallel processing mode is adopted inside the virtual devices.
In a specific implementation, the step 1 specifically includes the following processing procedures:
The different network layers of the deep-learning large model to be trained are divided into a plurality of stages, each stage is mapped to a separate virtual device, and one or more devices in the virtual devices are cooperatively responsible for forward propagation and backward propagation computation of all layers of the stage.
In a specific implementation, the step2 specifically includes the following processing procedures:
The training data set is first divided into a plurality of large batches, each large batch containing a number of training samples. In performing training of PP, the large batch is further subdivided into a plurality of small batches, so that forward propagation and backward propagation alternate between virtual devices, each virtual device having one small batch as a basic unit of input.
In a specific implementation, in the training of the deep learning large model, a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing is adopted among the virtual devices, and a data parallel processing mode, a tensor parallel processing mode or a pipeline parallel processing mode is adopted inside the virtual devices, wherein:
(1) Pipeline parallelism (namely PIPELINED PARALLELISM, PP for short) among virtual devices is specifically as follows:
The PP between virtual devices divides the different layers of the large base model into multiple phases, each phase being mapped to a separate virtual device, one or more of the virtual devices being responsible for forward and backward propagation computations of all layers of the phase in concert.
In a pipeline system, a synchronous pipeline scheduling mode easily increases idle time of a GPU, and particularly when computing loads of each stage are uneven, idle computing resources and training time are prolonged. In asynchronous pipeline scheduling, the computation of each stage is not required to be completed synchronously, each stage can be independently executed according to own progress, which effectively increases the utilization rate of the GPU, but because different stages are not synchronous, the forward propagation and the backward propagation of each stage can use different versions of parameters, so that the gradient computation is incorrect, the convergence of a model is affected, and therefore, a proper synchronization mechanism is required to avoid the problem. Some common synchronization strategies include gradient accumulation, delay parameter update, parameter server or AllReduce, weight version control, etc. to solve convergence problems in PP.
To maximize GPU utilization, the present method uses asynchronous PP to extend the 1F1B (One-Fetch One-Block) scheduling strategy into heterogeneous devices, as shown in fig. 3, where each virtual device alternates forward and backward propagation between different inputs, and by asynchronously passing forward activation and backward gradients, communications and computation can be overlapped, making PP easier. In order to ensure that consistent parameter versions are used in different stages, ensure that a model can be effectively converged and effectively reduce the occupation of the video memory space of multi-version weights, the method uses a double buffering mechanism oriented to virtual equipment.
Unlike the conventional approach, we first divide the training data set into a plurality of large batches, each of which contains several training samples. In performing training of PP, the large batch is further subdivided into a plurality of small batches for alternating forward and backward propagation between the various virtual devices. Each virtual device takes a small batch as a basic unit, and all small batches in the same large batch use the weight of the same version and perform gradient accumulation and updating on the fine granularity level.
The fine granularity level performs gradient accumulation and updating, which means that the physical devices in each virtual device calculate and accumulate gradients when processing each small batch and then perform weight updating when appropriate.
In particular, the double buffering mechanism requires each virtual device to maintain two different new and old weight versions. After each small batch completes the back propagation, the weights are updated and a new version is generated. Since some of the forward propagating small batches in process are started by old weight versions, this small batch also requires gradient accumulation during the back propagation using old weight versions in order to guarantee convergence. Only after all small batches in the same large batch have completed the back propagation update will the weights of the old version be discarded and the new weight version used to process the newly entered small batch. For example, as shown in fig. 3, the forward propagation process of the small lot 8 in the virtual device 1 does not use the weight version after the weight update of the small lot 4 in the virtual device 1, thus ensuring that the small lot 8 and the small lots 5,6,7 in the same large lot are updated using the same weight version. After the virtual devices have processed the back propagation of the small lot 8, devices within all virtual devices discard the old version of the weights and enable the new weights to process the newly entered small lot.
In further embodiments, it is contemplated that the communication between virtual devices may vary during different pipeline stages. To reduce the communication overhead, we customize the way in which virtual devices communicate. In the case of transmitting the activation value or gradient between two adjacent virtual devices, the point-to-point Send/Receive communication is usually adopted, because the device arrangement in the virtual devices is the same, and if the activation value or gradient is transmitted by different types of virtual devices, the communication method is usually customized.
As shown in FIG. 4, a customized communication mode of pipeline parallelism between heterogeneous virtual devices is shown when data parallel is used in the virtual device (such as the virtual device 2), specifically, when DP is used in the virtual device 2, in forward propagation, GPUA splits small batches into smaller micro-batches and sends the smaller micro-batches to corresponding GPU for execution, so that the video memory occupation of a single device is reduced. Gradient aggregation is performed by using AllReduce in the back propagation stage, and gradient synchronization of the virtual devices is efficiently completed. As shown in FIG. 5, the customized communication mode of pipeline parallelism between heterogeneous virtual devices is shown when tensors are used in parallel in the virtual devices (such as virtual device 2) in forward or backward propagation, specifically, when TP is used in the virtual device 2, the tensors are distributed by the forward propagation through the Scatter, so that each device only processes a part of tensors, the calculation burden of a single device is obviously reduced, and more efficient parallel calculation is realized. The back propagation then collects tensors through ALLGATHER and aggregates them onto each device for global gradient computation and parameter updating.
It should be noted that, because different parallel policies (DP (DATA PARALLELISM: data parallel), TP (Tensor Parallelism: tensor parallel), PP, dp+tp, dp+pp) may be used in a certain virtual device (such as virtual device 2), which results in pipelining and parallel between virtual devices, various communication modes are generated, only two typical cases of DP and TP are selected and illustrated herein, and meanwhile, when a PP policy is used in the virtual device, the communication between virtual devices is point-to-point Send/Receive communication, which has a simple mechanism and is not repeated herein.
The customized communication strategy optimizes the communication overhead, remarkably improves the utilization rate of the video memory and the calculation efficiency, fully exerts the expansibility and flexibility of multi-GPU or multi-node hardware, effectively solves the problems of gradient synchronization and resource utilization in large-scale training, and is particularly suitable for the distributed training of oversized models such as a transducer.
(2) Pipeline parallelism and data parallelism (namely PP+DP) between virtual devices are combined, and specifically:
DP is widely used to accelerate DNN execution, and for DP between virtual devices, DP can be nested on a PP basis when available heterogeneous device resources meet device requirements. As shown in fig. 6, we obtain copies through different stages of the copy model, and distribute the copies to identical virtual devices in the same machine, and let the virtual devices process different small batches of data respectively, so that all the virtual devices participate in the calculation at the same time. In addition, efficient DP synchronization between virtual devices can be performed through PCIe connections of the motherboard within the machine, orange arrows represent the synchronization process, use communication primitives (e.g., allReduce) to aggregate gradients across devices, and ensure consistency and accuracy of the model. The DP between the virtual devices further improves the utilization rate of computing resources, reduces the memory pressure, improves the throughput and is beneficial to the efficient training of the ultra-large-scale deep learning model.
In one or more embodiments, pipeline parallelism and data parallelism among the virtual devices can be used in a nested manner, namely, a processing mode of pp+dp, which can significantly further expand the scale of the trainable model and support larger-scale model training by using more heterogeneous GPU device clusters.
(3) The data parallelism in the virtual equipment is specifically as follows:
The union of heterogeneous GPU devices is achieved by DP, as shown in part (a) of fig. 1. In the heterogeneous environment, the virtual device 1 is composed of GPUA and is responsible for executing the pipeline stage S1, the virtual device 2 is composed of four GPUB and executes the stage S2 in a DP mode. The method dynamically adjusts according to the actual equipment performance, and does not limit the GPU type or the DP proportion thereof.
Specifically, the gpu a in virtual device 1 processes four micro-batches of data simultaneously (we assume 1 micro-batch=4 micro-batches), and then distributes each micro-batch of data to 4 GPUBs in virtual device 2, each GPUB processing one micro-batch of data.
By implementing DP, the method relieves the disadvantage of GPUB in terms of video memory and computing power relative to GPUA. More work load is effectively distributed to the GPU with stronger performance, and meanwhile, the GPU with weaker performance bears fewer tasks, so that load balancing among heterogeneous GPUs is realized, and the overall performance is optimized.
(4) Tensor parallelism in the virtual device is specifically:
The TP-based device matching method is illustrated in part (b) of fig. 1. Virtual device 1 represents the first phase of execution in which the gpu a processes 1 small batch of data to fully utilize its computing resources. The four GPUBs in the virtual device 2 are responsible for tensor slicing processing, and each GPU receives one tensor slice (e.g., A1, A2, A3, A4) and performs matrix multiplication (e.g., x·a1, x·a2, etc.) with the input data X, so as to implement fine-grained parallelism. The calculation results are then combined by AllReduce or ALLGATHER operations to form a complete tensor matrix. Thus, hardware resources are utilized to the maximum extent, the parallel system has higher flexibility, and the training and reasoning of the large-scale deep learning model become more efficient.
Of these, allReduce and ALLGATHER are two collective communication operations commonly used in distributed computing, commonly used for data interactions between multiple nodes, multiple processors, or multiple GPUs.
In more embodiments, when the computing power of the two GPU devices differ greatly, TP and DP can be dynamically combined, as shown in part (a) of fig. 2, by combining the dimensions of different TP and DP, the performance and the memory difference of the heterogeneous GPUs are further balanced. In addition, the combination of TP and DP can keep high calculation efficiency, and each device only needs to store partial model parameters and partial data, so that the memory bottleneck faced by low-memory GPU devices is avoided.
(5) Pipelined parallelism within virtual devices
In addition to DP and TP, the PP mechanism may be used in the virtual device, as shown in part (c) in fig. 1, the GPU with higher performance calculates more stage tasks, while the GPU device with lower performance advances in a pipeline manner by decomposing the calculation into a plurality of subtasks, so as to reduce the pressure of the video memory. The strategy of the heterogeneous PP can achieve better load balancing among heterogeneous devices, reduce performance bottlenecks caused by overload of a single GPU, and ensure that each device can operate efficiently within the performance limit of the device.
In further embodiments, the PP within the virtual device may be used in combination with the DP. As shown in part (b) of fig. 2, when the DP degree is 2, the input data of each GPU in the virtual device 2 at the time of executing the pipeline is reduced from 4 micro-batches to 2 micro-batches. Although the PP method in the virtual device generates a certain air bubble overhead, the PP method is still an effective strategy on a platform with good communication (such as a single machine and multiple cards).
In one or more embodiments, the present embodiment provides a hardware-aware policy search method, by which a search of a hybrid parallel policy can be implemented to accelerate model training on a heterogeneous GPU, where the core idea of the method is to search for different batch sizes and proportions of bridged heterogeneous GPUs under different parallel policies (i.e., DP, PP, TP, DP +pp, dp+pp in a virtual device) through a Depth-first search (DFS: depth-FIRST SEARCH) policy, and dynamically adjust the proportions of the heterogeneous GPUs to perform load balancing under the condition that the heterogeneous device does not generate an ook, so as to optimize the comprehensive utilization rate of computing resources of the heterogeneous GPUs, and find an optimal hybrid parallel policy, where the Depth-first search policy adopts the following concept:
And sequentially selecting a parallel strategy, under the strategy, using different batch sizes bs to perform simulation training, dynamically adjusting the proportion of heterogeneous equipment to maximize the comprehensive utilization rate of the heterogeneous GPU equipment, and obtaining specific available training configuration after algorithm searching is completed, wherein the specific available training configuration comprises the parallel strategy, the batch size during training and the proportion of the heterogeneous equipment. Similar to a concept strategy of DFS searching, the best heterogeneous device proportion is searched under different parallel strategies and different batch sizes in a depth-first searching mode, so that the maximum comprehensive utilization rate of heterogeneous GPU devices is achieved, and the method is specifically:
First, algorithm initializationEmpty and initialize the number of heterogeneous GPU devicesAndI.e. the number of class a GPUsNumber of class B GPUs) Then for the current policyAnd batch sizeThen useFunction calculation of utilization of currently configured heterogeneous GPUsAnd) Then calculate the comprehensive utilization of the heterogeneous device;
In the calculation of the utilization rate, aiming at two heterogeneous devices, only sub-models (namely network layers) corresponding to the devices are needed to be cut out from the deep learning large model to be trained, or the number of layers and parameter quantities in the model are reduced for simulation training, and the performance of the devices during actual model training is simulated through a simplified model structure without actually executing the training of the deep learning large model, so that the processing efficiency can be effectively improved.
Further, the batch size refers to the size of training data quantity input by the GPU at one time, namely the number of data samples processed by the GPU, when training is performed, the batch size is set to be 1 in the initial stage, and when searching is performed under different strategies, the batch size bs is gradually increased until OOM errors occur in the equipment.
In the searching process, the method dynamically adjusts and increases the number of the devices according to the devices with higher loads so as to balance the loads of the heterogeneous GPU devices and compare the loads with the current optimal plan. And if the comprehensive utilization rate of the current strategy is higher than that of the existing optimal strategy, updating the optimal strategy. If the comprehensive utilization rate of the current strategy is reduced after adjustment, or the number of the existing devices cannot meet the adjusted number, or exceeds the memory limit of the GPU, the current searching process of the strategy is skipped, and the next parallel strategy searching is performed. And finally, finding a hybrid parallel strategy which is applicable to heterogeneous GPU resources and can maximize the utilization rate of heterogeneous equipment by gradually adjusting configuration and evaluating the comprehensive utilization rate of the GPU of each combination.
Wherein the saidThe function is to:
firstly, checking whether the current configuration exceeds the memory limit of GPU, if so, returning to 0, otherwise, callingA function, wherein theThe function is used for evaluating the device performance utilization under the appointed configuration and is used for calculating the comprehensive utilization of the subsequent heterogeneous devices.
Specifically, when model training is performed on the GPU, if the memory required by the current configuration exceeds the available memory of the GPU, an abnormality of RuntimeError: CUDA out of memory is thrown,The function is used for capturing the exception, if the current configuration is checked to generate an OOM record, 0 is returned to indicate that the configuration is not feasible, and meanwhile, when the simulation training is performed, the time for completing one iteration is recorded, the CalcUtil function is used for estimating FLOPS (Floating-point operation times per second) according to the scale of the current calculation task (such as the size of input data, the number of parameters and the like) and dividing the calculation time by the value of GPU theory FLOPS to obtain the device performance utilization rate.
In general, the method eventually finds an optimal execution plan that balances between GPU resources and task demands by incrementally adjusting the configuration and evaluating the resource utilization of each combination.
Further, to verify the effectiveness of the scheme of the present embodiment, the scheme of the present embodiment uses two types of GPUs, NVIDIA a100 (40 gb×12) and NVIDIA TESLA T4 (16 gb×36), to simulate heterogeneous GPU clusters and environments. Each node is configured with 64GB of memory, each node is configured with 2 isomorphic GPUs, and the Ubuntu 20.04 LTS operating system is employed. Each GPU is connected through PCIe-3 x 16, and nodes are connected through InfiniBand (IB) networks, the bandwidth between the nodes is about 5 Gbps, and high efficiency and low delay of data transmission are ensured. In order to evaluate the performance of the method in various heterogeneous GPU environments, the embodiment selects a GPT-3-1.3B architecture model for performance test and compares the performance test with the existing method to verify the effectiveness of the method.
The scheme described in this example uses popular parallel training frameworks Gpipe and HetPipe to compare performance to our approach. The scheme described in this embodiment uses GPT3 models with different layers (layer number=4/6/8) for performance testing, with the horizontal axis representing global lot size (Global Batch Size) and the vertical axis representing throughput. As a result, as shown in fig. 7 (a) to 7 (c), the performance of the scheme of the present embodiment is significantly better than Gpipe and HetPipe at any global lot size, and the advantages of the present method are more prominent, especially as the global lot size increases. The difference is derived from the fact that the scheme of the embodiment adopts a more flexible and efficient heterogeneous parallel strategy, so that load distribution and memory use among different GPUs are optimized, and throughput is remarkably improved. In addition, the use of the double buffering strategy enables the method to maintain high performance and avoid memory overflow under larger batch sizes, thereby improving the overall throughput performance. In contrast, gpipe is not optimized for heterogeneous devices, resulting in OOM for low memory devices, while HetPipe is constrained by a centralized parameter server, resulting in limited throughput improvement, thus reducing system scalability. The evaluation result shows that the scheme of the embodiment is improved by 180% and 40% respectively at the training speed compared with the prior method.
In one or more embodiments, there is provided a deep learning large model training system for heterogeneous devices corresponding to the above method, comprising:
The virtual equipment construction unit is used for dividing different network layers of the deep learning large model to be trained into a plurality of stages, and forward propagation and backward propagation calculation of all network layers in each stage is executed by independent virtual equipment, wherein the virtual equipment is composed of one or more isomorphic GPU equipment in a heterogeneous equipment cluster;
The training data set dividing unit is used for dividing training samples in the training data set into a plurality of large batches meeting the first scale requirement and dividing each large batch into a plurality of small batches meeting the second scale requirement;
The training unit is used for training the deep learning large model by taking each small batch as the input of the virtual equipment, wherein in the training of the deep learning large model, all the small batches in the same large batch use the weight version of the same training stage, and in the training of the deep learning large model, each virtual equipment adopts a pipeline parallel processing mode or a processing mode combining pipeline parallel processing and data parallel processing, and the virtual equipment adopts a data parallel processing mode, tensor parallel processing mode.
It can be understood that the system in this embodiment corresponds to the method in the foregoing embodiment, and its technical details are described in the first embodiment, so that details are not repeated here.
In further embodiments, there is also provided:
an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of embodiment one. For brevity, the description is omitted here.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.
The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.
Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

5. The deep learning large model training method for heterogeneous devices according to claim 4, wherein the custom communication mode is specifically that when data concurrency is adopted in the virtual devices, in forward propagation, small batches are divided into micro batches meeting preset scale requirements and sent to corresponding GPUs to be executed, in reverse propagation, allReduce operation is used for gradient aggregation, when tensor concurrency is adopted in the virtual devices, tensor distribution is carried out through the sciter operation in forward propagation, and when tensor collection is carried out through ALLGATHER operation in reverse propagation, and when pipeline strategy is adopted in the virtual devices, point-to-point communication is adopted between the virtual devices.
CN202510131779.9A2025-02-062025-02-06Deep learning large model training method and system for heterogeneous equipmentActiveCN119557113B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510131779.9ACN119557113B (en)2025-02-062025-02-06Deep learning large model training method and system for heterogeneous equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510131779.9ACN119557113B (en)2025-02-062025-02-06Deep learning large model training method and system for heterogeneous equipment

Publications (2)

Publication NumberPublication Date
CN119557113A CN119557113A (en)2025-03-04
CN119557113Btrue CN119557113B (en)2025-06-06

Family

ID=94746513

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510131779.9AActiveCN119557113B (en)2025-02-062025-02-06Deep learning large model training method and system for heterogeneous equipment

Country Status (1)

CountryLink
CN (1)CN119557113B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120429090B (en)*2025-07-032025-09-02齐鲁工业大学(山东省科学院) A pipeline parallel training method suitable for heterogeneous devices
CN120409596B (en)*2025-07-032025-09-05齐鲁工业大学(山东省科学院)Deep learning model training method suitable for heterogeneous environments of network and equipment
CN120508395B (en)*2025-07-172025-09-16北京达佳互联信息技术有限公司Model training method, device and storage medium based on heterogeneous GPU cluster

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20220006360A (en)*2020-07-082022-01-17울산과학기술원Machine learning training method based on parametric synchronization model and training system thereof
CN116185604A (en)*2022-12-132023-05-30山东省计算中心(国家超级计算济南中心)Pipeline parallel training method and system for deep learning model

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US11436019B2 (en)*2019-07-152022-09-06Microsoft Technology Licensing, LlcData parallelism in distributed training of artificial intelligence models
CN116405392B (en)*2023-04-212025-09-19上海交通大学Distributed training communication optimization method and system for bandwidth limited environment
CN116883229A (en)*2023-07-202023-10-13东南大学 Pipeline parallel method to accelerate neural network training in heterogeneous GPU clusters
CN117808079A (en)*2023-12-152024-04-02河南师范大学Neural network training method and device based on few learning parameters
CN118535340A (en)*2024-06-042024-08-23中科视语(北京)科技有限公司Deep learning model optimization method for large-scale distributed training
CN119005269B (en)*2024-06-242025-09-19国网江苏省电力有限公司扬州供电分公司Assembly line training optimization method based on DAG structure deep learning model task placement

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20220006360A (en)*2020-07-082022-01-17울산과학기술원Machine learning training method based on parametric synchronization model and training system thereof
CN116185604A (en)*2022-12-132023-05-30山东省计算中心(国家超级计算济南中心)Pipeline parallel training method and system for deep learning model

Also Published As

Publication numberPublication date
CN119557113A (en)2025-03-04

Similar Documents

PublicationPublication DateTitle
CN110619595B (en)Graph calculation optimization method based on interconnection of multiple FPGA accelerators
CN119557113B (en)Deep learning large model training method and system for heterogeneous equipment
Wang et al.FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters
CN110674936A (en) A neural network processing method, device, computer equipment and storage medium
CN108804220A (en)A method of the satellite task planning algorithm research based on parallel computation
US12314851B2 (en)Microservice-based training systems in heterogeneous graphic processor unit (GPU) cluster and operating method thereof
EP3742350A1 (en)Parallelization strategies for training a neural network
CN119166159B (en) A model optimization method and related device
CN117744838A (en)Parallel training acceleration method and system for large model parameter partition
WO2024131170A1 (en)Operator processing method and apparatus, and chip, computing device and storage medium
Xie et al.Optimal distributed parallel algorithms for deep learning framework tensorflow
CN115858173A (en)GPU memory bottleneck improvement method for large deep learning model training
Wang et al.Prophet: Fine-grained load balancing for parallel training of large-scale moe models
CN115600673A (en)Method and system for parallel training DNN model for multi-machine multi-card computing system
Gui et al.Accelerating Design Space Exploration for {LLM} Training Systems with Multi-experiment Parallel Simulation
CN116166396A (en)Training method and device of scheduling model, electronic equipment and readable storage medium
CN119046022A (en)Determination method, device, equipment and medium of distributed parallel scheme
Rafique et al.A capabilities-aware framework for using computational accelerators in data-intensive computing
CN120409596B (en)Deep learning model training method suitable for heterogeneous environments of network and equipment
Luo et al.Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems
CN120123103B (en) Model pipeline partitioning and deployment method and system for heterogeneous cluster environments
CN120075122B (en) Communication scheduling method, electronic device, and medium for distributed large model training
CN120429090B (en) A pipeline parallel training method suitable for heterogeneous devices
CN119861974B (en)Task processing method, device, equipment and medium based on data stream core architecture
CN119989734B (en) A grating simulation method, device, electronic device and readable storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp