CN119557113B

Movatterモバイル変換

Info

Publication number: CN119557113B
Application number: CN202510131779.9A
Authority: CN
Inventors: 赵志刚; 刘福来; 李传涛; 肖连辉; 王春晓; 李响; 李锦涛; 王雨欣
Original assignee: Qilu University of Technology; National Supercomputing Center in Jinan
Current assignee: Qilu University of Technology; National Supercomputing Center in Jinan
Priority date: 2025-02-06
Filing date: 2025-02-06
Publication date: 2025-06-06
Anticipated expiration: 2045-02-06
Also published as: CN119557113A

Abstract

Translated fromChinese

本发明提供了一种用于异构设备的深度学习大模型训练方法及系统，其属于模型训练技术领域，为了解决传统方案在深度学习大模型的训练时，无法对异构GPU集群进行有效利用的问题，所述方案基于提出的虚拟设备概念，通过将待训练的深度学习大模型的不同网络层划分为若干阶段，每个阶段所有网络层的前向传播和反向传播计算均由独立的虚拟设备执行，同时，结合提出的混合并行训练策略来协调不同构的GPU资源的利用，实现高效的模型训练。

The present invention provides a deep learning large model training method and system for heterogeneous devices, which belongs to the field of model training technology. In order to solve the problem that traditional solutions cannot effectively utilize heterogeneous GPU clusters when training deep learning large models, the solution is based on the proposed virtual device concept. By dividing the different network layers of the deep learning large model to be trained into several stages, the forward propagation and backpropagation calculations of all network layers in each stage are performed by independent virtual devices. At the same time, the proposed hybrid parallel training strategy is combined to coordinate the utilization of GPU resources of different structures to achieve efficient model training.