CN119201416A

Movatterモバイル変換

Info

Publication number: CN119201416A
Application number: CN202410033375.1A
Authority: CN
Inventors: 赵伯罕; 徐葳; 李强; 龙利民; 胡勇超
Original assignee: Tuling Artificial Intelligence Institute Nanjing Co ltd; Tsinghua University
Current assignee: Tuling Artificial Intelligence Institute Nanjing Co ltd; Tsinghua University
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-12-27

Abstract

Translated fromChinese

本申请公开一种多作业分布式训练系统及方法，所述多作业分布式训练系统包括：存储服务模块，用于管理存储的训练数据并依据每一训练任务的ID及其迭代次数为各训练任务分发训练数据；命名控制模块，用于为每一训练任务分配ID以定位各训练任务的训练数据，其中，每一训练任务由一个计算节点来完成；主机代理模块，用于从存储服模块获取分发的对应一训练任务的训练数据以令计算节点进行梯度计算，并将所述计算节点得到的梯度数据发送至网络；以及从网络中接收聚合数据并发送至各计算节点；交换机模块，用于依据从网络中获取的每一训练任务对应的梯度数据进行聚合计算以得到所述聚合数据，并通过网络将聚合数据发送至网络中的各计算节点中。

The present application discloses a multi-job distributed training system and method, the multi-job distributed training system comprising: a storage service module, used to manage stored training data and distribute training data to each training task according to the ID of each training task and its iteration number; a naming control module, used to assign an ID to each training task to locate the training data of each training task, wherein each training task is completed by a computing node; a host agent module, used to obtain the distributed training data corresponding to a training task from the storage service module to enable the computing node to perform gradient calculation, and send the gradient data obtained by the computing node to the network; and receive aggregated data from the network and send it to each computing node; a switch module, used to perform aggregate calculation according to the gradient data corresponding to each training task obtained from the network to obtain the aggregated data, and send the aggregated data to each computing node in the network through the network.