CN112506673A

Movatterモバイル変換

Info

Publication number: CN112506673A
Application number: CN202110153068.3A
Authority: CN
Inventors: 邹昊东; 张明明; 俞俊; 陈海洋; 夏飞; 王鹏飞; 范磊; 陶晔波; 许明杰; 王琳
Original assignee: NARI Technology Co Ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: NARI Technology Co Ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-03-16
Anticipated expiration: 2041-02-04
Also published as: CN112506673B

Abstract

The invention discloses a collaborative model training task configuration method facing intelligent edge computing, which is used for edge computing nodes and comprises one or more training time slots, wherein each training time slot comprises the following steps: sending a model training request to one or more mobile devices; receiving an availability status and a user data size of a current time slot reported from one or more mobile devices; determining the number of training small wheels required by the training of the mobile equipment and the interaction model participating in the training based on the previously obtained task configuration result and the current available state of each mobile equipment; and performing interactive model training with the mobile equipment participating in training until the number of the determined training small rounds is reached, constructing and solving an optimization problem aiming at minimizing the use of edge training resources according to the training effect and the user data scale reported by each mobile equipment, and obtaining a new task configuration result. Compared with other methods, the method has the advantages of much less training resource consumption and little difference in precision.

Description

Translated fromChinese

面向智能边缘计算的协同模型训练任务配置方法A collaborative model training task configuration method for intelligent edge computing

技术领域technical field

本发明涉及一种协同模型训练任务配置方法，具体涉及一种面向智能边缘计算的协同模型训练任务配置方法。The invention relates to a collaborative model training task configuration method, in particular to a collaborative model training task configuration method oriented to intelligent edge computing.

背景技术Background technique

在用户使用移动设备，如手机、平板电脑等的过程中会产生大量用户数据，包括浏览记录、打字记录以及各类日志信息等。这些数据在被分析处理后，能够帮助服务提供商进行更好的服务部署与提供。这类分析处理手段往往借助于机器学习模型。具体来说，一个机器学习模型包含模型结构和模型参数，以及该机器学习模型在某一特定数据集上所体现出的精度，如用分类模型对某一数据集进行分类，得到的正确分类比例作为模型精度。那么，服务提供商的目标，就是要利用各用户分布式产生的用户数据，对某一特定的机器学习模型进行训练，以求能够得到最好的模型精度。这样，服务提供商就可以利用这些机器学习模型对外提供更好的推断类服务。例如，在用户浏览商品的时候结合商品分类为用户进行商品推荐；在用户打字的时候，结合上下文进行热词推荐；或者在导航的时候，利用更精确的导航模型进行路线制定。When users use mobile devices, such as mobile phones and tablet computers, a large amount of user data will be generated, including browsing records, typing records, and various log information. After these data are analyzed and processed, they can help service providers to deploy and provide better services. Such analytical processing methods often rely on machine learning models. Specifically, a machine learning model includes the model structure and model parameters, as well as the accuracy of the machine learning model on a specific data set, such as the correct classification ratio obtained by classifying a data set with a classification model as model accuracy. Then, the goal of the service provider is to use the user data distributed by each user to train a specific machine learning model in order to obtain the best model accuracy. In this way, service providers can use these machine learning models to provide better inference-like services to the outside world. For example, when users browse products, they can recommend products for users in combination with product categories; when users are typing, they can recommend hot words in combination with context; or when navigating, use more accurate navigation models to formulate routes.

虽然把所有的用户数据汇聚到数据中心进行处理，就可以按照如上的训练方式得到机器学习模型。但是，在边缘环境下，这样的原始数据汇聚是被禁止的。其原因是：1）出于隐私的保护，用户往往不愿将自己的原始数据进行上传。2）服务提供商往往是租借运营商的边缘设备进行计算和传输。将各移动设备上的用户数据传输至数据中心，会带来高昂的跨域传输。这里的跨域同时包含两个含义：跨地域范围传输以及跨运营商至数据中心传输。Although all user data is aggregated into the data center for processing, the machine learning model can be obtained according to the above training method. However, in edge environments, such aggregation of raw data is prohibited. The reasons are: 1) For the protection of privacy, users are often reluctant to upload their own original data. 2) Service providers often rent operators' edge devices for computing and transmission. Transferring user data from various mobile devices to a data center can result in expensive cross-domain transfers. The cross-domain here also includes two meanings: cross-regional transmission and cross-operator-to-data center transmission.

由于边缘场景下用户使用移动设备的习惯不同，即使用设备的时间、频度不同以及使用过程中产生的数据规模、内容不同，以至于在进行分布式机器学习训练的时候存在着不确定性。即使某一时段内设备固定，所有用户数据均已生成，如何利用移动设备和边缘计算节点进行分布式机器学习训练，以能够在保证模型训练精度下尽可能节省边缘训练资源是关键问题。Due to the different habits of users using mobile devices in edge scenarios, that is, the time and frequency of using the devices, and the scale and content of data generated during use, there is uncertainty when conducting distributed machine learning training. Even if the device is fixed for a certain period of time and all user data has been generated, how to use mobile devices and edge computing nodes for distributed machine learning training to save edge training resources as much as possible while ensuring model training accuracy is a key issue.

发明内容SUMMARY OF THE INVENTION

发明目的：为了克服现有技术中存在的不足，本发明一方面提供一种面向智能边缘计算的协同模型训练任务配置方法，以解决分布式机器学习训练数据难以共享的问题，且在保证精度的情况下，尽量节省资源消耗。Purpose of the invention: In order to overcome the deficiencies in the prior art, on the one hand, the present invention provides a collaborative model training task configuration method oriented to intelligent edge computing, so as to solve the problem that the distributed machine learning training data is difficult to share, and ensure the accuracy. In this case, try to save resource consumption as much as possible.

技术方案：本发明的一种面向智能边缘计算的协同模型训练任务配置方法，用于边缘计算节点且包括一或多个训练时隙。该方法的每一训练时隙包括如下步骤：向一或多个边缘设备发送模型训练请求；接收来自所述一或多个边缘设备响应于所述模型训练请求而汇报的当前时隙的可用状态和用户数据规模；基于上一训练时隙得到的任务配置结果，从当前可用边缘设备中选定参与训练的边缘设备，并确定交互模型训练所需的训练小轮数目；与参与训练的边缘设备进行交互模型训练直至达到确定的训练小轮数目；根据训练效果和各边缘设备汇报的当前时隙的用户数据规模，构建以最小化边缘训练资源的使用为目标的优化问题并求解，得到用于下一训练时隙的新的任务配置结果。Technical solution: A collaborative model training task configuration method for intelligent edge computing of the present invention is used for edge computing nodes and includes one or more training time slots. Each training time slot of the method includes the steps of: sending a model training request to one or more edge devices; receiving the available status of the current time slot reported by the one or more edge devices in response to the model training request and user data scale; based on the task configuration results obtained in the previous training time slot, select edge devices participating in training from the currently available edge devices, and determine the number of training rounds required for interactive model training; Carry out interactive model training until the number of training rounds is reached; according to the training effect and the user data scale of the current time slot reported by each edge device, construct and solve the optimization problem aiming at minimizing the use of edge training resources, and obtain the The new task configuration result for the next training slot.

进一步地，任务配置结果包括：用于决策是否选择第i个边缘设备在训练时隙t内参与训练的参与者决策量

和用于决策训练时隙t内训练小轮数目的辅助决策量

。Further, the task configuration result includes: a participant decision amount for deciding whether to select thei -th edge device to participate in the training in the training time slott

and the auxiliary decision amount used to decide the number of training rounds in the training time slott

.

进一步地，训练时隙t内所需训练小轮数目

通过下式计算：Further, the number of training rounds required in the training time slott

Calculated by the following formula:

=K

，

=K

,

其中，K为常数。whereK is a constant.

进一步地，与参与训练的边缘设备进行交互模型训练时，训练时隙t内每一训练小轮具体包括：Further, when performing interactive model training with edge devices participating in the training, each training wheel in the training time slott specifically includes:

（1）边缘计算节点将先前训练所得的全局训练模型参数

、各边缘设备的局部精度修正梯度

以及全局精度修正梯度

发送至所有可用边缘设备；参与训练的边缘设备根据接收到的数据和自身的局部精度损失函数

分别计算各自对全局训练模型参数的更新

；t为当前交互训练的序数，j为当前训练小轮的序数，i为各边缘设备的序数；

=0；(1) The edge computing node uses the previously trained global training model parameters

, the local accuracy correction gradient of each edge device

and the global accuracy correction gradient

Sent to all available edge devices; the edge device participating in the training uses the received data and its own local accuracy loss function

Calculate the respective updates to the global training model parameters separately

;t is the ordinal number of the current interactive training,j is the ordinal number of the current training round, andi is the ordinal number of each edge device;

=0;

（2）边缘计算节点在接收到所有参与训练的边缘设备发送的对全局训练模型参数的更新

的基础上，计算得到新的全局模型参数

并发送给所有参与训练的边缘设备进行验证；所有参与训练的边缘设备基于

分别进行计算，各自得到新的局部精度

、新的局部精度修正梯度

、新的局部汇聚性能

，并发送给边缘计算节点进行更新记录；(2) The edge computing node receives the update of the parameters of the global training model sent by all the edge devices participating in the training

On the basis of , the new global model parameters are calculated

and sent to all edge devices participating in training for verification; all edge devices participating in training are based on

Calculated separately, each obtains a new local accuracy

, new local accuracy correction gradient

, the new local convergence performance

, and send it to the edge computing node to update the record;

（3）边缘计算节点基于接收到的各边缘设备的局部精度修正梯度

计算得到新的全局精度修正梯度

；(3) The edge computing node corrects the gradient based on the received local accuracy of each edge device

Calculate the new global accuracy correction gradient

;

（4）若当前训练小轮达到当前训练时隙t所需的训练小轮数目

，边缘计算节点还将最新的全局模型参数

发送给未参与训练的边缘设备；未参与训练的边缘设备基于

计算得到各自的第

个训练小轮后新的局部精度

，并发送给边缘计算节点进行更新记录。(4) If the current training round reaches the number of training rounds required for the current training time slott

, the edge computing node will also update the latest global model parameters

Sent to edge devices not participating in training; edge devices not participating in training are based on

Calculate the respective

New local accuracy after training epochs

, and send it to the edge computing node to update the record.

进一步地，步骤（1）中，参与训练的边缘设备根据接收到的数据分别计算各自对全局训练模型参数的更新

，具体包括：各参与训练的边缘设备利用获取的

、

和自身的局部精度损失函数

构建优化函数

，并以最小化所述优化函数

的方式得到

；所述优化函数

表示为：Further, in step (1), the edge devices participating in the training calculate their respective updates to the parameters of the global training model according to the received data.

, specifically including: each edge device participating in the training uses the acquired data

,

and its own local accuracy loss function

Build optimization functions

, and to minimize the optimization function

way to get

; the optimization function

Expressed as:

，

,

其中

、

均为确定的参数。in

,

are definite parameters.

进一步地，步骤（2）中，新的全局模型参数

通过下式计算得到：Further, in step (2), the new global model parameters

It is calculated by the following formula:

，

,

其中，

为当前训练时隙t内可用边缘设备的集合，

为训练时隙t中用于指示第i个边缘设备是否参与训练的变量，

等于0或1。in,

is the set of available edge devices in the current training time slott ,

is the variable used to indicate whether thei -th edge device participates in the training in the training time slott ,

Equal to 0 or 1.

进一步地，步骤（2）中，新的局部精度

是由各边缘设备将

代入自身的局部精度损失函数

而得到；新的局部精度修正梯度

是基于新的局部精度

而得到；新的局部汇聚性能

是通过下式得到：Further, in step (2), the new local precision

by each edge device

Substitute into its own local precision loss function

and get; the new local accuracy correction gradient

is based on the new local precision

and get; new local convergence performance

is obtained by:

。

.

进一步地，步骤（3）中，新的全局精度修正梯度

通过下式得到：Further, in step (3), the new global accuracy correction gradient

It is obtained by the following formula:

，

,

其中，

为当前训练时隙t内可用边缘设备的集合。in,

is the set of available edge devices in the current training time slott .

进一步地，所述训练效果包括：训练时隙t内达到确定的训练小轮数目

后的全局模型参数

、各边缘设备实际观测到的局部汇聚性能

和各边缘设备在各训练小轮中更新的局部精度

；其中，

。Further, the training effect includes: reaching a certain number of training small rounds in the training time slott

global model parameters after

, the actual observed local convergence performance of each edge device

and the local accuracy updated by each edge device in each training epoch

;in,

.

进一步地，所述优化问题表示为：Further, the optimization problem is expressed as:

目标函数：

，Objective function:

,

约束条件：Restrictions:

1)

，1)

,

2)

，2)

,

3)

，3)

,

4)

，4)

,

5)

，5)

,

其中，

为用于决策训练时隙t+1内训练小轮数目的辅助决策量；

为训练时隙t内可用边缘设备的集合，由汇报的当前可用状态确定；

为用于决策是否选择第i个边缘设备在训练时隙t+1内参与训练的参与者决策量；m为移动网络可并发传输的容量上限；

为在当前边缘网络中对模型参数和梯度进行一次传输的规模；

为训练时隙t内边缘网络中的可用带宽；

为训练时隙t内第i个边缘设备针对单个数据样本的计算代价；

为训练时隙t内第i个边缘设备的用户数据规模；

为全局精度损失函数，且

=

，

；

为设定的全局精度损失；

为当前训练时隙t内交互训练后所有边缘设备的局部汇聚性能的最大值，且

=

=

，

为训练时隙t内达到确定的训练小轮数目

后第i个边缘设备实际观测到的局部汇聚性能。in,

is the auxiliary decision amount used to decide the number of training rounds in the training time slott +1;

is the set of available edge devices in the training time slott , determined by the current available state reported;

is the decision amount of the participants used to decide whether to select thei -th edge device to participate in the training in the training time slott +1;m is the upper limit of the concurrent transmission capacity of the mobile network;

is the size of one transfer of model parameters and gradients in the current edge network;

is the available bandwidth in the edge network within the training time slott ;

is the computational cost of thei -th edge device for a single data sample in the training time slott ;

is the user data size of thei -th edge device in the training time slott ;

is the global precision loss function, and

=

,

;

is the set global precision loss;

is the maximum local convergence performance of all edge devices after interactive training in the current training time slott , and

=

,

A certain number of training epochs is reached within the training time slott

The actual observed local convergence performance of the lasti -th edge device.

有益效果：与现有技术相比，本发明具有以下优点：Beneficial effect: Compared with the prior art, the present invention has the following advantages:

1、通过在交互训练过程中构建优化函数进行更新，边缘设备无需直接发送自身模型参数的原始数据至边缘计算节点，而是发送与自身原始数据相关的更新数据。这可以解决由于隐私的保护，用户往往不愿将自己的原始数据进行上传的问题，适用于分布式机器学习训练。1. By constructing an optimization function for updating in the interactive training process, the edge device does not need to directly send the original data of its own model parameters to the edge computing node, but sends the updated data related to its own original data. This can solve the problem that users are often reluctant to upload their own original data due to privacy protection, and is suitable for distributed machine learning training.

2、基于交互训练的效果构建最小化边缘训练资源的使用为目标的优化问题并求解，从而决策得到下一训练时隙的任务配置，包括下一训练时隙各边缘设备的选择偏好和训练小轮数目。根据决策结果，从可用边缘设备中选取参与训练的边缘设备进行训练，而不需要使所有边缘设备均参与训练，这使本申请的协同训练任务配置方法至少减少27%的资源消耗开销，训练精度则最多降低4%。换言之，本发明能够在保证训练精度的同时，大幅度减少训练资源消耗。2. Based on the effect of interactive training, an optimization problem with the goal of minimizing the use of edge training resources is constructed and solved, so as to obtain the task configuration of the next training time slot, including the selection preference and training time of each edge device in the next training time slot. number of rounds. According to the decision result, the edge devices participating in the training are selected from the available edge devices for training, and all edge devices do not need to participate in the training, which reduces the resource consumption overhead by at least 27% in the collaborative training task configuration method of the present application, and the training accuracy It is reduced by up to 4%. In other words, the present invention can greatly reduce the consumption of training resources while ensuring the training accuracy.

附图说明Description of drawings

图1是面向智能边缘计算的协同模型训练系统的结构示意图；Figure 1 is a schematic structural diagram of a collaborative model training system for intelligent edge computing;

图2是应用任务配置方法后实际使用各边缘计算资源作训练的训练资源花费变化情况；Figure 2 shows the change in the cost of training resources actually using each edge computing resource for training after applying the task configuration method;

图3是应用任务配置方法后全局精度的变化情况；Figure 3 shows the change of global accuracy after applying the task configuration method;

图4是应用任务配置方法后最大局部汇聚性能的变化情况。Figure 4 shows the change of the maximum local convergence performance after applying the task configuration method.

具体实施方式Detailed ways

下面结合附图对本发明所公开的方法做进一步的详细介绍。The method disclosed in the present invention will be further described in detail below with reference to the accompanying drawings.

本发明的面向智能边缘计算的协同模型训练任务配置方法用于边缘计算节点，且包括一或多个训练时隙。每一训练时隙包括如下步骤：The intelligent edge computing-oriented collaborative model training task configuration method of the present invention is used for edge computing nodes, and includes one or more training time slots. Each training slot includes the following steps:

S1：向一或多个边缘设备发送模型训练请求。这里的边缘设备可以是连接边缘计算节点的移动设备、笔记本电脑等。S1: Send a model training request to one or more edge devices. The edge devices here can be mobile devices, laptops, etc. connected to edge computing nodes.

S2：接收来自所述一或多个边缘设备响应于所述模型训练请求而汇报的当前时隙的可用状态和用户数据规模。S2: Receive the available status and user data scale of the current time slot reported by the one or more edge devices in response to the model training request.

S3：基于上一训练时隙得到的任务配置结果，从当前可用边缘设备中选定参与训练的边缘设备，并确定交互模型训练所需的训练小轮数目。其中，该步骤中上一训练时隙得到的任务配置结果包括：用于决策是否选择第i个边缘设备在训练时隙t内参与训练的参与者决策量

和用于决策训练时隙t内训练小轮数目的辅助决策量

。S3: Based on the task configuration result obtained in the previous training time slot, select edge devices participating in training from currently available edge devices, and determine the number of training rounds required for training the interactive model. Wherein, the task configuration result obtained in the previous training time slot in this step includes: the decision amount of the participants used to decide whether to select thei -th edge device to participate in the training in the training time slott

.

其中，当前训练时隙t内所需训练小轮数目

通过下式计算：Among them, the number of training rounds required in the current training time slott

Calculated by the following formula:

=K

，

=K

,

其中，K为常数。whereK is a constant.

当前训练时隙t内参与训练的边缘设备则基于参与者决策量

从处于可用状态的边缘设备中选定。参与者决策量

实际上用于表征对各边缘设备选择的偏好程度，该偏好程度作为概率用于在各边缘设备间进行选择以确定参与训练的边缘设备，确定时具体可以采用Pair-wise Rounding算法或者DepRound算法求解得到。各训练时隙的参与者决策量

由上一训练时隙决策得到，具体计算方法会在后续内容中进行介绍。对于初始训练时隙，由于没有上一训练时隙的训练效果作参考，因此将选择所有可用边缘设备作为参与者。The edge devices participating in the training in the current training time slott are based on the decision volume of the participants

Select from edge devices that are available. Participant decision volume

In fact, it is used to characterize the degree of preference for the selection of each edge device. The degree of preference is used as a probability to select among the edge devices to determine the edge device participating in the training. When determining, the Pair-wise Rounding algorithm or the DepRound algorithm can be used to solve the problem. get. Participant decision volume for each training slot

It is obtained from the previous training time slot decision, and the specific calculation method will be introduced in the subsequent content. For the initial training slot, since there is no training effect from the previous training slot for reference, all available edge devices will be selected as participants.

S4：与参与训练的边缘设备进行交互模型训练直至达到确定的训练小轮数目。S4: Perform interactive model training with edge devices participating in training until a certain number of training rounds is reached.

在步骤S4中，与参与训练的边缘设备进行交互模型训练时，每一训练小轮具体包括如下过程：In step S4, when performing interactive model training with edge devices participating in the training, each training round specifically includes the following processes:

S41：边缘计算节点将先前训练所得的全局训练模型参数

、各边缘设备的局部精度修正梯度

以及全局精度修正梯度

分别计算各自对全局训练模型参数的更新

=0。S41: The edge computing node uses the previously trained global training model parameters

, the local accuracy correction gradient of each edge device

and the global accuracy correction gradient

=0.

其中，参与训练的边缘设备根据接收到的数据分别计算各自对全局训练模型参数的更新

，具体包括：Among them, the edge devices participating in the training calculate their respective updates to the parameters of the global training model according to the received data.

, including:

各参与训练的边缘设备利用获取的

、

和自身的局部精度损失函数

构建优化函数

，并以最小化所述优化函数

的方式得到

；其中，优化函数

表示为：Each edge device participating in the training uses the obtained

,

and its own local accuracy loss function

Build optimization functions

, and to minimize the optimization function

way to get

; where the optimization function

Expressed as:

，

,

其中

、

均为确定的参数。in

,

are definite parameters.

通过构建优化函数

，边缘设备无需直接发送自身模型参数的原始数据至边缘计算节点，而是发送与自身原始数据相关的更新数据。这可以解决由于隐私的保护，用户往往不愿将自己的原始数据进行上传的问题，适用于分布式机器学习训练。By building an optimization function

, the edge device does not need to directly send the original data of its own model parameters to the edge computing node, but sends updated data related to its own original data. This can solve the problem that users are often reluctant to upload their own original data due to privacy protection, and is suitable for distributed machine learning training.

S42：边缘计算节点在接收到所有参与训练的边缘设备发送的对全局训练模型参数的更新

的基础上，计算得到新的全局模型参数

分别进行计算，各自得到新的局部精度

、新的局部精度修正梯度

、新的局部汇聚性能

，并发送给边缘计算节点进行更新记录。S42: The edge computing node receives the update of the parameters of the global training model sent by all the edge devices participating in the training

On the basis of , the new global model parameters are calculated

Calculated separately, each obtains a new local accuracy

, new local accuracy correction gradient

, the new local convergence performance

, and send it to the edge computing node to update the record.

该步骤中：新的全局模型参数

通过下式计算得到：In this step: New global model parameters

It is calculated by the following formula:

，

,

其中，

为当前训练时隙t内可用边缘设备的集合，

等于0或1且基于参与者决策量

得到。一般来说，

等于0时指示对应边缘设备i未参与训练，

等于1时指示对应边缘设备i参与训练，或者也可以反过来使用。in,

is the set of available edge devices in the current training time slott ,

Equal to 0 or 1 and based on participant decision volume

get. Generally speaking,

When it is equal to 0, it indicates that the corresponding edge devicei does not participate in the training,

When it is equal to 1, it indicates that the corresponding edge devicei participates in the training, or it can be used in reverse.

新的局部精度

是由各边缘设备将

代入自身的局部精度损失函数

而得到；新的局部精度修正梯度

是基于新的局部精度

而得到；新的局部汇聚性能

是通过下式得到：new local precision

by each edge device

Substitute into its own local precision loss function

and get; the new local accuracy correction gradient

is based on the new local precision

and get; new local convergence performance

is obtained by:

。

.

S43：边缘计算节点基于接收到的各局部精度修正梯度

计算得到新的全局精度修正梯度

。S43: The edge computing node corrects the gradient based on the received local accuracy

Calculate the new global accuracy correction gradient

.

该步骤中，新的全局精度修正梯度

通过下式得到：In this step, the new global accuracy correction gradient

It is obtained by the following formula:

，

,

其中，

为当前训练时隙t内可用边缘设备的集合。in,

is the set of available edge devices in the current training time slott .

若当前训练小轮达到当前训练时隙t所需的训练小轮数目

，边缘计算节点还将最新的全局模型参数

发送给未参与训练的边缘设备；未参与训练的边缘设备基于

计算得到各自的第

个训练小轮后新的局部精度

，并发送给边缘计算节点进行更新记录。If the current training round reaches the number of training rounds required for the current training time slott

, the edge computing node will also update the latest global model parameters

Calculate the respective

New local accuracy after training epochs

, and send it to the edge computing node to update the record.

S5：根据训练效果和各边缘设备汇报的当前时隙的用户数据规模，构建以最小化边缘训练资源的使用为目标的优化问题并求解，得到用于下一训练时隙的新的任务配置结果。S5: According to the training effect and the user data scale of the current time slot reported by each edge device, construct and solve an optimization problem aiming at minimizing the use of edge training resources, and obtain a new task configuration result for the next training time slot .

在步骤S5中，训练效果包括：训练时隙t内达到确定的训练小轮数目

后的全局模型参数

、各边缘设备实际观测到的局部汇聚性能

和各边缘设备在各训练小轮中更新的局部精度

；其中，

。In step S5, the training effect includes: reaching a certain number of training small rounds within the training time slott

global model parameters after

, the actual observed local convergence performance of each edge device

and the local accuracy updated by each edge device in each training epoch

;in,

.

边缘计算节点的总体目标旨在为所有训练（

次训练时隙的训练），在各自训练满足精度的条件下最小化边缘训练资源的使用，因此，建立的优化问题如下：The overall goal of edge computing nodes is to provide

training time slots), and minimize the use of edge training resources under the condition that the respective training satisfies the accuracy. Therefore, the established optimization problem is as follows:

优化目标：

；optimize the target:

;

约束条件：Restrictions:

1) 对于边缘传输限制：

1) For edge transfer restrictions:

2) 对于参与者选择控制：

2) For participant selection control:

3) 对于全局后验精度要求：

3) For global posterior accuracy requirements:

4) 对于决策的定义域限制：

4) Domain restrictions for decision making:

式中，

为训练时隙t内训练所需的小轮数目，

为训练所得的全局汇聚精度，

为全局精度损失；

为用于决策是否选择第i个边缘设备在训练时隙t内参与训练的参与者决策量；

为训练时隙t内边缘网络中的可用带宽；

为训练时隙t内第i个边缘设备的用户数据规模；m为移动网络可并发传输的容量上限；

=

=

，

为训练时隙t内达到确定的训练小轮数目

后第i个边缘设备实际观测到的局部汇聚性能；

为训练时隙t内

个训练小轮后的模型参数；

为全局精度损失函数，且

=

，

。In the formula,

is the number of epochs required for training in training time slott ,

is the global pooling accuracy obtained from training,

is the global precision loss;

is the decision amount of the participants used to decide whether to select thei -th edge device to participate in the training in the training time slott ;

is the available bandwidth in the edge network within the training time slott ;

is the user data scale of thei -th edge device in the training time slott ;m is the upper limit of the concurrent transmission capacity of the mobile network;

=

,

A certain number of training epochs is reached within the training time slott

The actual observed local convergence performance of the lasti -th edge device;

for the training time slott

model parameters after a training round;

is the global precision loss function, and

=

,

.

由于是在线场景，决策时无法准确观测到实际的训练效果，因此上述优化问题在实际中只能分解到每一训练时隙中，进行每一次子问题的求解。再者，每一次子问题的求解中，并不能提前观测到该次训练的局部汇聚精度

=

，以及全局汇聚精度

。因此，需要利用上一训练时隙各边缘设备的训练效果作为参考，以近似替代当前还未训练无法得到的全局和局部汇聚精度。综上，对于训练时隙t内的训练，子问题实际为：Because it is an online scene, the actual training effect cannot be accurately observed during decision-making, so the above optimization problem can only be decomposed into each training time slot in practice, and each sub-problem can be solved. Furthermore, in the solution of each sub-problem, the local convergence accuracy of the training cannot be observed in advance.

=

, and the global pooling accuracy

. Therefore, it is necessary to use the training effect of each edge device in the previous training time slot as a reference to approximate the global and local convergence accuracy that cannot be obtained without training currently. In summary, for the training in the training time slott , the sub-problems are actually:

目标函数：

，Objective function:

,

约束条件：Restrictions:

1)

，1)

,

2)

，2)

,

3)

，3)

,

4)

，4)

,

5)

，5)

,

其中，

为用于决策训练时隙t+1内训练小轮数目的辅助决策量；

为用于决策是否选择第i个边缘设备在训练时隙t+1内参与训练的参与者决策量。in,

is the amount of participant decision-making used to decide whether to select thei -th edge device to participate in training in training slott +1.

在上述子问题求解中，虽然目标函数里

是二次，但含义和没有二次一样，二次是为了说明在实数域上是凸函数。上述子问题中

可利用诸如IPOPT+AMPL等成熟的求解工具进行求解。In the above sub-problem solution, although the objective function

It is quadratic, but the meaning is the same as no quadratic. The quadratic is to illustrate that it is a convex function in the real number field. in the above sub-problems

It can be solved using mature solving tools such as IPOPT+AMPL.

图1以四个边缘设备的选择为例展示了本发明的面向智能边缘计算的协同模型训练系统的结构，其中所有的边缘设备均与同一个边缘计算节点相连并进行数据交互，且边缘网络能够允许传输的最大容量可以包含四个边缘设备；下面以两次全局模型训练为例，对本发明的面向智能边缘计算的协同模型训练任务配置方法作进一步的说明：Fig. 1 shows the structure of the intelligent edge computing-oriented collaborative model training system of the present invention by taking the selection of four edge devices as an example, wherein all edge devices are connected to the same edge computing node and perform data exchange, and the edge network can The maximum capacity allowed for transmission can include four edge devices; the following takes two global model training as an example to further illustrate the method for configuring the collaborative model training task for intelligent edge computing of the present invention:

（1）在第一次模型训练请求到达时，需要训练的数据分布在三个可用的边缘设备上；由于没有之前的已训练效果作为参考，因此将这三个可用边缘设备都认为是分布式机器学习训练的参与者，三个参与者向边缘计算节点汇报其用户数据规模；(1) When the first model training request arrives, the data to be trained is distributed on the three available edge devices; since there is no previous training effect as a reference, the three available edge devices are considered distributed. Participants in machine learning training, three participants report their user data scale to edge computing nodes;

（2）边缘计算节点初始化全局模型（边缘计算节点维护）、各边缘设备的精度修正梯度以及全局精度修正梯度；(2) The edge computing node initializes the global model (edge computing node maintenance), the accuracy correction gradient of each edge device, and the global accuracy correction gradient;

（3）边缘计算节点将全局模型参数、各边缘设备的精度修正梯度和全局精度修正梯度下发至三个边缘设备；(3) The edge computing node sends the global model parameters, the accuracy correction gradient of each edge device, and the global accuracy correction gradient to the three edge devices;

（4）各边缘设备接收到来自边缘计算设备的信息后，利用自身设备上的用户数据构造精度损失函数，并以最小化

的形式得到

，获得的过程为不断迭代更新

；(4) After each edge device receives the information from the edge computing device, it uses the user data on its own device to construct an accuracy loss function, and minimizes the loss function.

obtained in the form of

, the obtained process is continuous iterative update

;

（5）各边缘设备利用

更新自己的局部模型，并利用自身的精度损失函数进行一次验证，得到局部精度、局部汇聚性能和局部精度修正梯度；(5) Utilization of edge devices

Update your own local model, and use your own accuracy loss function to perform a verification to obtain the local accuracy, local convergence performance and local accuracy correction gradient;

（6）各边缘设备将

、局部精度、局部汇聚性能和局部精度修正梯度发送给边缘计算节点；(6) Each edge device will

, local accuracy, local convergence performance and local accuracy correction gradient are sent to edge computing nodes;

（7）边缘计算节点根据各边缘设备发送的

的基础上进行全局模型更新；利用各边缘设备发送的局部精度修正梯度进行全局精度修正梯度更新；并记录局部汇聚性能；(7) The edge computing node sends the data sent by each edge device according to the

The global model is updated on the basis of ; the global accuracy correction gradient is updated by using the local accuracy correction gradient sent by each edge device; and the local convergence performance is recorded;

（8）由于当前参与的为所有边缘设备，所以全局精度仅为各边缘设备局部精度的加权平均；(8) Since all edge devices are currently participating, the global accuracy is only the weighted average of the local accuracy of each edge device;

（9）不断进行步骤（3）到步骤（8），直到训练小轮数达到由

所确定的

；(9) Continue to perform steps (3) to (8) until the number of training rounds reaches from

determined

;

（10）观察到三个边缘设备的局部训练效果，即每次小轮的局部汇聚性能，并根据此进行三个边缘设备的偏好修正；(10) Observe the local training effect of the three edge devices, that is, the local convergence performance of each small round, and carry out the preference correction of the three edge devices according to this;

（11）第二次分布式机器学习模型训练请求到达，当前可用的是四个边缘设备；(11) The second distributed machine learning model training request arrives, and four edge devices are currently available;

（12）由于第一个边缘设备在上一次训练中的局部汇聚性能不好，因此边缘计算节点结合各边缘设备的选择偏好，选择除第一个边缘设备外的其他边缘设备作为参与者；(12) Since the local convergence performance of the first edge device in the last training is not good, the edge computing node selects other edge devices except the first edge device as participants in combination with the selection preferences of each edge device;

（13）为第二次分布式机器学习训练进行步骤（2）到步骤（10）；(13) Perform steps (2) to (10) for the second distributed machine learning training;

（14）在第二次分布式机器学习训练的步骤（8）中，虽然第一个边缘设备没有进行分布式机器学习训练的参与，但是在验证的时候，仍然需要从边缘计算节点获取最新的模型参数，并利用自身的精度损失函数进行一次验证，得到局部精度，并发送给边缘计算节点。(14) In step (8) of the second distributed machine learning training, although the first edge device does not participate in the distributed machine learning training, it is still necessary to obtain the latest data from the edge computing node during verification. model parameters, and use its own accuracy loss function to perform a verification to obtain the local accuracy and send it to the edge computing node.

实验的效果如图2至图4所示，图2展示了在应用动态任务调整方法后，在不断进行分布式机器学习训练过程中的边缘计算资源消耗变化（已按最大值进行归一化），边缘训练资源消耗为边缘计算节点和各个边缘设备上计算资源花费和每一小轮传输花费的总和，训练资源消耗对比其他方法总是最少，至少减少27%的开销；图3展示了在应用动态任务调整方法后，在不断进行分布式机器学习训练过程中的全局精度变化，实际上对应的是建模中的全局后验精度

，为在所有设备上验证得到的精度，所提方法最多降低4%的训练精度；图4展示了在应用动态任务调整方法后，在不断进行分布式机器学习训练过程中最大局部汇聚性能的变化，即建模中的

=

=

。The effects of the experiments are shown in Figures 2 to 4. Figure 2 shows the changes in edge computing resource consumption during continuous distributed machine learning training after applying the dynamic task adjustment method (normalized by the maximum value) , the edge training resource consumption is the sum of the computing resource consumption on the edge computing node and each edge device and the cost of each small round of transmission. Compared with other methods, the training resource consumption is always the least, reducing the overhead by at least 27%; Figure 3 shows the application After the dynamic task adjustment method, the global accuracy change in the process of continuous distributed machine learning training actually corresponds to the global posterior accuracy in modeling

, in order to verify the obtained accuracy on all devices, the proposed method reduces the training accuracy by up to 4%; Figure 4 shows the change in the maximum local convergence performance during continuous distributed machine learning training after applying the dynamic task adjustment method , that is, in the modeling

=

.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质（包括但不限于磁盘存储器、CD-ROM、光学存储器等）上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备（系统）、计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

最后应当说明的是：以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，本发明中的控制节点与边缘计算节点的交互方式，收集反馈信息内容与在线调度方法在各系统中均适用，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者等同替换，而未脱离本发明精神和范围的任何修改或者等同替换，其均应涵盖在本发明的权利要求保护范围之内。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the above embodiments, the interaction mode between the control node and the edge computing node in the present invention , the content of the collected feedback information and the online scheduling method are applicable in each system. Those of ordinary skill in the art should understand that the specific embodiments of the present invention can still be modified or equivalently replaced without departing from the spirit and scope of the present invention. Any modification Or equivalent replacements, all of which should be covered within the protection scope of the claims of the present invention.

Claims

1. A collaborative model training task configuration method facing intelligent edge computing is used for edge computing nodes and comprises one or more training time slots, and is characterized in that each training time slot comprises the following steps:

sending a model training request to one or more edge devices;

receiving an availability status and a user data size of a current time slot reported by the one or more edge devices in response to the model training request;

selecting edge equipment participating in training from the current available edge equipment based on a task configuration result obtained in the last training time slot, and determining the number of training small wheels required by interactive model training;

performing interactive model training with edge equipment participating in training until the number of the determined training small wheels is reached; and according to the training effect and the user data scale of the current time slot reported by each edge device, constructing and solving an optimization problem aiming at minimizing the use of edge training resources to obtain a new task configuration result for the next training time slot.

2. The intelligent edge computing-oriented collaborative model training task configuration method according to claim 1, wherein the task configuration result includes: for deciding whether to select firstiAn edge device in a training time slottParticipant decision making for internal participation training

And for decision training time slotstDecision-making aid for number of inner training small wheel

。

3. The intelligent edge computing-oriented collaborative model training task configuration method according to claim 2, wherein the training time slot is a training time slottNumber of small wheels for training

Calculated by the following formula:

=K

，

wherein,Kis a constant.

4. The intelligent edge computing-oriented collaborative model training task configuration method of claim 1, wherein in interactive model training with an edge device participating in training, a training time slot is usedtEach training small wheel specifically comprises:

(1) the edge computing node compares the parameters of the previously trained global training model

Local accuracy correction gradient of each edge device

And global precision correction gradient

Sending to all available edge devices; the edge device participating in training according to the received data and the local precision loss function of the edge device

Separately computing respective updates to global training model parameters

；tFor the ordinal number of the current interactive training,jfor the ordinal number of the current training small round,iordinal number for each edge device;

=0；

(2) the edge computing node receives the update of the global training model parameters sent by all the edge devices participating in the training

On the basis of the global model parameters, new global model parameters are obtained by calculation

And sending the data to all the edge devices participating in training for verification; all edge devices participating in training are based on

Respectively calculating to obtain new local precision

New local accuracy correction gradient

New local convergence performance

And sending the data to the edge computing node for updating the record;

(3) the edge computing node corrects the gradient based on the received local precision of each edge device

Calculating to obtain a new global precision correction gradient

；

(4) If the current training small wheel reaches the current training time slottNumber of training wheels required

The edge computing node also updates the global model parameters

Sending the data to the edge device which does not participate in training; edge device not participating in training based on

Calculate to obtain the respective second

New local accuracy after each training small wheel

And sending the data to the edge computing node for updating the record.

5. The intelligent edge computing-oriented collaborative model training task configuration method according to claim 4, wherein in the step (1), the edge devices participating in training respectively calculate respective updates to global training model parameters according to the received data

The method specifically comprises the following steps:

each edge device involved in training utilizes the obtained

、

And local loss of precision function of itself

Constructing an optimization function

And to minimize said optimization function

In such a manner as to obtain

；

The optimization function

Expressed as:

，

wherein

、

Are all determined parameters.

6. The intelligent edge computing-oriented collaborative model training task configuration method according to claim 4, wherein in the step (2), new global model parameters

Calculated by the following formula:

，

wherein,

time slot for current trainingtThe set of inner available edge devices,

for training time slotstFor indicating the firstiA variable of whether an individual edge device is involved in training,

equal to 0 or 1.

7. The intelligent edge computing-oriented collaborative model training task configuration method according to claim 5, wherein in the step (2), the new local precision

Is formed by edge devices

Substituting into its own local loss of precision function

And then obtaining; new local precision correction gradient

Based on new local precisions

And then obtaining; new local convergence performance

Is obtained by the following formula:

。

8. the intelligent edge computing-oriented collaborative model training task configuration method according to claim 4, wherein in the step (3), a new global accuracy correction gradient is adopted

Obtained by the following formula:

，

wherein,

time slot for current trainingtA set of inner available edge devices.

9. The intelligent edge computing-oriented collaborative model training task configuration method according to claim 4, wherein the training effect includes: training time slottReach a certain number of training wheels internally

Latter global model parameters

Local convergence performance actually observed by each edge device

And local accuracy of each edge device updated in each training small round

(ii) a Wherein,

。

10. the intelligent edge computing-oriented collaborative model training task configuration method according to claim 4, wherein the optimization problem is expressed as:

an objective function: