CN110232448A

Movatterモバイル変換

Info

Publication number: CN110232448A
Application number: CN201910274219.3A
Authority: CN
Inventors: 杨萃; 黄晓鸿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-04-08
Filing date: 2019-04-08
Publication date: 2019-09-13

Abstract

Translated fromChinese

本发明公开了提高梯度提升树模型的特征值作用和防止过拟合的方法。本发明通过将离散化前特征值添加到损失函数中，最终得到最佳分裂点及特征值权重和偏置，进一步尽可能多地利用离散化前的数据。对于输入特征与输出目标相关性较强的数据，模型表现性能相对于梯度提升树有较大的提升；本发明还给出了一种t分布防止过拟合方式，通过大数定理筛选分裂点，在实际应用中可以通过该方式找到更加准确的分裂点，防止过拟合。本发明解决了梯度决策树模型只考虑特征值离散化后的大小，而不会考虑离散化前特征值数值的真实分布以及过拟合问题。本发明可广泛应用于广告预测、人工智能、图像识别、语音识别等各个方面。

The invention discloses a method for improving the eigenvalue function of the gradient boosting tree model and preventing overfitting. In the present invention, by adding the eigenvalues before discretization to the loss function, the optimal split point and the weight and bias of the eigenvalues are finally obtained, and further utilize the data before discretization as much as possible. For data with a strong correlation between input features and output targets, the performance of the model is greatly improved compared to the gradient boosting tree; the invention also provides a t-distribution method to prevent over-fitting, and the split point is screened through the theorem of large numbers , in practical applications, more accurate split points can be found in this way to prevent overfitting. The invention solves the problem that the gradient decision tree model only considers the discretized size of the eigenvalues, but does not consider the real distribution of the eigenvalues before the discretization and the problem of overfitting. The invention can be widely used in various aspects such as advertisement prediction, artificial intelligence, image recognition, voice recognition and the like.

Description

Translated fromChinese

提高梯度提升树模型的特征值作用和防止过拟合的方法A method to improve the eigenvalue function of the gradient boosting tree model and prevent overfitting

技术领域technical field

本发明涉及机器学习算法模型，具体涉及一种解决梯度提升树模型对特征值数值不敏感的问题，同时添加新的防止模型过拟合的方法。The invention relates to a machine learning algorithm model, in particular to a method for solving the problem that a gradient boosting tree model is not sensitive to eigenvalue values, and at the same time adding a new method for preventing model overfitting.

背景技术Background technique

随着大数据的迅速发展，数据挖掘技术已经广泛应用于广告预测、人工智能、图像识别、语音识别等各个方面。梯度提升树算法相比于其他的机器学习模型算法具有一定的优越性。首先梯度提升树训练速度快，其次可以从训练好的模型中分析各个特征的重要性及相互关系，进一步提取新特征。With the rapid development of big data, data mining technology has been widely used in advertising prediction, artificial intelligence, image recognition, speech recognition and other aspects. The gradient boosting tree algorithm has certain advantages over other machine learning model algorithms. First, the training speed of the gradient boosting tree is fast, and secondly, the importance and interrelationship of each feature can be analyzed from the trained model, and new features can be further extracted.

然而，现在已有的梯度提升树算法如XGBoost，Lightgbm等在使用中存在着制约其发展的根本问题，即梯度提升树只考虑特征值离散化后的大小，而不会考虑离散化前特征值数值的真实分布。在构建梯度提升树时，梯度提升树模型会先将特征值(连续值)分割成各个离散值，然后从各个离散的值中寻找分裂点，在这个过程当中，模型只考虑特征值离散化后的大小，这种方式会使得模型在对数据进行离散化后就损失数据的部分信息，例如，当某特征值大小为：0.1，0.2，0.3，0，4，0.5，0.5，0.5，1.6，1.7，1.8，当对特征值离散化时，假设分割点个数为2，可以找到分裂点为：0.45，0.55，从而特征值离散化为： 0，0，0，0，1，1，1，2，2，2。在这个过程中，梯度提升树只关心离散化后的值，而忽略了其离散化前数据的真实分布。However, the existing gradient boosting tree algorithms such as XGBoost, Lightgbm, etc. have a fundamental problem that restricts their development in use, that is, the gradient boosting tree only considers the size of the discretized eigenvalues, and does not consider the eigenvalues before discretization The true distribution of values. When building a gradient boosting tree, the gradient boosting tree model will first divide the eigenvalues (continuous values) into discrete values, and then find split points from each discrete value. In this process, the model only considers the discretization of the eigenvalues. This method will cause the model to lose part of the information of the data after discretizing the data. For example, when the size of a certain feature value is: 0.1, 0.2, 0.3, 0, 4, 0.5, 0.5, 0.5, 1.6, 1.7, 1.8, when the eigenvalues are discretized, assuming that the number of split points is 2, the split points can be found as: 0.45, 0.55, so the eigenvalues are discretized as: 0, 0, 0, 0, 1, 1, 1 , 2, 2, 2. In this process, the gradient boosting tree only cares about the value after discretization, but ignores the real distribution of the data before discretization.

本改进模型也需要对数据进行离散化，但在对数据离散化后会进一步尽可能多地利用离散化前的数据，尽可能多地利用该部分损失的信息。This improved model also needs to discretize the data, but after discretizing the data, it will further use the data before discretization as much as possible, and make use of the lost information as much as possible.

发明内容Contents of the invention

本发明的目的在于解决现有技术存在的上述不不足，提供了一种提高梯度提升树模型的特征值作用和防止过拟合的方法。The purpose of the present invention is to solve the above-mentioned shortcomings existing in the prior art, and provide a method for improving the eigenvalue function of the gradient boosting tree model and preventing over-fitting.

本发明解决上述问题所采用的技术方案如下。The technical solution adopted by the present invention to solve the above problems is as follows.

一种提高特征值作用和防止过拟合的梯度提升树模型，具体包括以下步骤：A gradient boosting tree model that improves the effect of eigenvalues and prevents overfitting, specifically comprising the following steps:

步骤1：对样本集D确定模型的输入特征x_ij和输出变量y_i，其中i表示第i个样本，j表示第j个特征，假定样本个数为n，特征值个数为m。定义损失函数，损失函数可选为logloss 或MSE，但不限于此。Step 1: Determine the input features x_ij and output variables y_i of the model for the sample set D, where i represents the i-th sample, j represents the j-th feature, assuming that the number of samples is n, and the number of feature values is m. Define the loss function, the loss function can be selected as logloss or MSE, but not limited to this.

步骤2：对特征值x_ij进行归一化。Step 2: Normalize the eigenvalues x_ij .

步骤3：对预测值初始化为y_i的平均值Step 3: To predict the value initialized to the mean of y_i

步骤4：对特征值x_ij离散化得出所有的分裂点，分裂点个数为s。Step 4: Discretize the eigenvalues x_ij to obtain all split points, and the number of split points is s.

步骤5：计算输入样本的一阶偏导g_i和二阶偏导h_i。Step 5: Calculate the first-order partial derivative g_i and the second-order partial derivative h_i of the input samples.

步骤6：在第k个叶子节点上(如果k为0，D₀＝D)，对于每一个分裂点，该叶子节点的样本D_k会预分裂为左样本L和右样本R，其中L+R＝D_k，遍历所有分裂点，计算左样本 L和右样本R的所有特征值的特征值权重w₁、特征值偏置w₂及对应的分裂增益gain。此时会得到s份左样本L和右样本R，s×m个特征值权重w₁、特征值偏置w₂及对应的分裂增益 gain。Step 6: On the kth leaf node (if k is 0, D₀ =D), for each split point, the sample D_k of the leaf node will be pre-split into left sample L and right sample R, where L+ R=D_k , traverse all split points, and calculate the eigenvalue weight w₁ , eigenvalue offset w₂ and the corresponding splitting gain of all eigenvalues of the left sample L and the right sample R. At this time, s left sample L and right sample R, s×m eigenvalue weight w₁ , eigenvalue offset w₂ and corresponding split gain will be obtained.

步骤7：如果用户定义的损失函数为MSE函数，即则执行t分布防止过拟合方式，如果用户定义的损失函数不是则直接执行步骤8。Step 7: If the user-defined loss function is an MSE function, ie Then execute the t distribution to prevent overfitting, if the user-defined loss function is not Then go to step 8 directly.

步骤8：从s×m个gain中找出最大的gain，及对应的分裂点、特征值权重w₁、特征值偏置w₂、特征值权重w₁和特征值偏置w₂对应的选定特征r，但暂时不分裂。Step 8: Find the largest gain from the s×m gains, and the corresponding split point, eigenvalue weight w₁ , eigenvalue bias w₂ , eigenvalue weight w₁ and eigenvalue bias w₂ corresponding selections Determine the feature r, but do not split for the time being.

步骤9：从所有的节点中找出gain最大的节点，对该节点进行分裂。Step 9: Find the node with the largest gain from all nodes, and split the node.

步骤10：对新分裂出来的两个节点重复步骤6～10。直到叶子节点个数大于用户指定的叶子节点个数。至此构建完毕一棵弱决策树。Step 10: Repeat steps 6-10 for the two newly split nodes. Until the number of leaf nodes is greater than the number of leaf nodes specified by the user. So far, a weak decision tree has been constructed.

步骤11：对数据集根据该决策树上非叶子节点上的分裂点将数据分裂到各个叶子节点上，叶子节点和非叶子节点在梯度提升树上的表示如图1所示。Step 11: Split the data set into each leaf node according to the split point on the non-leaf node of the decision tree. The representation of leaf nodes and non-leaf nodes on the gradient boosting tree is shown in Figure 1.

步骤12：对各叶子节点上的数据，更新预测值：其中η为学习率，i为第i个样本，k为第k个叶子节点，r为第r个特征，t为第t棵弱决策树。Step 12: For the data on each leaf node, update the predicted value: Among them, η is the learning rate, i is the i-th sample, k is the k-th leaf node, r is the r-th feature, and t is the t-th weak decision tree.

步骤13：重复5～13，直至弱决策树棵数达到用户给定的弱决策树总数或验证集准确率不再提升。Step 13: Repeat steps 5 to 13 until the number of weak decision trees reaches the total number of weak decision trees given by the user or the accuracy of the verification set does not increase.

作为优选：步骤2中对特征值x_ij进行归一化方法包括：As a preference: the method for normalizing the eigenvalue x_ij in step 2 includes:

5)x_ij＝(x_ij-μ_j)/σ_j，μ_j，σ_j分别为特征x_j的均值，标准差。5) x_ij =(x_ij -μ_j )/σ_j , μ_j and σ_j are the mean and standard deviation of feature x_j respectively.

6)x_ij＝(x_ij-μ_j)/σ_j，μ_j，σ_j分别为特征x_j的均值，标准差。然后x_ij＝tanh(x_ij)。6) x_ij =(x_ij -μ_j )/σ_j , μ_j and σ_j are the mean and standard deviation of feature x_j respectively. Then x_ij =tanh(x_ij ).

7)先对x_ij进行离散化，然后x_ij＝(x_ij-μ_j)/σ_j，μ_j，σ_j分别为特征x_j的均值，标准差。7) Discretize x_ij first, then x_ij =(x_ij -μ_j )/σ_j , μ_j , σ_j are the mean and standard deviation of feature x_j respectively.

8)x_ij＝(x_ij-μ_j)/σ_j，μ_j，σ_j分别为特征x_j的均值，标准差。剔除x_ij中的离群值。8) x_ij =(x_ij -μ_j )/σ_j , μ_j and σ_j are the mean and standard deviation of feature x_j respectively. Remove outliers in x_ij .

归一化方法中2～4的目的是为了避免离群值引起系统的不稳定。The purpose of 2-4 in the normalization method is to avoid the instability of the system caused by outliers.

作为优选：步骤5中输入样本的一阶偏导g_i和二阶偏导h_i求法如下：As a preference: the first-order partial derivative g_i and the second-order partial derivative h_i of the input sample in step 5 are calculated as follows:

其中，当t＞1时，表示第t-1棵弱决策树对y_i的估计值，当t＝1时，表示y_i的默认估计值，即Among them, when t>1, Indicates the estimated value of y_i by the t-1th weak decision tree, when t=1, Indicates the default estimated value of y_i , that is,

l为损失函数，要求满足以下条件的所有损失函数都可以使用该模型进行训练。l is a loss function, and all loss functions that meet the following conditions can be trained using this model.

作为优选：步骤6中，计算特征值权重w₁、特征值偏置w₂及对应的分裂增益gain的方法是：As a preference: in step 6, the method of calculating the eigenvalue weight w₁ , the eigenvalue offset w₂ and the corresponding splitting gain is:

步骤6.1，构建算法的目标函数为In step 6.1, the objective function of the construction algorithm is

其中p(x_i)表示样本x_i的选定特征是第p(x_i)个特征。x_ir是叶子节点上的选定特征，是第t棵弱决策树的特征值权重函数，是第t棵弱决策树的特征值偏置函数。是正则化函数，const是常数。in p(_xi ) indicates that the selected feature of sample_xi is the p(_xi )th feature. x_ir is the selected feature on the leaf node, is the eigenvalue weight function of the tth weak decision tree, is the eigenvalue bias function of the tth weak decision tree. is a regularization function, and const is a constant.

模型的目标就是为了最小化目标函数Obj^(t)。The goal of the model is to minimize the objective function Obj^(t) .

对l进行泰勒展开：Taylor expansion of l:

其中r为第k个节点上的第r个特征，i为第i个样本，γ₁，γ₂为正则化系数，由用户指定。in r is the r-th feature on the k-th node, i is the i-th sample, γ₁ and γ₂ are regularization coefficients, which are specified by the user.

注：q(x_ir)表示当选定特征为第r个特征时，样本x_i映射到第q(x_ir)个叶子节点上。Note: q(x_ir ) means that when the selected feature is the rth feature, the sample x_i is mapped to the q(x_ir )th leaf node.

步骤6.2将目标函数转化为求二元二次函数最小值的问题。Step 6.2 transforms the objective function into a problem of finding the minimum value of the binary quadratic function.

最小化公式1，得到最优解：Minimizing Equation 1, we get Optimal solution:

其中 in

对进行l1正则化得步骤6.3，此时得到所有分裂点所有特征值的w₁和w₂，但仍未确定的是选取哪些分裂点和哪些特征值。接下来通过遍历搜索所有分裂点、所有特征求最优分裂点，过程是：对每一个分裂点和每个特征，计算gain。在构建决策树时，通过贪婪法构建决策树，每一次只将该叶子节点上的样本D_k分裂为左样本L和右样本R，通过添加分裂节点对目标函数的改变为gain，首先定义right Perform l1 regularization to get In step 6.3, w₁ and w₂ of all eigenvalues of all split points are obtained at this time, but which split points and which eigenvalues to select are still undetermined. Next, find the optimal split point by traversing all split points and all features. The process is: for each split point and each feature, calculate the gain. When constructing the decision tree, the decision tree is constructed by the greedy method, each time only the sample D_k on the leaf node is split into the left sample L and the right sample R, and the change of the objective function by adding split nodes is gain, firstly define

其中C为当前待分裂节点，r_C为当前节点的已经确定了的选定特征。L、R为节点C分裂出的左节点和右节点，r_L为分割点确定后需要遍历左边样本所有特征的集合，r_R为分割点确定后需要遍历右边样本所有特征的集合，Among them, C is the current node to be split, and r_C is the selected feature of the current node that has been determined. L and R are the left and right nodes split from node C, r_L is the set of all features of the left sample that needs to be traversed after the split point is determined, r_R is the set of all features of the right sample that needs to be traversed after the split point is determined,

则gain＝Score_L+Score_R-Score_cThen gain=Score_L +Score_R -Score_c

通过遍历所有s个分裂点和m个特征值，得到s×m个待选gain值。By traversing all s split points and m eigenvalues, s×m candidate gain values are obtained.

作为优选，步骤7，所述的t分布防止过拟合方法为：当用户定义的损失函数为时，有h_i＝1，H_k＝N_k，对于某个节点上的样本则有随机变量的数学期望为约等于假设w服从 t分布，求出w的置信区间，并判断在该置信区间内是否会越界。As a preference, in step 7, the method for preventing over-fitting of the t distribution is: when the user-defined loss function is , h_i =1, H_k =N_k , For a sample at a node, there is a random variable The mathematical expectation of approximately equal to Assuming that w obeys the t distribution, find the confidence interval of w, and judge that it is within the confidence interval Will it cross the line.

具体实现过程是：如果用户定义的损失函数为时，对gain进行筛选：The specific implementation process is: if the user-defined loss function is When the gain is filtered:

假设有则w～t(H_k-1)。suppose Have Then w~t(H_k -1).

定义：definition:

方差variance

上置信边界U＝w+s/n×t_α/2(H_k-1)下置信边界L＝w-s/n×t_α/2(H_k-1)Upper Confidence Boundary U=w+s/n×t_α/2 (H_k -1) Lower Confidence Boundary L=ws/n×t_α/2 (H_k -1)

对gain进行筛选：Filter gain:

其中α为置信度，由用户指定。Where α is the confidence level, specified by the user.

当gain符合要求时，保留该gain值，当gain不符合要求时，将其置为0。When the gain meets the requirements, keep the gain value, and when the gain does not meet the requirements, set it to 0.

作为优选，步骤8：从s×m个gain中找出最大的gain，确定gain最大时的分裂点、特征值权重w₁、特征值偏置w₂、特征值权重w₁和特征值偏置w₂对应的选定特征r，但暂时不分裂。Preferably, step 8: Find the largest gain from the s×m gains, and determine the split point, eigenvalue weight w₁ , eigenvalue offset w₂ , eigenvalue weight w₁ and eigenvalue offset when the gain is the largest w₂ corresponds to the selected feature r, but does not split for now.

作为优选，步骤9中，当从步骤8得出新分裂节点的s×m个gain时，模型会比较所有节点的gain，选出gain最大的节点进行分裂。Preferably, in step 9, when the s×m gains of the newly split node are obtained from step 8, the model will compare the gains of all nodes, and select the node with the largest gain for splitting.

本发明于现有技术相比，具有以下优点和效果：1)与现有技术相比，本发明对梯度提升树进行了改进，将模型的输入特征x_ij添加到损失函数l中，进一步训练出权重，一定程度上解决了梯度提升树只考虑特征值离散化后的大小，而不会考虑离散化前特征值数值的真实分布的问题。对于输入特征x_ij与输出目标y_i相关性较强的数据，模型表现性能相对于梯度提升树有较大的提升；Compared with the prior art, the present invention has the following advantages and effects: 1) Compared with the prior art, the present invention improves the gradient boosting tree, adds the input feature x_ij of the model to the loss function l, and further trains To a certain extent, it solves the problem that the gradient boosting tree only considers the discretized size of the eigenvalues, but does not consider the real distribution of the eigenvalues before discretization. For data with a strong correlation between the input feature x_ij and the output target y_i , the performance of the model is greatly improved compared with the gradient boosting tree;

2)本发明还给出了一种t分布防止过拟合方式，在实际应用中可以通过该方式找到更加准确的分裂点，防止过拟合。2) The present invention also provides a t-distribution preventing over-fitting method, which can be used to find more accurate split points and prevent over-fitting in practical applications.

附图说明Description of drawings

图1为叶子节点和非叶子节点在决策树上的位置。Figure 1 shows the positions of leaf nodes and non-leaf nodes on the decision tree.

图2为本发明梯度提示数模型的流程图。Fig. 2 is a flow chart of the gradient prompt number model of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步的说明，但是本发明要求保护的范围并不局限于实施方式表述的范围，需指出的是，以下若有未特别详细说明之过程或符号，均是本领域技术人员可根据现有技术理解或实现的。The present invention will be further described below in conjunction with the accompanying drawings, but the scope of protection claimed by the present invention is not limited to the scope of the embodiment. Personnel can understand or realize according to the prior art.

本发明可以用于广告预测、人工智能、图像识别、语音识别等各个方面。梯度提升树算法相比于其他的机器学习模型算法具有一定的优越性。首先梯度提升树训练速度快，其次可以从训练好的模型中分析各个特征的重要性及相互关系，进一步提取新特征。The present invention can be used in various aspects such as advertisement prediction, artificial intelligence, image recognition, voice recognition and the like. The gradient boosting tree algorithm has certain advantages over other machine learning model algorithms. First, the training speed of the gradient boosting tree is fast, and secondly, the importance and interrelationship of each feature can be analyzed from the trained model, and new features can be further extracted.

如图2，本实施例提供的提高梯度提升树模型的特征值作用和防止过拟合的方法，包括如下步骤。As shown in FIG. 2 , the method for improving the eigenvalue function of the gradient boosting tree model and preventing overfitting provided by this embodiment includes the following steps.

步骤1：对样本集D(如图像或广告预测等)确定梯度提升树模型的输入特征x_ij和输出变量y_i，其中i表示第i个样本，j表示第j个特征，样本个数为n，特征值个数为m；定义损失函数。Step 1: Determine the input feature x_ij and output variable y_i of the gradient boosting tree model for the sample set D (such as image or advertisement prediction, etc.), where i represents the i-th sample, j represents the j-th feature, and the number of samples is n, the number of feature values is m; define the loss function.

步骤2：对特征值x_ij进行归一化：x_ij＝(x_ij-μ_j)/σ_j。Step 2: Normalize the feature value x_ij : x_ij =(x_ij -μ_j )/σ_j .

步骤4：对特征值x_ij进行离散化得出所有的分裂点。Step 4: Discretize the eigenvalue x_ij to get all split points.

步骤5：计算输入样本的g_i，h_i：Step 5: Calculate g_i , h_i of the input samples:

将输入特征x_ij作为根节点的数据，并将该节点设置为新节点。Take the input feature x_ij as the data of the root node, and set this node as the new node.

步骤6：对所有新节点根据所有的分裂点将数据划分为两部分L和R，同时计算分割后的两部分样本的所有在不同特征r下的，其中L和R在不同r下的的w₁，w₂求法如下：Step 6: For all new nodes, divide the data into two parts L and R according to all split points, and calculate all the samples of the two parts after division under different characteristics r, Among them, w₁ and w₂ of L and R under different r are calculated as follows:

根据计算节点上的数据分割后对目标函数减少的gain：according to Calculate the gain of the objective function reduction after the data on the calculation node is divided:

步骤7：如果用户定义的损失函数为执行特殊的防止过拟合方式对 gain进行筛选：Step 7: If the user-defined loss function is Execute a special way to prevent overfitting to filter gain:

假设有则w～t(H_k-1)。suppose Have Then w~t(H_k -1).

定义：definition:

方差variance

对gain进行筛选：Filter gain:

步骤8：从s×m个gain中找出最大的gain，确定gain最大时的分裂点、特征值权重w₁、特征值偏置w₂、特征值权重w₁和特征值偏置w₂对应的选定特征r，当暂时不分裂。Step 8: Find the largest gain from the s×m gains, determine the split point when the gain is the largest, the eigenvalue weight w₁ , the eigenvalue offset w₂ , the eigenvalue weight w₁ and the eigenvalue offset w₂ correspond to The selected feature r of , when temporarily not split.

步骤9中，当从步骤8得出新分裂节点的s×m个gain时，模型会比较所有节点的gain，选出gain最大的节点进行分裂为两个新节点。In step 9, when the s×m gains of the newly split node are obtained from step 8, the model will compare the gains of all nodes, and select the node with the largest gain to split into two new nodes.

步骤12：对各叶子节点上的数据，更新预测值：其中η为学习率，i为第i个样本，k为第k个叶子节点，l为第r个特征，t为第t棵弱决策树。Step 12: For the data on each leaf node, update the predicted value: Among them, η is the learning rate, i is the i-th sample, k is the k-th leaf node, l is the r-th feature, and t is the t-th weak decision tree.

如上所述，本发明对梯度提升树进行了改进，将模型的输入特征x_ij添加到损失函数l中，进一步训练出权重，一定程度上解决了梯度提升树只考虑特征值离散化后的大小，而不会考虑离散化前特征值数值的真实分布的问题。对于输入特征x_ij与输出目标y_i相关性较强的数据，模型表现性能相对于梯度提升树有较大的提升；同时还给出了一种t分布防止过拟合方式，在实际应用中可以通过该方式找到更加准确的分裂点，防止过拟合。As mentioned above, the present invention improves the gradient boosting tree, adds the input feature x_ij of the model to the loss function l, and further trains out the weight, which solves the problem that the gradient boosting tree only considers the discretized size of the feature value to a certain extent , without considering the real distribution of eigenvalue values before discretization. For data with strong correlation between the input feature x_ij and the output target y_i , the performance of the model is greatly improved compared with the gradient boosting tree; at the same time, a t distribution method to prevent overfitting is also given. In practical applications In this way, more accurate split points can be found to prevent overfitting.