Movatterモバイル変換


[0]ホーム

URL:


CN110232448A - It improves gradient and promotes the method that the characteristic value of tree-model acts on and prevents over-fitting - Google Patents

It improves gradient and promotes the method that the characteristic value of tree-model acts on and prevents over-fitting
Download PDF

Info

Publication number
CN110232448A
CN110232448ACN201910274219.3ACN201910274219ACN110232448ACN 110232448 ACN110232448 ACN 110232448ACN 201910274219 ACN201910274219 ACN 201910274219ACN 110232448 ACN110232448 ACN 110232448A
Authority
CN
China
Prior art keywords
split
eigenvalue
feature
gain
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910274219.3A
Other languages
Chinese (zh)
Inventor
杨萃
黄晓鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUTfiledCriticalSouth China University of Technology SCUT
Priority to CN201910274219.3ApriorityCriticalpatent/CN110232448A/en
Publication of CN110232448ApublicationCriticalpatent/CN110232448A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了提高梯度提升树模型的特征值作用和防止过拟合的方法。本发明通过将离散化前特征值添加到损失函数中,最终得到最佳分裂点及特征值权重和偏置,进一步尽可能多地利用离散化前的数据。对于输入特征与输出目标相关性较强的数据,模型表现性能相对于梯度提升树有较大的提升;本发明还给出了一种t分布防止过拟合方式,通过大数定理筛选分裂点,在实际应用中可以通过该方式找到更加准确的分裂点,防止过拟合。本发明解决了梯度决策树模型只考虑特征值离散化后的大小,而不会考虑离散化前特征值数值的真实分布以及过拟合问题。本发明可广泛应用于广告预测、人工智能、图像识别、语音识别等各个方面。

The invention discloses a method for improving the eigenvalue function of the gradient boosting tree model and preventing overfitting. In the present invention, by adding the eigenvalues before discretization to the loss function, the optimal split point and the weight and bias of the eigenvalues are finally obtained, and further utilize the data before discretization as much as possible. For data with a strong correlation between input features and output targets, the performance of the model is greatly improved compared to the gradient boosting tree; the invention also provides a t-distribution method to prevent over-fitting, and the split point is screened through the theorem of large numbers , in practical applications, more accurate split points can be found in this way to prevent overfitting. The invention solves the problem that the gradient decision tree model only considers the discretized size of the eigenvalues, but does not consider the real distribution of the eigenvalues before the discretization and the problem of overfitting. The invention can be widely used in various aspects such as advertisement prediction, artificial intelligence, image recognition, voice recognition and the like.

Description

Translated fromChinese
提高梯度提升树模型的特征值作用和防止过拟合的方法A method to improve the eigenvalue function of the gradient boosting tree model and prevent overfitting

技术领域technical field

本发明涉及机器学习算法模型,具体涉及一种解决梯度提升树模型对特征值数值不敏感的问题,同时添加新的防止模型过拟合的方法。The invention relates to a machine learning algorithm model, in particular to a method for solving the problem that a gradient boosting tree model is not sensitive to eigenvalue values, and at the same time adding a new method for preventing model overfitting.

背景技术Background technique

随着大数据的迅速发展,数据挖掘技术已经广泛应用于广告预测、人工智能、图像识别、语音识别等各个方面。梯度提升树算法相比于其他的机器学习模型算法具有一定的优越性。首先梯度提升树训练速度快,其次可以从训练好的模型中分析各个特征的重要性及相互关系,进一步提取新特征。With the rapid development of big data, data mining technology has been widely used in advertising prediction, artificial intelligence, image recognition, speech recognition and other aspects. The gradient boosting tree algorithm has certain advantages over other machine learning model algorithms. First, the training speed of the gradient boosting tree is fast, and secondly, the importance and interrelationship of each feature can be analyzed from the trained model, and new features can be further extracted.

然而,现在已有的梯度提升树算法如XGBoost,Lightgbm等在使用中存在着制约其发展的根本问题,即梯度提升树只考虑特征值离散化后的大小,而不会考虑离散化前特征值数值的真实分布。在构建梯度提升树时,梯度提升树模型会先将特征值(连续值)分割成各个离散值,然后从各个离散的值中寻找分裂点,在这个过程当中,模型只考虑特征值离散化后的大小,这种方式会使得模型在对数据进行离散化后就损失数据的部分信息,例如,当某特征值大小为:0.1,0.2,0.3,0,4,0.5,0.5,0.5,1.6,1.7,1.8,当对特征值离散化时,假设分割点个数为2,可以找到分裂点为:0.45,0.55,从而特征值离散化为: 0,0,0,0,1,1,1,2,2,2。在这个过程中,梯度提升树只关心离散化后的值,而忽略了其离散化前数据的真实分布。However, the existing gradient boosting tree algorithms such as XGBoost, Lightgbm, etc. have a fundamental problem that restricts their development in use, that is, the gradient boosting tree only considers the size of the discretized eigenvalues, and does not consider the eigenvalues before discretization The true distribution of values. When building a gradient boosting tree, the gradient boosting tree model will first divide the eigenvalues (continuous values) into discrete values, and then find split points from each discrete value. In this process, the model only considers the discretization of the eigenvalues. This method will cause the model to lose part of the information of the data after discretizing the data. For example, when the size of a certain feature value is: 0.1, 0.2, 0.3, 0, 4, 0.5, 0.5, 0.5, 1.6, 1.7, 1.8, when the eigenvalues are discretized, assuming that the number of split points is 2, the split points can be found as: 0.45, 0.55, so the eigenvalues are discretized as: 0, 0, 0, 0, 1, 1, 1 , 2, 2, 2. In this process, the gradient boosting tree only cares about the value after discretization, but ignores the real distribution of the data before discretization.

本改进模型也需要对数据进行离散化,但在对数据离散化后会进一步尽可能多地利用离散化前的数据,尽可能多地利用该部分损失的信息。This improved model also needs to discretize the data, but after discretizing the data, it will further use the data before discretization as much as possible, and make use of the lost information as much as possible.

发明内容Contents of the invention

本发明的目的在于解决现有技术存在的上述不不足,提供了一种提高梯度提升树模型的特征值作用和防止过拟合的方法。The purpose of the present invention is to solve the above-mentioned shortcomings existing in the prior art, and provide a method for improving the eigenvalue function of the gradient boosting tree model and preventing over-fitting.

本发明解决上述问题所采用的技术方案如下。The technical solution adopted by the present invention to solve the above problems is as follows.

一种提高特征值作用和防止过拟合的梯度提升树模型,具体包括以下步骤:A gradient boosting tree model that improves the effect of eigenvalues and prevents overfitting, specifically comprising the following steps:

步骤1:对样本集D确定模型的输入特征xij和输出变量yi,其中i表示第i个样本,j表示第j个特征,假定样本个数为n,特征值个数为m。定义损失函数,损失函数可选为logloss 或MSE,但不限于此。Step 1: Determine the input features xij and output variables yi of the model for the sample set D, where i represents the i-th sample, j represents the j-th feature, assuming that the number of samples is n, and the number of feature values is m. Define the loss function, the loss function can be selected as logloss or MSE, but not limited to this.

步骤2:对特征值xij进行归一化。Step 2: Normalize the eigenvalues xij .

步骤3:对预测值初始化为yi的平均值Step 3: To predict the value initialized to the mean of yi

步骤4:对特征值xij离散化得出所有的分裂点,分裂点个数为s。Step 4: Discretize the eigenvalues xij to obtain all split points, and the number of split points is s.

步骤5:计算输入样本的一阶偏导gi和二阶偏导hiStep 5: Calculate the first-order partial derivative gi and the second-order partial derivative hi of the input samples.

步骤6:在第k个叶子节点上(如果k为0,D0=D),对于每一个分裂点,该叶子节点的样本Dk会预分裂为左样本L和右样本R,其中L+R=Dk,遍历所有分裂点,计算左样本 L和右样本R的所有特征值的特征值权重w1、特征值偏置w2及对应的分裂增益gain。此时会得到s份左样本L和右样本R,s×m个特征值权重w1、特征值偏置w2及对应的分裂增益 gain。Step 6: On the kth leaf node (if k is 0, D0 =D), for each split point, the sample Dk of the leaf node will be pre-split into left sample L and right sample R, where L+ R=Dk , traverse all split points, and calculate the eigenvalue weight w1 , eigenvalue offset w2 and the corresponding splitting gain of all eigenvalues of the left sample L and the right sample R. At this time, s left sample L and right sample R, s×m eigenvalue weight w1 , eigenvalue offset w2 and corresponding split gain will be obtained.

步骤7:如果用户定义的损失函数为MSE函数,即则执行t分布防止过拟合方式,如果用户定义的损失函数不是则直接执行步骤8。Step 7: If the user-defined loss function is an MSE function, ie Then execute the t distribution to prevent overfitting, if the user-defined loss function is not Then go to step 8 directly.

步骤8:从s×m个gain中找出最大的gain,及对应的分裂点、特征值权重w1、特征值偏置w2、特征值权重w1和特征值偏置w2对应的选定特征r,但暂时不分裂。Step 8: Find the largest gain from the s×m gains, and the corresponding split point, eigenvalue weight w1 , eigenvalue bias w2 , eigenvalue weight w1 and eigenvalue bias w2 corresponding selections Determine the feature r, but do not split for the time being.

步骤9:从所有的节点中找出gain最大的节点,对该节点进行分裂。Step 9: Find the node with the largest gain from all nodes, and split the node.

步骤10:对新分裂出来的两个节点重复步骤6~10。直到叶子节点个数大于用户指定的叶子节点个数。至此构建完毕一棵弱决策树。Step 10: Repeat steps 6-10 for the two newly split nodes. Until the number of leaf nodes is greater than the number of leaf nodes specified by the user. So far, a weak decision tree has been constructed.

步骤11:对数据集根据该决策树上非叶子节点上的分裂点将数据分裂到各个叶子节点上,叶子节点和非叶子节点在梯度提升树上的表示如图1所示。Step 11: Split the data set into each leaf node according to the split point on the non-leaf node of the decision tree. The representation of leaf nodes and non-leaf nodes on the gradient boosting tree is shown in Figure 1.

步骤12:对各叶子节点上的数据,更新预测值:其中η为学习率,i为第i个样本,k为第k个叶子节点,r为第r个特征,t为第t棵弱决策树。Step 12: For the data on each leaf node, update the predicted value: Among them, η is the learning rate, i is the i-th sample, k is the k-th leaf node, r is the r-th feature, and t is the t-th weak decision tree.

步骤13:重复5~13,直至弱决策树棵数达到用户给定的弱决策树总数或验证集准确率不再提升。Step 13: Repeat steps 5 to 13 until the number of weak decision trees reaches the total number of weak decision trees given by the user or the accuracy of the verification set does not increase.

作为优选:步骤2中对特征值xij进行归一化方法包括:As a preference: the method for normalizing the eigenvalue xij in step 2 includes:

5)xij=(xijj)/σj,μj,σj分别为特征xj的均值,标准差。5) xij =(xijj )/σj , μj and σj are the mean and standard deviation of feature xj respectively.

6)xij=(xijj)/σj,μj,σj分别为特征xj的均值,标准差。然后xij=tanh(xij)。6) xij =(xijj )/σj , μj and σj are the mean and standard deviation of feature xj respectively. Then xij =tanh(xij ).

7)先对xij进行离散化,然后xij=(xijj)/σj,μj,σj分别为特征xj的均值,标准差。7) Discretize xij first, then xij =(xijj )/σj , μj , σj are the mean and standard deviation of feature xj respectively.

8)xij=(xijj)/σj,μj,σj分别为特征xj的均值,标准差。剔除xij中的离群值。8) xij =(xijj )/σj , μj and σj are the mean and standard deviation of feature xj respectively. Remove outliers in xij .

归一化方法中2~4的目的是为了避免离群值引起系统的不稳定。The purpose of 2-4 in the normalization method is to avoid the instability of the system caused by outliers.

作为优选:步骤5中输入样本的一阶偏导gi和二阶偏导hi求法如下:As a preference: the first-order partial derivative gi and the second-order partial derivative hi of the input sample in step 5 are calculated as follows:

其中,当t>1时,表示第t-1棵弱决策树对yi的估计值,当t=1时,表示yi的默认估计值,即Among them, when t>1, Indicates the estimated value of yi by the t-1th weak decision tree, when t=1, Indicates the default estimated value of yi , that is,

l为损失函数,要求满足以下条件的所有损失函数都可以使用该模型进行训练。l is a loss function, and all loss functions that meet the following conditions can be trained using this model.

作为优选:步骤6中,计算特征值权重w1、特征值偏置w2及对应的分裂增益gain的方法是:As a preference: in step 6, the method of calculating the eigenvalue weight w1 , the eigenvalue offset w2 and the corresponding splitting gain is:

步骤6.1,构建算法的目标函数为In step 6.1, the objective function of the construction algorithm is

其中p(xi)表示样本xi的选定特征是第p(xi)个特征。xir是叶子节点上的选定特征,是第t棵弱决策树的特征值权重函数,是第t棵弱决策树的特征值偏置函数。是正则化函数,const是常数。in p(xi ) indicates that the selected feature of samplexi is the p(xi )th feature. xir is the selected feature on the leaf node, is the eigenvalue weight function of the tth weak decision tree, is the eigenvalue bias function of the tth weak decision tree. is a regularization function, and const is a constant.

模型的目标就是为了最小化目标函数Obj(t)The goal of the model is to minimize the objective function Obj(t) .

对l进行泰勒展开:Taylor expansion of l:

其中r为第k个节点上的第r个特征,i为第i个样本,γ1,γ2为正则化系数,由用户指定。in r is the r-th feature on the k-th node, i is the i-th sample, γ1 and γ2 are regularization coefficients, which are specified by the user.

注:q(xir)表示当选定特征为第r个特征时,样本xi映射到第q(xir)个叶子节点上。Note: q(xir ) means that when the selected feature is the rth feature, the sample xi is mapped to the q(xir )th leaf node.

步骤6.2将目标函数转化为求二元二次函数最小值的问题。Step 6.2 transforms the objective function into a problem of finding the minimum value of the binary quadratic function.

最小化公式1,得到最优解:Minimizing Equation 1, we get Optimal solution:

其中 in

进行l1正则化得步骤6.3,此时得到所有分裂点所有特征值的w1和w2,但仍未确定的是选取哪些分裂点和哪些特征值。接下来通过遍历搜索所有分裂点、所有特征求最优分裂点,过程是:对每一个分裂点和每个特征,计算gain。在构建决策树时,通过贪婪法构建决策树,每一次只将该叶子节点上的样本Dk分裂为左样本L和右样本R,通过添加分裂节点对目标函数的改变为gain,首先定义right Perform l1 regularization to get In step 6.3, w1 and w2 of all eigenvalues of all split points are obtained at this time, but which split points and which eigenvalues to select are still undetermined. Next, find the optimal split point by traversing all split points and all features. The process is: for each split point and each feature, calculate the gain. When constructing the decision tree, the decision tree is constructed by the greedy method, each time only the sample Dk on the leaf node is split into the left sample L and the right sample R, and the change of the objective function by adding split nodes is gain, firstly define

其中C为当前待分裂节点,rC为当前节点的已经确定了的选定特征。L、R为节点C分裂出的左节点和右节点,rL为分割点确定后需要遍历左边样本所有特征的集合,rR为分割点确定后需要遍历右边样本所有特征的集合,Among them, C is the current node to be split, and rC is the selected feature of the current node that has been determined. L and R are the left and right nodes split from node C, rL is the set of all features of the left sample that needs to be traversed after the split point is determined, rR is the set of all features of the right sample that needs to be traversed after the split point is determined,

则gain=ScoreL+ScoreR-ScorecThen gain=ScoreL +ScoreR -Scorec

通过遍历所有s个分裂点和m个特征值,得到s×m个待选gain值。By traversing all s split points and m eigenvalues, s×m candidate gain values are obtained.

作为优选,步骤7,所述的t分布防止过拟合方法为:当用户定义的损失函数为时,有hi=1,Hk=Nk对于某个节点上的样本则有随机变量的数学期望为约等于假设w服从 t分布,求出w的置信区间,并判断在该置信区间内是否会越界。As a preference, in step 7, the method for preventing over-fitting of the t distribution is: when the user-defined loss function is , hi =1, Hk =Nk , For a sample at a node, there is a random variable The mathematical expectation of approximately equal to Assuming that w obeys the t distribution, find the confidence interval of w, and judge that it is within the confidence interval Will it cross the line.

具体实现过程是:如果用户定义的损失函数为时,对gain进行筛选:The specific implementation process is: if the user-defined loss function is When the gain is filtered:

假设则w~t(Hk-1)。suppose Have Then w~t(Hk -1).

定义:definition:

方差variance

上置信边界U=w+s/n×tα/2(Hk-1)下置信边界L=w-s/n×tα/2(Hk-1)Upper Confidence Boundary U=w+s/n×tα/2 (Hk -1) Lower Confidence Boundary L=ws/n×tα/2 (Hk -1)

对gain进行筛选:Filter gain:

其中α为置信度,由用户指定。Where α is the confidence level, specified by the user.

当gain符合要求时,保留该gain值,当gain不符合要求时,将其置为0。When the gain meets the requirements, keep the gain value, and when the gain does not meet the requirements, set it to 0.

作为优选,步骤8:从s×m个gain中找出最大的gain,确定gain最大时的分裂点、特征值权重w1、特征值偏置w2、特征值权重w1和特征值偏置w2对应的选定特征r,但暂时不分裂。Preferably, step 8: Find the largest gain from the s×m gains, and determine the split point, eigenvalue weight w1 , eigenvalue offset w2 , eigenvalue weight w1 and eigenvalue offset when the gain is the largest w2 corresponds to the selected feature r, but does not split for now.

作为优选,步骤9中,当从步骤8得出新分裂节点的s×m个gain时,模型会比较所有节点的gain,选出gain最大的节点进行分裂。Preferably, in step 9, when the s×m gains of the newly split node are obtained from step 8, the model will compare the gains of all nodes, and select the node with the largest gain for splitting.

本发明于现有技术相比,具有以下优点和效果:1)与现有技术相比,本发明对梯度提升树进行了改进,将模型的输入特征xij添加到损失函数l中,进一步训练出权重,一定程度上解决了梯度提升树只考虑特征值离散化后的大小,而不会考虑离散化前特征值数值的真实分布的问题。对于输入特征xij与输出目标yi相关性较强的数据,模型表现性能相对于梯度提升树有较大的提升;Compared with the prior art, the present invention has the following advantages and effects: 1) Compared with the prior art, the present invention improves the gradient boosting tree, adds the input feature xij of the model to the loss function l, and further trains To a certain extent, it solves the problem that the gradient boosting tree only considers the discretized size of the eigenvalues, but does not consider the real distribution of the eigenvalues before discretization. For data with a strong correlation between the input feature xij and the output target yi , the performance of the model is greatly improved compared with the gradient boosting tree;

2)本发明还给出了一种t分布防止过拟合方式,在实际应用中可以通过该方式找到更加准确的分裂点,防止过拟合。2) The present invention also provides a t-distribution preventing over-fitting method, which can be used to find more accurate split points and prevent over-fitting in practical applications.

附图说明Description of drawings

图1为叶子节点和非叶子节点在决策树上的位置。Figure 1 shows the positions of leaf nodes and non-leaf nodes on the decision tree.

图2为本发明梯度提示数模型的流程图。Fig. 2 is a flow chart of the gradient prompt number model of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步的说明,但是本发明要求保护的范围并不局限于实施方式表述的范围,需指出的是,以下若有未特别详细说明之过程或符号,均是本领域技术人员可根据现有技术理解或实现的。The present invention will be further described below in conjunction with the accompanying drawings, but the scope of protection claimed by the present invention is not limited to the scope of the embodiment. Personnel can understand or realize according to the prior art.

本发明可以用于广告预测、人工智能、图像识别、语音识别等各个方面。梯度提升树算法相比于其他的机器学习模型算法具有一定的优越性。首先梯度提升树训练速度快,其次可以从训练好的模型中分析各个特征的重要性及相互关系,进一步提取新特征。The present invention can be used in various aspects such as advertisement prediction, artificial intelligence, image recognition, voice recognition and the like. The gradient boosting tree algorithm has certain advantages over other machine learning model algorithms. First, the training speed of the gradient boosting tree is fast, and secondly, the importance and interrelationship of each feature can be analyzed from the trained model, and new features can be further extracted.

如图2,本实施例提供的提高梯度提升树模型的特征值作用和防止过拟合的方法,包括如下步骤。As shown in FIG. 2 , the method for improving the eigenvalue function of the gradient boosting tree model and preventing overfitting provided by this embodiment includes the following steps.

步骤1:对样本集D(如图像或广告预测等)确定梯度提升树模型的输入特征xij和输出变量yi,其中i表示第i个样本,j表示第j个特征,样本个数为n,特征值个数为m;定义损失函数。Step 1: Determine the input feature xij and output variable yi of the gradient boosting tree model for the sample set D (such as image or advertisement prediction, etc.), where i represents the i-th sample, j represents the j-th feature, and the number of samples is n, the number of feature values is m; define the loss function.

步骤2:对特征值xij进行归一化:xij=(xijj)/σjStep 2: Normalize the feature value xij : xij =(xijj )/σj .

步骤3:对预测值初始化为yi的平均值Step 3: To predict the value initialized to the mean of yi

步骤4:对特征值xij进行离散化得出所有的分裂点。Step 4: Discretize the eigenvalue xij to get all split points.

步骤5:计算输入样本的gi,hiStep 5: Calculate gi , hi of the input samples:

将输入特征xij作为根节点的数据,并将该节点设置为新节点。Take the input feature xij as the data of the root node, and set this node as the new node.

步骤6:对所有新节点根据所有的分裂点将数据划分为两部分L和R,同时计算分割后的两部分样本的所有在不同特征r下的,其中L和R在不同r下的的w1,w2求法如下:Step 6: For all new nodes, divide the data into two parts L and R according to all split points, and calculate all the samples of the two parts after division under different characteristics r, Among them, w1 and w2 of L and R under different r are calculated as follows:

根据计算节点上的数据分割后对目标函数减少的gain:according to Calculate the gain of the objective function reduction after the data on the calculation node is divided:

步骤7:如果用户定义的损失函数为执行特殊的防止过拟合方式对 gain进行筛选:Step 7: If the user-defined loss function is Execute a special way to prevent overfitting to filter gain:

假设则w~t(Hk-1)。suppose Have Then w~t(Hk -1).

定义:definition:

方差variance

上置信边界U=w+s/n×tα/2(Hk-1)下置信边界L=w-s/n×tα/2(Hk-1)Upper Confidence Boundary U=w+s/n×tα/2 (Hk -1) Lower Confidence Boundary L=ws/n×tα/2 (Hk -1)

对gain进行筛选:Filter gain:

步骤8:从s×m个gain中找出最大的gain,确定gain最大时的分裂点、特征值权重w1、特征值偏置w2、特征值权重w1和特征值偏置w2对应的选定特征r,当暂时不分裂。Step 8: Find the largest gain from the s×m gains, determine the split point when the gain is the largest, the eigenvalue weight w1 , the eigenvalue offset w2 , the eigenvalue weight w1 and the eigenvalue offset w2 correspond to The selected feature r of , when temporarily not split.

步骤9中,当从步骤8得出新分裂节点的s×m个gain时,模型会比较所有节点的gain,选出gain最大的节点进行分裂为两个新节点。In step 9, when the s×m gains of the newly split node are obtained from step 8, the model will compare the gains of all nodes, and select the node with the largest gain to split into two new nodes.

步骤10:对新分裂出来的两个节点重复步骤6~10。直到叶子节点个数大于用户指定的叶子节点个数。至此构建完毕一棵弱决策树。Step 10: Repeat steps 6-10 for the two newly split nodes. Until the number of leaf nodes is greater than the number of leaf nodes specified by the user. So far, a weak decision tree has been constructed.

步骤11:对数据集根据该决策树上非叶子节点上的分裂点将数据分裂到各个叶子节点上,叶子节点和非叶子节点在梯度提升树上的表示如图1所示。Step 11: Split the data set into each leaf node according to the split point on the non-leaf node of the decision tree. The representation of leaf nodes and non-leaf nodes on the gradient boosting tree is shown in Figure 1.

步骤12:对各叶子节点上的数据,更新预测值:其中η为学习率,i为第i个样本,k为第k个叶子节点,l为第r个特征,t为第t棵弱决策树。Step 12: For the data on each leaf node, update the predicted value: Among them, η is the learning rate, i is the i-th sample, k is the k-th leaf node, l is the r-th feature, and t is the t-th weak decision tree.

步骤13:重复5~13,直至弱决策树棵数达到用户给定的弱决策树总数或验证集准确率不再提升。Step 13: Repeat steps 5 to 13 until the number of weak decision trees reaches the total number of weak decision trees given by the user or the accuracy of the verification set does not increase.

如上所述,本发明对梯度提升树进行了改进,将模型的输入特征xij添加到损失函数l中,进一步训练出权重,一定程度上解决了梯度提升树只考虑特征值离散化后的大小,而不会考虑离散化前特征值数值的真实分布的问题。对于输入特征xij与输出目标yi相关性较强的数据,模型表现性能相对于梯度提升树有较大的提升;同时还给出了一种t分布防止过拟合方式,在实际应用中可以通过该方式找到更加准确的分裂点,防止过拟合。As mentioned above, the present invention improves the gradient boosting tree, adds the input feature xij of the model to the loss function l, and further trains out the weight, which solves the problem that the gradient boosting tree only considers the discretized size of the feature value to a certain extent , without considering the real distribution of eigenvalue values before discretization. For data with strong correlation between the input feature xij and the output target yi , the performance of the model is greatly improved compared with the gradient boosting tree; at the same time, a t distribution method to prevent overfitting is also given. In practical applications In this way, more accurate split points can be found to prevent overfitting.

Claims (6)

Translated fromChinese
1.提高梯度提升树模型的特征值作用和防止过拟合的方法,其特征在于包括以下步骤:1. The method for improving the eigenvalue effect of the gradient boosting tree model and preventing overfitting is characterized in that it comprises the following steps:步骤1:对样本集D确定梯度提升树模型的输入特征xij和输出变量yi,其中i表示第i个样本,j表示第j个特征,样本个数为n,特征值个数为m;定义损失函数;Step 1: Determine the input features xij and output variables yi of the gradient boosting tree model for the sample set D, where i represents the i-th sample, j represents the j-th feature, the number of samples is n, and the number of feature values is m ; Define the loss function;步骤2:对特征值xij进行归一化;Step 2: Normalize the eigenvalues xij ;步骤3:对预测值初始化为yi的平均值Step 3: To predict the value initialized to the mean of yi步骤4:对特征值xij离散化得出所有的分裂点,分裂点个数为s;Step 4: Discretize the eigenvalue xij to obtain all split points, the number of split points is s;步骤5:计算输入样本的一阶偏导gi和二阶偏导hiStep 5: Calculate the first-order partial derivative gi and the second-order partial derivative hi of the input sample;步骤6:在第k个叶子节点上,对于每一个分裂点,该叶子节点的样本Dk会预分裂为左样本L和右样本R,其中L+R=Dk,如果k为0,则D0=D;遍历所有分裂点,计算左样本L和右样本R的所有特征值的特征值权重w1、特征值偏置w2及对应的分裂增益gain;此时会得到s份左样本L和右样本R,s×m个特征值权重w1、特征值偏置w2及对应的分裂增益gain;Step 6: On the kth leaf node, for each split point, the sample Dk of the leaf node will be pre-split into left sample L and right sample R, where L+R=Dk , if k is 0, then D0 =D; traverse all split points, calculate the eigenvalue weight w1 , eigenvalue offset w2 and the corresponding splitting gain of all eigenvalues of the left sample L and the right sample R; at this time, s left samples will be obtained L and right sample R, s×m eigenvalue weight w1 , eigenvalue offset w2 and corresponding splitting gain;步骤7:如果用户定义的损失函数为MSE函数,即则执行t分布防止过拟合方式,如果用户定义的损失函数不是则直接执行步骤8;Step 7: If the user-defined loss function is an MSE function, ie Then execute the t distribution to prevent overfitting, if the user-defined loss function is not Then go to step 8 directly;步骤8:从s×m个gain中找出最大的gain,及对应的分裂点、特征值权重w1、特征值偏置w2、特征值权重w1和特征值偏置w2对应的选定特征r,但暂时不分裂;Step 8: Find the largest gain from the s×m gains, and the corresponding split point, eigenvalue weight w1 , eigenvalue bias w2 , eigenvalue weight w1 and eigenvalue bias w2 corresponding selections Determine the feature r, but do not split temporarily;步骤9:从所有的节点中找出gain最大的节点,对该节点进行分裂;Step 9: Find the node with the largest gain from all nodes, and split the node;步骤10:对新分裂出来的两个节点重复步骤6~10,直到叶子节点个数大于用户指定的叶子节点个数;至此构建完毕一棵弱决策树;Step 10: Repeat steps 6 to 10 for the two newly split nodes until the number of leaf nodes is greater than the number of leaf nodes specified by the user; so far a weak decision tree has been constructed;步骤11:对数据集根据该决策树上非叶子节点上的分裂点将数据分裂到各个叶子节点上;Step 11: split the data set into each leaf node according to the split points on the non-leaf nodes on the decision tree;步骤12:对各叶子节点上的数据,更新预测值:第t棵弱决策树的预测值其中η为学习率,i为第i个样本,k为第k个叶子节点,r为第r个特征,t为第t棵弱决策树;Step 12: For the data on each leaf node, update the predicted value: the predicted value of the t-th weak decision tree Where η is the learning rate, i is the i-th sample, k is the k-th leaf node, r is the r-th feature, and t is the t-th weak decision tree;步骤13:重复5~13,直至弱决策树棵数达到用户给定的弱决策树总数或验证集准确率不再提升。Step 13: Repeat steps 5 to 13 until the number of weak decision trees reaches the total number of weak decision trees given by the user or the accuracy of the verification set does not increase.2.根据权利要求1所述的提高梯度提升树模型的特征值作用和防止过拟合的方法,其特征在于步骤1定义损失函数为logloss或MSE。2. The method for improving the eigenvalue effect of the gradient boosting tree model according to claim 1 and preventing overfitting, wherein step 1 defines a loss function as logloss or MSE.3.根据权利要求2所述的提高梯度提升树模型的特征值作用和防止过拟合的方法,其特征在于步骤2中对特征值xij进行归一化方法包括:3. the eigenvalue action of improving gradient promotion tree model according to claim 2 and the method for preventing overfitting, it is characterized in that in step 2, carrying out normalization method to eigenvalue xij comprises:1)xij=(xijj)/σjj,σj分别为特征xj的均值,标准差;1) xij =(xijj )/σjj , where σj is the mean and standard deviation of feature xj respectively;2)xij=(xijj)/σjj,σj分别为特征xj的均值,标准差,然后xij=tanh(xij);2) xij = (xijj )/σj , μj , σj are the mean and standard deviation of feature xj respectively, then xij =tanh(xij );3)先对xij进行离散化,然后xij=(xijj)/σjj,σj分别为特征xj的均值,标准差;3) Discretize xij first, then xij = (xijj )/σj , μj , where σj is the mean and standard deviation of feature xj respectively;4)xij=(xijj)/σjj,σj分别为特征xj的均值,标准差,剔除xij中的离群值。4) xij =(xijj )/σjj , where σj is the mean value and standard deviation of feature xj respectively, and the outliers in xij are removed.4.根据权利要求3所述的提高梯度提升树模型的特征值作用和防止过拟合的方法,其特征在于步骤5中输入样本的一阶偏导gi和二阶偏导hi求法如下:4. according to claim 3, improve the eigenvalue effect of gradient boosting tree model and prevent the method for overfitting, it is characterized in that the first-order partial derivationg and the second-order partial derivationh of the input sample in step 5 are as follows :其中,当t>1时,表示第t-1棵弱决策树对yi的估计值,当t=1时,表示yi的默认估计值,即Among them, when t>1, Indicates the estimated value of yi by the t-1th weak decision tree, when t=1, Indicates the default estimated value of yi , that is,l为损失函数,要求满足以下条件的所有损失函数都能使用该模型进行训练,l is a loss function, and all loss functions that meet the following conditions can be trained using this model,5.根据权利要求4所述的提高梯度提升树模型的特征值作用和防止过拟合的方法,其特征在于所述的步骤6具体包括:5. The eigenvalue effect of improving gradient boosting tree model according to claim 4 and the method for preventing overfitting, it is characterized in that described step 6 specifically comprises:构建目标函数为Build the objective function as其中r表示对应第r个特征;xir是叶子节点上的选定特征,是第t棵弱决策树的特征值权重函数,是第t棵弱决策树的特征值偏置函数;是正则化函数,const是常数;where r represents the feature corresponding to the rth feature; xir is the selected feature on the leaf node, is the eigenvalue weight function of the tth weak decision tree, is the eigenvalue bias function of the tth weak decision tree; Is a regularization function, const is a constant;最小化目标函数Obj(t)得到最优解:Minimize the objective function Obj(t) to get Optimal solution:其中 ink表示对应第k个叶子节点,r表示对应第r个特征; k means corresponding to the kth leaf node, and r means corresponding to the rth feature;进行l1正则化得right Perform l1 regularization to get此时得到所有分裂点所有特征值的w1和w2,接下来通过遍历搜索所有分裂点、所有特征求最优分裂点,过程是:对每一个分裂点和每个特征,计算gain;在构建决策树时,通过贪婪法构建决策树,每一次只将该叶子节点上的样本Dk分裂为左样本L和右样本R,通过添加分裂节点对目标函数的改变为gain,首先定义At this time, w1 and w2 of all eigenvalues of all split points are obtained, and then the optimal split point is found by traversing all split points and all features. The process is: for each split point and each feature, calculate the gain; When constructing a decision tree, the decision tree is constructed by the greedy method, each time only the sample Dk on the leaf node is split into the left sample L and the right sample R, and the objective function is changed to gain by adding split nodes. First, define其中C为当前待分裂节点,rC为当前节点的已经确定了的选定特征;L、R为节点C分裂出的左节点和右节点,rL为分割点确定后需要遍历左边样本所有特征的集合,rR为分割点确定后需要遍历右边样本所有特征的集合,Where C is the current node to be split, rC is the selected feature of the current node that has been determined; L and R are the left and right nodes split by node C, and rL is the need to traverse all the features of the left sample after the split point is determined The set of rR is the set of all features of the samples on the right that need to be traversed after the split point is determined.则gain=ScoreL+ScoreR-ScorecThen gain=ScoreL +ScoreR -Scorec通过遍历所有s个分裂点和m个特征值,得到s×m个待选gain值。By traversing all s split points and m eigenvalues, s×m candidate gain values are obtained.6.根据权利要求1~5任一项所述的提高梯度提升树模型的特征值作用和防止过拟合的方法,其特征在于所述的步骤7具体为:6. The method for improving the eigenvalue effect of the gradient boosting tree model and preventing overfitting according to any one of claims 1 to 5, wherein said step 7 is specifically:如果用户定义的损失函数为时,对gain进行筛选:If the user-defined loss function is When the gain is filtered:假设则w~t(Hk-1);suppose Have Then w~t(Hk -1);定义:definition:方差variance上置信边界U=w+s/n×tα/2(Hk-1)下置信边界L=w-s/n×tα/2(Hk-1)Upper Confidence Boundary U=w+s/n×tα/2 (Hk -1) Lower Confidence Boundary L=ws/n×tα/2 (Hk -1)对gain进行筛选:Filter gain:其中α为置信度,由用户指定;Where α is the confidence level, specified by the user;当gain符合要求时,保留该gain值,当gain不符合要求时,将其置为0。When the gain meets the requirements, keep the gain value, and when the gain does not meet the requirements, set it to 0.
CN201910274219.3A2019-04-082019-04-08It improves gradient and promotes the method that the characteristic value of tree-model acts on and prevents over-fittingPendingCN110232448A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910274219.3ACN110232448A (en)2019-04-082019-04-08It improves gradient and promotes the method that the characteristic value of tree-model acts on and prevents over-fitting

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910274219.3ACN110232448A (en)2019-04-082019-04-08It improves gradient and promotes the method that the characteristic value of tree-model acts on and prevents over-fitting

Publications (1)

Publication NumberPublication Date
CN110232448Atrue CN110232448A (en)2019-09-13

Family

ID=67860681

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910274219.3APendingCN110232448A (en)2019-04-082019-04-08It improves gradient and promotes the method that the characteristic value of tree-model acts on and prevents over-fitting

Country Status (1)

CountryLink
CN (1)CN110232448A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110728317A (en)*2019-09-302020-01-24腾讯科技(深圳)有限公司Training method and system of decision tree model, storage medium and prediction method
CN110866528A (en)*2019-10-282020-03-06腾讯科技(深圳)有限公司Model training method, energy consumption use efficiency prediction method, device and medium
CN110990829A (en)*2019-11-212020-04-10支付宝(杭州)信息技术有限公司Method, device and equipment for training GBDT model in trusted execution environment
CN111126628A (en)*2019-11-212020-05-08支付宝(杭州)信息技术有限公司 Method, Apparatus and Device for Training GBDT Model in Trusted Execution Environment
CN111157092A (en)*2020-01-022020-05-15深圳市汉德网络科技有限公司Vehicle-mounted weighing automatic calibration method and computer readable storage medium
CN111464510A (en)*2020-03-182020-07-28华南理工大学 A network real-time intrusion detection method based on fast gradient boosting tree model
CN112784492A (en)*2021-01-262021-05-11上海黑瞳信息技术有限公司Automatic modeling system for machine learning
CN112836741A (en)*2021-02-012021-05-25深圳无域科技技术有限公司Crowd sketch extraction method, system, equipment and computer readable medium for coupling decision tree
CN113095390A (en)*2021-04-022021-07-09东北大学Walking stick motion analysis system and method based on cloud database and improved ensemble learning
CN113204857A (en)*2021-03-152021-08-03北京锐达芯集成电路设计有限责任公司Method for predicting residual life of electronic device based on extreme gradient lifting tree algorithm
CN113722739A (en)*2021-09-062021-11-30京东科技控股股份有限公司Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN114916913A (en)*2022-05-092022-08-19东北大学 A portable real-time monitoring system and method of sleep breathing state

Cited By (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110728317A (en)*2019-09-302020-01-24腾讯科技(深圳)有限公司Training method and system of decision tree model, storage medium and prediction method
CN110866528B (en)*2019-10-282023-11-28腾讯科技(深圳)有限公司 A model training method, energy consumption efficiency prediction method, device and medium
CN110866528A (en)*2019-10-282020-03-06腾讯科技(深圳)有限公司Model training method, energy consumption use efficiency prediction method, device and medium
CN110990829A (en)*2019-11-212020-04-10支付宝(杭州)信息技术有限公司Method, device and equipment for training GBDT model in trusted execution environment
CN111126628A (en)*2019-11-212020-05-08支付宝(杭州)信息技术有限公司 Method, Apparatus and Device for Training GBDT Model in Trusted Execution Environment
CN111157092A (en)*2020-01-022020-05-15深圳市汉德网络科技有限公司Vehicle-mounted weighing automatic calibration method and computer readable storage medium
CN111464510A (en)*2020-03-182020-07-28华南理工大学 A network real-time intrusion detection method based on fast gradient boosting tree model
CN112784492A (en)*2021-01-262021-05-11上海黑瞳信息技术有限公司Automatic modeling system for machine learning
CN112836741A (en)*2021-02-012021-05-25深圳无域科技技术有限公司Crowd sketch extraction method, system, equipment and computer readable medium for coupling decision tree
CN112836741B (en)*2021-02-012024-05-24深圳无域科技技术有限公司Crowd portrayal extraction method, system, equipment and computer readable medium for coupling decision tree
CN113204857A (en)*2021-03-152021-08-03北京锐达芯集成电路设计有限责任公司Method for predicting residual life of electronic device based on extreme gradient lifting tree algorithm
CN113095390A (en)*2021-04-022021-07-09东北大学Walking stick motion analysis system and method based on cloud database and improved ensemble learning
CN113095390B (en)*2021-04-022024-06-04东北大学 Cane motion analysis method based on cloud database and improved ensemble learning
CN113722739B (en)*2021-09-062024-04-09京东科技控股股份有限公司Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN113722739A (en)*2021-09-062021-11-30京东科技控股股份有限公司Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN114916913B (en)*2022-05-092023-01-13东北大学 A portable sleep breathing state real-time monitoring system and method
CN114916913A (en)*2022-05-092022-08-19东北大学 A portable real-time monitoring system and method of sleep breathing state

Similar Documents

PublicationPublication DateTitle
CN110232448A (en)It improves gradient and promotes the method that the characteristic value of tree-model acts on and prevents over-fitting
WO2022121289A1 (en)Methods and systems for mining minority-class data samples for training neural network
US11593611B2 (en)Neural network cooperation
CN114841257B (en) A small sample target detection method based on self-supervised contrast constraints
CN112163426A (en)Relationship extraction method based on combination of attention mechanism and graph long-time memory neural network
US20180046915A1 (en)Compression of deep neural networks with proper use of mask
CN107688849A (en)A kind of dynamic strategy fixed point training method and device
CN112766603B (en)Traffic flow prediction method, system, computer equipment and storage medium
CN110555084A (en)remote supervision relation classification method based on PCNN and multi-layer attention
CN109740057A (en) An Augmented Neural Network and Information Recommendation Method Based on Knowledge Extraction
CN116032775B (en)Industrial control network anomaly detection method oriented to concept drift
CN116310542A (en) An Image Classification Method Based on Improved Cross-Entropy Loss Function
CN114897144A (en)Complex value time sequence signal prediction method based on complex value neural network
CN110263860A (en)A kind of freeway traffic flow prediction technique and device
CN112132096A (en) A Behavioral Modal Recognition Method for Randomly Configured Networks with Dynamically Updating Output Weights
CN120296148A (en) A method and device for searching discrete prompt words in a large language model
CN106407932A (en)Handwritten number recognition method based on fractional calculus and generalized inverse neural network
CN116821436B (en)Fuzzy query-oriented character string predicate accurate selection estimation method
CN114708185A (en)Target detection method, system and equipment based on big data enabling and model flow
CN117792737A (en) A network intrusion detection method, device, electronic equipment and storage medium
WO2024012179A1 (en)Model training method, target detection method and apparatuses
Putrada et al.NoCASC: A Novel Optimized Cost-Complexity Pruning for AdaBoost Model Compression on Edge Computing-Based Smart Lighting
CN114936598A (en)Cross-domain small sample learning method, learning system, electronic device and storage medium
CN115597867A (en) A Bearing Fault Diagnosis Method Based on Squirrel Algorithm Optimizing Neural Network
CN114385805B (en)Small sample learning method for improving adaptability of deep text matching model

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication

Application publication date:20190913

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp