
技术领域technical field
本发明是机器学习超参数重要性评估方法、系统及存储介质。The present invention is a method, a system and a storage medium for evaluating the importance of machine learning hyperparameters.
背景技术Background technique
机器学习为数据处理和数据分类提供了重要的技术支撑,然而模型选择和调参依然是困扰用户的两大难题,于是自动化机器学习系统应运而生。自动化机器学习系统利用自动化机器学习算法达到了自动化数据预处理,自动化选择算法,自动化调参的目的,提高了数据分类预测的准确性,同时将用户从选择算法和反复调参的繁重任务中解脱出来。Machine learning provides important technical support for data processing and data classification. However, model selection and parameter adjustment are still two major problems that plague users, so automated machine learning systems emerge as the times require. The automated machine learning system uses automated machine learning algorithms to achieve the purpose of automatic data preprocessing, automatic algorithm selection, and automatic parameter adjustment, improving the accuracy of data classification and prediction, and at the same time freeing users from the arduous task of selecting algorithms and repeatedly adjusting parameters come out.
由于自动化机器学习的核心是自动化算法选择及自动化超参配置,因此该系统将机器学习过程归约成了算法选择和超参优化(Combined Algorithm Selection andHyper-parameter optimization,CASH)问题。CASH问题即把算法的选择当做根层次的新的超参数,从而将选择算法和超参数值的问题映射到选择超参值的问题。通过将数据预处理和特征选择技术作为超参数,系统可以自动选择数据预处理和特征选择技术。最终归结为的超参优化问题可以通过经典的贝叶斯优化算法找到最优解,从而达到提升数据分类预测精度的效果。Since the core of automatic machine learning is automatic algorithm selection and automatic hyper-parameter configuration, the system reduces the machine learning process to the problem of algorithm selection and hyper-parameter optimization (CASH). The CASH problem regards the choice of algorithm as a new hyperparameter at the root level, thereby mapping the problem of choosing an algorithm and hyperparameter values to the problem of choosing hyperparameter values. By using data preprocessing and feature selection techniques as hyperparameters, the system can automatically select data preprocessing and feature selection techniques. The hyperparameter optimization problem that ultimately boils down to can find the optimal solution through the classical Bayesian optimization algorithm, so as to achieve the effect of improving the accuracy of data classification and prediction.
然而目前的自动化机器学习系统的超参配置模块的配置过程全凭经验,或者通过反复迭代得到最后的结果来对若干个超参数的配置进行一一调整,这样存在的缺陷是:浪费机器学习的时间,而且反复迭代也浪费计算机资源,不分重要性地对所有超参数的配置进行调整会浪费用户的时间和精力。However, the configuration process of the hyperparameter configuration module of the current automated machine learning system is entirely based on experience, or the configuration of several hyperparameters is adjusted one by one through repeated iterations to obtain the final result. Time, and repeated iterations waste computer resources, tweaking the configuration of all hyperparameters indiscriminately wastes the user's time and effort.
发明内容SUMMARY OF THE INVENTION
本发明是机器学习超参数重要性评估方法、系统及存储介质,所要解决的技术问题是如何准确评估机器学习算法的超参重要性,并将其用于指导自动化超参配置以及增强超参配置的可解释性问题。The present invention is a machine learning hyperparameter importance evaluation method, system and storage medium, and the technical problem to be solved is how to accurately evaluate the hyperparameter importance of the machine learning algorithm, and use it to guide automatic hyperparameter configuration and enhance hyperparameter configuration interpretability issues.
作为本发明的第一方面:As a first aspect of the present invention:
机器学习超参数重要性评估方法,包括:Machine learning hyperparameter importance assessment methods, including:
步骤(1):从开放式机器学习环境OpenML中获取与目标数据集类型相似的若干新数据集,并对每个新数据集提取元特征向量,使得每个新数据集都用元特征向量来表示;Step (1): Obtain several new datasets of the same type as the target dataset from the open machine learning environment OpenML, and extract the meta-feature vector for each new dataset, so that each new dataset is represented by the meta-feature vector. express;
从开放式机器学习环境OpenML中收集待评估分类算法在不同超参数配置下性能的数据;Collect data on the performance of the classification algorithm to be evaluated under different hyperparameter configurations from OpenML, an open machine learning environment;
将每个新数据集的元特征向量以及不同超参数配置对应的性能数据存储于对应的历史数据集中;Store the meta-feature vector of each new dataset and the performance data corresponding to different hyperparameter configurations in the corresponding historical dataset;
步骤(2):提取目标数据集的元特征向量来表示目标数据集,计算目标数据集元特征向量与历史数据集元特征向量之间的距离,获得目标数据集与每个历史数据集之间距离由近至远的距离序列;Step (2): Extract the meta-feature vector of the target data set to represent the target data set, calculate the distance between the meta-feature vector of the target data set and the meta-feature vector of the historical data set, and obtain the distance between the target data set and each historical data set distance sequence from near to far;
步骤(3):对距离目标数据集最近的前f个历史数据集依次执行Relief-Cluster算法:通过Relief算法得到的每类超参数的权重,进一步计算每类超参数的平均权重,利用每类超参数的平均权重初步得到每类超参数重要性权重排序;利用聚类算法进一步验证超参数重要性评估的准确性;最后,得到待评估分类算法的超参数重要性排序。Step (3): Execute the Relief-Cluster algorithm on the first f historical data sets closest to the target data set in turn: obtain the weight of each type of hyperparameters through the Relief algorithm, and further calculate the average weight of each type of hyperparameters. The average weight of the hyperparameters preliminarily obtains the ranking of the importance weights of each type of hyperparameters; the clustering algorithm is used to further verify the accuracy of the hyperparameter importance evaluation; finally, the hyperparameter importance ranking of the classification algorithm to be evaluated is obtained.
所述机器学习超参数重要性评估方法,包括以下步骤:The method for evaluating the importance of machine learning hyperparameters includes the following steps:
步骤(4):根据得到的待评估分类算法的超参数重要性排序,对重要性排序靠前的若干个参数进行设置,然后,利用设置好参数的分类算法对待分类数据进行分类。Step (4): According to the obtained hyperparameter importance ranking of the classification algorithm to be evaluated, set several parameters with the highest importance ranking, and then use the classification algorithm with the set parameters to classify the data to be classified.
所述步骤(1)中,每个数据集Di被描述为由F个元特征表示的向量In the step (1), each dataset Di is described as a vector represented by F element features
所述步骤(1)中,元特征,包括:简单的元特征、数据集的统计元特征和重要性元特征;In the step (1), the meta-features include: simple meta-features, statistical meta-features and important meta-features of the data set;
所述简单的元特征,包括:数据集样本数量、特征数量、类别数量或缺失值数量;The simple meta-features include: the number of data set samples, the number of features, the number of categories or the number of missing values;
所述数据集的统计元特征,包括:平均值、方差或距离向量的峰度;Statistical meta-features of the data set, including: mean, variance or kurtosis of distance vectors;
重要性元特征,包括:在数据集上运行机器学习算法获得的性能。Importance meta-features, including: performance obtained by running machine learning algorithms on the dataset.
所述步骤(1)中待评估分类算法在不同超参数配置下的性能,包括:错误分类率或者RMSE;In the step (1), the performance of the classification algorithm to be evaluated under different hyperparameter configurations, including: misclassification rate or RMSE;
另外,对于许多常见算法,开放式机器学习环境OpenML已经包含了非常全面的性能数据,适用于各种数据集上的不同超参数配置,即收集数据集Di在待评估分类算法下的超参配置θi及性能yi数据In addition, for many common algorithms, the open machine learning environment OpenML already contains very comprehensive performance data for different hyperparameter configurations on various datasets, that is, collecting the hyperparameters of the dataset Di under the classification algorithm to be evaluated Configuration θi and performanceyi data
对于目标数据集DN',提取元特征VN'来表示目标数据集,并基于不相似的数据集其使用算法的超参数配置也具有差异这一原则,利用元特征向量之间的距离获得目标数据集与历史数据集之间的距离序列。对距离目标数据集近的前f个历史数据集,使用算法在不同超参数的性能数据来评估超参数重要性;For the target data setDN' , extract the meta-feature VN' to represent the target data set, and based on the principle that the hyperparameter configuration of the algorithm used by dissimilar data sets also has differences, use the distance between the meta-feature vectors to obtain A sequence of distances between the target dataset and the historical dataset. For the first f historical datasets close to the target dataset, use the performance data of the algorithm in different hyperparameters to evaluate the importance of hyperparameters;
利用元特征向量之间的距离来衡量目标数据集DN'与历史数据集Di之间的距离dpn(DN′,Di):Use the distance between the meta-feature vectors to measure the distance dpn (DN' , Di ) between the target dataset DN' and the historical dataset Di :
dpn(DN′,Di)=||VN′-Vi||pndpn (DN′ , Di )=||VN′ −Vi ||pn
其中,VN'表示数据集DN'的元特征向量,Vi表示历史数据集Di的元特征向量,pn表示p范数。Among them, VN' represents the meta-feature vector of the data set DN' , Vi represents the meta-feature vector of the historical data set Di , and pn represents the p-norm.
通过目标数据集与历史数据集元特征向量之间的距离比较,得到历史数据集与目标数据集距离由近至远的排序序列π(1),...,π(N),其中By comparing the distance between the target data set and the meta-feature vector of the historical data set, the sorted sequence π(1), ..., π(N) of the distance between the historical data set and the target data set from near to far is obtained, where
根据历史数据集与目标数据集距离由近至远的排序队列π(1),...,π(N),对距离目标数据集较近的前f个历史数据集依次执行Relief-Cluster算法。首先通过Relief算法得到的每类超参的平均权重来初步评估超参重要性,然后利用聚类算法的r(C)指标进一步验证超参重要性评估的准确性,重复以上两步m次,选择r(C)指标最大时对应的超参重要性评估结果,最后得到待评估分类算法的超参重要性排序,转而用于指导目标数据集在待评估分类算法的自动化调参过程。According to the sorting queue π(1), ..., π(N) of the distance between the historical data set and the target data set, the Relief-Cluster algorithm is sequentially performed on the first f historical data sets that are closer to the target data set. . Firstly, the importance of hyperparameters is preliminarily evaluated by the average weight of each type of hyperparameters obtained by the Relief algorithm, and then the r(C) index of the clustering algorithm is used to further verify the accuracy of the evaluation of the importance of hyperparameters. Repeat the above two steps m times. Select the corresponding hyperparameter importance evaluation result when the r(C) index is the largest, and finally obtain the hyperparameter importance ranking of the classification algorithm to be evaluated, which is then used to guide the automatic parameter adjustment process of the target data set in the classification algorithm to be evaluated.
所述通过Relief算法得到的每类超参数的权重包括:The weight of each type of hyperparameter obtained by the Relief algorithm includes:
根据不同超参数配置下的性能数据大小设置阈值,将历史数据集中不同超参数配置对应的性能数据分为高性能样本和低性能样本,Relief算法首先从性能数据中随机选择一个样本si,然后从性能高样本和性能差样本中各选择一个距离si最近的样本;The threshold is set according to the performance data size under different hyperparameter configurations, and the performance data corresponding to different hyperparameter configurations in the historical data set is divided into high-performance samples and low-performance samples. The Relief algorithm first randomly selects a samplesi from the performance data, and then Select a sample that is closest tosi from the high-performance sample and the poor-performance sample;
与si同类的样本sj用M表示,与si不同类的样本sj用Q表示,每类超参数h的权重wh根据公式(1)更新:The samples sj of the same class as si are denoted by M, and the samples sj of different classes from si are denoted by Q, and the weight wh of the hyperparameter h of each class is updated according to formula (1):
wh=wh-diff(h,si,M)/rt+diff(h,si,Q)/rt (1)wh =wh -diff(h,si ,M)/rt+diff(h,si ,Q)/rt (1)
diff(h,si,M)表示两个样本si与M在超参数h上的差异;diff(h,si ,M) represents the difference between the two samples si and M on the hyperparameter h;
diff(h,si,Q)表示两个样本si与Q在超参数h上的差异;diff(h, si , Q) represents the difference between the two samples si and Q on the hyperparameter h;
两个样本si与sj在超参数h上的差异diff(h,si,sj)定义为:The difference diff(h,si ,sj ) of two samples si and sj on the hyperparameter h is defined as:
若超参数h为标量型超参数,If the hyperparameter h is a scalar hyperparameter,
若超参数h为数值型超参数,If the hyperparameter h is a numerical hyperparameter,
其中,1≤i≠j≤m,1≤h≤ph,maxh为超参数h在样本集中的最大值,minh为超参数h在样本集中的最小值,m表示样本数,每个样本包含ph个超参数,rt表示迭代次数,rt>1,为了避免一次抽样的随机性;sih表示在样本si上超参h的值,sjh表示在样本sj上超参h的值。Among them, 1≤i≠j≤m, 1≤h≤ph, maxh is the maximum value of the hyperparameter h in the sample set, minh is the minimum value of the hyperparameter h in the sample set, m represents the number of samples, and each sample Contains ph hyperparameters, rt represents the number of iterations, rt>1, in order to avoid the randomness of one sampling; sih represents the value of the hyperparameter h on the samplesi , and sjh represents the value of the hyperparameter h on the samplesj .
由公式(1)可知,对于高性能贡献大的超参数表现为在异类间差异大而在同类间差异小,因此具有区分能力的超参数的权值为正值。From formula (1), it can be seen that the hyperparameters that contribute greatly to high performance have large differences between different classes and small differences between similar classes, so the weights of hyperparameters with discriminating ability are positive values.
为避免一次抽样的随机性,迭代进行rt>1次,得到每类超参的重要性权重排序。In order to avoid the randomness of one sampling, iteratively performs rt>1 times to obtain the importance weight ranking of each type of hyperparameters.
所述利用聚类算法进一步验证超参数重要性评估的准确性包括:The described use of clustering algorithm to further verify the accuracy of hyperparameter importance evaluation includes:
根据得到的每类超参数的重要性权重排序,对位于前k类的超参数进行聚类,并计算超参数重要性,假设超参数样本集为S,T为超参数样本集合的大小,K为超参数样本所属类的个数,pik表示样本隶属于类k的概率,Ck表示超参数样本的实际类标签,C表示超参数集,则在C的重要性度量r(C)表示为:According to the obtained importance weights of each type of hyperparameters, the hyperparameters located in the top k categories are clustered, and the hyperparameter importance is calculated, assuming that the hyperparameter sample set is S, T is the size of the hyperparameter sample set, K is the number of classes to which the hyperparameter samples belong, pik represents the probability that the samples belong to class k, Ck represents the actual class label of the hyperparameter samples, and C represents the hyperparameter set, then the importance measure r(C) in C represents for:
其中,F(C)表示在超参数集C上聚类的结果与类标签在整个超参数样本集上的差异,C代表超参数集,Fi(C)表示在超参数集C上聚类的结果与类标签在各个类内的差异,Xi表示第i个类的超参数样本集合。Among them, F(C) represents the difference between the result of clustering on the hyperparameter set C and the class label on the entire hyperparameter sample set, C represents the hyperparameter set, and Fi (C) represents the clustering on the hyperparameter set C The difference between the results and the class labels within each class, Xi represents the set of hyperparameter samples of the ith class.
r(C)值越高,聚类结果与实际类标签之间的相关度越大,超参数集C对分类的影响越大。选择r(C)指标最大时对应的超参重要性评估结果。The higher the r(C) value, the greater the correlation between the clustering results and the actual class labels, and the greater the impact of the hyperparameter set C on the classification. Select the corresponding hyperparameter importance evaluation result when the r(C) index is the largest.
类标签是指性能高和性能低的标签。Class labels refer to high-performing and low-performing labels.
作为本发明的第二方面,As a second aspect of the present invention,
机器学习超参数重要性评估系统,包括:存储器、处理器以及存储在存储器上并在处理器上运行的计算机指令,所述计算机指令被处理器运行时,完成上述任一方法所述的步骤。A machine learning hyperparameter importance assessment system includes: a memory, a processor, and computer instructions stored in the memory and executed on the processor, and when the computer instructions are executed by the processor, the steps described in any of the above methods are completed.
作为本发明的第三方面,As a third aspect of the present invention,
一种计算机可读存储介质,其上运行有计算机指令,所述计算机指令被处理器运行时,完成上述任一方法所述的步骤。A computer-readable storage medium on which computer instructions run, and when the computer instructions are executed by a processor, completes the steps described in any of the above methods.
本发明的有益效果:Beneficial effects of the present invention:
本发明可以准确评估机器学习算法的超参重要性,用于指导自动化超参配置以及增强超参配置的可解释性问题。用于描述机器学习算法本身的超参重要性,为超参配置过程提供有效借鉴和良好的可解释性。此模块着重解决的技术问题为如何准确评估机器学习算法的超参重要性,并将其用于指导自动化超参配置以及增强超参配置的可解释性问题。The invention can accurately evaluate the importance of the hyperparameters of the machine learning algorithm, and is used to guide the automatic hyperparameter configuration and enhance the interpretability of the hyperparameter configuration. It is used to describe the importance of hyperparameters in the machine learning algorithm itself, and provides effective reference and good interpretability for the configuration process of hyperparameters. The technical problem that this module focuses on is how to accurately assess the importance of hyperparameters in machine learning algorithms, and use it to guide automated hyperparameter configuration and enhance the interpretability of hyperparameter configuration.
(1)节约资源,节省时间,通过提供合适的先验知识,缩小搜索空间,使得超参配置过程具有一定的指导性,摆脱以往完全黑盒的状态。(1) Save resources and save time. By providing appropriate prior knowledge and narrowing the search space, the hyperparameter configuration process is instructive to a certain extent, and it can get rid of the previous complete black box state.
(2)同时可以让用户直观的了解哪类超参数对算法性能影响更大。(2) At the same time, it allows users to intuitively understand which type of hyperparameters has a greater impact on algorithm performance.
附图说明Description of drawings
构成本申请的一部分的说明书附图用来提供对本申请的进一步理解,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。The accompanying drawings that form a part of the present application are used to provide further understanding of the present application, and the schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute improper limitations on the present application.
图1为本发明提供的流程图;Fig. 1 is the flow chart provided by the present invention;
具体实施方式Detailed ways
应该指出,以下详细说明都是例示性的,旨在对本申请提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the application. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
本发明充分利用开放式机器学习环境OpenML中的多个数据集以及其每个数据集在多种算法下的性能数据,结合元学习方法计算目标数据集与历史数据集的距离,并利用Relief算法和聚类算法得到待评估分类算法每类超参数的重要性排序,排序结果转而用于指导目标数据集在待评估分类算法的自动化调参过程。本发明为提供合适的先验知识,缩小搜索空间,使得超参配置过程具有一定的指导性,摆脱以往完全黑盒的状态;同时可以让用户直观的了解哪类超参数对算法性能影响更大。The invention makes full use of multiple data sets in the open machine learning environment OpenML and the performance data of each data set under multiple algorithms, combines the meta-learning method to calculate the distance between the target data set and the historical data set, and uses the Relief algorithm. And the clustering algorithm obtains the importance ranking of each type of hyperparameters of the classification algorithm to be evaluated, and the ranking results are then used to guide the automatic parameter adjustment process of the target data set in the classification algorithm to be evaluated. The present invention provides appropriate prior knowledge, reduces the search space, makes the hyperparameter configuration process have certain guidance, and gets rid of the previous complete black box state; at the same time, it allows users to intuitively understand which type of hyperparameters has a greater impact on algorithm performance .
如图1所示,本发明包括以下步骤:As shown in Figure 1, the present invention comprises the following steps:
步骤A、获取OpenML中不同的数据集,并对每个数据集提取元特征,使得每个数据集都可以用元特征来表示,同时收集待评估分类算法在不同超参配置θi下性能yi(例如,错误分类率或者RMSE)的数据并将每个数据集的元特征向量以及不同超参配置对应的性能数据存储于历史数据集样本库;Step A. Obtain different data sets in OpenML, and extract meta-features for each data set, so that each data set can be represented by meta-features, and collect the performance y of the classification algorithm to be evaluated under different hyperparameter configurations θii (e.g. misclassification rate or RMSE) data and store the meta-feature vector of each dataset and the performance data corresponding to different hyperparameter configurations in the historical dataset sample library;
在步骤A中提取的元特征主要包括:简单的元特征(例如,数据集样本数量,特征数量,类别数量,缺失值数量等)、数据集的统计元特征(例如,平均值,方差,距离向量的峰度等)、重要性元特征(例如在数据集上运行机器学习算法获得的性能等信息)这三大部分。The meta-features extracted in step A mainly include: simple meta-features (for example, the number of data set samples, the number of features, the number of categories, the number of missing values, etc.), the statistical meta-features of the data set (for example, the mean, variance, distance, etc.) The three major parts are the kurtosis of the vector, etc.) and the importance meta-features (such as the performance obtained by running the machine learning algorithm on the data set).
步骤B、对于我们使用的目标数据集,我们也提取元特征来表示目标数据集,并基于不相似的数据集其使用算法的超参配置也具有差异这一原则,利用元特征向量之间的距离获得目标数据集与历史数据集之间的距离序列。对距离目标数据集较近的前f个历史数据集,我们可以使用待评估分类算法不同超参的性能数据来评估超参重要性;Step B. For the target data set we use, we also extract meta-features to represent the target data set, and based on the principle of dissimilar data sets, the hyperparameter configuration of the algorithm used is also different, using the difference between the meta-feature vectors. Distance gets the sequence of distances between the target dataset and the historical dataset. For the first f historical datasets that are closer to the target dataset, we can use the performance data of different hyperparameters of the classification algorithm to be evaluated to evaluate the importance of hyperparameters;
在步骤B中,利用元特征向量之间的距离来衡量目标数据集DN'与历史数据集Di(i=1,2,…N)之间的距离,其中的距离公式我们使用的是衡量数据集元特征向量之间差异的常用p-范数:dpn(DN′,Di)=||VN′-Vi||pn。通过目标数据集与历史数据集元特征向量之间的距离比较,我们可以得到历史数据集与目标数据集距离由近至远的排序序列π(1),...,π(N),其中In step B, the distance between the meta-feature vectors is used to measure the distance between the target dataset DN' and the historical dataset Di (i=1,2,...N), where the distance formula we use is A common p-norm to measure the difference between meta-eigenvectors of a dataset: dpn (DN' , Di ) = ||VN' -Vi ||pn . By comparing the distance between the target data set and the meta-feature vector of the historical data set, we can obtain the sorted sequence π(1), ..., π(N) of the distance between the historical data set and the target data set from near to far, where
步骤C、根据历史数据集与目标数据集距离由近至远的有序序列,对距离目标数据集较近的前f个历史数据集依次执行我们提出的Relief-Cluster算法。首先通过Relief算法得到的每类超参的平均权重来初步评估超参重要性,然后利用聚类算法的r(C)指标进一步验证超参重要性评估的准确性,重复以上两步m次,选择r(C)指标最大时对应的超参重要性评估结果,最后得到待评估分类算法的超参重要性排序转而用于指导目标数据集在待评估分类算法的自动化调参过程。Step C. According to the ordered sequence of the distance between the historical data set and the target data set from near to far, execute the Relief-Cluster algorithm proposed by us in turn on the first f historical data sets that are closer to the target data set. Firstly, the importance of hyperparameters is preliminarily evaluated by the average weight of each type of hyperparameters obtained by the Relief algorithm, and then the r(C) index of the clustering algorithm is used to further verify the accuracy of the evaluation of the importance of hyperparameters. Repeat the above two steps m times. Select the corresponding hyperparameter importance evaluation result when the r(C) index is the largest, and finally obtain the hyperparameter importance ranking of the classification algorithm to be evaluated, which is then used to guide the automatic parameter adjustment process of the target data set in the classification algorithm to be evaluated.
在本发明中,步骤C具体包括以下步骤:In the present invention, step C specifically comprises the following steps:
步骤C1、我们根据不同超参配置下的性能数据大小设置阈值将数据分为性能高的一类和性能差的一类,Relief算法首先从超参样本集合中随机选择一个样本si,然后从两类样本中各选择一个距离si最近的样本。与si同类的样本用M表示,与si不同类的样本用Q表示,每类超参h的权重wh根据公式(1)更新:Step C1, we set the threshold according to the performance data size under different hyperparameter configurations to divide the data into a class with high performance and a class with poor performance. The Relief algorithm first randomly selects a sample si from the hyperparameter sample set, and then selects a Select a sample that is closest tosi in each of the two types of samples. The samples of the same class as si are represented by M, and the samples of different classes from si are represented by Q, and the weight wh of each type of hyperparameter h is updated according to formula (1):
wh=wh-diff(h,si,M)/rt+diff(h,si,Q)/rt (1)wh =wh -diff(h,si ,M)/rt+diff(h,si ,Q)/rt (1)
上述公式中,两个样本si与sj(1≤i≠j≤m)在超参h(1≤h≤ph)上的差定义为:In the above formula, the difference between two samples si and sj (1≤i≠j≤m) on the hyperparameter h (1≤h≤ph) is defined as:
若超参h为标量型超参,If the hyperparameter h is a scalar hyperparameter,
若超参h为数值型超参,If the hyperparameter h is a numeric hyperparameter,
其中,maxh和minh分别为超参h在样本集中的最大值和最小值。Among them, maxh and minh are the maximum and minimum values of the hyperparameter h in the sample set, respectively.
由公式(1)可知,对于高性能贡献较大的超参应该表现为在异类间差异较大而在同类间差异较小,因此具有区分能力的超参的权值应为正值。为避免一次抽样的随机性,上述过程迭代进行rt>1次。From formula (1), it can be seen that the hyperparameters that contribute more to high performance should show large differences between different types and small differences between similar types, so the weights of hyperparameters with discriminating ability should be positive. In order to avoid the randomness of one sampling, the above process is iteratively performed rt>1 times.
步骤C2、根据上步得到的每类超参的重要性权重排序,我们对位于前k类的超参进行聚类,并计算特征重要性,假设超参样本集为S,T为超参样本集合的大小,K为超参样本所属类的个数,pik表示样本隶属于类k的概率,Ck表示超参样本的实际类标号,C表示超参子集,则在C的重要性度量r(C)可以表示为:Step C2. According to the importance weight ranking of each type of hyperparameter obtained in the previous step, we cluster the hyperparameters located in the top k types and calculate the feature importance, assuming that the hyperparameter sample set is S and T is the hyperparameter sample The size of the set, K is the number of classes to which the hyperparameter samples belong, pik represents the probability that the samples belong to class k, Ck represents the actual class label of the hyperparameter samples, and C represents the hyperparameter subset, then the importance of C The metric r(C) can be expressed as:
其中F(C)表示在超参集C上聚类的结果与类标签在整个超参样本集上的差异,C代表超参子集,Fi(C)表示各个类内的差异,Xi表示第i个类的超参样本集合。r(C)值越高,聚类结果与实际类标签之间的相关度越大,超参集C对分类的影响越大。where F(C) represents the difference between the result of clustering on the hyperparameter set C and the class label on the entire hyperparameter sample set, C represents the hyperparameter subset, Fi (C) represents the difference within each class, and Xi represents A collection of hyperparameter samples for the ith class. The higher the r(C) value, the greater the correlation between the clustering results and the actual class labels, and the greater the impact of the hyperparameter set C on the classification.
对以上两步迭代m次,选取r(C)最大时对应的超参重要性排序,最后将得到的超参重要性排序结果转而用于指导目标数据集在待评估分类算法的自动化调参过程。Iterate m times for the above two steps, select the corresponding hyperparameter importance ranking when r(C) is the largest, and finally use the obtained hyperparameter importance ranking results to guide the automatic parameter adjustment of the target data set in the classification algorithm to be evaluated. process.
本发明中Relief-Cluster算法的流程图:The flow chart of the Relief-Cluster algorithm in the present invention:
输入:超参数样本集S,超参数类别数hc,取样/迭代次数rtInput: hyperparameter sample set S, number of hyperparameter categories hc, number of samples/iterations rt
输出:聚类评价指标r(C),超参数重要性权重矩阵WOutput: clustering evaluation index r(C), hyperparameter importance weight matrix W
从S中随机选择一个样本si;randomly select a samplesi from S;
从与si同类的样本中选择与si最近的一个近邻,记为M;Select a nearest neighbor to si from the samples of the same kind as si , denoted as M;
从与si异类的样本中选择与si最近的一个近邻,记为N;Select a nearest neighbor to si from samples that are different from si , denoted as N;
采用公式(1)更新超参重要性权重向量W;Use formula (1) to update the hyperparameter importance weight vector W;
选取大小为X的超参子集;Select a subset of hyperparameters of size X;
在超参子集上对样本聚类;Clustering samples on a subset of hyperparameters;
计算聚类结果与实际结果的相关度r(C)Calculate the correlation r(C) between the clustering results and the actual results
从m个r(C)中选取值最大时对应的超参重要性排序;Select the corresponding hyperparameter importance ranking when the value is the largest from m r(C);
EndEnd
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810270934.5ACN108446741B (en) | 2018-03-29 | 2018-03-29 | Method, system and storage medium for evaluating the importance of machine learning hyperparameters |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810270934.5ACN108446741B (en) | 2018-03-29 | 2018-03-29 | Method, system and storage medium for evaluating the importance of machine learning hyperparameters |
| Publication Number | Publication Date |
|---|---|
| CN108446741A CN108446741A (en) | 2018-08-24 |
| CN108446741Btrue CN108446741B (en) | 2020-01-07 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810270934.5AExpired - Fee RelatedCN108446741B (en) | 2018-03-29 | 2018-03-29 | Method, system and storage medium for evaluating the importance of machine learning hyperparameters |
| Country | Link |
|---|---|
| CN (1) | CN108446741B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP6892424B2 (en)* | 2018-10-09 | 2021-06-23 | 株式会社Preferred Networks | Hyperparameter tuning methods, devices and programs |
| CN109447277B (en)* | 2018-10-19 | 2023-11-10 | 厦门渊亭信息科技有限公司 | Universal machine learning super-ginseng black box optimization method and system |
| CN109460825A (en)* | 2018-10-24 | 2019-03-12 | 阿里巴巴集团控股有限公司 | For constructing the Feature Selection Algorithms, device and equipment of machine learning model |
| CN111160459A (en)* | 2019-12-30 | 2020-05-15 | 上海依图网络科技有限公司 | Device and method for optimizing hyper-parameters |
| CN111260243A (en)* | 2020-02-10 | 2020-06-09 | 京东数字科技控股有限公司 | Risk assessment method, device, equipment and computer readable storage medium |
| CN111401567A (en)* | 2020-03-20 | 2020-07-10 | 厦门渊亭信息科技有限公司 | Universal deep learning hyper-parameter optimization method and device |
| CN111539536B (en)* | 2020-06-19 | 2020-10-23 | 支付宝(杭州)信息技术有限公司 | Method and device for evaluating service model hyper-parameters |
| CN111917648B (en)* | 2020-06-30 | 2021-10-26 | 华南理工大学 | Transmission optimization method for rearrangement of distributed machine learning data in data center |
| CN113760188A (en)* | 2021-07-30 | 2021-12-07 | 浪潮电子信息产业股份有限公司 | Parameter adjusting and selecting method, system and device for distributed storage system |
| CN114490094B (en)* | 2022-04-18 | 2022-07-12 | 北京麟卓信息科技有限公司 | GPU (graphics processing Unit) video memory allocation method and system based on machine learning |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105531725A (en)* | 2013-06-28 | 2016-04-27 | D-波系统公司 | Systems and methods for quantum processing of data |
| CN105701509A (en)* | 2016-01-13 | 2016-06-22 | 清华大学 | Image classification method based on cross-type migration active learning |
| CN106295682A (en)* | 2016-08-02 | 2017-01-04 | 厦门美图之家科技有限公司 | A kind of judge the method for the picture quality factor, device and calculating equipment |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CA2542937A1 (en)* | 2003-07-01 | 2005-01-13 | Cardiomag Imaging, Inc. (Cmi) | Machine learning for classification of magneto cardiograms |
| CN106203432B (en)* | 2016-07-14 | 2020-01-17 | 杭州健培科技有限公司 | Positioning system of region of interest based on convolutional neural network significance map |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105531725A (en)* | 2013-06-28 | 2016-04-27 | D-波系统公司 | Systems and methods for quantum processing of data |
| CN105701509A (en)* | 2016-01-13 | 2016-06-22 | 清华大学 | Image classification method based on cross-type migration active learning |
| CN106295682A (en)* | 2016-08-02 | 2017-01-04 | 厦门美图之家科技有限公司 | A kind of judge the method for the picture quality factor, device and calculating equipment |
| Publication number | Publication date |
|---|---|
| CN108446741A (en) | 2018-08-24 |
| Publication | Publication Date | Title |
|---|---|---|
| CN108446741B (en) | Method, system and storage medium for evaluating the importance of machine learning hyperparameters | |
| US12210917B2 (en) | Systems and methods for quickly searching datasets by indexing synthetic data generating models | |
| US20220391767A1 (en) | System and method for relational time series learning with the aid of a digital computer | |
| US10013636B2 (en) | Image object category recognition method and device | |
| US20190340533A1 (en) | Systems and methods for preparing data for use by machine learning algorithms | |
| CN111553127B (en) | A multi-label text data feature selection method and device | |
| JP5521881B2 (en) | Image identification information addition program and image identification information addition device | |
| WO2019015246A1 (en) | Image feature acquisition | |
| CN106779087A (en) | A kind of general-purpose machinery learning data analysis platform | |
| CN110737805B (en) | Method and device for processing graph model data and terminal equipment | |
| US11971892B2 (en) | Methods for stratified sampling-based query execution | |
| CN110008259A (en) | The method and terminal device of visualized data analysis | |
| CN111027636B (en) | Unsupervised feature selection method and system based on multi-label learning | |
| Yang et al. | A feature-metric-based affinity propagation technique for feature selection in hyperspectral image classification | |
| WO2018036547A1 (en) | Data processing method and device thereof | |
| JP2020053073A (en) | Learning method, learning system, and learning program | |
| US20160019267A1 (en) | Using data mining to produce hidden insights from a given set of data | |
| CN111125469A (en) | A kind of user clustering method, device and computer equipment of social network | |
| CN110516950A (en) | A Risk Analysis Method Oriented to Entity Resolution Task | |
| WO2020024444A1 (en) | Group performance grade recognition method and apparatus, and storage medium and computer device | |
| CN111782805A (en) | A text label classification method and system | |
| De Silva et al. | Recursive hierarchical clustering algorithm | |
| Sivakumar et al. | A hybrid text classification approach using KNN and SVM | |
| CN113961808B (en) | Recommended ways to increase diversity | |
| CN116229330A (en) | Method, system, electronic equipment and storage medium for determining video effective frames |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | Granted publication date:20200107 | |
| CF01 | Termination of patent right due to non-payment of annual fee |