CN108446741B

Movatterモバイル変換

Info

Publication number: CN108446741B
Application number: CN201810270934.5A
Authority: CN
Inventors: 孙运雷; 魏倩; 孔言
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2020-01-07
Anticipated expiration: 2038-03-29
Also published as: CN108446741A

Abstract

Translated fromChinese

本发明公开了机器学习超参数重要性评估方法、系统及存储介质，获取OpenML中不同的数据集，并提取元特征来表示每个数据集，同时收集待评估分类算法在不同超参配置下性能的数据；提取元特征来表示使用的目标数据集，并通过计算元特征之间的距离获得目标数据集与历史数据集之间距离的递增序列；使用待评估分类算法不同超参的性能数据来评估超参重要性，根据历史数据集与目标数据集距离递增的有序序列，对距离目标数据集较近的前m个历史数据集依次执行提出的Relief和聚类算法，最终获得待评估分类算法的超参重要性排序并指导的自动化调参过程。本发明对于分类算法黑盒的超参调整给予一定的指导，从而达到节省时间，提高效率的目的。

The invention discloses a machine learning hyperparameter importance evaluation method, system and storage medium, obtains different data sets in OpenML, extracts meta-features to represent each data set, and simultaneously collects the performance of the classification algorithm to be evaluated under different hyperparameter configurations data; extract meta-features to represent the target data set used, and obtain an increasing sequence of distances between the target data set and historical data sets by calculating the distance between meta-features; use the performance data of different hyperparameters of the classification algorithm to be evaluated to Evaluate the importance of hyperparameters. According to the ordered sequence of increasing distance between the historical data set and the target data set, execute the proposed Relief and clustering algorithm on the first m historical data sets that are closer to the target data set in turn, and finally obtain the classification to be evaluated. The algorithm's hyperparameter importance ranks and guides the automated parameter tuning process. The invention provides certain guidance for the hyperparameter adjustment of the black box of the classification algorithm, so as to achieve the purpose of saving time and improving efficiency.

Description

Translated fromChinese

机器学习超参数重要性评估方法、系统及存储介质Method, system and storage medium for evaluating the importance of machine learning hyperparameters

技术领域technical field

本发明是机器学习超参数重要性评估方法、系统及存储介质。The present invention is a method, a system and a storage medium for evaluating the importance of machine learning hyperparameters.

背景技术Background technique

机器学习为数据处理和数据分类提供了重要的技术支撑，然而模型选择和调参依然是困扰用户的两大难题，于是自动化机器学习系统应运而生。自动化机器学习系统利用自动化机器学习算法达到了自动化数据预处理，自动化选择算法，自动化调参的目的，提高了数据分类预测的准确性，同时将用户从选择算法和反复调参的繁重任务中解脱出来。Machine learning provides important technical support for data processing and data classification. However, model selection and parameter adjustment are still two major problems that plague users, so automated machine learning systems emerge as the times require. The automated machine learning system uses automated machine learning algorithms to achieve the purpose of automatic data preprocessing, automatic algorithm selection, and automatic parameter adjustment, improving the accuracy of data classification and prediction, and at the same time freeing users from the arduous task of selecting algorithms and repeatedly adjusting parameters come out.

由于自动化机器学习的核心是自动化算法选择及自动化超参配置，因此该系统将机器学习过程归约成了算法选择和超参优化(Combined Algorithm Selection andHyper-parameter optimization，CASH)问题。CASH问题即把算法的选择当做根层次的新的超参数，从而将选择算法和超参数值的问题映射到选择超参值的问题。通过将数据预处理和特征选择技术作为超参数，系统可以自动选择数据预处理和特征选择技术。最终归结为的超参优化问题可以通过经典的贝叶斯优化算法找到最优解，从而达到提升数据分类预测精度的效果。Since the core of automatic machine learning is automatic algorithm selection and automatic hyper-parameter configuration, the system reduces the machine learning process to the problem of algorithm selection and hyper-parameter optimization (CASH). The CASH problem regards the choice of algorithm as a new hyperparameter at the root level, thereby mapping the problem of choosing an algorithm and hyperparameter values to the problem of choosing hyperparameter values. By using data preprocessing and feature selection techniques as hyperparameters, the system can automatically select data preprocessing and feature selection techniques. The hyperparameter optimization problem that ultimately boils down to can find the optimal solution through the classical Bayesian optimization algorithm, so as to achieve the effect of improving the accuracy of data classification and prediction.

然而目前的自动化机器学习系统的超参配置模块的配置过程全凭经验，或者通过反复迭代得到最后的结果来对若干个超参数的配置进行一一调整，这样存在的缺陷是：浪费机器学习的时间，而且反复迭代也浪费计算机资源，不分重要性地对所有超参数的配置进行调整会浪费用户的时间和精力。However, the configuration process of the hyperparameter configuration module of the current automated machine learning system is entirely based on experience, or the configuration of several hyperparameters is adjusted one by one through repeated iterations to obtain the final result. Time, and repeated iterations waste computer resources, tweaking the configuration of all hyperparameters indiscriminately wastes the user's time and effort.

发明内容SUMMARY OF THE INVENTION

本发明是机器学习超参数重要性评估方法、系统及存储介质，所要解决的技术问题是如何准确评估机器学习算法的超参重要性，并将其用于指导自动化超参配置以及增强超参配置的可解释性问题。The present invention is a machine learning hyperparameter importance evaluation method, system and storage medium, and the technical problem to be solved is how to accurately evaluate the hyperparameter importance of the machine learning algorithm, and use it to guide automatic hyperparameter configuration and enhance hyperparameter configuration interpretability issues.

作为本发明的第一方面：As a first aspect of the present invention:

机器学习超参数重要性评估方法，包括：Machine learning hyperparameter importance assessment methods, including:

步骤(1)：从开放式机器学习环境OpenML中获取与目标数据集类型相似的若干新数据集，并对每个新数据集提取元特征向量，使得每个新数据集都用元特征向量来表示；Step (1): Obtain several new datasets of the same type as the target dataset from the open machine learning environment OpenML, and extract the meta-feature vector for each new dataset, so that each new dataset is represented by the meta-feature vector. express;

从开放式机器学习环境OpenML中收集待评估分类算法在不同超参数配置下性能的数据；Collect data on the performance of the classification algorithm to be evaluated under different hyperparameter configurations from OpenML, an open machine learning environment;

将每个新数据集的元特征向量以及不同超参数配置对应的性能数据存储于对应的历史数据集中；Store the meta-feature vector of each new dataset and the performance data corresponding to different hyperparameter configurations in the corresponding historical dataset;

步骤(2)：提取目标数据集的元特征向量来表示目标数据集，计算目标数据集元特征向量与历史数据集元特征向量之间的距离，获得目标数据集与每个历史数据集之间距离由近至远的距离序列；Step (2): Extract the meta-feature vector of the target data set to represent the target data set, calculate the distance between the meta-feature vector of the target data set and the meta-feature vector of the historical data set, and obtain the distance between the target data set and each historical data set distance sequence from near to far;

步骤(3)：对距离目标数据集最近的前f个历史数据集依次执行Relief-Cluster算法：通过Relief算法得到的每类超参数的权重，进一步计算每类超参数的平均权重，利用每类超参数的平均权重初步得到每类超参数重要性权重排序；利用聚类算法进一步验证超参数重要性评估的准确性；最后，得到待评估分类算法的超参数重要性排序。Step (3): Execute the Relief-Cluster algorithm on the first f historical data sets closest to the target data set in turn: obtain the weight of each type of hyperparameters through the Relief algorithm, and further calculate the average weight of each type of hyperparameters. The average weight of the hyperparameters preliminarily obtains the ranking of the importance weights of each type of hyperparameters; the clustering algorithm is used to further verify the accuracy of the hyperparameter importance evaluation; finally, the hyperparameter importance ranking of the classification algorithm to be evaluated is obtained.

所述机器学习超参数重要性评估方法，包括以下步骤：The method for evaluating the importance of machine learning hyperparameters includes the following steps:

步骤(4)：根据得到的待评估分类算法的超参数重要性排序，对重要性排序靠前的若干个参数进行设置，然后，利用设置好参数的分类算法对待分类数据进行分类。Step (4): According to the obtained hyperparameter importance ranking of the classification algorithm to be evaluated, set several parameters with the highest importance ranking, and then use the classification algorithm with the set parameters to classify the data to be classified.

所述步骤(1)中，每个数据集D_i被描述为由F个元特征表示的向量In the step (1), each dataset D_i is described as a vector represented by F element features

所述步骤(1)中，元特征，包括：简单的元特征、数据集的统计元特征和重要性元特征；In the step (1), the meta-features include: simple meta-features, statistical meta-features and important meta-features of the data set;

所述简单的元特征，包括：数据集样本数量、特征数量、类别数量或缺失值数量；The simple meta-features include: the number of data set samples, the number of features, the number of categories or the number of missing values;

所述数据集的统计元特征，包括：平均值、方差或距离向量的峰度；Statistical meta-features of the data set, including: mean, variance or kurtosis of distance vectors;

重要性元特征，包括：在数据集上运行机器学习算法获得的性能。Importance meta-features, including: performance obtained by running machine learning algorithms on the dataset.

所述步骤(1)中待评估分类算法在不同超参数配置下的性能，包括：错误分类率或者RMSE；In the step (1), the performance of the classification algorithm to be evaluated under different hyperparameter configurations, including: misclassification rate or RMSE;

另外，对于许多常见算法，开放式机器学习环境OpenML已经包含了非常全面的性能数据，适用于各种数据集上的不同超参数配置，即收集数据集D_i在待评估分类算法下的超参配置θ_i及性能y_i数据

In addition, for many common algorithms, the open machine learning environment OpenML already contains very comprehensive performance data for different hyperparameter configurations on various datasets, that is, collecting the hyperparameters of the dataset D_i under the classification algorithm to be evaluated Configuration θ_i and performance_yi data

对于目标数据集D_N'，提取元特征V_N'来表示目标数据集，并基于不相似的数据集其使用算法的超参数配置也具有差异这一原则，利用元特征向量之间的距离获得目标数据集与历史数据集之间的距离序列。对距离目标数据集近的前f个历史数据集，使用算法在不同超参数的性能数据来评估超参数重要性；For the target data set_DN' , extract the meta-feature V_N' to represent the target data set, and based on the principle that the hyperparameter configuration of the algorithm used by dissimilar data sets also has differences, use the distance between the meta-feature vectors to obtain A sequence of distances between the target dataset and the historical dataset. For the first f historical datasets close to the target dataset, use the performance data of the algorithm in different hyperparameters to evaluate the importance of hyperparameters;

利用元特征向量之间的距离来衡量目标数据集D_N'与历史数据集D_i之间的距离d_pn(D_N′，D_i)：Use the distance between the meta-feature vectors to measure the distance d_pn (_DN' , D_i ) between the target dataset D_N' and the historical dataset D_i :

d_pn(D_N′，D_i)＝||V_N′-V_i||_pnd_pn (D_N′ , D_i )=||V_N′ −V_i ||_pn

其中，V_N'表示数据集D_N'的元特征向量，V_i表示历史数据集D_i的元特征向量，pn表示p范数。Among them, V_N' represents the meta-feature vector of the data set D_N' , V_i represents the meta-feature vector of the historical data set D_i , and pn represents the p-norm.

通过目标数据集与历史数据集元特征向量之间的距离比较，得到历史数据集与目标数据集距离由近至远的排序序列π(1)，...，π(N)，其中

By comparing the distance between the target data set and the meta-feature vector of the historical data set, the sorted sequence π(1), ..., π(N) of the distance between the historical data set and the target data set from near to far is obtained, where

根据历史数据集与目标数据集距离由近至远的排序队列π(1)，...，π(N)，对距离目标数据集较近的前f个历史数据集依次执行Relief-Cluster算法。首先通过Relief算法得到的每类超参的平均权重来初步评估超参重要性，然后利用聚类算法的r(C)指标进一步验证超参重要性评估的准确性，重复以上两步m次，选择r(C)指标最大时对应的超参重要性评估结果，最后得到待评估分类算法的超参重要性排序，转而用于指导目标数据集在待评估分类算法的自动化调参过程。According to the sorting queue π(1), ..., π(N) of the distance between the historical data set and the target data set, the Relief-Cluster algorithm is sequentially performed on the first f historical data sets that are closer to the target data set. . Firstly, the importance of hyperparameters is preliminarily evaluated by the average weight of each type of hyperparameters obtained by the Relief algorithm, and then the r(C) index of the clustering algorithm is used to further verify the accuracy of the evaluation of the importance of hyperparameters. Repeat the above two steps m times. Select the corresponding hyperparameter importance evaluation result when the r(C) index is the largest, and finally obtain the hyperparameter importance ranking of the classification algorithm to be evaluated, which is then used to guide the automatic parameter adjustment process of the target data set in the classification algorithm to be evaluated.

所述通过Relief算法得到的每类超参数的权重包括：The weight of each type of hyperparameter obtained by the Relief algorithm includes:

根据不同超参数配置下的性能数据大小设置阈值，将历史数据集中不同超参数配置对应的性能数据分为高性能样本和低性能样本，Relief算法首先从性能数据中随机选择一个样本s_i，然后从性能高样本和性能差样本中各选择一个距离s_i最近的样本；The threshold is set according to the performance data size under different hyperparameter configurations, and the performance data corresponding to different hyperparameter configurations in the historical data set is divided into high-performance samples and low-performance samples. The Relief algorithm first randomly selects a sample_si from the performance data, and then Select a sample that is closest to_si from the high-performance sample and the poor-performance sample;

与s_i同类的样本s_j用M表示，与s_i不同类的样本s_j用Q表示，每类超参数h的权重w_h根据公式(1)更新：The samples s_j of the same class as s_i are denoted by M, and the samples s_j of different classes from s_i are denoted by Q, and the weight w_h of the hyperparameter h of each class is updated according to formula (1):

w_h＝w_h-diff(h,s_i,M)/rt+diff(h,s_i,Q)/rt (1)w_h =w_h -diff(h,s_i ,M)/rt+diff(h,s_i ,Q)/rt (1)

diff(h,s_i,M)表示两个样本s_i与M在超参数h上的差异；diff(h,s_i ,M) represents the difference between the two samples s_i and M on the hyperparameter h;

diff(h,s_i,Q)表示两个样本s_i与Q在超参数h上的差异；diff(h, s_i , Q) represents the difference between the two samples s_i and Q on the hyperparameter h;

两个样本s_i与s_j在超参数h上的差异diff(h,s_i,s_j)定义为：The difference diff(h,s_i ,s_j ) of two samples s_i and s_j on the hyperparameter h is defined as:

若超参数h为标量型超参数，If the hyperparameter h is a scalar hyperparameter,

若超参数h为数值型超参数，If the hyperparameter h is a numerical hyperparameter,

其中，1≤i≠j≤m，1≤h≤ph，max_h为超参数h在样本集中的最大值，min_h为超参数h在样本集中的最小值，m表示样本数，每个样本包含ph个超参数，rt表示迭代次数，rt>1，为了避免一次抽样的随机性；s_ih表示在样本s_i上超参h的值，s_jh表示在样本s_j上超参h的值。Among them, 1≤i≠j≤m, 1≤h≤ph, max_h is the maximum value of the hyperparameter h in the sample set, min_h is the minimum value of the hyperparameter h in the sample set, m represents the number of samples, and each sample Contains ph hyperparameters, rt represents the number of iterations, rt>1, in order to avoid the randomness of one sampling; s_ih represents the value of the hyperparameter h on the sample_si , and s_jh represents the value of the hyperparameter h on the sample_sj .

由公式(1)可知，对于高性能贡献大的超参数表现为在异类间差异大而在同类间差异小，因此具有区分能力的超参数的权值为正值。From formula (1), it can be seen that the hyperparameters that contribute greatly to high performance have large differences between different classes and small differences between similar classes, so the weights of hyperparameters with discriminating ability are positive values.

为避免一次抽样的随机性，迭代进行rt>1次，得到每类超参的重要性权重排序。In order to avoid the randomness of one sampling, iteratively performs rt>1 times to obtain the importance weight ranking of each type of hyperparameters.

所述利用聚类算法进一步验证超参数重要性评估的准确性包括：The described use of clustering algorithm to further verify the accuracy of hyperparameter importance evaluation includes:

根据得到的每类超参数的重要性权重排序，对位于前k类的超参数进行聚类，并计算超参数重要性，假设超参数样本集为S，T为超参数样本集合的大小，K为超参数样本所属类的个数，p_ik表示样本隶属于类k的概率，C_k表示超参数样本的实际类标签，C表示超参数集，则在C的重要性度量r(C)表示为：According to the obtained importance weights of each type of hyperparameters, the hyperparameters located in the top k categories are clustered, and the hyperparameter importance is calculated, assuming that the hyperparameter sample set is S, T is the size of the hyperparameter sample set, K is the number of classes to which the hyperparameter samples belong, p_ik represents the probability that the samples belong to class k, C_k represents the actual class label of the hyperparameter samples, and C represents the hyperparameter set, then the importance measure r(C) in C represents for:

其中，F(C)表示在超参数集C上聚类的结果与类标签在整个超参数样本集上的差异，C代表超参数集，F_i(C)表示在超参数集C上聚类的结果与类标签在各个类内的差异，X_i表示第i个类的超参数样本集合。Among them, F(C) represents the difference between the result of clustering on the hyperparameter set C and the class label on the entire hyperparameter sample set, C represents the hyperparameter set, and F_i (C) represents the clustering on the hyperparameter set C The difference between the results and the class labels within each class, X_i represents the set of hyperparameter samples of the ith class.

r(C)值越高，聚类结果与实际类标签之间的相关度越大，超参数集C对分类的影响越大。选择r(C)指标最大时对应的超参重要性评估结果。The higher the r(C) value, the greater the correlation between the clustering results and the actual class labels, and the greater the impact of the hyperparameter set C on the classification. Select the corresponding hyperparameter importance evaluation result when the r(C) index is the largest.

类标签是指性能高和性能低的标签。Class labels refer to high-performing and low-performing labels.

作为本发明的第二方面，As a second aspect of the present invention,

机器学习超参数重要性评估系统，包括：存储器、处理器以及存储在存储器上并在处理器上运行的计算机指令，所述计算机指令被处理器运行时，完成上述任一方法所述的步骤。A machine learning hyperparameter importance assessment system includes: a memory, a processor, and computer instructions stored in the memory and executed on the processor, and when the computer instructions are executed by the processor, the steps described in any of the above methods are completed.

作为本发明的第三方面，As a third aspect of the present invention,

一种计算机可读存储介质，其上运行有计算机指令，所述计算机指令被处理器运行时，完成上述任一方法所述的步骤。A computer-readable storage medium on which computer instructions run, and when the computer instructions are executed by a processor, completes the steps described in any of the above methods.

本发明的有益效果：Beneficial effects of the present invention:

本发明可以准确评估机器学习算法的超参重要性，用于指导自动化超参配置以及增强超参配置的可解释性问题。用于描述机器学习算法本身的超参重要性，为超参配置过程提供有效借鉴和良好的可解释性。此模块着重解决的技术问题为如何准确评估机器学习算法的超参重要性，并将其用于指导自动化超参配置以及增强超参配置的可解释性问题。The invention can accurately evaluate the importance of the hyperparameters of the machine learning algorithm, and is used to guide the automatic hyperparameter configuration and enhance the interpretability of the hyperparameter configuration. It is used to describe the importance of hyperparameters in the machine learning algorithm itself, and provides effective reference and good interpretability for the configuration process of hyperparameters. The technical problem that this module focuses on is how to accurately assess the importance of hyperparameters in machine learning algorithms, and use it to guide automated hyperparameter configuration and enhance the interpretability of hyperparameter configuration.

(1)节约资源，节省时间，通过提供合适的先验知识，缩小搜索空间，使得超参配置过程具有一定的指导性，摆脱以往完全黑盒的状态。(1) Save resources and save time. By providing appropriate prior knowledge and narrowing the search space, the hyperparameter configuration process is instructive to a certain extent, and it can get rid of the previous complete black box state.

(2)同时可以让用户直观的了解哪类超参数对算法性能影响更大。(2) At the same time, it allows users to intuitively understand which type of hyperparameters has a greater impact on algorithm performance.

附图说明Description of drawings

构成本申请的一部分的说明书附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。The accompanying drawings that form a part of the present application are used to provide further understanding of the present application, and the schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute improper limitations on the present application.

图1为本发明提供的流程图；Fig. 1 is the flow chart provided by the present invention;

具体实施方式Detailed ways

应该指出，以下详细说明都是例示性的，旨在对本申请提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the application. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

本发明充分利用开放式机器学习环境OpenML中的多个数据集以及其每个数据集在多种算法下的性能数据，结合元学习方法计算目标数据集与历史数据集的距离，并利用Relief算法和聚类算法得到待评估分类算法每类超参数的重要性排序，排序结果转而用于指导目标数据集在待评估分类算法的自动化调参过程。本发明为提供合适的先验知识，缩小搜索空间，使得超参配置过程具有一定的指导性，摆脱以往完全黑盒的状态；同时可以让用户直观的了解哪类超参数对算法性能影响更大。The invention makes full use of multiple data sets in the open machine learning environment OpenML and the performance data of each data set under multiple algorithms, combines the meta-learning method to calculate the distance between the target data set and the historical data set, and uses the Relief algorithm. And the clustering algorithm obtains the importance ranking of each type of hyperparameters of the classification algorithm to be evaluated, and the ranking results are then used to guide the automatic parameter adjustment process of the target data set in the classification algorithm to be evaluated. The present invention provides appropriate prior knowledge, reduces the search space, makes the hyperparameter configuration process have certain guidance, and gets rid of the previous complete black box state; at the same time, it allows users to intuitively understand which type of hyperparameters has a greater impact on algorithm performance .

如图1所示，本发明包括以下步骤：As shown in Figure 1, the present invention comprises the following steps:

步骤A、获取OpenML中不同的数据集，并对每个数据集提取元特征，使得每个数据集都可以用元特征来表示，同时收集待评估分类算法在不同超参配置θ_i下性能y_i(例如，错误分类率或者RMSE)的数据

并将每个数据集的元特征向量以及不同超参配置对应的性能数据存储于历史数据集样本库；Step A. Obtain different data sets in OpenML, and extract meta-features for each data set, so that each data set can be represented by meta-features, and collect the performance y of the classification algorithm to be evaluated under different hyperparameter configurations θ_i_i (e.g. misclassification rate or RMSE) data

and store the meta-feature vector of each dataset and the performance data corresponding to different hyperparameter configurations in the historical dataset sample library;

在步骤A中提取的元特征主要包括:简单的元特征(例如，数据集样本数量，特征数量，类别数量，缺失值数量等)、数据集的统计元特征(例如，平均值，方差，距离向量的峰度等)、重要性元特征(例如在数据集上运行机器学习算法获得的性能等信息)这三大部分。The meta-features extracted in step A mainly include: simple meta-features (for example, the number of data set samples, the number of features, the number of categories, the number of missing values, etc.), the statistical meta-features of the data set (for example, the mean, variance, distance, etc.) The three major parts are the kurtosis of the vector, etc.) and the importance meta-features (such as the performance obtained by running the machine learning algorithm on the data set).

步骤B、对于我们使用的目标数据集，我们也提取元特征来表示目标数据集，并基于不相似的数据集其使用算法的超参配置也具有差异这一原则，利用元特征向量之间的距离获得目标数据集与历史数据集之间的距离序列。对距离目标数据集较近的前f个历史数据集，我们可以使用待评估分类算法不同超参的性能数据来评估超参重要性；Step B. For the target data set we use, we also extract meta-features to represent the target data set, and based on the principle of dissimilar data sets, the hyperparameter configuration of the algorithm used is also different, using the difference between the meta-feature vectors. Distance gets the sequence of distances between the target dataset and the historical dataset. For the first f historical datasets that are closer to the target dataset, we can use the performance data of different hyperparameters of the classification algorithm to be evaluated to evaluate the importance of hyperparameters;

在步骤B中，利用元特征向量之间的距离来衡量目标数据集D_N'与历史数据集D_i(i＝1,2,…N)之间的距离，其中的距离公式我们使用的是衡量数据集元特征向量之间差异的常用p-范数：d_pn(D_N′，D_i)＝||V_N′-V_i||_pn。通过目标数据集与历史数据集元特征向量之间的距离比较，我们可以得到历史数据集与目标数据集距离由近至远的排序序列π(1)，...，π(N)，其中

In step B, the distance between the meta-feature vectors is used to measure the distance between the target dataset D_N' and the historical dataset D_i (i=1,2,...N), where the distance formula we use is A common p-norm to measure the difference between meta-eigenvectors of a dataset: d_pn (_DN' , D_i ) = ||V_N' -V_i ||_pn . By comparing the distance between the target data set and the meta-feature vector of the historical data set, we can obtain the sorted sequence π(1), ..., π(N) of the distance between the historical data set and the target data set from near to far, where

步骤C、根据历史数据集与目标数据集距离由近至远的有序序列，对距离目标数据集较近的前f个历史数据集依次执行我们提出的Relief-Cluster算法。首先通过Relief算法得到的每类超参的平均权重来初步评估超参重要性，然后利用聚类算法的r(C)指标进一步验证超参重要性评估的准确性，重复以上两步m次，选择r(C)指标最大时对应的超参重要性评估结果，最后得到待评估分类算法的超参重要性排序转而用于指导目标数据集在待评估分类算法的自动化调参过程。Step C. According to the ordered sequence of the distance between the historical data set and the target data set from near to far, execute the Relief-Cluster algorithm proposed by us in turn on the first f historical data sets that are closer to the target data set. Firstly, the importance of hyperparameters is preliminarily evaluated by the average weight of each type of hyperparameters obtained by the Relief algorithm, and then the r(C) index of the clustering algorithm is used to further verify the accuracy of the evaluation of the importance of hyperparameters. Repeat the above two steps m times. Select the corresponding hyperparameter importance evaluation result when the r(C) index is the largest, and finally obtain the hyperparameter importance ranking of the classification algorithm to be evaluated, which is then used to guide the automatic parameter adjustment process of the target data set in the classification algorithm to be evaluated.

在本发明中，步骤C具体包括以下步骤：In the present invention, step C specifically comprises the following steps:

步骤C1、我们根据不同超参配置下的性能数据大小设置阈值将数据分为性能高的一类和性能差的一类，Relief算法首先从超参样本集合中随机选择一个样本s_i,然后从两类样本中各选择一个距离s_i最近的样本。与s_i同类的样本用M表示，与s_i不同类的样本用Q表示，每类超参h的权重w_h根据公式(1)更新：Step C1, we set the threshold according to the performance data size under different hyperparameter configurations to divide the data into a class with high performance and a class with poor performance. The Relief algorithm first randomly selects a sample s_i from the hyperparameter sample set, and then selects a Select a sample that is closest to_si in each of the two types of samples. The samples of the same class as s_i are represented by M, and the samples of different classes from s_i are represented by Q, and the weight w_h of each type of hyperparameter h is updated according to formula (1):

上述公式中，两个样本s_i与s_j(1≤i≠j≤m)在超参h(1≤h≤ph)上的差定义为：In the above formula, the difference between two samples s_i and s_j (1≤i≠j≤m) on the hyperparameter h (1≤h≤ph) is defined as:

若超参h为标量型超参，If the hyperparameter h is a scalar hyperparameter,

若超参h为数值型超参，If the hyperparameter h is a numeric hyperparameter,

其中，max_h和min_h分别为超参h在样本集中的最大值和最小值。Among them, max_h and min_h are the maximum and minimum values of the hyperparameter h in the sample set, respectively.

由公式(1)可知，对于高性能贡献较大的超参应该表现为在异类间差异较大而在同类间差异较小，因此具有区分能力的超参的权值应为正值。为避免一次抽样的随机性，上述过程迭代进行rt>1次。From formula (1), it can be seen that the hyperparameters that contribute more to high performance should show large differences between different types and small differences between similar types, so the weights of hyperparameters with discriminating ability should be positive. In order to avoid the randomness of one sampling, the above process is iteratively performed rt>1 times.

步骤C2、根据上步得到的每类超参的重要性权重排序，我们对位于前k类的超参进行聚类，并计算特征重要性，假设超参样本集为S，T为超参样本集合的大小，K为超参样本所属类的个数，p_ik表示样本隶属于类k的概率，C_k表示超参样本的实际类标号，C表示超参子集，则在C的重要性度量r(C)可以表示为：Step C2. According to the importance weight ranking of each type of hyperparameter obtained in the previous step, we cluster the hyperparameters located in the top k types and calculate the feature importance, assuming that the hyperparameter sample set is S and T is the hyperparameter sample The size of the set, K is the number of classes to which the hyperparameter samples belong, p_ik represents the probability that the samples belong to class k, C_k represents the actual class label of the hyperparameter samples, and C represents the hyperparameter subset, then the importance of C The metric r(C) can be expressed as:

其中F(C)表示在超参集C上聚类的结果与类标签在整个超参样本集上的差异，C代表超参子集，F_i(C)表示各个类内的差异，X_i表示第i个类的超参样本集合。r(C)值越高，聚类结果与实际类标签之间的相关度越大，超参集C对分类的影响越大。where F(C) represents the difference between the result of clustering on the hyperparameter set C and the class label on the entire hyperparameter sample set, C represents the hyperparameter subset, F_i (C) represents the difference within each class, and X_i represents A collection of hyperparameter samples for the ith class. The higher the r(C) value, the greater the correlation between the clustering results and the actual class labels, and the greater the impact of the hyperparameter set C on the classification.

对以上两步迭代m次，选取r(C)最大时对应的超参重要性排序，最后将得到的超参重要性排序结果转而用于指导目标数据集在待评估分类算法的自动化调参过程。Iterate m times for the above two steps, select the corresponding hyperparameter importance ranking when r(C) is the largest, and finally use the obtained hyperparameter importance ranking results to guide the automatic parameter adjustment of the target data set in the classification algorithm to be evaluated. process.

本发明中Relief-Cluster算法的流程图：The flow chart of the Relief-Cluster algorithm in the present invention:

输入：超参数样本集S，超参数类别数hc，取样/迭代次数rtInput: hyperparameter sample set S, number of hyperparameter categories hc, number of samples/iterations rt

输出：聚类评价指标r(C)，超参数重要性权重矩阵WOutput: clustering evaluation index r(C), hyperparameter importance weight matrix W

从S中随机选择一个样本s_i；randomly select a sample_si from S;

从与s_i同类的样本中选择与s_i最近的一个近邻，记为M；Select a nearest neighbor to s_i from the samples of the same kind as s_i , denoted as M;

从与s_i异类的样本中选择与s_i最近的一个近邻，记为N；Select a nearest neighbor to s_i from samples that are different from s_i , denoted as N;

采用公式(1)更新超参重要性权重向量W；Use formula (1) to update the hyperparameter importance weight vector W;

选取大小为X的超参子集；Select a subset of hyperparameters of size X;

在超参子集上对样本聚类；Clustering samples on a subset of hyperparameters;

计算聚类结果与实际结果的相关度r(C)Calculate the correlation r(C) between the clustering results and the actual results

从m个r(C)中选取值最大时对应的超参重要性排序；Select the corresponding hyperparameter importance ranking when the value is the largest from m r(C);

EndEnd

以上所述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Claims

1. The classification system of the data to be classified based on the machine learning super-parameter importance evaluation is characterized by comprising the following steps:

a historical data set acquisition module configured to: acquiring a plurality of new data sets similar to the target data set type from an open machine learning environment OpenML, and extracting meta-features from each new data set to enable each new data set to be represented by a meta-feature vector;

collecting data of the performance of a classification algorithm to be evaluated under different hyper-parameter configurations from an open machine learning environment OpenML;

storing the meta-feature vector of each new data set and the performance data corresponding to different hyper-parameter configurations in corresponding historical data sets;

a distance sequence acquisition module configured to: extracting meta-feature vectors of the target data set to represent the target data set, calculating the distance between the meta-feature vectors of the target data set and the meta-feature vectors of the historical data sets, and obtaining a distance sequence from near to far between the target data set and each historical data set;

an output module configured to: sequentially executing a Relief-Cluster algorithm on the first f historical data sets closest to the target data set: further calculating the average weight of each type of hyper-parameter through the weight of each type of hyper-parameter obtained by a Relief algorithm, and preliminarily obtaining the importance weight sequence of each type of hyper-parameter by utilizing the average weight of each type of hyper-parameter; further verifying the accuracy of the super-parameter importance evaluation by using a clustering algorithm; finally, obtaining the super-parameter importance ranking of the classification algorithm to be evaluated;

a classification module configured to: and setting a plurality of parameters with the top importance ranking according to the obtained super-parameter importance ranking of the classification algorithm to be evaluated, and then classifying the data to be classified by using the classification algorithm with the set parameters.

2. The system of claim 1, wherein each data set D in the historical data set acquisition module_iIs described as a vector represented by F meta-features

3. The system of claim 1, wherein the meta-features in the historical data set acquisition module include: simple meta-features, statistical meta-features and significance meta-features of the data set;

the simple meta-features include: the number of data set samples, the number of features, the number of categories, or the number of missing values;

statistical meta-features of the data set, including: the kurtosis of the mean, variance, or distance vector;

the importance meta-feature comprises: performance obtained by running a machine learning algorithm on the data set.

4. The system of claim 1, wherein the performance of the classification algorithm to be evaluated in the historical data set acquisition module under different hyper-parameter configurations comprises: misclassification rate or RMSE.

5. The system of claim 1, wherein the distance between meta-feature vectors is used to scale the target data set D_N+1With historical data set D_iA distance d between_pn(D_N′，D_i)：

d_pn(D_N′，D_i)＝||V_N′-V_i||_pn

Wherein, V_N′Representing a target data set D_N′Meta feature vector of (V)_iRepresenting a historical data set D_iP represents the p-norm;

and comparing the distances between the target data set and the meta-feature vectors of the historical data set to obtain an ordering sequence pi (1) of the distances between the historical data set and the target data set from near to far.

6. The system of claim 1, wherein,

the weight of each type of hyper-parameter obtained by the Relief algorithm comprises the following steps:

setting a threshold according to the size of performance data under different super-parameter configurations, dividing the performance data corresponding to different super-parameter configurations in a historical data set into high-performance samples and low-performance samples, and randomly selecting a sample s from the performance data by a Relief algorithm_iThen, a distance s is selected from each of the high-performance samples and the low-performance samples_iThe most recent sample;

and s_iHomogeneous samples s_jIs represented by M, with s_iSamples of different classes s_jWeight w of per-class hyperparameter h, denoted by Q_hUpdating according to equation (1):

w_h＝w_h-diff(h，s_i，M)/rt+diff(h，s_i，Q)/rt (1)

diff(h，s_im) represents two samples s_iThe difference from M in the hyperparameter h;

diff(h，s_iq) represents two samples s_iThe difference from Q in the hyperparameter h;

two samples s_iAnd s_jThe difference diff (h, s) in the hyperparameter h_i，s_j) Is defined as:

if the superparameter h is a scalar type superparameter,

if the hyperparameter h is a numerical hyperparameter,

wherein i is not less than 1 but not more than j and m is not less than 1 but not more than hph，max_hIs the maximum value of the hyperparameter h in the sample set, min_hIs the minimum value of the hyperparameter h in the sample set, m represents the number of samples, each sample contains ph hyperparameters, rt represents the iteration number, rt >1, s_ihIs shown in sample s_iValue of upper parameter h, s_jhIs shown in sample s_jThe value of the upper parameter h.

7. The classification system of the data to be classified based on the machine learning super-parameter importance evaluation is characterized by comprising the following steps: a memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the steps of:

step (1): acquiring a plurality of new data sets similar to the target data set type from an open machine learning environment OpenML, and extracting meta-features from each new data set to enable each new data set to be represented by a meta-feature vector;

step (2): extracting meta-feature vectors of the target data set to represent the target data set, calculating the distance between the meta-feature vectors of the target data set and the meta-feature vectors of the historical data sets, and obtaining a distance sequence from near to far between the target data set and each historical data set;

and (3): sequentially executing a Relief-Cluster algorithm on the first f historical data sets closest to the target data set: further calculating the average weight of each type of hyper-parameter through the weight of each type of hyper-parameter obtained by a Relief algorithm, and preliminarily obtaining the importance weight sequence of each type of hyper-parameter by utilizing the average weight of each type of hyper-parameter; further verifying the accuracy of the super-parameter importance evaluation by using a clustering algorithm; finally, obtaining the super-parameter importance ranking of the classification algorithm to be evaluated;

and (4): and setting a plurality of parameters with the top importance ranking according to the obtained super-parameter importance ranking of the classification algorithm to be evaluated, and then classifying the data to be classified by using the classification algorithm with the set parameters.

8. The system of claim 7, wherein in step (1), each data set D_iIs described as a vector represented by F meta-features

9. The system of claim 7, wherein in step (1), the meta-features comprise: simple meta-features, statistical meta-features and significance meta-features of the data set;

10. The system of claim 7, wherein the performance of the classification algorithm to be evaluated in step (1) under different hyper-parameter configurations comprises: misclassification rate or RMSE.

11. The system of claim 7, wherein the distance between meta feature vectors is used to scale the target data set D_N+1With historical data set D_iA distance d between_pn(D_N′，D_i)：

d_pn(D_N′，D_i)＝||V_N′-V_i||_pn；

and comparing the distances between the target data set and the meta-features of the historical data set to obtain an ordering sequence pi (1) of the distances between the historical data set and the target data set from near to far.

12. The system of claim 7, wherein,

w_h＝w_h-diff(h，s_i，M)/rt+diff(h，s_i，Q)/rt (1)

if the superparameter h is a scalar type superparameter,

if the hyperparameter h is a numerical hyperparameter,

wherein i is not less than 1 and not more than j and not more than m, h is not less than 1 and not more than ph, max_hIs the maximum value of the hyperparameter h in the sample set, min_hIs the minimum value of the hyperparameter h in the sample set, m represents the number of samples, each sample contains ph hyperparameters, rt represents the iteration number, rt >1, s_ihIs shown in sample s_iValue of upper parameter h, s_jhIs shown in sample s_jThe value of the upper parameter h.

13. A computer readable storage medium having computer instructions embodied thereon, said computer instructions when executed by a processor performing the steps of:

14. The medium of claim 13, wherein in step (1), each data set D_iIs described as a vector represented by F meta-features

15. The medium of claim 13, wherein in step (1), the meta-feature comprises: simple meta-features, statistical meta-features and significance meta-features of the data set;

16. The medium of claim 13, wherein the performance of the classification algorithm under evaluation in step (1) under different hyper-parameter configurations comprises: misclassification rate or RMSE.

17. The medium of claim 13, wherein the distance between meta feature vectors is used to scale the target data set D_N+1With historical data set D_iA distance d between_pn(D_N′，D_i)：

d_pn(D_N′，D_i)＝||V_N′-V_i||_pn；

Wherein, V_N′Representing a target data set D_N′Meta feature vector of (V)_iRepresenting the number of historiesData set D_iP represents the p-norm;

18. The medium of claim 13, wherein the weight for each type of hyperparameter obtained by the Relief algorithm comprises:

w_h＝w_h-diff(h，s_i，M)/rt+diff(h，s_i，Q)/rt (1)

if the superparameter h is a scalar type superparameter,

if the hyperparameter h is a numerical hyperparameter,