技术领域technical field
本发明涉及数据挖掘技术领域,具体涉及一种高维数据的特征选择方法及装置。The invention relates to the technical field of data mining, in particular to a feature selection method and device for high-dimensional data.
背景技术Background technique
飞速发展的信息社会每天都在产生海量的数据,如何快速地从这些数据中发掘有用的信息成为急需解决的问题。研究者们从机器学习模型的角度来解决这一问题,并取得了显著进展。但是,高复杂度的模型和高维度的特征空间越来越难以适应大数据应用的迫切要求,而且特征空间中往往存在着大量无用信息。只有采用合适的特征选择方法,才能从海量数据中获得有效的特征,进而提高机器学习模型处理数据的效率与准确率;同时特征选择还能够防止模型过拟合以及进行去噪。因此,作为机器学习和数据挖掘的一个重要的预处理步骤,特征选择一直都是机器学习领域的研究热点。The rapidly developing information society generates massive amounts of data every day, and how to quickly discover useful information from these data has become an urgent problem to be solved. Researchers have tackled this problem from the perspective of machine learning models and have made remarkable progress. However, it is increasingly difficult for high-complexity models and high-dimensional feature spaces to meet the urgent requirements of big data applications, and there are often a lot of useless information in feature spaces. Only by adopting an appropriate feature selection method can effective features be obtained from massive data, thereby improving the efficiency and accuracy of data processing by machine learning models; at the same time, feature selection can also prevent model overfitting and denoise. Therefore, as an important preprocessing step in machine learning and data mining, feature selection has always been a research hotspot in the field of machine learning.
特征选择的度量标准和搜索算法的选取至关重要。常用的度量标准有基于距离、信息论和一致性的度量标准。基于距离的度量标准、Pearson系数等度量标准只能衡量变量之间的线性关系,而信息增益、互信息等度量标准,可以对非线性关系进行度量。在生成特征子集时,往往需要使用相应的搜索算法,在众多的搜索策略中近似马尔科夫毯条件在计算复杂度和选择的特征的分类准确率上都有很不错的表现。但是其也有明显的缺点,无法考虑特征和特征子集之间的冗余性。The selection of metrics and search algorithms for feature selection is crucial. Commonly used metrics are distance-based, information-theoretic, and consistency-based metrics. Metrics such as distance-based metrics and Pearson coefficients can only measure the linear relationship between variables, while metrics such as information gain and mutual information can measure nonlinear relationships. When generating feature subsets, it is often necessary to use corresponding search algorithms. Among many search strategies, the approximate Markov blanket condition has a very good performance in terms of computational complexity and classification accuracy of selected features. But it also has obvious disadvantages, and it cannot consider the redundancy between features and feature subsets.
发明内容Contents of the invention
针对现有技术中的缺陷,本发明提供了一种高维数据的特征选择方法及装置,针对当前技术中的度量只能对变量间的线性和非线性关系度量,将MIC引入到特征选择中,MIC能够广泛地度量变量间的线性和非线性关系,甚至能够度量不能使用单个函数表示的非函数关系。尽管MIC在变量度量上十分有效,但是只能度量单个变量间的相关性和冗余性,因此本文提出一种新的度量mMIC(有效值),并应用到马尔科夫毯条件,以解决现有技术因为难以适用高维数据集中的特征和特征子集之间的冗余性而导致特征选择精确度低的问题。Aiming at the defects in the prior art, the present invention provides a feature selection method and device for high-dimensional data. The measurement in the current technology can only measure the linear and nonlinear relationship between variables, and introduces MIC into the feature selection. , MIC can widely measure linear and nonlinear relationships between variables, and even measure non-functional relationships that cannot be expressed using a single function. Although MIC is very effective in variable measurement, it can only measure the correlation and redundancy between a single variable. Therefore, this paper proposes a new measurement mMIC (effective value) and applies it to the Markov blanket condition to solve the current problem. Some techniques have the problem of low accuracy of feature selection because it is difficult to apply the redundancy between features and feature subsets in high-dimensional datasets.
本发明提出了一种高维数据的特征选择方法,包括:The present invention proposes a feature selection method for high-dimensional data, including:
获取待处理的原始数据集,所述原始数据集包括特征集、若干样本以及类别集,所述类别集包括每个样本的类别;Obtain an original data set to be processed, the original data set includes a feature set, several samples and a category set, the category set includes the category of each sample;
计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC,以及每一个特征与已选特征子集的冗余值;Calculate and obtain the maximum information coefficient MIC between each feature in the feature set and the category set, and the redundancy value between each feature and the selected feature subset;
根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值,并根据所述有效值从特征集中选择出特征子集。According to the maximum information coefficient MIC and the redundancy value, the effective value of each feature is obtained, and a feature subset is selected from the feature set according to the effective value.
优选地,所述计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC的步骤具体包括:Preferably, the step of calculating and obtaining the maximum information coefficient MIC between each feature in the feature set and the category set specifically includes:
通过公式(一),计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC;By formula (1), calculate and obtain the maximum information coefficient MIC between each feature in the feature set and the category set;
其中,B(n)为划定的网格数,ω(1)≤B(n)≤O(n1-ε),0<ε<1,n为特征的个数,x为对n个特征划分的段数,y为对n个样本划分的段数,M(D)x,y表示特征和样本在x*y网格划分下最大的互信息归一化后的值。Among them, B(n) is the number of delineated grids, ω(1)≤B(n)≤O(n1-ε ), 0<ε<1, n is the number of features, x is the pair of n The number of segments for feature division, y is the number of segments for n samples, and M(D)x, y represents the normalized value of the maximum mutual information of features and samples under x*y grid division.
优选地,所述根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值的步骤具体包括:Preferably, the step of obtaining the effective value of each feature according to the maximum information coefficient MIC and the redundancy value specifically includes:
通过公式(二),根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值;By formula (2), according to the maximum information coefficient MIC and the redundancy value, the effective value of each feature is obtained;
其中,Smain为当前已选的特征子集,Sresidue为剩余特征子集,i和j分别表示特征fi和fj,c为类别集,为冗余值。Among them, Smain is the currently selected feature subset, Sresidue is the remaining feature subset, i and j represent features fi and fj respectively, c is the category set, is a redundant value.
优选地,在所述根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值的步骤之前,该方法还包括:Preferably, before the step of obtaining the effective value of each feature according to the maximum information coefficient MIC and the redundancy value, the method further includes:
定义两个特征之间的近似马尔科夫毯条件:Define an approximate Markov blanket condition between two features:
MIC(fi,c)>MIC(fj,c)且MIC(fj,c)<MIC(fi,fj)MIC(fi ,c)>MIC(fj ,c) and MIC(fj ,c)<MIC(fi ,fj )
相应地,所述根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值,并根据所述有效值从特征集中选择出特征子集的步骤具体包括:Correspondingly, the step of obtaining the effective value of each feature according to the maximum information coefficient MIC and the redundancy value, and selecting a feature subset from the feature set according to the effective value specifically includes:
根据所述最大信息系数MIC从特征集中依次选取特征,并将选取的特征从特征集中删除;Selecting features sequentially from the feature set according to the maximum information coefficient MIC, and deleting the selected features from the feature set;
根据选取的特征的最大信息系数MIC和冗余值获取所述特征的有效值,并判断所述有效值是否大于或者等于预设阈值,若是,则将该特征添加至最优子集。Obtain the effective value of the feature according to the maximum information coefficient MIC and the redundancy value of the selected feature, and judge whether the effective value is greater than or equal to the preset threshold, and if so, add the feature to the optimal subset.
优选地,所述根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值,并根据所述有效值从特征集中选择出特征子集的步骤还包括:Preferably, the step of obtaining an effective value of each feature according to the maximum information coefficient MIC and the redundancy value, and selecting a feature subset from the feature set according to the effective value further includes:
根据所述近似马尔科夫毯条件从特征集中筛选出与所述选取的特征有近似马尔科夫毯条件的所有特征,并根据公式二获取每一个筛选出的特征的有效值;According to the approximate Markov blanket condition, select all the features that have an approximate Markov blanket condition with the selected feature from the feature set, and obtain the effective value of each filtered feature according to formula 2;
根据有效值判断筛选出的特征的有效值是否大于或者等于预设阈值,若否,则将筛选出的特征从特征集中删除,并从特征集中选取下一个特征。According to the effective value, it is judged whether the effective value of the filtered feature is greater than or equal to the preset threshold, if not, the filtered feature is deleted from the feature set, and the next feature is selected from the feature set.
本发明还提出了一种高维数据的特征选择装置,其特征在于,包括:The present invention also proposes a feature selection device for high-dimensional data, which is characterized in that it includes:
获取模块,用于获取待处理的原始数据集,所述原始数据集包括特征集、若干样本以及类别集,所述类别集包括每个样本的类别;The obtaining module is used to obtain the raw data set to be processed, the raw data set includes a feature set, several samples and a category set, and the category set includes the category of each sample;
处理模块,用于计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC,以及每一个特征与已选特征子集的冗余值;A processing module, used to calculate and obtain the maximum information coefficient MIC between each feature in the feature set and the category set, and the redundancy value between each feature and the selected feature subset;
选择模块,用于根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值,并根据所述有效值从特征集中选择出特征子集。The selection module is configured to obtain an effective value of each feature according to the maximum information coefficient MIC and the redundancy value, and select a feature subset from the feature set according to the effective value.
优选地,所述处理模块,具体用于通过公式(一),计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC;Preferably, the processing module is specifically configured to calculate and obtain the maximum information coefficient MIC between each feature in the feature set and the category set through formula (1);
其中,B(n)为划定的网格数,ω(1)≤B(n)≤O(n1-ε),0<ε<1,n为特征的个数,x为对n个特征划分的段数,y为对n个样本划分的段数,M(D)x,y表示特征和样本在x*y网格划分下最大的互信息归一化后的值。Among them, B(n) is the number of delineated grids, ω(1)≤B(n)≤O(n1-ε ), 0<ε<1, n is the number of features, x is the pair of n The number of segments for feature division, y is the number of segments for n samples, and M(D)x, y represents the normalized value of the maximum mutual information of features and samples under x*y grid division.
优选地,所述选择模块,具体用于通过公式(二),根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值;Preferably, the selection module is specifically configured to obtain the effective value of each feature according to the maximum information coefficient MIC and the redundancy value through formula (2);
其中,Smain为当前已选的特征子集,Sresidue为剩余特征子集,i和j分别表示特征fi和fj,c为类别集,为冗余值。Among them, Smain is the currently selected feature subset, Sresidue is the remaining feature subset, i and j represent features fi and fj respectively, c is the category set, is a redundant value.
优选地,该装置还包括:预定义模块;Preferably, the device also includes: a predefined module;
所述预定义模块,用于在所述根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值的步骤之前,定义两个特征之间的近似马尔科夫毯条件:The predefined module is used to define an approximate Markov blanket condition between two features before the step of obtaining the effective value of each feature according to the maximum information coefficient MIC and the redundancy value:
MIC(fi,c)>MIC(fj,c)且MIC(fj,c)<MIC(fi,fj)MIC(fi ,c)>MIC(fj ,c) and MIC(fj ,c)<MIC(fi ,fj )
相应地,所述选择模块,还用于根据所述最大信息系数MIC从特征集中依次选取特征,并将选取的特征从特征集中删除;根据选取的特征的最大信息系数MIC和冗余值获取所述特征的有效值,并判断所述有效值是否大于或者等于预设阈值,若是,则将该特征添加至最优子集。Correspondingly, the selection module is also used to sequentially select features from the feature set according to the maximum information coefficient MIC, and delete the selected features from the feature set; obtain the selected features according to the maximum information coefficient MIC and redundancy values of the selected features. The effective value of the feature, and judge whether the effective value is greater than or equal to the preset threshold, and if so, add the feature to the optimal subset.
优选地,所述选择模块,还用于根据所述近似马尔科夫毯条件从特征集中筛选出与所述选取的特征有近似马尔科夫毯条件的所有特征,并根据公式二获取每一个筛选出的特征的有效值;根据有效值判断筛选出的特征的有效值是否大于或者等于预设阈值,若否,则将筛选出的特征从特征集中删除,并从特征集中选取下一个特征Preferably, the selection module is further configured to filter out all features that have an approximate Markov blanket condition from the feature set according to the approximate Markov blanket condition, and obtain each filter according to formula 2 The effective value of the selected feature; according to the effective value, it is judged whether the effective value of the filtered feature is greater than or equal to the preset threshold, if not, the filtered feature is deleted from the feature set, and the next feature is selected from the feature set
由上述技术方案可知,本发明提出的高维数据的特征选择方法,通过最大信息系数引入到特征选择中,同时基于最大信息对高维数据进行特征选择,以克服了现有技术只能考虑两个特征之间相关性与冗余性的缺点,提高了选择的特征的分类准确率。It can be seen from the above technical solution that the feature selection method for high-dimensional data proposed by the present invention introduces the maximum information coefficient into the feature selection, and at the same time performs feature selection on high-dimensional data based on the maximum information, so as to overcome the existing technology that can only consider two The shortcomings of correlation and redundancy between features improve the classification accuracy of the selected features.
附图说明Description of drawings
通过参考附图会更加清楚的理解本发明的特征和优点,附图是示意性的而不应理解为对本发明进行任何限制,在附图中:The features and advantages of the present invention will be more clearly understood by referring to the accompanying drawings, which are schematic and should not be construed as limiting the invention in any way. In the accompanying drawings:
图1示出了本发明一实施例提出的一种高维数据的特征选择方法的流程示意图;FIG. 1 shows a schematic flow diagram of a feature selection method for high-dimensional data proposed by an embodiment of the present invention;
图2示出了本发明另一实施例提出的一种高维数据的特征选择方法的流程示意图;FIG. 2 shows a schematic flow diagram of a feature selection method for high-dimensional data proposed by another embodiment of the present invention;
图3示出了本发明一实施例提出的一种高维数据的特征选择装置的结构示意图。FIG. 3 shows a schematic structural diagram of a high-dimensional data feature selection device proposed by an embodiment of the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
图1为本发明一实施例提出的一种高维数据的特征选择方法的流程示意图,参照图1,该高维数据的特征选择方法,包括:Fig. 1 is a schematic flow chart of a feature selection method for high-dimensional data proposed by an embodiment of the present invention. Referring to Fig. 1, the feature selection method for high-dimensional data includes:
110、获取待处理的原始数据集,所述原始数据集包括特征集、若干样本以及类别集,所述类别集包括每个样本的类别;110. Obtain an original data set to be processed, the original data set includes a feature set, several samples, and a category set, and the category set includes the category of each sample;
120、计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC,以及每一个特征与已选特征子集的冗余值;120. Calculate and obtain the maximum information coefficient MIC between each feature in the feature set and the category set, and the redundancy value between each feature and the selected feature subset;
130、根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值,并根据所述有效值从特征集中选择出特征子集。130. Acquire an effective value of each feature according to the maximum information coefficient MIC and the redundancy value, and select a feature subset from the feature set according to the effective value.
本发明通过最大信息系数引入到特征选择中,同时基于最大信息对高维数据进行特征选择,因为难以适用高维数据集中的特征和特征子集之间的冗余性而导致特征选择精确度低的问题,提高了选择的特征的分类准确率。The present invention introduces the maximum information coefficient into feature selection, and at the same time performs feature selection on high-dimensional data based on maximum information, because it is difficult to apply the redundancy between features and feature subsets in high-dimensional data sets, resulting in low accuracy of feature selection problem, improving the classification accuracy of the selected features.
本实施例中,步骤120中计算MIC的过程具体包括:In this embodiment, the process of calculating the MIC in step 120 specifically includes:
通过公式(一),计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC;By formula (1), calculate and obtain the maximum information coefficient MIC between each feature in the feature set and the category set;
其中,B(n)为划定的网格数,ω(1)≤B(n)≤O(n1-ε),0<ε<1,n为特征的个数,x为对n个特征划分的段数,y为对n个样本划分的段数,M(D)x,y表示特征和样本在x*y网格划分下最大的互信息归一化后的值。Among them, B(n) is the number of delineated grids, ω(1)≤B(n)≤O(n1-ε ), 0<ε<1, n is the number of features, x is the pair of n The number of segments for feature division, y is the number of segments for n samples, and M(D)x, y represents the normalized value of the maximum mutual information of features and samples under x*y grid division.
本实施例中,步骤130具体包括:In this embodiment, step 130 specifically includes:
通过公式(二),根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值;By formula (2), according to the maximum information coefficient MIC and the redundancy value, the effective value of each feature is obtained;
其中,Smain为当前已选的特征子集,Sresidue为剩余特征子集,i和j分别表示特征fi和fj,c为类别集,为冗余值。Among them, Smain is the currently selected feature subset, Sresidue is the remaining feature subset, i and j represent features fi and fj respectively, c is the category set, is a redundant value.
本实施例中,在步骤130之前,该方法还包括:In this embodiment, before step 130, the method also includes:
定义两个特征之间的近似马尔科夫毯条件:Define an approximate Markov blanket condition between two features:
MIC(fi,c)>MIC(fj,c)且MIC(fj,c)<MIC(fi,fj)MIC(fi ,c)>MIC(fj ,c) and MIC(fj ,c)<MIC(fi ,fj )
相应地,步骤130具体包括:Correspondingly, step 130 specifically includes:
根据所述最大信息系数MIC从特征集中依次选取特征,并将选取的特征从特征集中删除;Selecting features sequentially from the feature set according to the maximum information coefficient MIC, and deleting the selected features from the feature set;
根据选取的特征的最大信息系数MIC和冗余值获取所述特征的有效值,并判断所述有效值是否大于或者等于预设阈值,若是,则将该特征添加至最优子集;Acquiring the effective value of the feature according to the maximum information coefficient MIC and the redundancy value of the selected feature, and judging whether the effective value is greater than or equal to a preset threshold, if so, adding the feature to the optimal subset;
根据所述近似马尔科夫毯条件从特征集中筛选出与所述选取的特征有近似马尔科夫毯条件的所有特征,并根据公式二获取每一个筛选出的特征的有效值;According to the approximate Markov blanket condition, select all the features that have an approximate Markov blanket condition with the selected feature from the feature set, and obtain the effective value of each filtered feature according to formula 2;
根据有效值判断筛选出的特征的有效值是否大于或者等于预设阈值,若否,则将筛选出的特征从特征集中删除,并从特征集中选取下一个特征,直至特征集F为空。Determine whether the effective value of the filtered feature is greater than or equal to the preset threshold according to the effective value, if not, delete the filtered feature from the feature set, and select the next feature from the feature set until the feature set F is empty.
图2为本发明另一实施例提出的一种高维数据的特征选择方法的流程示意图,下面参照图2对本发明的原理进行详细说明:FIG. 2 is a schematic flowchart of a feature selection method for high-dimensional data proposed by another embodiment of the present invention. The principle of the present invention will be described in detail below with reference to FIG. 2 :
该方法包括初始化阶段和特征删除阶段;The method includes an initialization phase and a feature removal phase;
一、初始化阶段包括:1. The initialization phase includes:
S1、给定的数据集D有m个特征和n个样本,其包含的特征集为F={f1,f2,...,fm},类别集c={c1,c2,...,cn}包括数据集中每个样本的类别。进行数据预处理,设置最优特征子集S为空,设定参数θ,此处的参数θ即为上述的预设阈值;S1. A given data set D has m features and n samples, the feature set it contains is F={f1 ,f2 ,...,fm }, and the category set c={c1 ,c2 ,...,cn } include the category of each sample in the dataset. Perform data preprocessing, set the optimal feature subset S to be empty, and set the parameter θ, where the parameter θ is the above-mentioned preset threshold;
二、特征删除阶段包含步骤:2. The feature deletion stage includes steps:
S2、计算特征集中每个特征与类别集之间的最大信息系数,并按照特征与类别集的MIC(c;fi)值对特征进行降序排序,其中,fi为第i个特征,i大于0且小于等于m;S2. Calculate the maximum information coefficient between each feature in the feature set and the category set, and sort the features in descending order according to the MIC(c; fi ) value of the feature and the category set, where fi is the i-th feature, i greater than 0 and less than or equal to m;
S3、根据本发明提出了近似马尔科夫毯条件和有效值mMIC评价函数,对特征集进行处理,删除无关和冗余的特征,得到最后的特征子集;S3, according to the present invention, an approximate Markov blanket condition and an effective value mMIC evaluation function are proposed, the feature set is processed, irrelevant and redundant features are deleted, and the final feature subset is obtained;
优选的,步骤S1具体包括:Preferably, step S1 specifically includes:
S11、对数据集D进行数据预处理,得到要求的文件格式;S11. Perform data preprocessing on the data set D to obtain a required file format;
S12、将最优特征子集S初始化为空集,对参数θ进行初始化;S12. Initialize the optimal feature subset S as an empty set, and initialize the parameter θ;
优选的,步骤S2具体包括:Preferably, step S2 specifically includes:
S21、对特征集F中任意特征fi,计算该特征与类别集之间的最大信息系数值MIC(c;fi);S21. For any feature fi in the feature set F, calculate the maximum information coefficient value MIC(c; fi ) between the feature and the category set;
S22、根据MIC(c;fi)对特征进行降序排序;S22. Sort the features in descending order according to MIC(c; fi );
优选的,步骤S3所述的近似马尔科夫毯条件定义如下:Preferably, the approximate Markov blanket condition described in step S3 is defined as follows:
对于两个特征fi和fj(i≠j,j大于0且小于等于m)以及类别c,fi是fj的近似马尔科夫毯的条件是:For two features fi and fj (i≠j, j greater than 0 and less than or equal to m) and category c, the conditions for fi to be an approximate Markov blanket of fj are:
MIC(fi,c)>MIC(fj,c)并且MIC(fj,c)<MIC(fi,fj)。MIC(fi ,c)>MIC(fj ,c) and MIC(fj ,c)<MIC(fi ,fj ).
由此,最大信息系数的计算公式如下:Therefore, the calculation formula of the maximum information coefficient is as follows:
其中,B(n)为划定的网格数,ω(1)≤B(n)≤O(n1-ε),0<ε<1。一般地,B(n)=n0.6时效果最好。x与y表示对两个变量值域划分的段数。式中M(D)x,y表示两个变量在x*y网格划分下最大的互信息归一化后的值。Wherein, B(n) is the number of defined grids, ω(1)≤B(n)≤O(n1-ε ), 0<ε<1. Generally, the effect is thebest when B(n)=n0.6. x and y represent the number of segments divided into two variable ranges. In the formula, M(D)x, y represents the normalized value of the maximum mutual information of two variables under x*y grid division.
M(D)x,y的计算公式如下:The calculation formula of M(D)x,y is as follows:
其中,MI*(D,x,y)表示x*y网格划分下最大的互信息。Among them, MI* (D,x,y) represents the maximum mutual information under x*y grid division.
MI*(D,x,y)的计算公式如下:The calculation formula of MI* (D,x,y) is as follows:
MI*(D,x,y)=maxMI(D|G)MI* (D,x,y)=maxMI(D|G)
其中,D|G为数据集D使用G(x*y网格)进行划分,然后求解每个网格的互信息。而式中互信息的计算公式如下:Among them, D|G uses G(x*y grid) to divide the data set D, and then solves the mutual information of each grid. The calculation formula of mutual information in the formula is as follows:
其中A={ai,i=1...n}和B={bi,i=1...n}。where A={ai , i=1...n} and B={bi ,i =1...n}.
优选的,步骤S3中基于最大信息系数的评价函数mMIC,可以对特征与类别之间的相关性以及特征与特征子集之间的相关性进行度量,进而判断特征的好坏。Preferably, in step S3, based on the evaluation function mMIC of the maximum information coefficient, the correlation between features and categories and the correlation between features and feature subsets can be measured, and then the quality of features can be judged.
mMIC评价函数的计算公式如下:The calculation formula of mMIC evaluation function is as follows:
其中,Smain为当前已选的特征子集,Sresidue为剩余特征子集。为了简化和表述上的便利性使用i和j分别表示特征fi和fj。上式表示从剩余特征子集选出的特征fj其好坏通过该特征与类别集的相关性以及该特征与当前已有特征子集的冗余性决定。Among them, Smain is the currently selected feature subset, and Sresidue is the remaining feature subset. For the sake of simplification and convenience in expression, i and j are used to denote features fi and fj respectively. The above formula indicates that the quality of the featurefj selected from the remaining feature subset is determined by the correlation between the feature and the category set and the redundancy between the feature and the current existing feature subset.
优选的,步骤S3包含步骤:Preferably, step S3 comprises the steps of:
S31、重复下述操作直到F为空集;S31. Repeat the following operations until F is an empty set;
a.从特征集F中选择MIC(c;fi)值最大的特征;a. Select the feature with the largest MIC(c; fi ) value from the feature set F;
b.从特征集F删除特征fi,如果其在冗余子集Sre中,则计算该特征的mMIC值,如果mMIC值小于θ,返回到步骤a;否则直接将fi添加到最优子集S中,并将fi作为主元素继续执行步骤c;b. Delete the feature fi from the feature set F, if it is in the redundant subset Sre , then calculate the mMIC value of the feature, if the mMIC value is less than θ, return to step a; otherwise directly add fi to the optimal In the subset S, continue to execute step c with fi as the main element;
c.从特征集F中搜索以a中选出的主元素fi为近似马尔科夫毯条件的所有元素,将选出的特征fj加入到Sre中并计算选出的所有元素的mMIC值。如果特征fj的mMIC值小于θ则将特征fj从F中删除;c. Search from the feature set F for all elements that use the main element fi selected in a as an approximate Markov blanket condition, add the selected feature fj to Sre and calculate the mMIC of all selected elements value. If the mMIC value of the feature fj is less than θ, the feature fj is removed from F;
d.上述过程结束后,输出的特征子集S为最优特征子集。d. After the above process ends, the output feature subset S is the optimal feature subset.
综上所述,本发明通过将mMIC加入到近似马尔科夫毯模型中,使得近似马尔科夫毯条件可以衡量单个特征与类别之间的相关性与该特征与特征子集之间的冗余性的强弱,来决定特征的去留。既保证了近似马尔科夫毯条件进行特征选择的效率也保证了选出的特征选择的准确性。In summary, the present invention enables the approximate Markov blanket condition to measure the correlation between a single feature and a category and the redundancy between the feature and a feature subset by adding mMIC to the approximate Markov blanket model. The strength of sex determines whether the characteristics will stay or not. It not only ensures the efficiency of feature selection by approximate Markov blanket condition, but also ensures the accuracy of the selected feature selection.
图3为本发明一实施例提出的一种高维数据的特征选择装置的结构示意图,参照图3,该装置包括:Fig. 3 is a schematic structural diagram of a high-dimensional data feature selection device proposed by an embodiment of the present invention. Referring to Fig. 3, the device includes:
获取模块310,用于获取待处理的原始数据集,所述原始数据集包括特征集、若干样本以及类别集,所述类别集包括每个样本的类别;The obtaining module 310 is used to obtain the raw data set to be processed, the raw data set includes a feature set, several samples and a category set, and the category set includes the category of each sample;
处理模块320,用于计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC,以及每一个特征与已选特征子集的冗余值;The processing module 320 is used to calculate and obtain the maximum information coefficient MIC between each feature in the feature set and the category set, and the redundancy value between each feature and the selected feature subset;
选择模块330,用于根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值,并根据所述有效值从特征集中选择出特征子集。The selection module 330 is configured to obtain an effective value of each feature according to the maximum information coefficient MIC and the redundancy value, and select a feature subset from the feature set according to the effective value.
本发明通过最大信息系数引入到特征选择中,同时基于最大信息对高维数据进行特征选择,以克服了现有技术只能考虑两个特征之间相关性与冗余性的缺点,提高了选择的特征的分类准确率。The present invention introduces the maximum information coefficient into the feature selection, and at the same time performs feature selection on high-dimensional data based on the maximum information, so as to overcome the shortcomings of the prior art that can only consider the correlation and redundancy between two features, and improve the selection efficiency. The classification accuracy of the features.
对于装置实施方式而言,由于其与方法实施方式基本相似,所以描述的比较简单,相关之处参见方法实施方式的部分说明即可。As for the device implementation, since it is basically similar to the method implementation, the description is relatively simple, and for related parts, please refer to the part of the description of the method implementation.
在一可行实施例中,所述处理模块320,具体用于通过公式(一),计算获取所述特征集中每一个特征与类别集之间的最大信息系数MIC;In a feasible embodiment, the processing module 320 is specifically configured to calculate and obtain the maximum information coefficient MIC between each feature in the feature set and the category set through formula (1);
其中,B(n)为划定的网格数,ω(1)≤B(n)≤O(n1-ε),0<ε<1,n为特征的个数,x为对n个特征划分的段数,y为对n个样本划分的段数,M(D)x,y表示特征和样本在x*y网格划分下最大的互信息归一化后的值。Among them, B(n) is the number of delineated grids, ω(1)≤B(n)≤O(n1-ε ), 0<ε<1, n is the number of features, x is the pair of n The number of segments for feature division, y is the number of segments for n samples, and M(D)x, y represents the normalized value of the maximum mutual information of features and samples under x*y grid division.
对于装置实施方式而言,由于其与方法实施方式基本相似,所以描述的比较简单,相关之处参见方法实施方式的部分说明即可。As for the device implementation, since it is basically similar to the method implementation, the description is relatively simple, and for related parts, please refer to the part of the description of the method implementation.
在一可行实施例中,所述选择模块330,具体用于通过公式(二),根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值;In a feasible embodiment, the selection module 330 is specifically configured to obtain an effective value of each feature according to the maximum information coefficient MIC and the redundancy value through formula (2);
其中,Smain为当前已选的特征子集,Sresidue为剩余特征子集,i和j分别表示特征fi和fj,c为类别集,为冗余值。Among them, Smain is the currently selected feature subset, Sresidue is the remaining feature subset, i and j represent features fi and fj respectively, c is the category set, is a redundant value.
对于装置实施方式而言,由于其与方法实施方式基本相似,所以描述的比较简单,相关之处参见方法实施方式的部分说明即可。As for the device implementation, since it is basically similar to the method implementation, the description is relatively simple, and for related parts, please refer to the part of the description of the method implementation.
在一可行实施例中,该装置还包括:预定义模块340;In a feasible embodiment, the device further includes: a predefined module 340;
所述预定义模块340,用于在所述根据所述最大信息系数MIC和所述冗余值,获取每一个特征的有效值之前,定义两个特征之间的近似马尔科夫毯条件:The predefined module 340 is configured to define an approximate Markov blanket condition between two features before obtaining the effective value of each feature according to the maximum information coefficient MIC and the redundancy value:
MIC(fi,c)>MIC(fj,c)且MIC(fj,c)<MIC(fi,fj)MIC(fi ,c)>MIC(fj ,c) and MIC(fj ,c)<MIC(fi ,fj )
相应地,所述选择模块330,还用于根据所述最大信息系数MIC从特征集中依次选取特征,并将选取的特征从特征集中删除;根据选取的特征的最大信息系数MIC和冗余值获取所述特征的有效值,并判断所述有效值是否大于或者等于预设阈值,若是,则将该特征添加至最优子集;根据所述近似马尔科夫毯条件从特征集中筛选出与所述选取的特征有近似马尔科夫毯条件的所有特征,并根据公式二获取每一个筛选出的特征的有效值;根据有效值判断筛选出的特征的有效值是否大于或者等于预设阈值,若否,则将筛选出的特征从特征集中删除,并从特征集中选取下一个特征,直至所述特征集为空。Correspondingly, the selection module 330 is also used to sequentially select features from the feature set according to the maximum information coefficient MIC, and delete the selected features from the feature set; obtain according to the maximum information coefficient MIC and the redundancy value of the selected features The effective value of the feature, and judge whether the effective value is greater than or equal to the preset threshold, if so, add the feature to the optimal subset; filter out the feature set from the feature set according to the approximate Markov blanket condition The features selected above have all the features similar to the Markov blanket condition, and the effective value of each selected feature is obtained according to formula 2; according to the effective value, it is judged whether the effective value of the selected feature is greater than or equal to the preset threshold, if If not, the filtered feature is deleted from the feature set, and the next feature is selected from the feature set until the feature set is empty.
对于装置实施方式而言,由于其与方法实施方式基本相似,所以描述的比较简单,相关之处参见方法实施方式的部分说明即可。As for the device implementation, since it is basically similar to the method implementation, the description is relatively simple, and for related parts, please refer to the part of the description of the method implementation.
在一可行实施例中,所述选择模块330,还用于在该特征的有效值小于预设阈值时,从特征集中选取下一个特征。In a feasible embodiment, the selection module 330 is further configured to select the next feature from the feature set when the effective value of the feature is less than a preset threshold.
对于装置实施方式而言,由于其与方法实施方式基本相似,所以描述的比较简单,相关之处参见方法实施方式的部分说明即可。As for the device implementation, since it is basically similar to the method implementation, the description is relatively simple, and for related parts, please refer to the part of the description of the method implementation.
应当注意的是,在本发明的装置的各个部件中,根据其要实现的功能而对其中的部件进行了逻辑划分,但是,本发明不受限于此,可以根据需要对各个部件进行重新划分或者组合。It should be noted that among the various components of the device of the present invention, the components are logically divided according to the functions to be realized, but the present invention is not limited thereto, and each component can be re-divided as required or a combination.
本发明的各个部件实施方式可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本装置中,PC通过实现因特网对设备或者装置远程控制,精准的控制设备或者装置每个操作的步骤。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样实现本发明的程序可以存储在计算机可读介质上,并且程序产生的文件或文档具有可统计性,产生数据报告和cpk报告等,能对功放进行批量测试并统计。应该注意的是上述实施方式对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施方式。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。The various component implementations of the present invention can be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. In this device, the PC realizes the remote control of the equipment or device through the Internet, and precisely controls each operation step of the device or device. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. In this way, the program for realizing the present invention can be stored on a computer-readable medium, and the files or documents generated by the program can be counted, and can generate data reports and cpk reports, etc., and can perform batch testing and statistics on power amplifiers. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.
虽然结合附图描述了本发明的实施方式,但是本领域技术人员可以在不脱离本发明的精神和范围的情况下做出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the present invention. within the bounds of the requirements.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610298079.XACN105975589A (en) | 2016-05-06 | 2016-05-06 | Feature selection method and device of high-dimension data |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610298079.XACN105975589A (en) | 2016-05-06 | 2016-05-06 | Feature selection method and device of high-dimension data |
| Publication Number | Publication Date |
|---|---|
| CN105975589Atrue CN105975589A (en) | 2016-09-28 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610298079.XAPendingCN105975589A (en) | 2016-05-06 | 2016-05-06 | Feature selection method and device of high-dimension data |
| Country | Link |
|---|---|
| CN (1) | CN105975589A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106570178A (en)* | 2016-11-10 | 2017-04-19 | 重庆邮电大学 | High-dimension text data characteristic selection method based on graph clustering |
| CN107478963A (en)* | 2017-09-30 | 2017-12-15 | 山东海兴电力科技有限公司 | Single-phase ground fault line selecting method of small-electric current grounding system based on power network big data |
| CN109101626A (en)* | 2018-08-13 | 2018-12-28 | 武汉科技大学 | Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree |
| CN109443766A (en)* | 2018-09-10 | 2019-03-08 | 中国人民解放军火箭军工程大学 | A kind of heavy-duty vehicle gearbox gear Safety Analysis Method |
| CN110334546A (en)* | 2019-07-08 | 2019-10-15 | 辽宁工业大学 | A Differential Privacy High-Dimensional Data Publishing Protection Method Based on Principal Component Analysis Optimization |
| CN110647943A (en)* | 2019-09-26 | 2020-01-03 | 西北工业大学 | Cutting tool wear monitoring method based on evolutionary data cluster analysis |
| CN111016914A (en)* | 2019-11-22 | 2020-04-17 | 华东交通大学 | Dangerous driving scene identification system and identification method based on portable terminal information |
| CN111312403A (en)* | 2020-01-21 | 2020-06-19 | 山东师范大学 | Disease prediction system, device and medium based on instance and feature sharing cascade |
| CN112465251A (en)* | 2020-12-08 | 2021-03-09 | 上海电力大学 | Short-term photovoltaic output probability prediction method based on simplest gated neural network |
| CN114400026A (en)* | 2022-01-30 | 2022-04-26 | 燕山大学 | Prediction method of UPDRS score in Parkinson's disease patients based on speech feature selection |
| CN114862151A (en)* | 2022-04-21 | 2022-08-05 | 东华大学 | A two-stage feature selection method for polyester fiber polymerization process |
| CN115130615A (en)* | 2022-07-27 | 2022-09-30 | 每日互动股份有限公司 | Wifi classification method and system |
| CN115729957A (en)* | 2022-11-28 | 2023-03-03 | 安徽大学 | Unknown stream feature selection method and device based on maximum information coefficient |
| CN116305292A (en)* | 2023-05-17 | 2023-06-23 | 中国电子科技集团公司第十五研究所 | Government affair data release method and system based on differential privacy protection |
| CN118818986A (en)* | 2024-07-24 | 2024-10-22 | 湖南中烟工业有限责任公司 | A method and system for incremental learning of control parameters of tofu drying machine |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106570178B (en)* | 2016-11-10 | 2020-09-29 | 重庆邮电大学 | A feature selection method for high-dimensional text data based on graph clustering |
| CN106570178A (en)* | 2016-11-10 | 2017-04-19 | 重庆邮电大学 | High-dimension text data characteristic selection method based on graph clustering |
| CN107478963A (en)* | 2017-09-30 | 2017-12-15 | 山东海兴电力科技有限公司 | Single-phase ground fault line selecting method of small-electric current grounding system based on power network big data |
| CN109101626A (en)* | 2018-08-13 | 2018-12-28 | 武汉科技大学 | Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree |
| CN109443766A (en)* | 2018-09-10 | 2019-03-08 | 中国人民解放军火箭军工程大学 | A kind of heavy-duty vehicle gearbox gear Safety Analysis Method |
| CN110334546A (en)* | 2019-07-08 | 2019-10-15 | 辽宁工业大学 | A Differential Privacy High-Dimensional Data Publishing Protection Method Based on Principal Component Analysis Optimization |
| CN110334546B (en)* | 2019-07-08 | 2021-11-23 | 辽宁工业大学 | Difference privacy high-dimensional data release protection method based on principal component analysis optimization |
| CN110647943A (en)* | 2019-09-26 | 2020-01-03 | 西北工业大学 | Cutting tool wear monitoring method based on evolutionary data cluster analysis |
| CN111016914B (en)* | 2019-11-22 | 2021-04-06 | 华东交通大学 | Dangerous driving scene identification system based on portable terminal information and identification method thereof |
| CN111016914A (en)* | 2019-11-22 | 2020-04-17 | 华东交通大学 | Dangerous driving scene identification system and identification method based on portable terminal information |
| CN111312403B (en)* | 2020-01-21 | 2024-09-10 | 山东师范大学 | Disease prediction system, device and medium based on instance and feature sharing cascade |
| CN111312403A (en)* | 2020-01-21 | 2020-06-19 | 山东师范大学 | Disease prediction system, device and medium based on instance and feature sharing cascade |
| CN112465251A (en)* | 2020-12-08 | 2021-03-09 | 上海电力大学 | Short-term photovoltaic output probability prediction method based on simplest gated neural network |
| CN114400026A (en)* | 2022-01-30 | 2022-04-26 | 燕山大学 | Prediction method of UPDRS score in Parkinson's disease patients based on speech feature selection |
| CN114862151A (en)* | 2022-04-21 | 2022-08-05 | 东华大学 | A two-stage feature selection method for polyester fiber polymerization process |
| CN115130615A (en)* | 2022-07-27 | 2022-09-30 | 每日互动股份有限公司 | Wifi classification method and system |
| CN115729957A (en)* | 2022-11-28 | 2023-03-03 | 安徽大学 | Unknown stream feature selection method and device based on maximum information coefficient |
| CN115729957B (en)* | 2022-11-28 | 2024-01-19 | 安徽大学 | An unknown flow feature selection method and device based on maximum information coefficient |
| CN116305292B (en)* | 2023-05-17 | 2023-08-08 | 中国电子科技集团公司第十五研究所 | Government affair data release method and system based on differential privacy protection |
| CN116305292A (en)* | 2023-05-17 | 2023-06-23 | 中国电子科技集团公司第十五研究所 | Government affair data release method and system based on differential privacy protection |
| CN118818986A (en)* | 2024-07-24 | 2024-10-22 | 湖南中烟工业有限责任公司 | A method and system for incremental learning of control parameters of tofu drying machine |
| CN118818986B (en)* | 2024-07-24 | 2025-06-20 | 湖南中烟工业有限责任公司 | A method and system for incremental learning of control parameters of tofu drying machine |
| Publication | Publication Date | Title |
|---|---|---|
| CN105975589A (en) | Feature selection method and device of high-dimension data | |
| TWI512506B (en) | Sorting method and device for search results | |
| US9390176B2 (en) | System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data | |
| EP3620982B1 (en) | Sample processing method and device | |
| CN108647249B (en) | Public opinion data prediction method, device, terminal and storage medium | |
| WO2019085335A1 (en) | Method for discovering investment objects with new words, device and storage medium | |
| CN110869920A (en) | Multi-dimensional industrial knowledge graph | |
| WO2020037917A1 (en) | User behavior data recommendation method, server and computer readable medium | |
| CN110737689B (en) | Data standard compliance detection method, device, system and storage medium | |
| CN110083639A (en) | A kind of method and device that the data blood relationship based on clustering is intelligently traced to the source | |
| CN111897961A (en) | A text classification method with a wide neural network model and related components | |
| CN108228656A (en) | URL classification method and device based on CART decision trees | |
| WO2015192798A1 (en) | Topic mining method and device | |
| CN114386931A (en) | An image analysis system and method based on AI technology | |
| CN119201597A (en) | Log parsing method, device, computer equipment and medium based on artificial intelligence | |
| CN114916237A (en) | Computer-implemented method for defect analysis, computer-implemented method for evaluating the likelihood of occurrence of defects, apparatus for defect analysis, computer program product and intelligent defect analysis system | |
| CN103207804B (en) | Based on the MapReduce load simulation method of group operation daily record | |
| CN106295711A (en) | A kind of time series classification method and system | |
| CN113806647B (en) | Method for identifying development framework and related equipment | |
| CN117391071B (en) | News topic data mining method, device and storage medium | |
| CN115525765A (en) | Event context generation method and device and network equipment | |
| CN111143356B (en) | Report retrieval method and device | |
| CN112784050A (en) | Method, device, equipment and medium for generating theme classification data set | |
| CN107193979A (en) | A kind of method of homologous picture retrieval | |
| CN117076454A (en) | Engineering quality acceptance form data structured storage method and system |
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date:20160928 |