

技术领域technical field
本发明涉及机器学习技术领域,更具体地说,涉及一种存在部分缺失和未知类别标记的 多标记分类方法。The present invention relates to the technical field of machine learning, and more particularly, to a multi-label classification method with partial missing and unknown class labels.
背景技术Background technique
多标记学习是当前机器学习领域的一个研究热点,近年来受到了学术界和企业界研究人 员的广泛关注。在多标记学习的学习中,每个样本可以同时属于多个类别标记,例如一部电 影可以同时属于多个类别,如“动作片”,“战争片”,“惊悚片”等。多标记学习在现实 生活中得到了广泛应用,如文本分类、图像和视频标注、音乐分类、商品推荐等。Multi-label learning is a research hotspot in the field of machine learning, and has received extensive attention from researchers in academia and business in recent years. In the learning of multi-label learning, each sample can belong to multiple category labels at the same time, for example, a movie can belong to multiple categories at the same time, such as "action movie", "war movie", "thriller movie" and so on. Multi-label learning has been widely used in real life, such as text classification, image and video annotation, music classification, product recommendation, etc.
多标记学习的主要任务是根据给定的训练数据集,学习一个高效的多标记分类模型,可 以给新的样本预测一个或多个可能的类别标记。针对多标记学习问题,研究人员已经提出了 很多方法。现有多标记学习方法主要假设训练数据集的类别标记集合是完整的,并且所有标 记值均已知。在多标记数据标注过程中,标注者会给样本标注一个或者多个相关的类别标记, 标注过程费时费力,标注者很难准确的给样本标注所有相关的类别,尤其当类别标记总数较 多时,很容易导致标注结果存在部分缺失,甚至完全缺失的情况,即这些类别标记没有标注 给任何一个样本。此外,多标记数据的语义复杂,可能会存在一些类别标记超出人类的认知 范围,也会导致这些类别标记没有标注给任何一个样本,这些完全缺失的类别标记在训练阶 段都是未知的,导致学习难度较大。The main task of multi-label learning is to learn an efficient multi-label classification model based on a given training dataset, which can predict one or more possible class labels for new samples. For the multi-label learning problem, researchers have proposed many methods. Existing multi-label learning methods mainly assume that the class label set of the training dataset is complete and all label values are known. In the multi-label data labeling process, the labeler will label the sample with one or more related category labels. The labeling process is time-consuming and labor-intensive, and it is difficult for the labeler to accurately label all relevant categories for the sample, especially when the total number of category labels is large. It is easy to cause partial or even complete absence of labeling results, that is, these category labels are not labelled for any sample. In addition, the semantics of multi-labeled data is complex, and there may be some category labels that are beyond the scope of human cognition, which will also cause these category labels to not be labeled for any sample. These completely missing category labels are unknown in the training phase, resulting in Learning is difficult.
当前,研究人员已经提出了一些处理缺失标记的多标记分类方法,但是只能处理存在部 分缺失值的情况,不能处理数据集存在未知类别标记的情况。这些方法主要基于矩阵补全或 者在构造分类损失函数时不考虑缺失项,这两种策略的前提要求是每个类别至少要有一个正 例样本。因此,当数据存在某些未知类别标记时,它们的标记结果是完全缺失时,现有方法 均无法处理。目前被提出的有两个方法可以用来处理存在未知类别标记的情况,如A.Pham 等在国际机器学习会议上发表的存在新颖标记实例的多实例多标记学习方法和朱越等在人工 智能促进协会年会发表的发现多个新颖标记的多实例多标记学习方法,但是这两个方法只能 用于多实例多标记学习,无法用于一般情况下的多标记学习,即单实例多标记学习,而且也 无法处理存在部分缺失标记的情况。At present, researchers have proposed some multi-label classification methods to deal with missing labels, but they can only deal with the situation where there are some missing values, and cannot handle the situation where the dataset has unknown class labels. These methods are mainly based on matrix completion or do not consider missing items when constructing the classification loss function. The premise of these two strategies is that each category must have at least one positive example. Therefore, when there are some unknown class labels in the data, their labeling results are completely missing, which cannot be handled by existing methods. There are currently two proposed methods that can be used to deal with the presence of unknown category labels, such as the multi-instance multi-label learning method with novel labeled instances published by A. Pham et al at the International Machine Learning Conference and Zhu Yue et al. The multi-instance multi-label learning method for discovering multiple novel labels published at the annual meeting of the Promotion Association, but these two methods can only be used for multi-instance multi-label learning, and cannot be used for multi-label learning in general, that is, single-instance multi-label learning learning, but also cannot handle the presence of partial missing markers.
经检索,中国专利申请号:201911306128.X,申请公布日:2020年4月21日,发明名称为:一种多标记分类中潜在类别发现和分类方法;该申请案将已知标记分类和潜在标记发 现及分类融合在一框架中,利用非负矩阵分解技术,将特征矩阵分解为完整类别标记矩阵的 近似解和系数矩阵,并约束近似解的已知部分结果与真实值一致,同时构建从样本特征到完 整标记的分类模型,发现潜在的标记类型;通过潜在标记发现,挖掘出数据中有价值的隐含 信息,利用已知标记和潜在标记之间的关联性,约束相关性较强的任意类别具有相似的分类 模型系数,得到近似的分类预测结果,使已知标记分类和潜在标记分类相互指导,共同促进, 最终提升已知标记和潜在标记的分类性能,更好的进行多标记学习任务。但该申请案假设已 知标记部分的标记值完全观测,当已知标记值存在缺失时,该申请案中所提算法的性能会受 到影响。且在实际应用中,当数据存在未知新标记时,已知标记部分的标记值存在缺失则更 为常见,该申请案在应用到实际上时,会存在误差。After searching, the Chinese patent application number: 201911306128.X, the application publication date: April 21, 2020, and the name of the invention is: a potential class discovery and classification method in multi-marker classification; this application classifies known markers and potential Label discovery and classification are integrated in one framework, using non-negative matrix factorization technology, the feature matrix is decomposed into approximate solutions and coefficient matrices of the complete category label matrix, and the known partial results of the approximate solutions are constrained to be consistent with the real values, while constructing from A classification model from sample features to complete markers to discover potential marker types; through potential marker discovery, valuable implicit information in the data is mined, and the correlation between known markers and potential markers is used to constrain those with strong correlation. Any category has similar classification model coefficients to obtain approximate classification prediction results, so that known label classification and potential label classification can guide each other, promote together, and ultimately improve the classification performance of known labels and potential labels, and better perform multi-label learning. Task. However, this application assumes that the marker values of the known marker part are completely observed, and when the known marker values are missing, the performance of the algorithm proposed in this application will be affected. And in practical applications, when there are unknown new markers in the data, it is more common that the marker values of the known markers are missing. When the application is applied in practice, there will be errors.
发明内容SUMMARY OF THE INVENTION
1.发明要解决的技术问题1. The technical problem to be solved by the invention
鉴于现有技术中,传统的多标记学习方法假定数据集的类别标记个数是固定且所有标记 结果都是已知的,不能处理数据集中存在未知类别标记的情况,从而影响分类的准确性的问 题,本发明提供了一种存在部分缺失和未知类别标记的多标记分类方法,本发明提出有效的 学习方法,发现数据集中未知的类别标记,构建已知类别和未知类别标记的多标记分类模型, 使多标记分类结果更加准确。In view of the prior art, the traditional multi-label learning method assumes that the number of class labels in the dataset is fixed and all labeling results are known, and cannot handle the situation where there are unknown class labels in the dataset, thus affecting the accuracy of classification. Problem, the present invention provides a multi-label classification method with partial missing and unknown class labels, the present invention proposes an effective learning method to discover unknown class labels in the data set, and build a multi-label classification model of known and unknown class labels , making the multi-label classification results more accurate.
2.技术方案2. Technical solutions
为达到上述目的,本发明提供的技术方案为:In order to achieve the above object, the technical scheme provided by the invention is:
本发明的一种存在部分缺失和未知类别标记的多标记分类方法,其步骤为:A kind of multi-label classification method with partial deletion and unknown class label of the present invention, its steps are:
步骤一、对训练数据进行特征提取和类别标注,获得数据特征表示矩阵X和已知类别标记 矩阵Y;Step 1. Perform feature extraction and category labeling on the training data to obtain a data feature representation matrix X and a known category labeling matrix Y;
步骤二、计算特征表示矩阵X和已知类别标记矩阵Y的相似度矩阵S;Step 2, calculate the similarity matrix S of the feature representation matrix X and the known category label matrix Y;
步骤三、将相似度矩阵S分解得到保持样本相似性结构信息的完整类别标记矩阵的近似表 示H,并约束得到的近似表示H的部分结果与步骤一得到的已知类别标记矩阵Y的结果一致;Step 3: Decompose the similarity matrix S to obtain the approximate representation H of the complete category label matrix that maintains the similarity structure information of the samples, and constrain the obtained approximate representation H. Part of the result is consistent with the known category label matrix Y obtained in step 1. ;
步骤四、利用矩阵重构技术对完整类别标记矩阵的近似表示H进行优化,将H优化为重构 结果HC;Step 4, utilize the matrix reconstruction technology to optimize the approximate representation H of the complete category labeling matrix, and optimize H as the reconstruction result HC;
步骤五、构建从数据特征表示矩阵X映射到完整类别标记矩阵的重构结果HC的线性分类 模型,并对模型系数W做稀疏约束,学习类属特征;同时,利用模型系数W对新发现标记进 行语义描述;Step 5: Construct a linear classification model that maps from the data feature representation matrix X to the reconstruction result HC of the complete category label matrix, and imposes a sparse constraint on the model coefficient W to learn the generic features; at the same time, use the model coefficient W to classify the newly discovered labels. perform semantic description;
步骤六、采用流行正则约束任意两个类别标记对应的模型系数的相似性,进而优化完整 标记矩阵H的结果;Step 6, adopt the popular regularity to constrain the similarity of the model coefficients corresponding to any two category marks, and then optimize the result of the complete mark matrix H;
步骤七、给定一个测试样本t,将测试样本t带入经过步骤一至六学习得到的最终分类模 型,输出测试样本在已知类别标记和未知类别标记上的预测结果。Step 7. Given a test sample t, bring the test sample t into the final classification model learned from steps 1 to 6, and output the prediction results of the test sample on the known class label and the unknown class label.
3.有益效果3. Beneficial effects
采用本发明提供的技术方案,与已有的公知技术相比,具有如下显著效果:Adopting the technical scheme provided by the present invention, compared with the existing known technology, has the following remarkable effects:
(1)鉴于现有技术中,现有的多标记学习方法假定数据集的类别标记都是已知的,不能 同时处理数据中存在部分缺失和未知类别标记的问题,进而影响多标记分类方法的准确性的 问题,本发明将部分缺失和未知类别标记处理及分类融合在统一框架中,利用矩阵分解技术, 将特征矩阵和类别标记矩阵计算得到相似度矩阵分解,得到完整标记矩阵的近似解,以此来 发现未知的类别标记;约束近似解的存在部分缺失值的已知类别标记的结果与真实的已观测 值的一致,同时构建从样本特征到完整标记的多标记分类模型,可以为新的样本同时预测已 知类别和新发现的类别标记。(1) In view of the prior art, the existing multi-label learning methods assume that the class labels of the data set are known, and cannot deal with the problems of partial missing and unknown class labels in the data at the same time, thus affecting the performance of the multi-label classification method. To solve the problem of accuracy, the present invention integrates the processing and classification of partial missing and unknown category tags into a unified framework, and uses matrix decomposition technology to calculate the feature matrix and category tag matrix to obtain similarity matrix decomposition, and obtain the approximate solution of the complete tag matrix, In this way, unknown class labels are discovered; the results of known class labels with partial missing values in the constraint approximate solution are consistent with the real observed values, and a multi-label classification model from sample features to complete labels is constructed, which can be used for new of samples predict both known classes and newly discovered class labels.
(2)本发明的一种存在部分缺失和未知类别标记的多标记分类方法,通过发现未知类别 标记,能够挖掘出数据中有价值的隐含信息。同时通过标记重构技术,能够处理部分缺失标 记的结果。本发明通过建模已知类别和未知类别标记的相互关联性,在提升已知类别标记分 类准确性的同时,还能够提升模型发现未知类别标记的能力。通过模型学习到的类属特征, 可以有效描述新发现类别的语义概念。(2) A multi-label classification method with partial deletions and unknown class labels of the present invention can mine valuable hidden information in data by discovering unknown class labels. At the same time, through the marker reconstruction technology, the result of some missing markers can be processed. The present invention can improve the ability of the model to discover the unknown category marks while improving the classification accuracy of the known category marks by modeling the correlation between the known category and the unknown category marks. The generic features learned by the model can effectively describe the semantic concepts of the newly discovered categories.
附图说明Description of drawings
图1为本发明的多标记分类方法模型框架图;Fig. 1 is the multi-label classification method model frame diagram of the present invention;
图2为五个新类别标记语义描述表。Figure 2 shows the semantic description table of five new category tags.
具体实施方式Detailed ways
为进一步了解本发明的内容,结合附图和实施例对本发明作详细描述。In order to further understand the content of the present invention, the present invention will be described in detail with reference to the accompanying drawings and embodiments.
本发明首先根据样本特征和已知标记的结果计算样本相似度矩阵,然后利用非负矩阵技 术对该样本相似度矩阵分解得到未知新标记的近似结果,并且可以保持样本间的近邻结构关 系。其次,本发明利用矩阵重构技术和标记相关性,共同优化整个标记矩阵的结果,包括部 分缺失和完整缺失的标记。最后,由于标记矩阵存在部分缺失和完全缺失值,无法直接计算 标记相关性矩阵大小,直接利用完整标记矩阵的近似结果计算出来的结果也存在一定误差, 本发明提出直接根据模型自动学习的方法,将相关性学习融入到模型优化过程中。The present invention first calculates the sample similarity matrix according to the sample features and the results of the known marks, and then uses the non-negative matrix technology to decompose the sample similarity matrix to obtain the approximate result of the unknown new mark, and can maintain the adjacent structural relationship between the samples. Second, the present invention utilizes matrix reconstruction techniques and marker correlations to jointly optimize the results of the entire marker matrix, including partially missing and completely missing markers. Finally, because the marker matrix has partially and completely missing values, the size of the marker correlation matrix cannot be directly calculated, and the result calculated directly by the approximate result of the complete marker matrix also has certain errors. The present invention proposes a method for automatic learning directly based on the model, Incorporate correlation learning into the model optimization process.
实施例1Example 1
结合图1,本实施例的一种存在部分缺失和未知类别标记的多标记分类方法,包含模型 训练和标记预测两个阶段,具体步骤如下:In conjunction with Fig. 1, a kind of multi-label classification method with partial deletion and unknown class label of the present embodiment includes two stages of model training and label prediction, and the concrete steps are as follows:
(1)模型训练(1) Model training
步骤一、对训练数据进行特征提取和类别标注,获得数据特征表示矩阵X,以及已知类别 标记矩阵Y,其中Y存在部分缺失值,设定未知类别标记个数为整数r。具体为:Step 1. Perform feature extraction and category labeling on the training data to obtain a data feature representation matrix X and a known category labeling matrix Y, where Y has some missing values, and the number of unknown category labels is set as an integer r. Specifically:
假定训练数据特征表示为一个二维实数矩阵其中,n表示样本个数,d表示特 征个数,表示实数域。Y∈{0,1}n×q是训练数据已知类别的类别标记矩阵,q表示已知的类 别标记个数,其中矩阵Y中的第i行j列的元素用Yij表示。当Yij=1时,则表示第i个样本属于 第j个类别标记,Yij=0则表示第i个样本不属于第j个类别或者当前值缺失,i为1到n之间的 正整数,j为1到q之间的正整数。设置未知类别标记个数为r,则完整类别标记矩阵是一个 大小为n×l的二维矩阵,其中l=q+r表示总类别标记个数,且每个元素取值范围为{0,1}。Assume that the training data features are represented as a two-dimensional matrix of real numbers Among them, n represents the number of samples, d represents the number of features, represents the real number field. Y∈{0, 1}n×q is the class labeling matrix of the known class of the training data, q represents the number of known class labels, and the element in the i-th row and the j-column in the matrix Y is represented by Yij . When Yij = 1, it means that the i-th sample belongs to the j-th category label, and Yij =0 means that the i-th sample does not belong to the j-th category or the current value is missing, i is a positive value between 1 and n. Integer, j is a positive integer between 1 and q. Set the number of unknown category tags to r, then the complete category tag matrix is a two-dimensional matrix of size n×l, where l=q+r represents the total number of category tags, and the value range of each element is {0, 1}.
步骤二、将训练数据的特征表示矩阵X和已知类别标记矩阵Y通过高斯距离函数计算相似 度矩阵S,具体为:In step 2, the feature representation matrix X of the training data and the known category labeling matrix Y are used to calculate the similarity matrix S through the Gaussian distance function, specifically:
将训练数据的特征表示矩阵X和已知类别标记矩阵Y通过高斯距离函数计算样本相似度 矩阵其中第i行j列元素Sij可以通过式(1)计算得到,The feature representation matrix X of the training data and the known category label matrix Y are used to calculate the sample similarity matrix through the Gaussian distance function The element Sij in the i-th row and j-column can be calculated by formula (1),
式(1)中,exp为指数函数,xi和xj分别表示特征表示矩阵X的第i 行和第j行,yi和yj分别表示已知类别标记矩阵Y的第i行和第j行。表示xj的k近邻样本 集合。In formula (1), exp is an exponential function, xi and xj represent the i-th row and j-th row of the feature representation matrix X, respectively, and yi and yj represent the i-th row and the j-th row of the known class label matrix Y, respectively. represents the set of k-nearest neighbor samples of xj .
步骤三、将相似度矩阵S分解得到保持样本相似性结构信息的完整类别标记矩阵的近似表 示H,约束得到的近似表示的部分结果与步骤一得到的已知类别标记矩阵Y的结果一致:Step 3, decompose the similarity matrix S to obtain the approximate representation H of the complete category label matrix that maintains the similarity structure information of the sample, and the partial result of the approximate representation obtained by constraint is consistent with the result of the known category label matrix Y obtained in step 1:
通过非负矩阵分解技术,将样本相似度矩阵分解为HHT。此过程近似等价于利 用K-Means(K均值)算法将训练数据的n个样本聚类到l个类别中,其中H∈[0,1]n×l表示 软聚类结果指示矩阵,HT为H的转置矩阵。因此,H可以作为完整类别标记矩阵的近似表示。 通过对样本相似度矩阵S分解,H和S具有相同的样本间相似性结构关系。Through the non-negative matrix factorization technique, the sample similarity matrix is Decompose toHHT . This process is approximately equivalent to clustering n samples of training data into l categories using the K-Means (K-means) algorithm, where H ∈ [0, 1]n×l represents the soft clustering result indicator matrix, HT is the transpose matrix of H. Therefore, H can be used as an approximate representation of the full class label matrix. By decomposing the sample similarity matrix S, H and S have the same similarity structure relationship between samples.
约束近似表示H的部分结果与已知类别标记矩阵Y的结果一致。由于改变H中列的顺序, 不影响H与其转置矩阵HT的乘积HHT的结果。因此,可假定H的前q列结果与Y一致。设置矩 阵P为一个l×l单位矩阵的前q列,然后使用P矩阵右乘H,可以返回H的前q列结果,然后最小化其与Y之间的误差,这里采用F范数来计算误差。得到最小化目标公式:The constrained approximation indicates that the partial results for H are consistent with the results for the known class labeling matrix Y. Since the order of the columns in H is changed, the result of the product HHT ofH and its transpose matrix HT is not affected. Therefore, it can be assumed that the first q column results of H are consistent with Y. Set the matrix P to the first q columns of an l×l unit matrix, and then use the P matrix to multiply H to return the first q column results of H, and then minimize the error between it and Y, where the F norm is used to calculate error. Get the minimization objective formula:
式(2)中,Ω∈{0,1}n×n表示需要保持的近邻关系,1表示保持,0表示不保持。为映 射矩阵,其取值为一个l×l单位矩阵的前q列,矩阵H为待求解的模型参数,λ0为非负权重系 数,取值域为{10-1,100,101}。In formula (2), Ω∈{0, 1}n×n represents the neighbor relationship that needs to be maintained, 1 means maintaining, 0 means not maintaining. is the mapping matrix, its value is the first q column of an l×l unit matrix, the matrix H is the model parameter to be solved, λ0 is the non-negative weight coefficient, and the value range is {10-1 , 100 , 101 }.
步骤四、利用矩阵重构技术,优化完整标记矩阵H的结果,得到HC。具体为:Step 4: Using the matrix reconstruction technology to optimize the result of the complete labeling matrix H to obtain HC. Specifically:
学习一个重构系数矩阵对于第i个样本的第j个标记结果Hij,可以通过该样本 在其余标记上的结果来重构,即写成矩阵表示形式为H≈HC。当某个标 记结果缺失时,利用该方法可以获得标记值。因此,通过求解重构系数矩阵C,可以优化标记 矩阵H的结果得到HC,得到更新的最小化目标公式:learn a matrix of reconstruction coefficients For the jth labeling result Hij of the ith sample, it can be reconstructed by the results of the sample on the remaining labels, namely Written in matrix representation as H≈HC. When a marker result is missing, the marker value can be obtained using this method. Therefore, by solving the reconstruction coefficient matrix C, the result of the marker matrix H can be optimized to obtain HC, and the updated minimization objective formula can be obtained:
式(3)中,表示重构系数矩阵,矩阵H和C为待求解的模型参数。λ1为非负权重 系数,取值域为{100,101,102}。In formula (3), Represents the reconstruction coefficient matrix, and the matrices H and C are the model parameters to be solved. λ1 is a non-negative weight coefficient, and its value range is {100 , 101 , 102 }.
步骤五、构建从数据特征表示矩阵X映射到完整类别标记矩阵的重构结果HC的线性分类 模型,并对模型系数做稀疏约束,学习类属特征。同时,利用模型系数W对新发现标记进行 语义描述。具体为:Step 5: Build a linear classification model of the reconstruction result HC mapped from the data feature representation matrix X to the complete category label matrix, and make sparse constraints on the model coefficients to learn the generic features. At the same time, the newly discovered tokens are semantically described using the model coefficients W. Specifically:
使用多元线性回归模型作为分类器,建立线性分类模型;基于训练数据的特征表示X,学 习一个映射到优化后完整类别标记矩阵HC的线性分类模型f(X,W)=XW,并对模型参数做L1正则约束,来学习特征数据表示中的类属特征,用来描述未知类别标记的语义 概念。得到更新的最小化目标公式:Use the multiple linear regression model as the classifier to establish a linear classification model; based on the feature representation X of the training data, learn a linear classification model f(X, W)=XW mapped to the optimized complete class label matrix HC, and the model parameters Do L1 regularity constraints to learn the generic features in the feature data representation, which are used to describe the semantic concepts of unknown class labels. Get the updated minimization objective formula:
式(4)中,使得H∈[0,1]n×l,矩阵W、H和C为待求解的模型参数,λ2为非负权重系数,取值域为{10-1,100,101}。In formula (4), let H∈[0, 1]n×l , the matrices W, H and C are the model parameters to be solved, λ2 is the non-negative weight coefficient, and the value range is {10-1 , 100 , 101 }.
对模型系数W的后r列,根据元素值的绝对值大小,分别对每一列进行降序排序,取每一 列中前10个取值较大的系数对应的特征构成类属特征集合。基于此类属特征集合,可以用于 描述新发现类别的语义概念。具体为:For the last r columns of the model coefficient W, according to the absolute value of the element value, each column is sorted in descending order, and the features corresponding to the first 10 coefficients with larger values in each column are used to form a generic feature set. Based on such a set of generic features, it can be used to describe the semantic concepts of newly discovered categories. Specifically:
设数据集的特征集合为f={f1,f2,...,fd},模型系数W的后r列为{wq+1,wq+2,...,wq+r}, 分别对应个新发现的类别标记的模型系数。其中,L1正则约束的作用是特征选择,过滤掉对 类别判别力弱的特征,保留具有强判别力的特征。模型系数W中元素的绝对值越大,说明判 别力越强。对每个|wq+i|进行降序排序,然后取前10个取值较大的系数对应的特征构成类属 特征集合,用于描述新发现的第i个未知类别标记的语义概念。Let the feature set of the dataset be f={f1 , f2 ,...,fd }, and the last r column of the model coefficient W is {wq+1 , wq+2 ,..., wq+ r }, respectively correspond to the model coefficients of each newly discovered class label. Among them, the function of L1 regular constraint is feature selection, filtering out the features with weak discriminative power for categories, and retaining the features with strong discriminative power. The larger the absolute value of the elements in the model coefficient W, the stronger the discriminative power. Sort each |wq+i | in descending order, and then take the features corresponding to the first 10 coefficients with larger values to form a generic feature set, which is used to describe the newly discovered semantic concept of the i-th unknown category tag.
步骤六、根据模型自动学习类别标记间的相关性大小,采用流行正则约束任意两个类别 标记对应的模型系数的相似性,进而优化完整标记矩阵H的结果。Step 6: According to the correlation between the automatic learning of the model labels, the similarity of the model coefficients corresponding to any two class labels is constrained by the popular regularity, and then the result of the complete label matrix H is optimized.
由于标记矩阵的中已知标记存在部分缺失值,未知标记的标记值完全未知,直接利用近 似表示H计算得到的类别相关性结果不精确。因此,本案采用直接学习ZTZ来逼近真实的标记 相关性矩阵的拉普拉斯矩阵。对标记间的相关性进行建模,利用流行正则约束相关标记对应 的模型系数的相似性最终需要求解的目标公式为,Since there are some missing values for the known markers in the marker matrix, the marker values of the unknown markers are completely unknown, and the class correlation result calculated by directly using the approximate representation H is inaccurate. Therefore, this case adopts direct learning ZT Z to approximate the Laplacian matrix of the real label correlation matrix. Model the correlation between markers and use popular regularization constraints to constrain the similarity of the model coefficients corresponding to the related markers The final target formula to be solved is,
式(5)中,l为总的类别标记个数,m为超参数。diag(ZTZ)返回由ZTZ的左对角线上元素构成的向量,约束diag(ZTZ)=1,可以防止矩阵Z=0的无效解。矩阵W、H, C和Z为待求解的模型参数,λ3为非负权重系数,取值域为{50,51,52,53},tr(·)表示矩阵迹 范数。In formula (5), l is the total number of category markers, m is the hyperparameter. diag(ZT Z) returns a vector of elements on the left diagonal of ZT Z, and the constraint diag(ZT Z) = 1 prevents invalid solutions for matrix Z = 0. The matrices W, H, C and Z are the model parameters to be solved, λ3 is the non-negative weight coefficient, the value range is {50 , 51 , 52 , 53 }, tr( ) represents the matrix trace norm .
(2)标记预测(2) Mark prediction
步骤七、给定一个测试样本t,将测试样本t带入经过步骤一至六学习得到的最终分类模 型,输出测试样本在已知类别标记和未知类别标记上的预测结果。具体为:Step 7. Given a test sample t, bring the test sample t into the final classification model learned from steps 1 to 6, and output the prediction results of the test sample on the known class label and the unknown class label. Specifically:
给定一个测试样本t的特征表示根据训练阶段得到的模型系数W,对测试样 本t得到预测值Given a feature representation of a test sample t According to the model coefficient W obtained in the training phase, the predicted value is obtained for the test sample t
f(xt,W)=xtWf(xt , W)=xt W
然后,根据式(6)所得到的测试样本t的预测值以及设置的分类阈值τ,计算测试样本在 已知q个类别以及未知的r个类别上的最终输出标记向量yt∈{0,1}1×l:Then, according to the predicted value of the test sample t obtained by equation (6) and the set classification threshold τ, the final output label vector yt ∈ {0 of the test sample on the known q categories and the unknown r categories is calculated, 1}1×l :
yt(i)=[f(xt,W)i>τ],1≤i≤l (7)yt (i)=[f(xt , W)i >τ], 1≤i≤l (7)
式(7)中,[·]为指示函数,其内部表示的条件成立时,返回结果为1,否则返回0。当 [f(xt,W)i>τ]值为1时,表示测试样本t属于第i类,否则表示不属于第i类。向量yt的前q个 元素值表示测试样本是否属于已知的q个类别,后r个元素则表示测试样本t是否属于新发现的 r个未知的类别标记。In formula (7), [·] is an indicator function, when the condition expressed in it is established, the return result is 1, otherwise, it returns 0. When the value of [f(xt , W)i >τ] is 1, it means that the test sample t belongs to the i-th class; otherwise, it means that it does not belong to the i-th class. The first q elements of the vector yt indicate whether the test sample belongs to the known q categories, and the last r elements indicate whether the test sample t belongs to the newly discovered r unknown category markers.
本实施例的方法中的语义描述方法对于文本型数据效果较好,直接通过词语即可获知新 标记的含义,图2中给出了关于一个计算机相关的数据集的例子,通过本实施例的方法学习 得到的新类别标记的类属特征。对于图像、视频等数据集,需要使用高层特征,例如利用深 度学习模型学习得到的特征。The semantic description method in the method of this embodiment has a better effect on textual data, and the meaning of the new mark can be directly obtained through words. An example of a computer-related data set is given in FIG. 2. The generic features of the new class labels learned by the method. For datasets such as images and videos, high-level features, such as those learned by deep learning models, need to be used.
本实施例将多标记学习中部分缺失类别标记和未知类别标记的学习问题融合在一个统一 的框架中,利用高斯距离函数,对特征矩阵和存在缺失值的标记矩阵计算样本相似度矩阵, 利用非负矩阵分解技术将相似度矩阵分解,得到完整类别标记矩阵的近似解,约束近似解的 部分结果与已观测的标记结果一致,同时构建从样本特征到完整标记的分类模型,建模已知 标记和新发现的这些未知标记之间的相互关联性,约束相关性较强的任意两个类别具有相似 的分类模型系数,并通过标记重构技术,不断优化完整标记矩阵的结果,进而学习得到更准 确的分类模型。本实施例不仅可以解决已知类别标记存在部分缺失值问题,还可以发现多标 记数据中的未知类别标记,挖掘出数据中有价值的隐含信息。本实施例中提出的多标记分类 模型,可以对新样本同时预测已知类别和新发现的类别标记。In this embodiment, the learning problems of partial missing class labels and unknown class labels in multi-label learning are integrated into a unified framework, and the Gaussian distance function is used to calculate the sample similarity matrix for the feature matrix and the label matrix with missing values. Negative matrix factorization technology decomposes the similarity matrix to obtain the approximate solution of the complete category label matrix, and the partial results of the constraint approximate solution are consistent with the observed label results. At the same time, a classification model from sample features to complete labels is constructed to model known labels. and the newly discovered correlations between these unknown markers, any two categories with strong constraints have similar classification model coefficients, and through the marker reconstruction technology, the results of the complete marker matrix are continuously optimized, and then learn to get more Accurate classification model. This embodiment can not only solve the problem of partial missing values in known class labels, but also discover unknown class labels in multi-label data, and mine valuable hidden information in the data. The multi-label classification model proposed in this embodiment can predict both the known class and the newly discovered class label for new samples.
以上示意性的对本发明及其实施方式进行了描述,该描述没有限制性,附图中所示的也 只是本发明的实施方式之一,实际的结构并不局限于此。所以,如果本领域的普通技术人员 受其启示,在不脱离本发明创造宗旨的情况下,不经创造性的设计出与该技术方案相似的结 构方式及实施例,均应属于本发明的保护范围。The present invention and its embodiments have been described above schematically, and the description is not restrictive, and what is shown in the accompanying drawings is only one of the embodiments of the present invention, and the actual structure is not limited to this. Therefore, if those of ordinary skill in the art are inspired by it, without departing from the purpose of the present invention, any structural modes and embodiments similar to this technical solution are designed without creativity, which shall belong to the protection scope of the present invention. .
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010870298.7ACN112132186A (en) | 2020-08-26 | 2020-08-26 | Multi-label classification method with partial deletion and unknown class labels |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010870298.7ACN112132186A (en) | 2020-08-26 | 2020-08-26 | Multi-label classification method with partial deletion and unknown class labels |
| Publication Number | Publication Date |
|---|---|
| CN112132186Atrue CN112132186A (en) | 2020-12-25 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010870298.7APendingCN112132186A (en) | 2020-08-26 | 2020-08-26 | Multi-label classification method with partial deletion and unknown class labels |
| Country | Link |
|---|---|
| CN (1) | CN112132186A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112906796A (en)* | 2021-02-23 | 2021-06-04 | 西北工业大学深圳研究院 | Medical image classification method aiming at uncertainty marking data |
| CN113535947A (en)* | 2021-05-21 | 2021-10-22 | 河南师范大学 | Multi-label classification method and device for incomplete data with missing labels |
| CN114299342A (en)* | 2021-12-30 | 2022-04-08 | 安徽工业大学 | A deep learning-based method for unknown label classification in multi-label image classification |
| CN114817668A (en)* | 2022-04-21 | 2022-07-29 | 中国人民解放军32802部队 | Automatic labeling and target association method for electromagnetic big data |
| CN116092598A (en)* | 2023-01-31 | 2023-05-09 | 汤永 | Antiviral drug screening method based on manifold regularized non-negative matrix factorization |
| CN116484091A (en)* | 2023-03-10 | 2023-07-25 | 湖北天勤伟业企业管理有限公司 | Card information program interaction method and device |
| CN117152578A (en)* | 2023-10-31 | 2023-12-01 | 合肥工业大学 | Incomplete multi-view data prediction method and system based on variation inference |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112906796A (en)* | 2021-02-23 | 2021-06-04 | 西北工业大学深圳研究院 | Medical image classification method aiming at uncertainty marking data |
| CN113535947A (en)* | 2021-05-21 | 2021-10-22 | 河南师范大学 | Multi-label classification method and device for incomplete data with missing labels |
| CN114299342A (en)* | 2021-12-30 | 2022-04-08 | 安徽工业大学 | A deep learning-based method for unknown label classification in multi-label image classification |
| CN114299342B (en)* | 2021-12-30 | 2024-04-26 | 安徽工业大学 | Unknown mark classification method in multi-mark picture classification based on deep learning |
| CN114817668A (en)* | 2022-04-21 | 2022-07-29 | 中国人民解放军32802部队 | Automatic labeling and target association method for electromagnetic big data |
| CN116092598A (en)* | 2023-01-31 | 2023-05-09 | 汤永 | Antiviral drug screening method based on manifold regularized non-negative matrix factorization |
| CN116092598B (en)* | 2023-01-31 | 2023-09-29 | 中国人民解放军总医院 | Antiviral drug screening method based on manifold regularized nonnegative matrix factorization |
| CN116484091A (en)* | 2023-03-10 | 2023-07-25 | 湖北天勤伟业企业管理有限公司 | Card information program interaction method and device |
| CN117152578A (en)* | 2023-10-31 | 2023-12-01 | 合肥工业大学 | Incomplete multi-view data prediction method and system based on variation inference |
| Publication | Publication Date | Title |
|---|---|---|
| CN112132186A (en) | Multi-label classification method with partial deletion and unknown class labels | |
| Monay et al. | Modeling semantic aspects for cross-media image indexing | |
| CN110502621A (en) | Answering method, question and answer system, computer equipment and storage medium | |
| CN115457332B (en) | Image multi-label classification method based on graph convolutional neural network and class activation mapping | |
| CN105808752B (en) | A kind of automatic image marking method based on CCA and 2PKNN | |
| CN112232374B (en) | Irrelevant label filtering method based on depth feature clustering and semantic measurement | |
| CN107391565B (en) | Matching method of cross-language hierarchical classification system based on topic model | |
| CN106202256A (en) | Propagate based on semanteme and mix the Web graph of multi-instance learning as search method | |
| CN115482418B (en) | Semi-supervised model training method, system and application based on pseudo-negative labels | |
| Feng et al. | Transductive multi-instance multi-label learning algorithm with application to automatic image annotation | |
| CN110633366A (en) | Short text classification method, device and storage medium | |
| Gao et al. | ERGM: A multi-stage joint entity and relation extraction with global entity match | |
| CN119202934B (en) | Multi-mode labeling method based on deep learning | |
| CN114239612A (en) | A kind of multimodal neural machine translation method, computer equipment and storage medium | |
| CN110009017A (en) | A Multi-view and Multi-label Classification Method Based on View Generic Feature Learning | |
| CN114048314A (en) | A Natural Language Steganalysis Method | |
| CN118397250A (en) | A generative zero-shot object detection method and system based on distilled CLIP model | |
| CN111046965A (en) | A latent class discovery and classification method in multi-label classification | |
| CN101213539B (en) | System and method for cross-descriptor learning using unlabeled samples | |
| CN107657276B (en) | A Weakly Supervised Semantic Segmentation Method Based on Finding Semantic Clusters | |
| CN104036021A (en) | Method for semantically annotating images on basis of hybrid generative and discriminative learning models | |
| CN110705384B (en) | Vehicle re-identification method based on cross-domain migration enhanced representation | |
| CN109255098B (en) | A matrix factorization hashing method based on reconstruction constraints | |
| CN111401519A (en) | Deep neural network unsupervised learning method based on similarity distance between objects | |
| Shi et al. | Region-based supervised annotation for semantic image retrieval |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date:20201225 |