CN112132186A

Movatterモバイル変換

Info

Publication number: CN112132186A
Application number: CN202010870298.7A
Authority: CN
Inventors: 黄�俊; 屈喜文; 郑啸; 陶陶; 袁志祥; 程泽凯; 秦锋
Original assignee: Anhui University of Technology AHUT
Current assignee: Anhui University of Technology AHUT
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-25

Abstract

Translated fromChinese

本发明公开了一种存在部分缺失和未知类别标记的多标记分类方法，属于机器学习技术领域；本发明将部分缺失类别标记和未知类别标记的问题融合在一个框架中，利用高斯距离函数，计算样本相似度矩阵，再将相似度矩阵分解，得到完整类别标记矩阵的近似解，约束近似解的部分结果与已观测的标记结果一致，同时构建从样本特征到完整标记的分类模型，建模已知标记和新发现未知标记之间的关联性，约束相关性较强的具有相似的分类模型系数，并不断优化完整标记矩阵的结果，进而学习得到准确的分类模型。本可发明不仅可以解决已知类别标记存在部分缺失值问题，还可以发现多标记数据中的未知类别标记，挖掘出数据中有价值的隐含信息。

The invention discloses a multi-label classification method with partial missing and unknown class labels, belonging to the technical field of machine learning; the invention integrates the problem of partially missing class labels and unknown class labels in one framework, and uses a Gaussian distance function to calculate The sample similarity matrix, and then decompose the similarity matrix to obtain the approximate solution of the complete category labeling matrix, and the partial results of the constraint approximate solution are consistent with the observed labeling results. At the same time, a classification model from sample features to complete labels is constructed. The correlation between the known markers and the newly discovered unknown markers is constrained to have similar classification model coefficients with strong correlations, and the results of the complete marker matrix are continuously optimized, and then an accurate classification model is learned. The present invention can not only solve the problem of partial missing values in known class labels, but also discover unknown class labels in multi-label data, and mine valuable implicit information in the data.

Description

Translated fromChinese

一种存在部分缺失和未知类别标记的多标记分类方法A multi-label classification method with partially missing and unknown class labels

技术领域technical field

本发明涉及机器学习技术领域，更具体地说，涉及一种存在部分缺失和未知类别标记的多标记分类方法。The present invention relates to the technical field of machine learning, and more particularly, to a multi-label classification method with partial missing and unknown class labels.

背景技术Background technique

多标记学习是当前机器学习领域的一个研究热点，近年来受到了学术界和企业界研究人员的广泛关注。在多标记学习的学习中，每个样本可以同时属于多个类别标记，例如一部电影可以同时属于多个类别，如“动作片”，“战争片”，“惊悚片”等。多标记学习在现实生活中得到了广泛应用，如文本分类、图像和视频标注、音乐分类、商品推荐等。Multi-label learning is a research hotspot in the field of machine learning, and has received extensive attention from researchers in academia and business in recent years. In the learning of multi-label learning, each sample can belong to multiple category labels at the same time, for example, a movie can belong to multiple categories at the same time, such as "action movie", "war movie", "thriller movie" and so on. Multi-label learning has been widely used in real life, such as text classification, image and video annotation, music classification, product recommendation, etc.

多标记学习的主要任务是根据给定的训练数据集，学习一个高效的多标记分类模型，可以给新的样本预测一个或多个可能的类别标记。针对多标记学习问题，研究人员已经提出了很多方法。现有多标记学习方法主要假设训练数据集的类别标记集合是完整的，并且所有标记值均已知。在多标记数据标注过程中，标注者会给样本标注一个或者多个相关的类别标记，标注过程费时费力，标注者很难准确的给样本标注所有相关的类别，尤其当类别标记总数较多时，很容易导致标注结果存在部分缺失，甚至完全缺失的情况，即这些类别标记没有标注给任何一个样本。此外，多标记数据的语义复杂，可能会存在一些类别标记超出人类的认知范围，也会导致这些类别标记没有标注给任何一个样本，这些完全缺失的类别标记在训练阶段都是未知的，导致学习难度较大。The main task of multi-label learning is to learn an efficient multi-label classification model based on a given training dataset, which can predict one or more possible class labels for new samples. For the multi-label learning problem, researchers have proposed many methods. Existing multi-label learning methods mainly assume that the class label set of the training dataset is complete and all label values are known. In the multi-label data labeling process, the labeler will label the sample with one or more related category labels. The labeling process is time-consuming and labor-intensive, and it is difficult for the labeler to accurately label all relevant categories for the sample, especially when the total number of category labels is large. It is easy to cause partial or even complete absence of labeling results, that is, these category labels are not labelled for any sample. In addition, the semantics of multi-labeled data is complex, and there may be some category labels that are beyond the scope of human cognition, which will also cause these category labels to not be labeled for any sample. These completely missing category labels are unknown in the training phase, resulting in Learning is difficult.

当前，研究人员已经提出了一些处理缺失标记的多标记分类方法，但是只能处理存在部分缺失值的情况，不能处理数据集存在未知类别标记的情况。这些方法主要基于矩阵补全或者在构造分类损失函数时不考虑缺失项，这两种策略的前提要求是每个类别至少要有一个正例样本。因此，当数据存在某些未知类别标记时，它们的标记结果是完全缺失时，现有方法均无法处理。目前被提出的有两个方法可以用来处理存在未知类别标记的情况，如A.Pham 等在国际机器学习会议上发表的存在新颖标记实例的多实例多标记学习方法和朱越等在人工智能促进协会年会发表的发现多个新颖标记的多实例多标记学习方法，但是这两个方法只能用于多实例多标记学习，无法用于一般情况下的多标记学习，即单实例多标记学习，而且也无法处理存在部分缺失标记的情况。At present, researchers have proposed some multi-label classification methods to deal with missing labels, but they can only deal with the situation where there are some missing values, and cannot handle the situation where the dataset has unknown class labels. These methods are mainly based on matrix completion or do not consider missing items when constructing the classification loss function. The premise of these two strategies is that each category must have at least one positive example. Therefore, when there are some unknown class labels in the data, their labeling results are completely missing, which cannot be handled by existing methods. There are currently two proposed methods that can be used to deal with the presence of unknown category labels, such as the multi-instance multi-label learning method with novel labeled instances published by A. Pham et al at the International Machine Learning Conference and Zhu Yue et al. The multi-instance multi-label learning method for discovering multiple novel labels published at the annual meeting of the Promotion Association, but these two methods can only be used for multi-instance multi-label learning, and cannot be used for multi-label learning in general, that is, single-instance multi-label learning learning, but also cannot handle the presence of partial missing markers.

经检索，中国专利申请号：201911306128.X，申请公布日：2020年4月21日，发明名称为：一种多标记分类中潜在类别发现和分类方法；该申请案将已知标记分类和潜在标记发现及分类融合在一框架中，利用非负矩阵分解技术，将特征矩阵分解为完整类别标记矩阵的近似解和系数矩阵，并约束近似解的已知部分结果与真实值一致，同时构建从样本特征到完整标记的分类模型，发现潜在的标记类型；通过潜在标记发现，挖掘出数据中有价值的隐含信息，利用已知标记和潜在标记之间的关联性，约束相关性较强的任意类别具有相似的分类模型系数，得到近似的分类预测结果，使已知标记分类和潜在标记分类相互指导，共同促进，最终提升已知标记和潜在标记的分类性能，更好的进行多标记学习任务。但该申请案假设已知标记部分的标记值完全观测，当已知标记值存在缺失时，该申请案中所提算法的性能会受到影响。且在实际应用中，当数据存在未知新标记时，已知标记部分的标记值存在缺失则更为常见，该申请案在应用到实际上时，会存在误差。After searching, the Chinese patent application number: 201911306128.X, the application publication date: April 21, 2020, and the name of the invention is: a potential class discovery and classification method in multi-marker classification; this application classifies known markers and potential Label discovery and classification are integrated in one framework, using non-negative matrix factorization technology, the feature matrix is decomposed into approximate solutions and coefficient matrices of the complete category label matrix, and the known partial results of the approximate solutions are constrained to be consistent with the real values, while constructing from A classification model from sample features to complete markers to discover potential marker types; through potential marker discovery, valuable implicit information in the data is mined, and the correlation between known markers and potential markers is used to constrain those with strong correlation. Any category has similar classification model coefficients to obtain approximate classification prediction results, so that known label classification and potential label classification can guide each other, promote together, and ultimately improve the classification performance of known labels and potential labels, and better perform multi-label learning. Task. However, this application assumes that the marker values of the known marker part are completely observed, and when the known marker values are missing, the performance of the algorithm proposed in this application will be affected. And in practical applications, when there are unknown new markers in the data, it is more common that the marker values of the known markers are missing. When the application is applied in practice, there will be errors.

发明内容SUMMARY OF THE INVENTION

1.发明要解决的技术问题1. The technical problem to be solved by the invention

鉴于现有技术中，传统的多标记学习方法假定数据集的类别标记个数是固定且所有标记结果都是已知的，不能处理数据集中存在未知类别标记的情况，从而影响分类的准确性的问题，本发明提供了一种存在部分缺失和未知类别标记的多标记分类方法，本发明提出有效的学习方法，发现数据集中未知的类别标记，构建已知类别和未知类别标记的多标记分类模型，使多标记分类结果更加准确。In view of the prior art, the traditional multi-label learning method assumes that the number of class labels in the dataset is fixed and all labeling results are known, and cannot handle the situation where there are unknown class labels in the dataset, thus affecting the accuracy of classification. Problem, the present invention provides a multi-label classification method with partial missing and unknown class labels, the present invention proposes an effective learning method to discover unknown class labels in the data set, and build a multi-label classification model of known and unknown class labels , making the multi-label classification results more accurate.

2.技术方案2. Technical solutions

为达到上述目的，本发明提供的技术方案为：In order to achieve the above object, the technical scheme provided by the invention is:

本发明的一种存在部分缺失和未知类别标记的多标记分类方法，其步骤为：A kind of multi-label classification method with partial deletion and unknown class label of the present invention, its steps are:

步骤一、对训练数据进行特征提取和类别标注，获得数据特征表示矩阵X和已知类别标记矩阵Y；Step 1. Perform feature extraction and category labeling on the training data to obtain a data feature representation matrix X and a known category labeling matrix Y;

步骤二、计算特征表示矩阵X和已知类别标记矩阵Y的相似度矩阵S；Step 2, calculate the similarity matrix S of the feature representation matrix X and the known category label matrix Y;

步骤三、将相似度矩阵S分解得到保持样本相似性结构信息的完整类别标记矩阵的近似表示H，并约束得到的近似表示H的部分结果与步骤一得到的已知类别标记矩阵Y的结果一致；Step 3: Decompose the similarity matrix S to obtain the approximate representation H of the complete category label matrix that maintains the similarity structure information of the samples, and constrain the obtained approximate representation H. Part of the result is consistent with the known category label matrix Y obtained in step 1. ;

步骤四、利用矩阵重构技术对完整类别标记矩阵的近似表示H进行优化，将H优化为重构结果HC；Step 4, utilize the matrix reconstruction technology to optimize the approximate representation H of the complete category labeling matrix, and optimize H as the reconstruction result HC;

步骤五、构建从数据特征表示矩阵X映射到完整类别标记矩阵的重构结果HC的线性分类模型，并对模型系数W做稀疏约束，学习类属特征；同时，利用模型系数W对新发现标记进行语义描述；Step 5: Construct a linear classification model that maps from the data feature representation matrix X to the reconstruction result HC of the complete category label matrix, and imposes a sparse constraint on the model coefficient W to learn the generic features; at the same time, use the model coefficient W to classify the newly discovered labels. perform semantic description;

步骤六、采用流行正则约束任意两个类别标记对应的模型系数的相似性，进而优化完整标记矩阵H的结果；Step 6, adopt the popular regularity to constrain the similarity of the model coefficients corresponding to any two category marks, and then optimize the result of the complete mark matrix H;

步骤七、给定一个测试样本t，将测试样本t带入经过步骤一至六学习得到的最终分类模型，输出测试样本在已知类别标记和未知类别标记上的预测结果。Step 7. Given a test sample t, bring the test sample t into the final classification model learned from steps 1 to 6, and output the prediction results of the test sample on the known class label and the unknown class label.

3.有益效果3. Beneficial effects

采用本发明提供的技术方案，与已有的公知技术相比，具有如下显著效果：Adopting the technical scheme provided by the present invention, compared with the existing known technology, has the following remarkable effects:

(1)鉴于现有技术中，现有的多标记学习方法假定数据集的类别标记都是已知的，不能同时处理数据中存在部分缺失和未知类别标记的问题，进而影响多标记分类方法的准确性的问题，本发明将部分缺失和未知类别标记处理及分类融合在统一框架中，利用矩阵分解技术，将特征矩阵和类别标记矩阵计算得到相似度矩阵分解，得到完整标记矩阵的近似解，以此来发现未知的类别标记；约束近似解的存在部分缺失值的已知类别标记的结果与真实的已观测值的一致，同时构建从样本特征到完整标记的多标记分类模型，可以为新的样本同时预测已知类别和新发现的类别标记。(1) In view of the prior art, the existing multi-label learning methods assume that the class labels of the data set are known, and cannot deal with the problems of partial missing and unknown class labels in the data at the same time, thus affecting the performance of the multi-label classification method. To solve the problem of accuracy, the present invention integrates the processing and classification of partial missing and unknown category tags into a unified framework, and uses matrix decomposition technology to calculate the feature matrix and category tag matrix to obtain similarity matrix decomposition, and obtain the approximate solution of the complete tag matrix, In this way, unknown class labels are discovered; the results of known class labels with partial missing values in the constraint approximate solution are consistent with the real observed values, and a multi-label classification model from sample features to complete labels is constructed, which can be used for new of samples predict both known classes and newly discovered class labels.

(2)本发明的一种存在部分缺失和未知类别标记的多标记分类方法，通过发现未知类别标记，能够挖掘出数据中有价值的隐含信息。同时通过标记重构技术，能够处理部分缺失标记的结果。本发明通过建模已知类别和未知类别标记的相互关联性，在提升已知类别标记分类准确性的同时，还能够提升模型发现未知类别标记的能力。通过模型学习到的类属特征，可以有效描述新发现类别的语义概念。(2) A multi-label classification method with partial deletions and unknown class labels of the present invention can mine valuable hidden information in data by discovering unknown class labels. At the same time, through the marker reconstruction technology, the result of some missing markers can be processed. The present invention can improve the ability of the model to discover the unknown category marks while improving the classification accuracy of the known category marks by modeling the correlation between the known category and the unknown category marks. The generic features learned by the model can effectively describe the semantic concepts of the newly discovered categories.

附图说明Description of drawings

图1为本发明的多标记分类方法模型框架图；Fig. 1 is the multi-label classification method model frame diagram of the present invention;

图2为五个新类别标记语义描述表。Figure 2 shows the semantic description table of five new category tags.

具体实施方式Detailed ways

为进一步了解本发明的内容，结合附图和实施例对本发明作详细描述。In order to further understand the content of the present invention, the present invention will be described in detail with reference to the accompanying drawings and embodiments.

本发明首先根据样本特征和已知标记的结果计算样本相似度矩阵，然后利用非负矩阵技术对该样本相似度矩阵分解得到未知新标记的近似结果，并且可以保持样本间的近邻结构关系。其次，本发明利用矩阵重构技术和标记相关性，共同优化整个标记矩阵的结果，包括部分缺失和完整缺失的标记。最后，由于标记矩阵存在部分缺失和完全缺失值，无法直接计算标记相关性矩阵大小，直接利用完整标记矩阵的近似结果计算出来的结果也存在一定误差，本发明提出直接根据模型自动学习的方法，将相关性学习融入到模型优化过程中。The present invention first calculates the sample similarity matrix according to the sample features and the results of the known marks, and then uses the non-negative matrix technology to decompose the sample similarity matrix to obtain the approximate result of the unknown new mark, and can maintain the adjacent structural relationship between the samples. Second, the present invention utilizes matrix reconstruction techniques and marker correlations to jointly optimize the results of the entire marker matrix, including partially missing and completely missing markers. Finally, because the marker matrix has partially and completely missing values, the size of the marker correlation matrix cannot be directly calculated, and the result calculated directly by the approximate result of the complete marker matrix also has certain errors. The present invention proposes a method for automatic learning directly based on the model, Incorporate correlation learning into the model optimization process.

实施例1Example 1

结合图1，本实施例的一种存在部分缺失和未知类别标记的多标记分类方法，包含模型训练和标记预测两个阶段，具体步骤如下：In conjunction with Fig. 1, a kind of multi-label classification method with partial deletion and unknown class label of the present embodiment includes two stages of model training and label prediction, and the concrete steps are as follows:

(1)模型训练(1) Model training

步骤一、对训练数据进行特征提取和类别标注，获得数据特征表示矩阵X，以及已知类别标记矩阵Y，其中Y存在部分缺失值，设定未知类别标记个数为整数r。具体为：Step 1. Perform feature extraction and category labeling on the training data to obtain a data feature representation matrix X and a known category labeling matrix Y, where Y has some missing values, and the number of unknown category labels is set as an integer r. Specifically:

假定训练数据特征表示为一个二维实数矩阵

其中，n表示样本个数，d表示特征个数，

表示实数域。Y∈{0，1}^n×q是训练数据已知类别的类别标记矩阵，q表示已知的类别标记个数，其中矩阵Y中的第i行j列的元素用Y_ij表示。当Y_ij＝1时，则表示第i个样本属于第j个类别标记，Y_ij＝0则表示第i个样本不属于第j个类别或者当前值缺失，i为1到n之间的正整数，j为1到q之间的正整数。设置未知类别标记个数为r，则完整类别标记矩阵是一个大小为n×l的二维矩阵，其中l＝q+r表示总类别标记个数，且每个元素取值范围为{0，1}。Assume that the training data features are represented as a two-dimensional matrix of real numbers

Among them, n represents the number of samples, d represents the number of features,

represents the real number field. Y∈{0, 1}^n×q is the class labeling matrix of the known class of the training data, q represents the number of known class labels, and the element in the i-th row and the j-column in the matrix Y is represented by Y_ij . When Y_ij = 1, it means that the i-th sample belongs to the j-th category label, and Y_ij =0 means that the i-th sample does not belong to the j-th category or the current value is missing, i is a positive value between 1 and n. Integer, j is a positive integer between 1 and q. Set the number of unknown category tags to r, then the complete category tag matrix is a two-dimensional matrix of size n×l, where l=q+r represents the total number of category tags, and the value range of each element is {0, 1}.

步骤二、将训练数据的特征表示矩阵X和已知类别标记矩阵Y通过高斯距离函数计算相似度矩阵S，具体为：In step 2, the feature representation matrix X of the training data and the known category labeling matrix Y are used to calculate the similarity matrix S through the Gaussian distance function, specifically:

将训练数据的特征表示矩阵X和已知类别标记矩阵Y通过高斯距离函数计算样本相似度矩阵

其中第i行j列元素S_ij可以通过式(1)计算得到，The feature representation matrix X of the training data and the known category label matrix Y are used to calculate the sample similarity matrix through the Gaussian distance function

The element S_ij in the i-th row and j-column can be calculated by formula (1),

式(1)中，exp为指数函数，

x_i和x_j分别表示特征表示矩阵X的第i 行和第j行，y_i和y_j分别表示已知类别标记矩阵Y的第i行和第j行。

表示x_j的k近邻样本集合。In formula (1), exp is an exponential function,

x_i and x_j represent the i-th row and j-th row of the feature representation matrix X, respectively, and y_i and y_j represent the i-th row and the j-th row of the known class label matrix Y, respectively.

represents the set of k-nearest neighbor samples of x_j .

步骤三、将相似度矩阵S分解得到保持样本相似性结构信息的完整类别标记矩阵的近似表示H，约束得到的近似表示的部分结果与步骤一得到的已知类别标记矩阵Y的结果一致：Step 3, decompose the similarity matrix S to obtain the approximate representation H of the complete category label matrix that maintains the similarity structure information of the sample, and the partial result of the approximate representation obtained by constraint is consistent with the result of the known category label matrix Y obtained in step 1:

通过非负矩阵分解技术，将样本相似度矩阵

分解为HH^T。此过程近似等价于利用K-Means(K均值)算法将训练数据的n个样本聚类到l个类别中，其中H∈[0，1]^n×l表示软聚类结果指示矩阵，H^T为H的转置矩阵。因此，H可以作为完整类别标记矩阵的近似表示。通过对样本相似度矩阵S分解，H和S具有相同的样本间相似性结构关系。Through the non-negative matrix factorization technique, the sample similarity matrix is

Decompose to^HHT . This process is approximately equivalent to clustering n samples of training data into l categories using the K-Means (K-means) algorithm, where H ∈ [0, 1]^n×l represents the soft clustering result indicator matrix, H^T is the transpose matrix of H. Therefore, H can be used as an approximate representation of the full class label matrix. By decomposing the sample similarity matrix S, H and S have the same similarity structure relationship between samples.

约束近似表示H的部分结果与已知类别标记矩阵Y的结果一致。由于改变H中列的顺序，不影响H与其转置矩阵H^T的乘积HH^T的结果。因此，可假定H的前q列结果与Y一致。设置矩阵P为一个l×l单位矩阵的前q列，然后使用P矩阵右乘H，可以返回H的前q列结果，然后最小化其与Y之间的误差，这里采用F范数来计算误差。得到最小化目标公式：The constrained approximation indicates that the partial results for H are consistent with the results for the known class labeling matrix Y. Since the order of the columns in H is changed, the result of the product HHT of^H and its transpose matrix H^T is not affected. Therefore, it can be assumed that the first q column results of H are consistent with Y. Set the matrix P to the first q columns of an l×l unit matrix, and then use the P matrix to multiply H to return the first q column results of H, and then minimize the error between it and Y, where the F norm is used to calculate error. Get the minimization objective formula:

式(2)中，Ω∈{0，1}^n×n表示需要保持的近邻关系，1表示保持，0表示不保持。

为映射矩阵，其取值为一个l×l单位矩阵的前q列，矩阵H为待求解的模型参数，λ₀为非负权重系数，取值域为{10^-1，10⁰，10¹}。In formula (2), Ω∈{0, 1}^n×n represents the neighbor relationship that needs to be maintained, 1 means maintaining, 0 means not maintaining.

is the mapping matrix, its value is the first q column of an l×l unit matrix, the matrix H is the model parameter to be solved, λ₀ is the non-negative weight coefficient, and the value range is {10^-1 , 10⁰ , 10¹ }.

步骤四、利用矩阵重构技术，优化完整标记矩阵H的结果，得到HC。具体为：Step 4: Using the matrix reconstruction technology to optimize the result of the complete labeling matrix H to obtain HC. Specifically:

学习一个重构系数矩阵

对于第i个样本的第j个标记结果H_ij，可以通过该样本在其余标记上的结果来重构，即

写成矩阵表示形式为H≈HC。当某个标记结果缺失时，利用该方法可以获得标记值。因此，通过求解重构系数矩阵C，可以优化标记矩阵H的结果得到HC，得到更新的最小化目标公式：learn a matrix of reconstruction coefficients

For the jth labeling result H_ij of the ith sample, it can be reconstructed by the results of the sample on the remaining labels, namely

Written in matrix representation as H≈HC. When a marker result is missing, the marker value can be obtained using this method. Therefore, by solving the reconstruction coefficient matrix C, the result of the marker matrix H can be optimized to obtain HC, and the updated minimization objective formula can be obtained:

式(3)中，

表示重构系数矩阵，矩阵H和C为待求解的模型参数。λ₁为非负权重系数，取值域为{10⁰，10¹，10²}。In formula (3),

Represents the reconstruction coefficient matrix, and the matrices H and C are the model parameters to be solved. λ₁ is a non-negative weight coefficient, and its value range is {10⁰ , 10¹ , 10² }.

步骤五、构建从数据特征表示矩阵X映射到完整类别标记矩阵的重构结果HC的线性分类模型，并对模型系数做稀疏约束，学习类属特征。同时，利用模型系数W对新发现标记进行语义描述。具体为：Step 5: Build a linear classification model of the reconstruction result HC mapped from the data feature representation matrix X to the complete category label matrix, and make sparse constraints on the model coefficients to learn the generic features. At the same time, the newly discovered tokens are semantically described using the model coefficients W. Specifically:

使用多元线性回归模型作为分类器，建立线性分类模型；基于训练数据的特征表示X，学习一个映射到优化后完整类别标记矩阵HC的线性分类模型f(X，W)＝XW，并对模型参数

做L₁正则约束，来学习特征数据表示中的类属特征，用来描述未知类别标记的语义概念。得到更新的最小化目标公式：Use the multiple linear regression model as the classifier to establish a linear classification model; based on the feature representation X of the training data, learn a linear classification model f(X, W)=XW mapped to the optimized complete class label matrix HC, and the model parameters

Do L₁ regularity constraints to learn the generic features in the feature data representation, which are used to describe the semantic concepts of unknown class labels. Get the updated minimization objective formula:

式(4)中，使得H∈[0，1]^n×l，矩阵W、H和C为待求解的模型参数，λ₂为非负权重系数，取值域为{10^-1，10⁰，10¹}。In formula (4), let H∈[0, 1]^n×l , the matrices W, H and C are the model parameters to be solved, λ₂ is the non-negative weight coefficient, and the value range is {10^-1 , 10⁰ , 10¹ }.

对模型系数W的后r列，根据元素值的绝对值大小，分别对每一列进行降序排序，取每一列中前10个取值较大的系数对应的特征构成类属特征集合。基于此类属特征集合，可以用于描述新发现类别的语义概念。具体为：For the last r columns of the model coefficient W, according to the absolute value of the element value, each column is sorted in descending order, and the features corresponding to the first 10 coefficients with larger values in each column are used to form a generic feature set. Based on such a set of generic features, it can be used to describe the semantic concepts of newly discovered categories. Specifically:

设数据集的特征集合为f＝{f₁，f₂，...，f_d}，模型系数W的后r列为{w^q+1，w^q+2，...，w^q+r}，分别对应个新发现的类别标记的模型系数。其中，L₁正则约束的作用是特征选择，过滤掉对类别判别力弱的特征，保留具有强判别力的特征。模型系数W中元素的绝对值越大，说明判别力越强。对每个|w^q+i|进行降序排序，然后取前10个取值较大的系数对应的特征构成类属特征集合，用于描述新发现的第i个未知类别标记的语义概念。Let the feature set of the dataset be f={f₁ , f₂ ,...,f_d }, and the last r column of the model coefficient W is {w^q+1 , w^q+2 ,..., w^{q+ r} }, respectively correspond to the model coefficients of each newly discovered class label. Among them, the function of L₁ regular constraint is feature selection, filtering out the features with weak discriminative power for categories, and retaining the features with strong discriminative power. The larger the absolute value of the elements in the model coefficient W, the stronger the discriminative power. Sort each |w^q+i | in descending order, and then take the features corresponding to the first 10 coefficients with larger values to form a generic feature set, which is used to describe the newly discovered semantic concept of the i-th unknown category tag.

步骤六、根据模型自动学习类别标记间的相关性大小，采用流行正则约束任意两个类别标记对应的模型系数的相似性，进而优化完整标记矩阵H的结果。Step 6: According to the correlation between the automatic learning of the model labels, the similarity of the model coefficients corresponding to any two class labels is constrained by the popular regularity, and then the result of the complete label matrix H is optimized.

由于标记矩阵的中已知标记存在部分缺失值，未知标记的标记值完全未知，直接利用近似表示H计算得到的类别相关性结果不精确。因此，本案采用直接学习Z^TZ来逼近真实的标记相关性矩阵的拉普拉斯矩阵。对标记间的相关性进行建模，利用流行正则约束相关标记对应的模型系数的相似性

最终需要求解的目标公式为，Since there are some missing values for the known markers in the marker matrix, the marker values of the unknown markers are completely unknown, and the class correlation result calculated by directly using the approximate representation H is inaccurate. Therefore, this case adopts direct learning Z^T Z to approximate the Laplacian matrix of the real label correlation matrix. Model the correlation between markers and use popular regularization constraints to constrain the similarity of the model coefficients corresponding to the related markers

The final target formula to be solved is,

式(5)中，

l为总的类别标记个数，m为超参数。diag(Z^TZ)返回由Z^TZ的左对角线上元素构成的向量，约束diag(Z^TZ)＝1，可以防止矩阵Z＝0的无效解。矩阵W、H， C和Z为待求解的模型参数，λ₃为非负权重系数，取值域为{5⁰，5¹，5²，5³}，tr(·)表示矩阵迹范数。In formula (5),

l is the total number of category markers, m is the hyperparameter. diag(Z^T Z) returns a vector of elements on the left diagonal of Z^T Z, and the constraint diag(Z^T Z) = 1 prevents invalid solutions for matrix Z = 0. The matrices W, H, C and Z are the model parameters to be solved, λ₃ is the non-negative weight coefficient, the value range is {5⁰ , 5¹ , 5² , 5³ }, tr( ) represents the matrix trace norm .

(2)标记预测(2) Mark prediction

步骤七、给定一个测试样本t，将测试样本t带入经过步骤一至六学习得到的最终分类模型，输出测试样本在已知类别标记和未知类别标记上的预测结果。具体为：Step 7. Given a test sample t, bring the test sample t into the final classification model learned from steps 1 to 6, and output the prediction results of the test sample on the known class label and the unknown class label. Specifically:

给定一个测试样本t的特征表示

根据训练阶段得到的模型系数W，对测试样本t得到预测值

Given a feature representation of a test sample t

According to the model coefficient W obtained in the training phase, the predicted value is obtained for the test sample t

f(x_t，W)＝x_tWf(x_t , W)=x_t W

然后，根据式(6)所得到的测试样本t的预测值以及设置的分类阈值τ，计算测试样本在已知q个类别以及未知的r个类别上的最终输出标记向量y_t∈{0，1}^1×l：Then, according to the predicted value of the test sample t obtained by equation (6) and the set classification threshold τ, the final output label vector y_t ∈ {0 of the test sample on the known q categories and the unknown r categories is calculated, 1}^1×l :

y_t(i)＝[f(x_t，W)_i＞τ]，1≤i≤l (7)y_t (i)=[f(x_t , W)_i >τ], 1≤i≤l (7)

式(7)中，[·]为指示函数，其内部表示的条件成立时，返回结果为1，否则返回0。当 [f(x_t，W)_i＞τ]值为1时，表示测试样本t属于第i类，否则表示不属于第i类。向量y_t的前q个元素值表示测试样本是否属于已知的q个类别，后r个元素则表示测试样本t是否属于新发现的 r个未知的类别标记。In formula (7), [·] is an indicator function, when the condition expressed in it is established, the return result is 1, otherwise, it returns 0. When the value of [f(x_t , W)_i >τ] is 1, it means that the test sample t belongs to the i-th class; otherwise, it means that it does not belong to the i-th class. The first q elements of the vector y_t indicate whether the test sample belongs to the known q categories, and the last r elements indicate whether the test sample t belongs to the newly discovered r unknown category markers.

本实施例的方法中的语义描述方法对于文本型数据效果较好，直接通过词语即可获知新标记的含义，图2中给出了关于一个计算机相关的数据集的例子，通过本实施例的方法学习得到的新类别标记的类属特征。对于图像、视频等数据集，需要使用高层特征，例如利用深度学习模型学习得到的特征。The semantic description method in the method of this embodiment has a better effect on textual data, and the meaning of the new mark can be directly obtained through words. An example of a computer-related data set is given in FIG. 2. The generic features of the new class labels learned by the method. For datasets such as images and videos, high-level features, such as those learned by deep learning models, need to be used.

本实施例将多标记学习中部分缺失类别标记和未知类别标记的学习问题融合在一个统一的框架中，利用高斯距离函数，对特征矩阵和存在缺失值的标记矩阵计算样本相似度矩阵，利用非负矩阵分解技术将相似度矩阵分解，得到完整类别标记矩阵的近似解，约束近似解的部分结果与已观测的标记结果一致，同时构建从样本特征到完整标记的分类模型，建模已知标记和新发现的这些未知标记之间的相互关联性，约束相关性较强的任意两个类别具有相似的分类模型系数，并通过标记重构技术，不断优化完整标记矩阵的结果，进而学习得到更准确的分类模型。本实施例不仅可以解决已知类别标记存在部分缺失值问题，还可以发现多标记数据中的未知类别标记，挖掘出数据中有价值的隐含信息。本实施例中提出的多标记分类模型，可以对新样本同时预测已知类别和新发现的类别标记。In this embodiment, the learning problems of partial missing class labels and unknown class labels in multi-label learning are integrated into a unified framework, and the Gaussian distance function is used to calculate the sample similarity matrix for the feature matrix and the label matrix with missing values. Negative matrix factorization technology decomposes the similarity matrix to obtain the approximate solution of the complete category label matrix, and the partial results of the constraint approximate solution are consistent with the observed label results. At the same time, a classification model from sample features to complete labels is constructed to model known labels. and the newly discovered correlations between these unknown markers, any two categories with strong constraints have similar classification model coefficients, and through the marker reconstruction technology, the results of the complete marker matrix are continuously optimized, and then learn to get more Accurate classification model. This embodiment can not only solve the problem of partial missing values in known class labels, but also discover unknown class labels in multi-label data, and mine valuable hidden information in the data. The multi-label classification model proposed in this embodiment can predict both the known class and the newly discovered class label for new samples.

以上示意性的对本发明及其实施方式进行了描述，该描述没有限制性，附图中所示的也只是本发明的实施方式之一，实际的结构并不局限于此。所以，如果本领域的普通技术人员受其启示，在不脱离本发明创造宗旨的情况下，不经创造性的设计出与该技术方案相似的结构方式及实施例，均应属于本发明的保护范围。The present invention and its embodiments have been described above schematically, and the description is not restrictive, and what is shown in the accompanying drawings is only one of the embodiments of the present invention, and the actual structure is not limited to this. Therefore, if those of ordinary skill in the art are inspired by it, without departing from the purpose of the present invention, any structural modes and embodiments similar to this technical solution are designed without creativity, which shall belong to the protection scope of the present invention. .

Claims

1. A multi-label classification method with partial deletion and unknown class labels is characterized by comprising the following steps:

firstly, extracting features and labeling categories of training data to obtain a data feature representation matrix X and a known category label matrix Y;

step two, calculating a similarity matrix S of the feature representation matrix X and the known category label matrix Y;

decomposing the similarity matrix S to obtain an approximate representation H of the complete class label matrix keeping the sample similarity structure information, and constraining partial results of the obtained approximate representation H to be consistent with the results of the known class label matrix Y obtained in the step one;

optimizing the approximate expression H of the complete category label matrix by using a matrix reconstruction technology, and optimizing the H into a reconstruction result HC;

constructing a linear classification model of a reconstruction result HC mapped to the complete category label matrix from the data feature representation matrix X, performing sparse constraint on a model coefficient W, and learning the generic features; meanwhile, semantic description is carried out on the newly found mark by utilizing the model coefficient W;

step six, adopting popular regular constraint to mark the similarity of the corresponding model coefficients of any two categories, and further optimizing the result of the complete mark matrix H;

and step seven, giving a test sample t, bringing the test sample t into the final classification model obtained through learning in the steps one to six, and outputting the prediction results of the test sample on the known class mark and the unknown class mark.

2. The multi-label classification method according to claim 1, wherein the classification method comprises the following steps: in the first step, the first step is carried out,

is a two-dimensional real number matrix, n is the number of samples, d is the number of features,

is a real number domain; class label matrix Y of known classes belongs to {0, 1}^n×qPartial deletion values exist, and q is the number of known class marks; and setting the number of unknown class marks as r, wherein the complete class mark matrix is a two-dimensional matrix with the size of n multiplied by l, wherein l-q + r represents the total class mark number, and the value range of each element is {0, 1 }.

3. The multi-label classification method according to claim 2, wherein the classification method comprises the following steps: in the second step, the feature expression matrix X and the known class mark matrix Y are subjected to Gaussian distance function calculation to obtain a similarity matrix

Wherein the ith row and j column element S_ijCan be obtained by calculation according to the formula (1),

in the formula (1), exp is an exponential function,

x_iand x_jRespectively represent characteristic tablesDenotes the ith and jth rows, y, of the matrix X_iAnd y_jRespectively representing the ith row and the jth row of the known class mark matrix Y;

denotes x_jK is a neighbor sample set.

4. The multi-label classification method according to claim 3, wherein the classification method comprises the following steps: in the third step, the sample similarity matrix is decomposed by using a non-negative matrix decomposition technology

Decomposition to HH^TWherein H is [0, 1 ]]^n×lAs an approximate representation of the complete class label matrix, H^TA transposed matrix that is H; the partial result of H is approximately expressed by the constraint of the matrix P and is consistent with the result of the known class mark matrix Y, so as to obtain a minimized target formula,

wherein Ω ∈ {0, 1}^n×nThe neighbor relation which needs to be kept is shown, 1 is kept, and 0 is not kept;

is a mapping matrix which takes the value of the first q columns of an l × l unit matrix, the matrix H is the model parameter to be solved, and the value of the matrix is lambda₀Is a non-negative weight coefficient, and has a value range of {10 }^-1，10⁰，10¹}。

5. The multi-label classification method according to claim 4, wherein the classification method comprises the following steps: in the fourth step, the first step is carried out,

is a reconstructed coefficient matrix, and the j marking result H for the i sample_ijThe result of the sample on the rest marks is used for reconstruction, and the result can be obtained

The matrix representation form is H & lt approximately & gt HC, and then the marking value of the missing result is obtained; optimizing the approximate representation H of the complete class label matrix to HC to obtain an updated minimization objective formula:

wherein, the matrixes H and C are model parameters to be solved; lambda [ alpha ]₁Is a non-negative weight coefficient, and has a value range of {10 }⁰，10¹，10²}。

6. The multi-label classification method according to claim 5, wherein the classification method comprises the following steps: in the fifth step, a multivariate linear regression model is used as a classifier, a linear classification model is established based on the feature expression X of the training data, a linear classification model f (X, W) mapped to the optimized complete category label matrix HC is learned, and model parameters are matched

To L₁And (3) regular constraint to learn the generic features in the feature data representation to obtain an updated minimization target formula:

H∈[0，1]^n×l，HP＝Y (4)

in the formula (4), H is ∈ [0, 1 ]]^n×lMatrices W, H and C are the model parameters to be solved, λ₂Taking value range for non-negative weight coefficientIs {10^-1，10⁰，10¹}。

7. The multi-label classification method according to claim 6, wherein the classification method comprises the following steps: in the fifth step, the rear r columns of the model coefficients W are sorted in descending order according to the absolute value of the element values, the features corresponding to the first 10 coefficients with larger values in each column are taken to form a generic feature set, and the feature set of the data set is set as f ═ f { (f)₁，f₂，...，f_dR after the model coefficient W is listed as { W }^q+1，w^q+2，...，w^q+rCorresponding to the model coefficients of each newly found category label, for each | w^q+iAnd I, sorting in a descending order, and then taking the features corresponding to the first 10 coefficients with larger values to form a generic feature set for describing the newly found semantic concept of the ith unknown class mark.

8. The multi-label classification method according to claim 7, wherein the classification method comprises the following steps: in the sixth step, Z is directly learned^TZ, approximating the Laplace matrix of the true tag correlation matrix and constraining the similarity of the model coefficients corresponding to the correlation tags using prevalence regularization, i.e.

The final target formula to be solved is obtained as,

wherein,

l is the total number of the category labels, and m is a hyper-parameter; z^TIs a transposed matrix of Z, W^TA transposed matrix that is W; diag (Z)^TZ) returning from Z^TOf ZVector formed by elements on the left diagonal, constraint diag (Z)^TZ) 1, preventing invalid solutions of matrix Z0; the matrices W, H, C and Z are the model parameters to be solved, λ₃Is a non-negative weight coefficient, and the value range is {5⁰，5¹，5²，5³And (c) }, tr (·) represents a matrix trace norm.

9. A multi-label classification method with partial deletion and unknown class labels according to any of claims 1-8, characterized by: in the seventh step, a characteristic representation of a test sample t is given

According to the model coefficient W obtained in the training stage, a predicted value is obtained for the test sample t

f(x_t，W)＝x_tW (6)。

10. The multi-label classification method according to claim 9, wherein the classification method comprises the following steps: in the seventh step, according to the obtained predicted value of the test sample t and the set classification threshold tau, the final output mark vector y of the test sample on the known q categories and the unknown r categories is calculated_t∈{0，1}^1×l，

y_t(i)＝[f(x_t，W)_i＞τ]，1≤i≤l (7)

Wherein [ ·]The function is an indication function, when the conditions of internal representation are satisfied, the result is returned to be 1, otherwise, the result is returned to be 0; when [ f (x)_t，W)_i＞τ]When the value is 1, the test sample t belongs to the ith class, otherwise, the test sample t does not belong to the ith class; vector y_tThe first q element values of (a) indicate whether the test sample belongs to the known q classes, and the last r element values indicate whether the test sample t belongs to the newly found r unknown class labels.