CN116861287A

Movatterモバイル変換

Info

Publication number: CN116861287A
Application number: CN202310752440.1A
Authority: CN
Inventors: 瑞嘉; 杰克; 邵俊明; 阿曼乌拉; 里亚兹乌拉汗
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-10-10

Abstract

Translated fromChinese

本发明提供了一种基于多标签演化高维文本流的在线半监督分类算法，其特征在于：包括模型初始化、分类阶段和模型维护；取D_init个带标签的文档，并为每个标签创建Z_min个微簇，微簇包含文档；初始模型对每个到来的文档流进行预测，对于每个到来的文档，模型计算簇‑文档的概率，并基于概率得分，选择k个最近的微簇Zd；此时，需要预测的标签数Y等于具有高于Zd分布均值的微簇的数量l_count；当大于平均值的簇数量为一，则预测在最近的簇中有高簇数量的标签，否侧比较每个标签的簇概率之和，并用最近标签的标签共现得分进行预测；在预测标签后，如果到来的文档没有标签，则将其添加到每个预测标签的最近微簇中，否则将到达的文档添加到地面真实标签的最近微簇中。

The present invention provides an online semi-supervised classification algorithm based on multi-label evolving high-dimensional text flow, which is characterized by: including model initialization, classification stage and model maintenance; taking D_init tagged documents and creating a Z_min micro-clusters contain documents; the initial model predicts each incoming document stream. For each incoming document, the model calculates the probability of the cluster-document and selects the k nearest micro-clusters based on the probability score. Zd; At this time, the number of labels Y that need to be predicted is equal to the number l_count of micro-clusters with a distribution mean higher than Zd; when the number of clusters greater than the mean is one, it is predicted that there are labels with a high number of clusters in the nearest cluster, The other side compares the sum of the cluster probabilities of each label and uses the label co-occurrence score of the nearest label for prediction; after predicting the label, if the incoming document does not have a label, it is added to the nearest micro-cluster of each predicted label, Otherwise the arriving document is added to the nearest micro-cluster of the ground truth label.

Description

Translated fromChinese

基于多标签演变高维文本流的在线半监督分类算法Online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams

技术领域Technical field

本发明涉及人工智能技术领域，尤其涉及一种基于多标签演化高维文本流的在线半监督分类算法。The invention relates to the field of artificial intelligence technology, and in particular to an online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams.

背景技术Background technique

多标签学习模型的目标是预测每个输入实例所对应的所有标签，在处理多标签文本流时，存在多种挑战。The goal of a multi-label learning model is to predict all labels corresponding to each input instance. There are various challenges when processing multi-label text streams.

如高纬度。随着文本数据中特征的数量增加，问题的维度也随之增加，这会导致特征选择困难，计算复杂度增加和维度灾难；在高维度下，特征选择变得更加困难，因为存在大量的特征，其中许多特征可能是冗余的或不相关的，选择适当的特征子集对于提高分类性能和降低计算复杂度至关重要。高维度数据会增加计算的复杂度，在高维度下，计算距离、相似性和概率等操作会变得更加耗时，导致算法的效率下降。而维度灾难是指在高维空间中，样本密度变得非常稀疏，导致分类算法的性能下降。Such as high latitudes. As the number of features in text data increases, the dimensionality of the problem also increases, which leads to difficulty in feature selection, increased computational complexity, and the curse of dimensionality; in high dimensions, feature selection becomes more difficult because there are a large number of features , many of the features may be redundant or irrelevant, and selecting an appropriate feature subset is crucial to improve classification performance and reduce computational complexity. High-dimensional data will increase the complexity of calculations. In high dimensions, operations such as calculating distance, similarity, and probability will become more time-consuming, resulting in a decrease in the efficiency of the algorithm. The curse of dimensionality means that in high-dimensional space, the sample density becomes very sparse, causing the performance of the classification algorithm to decrease.

如标签稀缺性。在缺乏足够的实例的情况下，很难准确地学习和预测稀有标签，由于缺乏标签实例，模型无法捕捉到这些标签的特征和模式，从而导致预测性能下降。Such as tag scarcity. In the absence of sufficient instances, it is difficult to accurately learn and predict rare labels. Due to the lack of label instances, the model cannot capture the characteristics and patterns of these labels, resulting in reduced prediction performance.

因此高维度、标签稀缺性等是处理多标签文本流时的重要挑战，需要采取适当的方法来处理特征选择、计算复杂度和维度灾难等问题，提高对稀有标签的学习和预测能力，以提高算法的性能和效率。Therefore, high dimensionality and label scarcity are important challenges when processing multi-label text streams. Appropriate methods need to be adopted to deal with issues such as feature selection, computational complexity, and dimension disaster, and improve the learning and prediction capabilities of rare labels to improve Algorithm performance and efficiency.

发明内容Contents of the invention

针对现有技术中所存在的不足，本发明提供了一种基于多标签演化高维文本流的在线半监督分类算法，其解决了高纬度时，随着文本数据中特征的数量增加，问题的维度也随之增加，从而导致的特征选择困难，计算复杂度增加和维度灾难问题。In view of the deficiencies in the existing technology, the present invention provides an online semi-supervised classification algorithm based on multi-label evolving high-dimensional text flow, which solves the problem of high latitude problems as the number of features in text data increases. Dimensions also increase, resulting in difficulty in feature selection, increased computational complexity and the curse of dimensionality.

本发明的上述技术目的是通过以下技术方案得以实现的：The above technical objectives of the present invention are achieved through the following technical solutions:

一种基于多标签演化高维文本流的在线半监督分类算法，包括模型初始化、分类阶段和模型维护；An online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams, including model initialization, classification stage and model maintenance;

取D_init个带标签的文档，并为每个标签创建Z_min个微簇，微簇包含文档；初始模型对每个到来的文档流进行预测，对于每个到来的文档，模型计算簇-文档的概率，并基于概率得分，选择k个最近的微簇Zd；此时，需要预测的标签数Y等于具有高于Zd分布均值的微簇的数量l_count；当大于平均值的簇数量为一，则预测在最近的簇中有高簇数量的标签，否侧比较每个标签的簇概率之和，并用最近标签的标签共现得分进行预测；在预测标签后，如果到来的文档没有标签，则将其添加到每个预测标签的最近微簇中，否则将到达的文档添加到地面真实标签的最近微簇中，降低每个错预测的标签相关簇和文档之间的共同术语得分V_d∩z。Take D_init labeled documents and create Z_min micro-clusters for each label. The micro-clusters contain documents; the initial model predicts each incoming document stream. For each incoming document, the model calculates the cluster-document probability, and based on the probability score, select the k nearest micro-clusters Zd; at this time, the number of labels Y that need to be predicted is equal to the number l_count of micro-clusters with a distribution mean higher than Zd; when the number of clusters greater than the average is one , then it is predicted that there are tags with a high number of clusters in the nearest cluster, otherwise the sum of the cluster probabilities of each tag is compared, and the tag co-occurrence score of the nearest tag is used for prediction; after predicting the tag, if the incoming document does not have a tag, then add it to the nearest micro-cluster of each predicted label, otherwise add the arriving document to the nearest micro-cluster of the ground truth label, lowering the common term score V_d between each mispredicted label-related cluster and the document_∩z .

本发明进一步设置为：微簇定义为8元组，其中m_z是文档的数量，/>表示单词w的词频，n_z是簇中所有单词的频率计数的总和/>Lz储存了微簇的分配标签，rz包含衰减权重，u_z是最后更新的时间戳，ta_z是单词到达的时间戳，cw_z是单词与单词的共现得分矩阵，每个条目定义如下，The present invention is further configured as follows: micro-clusters are defined as 8-tuples, where m_z is the number of documents, /> Represents the word frequency of word w, n_z is the sum of frequency counts of all words in the cluster/> Lz stores the distribution label of the micro-cluster, rz contains the decay weight, u_z is the last updated timestamp, ta_z is the timestamp of word arrival, cw_z is the co-occurrence score matrix of word and word, each entry is defined as follows,

这里，是文档d′中单词w的频率计数，在文档中，w_i和w_j之间的比率必须满足其中i≠j。here, is the frequency count of word w in document d′. In the document, the ratio between w_i and w_j must satisfy where i≠j.

本发明进一步设置为：模型初始化时，采用D_init个带标签的实例，对于每个标签l∈L，从给定的D_init中选择一组大小相等的实例S_i，定义为S_init＝{S₁,…,S_|L|}，使用LDA处理文本数据的潜在子空间，计算一级标签共现权重矩阵LC_M，其中每个条目使用启发式概率计算标签l_i和l_j之间的权重，定义为The present invention is further configured as follows: when initializing the model, D_init labeled instances are used. For each label l∈L, a group of equal-sized instances S_i is selected from the given D_init , defined as S_init = { S₁ ,…,S_|L| }, use LDA to process the latent subspace of text data, and calculate the first-level tag co-occurrence weight matrix LC_M , where each entry uses a heuristic probability to calculate the distance between tags l_i and l_j weight, defined as

本发明进一步设置为：在初始化簇和标签共现得分矩阵之后，对每个到达的文档进行分类过程，包括两个步骤：计算流中每个到达实例与模型中所有活跃簇的相似度得分，以及通过观察k个最近的簇来预测标签。The present invention is further configured to: after initializing the cluster and label co-occurrence score matrix, perform a classification process for each arriving document, including two steps: calculating the similarity score of each arriving instance in the stream and all active clusters in the model, and predicting labels by observing the k nearest clusters.

本发明进一步设置为：相似度计算时，概率得分定义为：The present invention is further configured as follows: when calculating the similarity, the probability score is defined as:

其中，D是模型M中的活动文档总数，ICF_w是逆簇频率，用于计算权重的重要性，定义为where D is the total number of active documents in model M, and ICF_w is the inverse cluster frequency, which is used to calculate the importance of the weight, defined as

本发明进一步设置为：模型维护包括概念演变、删除过时的标签相关术语、合并微簇。The present invention is further configured as follows: model maintenance includes concept evolution, deletion of outdated label-related terms, and merging of micro-clusters.

本发明具有以下有益效果：The invention has the following beneficial effects:

1、高效处理多标签文本流：本发明提出的在线半监督分类算法能够实时处理多标签文本流数据。它能够动态地适应数据流中的概念漂移和标签稀疏性，并具有较低的计算和内存需求。这使得算法能够在大规模数据流和实时应用中高效运行。1. Efficiently process multi-label text streams: The online semi-supervised classification algorithm proposed by the present invention can process multi-label text stream data in real time. It is able to dynamically adapt to concept drift and label sparsity in data streams and has low computational and memory requirements. This enables algorithms to run efficiently on large-scale data streams and real-time applications.

2、模型自适应能力：本发明的算法具有较强的自适应能力，能够处理概念漂移和标签基数的变化。它能够及时调整模型，捕捉数据流中新的概念和标签关联，并对新的实例进行准确的预测。这使得算法能够在动态的环境中保持准确性和性能。2. Model adaptive ability: The algorithm of the present invention has strong adaptive ability and can handle concept drift and changes in label cardinality. It can adjust the model in time, capture new concepts and label associations in the data stream, and make accurate predictions for new instances. This enables the algorithm to maintain accuracy and performance in dynamic environments.

3、本发明的算法在预测标签时综合考虑了标签相关性。通过嵌入标签共现概率和聚类相似度，算法能够捕捉标签之间的关联关系，提高预测性能。3. The algorithm of the present invention comprehensively considers tag correlation when predicting tags. By embedding tag co-occurrence probability and cluster similarity, the algorithm can capture the correlation between tags and improve prediction performance.

附图说明Description of the drawings

图1为本申请提出的在线半监督分类算法的整体流程图。Figure 1 is the overall flow chart of the online semi-supervised classification algorithm proposed in this application.

具体实施方式Detailed ways

下面结合附图及实施例对本发明中的技术方案进一步说明。The technical solutions in the present invention will be further described below in conjunction with the accompanying drawings and examples.

本申请中，提出了一种基于多标签演化高维文本流的在线半监督分类算法，通过利用少量有标签的实例，该算法动态地维护每个标签的术语子空间，并使用一组演化的微簇。对于多标签分类，算法使用非参数的Dirichlet模型来预测k个最近的微簇。为了处理术语空间中的渐进概念漂移，采用了三角时间函数来计算术语到达时间和簇寿命之间的差异。而对于突发性概念漂移，算法采用了两个步骤来处理：(a)通过利用指数衰减函数删除过时的微簇，(b)基于Dirichlet过程采用中餐馆过程来创建新的微簇。In this application, an online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams is proposed. By utilizing a small number of labeled instances, the algorithm dynamically maintains the term subspace of each label and uses a set of evolved Microclusters. For multi-label classification, the algorithm uses a non-parametric Dirichlet model to predict the k nearest micro-clusters. To handle progressive concept drift in the term space, a trigonometric time function is employed to calculate the difference between term arrival time and cluster lifetime. For sudden concept drift, the algorithm adopts two steps to deal with: (a) deleting outdated micro-clusters by using the exponential decay function, (b) using the Chinese restaurant process to create new micro-clusters based on the Dirichlet process.

在线半监督分类算法模型包括三个阶段，模型初始化、分类阶段和模型维护。设定模型中少部分文档流标签时可用的，而大部分的文档流不存在标签。如图1显示了本申请的算法模型整体流程图。首先，取少量的带标签的文档(D_init)，并为每个标签创建少量的微簇(Z_min)，微簇包含了这些文档。对于每一个到来的文档，模型将计算簇-文档的概率，然后基于概率得分，选择k个最近的微簇(Zd)。在这个阶段，需要预测的标签数(Y)等于具有高于Zd分布均值的微簇的数量(l_count)。如果大于平均值的簇数量为一，预测在最近的簇中有高簇数量的标签。否则，比较每个标签的簇概率之和，并考虑最近标签的标签共现得分进行预测。在预测标签后，如果到来的文档没有标签，将其添加到每个预测标签的最近微簇中。否则，将到达的文档添加到地面真实标签的最近微簇中，每个错预测的标签相关簇和文档之间的共同术语得分(V_d∩z)将会略有减少。The online semi-supervised classification algorithm model includes three stages, model initialization, classification stage and model maintenance. It is available when setting tags for a small number of document flows in the model, but most document flows do not have tags. Figure 1 shows the overall flow chart of the algorithm model of this application. First, take a small number of tagged documents (D_init ) and create a small number of micro-clusters (Z_min ) for each tag. The micro-clusters contain these documents. For each incoming document, the model will calculate the cluster-document probability and then select the k nearest micro-clusters (Zd) based on the probability score. At this stage, the number of labels that need to be predicted (Y) is equal to the number of micro-clusters with a value higher than the mean of the Zd distribution (l_count ). If the number of clusters greater than the average is one, predict a label with a high number of clusters in the nearest cluster. Otherwise, compare the sum of cluster probabilities for each label and consider the label co-occurrence score of the nearest label for prediction. After predicting labels, if an incoming document does not have a label, it is added to the nearest micro-cluster for each predicted label. Otherwise, by adding arriving documents to the nearest micro-clusters of ground truth labels, the common term score (V_d∩z ) between each mispredicted label-related cluster and document will be slightly reduced.

关于微簇：About microclusters:

一个微簇(在流中)由簇特征集表示，其中包含有关其中实例的不同统计值。微簇z的特征集定义为8元组这里，m_z是文档的数量，/>表示单词w的词频，n_z是簇中所有单词的频率计数的总和/>Lz存储了微簇的分配标签，rz包含衰减权重，u_z是最后更新的时间戳，ta_z是单词到达的时间戳。cw_z是单词与单词的共现得分矩阵，每个条目定义如下，A microcluster (in a stream) is represented by a cluster feature set, which contains different statistical values about the instances within it. The feature set of micro-cluster z is defined as an 8-tuple Here, m_z is the number of documents, /> Represents the word frequency of word w, n_z is the sum of frequency counts of all words in the cluster/> Lz stores the allocation label of the micro-cluster, rz contains the decay weight, u_z is the last updated timestamp, and ta_z is the word arrival timestamp. cw_z is the co-occurrence score matrix of words and words. Each entry is defined as follows,

这里，是文档d′中单词w的频率计数。在文档中，w_i和w_j之间的比率必须满足其中i≠j。here, is the frequency count of word w in document d′. In the document, the ratio between w_i and w_j must satisfy where i≠j.

文档d通过微簇的可添加属性被添加到簇z中。Document d is added to cluster z via the addable attribute of the micro-cluster.

m_z＝m_z+1m_z =m_z +1

cw_z＝cw_z∪cwdcw_z ＝cw_z ∪cwd

n_z＝n_z+N_dn_z =n_z +N_d

这里，N_d是文档d中的总单词数，是文档中每个单词的频率计数，cw_d表示文档的共现矩阵，/>ta_d合并了与文档相关的术语的到达时间，N_d是文档的长度。微簇的更新时间戳u_z将是模型的当前时间戳。将文档添加到更新簇的复杂度为/>其中/>是文档的平均长度。Here, N_d is the total number of words in document d, is the frequency count of each word in the document, cw_d represents the co-occurrence matrix of the document, /> ta_d incorporates the arrival times of terms associated with the document, and N_d is the length of the document. The update timestamp u_z of the micro-cluster will be the current timestamp of the model. The complexity of adding a document to an update cluster is/> Among them/> is the average length of the document.

模型的初始化开始于初始化学习模型，该模型采用D_init个带标签的实例。首先，对于每个标签l∈L，从给定的D_init中选择一组大小相等的实例S_i。形式上，定义为S_init＝{S₁,…,S_|L|}，其中由于多标签数据的存在，S_i可能是相互重叠的或不重叠的。然后将每个标签S_i的实例划分为Z_min个分区(簇)，使用LDA有效地处理文本数据的潜在子空间。计算一级标签共现权重矩阵LC_M，其中每个条目使用启发式概率计算标签l_i和l_j之间的权重，定义为Initialization of the model begins by initializing the learning model, which takes D_init labeled instances. First, for each label l∈L, a set of equal-sized instances_Si is selected from the given D_init . Formally, it is defined as S_init ={S₁ ,...,S_|L| }, where due to the existence of multi-label data, S_i may overlap or not overlap with each other. Each instance of label_Si is then divided into Z_min partitions (clusters), and LDA is used to efficiently process the latent subspace of text data. Calculate the first-level tag co-occurrence weight matrix LC_M , where each entry uses a heuristic probability to calculate the weight between tags l_i and l_j , defined as

在初始化簇和标签共现得分矩阵之后，我们开始对每个到达的文档进行分类。分类阶段包括两个步骤：(1)计算流中每个到达实例与模型中所有活跃簇的相似度得分，以及(2)通过观察k个最近的簇来预测标签。After initializing the cluster and label co-occurrence score matrices, we start classifying each arriving document. The classification stage consists of two steps: (1) calculating the similarity score of each arriving instance in the stream to all active clusters in the model, and (2) predicting the label by observing the k nearest clusters.

相似度计算:在过去的十年中，一系列算法提出了基于距离或基于概率的相似度来预测标签。前一种类型的度量受到维度灾难的影响，后一种方法中，设计基于概率的相似度仍然是一个具有挑战性的任务。大多数基于概率的方法处理的是短文本(实例只具有低维表示)，或者是长文本(实例只具有高维表示)。因此为了处理含有短文本和长文本的多标签文档，定义了一种新颖的概率得分，如下所示：Similarity calculation: In the past decade, a series of algorithms have been proposed to predict labels based on distance or probability-based similarity. The former type of metric suffers from the curse of dimensionality, and in the latter approach, designing probability-based similarity remains a challenging task. Most probability-based methods deal with short texts (instances only have low-dimensional representation), or long texts (instances only have high-dimensional representation). Therefore, in order to handle multi-label documents containing short and long texts, a novel probability score is defined as follows:

D是模型M中的活动文档总数。ICF_w是逆簇频率，用于计算权重的重要性，定义为D is the total number of active documents in model M. ICF_w is the inverse cluster frequency, used to calculate the importance of weights, defined as

预测标签：使用方程式3计算到达文档与模型的所有活跃簇之间的相似性后，为了预测标签，选择具有高概率得分的k个最近簇。预测的标签数量将是在k个最近簇的归一化概率分布中高于平均值的簇的数量。在初始化过程中已经计算了标签共现概率，即方程2中的条件概率。通过组合频繁出现的最近标签对及其簇计数来预测标签。在预测标签后，在选择(在方程2中)和创建新簇(在方程4中)的过程中，将文档添加到选择的微簇中。Predict labels: After calculating the similarity between all active clusters of the arriving document and the model using Equation 3, in order to predict labels, the k nearest clusters with high probability scores are selected. The predicted number of labels will be the number of clusters above the mean in the normalized probability distribution of the k nearest clusters. The tag co-occurrence probability has been calculated during the initialization process, which is the conditional probability in Equation 2. Predict labels by combining frequently occurring recent label pairs and their cluster counts. After predicting labels, documents are added to selected micro-clusters during selection (in Equation 2) and creation of new clusters (in Equation 4).

模型维护包括非参数模型的概念演变、删除过时的标签相关术语、合并微簇。Model maintenance includes conceptual evolution of non-parametric models, removal of obsolete label-related terms, and merging of micro-clusters.

非参数模型的概念演变：每个活跃的微簇包含具有不同分布的不同术语空间。因此，基本思想是检查新文档是否属于现有活跃微簇(k个最近微簇)的最近术语空间，或者是否属于新的术语空间。使用方程3，计算与现有微簇的概率。创建新微簇的概率定义为Conceptual evolution of nonparametric models: Each active microcluster contains a different term space with different distributions. Therefore, the basic idea is to check whether a new document belongs to the closest term space of an existing active microcluster (k nearest microclusters) or if it belongs to a new term space. Using Equation 3, calculate the probability with existing microclusters. The probability of creating a new micro-cluster is defined as

此时，可以看出簇的受欢迎程度，其中α是一个小的伪值，β是新微簇的术语的小的伪出现次数。因此，选择一个微簇用于到达文档的条件是At this point, the popularity of the cluster can be seen, where α is a small pseudo-value and β is the small pseudo-occurrence number of terms of the new micro-cluster. Therefore, the condition for selecting a microcluster for reaching documents is

删除过时的标签相关术语：除了为当前概念创建微簇外，模型还需要考虑删除过时的概念。采用了使用遗忘机制删除过时微簇的方法。在衰减机制中，如果微簇没有更新，其重要性权重(r_z)会随时间的推移不断降低。如果微簇随着时间的推移而更新，这意味着它可以捕捉当前的概念。为了简单起见，采用了指数衰减函数的定义，如下所示：Removing obsolete tag-related terms: In addition to creating micro-clusters for current concepts, the model also needs to consider removing obsolete concepts. A method of deleting obsolete micro-clusters using a forgetting mechanism is adopted. In the decay mechanism, if a micro-cluster is not updated, its importance weight (r_z ) will continue to decrease over time. If a microcluster is updated over time, it means it can capture current concepts. For simplicity, the definition of the exponential decay function is adopted as follows:

t_M是模型的当前时间。每当将文档添加到微簇z中时，设置r_z＝1。如果一个微簇在一段时间内没有接收到文档，则其得分将接近于零，从而导致微簇的删除。t_M is the current time of the model. Whenever a document is added to microcluster z, set r_z =1. If a microcluster does not receive documents for a period of time, its score will be close to zero, leading to the deletion of the microcluster.

每个术语空间的标签相关核心术语可能互斥，也可能不互斥，然而，这些标签相关术语的重要性也会随时间在文本流中发生变化。为了解决这个问题，维护一个存储每个微簇中术语到达时间序列的簇特征ta_z。基于三角数时间，计算簇的年龄，并相应地计算每个术语的新近性得分。如果术语的新近性得分小于我们定义的阈值Γ，则该术语不被视为簇的活动概念的表示。使用三角时间衰减函数来淡化模型中过时的簇，定义如下：The tag-related core terms of each term space may or may not be mutually exclusive, however, the importance of these tag-related terms can also change over time in the text stream. To solve this problem, a cluster feature ta_z is maintained that stores the arrival time series of terms in each micro-cluster. Based on the trigonometric time, the age of the cluster is calculated and the recency score of each term is calculated accordingly. If the recency score of a term is smaller than our defined threshold Γ, the term is not considered as a representation of the active concept of the cluster. Use a triangular time decay function to fade out-of-date clusters in the model, defined as follows:

Δf(T)＝((T²+T)/2).Δf(T)=((T² +T)/2).

这里，T是时间戳编号。术语的新近性得分(recency_w)通过到达次数和簇的年龄之和的比率来衡量。例如，如果m_z＝9，那么簇z的年龄将为：Here, T is the timestamp number. The recency score (recency_w ) of a term is measured by the ratio of the sum of the number of arrivals and the age of the cluster. For example, if m_z =9, then the age of cluster z will be:

Age_z＝Δf(9)-Δf(1)＝((9²+9)/2)-((1²+1)/2)Age_z =Δf(9)-Δf(1)=((9² +9)/2)-((1² +1)/2)

假设单词w_i出现在簇D1、D2和D8的三个文档中，它们的到达时间分别为1、2和8。单词w_i在簇中的到达时间存储为：w_i＝{timestamp:1,timestamp:2,timestamp:8}Suppose word w_i appears in three documents in clusters D1, D2, and D8, and their arrival times are 1, 2, and 8 respectively. The arrival time of word w_i in the cluster is stored as: w_i ={timestamp:1,timestamp:2,timestamp:8}

而$w_{i}$的新近性得分计算如下：The recency score of $w_{i}$ is calculated as follows:

可以分析到，随着新文档在微簇中的到达，越来越多的术语变得不重要。通过这种方式，随着时间的推移，术语会逐渐自动过滤掉。 It can be analyzed that as new documents arrive in micro-clusters, more and more terms become unimportant. This way, terms are filtered out automatically over time.

合并微簇：多项式分布参数β负责计算文档与微簇之间的同质性。然而，由于随着时间的推移，簇中噪声术语的数量增加，噪声术语的概率会占据主导地位而压过核心术语。在标签相关术语的概念漂移和逐渐去除噪声和过时术语后，两个微簇可能会共享高度重叠的术语空间，从而影响簇的粒度。本申请中，使用方程3将聚类合并过程纳入到计算两个簇之间的概率中。随着时间的推移，如果一个簇没有更新，那么r_z的值将接近于零，这表明该簇已过时。因此，在删除簇之前，使用每个同一标签的活动簇计算概率，同时，计算该簇选择新簇的概率如果新的概率大于活动簇的概率，删除该微簇。否则，将该微簇与包含相同标签的最近微簇进行合并。Merging microclusters: The polynomial distribution parameter β is responsible for calculating the homogeneity between documents and microclusters. However, as the number of noise terms in a cluster increases over time, the probability of the noise terms will dominate and overwhelm the core terms. After the concept drift of tag-related terms and the gradual removal of noisy and outdated terms, two micro-clusters may share a highly overlapping term space, thus affecting the granularity of the clusters. In this application, Equation 3 is used to incorporate the cluster merging process into calculating the probability between two clusters. Over time, if a cluster is not updated, the value of r_z will approach zero, indicating that the cluster is outdated. Therefore, before deleting a cluster, calculate the probability using each active cluster with the same label, and at the same time, calculate the probability of selecting a new cluster for that cluster If the new probability is greater than the probability of the active cluster, delete the micro-cluster. Otherwise, merge the microcluster with the nearest microcluster containing the same label.

本发明相对于现有技术具有诸多优点和积极效果，能够有效处理多标签文本流，并适应动态的环境变化，提高预测性能和效率。Compared with the existing technology, the present invention has many advantages and positive effects. It can effectively process multi-label text streams, adapt to dynamic environmental changes, and improve prediction performance and efficiency.

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limiting. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified. Modifications or equivalent substitutions without departing from the spirit and scope of the technical solution of the present invention shall be included in the scope of the claims of the present invention.

Claims

1. The online semi-supervised classification algorithm based on the multi-label evolution Gao Weiwen flow is characterized in that: the method comprises a model initialization, a classification stage and a model maintenance;

d is taken out_init Each tagged document and creating Z for each tag_min A plurality of micro clusters, the micro clusters comprising documents; predicting each incoming document stream by the initial model, calculating the probability of a cluster-document for each incoming document by the model, and selecting k nearest micro clusters Zd based on the probability score; at this time, the number of tags Y to be predicted is equal to the number l of micro clusters having a distribution mean higher than Zd_count The method comprises the steps of carrying out a first treatment on the surface of the When the number of clusters larger than the average value is one, predicting that the label with high cluster number exists in the nearest cluster, and comparing the sum of the cluster probability of each label on the non-side, and predicting by using the label co-occurrence score of the nearest label; after predicting tags, if the incoming document has no tags, it is added to the nearest micro-cluster of each predicted tag, otherwise the incoming document is added to the groundReducing the common term score V between each mispredicted label-associated cluster and the document in the nearest micro-cluster to the real label_d∩z 。

2. The online semi-supervised classification algorithm based on multi-label evolution Gao Weiwen flow as recited in claim 1, wherein: the micro-clusters are defined as 8-tuples,wherein m is_z Is the number of documents, +.>Word frequency, n, representing word w_z Is the sum of the frequency counts of all words in the cluster +.>Lz stores the assigned tag of the micro-cluster, rz contains the decay weight, u_z Is the last updated timestamp, ta_z Is the timestamp of the word arrival, cw_z Is a word-to-word co-occurrence score matrix, each entry is defined as follows,

here the number of the elements is the number,is the frequency count of the word w in the document d', where w_i And w_j The ratio between must be satisfiedWhere i+.j.

3. The online semi-supervised classification algorithm based on multi-label evolution Gao Weiwen flow as recited in claim 2, wherein: when the model is initialized, D is adopted_init Each tagged instance, for each tag L ε L, from a given D_init Selecting a group of equal-sized instances S_i Defined as S_init ＝{S₁ ,…,S_|L| Using LDA to process potential subspaces of text data, computing a first-order tag co-occurrence weight matrix LC_M Wherein each entry calculates a label l using heuristic probabilities_i And l_j Weights in between, defined as

4. The online semi-supervised classification algorithm based on multi-label evolution Gao Weiwen flow as recited in claim 3, wherein: after initializing the cluster and tag co-occurrence scoring matrix, a classification process is performed on each arriving document, comprising two steps: calculate a similarity score for each arrival instance in the stream with all active clusters in the model, and predict the labels by observing the k nearest clusters.

5. The online semi-supervised classification algorithm based on multi-label evolution Gao Weiwen flow as recited in claim 4, wherein: in similarity calculation, the probability score is defined as:

where D is the total number of active documents in model M, ICF_w Is the inverse cluster frequency, used to calculate the importance of the weights, defined as

6. The online semi-supervised classification algorithm based on multi-label evolution Gao Weiwen flow as recited in claim 1, wherein: model maintenance includes concept evolution, deletion of outdated tag-related terms, merging micro-clusters.