技术领域Technical field
本发明涉及人工智能技术领域,尤其涉及一种基于多标签演化高维文本流的在线半监督分类算法。The invention relates to the field of artificial intelligence technology, and in particular to an online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams.
背景技术Background technique
多标签学习模型的目标是预测每个输入实例所对应的所有标签,在处理多标签文本流时,存在多种挑战。The goal of a multi-label learning model is to predict all labels corresponding to each input instance. There are various challenges when processing multi-label text streams.
如高纬度。随着文本数据中特征的数量增加,问题的维度也随之增加,这会导致特征选择困难,计算复杂度增加和维度灾难;在高维度下,特征选择变得更加困难,因为存在大量的特征,其中许多特征可能是冗余的或不相关的,选择适当的特征子集对于提高分类性能和降低计算复杂度至关重要。高维度数据会增加计算的复杂度,在高维度下,计算距离、相似性和概率等操作会变得更加耗时,导致算法的效率下降。而维度灾难是指在高维空间中,样本密度变得非常稀疏,导致分类算法的性能下降。Such as high latitudes. As the number of features in text data increases, the dimensionality of the problem also increases, which leads to difficulty in feature selection, increased computational complexity, and the curse of dimensionality; in high dimensions, feature selection becomes more difficult because there are a large number of features , many of the features may be redundant or irrelevant, and selecting an appropriate feature subset is crucial to improve classification performance and reduce computational complexity. High-dimensional data will increase the complexity of calculations. In high dimensions, operations such as calculating distance, similarity, and probability will become more time-consuming, resulting in a decrease in the efficiency of the algorithm. The curse of dimensionality means that in high-dimensional space, the sample density becomes very sparse, causing the performance of the classification algorithm to decrease.
如标签稀缺性。在缺乏足够的实例的情况下,很难准确地学习和预测稀有标签,由于缺乏标签实例,模型无法捕捉到这些标签的特征和模式,从而导致预测性能下降。Such as tag scarcity. In the absence of sufficient instances, it is difficult to accurately learn and predict rare labels. Due to the lack of label instances, the model cannot capture the characteristics and patterns of these labels, resulting in reduced prediction performance.
因此高维度、标签稀缺性等是处理多标签文本流时的重要挑战,需要采取适当的方法来处理特征选择、计算复杂度和维度灾难等问题,提高对稀有标签的学习和预测能力,以提高算法的性能和效率。Therefore, high dimensionality and label scarcity are important challenges when processing multi-label text streams. Appropriate methods need to be adopted to deal with issues such as feature selection, computational complexity, and dimension disaster, and improve the learning and prediction capabilities of rare labels to improve Algorithm performance and efficiency.
发明内容Contents of the invention
针对现有技术中所存在的不足,本发明提供了一种基于多标签演化高维文本流的在线半监督分类算法,其解决了高纬度时,随着文本数据中特征的数量增加,问题的维度也随之增加,从而导致的特征选择困难,计算复杂度增加和维度灾难问题。In view of the deficiencies in the existing technology, the present invention provides an online semi-supervised classification algorithm based on multi-label evolving high-dimensional text flow, which solves the problem of high latitude problems as the number of features in text data increases. Dimensions also increase, resulting in difficulty in feature selection, increased computational complexity and the curse of dimensionality.
本发明的上述技术目的是通过以下技术方案得以实现的:The above technical objectives of the present invention are achieved through the following technical solutions:
一种基于多标签演化高维文本流的在线半监督分类算法,包括模型初始化、分类阶段和模型维护;An online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams, including model initialization, classification stage and model maintenance;
取Dinit个带标签的文档,并为每个标签创建Zmin个微簇,微簇包含文档;初始模型对每个到来的文档流进行预测,对于每个到来的文档,模型计算簇-文档的概率,并基于概率得分,选择k个最近的微簇Zd;此时,需要预测的标签数Y等于具有高于Zd分布均值的微簇的数量lcount;当大于平均值的簇数量为一,则预测在最近的簇中有高簇数量的标签,否侧比较每个标签的簇概率之和,并用最近标签的标签共现得分进行预测;在预测标签后,如果到来的文档没有标签,则将其添加到每个预测标签的最近微簇中,否则将到达的文档添加到地面真实标签的最近微簇中,降低每个错预测的标签相关簇和文档之间的共同术语得分Vd∩z。Take Dinit labeled documents and create Zmin micro-clusters for each label. The micro-clusters contain documents; the initial model predicts each incoming document stream. For each incoming document, the model calculates the cluster-document probability, and based on the probability score, select the k nearest micro-clusters Zd; at this time, the number of labels Y that need to be predicted is equal to the number lcount of micro-clusters with a distribution mean higher than Zd; when the number of clusters greater than the average is one , then it is predicted that there are tags with a high number of clusters in the nearest cluster, otherwise the sum of the cluster probabilities of each tag is compared, and the tag co-occurrence score of the nearest tag is used for prediction; after predicting the tag, if the incoming document does not have a tag, then add it to the nearest micro-cluster of each predicted label, otherwise add the arriving document to the nearest micro-cluster of the ground truth label, lowering the common term score Vd between each mispredicted label-related cluster and the document∩z .
本发明进一步设置为:微簇定义为8元组,其中mz是文档的数量,/>表示单词w的词频,nz是簇中所有单词的频率计数的总和/>Lz储存了微簇的分配标签,rz包含衰减权重,uz是最后更新的时间戳,taz是单词到达的时间戳,cwz是单词与单词的共现得分矩阵,每个条目定义如下,The present invention is further configured as follows: micro-clusters are defined as 8-tuples, where mz is the number of documents, /> Represents the word frequency of word w, nz is the sum of frequency counts of all words in the cluster/> Lz stores the distribution label of the micro-cluster, rz contains the decay weight, uz is the last updated timestamp, taz is the timestamp of word arrival, cwz is the co-occurrence score matrix of word and word, each entry is defined as follows,
这里,是文档d′中单词w的频率计数,在文档中,wi和wj之间的比率必须满足其中i≠j。here, is the frequency count of word w in document d′. In the document, the ratio between wi and wj must satisfy where i≠j.
本发明进一步设置为:模型初始化时,采用Dinit个带标签的实例,对于每个标签l∈L,从给定的Dinit中选择一组大小相等的实例Si,定义为Sinit={S1,…,S|L|},使用LDA处理文本数据的潜在子空间,计算一级标签共现权重矩阵LCM,其中每个条目使用启发式概率计算标签li和lj之间的权重,定义为The present invention is further configured as follows: when initializing the model, Dinit labeled instances are used. For each label l∈L, a group of equal-sized instances Si is selected from the given Dinit , defined as Sinit = { S1 ,…,S|L| }, use LDA to process the latent subspace of text data, and calculate the first-level tag co-occurrence weight matrix LCM , where each entry uses a heuristic probability to calculate the distance between tags li and lj weight, defined as
本发明进一步设置为:在初始化簇和标签共现得分矩阵之后,对每个到达的文档进行分类过程,包括两个步骤:计算流中每个到达实例与模型中所有活跃簇的相似度得分,以及通过观察k个最近的簇来预测标签。The present invention is further configured to: after initializing the cluster and label co-occurrence score matrix, perform a classification process for each arriving document, including two steps: calculating the similarity score of each arriving instance in the stream and all active clusters in the model, and predicting labels by observing the k nearest clusters.
本发明进一步设置为:相似度计算时,概率得分定义为:The present invention is further configured as follows: when calculating the similarity, the probability score is defined as:
其中,D是模型M中的活动文档总数,ICFw是逆簇频率,用于计算权重的重要性,定义为where D is the total number of active documents in model M, and ICFw is the inverse cluster frequency, which is used to calculate the importance of the weight, defined as
本发明进一步设置为:模型维护包括概念演变、删除过时的标签相关术语、合并微簇。The present invention is further configured as follows: model maintenance includes concept evolution, deletion of outdated label-related terms, and merging of micro-clusters.
本发明具有以下有益效果:The invention has the following beneficial effects:
1、高效处理多标签文本流:本发明提出的在线半监督分类算法能够实时处理多标签文本流数据。它能够动态地适应数据流中的概念漂移和标签稀疏性,并具有较低的计算和内存需求。这使得算法能够在大规模数据流和实时应用中高效运行。1. Efficiently process multi-label text streams: The online semi-supervised classification algorithm proposed by the present invention can process multi-label text stream data in real time. It is able to dynamically adapt to concept drift and label sparsity in data streams and has low computational and memory requirements. This enables algorithms to run efficiently on large-scale data streams and real-time applications.
2、模型自适应能力:本发明的算法具有较强的自适应能力,能够处理概念漂移和标签基数的变化。它能够及时调整模型,捕捉数据流中新的概念和标签关联,并对新的实例进行准确的预测。这使得算法能够在动态的环境中保持准确性和性能。2. Model adaptive ability: The algorithm of the present invention has strong adaptive ability and can handle concept drift and changes in label cardinality. It can adjust the model in time, capture new concepts and label associations in the data stream, and make accurate predictions for new instances. This enables the algorithm to maintain accuracy and performance in dynamic environments.
3、本发明的算法在预测标签时综合考虑了标签相关性。通过嵌入标签共现概率和聚类相似度,算法能够捕捉标签之间的关联关系,提高预测性能。3. The algorithm of the present invention comprehensively considers tag correlation when predicting tags. By embedding tag co-occurrence probability and cluster similarity, the algorithm can capture the correlation between tags and improve prediction performance.
附图说明Description of the drawings
图1为本申请提出的在线半监督分类算法的整体流程图。Figure 1 is the overall flow chart of the online semi-supervised classification algorithm proposed in this application.
具体实施方式Detailed ways
下面结合附图及实施例对本发明中的技术方案进一步说明。The technical solutions in the present invention will be further described below in conjunction with the accompanying drawings and examples.
本申请中,提出了一种基于多标签演化高维文本流的在线半监督分类算法,通过利用少量有标签的实例,该算法动态地维护每个标签的术语子空间,并使用一组演化的微簇。对于多标签分类,算法使用非参数的Dirichlet模型来预测k个最近的微簇。为了处理术语空间中的渐进概念漂移,采用了三角时间函数来计算术语到达时间和簇寿命之间的差异。而对于突发性概念漂移,算法采用了两个步骤来处理:(a)通过利用指数衰减函数删除过时的微簇,(b)基于Dirichlet过程采用中餐馆过程来创建新的微簇。In this application, an online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams is proposed. By utilizing a small number of labeled instances, the algorithm dynamically maintains the term subspace of each label and uses a set of evolved Microclusters. For multi-label classification, the algorithm uses a non-parametric Dirichlet model to predict the k nearest micro-clusters. To handle progressive concept drift in the term space, a trigonometric time function is employed to calculate the difference between term arrival time and cluster lifetime. For sudden concept drift, the algorithm adopts two steps to deal with: (a) deleting outdated micro-clusters by using the exponential decay function, (b) using the Chinese restaurant process to create new micro-clusters based on the Dirichlet process.
在线半监督分类算法模型包括三个阶段,模型初始化、分类阶段和模型维护。设定模型中少部分文档流标签时可用的,而大部分的文档流不存在标签。如图1显示了本申请的算法模型整体流程图。首先,取少量的带标签的文档(Dinit),并为每个标签创建少量的微簇(Zmin),微簇包含了这些文档。对于每一个到来的文档,模型将计算簇-文档的概率,然后基于概率得分,选择k个最近的微簇(Zd)。在这个阶段,需要预测的标签数(Y)等于具有高于Zd分布均值的微簇的数量(lcount)。如果大于平均值的簇数量为一,预测在最近的簇中有高簇数量的标签。否则,比较每个标签的簇概率之和,并考虑最近标签的标签共现得分进行预测。在预测标签后,如果到来的文档没有标签,将其添加到每个预测标签的最近微簇中。否则,将到达的文档添加到地面真实标签的最近微簇中,每个错预测的标签相关簇和文档之间的共同术语得分(Vd∩z)将会略有减少。The online semi-supervised classification algorithm model includes three stages, model initialization, classification stage and model maintenance. It is available when setting tags for a small number of document flows in the model, but most document flows do not have tags. Figure 1 shows the overall flow chart of the algorithm model of this application. First, take a small number of tagged documents (Dinit ) and create a small number of micro-clusters (Zmin ) for each tag. The micro-clusters contain these documents. For each incoming document, the model will calculate the cluster-document probability and then select the k nearest micro-clusters (Zd) based on the probability score. At this stage, the number of labels that need to be predicted (Y) is equal to the number of micro-clusters with a value higher than the mean of the Zd distribution (lcount ). If the number of clusters greater than the average is one, predict a label with a high number of clusters in the nearest cluster. Otherwise, compare the sum of cluster probabilities for each label and consider the label co-occurrence score of the nearest label for prediction. After predicting labels, if an incoming document does not have a label, it is added to the nearest micro-cluster for each predicted label. Otherwise, by adding arriving documents to the nearest micro-clusters of ground truth labels, the common term score (Vd∩z ) between each mispredicted label-related cluster and document will be slightly reduced.
关于微簇:About microclusters:
一个微簇(在流中)由簇特征集表示,其中包含有关其中实例的不同统计值。微簇z的特征集定义为8元组这里,mz是文档的数量,/>表示单词w的词频,nz是簇中所有单词的频率计数的总和/>Lz存储了微簇的分配标签,rz包含衰减权重,uz是最后更新的时间戳,taz是单词到达的时间戳。cwz是单词与单词的共现得分矩阵,每个条目定义如下,A microcluster (in a stream) is represented by a cluster feature set, which contains different statistical values about the instances within it. The feature set of micro-cluster z is defined as an 8-tuple Here, mz is the number of documents, /> Represents the word frequency of word w, nz is the sum of frequency counts of all words in the cluster/> Lz stores the allocation label of the micro-cluster, rz contains the decay weight, uz is the last updated timestamp, and taz is the word arrival timestamp. cwz is the co-occurrence score matrix of words and words. Each entry is defined as follows,
这里,是文档d′中单词w的频率计数。在文档中,wi和wj之间的比率必须满足其中i≠j。here, is the frequency count of word w in document d′. In the document, the ratio between wi and wj must satisfy where i≠j.
文档d通过微簇的可添加属性被添加到簇z中。Document d is added to cluster z via the addable attribute of the micro-cluster.
mz=mz+1mz =mz +1
cwz=cwz∪cwdcwz =cwz ∪cwd
nz=nz+Ndnz =nz +Nd
这里,Nd是文档d中的总单词数,是文档中每个单词的频率计数,cwd表示文档的共现矩阵,/>tad合并了与文档相关的术语的到达时间,Nd是文档的长度。微簇的更新时间戳uz将是模型的当前时间戳。将文档添加到更新簇的复杂度为/>其中/>是文档的平均长度。Here, Nd is the total number of words in document d, is the frequency count of each word in the document, cwd represents the co-occurrence matrix of the document, /> tad incorporates the arrival times of terms associated with the document, and Nd is the length of the document. The update timestamp uz of the micro-cluster will be the current timestamp of the model. The complexity of adding a document to an update cluster is/> Among them/> is the average length of the document.
模型的初始化开始于初始化学习模型,该模型采用Dinit个带标签的实例。首先,对于每个标签l∈L,从给定的Dinit中选择一组大小相等的实例Si。形式上,定义为Sinit={S1,…,S|L|},其中由于多标签数据的存在,Si可能是相互重叠的或不重叠的。然后将每个标签Si的实例划分为Zmin个分区(簇),使用LDA有效地处理文本数据的潜在子空间。计算一级标签共现权重矩阵LCM,其中每个条目使用启发式概率计算标签li和lj之间的权重,定义为Initialization of the model begins by initializing the learning model, which takes Dinit labeled instances. First, for each label l∈L, a set of equal-sized instancesSi is selected from the given Dinit . Formally, it is defined as Sinit ={S1 ,...,S|L| }, where due to the existence of multi-label data, Si may overlap or not overlap with each other. Each instance of labelSi is then divided into Zmin partitions (clusters), and LDA is used to efficiently process the latent subspace of text data. Calculate the first-level tag co-occurrence weight matrix LCM , where each entry uses a heuristic probability to calculate the weight between tags li and lj , defined as
在初始化簇和标签共现得分矩阵之后,我们开始对每个到达的文档进行分类。分类阶段包括两个步骤:(1)计算流中每个到达实例与模型中所有活跃簇的相似度得分,以及(2)通过观察k个最近的簇来预测标签。After initializing the cluster and label co-occurrence score matrices, we start classifying each arriving document. The classification stage consists of two steps: (1) calculating the similarity score of each arriving instance in the stream to all active clusters in the model, and (2) predicting the label by observing the k nearest clusters.
相似度计算:在过去的十年中,一系列算法提出了基于距离或基于概率的相似度来预测标签。前一种类型的度量受到维度灾难的影响,后一种方法中,设计基于概率的相似度仍然是一个具有挑战性的任务。大多数基于概率的方法处理的是短文本(实例只具有低维表示),或者是长文本(实例只具有高维表示)。因此为了处理含有短文本和长文本的多标签文档,定义了一种新颖的概率得分,如下所示:Similarity calculation: In the past decade, a series of algorithms have been proposed to predict labels based on distance or probability-based similarity. The former type of metric suffers from the curse of dimensionality, and in the latter approach, designing probability-based similarity remains a challenging task. Most probability-based methods deal with short texts (instances only have low-dimensional representation), or long texts (instances only have high-dimensional representation). Therefore, in order to handle multi-label documents containing short and long texts, a novel probability score is defined as follows:
D是模型M中的活动文档总数。ICFw是逆簇频率,用于计算权重的重要性,定义为D is the total number of active documents in model M. ICFw is the inverse cluster frequency, used to calculate the importance of weights, defined as
预测标签:使用方程式3计算到达文档与模型的所有活跃簇之间的相似性后,为了预测标签,选择具有高概率得分的k个最近簇。预测的标签数量将是在k个最近簇的归一化概率分布中高于平均值的簇的数量。在初始化过程中已经计算了标签共现概率,即方程2中的条件概率。通过组合频繁出现的最近标签对及其簇计数来预测标签。在预测标签后,在选择(在方程2中)和创建新簇(在方程4中)的过程中,将文档添加到选择的微簇中。Predict labels: After calculating the similarity between all active clusters of the arriving document and the model using Equation 3, in order to predict labels, the k nearest clusters with high probability scores are selected. The predicted number of labels will be the number of clusters above the mean in the normalized probability distribution of the k nearest clusters. The tag co-occurrence probability has been calculated during the initialization process, which is the conditional probability in Equation 2. Predict labels by combining frequently occurring recent label pairs and their cluster counts. After predicting labels, documents are added to selected micro-clusters during selection (in Equation 2) and creation of new clusters (in Equation 4).
模型维护包括非参数模型的概念演变、删除过时的标签相关术语、合并微簇。Model maintenance includes conceptual evolution of non-parametric models, removal of obsolete label-related terms, and merging of micro-clusters.
非参数模型的概念演变:每个活跃的微簇包含具有不同分布的不同术语空间。因此,基本思想是检查新文档是否属于现有活跃微簇(k个最近微簇)的最近术语空间,或者是否属于新的术语空间。使用方程3,计算与现有微簇的概率。创建新微簇的概率定义为Conceptual evolution of nonparametric models: Each active microcluster contains a different term space with different distributions. Therefore, the basic idea is to check whether a new document belongs to the closest term space of an existing active microcluster (k nearest microclusters) or if it belongs to a new term space. Using Equation 3, calculate the probability with existing microclusters. The probability of creating a new micro-cluster is defined as
此时,可以看出簇的受欢迎程度,其中α是一个小的伪值,β是新微簇的术语的小的伪出现次数。因此,选择一个微簇用于到达文档的条件是At this point, the popularity of the cluster can be seen, where α is a small pseudo-value and β is the small pseudo-occurrence number of terms of the new micro-cluster. Therefore, the condition for selecting a microcluster for reaching documents is
删除过时的标签相关术语:除了为当前概念创建微簇外,模型还需要考虑删除过时的概念。采用了使用遗忘机制删除过时微簇的方法。在衰减机制中,如果微簇没有更新,其重要性权重(rz)会随时间的推移不断降低。如果微簇随着时间的推移而更新,这意味着它可以捕捉当前的概念。为了简单起见,采用了指数衰减函数的定义,如下所示:Removing obsolete tag-related terms: In addition to creating micro-clusters for current concepts, the model also needs to consider removing obsolete concepts. A method of deleting obsolete micro-clusters using a forgetting mechanism is adopted. In the decay mechanism, if a micro-cluster is not updated, its importance weight (rz ) will continue to decrease over time. If a microcluster is updated over time, it means it can capture current concepts. For simplicity, the definition of the exponential decay function is adopted as follows:
tM是模型的当前时间。每当将文档添加到微簇z中时,设置rz=1。如果一个微簇在一段时间内没有接收到文档,则其得分将接近于零,从而导致微簇的删除。tM is the current time of the model. Whenever a document is added to microcluster z, set rz =1. If a microcluster does not receive documents for a period of time, its score will be close to zero, leading to the deletion of the microcluster.
每个术语空间的标签相关核心术语可能互斥,也可能不互斥,然而,这些标签相关术语的重要性也会随时间在文本流中发生变化。为了解决这个问题,维护一个存储每个微簇中术语到达时间序列的簇特征taz。基于三角数时间,计算簇的年龄,并相应地计算每个术语的新近性得分。如果术语的新近性得分小于我们定义的阈值Γ,则该术语不被视为簇的活动概念的表示。使用三角时间衰减函数来淡化模型中过时的簇,定义如下:The tag-related core terms of each term space may or may not be mutually exclusive, however, the importance of these tag-related terms can also change over time in the text stream. To solve this problem, a cluster feature taz is maintained that stores the arrival time series of terms in each micro-cluster. Based on the trigonometric time, the age of the cluster is calculated and the recency score of each term is calculated accordingly. If the recency score of a term is smaller than our defined threshold Γ, the term is not considered as a representation of the active concept of the cluster. Use a triangular time decay function to fade out-of-date clusters in the model, defined as follows:
Δf(T)=((T2+T)/2).Δf(T)=((T2 +T)/2).
这里,T是时间戳编号。术语的新近性得分(recencyw)通过到达次数和簇的年龄之和的比率来衡量。例如,如果mz=9,那么簇z的年龄将为:Here, T is the timestamp number. The recency score (recencyw ) of a term is measured by the ratio of the sum of the number of arrivals and the age of the cluster. For example, if mz =9, then the age of cluster z will be:
Agez=Δf(9)-Δf(1)=((92+9)/2)-((12+1)/2)Agez =Δf(9)-Δf(1)=((92 +9)/2)-((12 +1)/2)
假设单词wi出现在簇D1、D2和D8的三个文档中,它们的到达时间分别为1、2和8。单词wi在簇中的到达时间存储为:wi={timestamp:1,timestamp:2,timestamp:8}Suppose word wi appears in three documents in clusters D1, D2, and D8, and their arrival times are 1, 2, and 8 respectively. The arrival time of word wi in the cluster is stored as: wi ={timestamp:1,timestamp:2,timestamp:8}
而$w_{i}$的新近性得分计算如下:The recency score of $w_{i}$ is calculated as follows:
可以分析到,随着新文档在微簇中的到达,越来越多的术语变得不重要。通过这种方式,随着时间的推移,术语会逐渐自动过滤掉。 It can be analyzed that as new documents arrive in micro-clusters, more and more terms become unimportant. This way, terms are filtered out automatically over time.
合并微簇:多项式分布参数β负责计算文档与微簇之间的同质性。然而,由于随着时间的推移,簇中噪声术语的数量增加,噪声术语的概率会占据主导地位而压过核心术语。在标签相关术语的概念漂移和逐渐去除噪声和过时术语后,两个微簇可能会共享高度重叠的术语空间,从而影响簇的粒度。本申请中,使用方程3将聚类合并过程纳入到计算两个簇之间的概率中。随着时间的推移,如果一个簇没有更新,那么rz的值将接近于零,这表明该簇已过时。因此,在删除簇之前,使用每个同一标签的活动簇计算概率,同时,计算该簇选择新簇的概率如果新的概率大于活动簇的概率,删除该微簇。否则,将该微簇与包含相同标签的最近微簇进行合并。Merging microclusters: The polynomial distribution parameter β is responsible for calculating the homogeneity between documents and microclusters. However, as the number of noise terms in a cluster increases over time, the probability of the noise terms will dominate and overwhelm the core terms. After the concept drift of tag-related terms and the gradual removal of noisy and outdated terms, two micro-clusters may share a highly overlapping term space, thus affecting the granularity of the clusters. In this application, Equation 3 is used to incorporate the cluster merging process into calculating the probability between two clusters. Over time, if a cluster is not updated, the value of rz will approach zero, indicating that the cluster is outdated. Therefore, before deleting a cluster, calculate the probability using each active cluster with the same label, and at the same time, calculate the probability of selecting a new cluster for that cluster If the new probability is greater than the probability of the active cluster, delete the micro-cluster. Otherwise, merge the microcluster with the nearest microcluster containing the same label.
本发明相对于现有技术具有诸多优点和积极效果,能够有效处理多标签文本流,并适应动态的环境变化,提高预测性能和效率。Compared with the existing technology, the present invention has many advantages and positive effects. It can effectively process multi-label text streams, adapt to dynamic environmental changes, and improve prediction performance and efficiency.
最后说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的宗旨和范围,其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limiting. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified. Modifications or equivalent substitutions without departing from the spirit and scope of the technical solution of the present invention shall be included in the scope of the claims of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310752440.1ACN116861287A (en) | 2023-06-25 | 2023-06-25 | Online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310752440.1ACN116861287A (en) | 2023-06-25 | 2023-06-25 | Online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams |
| Publication Number | Publication Date |
|---|---|
| CN116861287Atrue CN116861287A (en) | 2023-10-10 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310752440.1APendingCN116861287A (en) | 2023-06-25 | 2023-06-25 | Online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams |
| Country | Link |
|---|---|
| CN (1) | CN116861287A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230281328A1 (en)* | 2022-03-07 | 2023-09-07 | Recolabs Ltd. | Systems and methods for securing files and/or records related to a business process |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3258333A1 (en)* | 2016-06-17 | 2017-12-20 | Siemens Aktiengesellschaft | Method and system for monitoring sensor data of rotating equipment |
| US20180165554A1 (en)* | 2016-12-09 | 2018-06-14 | The Research Foundation For The State University Of New York | Semisupervised autoencoder for sentiment analysis |
| CN110750638A (en)* | 2019-06-28 | 2020-02-04 | 厦门美域中央信息科技有限公司 | Multi-label corpus text classification method based on semi-supervised learning |
| CN112069322A (en)* | 2020-11-11 | 2020-12-11 | 北京智慧星光信息技术有限公司 | Text multi-label analysis method and device, electronic equipment and storage medium |
| CN112115995A (en)* | 2020-09-11 | 2020-12-22 | 北京邮电大学 | A semi-supervised learning based image multi-label classification method |
| CN113254599A (en)* | 2021-06-28 | 2021-08-13 | 浙江大学 | Multi-label microblog text classification method based on semi-supervised learning |
| CN115827861A (en)* | 2022-11-29 | 2023-03-21 | 电子科技大学长三角研究院(湖州) | Context-enhanced Dirichlet model supporting online clustering of short text streams |
| CN116150611A (en)* | 2022-12-01 | 2023-05-23 | 电子科技大学长三角研究院(湖州) | Automatic labeling method for mixed multi-element time series data under variable environment |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3258333A1 (en)* | 2016-06-17 | 2017-12-20 | Siemens Aktiengesellschaft | Method and system for monitoring sensor data of rotating equipment |
| US20180165554A1 (en)* | 2016-12-09 | 2018-06-14 | The Research Foundation For The State University Of New York | Semisupervised autoencoder for sentiment analysis |
| CN110750638A (en)* | 2019-06-28 | 2020-02-04 | 厦门美域中央信息科技有限公司 | Multi-label corpus text classification method based on semi-supervised learning |
| CN112115995A (en)* | 2020-09-11 | 2020-12-22 | 北京邮电大学 | A semi-supervised learning based image multi-label classification method |
| CN112069322A (en)* | 2020-11-11 | 2020-12-11 | 北京智慧星光信息技术有限公司 | Text multi-label analysis method and device, electronic equipment and storage medium |
| CN113254599A (en)* | 2021-06-28 | 2021-08-13 | 浙江大学 | Multi-label microblog text classification method based on semi-supervised learning |
| CN115827861A (en)* | 2022-11-29 | 2023-03-21 | 电子科技大学长三角研究院(湖州) | Context-enhanced Dirichlet model supporting online clustering of short text streams |
| CN116150611A (en)* | 2022-12-01 | 2023-05-23 | 电子科技大学长三角研究院(湖州) | Automatic labeling method for mixed multi-element time series data under variable environment |
| Title |
|---|
| KUMAR, JAY等: "An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering", 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 31 December 2020 (2020-12-31)* |
| KUMAR, JAY等: "Evolving Text Data Stream Mining", ARXIV, 15 August 2024 (2024-08-15)* |
| LI, PEIYAN等: "Online Semi-supervised Multi-label Classification with Label Compression and Local Smooth Regression", PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 31 December 2020 (2020-12-31)* |
| 戚后林;顾磊;: "概率潜在语义分析的KNN文本分类算法", 计算机技术与发展, no. 07, 31 December 2017 (2017-12-31)* |
| 武红鑫等: "监督和半监督学习下的多标签分类综述", 计算机科学, 31 December 2022 (2022-12-31)* |
| 汪忠国;吴敏;谭芳芳;: "稀疏混合图随机跳跃Web对象多标签半监督分类", 计算机科学与探索, no. 07, 31 December 2017 (2017-12-31)* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230281328A1 (en)* | 2022-03-07 | 2023-09-07 | Recolabs Ltd. | Systems and methods for securing files and/or records related to a business process |
| US11977653B2 (en)* | 2022-03-07 | 2024-05-07 | Recolabs Ltd. | Systems and methods for securing files and/or records related to a business process |
| Publication | Publication Date | Title |
|---|---|---|
| US11580119B2 (en) | System and method for automatic persona generation using small text components | |
| CN106649434B (en) | Cross-domain knowledge migration label embedding method and device | |
| US12400136B2 (en) | Systems and methods for counterfactual explanation in machine learning models | |
| CN106815369B (en) | A Text Classification Method Based on Xgboost Classification Algorithm | |
| CN113822494A (en) | Risk prediction method, device, equipment and storage medium | |
| CN112560912A (en) | Method and device for training classification model, electronic equipment and storage medium | |
| Luo et al. | Online learning of interpretable word embeddings | |
| CN112016633A (en) | Model training method and device, electronic equipment and storage medium | |
| CN112069310A (en) | Text classification method and system based on active learning strategy | |
| CN109933670A (en) | A Text Classification Method Based on Combination Matrix to Calculate Semantic Distance | |
| CN111538766B (en) | Text classification method, device, processing equipment and bill classification system | |
| CN110020435B (en) | Method for optimizing text feature selection by adopting parallel binary bat algorithm | |
| CN112632984A (en) | Graph model mobile application classification method based on description text word frequency | |
| US20240086441A1 (en) | System and method for automatic profile segmentation using small text variations | |
| CN113569955A (en) | A model training method, user portrait generation method, device and equipment | |
| CN115035890A (en) | Training method and device of voice recognition model, electronic equipment and storage medium | |
| CN116861287A (en) | Online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams | |
| CN115294397A (en) | Classification task post-processing method, device, equipment and storage medium | |
| CN109857892B (en) | Semi-supervised cross-modal hash retrieval method based on class label transfer | |
| CN114491030A (en) | Skill label extraction and candidate phrase classification model training method and device | |
| CN116894169B (en) | Online flow characteristic selection method based on dynamic characteristic clustering and particle swarm optimization | |
| CN117473092A (en) | Classification method, device and equipment for healthy corpus and storage medium | |
| CN110705274A (en) | Real-time Learning-Based Fusion-Type Word Sense Embedding Method | |
| CN114610953B (en) | Data classification method, device, equipment and storage medium | |
| CN116522158A (en) | Data cold and hot state prediction method, device, electronic equipment and storage medium |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |