Movatterモバイル変換


[0]ホーム

URL:


CN115827861A - Context-enhanced Dirichlet model supporting online clustering of short text streams - Google Patents

Context-enhanced Dirichlet model supporting online clustering of short text streams
Download PDF

Info

Publication number
CN115827861A
CN115827861ACN202211504585.1ACN202211504585ACN115827861ACN 115827861 ACN115827861 ACN 115827861ACN 202211504585 ACN202211504585 ACN 202211504585ACN 115827861 ACN115827861 ACN 115827861A
Authority
CN
China
Prior art keywords
cluster
model
document
documents
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211504585.1A
Other languages
Chinese (zh)
Other versions
CN115827861B (en
Inventor
瑞嘉
杰库码
任晓龙
邵俊明
李想
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC HuzhoufiledCriticalYangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202211504585.1ApriorityCriticalpatent/CN115827861B/en
Publication of CN115827861ApublicationCriticalpatent/CN115827861A/en
Application grantedgrantedCritical
Publication of CN115827861BpublicationCriticalpatent/CN115827861B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Landscapes

Abstract

Translated fromChinese

本发明提供了一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:包括以下步骤:步骤1,根据计算的概率,选择将到达的文档添加到模型的活动集群中,或者创建一个新的集群进行添加;步骤2,当模型中已有集群的文档到达的概率小于伪概率时,则将文档视为新主题的出现,从而创建新的集群;步骤3,随着文档的到来,模型对旧的集群(即过时的主题)进行检查并删除,从而使得当前分布的近期主题集群在模型中处于活跃状态,为了推断模型中的活跃簇数,在每个ρ时间单位间隔后重新采样来自最近ψ文档的随机文档数η。本发明具有模型效率高,稳健性强的优点。

Figure 202211504585

The present invention provides a context-enhanced Dirichlet model that supports online clustering of short text streams, characterized in that it includes the following steps: Step 1, according to the calculated probability, select and add the arriving document to the active cluster of the model, or Create a new cluster to add; step 2, when the probability of arrival of a document with an existing cluster in the model is less than the pseudo-probability, the document is regarded as the emergence of a new topic, thereby creating a new cluster; step 3, with the document Arrival, the model checks and deletes old clusters (that is, outdated topics), so that the current distribution of recent topic clusters is active in the model, in order to infer the number of active clusters in the model, after each ρ time unit interval Resample a random number of documents η from the most recent ψ documents. The invention has the advantages of high model efficiency and strong robustness.

Figure 202211504585

Description

Translated fromChinese
一种支持短文本流在线聚类的上下文增强狄利克雷模型A context-enhanced Dirichlet model supporting online clustering of short text streams

技术领域technical field

本发明涉及人工智能及自然语言处理技术领域,尤其涉及一种支持短文本流在线聚类的上下文增强狄利克雷模型。The invention relates to the technical fields of artificial intelligence and natural language processing, in particular to a context-enhanced Dirichlet model supporting online clustering of short text streams.

背景技术Background technique

在过去的十年中,每天在社交媒体上产生大量的短文本数据,例如推文、Facebook帖子和问答平台。近年来,这种连续到达的短文本的聚类因其在主题跟踪、新闻推荐等多种应用中而受到广泛关注。Over the past decade, a large amount of short-text data has been generated daily on social media, such as tweets, Facebook posts, and Q&A platforms. In recent years, clustering of such continuously arriving short texts has attracted much attention due to its diverse applications such as topic tracking, news recommendation, etc.

文档的文本表示是自然语言处理中的一项重要任务。通常获得的表示是基于选择一组合适的术语(低维词汇表)及其相对重要性(加权方案)来捕获文档内容。具体来说,术语加权方案有助于捕获特定句子或文档中单词的上下文。例如,词袋(BoW)表示和TF-IDF加权方案分别被广泛用作术语索引和重要性得分。提出了一种学习术语特异性(加权方案)的方法,用于在静态语料库中生成词嵌入。但是在流式环境中,这种表示不能直接映射。特别是对于不断发展的术语权重,并且表示具有包含重叠术语子空间的微集群的流式文档,术语分布是未知的并且可能随时间变化。Textual representation of documents is an important task in natural language processing. Typically obtained representations are based on selecting a suitable set of terms (low-dimensional vocabulary) and their relative importance (weighting scheme) to capture document content. Specifically, term weighting schemes help capture the context of words in a particular sentence or document. For example, bag-of-words (BoW) representations and TF-IDF weighting schemes are widely used as term indices and importance scores, respectively. A method for learning term specificity (weighting scheme) for generating word embeddings in a static corpus is proposed. But in a streaming environment, this representation cannot be mapped directly. Especially for evolving term weights, and representing streaming documents with micro-clusters containing overlapping term subspaces, the term distribution is unknown and likely to change over time.

与长文本文档(例如博客和学术论文)不同,因为文档包含的单词很少,处理短文本至关重要且具有挑战性。然而,由于流的独特属性(即高速、概念漂移和进化),聚类任务变得更加复杂,因为它需要处理具有时间和空间限制的数据。与静态文本文档不同,概念演化和概念漂移的发生在高速文本流中是未知的,因此产生了以在线方式处理流的需求。Unlike long text documents such as blogs and academic papers, processing short texts is critical and challenging because the documents contain few words. However, due to the unique properties of streams (i.e., high speed, concept drift, and evolution), the clustering task becomes more complicated as it needs to deal with data with temporal and spatial constraints. Unlike static text documents, the occurrence of concept evolution and concept drift is unknown in high-speed text streams, thus giving rise to the need to process streams in an online manner.

发明内容Contents of the invention

针对现有技术中所存在的不足,本发明提供了一种支持短文本流在线聚类的上下文增强狄利克雷模型,具有模型效率高,稳健性强的优点。Aiming at the deficiencies in the prior art, the present invention provides a context-enhanced Dirichlet model supporting online clustering of short text streams, which has the advantages of high model efficiency and strong robustness.

本发明的上述技术目的是通过以下技术方案得以实现的:Above-mentioned technical purpose of the present invention is achieved through the following technical solutions:

一种支持短文本流在线聚类的上下文增强狄利克雷模型,包括以下步骤:A context-enhanced Dirichlet model supporting online clustering of short text streams, comprising the following steps:

步骤1,根据计算的概率,选择将到达的文档添加到模型的活动集群中,或者创建一个新的集群进行添加;Step 1, according to the calculated probability, choose to add the arriving document to the active cluster of the model, or create a new cluster to add;

步骤2,当模型中已有集群的文档到达的概率小于伪概率时,则将文档视为新主题的出现,从而创建新的集群;Step 2, when the arrival probability of a document with an existing cluster in the model is less than the pseudo-probability, the document is regarded as the emergence of a new topic, thereby creating a new cluster;

步骤3,随着文档的到来,模型对旧的集群(即过时的主题)进行检查并删除,从而使得当前分布的近期主题集群在模型中处于活跃状态,为了推断模型中的活跃簇数,在每个ρ时间单位间隔后重新采样来自最近ψ文档的随机文档数η。Step 3, with the arrival of documents, the model checks and deletes old clusters (i.e. outdated topics), so that the current distribution of recent topic clusters is active in the model, in order to infer the number of active clusters in the model, in A random number of documents η from the most recent ψ documents are resampled after every ρ time unit interval.

3、本发明进一步设置为:模型中的每个伪集群都被表示为一个6元组

Figure SMS_1
其中,z代表微集群,mz是其中的文档数,
Figure SMS_2
是一个二维矩阵,包含z文档中的术语及其频率,Nz是z文档中的总字数,
Figure SMS_3
是d文档中的总字数,元组元素cwz是一个三维矩阵用于将术语的共现储存在
Figure SMS_4
中;wi和wj之间的分数定义为3. The present invention is further set as: each pseudo-cluster in the model is represented as a 6-tuple
Figure SMS_1
where z represents the micro-cluster, mz is the number of documents in it,
Figure SMS_2
is a two-dimensional matrix containing terms in z-documents and their frequencies, Nz is the total number of words in z-documents,
Figure SMS_3
is the total number of words in the d document, and the tuple element cwz is a three-dimensional matrix used to store the co-occurrence of terms in
Figure SMS_4
in; the fraction between wi and wj is defined as

Figure SMS_5
Figure SMS_5

其中,

Figure SMS_6
是文档d中wi的词频,lz和u_{z}存储集群的衰减权重和最后更新的时间戳,CF集有两个重要属性,包括可添加和可删除,这些属性允许集群随时间增量而更新,Addable属性使集群能够通过在其中添加新文档来进行更新,并进行定义。in,
Figure SMS_6
is the term frequency of wi in document d, lz and u_{z} store the decay weight of the cluster and the timestamp of the last update, CF sets have two important properties, including add and delete, which allow the cluster to grow over time The Addable property enables the cluster to be updated and defined by adding new documents to it.

本发明进一步设置为:定义1,通过使用addable属性更新集群,可以将文档d添加到集群z中:The present invention is further set as:Definition 1, by updating the cluster with the addable attribute, document d can be added to cluster z:

mz=mz+1mz =mz +1

Figure SMS_7
Figure SMS_7

cwz=cwz∪cwdcwz =cwz ∪cwd

Nz=Nz+NdNz =Nz +Nd ;

定义2,使用可删除属性从集群z中删除文档d:Definition 2, delete document d from cluster z using the deletable attribute:

mz=mz-1mz =mz -1

Figure SMS_8
Figure SMS_8

cwz=cwz-cwdcwz =cwz -cwd

Nz=Nz-NdNz =Nz -Nd ,

其中,Nd表示文档中的总字数,cwd是基于窗口大小的文档共现矩阵,文档的共现矩阵cwd包含两个相邻术语之间的频率比,定义为whereNd represents the total number of words in the document,cwd is the document co-occurrence matrix based on the window size, and the document’s co-occurrence matrixcwd contains the frequency ratio between two adjacent terms, defined as

Figure SMS_9
Figure SMS_9

其中,

Figure SMS_10
是wi在文档中(wi,wj)∈d以为主题的词频。in,
Figure SMS_10
is the word frequency of wi in the document (wi ,wj )∈d as the topic.

本发明进一步设置为:最初在模型中,没有集群,因此创建了一个新的空集群并将第一个到达的文档添加到其中,下一个到达的文档应该被添加到模型的现有(活动)集群z,(z∈M),或者将导致新集群的创建;计算文档d和活动集群z之间的相似性为,The invention is further set up such that initially in the model, there are no clusters, so a new empty cluster is created and the first arriving document is added to it, the next arriving document should be added to the model's existing (active) A cluster z, (z ∈ M), or will result in the creation of a new cluster; compute the similarity between document d and the active cluster z as,

Figure SMS_11
Figure SMS_11

本发明进一步设置为:提出一种权重来计算术语的特异性,如果一个术语每次都与不同的术语同时出现,则表明该术语不太具体,因此可以与多种概念一起使用,然而,高度具体的术语具有较少数量的同时出现的邻近数据,此外,术语的正常邻近数据取决于定义的窗口大小,因此将wi的单词特异性S定义为:The invention is further set up to: propose a weight to calculate the specificity of terms, if a term co-occurs with different terms every time, it indicates that the term is less specific and therefore can be used with multiple concepts, however, highly Specific terms have a smaller number of co-occurring neighboring data, moreover, the normal neighboring data of a term depends on the defined window size, so the word specificity S ofwi is defined as:

Figure SMS_12
Figure SMS_12

其中,δ是相邻窗口大小,g(wi)定义为where δ is the adjacent window size, and g(wi ) is defined as

Figure SMS_13
Figure SMS_13

Figure SMS_14
Figure SMS_14

Figure SMS_15
Figure SMS_15

g(wi)用于计算邻域人口与定义的窗口大小(σ)之间的比率及其在文档中的频率,其中,

Figure SMS_16
是模型中术语wi的唯一临近数据,
Figure SMS_17
是总术语频率,通过使用通过使用双曲正切sigmoid函数定义窗口大小的边界来计算分数。g(wi ) is used to calculate the ratio between the neighborhood population and the defined window size (σ) and its frequency in the document, where,
Figure SMS_16
is the only neighbor of term wi in the model,
Figure SMS_17
is the total term frequency, the score is calculated by using the bounds of the window size defined by using the hyperbolic tangent sigmoid function.

本发明进一步设置为:初始化时,流的第一个文档创建一个新的集群,为了随着时间的推移自动识别到达文档中的新主题,通过变换公式p(zd|Gz)=p(Gz).p(d|Gz),概率p(znew|d)导出如下,The present invention is further set as follows: when initializing, the first document of the stream creates a new cluster, in order to automatically identify new topics in arriving documents as time goes by, by transforming the formula p(zd |Gz )=p( Gz ).p(d|Gz ), the probability p(znew |d) is derived as follows,

Figure SMS_18
Figure SMS_18

其中,αD代表文档的伪群体,Vz是活动集群的平均词汇量

Figure SMS_19
β的值有助于计算与新聚类的伪词相似度;等式
Figure SMS_20
给出了创建新集群的条件。where αD represents the pseudo-population of documents andVz is the average vocabulary size of the active cluster
Figure SMS_19
The value of β helps to calculate the pseudo-word similarity with the new cluster; the equation
Figure SMS_20
The conditions for creating a new cluster are given.

本发明进一步设置为:引入了基于窗口的术语共现矩阵,其中每个术语只能与具有

Figure SMS_21
距离的邻近数据配对,同时保持句子顺序。The present invention is further set as follows: a window-based term co-occurrence matrix is introduced, wherein each term can only be associated with
Figure SMS_21
Proximity data pairings by distance while maintaining sentence order.

本发明进一步设置为:为了保持概念的当前分布,模型需要保持活跃的集群并移除过时的集群,每个集群随时间的衰减权重更新为The present invention is further set as: in order to maintain the current distribution of concepts, the model needs to maintain active clusters and remove outdated clusters, and the decay weight of each cluster over time is updated as

Figure SMS_22
Figure SMS_22

其中,tc表示模型的当前时间戳,并存储集群的最后更新时间戳z最初,每个新集群的衰减权重设置为1,如果l_{z}近似为零,则处理集群以从模型中删除,即该微集群无法捕获文本流中主题的当前词条分布。where tc represents the current timestamp of the model, and stores the cluster's last update timestamp z Initially, each new cluster has its decay weight set to 1, and if l_{z} is approximately zero, the cluster is processed to be removed from the model , that is, the microcluster cannot capture the current distribution of terms in the topic in the text stream.

本发明的有益效果为:一种支持短文本流在线聚类的上下文增强狄利克雷模型,该方案利用了整个模型的独特邻居群体的分布。对于活动集群,提出了一种情景推理过程,该过程降低了模型的集群稀疏性。此外,EINDM还会合并高度相似的集群,自动产生接近实际集群的集群人口。我们进行了广泛的实证分析,以证明所提出模型的效率和稳健性,同时观察不同参数范围内的结果。EINDM利用整个模型的独特邻居群体分布生成了一种新的单词特异性术语权重方案,同时,情景推理过程降低了模型的集群稀疏性。EINDM还会合并高度相似的集群,自动产生接近实际集群的集群人口。与最近最先进的聚类模型相比,EINDM在NMI、同质性和聚类纯度方面具有最佳性能。The beneficial effects of the invention are: a context-enhanced Dirichlet model supporting online clustering of short text streams, and the scheme utilizes the distribution of unique neighbor groups of the entire model. For active clusters, an episodic reasoning procedure is proposed that reduces the cluster sparsity of the model. In addition, EINDM also merges highly similar clusters, automatically producing cluster populations that are close to actual clusters. We conduct extensive empirical analyzes to demonstrate the efficiency and robustness of the proposed model, while observing results across different parameter ranges. EINDM generates a novel word-specific term weighting scheme using the unique neighbor population distribution across the model, while the episodic reasoning process reduces the cluster sparsity of the model. EINDM also merges highly similar clusters, automatically producing cluster populations that are close to actual clusters. Compared with recent state-of-the-art clustering models, EINDM has the best performance in terms of NMI, homogeneity and cluster purity.

附图说明Description of drawings

图1为算法流程示意图;Figure 1 is a schematic diagram of the algorithm flow;

图2为词语窗口举例示意图,画叉处表示不可移动的术语窗口。Fig. 2 is a schematic diagram of an example of a term window, where a cross represents a term window that cannot be moved.

具体实施方式Detailed ways

下面结合附图及实施例对本发明中的技术方案进一步说明。The technical solutions in the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

对狄利克雷分布的主题生成进行建模,其中最重要的任务是根据概率分布制定到达文档和主题(微集群)之间的关系。Modeling topic generation from a Dirichlet distribution, where the most important task is to formulate the relationship between arriving documents and topics (microclusters) according to probability distributions.

步骤1:根据计算的概率,每个到达的文档要么被添加到模型的活动集群(活动主题)中,要么创建一个新的集群(新主题已经到达)。Step 1: Depending on the calculated probability, each arriving document is either added to the model's active cluster (active topic) or creates a new cluster (new topic has arrived).

步骤2:如果模型中已有聚类的文档到达的概率小于伪概率,则将文档视为新主题的出现,从而创建新的聚类。Step 2: If the probability of arrival of a document already clustered in the model is less than the pseudo-probability, the document is considered as the emergence of a new topic, thus creating a new cluster.

步骤3:随着文档的到来,模型会检查旧的集群(过时的主题)是否被删除。Step 3: As documents come, the model checks whether old clusters (obsolete topics) are removed.

通过这种方式,具有当前分布的近期主题集群在模型中处于活跃状态。为了推断模型中的活跃簇数,在每个ρ时间单位间隔后重新采样来自最近ψ文档的随机文档数η。接下来详细描述每个关键过程。In this way, recent topic clusters with current distributions are active in the model. To infer the number of active clusters in the model, a random number of documents η from the most recent ψ documents are resampled after every ρ time unit interval. Each key process is described in detail next.

1、集群未来集合,模型中的每个伪集群都被表示为一个6元组

Figure SMS_23
其中,z代表微集群,mz是其中的文档数,
Figure SMS_24
是一个二维矩阵,包含z文档中的术语及其频率,Nz是z文档中的总字数,
Figure SMS_25
是d文档中的总字数,元组元素cwz是一个三维矩阵用于将术语的共现储存在
Figure SMS_26
中;wi和wj之间的分数定义为,1. The future collection of clusters, each pseudo-cluster in the model is represented as a 6-tuple
Figure SMS_23
where z represents the micro-cluster, mz is the number of documents in it,
Figure SMS_24
is a two-dimensional matrix containing terms in z-documents and their frequencies, Nz is the total number of words in z-documents,
Figure SMS_25
is the total number of words in the d document, and the tuple element cwz is a three-dimensional matrix used to store the co-occurrence of terms in
Figure SMS_26
in; the fraction between wi and wj is defined as,

Figure SMS_27
Figure SMS_27

其中,

Figure SMS_28
是文档d中wi的词频。lz和u_{z}存储集群的衰减权重和最后更新的时间戳。CF集有两个重要属性:(i)可添加和(ii)可删除。这些属性允许集群随时间增量更新。Addable属性使集群能够通过在其中添加新文档来进行更新,并定义为,定义1:通过使用addable属性更新集群,可以将文档d添加到集群z中。in,
Figure SMS_28
is the word frequency of wi in document d. lz and u_{z} store the decay weights of the cluster and the timestamp of the last update. There are two important properties of CF sets: (i) can be added and (ii) can be deleted. These properties allow the cluster to be updated incrementally over time. The Addable attribute enables a cluster to be updated by adding new documents to it, and is defined as, Definition 1: A document d can be added to cluster z by updating the cluster with the addable attribute.

m=m+1m=m+1

mz=mz

Figure SMS_29
cwz=cwz∪cwd Nz=Nz+Ndmz =mz
Figure SMS_29
cwz =cwz ∪cwd Nz =Nz +Nd

定义2:使用可删除属性从集群z中删除文档d。Definition 2: Use the deletable attribute to delete document d from cluster z.

mz=mz-1

Figure SMS_30
cwz=cwz-cwd Nz=Nz-Ndmz =mz -1
Figure SMS_30
cwz =cwz -cwd Nz =Nz -Nd

这里,Nd表示文档中的总字数。cwd是基于窗口大小的文档共现矩阵。文档的共现矩阵cwd包含两个相邻项之间的频率比,定义为:Here,Nd denotes the total number of words in the document.cwd is the document co-occurrence matrix based on the window size. The co-occurrence matrixcwd of a document contains the frequency ratios between two adjacent entries, defined as:

Figure SMS_31
Figure SMS_31

2、文件集群相似性:最初在模型中没有集群,因此创建了一个新的空集群并将第一个到达的文档添加到其中。下一个到达的文档应该被添加到模型的现有(活动)集群z(z∈M)(基于等式7),或者将导致新集群的创建(基于等式13)。为了计算文档d和活动集群z}(模型表示为

Figure SMS_32
)之间的相似性,2. Document cluster similarity: Initially there are no clusters in the model, so a new empty cluster is created and the first arriving document is added to it. The next arriving document should either be added to the model's existing (active) cluster z(z ∈ M) (based on Equation 7), or will result in the creation of a new cluster (based on Equation 13). To compute documents d and active clusters z} (model denoted as
Figure SMS_32
), the similarity between

Figure SMS_33
Figure SMS_33

所有符号的定义见下表。See the table below for definitions of all symbols.

Figure SMS_34
Figure SMS_34

表1:符号与标记Table 1: Symbols and markings

在这里,这个方程的第一部分显示了集群流行度p(G_{z}),而剩下的两部分是为了计算同质性p(d|G_{z})。等式的第二部分负责捕捉单项空间中的相似性。第三部分通过计算语义术语空间的相似度(即术语集的共现)来解决术语歧义问题。同质性的第二部分,其中β的值作为看不见的单词的伪权重,基于多项分布(即

Figure SMS_35
)。在捕获同质性(相似性)的同时使用聚类
Figure SMS_36
中的术语出现。然而,与静态环境不同,其中逆文档频率可用于计算全局空间中的术语重要性,术语文档分布是未知的。因此,我们定义了一个类似的加权分数,称为逆聚类频率ICFw来计算术语的重要性,定义为:Here, the first part of this equation shows the cluster popularity p(G_{z}), while the remaining two parts are for computing the homogeneity p(d|G_{z}). The second part of the equation is responsible for capturing the similarity in the one-way space. The third part resolves the term ambiguity problem by computing the similarity in the semantic term space (i.e., the co-occurrence of term sets). The second part of homogeneity, where the value of β acts as a pseudo-weight for unseen words, is based on a multinomial distribution (i.e.
Figure SMS_35
). Use clustering while capturing homogeneity (similarity)
Figure SMS_36
The terms in appear. However, unlike static environments, where inverse document frequency can be used to compute term importance in the global space, the term-document distribution is unknown. Therefore, we define a similar weighted score called inverse cluster frequency ICFw to compute term importance, defined as:

Figure SMS_37
Figure SMS_37

等式8的分母部分是包含单词w的模型中活动簇的总数,提名者是模型的活动簇总数。这意味着更多的簇包含单词w不如单词重要。如果单词w包含在很少的簇中,则表明它的重要性更高。为了捕捉这种行为,引入了一个新的单词特异性Sw权重,它在公式9中定义。The denominator part ofEquation 8 is the total number of active clusters in the model containing the word w, and the nominator is the total number of active clusters in the model. This means that more clusters contain word w less important than word. If word w is contained in few clusters, it indicates that it is more important. To capture this behavior, a new word-specific Sw weight is introduced, which is defined in Equation 9.

3、单词特异性:提出了一种新的权重来计算术语的特异性。基本思想是,如果一个术语每次都与不同的术语同时出现,则表明该术语不太具体,因此可以与多种概念一起使用。然而,高度具体的术语具有较少数量的同时出现的邻居。此外,术语的正常邻居取决于定义的窗口大小(作为模型参数)。因此,考虑到共现窗口大小约束,我们将wi的单词特异性S定义为,3. Word Specificity: A new weight is proposed to calculate the specificity of terms. The basic idea is that if a term co-occurs with a different term each time, it is a sign that the term is less specific and therefore can be used with multiple concepts. However, highly specific terms have a lower number of co-occurring neighbors. Furthermore, the normal neighbors of a term depend on the defined window size (as a model parameter). Therefore, considering the co-occurrence window size constraint, we define the word specificity S of wi as,

Figure SMS_38
Figure SMS_38

这里,δ是相邻窗口大小,g(wi)定义为,Here, δ is the adjacent window size, and g(wi ) is defined as,

Figure SMS_39
Figure SMS_39

Figure SMS_40
Figure SMS_40

Figure SMS_41
Figure SMS_41

g(wi)计算邻域人口与定义的窗口大小(σ)之间的比率及其在文档中的频率。这里,

Figure SMS_42
是模型中术语wi的唯一邻居数,
Figure SMS_43
是总术语频率。通过使用双曲正切sigmoid函数定义窗口大小的边界来计算分数。g(wi ) computes the ratio between the neighborhood population and the defined window size (σ) and its frequency in the document. here,
Figure SMS_42
is the number of unique neighbors of term wi in the model,
Figure SMS_43
is the total term frequency. Scores are computed by defining the bounds of the window size using the hyperbolic tangent sigmoid function.

4、自动集群创建:初始化时,流的第一个文档创建一个新的集群。为了随着时间的推移自动识别到达文档中的新主题,我们需要一个概率来帮助检测单词的新分布以创建新的集群,以防文档不属于任何活动集群。通过变换这个公式p(zd|Gz)=p(Gz).p(d|Gz),概率p(znew|d)导出如下。4. Automatic cluster creation: On initialization, the first document of the stream creates a new cluster. To automatically identify new topics arriving in documents over time, we need a probability to help detect new distributions of words to create new clusters, in case a document does not belong to any active cluster. By transforming this formula p(zd |Gz )=p(Gz ).p(d|Gz ), the probability p(znew |d) is derived as follows.

Figure SMS_44
Figure SMS_44

这里,αD代表文档的伪群体,Vz是活动集群的平均词汇量

Figure SMS_45
β的值有助于计算与新聚类的伪词相似度。等式14给出了创建新集群的条件(参见第10行算法1)。Here, αD represents the pseudo-population of documents andVz is the average vocabulary size of the active cluster
Figure SMS_45
The value of β helps to calculate the pseudo-word similarity with the new cluster. Equation 14 gives the conditions for creating a new cluster (seeAlgorithm 1, line 10).

Figure SMS_46
Figure SMS_46

Figure SMS_47
Figure SMS_47

算法1Algorithm 1

5、基于窗口的共现矩阵:该专利引入了基于窗口的术语共现矩阵,其中每个术语只能与具有

Figure SMS_48
距离的邻居配对,同时保持句子顺序。一个例子如图2所示。这样,一个文档cwd的共现矩阵最多可以有O(δNd)个条目。这里,Nd是文档长度。集群中两个相邻项wi和wj之间的权重在公式6中定义,其中i≠j,(i-δ)≤j≤(i+δ)和δ≥15. Window-based co-occurrence matrix: This patent introduces a window-based term co-occurrence matrix, where each term can only be associated with
Figure SMS_48
Neighbor pairings by distance while maintaining sentence order. An example is shown in Figure 2. In this way, the co-occurrence matrix of a document cwd can have at most O(δNd ) entries. Here,Nd is the document length. The weight between two adjacent items wi and wj in a cluster is defined in Equation 6, where i≠j, (i−δ)≤j≤(i+δ) and δ≥1

6、情节推理:推理过程已被证明可用于生成过程以减少集群稀疏性。他们的迭代过程有两个主要缺陷:(i)一个批次可能无法捕获当前分布,(ii)它增加了处理时间成本,使其不适合高速流。相比之下,我们提出了一个情节推理程序,不仅有效地降低了处理成本,而且能够覆盖流的分布。6. Episodic reasoning: The reasoning process has been shown to be useful in generative processes to reduce cluster sparsity. Their iterative procedure suffers from two major drawbacks: (i) a batch may not capture the current distribution, and (ii) it adds a processing time cost, making it unsuitable for high-speed streams. In contrast, we propose an episodic inference procedure that not only effectively reduces the processing cost, but is also able to cover the distribution of flows.

7、删除过时的集群:为了保持概念的当前分布,模型需要保持活跃的集群(当前概念)并移除过时的集群(旧概念)。采用基于流速度的衰减机制,该机制随时间更新集群的重要性分数(衰减权重)。如果模型最近没有接收文档,则模型中每个集群的衰减权重(lz)会随着时间的推移而减小。每个集群随时间的衰减权重更新为7. Removing obsolete clusters: To maintain the current distribution of concepts, the model needs to maintain active clusters (current concepts) and remove obsolete clusters (old concepts). A flow velocity based decay mechanism is employed which updates the importance scores (decay weights) of clusters over time. The decay weight (lz ) of each cluster in the model decreases over time if the model has not received documents recently. The decay weight of each cluster over time is updated as

Figure SMS_49
Figure SMS_49

这里,tc表示模型的当前时间戳,并存储集群的最后更新时间戳z最初,每个新集群的衰减权重设置为1(参见第10行算法2)。如果l_{z}近似为零,则处理集群以从模型中删除(参见第6行算法3),即该微集群无法捕获文本流中主题的当前词条分布。Here,tc represents the current timestamp of the model, and stores the last updated timestamp z of the cluster. Initially, the decay weight of each newcluster is set to 1 (seeAlgorithm 2, line 10). A cluster is processed for deletion from the model (see Algorithm 3, line 6) if l_{z} is approximately zero, i.e., the micro-cluster fails to capture the current distribution of terms for topics in the text stream.

Figure SMS_50
Figure SMS_50

Figure SMS_51
Figure SMS_51

算法2Algorithm 2

本发明提出了一种支持短文本流在线聚类的上下文增强狄利克雷模型,该方案利用了整个模型的独特邻居群体的分布。对于活动集群,提出了一种情景推理过程,该过程降低了模型的集群稀疏性。此外,EINDM还会合并高度相似的集群,自动产生接近实际集群的集群人口。我们进行了广泛的实证分析,以证明所提出模型的效率和稳健性,同时观察不同参数范围内的结果。EINDM利用整个模型的独特邻居群体分布生成了一种新的单词特异性术语权重方案,同时,情景推理过程降低了模型的集群稀疏性。EINDM还会合并高度相似的集群,自动产生接近实际集群的集群人口。与最近最先进的聚类模型相比,EINDM在NMI、同质性和聚类纯度方面具有最佳性能。The present invention proposes a context-enhanced Dirichlet model that supports online clustering of short text streams, and the scheme utilizes the distribution of unique neighbor groups throughout the model. For active clusters, an episodic reasoning procedure is proposed that reduces the cluster sparsity of the model. In addition, EINDM also merges highly similar clusters, automatically producing cluster populations that are close to actual clusters. We conduct extensive empirical analyzes to demonstrate the efficiency and robustness of the proposed model, while observing results across different parameter ranges. EINDM generates a novel word-specific term weighting scheme using the unique neighbor population distribution across the model, while the episodic reasoning process reduces the cluster sparsity of the model. EINDM also merges highly similar clusters, automatically producing cluster populations that are close to actual clusters. Compared with recent state-of-the-art clustering models, EINDM has the best performance in terms of NMI, homogeneity and cluster purity.

最后说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的宗旨和范围,其均应涵盖在本发明的权利要求范围当中。Finally, it is noted that the above embodiments are only used to illustrate the technical solutions of the present invention without limitation. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be carried out Modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present invention shall be covered by the claims of the present invention.

Claims (8)

Translated fromChinese
1.一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:包括以下步骤:1. A context-enhanced Dirichlet model that supports online clustering of short text streams, characterized in that: comprising the following steps:步骤1,根据计算的概率,选择将到达的文档添加到模型的活动集群中,或者创建一个新的集群进行添加;Step 1, according to the calculated probability, choose to add the arriving document to the active cluster of the model, or create a new cluster to add;步骤2,当模型中已有集群的文档到达的概率小于伪概率时,则将文档视为新主题的出现,从而创建新的集群;Step 2, when the arrival probability of a document with an existing cluster in the model is less than the pseudo-probability, the document is regarded as the emergence of a new topic, thereby creating a new cluster;步骤3,随着文档的到来,模型对旧的集群(即过时的主题)进行检查并删除,从而使得当前分布的近期主题集群在模型中处于活跃状态,为了推断模型中的活跃簇数,在每个ρ时间单位间隔后重新采样来自最近ψ文档的随机文档数η。Step 3, with the arrival of documents, the model checks and deletes the old clusters (i.e. outdated topics), so that the current distribution of recent topic clusters is active in the model, in order to infer the number of active clusters in the model, in A random number of documents η from the most recent ψ documents are resampled after every ρ time unit interval.2.如权利要求1所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:模型中的每个伪集群都被表示为一个6元组
Figure FDA0003968616540000011
其中,z代表微集群,mz是其中的文档数,
Figure FDA0003968616540000012
是一个二维矩阵,包含z文档中的术语及其频率,Nz是z文档中的总字数,
Figure FDA0003968616540000013
Nd是d文档中的总字数,元组元素cwz是一个三维矩阵用于将术语的共现储存在
Figure FDA0003968616540000014
中;wi和wj之间的分数定义为2. A context-enhanced Dirichlet model supporting online clustering of short text streams as claimed in claim 1, wherein each pseudo-cluster in the model is represented as a 6-tuple
Figure FDA0003968616540000011
where z represents the micro-cluster, mz is the number of documents in it,
Figure FDA0003968616540000012
is a two-dimensional matrix containing terms in z-documents and their frequencies, Nz is the total number of words in z-documents,
Figure FDA0003968616540000013
Nd is the total number of words in the document d, and the tuple element cwz is a three-dimensional matrix used to store the co-occurrence of terms in
Figure FDA0003968616540000014
in; the fraction between wi and wj is defined as
Figure FDA0003968616540000015
Figure FDA0003968616540000015
其中,
Figure FDA0003968616540000016
是文档d中wi的词频,lz和u_{z}存储集群的衰减权重和最后更新的时间戳,CF集有两个重要属性,包括可添加和可删除,这些属性允许集群随时间增量而更新,Addable属性使集群能够通过在其中添加新文档来进行更新,并进行定义。
in,
Figure FDA0003968616540000016
is the word frequency of wi in document d, lz and u_{z} store the decay weight of the cluster and the timestamp of the last update, CF sets have two important properties, including add and delete, which allow the cluster to grow over time The Addable property enables the cluster to be updated and defined by adding new documents to it.
3.如权利要求2所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:3. A kind of context-enhanced Dirichlet model supporting online clustering of short text streams as claimed in claim 2, characterized in that:定义1,通过使用addable属性更新集群,可以将文档d添加到集群z中:Definition 1, a document d can be added to cluster z by updating the cluster with the addable attribute:mz=mz+1mz =mz +1
Figure FDA0003968616540000017
Figure FDA0003968616540000017
cwz=cwz∪cwdcwz =cwz ∪cwdNz=Nz+NdNz =Nz +Nd ;定义2,使用可删除属性从集群z中删除文档d:Definition 2, delete document d from cluster z using the deletable attribute:mz=mz-1mz =mz -1
Figure FDA0003968616540000021
Figure FDA0003968616540000021
cwz=cwz-cwdcwz =cwz -cwdNz=Nz-Nd,Nz =Nz -Nd ,其中,Nd表示文档中的总字数,cwd是基于窗口大小的文档共现矩阵,文档的共现矩阵cwd包含两个相邻术语之间的频率比,定义为whereNd represents the total number of words in the document,cwd is the document co-occurrence matrix based on the window size, and the document’s co-occurrence matrixcwd contains the frequency ratio between two adjacent terms, defined as
Figure FDA0003968616540000022
Figure FDA0003968616540000022
其中,
Figure FDA0003968616540000023
是wi在文档中(wi,wj)∈d以为主题的词频。
in,
Figure FDA0003968616540000023
is the word frequency of wi in the document (wi ,wj )∈d as the topic.
4.如权利要求3所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:最初在模型中,没有集群,因此创建了一个新的空集群并将第一个到达的文档添加到其中,下一个到达的文档应该被添加到模型的现有(活动)集群z,(z∈M),或者将导致新集群的创建;计算文档d和活动集群z之间的相似性为,4. A context-enhanced Dirichlet model supporting online clustering of short text streams as claimed in claim 3, characterized in that: initially in the model, there is no cluster, so a new empty cluster is created and the first Arriving documents are added to it, and the next arriving document should be added to the existing (active) cluster z of the model, (z ∈ M), or will result in the creation of a new cluster; compute the distance between document d and the active cluster z The similarity is,
Figure FDA0003968616540000024
Figure FDA0003968616540000024
5.如权利要求4所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:提出一种权重来计算术语的特异性,如果一个术语每次都与不同的术语同时出现,则表明该术语不太具体,因此可以与多种概念一起使用,然而,高度具体的术语具有较少数量的同时出现的邻近数据,此外,术语的正常邻近数据取决于定义的窗口大小,因此将wi的单词特异性S定义为:5. A context-enhanced Dirichlet model that supports online clustering of short text streams as claimed in claim 4, wherein a weight is proposed to calculate the specificity of terms, if a term is different from each other Co-occurrences indicate that the term is less specific and thus can be used with multiple concepts, however, highly specific terms have a lower number of co-occurring neighbor data, moreover, the normal neighbor data for a term depends on the defined window size , so the word specificity S of wi is defined as:
Figure FDA0003968616540000025
Figure FDA0003968616540000025
其中,δ是相邻窗口大小,g(wi)定义为where δ is the adjacent window size, and g(wi ) is defined as
Figure FDA0003968616540000031
Figure FDA0003968616540000031
Figure FDA0003968616540000032
Figure FDA0003968616540000032
Figure FDA0003968616540000033
Figure FDA0003968616540000033
g(wi)用于计算邻域人口与定义的窗口大小(σ)之间的比率及其在文档中的频率,其中,
Figure FDA0003968616540000034
是模型中术语wi的唯一临近数据,
Figure FDA0003968616540000035
是总术语频率,通过使用通过使用双曲正切sigmoid函数定义窗口大小的边界来计算分数。
g(wi ) is used to calculate the ratio between the neighborhood population and the defined window size (σ) and its frequency in the document, where,
Figure FDA0003968616540000034
is the only neighbor of term wi in the model,
Figure FDA0003968616540000035
is the total term frequency, the score is calculated by using the bounds of the window size defined by using the hyperbolic tangent sigmoid function.
6.如权利要求5所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:初始化时,流的第一个文档创建一个新的集群,为了随着时间的推移自动识别到达文档中的新主题,通过变换公式p(zd|Gz)=p(Gz).p(d|Gz),概率p(znew|d)导出如下,6. A context-enhanced Dirichlet model supporting online clustering of short text streams as claimed in claim 5, characterized in that: when initializing, the first document of the stream creates a new cluster, in order to Automatically identify new topics in arriving documents, through the transformation formula p(zd |Gz )=p(Gz ).p(d|Gz ), the probability p(znew |d) is derived as follows,
Figure FDA0003968616540000036
Figure FDA0003968616540000036
其中,αD代表文档的伪群体,Vz是活动集群的平均词汇量
Figure FDA0003968616540000037
β的值有助于计算与新聚类的伪词相似度;等式
Figure FDA0003968616540000038
给出了创建新集群的条件。
where αD represents the pseudo-population of documents andVz is the average vocabulary size of the active cluster
Figure FDA0003968616540000037
The value of β helps to calculate the pseudo-word similarity with the new cluster; the equation
Figure FDA0003968616540000038
The conditions for creating a new cluster are given.
7.如权利要求6所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:引入了基于窗口的术语共现矩阵,其中每个术语只能与具有
Figure FDA0003968616540000039
距离的邻近数据配对,同时保持句子顺序。
7. A context-enhanced Dirichlet model that supports online clustering of short text streams as claimed in claim 6, wherein a window-based term co-occurrence matrix is introduced, wherein each term can only be associated with
Figure FDA0003968616540000039
Proximity data pairings by distance while maintaining sentence order.
8.如权利要求7所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型,其特征在于:为了保持概念的当前分布,模型需要保持活跃的集群并移除过时的集群,每个集群随时间的衰减权重更新为8. A context-enhanced Dirichlet model that supports online clustering of short text streams as claimed in claim 7, characterized in that: in order to maintain the current distribution of concepts, the model needs to keep active clusters and remove outdated clusters, every The decay weight of each cluster over time is updated as
Figure FDA0003968616540000041
Figure FDA0003968616540000041
其中,tc表示模型的当前时间戳,并存储集群的最后更新时间戳z最初,每个新集群的衰减权重设置为1,如果l_{z}近似为零,则处理集群以从模型中删除,即该微集群无法捕获文本流中主题的当前词条分布。where tc represents the current timestamp of the model, and stores the cluster's last update timestamp z Initially, each new cluster has its decay weight set to 1, and if l_{z} is approximately zero, the cluster is processed to be removed from the model , that is, the microcluster cannot capture the current distribution of terms in the topic in the text stream.
CN202211504585.1A2022-11-292022-11-29Method for constructing context enhanced dirichlet model supporting online clustering of short text streamsActiveCN115827861B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202211504585.1ACN115827861B (en)2022-11-292022-11-29Method for constructing context enhanced dirichlet model supporting online clustering of short text streams

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202211504585.1ACN115827861B (en)2022-11-292022-11-29Method for constructing context enhanced dirichlet model supporting online clustering of short text streams

Publications (2)

Publication NumberPublication Date
CN115827861Atrue CN115827861A (en)2023-03-21
CN115827861B CN115827861B (en)2025-07-18

Family

ID=85532362

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202211504585.1AActiveCN115827861B (en)2022-11-292022-11-29Method for constructing context enhanced dirichlet model supporting online clustering of short text streams

Country Status (1)

CountryLink
CN (1)CN115827861B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116861287A (en)*2023-06-252023-10-10电子科技大学长三角研究院(湖州) Online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams
CN118069835A (en)*2024-01-192024-05-24成都飞机工业(集团)有限责任公司 A method, device, equipment and medium for constructing a knowledge base for aircraft manufacturing

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20120330958A1 (en)*2011-06-272012-12-27Microsoft CorporationRegularized Latent Semantic Indexing for Topic Modeling
CN113271292A (en)*2021-04-072021-08-17中国科学院信息工程研究所Malicious domain name cluster detection method and device based on word vectors

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20120330958A1 (en)*2011-06-272012-12-27Microsoft CorporationRegularized Latent Semantic Indexing for Topic Modeling
CN113271292A (en)*2021-04-072021-08-17中国科学院信息工程研究所Malicious domain name cluster detection method and device based on word vectors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宿青;: "基于LDA模型的聚类检索应用", 中国新通信, no. 05, 5 March 2017 (2017-03-05), pages 43 - 44*

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116861287A (en)*2023-06-252023-10-10电子科技大学长三角研究院(湖州) Online semi-supervised classification algorithm based on multi-label evolving high-dimensional text streams
CN118069835A (en)*2024-01-192024-05-24成都飞机工业(集团)有限责任公司 A method, device, equipment and medium for constructing a knowledge base for aircraft manufacturing

Also Published As

Publication numberPublication date
CN115827861B (en)2025-07-18

Similar Documents

PublicationPublication DateTitle
Kim et al.Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec
Duarte et al.A review of semi-supervised learning for text classification
Chantar et al.Feature selection using binary grey wolf optimizer with elite-based crossover for Arabic text classification
CN110705260B (en)Text vector generation method based on unsupervised graph neural network structure
CN115827861A (en)Context-enhanced Dirichlet model supporting online clustering of short text streams
US7698339B2 (en)Method and system for summarizing a document
CN112765355B (en) Text adversarial attack method based on improved quantum-behaved particle swarm optimization algorithm
CN107832306A (en)A kind of similar entities method for digging based on Doc2vec
WO2008046104A9 (en)Methods and systems for knowledge discovery
CN108804701A (en)Personage's portrait model building method based on social networks big data
CN109214454B (en) A Weibo-Oriented Emotional Community Classification Method
Çoban et al.Deep learning-based sentiment analysis of Facebook data: The case of Turkish users
Manjesh et al.Clickbait pattern detection and classification of news headlines using natural language processing
Bollegala et al.ClassiNet--Predicting missing features for short-text classification
CN115688768B (en) A medical text professional classification method based on adversarial data enhancement
CN108694165B (en)Cross-domain dual emotion analysis method for product comments
CN108319584A (en)A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms
Zhou et al.Enhanced personalized search using social data
CN117610513A (en)Knowledge protection and selection-based theme text generation method
CN115221335B (en) A method for constructing a knowledge graph
CN116756600A (en) A random walk-based attribute network embedding and community discovery method
Aissa et al.A reinforcement learning-driven translation model for search-oriented conversational systems
CN108427769A (en)A kind of personage's interest tags extracting method based on social networks
Shi et al.An effective and efficient method for word-level textual adversarial attack
Yuan et al.A literature review of sentiment analysis on Chinese social media

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp