CN115827861A

Movatterモバイル変換

Info

Publication number: CN115827861A
Application number: CN202211504585.1A
Authority: CN
Inventors: 瑞嘉; 杰库码; 任晓龙; 邵俊明; 李想
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-21
Anticipated expiration: 2042-11-29
Also published as: CN115827861B

Abstract

本发明提供了一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：包括以下步骤：步骤1，根据计算的概率，选择将到达的文档添加到模型的活动集群中，或者创建一个新的集群进行添加；步骤2，当模型中已有集群的文档到达的概率小于伪概率时，则将文档视为新主题的出现，从而创建新的集群；步骤3，随着文档的到来，模型对旧的集群(即过时的主题)进行检查并删除，从而使得当前分布的近期主题集群在模型中处于活跃状态，为了推断模型中的活跃簇数，在每个ρ时间单位间隔后重新采样来自最近ψ文档的随机文档数η。本发明具有模型效率高，稳健性强的优点。

The present invention provides a context-enhanced Dirichlet model that supports online clustering of short text streams, characterized in that it includes the following steps: Step 1, according to the calculated probability, select and add the arriving document to the active cluster of the model, or Create a new cluster to add; step 2, when the probability of arrival of a document with an existing cluster in the model is less than the pseudo-probability, the document is regarded as the emergence of a new topic, thereby creating a new cluster; step 3, with the document Arrival, the model checks and deletes old clusters (that is, outdated topics), so that the current distribution of recent topic clusters is active in the model, in order to infer the number of active clusters in the model, after each ρ time unit interval Resample a random number of documents η from the most recent ψ documents. The invention has the advantages of high model efficiency and strong robustness.

Description

Translated fromChinese

一种支持短文本流在线聚类的上下文增强狄利克雷模型A context-enhanced Dirichlet model supporting online clustering of short text streams

技术领域technical field

本发明涉及人工智能及自然语言处理技术领域，尤其涉及一种支持短文本流在线聚类的上下文增强狄利克雷模型。The invention relates to the technical fields of artificial intelligence and natural language processing, in particular to a context-enhanced Dirichlet model supporting online clustering of short text streams.

背景技术Background technique

在过去的十年中，每天在社交媒体上产生大量的短文本数据，例如推文、Facebook帖子和问答平台。近年来，这种连续到达的短文本的聚类因其在主题跟踪、新闻推荐等多种应用中而受到广泛关注。Over the past decade, a large amount of short-text data has been generated daily on social media, such as tweets, Facebook posts, and Q&A platforms. In recent years, clustering of such continuously arriving short texts has attracted much attention due to its diverse applications such as topic tracking, news recommendation, etc.

文档的文本表示是自然语言处理中的一项重要任务。通常获得的表示是基于选择一组合适的术语(低维词汇表)及其相对重要性(加权方案)来捕获文档内容。具体来说，术语加权方案有助于捕获特定句子或文档中单词的上下文。例如，词袋(BoW)表示和TF-IDF加权方案分别被广泛用作术语索引和重要性得分。提出了一种学习术语特异性(加权方案)的方法，用于在静态语料库中生成词嵌入。但是在流式环境中，这种表示不能直接映射。特别是对于不断发展的术语权重，并且表示具有包含重叠术语子空间的微集群的流式文档，术语分布是未知的并且可能随时间变化。Textual representation of documents is an important task in natural language processing. Typically obtained representations are based on selecting a suitable set of terms (low-dimensional vocabulary) and their relative importance (weighting scheme) to capture document content. Specifically, term weighting schemes help capture the context of words in a particular sentence or document. For example, bag-of-words (BoW) representations and TF-IDF weighting schemes are widely used as term indices and importance scores, respectively. A method for learning term specificity (weighting scheme) for generating word embeddings in a static corpus is proposed. But in a streaming environment, this representation cannot be mapped directly. Especially for evolving term weights, and representing streaming documents with micro-clusters containing overlapping term subspaces, the term distribution is unknown and likely to change over time.

与长文本文档(例如博客和学术论文)不同，因为文档包含的单词很少，处理短文本至关重要且具有挑战性。然而，由于流的独特属性(即高速、概念漂移和进化)，聚类任务变得更加复杂，因为它需要处理具有时间和空间限制的数据。与静态文本文档不同，概念演化和概念漂移的发生在高速文本流中是未知的，因此产生了以在线方式处理流的需求。Unlike long text documents such as blogs and academic papers, processing short texts is critical and challenging because the documents contain few words. However, due to the unique properties of streams (i.e., high speed, concept drift, and evolution), the clustering task becomes more complicated as it needs to deal with data with temporal and spatial constraints. Unlike static text documents, the occurrence of concept evolution and concept drift is unknown in high-speed text streams, thus giving rise to the need to process streams in an online manner.

发明内容Contents of the invention

针对现有技术中所存在的不足，本发明提供了一种支持短文本流在线聚类的上下文增强狄利克雷模型，具有模型效率高，稳健性强的优点。Aiming at the deficiencies in the prior art, the present invention provides a context-enhanced Dirichlet model supporting online clustering of short text streams, which has the advantages of high model efficiency and strong robustness.

本发明的上述技术目的是通过以下技术方案得以实现的：Above-mentioned technical purpose of the present invention is achieved through the following technical solutions:

一种支持短文本流在线聚类的上下文增强狄利克雷模型，包括以下步骤：A context-enhanced Dirichlet model supporting online clustering of short text streams, comprising the following steps:

步骤1，根据计算的概率，选择将到达的文档添加到模型的活动集群中，或者创建一个新的集群进行添加；Step 1, according to the calculated probability, choose to add the arriving document to the active cluster of the model, or create a new cluster to add;

步骤2，当模型中已有集群的文档到达的概率小于伪概率时，则将文档视为新主题的出现，从而创建新的集群；Step 2, when the arrival probability of a document with an existing cluster in the model is less than the pseudo-probability, the document is regarded as the emergence of a new topic, thereby creating a new cluster;

步骤3，随着文档的到来，模型对旧的集群(即过时的主题)进行检查并删除，从而使得当前分布的近期主题集群在模型中处于活跃状态，为了推断模型中的活跃簇数，在每个ρ时间单位间隔后重新采样来自最近ψ文档的随机文档数η。Step 3, with the arrival of documents, the model checks and deletes old clusters (i.e. outdated topics), so that the current distribution of recent topic clusters is active in the model, in order to infer the number of active clusters in the model, in A random number of documents η from the most recent ψ documents are resampled after every ρ time unit interval.

3、本发明进一步设置为：模型中的每个伪集群都被表示为一个6元组

其中，z代表微集群，m_z是其中的文档数，

是一个二维矩阵，包含z文档中的术语及其频率,N_z是z文档中的总字数，

是d文档中的总字数，元组元素cw_z是一个三维矩阵用于将术语的共现储存在

中；w_i和w_j之间的分数定义为3. The present invention is further set as: each pseudo-cluster in the model is represented as a 6-tuple

where z represents the micro-cluster, m_z is the number of documents in it,

is a two-dimensional matrix containing terms in z-documents and their frequencies, N_z is the total number of words in z-documents,

is the total number of words in the d document, and the tuple element cw_z is a three-dimensional matrix used to store the co-occurrence of terms in

in; the fraction between w_i and w_j is defined as

其中，

是文档d中w_i的词频，l_z和u_{z}存储集群的衰减权重和最后更新的时间戳，CF集有两个重要属性，包括可添加和可删除，这些属性允许集群随时间增量而更新，Addable属性使集群能够通过在其中添加新文档来进行更新，并进行定义。in,

is the term frequency of w_i in document d, l_z and u_{z} store the decay weight of the cluster and the timestamp of the last update, CF sets have two important properties, including add and delete, which allow the cluster to grow over time The Addable property enables the cluster to be updated and defined by adding new documents to it.

本发明进一步设置为：定义1，通过使用addable属性更新集群，可以将文档d添加到集群z中：The present invention is further set as:Definition 1, by updating the cluster with the addable attribute, document d can be added to cluster z:

m_z＝m_z+1m_z =m_z +1

cw_z＝cw_z∪cw_dcw_z ＝cw_z ∪cw_d

N_z＝N_z+N_d；N_z =N_z +N_d ;

定义2，使用可删除属性从集群z中删除文档d：Definition 2, delete document d from cluster z using the deletable attribute:

m_z＝m_z-1m_z =m_z -1

cw_z＝cw_z-cw_dcw_z ＝cw_z -cw_d

N_z＝N_z-N_d，N_z =N_z -N_d ,

其中，N_d表示文档中的总字数，cw_d是基于窗口大小的文档共现矩阵，文档的共现矩阵cw_d包含两个相邻术语之间的频率比，定义为where_Nd represents the total number of words in the document,_cwd is the document co-occurrence matrix based on the window size, and the document’s co-occurrence matrix_cwd contains the frequency ratio between two adjacent terms, defined as

其中，

是w_i在文档中(w_i,w_j)∈d以为主题的词频。in,

is the word frequency of w_i in the document (w_i ,w_j )∈d as the topic.

本发明进一步设置为：最初在模型中，没有集群，因此创建了一个新的空集群并将第一个到达的文档添加到其中，下一个到达的文档应该被添加到模型的现有(活动)集群z，(z∈M)，或者将导致新集群的创建；计算文档d和活动集群z之间的相似性为，The invention is further set up such that initially in the model, there are no clusters, so a new empty cluster is created and the first arriving document is added to it, the next arriving document should be added to the model's existing (active) A cluster z, (z ∈ M), or will result in the creation of a new cluster; compute the similarity between document d and the active cluster z as,

本发明进一步设置为：提出一种权重来计算术语的特异性，如果一个术语每次都与不同的术语同时出现，则表明该术语不太具体，因此可以与多种概念一起使用，然而，高度具体的术语具有较少数量的同时出现的邻近数据，此外，术语的正常邻近数据取决于定义的窗口大小，因此将w_i的单词特异性S定义为：The invention is further set up to: propose a weight to calculate the specificity of terms, if a term co-occurs with different terms every time, it indicates that the term is less specific and therefore can be used with multiple concepts, however, highly Specific terms have a smaller number of co-occurring neighboring data, moreover, the normal neighboring data of a term depends on the defined window size, so the word specificity S of_wi is defined as:

其中，δ是相邻窗口大小，g(w_i)定义为where δ is the adjacent window size, and g(w_i ) is defined as

g(w_i)用于计算邻域人口与定义的窗口大小(σ)之间的比率及其在文档中的频率，其中，

是模型中术语w_i的唯一临近数据，

是总术语频率，通过使用通过使用双曲正切sigmoid函数定义窗口大小的边界来计算分数。g(w_i ) is used to calculate the ratio between the neighborhood population and the defined window size (σ) and its frequency in the document, where,

is the only neighbor of term w_i in the model,

is the total term frequency, the score is calculated by using the bounds of the window size defined by using the hyperbolic tangent sigmoid function.

本发明进一步设置为：初始化时，流的第一个文档创建一个新的集群，为了随着时间的推移自动识别到达文档中的新主题，通过变换公式p(z_d|G_z)＝p(G_z).p(d|G_z)，概率p(z_new|d)导出如下，The present invention is further set as follows: when initializing, the first document of the stream creates a new cluster, in order to automatically identify new topics in arriving documents as time goes by, by transforming the formula p(z_d |G_z )=p( G_z ).p(d|G_z ), the probability p(z_new |d) is derived as follows,

其中，αD代表文档的伪群体，V_z是活动集群的平均词汇量

β的值有助于计算与新聚类的伪词相似度；等式

给出了创建新集群的条件。where αD represents the pseudo-population of documents and_Vz is the average vocabulary size of the active cluster

The value of β helps to calculate the pseudo-word similarity with the new cluster; the equation

The conditions for creating a new cluster are given.

本发明进一步设置为：引入了基于窗口的术语共现矩阵，其中每个术语只能与具有

距离的邻近数据配对，同时保持句子顺序。The present invention is further set as follows: a window-based term co-occurrence matrix is introduced, wherein each term can only be associated with

Proximity data pairings by distance while maintaining sentence order.

本发明进一步设置为：为了保持概念的当前分布，模型需要保持活跃的集群并移除过时的集群，每个集群随时间的衰减权重更新为The present invention is further set as: in order to maintain the current distribution of concepts, the model needs to maintain active clusters and remove outdated clusters, and the decay weight of each cluster over time is updated as

其中，t_c表示模型的当前时间戳，并存储集群的最后更新时间戳z最初，每个新集群的衰减权重设置为1，如果l_{z}近似为零，则处理集群以从模型中删除，即该微集群无法捕获文本流中主题的当前词条分布。where t_c represents the current timestamp of the model, and stores the cluster's last update timestamp z Initially, each new cluster has its decay weight set to 1, and if l_{z} is approximately zero, the cluster is processed to be removed from the model , that is, the microcluster cannot capture the current distribution of terms in the topic in the text stream.

本发明的有益效果为：一种支持短文本流在线聚类的上下文增强狄利克雷模型，该方案利用了整个模型的独特邻居群体的分布。对于活动集群，提出了一种情景推理过程，该过程降低了模型的集群稀疏性。此外，EINDM还会合并高度相似的集群，自动产生接近实际集群的集群人口。我们进行了广泛的实证分析，以证明所提出模型的效率和稳健性，同时观察不同参数范围内的结果。EINDM利用整个模型的独特邻居群体分布生成了一种新的单词特异性术语权重方案，同时，情景推理过程降低了模型的集群稀疏性。EINDM还会合并高度相似的集群，自动产生接近实际集群的集群人口。与最近最先进的聚类模型相比，EINDM在NMI、同质性和聚类纯度方面具有最佳性能。The beneficial effects of the invention are: a context-enhanced Dirichlet model supporting online clustering of short text streams, and the scheme utilizes the distribution of unique neighbor groups of the entire model. For active clusters, an episodic reasoning procedure is proposed that reduces the cluster sparsity of the model. In addition, EINDM also merges highly similar clusters, automatically producing cluster populations that are close to actual clusters. We conduct extensive empirical analyzes to demonstrate the efficiency and robustness of the proposed model, while observing results across different parameter ranges. EINDM generates a novel word-specific term weighting scheme using the unique neighbor population distribution across the model, while the episodic reasoning process reduces the cluster sparsity of the model. EINDM also merges highly similar clusters, automatically producing cluster populations that are close to actual clusters. Compared with recent state-of-the-art clustering models, EINDM has the best performance in terms of NMI, homogeneity and cluster purity.

附图说明Description of drawings

图1为算法流程示意图；Figure 1 is a schematic diagram of the algorithm flow;

图2为词语窗口举例示意图，画叉处表示不可移动的术语窗口。Fig. 2 is a schematic diagram of an example of a term window, where a cross represents a term window that cannot be moved.

具体实施方式Detailed ways

下面结合附图及实施例对本发明中的技术方案进一步说明。The technical solutions in the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

对狄利克雷分布的主题生成进行建模，其中最重要的任务是根据概率分布制定到达文档和主题(微集群)之间的关系。Modeling topic generation from a Dirichlet distribution, where the most important task is to formulate the relationship between arriving documents and topics (microclusters) according to probability distributions.

步骤1：根据计算的概率，每个到达的文档要么被添加到模型的活动集群(活动主题)中，要么创建一个新的集群(新主题已经到达)。Step 1: Depending on the calculated probability, each arriving document is either added to the model's active cluster (active topic) or creates a new cluster (new topic has arrived).

步骤2：如果模型中已有聚类的文档到达的概率小于伪概率，则将文档视为新主题的出现，从而创建新的聚类。Step 2: If the probability of arrival of a document already clustered in the model is less than the pseudo-probability, the document is considered as the emergence of a new topic, thus creating a new cluster.

步骤3：随着文档的到来，模型会检查旧的集群(过时的主题)是否被删除。Step 3: As documents come, the model checks whether old clusters (obsolete topics) are removed.

通过这种方式，具有当前分布的近期主题集群在模型中处于活跃状态。为了推断模型中的活跃簇数，在每个ρ时间单位间隔后重新采样来自最近ψ文档的随机文档数η。接下来详细描述每个关键过程。In this way, recent topic clusters with current distributions are active in the model. To infer the number of active clusters in the model, a random number of documents η from the most recent ψ documents are resampled after every ρ time unit interval. Each key process is described in detail next.

1、集群未来集合，模型中的每个伪集群都被表示为一个6元组

其中，z代表微集群，m_z是其中的文档数，

中；w_i和wj之间的分数定义为，1. The future collection of clusters, each pseudo-cluster in the model is represented as a 6-tuple

where z represents the micro-cluster, m_z is the number of documents in it,

in; the fraction between w_i and wj is defined as,

其中，

是文档d中w_i的词频。l_z和u_{z}存储集群的衰减权重和最后更新的时间戳。CF集有两个重要属性：(i)可添加和(ii)可删除。这些属性允许集群随时间增量更新。Addable属性使集群能够通过在其中添加新文档来进行更新，并定义为，定义1：通过使用addable属性更新集群，可以将文档d添加到集群z中。in,

is the word frequency of w_i in document d. l_z and u_{z} store the decay weights of the cluster and the timestamp of the last update. There are two important properties of CF sets: (i) can be added and (ii) can be deleted. These properties allow the cluster to be updated incrementally over time. The Addable attribute enables a cluster to be updated by adding new documents to it, and is defined as, Definition 1: A document d can be added to cluster z by updating the cluster with the addable attribute.

m＝m+1m=m+1

m_z＝m_z

cw_z＝cw_z∪cw_d N_z＝N_z+N_dm_z =m_z

cw_z ＝cw_z ∪cw_d N_z ＝N_z +N_d

定义2：使用可删除属性从集群z中删除文档d。Definition 2: Use the deletable attribute to delete document d from cluster z.

m_z＝m_z-1

cw_z＝cw_z-cw_d N_z＝N_z-N_dm_z =m_z -1

cw_z ＝cw_z -cw_d N_z ＝N_z -N_d

这里，N_d表示文档中的总字数。cw_d是基于窗口大小的文档共现矩阵。文档的共现矩阵cw_d包含两个相邻项之间的频率比，定义为：Here,_Nd denotes the total number of words in the document._cwd is the document co-occurrence matrix based on the window size. The co-occurrence matrix_cwd of a document contains the frequency ratios between two adjacent entries, defined as:

2、文件集群相似性:最初在模型中没有集群，因此创建了一个新的空集群并将第一个到达的文档添加到其中。下一个到达的文档应该被添加到模型的现有(活动)集群z(z∈M)(基于等式7)，或者将导致新集群的创建(基于等式13)。为了计算文档d和活动集群z}(模型表示为

)之间的相似性，2. Document cluster similarity: Initially there are no clusters in the model, so a new empty cluster is created and the first arriving document is added to it. The next arriving document should either be added to the model's existing (active) cluster z(z ∈ M) (based on Equation 7), or will result in the creation of a new cluster (based on Equation 13). To compute documents d and active clusters z} (model denoted as

), the similarity between

所有符号的定义见下表。See the table below for definitions of all symbols.

表1：符号与标记Table 1: Symbols and markings

在这里，这个方程的第一部分显示了集群流行度p(G_{z})，而剩下的两部分是为了计算同质性p(d|G_{z})。等式的第二部分负责捕捉单项空间中的相似性。第三部分通过计算语义术语空间的相似度(即术语集的共现)来解决术语歧义问题。同质性的第二部分，其中β的值作为看不见的单词的伪权重，基于多项分布(即

)。在捕获同质性(相似性)的同时使用聚类

中的术语出现。然而，与静态环境不同，其中逆文档频率可用于计算全局空间中的术语重要性，术语文档分布是未知的。因此，我们定义了一个类似的加权分数，称为逆聚类频率ICF_w来计算术语的重要性，定义为：Here, the first part of this equation shows the cluster popularity p(G_{z}), while the remaining two parts are for computing the homogeneity p(d|G_{z}). The second part of the equation is responsible for capturing the similarity in the one-way space. The third part resolves the term ambiguity problem by computing the similarity in the semantic term space (i.e., the co-occurrence of term sets). The second part of homogeneity, where the value of β acts as a pseudo-weight for unseen words, is based on a multinomial distribution (i.e.

). Use clustering while capturing homogeneity (similarity)

The terms in appear. However, unlike static environments, where inverse document frequency can be used to compute term importance in the global space, the term-document distribution is unknown. Therefore, we define a similar weighted score called inverse cluster frequency ICF_w to compute term importance, defined as:

等式8的分母部分是包含单词w的模型中活动簇的总数，提名者是模型的活动簇总数。这意味着更多的簇包含单词w不如单词重要。如果单词w包含在很少的簇中，则表明它的重要性更高。为了捕捉这种行为，引入了一个新的单词特异性S_w权重，它在公式9中定义。The denominator part ofEquation 8 is the total number of active clusters in the model containing the word w, and the nominator is the total number of active clusters in the model. This means that more clusters contain word w less important than word. If word w is contained in few clusters, it indicates that it is more important. To capture this behavior, a new word-specific S_w weight is introduced, which is defined in Equation 9.

3、单词特异性：提出了一种新的权重来计算术语的特异性。基本思想是，如果一个术语每次都与不同的术语同时出现，则表明该术语不太具体，因此可以与多种概念一起使用。然而，高度具体的术语具有较少数量的同时出现的邻居。此外，术语的正常邻居取决于定义的窗口大小(作为模型参数)。因此，考虑到共现窗口大小约束，我们将w_i的单词特异性S定义为，3. Word Specificity: A new weight is proposed to calculate the specificity of terms. The basic idea is that if a term co-occurs with a different term each time, it is a sign that the term is less specific and therefore can be used with multiple concepts. However, highly specific terms have a lower number of co-occurring neighbors. Furthermore, the normal neighbors of a term depend on the defined window size (as a model parameter). Therefore, considering the co-occurrence window size constraint, we define the word specificity S of w_i as,

这里，δ是相邻窗口大小，g(w_i)定义为，Here, δ is the adjacent window size, and g(w_i ) is defined as,

g(w_i)计算邻域人口与定义的窗口大小(σ)之间的比率及其在文档中的频率。这里，

是模型中术语w_i的唯一邻居数，

是总术语频率。通过使用双曲正切sigmoid函数定义窗口大小的边界来计算分数。g(w_i ) computes the ratio between the neighborhood population and the defined window size (σ) and its frequency in the document. here,

is the number of unique neighbors of term w_i in the model,

is the total term frequency. Scores are computed by defining the bounds of the window size using the hyperbolic tangent sigmoid function.

4、自动集群创建：初始化时，流的第一个文档创建一个新的集群。为了随着时间的推移自动识别到达文档中的新主题，我们需要一个概率来帮助检测单词的新分布以创建新的集群，以防文档不属于任何活动集群。通过变换这个公式p(z_d|G_z)＝p(G_z).p(d|G_z)，概率p(z_new|d)导出如下。4. Automatic cluster creation: On initialization, the first document of the stream creates a new cluster. To automatically identify new topics arriving in documents over time, we need a probability to help detect new distributions of words to create new clusters, in case a document does not belong to any active cluster. By transforming this formula p(z_d |G_z )=p(G_z ).p(d|G_z ), the probability p(z_new |d) is derived as follows.

这里，αD代表文档的伪群体，V_z是活动集群的平均词汇量

β的值有助于计算与新聚类的伪词相似度。等式14给出了创建新集群的条件(参见第10行算法1)。Here, αD represents the pseudo-population of documents and_Vz is the average vocabulary size of the active cluster

The value of β helps to calculate the pseudo-word similarity with the new cluster. Equation 14 gives the conditions for creating a new cluster (seeAlgorithm 1, line 10).

算法1Algorithm 1

5、基于窗口的共现矩阵:该专利引入了基于窗口的术语共现矩阵，其中每个术语只能与具有

距离的邻居配对，同时保持句子顺序。一个例子如图2所示。这样，一个文档cw_d的共现矩阵最多可以有O(δN_d)个条目。这里，N_d是文档长度。集群中两个相邻项w_i和w_j之间的权重在公式6中定义，其中i≠j,(i-δ)≤j≤(i+δ)和δ≥15. Window-based co-occurrence matrix: This patent introduces a window-based term co-occurrence matrix, where each term can only be associated with

Neighbor pairings by distance while maintaining sentence order. An example is shown in Figure 2. In this way, the co-occurrence matrix of a document cw_d can have at most O(δN_d ) entries. Here,_Nd is the document length. The weight between two adjacent items w_i and w_j in a cluster is defined in Equation 6, where i≠j, (i−δ)≤j≤(i+δ) and δ≥1

6、情节推理：推理过程已被证明可用于生成过程以减少集群稀疏性。他们的迭代过程有两个主要缺陷：(i)一个批次可能无法捕获当前分布，(ii)它增加了处理时间成本，使其不适合高速流。相比之下，我们提出了一个情节推理程序，不仅有效地降低了处理成本，而且能够覆盖流的分布。6. Episodic reasoning: The reasoning process has been shown to be useful in generative processes to reduce cluster sparsity. Their iterative procedure suffers from two major drawbacks: (i) a batch may not capture the current distribution, and (ii) it adds a processing time cost, making it unsuitable for high-speed streams. In contrast, we propose an episodic inference procedure that not only effectively reduces the processing cost, but is also able to cover the distribution of flows.

7、删除过时的集群：为了保持概念的当前分布，模型需要保持活跃的集群(当前概念)并移除过时的集群(旧概念)。采用基于流速度的衰减机制，该机制随时间更新集群的重要性分数(衰减权重)。如果模型最近没有接收文档，则模型中每个集群的衰减权重(l_z)会随着时间的推移而减小。每个集群随时间的衰减权重更新为7. Removing obsolete clusters: To maintain the current distribution of concepts, the model needs to maintain active clusters (current concepts) and remove obsolete clusters (old concepts). A flow velocity based decay mechanism is employed which updates the importance scores (decay weights) of clusters over time. The decay weight (l_z ) of each cluster in the model decreases over time if the model has not received documents recently. The decay weight of each cluster over time is updated as

这里，^t_c表示模型的当前时间戳，并存储集群的最后更新时间戳z最初，每个新集群的衰减权重设置为1(参见第10行算法2)。如果l_{z}近似为零，则处理集群以从模型中删除(参见第6行算法3)，即该微集群无法捕获文本流中主题的当前词条分布。Here,^tc represents the current timestamp of the model, and stores the last updated timestamp z of the cluster. Initially, the decay weight of each new_cluster is set to 1 (seeAlgorithm 2, line 10). A cluster is processed for deletion from the model (see Algorithm 3, line 6) if l_{z} is approximately zero, i.e., the micro-cluster fails to capture the current distribution of terms for topics in the text stream.

算法2Algorithm 2

本发明提出了一种支持短文本流在线聚类的上下文增强狄利克雷模型，该方案利用了整个模型的独特邻居群体的分布。对于活动集群，提出了一种情景推理过程，该过程降低了模型的集群稀疏性。此外，EINDM还会合并高度相似的集群，自动产生接近实际集群的集群人口。我们进行了广泛的实证分析，以证明所提出模型的效率和稳健性，同时观察不同参数范围内的结果。EINDM利用整个模型的独特邻居群体分布生成了一种新的单词特异性术语权重方案，同时，情景推理过程降低了模型的集群稀疏性。EINDM还会合并高度相似的集群，自动产生接近实际集群的集群人口。与最近最先进的聚类模型相比，EINDM在NMI、同质性和聚类纯度方面具有最佳性能。The present invention proposes a context-enhanced Dirichlet model that supports online clustering of short text streams, and the scheme utilizes the distribution of unique neighbor groups throughout the model. For active clusters, an episodic reasoning procedure is proposed that reduces the cluster sparsity of the model. In addition, EINDM also merges highly similar clusters, automatically producing cluster populations that are close to actual clusters. We conduct extensive empirical analyzes to demonstrate the efficiency and robustness of the proposed model, while observing results across different parameter ranges. EINDM generates a novel word-specific term weighting scheme using the unique neighbor population distribution across the model, while the episodic reasoning process reduces the cluster sparsity of the model. EINDM also merges highly similar clusters, automatically producing cluster populations that are close to actual clusters. Compared with recent state-of-the-art clustering models, EINDM has the best performance in terms of NMI, homogeneity and cluster purity.

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it is noted that the above embodiments are only used to illustrate the technical solutions of the present invention without limitation. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be carried out Modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present invention shall be covered by the claims of the present invention.

Claims

Translated fromChinese

1.一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：包括以下步骤：1. A context-enhanced Dirichlet model that supports online clustering of short text streams, characterized in that: comprising the following steps:

步骤3，随着文档的到来，模型对旧的集群(即过时的主题)进行检查并删除，从而使得当前分布的近期主题集群在模型中处于活跃状态，为了推断模型中的活跃簇数，在每个ρ时间单位间隔后重新采样来自最近ψ文档的随机文档数η。Step 3, with the arrival of documents, the model checks and deletes the old clusters (i.e. outdated topics), so that the current distribution of recent topic clusters is active in the model, in order to infer the number of active clusters in the model, in A random number of documents η from the most recent ψ documents are resampled after every ρ time unit interval.

2.如权利要求1所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：模型中的每个伪集群都被表示为一个6元组

其中，z代表微集群，m_z是其中的文档数，

N_d是d文档中的总字数，元组元素cw_z是一个三维矩阵用于将术语的共现储存在

中；w_i和w_j之间的分数定义为2. A context-enhanced Dirichlet model supporting online clustering of short text streams as claimed in claim 1, wherein each pseudo-cluster in the model is represented as a 6-tuple

where z represents the micro-cluster, m_z is the number of documents in it,

N_d is the total number of words in the document d, and the tuple element cw_z is a three-dimensional matrix used to store the co-occurrence of terms in

in; the fraction between w_i and w_j is defined as

其中，

is the word frequency of w_i in document d, l_z and u_{z} store the decay weight of the cluster and the timestamp of the last update, CF sets have two important properties, including add and delete, which allow the cluster to grow over time The Addable property enables the cluster to be updated and defined by adding new documents to it.

3.如权利要求2所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：3. A kind of context-enhanced Dirichlet model supporting online clustering of short text streams as claimed in claim 2, characterized in that:

定义1，通过使用addable属性更新集群，可以将文档d添加到集群z中：Definition 1, a document d can be added to cluster z by updating the cluster with the addable attribute:

m_z＝m_z+1m_z =m_z +1

cw_z＝cw_z∪cw_dcw_z ＝cw_z ∪cw_d

N_z＝N_z+N_d；N_z =N_z +N_d ;

m_z＝m_z-1m_z =m_z -1

cw_z＝cw_z-cw_dcw_z ＝cw_z -cw_d

N_z＝N_z-N_d,N_z =N_z -N_d ,

其中，

是w_i在文档中(w_i,w_j)∈d以为主题的词频。in,

is the word frequency of w_i in the document (w_i ,w_j )∈d as the topic.

4.如权利要求3所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：最初在模型中，没有集群，因此创建了一个新的空集群并将第一个到达的文档添加到其中，下一个到达的文档应该被添加到模型的现有(活动)集群z，(z∈M)，或者将导致新集群的创建；计算文档d和活动集群z之间的相似性为，4. A context-enhanced Dirichlet model supporting online clustering of short text streams as claimed in claim 3, characterized in that: initially in the model, there is no cluster, so a new empty cluster is created and the first Arriving documents are added to it, and the next arriving document should be added to the existing (active) cluster z of the model, (z ∈ M), or will result in the creation of a new cluster; compute the distance between document d and the active cluster z The similarity is,

5.如权利要求4所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：提出一种权重来计算术语的特异性，如果一个术语每次都与不同的术语同时出现，则表明该术语不太具体，因此可以与多种概念一起使用，然而，高度具体的术语具有较少数量的同时出现的邻近数据，此外，术语的正常邻近数据取决于定义的窗口大小，因此将w_i的单词特异性S定义为：5. A context-enhanced Dirichlet model that supports online clustering of short text streams as claimed in claim 4, wherein a weight is proposed to calculate the specificity of terms, if a term is different from each other Co-occurrences indicate that the term is less specific and thus can be used with multiple concepts, however, highly specific terms have a lower number of co-occurring neighbor data, moreover, the normal neighbor data for a term depends on the defined window size , so the word specificity S of w_i is defined as:

是模型中术语w_i的唯一临近数据，

is the only neighbor of term w_i in the model,

6.如权利要求5所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：初始化时，流的第一个文档创建一个新的集群，为了随着时间的推移自动识别到达文档中的新主题，通过变换公式p(z_d|G_z)＝p(G_z).p(d|G_z)，概率p(z_new|d)导出如下，6. A context-enhanced Dirichlet model supporting online clustering of short text streams as claimed in claim 5, characterized in that: when initializing, the first document of the stream creates a new cluster, in order to Automatically identify new topics in arriving documents, through the transformation formula p(z_d |G_z )=p(G_z ).p(d|G_z ), the probability p(z_new |d) is derived as follows,

其中，αD代表文档的伪群体，V_z是活动集群的平均词汇量

β的值有助于计算与新聚类的伪词相似度；等式

The conditions for creating a new cluster are given.

7.如权利要求6所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：引入了基于窗口的术语共现矩阵，其中每个术语只能与具有

距离的邻近数据配对，同时保持句子顺序。7. A context-enhanced Dirichlet model that supports online clustering of short text streams as claimed in claim 6, wherein a window-based term co-occurrence matrix is introduced, wherein each term can only be associated with

Proximity data pairings by distance while maintaining sentence order.

8.如权利要求7所述的一种支持短文本流在线聚类的上下文增强狄利克雷模型，其特征在于：为了保持概念的当前分布，模型需要保持活跃的集群并移除过时的集群，每个集群随时间的衰减权重更新为8. A context-enhanced Dirichlet model that supports online clustering of short text streams as claimed in claim 7, characterized in that: in order to maintain the current distribution of concepts, the model needs to keep active clusters and remove outdated clusters, every The decay weight of each cluster over time is updated as