CN105787097A

Movatterモバイル変換

Info

Publication number: CN105787097A
Application number: CN201610154682.0A
Authority: CN
Inventors: 林格; 邓现
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2016-03-16
Filing date: 2016-03-16
Publication date: 2016-07-20

Abstract

Translated fromChinese

本发明公开了一种基于文本聚类的分布式索引构建方法及系统，其中，所述方法包括：对非结构化文本进行格式化和分词预处理，将预处理结果存储在原来的分布式节点上；对所述预处理结果进行过滤与特征提取处理，获取处理后的文本词汇特征向量；采用Canopy‑Kmeans聚类算法对所述文本词汇特征向量进行聚类处理，获取所述文本词汇特征向量的K个聚簇；将所述K个聚簇的每个聚簇分布在一个或多个分布式节点上；采用索引引擎对所述分布在一个或多个分布式节点上的所述K个聚簇进行建立全文索引处理，获取K个全文索引；实施本发明实施例，用于构建一种用于检索的分布式索引方式，给予用户一种快速的索引方式，提高用户的使用体验感。

The invention discloses a method and system for constructing a distributed index based on text clustering, wherein the method includes: performing formatting and word segmentation preprocessing on unstructured text, and storing the preprocessing result in the original distributed node On; filtering and feature extraction processing are carried out to described pretreatment result, obtains the text vocabulary feature vector after processing; Adopt Canopy-Kmeans clustering algorithm to carry out cluster processing to described text vocabulary feature vector, obtain described text vocabulary feature vector K clusters; each of the K clusters is distributed on one or more distributed nodes; an index engine is used to index the K clusters distributed on one or more distributed nodes Clustering is performed to establish a full-text index, and K full-text indexes are obtained; the embodiment of the present invention is implemented to construct a distributed index method for retrieval, to provide users with a fast index method, and to improve user experience.

Description

Translated fromChinese

一种基于文本聚类的分布式索引构建方法及系统A distributed index construction method and system based on text clustering

技术领域technical field

本发明涉及检索索引构建技术领域，尤其涉及一种基于文本聚类的分布式索引构建方法及系统。The invention relates to the technical field of retrieval index construction, in particular to a text clustering-based distributed index construction method and system.

背景技术Background technique

在传统的结构化信息管理中通常采用索引技术对信息进行检索，然而在分布式网络环境下，知识规模的增长速度非常快，索引文件的大小随着规模的增长而急剧增大，不仅无法用集中式方式存储索引，检索效率也严重被庞大索引库影响；针对这一情况有提出一种基于文档划分的索引方法，但是这种索引通过随机的方式对集合进行划分，由于各个划分的子集是等价的分布，因此在检索时仍然需要检索所有的子索引，导致检索的开销很大。In traditional structured information management, index technology is usually used to retrieve information. However, in a distributed network environment, the growth rate of knowledge scale is very fast, and the size of index files increases sharply with the growth of scale. The index is stored in a centralized way, and the retrieval efficiency is also seriously affected by the huge index library; in view of this situation, an index method based on document division is proposed, but this index divides the collection in a random way, because the subsets of each division is an equivalent distribution, so all sub-indexes still need to be retrieved during retrieval, resulting in a large retrieval overhead.

文本聚类依据聚类假设：同一个类的对象有较高的相似度，不同的类的对象之间差别较大，是一种无监督的机器学习方法；文本聚类区别于文本分类，聚类不需要训练过程，也不需要预先对文档手工标注类别，即可将不同文本自动凝聚成不同的类别，具有一定的灵活性与较高的自动化处理能力。Text clustering is based on the clustering hypothesis: objects of the same class have a high degree of similarity, and objects of different classes are quite different, which is an unsupervised machine learning method; text clustering is different from text classification, clustering The class does not require a training process, nor does it need to manually mark the categories of the documents in advance, and can automatically condense different texts into different categories, which has certain flexibility and high automatic processing capabilities.

分布式技术主要包括分布式存储与并行计算两个基本功能；分布式存储提供一个透明一致的文件存取系统，而在物理上使用分布式的方式对海量的数据进行存储；并行计算将海量的输入数据分散于多个节点，由各个节点并行地进行计算，最后将所有节点的计算结果归并成最终的结果。Distributed technology mainly includes two basic functions of distributed storage and parallel computing; distributed storage provides a transparent and consistent file access system, and physically stores massive data in a distributed manner; parallel computing combines massive The input data is scattered across multiple nodes, each node performs calculations in parallel, and finally merges the calculation results of all nodes into the final result.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，本发明提供了一种基于文本聚类的分布式索引构建方法及系统，用于构建一种用于检索的分布式索引方式，给予用户一种快速的索引方式，提高用户的使用体验感。The purpose of the present invention is to overcome the deficiencies of the prior art. The present invention provides a distributed index construction method and system based on text clustering, which is used to construct a distributed index method for retrieval, giving users a fast The indexing method improves the user experience.

为了解决上述问题，本发明提出了一种基于文本聚类的分布式索引构建方法，所述方法包括：In order to solve the above problems, the present invention proposes a method for building a distributed index based on text clustering, the method comprising:

对非结构化文本进行格式化和分词预处理，将预处理结果存储在分布式节点上；Format and preprocess unstructured text, and store the preprocessing results on distributed nodes;

对所述预处理结果进行过滤与特征提取处理，获取处理后的文本词汇特征向量；Performing filtering and feature extraction processing on the preprocessing result to obtain the processed text vocabulary feature vector;

采用Canopy-Kmeans聚类算法对所述文本词汇特征向量进行聚类处理，获取所述文本词汇特征向量的K个聚簇；Using the Canopy-Kmeans clustering algorithm to carry out clustering processing on the text vocabulary feature vector to obtain K clusters of the text vocabulary feature vector;

将所述K个聚簇的每个聚簇分布在一个或多个分布式节点上；distributing each of the K clusters on one or more distributed nodes;

采用索引引擎对所述分布在一个或多个分布式节点上的所述K个聚簇进行建立全文索引处理，获取K个全文索引。An index engine is used to establish full-text indexes on the K clusters distributed on one or more distributed nodes, and obtain K full-text indexes.

优选地，所述对非结构化文本进行格式化和分词预处理，将预处理结果存储在分布式节点上，包括：Preferably, the unstructured text is formatted and preprocessed by word segmentation, and the preprocessing results are stored on the distributed nodes, including:

将各个分布式节点上不同格式的非结构化文本进行格式统一处理，获取格式一致的第一文本；Unstructured texts in different formats on each distributed node are processed in a unified format to obtain the first text in the same format;

对所述第一文本进行分词处理，根据处理结果进行关键词提取，获取第一文本的关键词词汇；Perform word segmentation processing on the first text, perform keyword extraction according to the processing result, and obtain keyword vocabulary of the first text;

采用“key＝文本编号、value＝文本词汇”的组合方式将所述关键词词汇存储在分布式节点上。The keyword vocabulary is stored on the distributed nodes in a combination manner of "key=text number, value=text vocabulary".

优选地，所述对所述预处理结果进行过滤与特征提取处理，获取处理后的文本特征向量，包括：Preferably, performing filtering and feature extraction processing on the preprocessing result to obtain a processed text feature vector includes:

采用并行化计算方式对存储在所述分布节点的文本进行处理，获取所述文本内词汇的词频；Processing the text stored in the distribution node by using a parallel computing method to obtain the word frequency of the vocabulary in the text;

采用所述词频与第一阈值进行比较，保存所述词频大于第一阈值的词汇；Using the word frequency to compare with a first threshold, saving the words whose word frequency is greater than the first threshold;

计算所述词汇的TF-IDF值，采用所述TF-IDF值与第二阈值相比较，保存TF-IDF值大于第二阈值的第二词汇；calculating the TF-IDF value of the vocabulary, comparing the TF-IDF value with a second threshold, and saving the second vocabulary whose TF-IDF value is greater than the second threshold;

根据所述第二词汇提取特征，并赋予所述第二词汇的权重，获取所述第二词汇的特征向量。Extract features according to the second vocabulary, assign weights to the second vocabulary, and obtain feature vectors of the second vocabulary.

优选地，所述采用Canopy-Kmeans聚类算法对所述文本特征向量进行聚类处理，包括：Preferably, said adopting Canopy-Kmeans clustering algorithm to carry out clustering process to said text feature vector, comprising:

采用Canopy聚类方式对所述文本词汇特征向量进行初步聚类，获取以Canopy为中心的文本词汇特征向量初步聚簇；Adopt the Canopy clustering mode to carry out preliminary clustering to described text vocabulary feature vector, obtain the text vocabulary feature vector preliminary clustering centered on Canopy;

根据所述文本词汇特征向量初步聚簇进行Kmeans聚类处理，获取所述文本词汇特征向量的K个聚簇。Kmeans clustering processing is performed according to the preliminary clustering of the text vocabulary feature vectors, and K clusters of the text vocabulary feature vectors are obtained.

优选地，所述采用索引引擎对所述分布在一个或多个分布式节点上的所述K个聚簇进行建立全文索引处理，包括：Preferably, using an index engine to perform full-text indexing processing on the K clusters distributed on one or more distributed nodes includes:

采用索引引擎对每个分布节点上的聚簇进行处理，建立所述聚簇的全文索引；An index engine is used to process the clusters on each distribution node, and a full-text index of the clusters is established;

对所有分布节点上聚簇的全文索引进行合并，获取K个全文索引。Merge the clustered full-text indexes on all distribution nodes to obtain K full-text indexes.

相应地，本发明还提供了一种基于文本聚类的分布式索引构建系统，所述系统包括：Correspondingly, the present invention also provides a distributed index construction system based on text clustering, and the system includes:

预处理模块：用于对非结构化文本进行格式化和分词预处理，将预处理结果存储在分布式节点上；Preprocessing module: used to format and preprocess unstructured text, and store the preprocessing results on distributed nodes;

过滤与特征提取模块：用于对所述预处理结果进行过滤与特征提取处理，获取处理后的文本词汇特征向量；Filtering and feature extraction module: used to perform filtering and feature extraction processing on the preprocessing result, and obtain the processed text vocabulary feature vector;

聚类模块：用于采用Canopy-Kmeans聚类算法对所述文本词汇特征向量进行聚类处理，获取所述文本词汇特征向量的K个聚簇；Clustering module: for adopting Canopy-Kmeans clustering algorithm to carry out cluster processing to described text vocabulary feature vector, obtain K clusters of described text vocabulary feature vector;

聚簇分布模块：用于将所述K个聚簇的每个聚簇分布在一个或多个分布式节点上；Cluster distribution module: for distributing each cluster of the K clusters on one or more distributed nodes;

索引构建模块：用于采用索引引擎对所述分布在一个或多个分布式节点上的所述K个聚簇进行建立全文索引处理，获取K个全文索引。An index building module: used to use an index engine to build full-text indexes on the K clusters distributed on one or more distributed nodes, and obtain K full-text indexes.

优选地，所述预处理模块，包括：Preferably, the preprocessing module includes:

格式统一处理单元：用于将各个分布式节点上不同格式的非结构化文本进行格式统一处理，获取格式一致的第一文本；Unified format processing unit: used to unify the format of unstructured texts in different formats on each distributed node, and obtain the first text with the same format;

分词处理与关键词提取单元：用于对所述第一文本进行分词处理，根据处理结果进行关键词提取，获取第一文本的关键词词汇；Word segmentation processing and keyword extraction unit: used to perform word segmentation processing on the first text, perform keyword extraction according to the processing result, and obtain keyword vocabulary of the first text;

存储单元：用于采用“key＝文本编号、value＝文本词汇”的组合方式将所述关键词词汇存储在分布式节点上。Storage unit: for storing the keyword vocabulary on the distributed nodes in a combination manner of "key=text number, value=text vocabulary".

优选地，所述过滤与特征提取模块包括：Preferably, the filtering and feature extraction module includes:

并行化计算单元：用于采用并行化计算方式对存储在所述分布节点的文本进行处理，获取所述文本内词汇的词频；Parallel computing unit: used to process the text stored in the distribution node in a parallel computing manner, and obtain the word frequency of vocabulary in the text;

第一比较单元：用于采用所述词频与第一阈值进行比较，保存所述词频大于第一阈值的词汇；The first comparison unit: used to compare the word frequency with the first threshold, and save the words whose word frequency is greater than the first threshold;

第二比较单元：用于计算所述词汇的TF-IDF值，采用所述TF-IDF值与第二阈值相比较，保存TF-IDF值大于第二阈值的第二词汇；The second comparison unit: used to calculate the TF-IDF value of the vocabulary, compare the TF-IDF value with the second threshold, and save the second vocabulary whose TF-IDF value is greater than the second threshold;

特征提取单元：用于根据所述第二词汇提取特征，并赋予所述第二词汇的权重，获取所述第二词汇的特征向量。A feature extraction unit: used to extract features according to the second vocabulary, assign weights to the second vocabulary, and obtain feature vectors of the second vocabulary.

优选地，所述聚类模块包括：Preferably, the clustering module includes:

第一聚类单元：用于采用Canopy聚类方式对所述文本词汇特征向量进行初步聚类，获取以Canopy为中心的文本词汇特征向量初步聚簇；The first clustering unit: used for performing preliminary clustering on the text vocabulary feature vector by using the Canopy clustering method, and obtaining the preliminary clustering of the text vocabulary feature vector centered on Canopy;

第二聚类单元：用于根据所述文本词汇特征向量初步聚簇进行Kmeans聚类处理，获取所述文本词汇特征向量的K个聚簇。The second clustering unit: for performing Kmeans clustering processing according to the preliminary clustering of the text vocabulary feature vectors, and obtaining K clusters of the text vocabulary feature vectors.

优选地，所述索引构建模块包括：Preferably, the index building module includes:

节点索引构建单元：用于采用索引引擎对每个分布节点上的聚簇进行处理，建立所述聚簇的全文索引；Node index construction unit: used to process the clusters on each distribution node by using an index engine, and build a full-text index of the clusters;

索引合并单元：用于对所有分布节点上聚簇的全文索引进行合并，获取K个全文索引。Index merging unit: for merging the full-text indexes clustered on all distributed nodes to obtain K full-text indexes.

在本发明实施过程中，通过对文本进行格式化、分词、过滤、特征提取和聚类处理，并将处理结果建立全文索引，用于构建一种用于检索的分布式索引方式，给予用户一种快速的索引方式，提高用户的使用体验感。During the implementation of the present invention, the text is formatted, word segmented, filtered, feature extracted and clustered, and a full-text index is established for the processing results to construct a distributed index method for retrieval, giving users a A fast indexing method improves the user experience.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1是本发明实施例的基于文本聚类的分布式索引构建方法的流程示意图；FIG. 1 is a schematic flow diagram of a method for constructing a distributed index based on text clustering according to an embodiment of the present invention;

图2是本发明实施例的预处理步骤的流程示意图；Fig. 2 is a schematic flow chart of a preprocessing step in an embodiment of the present invention;

图3是本发明实施例的文本特征向量获取步骤的流程示意图；Fig. 3 is a schematic flow chart of the text feature vector acquisition step of the embodiment of the present invention;

图4是本发明实施例的基于文本聚类的分布式索引构建系统的结构组成示意图；FIG. 4 is a schematic diagram of the structural composition of a distributed index construction system based on text clustering according to an embodiment of the present invention;

图5是本发明实施例的预处理模块的结构组成示意图；Fig. 5 is a schematic diagram of the structural composition of the preprocessing module of the embodiment of the present invention;

图6是本发明实施例的过滤与特征提取模块的结构组成示意图。Fig. 6 is a schematic diagram of the structure and composition of the filtering and feature extraction module of the embodiment of the present invention.

具体实施方式detailed description

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

图1是本发明实施例的基于文本聚类的分布式索引构建方法的流程示意图，如图1所示，该方法包括：Fig. 1 is a schematic flowchart of a method for constructing a distributed index based on text clustering according to an embodiment of the present invention. As shown in Fig. 1 , the method includes:

S11：对非结构化文本进行格式化和分词预处理，将预处理结果存储在分布式节点上；S11: Format and preprocess the unstructured text, and store the preprocessing results on the distributed nodes;

S124：对该预处理结果进行过滤与特征提取处理，获取处理后的文本词汇特征向量；S124: Perform filtering and feature extraction processing on the preprocessing result, and obtain a processed text vocabulary feature vector;

S13：采用Canopy-Kmeans聚类算法对该文本词汇特征向量进行聚类处理，获取该文本词汇特征向量的K个聚簇；S13: Using the Canopy-Kmeans clustering algorithm to cluster the text vocabulary feature vector to obtain K clusters of the text vocabulary feature vector;

S14：将该K个聚簇的每个聚簇分布在一个或多个分布式节点上；S14: Distribute each of the K clusters on one or more distributed nodes;

S15：采用索引引擎对该分布在一个或多个分布式节点上的K个聚簇进行建立全文索引处理，获取K个全文索引。S15: Use an index engine to establish full-text indexes for the K clusters distributed on one or more distributed nodes, and obtain K full-text indexes.

对S11作进一步说明：Further clarification on S11:

在数据库中的文本存在结构不一致的问题，对非结构化文本进行格式化处理得到格式统一的结构化文本，再对文本进行分词预处理，然后获取分词结果，并将该结果存储在分布式节点上。There is a problem of inconsistent structure in the text in the database. Format the unstructured text to get a structured text with a uniform format, then perform word segmentation preprocessing on the text, then obtain the word segmentation result, and store the result in the distributed node superior.

进一步的，图2是本发明实施例的预处理步骤的流程示意图，如图2所示，该步骤包括：Further, FIG. 2 is a schematic flow chart of the preprocessing steps of the embodiment of the present invention. As shown in FIG. 2, the steps include:

S111：将各个分布式节点上不同格式的非结构化文本进行格式统一处理，获取格式一致的第一文本；S111: Unstructured text in different formats on each distributed node is processed in a unified format to obtain a first text in a consistent format;

S112：对该第一文本进行分词处理，根据处理结果进行关键词提取，获取第一文本的关键词词汇；S112: Perform word segmentation processing on the first text, perform keyword extraction according to the processing result, and obtain keyword vocabulary of the first text;

S113：采用“key＝文本编号、value＝文本词汇”的组合方式将所述关键词词汇存储在分布式节点上。S113: Using the combination of "key=text number, value=text vocabulary" to store the keyword vocabulary on the distributed nodes.

对S111作进一步说明：Further explanation on S111:

将分布在分布式节点上的各类不同格式的非结构化文本进行格式统一处理，从而获取到格式统一的第一文本。All kinds of unstructured texts in different formats distributed on the distributed nodes are processed in a unified format, so as to obtain the first text in a unified format.

对S112作进一步说明：Further explanation on S112:

对第一文本进行分词处理，对第一文本中分离处理的词汇进行提取，将提取出来的词汇作为关键词，从而获取第一文本的关键词词汇。The word segmentation processing is performed on the first text, the separated vocabulary in the first text is extracted, and the extracted vocabulary is used as a keyword, so as to obtain the keyword vocabulary of the first text.

对S113作进一步说明：Further explanation on S113:

根据各文本和该文本提取出来的关键词词汇采用采用“key＝文本编号、value＝文本词汇”的组合方式进行组合，再将组合好的key和value存储在分布式节点上。According to each text and the keyword vocabulary extracted from the text, the combination method of "key=text number, value=text vocabulary" is used to combine, and then the combined key and value are stored on the distributed nodes.

对S12作进一步说明：Further explanation on S12:

根据上述步骤处理的结果的文本和词汇，计算该文本内的词汇的词频，采用该词频与第一阈值相比较，保存词频大于第一阈值的词汇，然后再次计算该词汇的TF-IDF值,采用TF-IDF值与第二阈值相比较，保存TF-IDF值大于第二阈值的第二词汇，让后根据TF-IDF值赋予剩下的第二词汇权重，提取该第二词汇的特征向量。Calculate the word frequency of the vocabulary in the text according to the text and vocabulary processed in the above steps, compare the word frequency with the first threshold, save the vocabulary whose word frequency is greater than the first threshold, and then calculate the TF-IDF value of the vocabulary again, Use the TF-IDF value to compare with the second threshold, save the second vocabulary whose TF-IDF value is greater than the second threshold, and then give the remaining second vocabulary weight according to the TF-IDF value, and extract the feature vector of the second vocabulary .

进一步的，图3是本发明实施例的文本特征向量获取步骤的流程示意图，如图3所示，该步骤包括：Further, FIG. 3 is a schematic flow chart of the step of acquiring text feature vectors according to an embodiment of the present invention. As shown in FIG. 3 , this step includes:

S121：采用并行化计算方式对存储在该分布节点的文本进行处理，获取该文本内词汇的词频；S121: Process the text stored in the distribution node by parallel computing, and obtain the word frequency of the vocabulary in the text;

S122：比较该词频是否大于第一阈值，若是则跳到S123，若不是，则去除该词频对应的词汇；S122: compare whether the word frequency is greater than the first threshold, if so, skip to S123, if not, remove the vocabulary corresponding to the word frequency;

S123：保存所述词频大于第一阈值的词汇，计算所述词汇的TF-IDF值；S123: Save the vocabulary whose word frequency is greater than the first threshold, and calculate the TF-IDF value of the vocabulary;

S124：比较该TF-IDF值是否大于第二阈值，若是，则跳到S125，若否，则去除该TF-IDF值对应的词汇；S124: compare whether the TF-IDF value is greater than the second threshold, if yes, skip to S125, if not, remove the vocabulary corresponding to the TF-IDF value;

S125：保存TF-IDF值大于第二阈值的第二词汇；S125: Save the second vocabulary whose TF-IDF value is greater than the second threshold;

S126：根据第二词汇提取特征，并赋予第二词汇的权重，获取该第二词汇的特征向量。S126: Extract features according to the second vocabulary, assign weights to the second vocabulary, and acquire feature vectors of the second vocabulary.

对S121作进一步说明：Further explanation on S121:

词频是指一个词汇在文本中出现的频率，词汇t在文本d中的频率tf(t,d)＝count(tINd)/count(t)，即词汇出现次数与文本词汇总量的比值；通过上述的公式对文本进行处理，计算获取到文本内词汇的词频。Word frequency refers to the frequency of a word appearing in the text, the frequency tf(t,d)=count(tINd)/count(t) of the word t in the text d, that is, the ratio of the number of word occurrences to the total amount of words in the text; by The above formula processes the text and calculates the word frequency of the words in the text.

对S122作进一步说明：Further explanation on S122:

在同一文本中，词汇出现的次数越多，该词汇就是越关键，而词频比较低的词汇一般不具有表示文本的能力，因此，设置第一阈值，采用词频与第一阈值进行比较，去除词频比第一阈值小的词汇，保存词频比第一阈值大的词汇；该第一阈值根据实际情况赋予，在本实施例中该第一阈值设定为0.01。In the same text, the more times a word appears, the more critical the word is, and words with a lower word frequency generally do not have the ability to represent the text. Therefore, set the first threshold and compare the word frequency with the first threshold to remove the word frequency. Vocabularies smaller than the first threshold are reserved for vocabulary with a word frequency higher than the first threshold; the first threshold is assigned according to actual conditions, and in this embodiment the first threshold is set to 0.01.

对S123作进一步说明：Further explanation on S123:

在得到与第一阈值比较后剩下的词汇之后，计算该词汇的TF-IDF值。After obtaining the remaining vocabulary after comparison with the first threshold, the TF-IDF value of this vocabulary is calculated.

对S124作进一步说明：Further explanation on S124:

采用TF-IDF值与第二阈值进行比较，去除TF-IDF值比第二阈值小的词汇，保存TF-IDF值比第二阈值大的词汇；该第二阈值设定根据实际情况而定，在本实施例中该第二阈值设定为0.01。Use the TF-IDF value to compare with the second threshold, remove the words whose TF-IDF value is smaller than the second threshold, and save the words whose TF-IDF value is larger than the second threshold; the second threshold setting depends on the actual situation, In this embodiment, the second threshold is set to 0.01.

对S125作进一步说明：Further explanation on S125:

存TF-IDF值大于第二阈值的第二词汇。Store the second vocabulary whose TF-IDF value is greater than the second threshold.

对S126作进一步说明：Further explanation on S126:

TF-IDF对词汇的t在文本d中的权重w的计算公式为：The calculation formula of TF-IDF for the weight w of vocabulary t in text d is:

w(t,d)＝TF(t,d)*log(1/DF(t))；w(t,d)=TF(t,d)*log(1/DF(t));

其中，DF(t)为文本频率，指某一词汇t的文本比重，词汇t的文本频率DF为DF(t)＝n(t)/n，即含有词汇t的文本数量与文本总数的比值；词汇频率TF(t,d)为词汇t在文本d中的词频。Among them, DF(t) is the text frequency, which refers to the text proportion of a vocabulary t, and the text frequency DF of the vocabulary t is DF(t)=n(t)/n, that is, the ratio of the number of texts containing the vocabulary t to the total number of texts ;Term frequency TF(t,d) is the word frequency of vocabulary t in text d.

若某一个词汇在一个文本中的出现频率较高，而在其他文本中的出现频率较少，则可以认为这个词汇具有很好的类区别能力，合适用来表示文本，可以进一步提取特征向量。If a word appears more frequently in one text but less frequently in other texts, it can be considered that this word has a good class distinction ability and is suitable for representing text, and can further extract feature vectors.

采用向量空间模型VSM来表示文本，对含有n个特征项的文本d(t₁,t₂,…,t_n)，每个特征项t_k被赋予TF-IDF计算得到的权重w_k，表示该特征在文本中的重要程度，即该文本可以采用特征向量d(w₁,w₂,…,w_n)表示，w_k为特征项t_k的TF-IDF权重，根据该特征项t_k的TF-IDF权重赋予对应的词汇权重。The vector space model VSM is used to represent the text. For a text d(t₁ ,t₂ ,…,t_n ) containing n feature items, each feature item t_k is assigned a weight w_k calculated by TF-IDF, which means The importance of this feature in the text, that is, the text can be represented by a feature vector d(w₁ ,w₂ ,…,w_n ), w_k is the TF-IDF weight of the feature item t_k , according to the feature item t_k The TF-IDF weights are assigned to the corresponding vocabulary weights.

对S13作进一步说明：Further clarification on S13:

首先，采用Canopy聚类方式对该文本词汇特征向量进行初步聚类，获取以Canopy为中心的文本词汇特征向量初步聚簇；然后，根据该文本词汇特征向量初步聚簇进行Kmeans聚类处理，获取该文本词汇特征向量的K个聚簇。First, the Canopy clustering method is used to perform preliminary clustering of the text vocabulary feature vectors to obtain preliminary clustering of the text vocabulary feature vectors centered on Canopy; then, according to the preliminary clustering of the text vocabulary feature vectors, Kmeans clustering is performed to obtain K clusters of the text vocabulary feature vector.

进一步的，Canopy聚类算法具有简单、快速和精确的特性，在处理海量的高维数时，尤其是数据量巨大的情况下，使用Canopy聚类进行初步处理，可以有效提高效率，Canopy聚类算法具体如下：Furthermore, the Canopy clustering algorithm is simple, fast and accurate. When dealing with massive high-dimensional numbers, especially when the amount of data is huge, using Canopy clustering for preliminary processing can effectively improve efficiency. Canopy clustering The algorithm is as follows:

(1)将特征向量集合初始化为list，选择两个距离阈值：T1、T2。(1) Initialize the set of feature vectors as a list, and select two distance thresholds: T1 and T2.

(2)随机取list中的一个对象d作为Canopy中心，标记为c，并将d从list中删除；(2) Randomly take an object d in the list as the Canopy center, mark it as c, and delete d from the list;

(3)计算list中所有对象d_i与c的距离distance，如果distance<T1，将该对象加入Canopyc；如果distanc<T2,将该点从list中删除，也就是该对象无法作为Canopy中心；(3) Calculate the distance between all objects d_i and c in the list. If distance<T1, add the object to Canopyc; if distance<T2, delete the point from the list, that is, the object cannot be used as the center of Canopy;

(4)将剩下的c加入canopylist中；(4) Add the remaining c to the canopylist;

(5)重复步骤2、3、4，直至list中数据为空结束，canopylist则为最后Canopy聚类结果。(5) Repeat steps 2, 3, and 4 until the data in the list is empty, and the canopylist is the final Canopy clustering result.

其中，考虑到由于文本词汇特征向量的高维性，因此采用余弦距离度量；Among them, considering the high dimensionality of the text vocabulary feature vector, the cosine distance measure is used;

具体的，特征向量A与特征向量B之间的余弦距离计算公式具体为：Specifically, the formula for calculating the cosine distance between the eigenvector A and the eigenvector B is specifically:

$C C o o sin sin e e__d d i i s the s tan the tan c c e e ((A A,, B B)) = = 11 - - {Σ Σ}_{i i = = 11}^{n no} (({a a}_{i i} \times \times {b b}_{i i})) / / ((\sqrt{{Σ Σ}_{i i = = 11}^{n no} {a a}_{i i}^{22}} \times \times \sqrt{{Σ Σ}_{i i = = 11}^{n no} {b b}_{i i}^{22}}));;$

其中特征向量A表示为A＝(a₁,a₂,…,a_n),特征向量B表示为B＝(b₁,b₂,…,b_n),i＝1,2,…,n。The eigenvector A is expressed as A=(a₁ ,a₂ ,…,a_n ), the eigenvector B is expressed as B=(b₁ ,b₂ ,…,b_n ),i=1,2,…,n .

再采用Kmeans聚类算法对初步聚类处理结果进行聚类处理，Kmeans聚类算法的基本思想为：以空间中k个对象做为中心进行归类，把对象空间中最靠近各个中心的对象分别归为一类，通过多次迭代的方式，将各聚类质心的值逐次计算更新，直至聚簇质心稳定不变。Then use the Kmeans clustering algorithm to cluster the results of the preliminary clustering processing. The basic idea of the Kmeans clustering algorithm is to classify k objects in the space as the center, and classify the objects closest to each center in the object space. Classified into one category, the value of each cluster centroid is calculated and updated successively through multiple iterations until the cluster centroid is stable.

针对本发明实施例，将原来的Kmeans聚类算法进行算的修改，修改后的算法具体如下：For the embodiment of the present invention, the original Kmeans clustering algorithm is calculated and modified, and the modified algorithm is specifically as follows:

(1)将Canopy聚类算法的结果作为Kmeans聚类算法的输入，即Canopy聚类算法产生的Canopy中心作为Kmeans算法的初始化质心，并且各个特征向量已经分配到相应的质心中；(1) The result of the Canopy clustering algorithm is used as the input of the Kmeans clustering algorithm, that is, the Canopy center generated by the Canopy clustering algorithm is used as the initial centroid of the Kmeans algorithm, and each feature vector has been assigned to the corresponding centroid;

(2)对每个特征向量计算该特征向量到每个质心的距离，并将其分配到最近的聚类质心，其中距离计算公式仍然采用Canopy聚类算法中使用的余弦距离；(2) Calculate the distance from the feature vector to each centroid for each feature vector, and assign it to the nearest cluster centroid, where the distance calculation formula still adopts the cosine distance used in the Canopy clustering algorithm;

(3)对每个聚类重新计算均值得到新的聚类质心；(3) Recalculate the mean value for each cluster to obtain a new cluster centroid;

(4)计算所有数据对象到其对应聚类质心的方差误差值E，若E大于阈值则重复步骤2及步骤3，否则聚类结束。(4) Calculate the variance error E of all data objects to their corresponding cluster centroids, if E is greater than the threshold, repeat steps 2 and 3, otherwise the clustering ends.

其中，E的计算公式具体为：Among them, the calculation formula of E is specifically:

$E E. = = \frac{11}{n no} \underset{x x}{Σ Σ} | | | | x x - - {u u}_{k k ((x x))} | | {| |}^{22};;$

其中，x为文档的文本向量；k(x)表示向量x所在的聚簇；u_k(x)表示向量x所在的聚簇的质心向量；n为文档向量数目。Among them, x is the text vector of the document; k(x) indicates the cluster where the vector x is located; u_k(x) indicates the centroid vector of the cluster where the vector x is located; n is the number of document vectors.

并行优化设计：同样先在每个节点上进行局部的kmeans聚类算法：对每个节点上的向量局部计算该向量到每个全局质心的距离，并将其分配到最近的全局质心得到全局聚类；对节点上的局部聚类计算均值得到局部质心以及局部方差误差值；将所有节点上的局部质心及局部误差方差值整合成全局质心以及总的误差方差值E，再根据E决定是否继续迭代或者结束聚合，最终得到K个聚类及其质心；Parallel optimization design: also perform local kmeans clustering algorithm on each node first: locally calculate the distance from the vector to each global centroid for the vector on each node, and assign it to the nearest global centroid to obtain the global clustering Class; calculate the mean value of the local clustering on the node to obtain the local centroid and local variance error value; integrate the local centroid and local error variance value on all nodes into the global centroid and the total error variance value E, and then decide according to E Whether to continue iteration or end the aggregation, and finally get K clusters and their centroids;

全局质心计算公式为：The formula for calculating the global centroid is:

v_i＝(v_i[1]*m₁+…+v_i[j]*m_j+…+v_i[s]*m_s)/(m₁+…+m_s)v_i ＝(v_i [1]*m₁ +…+v_i [j]*m_j +…+v_i [s]*m_s )/(m₁ +…+m_s )

其中，v_i为计算出的第i个聚类的全局质心向量；Among them, v_i is the calculated global centroid vector of the ith cluster;

v_i[j]为在有第j个聚类的分布式节点S上的局部质心向量，m_s为该聚类中的表示文档的向量的个数；全局方差误差值计算公式E为：E＝(E₁*n₁+…+E_j*n_j+…+E_t*n_t)/(n₁+…+n_t)；E_j为第j个节点的方差误差值；n_j为该节点上的向量总数；t为节点总数。v_i [j] is the local centroid vector on the distributed node S with the jth cluster, m_s is the number of vectors representing documents in the cluster; the formula E for calculating the global variance error value is: E ＝(E₁ *n₁ +…+E_j *n_j +…+E_t *n_t )/(n₁ +…+n_t ); E_j is the variance error value of the jth node; n_j is The total number of vectors on this node; t is the total number of nodes.

对S14作进一步说明：Further clarification on S14:

将上述步骤中获取的K个聚簇的每个聚簇分布在一个或多个分布式节点上。Distribute each of the K clusters obtained in the above steps on one or more distributed nodes.

对S15作进一步说明：Further explanation on S15:

采用索引引擎对每个分布节点上的聚簇进行处理，建立该聚簇的全文索引；对所有分布节点上聚簇的全文索引进行合并，获取K个全文索引。The index engine is used to process the clusters on each distribution node, and the full-text index of the cluster is established; the full-text indexes of the clusters on all distribution nodes are merged to obtain K full-text indexes.

进一步的，根据具体的索引引擎，对每个分布节点上的聚簇进行全文索引建立，并且将所有节点上的相同聚类的聚簇索引进行合并，即可得到K个聚簇的全局全文索引。Further, according to the specific index engine, build a full-text index for the clusters on each distribution node, and merge the cluster indexes of the same cluster on all nodes to obtain the global full-text index of K clusters .

以下是本发明实施例中，用户在使用检索关键词进行检索的过程：The following is the process in which the user searches using the search keywords in the embodiment of the present invention:

对输入的查询字符串进行分词提取关键词处理，再根据索引选择算法计算出查询与子集合的相似度，选择出符合一定条件的索引。Segment the input query string to extract keywords, and then calculate the similarity between the query and the subset according to the index selection algorithm, and select the index that meets certain conditions.

其中给出一种基于查询空间的索引选择算法，描述如下：Among them, an index selection algorithm based on query space is given, which is described as follows:

定义系统内部的查询空间P＝{p₁,p₂,…,p_i}，p_i表示历史的一次查询记录；聚类索引库为S＝{S₁,S₂,…,S_j}；rel(q|S_j)表示索引库S_j与当前查询q的相关程度；Define the query space inside the system P={p₁ ,p₂ ,…,_{pi }, where p i}_represents a historical query record; the clustering index library is S={S₁ ,S₂ ,…,S_j }; rel(q|S_j ) indicates the degree of correlation between the index library S_j and the current query q;

算法步骤为：The algorithm steps are:

(1)计算每个索引库与历史查询p_i的相关度rel(p_i|S_j)；如果S_j不在SET(p_i)中，则rel(p_i|S_j)＝0；否则相关度rel(p_i|S_j)计算公式具体如下为：(1) Calculate the correlation degree rel(p_i |S_j ) between each index library and historical query p_i ; if S_j is not in SET(p_i ), then rel(p_i |S_j )=0; otherwise correlation The calculation formula of degree rel(p_i |S_j ) is as follows:

$r r e e l l (({p p}_{i i} | | {S S}_{j j})) = = \underset{T T}{Σ Σ} \frac{r r e e l l (({p p}_{i i} | | d d o o c c))}{T T};;$

其中，rel(p_i|doc)是指历史查询与文档的相关度，当文档属于聚类S_j时相关度rel(p_i|doc)＝1，否则相关度rel(pi|doc)＝0；T是预定义值，指在评分列表中需要被考虑的前T文档数目，在本发明实施例中T设为20，即选择相关度排名在前20的文档；Among them, rel(p_i |doc) refers to the correlation between the historical query and the document. When the document belongs to the cluster S_j , the correlation rel(p_i |doc)=1, otherwise the correlation rel(pi|doc)=0 ; T is a predefined value, which refers to the number of T documents that need to be considered in the scoring list. In the embodiment of the present invention, T is set to 20, that is, the documents ranked in the top 20 are selected for relevance;

(2)选择最相似的k个历史查询，采用余弦距离度量计算输入的查询q与历史查询的相似度sim(q|p_i)，选择相似度较高的k个查询，可根据实验测试获得最佳效果k取值；(2) Select the k most similar historical queries, use the cosine distance measure to calculate the similarity sim(q|p_i ) between the input query q and historical queries, and select k queries with high similarity, which can be obtained according to the experimental test Best effect k value;

(3)根据相似查询的相关信息计算当前查询q与索引库S_i的相关度rel(q|S_j)，根据相关度rel(q|S_j)排序，选择较相关的索引库；(3) Calculate the correlation degree rel(q|S_j ) between the current query q and the index database S_i according to the relevant information of the similar query, sort according to the correlation degree rel(q|S_j ), and select the more relevant index database;

当前查询q与检索库S_i的相关度rel(q|S_j)的计算公式具体为：The calculation formula of the correlation degree rel(q|S_j ) between the current query q and the retrieval database S_i is as follows:

$r r e e l l ((q q | | {S S}_{j j})) = = \underset{k k}{Σ Σ} r r e e l l (({p p}_{i i} | | {S S}_{j j})) \times \times s the s i i m m ((q q | | {p p}_{i i}));;$

rel(p_i|s_j)表示索引库S_j与历史查询p_i的相关度；sim(q|p_i)表示当前查询q与历史查询p_i的相关度；k表示与当前查询q最相似的前k个历史查询；rel(p_i |s_j ) indicates the correlation between the index library S_j and the historical query p_i ; sim(q|p_i ) indicates the correlation between the current query q and the historical query p_i ; k indicates the most similar to the current query q The first k historical queries of ;

(4)在处理完成查询之后，系统采集用户的反馈信息，如用户实际点击的链接等信息，最后添加此次查询至查询空间，更新查询库，从而完成一次查询。(4) After processing the query, the system collects the user's feedback information, such as the link actually clicked by the user, and finally adds this query to the query space, updates the query database, and completes a query.

在符合条件的索引上进行检索，通过用全局的文档频率等信息计算得分对各个索引的检索结果进行合并以及排序，得到最终检索结果，完成对查询的检索；给出对于查询q检索结果d的评分Score(q,d)的计算依据为：Retrieve on the qualified index, calculate the score by using the global document frequency and other information to merge and sort the retrieval results of each index, get the final retrieval result, and complete the retrieval of the query; give the retrieval result d of the query q The calculation basis of scoring Score(q,d) is:

$\begin{matrix} S S c c o o r r e e ((q q,, d d)) = = c c o o o o r r d d ((q q,, d d)) \times \times q q u u e e r r y the y N N o o r r m m ((q q)) \\ \times \times {Σ Σ}_{i i}^{q q} ((T T F f ((t t,, d d)) \times \times {IDF IDF}^{22} ((t t)) \times \times t t . . g g e e t t B B o o o o s the s t t (()) \times \times n no o o r r m m ((t t,, d d)))) \end{matrix};;$

其中，t为查询q中提取的各个关键词；TF(t,d)为t在文档d中的词频，IDF(t)为逆文档频率；t.getBoost()为查询输入中对关键词设置的重要程度；norm(t,d)为建立索引时设定的文档的加权和长度因子；coord(q,d)为评分因子，文档出现查询项次数越多匹配程度越高；queryNorm(q)将查询语言归一化，使不同的查询语言直接进行比较。Among them, t is each keyword extracted in the query q; TF(t,d) is the word frequency of t in document d, IDF(t) is the inverse document frequency; t.getBoost() is the keyword setting in the query input The importance of the document; norm(t,d) is the weight and length factor of the document set when indexing; coord(q,d) is the scoring factor, the more query items appear in the document, the higher the matching degree; queryNorm(q) Normalize query languages so that different query languages are directly comparable.

相应地，图4是本发明实施例的基于文本聚类的分布式索引构建系统的结构组成示意图，如图4所示，该系统包括：Correspondingly, FIG. 4 is a schematic diagram of the structural composition of a distributed index construction system based on text clustering according to an embodiment of the present invention. As shown in FIG. 4 , the system includes:

预处理模块11：用于对非结构化文本进行格式化和分词预处理，将预处理结果存储在分布式节点上；Preprocessing module 11: used for formatting and word segmentation preprocessing of unstructured text, and storing the preprocessing results on distributed nodes;

过滤与特征提取模块12：用于对该预处理结果进行过滤与特征提取处理，获取处理后的文本词汇特征向量；Filtering and feature extraction module 12: used to filter and feature extract the preprocessing result, and obtain the processed text vocabulary feature vector;

聚类模块13：用于采用Canopy-Kmeans聚类算法对该文本词汇特征向量进行聚类处理，获取该文本词汇特征向量的K个聚簇；Clustering module 13: for adopting Canopy-Kmeans clustering algorithm to cluster the text vocabulary feature vector, and obtain K clusters of the text vocabulary feature vector;

聚簇分布模块14：用于将该K个聚簇的每个聚簇分布在一个或多个分布式节点上；Cluster distribution module 14: for distributing each of the K clusters on one or more distributed nodes;

索引构建模块15：用于采用索引引擎对该分布在一个或多个分布式节点上的该K个聚簇进行建立全文索引处理，获取K个全文索引。The index building module 15 is used to use an index engine to build full-text indexes on the K clusters distributed on one or more distributed nodes, and obtain K full-text indexes.

优选地，图5是本发明实施例的预处理模块的结构组成示意图，如图5所示，该预处理模块11，包括：Preferably, FIG. 5 is a schematic diagram of the structural composition of the preprocessing module of the embodiment of the present invention. As shown in FIG. 5, the preprocessing module 11 includes:

格式统一处理单元111：用于将各个分布式节点上不同格式的非结构化文本进行格式统一处理，获取格式一致的第一文本；Format unification processing unit 111: used to unify the format of unstructured texts in different formats on each distributed node, and obtain the first text in the same format;

分词处理与关键词提取单元112：用于对该第一文本进行分词处理，根据处理结果进行关键词提取，获取第一文本的关键词词汇；Word segmentation processing and keyword extraction unit 112: for performing word segmentation processing on the first text, performing keyword extraction according to the processing result, and obtaining keyword vocabulary of the first text;

存储单元113：用于采用“key＝文本编号、value＝文本词汇”的组合方式将该关键词词汇存储在分布式节点上。Storage unit 113: used to store the keyword vocabulary on the distributed nodes in a combination manner of "key=text number, value=text vocabulary".

优选地，图6是本发明实施例的过滤与特征提取模块的结构组成示意图，如图6所示，该过滤与特征提取模块12包括：Preferably, FIG. 6 is a schematic diagram of the structural composition of the filtering and feature extraction module of the embodiment of the present invention. As shown in FIG. 6, the filtering and feature extraction module 12 includes:

并行化计算单元121：用于采用并行化计算方式对存储在该分布节点的文本进行处理，获取该文本内词汇的词频；Parallelized computing unit 121: used to process the text stored in the distribution node in a parallelized computing manner, and obtain the word frequency of the vocabulary in the text;

第一比较单元122：用于采用该词频与第一阈值进行比较，保存该词频大于第一阈值的词汇；The first comparison unit 122: used to compare the word frequency with the first threshold, and save the words whose word frequency is greater than the first threshold;

第二比较单元123：用于计算该词汇的TF-IDF值，采用该TF-IDF值与第二阈值相比较，保存TF-IDF值大于第二阈值的第二词汇；The second comparison unit 123: used to calculate the TF-IDF value of the vocabulary, compare the TF-IDF value with the second threshold, and save the second vocabulary whose TF-IDF value is greater than the second threshold;

特征提取单元124：用于根据该第二词汇提取特征，并赋予该第二词汇的权重，获取该第二词汇的特征向量。Feature extraction unit 124: for extracting features according to the second vocabulary, giving weights to the second vocabulary, and obtaining feature vectors of the second vocabulary.

优选地，该聚类模块13包括：Preferably, the clustering module 13 includes:

第一聚类单元：用于采用Canopy聚类方式对该文本词汇特征向量进行初步聚类，获取以Canopy为中心的文本词汇特征向量初步聚簇；The first clustering unit: for adopting the Canopy clustering method to carry out preliminary clustering of the text vocabulary feature vector, and obtain the preliminary clustering of the text vocabulary feature vector centered on Canopy;

第二聚类单元：用于根据该文本词汇特征向量初步聚簇进行Kmeans聚类处理，获取该文本词汇特征向量的K个聚簇。The second clustering unit: for performing Kmeans clustering processing according to the preliminary clustering of the text vocabulary feature vector, and obtaining K clusters of the text vocabulary feature vector.

优选地，该索引构建模块15包括：Preferably, the index building module 15 includes:

节点索引构建单元：用于采用索引引擎对每个分布节点上的聚簇进行处理，建立该聚簇的全文索引；Node index construction unit: used to use the index engine to process the cluster on each distribution node, and build the full-text index of the cluster;

具体地，本发明实施例的系统相关功能模块工作原理可参考方法实施例的相关描述，这里不再赘述。Specifically, for the working principles of the system-related functional modules of the embodiments of the present invention, reference may be made to the relevant descriptions of the method embodiments, and details are not repeated here.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储介质中，存储介质可以包括：只读存储器(ROM，ReadOnlyMemory)、随机存取存储器(RAM，RandomAccessMemory)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium can include: Read-only memory (ROM, ReadOnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disk or optical disk, etc.

另外，以上对本发明实施例所提供的一种基于文本聚类的分布式索引构建方法及系统进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。In addition, the text clustering-based distributed index construction method and system provided by the embodiments of the present invention are described above in detail. In this paper, specific examples are used to illustrate the principle and implementation of the present invention. The above embodiments The description is only used to help understand the method of the present invention and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific implementation and scope of application. In summary, As stated above, the content of this specification should not be construed as limiting the present invention.