CN108846016A

Movatterモバイル変換

Info

Publication number: CN108846016A
Application number: CN201810422499.3A
Authority: CN
Inventors: 金城; 陶仕谦; 唐士芳; 吴渊; 张玥杰; 冯瑞; 薛向阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-05-05
Filing date: 2018-05-05
Publication date: 2018-11-20
Anticipated expiration: 2038-05-05
Also published as: CN108846016B

Abstract

Translated fromChinese

本发明属于文本搜索引擎技术领域，具体为一种面向中文分词的搜索算法。本发明算法主要分为两个阶段：离线构建索引阶段和在线查找阶段。在离线构建索引阶段，首先提取所有原始字符串集合的后缀串集合，然后由后缀串集合生成改进的后缀树；在在线查找阶段，首先根据基于后缀树的索引模型得到关键词的查询结果，然后量化关键词和查询结果的匹配程度，最后将查询结果按匹配程序由高到低排序后返回。本发明通过一种改进的基于后缀树的索引结构来平衡索引构建时间和占用空间，使用本发明的索引结构的搜索效率远高于对结果集暴力计算匹配度并排序的效率。

The invention belongs to the technical field of text search engines, in particular to a search algorithm oriented to Chinese word segmentation. The algorithm of the present invention is mainly divided into two stages: an offline index construction stage and an online search stage. In the offline index construction stage, first extract the suffix string set of all the original string sets, and then generate an improved suffix tree from the suffix string set; in the online search stage, first obtain the keyword query results according to the index model based on the suffix tree, and then Quantify the matching degree between keywords and query results, and finally sort the query results according to the matching program from high to low and return them. The present invention uses an improved suffix tree-based index structure to balance the index construction time and occupied space, and the search efficiency using the index structure of the present invention is much higher than the efficiency of violently calculating the matching degree and sorting the result set.

Description

Translated fromChinese

一种面向中文分词的搜索算法A Search Algorithm for Chinese Word Segmentation

技术领域technical field

本发明属于文本搜索引擎技术领域，具体涉及一种面向中文分词的搜索算法。The invention belongs to the technical field of text search engines, and in particular relates to a search algorithm oriented to Chinese word segmentation.

背景技术Background technique

搜索引擎是一种在线信息搜索工具，将符合用户搜索关键字的一系列搜索结果返回给用户。当今社会是个信息爆炸的时代，面对着数不尽的信息，如何快速精确定位用户想要的信息是最迫切的需求之一，信息搜索技术也因此得到快速的发展和应用。A search engine is an online information search tool that returns a series of search results matching the user's search keywords to the user. Today's society is an era of information explosion. Faced with countless information, how to quickly and accurately locate the information users want is one of the most urgent needs. Information search technology has also been rapidly developed and applied.

搜索最常见的形式是文本搜索，无论用户的目标资源是文字、图像、音频甚至是视频，只要输入的格式是文本，都可以归结到本发明搜索的范围内。现在除了谷歌、必应、雅虎等提供的全网站搜索功能外，特定领域的搜索需求也越来越大。在特定领域中(比如仅面向电视节目)，由于资源的种类有局限性，所以搜索的条件一般能做到十分明确，另外数据集的大小也在可接受的范围内，在这些前提下可以对搜索引擎做出很多有针对性的优化。The most common form of search is text search. No matter the user's target resource is text, image, audio or even video, as long as the input format is text, it can be attributed to the scope of the search of the present invention. Now, in addition to the full-site search functions provided by Google, Bing, Yahoo, etc., the demand for search in specific fields is also increasing. In a specific field (such as only for TV programs), due to the limitations of the types of resources, the search conditions can generally be very clear, and the size of the data set is also within an acceptable range. Search engines do a lot of targeted optimization.

目前中文搜索系统的相关技术主要有倒排索引、正排索引、署名文件、后缀树等。其中倒排索引综合性能较好且最常用，但在实际应用中，应用倒排索引模型处理大文本集合时，对CPU资源、内存空间和I/O都是十分严峻的考验。At present, the relevant technologies of the Chinese search system mainly include inverted index, forward index, signature file, suffix tree and so on. Among them, the inverted index has better comprehensive performance and is the most commonly used. However, in practical applications, when applying the inverted index model to process large text collections, it is a very severe test for CPU resources, memory space, and I/O.

发明内容Contents of the invention

本发明的目的在于提出一种面向中文分词的搜索算法，应用于智能化的中文搜索引擎系统，使之能够快速地根据关键字返回搜索结果，并将结果按匹配程度由高到低排序后展示给用户。The purpose of the present invention is to propose a search algorithm oriented to Chinese word segmentation, which is applied to an intelligent Chinese search engine system, so that it can quickly return search results according to keywords, and display the results after sorting them according to the degree of matching from high to low to the user.

本发明提出的面向中文分词的搜索算法，主要可以分为两个阶段：离线构建索引阶段和在线查找阶段。在离线构建索引阶段，首先提取所有原始字符串集合的后缀串集合，然后由后缀串集合生成改进的后缀树；在在线查找阶段，首先根据基于后缀树的索引模型得到关键词的查询结果，然后量化关键词和查询结果的匹配程度，最后将查询结果按匹配程序由高到低排序后返回。The search algorithm oriented to Chinese word segmentation proposed by the present invention can be mainly divided into two stages: an offline index construction stage and an online search stage. In the offline index construction stage, first extract the suffix string set of all the original string sets, and then generate an improved suffix tree from the suffix string set; in the online search stage, first obtain the keyword query results according to the index model based on the suffix tree, and then Quantify the matching degree between keywords and query results, and finally sort the query results according to the matching program from high to low and return them.

一、离线构建索引阶段，具体步骤为：1. Offline index construction phase, the specific steps are:

(1)由原数据集生成后缀串集合(1) Generate a set of suffix strings from the original data set

T(S)表示带有分隔符($)和结束符(#)的字符串S所组成的原数据集，其中第i个字符串的索引ID为i(1≤i≤n)。假设WBS表示从分隔符处开始的后缀串，NWBS表示不从分隔符处开始的后缀串。由T(S)生成带索引ID的后缀串集合T(WBS)和T(NWBS)的具体步骤如下：T(S) represents the original data set composed of character strings S with separators ($) and terminators (#), where the index ID of the i-th string is i (1≤i≤n). Assume that WBS indicates a suffix string starting from a delimiter, and NWBS indicates a suffix string not starting from a delimiter. The specific steps of generating the suffix string sets T(WBS) and T(NWBS) with index ID from T(S) are as follows:

第一步：遍历T(S)中的所有字符串，提取每个字符串的所有后缀串s_i，构成集合T^*(s₁),T^*(s₂)…T^*(s_n)[1]。其中后缀串是指字符串S从位置i开始到S末尾结束符的一个子串，即若S用C₁C₂…C_n表示，则C_iC_i+1…C_n称为S的一个后缀串(1≤i≤n)；Step 1: traverse all strings in T(S), extract all suffixes s_i of each string, and form a set T^* (s₁ ),T^* (s₂ )…T^* (s_n )[ 1]. The suffix string refers to a substring of the string S starting from position i to the terminator at the end of S, that is, if S is represented by C₁ C₂ ... C_n , then C_i C_i+1 ... C_{n is} called one of S suffix string (1≤i≤n);

第二步：剔除T^*(s₁),T^*(s₂)…T^*(s_n)中所有以分隔符($)或结束符(#)为首的后缀串；Step 2: Eliminate all suffix strings in T^* (s₁ ), T^* (s₂ )...T^* (s_n ) that start with a separator ($) or a terminator (#);

第三步：遍历T^*(s_i)中所有后缀串，若后缀串的首字符跟原字符串的首字符相同，或者跟原字符串中分隔符($)后的首字符相同，则在该后缀串末尾添加索引ID后加入至T(WBS)，反之，则在该后缀串末尾添加索引ID后加入至T(NWBS)。Step 3: traverse all suffix strings in T^* (s_i ), if the first character of the suffix string is the same as the first character of the original string, or the same as the first character after the separator ($) in the original string, then in Add the index ID at the end of the suffix string and add it to T(WBS), otherwise, add the index ID at the end of the suffix string and add it to T(NWBS).

(2)对后缀串集合T(WBS)和T(NWBS)分别建立改进后缀树(2) Establish an improved suffix tree for the suffix string sets T(WBS) and T(NWBS) respectively

改进后缀树是在传统后缀树[1]的基础上，将每条边上的标识存放到节点中。即把每个节点作为一个存储单元，其结构如图1所示。节点存储信息包括节点标识、结束符子节点指针、分隔符子节点指针、一般子节点指针集和匹配索引ID序列，其中节点标识为结束符、分隔符或一般字符串。The improved suffix tree is based on the traditional suffix tree [1], and stores the identification on each edge into the node. That is, each node is regarded as a storage unit, and its structure is shown in Figure 1. Node storage information includes node identifier, terminator child node pointer, delimiter child node pointer, general child node pointer set and matching index ID sequence, where the node identifier is terminator, separator or general character string.

对任意后缀串集合T建立改进后缀树的具体步骤如下：The specific steps to build an improved suffix tree for any set of suffix strings T are as follows:

第一步：创建一棵只包含一个节点的改进后缀树，该节点的节点标识、所有子节点指针和匹配索引ID序列均为空，把这个节点记为改进后缀树的根节点root。Step 1: Create an improved suffix tree containing only one node. The node ID, all child node pointers and matching index ID sequences of this node are empty, and this node is recorded as the root node root of the improved suffix tree.

第二步：把后缀串集合T中所有元素依次插入到改进后缀树中。每个后缀串的插入过程都是从根节点出发，寻找插入位置。Step 2: Insert all elements in the suffix string set T into the improved suffix tree in sequence. The insertion process of each suffix string starts from the root node to find the insertion position.

以图2中的改进后缀树为例子，在插入后缀串时分为以下三种情况：Taking the improved suffix tree in Figure 2 as an example, there are three situations when inserting a suffix string:

情况①：如需要插入的后缀串在当前树中已经出现，则直接在节点的匹配索引ID序列中添加索引号即可。例如要插入的后缀串为“学生#2”，由于“学生”节点在当前树中已经出现，因此直接在节点的匹配索引ID序列中添加索引号即可，结果如图3(a)所示。Situation ①: If the suffix string to be inserted already appears in the current tree, just add the index number directly to the matching index ID sequence of the node. For example, the suffix string to be inserted is "student #2", since the "student" node has already appeared in the current tree, so just add the index number directly to the matching index ID sequence of the node, the result is shown in Figure 3(a) .

情况②：如需插入的后缀串的前缀与当前已有节点相同，则是需要直接添加节点即可。例如需要插入的后缀串为“复旦$学生#3”，由于“复旦”节点已经存在，所以直接添加节点“学生”和“#”即可，结果如图3(b)所示。Situation ②: If the prefix of the suffix string to be inserted is the same as that of the existing node, it is necessary to add the node directly. For example, the suffix string to be inserted is "Fudan$student#3". Since the "Fudan" node already exists, it is enough to directly add the nodes "Student" and "#", and the result is shown in Figure 3(b).

情况③：如需要插入的后缀串与当前节点中的前缀相同，则先分裂当前节点，然后再插入其他节点。例如需要插入的后缀串为“大$学生#4”，由于后缀串的前缀“大”与当前节点“大学”中的前缀相同，所以需要先分裂当前节点，然后再插入其他节点，结果如图3(c)所示。Case ③: If the suffix string to be inserted is the same as the prefix in the current node, first split the current node, and then insert other nodes. For example, the suffix string to be inserted is "big$student#4". Since the prefix "big" of the suffix string is the same as the prefix in the current node "university", it is necessary to split the current node first and then insert other nodes. The result is shown in the figure 3(c).

第三步：递归构造每个节点的匹配索引ID序列。由前可知，结束符节点的匹配索引ID序列在全部后缀串插入完成时已经构造完成。因此，只需构造所有非结束符节点N(s)的匹配索引ID序列Q(N(s))，具体方法如公式(1)所示：Step 3: Recursively construct the matching index ID sequence of each node. It can be seen from the foregoing that the matching index ID sequence of the terminator node has been constructed when all suffix strings are inserted. Therefore, it is only necessary to construct the matching index ID sequence Q(N(s)) of all non-terminator nodes N(s), the specific method is shown in formula (1):

Q(N(s))＝Q(N(s#))Q(N(s$))Q(N(s*))# (1)Q(N(s))=Q(N(s#))Q(N(s$))Q(N(s*))# (1)

其中，N(s#)，N(s$)和N(s*)分别表示节点N(s)的结束符子节点，分隔符子节点和所有一般子节点。Among them, N(s#), N(s$) and N(s*) respectively denote the terminator child node, separator child node and all general child nodes of node N(s).

二、在线查找阶段，具体步骤为：Second, the online search stage, the specific steps are:

(1)匹配点查询(1) Matching point query

对任意节点N(s)，从N(s)出发，查询字符串c₁…c_n的匹配节点的过程如公式(2)所示：For any node N(s), starting from N(s), the process of querying the matching nodes of the string c₁ ... c_n is shown in formula (2):

其中，R(N(s))表示查询结果，N(s)为匹配节点，s为节点标识。Among them, R(N(s)) represents the query result, N(s) is the matching node, and s is the node identifier.

给出查询字符串c₁…c_n，首先查找根节点的所有子节点，找到节点标识的首字符等于c₁的子节点N(s)，然后执行R(N(s),c₁…c_n)，找到所有匹配点，最终得到搜索结果R(N(s))＝(S,Q(N(s)))。其中，Q(N(s))为N(s)的匹配索引ID序列。Given the query string c₁ ...c_n , first search for all child nodes of the root node, find the child node N(s) whose first character of the node ID is equal to c₁ , and then execute R(N(s),c₁ ...c_n ), find all matching points, and finally obtain the search result R(N(s))=(S,Q(N(s))). Wherein, Q(N(s)) is the matching index ID sequence of N(s).

(2)对结果集排序(2) Sort the result set

定义负熵来衡量查询字符串c₁…c_n和搜索结果字符串s的匹配程度，熵值越小，匹配程度越低；反之，熵值越大，匹配程度越高。Negative entropy is defined to measure the matching degree between the query string c₁ ...c_n and the search result string s. The smaller the entropy value, the lower the matching degree; conversely, the larger the entropy value, the higher the matching degree.

假设负熵值的计算算法如下(初始熵值为0)：Assume that the calculation algorithm for the negative entropy value is as follows (the initial entropy value is 0):

(a)获取从c₁在s中的位置i；(a) Get the position i in s from c₁ ;

(b)从i开始向后遍历s，直到遇到分隔符$或者终止符#或者s的结尾，假设期间遍历了m个字符；(b) Traverse s backwards from i until encountering separator $ or terminator # or the end of s, assuming m characters are traversed during the period;

(c)如果遇到的是s的结尾，判断最后一个字符是否为终止符#，如果是，则负熵值增加m²，算法结束；否则，负熵值增加m，算法结束；(c) If the end of s is encountered, judge whether the last character is a terminator #, if yes, the negative entropy value increases by m² , and the algorithm ends; otherwise, the negative entropy value increases by m, and the algorithm ends;

(d)如果遇到的是分隔符$，负熵值增加m²，算法结束；(d) If the separator $ is encountered, the negative entropy value increases by m² , and the algorithm ends;

(e)将i更新为遇到的分隔符$之后一个字符的位置，回到(b)。(e) Update i to the position of one character after the separator $ encountered, and return to (b).

根据以上步骤计算结果集中所有s的分词负熵值，按其值由大到小对结果集进行排序。Calculate the word segmentation negative entropy values of all s in the result set according to the above steps, and sort the result set according to their values from large to small.

(3)消除结果集中的重复项并生成搜索结果序列(3) Eliminate duplicates in the result set and generate a sequence of search results

依次取出排序后结果集的Q(N(s))，执行相应操作后放入搜索结果序列中，搜索结果序列初始值为空。公式(3)是对Q(N(s_i))执行的具体操作：Take out the Q(N(s)) of the sorted result set in turn, and put them into the search result sequence after performing the corresponding operation. The initial value of the search result sequence is empty. Formula (3) is the specific operation performed on Q(N(s_i )):

SR(i)＝(D(Q(N(s_i)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)SR(i)=(D(Q(N(s_i )))-SR(i-1))∩SR(i-1), 1≤i≤n# (3)

其中，SR(i)表示合并完第i个节点的匹配索引ID序列Q(N(s_i))后的搜索结果序列，SR(1)和SR(n)分别为搜索结果序列的初始状态和最终状态；D(Q(N(s_i)))表示对Q(N(s_i))执行去重操作；(D-SR)表示在去重后的Q(N(s_i))中去除已经在搜索结果序列中出现过的索引号；(D-SR)∩SR表示将(D-SR)添加至当前搜索结果序列SR的末尾。Among them, SR(i) represents the search result sequence after merging the matching index ID sequence Q(N(s_i )) of the i-th node, SR(1) and SR(n) are the initial state and Final state; D(Q(N(s_i ))) means to perform a deduplication operation on Q(N(_s_i )); (D-SR) means to remove An index number that has already appeared in the search result sequence; (D-SR)∩SR means adding (D-SR) to the end of the current search result sequence SR.

经过上述步骤后，最终得到的搜索结果序列为SR(n)。After the above steps, the finally obtained search result sequence is SR(n).

本发明通过一种改进的基于后缀树的索引结构来很好的平衡索引构建时间和占用空间，使用本发明的索引结构的搜索效率远高于对结果集暴力计算匹配度并排序的效率，且相比较于其他全文索引结构实现的模糊检索，本发明的索引结构采用更少的构建时间和占用内存代价的同时就能有很高的搜索效率。The present invention uses an improved suffix tree-based index structure to well balance the index construction time and occupied space. The search efficiency using the index structure of the present invention is much higher than the efficiency of violently calculating the matching degree and sorting the result set, and Compared with the fuzzy retrieval realized by other full-text index structures, the index structure of the present invention can achieve high search efficiency while using less construction time and memory cost.

附图说明Description of drawings

图1：改进后缀树节点的结构图。Figure 1: Structural diagram of improved suffix tree nodes.

图2：改进后缀树示例图。Figure 2: An example diagram of an improved suffix tree.

图3：插入后缀串时不同情况对比图。Figure 3: Comparison of different situations when inserting suffix strings.

具体实施方式Detailed ways

为研究本发明在不同大小数据集上的搜索性能，我们分别构建了数据量为10000、20000、50000、100000和200000的五个数据集，并在各数据集上与基于倒排表的Lucene引擎进行多组对比实验。In order to study the search performance of the present invention on data sets of different sizes, we constructed five data sets with a data volume of 10,000, 20,000, 50,000, 100,000 and 200,000 respectively, and compared each data set with the Lucene engine based on the inverted list Carry out multiple comparison experiments.

随机生成长度为2-4不等的搜索字符串各25个，共同构成75种搜索字符串。对于每一种搜索字符串，都进行100000次搜索，在搜索结果正确的前提下，记录每次搜索的时间消耗。Randomly generate 25 search strings each with a length ranging from 2 to 4 to form 75 search strings. For each search string, 100,000 searches are performed, and the time consumption of each search is recorded on the premise that the search results are correct.

为了让Lucene能够完成与本发明索引相同的任务，当建立首字母索引时在首字母序列的每个字符间加入空格，使每个字符被认为是一个单词，在搜索字符串的每个字符间也加入空格，用以实现本发明相同的搜索功能。In order to allow Lucene to complete the same task as the index of the present invention, a space is added between each character of the initial sequence when establishing the initial index, so that each character is considered as a word, and between each character of the search string Spaces are also added to realize the same search function of the present invention.

实验结果如表1所示：The experimental results are shown in Table 1:

表1本发明索引和Lucene索引搜索时间对比Table 1 Comparison of search time between the index of the present invention and the index of Lucene

由表可见，本发明算法在任何数据集上都有着比Lucene更好的搜索效率，并且结果在小数据集上更加明显，在数据集小于50000的情况下，使用本发明算法的搜索效率可以达到Lucene的7-10倍。It can be seen from the table that the algorithm of the present invention has better search efficiency than Lucene on any data set, and the result is more obvious on small data sets. When the data set is less than 50000, the search efficiency of the algorithm of the present invention can reach 7-10 times that of Lucene.

参考文选：Selected references:

[1]E.Ukkonen,On-Line Construction of Suffix Trees,Algorithmica,14(1995),249-260。[1] E. Ukkonen, On-Line Construction of Suffix Trees, Algorithmica, 14(1995), 249-260.