Movatterモバイル変換


[0]ホーム

URL:


CN108846016A - A kind of searching algorithm towards Chinese word segmentation - Google Patents

A kind of searching algorithm towards Chinese word segmentation
Download PDF

Info

Publication number
CN108846016A
CN108846016ACN201810422499.3ACN201810422499ACN108846016ACN 108846016 ACN108846016 ACN 108846016ACN 201810422499 ACN201810422499 ACN 201810422499ACN 108846016 ACN108846016 ACN 108846016A
Authority
CN
China
Prior art keywords
node
suffix
string
index
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810422499.3A
Other languages
Chinese (zh)
Other versions
CN108846016B (en
Inventor
金城
陶仕谦
唐士芳
吴渊
张玥杰
冯瑞
薛向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan UniversityfiledCriticalFudan University
Priority to CN201810422499.3ApriorityCriticalpatent/CN108846016B/en
Publication of CN108846016ApublicationCriticalpatent/CN108846016A/en
Application grantedgrantedCritical
Publication of CN108846016BpublicationCriticalpatent/CN108846016B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明属于文本搜索引擎技术领域,具体为一种面向中文分词的搜索算法。本发明算法主要分为两个阶段:离线构建索引阶段和在线查找阶段。在离线构建索引阶段,首先提取所有原始字符串集合的后缀串集合,然后由后缀串集合生成改进的后缀树;在在线查找阶段,首先根据基于后缀树的索引模型得到关键词的查询结果,然后量化关键词和查询结果的匹配程度,最后将查询结果按匹配程序由高到低排序后返回。本发明通过一种改进的基于后缀树的索引结构来平衡索引构建时间和占用空间,使用本发明的索引结构的搜索效率远高于对结果集暴力计算匹配度并排序的效率。

The invention belongs to the technical field of text search engines, in particular to a search algorithm oriented to Chinese word segmentation. The algorithm of the present invention is mainly divided into two stages: an offline index construction stage and an online search stage. In the offline index construction stage, first extract the suffix string set of all the original string sets, and then generate an improved suffix tree from the suffix string set; in the online search stage, first obtain the keyword query results according to the index model based on the suffix tree, and then Quantify the matching degree between keywords and query results, and finally sort the query results according to the matching program from high to low and return them. The present invention uses an improved suffix tree-based index structure to balance the index construction time and occupied space, and the search efficiency using the index structure of the present invention is much higher than the efficiency of violently calculating the matching degree and sorting the result set.

Description

Translated fromChinese
一种面向中文分词的搜索算法A Search Algorithm for Chinese Word Segmentation

技术领域technical field

本发明属于文本搜索引擎技术领域,具体涉及一种面向中文分词的搜索算法。The invention belongs to the technical field of text search engines, and in particular relates to a search algorithm oriented to Chinese word segmentation.

背景技术Background technique

搜索引擎是一种在线信息搜索工具,将符合用户搜索关键字的一系列搜索结果返回给用户。当今社会是个信息爆炸的时代,面对着数不尽的信息,如何快速精确定位用户想要的信息是最迫切的需求之一,信息搜索技术也因此得到快速的发展和应用。A search engine is an online information search tool that returns a series of search results matching the user's search keywords to the user. Today's society is an era of information explosion. Faced with countless information, how to quickly and accurately locate the information users want is one of the most urgent needs. Information search technology has also been rapidly developed and applied.

搜索最常见的形式是文本搜索,无论用户的目标资源是文字、图像、音频甚至是视频,只要输入的格式是文本,都可以归结到本发明搜索的范围内。现在除了谷歌、必应、雅虎等提供的全网站搜索功能外,特定领域的搜索需求也越来越大。在特定领域中(比如仅面向电视节目),由于资源的种类有局限性,所以搜索的条件一般能做到十分明确,另外数据集的大小也在可接受的范围内,在这些前提下可以对搜索引擎做出很多有针对性的优化。The most common form of search is text search. No matter the user's target resource is text, image, audio or even video, as long as the input format is text, it can be attributed to the scope of the search of the present invention. Now, in addition to the full-site search functions provided by Google, Bing, Yahoo, etc., the demand for search in specific fields is also increasing. In a specific field (such as only for TV programs), due to the limitations of the types of resources, the search conditions can generally be very clear, and the size of the data set is also within an acceptable range. Search engines do a lot of targeted optimization.

目前中文搜索系统的相关技术主要有倒排索引、正排索引、署名文件、后缀树等。其中倒排索引综合性能较好且最常用,但在实际应用中,应用倒排索引模型处理大文本集合时,对CPU资源、内存空间和I/O都是十分严峻的考验。At present, the relevant technologies of the Chinese search system mainly include inverted index, forward index, signature file, suffix tree and so on. Among them, the inverted index has better comprehensive performance and is the most commonly used. However, in practical applications, when applying the inverted index model to process large text collections, it is a very severe test for CPU resources, memory space, and I/O.

发明内容Contents of the invention

本发明的目的在于提出一种面向中文分词的搜索算法,应用于智能化的中文搜索引擎系统,使之能够快速地根据关键字返回搜索结果,并将结果按匹配程度由高到低排序后展示给用户。The purpose of the present invention is to propose a search algorithm oriented to Chinese word segmentation, which is applied to an intelligent Chinese search engine system, so that it can quickly return search results according to keywords, and display the results after sorting them according to the degree of matching from high to low to the user.

本发明提出的面向中文分词的搜索算法,主要可以分为两个阶段:离线构建索引阶段和在线查找阶段。在离线构建索引阶段,首先提取所有原始字符串集合的后缀串集合,然后由后缀串集合生成改进的后缀树;在在线查找阶段,首先根据基于后缀树的索引模型得到关键词的查询结果,然后量化关键词和查询结果的匹配程度,最后将查询结果按匹配程序由高到低排序后返回。The search algorithm oriented to Chinese word segmentation proposed by the present invention can be mainly divided into two stages: an offline index construction stage and an online search stage. In the offline index construction stage, first extract the suffix string set of all the original string sets, and then generate an improved suffix tree from the suffix string set; in the online search stage, first obtain the keyword query results according to the index model based on the suffix tree, and then Quantify the matching degree between keywords and query results, and finally sort the query results according to the matching program from high to low and return them.

一、离线构建索引阶段,具体步骤为:1. Offline index construction phase, the specific steps are:

(1)由原数据集生成后缀串集合(1) Generate a set of suffix strings from the original data set

T(S)表示带有分隔符($)和结束符(#)的字符串S所组成的原数据集,其中第i个字符串的索引ID为i(1≤i≤n)。假设WBS表示从分隔符处开始的后缀串,NWBS表示不从分隔符处开始的后缀串。由T(S)生成带索引ID的后缀串集合T(WBS)和T(NWBS)的具体步骤如下:T(S) represents the original data set composed of character strings S with separators ($) and terminators (#), where the index ID of the i-th string is i (1≤i≤n). Assume that WBS indicates a suffix string starting from a delimiter, and NWBS indicates a suffix string not starting from a delimiter. The specific steps of generating the suffix string sets T(WBS) and T(NWBS) with index ID from T(S) are as follows:

第一步:遍历T(S)中的所有字符串,提取每个字符串的所有后缀串si,构成集合T*(s1),T*(s2)…T*(sn)[1]。其中后缀串是指字符串S从位置i开始到S末尾结束符的一个子串,即若S用C1C2…Cn表示,则CiCi+1…Cn称为S的一个后缀串(1≤i≤n);Step 1: traverse all strings in T(S), extract all suffixes si of each string, and form a set T* (s1 ),T* (s2 )…T* (sn )[ 1]. The suffix string refers to a substring of the string S starting from position i to the terminator at the end of S, that is, if S is represented by C1 C2 ... Cn , then Ci Ci+1 ... Cn is called one of S suffix string (1≤i≤n);

第二步:剔除T*(s1),T*(s2)…T*(sn)中所有以分隔符($)或结束符(#)为首的后缀串;Step 2: Eliminate all suffix strings in T* (s1 ), T* (s2 )...T* (sn ) that start with a separator ($) or a terminator (#);

第三步:遍历T*(si)中所有后缀串,若后缀串的首字符跟原字符串的首字符相同,或者跟原字符串中分隔符($)后的首字符相同,则在该后缀串末尾添加索引ID后加入至T(WBS),反之,则在该后缀串末尾添加索引ID后加入至T(NWBS)。Step 3: traverse all suffix strings in T* (si ), if the first character of the suffix string is the same as the first character of the original string, or the same as the first character after the separator ($) in the original string, then in Add the index ID at the end of the suffix string and add it to T(WBS), otherwise, add the index ID at the end of the suffix string and add it to T(NWBS).

(2)对后缀串集合T(WBS)和T(NWBS)分别建立改进后缀树(2) Establish an improved suffix tree for the suffix string sets T(WBS) and T(NWBS) respectively

改进后缀树是在传统后缀树[1]的基础上,将每条边上的标识存放到节点中。即把每个节点作为一个存储单元,其结构如图1所示。节点存储信息包括节点标识、结束符子节点指针、分隔符子节点指针、一般子节点指针集和匹配索引ID序列,其中节点标识为结束符、分隔符或一般字符串。The improved suffix tree is based on the traditional suffix tree [1], and stores the identification on each edge into the node. That is, each node is regarded as a storage unit, and its structure is shown in Figure 1. Node storage information includes node identifier, terminator child node pointer, delimiter child node pointer, general child node pointer set and matching index ID sequence, where the node identifier is terminator, separator or general character string.

对任意后缀串集合T建立改进后缀树的具体步骤如下:The specific steps to build an improved suffix tree for any set of suffix strings T are as follows:

第一步:创建一棵只包含一个节点的改进后缀树,该节点的节点标识、所有子节点指针和匹配索引ID序列均为空,把这个节点记为改进后缀树的根节点root。Step 1: Create an improved suffix tree containing only one node. The node ID, all child node pointers and matching index ID sequences of this node are empty, and this node is recorded as the root node root of the improved suffix tree.

第二步:把后缀串集合T中所有元素依次插入到改进后缀树中。每个后缀串的插入过程都是从根节点出发,寻找插入位置。Step 2: Insert all elements in the suffix string set T into the improved suffix tree in sequence. The insertion process of each suffix string starts from the root node to find the insertion position.

以图2中的改进后缀树为例子,在插入后缀串时分为以下三种情况:Taking the improved suffix tree in Figure 2 as an example, there are three situations when inserting a suffix string:

情况①:如需要插入的后缀串在当前树中已经出现,则直接在节点的匹配索引ID序列中添加索引号即可。例如要插入的后缀串为“学生#2”,由于“学生”节点在当前树中已经出现,因此直接在节点的匹配索引ID序列中添加索引号即可,结果如图3(a)所示。Situation ①: If the suffix string to be inserted already appears in the current tree, just add the index number directly to the matching index ID sequence of the node. For example, the suffix string to be inserted is "student #2", since the "student" node has already appeared in the current tree, so just add the index number directly to the matching index ID sequence of the node, the result is shown in Figure 3(a) .

情况②:如需插入的后缀串的前缀与当前已有节点相同,则是需要直接添加节点即可。例如需要插入的后缀串为“复旦$学生#3”,由于“复旦”节点已经存在,所以直接添加节点“学生”和“#”即可,结果如图3(b)所示。Situation ②: If the prefix of the suffix string to be inserted is the same as that of the existing node, it is necessary to add the node directly. For example, the suffix string to be inserted is "Fudan$student#3". Since the "Fudan" node already exists, it is enough to directly add the nodes "Student" and "#", and the result is shown in Figure 3(b).

情况③:如需要插入的后缀串与当前节点中的前缀相同,则先分裂当前节点,然后再插入其他节点。例如需要插入的后缀串为“大$学生#4”,由于后缀串的前缀“大”与当前节点“大学”中的前缀相同,所以需要先分裂当前节点,然后再插入其他节点,结果如图3(c)所示。Case ③: If the suffix string to be inserted is the same as the prefix in the current node, first split the current node, and then insert other nodes. For example, the suffix string to be inserted is "big$student#4". Since the prefix "big" of the suffix string is the same as the prefix in the current node "university", it is necessary to split the current node first and then insert other nodes. The result is shown in the figure 3(c).

第三步:递归构造每个节点的匹配索引ID序列。由前可知,结束符节点的匹配索引ID序列在全部后缀串插入完成时已经构造完成。因此,只需构造所有非结束符节点N(s)的匹配索引ID序列Q(N(s)),具体方法如公式(1)所示:Step 3: Recursively construct the matching index ID sequence of each node. It can be seen from the foregoing that the matching index ID sequence of the terminator node has been constructed when all suffix strings are inserted. Therefore, it is only necessary to construct the matching index ID sequence Q(N(s)) of all non-terminator nodes N(s), the specific method is shown in formula (1):

Q(N(s))=Q(N(s#))Q(N(s$))Q(N(s*))# (1)Q(N(s))=Q(N(s#))Q(N(s$))Q(N(s*))# (1)

其中,N(s#),N(s$)和N(s*)分别表示节点N(s)的结束符子节点,分隔符子节点和所有一般子节点。Among them, N(s#), N(s$) and N(s*) respectively denote the terminator child node, separator child node and all general child nodes of node N(s).

二、在线查找阶段,具体步骤为:Second, the online search stage, the specific steps are:

(1)匹配点查询(1) Matching point query

对任意节点N(s),从N(s)出发,查询字符串c1…cn的匹配节点的过程如公式(2)所示:For any node N(s), starting from N(s), the process of querying the matching nodes of the string c1 ... cn is shown in formula (2):

其中,R(N(s))表示查询结果,N(s)为匹配节点,s为节点标识。Among them, R(N(s)) represents the query result, N(s) is the matching node, and s is the node identifier.

给出查询字符串c1…cn,首先查找根节点的所有子节点,找到节点标识的首字符等于c1的子节点N(s),然后执行R(N(s),c1…cn),找到所有匹配点,最终得到搜索结果R(N(s))=(S,Q(N(s)))。其中,Q(N(s))为N(s)的匹配索引ID序列。Given the query string c1 ...cn , first search for all child nodes of the root node, find the child node N(s) whose first character of the node ID is equal to c1 , and then execute R(N(s),c1 ...cn ), find all matching points, and finally obtain the search result R(N(s))=(S,Q(N(s))). Wherein, Q(N(s)) is the matching index ID sequence of N(s).

(2)对结果集排序(2) Sort the result set

定义负熵来衡量查询字符串c1…cn和搜索结果字符串s的匹配程度,熵值越小,匹配程度越低;反之,熵值越大,匹配程度越高。Negative entropy is defined to measure the matching degree between the query string c1 ...cn and the search result string s. The smaller the entropy value, the lower the matching degree; conversely, the larger the entropy value, the higher the matching degree.

假设负熵值的计算算法如下(初始熵值为0):Assume that the calculation algorithm for the negative entropy value is as follows (the initial entropy value is 0):

(a)获取从c1在s中的位置i;(a) Get the position i in s from c1 ;

(b)从i开始向后遍历s,直到遇到分隔符$或者终止符#或者s的结尾,假设期间遍历了m个字符;(b) Traverse s backwards from i until encountering separator $ or terminator # or the end of s, assuming m characters are traversed during the period;

(c)如果遇到的是s的结尾,判断最后一个字符是否为终止符#,如果是,则负熵值增加m2,算法结束;否则,负熵值增加m,算法结束;(c) If the end of s is encountered, judge whether the last character is a terminator #, if yes, the negative entropy value increases by m2 , and the algorithm ends; otherwise, the negative entropy value increases by m, and the algorithm ends;

(d)如果遇到的是分隔符$,负熵值增加m2,算法结束;(d) If the separator $ is encountered, the negative entropy value increases by m2 , and the algorithm ends;

(e)将i更新为遇到的分隔符$之后一个字符的位置,回到(b)。(e) Update i to the position of one character after the separator $ encountered, and return to (b).

根据以上步骤计算结果集中所有s的分词负熵值,按其值由大到小对结果集进行排序。Calculate the word segmentation negative entropy values of all s in the result set according to the above steps, and sort the result set according to their values from large to small.

(3)消除结果集中的重复项并生成搜索结果序列(3) Eliminate duplicates in the result set and generate a sequence of search results

依次取出排序后结果集的Q(N(s)),执行相应操作后放入搜索结果序列中,搜索结果序列初始值为空。公式(3)是对Q(N(si))执行的具体操作:Take out the Q(N(s)) of the sorted result set in turn, and put them into the search result sequence after performing the corresponding operation. The initial value of the search result sequence is empty. Formula (3) is the specific operation performed on Q(N(si )):

SR(i)=(D(Q(N(si)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)SR(i)=(D(Q(N(si )))-SR(i-1))∩SR(i-1), 1≤i≤n# (3)

其中,SR(i)表示合并完第i个节点的匹配索引ID序列Q(N(si))后的搜索结果序列,SR(1)和SR(n)分别为搜索结果序列的初始状态和最终状态;D(Q(N(si)))表示对Q(N(si))执行去重操作;(D-SR)表示在去重后的Q(N(si))中去除已经在搜索结果序列中出现过的索引号;(D-SR)∩SR表示将(D-SR)添加至当前搜索结果序列SR的末尾。Among them, SR(i) represents the search result sequence after merging the matching index ID sequence Q(N(si )) of the i-th node, SR(1) and SR(n) are the initial state and Final state; D(Q(N(si ))) means to perform a deduplication operation on Q(N(si )); (D-SR) means to remove An index number that has already appeared in the search result sequence; (D-SR)∩SR means adding (D-SR) to the end of the current search result sequence SR.

经过上述步骤后,最终得到的搜索结果序列为SR(n)。After the above steps, the finally obtained search result sequence is SR(n).

本发明通过一种改进的基于后缀树的索引结构来很好的平衡索引构建时间和占用空间,使用本发明的索引结构的搜索效率远高于对结果集暴力计算匹配度并排序的效率,且相比较于其他全文索引结构实现的模糊检索,本发明的索引结构采用更少的构建时间和占用内存代价的同时就能有很高的搜索效率。The present invention uses an improved suffix tree-based index structure to well balance the index construction time and occupied space. The search efficiency using the index structure of the present invention is much higher than the efficiency of violently calculating the matching degree and sorting the result set, and Compared with the fuzzy retrieval realized by other full-text index structures, the index structure of the present invention can achieve high search efficiency while using less construction time and memory cost.

附图说明Description of drawings

图1:改进后缀树节点的结构图。Figure 1: Structural diagram of improved suffix tree nodes.

图2:改进后缀树示例图。Figure 2: An example diagram of an improved suffix tree.

图3:插入后缀串时不同情况对比图。Figure 3: Comparison of different situations when inserting suffix strings.

具体实施方式Detailed ways

为研究本发明在不同大小数据集上的搜索性能,我们分别构建了数据量为10000、20000、50000、100000和200000的五个数据集,并在各数据集上与基于倒排表的Lucene引擎进行多组对比实验。In order to study the search performance of the present invention on data sets of different sizes, we constructed five data sets with a data volume of 10,000, 20,000, 50,000, 100,000 and 200,000 respectively, and compared each data set with the Lucene engine based on the inverted list Carry out multiple comparison experiments.

随机生成长度为2-4不等的搜索字符串各25个,共同构成75种搜索字符串。对于每一种搜索字符串,都进行100000次搜索,在搜索结果正确的前提下,记录每次搜索的时间消耗。Randomly generate 25 search strings each with a length ranging from 2 to 4 to form 75 search strings. For each search string, 100,000 searches are performed, and the time consumption of each search is recorded on the premise that the search results are correct.

为了让Lucene能够完成与本发明索引相同的任务,当建立首字母索引时在首字母序列的每个字符间加入空格,使每个字符被认为是一个单词,在搜索字符串的每个字符间也加入空格,用以实现本发明相同的搜索功能。In order to allow Lucene to complete the same task as the index of the present invention, a space is added between each character of the initial sequence when establishing the initial index, so that each character is considered as a word, and between each character of the search string Spaces are also added to realize the same search function of the present invention.

实验结果如表1所示:The experimental results are shown in Table 1:

表1本发明索引和Lucene索引搜索时间对比Table 1 Comparison of search time between the index of the present invention and the index of Lucene

由表可见,本发明算法在任何数据集上都有着比Lucene更好的搜索效率,并且结果在小数据集上更加明显,在数据集小于50000的情况下,使用本发明算法的搜索效率可以达到Lucene的7-10倍。It can be seen from the table that the algorithm of the present invention has better search efficiency than Lucene on any data set, and the result is more obvious on small data sets. When the data set is less than 50000, the search efficiency of the algorithm of the present invention can reach 7-10 times that of Lucene.

参考文选:Selected references:

[1]E.Ukkonen,On-Line Construction of Suffix Trees,Algorithmica,14(1995),249-260。[1] E. Ukkonen, On-Line Construction of Suffix Trees, Algorithmica, 14(1995), 249-260.

Claims (3)

Translated fromChinese
1.一种面向中文分词的搜索算法,其特征在于,分为两个阶段:离线构建索引阶段和在线查找阶段;1. A search algorithm for Chinese word segmentation is characterized in that it is divided into two stages: an offline index construction stage and an online search stage;(一)离线构建索引阶段,具体步骤为:(1) Offline index construction stage, the specific steps are:(1)由原数据集生成后缀串集合(1) Generate a set of suffix strings from the original data setT(S)表示带有分隔符($)和结束符(#)的字符串S所组成的原数据集,其中第i个字符串的索引ID为i,1≤i≤n,假设WBS表示从分隔符处开始的后缀串,NWBS表示不从分隔符处开始的后缀串;由T(S)生成带索引ID的后缀串集合T(WBS)和T(NWBS)的具体步骤如下:T(S) represents the original data set composed of character strings S with separators ($) and terminators (#), where the index ID of the i-th string is i, 1≤i≤n, assuming that WBS represents The suffix string starting from the delimiter, NWBS means the suffix string not starting from the delimiter; the specific steps of generating the suffix string sets T(WBS) and T(NWBS) with index ID by T(S) are as follows:第一步:遍历T(S)中的所有字符串,提取每个字符串的所有后缀串si,构成集合T*(s1),T*(s2)…T*(sn),其中后缀串是指字符串S从位置i开始到S末尾结束符的一个子串,即若S用C1C2…Cn表示,则CiCi+1…Cn称为S的一个后缀串,1≤i≤n;Step 1: traverse all strings in T(S), extract all suffixes si of each string, and form a set T* (s1 ), T* (s2 )...T* (sn ), The suffix string refers to a substring of the string S starting from position i to the terminator at the end of S, that is, if S is represented by C1 C2 ... Cn , then Ci Ci+1 ... Cn is called one of S Suffix string, 1≤i≤n;第二步:剔除集合T*(s1),T*(s2)…T*(sn)中所有以分隔符($)或结束符(#)为首的后缀串;Step 2: Eliminate all suffix strings starting with a separator ($) or a terminator (#) in the set T* (s1 ), T* (s2 )...T* (sn );第三步:遍历T*(si)中所有后缀串,若后缀串的首字符跟原字符串的首字符相同,或者跟原字符串中分隔符($)后的首字符相同,则在该后缀串末尾添加索引ID后加入至T(WBS),反之,则在该后缀串末尾添加索引ID后加入至T(NWBS);Step 3: traverse all suffix strings in T* (si ), if the first character of the suffix string is the same as the first character of the original string, or the same as the first character after the separator ($) in the original string, then in Add the index ID at the end of the suffix string and add it to T(WBS); otherwise, add the index ID at the end of the suffix string and add it to T(NWBS);(2)对后缀串集合T(WBS)和T(NWBS)分别建立改进后缀树(2) Establish an improved suffix tree for the suffix string sets T(WBS) and T(NWBS) respectively所谓改进后缀树是在传统后缀树的基础上,将每条边上的标识存放到节点中,即把每个节点作为一个存储单元,节点存储信息包括节点标识、结束符子节点指针、分隔符子节点指针、一般子节点指针集和匹配索引ID序列,其中节点标识为结束符、分隔符或一般字符串;The so-called improved suffix tree is based on the traditional suffix tree, storing the identification of each edge in the node, that is, each node is used as a storage unit, and the node storage information includes the node identification, the terminator child node pointer, the separator A child node pointer, a general set of child node pointers, and a sequence of matching index IDs, where node identifiers are terminators, delimiters, or general strings;对任意后缀串集合T建立改进后缀树的具体步骤如下:The specific steps to build an improved suffix tree for any set of suffix strings T are as follows:第一步:创建一棵只包含一个节点的改进后缀树,该节点的节点标识、所有子节点指针和匹配索引ID序列均为空,把这个节点记为改进后缀树的根节点root;Step 1: Create an improved suffix tree containing only one node. The node ID, all child node pointers and matching index ID sequences of this node are empty, and this node is recorded as the root node root of the improved suffix tree;第二步:把后缀串集合T中所有元素依次插入到改进后缀树中;每个后缀串的插入过程都是从根节点出发,寻找插入位置;Step 2: Insert all elements in the suffix string set T into the improved suffix tree in turn; the insertion process of each suffix string starts from the root node to find the insertion position;第三步:递归构造每个节点的匹配索引ID序列;由前可知,结束符节点的匹配索引ID序列在全部后缀串插入完成时已经构造完成;只需按公式(1)构造所有非结束符节点N(s)的匹配索引ID序列Q(N(s)):The third step: recursively construct the matching index ID sequence of each node; as we can see, the matching index ID sequence of the terminator node has been constructed when all suffix strings are inserted; only need to construct all non-terminators according to formula (1) Matching index ID sequence Q(N(s)) for node N(s):Q(N(s))=Q(N(s#))Q(N(s$))Q(N(s*))# (1)Q(N(s))=Q(N(s#))Q(N(s$))Q(N(s*))# (1)其中,N(s#),N(s$)和N(s*)分别表示节点N(s)的结束符子节点,分隔符子节点和所有一般子节点;Among them, N(s#), N(s$) and N(s*) respectively represent the terminator sub-node, delimiter sub-node and all general sub-nodes of node N(s);(二)在线查找阶段,具体步骤为:(2) Online search stage, the specific steps are:(1)匹配点查询(1) Matching point query对任意节点N(s),从N(s)出发,按公式(2)查询字符串c1…cn的匹配节点:For any node N(s), starting from N(s), query the matching nodes of the string c1 ... cn according to the formula (2):其中,R(N(s))表示查询结果,N(s)为匹配节点,s为节点标识;Among them, R(N(s)) represents the query result, N(s) is the matching node, and s is the node identifier;给出查询字符串c1…cn,首先查找根节点的所有子节点,找到节点标识的首字符等于c1的子节点N(s),然后执行R(N(s),c1…cn),找到所有匹配点,最终得到搜索结果R(N(s))=(S,Q(N(s)));其中,Q(N(s))为N(s)的匹配索引ID序列;Given the query string c1 ...cn , first search for all child nodes of the root node, find the child node N(s) whose first character of the node ID is equal to c1 , and then execute R(N(s),c1 ...cn ), find all matching points, and finally get the search result R(N(s))=(S,Q(N(s))); among them, Q(N(s)) is the matching index ID of N(s) sequence;(2)对结果集排序(2) Sort the result set定义负熵来衡量查询字符串c1…cn和搜索结果字符串s的匹配程度,熵值越小,匹配程度越低;反之,熵值越大,匹配程度越高;Negative entropy is defined to measure the matching degree of the query string c1 ...cn and the search result string s, the smaller the entropy value, the lower the matching degree; conversely, the larger the entropy value, the higher the matching degree;计算所有s的分词负熵值,按其值由大到小对结果集进行排序;Calculate the word segmentation negative entropy value of all s, and sort the result set according to its value from large to small;(3)消除结果集中的重复项并生成搜索结果序列(3) Eliminate duplicates in the result set and generate a sequence of search results依次取出排序后结果集的Q(N(s)),执行相应操作后放入搜索结果序列中,搜索结果序列初始值为空;所述执行相应操作是按公式(3)是对Q(N(si))执行如下操作:The Q(N(s)) of the sorted result set is taken out successively, put into the search result sequence after performing the corresponding operation, and the initial value of the search result sequence is empty; the corresponding operation is performed according to the formula (3) for Q(N (si )) performs the following operations:SR(i)=(D(Q(N(si)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)SR(i)=(D(Q(N(si )))-SR(i-1))∩SR(i-1), 1≤i≤n# (3)其中,SR(i)表示合并完第i个节点的匹配索引ID序列Q(N(si))后的搜索结果序列,SR(1)和SR(n)分别为搜索结果序列的初始状态和最终状态;D(Q(N(si)))表示对Q(N(si))执行去重操作;(D-SR)表示在去重后的Q(N(si))中去除已经在搜索结果序列中出现过的索引号;(D-SR)∩SR表示将(D-SR)添加至当前搜索结果序列SR的末尾;Among them, SR(i) represents the search result sequence after merging the matching index ID sequence Q(N(si )) of the i-th node, SR(1) and SR(n) are the initial state and Final state; D(Q(N(si ))) means to perform a deduplication operation on Q(N(si )); (D-SR) means to remove The index number that has appeared in the search result sequence; (D-SR)∩SR means adding (D-SR) to the end of the current search result sequence SR;最终得到的搜索结果序列为SR(n)。The final search result sequence is SR(n).2.根据权利要求1所述的面向中文分词的搜索算法,其特征在于,所述每个后缀串的插入过程都是从根节点出发,寻找插入位置,分为如下3种情况:2. the search algorithm facing Chinese word segmentation according to claim 1, is characterized in that, the insertion process of each suffix string all starts from root node, finds insertion position, is divided into following 3 situations:情况①:如需要插入的后缀串在当前树中已经出现,则直接在节点的匹配索引ID序列中添加索引号即可;Situation ①: If the suffix string to be inserted already appears in the current tree, just add the index number directly to the matching index ID sequence of the node;情况②:如需插入的后缀串的前缀与当前已有节点相同,则直接添加节点即可;Situation ②: If the prefix of the suffix string to be inserted is the same as the existing node, just add the node directly;情况③:如需要插入的后缀串与当前节点中的前缀相同,则先分裂当前节点,然后再插入其他节点。Case ③: If the suffix string to be inserted is the same as the prefix in the current node, first split the current node, and then insert other nodes.3.根据权利要求1所述的面向中文分词的搜索算法,其特征在于,所述计算s的分词负熵值的步骤如下:设初始熵值为0;3. the search algorithm facing Chinese word segmentation according to claim 1, is characterized in that, the step of the word segmentation negative entropy value of described calculation s is as follows: set initial entropy value as 0;(a)获取从c1在s中的位置i;(a) Get the position i in s from c1 ;(b)从i开始向后遍历s,直到遇到分隔符$或者终止符#或者s的结尾,假设期间遍历了m个字符;(b) Traverse s backwards from i until encountering separator $ or terminator # or the end of s, assuming m characters are traversed during the period;(c)如果遇到的是s的结尾,判断最后一个字符是否为终止符#,如果是,则负熵值增加m2,算法结束;否则,负熵值增加m,算法结束;(c) If the end of s is encountered, judge whether the last character is a terminator #, if yes, the negative entropy value increases by m2 , and the algorithm ends; otherwise, the negative entropy value increases by m, and the algorithm ends;(d)如果遇到的是分隔符$,负熵值增加m2,算法结束;(d) If the separator $ is encountered, the negative entropy value increases by m2 , and the algorithm ends;(e)将i更新为遇到的分隔符$之后一个字符的位置,回到步骤(b)。(e) Update i to the position of a character after the encountered delimiter $, and return to step (b).
CN201810422499.3A2018-05-052018-05-05 A Search Algorithm for Chinese Word SegmentationActiveCN108846016B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201810422499.3ACN108846016B (en)2018-05-052018-05-05 A Search Algorithm for Chinese Word Segmentation

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201810422499.3ACN108846016B (en)2018-05-052018-05-05 A Search Algorithm for Chinese Word Segmentation

Publications (2)

Publication NumberPublication Date
CN108846016Atrue CN108846016A (en)2018-11-20
CN108846016B CN108846016B (en)2021-08-20

Family

ID=64212741

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201810422499.3AActiveCN108846016B (en)2018-05-052018-05-05 A Search Algorithm for Chinese Word Segmentation

Country Status (1)

CountryLink
CN (1)CN108846016B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109686413A (en)*2018-12-242019-04-26杭州费尔斯通科技有限公司A kind of chemical molecular formula search method based on es inverted index
CN110597855A (en)*2019-08-142019-12-20中山大学 A data storage method, terminal equipment, and computer-readable storage medium
CN112232903A (en)*2020-09-272021-01-15北京五八信息技术有限公司Business object display method and device
CN112802553A (en)*2020-12-292021-05-14北京优迅医疗器械有限公司Method for comparing genome sequencing sequence and reference genome based on suffix tree algorithm
CN112966505A (en)*2021-01-212021-06-15哈尔滨工业大学Method, device and storage medium for extracting persistent hot phrases from text corpus
WO2021139154A1 (en)*2020-01-102021-07-15百度在线网络技术(北京)有限公司Data prefetching method and apparatus, electronic device, and computer-readable storage medium
CN113450028A (en)*2021-08-312021-09-28深圳格隆汇信息科技有限公司Behavior fund analysis method and system
CN115244539A (en)*2020-05-182022-10-25谷歌有限责任公司Word or word segment lemmatization inference method
CN118820246A (en)*2024-09-182024-10-22西安博达软件股份有限公司 A dynamic index optimization method, device, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103631929A (en)*2013-12-092014-03-12江苏金智教育信息技术有限公司Intelligent prompt method, module and system for search
CN103838785A (en)*2012-11-272014-06-04大连灵动科技发展有限公司Vertical search engine in patent field
CN103838783A (en)*2012-11-272014-06-04大连灵动科技发展有限公司 A Suffix Tree Clustering Method Suitable for Chinese Web Documents
CN107844731A (en)*2016-09-172018-03-27复旦大学Long-term sequence δ abnormal point detecting methods based on probabilistic suffix tree

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103838785A (en)*2012-11-272014-06-04大连灵动科技发展有限公司Vertical search engine in patent field
CN103838783A (en)*2012-11-272014-06-04大连灵动科技发展有限公司 A Suffix Tree Clustering Method Suitable for Chinese Web Documents
CN103631929A (en)*2013-12-092014-03-12江苏金智教育信息技术有限公司Intelligent prompt method, module and system for search
CN107844731A (en)*2016-09-172018-03-27复旦大学Long-term sequence δ abnormal point detecting methods based on probabilistic suffix tree

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZENG D等: "Domain-specific Chinese word segmentation using suffix tree and mutual information", 《INFORMATION SYSTEMS FRONTIERS》*
袁津生等: "改进后缀树的中文检索结果聚类研究", 《计算机工程与应用》*
韦美峰等: "基于后缀树聚类的主题搜索引擎研究", 《情报理论与实践》*

Cited By (15)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109686413A (en)*2018-12-242019-04-26杭州费尔斯通科技有限公司A kind of chemical molecular formula search method based on es inverted index
CN110597855B (en)*2019-08-142022-03-29中山大学Data query method, terminal device and computer readable storage medium
CN110597855A (en)*2019-08-142019-12-20中山大学 A data storage method, terminal equipment, and computer-readable storage medium
US12124447B2 (en)2020-01-102024-10-22Baidu Online Network Technology (Beijing) Co., Ltd.Data prefetching method and apparatus, electronic device, and computer-readable storage medium
WO2021139154A1 (en)*2020-01-102021-07-15百度在线网络技术(北京)有限公司Data prefetching method and apparatus, electronic device, and computer-readable storage medium
US11763083B2 (en)2020-05-182023-09-19Google LlcInference methods for word or wordpiece tokenization
CN115244539A (en)*2020-05-182022-10-25谷歌有限责任公司Word or word segment lemmatization inference method
CN112232903A (en)*2020-09-272021-01-15北京五八信息技术有限公司Business object display method and device
CN112802553A (en)*2020-12-292021-05-14北京优迅医疗器械有限公司Method for comparing genome sequencing sequence and reference genome based on suffix tree algorithm
CN112802553B (en)*2020-12-292024-03-15北京优迅医疗器械有限公司Suffix tree algorithm-based genome sequencing sequence and reference genome comparison method
CN112966505B (en)*2021-01-212021-10-15哈尔滨工业大学 A method, device and storage medium for extracting persistent hot phrases from text corpus
CN112966505A (en)*2021-01-212021-06-15哈尔滨工业大学Method, device and storage medium for extracting persistent hot phrases from text corpus
CN113450028A (en)*2021-08-312021-09-28深圳格隆汇信息科技有限公司Behavior fund analysis method and system
CN118820246A (en)*2024-09-182024-10-22西安博达软件股份有限公司 A dynamic index optimization method, device, system and storage medium
CN118820246B (en)*2024-09-182025-01-24西安博达软件股份有限公司 A dynamic index optimization method, device, system and storage medium

Also Published As

Publication numberPublication date
CN108846016B (en)2021-08-20

Similar Documents

PublicationPublication DateTitle
CN108846016A (en)A kind of searching algorithm towards Chinese word segmentation
CN104866593B (en)A kind of database search method of knowledge based collection of illustrative plates
CN104268280B (en)A kind of Hierarchical storage and querying method based on key value database
CN103646032B (en)A kind of based on body with the data base query method of limited natural language processing
CN103198079B (en)The implementation method of relevant search and device
CN104991905B (en)A kind of mathematic(al) representation search method based on level index
CN105045875B (en)Personalized search and device
US9020951B2 (en)Methods for indexing and searching based on language locale
CN104239513A (en)Semantic retrieval method oriented to field data
CN103365992B (en)Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN106503223B (en) An online housing search method and device combining location and keyword information
CN102236706B (en)Fast fuzzy pinyin inquiry method of mass Chinese file names
CN102915381B (en)Visual network retrieval based on multi-dimensional semantic presents system and presents control method
WO2007085187A1 (en)Method of data retrieval, method of generating index files and search engine
CN112800023B (en)Multi-model data distributed storage and hierarchical query method based on semantic classification
CN102867049B (en)Chinese PINYIN quick word segmentation method based on word search tree
CN104346331A (en)Retrieval method and system for XML database
CN102314464B (en)Lyrics searching method and lyrics searching engine
CN105404677A (en)Tree structure based retrieval method
CN108241709B (en) A data integration method, device and system
US8364684B2 (en)Methods for prefix indexing
CN107229714B (en)Full-text search engine based on distributed database
CN105824956A (en)Inverted index model based on link list structure and construction method of inverted index model
CN104268176A (en)Recommendation method and system based on search keyword
CN110909128B (en)Method, equipment and storage medium for carrying out data query by using root list

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp