Summary of the invention
Do not support the function of fuzzy search at present electronic dictionary, purpose of the present invention designs a kind of electronic dictionary fuzzy searching method exactly, to improve retrieval precision.
In order to realize the object of the invention, the present invention proposes a kind of electronic dictionary fuzzy searching method, described electronic dictionary comprises the entry dictionary that stores a plurality of entries, the keyword dictionary that stores a plurality of keywords and keyword index table; Wherein each entry is made up of one or more keywords, and described concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of described keyword dictionary and the described entry dictionary, said method comprising the steps of:
(a) participle: the word to user's input uses the keyword dictionary to carry out participle, and the word of importing is divided into one or more keywords;
(b) calculate editing distance: the keyword that obtains according to the participle step retrieves wherein one or more entries of each keyword correspondence from described keyword index table, calculate the word of described input and the editing distance between these entries respectively;
(c) choose result for retrieval: to editing distance sort and the entry of choosing at least one editing distance minimum as result for retrieval.
Preferably, wherein step (a) adopts reverse maximum syncopation to carry out participle, and the word of importing is divided into several keywords.
Preferably, wherein the editing distance described in the step (b) can refer to and allow an entry S1 become the character sum that another entry S2 need operate, and described operation comprises increase, deletion or substitute character.
Preferably, wherein step (c) back also comprises the step that shows result for retrieval.
Preferably, wherein said entry dictionary is for supporting the electronic dictionary of accurate coupling or prefix.
Preferably, wherein step (a) is preceding also comprises: at first the word to user's input accurately mates retrieval, is somebody's turn to do accurately coupling result for retrieval if find accurate coupling retrieval then directly show.
The present invention has positive effect: owing to can support fuzzy search, make the user when input has some wrong individual characters in the entry, the result that script can not retrieve by the accurate retrieval mode of tradition, obtain result for retrieval soon by method and apparatus of the present invention, both improve speed, also improved retrieval precision.The present invention can be used for the electronic dictionary of handwriting input; And in the OCR dictionary for translation.
Embodiment
Describe the electronic dictionary of support fuzzy search of the present invention in detail below in conjunction with accompanying drawing.
The data structure of the electronic dictionary of support fuzzy search of the present invention as shown in Figure 1, comprises an entry dictionary 10 and a keyword dictionary 20.Entry dictionary 10 refers to traditional electronic dictionary, is that example comprises a series of entries as " People's Bank of China " with Sino-British electronic dictionary.Keyword dictionary 20 has comprised the keyword that each entry can be divided in the entry dictionary, has comprised " China ", " people ", " bank " three keywords as " People's Bank of China ".And the electronic dictionary of support fuzzy search of the present invention also comprises a keyword index table (not shown), this concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of keyword dictionary 20 and the entry dictionary 10, generally speaking, a keyword is to there being several entries.As shown in Figure 1, keyword " people " has write down all and has comprised keyword " people's " the index of entry (as " Chinese people ", " people " etc.) in the entry dictionary in the keyword dictionary.Index relative please see Figure the arrow in 1.
Described entry dictionary 10 is promptly supported the realization of the electronic dictionary of accurate coupling or prefix, can use plurality of data structures (Hash table, search tree, Trie tree etc.).The present invention has adopted even numbers group Trie to set and has realized.The data structure of even numbers group Trie tree is two linear arrays, and one is the base array, and one is the check array.The base array is used for determining the transfer of state, and the check array is used to check the correctness of transfer.With the Chinese-English Dictionary is example, at first the basic Chinese characters among all GB2312 is changed into the sequence code of 1-6768, with the basic value as state exchange; Then the sequence code of all Chinese characters is put into the base array as original state; Next put the follow-up Chinese character sequence code of different entries into array, generate new state, and the base value of original state in the array is adjusted, can put into array to guarantee all follow-up Chinese characters; By that analogy, up to depositing all entry states in array; The final state of representing even numbers group Trie tree simultaneously with negative value.
The realization of described keyword dictionary 20 also can be used plurality of data structures (Hash table, search tree, Trie tree etc.).The same with entry dictionary 10, the present invention has adopted even numbers group Trie to set and has realized, and has added an index and write down the index of all entries that comprise this keyword in entry dictionary 10 in each keyword structure.
The construction method of dictionary configuration of the present invention as shown in Figure 1 is as follows:
1) each entry that will comprise in the entry storehouse of all entries is divided into keyword, and keeps keyword and its semantic identical in former entry.Form crucial dictionary." China ", " people ", " bank " three key words have been divided into as " People's Bank of China ".
2) so crucial dictionary is created as keyword dictionary 20.
3) use the entry dictionary of setting up according to the entry storehouse 10, when setting up these two dictionaries, corresponding to each keyword a keyword index table is arranged, the one or more entries of this index point will write down " People's Bank of China " index in entry dictionary 10 respectively as " People's Bank of China " in the concordance list of " China ", " people ", " bank " three keywords.
The search method of the electronic dictionary of support fuzzy search of the present invention, at first the word to user's input carries out traditional accurate coupling retrieval, if find accurate coupling retrieval then the direct accurately coupling result for retrieval that shows; If can not find, then carry out fuzzy search of the present invention.
Fig. 2 is the schematic flow sheet of fuzzy search in the retrieval flow in the one embodiment of the invention.The fuzzy matching retrieval may further comprise the steps:
Step 210, at first the word to input uses keyword dictionary 20 to adopt reverse maximum syncopation to carry out participle, the word of input is divided into several keywords, as " People's Bank of China " being divided into " China ", " people ", " bank " three keywords; If word segmentation result is arranged then enter next step, otherwise show and do not retrieve; The word segmentation result of this step has obtained one or more keywords.
Step 220, the keyword that obtains according to word segmentation result retrieves wherein each keyword corresponding entries from concordance list, calculate the word of input and the editing distance between these entries respectively, during as retrieval " the Shen state people ", the keyword that entry is divided into is " Shen ", " state ", " people ", wherein choose the longest keyword " people " corresponding entries in concordance list " People's Republic of China (PRC) " is arranged, " Chinese people ", " the People's Bank ", " PLA " etc., with the input entry " the Shen state people " editing distance be respectively " People's Republic of China (PRC) "-5 (twice replacement operation, " Shen state "=>China; Insert operation three times, insert " republic "), " Chinese people "-1 (replacement operation, " Shen "=>" in "), " the People's Bank "-4 (twice deletion action, deletion " Shen state "; " bank " inserted in twice insertion operation), " PLA "-5 (twice deletion action, deletion " Shen state "; Insert operation three times and insert " PLA ") etc.
In the present invention, weigh the similarity of two entries, refer to and allow the character sum of the operation that an entry S1 becomes another entry S2 to be needed (increase, deletion is replaced) with " editing distance ".In Chinese one and Chinese character calculate a character, a character calculated in a letter in English or other phonetic languages, for example S1=" the Shen state people " then " Shen " among the S1 can be replaced to S2=" Chinese people " " in " to obtain " Chinese people " consistent with S2, owing to carried out replacement operation one time, here editing distance is 1.
Step 230 sorts from small to large to editing distance, and tabulation is sorted from small to large according to editing distance as the entry in the previous step, obtains " Chinese people "-1, " the People's Bank "-4, " People's Republic of China (PRC) "-5, " PLA "-5;
Step 240, return the result of the less several entries of the entry result of editing distance minimum or editing distance, during as retrieval " the Shen state people " in previous step, because the editing distance minimum of " Chinese people " is 1, so return entry " Chinese people " and corresponding explanation the thereof.
Can be transported to devices such as display by the result who obtains after the inventive method retrieval, also can deliver to other modules carries out further data processing, because of not being emphasis of the present invention, no longer describes in detail.
The present invention also provides the fuzzy search device based on the above-mentioned electronic dictionary fuzzy retrieval method, and electronic dictionary comprises the entry dictionary that stores a plurality of entries, the keyword dictionary that stores a plurality of keywords and keyword index table; Wherein each entry is made up of one or more keywords, and concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of keyword dictionary and the entry dictionary, and the fuzzy search device comprises with lower module:
(a) participle: the word to user's input uses the keyword dictionary to carry out participle, and the word of importing is divided into one or more keywords;
(b) calculate editing distance: the keyword that obtains according to word-dividing mode retrieves wherein one or more entries of each keyword correspondence from the keyword index table, calculates the word of input and the editing distance between these entries respectively;
(c) choose result for retrieval: to editing distance sort and the entry of choosing at least one editing distance minimum as result for retrieval.
Wherein module (a) adopts reverse maximum syncopation to carry out participle, and the word of importing is divided into several keywords.
Wherein the editing distance in the module (b) refers to and allows entry S1 become the character sum that entry S2 need operate, and operation comprises increase, deletion or substitute character.
Wherein module (c) back also comprises the module that shows result for retrieval.
Wherein the entry dictionary is for supporting the electronic dictionary of accurate coupling or prefix.
Wherein module (a) is preceding also comprises with lower module: at first the word to user's input accurately mates retrieval, is somebody's turn to do accurately coupling result for retrieval if find accurate coupling retrieval then directly show.
The present invention also provides the electronic dictionary corresponding to above-mentioned search method and indexing unit, comprises the entry dictionary that stores a plurality of entries, also comprises:
Store the keyword dictionary of a plurality of keywords;
The keyword index table;
And as the fuzzy search device in the previous technique scheme;
Wherein each entry is made up of one or more keywords, and concordance list has write down that all have comprised this keyword in each keyword of keyword dictionary and the entry dictionary.
The present invention can be used for a lot of occasions, can be used for the electronic dictionary of handwriting input; Also can be used in a as previously mentioned OCR dictionary for translation, the character of the wrong knowledge of recognition result possibility of output is finished in OCR identification, if use traditional retrieval mode in dictionary, to can not find result for retrieval probably, but be to use dictionary of the present invention to carry out fuzzy search and just can retrieve the result that the user needs, improved user's satisfaction.
It should be noted that the foregoing description is example and unrestricted the present invention, those skilled in the art can design a lot of alternate embodiments and not break away from the scope of appended claims.