CN100437585C

Movatterモバイル変換

Info

Publication number: CN100437585C
Application number: CNB2006101128224A
Authority: CN
Inventors: 曹勇刚; 曹羽中; 金茂忠; 刘超
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2006-09-04
Filing date: 2006-09-04
Publication date: 2008-11-26
Anticipated expiration: 2026-09-04
Also published as: CN1916905A

Abstract

A method for carrying out index prompt based on back table includes setting up master back table and back subtable to be used by master search engine and by search subengine, dividing index string inputted by user to be word and searching out file containing said word by master search engine then sequencing these files as per their correlation degree, dividing said index string to be character and searching out word containing each character in index string by said subengine then sequencing these word as per their priority, displaying each indexed out word and file number contained in this word when index prompt is provided to user.

Description

Retrieve the method for prompting based on inverted list

Technical field

The present invention relates to the computer information retrieval technology, be meant a kind of retrieval reminding method especially based on inverted list.

Background technology

The user of search engine often need seek own this unfamiliar content (novel content), and promptly he and indeterminate oneself needs perhaps do not know how to express this needs.Except some popular word (as star's name, the name of media event etc.), the always not optimal query word of the query word that the user imported.Another kind of situation is that the user does not have specific aim, just hopes and probably learns about own interested unknown content in certain scope, and like this, he does not just know how to have expressed this demand by query word more.Last a kind of situation is the existence that the user not knows related content, and he can not remember going for and seeks them, wants to obtain and these contents are for they; Perhaps the user thinks related content, but does not have related content really in the index database, for example some information is not obtained automatically or some information be taken as harmful information be under an embargo the visit or abandoned.

The existence of above-mentioned situation makes user's exposition need difficulty can not allow search engine find the needed content of user easily that we are referred to as the inconsistency of user's request and content.The solution of this inconsistency needs system to go out relevant speech according to content presentation, allows the user select or click search for rather than require the user to import correct query word.Studies show that the user is usually since a short inquiry, check Query Result after, revise inquiry, retrieve again, so repeatedly, up to finding target, if in Query Result, give more to point out this process that will speed up.

Present search engine mainly all is based on to be added up and generates the retrieval prompting the query word of user input, promptly by the query word of all user's inputs is added up, obtain the popular degree of all query words, select then with the query word of active user's input similarly and the most popular a collection of term as the retrieval prompting.This retrieval reminding method is prompted to always the most popular that batch term of user, though its rationality is arranged, might not be exactly that the user really wants.

Summary of the invention

In view of this, the purpose of this invention is to provide and a kind ofly retrieve the method for prompting based on inverted list, it can retrieve prompting based on the content of document to be retrieved.

For this reason, the present invention adopts following method:

A kind ofly retrieve the method for prompting based on inverted list, it comprises the steps:

● set up the main inverted list that the main search engine uses

Whole documents to be retrieved are cut into speech, the speech after the cutting is carried out index, the document code tabulation that to set up with the speech be index, comprise this speech is the inverted list of value, is referred to as main inverted list.Use main inverted list that the part that document carries out index and retrieval is the main search engine, the main search engine is used for comprising according to the word and search of query string the document of this speech;

● set up the inferior inverted list that time search engine uses

Speech after the rapid middle cutting of previous step is cut into word again, word after the cutting is carried out index, the speech that foundation is index with the word, comprise this word is the inverted list of value, be referred to as time inverted list, in inferior inverted list, also stored the document frequency of each speech, use time inverted list that the part that speech carries out index and retrieval is time search engine, inferior search engine is used for comprising according to the search words of query string the speech of this word;

● with main search engine. retrieves document

The retrieval string of user input is cut into speech, goes out to comprise the document of these speech, again all documents that retrieve are carried out relevancy ranking, detect document sequence after obtaining sorting with the main search engine. retrieves;

● with time search engine retrieving speech

The retrieval string of user input is cut into word, goes out to comprise the speech of each word in the retrieval string, again all speech that retrieve are carried out following priority ordered, detect word sequence after obtaining sorting with time search engine retrieving:

Each word during the calculating retrieval is gone here and there at first respectively and the similarity of this speech in time inverted list;

Number of times * log that certain word during certain word in the retrieval string is gone here and there with the similarity of this speech=retrieval occurs in this speech (inverse of number that comprises the speech of certain word in the inferior inverted list),

Calculate the relative importance value of this speech then, that is, and each word in the square root of the frequency that the relative importance value of this speech=this speech occurs in all documents of main search engine * retrieval string and the similarity sum of this speech;

● the retrieval prompting

Provide when prompting retrieval to the user, show that according to the order that detects word sequence that from inferior search engine, retrieves each detects speech, and detect the speech back at each and demonstrate the number of documents that comprises this speech.

In addition:

Before the inferior inverted list step that described foundation time search engine uses, can screen the speech in the main inverted list of main search engine use earlier, to remove unwanted speech.

During screening, can be with speech the long and document number that comprises this speech as screening conditions.

In fact the inventive method retrieves prompting with regard to the content that is based on document to be retrieved, and it has following advantage based on the retrieval prompting of query word statistics:

(1) from information-theoretical angle, a speech is high more at the frequency that each occasion occurs, and the quantity of information that it comprised is just few more.The content-based retrieval prompting can be come out speech rare, that contain much information to prompting, then can only point out out a little well-known few speech of quantity of information that comprise based on the prompting of query statistic.

(2) the content-based suggested speech meaningful correspondence that is bound to.Prompting based on query word is quite different, and the user may import retrieval less than the result or misleading query word is arranged.

(3) information retrieval based on contents prompting, because the relative uniformity of term in the document content, the redundant quantity of suggested speech is few, the leap scope is big, can give user's prompting of range more.Prompting based on query word is quite different, because the speech of not knowing destination document and adopted in advance, different user can adopt different query words to popular theme, different combinations, different orders, this phenomenon cause being filled by the buzzword that a large amount of meanings repeat based on the prompting of query word.

(4) content-based prompting can be pointed out out term rarely known by the people, thus the scope of one's knowledge of extending user.Prompting based on query word is quite different, have only the query word that was used by the user and satisfied statistical requirements just can be prompted out, be in the content of toilet index related subject to be arranged, but as long as the user does not know, just can not go to inquire about or have only seldom user inquiring, system can not point out the user to go inquiry yet, and they are also just well known never or only know for a few peoples.

(5) number of relevant documentation can be accurately pointed out in content-based prompting, and efficient is higher relatively.Because content-based retrieval prompting, directly prompting is exactly speech in the inverted list, can be easy to obtain the number of corresponding document.And based on the prompting of query word, corresponding number of files then needs to carry out corresponding inquiry or come record retrieval number as a result by extra buffer memory if will obtain, and it is huge to finish the required expense of identical functions.And the Query Result reflection after resolving according to query word is the possible document that comprises the various permutation and combination of term, the result that obtains be inaccurate (seriously bigger than normal).

Description of drawings

Fig. 1 is a system assumption diagram of the present invention;

Fig. 2 is a process flow diagram of the present invention;

Fig. 3 is a main inverted list synoptic diagram of the present invention;

Fig. 4 is of the present invention inverted list synoptic diagram;

The prompting that Fig. 5 provides based on query statistic for Google Suggest;

Fig. 6 is the prompting that the relevant search of Baidu provides based on query statistic;

Fig. 7 is according to the content-based given retrieval prompting of the new search engine of the present invention's structure.

Embodiment

Inverted list is the data structure a kind of commonly used in the search engine, and inverted list is index with the speech, is item with the collection of document that comprises these speech, can find the collection of document that comprises certain speech or some speech fast.Inverted list has not only been deposited the pairing document code tabulation of each speech, also stored the number (being referred to as document frequency df) of the document of this speech correspondence, the number of times (being referred to as word frequency tf) of the appearance of this speech in certain document, even the information such as position of the appearance of this speech in certain document.Therefore speech and its pairing number of documents in the inverted list constructed a word frequency dictionary based on extensive language material in fact, can be used as a foundation of retrieval prompting.When the user does not know which type of term this uses search for his interested content, he can import with him and think the words that content retrieved is relevant, existing speech in the inverted list can be searched for by system, point out out with the user and import relevant speech, and can list each speech and can in what piece documents, occur.The user can do further precise search according to the retrieval prompting.

BJ University of Aeronautics ﹠ Astronautics develops the Software Engineering Institute a kind of Chinese word segmentation software BUAASEISEG, this participle software tends to the long word cutting, have very strong neologisms recognition capability, named entities such as term, name, place name, organization name, mechanism's name are had very strong recognition capability.BUAASEISEG adopts iterative binary cutting method, the overall probability that local probability that occurs in article in conjunction with candidate word and candidate word occur in word frequency dictionary, and candidate word is to the transfer number of follow-up speech, can onlinely carry out context-sensitive neologisms identification and ambiguity resolution, as long as possess certain context, it just has the ability (being not limited to name, place name, organization name) of the various types of neologisms of identification and clears up the ability of all kinds of ambiguities.For the higher named entity of some frequencies of occurrences in article, BUAASEISEG can be cut into it whole speech, such as " BJ University of Aeronautics ﹠ Astronautics ", BUAASEISEG also is cut into a whole speech to it, and general Chinese word segmentation algorithm can be cut into it " Beijing ", " aviation ", " space flight ", " university " 4 speech.BUAASEISEG supports English named entity recognition simultaneously, can add in the inverted list as a complete speech such as " software engineering ", and can not be divided into " software " and " engineering " two speech.(detailed description about the BUAASEISEG Words partition system can be referring to paper: Cao Yonggang, Cao Yuzhong, Jin Maozhong, Liu Chao. towards the self-adaptation Chinese automatic word-cut of information retrieval. and software journal, 2006,17 (3): 356-363).

The present invention a kind ofly retrieves the method for prompting based on inverted list, below in conjunction with architecture of the present invention shown in Figure 1 and process flow diagram shown in Figure 2, describes implementation step of the present invention.

Step a) is set up the main inverted list that the main search engine uses

Whole documents to be retrieved or segmenting web page are become speech, the speech after the cutting is carried out index, the document code tabulation that to set up in the process of index with the speech be index, comprise this speech is the inverted list of value, and its structure as shown in Figure 3.We are called the main search engine to the part that document is carried out index and retrieval, and the main search engine is used for comprising according to the word and search of query string the document of this speech.

If comprise Chinese in the document to be retrieved, can use any Chinese word segmentation algorithm or Chinese word segmentation software (participle device) to carry out Chinese word segmentation.If use specific branch word algorithm,, will obtain better retrieval prompting such as using Chinese word segmentation software BUAASEISEG.

Step b) is screened the speech in the main inverted list in the step a)

The document number that can be long with speech and comprises this speech is as screening conditions, removes unwanted speech (as single speech or few speech occurs).Can adopt various screening means according to different needs, also can not screen.The screening principle be speech length must more than or equal to 2 and speech (being DF 〉=5) must appear at least five documents.Select which kind of screening conditions main, determine according to experiment effect according to size and the content of wanting the document sets of index.

Step c) is set up the inferior inverted list that time search engine uses

Speech after the cutting (is screened as process, speech after then referring to screen) carries out the individual character cutting and (promptly English is cut into one by one word, Chinese is cut into Chinese character one by one) line index of going forward side by side, the speech that foundation is index with the word, comprise this word is the inverted list of value, be referred to as " inferior inverted list ", its structure as shown in Figure 4.We are using time inverted list that the part that speech carries out index and retrieval is called time search engine.Inferior search engine is used for comprising according to the search words of query string the vocabulary (not limitting order) of these words.Also stored simultaneously the document frequency (promptly have and comprise this speech in what piece documents) of each speech in the inferior inverted list.

Step d) main search engine. retrieves document

During user search, going here and there retrieval set by step a) earlier, the middle word algorithm of cutting that uses is cut into speech, go out to comprise the document of these speech with the main search engine. retrieves, again to all documents that retrieve carry out relevancy ranking (according to search engine general vector space model calculate), obtain result for retrieval, promptly detect document sequence after the ordering.Can adopt the various relevancy ranking algorithms of information retrieval field herein, as TF*IDF, PageRank etc., employed concrete sort algorithm only can influence file retrieval result's precision ratio and recall ratio, and can not influence the effect of retrieval prompting.

Step c) time search engine retrieving speech

Retrieval string with user's input in step d) is cut into word, in inferior search engine, retrieve all speech that comprise each word in the retrieval string and the document frequency of this speech then, again all speech that retrieve are carried out priority ordered, detect word sequence after obtaining sorting.

Step f) provides the retrieval prompting

When the user provides the retrieval prompting, show that according to the order that detects word sequence that from inferior search engine, retrieves each detects speech, and show the document frequency (promptly have and comprise this speech in what piece documents) of this speech in each speech back.

The present invention has used a kind of retrieval cue priority ordered algorithm (this algorithm will be described in detail later) that whole result for retrieval are carried out priority ordered in step e), obtain the highest 10 (can adjust number as required) retrieval cue of relative importance value, be presented at (generally being fit to be placed on top, below or two places has entirely) in the resulting result for retrieval page of step d).Each retrieval cue is all to there being hyperlink, and also can there be a check box in the place ahead, and the user can click single or choose a plurality of own interested terms, carries out further precise search.The user also can pass through to click " more retrievals promptings " this link, all retrieved cue.

Retrieval cue priority ordered algorithm used in the present invention is as follows:

The retrieval string sequence of user's input is cut into individual character, i.e. sequence={char[1], char[2], ... .., char[n] }, use char[1 then], char[2] ...., char[n] in inferior inverted list, mate, if certain speech word[j in the inferior inverted list] comprised char[1], char[2] ...., char[n] (not limitting the appearance order of each word), word[j so] be exactly a retrieval cue.Then to char[1], char[2] ... .., char[n] calculate respectively and word[j] similarity score, computing formula is:

sim(char[i]，word[j])＝TF*IDF

Wherein: sim (char[i], word[j]): word char[i] with retrieval cue word[j] similarity score

TF: word char[i] at speech word[j] in the number of times that occurs.

IDF: comprise word char[i in the inferior inverted index] the inverse of number of speech get the log value.

The relative importance value computing formula of retrieving cue then is as follows:

Priority(word[j])＝boost(word[j])*Σsim(char[i]，word[j])

Promptly char[1], char[2] ..., char[n] and word[j] similarity score summation after multiply by word[j again] weighted value boost (word[j]) itself, promptly obtained word[j] relative importance value score Priority (word[j]).Wherein the computing formula of weighted value boost is as follows:

boost(word[j])＝sqrt(docFreq(word[j]))

Wherein: docFreq (word[j]) refers to retrieve cue word[j] frequency that in all documents of main search engine, occurs, promptly have what documents to comprise word[j in the main search engine], sqrt refers to the root of making even.

Using the benefit of retrieval cue weighted value is preferentially to point out document frequency high speech, can avoid that so on the one hand the speech of minority cutting mistake in the step a) is come the front and point out out (because the speech of cutting mistake has very low document frequency usually), on the other hand since in the document the speech of frequent appearance may also be the most normal use of people, most interested speech, can allow the user find own needed speech very soon like this.

The present invention does not rely on specific hardware environment, and it can adopt the hardware configuration of present existing search engine.According to the step of this instructions, thereby can on any existing search engine basis with inverted list index and query capability, be transformed the target that realizes that the present invention will reach---the content-based retrieval prompting.

The difference based on query statistic of prompting that produces of the present invention and existing search engine has been set forth in summary of the invention, accompanying drawing 5-7 has provided a concrete example (retrieval " robot ") and has come to point out effect relatively with popular search engine Google and Baidu, and wherein DiMoor is the inventor presses the step structure of this instructions on the basis of the search engine Nutch that increases income (http://lucene.Apache.org/nutch/) a new search engine.Can find out by Fig. 5-Fig. 6, based on the retrieval prompting meeting of query statistic owing to the user adopt repeatedly different query words retrieve same hot issue make prompting face very narrow and lack of standardization (among Fig. 5-Fig. 6 since recreation fan quantity greater than the science fan, what prompting was come out all is with the relevant speech of recreation basically, and the robot relative words in the science have been buried).Fig. 7 then can disclose the relevant vocabulary of various and query word that document comprised of institute's index from universal significance more, and disclosed the number of documents that comprises corresponding cue (as seen from Figure 7, introduce the science popularization article of Robotics and be no less than walkthrough) in fact on the net with the robot cheating.

As use our inventive method system, the user imports term " Beijing Institute of Aeronautics ", " BJ University of Aeronautics ﹠ Astronautics " can be pointed out out by system, " China Northern Airline " etc., the Lower Establishment name that also can point out out BJ University of Aeronautics ﹠ Astronautics is as " computing machine institute of BJ University of Aeronautics ﹠ Astronautics ", " BJ University of Aeronautics ﹠ Astronautics Software Engineering Institute ", " aerospace institute of BJ University of Aeronautics ﹠ Astronautics " or the like, if the user wants to utilize search engine to understand some units of the subordinate of BJ University of Aeronautics ﹠ Astronautics, but the title of this unit that do not know for sure, such retrieval prompting is just very valuable.

Claims

1. retrieve the method for prompting based on inverted list for one kind, it is characterized in that it comprises the steps:

● set up the main inverted list that the main search engine uses

Whole documents to be retrieved are cut into speech, the speech after the cutting is carried out index, the document code tabulation that to set up with the speech be index, comprise this speech is the inverted list of value, is referred to as main inverted list; Use main inverted list that the part that document carries out index and retrieval is the main search engine, the main search engine is used for comprising according to the word and search of query string the document of this speech;

● set up the inferior inverted list that time search engine uses

● with main search engine. retrieves document

● with time search engine retrieving speech

● the retrieval prompting

2. as claimed in claim 1ly retrieve the method for prompting, it is characterized in that based on inverted list:

Before the inferior inverted list step that described foundation time search engine uses, earlier the speech in the main inverted list of main search engine use is screened, to remove unwanted speech.

3. as claimed in claim 2ly retrieve the method for prompting, it is characterized in that based on inverted list:

When the speech in the main inverted list that described main search engine is used screens, can be with speech the long and document number that comprises this speech as screening conditions.

4. as claimed in claim 3ly retrieve the method for prompting, it is characterized in that based on inverted list: the condition of described screening be speech length must more than or equal to 2 and speech must at least five documents, occur.

5. as claimed in claim 1ly retrieve the method for prompting, it is characterized in that based on inverted list:

Use in the main search engine. retrieves document step described, adopt TF*IDF algorithm or PageRank algorithm to carry out relevancy ranking the document that retrieves.