Movatterモバイル変換


[0]ホーム

URL:


US5745602A - Automatic method of selecting multi-word key phrases from a document - Google Patents

Automatic method of selecting multi-word key phrases from a document
Download PDF

Info

Publication number
US5745602A
US5745602AUS08/432,383US43238395AUS5745602AUS 5745602 AUS5745602 AUS 5745602AUS 43238395 AUS43238395 AUS 43238395AUS 5745602 AUS5745602 AUS 5745602A
Authority
US
United States
Prior art keywords
word
phrase
processor
phrases
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US08/432,383
Inventor
Francine R. Chen
Steven B. Putz
Daniel C. Brotsky
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xerox Corp
Original Assignee
Xerox Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xerox CorpfiledCriticalXerox Corp
Priority to US08/432,383priorityCriticalpatent/US5745602A/en
Assigned to XEROX CORPORATIONreassignmentXEROX CORPORATIONASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: BROTSKY, DANIEL C., CHEN, FRANCINE R., PUTZ, STEVEN B.
Priority to JP10578696Aprioritypatent/JP3653141B2/en
Priority to EP96303094Aprioritypatent/EP0741364A1/en
Application grantedgrantedCritical
Publication of US5745602ApublicationCriticalpatent/US5745602A/en
Assigned to BANK ONE, NA, AS ADMINISTRATIVE AGENTreassignmentBANK ONE, NA, AS ADMINISTRATIVE AGENTSECURITY INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: XEROX CORPORATION
Assigned to JPMORGAN CHASE BANK, AS COLLATERAL AGENTreassignmentJPMORGAN CHASE BANK, AS COLLATERAL AGENTSECURITY AGREEMENTAssignors: XEROX CORPORATION
Anticipated expirationlegal-statusCritical
Assigned to XEROX CORPORATIONreassignmentXEROX CORPORATIONRELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS).Assignors: JPMORGAN CHASE BANK, N.A. AS SUCCESSOR-IN-INTEREST ADMINISTRATIVE AGENT AND COLLATERAL AGENT TO JPMORGAN CHASE BANK
Expired - Lifetimelegal-statusCriticalCurrent

Links

Images

Classifications

Definitions

Landscapes

Abstract

An automatic method of generating key phrases for a machine readable document. The method begins by breaking the text of the document into multi-word phrases free of stop words which begin and end acceptably. Afterward, the most frequent phrases are selected as key word phrases.

Description

FIELD OF THE INVENTION
The present invention relates to a method of automatic text processing. In particular, the present method relates to an automatic method of selecting key phrases from a machine readable document.
BACKGROUND OF THE INVENTION
A key word list allows a reader to determine the content of a document without reading that document. A key word list for a document can be created subsequent to document creation either automatically or using human intelligence and labor. Using human labor to generate a key word list can be expensive. In contrast, automatic techniques of generating a key word list can be less expensive.
Both natural language processing and statistical techniques have been used to automatically generate key word lists for documents. Natural language processing attempts to understand natural language text and is therefore computationally intensive. Statistical techniques allow quicker generation of key word lists because no effort is made to understand the text. In 1969 Carroll and Roeloffs disclosed a method for selecting key words in "Computer Selection of Keywords Using Word-Frequency Analysis." Carroll and Roelofts selected key words based upon the relative frequency of words within each document as well as across a document corpus. Because of the use of word frequency across a document corpus, the method of Carroll and Roeloffs is not fast enough without preprocessing for those searchers who desire immediate results or do not possess a corpus of related documents.
SUMMARY OF THE INVENTION
An object of the present invention is to provide an automatic method of key phrase selection that can be executed quickly to produce reasonable key phrases.
Another object of the present invention is to provide an automatic method of key phrase selection that depends neither upon natural language processing, nor upon corpus-dependent information.
An automatic method of generating key phrases for a machine readable document will be described. The method begins by generating from the document text multi-word candidate phrases. Candidate phrases are phrases free of stop words that begin and end acceptably. Afterward, the most frequent candidate phrases are selected as key word phrases.
Other objects, features, and advantages of the present invention will be apparent from the accompanying drawings and detailed description that follows.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings. In the accompanying drawings similar references indicate similar elements.
FIG. 1 illustrates a computer system which automatically selects key phrases from a machine readable document.
FIG. 2 is a flow diagram of the method of selecting key phrases from a machine readable document.
FIG. 3 is a flow diagram of the method of generating candidate phrases from phrases.
FIG. 4 illustrates in flow diagram form an alternate method of selecting key phrases.
FIG. 5 illustrates in flow diagram form an alternate method of generating candidate phrases.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
FIG. 1 illustrates in block diagramform computer system 10 in which the present method is implemented. The present method alters the operation ofcomputer system 10, allowing it to select key phrases from any document presented in machine readable form. Briefly described,computer system 10 selects key phrases by breaking the text of the machine readable document into multi-word candidate phrases. Candidate phrases do not include stop words and begin and end with acceptable words. Finally, the most frequent candidate phrases are selected as key phrases. Two methods of selecting key phrases usingcomputer system 10 will be described in detail below.
A. Key Phrase Selection Computer System
Prior to a more detailed discussion of the present method, considercomputer system 10.Computer system 10 includesmonitor 12 for visually displaying information to a computer user.Computer system 10 also outputs information to the computer user viaprinter 13.Computer system 10 provides the computer user multiple avenues to input data. Keyboard 14 allows the computer user to input data tocomputer system 10 by typing. By movingmouse 16 the computer user is able to move a pointer displayed onmonitor 12 and to select displayed icons. The computer user may also input information tocomputer system 10 by writing ontablet 18 with a stylus orpen 20. Alternately, the computer user can input data stored in machine readable form on a magnetic medium, such as a floppy disk, by inserting the disk intofloppy disk drive 22. Optical character recognition unit (OCR unit) 24 permits the computer user to inputhardcopy documents 26 into computer system, whichOCR unit 24 then converts into a coded electronic representation, typically American National Standard Code for Information Interchange (ASCII).
Processor 11 controls and coordinates the operations ofcomputer system 10 to execute the commands of the computer user.Processor 11 determines and takes the appropriate action in response to each user command by executing instructions stored electronically in memory. Typically, operating instructions forprocessor 11 are stored insolid state memory 28, allowing frequent and rapid access to the instructions. Semiconductor memory devices that can be used to realizememory 28 include read only memories (ROM), random access memories (RAM), dynamic random access memories (DRAM), programmable read only memories (PROM), erasable programmable read only memories (EPROM), and electrically erasable programmable read only memories (EEPROM), such as flash memories.
B. One Method of Selecting Key Phrases
FIG. 2 illustrates in flow diagram form theinstructions 40 executed byprocessor 11 to select key phrases from a machine readable document.Instructions 40 may be stored insolid state memory 28 or on a floppy disk placed withinfloppy disk drive 22.Instructions 40 may be realized in any computer language, including LISP and C++. Execution ofinstructions 40 is initiated by selection and input of a machine readable document. If desired, prior to initiating execution ofinstructions 40 the computer user may also change the number of key phrases selected, denoted "P," from the default number. The default number may be set to any arbitrary value. In one embodiment, the default value is set to five key phrases.
Processor 11 responds to the selection of a tokenized document by branching tostep 42. As used herein, a tokenized document is one for which sentence boundaries and word tokens have identified. Duringstep 42processor 11 examines the tokenized document and generates multi-word phrases. That is to say,processor 11 extracts from each sentence non-overlapping phrases of two or more words. Stop words are preferably excluded from the phrases generated duringstep 42 so that each word of a phrase conveys meaning relevant to the document theme. Stop words are words such as pronouns, prepositions, determiners, and "to be" verbs that convey little meaning relevant to document theme. Excluding stop words from phrases has the advantage of producing compact key phrases and reducing the processing time required during steps subsequent to step 42.Processor 11 excludes stop words by comparing each word token of each sentence to the words of a stop list.Processor 11 ends one phrase and begins another whenever it encounters a stop word in a sentence. Consequently, the phrases generated are composed of adjacent terms. As a result of efforts during step 42 a list of phrases is generated. The phrase list complete,processor 11 branches to step 43 fromstep 42.
Duringstep 43processor 11 determines for subsequent use the frequency within the document of each word on the phrase list. Depending upon the tokenizer used duringstep 42,processor 11 may be able to determine the frequency of each word on the phrase list by consulting a term list, which lists each word of the document and identifies each sentence in which that word occurs. With such a list,processor 11 need only count the number of sentence IDs for each word on the phrase list. Afterward,processor 11 branches fromstep 43 to step 44.
Duringstep 44processor 11 generates candidate phrases from the phrases on the phrase list.Processor 11 considers a number of factors while generating candidate phrases.Processor 11 examines the beginning and ending words of a phrase to determine whether they are appropriate for a candidate phrase. This insures that the key phrases selected subsequently will appear reasonable. Howprocessor 11 performs these tasks will be discussed in greater detail later with respect to FIG. 3. Duringstep 44processor 11 also examines each word of a phrase to determine whether that word is frequent. The frequency of words within phrases and of the frequency of the phrases themselves are used to select key words for a document because of the belief that the most frequent phrases are most likely to be indicative of document content.Processor 11 considers a word frequent if it occurs in the document at least a minimum number of times. That is to say,processor 11 compares the number of occurrences of a word within the document to a threshold. If number of occurrences exceeds the threshold,processor 11 considers the term frequent. Infrequent terms are excluded from candidate phrases. For brief documents the threshold is preferably set to one. As a result, only terms occurring at least twice are considered frequent. For longer documents, a higher threshold may be desirable. Armed with a list of candidate phrases,processor 11 branches fromstep 44 to step 46.
Withstep 46,processor 11 begins the task of selecting P key phrases from the list of candidate phrases.Processor 11 starts by sorting the candidate phrase list according to the number of occurrences within the document of each candidate phrase. Candidate phrases which occur frequently are placed higher on the sorted list of candidate phrases than candidate phrases that occur less frequently. Ties between candidate phrases can be sorted in a number of fashions, including by candidate phrase length measured in terms of number of words or characters, according to which phrase candidate includes the most frequent word, or in terms of highest average word frequency. As a result ofstep 46processor 11 possesses an ordered list of candidate phrases. Afterward,processor 11 branches to step 48 fromstep 46.
Duringstep 48processor 11 prepares to begin selecting key phrases from the candidate phrase list by setting the number of key phrases selected to zero. That done,processor 11 branches to step 50.Processor 11 determines duringstep 50 whether P key phrases have been selected yet. All key phrases have not yet been selected if the number selected does not equalP. Processor 11 responds to this situation by branching to step 52 fromstep 50.
Processor 11 examines the candidate phrase on the top of the sorted candidate phrase list duringstep 52. For brevity, call that phrase the "current phrase".Processor 11 determines instep 52 whether the current phrase is a variant of one of the already selected key phrases. As used herein, a variant is a phrase that is related to another phrase, but differing in word order, or word stem. For example, possible variants of "text analysis system" include "system analyzes text," "document analysis system," and "document processing system." A number of automatic text processing techniques can be used to perform variant analysis; therefore, variant analysis will not be discussed in detail herein.
Based upon the variant analysis,processor 11 takes one of two paths fromstep 52. If the candidate phrase at the top of the sorted candidate phrase list is not a variant of one of the key phrases,processor 11 branches to step 54 fromstep 52. Duringstep 54processor 11 removes the current candidate phrase from the sorted candidate phrase list and places the current candidate phrase on the key phrase list. Afterward,processor 11 advances to step 56 fromstep 54 and increments by one the number of key phrases selected. That done,processor 11 returns to step 50.
The actions ofprocessor 11 differ when the variant analysis ofstep 52 indicates that the current candidate phrase is a variant of one of the key phrases. In response,processor 11 branches to step 58 fromstep 52. Duringstep 58processor 11 removes the current candidate phrase from the sorted candidate phrase list and then modifies the key phrase list, if appropriate. In one embodiment the phrase already on the key phrase list will be removed and replaced if it is a subphrase of the phrase just selected from the sorted candidate phrase list. Thus, for example,processor 11 would exclude the subphrase "Southern California" rather than "Southern California coast." Other methods of determining which variant to exclude can be used duringstep 58, such as excluding the least frequent variant of a phrase. Afterward,processor 11 returns to step 50 fromstep 58.
Upon return to step 50,processor 11 determines whether P key phrases have been selected. If not,processor 11 branches throughsteps 52, 54, 56, and 58 until P key phrases have been selected from the sorted candidate phrase list. When that occurs,processor 11 branches fromstep 50 to step 60, selection of key phrases for the document complete.
B1. Generation of Candidate Phrases
FIG. 3 illustrates in detail the activities ofstep 44 to break phrases into candidate phrases that are maximally long and begin and end acceptably. Briefly described,processor 11 begins by examining each word of the selected phrase a word at a time to determine whether that word is frequent. Because the candidate phrases generated duringstep 44 are composed entirely of adjacent and frequent terms, one phrase may generate multiple candidate phrases or none, depending upon the length of the phrase and the location of infrequent terms within the phrase. Once the first frequent word of the selected phrase is identified,processor 11 determines whether that word represents an acceptable beginning for a candidate phrase. After identifying an acceptable beginning word for a candidate phrase,processor 11 continues building the candidate phrase from frequent terms of the selected phrase until the last word of the candidate phrase is identified.Processor 11 then examines the last word of the candidate phrase to determine whether it represents an acceptable ending for a candidate phrase. If not,processor 11 removes words from the end of the candidate phrase until an acceptable ending word is discovered.Processor 11 then determines whether the resulting candidate phrase is of sufficient length.Processor 11 stores the candidate phrase if it includes a sufficient number of words.
Given that introduction, consider now a situation that aids the detailed discussion ofinstructions 44. First, assume the list of phrases generated duringstep 42 includes: "Southern Pacific Company exerted great influence," "four years later," and "fee versus free." Second, assume also that the words occurring more than once within the document include: "Southern," "Pacific," "Company," "great," "influence," "years," "later," "versus" and "free." Third, assume that the bad beginning list includes: "versus." Fourth and finally, assume that the bad ending list includes: "versus" and "later."
Generation of candidate phrases begins instep 70 with the selection of one of the phrases from the phrase list. Assume thatprocessor 11 selects "Southern Pacific Company exerted great influence" the first pass throughstep 70. Afterward,processor 11 branches fromstep 70 to step 72.
Duringstep 72processor 11 selects for examination one of the words of the selected phrase. Preferably, examination of the words of the selected phrase proceeds sequentially from left to right. The selected phrase may also be examined by proceeding sequentially from right to left provided thatinstructions 44 are modified to check for an acceptable ending prior to checking for an acceptable beginning. Regardless of thedirection processor 11 proceeds in its examination of the words of the selected phrase, the words must be examined sequentially to insure that each candidate phrase generated is composed of adjacent terms.Processor 11 preferably selects "Southern" in its first pass throughstep 72. Having selected a word from the selected phrase,processor 11 branches fromstep 72 to step 74.
Processor 11 determines duringstep 74 whether the selected word is frequent.Processor 11 does so by comparing the number of occurrences of the selected word to a threshold. The value of the threshold is a design choice dependent upon the length of the document for which the key phrases are being generated. In one embodiment, the threshold is set to one so that each word must occur at least twice to be considered frequent.
As a result ofstep 74, phrases are broken into maximally long, non-overlapping subphrases. Thus, for example, the phrase "New Mexican border" produces only the candidate phrase "New Mexican border," not the subphrases "New Mexican" and "Mexican border." Using only maximally long candidate phrases may produce spurious candidate phrases; however, these candidate phrases are unlikely to be selected as key phrases because of their low frequency of occurrence. In contrast, subphrases generated from maximally long candidate phrases are not likely to be excluded as key phrases because they are likely to occur more frequently because of their smaller number of words. Consequently, producing reasonable key phrases using subphrases of maximally long candidate phrases requires modifying the present method.
Because "Southern" is a frequent word given our assumptions,processor 11 responds by branching to step 76 fromstep 74.Processor 11 entersstep 76 when a potential beginning word of a candidate phrase has been identified.Processor 11 determines duringstep 76 whether the selected word represents an acceptable beginning for a candidate phrase.Processor 11 does so by searching a bad beginning list for the selected word. The bad beginning list includes words that are not acceptable beginnings for a key phrase. The bad beginning list for English language text is likely to be brief; however, the tendency is to include a word on the bad beginning list when in doubt to reduce the possibility of generating key phrases that appear spurious or unreasonable. For non-English documents, different words should be included on the bad beginning list. For example, the French equivalent for "of," "de," should not be included on the stop word list because French noun phrases are of the form "noun de adjective." To prevent generating key phrases beginning "de adjective," "de" should be included on a French bad beginning list.
The word "Southern" constitutes an acceptable beginning for a key phrase given our assumptions, thereforeprocessor 11 branches fromstep 76 to step 78.
Processor 11 begins the process of building a new candidate phrase duringstep 78, which shall be referred to as the current candidate phrase. Duringstep 78processor 11 adds the selected word to the current candidate phrase. That done,processor 11 begins the effort to add adjacent, frequent terms from the selected phrase to the current phrase candidate by advancing to step 80 fromstep 78.Processor 11 determines duringstep 80 whether the selected phrase includes any additional terms that have yet to be examined.Processor 11 has not yet examined all the words of the selected phrase and so branches to step 81 fromstep 80. Duringstep 81processor 11 selects the next word of the selected phrase for possible inclusion in the current phrase candidate. Given the selected phrase and proceeding from left to right sequentially,processor 11 selects "Pacific" duringstep 81. Subsequently, duringstep 82,processor 11 decides that the selected word is frequent. In response,processor 11 returns to step 78 fromstep 82.Processor 11 adds "Pacific" to the current phrase candidate duringstep 78, which becomes "Southern Pacific" as a result. That done,processor 11 advances to step 80 and discovers that the selected phrase includes words that have yet to be examined.
Processor 11 selects "Company" duringstep 81 and advances to step 82.Processor 11 discovers that the selected word is frequent because it occurs more than once in the document. Consequently,processor 11 branches to step 78 fromstep 82 and adds the selected word to the current candidate phrase. As a result, the current phrase candidate becomes "Southern Pacific Company." Afterwardprocessor 11 branches to step 80 fromstep 78.
Duringstep 80processor 11 discovers that the selected phrase includes words that have not been examined yet. Accordingly, duringstep 81processor 11 selects the next word of the selected phrase, "exerted."Processor 11 discovers during the subsequent step that "exerted" is not a frequent word within the selected document. The occurrence of an infrequent word adjacent to the right-most word of the current phrase candidate ends it. As a result,processor 11 will not add the selected word, nor any others, to the current phrase candidate.Processor 11 responds to this situation by branching to step 84 fromstep 82.
Duringstep 84processor 11 determines whether the last word of the current candidate phrase is an acceptable ending by searching for that word on the bad ending list. Words on the bad ending list are those that may cause a key phrase to appear spurious or unreasonable. As with the bad beginning list, words placed on the bad ending list may vary depending upon the language of the natural language text being analyzed. Given our previous assumptions, "Company" represents an acceptable ending. Having generated a candidate phrase composed entirely of adjacent, frequent terms and that ends and begins acceptably,processor 11 advances to step 88 fromstep 84.
Processor 11 determines duringstep 88 whether the current candidate phrase includes more than one word. Single word phrases are not selected as key phrases according to the present method because without linguistic information about the word it is likely to appear spurious on a key phrase list. Rather than taking the time to obtain such linguistic information, single word phrases are not accepted as phrase candidates. Because the current candidate phrase includes more than one word,processor 11 advances to step 90 fromstep 88.
Processor 11 compares the current candidate phrase to the phrase candidates listed to date duringstep 90. As the current candidate phrase is the first one generated, the first pass throughstep 90processor 11 finds that the current candidate phrase is not on the list of candidate phrases. In response,processor 11 adds the current candidate phrase to the list of candidate phrases duringstep 94 and sets to one the count for that candidate phrase.Later processor 11 uses the counts associated with candidate phrases to select key phrases. Afterward,processor 11 branches to step 96 fromstep 94 to begin construction of another candidate phrase.
Efforts to construct another candidate phrase begin withstep 96 by determining whether all words of the selected phrase have been examined. The words "great influence" of the selected phrase have not yet been examined, soprocessor 11 responds by returning to step 72 fromstep 96 to continue its examination of the selected phrase.Processor 11 selects "great" as the selected word duringstep 72. Afterward,processor 11 branches throughsteps 74, 76, 78, 80, 81, 82, 84, and 88 in the manner just described and builds another candidate phrase, "great influence" from the selected phrase. Eventuallyprocessor 11 branches to step 90 fromstep 88. If the current candidate phrase is already included on the list ofcandidate phrases processor 11 branches to step 92 fromstep 90. Duringstep 92processor 11 increments by one the count of the current candidate count. That done,processor 11 branches fromstep 92 to step 96.
Upon return to step 96processor 11 discovers that all words of the selected phrase have been examined. Consequently,processor 11 advances to step 70 fromstep 96. Duringstep 96processor 11 selects "four years later" as the selected phrase. Subsequently, duringstep 72processor 11 designates "four" as the selected word.Processor 11 discovers duringstep 74 that "four" is not a frequent word within the selected document. In response,processor 11 advances to step 96 fromstep 74. Duringstep 96processor 11 determines that the selected phrase includes words that have not yet been examined.Processor 11 returns to step 72 fromstep 96 to select the next word of the selected phrase.Processor 11 selects "years" as the selected word and determines that the selected word is frequent. Consequently,processor 11 advances to step 76. Duringstep 76processor 11 searches the bad beginning list for "years" and does not find it. Thus, "years" represents an acceptable beginning.
Processor 11 continues building of the current candidate phrase by branching to step 78 fromstep 76. The selected word is added to the current candidate phrase duringstep 78. In the following step,step 80,processor 11 determines whether the selected phrase includes any other words that have not yet been examined. The selected phrase does, so duringstep 81processor 11 designates "later" as the selected word.Processor 11 then discovers duringstep 82 that "later" is a frequent word within the selected document.Processor 11 responds by branching to step 78 and adding the selected word to the current candidate phrase. As a result of this action, the current candidate phrase becomes "years later." Afterward,processor 11 branches fromstep 78 to step 80.
Duringstep 80processor 11 determines whether additional words can be added to the current candidate phrase by determining whether the selected phrase includes any additional words.Processor 11 has examined all words of the selected phrase so there will be no further additions to the current candidate phrase.Processor 11 responds by advancing to step 84 fromstep 80. Duringstep 84processor 11 determines whether the current candidate phrases ends acceptably by searching the bad ending list for "later."Processor 11 responds to the discovery of "later" on the bad ending list by branching fromstep 84 to step 86. During thatstep processor 11 removes from the current candidate phrase the last word, making the current candidate phrase "years." Afterward,processor 11 returns to step 84 fromstep 86 to examine once again the last word of the current candidate phrase.Processor 11 does not find "years" on the bad ending list and responds by branching to step 88 fromstep 86. Duringstep 88processor 11 determines whether the current candidate phrase is a multi-word phrase. The current candidate phrase includes only one word, soprocessor 11 discards the current candidate phrase and branches up to step 96 fromstep 88.
Processor 11 discovers duringstep 96 that it must select another phrase for examination because all words of the currently selected phrase have already been examined. As a result,processor 11 advances to step 98 and discovers that there are additional phrases that it has not examined yet.Processor 11 returns to step 70 and selects the phrase "fee versus free." Subsequently,processor 11 selects "fee" for examination and branches fromstep 72 to step 74.
Processor 11 discovers duringstep 74 that "fee" is an infrequent word. In response,processor 11 returns to step 72 and selects the next word of the selected phrase, "versus."Processor 11 regards "versus" as a frequent word because it appears more than once within the selected document. Accordingly,processor 11 branches to step 76 fromstep 74.Processor 11 searches the bad beginning list duringstep 76 for the selected word and discovers it there. In response,processor 11 branches fromstep 76 to step 96. Not all words of the selected phrase have been examined yet soprocessor 11 returns to step 72 fromstep 96.Processor 11 selects another word duringstep 72 and advances to step 74.Processor 11 determines that the selected word, "free," is a frequent term within the selected document duringstep 74. Further, during the followingstep processor 11 determines that the selected word is an acceptable beginning. In response,processor 11 branches to step 78 and executessteps 78, 80, 84, 88, 96, and 98 in the manner previously described.Processor 11 continues executinginstructions 44 until it is discovered during step 98 that all phrases have been examined.
When that occurs,processor 11 branches to step 100 from step 98, having completed the task of generating phrase candidates.
C. Alternate Method of Selecting Key Phrases
FIG. 4 illustrates in flow diagram form alternate instructions 40a for selecting key phrases from a document in machine readable form. Instructions 40a may be stored insolid state memory 28 or on a floppy disk placed withinfloppy disk drive 22. Instructions 40a may be realized in any computer language, including LISP and C++.
Instructions 40a differ frominstructions 40 in thatprocessor 11 may not necessarily select the same phrases as key phrases as would be selected usinginstructions 40. Instructions 40a also differ frominstructions 40 by permittingprocessor 11 to select key phrases more quickly. Instructions40a permit processor 11 to extract the information required from the document in a single pass, rather than requiring two passes as required byinstructions 40. Instructions 40a achieve this speed advantage through increased memory use as compared toinstructions 40. Despite these differences, instructions 40a closely resembleinstructions 40. Because of this resemblance, FIG. 4 illustrates only steps 44a and 45. Instructions 40a include no analog tosteps 42 or 46. FIG. 4 does not illustrate steps 48-60 because these steps are essentially identical for both methods of selecting key phrases. Consequently, steps 48-60 need not be described in the following discussion of instructions 40a.
Processor 11 begins execution of instructions 40a with step 44a. Duringstep 44a processor 11 generates a table of candidate phrases by identifying stop words, and acceptable beginning and ending words. Duringstep 44a processor 11 does not consider whether the words included within a candidate phrase are frequent.
Before beginning a discussion of how the table of candidate phrases is built during step 44a consider first the contents of the phrase table. The phrase table includes a phrase count and two representations of each candidate phrase: a generic form representation and a surface form representation. If these representations differ at all, they differ as to capitalization of the words of the candidate phrase. The generic form representation of the candidate phrase is a downcased version of the candidate phrase, which may not occur within the document.Processor 11 uses generic form representations as keys into phrase table by determining the generic form for the candidate phrase and searching for that generic form representation within the phrase table. Ifprocessor 11 encounters the generic form representation of a candidate phrase within the phrase table, then that candidate phrase need not be added to the phrase table. Instead,processor 11 increments the phrase count associated with the generic form. The surface form representation represents one of the occurrences of the candidate phrase as actually capitalized. The surface form representation permitsprocessor 11 to present to the computer user each key phrase as actually capitalized at least once within the document. Preferably, the surface form representation always represents the occurrence of the candidate phrase with the fewest capital letters.
Processor 11 represents both the generic and surface forms of candidate phrases as strings of word IDs. Each word ID is an integer number unique to one ASCII representation of a word. Consequently, different capitalizations of the same word will have different word IDs because of the differing ASCII representations. For example, the phrases "hate speech" and "Hate speech" have different ASCII representations and different word IDs.Processor 11 obtains the word IDs from a word ID table.Processor 11 generates the word ID table during step 44a, concurrently with the phrase table. Each time a word is selected for examination during step 44a,processor 11 searches the word ID table for that word's ASCII representation. If the word ID table does not include the word's ASCII representation,processor 11 adds that representation to the word ID table and assigns a unique integer number to function as the word's ID.Processor 11 stores other useful information in the word ID table to speed the generation of the phrase table. Prior to beginning analysis of the document,processor 11 intializes the word table by adding the words from the stop, bad beginning, and bad ending lists to the table and setting the flag or flags associated with that word. Thus, for example, when adding the stop word "the" to the word ID table, the stop word flag associated with "the" will be set. As a consquence of adding the words of these lists to the word ID table,processor 11 need consult only the word ID table to retrieve all information specific to a particular word.
Known hashing techniques can be used to efficiently locate information within the word ID table and the phrase table during the execution of instructions 44a. Consequently, there will be no discussion of howprocessor 11 retrieves information from these tables while executing instructions 44a.
Equipped with that description of the phrase table and word ID table, consider FIG. 5, which illustrates in detail instructions 44a for generating candidate phrases. Instructions 44a generate candidate phrases in substantially the same manner as discussed previously with respect toinstructions 44. Consequently, the following discussion assumes knowledge of that previous discussion and focuses on the differences between the two methods of generating candidate phrases. Differences betweeninstructions 44 and 44a arise because instructions 44a generate candidate phrases from the tokenized document, which includes stop words, without any a priori knowledge of word frequency within the document. As a consequence, instructions 44a search for stop words but not infrequent terms. Not using word frequency to end candidate phrases increases both the average length and the number of candidate phrases, as compared to the candidate phrases generated usinginstructions 44.
Execution of instructions 44a begins with step 70a. Duringstep 70a processor 11 selects a sentence as a possible source of candidate phrases, rather than a phrase as is the case duringstep 70. Afterward, during step 72a,processor 11 designates as the selected word one of the words of the selected sentence. Fromstep 72a processor 11 advances to step 74a. Duringstep 74a processor 11 determines whether the selected word is a stop word by consulting the appropriate entry in the word ID table and determining whether the associated stop word flag is set. If so, the selected word is not an acceptable word for a phrase andprocessor 11 advances to step 96. Execution ofsteps 96 and 98 proceeds in essentially the same manner discussed previously. On the other hand, if the selected word is not a stop word, thenprocessor 11 branches to step 76.
Fromstep 76 generation of candidate phrases proceeds in substantially the same manner discussed previously with respect toinstructions 44 with three minor differences. First,processor 11 consults the word ID table duringsteps 76, 82a, and 86 to determine whether the selected word is on either of the bad beginning, bad ending, or stop lists, rather than consulting the lists themselves. Ifprocessor 11 cannot find the selected word in the word ID table, then duringstep 76processor 11 adds an entry for that word to the table. Second, during step 82a,processor 11 excludes words from the current phrase based upon whether they are stop words, rather than their frequency within the document, as is the case duringstep 82 of FIG. 3.
After generating a candidate phrase,processor 11 advances to step 90, ready to determine how to modify the phrase table.Processor 11 begins this task by generating the generic form and surface form representations of the current candidate phrase using the word ID table, locating the generic form representation of the current candidate phrase in the phrase table. Discovery of the generic form representation in the phrase table indicates that the current candidate phrase is already included within the phrase table. In response, processor proceeds to step 92 to increment the count associated with the candidate phrase. Duringstep 92processor 11 may also modify the current surface form representation of the candidate phrase if it includes more uppercase words than the surface form of the candidate phrase. Preferably no modification of the surface form representation is made when current phrase includes more uppercase letter than the current surface form representation. On the other hand, ifprocessor 11 cannot locate the generic form representation of the current candidate thenprocessor 11 exits step 90, bound forstep 94. Duringstep 94processor 11 adds both the generic form representation and the surface form representation of the current phrase to the phrase table, as well setting the associated phrase count to one.
After generating all possible candidate phrases during step 44a,processor 11 advances to step 45a, illustrated in FIG. 4. During step 45a selects a subset of the candidate phrases from the phrase table.Processor 11 does so by selecting a subset of the most frequently occurring candidate phrases within the document. The number of phrases selected during step 45a should exceed the number of key phrases to be output, P, but is otherwise a design choice. After executing step 45a, selection of key phrases proceeds as discussed previously.
D. Summary
Thus, a method of selecting multi-word key phrases from a machine readable document has been described. The method begins by breaking the text of the document into multi-word phrases free of stop words that begin and end acceptably. Afterward, the most frequent phrases are selected as key word phrases.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims (4)

What is claimed is:
1. An automatic method of selecting key phrases from a document presented in machine readable form to a processor, the document including a first multiplicity of words, some of the words forming phrases, the processor implementing the method by executing instructions stored in a memory device coupled to the processor, the method comprising the processor implemented steps of:
a) generating from the document a multiplicity of phrases not including stop words;
b) generating candidate phrases from the multiplicity of phrases, candidate phrases including more than one word and being composed of maximally long, non-overlapping subphrases, generating candidate phrases by the substeps of:
b1) selecting a one of the multiplicity of phrases as a selected phrase;
b2) selecting as a selected word a yet to be examined word of the selected phrase;
b3) determining whether the selected word is a frequent word;
b4) if the selected word is an infrequent:
A) if all words of the selected phrase have not been examined, repeating steps b2) through b3);
B) if all word of the selected phrase have been examined, repeating steps b1) through b3);
C) if the selected word is a frequent word:
C1) determining whether the selected word is an acceptable beginning for a key phrase by searching a list of bad beginning words;
C2) if the selected word is not an acceptable beginning for a key phrase:
i) determining whether all words of the selected phrase have been examined:
ii) if all words of the selected phrase have not been examined, repeating steps b2) through b3);
iii) if all words of the selected phrase have been examined, repeating steps b1) through b4);
C3) if the selected word is an acceptable beginning for a key phrase:
i) adding the selected word to a current phrase;
ii) if all words of the selected phrase have not been examined selecting as a selected word a yet to be examined word of the selected phrase;
iii) determining whether the selected word is a stop word; and
iv) if the selected word is not a stop word, repeating steps C3i) through C3iii)
c) selecting as key phrases a subset of most frequently occurring of the candidate phrases.
2. The method of claim 1 wherein step C3) further comprises the steps of:
v if the selected word is an infrequent word or if all words of the selected phrase have been examined:
vA determining whether a last word of the current phrase is an acceptable ending for a key phrase by searching a list of bad ending words;
vB if the last word of the current phrase is not an acceptable ending for a key phrase, removing the last word of the current phrase and repeating step vA);
if the last word of the current phrase is an acceptable ending for a key phrase, determining whether the current phrase includes more than one word; and
if the current phrase includes more than one word adding the current phrase to a list of candidate phrases.
3. The method of claim 1 wherein step a) further comprises:
assigning a unique integer number to represent each word of the document;
representing each candidate phrase as a string of integer numbers, each integer number of a string representing a word of the candidate phrase; and
storing each string of integer numbers in a table.
4. The method of claim 3 wherein step a) further comprises:
storing in a word table the unique integer number associated with each word of the document;
for each word in the word table storing an indication of whether the word represents an acceptable beginning and an acceptable ending for candidate phrases.
US08/432,3831995-05-011995-05-01Automatic method of selecting multi-word key phrases from a documentExpired - LifetimeUS5745602A (en)

Priority Applications (3)

Application NumberPriority DateFiling DateTitle
US08/432,383US5745602A (en)1995-05-011995-05-01Automatic method of selecting multi-word key phrases from a document
JP10578696AJP3653141B2 (en)1995-05-011996-04-25 An automatic method for selecting a key phrase for a processor from a machine-readable document
EP96303094AEP0741364A1 (en)1995-05-011996-05-01Automatic method of selecting multi-word key phrases from a document

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
US08/432,383US5745602A (en)1995-05-011995-05-01Automatic method of selecting multi-word key phrases from a document

Publications (1)

Publication NumberPublication Date
US5745602Atrue US5745602A (en)1998-04-28

Family

ID=23715929

Family Applications (1)

Application NumberTitlePriority DateFiling Date
US08/432,383Expired - LifetimeUS5745602A (en)1995-05-011995-05-01Automatic method of selecting multi-word key phrases from a document

Country Status (3)

CountryLink
US (1)US5745602A (en)
EP (1)EP0741364A1 (en)
JP (1)JP3653141B2 (en)

Cited By (48)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5943443A (en)*1996-06-261999-08-24Fuji Xerox Co., Ltd.Method and apparatus for image based document processing
US6038527A (en)*1995-07-192000-03-14Daimler Benz AgMethod for generating descriptors for the classification of texts
US6338057B1 (en)*1997-11-242002-01-08British Telecommunications Public Limited CompanyInformation management and retrieval
US6341176B1 (en)*1996-11-202002-01-22Matsushita Electric Industrial Co., Ltd.Method and apparatus for character recognition
US6374209B1 (en)*1998-03-192002-04-16Sharp Kabushiki KaishaText structure analyzing apparatus, abstracting apparatus, and program recording medium
US6415250B1 (en)*1997-06-182002-07-02Novell, Inc.System and method for identifying language using morphologically-based techniques
US20020128818A1 (en)*1996-12-022002-09-12Ho Chi FaiMethod and system to answer a natural-language question
US6470307B1 (en)*1997-06-232002-10-22National Research Council Of CanadaMethod and apparatus for automatically identifying keywords within a document
US20030159107A1 (en)*2002-02-212003-08-21Xerox CorporationMethods and systems for incrementally changing text representation
US20030179236A1 (en)*2002-02-212003-09-25Xerox CorporationMethods and systems for interactive classification of objects
US20030224341A1 (en)*1996-12-022003-12-04Ho Chi FaiLearning method and system based on questioning
US20040040042A1 (en)*1997-01-062004-02-26David FeinleibSystem and method for synchronizing enhancing content with a video program using closed captioning
US20040054677A1 (en)*2000-11-212004-03-18Hans-Georg MuellerMethod for processing text in a computer and a computer
US20040064438A1 (en)*2002-09-302004-04-01Kostoff Ronald N.Method for data and text mining and literature-based discovery
US20040117740A1 (en)*2002-12-162004-06-17Chen Francine R.Systems and methods for displaying interactive topic-based text summaries
US20040117725A1 (en)*2002-12-162004-06-17Chen Francine R.Systems and methods for sentence based interactive topic-based text summarization
US20040122657A1 (en)*2002-12-162004-06-24Brants Thorsten H.Systems and methods for interactive topic-based text summarization
US6766287B1 (en)1999-12-152004-07-20Xerox CorporationSystem for genre-specific summarization of documents
US20040230415A1 (en)*2003-05-122004-11-18Stefan RiezlerSystems and methods for grammatical text condensation
US20060212421A1 (en)*2005-03-182006-09-21Oyarce Guillermo AContextual phrase analyzer
US20060212443A1 (en)*2005-03-182006-09-21Oyarce Guillermo AContextual interactive support system
US20060212441A1 (en)*2004-10-252006-09-21Yuanhua TangFull text query and search systems and methods of use
US20060242140A1 (en)*2005-04-262006-10-26Content Analyst Company, LlcLatent semantic clustering
US20060242190A1 (en)*2005-04-262006-10-26Content Analyst Comapny, LlcLatent semantic taxonomy generation
US7162413B1 (en)*1999-07-092007-01-09International Business Machines CorporationRule induction for summarizing documents in a classified document collection
US20070061320A1 (en)*2005-09-122007-03-15Microsoft CorporationMulti-document keyphrase exctraction using partial mutual information
US20070112839A1 (en)*2005-06-072007-05-17Anna BjarnestamMethod and system for expansion of structured keyword vocabulary
US20070112838A1 (en)*2005-06-072007-05-17Anna BjarnestamMethod and system for classifying media content
US7228507B2 (en)2002-02-212007-06-05Xerox CorporationMethods and systems for navigating a workspace
US20070156665A1 (en)*2001-12-052007-07-05Janusz WnekTaxonomy discovery
US20080077570A1 (en)*2004-10-252008-03-27Infovell, Inc.Full Text Query and Search Systems and Method of Use
US20080243820A1 (en)*2007-03-272008-10-02Walter ChangSemantic analysis documents to rank terms
WO2008120030A1 (en)*2007-04-022008-10-09Sobha Renaissance InformationLatent metonymical analysis and indexing [lmai]
US7487462B2 (en)2002-02-212009-02-03Xerox CorporationMethods and systems for indicating invisible contents of workspace
US7503000B1 (en)*2000-07-312009-03-10International Business Machines CorporationMethod for generation of an N-word phrase dictionary from a text corpus
US7549114B2 (en)2002-02-212009-06-16Xerox CorporationMethods and systems for incrementally changing text representation
US20090193337A1 (en)*2008-01-282009-07-30Fuji Xerox Co., Ltd.System and method for supporting document navigation on mobile devices using segmentation and keyphrase summarization
US20090193350A1 (en)*2008-01-282009-07-30Fuji Xerox Co., Ltd.System and method for supporting document navigation on mobile devices using segmentation and keyphrase summarization
US20090222528A1 (en)*2008-02-292009-09-03Samsung Electronics Co., Ltd.Resource sharing method and system
US20090228468A1 (en)*2008-03-042009-09-10Microsoft CorporationUsing core words to extract key phrases from documents
US20100305942A1 (en)*1998-09-282010-12-02Chaney Garnet RMethod and apparatus for generating a language independent document abstract
US8327265B1 (en)1999-04-092012-12-04Lucimedia Networks, Inc.System and method for parsing a document
US20150006531A1 (en)*2013-07-012015-01-01Tata Consultancy Services LimitedSystem and Method for Creating Labels for Clusters
US20160283474A1 (en)*2004-07-262016-09-29Google Inc.Multiple index based information retrieval system
US20170103122A1 (en)*2004-08-092017-04-13Amazon Technologies, Inc.Method and system for identifying keywords for use in placing keyword-targeted advertisements
CN107750378A (en)*2015-03-062018-03-02泽泰斯工业股份有限公司Method and system for voice identification result post processing
CN110032622A (en)*2018-11-282019-07-19阿里巴巴集团控股有限公司Keyword determines method, apparatus, equipment and computer readable storage medium
US10628496B2 (en)2017-03-272020-04-21Dell Products, L.P.Validating and correlating content

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5819260A (en)*1996-01-221998-10-06Lexis-NexisPhrase recognition method and apparatus
GB2333871A (en)*1998-01-291999-08-04Sharp KkRanking of text units
US6424982B1 (en)*1999-04-092002-07-23Semio CorporationSystem and method for parsing a document using one or more break characters
BE1013153A3 (en)*1999-11-252001-10-02Datastat S AMethod and system for information collection.
US7580921B2 (en)*2004-07-262009-08-25Google Inc.Phrase identification in an information retrieval system

Citations (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4812966A (en)*1984-11-161989-03-14Kabushiki Kaisha ToshibaWord block searcher for word processing equipment and searching method therefor
EP0361464A2 (en)*1988-09-301990-04-04Kabushiki Kaisha ToshibaMethod and apparatus for producing an abstract of a document
JPH02257266A (en)*1989-02-061990-10-18Teremateiiku Kokusai Kenkyusho:KkAbstract preparing device
JPH03105566A (en)*1989-09-201991-05-02Hitachi Ltd Abstract creation method
US5251316A (en)*1991-06-281993-10-05Digital Equipment CorporationMethod and apparatus for integrating a dynamic lexicon into a full-text information retrieval system
JPH0635961A (en)*1992-07-171994-02-10Matsushita Electric Ind Co LtdDocument summerizing device
US5307266A (en)*1990-08-221994-04-26Hitachi, Ltd.Information processing system and method for processing document by using structured keywords
US5440481A (en)*1992-10-281995-08-08The United States Of America As Represented By The Secretary Of The NavySystem and method for database tomography
US5526443A (en)*1994-10-061996-06-11Xerox CorporationMethod and apparatus for highlighting and categorizing documents using coded word tokens
US5553283A (en)*1987-05-261996-09-03Xerox CorporationStored mapping data with information for skipping branches while keeping count of suffix endings

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JPH03278270A (en)*1990-03-281991-12-09Ricoh Co LtdAbstract document forming device
JPH04673A (en)*1990-04-181992-01-06Hitachi Ltd Collocation registration method and device
JPH0561912A (en)*1991-09-021993-03-12Toshiba CorpInformation filing device
JP3361563B2 (en)*1993-04-132003-01-07松下電器産業株式会社 Morphological analysis device and keyword extraction device
JP2596325B2 (en)*1993-08-111997-04-02日本電気株式会社 Word extraction system
JPH0773200A (en)*1993-09-071995-03-17Ricoh Co Ltd Keyword extraction method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4812966A (en)*1984-11-161989-03-14Kabushiki Kaisha ToshibaWord block searcher for word processing equipment and searching method therefor
US5553283A (en)*1987-05-261996-09-03Xerox CorporationStored mapping data with information for skipping branches while keeping count of suffix endings
EP0361464A2 (en)*1988-09-301990-04-04Kabushiki Kaisha ToshibaMethod and apparatus for producing an abstract of a document
JPH02257266A (en)*1989-02-061990-10-18Teremateiiku Kokusai Kenkyusho:KkAbstract preparing device
JPH03105566A (en)*1989-09-201991-05-02Hitachi Ltd Abstract creation method
US5307266A (en)*1990-08-221994-04-26Hitachi, Ltd.Information processing system and method for processing document by using structured keywords
US5251316A (en)*1991-06-281993-10-05Digital Equipment CorporationMethod and apparatus for integrating a dynamic lexicon into a full-text information retrieval system
JPH0635961A (en)*1992-07-171994-02-10Matsushita Electric Ind Co LtdDocument summerizing device
US5440481A (en)*1992-10-281995-08-08The United States Of America As Represented By The Secretary Of The NavySystem and method for database tomography
US5526443A (en)*1994-10-061996-06-11Xerox CorporationMethod and apparatus for highlighting and categorizing documents using coded word tokens

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
Carroll, J.M., Roeloffs, R., "Computer Selection of Keywords Using Word-Frequency Analysis", American Documentation, vol. 20, No. 3, pp. 183, 227-233, 1969.
Carroll, J.M., Roeloffs, R., Computer Selection of Keywords Using Word Frequency Analysis , American Documentation, vol. 20, No. 3, pp. 183, 227 233, 1969.*
European Search Report for Corresponding Application No. 96303094.5*
Luhn, H.P. The Automatic Creation of Literature Abstracts, IBM Journal of Research and Development, vol. 2(2); pp. 159 162; 1958.*
Luhn, H.P. The Automatic Creation of Literature Abstracts, IBM Journal of Research and Development, vol. 2(2); pp. 159-162; 1958.
Salton, G. "Term-Phrase Formation" In:Harrison, Michael A. ed., Automatic Text Processing. Addison-Wesley Publishing Company, Inc.; 1989: pp. 294-299.
Salton, G. Term Phrase Formation In:Harrison, Michael A. ed., Automatic Text Processing. Addison Wesley Publishing Company, Inc.; 1989: pp. 294 299.*

Cited By (74)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6038527A (en)*1995-07-192000-03-14Daimler Benz AgMethod for generating descriptors for the classification of texts
US5943443A (en)*1996-06-261999-08-24Fuji Xerox Co., Ltd.Method and apparatus for image based document processing
US6341176B1 (en)*1996-11-202002-01-22Matsushita Electric Industrial Co., Ltd.Method and apparatus for character recognition
US6865370B2 (en)1996-12-022005-03-08Mindfabric, Inc.Learning method and system based on questioning
US20020128818A1 (en)*1996-12-022002-09-12Ho Chi FaiMethod and system to answer a natural-language question
US20030224341A1 (en)*1996-12-022003-12-04Ho Chi FaiLearning method and system based on questioning
US20040110120A1 (en)*1996-12-022004-06-10Mindfabric, Inc.Learning method and system based on questioning
US20040040042A1 (en)*1997-01-062004-02-26David FeinleibSystem and method for synchronizing enhancing content with a video program using closed captioning
US6415250B1 (en)*1997-06-182002-07-02Novell, Inc.System and method for identifying language using morphologically-based techniques
US6470307B1 (en)*1997-06-232002-10-22National Research Council Of CanadaMethod and apparatus for automatically identifying keywords within a document
US6338057B1 (en)*1997-11-242002-01-08British Telecommunications Public Limited CompanyInformation management and retrieval
US6374209B1 (en)*1998-03-192002-04-16Sharp Kabushiki KaishaText structure analyzing apparatus, abstracting apparatus, and program recording medium
US20100305942A1 (en)*1998-09-282010-12-02Chaney Garnet RMethod and apparatus for generating a language independent document abstract
US8005665B2 (en)1998-09-282011-08-23Schukhaus Group Gmbh, LlcMethod and apparatus for generating a language independent document abstract
US8327265B1 (en)1999-04-092012-12-04Lucimedia Networks, Inc.System and method for parsing a document
US7162413B1 (en)*1999-07-092007-01-09International Business Machines CorporationRule induction for summarizing documents in a classified document collection
US6766287B1 (en)1999-12-152004-07-20Xerox CorporationSystem for genre-specific summarization of documents
US7503000B1 (en)*2000-07-312009-03-10International Business Machines CorporationMethod for generation of an N-word phrase dictionary from a text corpus
US20040054677A1 (en)*2000-11-212004-03-18Hans-Georg MuellerMethod for processing text in a computer and a computer
US20070156665A1 (en)*2001-12-052007-07-05Janusz WnekTaxonomy discovery
US20030179236A1 (en)*2002-02-212003-09-25Xerox CorporationMethods and systems for interactive classification of objects
US7228507B2 (en)2002-02-212007-06-05Xerox CorporationMethods and systems for navigating a workspace
US7650562B2 (en)2002-02-212010-01-19Xerox CorporationMethods and systems for incrementally changing text representation
US20030159107A1 (en)*2002-02-212003-08-21Xerox CorporationMethods and systems for incrementally changing text representation
US7549114B2 (en)2002-02-212009-06-16Xerox CorporationMethods and systems for incrementally changing text representation
US8370761B2 (en)2002-02-212013-02-05Xerox CorporationMethods and systems for interactive classification of objects
US7487462B2 (en)2002-02-212009-02-03Xerox CorporationMethods and systems for indicating invisible contents of workspace
US20040064438A1 (en)*2002-09-302004-04-01Kostoff Ronald N.Method for data and text mining and literature-based discovery
US6886010B2 (en)2002-09-302005-04-26The United States Of America As Represented By The Secretary Of The NavyMethod for data and text mining and literature-based discovery
US7117437B2 (en)2002-12-162006-10-03Palo Alto Research Center IncorporatedSystems and methods for displaying interactive topic-based text summaries
US20040122657A1 (en)*2002-12-162004-06-24Brants Thorsten H.Systems and methods for interactive topic-based text summarization
US20040117725A1 (en)*2002-12-162004-06-17Chen Francine R.Systems and methods for sentence based interactive topic-based text summarization
US7376893B2 (en)2002-12-162008-05-20Palo Alto Research Center IncorporatedSystems and methods for sentence based interactive topic-based text summarization
US20040117740A1 (en)*2002-12-162004-06-17Chen Francine R.Systems and methods for displaying interactive topic-based text summaries
US7451395B2 (en)2002-12-162008-11-11Palo Alto Research Center IncorporatedSystems and methods for interactive topic-based text summarization
US20040230415A1 (en)*2003-05-122004-11-18Stefan RiezlerSystems and methods for grammatical text condensation
US20160283474A1 (en)*2004-07-262016-09-29Google Inc.Multiple index based information retrieval system
US9817825B2 (en)*2004-07-262017-11-14Google LlcMultiple index based information retrieval system
US20170103122A1 (en)*2004-08-092017-04-13Amazon Technologies, Inc.Method and system for identifying keywords for use in placing keyword-targeted advertisements
US10402431B2 (en)*2004-08-092019-09-03Amazon Technologies, Inc.Method and system for identifying keywords for use in placing keyword-targeted advertisements
US20110055192A1 (en)*2004-10-252011-03-03Infovell, Inc.Full text query and search systems and method of use
US20080077570A1 (en)*2004-10-252008-03-27Infovell, Inc.Full Text Query and Search Systems and Method of Use
US20060212441A1 (en)*2004-10-252006-09-21Yuanhua TangFull text query and search systems and methods of use
US20060212421A1 (en)*2005-03-182006-09-21Oyarce Guillermo AContextual phrase analyzer
US20060212443A1 (en)*2005-03-182006-09-21Oyarce Guillermo AContextual interactive support system
US7844566B2 (en)2005-04-262010-11-30Content Analyst Company, LlcLatent semantic clustering
US20060242140A1 (en)*2005-04-262006-10-26Content Analyst Company, LlcLatent semantic clustering
US20060242190A1 (en)*2005-04-262006-10-26Content Analyst Comapny, LlcLatent semantic taxonomy generation
US10445359B2 (en)*2005-06-072019-10-15Getty Images, Inc.Method and system for classifying media content
US20070112838A1 (en)*2005-06-072007-05-17Anna BjarnestamMethod and system for classifying media content
US20070112839A1 (en)*2005-06-072007-05-17Anna BjarnestamMethod and system for expansion of structured keyword vocabulary
US20070061320A1 (en)*2005-09-122007-03-15Microsoft CorporationMulti-document keyphrase exctraction using partial mutual information
US7711737B2 (en)*2005-09-122010-05-04Microsoft CorporationMulti-document keyphrase extraction using partial mutual information
US20080243820A1 (en)*2007-03-272008-10-02Walter ChangSemantic analysis documents to rank terms
US7873640B2 (en)*2007-03-272011-01-18Adobe Systems IncorporatedSemantic analysis documents to rank terms
US8504564B2 (en)2007-03-272013-08-06Adobe Systems IncorporatedSemantic analysis of documents to rank terms
US20110082863A1 (en)*2007-03-272011-04-07Adobe Systems IncorporatedSemantic analysis of documents to rank terms
US8583419B2 (en)*2007-04-022013-11-12Syed YasinLatent metonymical analysis and indexing (LMAI)
WO2008120030A1 (en)*2007-04-022008-10-09Sobha Renaissance InformationLatent metonymical analysis and indexing [lmai]
US20100114561A1 (en)*2007-04-022010-05-06Syed YasinLatent metonymical analysis and indexing (lmai)
US20090193337A1 (en)*2008-01-282009-07-30Fuji Xerox Co., Ltd.System and method for supporting document navigation on mobile devices using segmentation and keyphrase summarization
US20090193350A1 (en)*2008-01-282009-07-30Fuji Xerox Co., Ltd.System and method for supporting document navigation on mobile devices using segmentation and keyphrase summarization
US8281250B2 (en)2008-01-282012-10-02Fuji Xerox Co., Ltd.System and method for supporting document navigation on mobile devices using segmentation and keyphrase summarization
US8601393B2 (en)*2008-01-282013-12-03Fuji Xerox Co., Ltd.System and method for supporting document navigation on mobile devices using segmentation and keyphrase summarization
US9098518B2 (en)*2008-02-292015-08-04Samsung Electronics Co., Ltd.Resource sharing method and system
US20090222528A1 (en)*2008-02-292009-09-03Samsung Electronics Co., Ltd.Resource sharing method and system
US20090228468A1 (en)*2008-03-042009-09-10Microsoft CorporationUsing core words to extract key phrases from documents
US7895205B2 (en)2008-03-042011-02-22Microsoft CorporationUsing core words to extract key phrases from documents
US10210251B2 (en)*2013-07-012019-02-19Tata Consultancy Services LimitedSystem and method for creating labels for clusters
US20150006531A1 (en)*2013-07-012015-01-01Tata Consultancy Services LimitedSystem and Method for Creating Labels for Clusters
CN107750378A (en)*2015-03-062018-03-02泽泰斯工业股份有限公司Method and system for voice identification result post processing
US10628496B2 (en)2017-03-272020-04-21Dell Products, L.P.Validating and correlating content
CN110032622A (en)*2018-11-282019-07-19阿里巴巴集团控股有限公司Keyword determines method, apparatus, equipment and computer readable storage medium
CN110032622B (en)*2018-11-282023-07-14创新先进技术有限公司Keyword determination method, keyword determination device, keyword determination equipment and computer readable storage medium

Also Published As

Publication numberPublication date
JPH08305730A (en)1996-11-22
JP3653141B2 (en)2005-05-25
EP0741364A1 (en)1996-11-06

Similar Documents

PublicationPublication DateTitle
US5745602A (en)Automatic method of selecting multi-word key phrases from a document
MasuiAn efficient text input method for pen-based computers
EP0737927B1 (en)Automatic method of generating thematic summaries
JP3759242B2 (en) Feature probability automatic generation method and system
US5579224A (en)Dictionary creation supporting system
EP0971294A2 (en)Method and apparatus for automated search and retrieval processing
JP5231698B2 (en) How to predict how to read Japanese ideograms
EP1290574A2 (en)System and method for matching a textual input to a lexical knowledge base and for utilizing results of that match
JP5447368B2 (en) NEW CASE GENERATION DEVICE, NEW CASE GENERATION METHOD, AND NEW CASE GENERATION PROGRAM
JP3952964B2 (en) Reading information determination method, apparatus and program
US20040054677A1 (en)Method for processing text in a computer and a computer
JP2002189734A (en) Search term extraction device and search term extraction method
JP4047895B2 (en) Document proofing apparatus and program storage medium
JPH09325962A (en) Document proofreading device and program storage medium
JP4047894B2 (en) Document proofing apparatus and program storage medium
JP3324910B2 (en) Japanese analyzer
JPH01287774A (en)Japanese data input processor
JP4318223B2 (en) Document proofing apparatus and program storage medium
JP2005189955A (en) Document processing method, document processing apparatus, control program, and recording medium
JP2856775B2 (en) Document creation device
Mráková et al.From Czech morphology through partial parsing to disambiguation
JPH1040267A (en)Document summary viewer
JP3216725B2 (en) Sentence structure analyzer
JPS62221065A (en) Document creation method
JPH0275059A (en) Japanese error correction processing device

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name:XEROX CORPORATION, CONNECTICUT

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHEN, FRANCINE R.;PUTZ, STEVEN B.;BROTSKY, DANIEL C.;REEL/FRAME:007480/0404

Effective date:19950501

STCFInformation on status: patent grant

Free format text:PATENTED CASE

FPAYFee payment

Year of fee payment:4

ASAssignment

Owner name:BANK ONE, NA, AS ADMINISTRATIVE AGENT, ILLINOIS

Free format text:SECURITY INTEREST;ASSIGNOR:XEROX CORPORATION;REEL/FRAME:013153/0001

Effective date:20020621

ASAssignment

Owner name:JPMORGAN CHASE BANK, AS COLLATERAL AGENT, TEXAS

Free format text:SECURITY AGREEMENT;ASSIGNOR:XEROX CORPORATION;REEL/FRAME:015134/0476

Effective date:20030625

Owner name:JPMORGAN CHASE BANK, AS COLLATERAL AGENT,TEXAS

Free format text:SECURITY AGREEMENT;ASSIGNOR:XEROX CORPORATION;REEL/FRAME:015134/0476

Effective date:20030625

FPAYFee payment

Year of fee payment:8

FPAYFee payment

Year of fee payment:12

ASAssignment

Owner name:XEROX CORPORATION, CONNECTICUT

Free format text:RELEASE BY SECURED PARTY;ASSIGNOR:JPMORGAN CHASE BANK, N.A. AS SUCCESSOR-IN-INTEREST ADMINISTRATIVE AGENT AND COLLATERAL AGENT TO JPMORGAN CHASE BANK;REEL/FRAME:066728/0193

Effective date:20220822


[8]ページ先頭

©2009-2025 Movatter.jp