RELATED APPLICATIONSThis application is a continuation of U.S. patent application Ser. No. 13/501,370 filed on Apr. 11, 2012 by the inventors of the present application and titled Method and System for Document Presentation and Analysis which, in turn, claimed the benefit under 35 U.S.C. §371 of International Application No. PCT/US10/52321, the entire contents of each of which are incorporated herein by reference.
FIELD OF THE INVENTIONThe present invention relates to the field of document analysis, and more particularly to methods and systems for rapidly determining relevancy of one or more documents.
BACKGROUNDDocument research involves indentifying relevant subject matter or concepts within a document or set of documents. Search engines, for example, use “key” words or phrases as search arguments to locate text passages containing those words or phrases. The passages may or may not be relevant, however, regardless of the instance of the argument. Finding relevant subject matter involves not just the instance of the word or phrase, but the context in which it is found. The preceding and succeeding words that surround a keyword in a passage influence the meaning or effect of its use.
Sometimes the search for context, as opposed to an instance of a keyword, can be narrowed by using additional descriptive terms. Boolean operators are used by almost all search engines to link words separated by the operators in some logic set. For example, the operator “AND” implies the set of all instances of word number one used in conjunction with word number two; the operator “OR”, by contrast, implies the set of all instances of word number one combined with the set of all instances of word number two. In mathematical language, the first set is an intersection set and the second, a union set.
Wildcards, indicated by some symbol like “*” or “$”, can be used to substitute for letters, prefixes or endings, thereby picking up the alternative forms in which a word might appear. Proximity indicators, such as “ADJ”, “NEAR”, “WITH” and “SAME”, are used together with Boolean operators to indicate how far apart two words may appear in a text passage. This gives the document researcher a means for assessing context. Two words used in the same sentence, or in the same paragraph, can indicate a contextual nexus.
In the current state-of-the-art, finding contextual meaning involves reading whole passages or entire documents where keywords are located. Since the quality of document research is defined in the negative as not missing any relevant passages in a field of inquiry, the researcher can ill-afford to simply spot-read. Search engines can find the keywords, but it is the reading task that defines not only the quality but the time spent on a properly conducted search exercise. Any artifice which reduces reading time without compromising quality becomes highly desirable for productivity reasons.
U.S. Published Application No. 20050210042 to Goedken shows methods and apparatus to search and analyze prior art. Goedken shows the benefit of grouping conceptually related words to a single color, and then highlighting those words in the text of a patent document. Goedken also recognizes the benefit of counting elements for reporting purposes (seeFIG. 14a). Goedken, however, does not show a system for rapidly displaying the text of a document alongside an indexed color coded chart for allowing quick navigation and quickly showing the user prevalence of various concepts inside of a document. These are important shortcomings because the patent researcher requires a system for acquiring an initial understanding of a document in 1-2 seconds. The patent researcher must view thousands of documents in a typical search, and if the initial document inquirytakes more than a few seconds, then a patent search can become economically unfeasible.
U.S. Published Application No. 20060156222 to Chi shows a method for automatically performing conceptual highlighting in electronic text. Chi has also noticed that conceptually related words can be grouped together and highlighted the same color. However, Chi has not provided for additional features that enable rapid initial understanding of a document. For example, Chi doesn't teach methods of removing passages of no relevance to the reader's interest. In addition, Chi doesn't show methods of removing all but the most relevant passages. Moreover, Chi also doesn't show a method of providing rapid understanding (1-2 seconds) of a document, such that a researcher can make the quick decision of whether or not to start reading a document.
U.S. Pat. No. 7,194,693 to Cragun shows an apparatus and method for automatically highlighting text in an electronic document. However, highlighting is determined by user preferences and scroll speed. Cragun does not show features that allow rapid, staged understanding of a document that are required by the researcher wrestling with large numbers of long documents.
U.S. Pat. No. 6,823,331 to Abu-Hakima shows a concept identification system and method for use in reducing and/or representing text content of an electronic document. Although Abu-Hakima provides for counting and ranking, there are no tools for rapid understanding of the document once it is presented.
U.S. Published Application 20090276694 to Henry shows a System and Method for Document Display. Like the present invention, Henry has found the usefulness in presenting reference characters along with names on or near the figures to which they relate. However, Henry has not taught a search system where the reference characters are rapidly located for the searcher, and presented for quick navigation through the document. Moreover, Henry has decided to retrieve characters from drawings, where the present invention contains a method for hunting patent text for reference characters.
U.S. Published Application 20040113916 to Ungar shows a perceptual-based color selection for text highlighting. The text color choice is based upon factors such as the total amount of highlighted display.
Several problems still exist in prior art. First, most search systems rely on a researcher to limit a document set using a combination of keyword and classification. But since a researcher is looking for multiple concepts simultaneously, limiting a search with a set of keywords will inevitably miss references showing the concepts that were not part of the immediate search. This is exacerbated when a searcher is looking for ten or more concepts simultaneously. Clearly, a better system would involve reviewing large sets of documents for all concepts simultaneously. However, the labor involved in reading large sets of long documents makes this approach impractical. Therefore, a system is required that enables rapid manual review of large sets of lengthy documents for multiple concepts simultaneously.
Embodiments of the present invention address many of the shortfalls in the prior art while presenting, what will hereinafter become apparent to be, a pioneering document analysis technology.
SUMMARY OF THE INVENTIONIt is a first object of the present invention to enable location and loading of groups of words having relevance in a research project. It is a second object to provide an interface that enables rapid (1-2 second) first level of relevance determination through color coding of concept blocks. Yet another object of the present invention is to provide an interface that enables quick (5-10 second) second level of relevance determination through multi-colored highlighting of keywords. It is yet another object to provide multiple user options for removal of non-relevant passages in a document. Yet another object is to provide for optional display of only the highest relevance passages for high speed patent searching. Still another object is to provide an interface that enables rapid location in patent text of any reference numeral from the figures. Yet another object is to provide an interface that enables rapid location of passages related to figure numbers. Still another object of the present invention is to provide an interface with rapid location of patent and published application numbers inside a body of text.
The present invention is a document presentation system that enables a researcher to quickly assess relevance of a document in the context of a search project. With the present invention, the researcher can locate potentially relevant areas of a document database, and then review large numbers of documents for the presence of multiple concepts. The invention contains GUI tools that enable the researcher to first load multiple keyword groups into blocks of conceptually related keywords. As the researcher navigates from document to document, the keywords are counted, and the keyword blocks are colored according the highest keyword occurrence in each keyword block. This enables the researcher to make a first level of relevance determination within a 1-2 seconds of loading the document. If multiple colors aside from red are observed, the researcher can then inspect for passages of relevance. Only passages containing a user specified number of keywords are presented, so that the researcher does not read and page through long documents. In addition, each passage has all keywords color coded, such that all keywords from a given block are made the same color. When the researcher observes multi-colored passages, he or she can quickly inspect the passage by scanning from keyword to keyword—enabling a second level of understanding within just 5-10 more seconds. In addition, the researcher is provided with the ability to scroll the document from keyword to keyword by clicking in the keyword blocks. Particularly dense keyword areas are shown on a keyword density scrollbar enabling the researcher to jump directly to keyword dense sections of the document. In addition, the researcher can instruct the interface to automatically remove all but the most relevant passages—which are defined as those with the highest number of keyword blocks represented therein. Moreover, the document is processed to present a bill of material (BOM) table and a figures table, both of which provide document navigation. With these navigation tools, a patent researcher can view patent images in one window and quickly locate passages in the text where reference characters reside (using the BOM table) or where figures are discussed (using the figs table). In addition, the interface presents any patent numbers or published application numbers discussed in the document, which provides quick adding of applicant cited documents to a standard backward citation search. An additional tool provides the ability to tag each document according to relevance and according to presence or absence of multiple user defined concepts.
BRIEF DESCRIPTION OF THE DRAWINGSFIG. 1A is a block diagram illustrating a document analysis system in accordance with an exemplary embodiment of the invention.
FIG. 1B is a sample of a document.
FIG. 2A is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2B is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2C is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2D is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2E is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2F is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2G is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2H is a diagram of a project file created and used by the present invention.
FIG. 2I is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2J is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 2K is an interface diagram in accordance with an exemplary embodiment of the invention.
FIG. 3A is a flow diagram illustrating a process that may be carried out in accordance with the exemplary system ofFIG. 1.
FIG. 3B is a flow diagram illustrating a process that may be carried out in accordance with the exemplary system ofFIG. 1.
FIG. 3C is a flow diagram illustrating a process that may be carried out in accordance with the exemplary system ofFIG. 1.
FIG. 3D is a flow diagram illustrating a process that may be carried out in accordance with the exemplary system ofFIG. 1.
FIG.3Ee is a flow diagram illustrating a process that may be carried out in accordance with the exemplary system ofFIG. 1.
FIG. 3F is a flow diagram illustrating a process that may be carried out in accordance with the exemplary system ofFIG. 1.
FIG. 3G is a flow diagram illustrating a process that may be carried out in accordance with the exemplary system ofFIG. 1.
FIG. 3H is a block color scheme table.
FIG. 3I is a document text color scheme table.
FIG. 4 is a flow diagram illustrating another process that may be carried out in accordance with the exemplary system ofFIG. 1.
FIG. 5 is a block diagram illustrating a document analysis system in accordance with another exemplary embodiment of the invention.
DETAILED DESCRIPTIONReference will now be made in detail to the present exemplary embodiments of the invention, examples of which are illustrated in the accompanying drawings.
Referring toFIG. 1, a block diagram is shown illustrating adocument analysis system100 in accordance with an exemplary embodiment of the invention. Thedocument analysis system100 comprises aclient device110. Theclient device110 includes adocument analysis module112, aninterface module114 and a user Input/Output (I/O)interface118. By way of example, theclient device110 may be a computing device having a processor such as personal computer, a phone, a mobile phone, or a personal digital assistant. Thedocument analysis system100 may also comprise adocument provider130 and anetwork120. Thedocument provider130 is configured to deliver one or more documents, labeled generally as132. By way of example, thedocuments132 may be electronic files containing patent data or any type of electronic file that contains textual data. SeeFIG. 1B for an example of adocument132. As seen, thedocument132 hasmultiple document classifications135 that are further divided into aclass136 and asubclass137. In addition, notice the body of the document is composed of multiple sections (eg. Abstract, description,claims), and that section are further divided intodocument paragraphs138. Thedocument132 may also containBOM items267, which are also known as reference characters,patent reference numbers260, and figurenumbers268. Thedocument provider130 may be a remote server running a search engine such as that provided by the United States Patent and Trademark Office (USPTO) FreePatentsOnLine, Micropatent®, Delphion®, PatentCafe®, Thompson Innovation or Google®. Thedocument provider130 may retrieve the data from a local repository or from one or more remote documents repositories. Examples of such a document repository include patent databases including those provided by EP (European patents), WO (PCT publications), JP (Japan abstracts) and DWPI (Derwent World Patent Index for patent families). Thedocument provider130 may alternatively be a cloud based bulk storage system such as Amazon Simple Storage Service. Theinterface module114 is configured to receive one ormore documents132 from thedocument provider130 by way ofnetwork120. By way of example, the network may be the Internet. Theinterface module114 may alternatively be configured to receive one ormore documents132 through the user I/O interface118. In such an embodiment, thedocuments132 may be stored on a portable storage device (not shown) such as a CD, DVD or solid state device and the user I/O interface118 may include a communications interface such as a wireless interface, a CD/DVD drive or a USB drive for retrieving data from the personal storage device. Thedocuments132 may alternately be paper-based documents and may be provided to theinterface module114 by use of a scanner (not shown) that is configured with the I/O interface118. Theclient device110 may also include adata storage element116. Theinterface module114 is also configured to receive a set of one or more concepts from a researcher by way of the I/O interface118. The I/O interface118 may also include at least one input device such as a keyboard, mouse, microphone or a touch screen for receiving the concepts from the researcher. Each concept is comprised of one or more text-based keywords or sets of text-based keywords which are used by thedocument analysis module112 to analyze each of thedocuments132. Thedocument analysis module112 generates statistical data based on the user-defined concepts and thedocuments132. The statistical data may be used by the researcher to quickly assess the relevancy of eachdocument132 to each of the user-defined concepts. Thedocument analysis module112 may transmit the statistical data to theinterface module114 which presents the data to the researcher by way of the I/O interface118. The I/O interface118 may also include a display such as an LCD or CRT monitor configured to display a graphical user interface (GUI) for presenting information such as the statistical data to the researcher. The GUI will now be discussed in greater detail.
Referring now toFIG. 2AFIG. 2B,FIG. 2C,2D,2E, and2F, diagrams are shown illustrating a document analysis graphical user interface (GUI)200 in accordance with an exemplary embodiment of the invention.FIG. 3A-F which illustrates an exemplary computer-implementedprocess300 for performing document analysis will also be discussed. At a first step labeled as310, theinterface module114 will receive concept data from the researcher. Theinterface module114 first generates adocument analysis GUI200 and displays theGUI200 to the researcher by way of the display device included with user I/O interface118. As shown inFIG. 2A, thedocument analysis GUI200 includes adocument relevance interface220, adocument management interface250, and adocument image window254. Thedocument image window254 displays non-textual data such as images or drawings that may be associated with the currently selected document thus providing an additional means for assessing the relevance of the document. As seen inFIG. 2F, the researcher may start a research project by entering onemore concepts272. Eachconcept272 may have one or more words or word groups associated therewith. As shown inFIG. 2B, thedocument analysis GUI200 includes akeyword entry interface210. Thekeyword entry interface210 comprises multiple rows of alphanumeric entry fields212. One ormore keywords213 may be entered by a researcher into eachentry field212, wherein eachkeyword213 is conceptually related such that each line represents akeyword group214. The researcher is also provided with auser thesaurus211 andweb thesaurus219. Theuser thesaurus211 can be edited and stored in adata storage element116, and theweb thesaurus219 may be accessed through thenetwork120 by theinterface module114. Five alphanumeric entry fields212 are shown to be filled inFIG. 2B. Eachconcept272 andcorresponding keyword group214 may be determined manually by the researcher or may be received from an external source. By way of example, the concepts may be reduced to a manageable number of concepts (e.g. 4-5 concepts).Keywords213 may then be chosen for each of the concepts and entered into one of thealphanumeric fields212 to form thekeyword group214. After entering each of the desired concepts, the researcher may then exit thekeyword entry interface210 and proceed to analysis of a set of documents based on the user-defined concepts.
At a next step labeled as320 theinterface module114 will receive one ormore documents132. As discussed theinterface module114 is configured to receive the one ormore documents132 from thedocument provider130 by way ofnetwork120. Theinterface module114 may be configured to allow the researcher to request a predetermined set ofdocuments132. By way of example, the researcher may initiate a request for a specific set of patent documents or a set of patent documents that fall within a specific category or classification. The researcher may also initiate a search of a remote document repository through a search interface window230 (shown inFIG. 2D) provided by thedocument analysis GUI200. The search may be initiated by entering a set of search parameters, such as keywords, into one ormore search fields232 located on thesearch interface window230. Boolean operators, wildcards and proximity indicators may be used to link the keywords together in logic sets. Thesearch interface window230 may also provide asearch assistance window234 that allows the previously definedkeywords213 to be added to the set of search parameters in response to a user action (e.g. a mouse click). Thesearch assistance window234 thereby facilitates the loading of search parameters into the one or more search fields232. In addition, the researcher is provided with aclassification search list290, which contains a table for documenting the search project strategy (discussed in detail later). The researcher may pick classification codes from theclassification search list290. As discussed, theinterface module114 may alternatively be configured to receive one ormore documents132 through the user I/O interface118. In such an embodiment, thedocuments132 may be stored on a portable storage device (not shown) such as a CD, DVD or solid state device and the user I/O interface118 may include a communications interface such as a wireless interface, a CD/DVD drive or a USB drive for retrieving data from the personal storage device. Upon receiving the one ormore reference documents132 theinterface module114 will populate a document management table252 located on a document management interface250 (shown inFIG. 2E) withselectable rows253 each having information descriptive of one of the receivedreference documents132. By way of example, each row may include areference document number255 anddocument title256.
At a next step, labeled as330, thedocument analysis module112 performs analysis of the one ormore reference documents132 received by theinterface module112 relative to the user-defined concepts also received by theinterface module112. As shown inFIG. 2C thedocument analysis GUI200 includes thedocument relevance interface220. Thedocument relevance interface220 comprises a keyword table222 and adocument text window226. When the researcher selects (by way of a mouse click or similar navigation event) one of the rows that appear in the document management table252, processedtext228 or corresponding text of the reference document becomes viewable in thedocument text window226. Each keyword entered in alphanumeric entry fields212 is listed in a separate row of afirst column223 of the keyword table222. The keyword table222 also includes asecond column224. Thesecond column224 displays a numeric value that represents the number of times the corresponding keyword in thefirst column223 appears in the processedtext228 of the currently selected document. The keywords are arranged in keyword blocks225, wherein eachblock225 contains all of the keywords from asingle keyword group214. In addition, eachkeyword block225 has a highest occurringkeyword235, which is the highest occurring keyword from the block. The keyword blocks225 may be visually separated by bold horizontal bars, labeled generally as229. When a document is first selected by the researcher, thedocument analysis module112 will retrieve thedocument132 through theinterface module114 and generate the processedtext228. Thedocument analysis module112 will use ablock color scheme236 to determine a color for eachkeyword block225. According to theblock color scheme236, the color is determined from the highest occurringkeyword235 in eachkeyword block225. The keyword table colors are selected by thedocument analysis module112 from one of a set of predetermined colors in the block color scheme, each color corresponding to a range of instances of appearances of a keyword in thedocument132. SeeFIG. 3H for an example of ablock color scheme236. As seen, red signifies lowest occurrence, and green signifies highest occurrence. All intermediate integers receive different colors along a red-green continuum. After determining a color for eachkeyword block225, thedocument analysis module112 will instruct theinterface module114 atstep340 to highlight eachcorresponding block225 with that color. By viewing the colored keyword blocks225 in the keyword table222, a researcher may then make a rapid decision regarding the potential relevance of the selecteddocument132 to the previously defined concepts. More specifically, the researcher can use the colored keyword blocks225 to make an initial relevance assessment within 1-2 seconds. If multiple colors, other than red, are observed in the initial relevance assessment, the researcher may then scan the processedtext228 to locate paragraphs having multiple colors, which would correspond to multiple concepts. If multi-colored paragraphs are noticed, the researcher may then decide to read portions of the processedtext228 to make a second determination as to relevance within 5-10 seconds. Finally, a researcher may view theoriginal document132 in thedocument image window254 to make a final determination for tagging thedocument132.
In addition, the count of instances for eachkeyword213 may be transformed by thedocument analysis module112 into a normalized count so that the length of the selecteddocument132 is substantially eliminated as a variable. The computation for the normalized count involves dividing the totality of the text characters in the selected document by five (average letter count for a word in the English language) to a normalized word count. Next, the count of instances for eachkeyword213 is divided by the normalized word count to find density. This is followed by multiplying density by2500 (arbitrary constant) and rounding to result in the normalized count expressed in integers. In one aspect of the exemplary embodiment, one of the keyword table colors is associated with a normalized count value of10 or greater, another keyword table color with a value of9, and a third keyword table color with a value of8, and so on until the zero color is assigned.Steps330 and340 may be repeated for each of the receivedreference documents132 as indicated by dashedarrow350.
As seen inFIG. 2C, akeyword density scrollbar227 may also be provided having integrated colors which correspond to such sections of text where highlighted keywords are tightly grouped. By way of example, thescrollbar227 may be divided vertically intodensity sections238, wherein the number of sections corresponds to the number ofdocument paragraphs138 appearing in the processedtext228. Colors may be assigned to eachdensity section238 according to the number ofkeyword groups214 that appear in eachdocument paragraph138. The researcher can then rapidly scroll through long documents directly to areas wheremultiple keyword groups214 are represented.
As discussed, when the researcher selects one of the rows that appear in the document management table252 the processedtext228 of thecorresponding reference document132 becomes viewable in thedocument text window226 and an image of thedocument132 becomes viewable in thedocument image window254. In addition, thedocument analysis module112 will assign a unique keyword color to each block of keywords (each block of keywords corresponding to one concept) for subsequent highlighting in thedocument text window226. Thereby, each keyword within akeyword block225 or logical set of keywords will have the same unique color. Thedocument analysis module112 then instructs theinterface module114 atstep340 to display the keywords highlighted with the corresponding unique keyword colors in thedocument text window226. In this manner, a scrolling scan of the displayed text may reveal sections of text where highlighted keywords are tightly grouped together. When keywords highlighted with different colors appear within a section, such a localized array might indicate a confluence of concepts and a nexus of context. The need for reading can be reduced by the collage of highlighted words in the localized array, the collage potentially communicating the meaning of a passage in the same way that a word with missing letters is recognizable. Thus a quick confirmation of relevance can be made by a person in a glancing inspection.
With reference now toFIGS. 3b-3g, the generation of thedocument relevance interface220 will be discussed in greater detail. As seen inFIG. 3B, four basic inputs are thedocument132, thekeyword groups214, thestatic parameters240, and theinterface settings245. With these inputs, thedocument analysis module112 runsprocesses600,630,640 to generate thedocument relevance interface220.
Process600 Generate ProcessedText228
Referring toFIG. 2C andFIG.3C process600 begins when the researcher navigates to adocument132 using thedocument management interface250. At601, ifsection selector218 is Bill of Material or (BOM) then proceed to602, where the description field of the document is selected and passed to step650. Here the “Build BOM” subroutine is executed and the resulting text becomes the processedtext228, which is displayed in thedocument text window226. The result is a single column of the reference characters followed by item names. Returning to step601 and proceeding to Step604, ifsection selector218 is “Class”, then proceed to step605, and select alldocument classifications135. Next, atstep606, retrieve full class schedules for eachdocument classification135, which becomes the processedtext228 and is displayed in thedocument text window226. Returning to step604 and proceeding to Step607, ifsection selector218 is “Citations”, then proceed to step608. Select the Citations section of thedocument132, and proceed to step609. Select the description section of the document, and proceed to process640. Append the examiner citations from the citations section to the patent and application numbers found inprocess640. The resulting delimited list of patent reference numbers becomes the processedtext228, which is displayed in thedocument text window226. Returning to step607 and proceeding to Step611, if Summary Only or SO=Yes, then proceed to step612 and remove all text related to prior art and background by searching for words such as SUMMARY or BRIEF SUMMARY. Proceeding to step613, first select the document section (ie. ifsection selector218 is “Claims”, select the claims section). Next, separate the selected section into an array of paragraphs using carriage returns as the delimiter to make 1d-array670. Next, count the total number of occurrences of any keyword from eachkeyword group214 in each paragraph in 1d-array670, and store as 2d-array671. Next, use the 2d-array671 to find the number ofdifferent keyword groups214 represented in each paragraph (ie the number of non-zero cells in each row of 2d-array671), and store as 1d-array672. Next, if Keyword Setting orKW Setting215=KW1, then proceed to step615, and remove all paragraphs from the 1d-array670 having a corresponding number in 1d-array672 of zero (so that the end display shows only paragraphs with at least onekeyword group214 represented). Returning to step614, and on to step616, ifKW Setting215=KWII, then proceed to step617, and remove all paragraphs from the 1d-array670 having a corresponding number in 1d-array672 of zero or one (so that the end display shows only paragraphs with at least twokeyword group214 represented). Returning to step616, and on to step618, ifKW Setting215=KW Hot, then proceed to step619, and remove all paragraphs from the 1d-array670 having a corresponding number in 1d-array672 that is less than the highest number found anywhere in 1d-array672 (so that the end display shows only paragraphs with the highest number ofkeyword groups214 represented). Next, assign colors to eachdensity section238 of thekeyword density scrollbar227 using the 1d-array672 and a color scheme of 1) green=highest number in 1d-array672, 2) red=0, 3) all intermediate numbers receive an intermediate color along the red-green spectrum. Moving now to step620, assign unique colors to eachkeyword group214 using a documenttext color scheme237, wherein each color is picked for its ability to stand out on white background and also be contrasted from the other colors. See an example of the document text color scheme inFIG. 3i. At,step621, if highlight setting217=AII, then proceed to step622 and convert 1d-array670 to regular text, and highlight all keyword according to color scheme developed instep620. Returning to step621, and on to step623, convert 1d-array670 to regular text, and highlight the keywords in the visible window according to color scheme developed instep620. Display as the processedtext228 in thedocument text window226 of thedocument relevance interface220.
Process630 Generate Keyword Table222:
Referring toFIG. 3D, first atstep631, count the total number of eachkeyword213 and store in 1d-array673. Isolate the highest number representing eachkeyword group214 from 1d-array673, and store to 1d-array674. Next atstep632, assign colors to 1-d array674 according to theblock color scheme236 fromFIG. 3H (i.e. red=0, yellow=5, green=10 or more, all intermediate numbers between 0 and 10 get a different color along a red-green continuum). Arrange thekeyword groups214 for display in thefirst column223, and 1-d array673 in the second column. Separate eachkeyword group214 with ahorizontal bar229 to form multiple keyword blocks225. Index processedtext228 againstkeywords213, such that mouse clicks in any row will cause scrolling to keyword locations intext228. As seen inFIG. 2i, the index provides the researcher with rapid scrolling to and bolding of the keyword that is clicked in the keyword table.
Process640 Generate Patent References260:
Referring toFIG. 3E,first step641, Select the Description from thedocument132, and convert all characters to lower case. Remove all non-alphabet and non-numeric characters such as slashes, commas, periods, etc. Next, atstep642, hunt for any words preceded by the phrases such as: “patent”, “us”, “u.s.”, “no.” If words are numeric, then add to a 1d array675 ofpatent reference numbers260. Next, hunt for any words that are 6, 7, or 11 characters long and are composed entirely of numeric characters, and add to 1d array675.
Process650 Generate BOM Table262
The BOM table will containBOM items267, which are also known as reference characters, and are found throughout patent text as seen inFIG. 1B. Referring toFIG. 3F.first step651, select the description fromdocument132. Next, atstep652, search for words that start with numbers and load them to a BOM Candidate Array676. Next, search for words that start with a left parenthesis and are immediately followed by a number, and add them to the BOM Candidate Array676. Next, atstep653, retrieve three words previous to each element in the BOM Candidate Array676. Eliminate candidates where the preceding words contain words such as fig, figure, or figs. Next, eliminate candidates that are not immediately succeeded by a space, right parenthesis, period, or comma. Index with processedtext228. Next atstep654, load the remaining candidate numbers into the BOM Table642. Index BOM candidate array676 with processedtext228, such that mouse clicks in any row will cause scrolling to BOM item locations in text processed228. As seen inFIG. 2K, the index provides the researcher with rapid scrolling to and bolding of theBOM item267 that is clicked in the BOM table.
Process660 Generate Figs Table261
The Figs table will contain figurenumbers268, which are found throughout patent text as seen inFIG. 1B. Referring toFIG. 3G, first step661: Select description ofdocument132. Next, atstep662, search for words immediately preceded by words such as fig, fig., figure, figs., figs, and add to a figs candidate array677. Next, atstep663 Remove elements from the figs candidate array677 that do not start with a number (i.e. allow 1, 2, 2C, 2D). Next atstep664 index with processedtext228, and load figs candidate array677 and associated index (for quick mouse scrolling) into figs table261. As seen inFIG. 2J, the index provides the researcher with rapid scrolling to and bolding of thefigure number268 that is clicked in the figs table.
Referring now toFIG. 4, anotherexemplary method400 for performing document analysis will now be discussed.Steps410 through440 proceed in a similar manner tosteps310 through340 of the computer-implementedprocess300. The present embodiment additionally provides anadditional step450 for receiving and storing data from the researcher that indicates the determined relevancy of the currently selecteddocument132 to the one or more user-defined concepts. As discussed, theinterface module114 will populate the document management table252 (shown in FIG.2E) withselectable rows253 each having information descriptive of one of the received reference documents. In the exemplary embodiment, the document management table252 also includes one or more additional columns for allowing the researcher to indicate (by way of a mouse-click or similar navigation event) the relevance of the currently selected document. Each row of the document management table252 may have arelevancy value column257 that contains an input field for indicating the overall relevancy of the associated reference document. By way of example theinterface module114 may provide the researcher with the ability to select an indicia (e.g. using a drop-down menu list) such as “A” for highest relevance, “B” for suspected relevance, and “C” for uncertain relevance. Irrelevant documents may be marked with an “I” to place a marker in the file indicating that a reference document was reviewed. Each row of the document management table252 may also have one or more additional columns labeled generally as258 that contain an input field for indicating whether a specific concept has been verified to appear in the currently selected reference document. Theinterface module114 may provide the researcher with the ability to toggle a field (one such field is labeled as259) corresponding to a specific concept “on” or “off” (e.g. by a mouse-click) when indicating whether a particular concept does or does not exist. A column may be provided for each of the previously discussed concepts. However, in another embodiment theinterface module112 may provide the researcher with a concept management window270 (seeFIG. 2F) for allowing the researcher to definedifferent concepts272 which theadditional columns258 may be derived from. In this manner, the researcher may be able to track higher-level or more abstract concepts than were initially defined and may also provide more user-friendly naming of the concepts (useful, for example, for report generation). Theinterface module112 may also store the previously discussed relevancy indicators in a data repository such as the database labeled as116 inFIG. 1. By storing each of the indicators theinterface module114 is able to generate reports that may include a reduced, and more relevant set ofreference documents132, than was initially received by theclient device110.Steps430 through450 may be repeated for each of the receivedreference documents132 as indicated by dashedarrow460.
In this manner a document analysis system is provided that includes a computing device having program modules executable by a processor, the program modules configured to rapidly transform a first set of set of data files representative of a plurality of reference documents into a second set of data files representative of a subset of the plurality of reference documents, the subset having textual content particularly relevant to one or more received concepts.
Referring toFIG. 5, a block diagram is shown illustrating a document analysis system500 in accordance with another exemplary embodiment of the invention. The document analysis system500 is similar to thedocument analysis system100 ofFIG. 1 however provides a client-server architecture. Accordingly, document analysis system500 includes a client device510 and a server device580. The server device580 may be a computing device having a processor such as personal computer or may be implemented on a high performance server, such as a HP, IBM or Sun computer using an operating system such as, but not limited to, Solaris or UNIX. The server device580 includes a document analysis module512 similar in function to the document analysis module of112 of the embodiment ofFIG. 1.
Thus, a document analysis system having the benefits of allowing for rapid and accurate assessment of the relevancy of a document or set of documents to one or more concepts is contemplated. The document analysis system receives one or more concepts along with one or more reference documents and generates various sensory indicators that assist a researcher in assessing the relevance of each of the received documents to the received concepts. In one aspect, the document analysis system displays a table of keywords separated into blocks, each block of keywords corresponding to one of the concepts. The document analysis module will highlight each block of keywords with a color, the color based on the highest count of a keyword within each group of keywords. The color of a block thus indicates the relative presence of a concept in the document. In another aspect, the document analysis system determines a unique color for each block of keywords and then displays the text of the reference document with each occurrence of a keyword highlighted with the color of its associated keyword block. In this manner a researcher can quickly identify passages that contain multiple concepts.