System and method for block segmentation, identification and indexing of visual elements and searching documentsCROSS-REFERENCE TO RELATED APPLICATIONS
This patent application claims priority from united states provisional patent application No. 61/247,973 entitled system and method for segmentation, identification, and indexing of visual elements and searching for documents, filed on 2.10.2009, and is incorporated herein by reference.
FIELD OF THE DISCLOSURE
The present disclosure relates generally to searching data sources methods and systems. More particularly, it relates to a method of customizing a computer search according to the needs of the searcher. It further relates to a method of displaying search results in such a way as to facilitate an easy understanding of the nature and scope of the information obtained by the search.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
In the information age today, users are able to access large amounts of data through their local computers, and an unlimited amount of data through intranets and global computer networks known as the internet.
Users typically use search engines, which are ubiquitous in the art and are well known in various forms, in order to find the desired information. Some search engines are embedded in programs. In a document that is being opened in a program, information is typically found using such a search engine. A typical commonly used document search engine includesNotepad search function anda search function. Desktop search engines, on the other hand, enable users to find information on local computers. A typical commonly used desktop search engine includesXP Search andOS Finder. Web search engines enable users to find information via the internet (or intranets). A typical commonly used web search engine includesAndsome search engines are hybrid in that they search both local and remote data sources.
To use a search engine, a user seeking information on a desired topic typically enters a search query consisting of keywords or keyword groups related to the topic into a search interface of the search engine. If the search is performed in a single document, the search engine will typically highlight the matches in the document. If the search is performed over multiple documents, the search engine typically displays a report with links to a prioritized list of related documents that contain the search keywords. Each result will also typically include a short text summary. The summary is one or more text segments in the document that contain the keywords of the search query.
Despite the many uses and advantages of existing search engines, there are deficiencies in the art. A typical internet search through a web search engine finds a large amount of irrelevant data. This takes the user considerable time and effort to filter the information to find relatively few web pages that meet his needs.
The reason search engines return so many irrelevant results is that indexing and searching by the keywords themselves is not a suitable approach. For example, in existing search engines, it is not possible for a user interested in "india" to specify and qualify search results to a "key/value" pair such as "capital/new delry".
Another disadvantage of existing search engines is that they are useless to users who do not yet know what the keywords are about the topic of interest. For example, if the user wants to find a movie similar to Jurassic park, it is not useful to search by the keywords "Jurassic park" and "similar", since this would still return a page about "Jurassic park" that also includes the word "similar".
Yet another drawback of existing search engines is that they do not present the results in a manner that makes it easy for the user to understand the nature and type of the search results.
Systems for searching in intranet, extranet, local area network, personal computer, or even a single document, typically suffer from the same disadvantages described above.
In view of the above disadvantages, there is a need to find an efficient way to search for useful information related to a topic of interest from a data source.
SUMMARY
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an exhaustive description of the disclosure. It is intended to neither identify key or critical elements of the disclosure nor delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
The present disclosure may be equally applicable to searching information from data sources on intranets, extranets, large or small networks, personal computers, and personal programs/documents/files. Thus, while our disclosure and the examples of use presented herein are sometimes described in terms of internet searches, this should be understood as an example of the use and utility of the present disclosure, and it is not intended to illustrate any limitation of the scope of their use. Rather, the present disclosure should be understood to be equally applicable to systems such as intranets, wide area networks, local area networks, personal computer systems, and personal programs/documents/files.
The present disclosure is equally applicable to searching for and using any protocol and technology known or developed in the future for communicating or transmitting data, such as (but not limited to) HTTP, HTTPs, FTP, File, TCP/IP, and POP 3. Thus, while our disclosed and presented examples of use are sometimes described in terms of HTTP and TCP/IP, this should be understood to be an example of the use and utility of the present disclosure, and not to illustrate any limitation of the scope of their use. Rather, the present disclosure should be understood to be equally applicable to any type of local or network protocol and technique known in the art or developed in the future for transmitting and/or receiving data.
The present disclosure may be equally applicable to searching for and returning to links containing text and optionally representational semantics (interface appearance instructions), such as (but not limited to) HTML, DHTML, XML, SGML, PDF, E-mail, XML, and the like,Word document,Powerpoint documents, newsgroup postings, multimedia objects, image interchange format pictures, and/or Shockwave Flash files. The representation semantics can be defined in detail implicitly or explicitly in languages known in the art or developed in the future, such as (but not limited to) CSS. In the case of explicitly defined representation semantics of a document, the representation semantics may be included with the data in the same file, or defined in an external file, or a combination of both. An external representation semantic file of a document may be defined herein as an introduction file. Thus, while our disclosure and the examples of use presented herein are sometimes described in terms of HTML and CSS, this should be understood to be an example of the use and utility of the present disclosure, and not to be construed as any limitation to the scope of their use. Rather, the present disclosure should be understood to apply equally to any document, file, or collection of files integrated as a unit or developed in the future in the art, including text, audio and video objects, images, and other multimedia objects with optional presentation semantic instructions.
The present disclosure relates generally to methods and systems for searching information from data sources. More particularly, it relates to a method of visually identifying, extracting and indexing a document of snippets or sections and a method of matching paragraphs, tables, lists, menus, fixed-width text, keys/values, charts, questions/answers, timelines and interaction (hereinafter referred to as "visual elements") types, similar to the way people do by viewing a document on a display interface.
In an aspect of the disclosure, a person identifies and/or extracts a visual element in a visual element type document through presentation of the document on a display interface such as a display screen or paper. In yet another aspect of the disclosure, the system automatically identifies and/or extracts visual elements with the help of blocks and configuration files in a visual element type document. In another aspect of the invention, human-to-system integration can be used for visual element identification and extraction.
A block is a logical unit of a document. A profile is a set of rules designed when tiles are displayed as part of the document on a display interface in such a way as to identify and classify matching tiles into visual element types, which is completely similar to the way a person classifies tiles by visually viewing them.
It is another object of the present disclosure to index and order found visual elements.
It is another object of the present disclosure to provide an interface to a user to search for information related to a topic by defining results as one or more visual element types. Wherein the search interface provides suggestions to the user during a search query entry phase and further suggestions to the user in a results report.
It is another object of the present disclosure to display the prioritized search results in the form of a horizontal list and/or a vertical list and/or a grid as feedback to the user search query.
It is another object of the present disclosure to present a brief overview of the paragraph text as each search result. This summary is one or more text snippets in a document that contain keywords for a search query. In another aspect of the disclosure, the summary results are presented in the same visual type as found in the initial document, such as a result summary of the visual type of a table in a table found in the initial document. While in another aspect, the result summary is presented with the same visual type and presentation semantics as found in the initial document.
It is another object of the present disclosure to present advertisements in a results report when a user performs a search using a search query that also contains a visual element type. In another aspect of the disclosure, a merchant may participate in bidding for ad slots on a result report derived from a search query that also includes a visual element type. In yet another aspect of the present disclosure, a document author or document owner may pay for a visual element of the document to be indexed and included as part of a search result.
Brief Description of Drawings
FIG. 1 is a block diagram illustrating an example search engine system in accordance with an embodiment of the present disclosure.
FIG. 2 is a block diagram of a computing device illustrating the example search engine of FIG. 1.
FIG. 3 depicts source content data of an example document.
FIG. 4 depicts example representation semantics of the source content data of FIG. 3.
Fig. 5 depicts example metadata for the source content data of fig. 3.
FIG. 6 depicts example criteria data for block identification.
FIG. 7 depicts example criteria data for block identification.
FIG. 8 depicts an example of the source content data of FIG. 3 presented to a display device.
FIG. 9 depicts the rendering of FIG. 8 demonstrating the block partitioning performed by the block partitioning and indexing logic component of FIG. 2.
FIG. 10 depicts an example block partitioning performed by the block partitioning and indexing logic component of FIG. 2.
FIG. 11 depicts an example block partitioning performed by the block partitioning and indexing logic component of FIG. 2.
FIG. 12 depicts the merging of the blocks of FIG. 11.
Fig. 13 is a partial table showing importance values for each font series, font size, and font weight combination.
FIG. 14 depicts example block partitioning performed by the block partitioning and indexing logic component of FIG. 2.
FIG. 15 depicts example source content data containing a "list" type of visual element.
FIG. 16 depicts example source content data containing a "fixed width text" type visual element.
FIG. 17 depicts example presentation semantic data for the source content data of FIG. 16.
FIG. 18 depicts example source content data containing a "list" type of visual element.
FIG. 19 depicts example presentation semantic data for the source content data of FIG. 18.
FIG. 20 depicts example source content data containing a "list" type of visual element.
FIG. 21 depicts example source content data containing a "paragraph" type of visual element.
FIG. 22 depicts example presentation semantic data for the source content data of FIG. 21.
FIG. 23 depicts example source content data containing a "table" type visual element.
FIG. 24 depicts example presentation semantic data for the source content data of FIG. 23.
FIG. 25 depicts example source content data containing a "table" type visual element.
FIG. 26 depicts example presentation semantic data for the source content data of FIG. 25.
FIG. 27 depicts example source content data containing "key/value" type visual elements.
FIG. 28 depicts example presentation semantic data for the source content data of FIG. 27.
FIG. 29 depicts example source content data containing a visual element of the "question/answer" type.
FIG. 30 depicts example source content data containing a "menu" type visual element.
FIG. 31 depicts example presentation semantic data for the source content data of FIG. 30.
FIG. 32 depicts example source content data containing a "fixed width text" type visual element.
FIG. 33 depicts example presentation semantic data for the source content data of FIG. 32.
FIG. 34 depicts example source content data containing "timeline" type visual elements.
FIG. 35 depicts example source content data containing visual elements of the "chart" type.
FIG. 36 depicts example source content data containing "interactive" type visual elements.
FIG. 37 depicts example source content data shown in accordance with an embodiment of the disclosure.
FIG. 38 is a flow diagram depicting an example structure and function of the block partitioning and indexing logic component of FIG. 2.
FIG. 39 depicts an example Graphical User Interface (GUI) that may be presented to a user by the search engine system of FIG. 1.
FIG. 40 depicts another example GUI that may be presented to a user by the search engine system of FIG. 1.
FIG. 41 depicts an example GUI that provides keyword suggestions to a user when entering a search query in the search engine system of FIG. 1.
FIG. 42 depicts an example result report that may be presented to a user by the search engine system of FIG. 1 as feedback to a user search for a "list" type visual element.
FIG. 43 depicts another example result report that may be presented to a user by the search engine system of FIG. 1 as feedback to a user search for a "list" type visual element.
FIG. 44 depicts an example result report that may be presented to a user by the search engine system of FIG. 1 as feedback to a user search for "list" and "table" type visual elements.
Detailed Description
The present disclosure relates to systems and methods for searching and indexing documents. A system according to an embodiment of the present disclosure uses a crawler that locates documents (or web pages) on a network. Once the documents are located, the system divides each located document into blocks based on predefined rules. In addition, the system locates visual elements within each document based on predefined rules. For example, the system locates tables, paragraphs, headings, lists, and fixed width text within a document based on predefined rules. The visual element is determined by analyzing the document source content, the representation semantics of the document, and metadata related to the document. Once these visual elements are found, they are indexed. The user may then search for documents that contain visual elements.
FIG. 1 depicts a web search engine system 100 according to an example embodiment of the present disclosure. System 100 includes a web server 101, a search engine server 102, and a client 103. Web server 101, search engine server 102, and client 103 all communicate over network 104.
Network 104 may include any type of network known in the art or developed in the future. In this regard, the network 104 may be an Ethernet network, a Local Area Network (LAN), or a Wide Area Network (WAN), such as the Internet or a combination of networks.
The example search engine server 102 includes a crawler logic component 105, a block partitioning and indexing logic component 106, a search engine logic component 107, as well as document and semantic data 108, index data 109, and advertisement data 110.
In the example search engine servlet 102, crawler logic 105 takes web documents, particularly HTML web pages, and their associated Cascading Style Sheet (CSS) introduction files and stores them in document and presentation data 108. The crawler logic component 105 is an automated browser that tracks each link to the document it encounters in the crawled documents. Each link identifies a web page 111 provided by the web server 101. For simplicity, a web server 101 is shown in FIG. 1 as serving only one web page 111. However, so long as the web server 101 is communicatively connected to the network 104, the web server 101 may service a plurality of documents and the crawler logic component 105 may retrieve any document identified by the link and the services provided by the web server. It should be noted that the web page 111 may be a web page or a document.
It should be noted that crawler logic component 105 may store additional information related to the document, such as a link identifying the document in document and presentation data 109, the date and time the document was last modified, the date and time the document was crawled, and the size of the document, among others.
It should also be noted that in those instances where the block partitioning and indexing logic component 106 facilitates retrieval of the document to be searched and its associated introductory file, the fetcher logic component 105 may not be required.
The search engine server 102 also includes a block partitioning and indexing logic component 106. The block segmentation and indexing logic component 106 analyzes the documents and their associated introduction files in the document and presentation data 108. For each document, the block segmentation and indexing logic component 106 divides the document into logical units, also referred to herein as blocks, and identifies those visual elements that are part of each block with the aid of a configuration file. It also creates an index of the identified visual elements in index data 109. The block partitioning and indexing logic means 109 will be further explained as well, as in fig. 3 to 38.
Once the index is created and stored in the index data 109, the user 113, through client logic component 112 running on the client computing device 103, can enter a search query composed of keywords and one or more visual element types that can identify the types of information the user is interested in searching for. As with fig. 39-44, the example interfaces presented by the client logic component 112 to the user 113 for receiving user search queries will also be further detailed.
The client logic component 112 may include, for example, an internet browser; however, other types of client logic components 112 to interact with the user 113 and communicate with the search engine logic component 107 may be used in other embodiments of the present disclosure. The client logic component 112 transmits the user search query to the search engine server 102 over the network 104. Search engine logic 107, upon receiving the user search query, examines index data store 109 to determine whether it contains terms that match the user search query narrowed by the visual element type in the user's search query. If so, the search engine logic component 107 compiles a prioritized list of all documents that contain all or some of the keywords in the specified visual element type and returns the list to the client logic component 106, which client logic component 106 displays the results to the user 113 in a window.
In another embodiment, the search engine logic component 107, upon receiving the user search query, may not narrow the search results by visual element type in the user search query, but may instead give documents a higher association or a higher ranking with keywords found in the visual element type specified in the user search query. Thus, if two web pages (or documents) have a word that matches the keyword in the user search query and one of the two web pages has the keyword in a visual element of the specified type of the user search query, but the other conditions are the same, then as feedback to the search query, the web page having the keyword in the visual element of the type specified by the user will have a higher ranking in the search results sent to the user. Accordingly, the search results are based not only on whether and to what extent a given web page has words that match the keywords, but also on the context in which the matching words were used (e.g., the matching words were used in a specified type of visual element).
In another embodiment, search engine logic 107 may include advertisements derived from advertisement data 110 in addition to the search results as feedback to the user search query.
FIG. 2 depicts an example search engine server 102, according to an embodiment of the present disclosure. Search engine server 102 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or utility of the disclosure.
The search engine server 102 may include a bus 206, a processor 201, a memory 202, a network device 203, a computing device 204, and an output device 205. Bus 206 may include a pathway that allows communication among the components of computing device 200.
The memory 202 stores the fetcher logic component 105, the chunk splitting and indexing logic component 106, the search engine logic component 107, the document and presentation data 108, the index data 109, and the advertisement data 110. These components may be implemented in software, hardware, firmware, or a combination of the three. In this example embodiment, the fetcher logic component 105, the chunk splitting and indexing logic component 106, the search engine logic component 107, the document and presentation data 108, the index data 109, and the advertisement data 110 are exposed as software stored in the memory 102.
Memory 202 may be any type of computer memory known in the art or developed in the future for electronically storing data and/or logic, including volatile and non-volatile memory. In this regard, the memory 202 may include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, any magnetic computer storage device including hard disks, floppy disks, or magnetic tape, and optical disks.
The processor 201 includes processing hardware for translating or executing tasks or instructions stored in the memory 202. It should be noted that the processor 201 may be a microprocessor, digital processor, or other type of circuit configured to execute and/or execute instructions.
Network device 203 may be any type of network device (e.g., a modem) known in the art or developed in the future for communicating over network 104 (fig. 1). In this regard, search engine server 102 (FIG. 1) communicates with network server 101 (FIG. 1) and client computing device 103 (FIG. 1) over network 104 (FIG. 1) through the network device 203.
Input device 204 is any type of input device known in the art or developed in the future for receiving data from user 114 (FIG. 1). As an example, the input device 204 may be a keyboard, mouse, touch screen, serial port, scanner, camera, or microphone.
Output device 205 may be any type of output device known in the art or developed in the future for displaying data to user 114 (FIG. 1). As an example, the output device 205 may be a Liquid Crystal Display (LCD) or other type of video display device, a speaker, or a printer.
It should be noted that the present disclosure may also be practiced in distributed computing environments where multiple computing devices that are communicatively connected to the network perform the tasks or instructions of the search engine server 102 (FIG. 1).
It should also be noted that the components of the search engine server 102 may be implemented by software, hardware, firmware, or any combination thereof. In this example search engine server 102, all of the components are implemented by software stored in memory 202, as shown in FIG. 1.
3-14 illustrate the division of a document into blocks by the block segmentation and indexing logic component 106, which also enhances the ability of the system to locate visual elements within the document.
As previously described, a block is a logical unit of a document. It may be a way to interpret blocks as follows: letters make up words, words make up sentences, sentences make up blocks, blocks make up larger blocks, and the document itself is the largest block. Depending on the document type, linefeeds, tags, presentation semantics, and/or calculated data can help identify blocks.
As described above, blocks may be nested because the blocks may contain internal blocks. The block that contains an internal block may be referred to as the parent block of the internal block, and accordingly, the internal block may also be referred to as a child block. The parent block encloses all its children. It should be noted that a block may have multiple parents due to nesting. A block without sub-blocks is a special type of block, which may be referred to as a block entry (block item). The entire document is a logical unit and therefore also has a block as root block. Each block, except the root block, has a parent block.
To record the nesting of blocks, each block is assigned a level. The block levels are assigned such that two blocks with the same number of parent blocks must have the same level and two blocks with different number of parent blocks must have different levels. In an example embodiment of the present disclosure, the level of a block is equal to the number of its parent blocks. The level of the root block is zero.
Once the crawler logic component 105 (FIG. 1) downloads the document and any introductory files associated with the document into the document and presentation data 108 (FIG. 1), the block segmentation and indexing logic 106 (FIG. 1) analyzes the document source content and its presentation semantics and generates initial metadata. Metadata, as the term is used herein, is generally understood to include any aggregated, possessed, or calculated information relating to the document. In view of this, metadata may include such things as tags, tag attributes, implicit and explicit representation semantics, location data for the text (if the document is presented on a display device), comments, additional calculated values about the text itself (such as whether the text is a block, whether the text is a block item, average font size, etc.), and additional calculated values from previously identified/calculated metadata. These examples of metadata are typical examples, and other types of metadata may be used in other embodiments.
It should be noted that by performing the block partitioning and indexing logic component 106 (FIG. 1), the initial metadata may be augmented with more metadata. It should also be noted that different segmentations of the document may result in different metadata. It should also be noted that the amount and type of metadata aggregated may vary from document type to document type.
FIG. 3 depicts source content 301 of a portion of an example HTML document 300 downloaded by the fetcher logic component 105 (FIG. 1) and stored in the document and presentation data 108 (FIG. 1). The source content 301 is composed of < DIV > elements 302 and < DIV > elements 303. Two empty < BR > elements 304 and 305 are nested in < DIV > element 303. It should be noted that the source content 301 is composed of markup content and text content.
FIG. 4 depicts a presentation semantics portion 401 of example presentation semantics 400 downloaded by the fetcher logic building block 105 (FIG. 1) and stored in the document and presentation data 108 (FIG. 1) for an HTML document 300 (FIG. 3). Presentation semantics 401 relate to source content 301 (FIG. 3).
Fig. 5 depicts a portion of metadata 500 generated by the block partitioning and indexing logic component 106 (fig. 1) for < DIV > element 303 (fig. 3). The metadata 500 includes a plurality of metadata attributes 501-512. Each metadata attribute is composed of a key/value pair. Metadata attributes 501 and 502 are derived from the tags of < DIV > element 303 (fig. 3). Metadata attributes 503 are derived from the explicit representation semantics 401 (FIG. 4). Metadata attributes 504, 505, and 506 are derived from the implicit representation semantics of document 300 (FIG. 3). Metadata attributes 507 through 510 are derived from the rendering of document 300 (FIG. 3). Metadata attributes 507 and 508 identify the starting location on the display where the text content in < DIV > element 303 (FIG. 3) as seen by the user begins. Metadata attributes 509 and 510 identify the width and height, respectively, of the text content in < DIV > element 303 (FIG. 3) as viewed by the user on the display interface. It should be noted that the rendering of the documents of computational metadata may be done in the memory 202 (FIG. 2) of the virtual display interface.
Metadata attributes 511 and 512 are derived from the block partitioning and indexing logic component 106 (fig. 1) calculations and are augmented to the existing metadata of < DIV > element 303 (fig. 3) after the block identification process as will be further detailed in fig. 6-12. Assume that < DIV > element 303 (FIG. 3) is identified as a block by block partitioning and indexing logic component 106 (FIG. 1) during block identification. This information is stored in metadata attribute 511. Metadata attribute 512 holds the calculated level for the block.
The next step performed by the block segmentation and indexing logic component 106 (FIG. 1) after the initial computation of metadata from the document source content and representation semantics is to identify blocks. To determine whether a segment of the document is a block, block segmentation and indexing logic component 106 (FIG. 1) checks whether the segment of the document satisfies all of the criteria in at least one set of block identification criteria. Each criterion in the set of criteria is a rule on metadata attributes that computes true or false. The set of block identification criteria is designed in such a way that only the block portion of the computation results are true for all criteria in the set of criteria.
One way to consider the criteria of the block identification criteria set may be to visually observe the document and identify blocks in the document, generate metadata for the document content, identify and generate metadata attributes for those particular portions that can be visually identified as blocks.
It should be noted that there may be multiple sets of block identification criteria and a document segment is a block if it satisfies at least one set of block identification criteria.
FIG. 6 depicts an example of a set of block identification criteria 600. Criteria set 600 consists of only one criterion 601. For the criteria 601, the calculation result is true when any segment with a metadata attribute key equals "display" and the corresponding metadata attribute value equals "block". It should be noted that < DIV > element 303 (FIG. 3) has a metadata attribute 505 (FIG. 5) that satisfies criteria 601, and thus criteria set 600. The < DIV > element 303 (FIG. 3) is a block because it satisfies at least one set of criteria.
FIG. 7 depicts an example of another set of block identification criteria 700. The set of criteria 700 consists of two criteria 701 and 702. The criteria 701 requires that there be a metadata attribute keyed "start-tag" and its corresponding metadata attribute value "br". Criteria 702 requires that no metadata attributes should be displayed by a key and that its corresponding metadata value should be non-null. The metadata (not shown) of < BR > elements 304 (fig. 3) and 305 (fig. 3) satisfy both criteria of criteria set 700. Thus, an empty < BR/> element is also a block.
FIG. 8 is a diagram of a portion 801 of a document 300 (FIG. 3) identified by source content 301 (FIG. 3) as seen by a user on a display interface. It should be noted that although the source content 301 (FIG. 3) is comprised of markup content and text content, the user can only see the text content when the document is presented on the display interface.
FIG. 9 is a diagram showing all blocks in a portion of document 300 (FIG. 3) identified by source content 301 (FIG. 3). Solid line boxes are used in fig. 9 to identify blocks 901-906 identified by the block segmentation and indexing logic component 106 (fig. 1) through a set of block identification criteria. Note that block entry 902 is a sub-block of block 901. It is further noted that block 904 is a direct sub-block of block 903, and block 905 is a sub-block of block 903 but not an immediate sub-block of block 903.
In an example embodiment of the present disclosure, all text segments in the document may belong to one block item. It does not necessarily have to be the same block item. Also, no text segment should belong to multiple block items. The text in the block item 902 satisfies these conditions because the text belongs to and only belongs to one block item 902. However, it should be noted that the text segments 907 to 911 that are part of a block but not part of any block item are each independent. These respective independent text segments are designated as chunk entries by the chunk partitioning and indexing logic component 106 (FIG. 1). Text segment 907 is designated as a block entry with parent block 903. Similarly, the remaining text segments 908 through 911 are all designated as block entries. It should be noted that the block entry is also a block.
It should be noted that in other embodiments, only some or even no separate text segments may be designated as chunk items.
At this stage, the block partitioning and indexing logic component 106 (FIG. 1) has divided the entire document into blocks and a list of identified blocks is ready. Next, block partitioning and indexing logic component 106 (FIG. 1) performs a series of operations on the blocks in the block list.
The first chunk operation performed by the chunk splitting and indexing logic component 106 (FIG. 1) is to identify and move any empty chunks from the chunk list. To accomplish this, the block partitioning and indexing logic component 106 (FIG. 1) traverses all blocks identified in the block list in descending order of rank and checks each block for whether the block is empty. If the block is determined to be empty, the block is removed from the block list. A block is considered to be an empty block if nothing is present or drawn on the display interface that relates to the block. Empty blocks may be identified by a set of empty block identification criteria. The identification of a block as an empty block must satisfy all criteria in at least one set of empty block identification criteria. Metadata attributes such as show empty, visible hidden, show size zero, overflow hidden, or none of the text and borders in the block visible help identify empty blocks, and are also candidates for criteria in the set of empty block identification criteria.
Again, this technique of identifying empty blocks can be refined by processing the document and comparing the results with visually observing the rendered document. If no portion of a block is rendered on the screen and the program cannot identify the block as an empty block, then there is a combination of generalized metadata attributes that identify the empty block that needs to be merged into the block partitioning and indexing logic component 106 (FIG. 1). On the other hand, if the block segmentation and indexing logic component 106 (FIG. 1) identifies a block as an empty block and some portion of the block is visible when the document is presented on the display interface, then there is a generalized set of metadata attributes that do not, and should not, identify a block as empty for use in identifying an empty block.
As previously described, the block partitioning and indexing logic component identifies blocks 901-911 in the source content 301 (FIG. 3). As may be inferred from the presentation of source content 301 (FIG. 3) of FIG. 8, no portion of blocks 905 and 906 are displayed on the display interface, and thus blocks 905 and 906 are empty blocks. Decision blocks 905 and 906 are empty blocks by the block partitioning and indexing logic component 106 (FIG. 1) because neither the text nor the borders of the two blocks are visible, and the block partitioning and indexing logic component 106 (FIG. 1) also removes the two blocks from the block list. It should also be noted that removing blocks 905 and 906 results in block 904 having three sub-block entries 908, 909, 910 instead of five.
The next operation performed by the block partitioning and indexing logic component 106 (FIG. 1) is to check whether the block entry is covered. It is known from the position metadata that each block item occupies a rectangular area. If two or more block entries overlap each other, all of the overlapping block entries may be deleted from the block list.
It should be noted that in other embodiments, only blocks that are covered by other blocks may be deleted from the block list. In yet another embodiment, only blocks that are covered by other non-transparent blocks may be deleted from the block list.
The next chunk operation performed by the chunk splitting and indexing logic component 106 (FIG. 1) is to identify and remove intermediate chunks from the chunk list. A block is considered to be an intermediate block if it has only one immediate sub-block.
FIG. 10 is a diagram showing a portion of a document (not to scale) as seen by a user, with boxes added to the diagram to designate the tiles identified by the tile segmentation and indexing logic component 106 (FIG. 1) in that portion. The segment is composed of four blocks 1001 to 1004. Assume that the level of block 1001 is 2, the level of block 1002 is 3, and the levels of 1003 and 1004 are 4. The block partitioning and indexing logic component 106 (fig. 1) calculates that the block 1001 has only one immediate block 1002, so that the block 1001 is an intermediate block and is removed from the block list. The sub-blocks 1002, 1003, 1004 of the middle block 1001 now become sub-blocks of the immediate parent block (not shown) of the block 1001. Also, the levels of all blocks in the middle block 1001 are recalculated so that block 1002 has a new level 2 and blocks 1003 and 1004 have a new level 3.
It should be noted that some other blocks, such as table line blocks, header blocks, footer blocks, and title blocks, may also be considered intermediate blocks.
The next operation performed by the block partitioning and indexing logic component 106 (FIG. 1) is to merge the eligible blocks into a larger block. When a user views a presented document segment, if he finds and recognizes a single logical unit, and the chunk splitting and indexing logic component 106 (FIG. 1) identifies more than one chunk of the same document segment, then the two or more identified chunks become candidates for merging into a single chunk.
Assuming that FIG. 11 depicts (not to scale) a portion of the rendered document, it is also assumed that the chunk splitting and indexing logic component 106 (FIG. 1) identifies three chunks 1101, 1102, and 1103 that are indicated by the added boxes in FIG. 11. The block segmentation and indexing logic component 106 (FIG. 3) identifies three blocks, whereas a user viewing the document can only identify one logical unit. Because block 1102 begins with a date and it can be part of a timeline visual element and block 1103 does not begin with a date, the two blocks can be merged. When blocks 1102 and 1103 are merged into a single block 1201 (FIG. 12), block 1101 qualifies as an intermediate block, thus removing the block resulting in recalculating the parent block and level of block 1201 (FIG. 12). FIG. 12 depicts the same portion of the document in FIG. 11, after merging the eligible blocks with an add box to identify the eligible blocks. The merging results in removing three blocks 1101, 1102, and 1103 and adding one block 1201 to the block list.
It should be noted that in order to determine eligible blocks to be merged, analysis of block data, and analysis of neighboring block data may be included in addition to the metadata.
The next operation performed by the block segmentation and indexing logic means 106 (fig. 1) is to calculate the font metadata attributes block item font series, block item font size and block item font weight for each block item in the block list. Since different text segments in a block item may have different font characteristics (font family, font size, and font weight), it may be useful to calculate typical font characteristics for all text in the block item. When the block entry font series is a typical font series, the block entry font size is a typical font size and the block entry font weight is a typical font weight for all the text in the block entry.
To calculate a typical font metadata property for a block item, a set of triples is prepared for each visible word in the text of the block item, consisting of the font series metadata property value, the font size metadata property value, and the font weight metadata property value for that word. The font series, font size, and font weight of the most frequently occurring (statistical mode) triples are the respective metadata attribute values of the block entry font series, block entry font size, and block entry font weight. The total number of different triples in a block item may also be useful and may be stored in another metadata attribute block item variation value for that block item.
It should be noted that for those block items that have text consisting of only a few words, it may be useful to calculate font metadata attributes to use not the most frequently occurring triplets, but the triplets with the largest font size value or the largest font thickness value or the largest importance value. The significance values of the triples are detailed as in fig. 13. It should also be noted that in other embodiments, subscript words, superscript words, words belonging to a particular font family, such as Webdings, in addition to invisible words, also need not be considered in calculating block item font metadata attributes. In yet another embodiment, triples may be prepared for characters instead of words or statistical averages may be used instead of statistical patterns in calculating block font metadata attributes.
The block partitioning and indexing logic component 106 (FIG. 1) may also calculate additional metadata attributes block item importance, as defined in FIG. 13, which applies only to block items. The block item importance metadata property for a block item may be considered the importance of the block item with respect to the remainder of the document. The block item importance metadata attribute of a block item is a function of the block item font family, the block item font size, or the block item font weight of the block. Generally, the larger the block item font size or block item font weight, the other conditions being the same, the greater the block item importance.
In an example embodiment of the present disclosure, the block segmentation and indexing logic component 106 (FIG. 1) provides a lookup table pre-populated with significance values for triple combinations of each font family, font size, and font weight. Fig. 13 depicts a portion of the lookup table 1300. The triple 1301 consists of the font series "Arial", font size "8" and font weight "700", with its importance value of 1.11302. Thus, for a block entry having a block entry font-hierarchy value of Arial, a block entry font-size value of 8, and a block entry font-thickness value of 700, the block entry importance is 1.1.
It should be noted that in another embodiment, the significance value of the lookup table may have a range of lower and upper limits, rather than a single number, so that the block entry significance is a range value.
It should be noted that in another embodiment, the block partitioning and indexing logic component 106 may perform additional block operations that also divide the blocks into multiple blocks or merge the blocks into larger blocks. These operations may be required if the partitioning of the document into blocks by block segmentation and indexing logic component 106 (FIG. 1) may not produce the same results as a user viewing the presented document and manually partitioning the document into logical units. One example of the division of the block into blocks that may be required is to assume that the immediate sub-block follows a pattern of blocks of greater block entry importance value followed by a series of blocks of lower block entry importance value.
Once all block operations are completed, the next step performed by the block partitioning and indexing logic component 106 (FIG. 1) is to identify the title block entry for each block in the block list that is not a block entry. The title block entry for a block is usually located at the top of the block, possibly with a larger font size or larger font weight or a different font family or center or background color than the rest of the sub-blocks in the block.
In an example embodiment of the present disclosure, in order for an immediate child block entry of a block to be specified as a title block of the immediate parent block, the immediate child block needs to be located in the first three blocks of the parent block and/or be centered and/or have a different foreground or background color than the remaining immediate blocks and/or have a larger block entry font size and/or block entry font weight and/or block entry importance than the remaining child block entries (not necessarily immediate blocks).
It should be noted that other embodiments may use other metadata conditions in identifying the title block entries of a block. It should also be noted that the user visually observing the document and the block segmentation and indexing logic component 106 (FIG. 1) must be able to identify the same subject block entries in the block. If the block partitioning and indexing logic means 106 (fig. 1) identifies an erroneous title block entry in a block or fails to identify a correct title block entry in a block, the metadata conditions identifying the title block entry must be modified.
FIG. 14 is a diagram showing a portion of a document (not to scale) as seen by a user, with boxes added to the diagram to designate the tiles identified by the tile segmentation and indexing logic component 106 (FIG. 1) in that portion. It is assumed that the block item font size and block item importance metadata attribute values of all sub-block items in block 1401 have been calculated. It is also assumed that the block item 1402 has the largest block item font size and block item importance metadata attribute value among all sub-block items of the block 1401. Block entry 1402 is also located in the first three blocks of block 1401. Thus, block entry 1402 is the title block of block 1401. Similarly, among all sub-block entries of block 1403, block entry 1404 has the largest block entry font size and block entry importance value and is located at the first of all sub-blocks of block 1403. Thus, block entry 1404 is the title block of block 1403. None of the immediate block entries of block 1406 satisfy the conditions required to become a header block. Thus, block 1406 has no header block.
Assume that block 1406 is identified as a list of visual elements, as described in further detail herein, and also assume that block 1405 is identified as the title of the list of visual elements. Since the visual element is part of the block 1403 and the block entry 1404 is its title block, the block entry 1404 may also be considered the title of the list visual element. Also, block 1403 is part of block 1401 and block 1402 is its title block, so block 1402 can also be considered the title of the list visual element. In example embodiments of the present disclosure, a visual element may have only one title. Because title 1405 is the closest of the three identified titles to the visual element, it is considered the title of the list visual element. In another embodiment, a visual element may have multiple headings and all three identified headings may be considered headings for the visual element.
Once the title block is identified, the next step performed by the block segmentation and indexing logic component 106 (FIG. 1) is to identify and index the visual element. There are two types of visual elements, e.g., inline visual elements and block visual elements. The inline visual element comprises a portion of a block entry and the block visual element comprises one or more blocks. Typically, inline visual elements are found in a sentence in a block entry.
The determination and positioning of the different visual elements is now further detailed in fig. 15 to 37. In particular, dividing the document into blocks also helps the process of locating visual elements within the document. Superimposed on the source content in fig. 15-37 is a rectangular dashed box to depict the blocks identified by the block segmentation and indexing logic component 106 (fig. 1). As will be detailed in fig. 15-37, the block segmentation and indexing logic component 106 (fig. 1) identifies and indexes visual elements in the source content with the aid of data and metadata rules of the configuration files of inline visual elements and block visual elements.
To identify inline visual elements, the block segmentation and indexing logic component 106 (FIG. 1) traverses all block entries in the block list. For each block entry, the block partitioning and indexing logic component 106 identifies statements and finds visual elements in each statement with the help of data and metadata rules. The block segmentation and indexing logic component 106 (fig. 1) also creates an index in the index data 109 (fig. 1) for each found visual element and its visual element title and visual element characteristics.
It should be noted that a set of data and metadata rules may identify not only the visual element but also the type of visual element, the title of the visual element, and other visual element-specific characteristics. It should also be noted that there may be several groups of inline visual elements that identify a sentence or portion of a sentence, and that the sentence or portion of a sentence is an inline visual element if the sentence or portion of a sentence satisfies at least one of the groups.
As in FIG. 15, a user viewing a presentation (not shown) of document source content 1501 on a display interface identifies a list of visual elements in the second sentence of block 1502. A generalized set of data and metadata rules includes: a sentence with the word "are" is followed by a ". thereafter, a series of words separated by", "followed by the word" and a set of words not separated by "," just like the way a human user recognizes the second sentence in block 1502 as a list of visual elements. Also, the parts of the sentence preceding the phrase "are:" are identified as the title of the visual element and each word divided by the word following the phrase "are:" or "and" is identified as a list item. It should be noted that the human user recognizes the same part of the statement as a title and a list item.
It should be noted that the generalized group data and metadata rules described above may not identify all inline list visual elements. If a human user viewing the document identifies one inline list visual element and the set of data and metadata rules described above do not, a new set of data and metadata rules may be required for the block segmentation and indexing logic component 106 (FIG. 1) to identify inline list visual elements of different formats. Thus, for each inline visual element type, there may be several sets of data and metadata rules, and the statement or portion of the statement must satisfy at least one of the sets to be identified as the type of visual element that the set is defined to identify.
Fig. 16 depicts source content 1601 and fig. 17 depicts presentation semantics 1701 that are applicable to the source content 1601. The inline visual element identified in the source content 1601 is a visual element of a fixed width text. One word "text-percentage" appears in the font "courier" representing a fixed width text. Thus, the single word constitutes a visual element. It should be noted that the user viewing segment 1601 also recognizes the single word "text-percentage" as fixed width text when the segment is presented.
To identify a block visual element in a document, block partitioning and indexing logic component 106 (FIG. 1) begins with the block having the highest rank in the block list (non-block entry), and determines whether the entire block and all its sub-blocks match a configuration file, such as a set of rules identifying the visual element. If the entire block matches the configuration file, the resulting visual element, consisting of the entire block and its title and visual element characteristics, is indexed and stored in index data 109 (FIG. 1), and the block is removed from the list of blocks. If the entire block does not match the profile and as long as a subset of the sub-blocks match the profile, the resulting visual elements, consisting of the subset of sub-blocks and their titles and visual element characteristics, are indexed and stored in index data 109 (FIG. 1), and the subset that the sub-blocks match is removed from the block list. If even a subset of contiguous blocks cannot match the profile, the entire block and all of its sub-blocks are deleted and removed from the block list. Since the block is deleted, all blocks in the block list will be checked again. For any block operation, which is described in further detail herein, the block operation is performed if any block is appropriate, including, for example, removing empty blocks, eliminating intermediate blocks, block merging, or block partitioning. A new block (non-block entry) is selected from the list of blocks with the highest rank and the process will repeat until no more blocks remain to match the profile.
The configuration file identifies a particular type of block visual element and it is composed of a set of rules. Each rule in the configuration file is composed of two parts. The first part identifies one or more blocks. The second component calculates one or more data and/or metadata attributes for the identified block. For example, the profile rule may be: all sub-blocks of a block with a metadata attribute table column index value are equal to blocks that must have the same text and must not have a bounding box metadata attribute value equal to zero. If one or more blocks satisfy all of the rules of at least one profile, the chunk may be identified as a visual element of the type that the profile is intended to identify.
The configuration file may also identify the title of the visual element and other visual element characteristics. If the configuration file does not identify any title, one of the identified parent block title blocks of the visual element may be considered the title of the visual element.
As shown in FIGS. 18 and 19, when document source content 1800 is rendered (not shown), the visual element recognized by the human user is a list visual element. The configuration file for determining the visual elements of the list looks for blocks consisting of a two-column table whose block entries in the first column of the table block all have the same letter-free, number-free character with a character length of less than three. Block 1802 and its sub-block entries 1803 to 1808 satisfy such a profile condition, so the entire block 1802 is a list visual element. Also, the block entry 1801 preceding the list visual element is comprised of the phrase "list of", so the block segmentation and indexing logic component 106 identifies the preceding block entry as the title of the identified list visual element.
It should be noted that the block partitioning and indexing logic component 106 (FIG. 1) computes TABLE metadata attributes across the < TABLE > elements of the entire block 1802 when preparing the metadata. These table metadata attributes may include a "table row" attribute having a value of 3 and a "table column" metadata attribute having a value of 2. Also for each < TD > element that spans the chunk entries 1903 through 1808, the chunk partition and index logic component 106 computes a table cell metadata attribute. For < TD > elements that span the entire block 1805, a "table cell row index" metadata attribute would identify the row index value as 2, and a "table cell column index" metadata attribute would identify the column index value as 1. The configuration file identifying block 1802 as a list may utilize these table and table cell metadata attributes.
The configuration file also identifies the list visual element characteristics. Each of block entries 1804, 1806, and 1808 having a "table cell column index" value of 2 is identified as a list item. The configuration file is also derived from the text of block item 1803, identifying such a list as one with a star symbol.
Presentation semantics 1900 recognize a block 1802 consisting of < TABLE > elements as being rendered without a bounding box, and also identifies the recognized temporal elements as list visual elements.
As in FIG. 20, when the document source content 2000 is presented, the visual element recognized by the human user is also a list visual element. The configuration file used to determine the list visual element may find at least two sub-block entries consisting of blocks of unordered lists and for which there are no characters "-" or ": embedded in the text content of the sub-blocks. Such a configuration file would identify block 2002 as a list visual element. And the block item 2001 before the block 2002 has a block item importance metadata attribute value larger than those of the other block items 2003-2006, the block item 2001 also being composed of only one sentence containing a plurality of words; block entry 2001 is therefore identified as the title of the list visual element.
It should be noted that the chunk splitting and indexing logic component 106 (fig. 1) computes the list metadata attributes of the < UL > elements across the entire chunk 2002 when preparing the metadata. These list metadata attributes may include a list metadata attribute with a value of 4 because there are four < LI > elements in the < UL > element. Also for each < LI > element that spans the entire block item 2003-2006, the block partitioning and indexing logic component 106 computes a list item metadata attribute. For < LI > elements that span the entire chunk 2005, its list entry index metadata attribute value identifying its index is 3. The configuration file identifying block 2002 as a list may utilize these table and table cell metadata attributes.
The configuration file also identifies the list visual element attributes. Each of the chunk entries 2003-2006 is identified as a column entry in view of the list item index metadata attribute of each < LI > element in the chunk entries 2003-2006. Also from metadata inferred from implicit representation semantics of < UL > elements across block 2002, the configuration file identifies the list as a list with solid circle symbols.
It should be noted that the above two profile rules for determining whether a block is a profile for a list visual element are not exhaustive, and additional profile rules may also be considered by block segmentation and indexing logic component 106 (FIG. 1) in determining whether a block is a list visual element.
It should also be noted that while two profiles are provided to identify list visual elements, more profiles identifying list visual elements may be required if a user viewing a presented document identifies a list in a segment of the document and neither of the two profiles identifies the segment as a list visual element.
As with fig. 21 and 22, when the document source content 2100 is rendered according to the representation semantics 2200, the visual element recognized by the human user is a paragraph visual element. The configuration file used to determine the paragraph visual elements may find blocks consisting of at least three sentence sentences and/or at least two hundred words. Such a configuration file would identify block 2101 as a paragraph visual element. Since no title is identified by the configuration file, one or all of the titles of the parent blocks (not shown) of block 2101 may be considered as the title of the identified paragraph visual element.
It should be noted that in addition to the title, the configuration file that identifies the paragraph visual element may also identify the size, number of sentences and other paragraph features in the paragraph visual element.
It should be noted that the profile rules described for determining whether a block is a paragraph are not exhaustive, and additional profile rules may also be considered by the block segmentation and indexing logic component 106 (FIG. 1) in determining whether a block is a paragraph visual element.
It should also be noted that while one configuration file is provided to identify paragraph visuals, more configuration files identifying paragraph visuals may be required if a user viewing the rendered document identifies a paragraph in a segment of the document and the configuration file cannot identify the segment as a paragraph visual element.
As with FIGS. 23 and 24, when the document source content 2300 is rendered according to the presentation semantics 2400, the visual element that the human user recognizes is a table visual element. The configuration file used to identify the table may find blocks that consist of a table where the first column of each block of the table does not have the same text. Furthermore, at least one first column block of the table has a text content of a length greater than five characters and the at least one first column block of the text content of the table does not end with a punctuation mark. Such profile rules would identify block 2302 as a table visual element. And block 2301, preceding block 2302, ends with the text "following table:" that constitutes the last sentence of block 2301 that contains the title of the identified table visual element of the text "following table:". Since chunk entries 2303 and 2304 have a larger chunk entry importance metadata attribute value than the other sub-chunks in chunk 2302, and both chunks are part of the first row and both chunks have a tag label of < TH >, both chunks are identified as a table header.
Note that this configuration file may also identify blocks 2305 through 2310 as table cells because the < TD > element spans the entirety of each of these blocks. And the configuration file may also identify the table visual element as a table visual element having three rows and two columns.
As shown in fig. 25 and 26, when document source content 2500 is presented according to representation semantics 2600, the visual element recognized by the human user is a table visual element. The configuration file for identifying visual elements of a table may find blocks that represent the semantic "display" as "table" with visible borders and its sub-blocks that represent the semantic "display" as "table-cell" with visible borders, and the table cell blocks are spread over more than one row and column. Such a configuration file would identify block 2502 as a table visual element. Similar to block 2301 (FIG. 23), the last sentence of block 2501 is also identified as the title of the identified table visual element. Since blocks 2503 and 2504 are part of the first row and have a greater importance of block entries than other blocks in block 2502, these two blocks are identified as the head blocks. Because blocks 2505 through 2510 have explicit metadata attributes showing value table cells, they are identified as table cells.
It should be noted that the above two profile rules for determining whether a block is a table visual element are not exhaustive, and additional profile rules may also be considered by the block segmentation and indexing logic component 106 (FIG. 1) in determining whether a block is a table visual element.
It should also be noted that while two configuration files are provided to identify form visual elements, more configuration files identifying form visual elements may be required if a user viewing the presented document identifies a form in a segment of the document and neither of the two configuration files identifies the segment as a form visual element.
As with fig. 27 and 28, when the document source content 2700 is presented according to the representation semantics 2800, the visual element that the human user recognizes is a key/value visual element. The configuration file for determining key/value visual elements finds a two-column table block in which the block text content of each first column, except the block of the first row, ends with a colon. In addition, the blocks of the first row span two columns. Such a configuration file identifies block 2701 as a key/value visual element, where block 2702 of the first row is the title of the identified key/value visual element. Also, with the configuration file, blocks 2703, 2705, and 2707 of the first column would all be identified as keys and blocks 2704, 2706, and 2708 of the second column would all be identified as values for these keys.
It should be noted that the above-described profile rules for determining whether a block is a key/value visual element profile are not exhaustive, and additional metadata rules may be considered by block segmentation and indexing logic component 106 (FIG. 1) in determining whether a block is a key/value visual element.
It should also be noted that while one profile is provided to identify key/value visual elements, more profiles identifying key/value visual elements may be required if a user viewing the presented document identifies a key/value attribute in a segment and the profile cannot identify the segment as a key/value visual element.
As FIG. 29, when presenting document source content 2900, the time element recognized by the human user is a question/answer visual element. The profile used to determine the question/answer visual element is to find a block whose text content ends with the string "Q:" begin with string ". Such a configuration file identifies blocks 2901 and 2902 as question/answer visual elements. And block 2901, where the text content ends with "Q:" starting with a string ".
It should be noted that the above-described profile rules for determining whether a block is a profile of a question/answer visual element are not exhaustive, and additional profile rules may be considered by the block segmentation and indexing logic component 106 (fig. 1) in determining whether a block is a question/answer visual element.
It should also be noted that while one profile is provided to identify the question/answer visual elements, more profiles identifying the question/answer visual elements may be required if a user viewing the presented document identifies one question/answer visual element in a segment and the profile cannot identify the segment as a question/answer visual element.
As in fig. 30 and 31, when the document source content 3000 is presented according to the presentation semantics 3100, the visual element recognized by the human user is a menu visual element. The configuration file for determining the visual elements of the menu is to find an unordered list block having inline displays and all sub-list item blocks of hyperlinks, and the unordered list block is located in the top 20% of the document. Such a configuration file would identify block 3001 as a menu visual element. Each of the list item block elements 3002-3005 will also be identified as a menu item for the identified menu visual item.
It should be noted that the profile rules described above for determining whether a tile is a profile for a menu visual element are not exhaustive, and additional profile rules may be considered by the tile segmentation and indexing logic component 106 (FIG. 1) in determining whether a tile is a menu visual element.
It should also be noted that while one profile is provided to identify menu visual elements, more profiles identifying menu visual elements may be required if a user viewing the presented document identifies one menu in a segment and the profile cannot identify the segment as a menu visual element.
As with fig. 32 and 33, when the document source content 3200 is rendered according to the representation semantics 3300, the visual element recognized by the human user is a fixed width textual visual element. The configuration file for determining whether the document contains a fixed width text visual element finds a table block having blocks with all block item font series metadata attribute values that are the same as the font series of the fixed width characters, except for the block of the first line. Such a configuration file would identify block 3201 as a fixed width textual visual element. And block 3202 of the first row has the largest block entry importance value and has different representation semantics than the other rows. Thus, block 3202 is identified as the title of the identified fixed width textual visual element.
It should be noted that the profile rules described above for determining whether a block is a profile of a fixed width textual visual element are not exhaustive, and additional profile rules may be considered by the block segmentation and indexing logic component 106 (FIG. 1) in determining whether a block is a fixed width textual visual element.
It should also be noted that while one profile is provided to identify fixed width textual visual elements, more profiles identifying fixed width textual visual elements may be required if a user viewing the document being rendered identifies one fixed width text in a segment and the profile cannot identify the segment as a fixed width textual visual element.
As in FIG. 34, when the document source content 3400 is presented, the visual element that the human user recognizes is a timeline visual element. The configuration file for identifying timeline visual elements finds unordered list blocks, all starting from the year followed by "-" or ":" or ". Such a configuration file would identify block 3402 as a timeline visual element. And the block immediately above block 3402 has a sentence containing the word "timeline". Thus, the unordered list entries 3403-3407 may all be identified as timeline events.
It should be noted that the profile rules described above for determining whether a block is a profile of a timeline visual element are not exhaustive, and other rules may be considered by the block segmentation and indexing logic component 106 (fig. 1) in determining whether a block is a timeline visual element.
It should also be noted that while one profile is provided to identify timeline visual elements, more profiles identifying timeline visual elements may be required if a user viewing the rendered document identifies one timeline in a segment and the profile cannot identify the segment as a timeline visual element.
As in FIG. 35, when document source content 3500 is presented, the visual element recognized by the human user is a chart visual element. The configuration file for recognizing the visual element of the chart finds a block with a picture and an alternative text containing the word "chart" (chart), and the block immediately above it has a sentence containing the word "chart" (chart). Such a configuration file would identify block 3502 as a chart visual element. Only block 3501 of a sentence is identified as the title of the identified chart visual element.
It should be noted that the profile rules described above for determining whether a block is a profile of a chart visual type are not exhaustive, and additional profile rules may be considered by the block segmentation and indexing logic component 106 (FIG. 1) in determining whether a block is a chart visual element.
It should also be noted that while one profile is provided to identify chart visual elements, more profiles identifying chart visual elements may be required if a user viewing the presented document identifies one chart in a segment and the profile cannot identify the segment as a chart visual element.
As in FIG. 36, when the document segment 3600 is presented, the visual element recognized by the human user is an interactive visual element. The method is used for recognizing that the interactive visual element finds a block with an object, and a block item which is arranged immediately above the interactive visual element has a sentence containing a word "interactive" (interactive). Such a configuration file would identify block 3602 as an interactive visual element. Only block 3601 of a sentence is identified as the title of the identified interactive visual element.
It should be noted that the profile rules described above for determining whether a block is a profile of an interactive visual element are not exhaustive, and other rules may be considered by the block segmentation and indexing logic component 106 (FIG. 1) in determining whether a block is an interactive visual element.
It should also be noted that while one profile is provided to identify interactive visual elements, more profiles identifying interactive visual elements may be required if a user viewing the presented document identifies one interactive object in a segment and the profile cannot identify the segment as an interactive visual element.
FIG. 37 shows source content 3700 in which the document author provides hints to the block segmentation and indexing logic component 106 (FIG. 1) to identify block timeline visual elements. Prompt vse the timeline 3708 identifies the title of the timeline visual element as the < DIV > element that covers the entire block 3701 and contains one timeline visual element. Hint vse title 3709 identifies the timeline visual element as the < H2> element that overlays block 3702. Prompt vse event 3710 identifies the < LI > element that crossed block 3703 as the timeline event. Similarly, the hint vse event in blocks 3704 through 3707 identifies each < LI > element that spans blocks 3704 through 3707 as the timeline event.
It should be noted that the hints may be predefined and may be specified by the search engine server 102 (FIG. 1) for reasonable use by the author of the document. It should also be noted that while the class attribute is used as a hint in this example embodiment, other developed or future developed methods, such as Resource Description Framework (RDF), attribute resource description framework (RDFa), and/or microformats, may also be used to specify hints in other embodiments.
It should be noted that the predefined prompts may also be specified by search engine server 102 (FIG. 1) for pictures, tables, lists, menus, charts, fixed width text, interactions, key/value and question/answer visual element types and provided to the document author for use in identifying visual elements in the document. It should also be noted that in addition to identifying headings, these predefined rules may also be specified by search engine server 102 (FIG. 1) and provided to the document author for identifying visual element characteristics.
It should be noted that in one embodiment, if a hint is present, the hint may be used only to verify that the block segmentation and indexing logic component 106 (FIG. 1) correctly identified the visual element and/or visual element features. In another embodiment, the cues may be used as a substitute for a profile for identifying visual elements and/or visual element features. In another embodiment, the cues and configuration files may be used together to identify visual elements and/or visual element features.
FIG. 38 is a flow diagram depicting an example high-level structure and functionality of the block partitioning and indexing logic component 106 shown in FIG. 1 and described herein. In step 3800, the block segmentation and indexing logic component 106 (FIG. 1) identifies candidate documents to process. In step 3801, the chunk segmentation and indexing logic component 106 (FIG. 1) generates preliminary metadata for the document and the document content segments. In step 3802, using the preliminary metadata, the document is divided into logical units called chunks and a list of chunks is prepared using the chunk identifying rules described above. In step 3803, the block partitioning and indexing logic component 106 (FIG. 1) performs a block operation that defines the block. Block operations may add, delete, or modify blocks in the block list. In step 3804, the chunk splitting and indexing logic component 106 (FIG. 1) identifies a title (if any) for each chunk that is also not a chunk item in the chunk list. In step 3805, if any indexed inline visual element is found, the inline visual element of each chunk item in the chunk list is checked by using the data and metadata rules described above. In step 3806, if any indexed chunk visual element is found, the chunk visual element of each chunk in the chunk list is checked by using the configuration file described above.
It should be noted that each of steps 3802 through 3806 generates further metadata that augments the metadata generated in step 3801.
It should be noted that although dividing the document into blocks and performing block operations helps identify visual elements, in other embodiments, visual elements may be identified by applying data and metadata rules to source content segments without identifying blocks.
Once the documents found by the crawler logic 105 are block partitioned and the visual elements in these documents are identified and indexed, the visual elements may be searched for from these documents.
Thus, FIG. 39 depicts an example Graphical User Interface (GUI) that may be used in one embodiment of the present disclosure. Such a GUI may be displayed to user 113 (fig. 1) by client logic component 112 (fig. 1) or to user 114 (fig. 1) by search engine logic component 107 (fig. 1).
The GUI 3900 is comprised of a plurality of buttons 3901-. In addition, the GUI 3900 is also comprised of a text field 3911 for entering keywords that the user 113 or 114 wishes to Search for and a "Search" button 3912 for selecting to start a Search.
Each of the above-described buttons 3901-2910 corresponds to a different visual element type to be searched. The user selects one or more visual element types by selecting the corresponding buttons 3901-.
If the user wishes to search for keywords entered in the text field 3911 and is to get a result of "Paragraph" (paragraphs), the user selects button 3901. If the user wishes to search for keywords entered in the text field 3911 and is to get a result of "Table", the user selects button 3902. If the user wishes to search for keywords entered in the text field 3911 and is to get a "List" (List) result, the user selects button 3903. If the user wishes to search for a keyword entered in the text field 3911 and is to get a result of "Menu," the user selects button 3904. If the user wishes to search for keywords entered in the text field 3911 and wants results for "Graphs" or "Charts", the user selects button 3905. If the user wishes to search for keywords entered in the Text field 3911 and wants to get a result of "Fixed Width Text," the user selects button 3906. If the user wishes to search for a keyword entered in the text field 3911 and is to get a result of "Interactive Data", the user selects button 3907. If the user wishes to search for a keyword entered in the text field 3911 and is to get a result of "Key/Value," the user selects button 3908. If the user wishes to search for keywords entered in the text field 3911 and wants to get a result of "Question/Answer," the user selects button 3909. If the user wishes to search for keywords entered in the text field 3911 and is to get a "Timeline" result, the user selects button 3910.
Search engine logic 107 (fig. 1) also supports the use of operators and modifiers. Operators are predefined codewords whose syntax is specified by the search engine logic component 107 (fig. 1), and operators are entered into the text field 3911 and are not translated into keywords by the search engine logic component 107 (fig. 1). In an example embodiment of the present disclosure, the codeword of an operator is non-sizing, and it is input in a syntax that an operator is always followed by a ": and possibly also a search keyword in the operator codeword.
To Search for "diabetes" and get the result of "Paragraph" (Paragraph), assuming that the codeword that the result of "Paragraph" (Paragraph) is to be obtained according to the definition of the Search engine logic component 107 (FIG. 1) is "P", the user may enter "P: diabetes" or "P: diabetes" into the text field 3911 and select the "Search" button 3912. Similarly, other visual elements may have other codewords. Also, to Search for "diabetes" and get the results of "Paragraph" (Paragraph) or "Table" (Table), assuming that the codeword for which the results of "Paragraph" (Paragraph) are to be obtained is "p" or the codeword for Table "(Table) is" tb "according to the definition of Search engine logic 107 (FIG. 1), the user can enter" p | | tb: diabetes "(p | | tb: diabetes) or" tb | | p: diabetes "(tb | | p: diabetes) into text field 3911 and select" Search "button 3912. It should be noted that "| |" represents "or" (or) in this syntax and is a modifier.
An "or" (or) modifier may also be used as part of a keyword. For example, to Search for "cars" or "vans" and get the result of "Paragraph" (Paragraph), the user may enter "p: cars | | | vans" (p: car | | | vans) into text field 3911 and select "Search" button 3912. Alternatively, the user may enter "cars | | | vans" (car | | | van) into the text field 3911, select the result of "Paragraph" button 3901, and then select the "Search" button 3912.
The "not" (no) modifier is used to exclude a particular outcome. For example, to get a "Table" result for "jaguar" (cats) and not jaguar "(car), the user may enter" tb: jaguar [ -car "(tb: j jaguar [ - ] car) into text field 3911 and select" Search "button 3912. Alternatively, the user may enter "jaguar-car" (jaguar-car) into the text field 3911, select the "Table" result button 3902, and then select the "Search" button 3912.
The predefined operator "comp" lets the user search for all visual elements at once. This eliminates the need to select all of the buttons 3901-. To Search for "diabetes" and get results for all visual element types, the user may enter "COMP: diabetes" or "COMP: diabetes" into text field 3911 and select the "Search" button.
The predefined operator "site" lets the user confine the search results to a location or domain. To only fromThe news domain searches for "diabetes" and gets the result of "Table", which usesThe user may enter "site: news. bbc. co. uk tb: diabetes" or "tb: diabetes site: news. bbc. co. uk" (tb: diabetes site: news. bbc. co. uk) into text field 3911 and select "Search" button 3912. Alternatively, the user may enter "site: new. bbc. co. uk diabetes" or "diabetes site: new. bbc. co. uk" (diabetes site: new. bbc. co. uk) into text field 3911, select button 3902 for the "Table" result, and then select "Search" button 3912.
If the user desires, the user may specify visual element features as part of a search query. This may be done through a GUI component or operator, and this may result in narrowing or increasing the relevance of documents whose visual elements satisfy the visual element characteristics.
When searching for "Paragraph" (Paragraph) results, the user may specify that the keyword must be part of the title, or that the "Paragraph" (Paragraph) results must be at least or equal to or at most a particular length.
When searching for "Table" results, the user may specify that the keyword must be part of the header, or must be part of the cell of the Table, or that the "Table" result must be at least or equal to or at most one particular number of rows or columns.
When searching for a "List" result, the user may specify that the keyword must be part of the title, or must be part of the title of the List, or that the "List" result must be at least or equal to or at most a certain number of List items.
When searching for "Menu" results, the user may specify that he is interested in "Menu" results that are displayed horizontally or vertically.
When searching for "Graphs" or "Charts" results, the user may specify that the keyword must be part of the title, or must be part of the chart title, or that the "Graphs" or "Charts" results are histograms or harris Charts or huffman Charts or bar Charts or line Charts or spline area Charts or range Charts or stock spark Charts or ring Charts or bubble Charts or candle Charts or pie Charts.
When searching for a "Fixed Width Text" result, the user may specify that the keyword must be part of the title, or that the "Fixed Width Text" result must be at least or equal to or at most a particular length.
When searching for a "Key/Value" result, the user may specify that the keyword must be part of the title, or must be part of the "Key", or must be part of the "Value", or that the "Key/Value" result must have at least or equal to or at most a certain number of Key/Value entries.
When searching for "Question/Answer" results, the user may specify that the keyword must be part of the title, or must be part of "Question", or must be part of "Answer".
When searching for "Timeline" results, the user can specify that the keywords must be part of the title, or must be part of the Timeline event.
FIG. 40 depicts GUI 4000, which operates in the same manner as GUI 3900. However, the GUI 4000 is composed not of buttons 3901-. In operation, a user 113 or 114 (FIG. 1) selects one or more of the selection checkboxes 4001 and 4010, enters a keyword in the text field 4011, and then selects the "Search" button 4012.
FIG. 41 depicts a GUI 4100 that the client logic component 112 (FIG. 1) displays to a user 113 (FIG. 1) or the search engine logic component 107 (FIG. 1) displays to a user 114 (FIG. 1) and provides suggestions 4104 to the user when the user is entering a visual element type "list" search keyword 4102. The user presses button 4101, which indicates that the search will be limited to the visual element type "list". The diagonal background on the button 4101 shown in fig. 41 is added to indicate the pressed state of the button, and is not a part of the real GUI 4100. The suggestions 4104 are given based on the "list" selections 4101 and search query terms, and are updated as the user updates the search query 4102. If the user likes the suggestions from suggestion box 4104, he can pick the suggestions and press the "Search" button 4103 to execute that suggested query.
It should be noted that suggestions 4104 may depend on the visual element type selected in the search query, and that selecting different visual element types may result in different keyword suggestions 4104. It should also be noted that in another embodiment, the suggestions 4104 are not only updated as the user enters the Search keywords 4102, but the results of the Search query may also be updated immediately as the user enters the Search keywords without the user pressing the "Search" button 4103.
FIG. 42 depicts a GUI 4200 that the client logic component 112 (FIG. 1) displays to the user 113 (FIG. 1) or the Search engine logic component 107 (FIG. 1) displays to the user 114, the Search results of which are displayed as a result of the user pressing the "Search" button 4203, and which performs a Search using the Search keyword "Diabetes Symptoms" 4202. The user presses button 4201, which indicates that the search is restricted to the visual element type "list". The diagonal background on the button 4201 shown in fig. 42 is added to indicate the pressed state of the button, and is not part of the actual GUI 4200. The search results 4204, 4205 and 4206 are displayed to the user in a horizontal format, e.g., one after the other. Each search result 4204, 4205, 4206 has a title, which is also a link to the initial web page 111 (FIG. 1). This title is followed by a short abstract. The summary is the portion of the web page that is relevant to the search query. The summary is displayed in the same visual element type as found in the initial web page 111 (FIG. 1). The search results 4204 show "Polyuria" (Polyuria) and "Polydipsia" (Polydipsia) as a list of circular symbols because the block segmentation and indexing logic 106 (FIG. 1) indexes and extracts the list from the web page 111 (FIG. 1) and the words of the web page 111 (FIG. 1) are presented in the form of a list of circular symbols. The search results 4205 show "Weight Loss" and "Polydipsia" as a list of numerical symbols because the block segmentation and indexing logic 106 (FIG. 1) indexes and extracts the list from the web page 111 (FIG. 1) and the words of the web page 111 (FIG. 1) are presented in the form of a list of data symbols. The search results 4206 show "Blurred Vision" and "Weight Loss" as a list of lower case letters because the block segmentation and indexing logic component 106 (FIG. 1) indexes and extracts the list from the web page 111 (FIG. 1) and the words of the web page 111 (FIG. 1) are presented in the form of a list of lower case letters. This summary is followed by a URI to the web page 111 (FIG. 1). The GUI 4200 has an advertisement section 4208 to the right of the search results 4204, 4205 and 4206. Merchants may participate in bids for their advertisements to be displayed in advertisement section 4208. If too many search results are displayed in a single page, the results may be divided into multiple pages for display. The user may access the multiple web pages using the pagination control 4207.
FIG. 43 shows another example of side-by-side display, such as vertical display, of search results 4305 and 4306. The user can use previous page link 4307 or next page link 4308 to view more search results, if any. Advertisement sections 4304 and 4309 are places where merchants can bid to place their advertisements. Although other locations for the ad portion are not shown in the diagram 4300, the ad portion is not limited to the upper and lower search results. They may also be placed to the right, left, or anywhere on the results page.
Fig. 44 shows another example of the search results 4405, 4406, 4407, 4408, 4409, and 4410 displayed in a grid view. Fig. 4400 depicts a user performing a search for the keyword "Diabetes" 4403. FIG. 4400 also depicts the user defining the search results as "table" and "list" by pressing buttons 4401 and 4402, respectively. The slash background on the buttons 4401 and 4402 shown in fig. 44 is added to indicate the pressed state of the button, and is not part of the actual GUI 4400. The search results 4407 display a summary of a portion in tabular form because the block segmentation and indexing logic component 106 (FIG. 1) abstracts the summary from the web page 111 (FIG. 1) and the portion of the web page 111 (FIG. 1) is presented in tabular form.
While the present disclosure has been described in detail with reference to certain preferred embodiments, various changes and modifications may be suggested to one skilled in the art, and it is intended that the present disclosure encompass such changes and modifications as fall within the scope of the appended claims.