A SEMANTIC ORGANIZATION AND RETRIEVAL SYSTEM AND METHODS
THEREOF
FIELD OF INVENTION
The present invention relates to a semantic organization and retrieval system of text based web artifacts, and methods thereof.
BACKGROUND OF INVENTION
Currently, indexing and retrieving documents from a search engine requires a mechanism where an end user stores the documents in an organized hierarchy on media such as hard disk drive (HDD), floppy disk drive (FDD), Digital Video Disc (DVD) or Compact Disc (CD). This leads to a problem of dealing with an ever growing requirement of data space that can become unmanageable.
In existing searching practice, a user is responsible for manually identifying categories of a document and hierarchically storing the same in folders in the media as described above. This is a time consuming process as each document has to be internalized manually to identify its category.
Further, current search methods have been extensively developed based on keyword search, which may not accurately portray a user's intent of searching. The use of keywords to search documents may result in an inability to retrieve artifacts containing intended concepts which are related to keywords, but not the keyword in itself. U.S. 6,687,689 B1 describes a system and methods for document retrieval, which assumes that documents are organized in a database along with a certain word set representing a context of the documents. However, the documents may not necessarily be organized in this manner and may not be an accurate representation of context of said documents. The resulting retrieved documents may not be well matched to the query.
Therefore, there is a need for a solution in order to search for relevant data from a resource in an efficient manner, wherein the search results are to be as accurate as possible.
SUMMARY OF INVENTION
Accordingly there is provided a semantic organization and retrieval system of text based web artifacts, the system includes a knowledge base, a plurality of semantic processors connectable to the knowledge base, a semantic organizer connectable to the plurality of semantic processors and a semantic retriever connectable to the plurality of semantic processors.
There is also provided a method of semantic data organization of text based web artifacts, the method includes the steps of storing metadata information of web artifacts, categorizing the web artifacts and performing at least one conceptual representation of contents in web artifacts.
There is further provided a method of semantic data retrieval of text based organized web artifacts from metadata information of said web artifacts, the method includes the steps of accepting at least one query from user, filtering organized artifacts using metadata information, representing context of query in at least one conceptual representation, matching the at least one conceptual representation for semantic similarity and retrieving a plurality of web artifact pointers based on user sorted relevance ranking of semantic similarity indexes.
The present invention consists of several novel features and a combination of parts hereinafter fully described and illustrated in the accompanying description and drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention. BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be fully understood from the detailed description given herein below and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, wherein:
Figure 1 is a block diagram illustrating architecture of an organization phase of a preferred embodiment of a semantic organizer and retrieval system and method; and
Figure 2 is a block diagram illustrating architecture of a retrieval phase of a preferred embodiment of a semantic organizer and retrieval system and method.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
The present invention relates to a semantic organization and retrieval system of text based web artifacts, and methods thereof. Hereinafter, this specification will describe the present invention according to the preferred embodiment of the present invention. However, it is to be understood that limiting the description to the preferred embodiment of the invention is merely to facilitate discussion of the present invention and it is envisioned that those skilled in the art may devise various modifications and equivalents without departing from the scope of the appended claims.
The following detailed description of the preferred embodiment will now be described in accordance with the attached drawings, either individually or in combination.
The present invention provides a semantic organization and retrieval system (100, 200) of text based web artifacts as seen in Figures 1 and 2. The system (100,200) includes a knowledge base (126), a plurality of semantic processors connectable to the knowledge base (126), a semantic organizer (110) connectable to the plurality of semantic processors and a semantic retriever (210) connectable to the plurality of semantic processors. The plurality of semantic processors may be selected from, but not restricted to, semantic interpreter (112), parser (114), conceptual graph processor (124), Personal Information Manager (PIM) (116), PIM database (128), source preprocessor (118), document analyzer (120), document categorizer (122), semantic similarity matching unit (212) and summarizing means (134). It is to be appreciated that the system (100,200) may include other related processors according to application of the system (100,200). An example of the knowledge base (126) is a conceptual graph knowledge base. The conceptual graph knowledge base is built for representing hierarchical structure and relations between concepts and relations in natural language. The knowledge base (126) may be extended to incorporate various domains, but not limited to, medical, engineering and computing. The present embodiment of the invention includes a knowledge base (126) populated for general uses as well as the medical domain. The knowledge base (126) includes concept type hierarchy, relation type hierarchy, type definitions, schemas, prototypes and instances. A method of semantic data organization of text based web artifacts is described herein as seen in Figure 1. The method includes the steps of storing metadata information of web artifacts, categorizing the web artifacts, performing at least one conceptual representation of contents in web artifacts. The method is further described in detail. Metadata information of a plurality of web based text documents, such as a web Uniform Resource Locators (URL) is provided to a source preprocessor (118). The source preprocessor (118) extracts text out of the plurality of documents, writes the extracted text to a file and saves it to a file server (130). An example of metadata information is publication information. An example of at least one conceptual representation is at least one conceptual graph.
The URLs are then provided to a document analyzer (120) which removes stop words and counts occurrences of at least one predetermined keyword. A feature vector is generated for the plurality of documents based on the occurrence count and the at least one predetermined keyword. A document categorizer (122) receives the feature vector along with a specified model file. The document is then categorized, wherein the document is identified to be in a particular category in the specified model file. A summarizing means (134) summarizes the categorized document and saves it on the file server (130). Each sentence from the summarized document is then provided to a parser (114) that parses natural language. A plurality of syntactical structures, which include constituent tree and linkage tree of a sentence, is then obtained. The obtained syntactical structures for each sentence are processed individually by the Semantic Interpreter (112). The Semantic Interpreter (112) uses the linkage and constituent information to produce at least one conceptual graph using a set of rules to identify a plurality of words in sentences as concepts, schemas and relations. The Semantic Interpreter (112) communicates with a conceptual graph processor (124) to perform various conceptual graph operations such as but not limited to, load graph, maximal join, join, load schema, load relation, load prototype, load type definition, to generate the conceptual graph representing the meaning of the sentence. The at least one conceptual graph is stored in the conceptual graph knowledge base (126) and any graph references are returned to the Semantic Organizer (110).
The conceptual graphs are sent to the conceptual graph processor (124) wherein a maximal join operation is conducted resulting in a single conceptual graph that represents meaning of summarized content in a document. The graph references, metadata and the category information of the document are then stored in a personal information manager database (128). A method of semantic data retrieval of text based organized web artifacts from metadata information of said web artifacts is described herein as seen in Figure 2. The method includes the steps of accepting at least one query from user, filtering organized artifacts using metadata information, representing context of query in a conceptual representation, matching the conceptual representation for semantic similarity and retrieving a plurality of web artifact pointers based on user sorted relevance ranking of semantic similarity indexes.
A user enters a natural language query into a Semantic Retriever (210). Metadata information and context of the query is then extracted from the natural language query. A plurality of documents is retrieved from a Personal Information Manager database (128) based on the extracted metadata information. Next, the context of the query is parsed and sent to the Semantic Interpreter (112) to generate at least one conceptual representation such as at least one conceptual graph. The at least one conceptual graph is stored in the knowledge base (126) and any references related to the at least one conceptual graph are returned to the Semantic Retriever (210).
The at least one conceptual graph based on the retrieved documents are matched semantically one by one with the at least one conceptual graph of the natural language query. A similarity index is generated based on the matching process. A semantic similarity matching unit (212) generates the similarity index. The semantic similarity matching unit (212) is able to receive up to more than two conceptual graphs simultaneously in this embodiment. However, it is understood that one skilled in the art would appreciate that the method may also include a plurality of conceptual graphs to be processed depending on application of the method. A maximal join size operation is called upon in the conceptual graph processor (124) by the similarity matching unit (212), wherein a plurality of input graphs are processed, resulting in a number of concepts which are maximally matched in a maximally extended resultant graph.
A formula as seen below is applied to find a similarity index (ranging from 0 to 1 ):
Similarity index = (maxJoinSize)
cg1 size + cg2size - maxJoinSize
Wherein, maxJoinSize is the number of concepts which are maximally matched in both conceptual graphs, cgl size is a number of concept nodes in a first conceptual graph and cg2size is a number of concept nodes in a second conceptual graph in an example where two conceptual graphs are matched. A similarity index of 1 indicates a best match while 0 indicates a worst match. A list of retrieved documents is sorted according to the similarity index, which is an indicator for relevance of retrieved documents against the natural language query. URLs of the retrieved documents are returned to user.
The present invention is an improvement over storing a web artifact on a local file space, wherein the web artifact URL is stored in a database which addresses the problem of unmanageable data space expansion. Personalized storage of the web artifacts is also an improvement over identifying a document category manually in known systems. As the present invention is able to perform a search based on natural language query, this is an improvement over using a keyword search. This is because a natural language query takes context into consideration unlike the keyword search. Automatic ranking of documents based on relevance also improves on accuracy of search compared to manually analyzing all retrieved documents by personalizing categorization of search documents. The described methods and system (100, 200) can be applied, but not restricted to, organize and retrieve web based text artifacts on a basis of metadata, category and conceptual graphs of said web artifacts. Therefore, the described system (100,200) and methods are able to transform natural language queries into appropriate internal semantic query. This produces results that directly answer the query rather than relying on users to check each search result for relevancy.