Movatterモバイル変換


[0]ホーム

URL:


WO2011093691A2 - A semantic organization and retrieval system and methods thereof - Google Patents

A semantic organization and retrieval system and methods thereof
Download PDF

Info

Publication number
WO2011093691A2
WO2011093691A2PCT/MY2010/000300MY2010000300WWO2011093691A2WO 2011093691 A2WO2011093691 A2WO 2011093691A2MY 2010000300 WMY2010000300 WMY 2010000300WWO 2011093691 A2WO2011093691 A2WO 2011093691A2
Authority
WO
WIPO (PCT)
Prior art keywords
semantic
web
artifacts
conceptual
knowledge base
Prior art date
Application number
PCT/MY2010/000300
Other languages
French (fr)
Other versions
WO2011093691A3 (en
Inventor
Navnit Singh Biring
Arun Anand
Lukose Dickson
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos BerhadfiledCriticalMimos Berhad
Publication of WO2011093691A2publicationCriticalpatent/WO2011093691A2/en
Publication of WO2011093691A3publicationCriticalpatent/WO2011093691A3/en

Links

Classifications

Definitions

Landscapes

Abstract

A semantic organization and retrieval system (100, 200) of text based web artifacts is provided, the system (100,200) includes a knowledge base (126), a plurality of semantic processors connectable to the knowledge base (126), a semantic organizer (110) connectable to the plurality of semantic processors and a semantic retriever (210) connectable to the plurality of semantic processors.

Description

A SEMANTIC ORGANIZATION AND RETRIEVAL SYSTEM AND METHODS
THEREOF
FIELD OF INVENTION
The present invention relates to a semantic organization and retrieval system of text based web artifacts, and methods thereof.
BACKGROUND OF INVENTION
Currently, indexing and retrieving documents from a search engine requires a mechanism where an end user stores the documents in an organized hierarchy on media such as hard disk drive (HDD), floppy disk drive (FDD), Digital Video Disc (DVD) or Compact Disc (CD). This leads to a problem of dealing with an ever growing requirement of data space that can become unmanageable.
In existing searching practice, a user is responsible for manually identifying categories of a document and hierarchically storing the same in folders in the media as described above. This is a time consuming process as each document has to be internalized manually to identify its category.
Further, current search methods have been extensively developed based on keyword search, which may not accurately portray a user's intent of searching. The use of keywords to search documents may result in an inability to retrieve artifacts containing intended concepts which are related to keywords, but not the keyword in itself. U.S. 6,687,689 B1 describes a system and methods for document retrieval, which assumes that documents are organized in a database along with a certain word set representing a context of the documents. However, the documents may not necessarily be organized in this manner and may not be an accurate representation of context of said documents. The resulting retrieved documents may not be well matched to the query.
Therefore, there is a need for a solution in order to search for relevant data from a resource in an efficient manner, wherein the search results are to be as accurate as possible.
SUMMARY OF INVENTION
Accordingly there is provided a semantic organization and retrieval system of text based web artifacts, the system includes a knowledge base, a plurality of semantic processors connectable to the knowledge base, a semantic organizer connectable to the plurality of semantic processors and a semantic retriever connectable to the plurality of semantic processors.
There is also provided a method of semantic data organization of text based web artifacts, the method includes the steps of storing metadata information of web artifacts, categorizing the web artifacts and performing at least one conceptual representation of contents in web artifacts.
There is further provided a method of semantic data retrieval of text based organized web artifacts from metadata information of said web artifacts, the method includes the steps of accepting at least one query from user, filtering organized artifacts using metadata information, representing context of query in at least one conceptual representation, matching the at least one conceptual representation for semantic similarity and retrieving a plurality of web artifact pointers based on user sorted relevance ranking of semantic similarity indexes.
The present invention consists of several novel features and a combination of parts hereinafter fully described and illustrated in the accompanying description and drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention. BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be fully understood from the detailed description given herein below and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, wherein:
Figure 1 is a block diagram illustrating architecture of an organization phase of a preferred embodiment of a semantic organizer and retrieval system and method; and
Figure 2 is a block diagram illustrating architecture of a retrieval phase of a preferred embodiment of a semantic organizer and retrieval system and method.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
The present invention relates to a semantic organization and retrieval system of text based web artifacts, and methods thereof. Hereinafter, this specification will describe the present invention according to the preferred embodiment of the present invention. However, it is to be understood that limiting the description to the preferred embodiment of the invention is merely to facilitate discussion of the present invention and it is envisioned that those skilled in the art may devise various modifications and equivalents without departing from the scope of the appended claims.
The following detailed description of the preferred embodiment will now be described in accordance with the attached drawings, either individually or in combination.
The present invention provides a semantic organization and retrieval system (100, 200) of text based web artifacts as seen in Figures 1 and 2. The system (100,200) includes a knowledge base (126), a plurality of semantic processors connectable to the knowledge base (126), a semantic organizer (110) connectable to the plurality of semantic processors and a semantic retriever (210) connectable to the plurality of semantic processors. The plurality of semantic processors may be selected from, but not restricted to, semantic interpreter (112), parser (114), conceptual graph processor (124), Personal Information Manager (PIM) (116), PIM database (128), source preprocessor (118), document analyzer (120), document categorizer (122), semantic similarity matching unit (212) and summarizing means (134). It is to be appreciated that the system (100,200) may include other related processors according to application of the system (100,200). An example of the knowledge base (126) is a conceptual graph knowledge base. The conceptual graph knowledge base is built for representing hierarchical structure and relations between concepts and relations in natural language. The knowledge base (126) may be extended to incorporate various domains, but not limited to, medical, engineering and computing. The present embodiment of the invention includes a knowledge base (126) populated for general uses as well as the medical domain. The knowledge base (126) includes concept type hierarchy, relation type hierarchy, type definitions, schemas, prototypes and instances. A method of semantic data organization of text based web artifacts is described herein as seen in Figure 1. The method includes the steps of storing metadata information of web artifacts, categorizing the web artifacts, performing at least one conceptual representation of contents in web artifacts. The method is further described in detail. Metadata information of a plurality of web based text documents, such as a web Uniform Resource Locators (URL) is provided to a source preprocessor (118). The source preprocessor (118) extracts text out of the plurality of documents, writes the extracted text to a file and saves it to a file server (130). An example of metadata information is publication information. An example of at least one conceptual representation is at least one conceptual graph.
The URLs are then provided to a document analyzer (120) which removes stop words and counts occurrences of at least one predetermined keyword. A feature vector is generated for the plurality of documents based on the occurrence count and the at least one predetermined keyword. A document categorizer (122) receives the feature vector along with a specified model file. The document is then categorized, wherein the document is identified to be in a particular category in the specified model file. A summarizing means (134) summarizes the categorized document and saves it on the file server (130). Each sentence from the summarized document is then provided to a parser (114) that parses natural language. A plurality of syntactical structures, which include constituent tree and linkage tree of a sentence, is then obtained. The obtained syntactical structures for each sentence are processed individually by the Semantic Interpreter (112). The Semantic Interpreter (112) uses the linkage and constituent information to produce at least one conceptual graph using a set of rules to identify a plurality of words in sentences as concepts, schemas and relations. The Semantic Interpreter (112) communicates with a conceptual graph processor (124) to perform various conceptual graph operations such as but not limited to, load graph, maximal join, join, load schema, load relation, load prototype, load type definition, to generate the conceptual graph representing the meaning of the sentence. The at least one conceptual graph is stored in the conceptual graph knowledge base (126) and any graph references are returned to the Semantic Organizer (110).
The conceptual graphs are sent to the conceptual graph processor (124) wherein a maximal join operation is conducted resulting in a single conceptual graph that represents meaning of summarized content in a document. The graph references, metadata and the category information of the document are then stored in a personal information manager database (128). A method of semantic data retrieval of text based organized web artifacts from metadata information of said web artifacts is described herein as seen in Figure 2. The method includes the steps of accepting at least one query from user, filtering organized artifacts using metadata information, representing context of query in a conceptual representation, matching the conceptual representation for semantic similarity and retrieving a plurality of web artifact pointers based on user sorted relevance ranking of semantic similarity indexes.
A user enters a natural language query into a Semantic Retriever (210). Metadata information and context of the query is then extracted from the natural language query. A plurality of documents is retrieved from a Personal Information Manager database (128) based on the extracted metadata information. Next, the context of the query is parsed and sent to the Semantic Interpreter (112) to generate at least one conceptual representation such as at least one conceptual graph. The at least one conceptual graph is stored in the knowledge base (126) and any references related to the at least one conceptual graph are returned to the Semantic Retriever (210).
The at least one conceptual graph based on the retrieved documents are matched semantically one by one with the at least one conceptual graph of the natural language query. A similarity index is generated based on the matching process. A semantic similarity matching unit (212) generates the similarity index. The semantic similarity matching unit (212) is able to receive up to more than two conceptual graphs simultaneously in this embodiment. However, it is understood that one skilled in the art would appreciate that the method may also include a plurality of conceptual graphs to be processed depending on application of the method. A maximal join size operation is called upon in the conceptual graph processor (124) by the similarity matching unit (212), wherein a plurality of input graphs are processed, resulting in a number of concepts which are maximally matched in a maximally extended resultant graph.
A formula as seen below is applied to find a similarity index (ranging from 0 to 1 ):
Similarity index = (maxJoinSize)
cg1 size + cg2size - maxJoinSize
Wherein, maxJoinSize is the number of concepts which are maximally matched in both conceptual graphs, cgl size is a number of concept nodes in a first conceptual graph and cg2size is a number of concept nodes in a second conceptual graph in an example where two conceptual graphs are matched. A similarity index of 1 indicates a best match while 0 indicates a worst match. A list of retrieved documents is sorted according to the similarity index, which is an indicator for relevance of retrieved documents against the natural language query. URLs of the retrieved documents are returned to user.
The present invention is an improvement over storing a web artifact on a local file space, wherein the web artifact URL is stored in a database which addresses the problem of unmanageable data space expansion. Personalized storage of the web artifacts is also an improvement over identifying a document category manually in known systems. As the present invention is able to perform a search based on natural language query, this is an improvement over using a keyword search. This is because a natural language query takes context into consideration unlike the keyword search. Automatic ranking of documents based on relevance also improves on accuracy of search compared to manually analyzing all retrieved documents by personalizing categorization of search documents. The described methods and system (100, 200) can be applied, but not restricted to, organize and retrieve web based text artifacts on a basis of metadata, category and conceptual graphs of said web artifacts. Therefore, the described system (100,200) and methods are able to transform natural language queries into appropriate internal semantic query. This produces results that directly answer the query rather than relying on users to check each search result for relevancy.

Claims

A semantic organization and retrieval system (100, 200) of text based web artifacts, the system (100,200) includes:
i. a means of access to a knowledge base (126);
ii. a plurality of semantic processors connectable to the knowledge base (126);
iii. a semantic organizer (110) connectable to the plurality of semantic processors; and
iv. a semantic retriever (210) connectable to the plurality of semantic processors.
The system (100,200) as claimed in claim 1 , wherein the plurality of semantic processors include semantic interpreter (112), parser (114), conceptual graph processor (124), Personal Information Manager (PIM) (116), PIM database (128), source preprocessor (118), document analyzer (120), document categorizer (122), semantic similarity matching unit (212) and summarizing means (134).
The system (100,200) as claimed in claim 1 , wherein the knowledge base (126) includes concept type hierarchy, relation type hierarchy, type definitions, schemas, prototypes and instances.
4. The system (100,200) as claimed in claim 3, wherein the knowledge base (126) is a conceptual graph knowledge base.
5. A method of semantic data organization of text based web artifacts, the method includes the steps of:
storing metadata information of web artifacts;
categorizing the web artifacts; and
performing at least one conceptual representation of contents in web artifacts.
The method as claimed in claim 5, wherein the metadata information is a web Uniform Resource Locators (URL).
The method as claimed in claim 5, wherein the metadata information is publication information.
The method as claimed in claim 5, wherein the at least one conceptual representation is at least one conceptual graph.
A method of semantic data retrieval of text based organized web artifacts from metadata information of said web artifacts, the method includes the steps of: i. accepting at least one query from user;
ii. filtering organized artifacts using metadata information;
iii. representing context of query in at least one conceptual representation;
IV. matching the at least one conceptual representation for semantic similarity; and
v. retrieving a plurality of web artifact pointers based on user sorted relevance ranking of semantic similarity indexes.
10. The method as claimed in claim 9, wherein the plurality of web artifact pointers are a plurality of Uniform Resource Locators (URLs).
11. The method as claimed in claim 9, wherein the at least one conceptual representation is at least one conceptual graph.
PCT/MY2010/0003002010-01-272010-11-25A semantic organization and retrieval system and methods thereofWO2011093691A2 (en)

Applications Claiming Priority (2)

Application NumberPriority DateFiling DateTitle
MYPI20100004352010-01-27
MYPI2010000435AMY159332A (en)2010-01-272010-01-27A semantic organization and retrieval system and methods thereof

Publications (2)

Publication NumberPublication Date
WO2011093691A2true WO2011093691A2 (en)2011-08-04
WO2011093691A3 WO2011093691A3 (en)2011-11-24

Family

ID=44320018

Family Applications (1)

Application NumberTitlePriority DateFiling Date
PCT/MY2010/000300WO2011093691A2 (en)2010-01-272010-11-25A semantic organization and retrieval system and methods thereof

Country Status (2)

CountryLink
MY (1)MY159332A (en)
WO (1)WO2011093691A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103678302A (en)*2012-08-302014-03-26北京百度网讯科技有限公司Document structuration organizing method and device
WO2015030796A1 (en)*2013-08-302015-03-05Intel CorporationExtensible context-aware natural language interactions for virtual personal assistants

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP0962873A1 (en)*1998-06-021999-12-08International Business Machines CorporationProcessing of textual information and automated apprehension of information
US6453315B1 (en)*1999-09-222002-09-17Applied Semantics, Inc.Meaning-based information organization and retrieval
US6704728B1 (en)*2000-05-022004-03-09Iphase.Com, Inc.Accessing information from a collection of data
US20100036797A1 (en)*2006-08-312010-02-11The Regents Of The University Of CaliforniaSemantic search engine
JP2009134580A (en)*2007-11-302009-06-18Canon Inc Document database system and image input device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103678302A (en)*2012-08-302014-03-26北京百度网讯科技有限公司Document structuration organizing method and device
WO2015030796A1 (en)*2013-08-302015-03-05Intel CorporationExtensible context-aware natural language interactions for virtual personal assistants
US10127224B2 (en)2013-08-302018-11-13Intel CorporationExtensible context-aware natural language interactions for virtual personal assistants

Also Published As

Publication numberPublication date
WO2011093691A3 (en)2011-11-24
MY159332A (en)2016-12-30

Similar Documents

PublicationPublication DateTitle
US9875299B2 (en)System and method for identifying relevant search results via an index
US8275765B2 (en)Method and system for automatic objects classification
US7739258B1 (en)Facilitating searches through content which is accessible through web-based forms
AU2005217413B2 (en)Intelligent search and retrieval system and method
US20090070322A1 (en)Browsing knowledge on the basis of semantic relations
US9613125B2 (en)Data store organizing data using semantic classification
US9239872B2 (en)Data store organizing data using semantic classification
US20100174704A1 (en)Searching method and system
Dong et al.A survey in semantic search technologies
US9081847B2 (en)Data store organizing data using semantic classification
WO2010089248A1 (en)Method and system for semantic searching
US20110238664A1 (en)Region Based Information Retrieval System
Jung et al.Finding Topic-centric Identified Experts based on Full Text Analysis.
WO2009035871A1 (en)Browsing knowledge on the basis of semantic relations
WO2011093691A2 (en)A semantic organization and retrieval system and methods thereof
US20090177633A1 (en)Query expansion of properties for video retrieval
FerroAnnotation search: The FAST way
Noah et al.Ontology-driven semantic digital library
Aleman-MezaSearching and ranking documents based on semantic relationships
LalmasA model for representing and retrieving heterogeneous structured documents based on evidential reasoning
Leveling et al.University of Hagen at GeoCLEF2006: Experiments with Metonymy Recognition in Documents.
أسماء اعبيد et al.Studying and investigation of the Semantic Agent Case Study (SemanSearch)
Fogarolli et al.Discovering semantics in multimedia content using Wikipedia
Khattak et al.Intelligent search in digital documents
EP2720160A2 (en)Data store organizing data using semantic classification

Legal Events

DateCodeTitleDescription
121Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number:10844825

Country of ref document:EP

Kind code of ref document:A1

NENPNon-entry into the national phase

Ref country code:DE

122Ep: pct application non-entry in european phase

Ref document number:10844825

Country of ref document:EP

Kind code of ref document:A2


[8]ページ先頭

©2009-2025 Movatter.jp