Background technology
People are to use research tool from the main means that network obtains information, as Google, Baidu, Yahoo etc.The principle of work of search engine comprises three processes substantially: (1) gathers information from the internet, by regularly the information of all website and webpage on the internet being grasped with Web Spider.(2) organize your messages and set up index data base and analyze collecting the webpage of returning by analyzing the directory system program, extract keyword that related web page place website links, type of coding, content of pages comprise, keyword position, rise time, size, with the information such as linking relationship of other webpage, calculate according to certain degree of correlation algorithm, obtain each webpage at the degree of correlation (or importance) that reaches each keyword in the super chain in the content of pages, set up the web page index database with these relevant informations then.(3) in index data base searching order, accept inquiry when the user after keyword search is imported at the interface of search engine, from the web page index database, find all related web pages that meet this keyword by the search system program, according to ready-made degree of correlation numerical ordering, the degree of correlation is high more, and rank is forward more.At last,, organize and return to the user contents such as the chained address of Search Results, content of pages summaries by page generation system.
Present search engine is based on the search engine of keyword matching mostly.Yet these search engines seldom have the semantic reasoning ability.Though Google has adopted some natural language processing techniques, for example, the synonym expansion, but it can not resolve the semantic relation between the notion, caused the reduction of precision ratio so to a certain extent, made that the inquiry return results is not a user institute satisfactory information.On the other hand, user's inquiry depends on certain professional domain to a great extent, as marine field.For example; suppose that the user wants to search for the information of marine field relevant " DIP (Dissolved inorganic phosphorus dissolves Phos) "; its Query Result as shown in Figure 4; usually can obtain a large amount of other fields " DIP " information; as " the Dual Inline Package " of microelectronic, i.e. dual-in-line package technology.Because these are and the incoherent garbage of user's purpose that the user obviously is unsatisfied to such result.
" body (Ontology) " conduct " the clear and definite formalization normalized illustration of shared ideas model " is by taking out the model that the related notion of some phenomenons obtains in the objective world, and the implication of conceptual model performance is independent of concrete ambient condition.What body embodied is the knowledge of common approval, reflection be the concept set of generally acknowledging in the association area, so body provides common understanding and description to domain knowledge, can be used to better share, exchange and reuse.Constitute body notion and between relation through explication, use body can eliminate phenomenons such as polysemy, many speech one justice and the meaning of a word be ambiguous, thereby finish domain knowledge clear, definite, complete definition and description.The target of body research is to obtain a Knowledge Representation Method, makes machine to share and process information as the mankind.At present, ontology is widely used in fields such as the representation of knowledge, information retrieval.
Summary of the invention
In order to overcome the existing deficiency of search engine on semantic retrieval, the invention provides a kind of information retrieval optimization method based on domain body.
Technical scheme of the present invention is: a kind of information retrieval optimization method based on domain body, and its step is as follows:
(1), obtains the key word of the inquiry that the user submits to by the search interface of searching system;
(2) in the field of user expectation, according to the domain body of having set up, the key word of the inquiry that the user is submitted to carries out semantic extension by ontology inference, obtains one or more groups new inquiry string;
(3) inquiry string after will expanding is submitted to one or more search engines and is retrieved;
(4) return results to each search engine goes heavily, sorts and integrate;
(5) net result is shown to the user by search interface.
In the above-mentioned steps (2) based on the semantic extension mode of domain body comprise in the following mode a kind of, two kinds or all:
1. based on the optimization method of is-a relation
Is-a relation (inheritance) has shown the classification of notion, and promptly the example of father's notion equals the summation of sub-notion example.Therefore added some constraints in that son is conceptive, sub-notion is also referred to as the particularization of father's notion.The probability that a notion father notion direct with it or sub-notion occur in same document is higher.Therefore, when search during, can utilize the father's notion P of A or sub-notion C to improve the precision ratio of search as constraint about the document of certain notion A.So the inquiry that a notion can be optimized to notion itself and its father's notion or sub-notion is right.
2. based on the optimization method of part-of relation
Part-of represents whole-part relations, is used for describing the mutual relationship between a notion and its part notion.The ingredient of a notion also therewith the field under the notion be closely related.Therefore, also be associated usually with part notion document matching with its global concept.So the inquiry that a notion can be optimized to notion itself and part notion thereof is right.
3. based on the optimization method of equivalent-class relation
Equivalent-class (equivalence class) relation is used for the synonym phenomenon of process field knowledge.Utilize the equivalent-class relation, the notion in the user inquiring can be mapped to the synonym of equal value with it.Like this, can improve the precision ratio of information retrieval.And, the common householder method of equivalent-class relation as preceding two kinds of optimization methods.
Between the internal notion of described inquiry be " with " or " or " logical relation, " with " can improve the inquiry accuracy rate, " or " can improve recall ratio.
In the above-mentioned steps (4), to the return results of each search engine go heavily, ordering integrates, the algorithm that can adopt is as follows:
(1) URL to Search Results handles, and intercepting " # " URL character string before is as final chained address; If there is MD5 (URLA)=MD5 (URLB), then think URLAAnd URLBCorresponding page is a duplicate pages, goes heavily;
(2) sort algorithm is considered two aspects:
1. the semantic distance Dist (C of each notion in the inquiry stringi, Cj), C whereiniWith CjBe two notions in the inquiry string,
Formula 1
In the
formula 1,
Link node C in the expression body tree
i, C
jShortest path in the Weighted distance sum on each limit;
With
Represent node C respectively
iAnd C
jWeighted distance to minimum common ancestor's node; N
LCARepresent the Weighted distance of minimum common ancestor to root node; ε is a constant, determines according to weighting coefficient.
The semantic weight of different relations is with reference to table 1 between the notion.
Table 1 semantic distance weight table
In the table 1,
The expression blank operation, single operation is represented in its combination with row; E represents the equivalent-class relation; G represents the is-a relation, and direction is pointed to father's notion by sub-notion; S represents the is-a relation, and direction is pointed to sub-notion by father's notion; P represents the part-of relation.
Because the semantic distance of notion semantic similarity and notion is inverse function each other, when semantic distance was 0, semantic similarity was 1.Therefore can be with Ci, CjSimilarity between the two is reduced to:
Formula 2
2. inquiry string and Search Results the record degree of correlation Rank (Query, Abstract).
Formula 3
In theformula 3, Rank (Ci, be the degree of correlation between each notion and the Search Results summary Abstract among the inquiry string Query Abstract), n is the number of notion among the Query.
Formula 4
In the formula 4, m=Time (Ci, Abstract) be notion CiThe number of times that in summary Abstract, occurs; The length of len (Abstract) expression summary Abstract; Index (Ci, j Abstract) is notion CiThe position that the j time occurs in summary Abstract.
3. to original query key word K
iAnd the inquiry string Query of expansion, obtain K respectively
iSemantic similarity with each notion among the Query
Then can calculate the matching degree R of result for retrieval.
R=α Sim (Ki, Cj(Query, Abstract) formula 5 for)+β Rank
In the formula 5, α and β are constant, represent the semantic relevancy of etendue critical word and the weight of the summary degree of correlation thereof respectively.α ∈ (0,1) wherein, β ∈ (0,1), and alpha+beta=1.
4. the order of successively decreasing according to R numerical value is finished the ordering of result for retrieval.
The present invention is recall ratio and the precision ratio that utilizes the relevant information retrieval in the semantic advantage raising field of body.On the basis of the method, user's key word of the inquiry can be utilized domain body carry out semantic extension, obtain one or more groups new query string, then it is submitted to the Web search engine, and Search Results sorted and put in order, finally be shown to the user.Because these new query strings have been considered the relation between the field concept, as hypernym, hyponym, synonym etc., can improve the recall ratio of retrieval; Simultaneously because that body is the field is relevant, make result for retrieval be limited under within the scope in field, can screen out a large amount of and information field independence, thereby improve the precision ratio of retrieval.
Embodiment
Below by a marine ecology field specific embodiment the present invention is described in further detail.
The present invention proposes a kind of information retrieval optimization method based on domain body, is example with the marine ecology field, in conjunction with the accompanying drawings, specifically describes as follows.
The workflow diagram of committed step of the present invention is an example with the marine ecology field as shown in Figure 2, and when submit queries " DIP ", concrete implementation step is:
1. server is set up a marine ecology body (Ontology), and with the storage of ocean.ont form, its body fragment as shown in Figure 1;
2. pass through search interface shown in Figure 3 at user side, submit to key word of the inquiry " DIP " to inquire about (Portal);
3. server obtains the key word of the inquiry that the user submits to, utilize HozoAPI to carry out semantic reasoning to the ocean.ont body and realize optimizing (Query Optimizer), at notion " DIP ", can get access to relative notion has: based on the notion InorganicNutrient of is-a relation, based on notion Phytoplankton, the Seawater of part-of relation.Obtain three groups of new inquiry strings " InorganicNutrient+DIP ", " DIP+Phytoplankton " and " DIP+Seawater " by the relation between these notions and the notion;
4. these three groups of character strings are sent to Web search engine (Web SearchEngine) respectively, (World Wide Web) obtains three groups of retrieval sets from WWW, get preceding 30 records of each result for retrieval, obtain result set Result_1 respectively, Result_2 and Result_3;
5. server is with Result_1, and Result_2 and Result_3 merge, and resequences after finishing retry, obtains net result collection Result.Main algorithm is as follows:
(1) URL to Search Results handles, and intercepting " # " URL character string before is as final chained address.If there is MD5 (URLA)=MD5 (URLB), then think URLAAnd URLBCorresponding page is a duplicate pages.
(2) sort algorithm is considered two aspects:
1. the semantic distance Dist (C of each notion in the inquiry stringi, Cj), C whereiniWith CjBe two notions in the inquiry string.
Utilize formula 1:
Calculate C
iWith C
jSemantic distance, and by formula 2:
Calculate C
iWith C
jSemantic similarity.
2. utilize formula 3:
Calculate the degree of correlation of inquiry string and Search Results record.
3. to original query key word K
iAnd the inquiry string Query of expansion, obtain K respectively
iSemantic similarity with each notion among the Query
And utilize formula 5:R=α Sim (K
i, C
j(Query Abstract) calculates matching degree to)+β Rank, finishes the ordering of result for retrieval by its result's the order of successively decreasing.
Be that the example explanation describes now with inquiry string " InorganicNutrient+DIP ".Two notions are respectively with CINAnd CDIPExpression.
By Fig. 1
associative list 1 as can be known
N
LCA=2, get ε=1.Then calculate by
formula 1
Calculate by formula 2
(Query, correlation parameter Abstract) as shown in Figure 5 to calculate Rank.
Utilize formula 5, get α=0.6, β=0.4:
Therefore
Come
The prostatitis.
6. Result is shown to the user by search interface.As shown in Figure 6.
Said process is the specialty retrieval optimization method that is defaulted as marine ecology domain-specific searching system OASIS andinterface 3 with.Also can adopt this professional searching system for other field, but will adopt the association area body.Certainly for comprehensive search engine, then can on search interface, increase field keyword column by user's input, to determine the field of user expectation retrieval according to the field keyword of user's input, for the user strange situation is divided in the field, can the preliminary election association area on the search interface of search engine select during by user search, to determine domain body and to carry out the meaning of a word expansion of association area.For not selecting or do not import the field keyword, then adopt all spectra body when determining domain body.