A kind of judicial class case searching method based on multi-feature fusionTechnical field
The present invention relates to judicial class case search field, specifically a kind of judicial class case search based on multi-feature fusionMethod.
Background technology
Law is the product of country, refers to ruling class's (ruling group is exactly political party, including king, monarch), in order toRealize the purpose for ruling and managing country, by certain legislative procedure, the basic statute and general law promulgated.Law is completeThe embodiment of body its people's will, national rule tools.
With coming into the open for social information, the trial result of some legal cases is increasingly paid attention in society, trialIn the process, similar judgement document can be recommended in time as reference, the effect of trial can be effectively improved, currently, generally usingBe the text retrieval system based on keyword, only simply compare the similar of two cases using word matching, it is difficult to accurateIdeal search result is got, reason can be summarized as three aspects:Keyword feature is not comprehensive to the description of document information,To keep similarity calculation inaccurate;It is distributed in the keyword of document difference section block, final similar judgement is influenced also different;Fail constraint of the fine consideration contextual information to keywords semantics, to have to the difference that context change is broughtThe differentiation of effect, therefore work out a kind of searching method that accuracy is high and have become current important one of project.
Invention content
The technical problem to be solved by the present invention is in order to overcome in the prior art recall precision is low, accuracy is not high to lackIt falls into, and a kind of judicial class case searching method based on multi-feature fusion is provided.
The present invention solves the technical solution that above-mentioned technical problem provides:The invention discloses one kind being based on multiple features fusionJudicial class case searching method, be as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary intoRow query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywordsVector, then carry out multiple features fusion;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score;
(6), output is ranked up to final search result.
Preferably, in the step (4), the feature vector include divided group keyword feature vector,Language model feature vector, theme word set feature vector.
Preferably, tfidf information of the keyword feature vector of the divided group by statistics piecemeal entry, thenDivided group;
Preferably, the language model feature vector is operated by carrying out the sliding window that size is N, formation lengthFor the word fragment sequence of N, each word segment is known as gram, is counted to the occurrence frequency of whole gram, and according to thingThe threshold value first set is filtered, and forms key gram lists;
Preferably, the theme word set feature vector indicates concept, an one side, table by using themeIt is now a series of relevant key topic words, is the conditional probability of these key words;
Preferably, in the step (5), the similarity marking formula after multiple features fusion is as follows:
score(q,d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Simcapte(q, d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical modelDescription and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit firstA, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate.
Compared with prior art, the present invention has following beneficial advantage:
The present invention passes through semantic dictionary and carries out query semantics extension first so that relationship description is more between searching keyword and wordComprehensively, comprehensive and accurate keyword description is constructed, then passes through the multiple features such as the entry weighting of piecemeal, language model, theme word setSimilarity model is constructed, and integrated ordered to search result progress, greatly improves the accuracy rate and recall rate of the retrieval of class case.
Description of the drawings
Fig. 1 is to build multiple features model schematic offline in the embodiment of the present invention 1;
Fig. 2 is the flow diagram of the judicial class case searching method based on multi-feature fusion in the embodiment of the present invention 1;
Fig. 3 is the multiple features fusion schematic diagram in the embodiment of the present invention 1;
Fig. 4 is the vector space model principle schematic in the embodiment of the present invention 1.
Specific implementation mode
It is specific to walk the invention discloses a kind of judicial class case searching method based on multi-feature fusion referring to Fig.1 shown in -4It is rapid as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary intoRow query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywordsVector, then carry out multiple features fusion;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score;
(6), output is ranked up to final search result.
Preferably, in the step (4), the feature vector include divided group keyword feature vector,Language model feature vector, theme word set feature vector.
Preferably, tfidf information of the keyword feature vector of the divided group by statistics piecemeal entry, thenDivided group;
Preferably, the language model feature vector is operated by carrying out the sliding window that size is N, formation lengthFor the word fragment sequence of N, each word segment is known as gram, is counted to the occurrence frequency of whole gram, and according to thingThe threshold value first set is filtered, and forms key gram lists;
Preferably, the theme word set feature vector indicates concept, an one side, table by using themeIt is now a series of relevant key topic words, is the conditional probability of these key words;
Preferably, in the step (5), the similarity marking formula after multiple features fusion is as follows:
Score (q, d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Slmcapte(q,d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical modelDescription and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit firstA, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate.
Embodiment 1
The invention discloses a kind of judicial class case searching methods based on multi-feature fusion, are as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary intoRow query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywordsThe keyword of vector, including keyword feature vector, language model feature vector, theme word set feature vector, divided group is specialThe vectorial tfidf information by counting piecemeal entry of sign, then divided group, language model feature vector are N by carrying out sizeSliding window operation, formation length is the word fragment sequence of N, and each word segment is known as gram, goes out to whole gramExisting frequency is counted, and is filtered according to the threshold value being previously set, and is formed key gram lists, is with 2-gram modelsExample, the method for calculating the adjacent similarity score of word, calculation formula are as follows:
Indicate the Words similarity score between query string q and document d;2-gram (q) indicates the 2-gram collection of query stringIt closes, 2-gram (d) indicates the 2-gram set of document
Specific algorithm is described as follows:Input pretreated query string q, document d
Export the adjacent similarity score of word between q and d
A, the 2-gram set 2-gram (q) of q are acquired;
B, the 2-gram set 2-gram (d) of d are acquired;
C, q similarity score gramScore (q, d) adjacent with the word of d are calculated by 2-gram (q) and 2-gram (d);
Theme word set feature vector indicates concept, an one side by using theme, shows as a series of correlationsKey topic word, be the conditional probability of these key words,
Then multiple features fusion is carried out to features described above vector;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score, specificallyStep is
Hypothesized model regards document as a vector being made of t dimensional features, and feature is commonly using word come tableShow, each feature can calculate its weight according to certain basis for estimation, and feature of this t dimensions with weight together constitutes a textBook;
In order to calculate the score value, document and inquiry are all expressed as vector, and document is regarded as a series of words (Term) by we,Each word (Term) is there are one weight (Term weight), and different word (Term) is according to oneself weight in documentMarking to influence document relevance calculates,
Then the weight (term weight) of word (term) in this all document is regarded as a vector by we,
Document=term1, term2 ..., term N }
Document Vector=weight1, weight2 ..., weight N }
Equally query statement is regarded as a simple document by we, is also indicated with vector,
Query=term1, term 2 ..., term N }
Query Vector=weight1, weight2 ..., weight N }
We are put into all document vectors searched out and query vector in one N-dimensional space, and each word (term) isOne-dimensional, vector space model principle is as shown in Figure 4:
Then the similarity value between document and query statement is obtained by following formula:
Query semantics extend so that more comprehensively, the keyword based on divided group is special for relationship description between searching keyword and wordSign embodies keyword distributed intelligence;Keyword feature based on language model embodies keyword dependence and context languageThe constraint of adopted keywords semantics;And query terms and descriptor correlativity, body are introduced based on the keyword feature of theme word setThe likelihood score between inquiry and document block is showed, our target is, the keyword feature of divided group, language model is specialSign, descriptor feature combine, maximize favourable factors and minimize unfavourable ones, complement one another, and describe a document jointly, to according to these feature calculationsSimilarity between inquiry and document,
Similarity marking formula after multiple features fusion is as follows:
Score (q, d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Slcapte(q, d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical modelDescription and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit firstA, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate;
(6), output is ranked up to final search result.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripeThe personage for knowing this technology can all carry out modifications and changes to above-described embodiment without violating the spirit and scope of the present invention.CauseThis, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such asAt all equivalent modifications or change, should by the present invention claim be covered.