Disclosure of Invention
In order to solve the technical problem that the retrieval result of the similar case is inaccurate, the application provides a legal document processing method and a legal document processing system based on big data, which can improve the accuracy of the retrieval result of the similar case.
The application provides a legal document processing method based on big data, which comprises the steps of segmenting case descriptions to obtain a plurality of description words, calculating matching effectiveness of each description word in a case library, wherein the case library comprises a plurality of case pairs with matching labels, the matching labels comprise matching and non-matching, counting first co-occurrence probability of any description word in the case pairs with the matching labels as matching, counting second co-occurrence probability of the description word in the case pairs with the matching labels as non-matching, taking the ratio of the first co-occurrence probability to the second co-occurrence probability as the matching effectiveness of the description word, taking the product of the normalized matching effectiveness and TF-IDF value of the description word as a weighting coefficient to carry out weighted summation on semantic vectors of the description words to obtain case characteristics, and obtaining similar cases of case descriptions according to similarity between the case characteristics and case characteristics of historical cases.
The case description is segmented to obtain a plurality of description words of the case description, a case library comprises a plurality of case pairs, the case pairs can be divided into matching case pairs and unmatched case pairs, first co-occurrence probability of any description word is counted in the matching case pairs, second co-occurrence probability of the description word is counted in the unmatched case pairs, the ratio of the first co-occurrence probability to the second co-occurrence probability is used as matching effectiveness of the description word, the matching effectiveness can measure whether legal documents are similar or not, further, the normalized product of the matching effectiveness and TF-IDF value is used as a weighting coefficient to weight and sum semantic vectors of the description words to obtain case characteristics, the weighting coefficient of each description word is determined by combining the matching effectiveness and TF-IDF value, the case characteristics can be accurately extracted, and finally, the situation characteristics can be accurately judged whether the case characteristics are similar or not according to the similarity between the case characteristics of the history cases, the situation characteristics are similar, and the similarity of the case description case characteristics is obtained, and the similarity of similar search results is ensured.
Preferably, jieba word-splitting is used to word-split the scenario description.
Preferably, the descriptive wordsMatching validity of (a)The method comprises the following steps:
, For descriptive wordsIs selected from the group consisting of a first co-occurrence probability of (a),For descriptive wordsIs a second co-occurrence probability of (2).
Preferably, the method for acquiring the semantic vector of the descriptor comprises the steps of responding to the fact that the matching effectiveness of the descriptor is larger than a preset threshold, taking the word vector of the descriptor as the semantic vector, and otherwise, taking the sum of the word vectors of the descriptor and a plurality of descriptors in the context information as the semantic vector of the descriptor.
For the descriptive words playing a positive role in the process of judging whether the case pairs are similar (i.e. the descriptive words with matching effectiveness larger than 1), whether the case pairs are similar can be distinguished according to the word vectors of the descriptive words, the word vectors are directly used as semantic vectors, and for the descriptive words playing no role or playing a negative role in the process of judging whether the case pairs are similar (i.e. the descriptive words with matching effectiveness smaller than or equal to 1), the word vectors of the descriptive words cannot distinguish whether the case pairs are similar, and at the moment, the semantic vectors of the descriptive words need to be accurately acquired by combining the context information of the case descriptions.
Preferably, the Word vector is obtained using the Legal-BERT or Word2Vec model.
Preferably, the method for extracting the semantic vector of the descriptive word in the case description comprises the steps of setting an initial window, taking the sum of word vectors of the descriptive word and a plurality of word vectors in context information as semantic vectors of the descriptive word, setting the initial window comprising the context information of the descriptive word, taking the sum of word vectors in the initial window as semantic vectors of the descriptive word in a case pair simultaneously containing the descriptive word, calculating Euclidean distance minimum values of the semantic vectors of the descriptive word in different historical cases in the case pair as semantic deviations of the descriptive word, adjusting the size of the initial window, taking the initial window corresponding to the maximum value of a target function as a target window, wherein the semantic deviations of the descriptive word in the case pair are positively correlated with the semantic deviations of the descriptive word in the case pair, and negatively correlated with the semantic vectors of the descriptive word in the case pair, and extracting the semantic vectors of the descriptive word in the case description according to the target window.
The initial window comprises context information of the descriptive words, more context information of the descriptive words is continuously introduced by adjusting the size of the initial window until semantic information of the descriptive words can distinguish whether case pairs are similar or not, a target window is obtained, accurate extraction of the descriptive word semantic information in legal documents is achieved, and whether the case pairs are similar or not can be effectively distinguished by the extracted semantic information.
Preferably, the objective functionThe relation is satisfied:; For non-matching case pairsThe semantic deviation of the descriptors in (c),To match case pairsThe semantic deviation of the descriptors in (c),AndA set of unmatched case pairs and a set of matched case pairs, respectively.
Objective functionThe value of the (2) can accurately reflect the capability of semantic vector judgment of the descriptor in the initial window to judge whether the case pairs are similar.
Preferably, after obtaining the similar cases of the case descriptions, the processing method further comprises determining scores of the similar cases according to user feedback information, and in response to the scores of any similar cases being greater than a score threshold, taking the case descriptions and the similar cases as a set of case pairs with matched labels, otherwise, taking the case descriptions and the similar cases as a set of case pairs with unmatched matched labels, so as to update a case library.
The continuous updating of the case library is realized, and the matching effectiveness calculation is gradually optimized, so that the accuracy of similar cases is ensured.
Preferably, the user feedback information includes adoption, collection, and praise.
In a second aspect of the present application, there is also provided a legal document processing system based on big data, comprising a processor and a memory, the memory storing computer program instructions which, when executed by the processor, implement a legal document processing method based on big data according to the first aspect of the present application.
The technical scheme of the application has the following beneficial technical effects:
The method comprises the steps of dividing a case description into a plurality of description words for obtaining the case description, dividing the case library into a plurality of case pairs, counting the first co-occurrence probability of any description word in the pair of the matching case pairs, counting the second co-occurrence probability of the description word in the pair of the non-matching case pairs, taking the ratio of the first co-occurrence probability to the second co-occurrence probability as the matching effectiveness of the description word, wherein the matching effectiveness can measure the capability of the description word for judging whether the legal documents are similar or not, when the description word plays a positive role in judging whether the case pairs are similar or not, the matching effectiveness of the description word is larger than 1, when the description word does not play a role in judging whether the case pairs are similar or not, the matching effectiveness of the description word is smaller than 1 in the reverse role in the process of judging whether the case pairs are similar or not, further taking the product of the normalized matching effectiveness and the TF-IDF value as the weighting vector for weighting the description word, and the characteristic of the matching effectiveness of the description word is equal to 1, and the characteristic of the F-IDF value is accurately obtained, and the characteristic is accurately judged according to the weighted similarity, and the feature of the similarity is accurately obtained, and the feature of the similarity is accurately judged if the characteristic is obtained, and the characteristic is accurately judged and the characteristic is compared with the characteristic of the similarity is obtained.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
According to a first aspect of the application, the application provides a legal document processing method based on big data, which is used for realizing retrieval of similar cases according to case descriptions. FIG. 1 is a flow chart of a legal document processing method based on big data in accordance with an embodiment of the present application. As shown in fig. 1, the big data based legal document processing method includes steps S101 to S103, which will be described in detail below.
S101, word segmentation is carried out on the case description, and a plurality of description words are obtained.
In one embodiment, a case description input by a user is obtained, the case description is subjected to word segmentation by jieba word segmentation, stop words are removed after the word segmentation to obtain a plurality of description words, wherein the stop words are virtual words or Chinese words such as 'O', and the like.
The jieba word segmentation is an existing Chinese word segmentation tool, can segment a Chinese text into single words, and one word is a description word.
S102, calculating the matching effectiveness of each descriptor in a case library.
In one embodiment, the matching effectiveness of a descriptor can measure the ability of the descriptor to determine whether legal documents are similar, and the greater the matching effectiveness, the greater the ability of the descriptor to determine whether legal documents are similar.
The method comprises the steps of calculating matching effectiveness of each descriptor in a case library, wherein the case library comprises a plurality of case pairs with matching labels, the matching labels comprise matching and non-matching, counting first co-occurrence probability of any descriptor in the case pairs with the matching labels being matching, counting second co-occurrence probability of the descriptor in the case pairs with the matching labels being non-matching, and taking the ratio of the first co-occurrence probability to the second co-occurrence probability as the matching effectiveness of the descriptor.
The case library comprises a plurality of case pairs, wherein one case pair comprises two historical cases, one historical case corresponds to one or more legal documents, one case pair corresponds to one matching label, the matching label is artificial mark and comprises matching and non-matching, if the matching label is matching, the similarity of the two historical case legal documents in the case pair is considered to be 1, and if the matching label is non-matching, the similarity of the two historical case legal documents in the case pair is considered to be 0.
For any one descriptor, there are three cases:
In the first case, the first co-occurrence probability is larger and the second co-occurrence probability is smaller, so that the occurrence probability of the descriptive word in the matching case pair is larger than that in the non-matching case pair, the descriptive word can effectively distinguish the matching case pair and the non-matching case pair, the positive effect is achieved in the process of judging whether the case pairs are similar, and the matching effectiveness of the descriptive word is a larger value.
In the second case, the first co-occurrence probability is equal to the second co-occurrence probability, which indicates that the occurrence times of the descriptor in the matching case pair and the non-matching case pair are the same, the descriptor cannot effectively distinguish the matching case pair and the non-matching case pair, and the descriptor cannot function in the process of judging whether the case pairs are similar.
The third situation is that the first co-occurrence probability is smaller and the second co-occurrence probability is larger, which means that the occurrence probability of the descriptive word in the non-matching case pair is larger than that in the matching case pair, that is, the descriptive word also has larger co-occurrence probability in the non-matching case pair, and the descriptive word can distinguish the matching case pair and the non-matching case pair, but can incorrectly judge the non-matching case pair as the matching case pair, so that the descriptive word plays a reverse role in judging whether the case pairs are similar or not, and the matching effectiveness of the descriptive word should be a smaller value.
Specifically, the descriptive wordMatching validity of (a)The following formula is satisfied:
, For descriptive wordsIs selected from the group consisting of a first co-occurrence probability of (a),For descriptive wordsIs a second co-occurrence probability of (2).
In summary, for any descriptor obtained in step S101, the matching validity of the descriptor can be obtained, when the descriptor plays a positive role in determining whether the case pairs are similar, the matching validity of the descriptor is greater than 1, when the descriptor does not play a role in determining whether the case pairs are similar, the matching validity of the descriptor is equal to 1, when the descriptor plays a negative role in determining whether the case pairs are similar, the matching validity of the descriptor is less than 1, so as to realize accurate quantification of the matching validity of each descriptor.
S103, taking the product of the normalized matching effectiveness and the TF-IDF value as a weighting coefficient to carry out weighted summation on semantic vectors of the descriptive words to obtain case characteristics, and obtaining similar cases of case description according to the similarity between the case characteristics and case characteristics of the historical cases.
In one embodiment, the word vector of each descriptor is directly used as the semantic vector of the corresponding descriptor.
It should be noted that, for the descriptors that play a positive role in determining whether the case pairs are similar (i.e., the descriptors with matching validity greater than 1), it is possible to distinguish whether the case pairs are similar according to their own word vectors, while for the descriptors that do not play a role or play a negative role in determining whether the case pairs are similar (i.e., the descriptors with matching validity less than or equal to 1), their own word vectors cannot distinguish whether the case pairs are similar, and at this time, it is necessary to determine the semantic vector of the descriptor in combination with the context information of the case description.
The method for acquiring the semantic vector of the descriptor comprises the steps of responding to the fact that the matching effectiveness of the descriptor is larger than a preset threshold, taking the word vector of the descriptor as the semantic vector, and otherwise, taking the sum of the word vectors of the descriptor and a plurality of descriptors in the context information as the semantic vector of the descriptor.
Wherein, the value of the preset threshold is 1.
In one embodiment, the method comprises the steps of taking the sum of word vectors of the descriptor and a plurality of descriptor in context information as semantic vectors of the descriptor, obtaining the word vectors of the descriptor, setting an initial window, taking the sum of word vectors in the initial window as the semantic vectors of the descriptor in a case pair containing the descriptor at the same time, calculating Euclidean distance minimum values of the semantic vectors of the descriptor in different historical cases in the case pair as the semantic deviation of the descriptor, adjusting the size of the initial window, taking the initial window corresponding to the maximum value of a target function as the target window, wherein the target function is positively correlated with the semantic deviation of a matching case pair, is negatively correlated with the semantic deviation of the matching case pair, and extracting the semantic vectors of the descriptor in the case description according to the target window.
The Word vector can be obtained by using a Legal-BERT or Word2Vec model, wherein Legal-BERT is a pre-training model in the Legal field, and is used for obtaining the semantic vector of each descriptor in the Legal document, which is a well-known technology of those skilled in the art and will not be described herein.
When the size of the initial window is adjusted, the distance of each word is enlarged towards the context direction each time, namely, after the size adjustment is completed once, the size of the initial window is 5, and the initial window comprises the descriptive words and the descriptive word context.
And respectively acquiring semantic vectors of the descriptive words in legal documents of the historical case 1 and the historical case 2, wherein the occurrence times of the descriptive words in the historical case 1 and the historical case 2 are at least one time, and taking the minimum Euclidean distance of the semantic vectors between the historical case 1 and the historical case 2 as semantic deviation of the case pair.
The semantic information of the descriptive words in legal documents is obtained more accurately by adjusting the size of the initial window and continuously introducing more contextual information of the descriptive words, and after each adjustment, an initial window and an objective function value of the initial window are obtained, wherein the objective function is positively correlated with the semantic deviation of a non-matching case pair and negatively correlated with the semantic deviation of the matching case pair, so that the capability of judging whether the case pairs are similar or not by the semantic vector of the descriptive words can be reflected. In particular, the objective functionThe relation is satisfied:
; For non-matching case pairsThe semantic deviation of the descriptors in (c),To match case pairsThe semantic deviation of the descriptors in (c),AndA set of unmatched case pairs and a set of matched case pairs, respectively.
It can be understood that when the objective function reaches the maximum value, the initial window at this time is indicated to be capable of improving the capability of judging whether the case pairs are similar to each other to the maximum extent, and the initial window at this time is used as the objective window of the description word to extract the semantic vector of the description word in the case description. The target window may be determined using an optimization algorithm, which may be a simulated annealing algorithm or a hill climbing algorithm.
In this way, semantic vectors of the descriptors in the case description are obtained, products of normalized matching effectiveness and TF-IDF values are used as weighting coefficients, the semantic vectors of the descriptors are weighted and summed according to the weighting coefficients, and case characteristics are obtained, wherein the case characteristics focus on the descriptors playing a positive role in the process of judging whether the case pairs are similar or not, and the descriptors playing a negative role are weakened.
And taking the ratio of the matching effectiveness of any descriptor to the sum of the matching effectiveness of all descriptors in the case description as the matching effectiveness of the normalized descriptor.
The TF-IDF algorithm is a common technical means in text processing, a TF-IDF value of each descriptor is obtained from legal documents of all historical cases, the TF-IDF value tends to filter common descriptors and keep important descriptors, the weighting coefficient of each descriptor is determined by comprehensive matching effectiveness and the TF-IDF value, the case characteristics are accurately extracted, and meanwhile, the extracted case characteristics are ensured to be capable of accurately judging whether the case characteristics are similar to the historical cases or not.
In one embodiment, the case characteristics of each historical case are obtained according to the same method, the similarity between the case characteristics of each historical case and the case characteristics of each historical case is calculated, the historical cases are ranked in order from big to small, and a plurality of historical cases ranked at the top are used as similar cases of the case description. The similarity adopts a similarity calculation method based on Euclidean distance, and if the Euclidean distance between the case characteristics and the case characteristics of any historical case is larger, the similarity between the case description and the historical case is smaller.
In one embodiment, after obtaining the similar cases of the case descriptions, the processing method further includes determining scores of the similar cases according to user feedback information, and in response to the scores of any similar cases being greater than a score threshold, using the case descriptions and the similar cases as a set of case pairs with matched labels, otherwise, using the case descriptions and the similar cases as a set of case pairs with unmatched matched labels, so as to update a case library.
The user feedback information may be user behaviors such as adoption, collection, praise, and the like, and a score may be assigned to each user behavior, for example, the adopted score is 4, the collected score is 3, the praise score is 3, and the total score is 10, and when the user executes the corresponding user behavior, the score corresponding to the user behavior is obtained. And when the score of any similar case is not greater than the score threshold, the case description and the similar case are used as a set of case pairs with unmatched matching labels, the case library is continuously updated, and matching effectiveness calculation is gradually optimized, so that accuracy of the similar case is ensured.
According to a second aspect of the present application, the present application also provides a legal document processing system based on big data. FIG. 2 is a block diagram of a big data based legal document processing system in accordance with an embodiment of the present application. As shown in fig. 2, the system 50 includes a processor and a memory storing computer program instructions that when executed by the processor implement a big data based legal document processing method according to the first aspect of the present application. The system further comprises other components known to those skilled in the art, such as a communication bus and a communication interface, the arrangement and function of which are known in the art and are therefore not described in detail herein.
It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application.