Disclosure of Invention
In order to solve the technical problems, the application aims to provide a method and a device for analyzing the content of an electronic proposal of an intelligent court, and the adopted technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a method for analyzing contents of an electronic case in an intelligent court, the method including the steps of:
word segmentation is carried out on the text content of each electronic proposal to obtain each phrase, and word deactivation word processing is carried out on all phrases to obtain each keyword;
Constructing sequence measurement coefficients of the same keywords among different electronic case sentences based on differences among sentences containing the same keywords in different electronic case sentences; based on the importance and part-of-speech characteristics of each keyword in each electronic case, combining the sequence difference measurement to construct a measurement coefficient of each keyword in each electronic case;
Determining the weight of each keyword in each electronic case based on the measurement coefficient, and combining a fingerprint generation algorithm to obtain the fingerprint of each electronic case;
clustering based on the maximum measurement coefficient and the fingerprint in each electronic file to obtain each cluster;
And acquiring a core keyword set of each cluster based on the keywords of each electronic case in each cluster.
In one embodiment, the calculation expression of the sequential metric coefficient is:
In which, in the process,Is a sentenceAnd sentenceThe order metric coefficient of the keyword i in between;、 Respectively are sentencesSum sentenceThe order value of the key i in (c),、Respectively are sentencesSum sentenceIs used to determine the length of the sentence,Is a preset positive number, the number of which is a preset positive number,Is an exponential function based on a natural constant e;
wherein the sentence isA statement containing a keyword i for the z-th item in the F-th electronic proposal, the statementThe statement that the t-th item in the S-th electronic proposal contains a keyword i.
In one embodiment, the calculated expression of the sequence difference metric is:
Wherein, the method comprises the steps of, wherein,Is the order difference measure of the keyword i between the F-th electronic case and the S-th electronic case,Is a sentenceAnd sentenceThe order metric coefficient of the keyword i in between,、The number of sentences including the keyword i in the F-th electronic case and the S-th electronic case respectively,To take a minimum function.
In one embodiment, the process of obtaining the metric coefficient is:
For any keyword in any electronic case, taking the average value of the sequence difference metrics of the any keyword between any electronic case and all other electronic cases as the comprehensive sequence difference metric of any keyword in any electronic case;
Acquiring TF-IDF values of each keyword in each electronic proposal, and setting part-of-speech measurement of each keyword based on the part-of-speech of each keyword;
in any electronic proposal, the measurement coefficient of each keyword respectively forms a positive correlation with the TF-IDF value and the part of speech measurement of each keyword, and forms a negative correlation with the comprehensive sequence difference measurement of each keyword.
In one embodiment, the part-of-speech metric of each keyword is that the part-of-speech metric of the keyword of the verb part-of-speech is set to 1, and the part-of-speech metrics of the keywords of the rest part-of-speech are set to 0.
In one embodiment, the calculation expression of the weight of each keyword in each electronic case is:
In which, in the process,The weight of the keyword i in the F-th electronic proposal,The measurement coefficient of the keyword i in the F-th electronic proposal,For the number of keywords in the F-th electronic proposal,For maximum function.
In one embodiment, the process of obtaining the fingerprint of each electronic file is:
And taking the sequences consisting of all the keywords and the sequences consisting of the weights of all the keywords in each electronic file as inputs of a fingerprint generation algorithm, and outputting the sequences as fingerprints of each electronic file.
In one embodiment, the acquiring process of each cluster is as follows:
The method comprises the steps of carrying out binary number conversion on the maximum value of the measurement coefficient of all keywords in each electronic case, forming a vector by fingerprints of each electronic case and the binary number, taking the vector of all electronic cases as input of a clustering algorithm, and outputting the vector as each cluster.
In one embodiment, the obtaining process of the core keyword set is:
in each electronic proposal, recording a set formed by keywords with all measurement coefficients larger than a preset segmentation threshold as a key keyword set;
and in each cluster, merging the intersection of the key keyword sets of all electronic files and all verbs to form a set serving as a core keyword set of each cluster.
In a second aspect, an embodiment of the present application further provides an electronic case content analysis device of a smart court, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor executes the computer program to implement the steps of any one of the methods described above.
The embodiment of the application has at least the following beneficial effects:
The method comprises the steps of obtaining sequential difference measurement of the same keywords among different electronic cases based on differences among sentences containing the same keywords in the different electronic cases, combining TF-IDF values and parts of speech of each keyword in each electronic case to construct measurement coefficients of each keyword in each electronic case, wherein the indexes simultaneously consider word frequencies and logic sequences of the keywords, so that accuracy of judging importance degree of the keywords is improved, constructing weight functions according to the measurement coefficients of the keywords, calculating weight of each keyword, improving accuracy of calculating fingerprints of the electronic cases by using SimHash algorithm, improving accuracy of similarity judgment of subsequent electronic cases, clustering according to fingerprints and the measurement coefficients of the keywords of the electronic cases, obtaining core keyword sets of each cluster, improving accuracy of case division and keyword extraction of the same type, improving accuracy of case similarity measurement, and improving accuracy of case content analysis and case type division.
Detailed Description
In order to further describe the technical means and effects adopted by the application to achieve the preset aim, the following is a detailed description of the specific implementation, structure, characteristics and effects of the method and device for analyzing the content of an electronic proposal of the intelligent court according to the application with reference to the attached drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
The following specifically describes a specific scheme of the method and apparatus for analyzing contents of an electronic case in an intelligent court according to the present application with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a method for analyzing contents of an electronic proposal of an intelligent court according to an embodiment of the application is shown, the method comprises the following steps:
And S1, word segmentation is carried out on the text content of each electronic proposal to obtain each phrase, and the stop word processing is carried out on all the phrases to obtain each keyword.
Besides the text, the electronic file also comprises a title and an end signature, the title and the end signature are redundant information, and the title and the end signature of each electronic file are filtered through rule matching, so that the text data of each electronic file are obtained. The LTP tool package can be used for carrying out processing such as sentence segmentation, word segmentation, part-of-speech tagging, stop word deletion and the like on contents in a document, dividing each electronic case into sentences by using the LTP tool package to obtain a sentence set, carrying out word segmentation processing on each sentence to obtain each phrase, carrying out part-of-speech tagging on each phrase of each sentence, carrying out stop word removal processing on all phrases subjected to part-of-speech tagging by using a Ha-Gong stop word list, deleting stop words such as connecting words, prepositions and articles in the stop words, and enabling the rest phrases to be called keywords. The rule matching and LTP toolkit are known techniques, and specific processes are not described herein.
Step S2, a measurement coefficient of each keyword in each electronic case is constructed according to the difference among sentences containing the same keywords in different electronic cases, the TF-IDF value of each keyword in each electronic case and the parts of speech.
The electronic case is a text with strong logic, and due to the special use scene, too personalized expressions are not allowed, and for the same type of cases, the expressions of events and related contents in the electronic case have higher similarity, but due to the fact that the expression modes of each person are different, the high similarity is mainly reflected in related keywords. Therefore, the similarity between different electronic cases can be calculated through the key word characteristics of the electronic cases, and representative key words of the same type of cases can be extracted.
The case of the same type often contains a plurality of similar statement expressions, meanwhile, due to the text property of the case, certain statements which can identify the case type have the same keywords, and the appearance sequence of the keywords is fixed. Since the events described in an electronic document are actions applied by one individual to another, where an action is one of the important information in the electronic document and is mentioned many times in the electronic document, when a keyword appears more frequently in one electronic document and is a verb, the information contained in the keyword is more important and also more important for the electronic document. However, for all electronic cases, when a keyword appears more frequently in all cases, the contribution rate of the keyword for distinguishing different types of cases is lower.
Based on the analysis, (1) constructing a sequence measurement coefficient between each sentence containing any keyword in any electronic case and each sentence containing the same keyword in each remaining electronic case according to the difference between sentences containing the same keyword in different electronic cases, wherein the sequence measurement coefficient is used for measuring the front-to-back logic sequence of each keyword:
In an electronic scenario, a keyword may be included in multiple sentences. Marking the statement including the keyword i of the z-th item in the F-th electronic proposal asThe statement that the t th item contains the keyword i in the S-th electronic proposal is recorded asStatementAnd sentenceThe calculation expression of the sequence metric coefficient of the key word i is as follows:
In which, in the process,Is a sentenceAnd sentenceThe order metric coefficient of the keyword i in between;、 Respectively are sentencesSum sentenceThe order value of the key i in (c),、Respectively are sentencesSum sentenceIs used to determine the length of the sentence,As an exponential function based on a natural number e,Is a preset positive number, and has the function of preventing the denominator from being 0. Preferably, in one embodiment of the application, the method comprisesThe value of (2) is set to 1. In other embodiments, the practitioner can set the device according to the actual situationIs a value of (2).
When a keyword is a keyword of a case type judgment basis of an electronic case, in any two electronic cases of the same type, the basis expression should exist, that is, in the two electronic cases, at least one sentence has the same or similar bit sequence of the keyword, and the phrase sequence length of the sentence is the same or similar, that is, the molecule is similar to 1, and the denominator is similar to 1, so that when two electronic cases correspond to the case of the same type, and a keyword is the keyword in the case judgment basis of the same type, the sequence measurement coefficient of the keyword is approximately 1. Conversely, when a keyword is not a keyword in the basis of judging a case of a type, its sequential measurement coefficient is far less than 1 or far greater than 1, and the further the distance from 1, the smaller the contribution of the keyword to distinguishing the case type is, the less important is.
(2) Obtaining sequence difference measurement of the same keywords among different electronic cases according to the sequence measurement coefficient:
Wherein, the method comprises the steps of, wherein,Is the order difference measure of the keyword i between the F-th electronic case and the S-th electronic case,Is a sentenceAnd sentenceThe order metric coefficient of the keyword i in between,、The number of sentences including the keyword i in the F-th electronic case and the S-th electronic case respectively,To take the minimum function;
Since the closer the order metric coefficient is to 1, the more important the corresponding keyword is, the closer the order difference metric is to 0.
(3) Taking the average value of the sequence difference metrics of the keywords i between the F-th electronic proposal and all other electronic proposal as the comprehensive sequence difference metric of the keywords i in the F-th electronic proposal.
(4) The method comprises the steps of calculating TF-IDF values (word frequency inverse document frequency) of each keyword in each electronic case, wherein the TF-IDF values are known technologies, specific processes are not repeated, part-of-speech metrics of each keyword are preset according to the part of speech of each keyword, and the part-of-speech metrics of the keywords of verbs are set to be 1 and the part-of-speech metrics of the keywords of other parts of speech are set to be 0 because information contained in verbs in the electronic case is more important, so that the higher the part-of-speech metrics of the keywords are, the more important the keywords are. It should be noted that, the part-of-speech measurement enforcer of the keyword may set itself, and the embodiment of the present application is not limited specifically.
(5) In order to measure the contribution degree of each keyword to distinguishing the case type or describing the case event, in any electronic case, a measurement coefficient of each keyword in each electronic case is constructed based on the TF-IDF value, the comprehensive sequence difference measurement and the part-of-speech measurement of each keyword:
in any electronic proposal, the measurement coefficient of each keyword respectively forms a positive correlation with the TF-IDF value and the part of speech measurement of each keyword, and forms a negative correlation with the comprehensive sequence difference measurement of each keyword.
It can be understood that the positive correlation and the negative correlation of the present application refer to the relationship between the independent variable and the dependent variable, the positive correlation is the increase (decrease) of the independent variable along with the increase (decrease) of the dependent variable, the negative correlation is the decrease (increase) of the independent variable along with the increase (decrease) of the dependent variable, and the specific positive correlation and negative correlation can be determined according to the actual situation in the application process, and the present application is not particularly limited.
Preferably, in one embodiment of the present application, the calculation expression of the metric coefficient of each keyword in any electronic case may be: In which, in the process,Is the metric coefficient of the keyword i in the F-th electronic proposal,Is the comprehensive sequence difference measure of the keyword i in the F-th electronic proposal,Is the TF-IDF value of the keyword i in the F-th electronic proposal,Is a part-of-speech measure of keyword i in the F-th electronic proposal,To avoid a denominator of 0, preferably, in one embodiment of the application, a predetermined positive numberThe value of (2) is set to 1. In other embodiments of the present application, the practitioner can set the device according to the actual situationIs a value of (2).
It will be appreciated that when a keyword contributes more to distinguishing a case type or describing a case event, the logical order of the keyword before and after is relatively fixed, that is, the order in which the keyword appears in a sentence and the sentence length are relatively fixed or approximate, and secondly, the word frequency of the keyword is more frequent, and the keyword may be a description of event behavior if it is a verb, it is relatively more important, so that when the contribution of a keyword to distinguishing a case type or describing a case event is greater, the metric coefficient of the keyword is greater, and the metric coefficient is greater, the keyword is more important. Conversely, when the contribution of the keyword to distinguishing the case type or describing the case event is smaller, the metric coefficient of the keyword is smaller.
And step S3, calculating the weight of each keyword in each electronic case according to the measurement coefficient of each keyword in each electronic case, and generating the fingerprint of each electronic case by combining SimHash algorithm.
(1) In each electronic case, the greater the contribution degree of a keyword, namely the greater the measurement coefficient, the more effective information the keyword can provide for distinguishing case types or describing case events, so that the keyword is more important and is given greater weight, the greater the influence on the generation of subsequent electronic case fingerprints and the judgment of similarity is provided, and the accuracy of distinguishing case types according to the electronic case is improved. The calculation expression of the weight of each keyword in any electronic case is:
In the followingThe weight of the keyword i in the F-th electronic proposal,The measurement coefficient of the keyword i in the F-th electronic proposal,For the number of keywords in the F-th electronic proposal,For maximum function.
When the contribution of the keyword to distinguishing the case type or describing the case event is larger, the measurement coefficient of the keyword is larger, so that the corresponding weight is larger, and the denominator is used for limiting the value range of the weight between 0 and 1.
(2) In each electronic file, arranging all keywords of all sentences according to the appearance sequence in the electronic file to obtain a phrase sequence of the electronic file, arranging weights of all keywords according to the sequence of the keywords in the phrase sequence of the electronic file to form a keyword weight sequence of the electronic file, and then taking the phrase sequence and the keyword weight sequence of each electronic file as inputs of SimHash algorithm to output as fingerprints of each electronic file. The SimHash algorithm is a known technique, and a specific process is not described herein.
And S4, clustering based on the fingerprints of the electronic cases and the maximum measurement coefficient in the electronic cases to obtain clustering clusters.
Obtaining the maximum value of the measurement coefficients of all the keywords in each electronic proposal, recording the maximum value as the maximum measurement coefficient, and carrying out binary number conversion on the maximum measurement coefficient. And taking a vector formed by fingerprints of all electronic cases and binary numbers of the maximum measurement coefficients of all the electronic cases as a characteristic vector of each electronic case.
The feature vectors of all electronic cases are used as the input of a K-means clustering algorithm, the number of clustering centers is set, preferably, in the embodiment of the application, the number of the clustering centers is set to be 5, in other embodiments of the application, an implementer can set the number of the clustering centers according to actual conditions, and as fingerprints and measurement coefficients are in the form of character strings, the adopted distance measurement mode is Hamming distance and is output as each clustering cluster. The K-means algorithm is a known technology, and the specific process is not described again. It can be understood that, for the clustering of the feature vectors of all electronic cases, the application only provides one clustering method, the existing clustering methods are numerous, and the implementers can also adopt other clustering methods for clustering, so that the application is not particularly limited.
The function of the maximum measurement coefficient in the electronic case is that if the electronic case is the same type of case, the descriptions of the events in the electronic case are similar, the descriptions of the basis are basically consistent, so that the measurement coefficients of the keywords are basically equal, but because the number of the keywords in the electronic case is not necessarily consistent, each measurement coefficient cannot be compared in a one-to-one correspondence manner, and the maximum measurement coefficient is used as a representative for clustering.
And S5, acquiring a core keyword set of each cluster based on the keywords of each electronic case in each cluster.
In each electronic case, the quartile of the measurement coefficient of all the keywords is used as a preset segmentation threshold value of the electronic case, and the set of the keywords with the measurement coefficient larger than the preset segmentation threshold value is recorded as a key keyword set of the electronic case.
Acquiring intersections of key keyword sets of all electronic cases in each cluster, and taking the intersections as the key keyword sets of the cluster;
using a set formed by verbs of all electronic cases in the cluster as a verb set of the cluster, wherein each element in the verb set does not appear repeatedly;
And taking the union set of the key keyword set and the verb set of the cluster as the core keyword set of the cluster. Each cluster represents an electronic case of one type of case, and then the core keyword set of one cluster represents the keyword set with the greatest contribution to the description of the case of the type, namely, the contribution of the keywords in the core keyword set to the case type distinction and the case event description is the greatest.
In this way, when the SimHash algorithm is used for calculating the text similarity, the word frequency of each keyword in the text is considered, and the front-back logic sequence of each keyword in sentences is considered, so that the calculation and judgment of the similarity of the electronic cases are more accurate, the electronic cases of the same type of cases can be more accurately divided into the same cluster, and secondly, the core keyword set of each cluster is calculated, so that the keyword with the greatest contribution to the case type and event description can be selected, the extraction of the key keywords of each type of cases is facilitated, the description of each type of cases is improved, and the accurate analysis of the content of the electronic cases and the accurate division of the case types are realized.
A schematic diagram of the process of obtaining the measurement coefficient of each keyword in each electronic scenario is shown in FIG. 2.
Based on the same inventive concept as the above method, the embodiment of the application further provides an electronic case content analysis device of the intelligent court, which comprises a memory, a processor and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to realize the steps of any one of the electronic case content analysis methods of the intelligent court.
In summary, the embodiment of the application provides an electronic case content analysis method of an intelligent court, which is based on the difference between sentences containing the same keywords in different electronic cases to acquire the sequence difference measurement of the same keywords among different electronic cases, combines the TF-IDF value and the word part of each keyword in each electronic case to construct the measurement coefficient of each keyword in each electronic case, and simultaneously considers the word frequency and the logic sequence of the keywords to help to improve the accuracy of judging the importance degree of the keywords, constructs a weight function according to the measurement coefficient of the keywords to calculate the weight of each keyword to help to improve the accuracy of calculating the fingerprints of the electronic case by using SimHash algorithm and improve the accuracy of judging the similarity of the subsequent electronic cases, and finally, clusters of the keywords are clustered according to the fingerprints and the measurement coefficient of the keywords, so that the accuracy of dividing the same type of case and the keyword extraction is improved, and the accuracy of dividing the content of the electronic case is improved.
It should be noted that the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this application. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments.
The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but rather, any modifications, equivalents, improvements, etc. that fall within the principles of the present application are intended to be included within the scope of the present application.