Movatterモバイル変換


[0]ホーム

URL:


CN118964527B - Electronic case content analysis method and device for intelligent court - Google Patents

Electronic case content analysis method and device for intelligent court
Download PDF

Info

Publication number
CN118964527B
CN118964527BCN202411433147.XACN202411433147ACN118964527BCN 118964527 BCN118964527 BCN 118964527BCN 202411433147 ACN202411433147 ACN 202411433147ACN 118964527 BCN118964527 BCN 118964527B
Authority
CN
China
Prior art keywords
keyword
electronic
electronic case
case
measurement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411433147.XA
Other languages
Chinese (zh)
Other versions
CN118964527A (en
Inventor
陈宇斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Newchuan Technology Co ltd
Original Assignee
Newchuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Newchuan Technology Co ltdfiledCriticalNewchuan Technology Co ltd
Priority to CN202411433147.XApriorityCriticalpatent/CN118964527B/en
Publication of CN118964527ApublicationCriticalpatent/CN118964527A/en
Application grantedgrantedCritical
Publication of CN118964527BpublicationCriticalpatent/CN118964527B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请涉及自然语言处理技术领域,具体涉及一种智慧法院的电子案宗内容分析方法及装置,具体包括:基于不同电子案宗中包含相同关键词的语句之间的差异获取不同电子案宗之间相同关键词的顺序差异度量,结合各电子案宗中每个关键词的TF‑IDF值及词性构建各电子案宗中每个关键词的度量系数;根据关键词的度量系数构造权重函数,计算每个关键词的权重;根据电子案宗的指纹和关键词度量系数进行聚类,并获取各聚类簇的核心关键词集。提高了对于同一类型案件划分以及关键词提取的准确性,提高了对于电子案宗相似性衡量的准确程度,有助于提升电子案宗内容分析和案件类型划分的准确性。

The present application relates to the field of natural language processing technology, and specifically to a method and device for analyzing the content of electronic case files in a smart court, which specifically includes: obtaining the order difference measurement of the same keywords between different electronic case files based on the differences between sentences containing the same keywords in different electronic case files, and constructing the measurement coefficient of each keyword in each electronic case file in combination with the TF-IDF value and part of speech of each keyword in each electronic case file; constructing a weight function according to the measurement coefficient of the keyword, and calculating the weight of each keyword; clustering according to the fingerprint of the electronic case file and the keyword measurement coefficient, and obtaining the core keyword set of each cluster. It improves the accuracy of the division of cases of the same type and the extraction of keywords, improves the accuracy of the measurement of the similarity of electronic case files, and helps to improve the accuracy of the content analysis of electronic case files and the division of case types.

Description

Electronic case content analysis method and device for intelligent court
Technical Field
The application relates to the technical field of natural language processing, in particular to a method and a device for analyzing contents of an electronic case in an intelligent court.
Background
Along with the acceleration of digital transformation, court systems are also improving towards intelligent and digital directions to improve the efficiency and transparency of judicial work. The electronic proposal is taken as an important component of the construction of the intelligent court, and has huge quantity and complicated content, so that an efficient electronic proposal content analysis method is important to improve the judicial efficiency and the storage optimization of the electronic proposal. The existing electronic cases can be stored together by selecting the same type of cases, so that the calling process can be realized quickly and efficiently, and the electronic cases of the same type of cases are required to be distinguished accurately.
In the prior art, when content analysis of an electronic case is performed, the content analysis is distinguished according to the similarity of the electronic case through calculation of text similarity. The SimHash algorithm is a commonly used text similarity calculation method, and can generate a fingerprint for each electronic case, and the generated fingerprint has better local sensitivity and higher retrieval efficiency, so that the algorithm is widely applied to similarity detection of a large number of texts. However, when the SimHash algorithm processes the electronic case, the consideration of the algorithm on the keywords is only based on word frequency dimension, so that the influence of other characteristics of the keywords on the electronic case is ignored, the electronic case is a text with strong logic property, and the logic sequence characteristics of the keywords are ignored only by means of word frequency, so that the division of the similar electronic case and the extraction of core keywords are inaccurate finally.
Disclosure of Invention
In order to solve the technical problems, the application aims to provide a method and a device for analyzing the content of an electronic proposal of an intelligent court, and the adopted technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a method for analyzing contents of an electronic case in an intelligent court, the method including the steps of:
word segmentation is carried out on the text content of each electronic proposal to obtain each phrase, and word deactivation word processing is carried out on all phrases to obtain each keyword;
Constructing sequence measurement coefficients of the same keywords among different electronic case sentences based on differences among sentences containing the same keywords in different electronic case sentences; based on the importance and part-of-speech characteristics of each keyword in each electronic case, combining the sequence difference measurement to construct a measurement coefficient of each keyword in each electronic case;
Determining the weight of each keyword in each electronic case based on the measurement coefficient, and combining a fingerprint generation algorithm to obtain the fingerprint of each electronic case;
clustering based on the maximum measurement coefficient and the fingerprint in each electronic file to obtain each cluster;
And acquiring a core keyword set of each cluster based on the keywords of each electronic case in each cluster.
In one embodiment, the calculation expression of the sequential metric coefficient is:
In which, in the process,Is a sentenceAnd sentenceThe order metric coefficient of the keyword i in between; Respectively are sentencesSum sentenceThe order value of the key i in (c),Respectively are sentencesSum sentenceIs used to determine the length of the sentence,Is a preset positive number, the number of which is a preset positive number,Is an exponential function based on a natural constant e;
wherein the sentence isA statement containing a keyword i for the z-th item in the F-th electronic proposal, the statementThe statement that the t-th item in the S-th electronic proposal contains a keyword i.
In one embodiment, the calculated expression of the sequence difference metric is:
Wherein, the method comprises the steps of, wherein,Is the order difference measure of the keyword i between the F-th electronic case and the S-th electronic case,Is a sentenceAnd sentenceThe order metric coefficient of the keyword i in between,The number of sentences including the keyword i in the F-th electronic case and the S-th electronic case respectively,To take a minimum function.
In one embodiment, the process of obtaining the metric coefficient is:
For any keyword in any electronic case, taking the average value of the sequence difference metrics of the any keyword between any electronic case and all other electronic cases as the comprehensive sequence difference metric of any keyword in any electronic case;
Acquiring TF-IDF values of each keyword in each electronic proposal, and setting part-of-speech measurement of each keyword based on the part-of-speech of each keyword;
in any electronic proposal, the measurement coefficient of each keyword respectively forms a positive correlation with the TF-IDF value and the part of speech measurement of each keyword, and forms a negative correlation with the comprehensive sequence difference measurement of each keyword.
In one embodiment, the part-of-speech metric of each keyword is that the part-of-speech metric of the keyword of the verb part-of-speech is set to 1, and the part-of-speech metrics of the keywords of the rest part-of-speech are set to 0.
In one embodiment, the calculation expression of the weight of each keyword in each electronic case is:
In which, in the process,The weight of the keyword i in the F-th electronic proposal,The measurement coefficient of the keyword i in the F-th electronic proposal,For the number of keywords in the F-th electronic proposal,For maximum function.
In one embodiment, the process of obtaining the fingerprint of each electronic file is:
And taking the sequences consisting of all the keywords and the sequences consisting of the weights of all the keywords in each electronic file as inputs of a fingerprint generation algorithm, and outputting the sequences as fingerprints of each electronic file.
In one embodiment, the acquiring process of each cluster is as follows:
The method comprises the steps of carrying out binary number conversion on the maximum value of the measurement coefficient of all keywords in each electronic case, forming a vector by fingerprints of each electronic case and the binary number, taking the vector of all electronic cases as input of a clustering algorithm, and outputting the vector as each cluster.
In one embodiment, the obtaining process of the core keyword set is:
in each electronic proposal, recording a set formed by keywords with all measurement coefficients larger than a preset segmentation threshold as a key keyword set;
and in each cluster, merging the intersection of the key keyword sets of all electronic files and all verbs to form a set serving as a core keyword set of each cluster.
In a second aspect, an embodiment of the present application further provides an electronic case content analysis device of a smart court, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor executes the computer program to implement the steps of any one of the methods described above.
The embodiment of the application has at least the following beneficial effects:
The method comprises the steps of obtaining sequential difference measurement of the same keywords among different electronic cases based on differences among sentences containing the same keywords in the different electronic cases, combining TF-IDF values and parts of speech of each keyword in each electronic case to construct measurement coefficients of each keyword in each electronic case, wherein the indexes simultaneously consider word frequencies and logic sequences of the keywords, so that accuracy of judging importance degree of the keywords is improved, constructing weight functions according to the measurement coefficients of the keywords, calculating weight of each keyword, improving accuracy of calculating fingerprints of the electronic cases by using SimHash algorithm, improving accuracy of similarity judgment of subsequent electronic cases, clustering according to fingerprints and the measurement coefficients of the keywords of the electronic cases, obtaining core keyword sets of each cluster, improving accuracy of case division and keyword extraction of the same type, improving accuracy of case similarity measurement, and improving accuracy of case content analysis and case type division.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart showing a method for analyzing contents of an electronic proposal of an intelligent court according to an embodiment of the application;
fig. 2 is a schematic diagram of a process for obtaining a metric coefficient of each keyword in each electronic scenario.
Detailed Description
In order to further describe the technical means and effects adopted by the application to achieve the preset aim, the following is a detailed description of the specific implementation, structure, characteristics and effects of the method and device for analyzing the content of an electronic proposal of the intelligent court according to the application with reference to the attached drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
The following specifically describes a specific scheme of the method and apparatus for analyzing contents of an electronic case in an intelligent court according to the present application with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a method for analyzing contents of an electronic proposal of an intelligent court according to an embodiment of the application is shown, the method comprises the following steps:
And S1, word segmentation is carried out on the text content of each electronic proposal to obtain each phrase, and the stop word processing is carried out on all the phrases to obtain each keyword.
Besides the text, the electronic file also comprises a title and an end signature, the title and the end signature are redundant information, and the title and the end signature of each electronic file are filtered through rule matching, so that the text data of each electronic file are obtained. The LTP tool package can be used for carrying out processing such as sentence segmentation, word segmentation, part-of-speech tagging, stop word deletion and the like on contents in a document, dividing each electronic case into sentences by using the LTP tool package to obtain a sentence set, carrying out word segmentation processing on each sentence to obtain each phrase, carrying out part-of-speech tagging on each phrase of each sentence, carrying out stop word removal processing on all phrases subjected to part-of-speech tagging by using a Ha-Gong stop word list, deleting stop words such as connecting words, prepositions and articles in the stop words, and enabling the rest phrases to be called keywords. The rule matching and LTP toolkit are known techniques, and specific processes are not described herein.
Step S2, a measurement coefficient of each keyword in each electronic case is constructed according to the difference among sentences containing the same keywords in different electronic cases, the TF-IDF value of each keyword in each electronic case and the parts of speech.
The electronic case is a text with strong logic, and due to the special use scene, too personalized expressions are not allowed, and for the same type of cases, the expressions of events and related contents in the electronic case have higher similarity, but due to the fact that the expression modes of each person are different, the high similarity is mainly reflected in related keywords. Therefore, the similarity between different electronic cases can be calculated through the key word characteristics of the electronic cases, and representative key words of the same type of cases can be extracted.
The case of the same type often contains a plurality of similar statement expressions, meanwhile, due to the text property of the case, certain statements which can identify the case type have the same keywords, and the appearance sequence of the keywords is fixed. Since the events described in an electronic document are actions applied by one individual to another, where an action is one of the important information in the electronic document and is mentioned many times in the electronic document, when a keyword appears more frequently in one electronic document and is a verb, the information contained in the keyword is more important and also more important for the electronic document. However, for all electronic cases, when a keyword appears more frequently in all cases, the contribution rate of the keyword for distinguishing different types of cases is lower.
Based on the analysis, (1) constructing a sequence measurement coefficient between each sentence containing any keyword in any electronic case and each sentence containing the same keyword in each remaining electronic case according to the difference between sentences containing the same keyword in different electronic cases, wherein the sequence measurement coefficient is used for measuring the front-to-back logic sequence of each keyword:
In an electronic scenario, a keyword may be included in multiple sentences. Marking the statement including the keyword i of the z-th item in the F-th electronic proposal asThe statement that the t th item contains the keyword i in the S-th electronic proposal is recorded asStatementAnd sentenceThe calculation expression of the sequence metric coefficient of the key word i is as follows:
In which, in the process,Is a sentenceAnd sentenceThe order metric coefficient of the keyword i in between; Respectively are sentencesSum sentenceThe order value of the key i in (c),Respectively are sentencesSum sentenceIs used to determine the length of the sentence,As an exponential function based on a natural number e,Is a preset positive number, and has the function of preventing the denominator from being 0. Preferably, in one embodiment of the application, the method comprisesThe value of (2) is set to 1. In other embodiments, the practitioner can set the device according to the actual situationIs a value of (2).
When a keyword is a keyword of a case type judgment basis of an electronic case, in any two electronic cases of the same type, the basis expression should exist, that is, in the two electronic cases, at least one sentence has the same or similar bit sequence of the keyword, and the phrase sequence length of the sentence is the same or similar, that is, the molecule is similar to 1, and the denominator is similar to 1, so that when two electronic cases correspond to the case of the same type, and a keyword is the keyword in the case judgment basis of the same type, the sequence measurement coefficient of the keyword is approximately 1. Conversely, when a keyword is not a keyword in the basis of judging a case of a type, its sequential measurement coefficient is far less than 1 or far greater than 1, and the further the distance from 1, the smaller the contribution of the keyword to distinguishing the case type is, the less important is.
(2) Obtaining sequence difference measurement of the same keywords among different electronic cases according to the sequence measurement coefficient:
Wherein, the method comprises the steps of, wherein,Is the order difference measure of the keyword i between the F-th electronic case and the S-th electronic case,Is a sentenceAnd sentenceThe order metric coefficient of the keyword i in between,The number of sentences including the keyword i in the F-th electronic case and the S-th electronic case respectively,To take the minimum function;
Since the closer the order metric coefficient is to 1, the more important the corresponding keyword is, the closer the order difference metric is to 0.
(3) Taking the average value of the sequence difference metrics of the keywords i between the F-th electronic proposal and all other electronic proposal as the comprehensive sequence difference metric of the keywords i in the F-th electronic proposal.
(4) The method comprises the steps of calculating TF-IDF values (word frequency inverse document frequency) of each keyword in each electronic case, wherein the TF-IDF values are known technologies, specific processes are not repeated, part-of-speech metrics of each keyword are preset according to the part of speech of each keyword, and the part-of-speech metrics of the keywords of verbs are set to be 1 and the part-of-speech metrics of the keywords of other parts of speech are set to be 0 because information contained in verbs in the electronic case is more important, so that the higher the part-of-speech metrics of the keywords are, the more important the keywords are. It should be noted that, the part-of-speech measurement enforcer of the keyword may set itself, and the embodiment of the present application is not limited specifically.
(5) In order to measure the contribution degree of each keyword to distinguishing the case type or describing the case event, in any electronic case, a measurement coefficient of each keyword in each electronic case is constructed based on the TF-IDF value, the comprehensive sequence difference measurement and the part-of-speech measurement of each keyword:
in any electronic proposal, the measurement coefficient of each keyword respectively forms a positive correlation with the TF-IDF value and the part of speech measurement of each keyword, and forms a negative correlation with the comprehensive sequence difference measurement of each keyword.
It can be understood that the positive correlation and the negative correlation of the present application refer to the relationship between the independent variable and the dependent variable, the positive correlation is the increase (decrease) of the independent variable along with the increase (decrease) of the dependent variable, the negative correlation is the decrease (increase) of the independent variable along with the increase (decrease) of the dependent variable, and the specific positive correlation and negative correlation can be determined according to the actual situation in the application process, and the present application is not particularly limited.
Preferably, in one embodiment of the present application, the calculation expression of the metric coefficient of each keyword in any electronic case may be: In which, in the process,Is the metric coefficient of the keyword i in the F-th electronic proposal,Is the comprehensive sequence difference measure of the keyword i in the F-th electronic proposal,Is the TF-IDF value of the keyword i in the F-th electronic proposal,Is a part-of-speech measure of keyword i in the F-th electronic proposal,To avoid a denominator of 0, preferably, in one embodiment of the application, a predetermined positive numberThe value of (2) is set to 1. In other embodiments of the present application, the practitioner can set the device according to the actual situationIs a value of (2).
It will be appreciated that when a keyword contributes more to distinguishing a case type or describing a case event, the logical order of the keyword before and after is relatively fixed, that is, the order in which the keyword appears in a sentence and the sentence length are relatively fixed or approximate, and secondly, the word frequency of the keyword is more frequent, and the keyword may be a description of event behavior if it is a verb, it is relatively more important, so that when the contribution of a keyword to distinguishing a case type or describing a case event is greater, the metric coefficient of the keyword is greater, and the metric coefficient is greater, the keyword is more important. Conversely, when the contribution of the keyword to distinguishing the case type or describing the case event is smaller, the metric coefficient of the keyword is smaller.
And step S3, calculating the weight of each keyword in each electronic case according to the measurement coefficient of each keyword in each electronic case, and generating the fingerprint of each electronic case by combining SimHash algorithm.
(1) In each electronic case, the greater the contribution degree of a keyword, namely the greater the measurement coefficient, the more effective information the keyword can provide for distinguishing case types or describing case events, so that the keyword is more important and is given greater weight, the greater the influence on the generation of subsequent electronic case fingerprints and the judgment of similarity is provided, and the accuracy of distinguishing case types according to the electronic case is improved. The calculation expression of the weight of each keyword in any electronic case is:
In the followingThe weight of the keyword i in the F-th electronic proposal,The measurement coefficient of the keyword i in the F-th electronic proposal,For the number of keywords in the F-th electronic proposal,For maximum function.
When the contribution of the keyword to distinguishing the case type or describing the case event is larger, the measurement coefficient of the keyword is larger, so that the corresponding weight is larger, and the denominator is used for limiting the value range of the weight between 0 and 1.
(2) In each electronic file, arranging all keywords of all sentences according to the appearance sequence in the electronic file to obtain a phrase sequence of the electronic file, arranging weights of all keywords according to the sequence of the keywords in the phrase sequence of the electronic file to form a keyword weight sequence of the electronic file, and then taking the phrase sequence and the keyword weight sequence of each electronic file as inputs of SimHash algorithm to output as fingerprints of each electronic file. The SimHash algorithm is a known technique, and a specific process is not described herein.
And S4, clustering based on the fingerprints of the electronic cases and the maximum measurement coefficient in the electronic cases to obtain clustering clusters.
Obtaining the maximum value of the measurement coefficients of all the keywords in each electronic proposal, recording the maximum value as the maximum measurement coefficient, and carrying out binary number conversion on the maximum measurement coefficient. And taking a vector formed by fingerprints of all electronic cases and binary numbers of the maximum measurement coefficients of all the electronic cases as a characteristic vector of each electronic case.
The feature vectors of all electronic cases are used as the input of a K-means clustering algorithm, the number of clustering centers is set, preferably, in the embodiment of the application, the number of the clustering centers is set to be 5, in other embodiments of the application, an implementer can set the number of the clustering centers according to actual conditions, and as fingerprints and measurement coefficients are in the form of character strings, the adopted distance measurement mode is Hamming distance and is output as each clustering cluster. The K-means algorithm is a known technology, and the specific process is not described again. It can be understood that, for the clustering of the feature vectors of all electronic cases, the application only provides one clustering method, the existing clustering methods are numerous, and the implementers can also adopt other clustering methods for clustering, so that the application is not particularly limited.
The function of the maximum measurement coefficient in the electronic case is that if the electronic case is the same type of case, the descriptions of the events in the electronic case are similar, the descriptions of the basis are basically consistent, so that the measurement coefficients of the keywords are basically equal, but because the number of the keywords in the electronic case is not necessarily consistent, each measurement coefficient cannot be compared in a one-to-one correspondence manner, and the maximum measurement coefficient is used as a representative for clustering.
And S5, acquiring a core keyword set of each cluster based on the keywords of each electronic case in each cluster.
In each electronic case, the quartile of the measurement coefficient of all the keywords is used as a preset segmentation threshold value of the electronic case, and the set of the keywords with the measurement coefficient larger than the preset segmentation threshold value is recorded as a key keyword set of the electronic case.
Acquiring intersections of key keyword sets of all electronic cases in each cluster, and taking the intersections as the key keyword sets of the cluster;
using a set formed by verbs of all electronic cases in the cluster as a verb set of the cluster, wherein each element in the verb set does not appear repeatedly;
And taking the union set of the key keyword set and the verb set of the cluster as the core keyword set of the cluster. Each cluster represents an electronic case of one type of case, and then the core keyword set of one cluster represents the keyword set with the greatest contribution to the description of the case of the type, namely, the contribution of the keywords in the core keyword set to the case type distinction and the case event description is the greatest.
In this way, when the SimHash algorithm is used for calculating the text similarity, the word frequency of each keyword in the text is considered, and the front-back logic sequence of each keyword in sentences is considered, so that the calculation and judgment of the similarity of the electronic cases are more accurate, the electronic cases of the same type of cases can be more accurately divided into the same cluster, and secondly, the core keyword set of each cluster is calculated, so that the keyword with the greatest contribution to the case type and event description can be selected, the extraction of the key keywords of each type of cases is facilitated, the description of each type of cases is improved, and the accurate analysis of the content of the electronic cases and the accurate division of the case types are realized.
A schematic diagram of the process of obtaining the measurement coefficient of each keyword in each electronic scenario is shown in FIG. 2.
Based on the same inventive concept as the above method, the embodiment of the application further provides an electronic case content analysis device of the intelligent court, which comprises a memory, a processor and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to realize the steps of any one of the electronic case content analysis methods of the intelligent court.
In summary, the embodiment of the application provides an electronic case content analysis method of an intelligent court, which is based on the difference between sentences containing the same keywords in different electronic cases to acquire the sequence difference measurement of the same keywords among different electronic cases, combines the TF-IDF value and the word part of each keyword in each electronic case to construct the measurement coefficient of each keyword in each electronic case, and simultaneously considers the word frequency and the logic sequence of the keywords to help to improve the accuracy of judging the importance degree of the keywords, constructs a weight function according to the measurement coefficient of the keywords to calculate the weight of each keyword to help to improve the accuracy of calculating the fingerprints of the electronic case by using SimHash algorithm and improve the accuracy of judging the similarity of the subsequent electronic cases, and finally, clusters of the keywords are clustered according to the fingerprints and the measurement coefficient of the keywords, so that the accuracy of dividing the same type of case and the keyword extraction is improved, and the accuracy of dividing the content of the electronic case is improved.
It should be noted that the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this application. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments.
The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but rather, any modifications, equivalents, improvements, etc. that fall within the principles of the present application are intended to be included within the scope of the present application.

Claims (3)

Translated fromChinese
1.一种智慧法院的电子案宗内容分析方法,其特征在于,该方法包括以下步骤:1. A method for analyzing the content of electronic case files in a smart court, characterized in that the method comprises the following steps:对各电子案宗的正文内容进行分词得到各词组,将所有词组进行去停用词处理得到各关键词;Segment the text of each electronic case to obtain each phrase, and remove stop words from all phrases to obtain each keyword;基于不同电子案宗中包含相同关键词的语句之间的差异,构建不同电子案宗语句之间的相同关键词的顺序度量系数;基于不同电子案宗语句之间所述顺序度量系数的差异获取不同电子案宗之间相同关键词的顺序差异度量;基于各电子案宗中每个关键词的重要性及词性特征,结合所述顺序差异度量,构建各电子案宗中每个关键词的度量系数;Based on the differences between sentences containing the same keywords in different electronic files, construct the order measurement coefficients of the same keywords between the sentences in different electronic files; based on the differences in the order measurement coefficients between the sentences in different electronic files, obtain the order difference measurement of the same keywords between different electronic files; based on the importance and part-of-speech features of each keyword in each electronic file, combined with the order difference measurement, construct the measurement coefficient of each keyword in each electronic file;基于所述度量系数确定各电子案宗中每个关键词的权重,结合指纹生成算法得到各电子案宗的指纹;Determine the weight of each keyword in each electronic case based on the measurement coefficient, and obtain the fingerprint of each electronic case in combination with the fingerprint generation algorithm;基于各电子案宗中的最大度量系数及指纹进行聚类得到各聚类簇;Clustering is performed based on the maximum metric coefficient and fingerprint in each electronic case to obtain each cluster;基于各聚类簇中每个电子案宗的关键词获取各聚类簇的核心关键词集;Obtaining a core keyword set of each cluster based on the keywords of each electronic case in each cluster;所述顺序度量系数的计算表达式为:The calculation expression of the order measurement coefficient is:,式中,是语句与语句之间的关键词i的顺序度量系数;分别是语句和语句中关键词i的次序值,分别是语句和语句的语句长度,是预设正数,为以自然常数e为底的指数函数; , where It is a statement With statement The order metric coefficient of keyword i between ; , The statements are and statements The ordinal value of keyword i in , , The statements are and statements The length of the sentence, is a preset positive number, is an exponential function with the natural constant e as base;其中所述语句为第F个电子案宗中第z条包含关键词i的语句,所述语句为第S个电子案宗中第t条包含关键词i的语句;The statement is the sentence containing keyword i in the zth sentence in the Fth electronic case. is the tth sentence containing keyword i in the Sth electronic case;所述顺序差异度量的计算表达式为:The calculation expression of the order difference metric is:,其中,为第F个电子案宗与第S个电子案宗之间的关键词i的顺序差异度量,是语句与语句之间的关键词i的顺序度量系数,分别为第F个电子案宗与第S个电子案宗中包含关键词i的语句数量,为取最小值函数; ,in, is the order difference measure of keyword i between the Fth electronic case and the Sth electronic case, It is a statement With statement The order metric coefficient of keyword i between , are the number of sentences containing keyword i in the Fth electronic case and the Sth electronic case, is the minimum value function;所述核心关键词集的获取过程为:The process of obtaining the core keyword set is as follows:在各电子案宗中,将所有度量系数大于预设分割阈值的关键词构成的集合记为重点关键词集;In each electronic case, a set of keywords with a metric coefficient greater than a preset segmentation threshold is recorded as a key keyword set;在各聚类簇中,将所有电子卷宗的所述重点关键词集的交集与所有电子卷宗的动词合并组成的集合作为各聚类簇的核心关键词集;In each cluster, the intersection of the key keyword set of all electronic files and the verbs of all electronic files are combined to form a set as the core keyword set of each cluster;所述度量系数的获取过程为:The process of obtaining the measurement coefficient is as follows:对于任一电子案宗中任一关键词,将所述任一电子案宗与其他所有电子案宗之间的所述任一关键词的顺序差异度量的均值作为所述任一电子案宗中任一关键词的综合顺序差异度量;For any keyword in any electronic case, the average of the order difference measures of the keyword between the electronic case and all other electronic cases is used as the comprehensive order difference measure of the keyword in the electronic case;获取各电子案宗中每个关键词的TF-IDF值,基于各关键词的词性设置各关键词的词性度量;Obtain the TF-IDF value of each keyword in each electronic case, and set the part-of-speech metric of each keyword based on the part-of-speech of each keyword;任一电子案宗中,每个关键词的度量系数分别与每个关键词的TF-IDF值及词性度量成正相关关系,与每个关键词的综合顺序差异度量成负相关关系;In any electronic case, the metric coefficient of each keyword is positively correlated with the TF-IDF value and part-of-speech metric of each keyword, and negatively correlated with the comprehensive order difference metric of each keyword;所述各电子案宗中每个关键词的权重的计算表达式为:The calculation expression of the weight of each keyword in each electronic case is:,式中,为第F个电子案宗中关键词i的权重,为第F个电子案宗中关键词i的度量系数,为第F个电子案宗中关键词的个数,为求最大值函数; , where is the weight of keyword i in the Fth electronic case, is the metric coefficient of keyword i in the Fth electronic case, is the number of keywords in the Fth electronic case, To find the maximum value function;所述各电子案宗的指纹的获取过程为:The process of obtaining the fingerprint of each electronic case is as follows:将各电子卷宗中所有关键词组成的序列及所有关键词的权重组成的序列作为指纹生成算法的输入,输出为各电子案宗的指纹;The sequence of all keywords in each electronic case file and the sequence of all keyword weights are used as the input of the fingerprint generation algorithm, and the output is the fingerprint of each electronic case file;所述各聚类簇的获取过程为:The process of obtaining each cluster is as follows:将各电子案宗中所有关键词度量系数的最大值进行二进制数转换,将各电子案宗的指纹及所述二进制数组成向量;将所有电子案宗的所述向量作为聚类算法的输入,输出为各聚类簇。The maximum value of all keyword measurement coefficients in each electronic case is converted into a binary number, and the fingerprint of each electronic case and the binary number are combined into a vector; the vector of all electronic cases is used as the input of the clustering algorithm, and the output is each cluster.2.如权利要求1所述的一种智慧法院的电子案宗内容分析方法,其特征在于,所述各关键词的词性度量为:将动词词性的关键词的词性度量设置为1,将其余词性的关键词的词性度量设置为0。2. A method for analyzing the content of electronic case files of a smart court as described in claim 1, characterized in that the part-of-speech measurement of each keyword is: the part-of-speech measurement of the keyword with the verb part of speech is set to 1, and the part-of-speech measurement of the keywords with other parts of speech is set to 0.3.一种智慧法院的电子案宗内容分析装置,包括存储器、处理器以及存储在所述存储器中并在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1-2任意一项方法的步骤。3. An electronic case content analysis device for a smart court, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the steps of any one of the methods of claims 1-2 when executing the computer program.
CN202411433147.XA2024-10-152024-10-15Electronic case content analysis method and device for intelligent courtActiveCN118964527B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411433147.XACN118964527B (en)2024-10-152024-10-15Electronic case content analysis method and device for intelligent court

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411433147.XACN118964527B (en)2024-10-152024-10-15Electronic case content analysis method and device for intelligent court

Publications (2)

Publication NumberPublication Date
CN118964527A CN118964527A (en)2024-11-15
CN118964527Btrue CN118964527B (en)2025-01-17

Family

ID=93391601

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411433147.XAActiveCN118964527B (en)2024-10-152024-10-15Electronic case content analysis method and device for intelligent court

Country Status (1)

CountryLink
CN (1)CN118964527B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105989033A (en)*2015-02-032016-10-05北京中搜网络技术股份有限公司Information duplication eliminating method based on information fingerprints
CN113704451A (en)*2021-08-302021-11-26广东电网有限责任公司Power user appeal screening method and system, electronic device and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112364625A (en)*2020-11-192021-02-12深圳壹账通智能科技有限公司Text screening method, device, equipment and storage medium
KR102601932B1 (en)*2021-11-082023-11-14(주)사람인System and method for extracting data from document for each company using fingerprints and machine learning
CN118153568B (en)*2024-03-072024-07-30中国人民解放军32011部队Intelligent management method for on-duty document data
CN118627972B (en)*2024-07-242024-11-05武汉华林梦想科技有限公司 Vocational skills assessment method and system based on big data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105989033A (en)*2015-02-032016-10-05北京中搜网络技术股份有限公司Information duplication eliminating method based on information fingerprints
CN113704451A (en)*2021-08-302021-11-26广东电网有限责任公司Power user appeal screening method and system, electronic device and storage medium

Also Published As

Publication numberPublication date
CN118964527A (en)2024-11-15

Similar Documents

PublicationPublication DateTitle
CN113591483B (en) A document-level event argument extraction method based on sequence labeling
CN111539197B (en)Text matching method and device, computer system and readable storage medium
KR101201037B1 (en)Verifying relevance between keywords and web site contents
CN111753167B (en) Search for processing methods, apparatus, computer equipment and media
CN111797214A (en) Question screening method, device, computer equipment and medium based on FAQ database
CN110825877A (en) A Semantic Similarity Analysis Method Based on Text Clustering
CN111767738B (en) A label verification method, device, equipment and storage medium
CN111428028A (en)Information classification method based on deep learning and related equipment
WO2020232898A1 (en)Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN112395875A (en)Keyword extraction method, device, terminal and storage medium
CN110134965B (en)Method, apparatus, device and computer readable storage medium for information processing
WO2022141875A1 (en)User intention recognition method and apparatus, device, and computer-readable storage medium
CN106708929A (en)Video program searching method and device
CN113988057A (en) Title generation method, device, device and medium based on concept extraction
CN115129890A (en)Feedback data map generation method and generation device, question answering device and refrigerator
CN116450883B (en) Video moment retrieval method based on fine-grained information of video content
CN103488782B (en)A kind of method utilizing lyrics identification music emotion
CN118170899B (en)AIGC-based media news manuscript generation method and related device
CN118797005A (en) Intelligent question-answering method, device, electronic device, storage medium and product
CN118228729A (en) Domain named entity recognition method and system based on adversarial learning and feature enhancement
CN119577115A (en) Intelligent patent retrieval method and system based on large language model re-ranking technology
CN115129864A (en)Text classification method and device, computer equipment and storage medium
CN118964527B (en)Electronic case content analysis method and device for intelligent court
CN118229465A (en) Pre-application patent quality assessment method and system based on cluster center representation
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp