CN111858842A

Movatterモバイル変換

Info

Publication number: CN111858842A
Application number: CN201910352429.XA
Authority: CN
Inventors: 何铁科; 许金; 严格
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2020-10-30

Abstract

Translated fromChinese

本发明提出了一种基于LDA的法院类案推荐方法，为法律人员高效推荐类案以供参考分析，帮助在疑难案件上提供新的解决途径，也能够统一司法裁判尺度，避免司法裁判不公。该发明的主要创新在于(1)对案例文书进行语句清洗，提取关键词；(2)对关键词列表构建文本矩阵进行聚类；(3)将LDA和余弦相似度相结合的方法高效筛选最佳类案。本发明最终基于LDA开发的类案推荐，帮助法律工作人员快速高效分析案例并作出准确裁判。

The present invention proposes a court similar case recommendation method based on LDA, which efficiently recommends similar cases for reference and analysis for legal personnel, helps provide new solutions to difficult cases, and can also unify the scale of judicial judgment to avoid unfair judicial judgment. The main innovation of the invention lies in (1) cleaning the sentences of the case documents and extracting keywords; (2) constructing a text matrix for clustering of the keyword list; (3) combining LDA and cosine similarity to efficiently screen the most efficient Good case. The present invention is finally based on the recommendation of similar cases developed by LDA, which helps legal staff to analyze cases quickly and efficiently and make accurate judgments.

Description

Translated fromChinese

一种基于LDA主题模型的司法案例筛选方法A Judicial Case Screening Method Based on LDA Topic Model

技术领域technical field

本发明属于计算机技术中的机器学习领域，尤其是机器学习中数据分析领域，采用主题模型，能够对文本资源进行语义提取，同时，词袋模型不需要考虑词间的顺序，这简化了文本分析处理的复杂性，也为模型的改进提供了契机，是一种帮助司法工作者在海量案例中快速筛选找出类案对比进行参考分析的新方法。The present invention belongs to the field of machine learning in computer technology, especially the field of data analysis in machine learning. By adopting a topic model, the text resources can be semantically extracted, and at the same time, the word bag model does not need to consider the order between words, which simplifies text analysis. The complexity of processing also provides an opportunity for the improvement of the model. It is a new method to help judicial workers to quickly screen out a large number of cases to find similar cases for reference analysis.

背景技术Background technique

在司法责任制改革背景下，法律大数据掀起中国司法“技术革命”浪潮，通过类案类判已被视作贴近一线法官需要的重要内容。整体上，类案类判不仅可以为疑难案件提供新的解决途径，也能够统一司法裁判尺度，避免司法裁判不公，类案类判有望成为控制裁判偏离度的“纠偏机制”，管控法官的办案质量，司法实务部门对类案类判抱持强烈的需求与期待。依据此背景意义，本系统计划采用开源中文分词工具对案例文本进行自动化自然语言处理，提取案件中关键信息；在此基础上，根据文本相似度提供智能类案对比。Under the background of judicial responsibility system reform, legal big data has set off a wave of judicial "technical revolution" in China, and passing similar cases and similar judgments has been regarded as an important content that is close to the needs of front-line judges. On the whole, similar cases and similar judgments can not only provide new solutions for difficult cases, but also unify the scale of judicial judgments and avoid unfair judicial judgments. Quality, judicial practice departments have strong needs and expectations for similar cases and similar judgments. Based on this background, the system plans to use open-source Chinese word segmentation tools to automate natural language processing of case texts to extract key information in the cases; on this basis, provide intelligent comparison of similar cases based on text similarity.

文本语言处理是采用中文分词器jieba进行分词操作，jieba是通过拟人对文本内容的分析理解以达到识别词的效果。其基本思想就是：在分词的过程中对文本语言进行句法、语义分析，将文字序列切分成一个个单独立的词，使用停用词类加载文本判别文本中停用词的存在性(若是，立即删除)，再按照规则重新组合成新的词序列。Text language processing uses the Chinese word segmenter jieba to perform word segmentation. Jieba achieves the effect of recognizing words by anthropomorphic analysis and understanding of text content. The basic idea is: perform syntactic and semantic analysis on the text language in the process of word segmentation, divide the text sequence into individual words, and use the stop word class to load the text to determine the existence of stop words in the text (if it is, immediately. delete), and then reassemble into a new word sequence according to the rules.

TextRank是一种用于文本关键词(短语、自动摘要等)提取的方法，是由 PageRank改进而来的，利用局部词汇之间关系对后续关键词进行排序，直接从文本本身抽取。通过把文本分割成一个个独立的词在此基础上建立图模型，利用投票机制对文本主题信息重新排序，由文本内容信息就可获得关键词的提取。TextRank is a method for extracting text keywords (phrases, automatic summaries, etc.), which is improved from PageRank. It uses the relationship between local words to sort subsequent keywords and extracts directly from the text itself. By dividing the text into independent words, a graph model is established on this basis, and the subject information of the text is reordered by the voting mechanism, and the keyword extraction can be obtained from the text content information.

词袋(bag of words)方法把每一篇文本视为由若干词汇构成的集合，并将该集合当做一个词频向量，从而将复杂文本内容转化为了易于建模的数字信息；词袋方法没有考虑文本信息的语序等要素，即忽略词与词之间的前后顺序，每个词都是独立的，这大大降低了问题的复杂度，同时也为模型的改进提供了契机。每一篇文档代表了一些主题所构成的一个概率分布，而每一个主题又代表了很多单词所构成的一个概率分布。The bag of words method treats each text as a set composed of several words, and regards the set as a word frequency vector, so as to convert the complex text content into digital information that is easy to model; the bag of words method does not consider Word order and other elements of text information, that is, ignoring the sequence between words, each word is independent, which greatly reduces the complexity of the problem, and also provides an opportunity for model improvement. Each document represents a probability distribution composed of some topics, and each topic represents a probability distribution composed of many words.

使用吉布斯采样估计参数，具体过程为：首先对所有文本中的所有词遍历一遍，为其都随机分配一个主题，即Z_m，n＝k～Mult(1/K)，其中m表示第m篇文档，n 表示文档中的第n个词，k表示主题，K表示主题的总数，之后将对应的

n_m+1，

n_k+1，他们分别表示在m文档中k主题出现的次数，m文档中主题数量的和，k主题对应的t词的次数，k主题对应的总词数；对所有文档中的所有词进行遍历，假如当前文档m的词t对应主题为k，则

n_m-1，

n_k-1，即先取出当前词，并依据文本中主题样例的概率分布采样出新的主题，在对应的

n_m，

n_k上分别+1，重复迭代上述操作，最后输出参数数值。Using Gibbs sampling to estimate parameters, the specific process is: first traverse all words in all texts, and randomly assign a topic to them, namely Z_{m, n} = k ~ Mult(1/K), where m represents the first m documents, n represents the nth word in the document, k represents the topic, K represents the total number of topics, and then the corresponding

n_m +1,

n_k +1, they respectively represent the number of occurrences of k topics in m documents, the sum of the number of topics in m documents, the number of t words corresponding to k topics, and the total number of words corresponding to k topics; for all words in all documents Traverse, if the word t of the current document m corresponds to the topic k, then

n_m -1,

n_k -1, that is, take out the current word first, and sample a new topic according to the probability distribution of topic samples in the text.

n_m ,

+1 on n_k respectively, repeat the above operation, and finally output the parameter value.

LDA是一种非监督机器学习技术，用于寻找出大量文本集或语料库中潜藏的主题信息，不需要人工标记训练集二只需提供文本集T及规定的主题数K即可。同时，它又是一种聚类方法：主题对应聚类中心，文档对应数据集中的例子；主题和文本在特征空间中都存在，且特征向量是词频向量。采用LDA先计算出与目标案例最相似的类案，以该类案为聚类中心，获取该聚类中心相关联的所有案例。余弦相似度是计算聚类中心的所有案例到目标案例的距离，最终计算结果数值越大(始终＜＝1)表示案例间相似性越高，被推送的几率越大；反之则相似性越低，被推送的几率越小。LDA is an unsupervised machine learning technology, which is used to find hidden topic information in a large number of text sets or corpora. It does not require manual labeling of the training set. It only needs to provide the text set T and the specified number of topics K. At the same time, it is a clustering method: topics correspond to cluster centers, documents correspond to examples in the dataset; topics and texts both exist in the feature space, and feature vectors are word frequency vectors. LDA is used to first calculate the most similar case with the target case, and take this case as the cluster center to obtain all cases associated with the cluster center. The cosine similarity is the distance from all cases in the cluster center to the target case. The larger the final calculation result value (always <= 1), the higher the similarity between cases, and the greater the probability of being pushed; otherwise, the lower the similarity , the lower the probability of being pushed.

发明内容SUMMARY OF THE INVENTION

本发明要解决的问题是：基于LDA技术为法院推荐类案进行分析。本发明的技术方案为：The problem to be solved by the present invention is: based on the LDA technology to recommend similar cases for the court to analyze. The technical scheme of the present invention is:

1)将XML格式的半结构化案例文本使用开源中文分词工具jieba进行分词，返回分词结果集。1) Use the open source Chinese word segmentation tool jieba to segment the semi-structured case text in XML format, and return the segmentation result set.

2)采用关键词提取技术TextRank抽取结果集中潜藏的关键词。案例文本可以包含多个主题，文本中每一个关键词都由其中的一个主题生成，利用词袋方法转换成固定维度的词频向量。2) Using the keyword extraction technology TextRank to extract the hidden keywords in the result set. The case text can contain multiple topics, and each keyword in the text is generated by one of the topics, and is converted into a fixed-dimensional word frequency vector using the bag-of-words method.

3)通过吉布斯采样方法估算出文本矩阵中参数，如案例文本主题多项式分布θ与词语多项式分布φ的数值。3) The parameters in the text matrix are estimated by the Gibbs sampling method, such as the value of the multinomial distribution θ of the case text topic and the multinomial distribution φ of the word.

4)选定文书查询案例的案由，根据该案由通过LDA方法得到所属类的所有其他文书，基于案例文本向量模型，利用向量相关的余弦距离公式来计算其他文本与目标案例的余弦相似性，以相似度最高的前5个文本作为推送。4) Select the cause of the document query case, obtain all other documents of the category through the LDA method according to the cause, and use the vector-related cosine distance formula to calculate the cosine similarity between other texts and the target case based on the case text vector model. The top 5 texts with the highest similarity are used as pushes.

本发明的有益效果是：本系统为传统人工审阅案件提供了一个高效分析案例的新思路，为法律工作人员推送类案(相似案例)以供分析参考，这些类案涵盖了相关法律条文，判决结果考虑因素，判决结果等，其不仅可以为疑难案件提供新的解决途径，也能够统一司法裁判尺度，避免司法裁判不公。The beneficial effects of the present invention are: the system provides a new idea for efficient case analysis for traditional manual review cases, and pushes similar cases (similar cases) for legal staff for analysis and reference. Considering factors, judgment results, etc., it can not only provide new solutions for difficult cases, but also unify the scale of judicial judgments and avoid unfair judicial judgments.

附图说明Description of drawings

图1系统业务流程图Figure 1 System business flow chart

图2 jieba分词流程图Figure 2 The flow chart of jieba word segmentation

图3 LDA生成模型图Figure 3 LDA generation model diagram

具体实施方法Specific implementation method

本方法是基于LDA模型，利用吉布斯采样算法，搭建案例文本的特征空间，对文本集建模形成文本的主题分布向量，构造相似度矩阵进而文本间聚类，聚类中心所属的相关案例采用余弦定理计算其之间的相似度，以余弦值最大的前五个作为类案推荐。在用户输入XML格式的半结构化案例时，要先进行信息预处理，预处理主要包括对文本内容分词、筛选过滤停用词，命名实体识别等，在提取关键词，若文本已经处理过了，可直接跳过此环节。本系统操作流程如图1所示。This method is based on the LDA model and uses the Gibbs sampling algorithm to build the feature space of the case text, model the text set to form the topic distribution vector of the text, construct the similarity matrix and then cluster the texts, and the related cases to which the cluster centers belong. The cosine theorem is used to calculate the similarity between them, and the top five with the largest cosine value are recommended as similar cases. When a user inputs a semi-structured case in XML format, information preprocessing should be performed first. The preprocessing mainly includes word segmentation of text content, filtering and filtering stop words, named entity recognition, etc. When extracting keywords, if the text has been processed , you can skip this link directly. The operation flow of this system is shown in Figure 1.

首先，我们对已上传的案例文书通过分词工具jieba进行内容筛选，如中文分词，删除停用词等，jieba工作流程是：基于Trie树结构实现高效的词图扫描，生成句子中汉字所有可能成词情况所构成的有向无环图；采用了动态规划查找最大概率路径，找出基于词频的最大切分组合；对于未登录词(登录词是指词库，如dict.txt，若不在其中就是未登录词)，采用了基于汉字成词能力的HMM模型，使用了Viterbi算法，其具体流程见图2。基于特殊字符将案例全文进行分离操作，分离结果是一个个按文中顺序排列的独立子句，再基于该基础利用jieba删除“的”“地”等停用词，输出筛选结果集。First, we screen the uploaded case documents through the word segmentation tool jieba, such as Chinese word segmentation, deletion of stop words, etc. The jieba workflow is: based on the Trie tree structure to achieve efficient word map scanning, and generate all possible components of Chinese characters in the sentence. A directed acyclic graph composed of word conditions; dynamic programming is used to find the maximum probability path, and the maximum segmentation combination based on word frequency is found; for unregistered words (registered words refer to the thesaurus, such as dict.txt, if not in it It is an unregistered word), using the HMM model based on the ability of Chinese characters to form words, and using the Viterbi algorithm. The specific process is shown in Figure 2. The full text of the case is separated based on special characters, and the separation results are independent clauses arranged in the order of the text. Based on this basis, jieba is used to delete stop words such as "de" and "地", and output the filter result set.

TextRank提取关键字算法利用局部词汇之间的关系生成关键词列表，计算文本中分词结果列表中每个候选关键词之间的置信度并排序，从而得到最重要的若干词组成关键词序列，即实现从给定的案例文本中自动提取出若干有实际意义的词语。步骤如下：The TextRank keyword extraction algorithm uses the relationship between local words to generate a keyword list, calculates the confidence between each candidate keyword in the word segmentation result list in the text and sorts it, so as to obtain the most important words to form a keyword sequence, namely Realize the automatic extraction of several meaningful words from the given case text. Proceed as follows:

1)分隔文本内容：把给定的文本使用标点切分成若干子句；Text＝[S1，S2，… Sn]1) Separate text content: divide the given text into several clauses using punctuation; Text=[S1, S2, ... Sn]

2)提取候选关键词：对jieba分词后的结果集保留指定词性的单词，如名词、动词、形容词，将其视为候选关键词；Si＝[W1，W2，…Wn]2) Extract candidate keywords: retain the words with the specified part of speech, such as nouns, verbs, and adjectives, in the result set after jieba segmentation, and consider them as candidate keywords; Si=[W1, W2,...Wn]

3)设置共现窗口大小n，根据n重新生成文本内的全部语句的关键词列表3) Set the co-occurrence window size n, and regenerate the keyword list of all sentences in the text according to n

[W1，W2，…Wn]，[W2，W3，…Wn+1]，…[W1, W2, …Wn], [W2, W3, …Wn+1], …

4)计算列表中每个词语间的置信度。4) Calculate the confidence between each word in the list.

U＝[1/n，1/n，…1/n]U=[1/n, 1/n,...1/n]

Un＝α(M*Un-1)+(1-α)U0，其中M是初始化词共现方阵，U是均值矩阵。根据上面公式，迭代传播各节点的权重，直至收敛。Un=α(M*Un-1)+(1-α)U0, where M is the initializer co-occurrence square matrix, and U is the mean matrix. According to the above formula, the weight of each node is iteratively propagated until convergence.

5)根据节点权重排序排序得到最重要的若干单词，在最初文本中进行标记，若形成相邻词组，则组合成多词关键词。例如，案例中某一句为“独立于立法权之外的司法权”，若“立法权”和“司法权”均为候选关键词，将这两个相邻词组合为“立法权司法权”列入关键词序列。5) Sort and sort the most important words according to the node weights, mark them in the original text, and combine them into multi-word keywords if adjacent phrases are formed. For example, if a sentence in the case is "judicial power independent of legislative power", if "legislative power" and "judicial power" are both candidate keywords, combine these two adjacent words as "legislative power and judicial power" Include keyword sequences.

本方法中LDA模型把每一篇案例文本看成是基于K个主题由N个词混合构成的集合，每一个主题以一定的概率产生若干数量的关键词，而每一个主题类似也是通过一定的概率产生，文档到主题服从多项式分布，主题到词服从多项式分布，即每一篇文档代表了一些主题所构成的一个概率分布，而每一个主题又代表了很多单词所构成的一个概率分布。利用LDA模型对案例文本进行建模，寻找不同主题与词之间的关系，得到文本的主题分布，每篇文本的生成过程如下：In the LDA model of this method, each case text is regarded as a set composed of N words based on K topics, each topic generates a certain number of keywords with a certain probability, and each topic is similar through a certain number of keywords. Probability generation, documents to topics obey a multinomial distribution, and topics to words obey a multinomial distribution, that is, each document represents a probability distribution composed of some topics, and each topic represents a probability distribution composed of many words. Use the LDA model to model the case text, find the relationship between different topics and words, and obtain the topic distribution of the text. The generation process of each text is as follows:

1)选择单词数N_m服从泊松分布，N_m～Poisson(ζ)。1) Select the number of words N_m to obey the Poisson distribution, N_m ～Poisson(ζ).

2)选择θ服从狄利克雷分布，θ～Dir(α)。2) Choose θ to obey Dirichlet distribution, θ～Dir(α).

3)N_m中的每个单词：选择一个主题服从多项分布Mult(θ)；以该主题的条件多项式概率生成单词。3) Each word in N_m : select a topic subject to multinomial distribution Mult(θ); generate words with conditional polynomial probability of the topic.

4)重复上述操作直至遍历文本中每一个关键词。4) Repeat the above operations until each keyword in the text is traversed.

具体生成流程如图3。其中，K为主题个数，M为文档总数，N_m是第m个文档的单词总数。β是每个主题下词的多项分布的Dirichlet先验参数，α是每个文档下主题的多项分布的Dirichlet先验参数。Z_m，n是第m个文档中第n个词的主题，W_m，n是m个文档中的第n个词。剩下来的两个隐含变量θ_m和

分别表示第m个文档下的Topic分布和第k个Topic下词的分布，前者是k维(k为Topic 总数)向量，后者是v维向量(v为词典中词总数)The specific generation process is shown in Figure 3. Among them, K is the number of topics, M is the total number of documents, and N_m is the total number of words in the m-th document. β is the Dirichlet prior parameter of the multinomial distribution of words under each topic, and α is the Dirichlet prior parameter of the multinomial distribution of the topic under each document. Z_m,n is the topic of the nth word in the mth document, and Wm_,n is the nth word in the mth document. The remaining two implicit variables θ_m and

Represents the topic distribution under the mth document and the distribution of words under the kth topic, the former is a k-dimensional (k is the total number of topics) vector, and the latter is a v-dimensional vector (v is the total number of words in the dictionary)

通过吉布斯采样算法估算出LDA模型中的θ(生成doc-topic分布的分布)与φ (生成topic-word分布的分布)参数，该算法思想是选取向量中的一个维度，通过抽取其他维度的变量值来确定当前维度的数值，重复操作直至收敛得到目标参数。根据以上阐述，我们可以利用θ表示一篇案例文本，在通过该向量对案例文本进行语义层次的词与文本相似性分析以达到聚类目的，再计算词案例文本到其他文本的余弦相似性，对最后结果排序，选择余弦相似度最大的前五篇案例作为推荐。The θ (generating doc-topic distribution distribution) and φ (generating topic-word distribution distribution) parameters in the LDA model are estimated by the Gibbs sampling algorithm. The idea of the algorithm is to select one dimension in the vector and extract other dimensions by The value of the variable to determine the value of the current dimension, and repeat the operation until the target parameter is converged. According to the above description, we can use θ to represent a case text, perform semantic-level word and text similarity analysis on the case text through this vector to achieve the purpose of clustering, and then calculate the cosine similarity between the word case text and other texts, Sort the final results and select the top five cases with the largest cosine similarity as recommendations.

综上所述，本方法通过对上传的案例文书进行语句清洗(分词，筛选滤过停用词，提取关键词等)，构建文本矩阵，利用LDA和余弦相似度相结合的方法高效完成类案推荐。In summary, this method constructs a text matrix by cleaning the uploaded case documents (word segmentation, filtering and filtering stop words, extracting keywords, etc.), and using the combination of LDA and cosine similarity to efficiently complete similar cases recommend.

Claims

1. The judicial case screening method based on the LDA topic model is characterized by comprising the following steps: (1) importing a target case document; (2) the method comprises the steps of segmenting words of document contents, screening out subject information hidden in a case, namely extracting text keywords; (3) estimating parameters in the LDA model by using a Gibbs Sampling (Gibbs Sampling) method and establishing the LDA model; (4) when a bag of words (bag of words) method is adopted to model the text, the text data (5) is represented, and the similarity between cases is calculated by adopting a method of combining LDA and cosine similarity.

2. The method as claimed in claim 1, wherein the method comprises importing a case document in XML format, extracting corresponding topic information from the document, finding out hidden text keywords from the extracted topic, representing text data with a bag-of-words method for the returned result set, calculating parameters in the LDA model by gibbs sampling method and establishing the LDA model, and realizing similarity calculation and pushing based on the LDA model and cosine similarity by a method combining with cosine similarity.

3. The LDA-based case-class recommendation method as claimed in claims 1 and 2, characterized by the following specific steps:

1) performing word segmentation on the semi-structured case text in the XML format by using an open-source Chinese word segmentation tool jieba, and returning a word segmentation result set;

2) And extracting hidden keywords in the result set by adopting a keyword extraction technology TextRank. The case text can contain a plurality of topics, each keyword in the text is generated by one topic, and a word bag method is used for converting the keyword into a word frequency vector with fixed dimensionality to construct a text vector model;

3) calculating parameters in a text matrix by a Gibbs sampling method, such as numerical values of topic polynomial distribution theta and word polynomial distribution phi, and building an LDA model;

4) the case routing of the document query case is selected, all other documents belonging to the category are obtained through an LDA method according to the case routing, cosine similarity between other texts and the target case is calculated by utilizing a cosine distance formula related to vectors on the basis of a case text vector model, and the first 5 texts with the highest similarity are used as push.