CN106445920A

Movatterモバイル変換

Info

Publication number: CN106445920A
Application number: CN201610867254.2A
Authority: CN
Inventors: 罗森林; 陈倩柔; 潘丽敏; 原玉娇
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2017-02-22

Abstract

Translated fromChinese

本发明为解决社交短文本句子相似度计算的特征稀疏问题，提出利用句义结构特征的句子相似度计算方法。首先基于句义结构模型分析句子语义，并利用主题模型挖掘潜在的主题知识，根据主题‑词语分布扩充句子特征，得到基于句子本身特征的句子向量，然后引入Paragraph Vector深度学习模型学习句子上下文特征，得到基于上下文信息的句子向量，最后加权由两种句子向量计算得到的句子相似度。本发明通过深度挖掘句子的语义信息和上下文信息，更加全面、准确地刻画了句子之间的内在联系，提高了相似度计算的准确率。In order to solve the problem of feature sparseness in the calculation of the similarity of social short text sentences, the present invention proposes a method for calculating the similarity of sentences using the structural features of sentence meaning. Firstly, the sentence semantics are analyzed based on the sentence structure model, and the topic model is used to mine potential topic knowledge. The sentence features are expanded according to the topic-word distribution, and the sentence vector based on the characteristics of the sentence itself is obtained. Then, the Paragraph Vector deep learning model is introduced to learn the sentence context features. The sentence vector based on the context information is obtained, and finally the sentence similarity calculated by the two sentence vectors is weighted. The present invention more comprehensively and accurately depicts the inner relationship between sentences by deeply mining the semantic information and context information of sentences, and improves the accuracy rate of similarity calculation.

Description

Translated fromChinese

利用句义结构特征的句子相似度计算方法Sentence Similarity Calculation Method Using Sentence Structural Features

技术领域technical field

本发明涉及利用句义结构特征的句子相似度计算方法，属于计算机科学及自然语言处理领域。The invention relates to a sentence similarity calculation method using sentence meaning structure features, and belongs to the fields of computer science and natural language processing.

背景技术Background technique

句子相似度计算用于衡量两个文本内容的语义相似程度，是自然语言处理中信息检索、自动摘要等任务的基础环节。伴随社交网站的快速发展，以微博为代表的社交短文本大量涌现，其篇幅短小、表示方式多样化，由于缺失长文档的结构化信息造成传统的句子相似度计算方法无法直接适用于此类短文本的句子相似度计算。Sentence similarity calculation is used to measure the semantic similarity between two texts, and it is the basic link of information retrieval and automatic summarization in natural language processing. With the rapid development of social networking sites, a large number of social short texts represented by Weibo have emerged, which are short in length and diverse in presentation methods. Due to the lack of structured information of long documents, the traditional sentence similarity calculation method cannot be directly applied to this type of text. Sentence similarity calculation for short texts.

目前，根据对句子语义分析的深度不同，针对社交短文本中句子的相似度计算方法主要包括基于词特征、基于词义特征和基于句法分析特征三类。At present, according to the depth of semantic analysis of sentences, the similarity calculation methods for sentences in short social texts mainly include three categories: word-based features, word-meaning-based features, and syntactic analysis-based features.

基于词特征的方法是早期的句子相似度计算方法，该方法主要是将句子视为词的线性组合，采用统计的手段计算句子的词频、词性、句长、词序等词语表层信息，典型方法包括Jaccard Similarity Coefficient字符串匹配，其通过统计两个句子中包含的相同词数目作为句子的相似度、TF-IDF词频统计方法将句子表征为向量，计算余弦距离作为相似度结果。The method based on word features is an early sentence similarity calculation method. This method mainly regards sentences as a linear combination of words, and uses statistical means to calculate the word frequency, part of speech, sentence length, word order and other word surface information of sentences. Typical methods include Jaccard Similarity Coefficient string matching, which counts the number of identical words contained in two sentences as the similarity of the sentences, TF-IDF word frequency statistics method represents the sentence as a vector, and calculates the cosine distance as the similarity result.

基于词义特征的方法从语义分析的角度，通过借助语义知识资源捕捉词语的语义信息。根据利用的资源不同，分为基于语义词典的方法和基于语料库的方法。基于语义词典的方法主要是借助WordNet、HowNet等基于词义组织词汇信息的词汇数据库，结合词义消歧的技术挖掘句子中词语在给定的上下文语境下所表达的涵义，从而提高整个句子的语义分辨率。基于语料库的方法主要通过引入语言模型框架，以两个词共同出现的概率大小来推断其相似性，常用的技术是利用潜在语言分析(Latent Semantic Analysis，LSA)方法将词-文档矩阵进行奇异值分解实现高维特征表示到低维潜在语义空间表示的空间映射。From the perspective of semantic analysis, the method based on semantic features captures the semantic information of words by using semantic knowledge resources. According to the different resources utilized, it can be divided into methods based on semantic dictionary and methods based on corpus. The method based on the semantic dictionary is mainly to use WordNet, HowNet and other lexical databases that organize vocabulary information based on word meaning, combined with word meaning disambiguation technology to mine the meaning expressed by the words in the sentence in a given context, so as to improve the semantics of the entire sentence resolution. The corpus-based method mainly introduces a language model framework to infer the similarity of two words based on the probability of their co-occurrence. A common technique is to use the Latent Semantic Analysis (LSA) method to perform singular value of the word-document matrix. Decomposition enables spatial mapping from high-dimensional feature representations to low-dimensional latent semantic space representations.

基于句法分析特征的方法通过对句子进行整体的结构分析判断句子的相似性，该方法主张句子中核心动词是支配其它成分的关键，核心动词本身不受任何成分的支配，而其他句子成分受到核心动词的支配，通过分析词语之间的依存关系来挖掘句子的语义信息，实际应用中通常只计算动词、名词和形容词等有效词以及直接依附于有效词所组成的搭配对之间的相似性来估算句子的相似度，以此避免增加噪音数据对结果带来的偏差。The method based on the characteristics of syntactic analysis judges the similarity of sentences by analyzing the overall structure of the sentence. This method claims that the core verb in the sentence is the key to dominate other components. The core verb itself is not dominated by any component, while other sentence components are controlled by the core. The dominance of verbs is to mine the semantic information of sentences by analyzing the dependencies between words. In practical applications, only effective words such as verbs, nouns, and adjectives, and the similarity between collocation pairs that are directly attached to effective words are usually calculated. Estimate the similarity of sentences to avoid biasing the results by adding noise data.

上述各类方法虽然从不同的分析层面计算句子的相似度，但社交短文本的实词较少，不加任何句法结构分析和对句子语义信息的挖掘，仅靠对词频、词形等表层信息的统计无法辨别词语的深层信息。基于词义特征的方法虽然考虑了词语的语义信息，但该方法受限于外部语义资源，社交短文本包含大量的未登录词、内容时效性强，由于字典的不全面和特征稀疏的影响往往造成语义信息不明确的问题。基于句法特征的方法受当前句法分析技术不成熟的限制，没有考虑句子上下文信息和深层语义信息，信息的缺失对相似度计算结果的准确性带来无法预估的影响。Although the above-mentioned methods calculate the similarity of sentences from different levels of analysis, the content words in short social texts are less, without any syntactic structure analysis and mining of sentence semantic information, only relying on surface information such as word frequency and word form. Statistics cannot distinguish the deep information of words. Although the semantic feature-based method considers the semantic information of words, this method is limited by external semantic resources. Social short texts contain a large number of unregistered words, and the content is time-sensitive. Due to the influence of incomplete dictionaries and sparse features, it often leads to The problem of ambiguous semantic information. The method based on syntactic features is limited by the immaturity of the current syntactic analysis technology. It does not consider the sentence context information and deep semantic information. The lack of information will have an unpredictable impact on the accuracy of the similarity calculation results.

发明内容Contents of the invention

本发明为解决社交短文本句子相似度计算的特征稀疏和没有考虑深层语义信息的问题，提出利用句义结构特征的句子相似度计算方法。在综合考虑句子的语义信息和上下文信息的前提下，将多元信息加权融合，使句子信息更加全面，通过深度挖掘句子语义信息，使句子相似度计算结果不受表述方式的影响，更加全面、准确地计算句子语义的关联程度。In order to solve the problem of feature sparseness and lack of consideration of deep semantic information in the calculation of social short text sentence similarity, the present invention proposes a sentence similarity calculation method using sentence structure features. On the premise of comprehensively considering the semantic information and context information of the sentence, the multi-information is weighted and fused to make the sentence information more comprehensive. By deeply mining the semantic information of the sentence, the sentence similarity calculation result is not affected by the expression method and is more comprehensive and accurate. Calculate the relevance degree of sentence semantics accurately.

本发明的设计原理为：1)基于句义结构模型(Chinese Semantic StructureModel，CSM)分析句子语义，提取句义成分，利用Latent Dirichlet Allocation(LDA)主题模型挖掘潜在的主题知识，根据知识库扩充句义成分对应维度的特征1112，得到基于句子本身语义信息的句子向量；2)引入Paragraph Vector(PV)深度学习模型自适应地学习文本特征，得到基于句子上下文信息的句子向量；3)利用两种句子向量分别计算得到句子间的相似度，并进行线性加权，通过网格法调节优化系数，使句子相似度计算结果更加准确。The design principles of the present invention are: 1) analyze the sentence semantics based on the Chinese Semantic Structure Model (CSM), extract the sentence components, utilize the Latent Dirichlet Allocation (LDA) topic model to mine potential topic knowledge, and expand the sentence according to the knowledge base The feature 1112 of the dimension corresponding to the semantic component, and obtain the sentence vector based on the semantic information of the sentence itself; 2) introduce the Paragraph Vector (PV) deep learning model to learn text features adaptively, and obtain the sentence vector based on the sentence context information; 3) use two Sentence vectors are calculated separately to obtain the similarity between sentences, and linearly weighted, and the optimization coefficient is adjusted through the grid method to make the sentence similarity calculation result more accurate.

具体步骤如下：Specific steps are as follows:

步骤1，对社交短文本集进行预处理，先进行分句，然后进行分词和词性标注，去停用词。Step 1. Preprocess the social short text set, first segment sentences, and then perform word segmentation and part-of-speech tagging to remove stop words.

步骤2，基于利用CSM对每条句子的句义结构分析结果和利用LDA主题模型对短文本集进行分析得到的主题和词语分布，对句子进行特征扩充，并计算句子相似度。Step 2. Based on the analysis results of the semantic structure of each sentence by using CSM and the topic and word distribution obtained by analyzing the short text set by using the LDA topic model, the feature expansion of the sentence is carried out, and the sentence similarity is calculated.

步骤2.1，在步骤1的基础上，对每条句子进行句义结构分析，提取句子的话题、述题、基本项、一般项。CSM将整个句义的语义表示为结构树的形式，具体表示为句型层、描述层、对象层和细节层四个层次。句型层指明句子的句义类型，包括简单句义、复杂句义、复合句义、多重句义四中类型；描述层中包含话题和述题，话题和述题是对句义的初步划分，是句义结构中的基本句义成分，话题定义为句义中的被描述对象，述题定义为句义中的话题的描述内容；对象层中包含谓词、基本项、一般项、语义格，语义格是对词语的语义标注，包括7种基本格和12种一般格，基本项定义为句义中与谓词具有直接联系的成分，构成一个句子语义的主干，其对应的语义格为基本格，一般项定义为句义中的修饰成分，其对应的语义格为一般格；细节层中包含句子的引申含义。Step 2.1, on the basis of step 1, analyze the sentence structure of each sentence, and extract the topic, statement, basic items, and general items of the sentence. CSM expresses the semantics of the entire sentence as a structure tree, which is specifically expressed as four levels: sentence type layer, description layer, object layer and detail layer. The sentence type layer indicates the sentence meaning type of the sentence, including four types: simple sentence meaning, complex sentence meaning, compound sentence meaning, and multiple sentence meaning; the description layer includes topic and topic, which are the preliminary division of sentence meaning , is the basic sentence meaning component in the sentence meaning structure, the topic is defined as the described object in the sentence meaning, and the topic is defined as the description content of the topic in the sentence meaning; the object layer includes predicates, basic items, general items, semantic case , the semantic case is the semantic labeling of words, including 7 basic cases and 12 general cases. The general item is defined as the modifier in the sentence meaning, and its corresponding semantic case is the general case; the detail layer contains the extended meaning of the sentence.

步骤2.2，利用LDA主题模型对短文本集进行分析，挖掘文本中的潜在主题知识，提取文本中的主题和主题下的词语分布，得到文本-主题矩阵和主题-词语矩阵。LDA主题模型可以获取文本中的主题，可以用来对文本中的词语进行划分，同一主题下的词语具有相同或相似的语义。Step 2.2: Use the LDA topic model to analyze the short text set, mine the potential topic knowledge in the text, extract the topics in the text and the word distribution under the topic, and obtain the text-topic matrix and topic-word matrix. The LDA topic model can obtain the topics in the text and can be used to divide the words in the text. The words under the same topic have the same or similar semantics.

步骤2.3，根据话题对句子进行特征扩充，得到基于话题的句子向量。如果两个相同的词语在句子中分别充当话题和述题的一部分，则认为这两个词语具有不同的语义，定义这两个词语为不同的词语，根据此定义，对句子进行特征扩充时，应分别根据话题和述题部分对句子进行特征扩充。句子的话题部分的特征扩充具体方法为：首先提取话题下的基本项和一般项对应的词语，然后根据步骤2.2中得到的主题-词语矩阵，比较词语在不同主题下的概率，选取概率最高的主题，将该主题下的其它词语补充到句子中，作为句子的一部分，最后使用句子的所有词语作为特征，构建特征向量表示句子，其中句中原有词语所对应的维度上的取值为词语的在句中的出现次数，而扩充的词语所对应的维度上的取值按公式(1)进行计算，In step 2.3, the feature of the sentence is expanded according to the topic, and the sentence vector based on the topic is obtained. If two identical words serve as part of the topic and the topic respectively in the sentence, the two words are considered to have different semantics, and these two words are defined as different words. According to this definition, when the sentence is extended with features, Sentences should be feature-expanded according to topic and recitation parts respectively. The specific method of feature expansion of the topic part of the sentence is as follows: first extract the words corresponding to the basic items and general items under the topic, and then compare the probabilities of words under different topics according to the topic-word matrix obtained in step 2.2, and select the one with the highest probability Topic, add other words under the topic to the sentence as part of the sentence, and finally use all the words in the sentence as features to construct a feature vector to represent the sentence, where the value of the dimension corresponding to the original word in the sentence is the value of the word The number of occurrences in the sentence, and the value on the dimension corresponding to the expanded word is calculated according to formula (1),

V＝n*w (1)V=n*w (1)

V是扩充词语对应维度上的取值，n是扩充词语在句子中出现的次数，w为扩充词语在对应主题下的概率值。V is the value of the corresponding dimension of the extended word, n is the number of times the extended word appears in the sentence, and w is the probability value of the extended word under the corresponding topic.

步骤2.4，按步骤2.3的方法，根据述题对句子进行特征扩充，得到基于述题的句子向量。In step 2.4, according to the method of step 2.3, the feature of the sentence is expanded according to the question, and the sentence vector based on the question is obtained.

步骤2.5，分别基于步骤2.3和2.4得到的两种句子向量计算句子相似度，对两个相似度值进行加权，得到句子间的最终相似度值，具体计算公式如下，In step 2.5, the sentence similarity is calculated based on the two sentence vectors obtained in steps 2.3 and 2.4 respectively, and the two similarity values are weighted to obtain the final similarity value between sentences. The specific calculation formula is as follows,

其中，S_A和S_B代表任意两个句子，sim1(S_A,S_B)表示两个句子的相似度值，和分别表示句子S_A和S_B的基于话题的句子向量，和分别表示句子S_A和S_B的表示基于述题的句子向量，ω为可调参数，取值范围为[0,1]，用来调整两种相似度的加权系数。Among them, S_A and S_B represent any two sentences, sim1(S_A , S_B ) represents the similarity value of the two sentences, and denote the topic-based sentence vectors of sentences S_A and S_B respectively, and Sentences S_A and S_B respectively represent the sentence vector based on the question, and ω is an adjustable parameter with a value range of [0,1], which is used to adjust the weighting coefficient of the two similarities.

步骤3，将经过步骤1预处理后的所有句子输入到PV深度学习模型，利用PV模型学习文本特征，得到句子向量，并基于该句子向量计算句子间的余弦距离作为句子间的相似度，计算公式如下，Step 3, input all the sentences preprocessed in step 1 into the PV deep learning model, use the PV model to learn text features, obtain sentence vectors, and calculate the cosine distance between sentences based on the sentence vectors as the similarity between sentences, calculate The formula is as follows,

其中，S_A和S_B代表任意两个句子，sim2(S_A,S_B)表示两个句子的相似度值，和分别表示用PV模型学习得到的句子向量。PV模型是一种非监督的学习方式，输入是任意长度的文本(这里文本可以是文章、段落、句子等任意形式，统称为文本)，输出则是对应文本的连续分布式向量表示，同word2vec词向量的原理类似，在保留语义和词序信息的基础上通过对特征学习得到有效的句子或者篇章的向量表示，PV模型能有效解决词袋模型没有考虑词义和词序的问题，生成的向量维度稠密，也可有效克服短文本的句子表示的特征稀疏问题。Among them, S_A and S_B represent any two sentences, sim2(S_A , S_B ) represents the similarity value of the two sentences, and Represent the sentence vectors learned by the PV model, respectively. The PV model is an unsupervised learning method. The input is a text of any length (here the text can be in any form such as articles, paragraphs, sentences, etc., collectively referred to as text), and the output is a continuous distributed vector representation of the corresponding text, similar to word2vec The principle of word vector is similar. On the basis of retaining semantic and word order information, effective sentence or chapter vector representation can be obtained through feature learning. PV model can effectively solve the problem that word bag model does not consider word meaning and word order, and the generated vector dimension is dense. , can also effectively overcome the feature sparsity problem of short text sentence representation.

步骤4，将步骤2和步骤3得到的句子间的相似度值进行线性加权，通过网格法调节参数，找到一组最优的参数取值，输出最终的句子对之间的相似度值，计算公式如下，Step 4: Carry out linear weighting on the similarity value between the sentences obtained in step 2 and step 3, adjust the parameters through the grid method, find a set of optimal parameter values, and output the final similarity value between the sentence pairs, Calculated as follows,

sim(S_A,S_B)＝θ*sim1(S_A,S_B)+(1-θ)*sim2(S_A,S_B) (4)sim(S_A ,S_B )＝θ*sim1(S_A ,S_B )+(1-θ)*sim2(S_A ,S_B ) (4)

其中，S_A和S_B代表任意两个句子，sim(S_A,S_B)表示两个句子的相似度值，θ为可调参数，取值范围为[0,1]，sim1(S_A,S_B)和sim2(S_A,S_B)分别按公式(2)和(3)计算得到。根据公式(4)，结合公式(2)和(3)，完整的句子相似度计算公式为：Among them, S_A and S_B represent any two sentences, sim(S_A , S_B ) represents the similarity value of two sentences, θ is an adjustable parameter, and the value range is [0,1], sim1(S_A ,S_B ) and sim2(S_A ,S_B ) are calculated according to formulas (2) and (3) respectively. According to formula (4), combined with formulas (2) and (3), the complete sentence similarity calculation formula is:

ω和θ都是可调参数，取值范围都是[0,1]，利用网格法根据句子相似度的计算或应用结果对参数进行调优，取最优参数值。Both ω and θ are adjustable parameters, and the value range is [0,1]. The grid method is used to optimize the parameters according to the calculation or application results of sentence similarity, and the optimal parameter value is taken.

有益效果Beneficial effect

本发明的句子相似度计算方法有效地减少了语义信息的丢失，更加全面、准确地刻画了句子之间的内在联系，通过深度挖掘句子的上下文和内在语义结构特征使句子相似度计算不直接依赖于句子的表述方式，提高了计算结果的准确率。The sentence similarity calculation method of the present invention effectively reduces the loss of semantic information, more comprehensively and accurately depicts the internal relationship between sentences, and makes the sentence similarity calculation not directly dependent on Based on the expression of the sentence, the accuracy of the calculation result is improved.

具体实施方式detailed description

为了更好的说明本发明的目的和优点，下面结合具体实例对本发明方法的实施方式做进一步详细说明。In order to better illustrate the purpose and advantages of the present invention, the implementation of the method of the present invention will be further described in detail below in conjunction with specific examples.

实验采用NLP&&CC会议2013年的面向中文微博观点要素抽取评测任务所公开的语料。从中随机挑选了5个话题，共10896个句子作为短文集，采用将句子相似度计算应用到短文本聚类并评价聚类效果的方式，对句子相似度计算的效果进行评价。对于聚类效果的评价，采用轮廓系数(Silhouette Coefficient)指标来衡量，轮廓系数这一概念最早由PeterJ.Pousseeuw在1986年提出，它结合内聚度和分离度两种因素来判断聚类效果。The experiment uses the corpus disclosed in the 2013 NLP&&CC Conference oriented Chinese microblog viewpoint element extraction evaluation task. Randomly selected 5 topics and a total of 10,896 sentences as a collection of short texts, and evaluated the effect of sentence similarity calculation by applying sentence similarity calculation to short text clustering and evaluating the clustering effect. For the evaluation of clustering effect, the index of Silhouette Coefficient is used to measure. The concept of Silhouette Coefficient was first proposed by PeterJ.Pousseeuw in 1986. It combines two factors of cohesion and separation to judge the clustering effect.

轮廓系数的计算步骤如下：The calculation steps of the silhouette coefficient are as follows:

(1)对于第i个对象，计算其到所属簇中其他对象的平均距离，记为a_i。(1) For the i-th object, calculate its average distance to other objects in its cluster, denoted as a_i .

(2)对于第i个对象，计算该对象到到不包含该对象的任意簇中所有对象的平均距离，找出每个簇中的最小值记为b_i。(2) For the i-th object, calculate the average distance from this object to all objects in any cluster that does not contain this object, and find the minimum value in each cluster and record it as b_i .

(3)对于第i个对象，轮廓系数记为s_i，计算方法如公式(6)所示。(3) For the i-th object, the silhouette coefficient is recorded as s_i , and the calculation method is shown in formula (6).

轮廓系数取值范围为[-1,1]，从公式(6)可以看出，如果s_i＜0，表明第i个对象和同一个簇内部的元素之间的平均距离小于其他的簇，聚类效果不准确。如果a_i的值趋于0，或者b_i足够大，那么s_i取值越接近于1，说明聚类后簇内数据越紧密、簇间越分离差异明显，聚类效果越好。The value range of the silhouette coefficient is [-1,1]. It can be seen from the formula (6) that if s_i <0, it indicates that the average distance between the i-th object and the elements inside the same cluster is smaller than other clusters, The clustering effect is not accurate. If the value of a_i tends to 0, or if the value of b_i is large enough, then the closer the value of_si is to 1, the closer the data in the cluster after clustering, the more separated the clusters, the better the clustering effect.

具体实施步骤为：The specific implementation steps are:

步骤1，对社交短文本集进行分句，然后对每一条句子利用ICTCLAS2015进行分词和词性标注，根据从Internet下载的停用词表，去除文本中的停用词。Step 1. Segment the social short text set, and then use ICTCLAS2015 for word segmentation and part-of-speech tagging for each sentence, and remove the stop words in the text according to the stop word list downloaded from the Internet.

步骤2，利用CSM对短文本集中的每条句子进行句义结构分析，并利用LDA主题模型对短文本集进行分析，获取短文本的主题和词语分布，对句子进行特征丰富，并计算句子相似度。Step 2, use CSM to analyze the sentence structure of each sentence in the short text set, and use the LDA topic model to analyze the short text set, obtain the topic and word distribution of the short text, enrich the features of the sentence, and calculate the sentence similarity Spend.

步骤2.1，在步骤1的基础上，对每条句子进行句义结构分析，提取句子的话题、述题、基本项、一般项。Step 2.1, on the basis of step 1, analyze the sentence structure of each sentence, and extract the topic, statement, basic items, and general items of the sentence.

步骤2.2，利用LDA主题模型对短文本集进行分析，提取文本中的主题和主题下的词语分布，得到主题-词语矩阵。In step 2.2, the LDA topic model is used to analyze the short text set, extract the topics in the text and the word distribution under the topic, and obtain the topic-word matrix.

步骤2.3，根据话题对句子进行特征扩充，得到基于话题的句子向量。具体方法为：首先提取话题下的基本项和一般项对应的词语，然后根据步骤2.2中得到的主题-词语矩阵，比较词语在不同主题下的概率，选取概率最高的主题，将该主题下的其它词语补充到句子中，作为句子的一部分，最后使用句子的所有词语作为特征，构建特征向量表示句子，其中句中原有词语所对应的维度上的取值为词语的在句中的出现次数，而扩充的词语所对应的维度上的取值按公式(1)进行计算，In step 2.3, the feature of the sentence is expanded according to the topic, and the sentence vector based on the topic is obtained. The specific method is: first extract the words corresponding to the basic items and general items under the topic, and then compare the probabilities of the words under different topics according to the topic-word matrix obtained in step 2.2, select the topic with the highest probability, and use the topic under the topic Add other words to the sentence as part of the sentence, and finally use all the words in the sentence as features to construct a feature vector to represent the sentence, where the value of the dimension corresponding to the original word in the sentence is the number of occurrences of the word in the sentence, And the value on the dimension corresponding to the expanded word is calculated according to the formula (1),

步骤2.5，分别基于步骤2.3和2.4得到的两种句子向量计算句子相似度，对两个相似度值进行加权，按公式(2)得到句子间的最终相似度值。In step 2.5, the sentence similarity is calculated based on the two sentence vectors obtained in steps 2.3 and 2.4 respectively, and the two similarity values are weighted, and the final similarity value between sentences is obtained according to formula (2).

步骤3，将经过步骤1预处理后的所有句子输入到PV深度学习模型，利用PV模型学习文本特征，得到句子向量，并基于该句子向量计算句子间的余弦距离作为句子间的相似度，其中PV模型里的参数均采用工具中的默认值。Step 3, input all the sentences preprocessed in step 1 into the PV deep learning model, use the PV model to learn text features, obtain sentence vectors, and calculate the cosine distance between sentences based on the sentence vectors as the similarity between sentences, where The parameters in the PV model adopt the default values in the tool.

步骤4，将步骤2和步骤3得到的句子间的相似度值进行线性加权，通过网格法调节参数ω和θ，选择最优的一组参数。Step 4: Perform linear weighting on the similarity values between sentences obtained in Step 2 and Step 3, adjust the parameters ω and θ through the grid method, and select the optimal set of parameters.

对5个话题的聚类效果，在PV模型中的向量长度size＝100，窗口长度window＝5，ω取0.33时，θ取0.25，轮廓系数达到最优效果0.45；当θ取0时，即只考虑基于CSM句义结构分析得到的相似度计算结果，轮廓系数达到0.42；当θ取1时，即只考虑依赖PV分析得到的句子相似度结果，轮廓系数达到0.31。实验结果表明利用CSM得到的句子向量能够包含更深层次的句子内部语义信息，PV模型使句子向量能够获取丰富的上下文信息，既考虑自身语义信息同时还包含上下文信息的句子相似度计算方法更能准确衡量句子间的相似程度。For the clustering effect of 5 topics, in the PV model, the vector length size=100, the window length window=5, when ω is 0.33, θ is 0.25, and the silhouette coefficient reaches the optimal effect of 0.45; when θ is 0, that is Only considering the similarity calculation results based on the CSM sentence structure analysis, the contour coefficient reaches 0.42; when θ is 1, that is, only considers the sentence similarity results obtained by PV analysis, the contour coefficient reaches 0.31. The experimental results show that the sentence vector obtained by using CSM can contain deeper semantic information inside the sentence, the PV model enables the sentence vector to obtain rich context information, and the sentence similarity calculation method that not only considers its own semantic information but also contains context information is more accurate measure the similarity between sentences.