Movatterモバイル変換


[0]ホーム

URL:


CN110263343A - The keyword abstraction method and system of phrase-based vector - Google Patents

The keyword abstraction method and system of phrase-based vector
Download PDF

Info

Publication number
CN110263343A
CN110263343ACN201910548261.XACN201910548261ACN110263343ACN 110263343 ACN110263343 ACN 110263343ACN 201910548261 ACN201910548261 ACN 201910548261ACN 110263343 ACN110263343 ACN 110263343A
Authority
CN
China
Prior art keywords
candidate
candidate term
weight
term
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910548261.XA
Other languages
Chinese (zh)
Other versions
CN110263343B (en
Inventor
孙新
赵永妍
申长虹
杨凯歌
张颖捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BITfiledCriticalBeijing Institute of Technology BIT
Priority to CN201910548261.XApriorityCriticalpatent/CN110263343B/en
Publication of CN110263343ApublicationCriticalpatent/CN110263343A/en
Application grantedgrantedCritical
Publication of CN110263343BpublicationCriticalpatent/CN110263343B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明涉及自然语言处理和深度学习技术领域,特别涉及一种基于短语向量的关键词抽取方法及系统。本发明的主要技术方案包括:对原始文本分词并标注词性,根据词性保留n元组,得到候选词项集;对候选关键词集合中包含的大量短语构建向量表示;计算各候选词项的主题权重;以候选词项作为图中的顶点,以候选词项的共现信息为边构造图,以候选词项之间的语义相似度和共现信息计算边的权重,迭代计算每个候选词项的得分并排序。本发明提供的关键词抽取方法及系统,既引入了文档中的主题信息,又通过短语间的语义相似度引入了上下文信息,更能够捕捉全文中的重点词,语义精度高,应用范围广。

The invention relates to the technical fields of natural language processing and deep learning, in particular to a keyword extraction method and system based on phrase vectors. The main technical solutions of the present invention include: segmenting the original text and marking the part of speech, retaining n-tuples according to the part of speech, and obtaining the candidate term set; constructing a vector representation for a large number of phrases contained in the candidate keyword set; calculating the subject of each candidate term Weight: use the candidate term as the vertex in the graph, use the co-occurrence information of the candidate term as the edge to construct the graph, calculate the weight of the edge based on the semantic similarity and co-occurrence information between the candidate terms, and iteratively calculate each candidate word Items are scored and sorted. The keyword extraction method and system provided by the present invention not only introduce the topic information in the document, but also introduce the context information through the semantic similarity between phrases, and can better capture the key words in the full text, with high semantic precision and wide application range.

Description

Translated fromChinese
基于短语向量的关键词抽取方法及系统Keyword extraction method and system based on phrase vector

技术领域technical field

本发明涉及自然语言处理和深度学习技术领域,特别涉及一种基于短语向量的关键词抽取方法及系统。The invention relates to the technical fields of natural language processing and deep learning, in particular to a keyword extraction method and system based on phrase vectors.

背景技术Background technique

近年来,海量数据在给人们带来极大便利的同时,也同样给数据的分析和查找带来了巨大挑战。在大数据背景下,如何从海量数据中快速地获取所需要的重点信息成为人们迫切需要解决的问题。关键词抽取是指通过算法自动地从文档中抽取重要的、具有主题性的词或短语。在科技文献中,关键词或短语可以帮助用户快速了解论文内容。同时,关键词或短语还可以用作信息检索、自然语言处理和文本挖掘中的搜索条目。在关键词抽取任务上,包含单词语义的词向量已经得到了应用并取得了良好的效果。然而,很多专业论文,包括企业论文中含有大量的专有名词,而且这些名词往往都不是单个词而是短语,因此仅用词向量不足以满足关键词抽取任务的需要,文本需要对短语构建向量表示。In recent years, while massive data has brought great convenience to people, it has also brought great challenges to data analysis and search. In the context of big data, how to quickly obtain the required key information from massive data has become an urgent problem that people need to solve. Keyword extraction refers to automatically extracting important and topical words or phrases from documents through algorithms. In scientific literature, keywords or phrases can help users quickly understand the content of the paper. At the same time, keywords or phrases can also be used as search terms in information retrieval, natural language processing, and text mining. On the keyword extraction task, word vectors containing word semantics have been applied and achieved good results. However, many professional papers, including corporate papers, contain a large number of proper nouns, and these nouns are often not single words but phrases, so word vectors alone are not enough to meet the needs of keyword extraction tasks, and texts need to construct vectors for phrases express.

当前已有学者提出以词向量为基础利用自编码器进行组合来构建短语向量。自编码器(Auto Encoder)在结构上只有编码器和解码器两个部分,以自编码器对单词向量进行组合来构建短语向量时,可以在编码器部分输入短语中各单词的表示,然后把它们压缩为一个中间隐藏层向量,在解码器部分通过隐藏层向量重新解析出输入的短语,那么这个中间向量就可以认为是包含了语义信息的短语向量表示。然而,在传统自编码器中,直接使用基础的全连接网络进行编码和解码,其中层与层之间是全连接的,每层之间的节点是无连接的,这种普通的自编码网络无法处理类似短语这样的结构中的序列信息。At present, scholars have proposed to construct phrase vectors based on word vectors by combining autoencoders. The autoencoder (Auto Encoder) has only two parts in the structure, the encoder and the decoder. When the autoencoder is used to combine the word vectors to construct the phrase vector, the representation of each word in the phrase can be input in the encoder part, and then the They are compressed into an intermediate hidden layer vector, and the input phrase is re-parsed through the hidden layer vector in the decoder part, then this intermediate vector can be considered as a phrase vector representation containing semantic information. However, in the traditional autoencoder, the basic fully connected network is directly used for encoding and decoding, where the layers are fully connected, and the nodes between each layer are unconnected. This ordinary autoencoder network Cannot handle sequence information in structures like phrases.

此外,已有的算法只通过词向量来计算单词的语义相似度,而忽略了文本的主题信息。TextRank是一种基于图的关键词抽取算法,它的基本思想是用文档中的候选词项构成图,用候选词项在文档中的共现关系构造边,然后通过候选词项之间的相互投票来迭代计算权值,最后根据得分对候选词项进行排序来确定最终抽取的关键词。在传统的TextRank中,图中每个顶点的初始权重均为1(或1/n,n为顶点个数),每条边的权重也设为1,也就是说每个顶点的票数会均匀地投给与它相连的每个顶点。这样的方法虽然简单方便,但是既忽略了文档的主题性,又没有考虑顶点之间的语义关系。In addition, existing algorithms only calculate the semantic similarity of words through word vectors, while ignoring the topic information of texts. TextRank is a graph-based keyword extraction algorithm. Its basic idea is to use the candidate terms in the document to form a graph, use the co-occurrence relationship of the candidate terms in the document to construct edges, and then use the mutual relationship between the candidate terms to Vote to iteratively calculate the weight, and finally sort the candidate terms according to the score to determine the final extracted keywords. In traditional TextRank, the initial weight of each vertex in the graph is 1 (or 1/n, n is the number of vertices), and the weight of each edge is also set to 1, which means that the number of votes for each vertex will be even to each vertex connected to it. Although such a method is simple and convenient, it ignores the topicality of the document and does not consider the semantic relationship between vertices.

在循环神经网络(Recurrent Neural Network,RNN)中,隐藏层之间的节点不再是无连接而是有连接的,并且隐藏层的输入不仅包含输入层的输出还包含上一时刻隐藏层的输出。因此RNN适合用来对序列数据进行编码。然而在RNN的传播过程中,历史信息的遗忘和误差的累积是一个重要问题,现在人们通常使用长短时记忆神经网络(Long Short-TermMemery,LSTM)来改进。In the Recurrent Neural Network (RNN), the nodes between the hidden layers are no longer connected but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment. . Therefore, RNN is suitable for encoding sequence data. However, in the propagation process of RNN, the forgetting of historical information and the accumulation of errors are important problems, and now people usually use Long Short-Term Memory neural network (Long Short-TermMemery, LSTM) to improve.

LSTM是一种RNN特殊类型,它使用细胞状态来记录信息,细胞状态在序列传输过程中只有少量的线性交互,可以较好的保留历史信息。然后LSTM使用门控机制来保护和控制细胞状态。门控机制是一个抽象的概念,在具体实现时它实际上是由一个sigmoid函数和点乘运算构成的,门控机制通过输出一个0到1之间的值来控制信息的传递,输出值越接近0表示允许通过的信息越少,越接近1表示允许通过的信息越多。LSTM is a special type of RNN. It uses cell state to record information. Cell state has only a small amount of linear interaction during sequence transmission, which can better retain historical information. LSTM then uses a gating mechanism to preserve and control the cell state. The gating mechanism is an abstract concept. In actual implementation, it is actually composed of a sigmoid function and a dot product operation. The gating mechanism controls the transmission of information by outputting a value between 0 and 1. The closer to 0, the less information is allowed to pass, and the closer to 1, the more information is allowed to pass.

在一个LSTM单元中,首先要处理的是上一步传递过来的信息,LSTM通过遗忘门(forget gate)来控制历史信息的遗忘和保留。遗忘门ft根据当前信息,决定是否需要遗忘之前的信息,具体公式如下:In an LSTM unit, the first thing to be processed is the information passed in the previous step. LSTM controls the forgetting and retention of historical information through the forget gate. The forgetting gate ft decides whether to forget the previous information according to the current information. The specific formula is as follows:

ft=σ(Wf·[ht-1,xt]+bf)ft =σ(Wf ·[ht-1 ,xt ]+bf )

其中σ表示sigmoid函数,Wf和bf分别表示遗忘门中的权重矩阵和偏置。where σ represents the sigmoid function, Wf and bf represent the weight matrix and bias in the forget gate, respectively.

之后LSTM需要处理的是当前输入的信息,先通过输入门控制当前输入信息要保留的部分,之后,用tanh函数创建一个细胞状态将该时刻节点的信息添加到该细胞状态中。After that, LSTM needs to process the current input information. First, control the part of the current input information to be retained through the input gate, and then use the tanh function to create a cell state. Add the information of the node at this moment to the cell state.

it=σ(Wi·[ht-1,xt]+bi)it =σ(Wi ·[ht-1 ,xt ]+bi )

通过遗忘门和输入门,LSTM可以决定过去的哪些信息需要被留下,和当前的哪些信息需要被存储,从而计算当前的细胞状态CtThrough the forget gate and the input gate, LSTM can decide which information in the past needs to be kept, and which information needs to be stored in the present, so as to calculate the current cell state Ct .

最后LSTM会利用sigmoid函数,根据历史信息和当前的输入信息,通过输出门(output gate)决定当前时刻需要输出的信息,与输入状态类似,输出状态也会用一个tanh函数过滤。Finally, LSTM will use the sigmoid function to determine the information that needs to be output at the current time through the output gate according to the historical information and current input information. Similar to the input state, the output state will also be filtered by a tanh function.

ot=σ(Wo·[ht-1,xt]+bo)ot =σ(Wo ·[ht-1 ,xt ]+bo )

ot=ot*tanh(Ct)ot =ot *tanh(Ct )

通过巧妙的门机制,长短时记忆神经网络可以记忆之前的信息,同时又避免了“梯度消失”的问题。Through the ingenious gate mechanism, the long-short-term memory neural network can remember the previous information, while avoiding the problem of "gradient disappearance".

发明内容Contents of the invention

为了解决词向量不足以满足关键词抽取任务的需要,以及已有的算法忽略了文本的主题信息这两方面问题,本发明提供一种基于短语向量的关键词抽取方法及系统。In order to solve the two problems that word vectors are insufficient to meet the needs of keyword extraction tasks and existing algorithms ignore the subject information of texts, the present invention provides a keyword extraction method and system based on phrase vectors.

为实现上述目的,第一方面,本发明提供一种基于短语向量的关键词抽取方法,所述方法包括:In order to achieve the above object, in the first aspect, the present invention provides a method for extracting keywords based on phrase vectors, the method comprising:

S1、对文本进行分词并标注词性,保留n元组得到候选词项集;S1. Segment the text and mark the part of speech, and retain the n-tuple to obtain the candidate word item set;

S2、通过自编码器为候选词项构建短语向量;S2. Constructing phrase vectors for candidate terms through an autoencoder;

S3、确定所述文本的主题,计算候选词项与主题向量的相似度,将所述相似度作为所述候选词项的主题权重;S3. Determine the subject of the text, calculate the similarity between the candidate term and the subject vector, and use the similarity as the subject weight of the candidate term;

S4、通过TextRank算法,从所述候选词项集中获取关键词。S4. Acquire keywords from the set of candidate terms through the TextRank algorithm.

进一步的,所述步骤S2中的自编码器包括编码器和解码器,编码器由双向LSTM层和全连接层组成,解码部分由单向LSTM层和softmax层组成。Further, the self-encoder in step S2 includes an encoder and a decoder, the encoder is composed of a bidirectional LSTM layer and a fully connected layer, and the decoding part is composed of a unidirectional LSTM layer and a softmax layer.

进一步的,所述步骤S2中的自编码器包括编码器和解码器,训练方法包括以下步骤:Further, the self-encoder in the step S2 includes an encoder and a decoder, and the training method includes the following steps:

S21、选取训练样本,获取候选词项;S21. Select training samples to obtain candidate terms;

S22、对候选词项cj=(x1,x2,…,xT),在编码器中,使用双向LSTM从前后两个方向分别进行计算:S22. For the candidate term cj =(x1 ,x2 ,…,xT ), in the encoder, use bidirectional LSTM to perform calculations from the front and back directions respectively:

其中,分别为t(t=1,2,…,T)时刻从左向右和从右向左两个方向上的隐藏层状态和细胞状态,分别为t-1时刻从左向右和从右向左两个方向上的隐藏层状态和细胞状态,xt为t时刻输入的候选词项中的单词;T表示候选词项中单词的数量;in, and are respectively the hidden layer state and the cell state in two directions from left to right and from right to left at time t (t=1,2,...,T), and are the hidden layer state and cell state from left to right and right to left at time t-1 respectively, xt is the word in the candidate term input at time t; T represents the number of words in the candidate term ;

S23、在编码器中,通过公式计算得到ESTS23. In the encoder, calculate EST through formula:

h′T=f(WhhT+bh)h′T =f(Wh hT + bh )

C′T=f(WcCT+bc)C′T =f(Wc CT +bc )

其中,为连接符,Wh、bh、Wc、bc代表全连接网络中的参数矩阵和偏置,f表示全连接网络中的激活函数ReLU,EST是h′T和C′T组成的一个元组;in, is the connector, Wh , bh , Wc , bc represent the parameter matrix and bias in the fully connected network, f represents the activation function ReLU in the fully connected network, EST is composed of h′T and C′T a tuple;

S24、在解码器部分,以EST为初始状态使用单向LSTM进行解码:S24. In the decoder part, useEST as the initial state to decode using unidirectional LSTM:

其中,zt是解码器在t时刻的隐藏层状态,zt-1为t-1时刻的隐藏层状态,EST为编码器状态,为t-1时刻输出的候选词项中的单词;Among them, zt is the state of the hidden layer of the decoder at time t, zt-1 is the state of the hidden layer at time t-1, EST is the state of the encoder, is the word in the candidate term output at time t-1;

S25、根据zt估算当前单词的概率:S25. Estimate the probability of the current word according to zt :

其中,Wszt+bs对每个可能的输出单词进行打分,softmax为归一化函数。Among them, Ws zt +bs scores each possible output word, and softmax is a normalization function.

S26、当训练过程中损失函数L不断变小最终趋于稳定时,获得编码器的参数Wh、bh、Wc、bc,以及解码器中的Ws、zt,从而确定自编码器;其中,损失函数L的计算公式为:S26. When the loss function L keeps decreasing and finally stabilizes during the training process, obtain the parameters Wh , bh , Wc , bc of the encoder, and Ws , zt in the decoder, so as to determine the self-encoder device; where the calculation formula of the loss function L is:

进一步的,所述步骤S2中,所述候选词项输入自编码器,编码器输出的EST中的值为所述候选词项的短语向量。Further, in the step S2, the candidate term is input into the encoder, and the value in theEST output by the encoder is the phrase vector of the candidate term.

进一步的,所述步骤S3中主题向量的计算公式为:Further, in the step S3, the topic vector The calculation formula is:

其中,是主题词项ti对应的向量表示,是文本di的主题向量表示。in, is the vector representation corresponding to the subject term item ti , is the topic vector representation of the text di .

进一步的,在所述步骤S4的TextRank算法中,如果候选词项cj和ck在共现窗口中出现,则cj和ck之间存在一条边,边的权重的计算公式为:Further, in the TextRank algorithm of the step S4, if the candidate terms cj and ck appear in the co-occurrence window, there is an edge between cj and ck , and the calculation formula of the weight of the edge is:

wjk=similarity(cj,ck)×occurcount(cj,ck)wjk =similarity(cj ,ck )×occurcount (cj ,ck )

其中,分别是候选词项cj和ck的向量表示,occurcount(cj,ck)表示cj和ck在共现窗口中共同出现的次数,similarity(cj,ck)为cj和ck之间的相似度,wjk代表了cj和ck之间边的权重。in, are the vector representations of candidate terms cj and ck respectively, occurrencecount (cj ,ck ) indicates the number of times cj and ck co-occur in the co-occurrence window, similarity(cj ,ck ) is cj The similarity between c j and ck , wjk represents the weight of the edge between cj and ck .

进一步的,在所述步骤S4的TextRank算法中还包括迭代计算顶点权重,包括以下步骤:Further, the TextRank algorithm in the step S4 also includes iterative calculation of vertex weights, including the following steps:

迭代计算候选词项的权重,直到达到最大迭代次数,权重得分计算公式为:Iteratively calculate the weight of candidate terms until the maximum number of iterations is reached, and the weight score The calculation formula is:

其中,表示候选词项cj的得分,d为阻尼系数,优选的,d为0.85;是候选词项cj的主题权重,wjk是候选词项cj和候选词项ck之间边的权重,wkp是候选词项ck和候选词项cp之间边的权重,表示与候选词项cj相连的候选词项的集合,是其中的元素,同理,表示与候选词项ck相连的候选词项的集合,是其中的元素。in, Indicates the score of the candidate term cj , d is the damping coefficient, preferably, d is 0.85; is the topic weight of the candidate term cj , wjk is the weight of the edge between the candidate term cj and the candidate term ck , wkp is the weight of the edge between the candidate term ck and the candidate term cp , Represents the set of candidate terms connected to the candidate term cj , is one of the elements, and similarly, Indicates the set of candidate terms connected to the candidate termck , is an element of it.

第二方面,本发明提供了一种基于短语向量的关键词抽取系统,所述系统包括文本预处理模块,用于对原始文本进行分词并标注词性,根据词性保留n元组,得到候选词项集;In the second aspect, the present invention provides a keyword extraction system based on phrase vectors. The system includes a text preprocessing module for segmenting the original text and marking the part of speech, and retaining n-tuples according to the part of speech to obtain candidate terms set;

短语向量构建模块,用于对候选词项cj=(x1,x2,…,xT),通过自编码器获得具有语义表示的短语向量;A phrase vector construction module, used to obtain a phrase vector with semantic representation through an autoencoder for the candidate term cj =(x1 ,x2 ,…,xT );

主题权重计算模块,用于计算候选词项的主题权重;Topic weight calculation module, used to calculate the topic weight of the candidate term;

候选词排序模块,用于为候选词项计算权重得分,取TopK个候选词项作为关键词。The candidate word sorting module is used to calculate the weight score for the candidate words, and takes the TopK candidate words as keywords.

进一步的,所述系统还包括自编码器训练模块,用于通过样本训练得到自编码的参数,从而确定自编码器。Further, the system further includes an autoencoder training module, which is used to obtain autoencoder parameters through sample training, so as to determine the autoencoder.

本发明提供的一种基于短语向量的关键词抽取方法及系统,与现有关键词抽取方法及系统相比,具有如下有益效果:A keyword extraction method and system based on phrase vectors provided by the present invention, compared with existing keyword extraction methods and systems, has the following beneficial effects:

1、本发明提供的关键词抽取方法及系统,既引入了文档中的主题信息,又通过词语之间的语义相似度引入了上下文信息,更能捕捉全文中的重点词,使抽取出的关键词更加精确。1. The keyword extraction method and system provided by the present invention not only introduce the topic information in the document, but also introduce the context information through the semantic similarity between words, which can better capture the key words in the full text, so that the extracted keywords Words are more precise.

2、本发明提供的关键词抽取方法及系统,利用短语向量获取关键词,使得计算过程变得简洁高效。2. The keyword extraction method and system provided by the present invention use phrase vectors to obtain keywords, making the calculation process simple and efficient.

3、本发明提供的短语向量计算方法,创新性地引入基于LSTM的自编码器对词向量进行压缩,可以更好地表示短语的语义信息,语义精度更高,应用范围更广。3. The phrase vector calculation method provided by the present invention innovatively introduces an LSTM-based autoencoder to compress word vectors, which can better represent the semantic information of phrases, with higher semantic precision and wider application range.

4、本发明改进了TextRank算法,创新性地利用短语向量对每个候选词项计算主题权重,并以候选词项之间的语义相似度和共现信息共同计算边的权重,既能考虑整个文档的主题,又引入了顶点之间的语义信息,使排序算法的准确性更高。4. The present invention improves the TextRank algorithm, innovatively uses phrase vectors to calculate topic weights for each candidate term, and jointly calculates edge weights based on the semantic similarity and co-occurrence information between candidate terms, which can not only consider the entire The topic of the document introduces semantic information between vertices, making the sorting algorithm more accurate.

附图说明Description of drawings

为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present disclosure. For those skilled in the art, other drawings can also be obtained according to these drawings on the premise of not paying creative efforts.

图1为本发明一个实施例的自编码器的结构示意图;Fig. 1 is the structural representation of the autoencoder of an embodiment of the present invention;

图2为本发明一个实施例的基于短语向量的关键词抽取方法的流程图。FIG. 2 is a flowchart of a keyword extraction method based on phrase vectors according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

下面结合附图和具体实施方式对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

为了使本申请实例中的技术方案及优点更加清楚明白,以下结合附图对本申请的示例性实施例进行进一步详细的说明,显然,所描述的实施例仅是本申请的一部分实施例,而不是所有实施例的穷举。需要说明的是,在不冲突的情况下,本申请中的实例可以相互结合。In order to make the technical solutions and advantages in the examples of the present application clearer, the exemplary embodiments of the present application will be further described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only part of the embodiments of the present application, rather than Exhaustive list of all examples. It should be noted that, in the case of no conflict, the examples in this application can be combined with each other.

本发明提供一种基于短语向量的关键词抽取方法,如图2所示,该方法包括如下步骤:The present invention provides a kind of keyword extraction method based on phrase vector, as shown in Figure 2, this method comprises the steps:

S1、对原始文本di进行分词并标注词性,根据词性保留n元组,得到候选词项集S1. Segment the original text di and mark the part of speech, keep n-tuples according to the part of speech, and obtain the candidate word item set

S2、对每个候选词项cj=(x1,x2,…,xT),通过自编码器获得候选词项的短语向量表示。其中,xi是候选词项cj中第i个单词的词向量表示,T表示候选词项中的单词数量。S2. For each candidate term cj =(x1 , x2 , . . . , xT ), obtain a phrase vector representation of the candidate term through an autoencoder. Among them,xi is the word vector representation of thei -th word in the candidate term cj, and T is the number of words in the candidate term.

S3、计算每个候选词项ci与主题向量的相似度作为其主题权重其中,di表示第i篇文档。自编码器包括编码器和解码器,编码器部分由双向LSTM层和全连接层组成,解码部分由单向LSTM层和softmax层组成。S3. Calculate each candidate term ci and topic vector The similarity of is used as its topic weight Among them, di represents the i-th document. The self-encoder includes an encoder and a decoder. The encoder part consists of a bidirectional LSTM layer and a fully connected layer, and the decoding part consists of a unidirectional LSTM layer and a softmax layer.

S4、通过改进的TextRank算法,从所述候选词项集中获取关键词。S4. Acquire keywords from the set of candidate terms through the improved TextRank algorithm.

在步骤S2中,在编码器中,对每个待输入的候选词项cj,使用双向LSTM从前后两个方向分别进行计算,取最后一个时刻隐藏层状态hT和细胞状态CT作为最终状态,并进行拼接,最后通过一个全连接层得到编码层的输出ESTIn step S2, in the encoder, for each candidate term cj to be input, use the bidirectional LSTM to calculate from the front and back directions respectively, and take the hidden layer state hT and the cell state CT at the last moment as the final State, and splicing, and finally get the outputEST of the encoding layer through a fully connected layer.

在解码器中,以EST为初始输入,使用单向的LSTM结构进行解码,通过softmax层得到每一步解码的概率分布,最后通过损失函数L最大化解码每一步对应的正确单词的概率。In the decoder,EST is used as the initial input, and the unidirectional LSTM structure is used for decoding. The probability distribution of each step of decoding is obtained through the softmax layer, and finally the loss function L is used to maximize the probability of decoding the correct word corresponding to each step.

训练的目的是优化自编码器的参数,使解码器能够以编码器的输出为输入,最大程度的还原编码器输入的候选词项的语义信息。The purpose of training is to optimize the parameters of the self-encoder, so that the decoder can take the output of the encoder as input and restore the semantic information of the candidate words input by the encoder to the greatest extent.

具体的训练方法为:The specific training method is:

(1)选取训练样本,然后与S1一样,对样本进行分词等操作,获得候选词项集。(1) Select a training sample, and then perform operations such as word segmentation on the sample as in S1 to obtain a set of candidate word items.

候选词项用cj=(x1,x2,…,xT)表示,其中,xi是候选词项cj中第i个单词的词向量表示,T表示候选词项中的单词数量。以候选词项cj为“北京理工大学”为例,x1是“北京”对应的词向量,x2是“理工”对应的词向量,x3是“大学”对应的词向量。The candidate term is represented by cj =(x1 ,x2 ,…,xT ), where xi is the word vector representation of the i-th word in the candidate term cj , and T represents the number of words in the candidate term . Taking the candidate term cj as "Beijing Institute of Technology" as an example, x1 is the word vector corresponding to "Beijing", x2 is the word vector corresponding to "Science and Technology", and x3 is the word vector corresponding to "University".

(2)使用大量候选词项对模型进行训练。以候选词项“北京理工大学”为例,输入为“北京”“理工”“大学”对应的词向量表示,经编码获得“北京理工大学”的短语向量表示,并通过该短语向量解码得到解码序列依次为“北京”“理工”“大学”对应概率值,通过训练使其最大化。(2) Use a large number of candidate terms to train the model. Taking the candidate term "Beijing Institute of Technology" as an example, the input is the word vector representation corresponding to "Beijing", "Science and Technology" and "University". After encoding, the phrase vector representation of "Beijing Institute of Technology" is obtained, and the phrase vector is decoded to obtain the decoding The sequence is the probability value corresponding to "Beijing", "Science and Technology" and "University" in turn, and it is maximized through training.

对每个候选词项cj=(x1,s2,…,xT),在编码器部分,编码器使用双向LSTM从前后两个方向分别进行计算:For each candidate term cj =(x1 ,s2 ,…,xT ), in the encoder part, the encoder uses a bidirectional LSTM to perform calculations from the front and back directions:

其中,分别为t(t=1,2,…,T)时刻从左向右和从右向左两个方向上的隐藏层状态和细胞状态,分别为t-1时刻从左向右和从右向左两个方向上的隐藏层状态和细胞状态,xt为t时刻输入的候选词项中的单词。在每一个时刻,当前隐藏层状态ht和细胞状态Ct的计算都要依赖上一个时刻的隐藏层状态ht-1、细胞状态Ct-1和当前输入xtin, and are respectively the hidden layer state and the cell state in two directions from left to right and from right to left at time t (t=1,2,...,T), and are the hidden layer state and cell state in two directions from left to right and right to left at time t-1 respectively, and xt is the word in the candidate term input at time t. At each moment, the calculation of the current hidden layer state ht and the cell state Ct depends on the hidden layer state ht-1 , the cell state Ct-1 and the current input xt at the previous moment.

取最后一个时刻隐藏层状态hT和细胞状态CT作为最终状态,直接将两个方向上的状态进行连接。另外为了给解码层提供一个固定大小的输入,还需要通过一个全连接层对连接后的状态进行处理。计算如下公式获得解码器的一个固定大小的输入ESTTake the hidden layer state hT and the cell stateCT at the last moment as the final state, and directly connect the states in the two directions. In addition, in order to provide a fixed-size input to the decoding layer, it is necessary to process the connected state through a fully connected layer. A fixed-size input EST to the decoder is obtained by calculating the following formula:

h′T=f(WhhT+bh)h′T =f(Wh hT + bh )

C′T=f(WcCT+bc)C′T =f(Wc CT +bc )

其中,为连接符,Wh、bh、Wc、bc代表全连接网络中的参数矩阵和偏置,f表示全连接网络中的激活函数ReLU,EST是h′T和C′T组成最终提供给解码器的一个元组。in, is the connector, Wh , bh , Wc , bc represent the parameter matrix and bias in the fully connected network, f represents the activation functionReLU in the fully connected network, EST is thefinal A tuple provided to the decoder.

在解码器部分,以EST为初始状态使用单向LSTM进行解码:In the decoder part, use a unidirectional LSTM to decode withEST as the initial state:

其中,zt是解码器在t时刻的隐藏层状态,zt-1为t-1时刻的隐藏层状态,EST为编码器状态,为t-1时刻输出的候选词项中的单词。Among them, zt is the state of the hidden layer of the decoder at time t, zt-1 is the state of the hidden layer at time t-1, EST is the state of the encoder, is the word in the candidate term output at time t-1.

根据zt估算当前单词的概率:Estimate the probability of the current word according to zt :

其中,Ws是参数矩阵,zt是解码器在t时刻的隐藏层状态,Wszt+bs对每个可能的输出单词进行打分,用softmax归一化得到每个词的概率Among them, Ws is the parameter matrix, zt is the hidden layer state of the decoder at time t, Ws zt +bs scores each possible output word, and normalizes each word with softmax The probability

自编码器的训练目标是使输出正确短语的概率最大:自编码器输出的是对应每个单词的概率,训练目标是使输出正确单词的概率最大,也即,根据损失函数L进行训练,通过训练调整自编码器的参数(包括LSTM里的参数,编码器中的Wh、bh、Wc、bc,以及解码器中的Ws、zt),当训练过程中损失函数不断变小最终趋于稳定时,就能说明中间的向量可以很好地表示短语语义,我们就能把中间的向量表示作为短语向量。所述损失函数L计算如下:The training goal of the autoencoder is to maximize the probability of outputting the correct phrase: the output of the autoencoder is the probability corresponding to each word, and the training goal is to maximize the probability of outputting the correct word, that is, to train according to the loss function L, by Training adjusts the parameters of the autoencoder (including parameters in LSTM, Wh , bh , Wc , bc in the encoder, and Ws , zt in the decoder), when the loss function keeps changing during the training process When small eventually stabilizes, it means that the vector in the middle can well represent the phrase semantics, and we can use the vector in the middle as a phrase vector. The loss function L is calculated as follows:

在自编码器训练结束后,其损失函数值趋于稳定。此时自编码器训练完成,将候选词项输入自编码器的编码器中,EST中的值即为短语向量。通过以上构建的自编码器,利用候选词项序列上的信息对词向量压缩,得到候选词项的短语向量表示。After the autoencoder is trained, its loss function value tends to be stable. At this point, the autoencoder training is completed, and the candidate term is input into the encoder of theautoencoder , and the value in EST is the phrase vector. Through the self-encoder constructed above, the word vector is compressed using the information on the candidate term sequence, and the phrase vector representation of the candidate term is obtained.

在自编码器训练完成后,当需要获取候选词项的短语向量表示时,只需利用编码部分计算,即可获得候选词项的短语向量表示EST,所得EST就能以一个候选词项的整体来考虑该候选词项的语义信息。After the self-encoder training is completed, when it is necessary to obtain the phrase vector representation of the candidate term, it only needs to use the encoding part of the calculation to obtain the phrase vector representation EST of the candidate term, and the obtained EST can be expressed as a candidate term Consider the semantic information of the candidate term as a whole.

在步骤S3中,主题权重计算过程如下:In step S3, the topic weight calculation process is as follows:

(1)确定主题词项集:以文本具有高度概括性的主题句子或段落为代表,例如论文的题目或摘要,从中确定文本的主题词项,加入文本的主题词项集:其中di表示第i篇文档,n为主题词项集中的元素数目。例如,对“新形势下采矿设计行业发展思路实例分析”来说,主题词项集可以为“采矿设计”、“发展思路”、“实例分析”。(1) Determining the subject term item set: represented by the highly generalized subject sentence or paragraph of the text, such as the title or abstract of the paper, from which the subject term item of the text is determined and added to the text subject term item set: Among them, di represents the i-th document, and n is the number of elements in the subject term item set. For example, for "example analysis of development ideas of mining design industry under the new situation", the subject term item set can be "mining design", "development ideas", and "example analysis".

(2)计算主题向量:计算主题词项集中所有词项对应的单词或短语向量的平均值,作为文档的主题向量用于表示整篇文档的主题:(2) Calculating subject vectors: calculating subject term item sets The average of the word or phrase vectors corresponding to all terms in , as the topic vector of the document Used to represent the subject of the entire document:

其中,是主题词项ti对应的向量表示,是文档di的主题向量表示。in, is the vector representation corresponding to the subject term item ti , is the topic vector representation of document di .

(3)计算主题权重:对每个候选词项cj,计算它和文档di的主题向量之间的余弦距离,作为其主题权重。(3) Calculate topic weight: For each candidate term cj , calculate its topic vector and document di The cosine distance between them is used as their topic weights.

其中,是文档di的候选词项cj的主题权重,是候选词项cj的向量表示,cos表示余弦距离。in, is the topic weight of the candidate term cj of the document di , is the vector representation of the candidate term cj , and cos represents the cosine distance.

通过以上(1)~(3)步骤,即可为每个候选词项分配一个0到1之间的主题权重。需要说明的是,主题权重为1表示该候选词项最接近文本的主题,为0则表示该候选词项距离文本的主题较远。Through the above steps (1)-(3), a topic weight between 0 and 1 can be assigned to each candidate term. It should be noted that a topic weight of 1 indicates that the candidate term is closest to the subject of the text, and a value of 0 indicates that the candidate term is far from the subject of the text.

在步骤S4中,以文档di的候选词项集为顶点构造无向图,计算每个候选词项cj的权重得分取TopK(前K)个候选词项作为关键词。这是通过改进TextRank算法来实现的,具体的过程如下:In step S4, the candidate term set of document di Construct an undirected graph for the vertices, and calculate the weight score of each candidate term cj Take the TopK (top K) candidate terms as keywords. This is achieved by improving the TextRank algorithm. The specific process is as follows:

(1)构造无向图:以文档di的候选词项集中的所有元素为顶点构造一个无向图。其中,如果候选词项cj和ck在一个长度为n的共现窗口中出现,则cj和ck之间存在一条边。(1) Construct an undirected graph: take the candidate term set of document di All elements in construct an undirected graph for vertices. Among them, there is an edge between cj and ck if the candidate terms cj and ck appear in a co-occurrence window of length n.

(2)计算边的权重:边的权重是本发明的改进之处。计算同样依赖自编码器构造的短语向量。根据两个候选词项cj和ck的向量表示之间的余弦距离similarity(cj,ck)和共现次数occurcount(cj,ck)为图中的每条边分配权重wjk(2) Calculate the weight of the edge: the weight of the edge is the improvement of the present invention. The computation also relies on the phrase vectors constructed by the autoencoder. Assign a weight w to each edge in the graph according to the cosine distance similarity(cj ,ck ) and the co-occurrencecount (cj ,ck ) between the vector representations of two candidate terms cj and ckjk :

wjk=similarity(cj,ck)×occurcount(cj,ck)wjk =similarity(cj ,ck )×occurcount (cj ,ck )

其中分别是是候选词项cj和ck的向量表示,cos表示向量的余弦距离,occurcount(cj,ck)表示cj和ck在共现窗口中共同出现的次数,将两者相乘用两个词同时出现的次数来加强它们的语义联系,wjk代表了cj和ck之间边的权重。in are the vector representations of candidate terms cj and ck respectively, cos represents the cosine distance of the vector, occurcount (cj , ck ) represents the number of times cj and ck co-occur in the co-occurrence window, and the two The multiplication uses the number of times that two words appear at the same time to strengthen their semantic connection, and wjk represents the weight of the edge between cj and ck .

(3)迭代计算顶点权重:顶点权重也是本发明的改进之处。迭代计算图中各个顶点的权重,直到达到最大迭代次数,权重得分计算如下:(3) Iterative calculation of vertex weights: vertex weights are also an improvement of the present invention. Iteratively calculate the weight of each vertex in the graph until the maximum number of iterations is reached, and the weight score Calculated as follows:

其中,表示文档di的候选词项cj的权重,d为阻尼系数,作用是使每个顶点都有一定的概率给其他顶点投票,这样每个顶点都会有一个不为零的得分,确保算法在多次迭代后可以收敛,通常取值为0.85。是文档di的候选词项cj的主题权重,wjk是候选词项cj和候选词项ck之间边的权重,wkp是候选词项ck和候选词项cp之间边的权重,表示与候选词项cj相连的候选词项集合,是该集合中的元素,同理,表示与候选词项ck相连的候选词项集合,是该集合中的元素,表示文档di的候选词项ck的权重,等式右边的后半部分表示的是与cj相连的顶点给cj的投票。in, Indicates the weight of the candidate term cj of the document di , d is the damping coefficient, the function is to make each vertex have a certain probability to vote for other vertices, so that each vertex will have a non-zero score, ensuring that the algorithm is in It can converge after multiple iterations, and usually takes a value of 0.85. is the topic weight of the candidate term cj of the document di , wjk is the weight of the edge between the candidate term cj and the candidate term ck , wkp is the edge weight between the candidate term ck and the candidate term cp edge weights, Indicates the set of candidate terms connected to the candidate term cj , is an element in the set, and similarly, Represents the set of candidate terms connected to the candidate termck , is an element in the set, Represents the weight of the candidate term ck of the document di , and the second half of the right side of the equation represents the votes of the vertices connected to cj for cj .

(4)候选词项排序:在多次迭代后,图中的每个顶点都能得到一个稳定的得分,将候选词项集按权重得分由大到小排序,保留前TopK个候选词项作为文档的关键词。(4) Ranking of candidate terms: After multiple iterations, each vertex in the graph can get a stable score, and the candidate term set score by weight Sort from large to small, and keep the top TopK candidate terms as keywords of the document.

经过上述S1~S4四个步骤,就可以抽取出文档的关键词。After the above four steps S1-S4, keywords of the document can be extracted.

本发明还提供一种基于短语向量的关键词抽取系统,包括:The present invention also provides a keyword extraction system based on phrase vectors, including:

文本预处理模块,用于对原始文本进行分词并标注词性,根据词性保留n元组,得到候选词项集;The text preprocessing module is used to segment the original text and mark the part of speech, and retain n-tuples according to the part of speech to obtain the candidate word item set;

短语向量构建模块,用于对候选词项cj=(x1,x2,…,xT),通过自编码器获得具有语义表示的短语向量;A phrase vector construction module, used to obtain a phrase vector with semantic representation through an autoencoder for the candidate term cj =(x1 ,x2 ,…,xT );

主题权重计算模块,用于计算候选词项的主题权重;具体的计算方法如上文所述。The topic weight calculation module is used to calculate the topic weight of the candidate term; the specific calculation method is as described above.

候选词排序模块,用于为候选词项计算权重得分,取TopK个候选词项作为关键词。具体的选取方法如上文所述。The candidate word sorting module is used to calculate the weight score for the candidate words, and takes the TopK candidate words as keywords. The specific selection method is as described above.

进一步的,所述系统还包括自编码器训练模块,用于处理短语结构中的序列信息,获取候选词项的短语向量表示,训练方法如上所述。Further, the system further includes an autoencoder training module, which is used to process the sequence information in the phrase structure and obtain the phrase vector representation of the candidate term, and the training method is as described above.

下面以企业论文数据库中的企业论文数据为例,说明具体的基于短语向量的关键词抽取方法。The following takes the enterprise paper data in the enterprise paper database as an example to illustrate a specific keyword extraction method based on phrase vectors.

企业论文数据库中有环保及其他多种领域的企业论文数据,数据中包括“题目”、“年份”、“摘要”、“关键词”、“英文关键词”、“分类号”等字段。在关键词抽取过程中,以数据库中的“题目”和“摘要”作为文本内容,“关键词”作为标注数据来验证抽取结果。The enterprise thesis database contains enterprise thesis data in environmental protection and other fields, and the data includes fields such as "title", "year", "abstract", "keywords", "English keywords", and "classification number". In the keyword extraction process, the "title" and "abstract" in the database are used as text content, and "keywords" are used as label data to verify the extraction results.

在训练自编码器时,取数据库中的“关键词”字段作为训练数据,训练过程中的部分参数如表1所示。When training the autoencoder, the "keyword" field in the database is taken as the training data, and some parameters in the training process are shown in Table 1.

表1训练参数设置Table 1 Training parameter settings

在进行关键词抽取之前,对标注数据进行分析来确定算法中的部分参数。数据集中共有59913条论文数据,平均每篇论文有4.2个标注关键词。首先,统计标注关键词的长度,即每个关键词包含的单词数目,结果如表2所示。从表2中可以发现全部关键词的平均长度为1.98,而且绝大部分关键词的长度都在1到3之间,长度在1到3之间的关键词在全部254376个关键词中占据了93.9%。因此在选择候选词项时保留文本中的1元组、2元组和3元组。Before keyword extraction, the labeled data is analyzed to determine some parameters in the algorithm. There are 59,913 papers in the dataset, and each paper has 4.2 tagged keywords on average. First, count the length of tagged keywords, that is, the number of words contained in each keyword, and the results are shown in Table 2. From Table 2, it can be found that the average length of all keywords is 1.98, and the length of most of the keywords is between 1 and 3. 93.9%. Therefore, 1-tuples, 2-tuples and 3-tuples in the text are preserved when selecting candidate terms.

然后,统计关键词中全部单词的词性,统计结果如表3所示。词性标注利用Jieba分词工具完成,部分词性说明如表4所示。根据表3,关键词中单词的词性分布没有长度分布集中,但是也主要聚集在名词、动词和具有名词功能的动词,这三种词性占据了全部单词词性的73.1%。因此,在进行候选词项选择时取文本中的名词、动词和名动词及其组合作为候选词项。Then, the parts of speech of all the words in the keywords are counted, and the statistical results are shown in Table 3. The part-of-speech tagging is completed using the Jieba word segmentation tool, and part of the part-of-speech description is shown in Table 4. According to Table 3, the part-of-speech distribution of words in keywords is not concentrated in the length distribution, but it is also mainly concentrated in nouns, verbs and verbs with noun functions. These three parts of speech account for 73.1% of all word parts of speech. Therefore, nouns, verbs, noun-verbs and their combinations in the text are taken as candidate terms when selecting candidate terms.

表2关键词长度分布Table 2 Keyword length distribution

表3单词词性分布Table 3 word part-of-speech distribution

表4 Jieba词性说明Table 4 Jieba part-of-speech description

由于文本内容中只包括论文的题目和摘要,在计算主题权重时以题目作为全文主题的代表,从题目中抽取候选词项计算文本的主题向量。另外候选词排序中的共现窗口大小初始设置为3,最终保留的候选词个数取10,如表5所示。Since the text content only includes the title and abstract of the paper, when calculating the topic weight, the title is used as the representative of the full-text topic, and candidate terms are extracted from the title to calculate the topic vector of the text. In addition, the co-occurrence window size in candidate word sorting is initially set to 3, and the number of candidate words to be retained is 10, as shown in Table 5.

表5关键词抽取结果(部分)Table 5 Keyword extraction results (part)

优选地,本发明取企业论文数据库中的一条论文数据为例,给出具体的关键词抽取过程。Preferably, the present invention takes a piece of paper data in an enterprise paper database as an example, and provides a specific keyword extraction process.

数据内容为“新形势下采矿设计行业发展思路实例分析回顾了煤炭行业十年的高速发展期及其对采矿设计市场产生的深远影响。在当前煤炭行业经济急速下行,煤炭设计市场竞争激烈的背景下,以天地科技设计院采矿专业发展为例,分析了采矿专业的人力资源和业务变化特点,提出了采矿专业的发展思路及实施措施,为其他设计企业采矿专业的发展提供了参考”。The content of the data is "Example Analysis of the Development Ideas of the Mining Design Industry under the New Situation. It reviews the ten-year high-speed development period of the coal industry and its profound impact on the mining design market. In the context of the current rapid economic downturn in the coal industry and the fierce competition in the coal design market Next, taking the mining specialty development of Tiandi Science and Technology Design Institute as an example, it analyzes the characteristics of human resources and business changes of the mining specialty, and puts forward the development ideas and implementation measures of the mining specialty, which provides a reference for the development of the mining specialty of other design companies.”

其中,“新形势下采矿设计行业发展思路实例分析”为论文的题目,其余内容为论文的摘要。Among them, "Instance Analysis of Development Ideas of Mining Design Industry under the New Situation" is the title of the thesis, and the rest of the content is the abstract of the thesis.

通过n元组词项和词性标注来选取候选词项,从论文的摘要中选出的候选词项作为文本的主题词项集,选出的候选词项如表6所示。Candidate terms are selected through n-tuple terms and part-of-speech tagging. The candidate terms selected from the abstract of the paper are used as the subject term set of the text. The selected candidate terms are shown in Table 6.

表6候选词项结果Table 6 Candidate term results

利用自编码器获取主题词项集中所有词项对应的短语向量表示,计算主题词项集中所有词项对应的短语向量的平均值,作为文本的主题向量,计算得到文档的主题向量大小为400,部分值如表7所示。Use the autoencoder to obtain the phrase vector representations corresponding to all the terms in the subject term set, calculate the average value of the phrase vectors corresponding to all the terms in the subject term set, and use it as the subject vector of the text. The calculated subject vector size of the document is 400, Some values are shown in Table 7.

表7主题权重结果(部分)Table 7 Topic weight results (partial)

对每个候选词项,计算它和文本的主题向量之间的余弦距离,得到其主题权重,部分值如表8所示。For each candidate term, calculate the cosine distance between it and the topic vector of the text to obtain its topic weight, and some values are shown in Table 8.

表8主题权重结果(部分)Table 8 Topic weight results (partial)

将候选词项作为顶点,候选词项的共现信息作为边构造无向图,根据两个候选词项的向量表示之间的余弦距离和两者的共现次数为图中的每条边分配权重,根据主题权重与边的权重多次迭代计算得到顶点权重。在多次迭代后,图中的每个顶点都能得到一个稳定的得分,部分得分如表9所示。The candidate term is used as a vertex, and the co-occurrence information of the candidate term is used as an edge to construct an undirected graph, and each edge in the graph is allocated according to the cosine distance between the vector representations of the two candidate terms and the number of co-occurrences Weight, the weight of the vertex is obtained through multiple iterative calculations based on the weight of the topic and the weight of the edge. After multiple iterations, each vertex in the graph can get a stable score, some of which are shown in Table 9.

表9权重得分结果(部分)Table 9 Weight score results (partial)

将得到的得分情况进行排序,以得分最高的Top10个候选词项作为最终的关键词,如表10所示。The obtained scores are sorted, and the top 10 candidate terms with the highest scores are used as the final keywords, as shown in Table 10.

表10关键词抽取结果(部分)Table 10 Keyword extraction results (part)

需要说明的是,本文中“第一”和“第二”仅仅用来区分名称相同的实体或操作,并不暗示这些实体或操作之间顺序或关系。It should be noted that "first" and "second" in this document are only used to distinguish entities or operations with the same name, and do not imply a sequence or relationship between these entities or operations.

本领域普通技术人员可以理解:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明权利要求所限定的范围。Those of ordinary skill in the art can understand that: the above embodiments are only used to illustrate the technical scheme of the present invention, rather than limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand : It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements to some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the claims of the present invention. range.

Claims (9)

Translated fromChinese
1.一种基于短语向量的关键词抽取方法,其特征在于,所述方法包括:1. A method for extracting keywords based on phrase vectors, characterized in that the method comprises:S1、对文本进行分词并标注词性,保留n元组得到候选词项集;S1. Segment the text and mark the part of speech, and retain the n-tuple to obtain the candidate word item set;S2、通过自编码器为候选词项构建短语向量;S2. Constructing phrase vectors for candidate terms through an autoencoder;S3、确定所述文本的主题,计算候选词项与主题向量的相似度,将所述相似度作为所述候选词项的主题权重;S3. Determine the subject of the text, calculate the similarity between the candidate term and the subject vector, and use the similarity as the subject weight of the candidate term;S4、通过TextRank算法,从所述候选词项集中获取关键词。S4. Acquire keywords from the set of candidate terms through the TextRank algorithm.2.根据权利要求1所述的方法,其特征在于,所述步骤S2中的自编码器包括编码器和解码器,编码器由双向LSTM层和全连接层组成,解码部分由单向LSTM层和softmax层组成。2. The method according to claim 1, wherein the self-encoder in the step S2 includes an encoder and a decoder, the encoder is composed of a bidirectional LSTM layer and a fully connected layer, and the decoding part is composed of a unidirectional LSTM layer and softmax layer composition.3.根据权利要求2所述的方法,其特征在于,所述步骤S2中的自编码器的训练方法包括以下步骤:3. method according to claim 2, is characterized in that, the training method of the self-encoder in the described step S2 comprises the following steps:S21、选取训练样本,获取候选词项;S21. Select training samples to obtain candidate terms;S22、对候选词项cj=(x1,x2,...,xT),在编码器中,使用双向LSTM从前后两个方向分别进行计算:S22. For the candidate term cj = (x1 , x2 , ..., xT ), in the encoder, use a bidirectional LSTM to perform calculations from the front and back directions respectively:其中,分别为t(t=1,2,...,T)时刻从左向右和从右向左两个方向上的隐藏层状态和细胞状态,分别为t-1时刻从左向右和从右向左两个方向上的隐藏层状态和细胞状态,xt为t时刻输入的候选词项中的单词,T表示候选词项中单词的数量;in, and are respectively the hidden layer state and the cell state in two directions from left to right and from right to left at time t (t=1, 2, ..., T), and are the hidden layer state and cell state from left to right and right to left at time t-1 respectively, xt is the word in the candidate term input at time t, and T represents the number of words in the candidate term ;S23、在编码器中,通过公式计算得到ESTS23. In the encoder, calculate EST through formula:h′T=f(WhhT+bh)h′T =f(Wh hT +bh )C′T=f(WcCT+bc)C′T =f(Wc CT +bc )其中,为连接符,Wh、bh、Wc、bc代表全连接网络中的参数矩阵和偏置,f表示全连接网络中的激活函数ReLU,EST是h′T和C′T组成的一个元组;in, is the connector, Wh , bh , Wc , bc represent the parameter matrix and bias in the fully connected network, f represents the activation function ReLU in the fully connected network, EST is composed of h′T and C′T a tuple;S24、在解码器部分,以EST为初始状态使用单向LSTM进行解码:S24. In the decoder part, useEST as the initial state to decode using unidirectional LSTM:其中,zt是解码器在t时刻的隐藏层状态,zt-1为t-1时刻的隐藏层状态,EST为编码器状态,为t-1时刻输出的候选词项中的单词;Among them, zt is the state of the hidden layer of the decoder at time t, zt-1 is the state of the hidden layer at time t-1, EST is the state of the encoder, is the word in the candidate term output at time t-1;S25、根据zt估算当前单词的概率S25. Estimate the probability of the current word according to zt其中,Wszt+bs对每个可能的输出单词进行打分,softmax为归一化函数;Among them, Ws zt +bs scores each possible output word, and softmax is a normalization function;S26、当训练过程中损失函数L不断变小最终趋于稳定时,获得编码器的参数Wh、bh、Wc、bc,以及解码器的参数Ws、zt,从而确定自编码器;其中,损失函数L的计算公式为:S26. When the loss function L keeps getting smaller and finally stabilizes during the training process, obtain the parameters Wh , bh , Wc , bc of the encoder, and the parameters Ws , zt of the decoder, so as to determine the self-encoder device; where the calculation formula of the loss function L is:4.根据权利要求3所述的方法,其特征在于,所述步骤S2中,所述候选词项输入自编码器,编码器输出的EST中的值为所述候选词项的短语向量。4. The method according to claim 3, wherein in the step S2, the candidate term is input into an encoder, and the value in theEST output by the encoder is a phrase vector of the candidate term.5.根据权利要求1所述的方法,其特征在于,所述步骤S3中主题向量的计算公式为:5. The method according to claim 1, characterized in that the topic vector in the step S3 The calculation formula is:其中,是主题词项ti对应的向量表示,是文本di的主题向量表示。in, is the vector representation corresponding to the subject term item ti , is the topic vector representation of the text di .6.根据权利要求1所述的方法,其特征在于,在所述步骤S4的TextRank算法中,候选词项cj和ck在共现窗口中出现,则cj和ck之间存在一条边,边的权重的计算公式为:6. method according to claim 1, is characterized in that, in the TextRank algorithm of described step S4, candidate term cj and ck occur in co-occurrence window, then there is a line between cj and ck Edge, the calculation formula of edge weight is:wjk=similarity(cj,ck)×occurcount(cj,ck)wjk =similarity(cj ,ck )×occurcount (cj ,ck )其中,分别是候选词项cj和ck的向量表示,occurcount(cj,ck)表示cj和ck在共现窗口中共同出现的次数,similarity(cj,ck)为cj和ck之间的相似度,wjk代表了cj和ck之间边的权重。in, are the vector representations of candidate terms cj and ck respectively, occurrencecount (cj , ck ) represents the number of times cj and ck co-occur in the co-occurrence window, similarity(cj , ck ) is cj The similarity between c j and ck , wjk represents the weight of the edge between cj and ck .7.根据权利要求6所述的方法,其特征在于,在所述步骤S4的TextRank算法中还包括迭代计算候选词项的权重,直到达到最大迭代次数,权重计算公式为:7. method according to claim 6, is characterized in that, in the TextRank algorithm of described step S4, also comprises the weight of iterative calculation candidate term, until reaching maximum number of iterations, weight The calculation formula is:其中,表示候选词项cj的权重,d为阻尼系数,优选的,d为0.85;是候选词项cj的主题权重,wjk是候选词项cj和候选词项ck之间边的权重,wkp是候选词项ck和候选词项cp之间边的权重,表示与候选词项cj相连的候选词项的集合,中的元素,表示与候选词项ck相连的候选词项的集合,中的元素,表示候选词项ck的权重。in, Indicates the weight of the candidate termcj , d is the damping coefficient, preferably, d is 0.85; is the topic weight of the candidate termcj , wjk is the weight of the edge between the candidate term cj and the candidate term ck , wkp is the weight of the edge between the candidate term ck and the candidate term cp , Represents the set of candidate terms connected to the candidate term cj , Yes elements in the Indicates the set of candidate terms connected to the candidate termck , Yes elements in the Indicates the weight of the candidate term ck .8.一种基于短语向量的关键词抽取系统,其特征在于,所述系统包括:8. A keyword extraction system based on phrase vectors, characterized in that the system comprises:文本预处理模块,用于对原始文本进行分词并标注词性,根据词性保留n元组,得到候选词项集;The text preprocessing module is used to segment the original text and mark the part of speech, and retain n-tuples according to the part of speech to obtain the candidate word item set;短语向量构建模块,用于对候选词项cj=(x1,x2,...,xT),通过自编码器获得具有语义表示的短语向量;A phrase vector construction module, used for obtaining a phrase vector with semantic representation through an autoencoder for candidate term cj =(x1 , x2 , ..., xT );主题权重计算模块,用于计算候选词项的主题权重;Topic weight calculation module, used to calculate the topic weight of the candidate term;候选词排序模块,用于为候选词项计算权重得分,取TopK个候选词项作为关键词。The candidate word sorting module is used to calculate the weight score for the candidate words, and takes the TopK candidate words as keywords.9.根据权利要求8所述的系统,其特征在于,所述系统还包括自编码器训练模块,用于通过样本训练得到自编码的参数,从而确定自编码器。9. The system according to claim 8, further comprising an autoencoder training module, configured to obtain autoencoder parameters through sample training, so as to determine the autoencoder.
CN201910548261.XA2019-06-242019-06-24 Method and system for keyword extraction based on phrase vectorActiveCN110263343B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910548261.XACN110263343B (en)2019-06-242019-06-24 Method and system for keyword extraction based on phrase vector

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910548261.XACN110263343B (en)2019-06-242019-06-24 Method and system for keyword extraction based on phrase vector

Publications (2)

Publication NumberPublication Date
CN110263343Atrue CN110263343A (en)2019-09-20
CN110263343B CN110263343B (en)2021-06-15

Family

ID=67920847

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910548261.XAActiveCN110263343B (en)2019-06-242019-06-24 Method and system for keyword extraction based on phrase vector

Country Status (1)

CountryLink
CN (1)CN110263343B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111222333A (en)*2020-04-222020-06-02成都索贝数码科技股份有限公司Keyword extraction method based on fusion of network high-order structure and topic model
CN111274428A (en)*2019-12-192020-06-12北京创鑫旅程网络技术有限公司Keyword extraction method and device, electronic equipment and storage medium
CN111785254A (en)*2020-07-242020-10-16四川大学华西医院 Self-service BLS training and assessment system based on simulator
CN112818686A (en)*2021-03-232021-05-18北京百度网讯科技有限公司Domain phrase mining method and device and electronic equipment
CN113312532A (en)*2021-06-012021-08-27哈尔滨工业大学Public opinion grade prediction method based on deep learning and oriented to public inspection field
CN114491030A (en)*2022-01-192022-05-13北京百度网讯科技有限公司Skill label extraction and candidate phrase classification model training method and device
CN114580394A (en)*2020-12-012022-06-03北大方正集团有限公司 Key phrase extraction method and device
CN115146027A (en)*2022-05-312022-10-04招联消费金融有限公司 Text vectorized storage and retrieval method, device and computer equipment
CN116795956A (en)*2022-03-152023-09-22华为技术有限公司Method for acquiring key phrase and related equipment

Citations (19)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20080017686A (en)*2006-08-222008-02-27에스케이커뮤니케이션즈 주식회사 Computer-readable recording media that contains topics for creating search engines, classifies documents, and programs that can perform them.
US8019708B2 (en)*2007-12-052011-09-13Yahoo! Inc.Methods and apparatus for computing graph similarity via signature similarity
CN103744835A (en)*2014-01-022014-04-23上海大学Text keyword extracting method based on subject model
KR101656245B1 (en)*2015-09-092016-09-09주식회사 위버플Method and system for extracting sentences
CN106372064A (en)*2016-11-182017-02-01北京工业大学Characteristic word weight calculating method for text mining
CN106970910A (en)*2017-03-312017-07-21北京奇艺世纪科技有限公司A kind of keyword extracting method and device based on graph model
CN106997382A (en)*2017-03-222017-08-01山东大学Innovation intention label automatic marking method and system based on big data
CN107122413A (en)*2017-03-312017-09-01北京奇艺世纪科技有限公司A kind of keyword extracting method and device based on graph model
CN107133213A (en)*2017-05-062017-09-05广东药科大学A kind of text snippet extraction method and system based on algorithm
CN107193803A (en)*2017-05-262017-09-22北京东方科诺科技发展有限公司A kind of particular task text key word extracting method based on semanteme
CN107247780A (en)*2017-06-122017-10-13北京理工大学A kind of patent document method for measuring similarity of knowledge based body
CN107832457A (en)*2017-11-242018-03-23国网山东省电力公司电力科学研究院Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm
CN108460019A (en)*2018-02-282018-08-28福州大学A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN108710611A (en)*2018-05-172018-10-26南京大学A kind of short text topic model generation method of word-based network and term vector
CN108984526A (en)*2018-07-102018-12-11北京理工大学A kind of document subject matter vector abstracting method based on deep learning
CN109614626A (en)*2018-12-212019-04-12北京信息科技大学 Automatic keyword extraction method based on gravitational model
CN109726394A (en)*2018-12-182019-05-07电子科技大学 Short text topic clustering method based on fusion BTM model
CN109918510A (en)*2019-03-262019-06-21中国科学技术大学 Cross-domain keyword extraction method
CN109918660A (en)*2019-03-042019-06-21北京邮电大学 A method and device for keyword extraction based on TextRank

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20080017686A (en)*2006-08-222008-02-27에스케이커뮤니케이션즈 주식회사 Computer-readable recording media that contains topics for creating search engines, classifies documents, and programs that can perform them.
US8019708B2 (en)*2007-12-052011-09-13Yahoo! Inc.Methods and apparatus for computing graph similarity via signature similarity
CN103744835A (en)*2014-01-022014-04-23上海大学Text keyword extracting method based on subject model
KR101656245B1 (en)*2015-09-092016-09-09주식회사 위버플Method and system for extracting sentences
CN106372064A (en)*2016-11-182017-02-01北京工业大学Characteristic word weight calculating method for text mining
CN106997382A (en)*2017-03-222017-08-01山东大学Innovation intention label automatic marking method and system based on big data
CN106970910A (en)*2017-03-312017-07-21北京奇艺世纪科技有限公司A kind of keyword extracting method and device based on graph model
CN107122413A (en)*2017-03-312017-09-01北京奇艺世纪科技有限公司A kind of keyword extracting method and device based on graph model
CN107133213A (en)*2017-05-062017-09-05广东药科大学A kind of text snippet extraction method and system based on algorithm
CN107193803A (en)*2017-05-262017-09-22北京东方科诺科技发展有限公司A kind of particular task text key word extracting method based on semanteme
CN107247780A (en)*2017-06-122017-10-13北京理工大学A kind of patent document method for measuring similarity of knowledge based body
CN107832457A (en)*2017-11-242018-03-23国网山东省电力公司电力科学研究院Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm
CN108460019A (en)*2018-02-282018-08-28福州大学A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN108710611A (en)*2018-05-172018-10-26南京大学A kind of short text topic model generation method of word-based network and term vector
CN108984526A (en)*2018-07-102018-12-11北京理工大学A kind of document subject matter vector abstracting method based on deep learning
CN109726394A (en)*2018-12-182019-05-07电子科技大学 Short text topic clustering method based on fusion BTM model
CN109614626A (en)*2018-12-212019-04-12北京信息科技大学 Automatic keyword extraction method based on gravitational model
CN109918660A (en)*2019-03-042019-06-21北京邮电大学 A method and device for keyword extraction based on TextRank
CN109918510A (en)*2019-03-262019-06-21中国科学技术大学 Cross-domain keyword extraction method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BASALDELLA MARCO 等: "Bidirectional lstm recurrent neural network for keyphrase extraction", 《ITALIAN RESEARCH CONFERENCE ON DIGITAL LIBRARIES》*
张莉婧 等: "基于改进TextRank的关键词抽取算法", 《北京印刷学院学报》*
李航 等: "融合多特征的TextRank关键词抽取方法", 《情报杂志》*
洪冬梅: "基于LSTM的自动文本摘要技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》*
齐翌辰 等: "基于深度学习的中文抽取式摘要方法应用", 《科教导刊》*

Cited By (14)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111274428A (en)*2019-12-192020-06-12北京创鑫旅程网络技术有限公司Keyword extraction method and device, electronic equipment and storage medium
CN111274428B (en)*2019-12-192023-06-30北京创鑫旅程网络技术有限公司Keyword extraction method and device, electronic equipment and storage medium
CN111222333A (en)*2020-04-222020-06-02成都索贝数码科技股份有限公司Keyword extraction method based on fusion of network high-order structure and topic model
CN111785254A (en)*2020-07-242020-10-16四川大学华西医院 Self-service BLS training and assessment system based on simulator
CN111785254B (en)*2020-07-242023-04-07四川大学华西医院Self-service BLS training and checking system based on anthropomorphic dummy
CN114580394A (en)*2020-12-012022-06-03北大方正集团有限公司 Key phrase extraction method and device
CN112818686A (en)*2021-03-232021-05-18北京百度网讯科技有限公司Domain phrase mining method and device and electronic equipment
CN112818686B (en)*2021-03-232023-10-31北京百度网讯科技有限公司 Domain phrase mining methods, devices and electronic devices
CN113312532A (en)*2021-06-012021-08-27哈尔滨工业大学Public opinion grade prediction method based on deep learning and oriented to public inspection field
CN114491030A (en)*2022-01-192022-05-13北京百度网讯科技有限公司Skill label extraction and candidate phrase classification model training method and device
US20230139642A1 (en)*2022-01-192023-05-04Beijing Baidu Netcom Science Technology Co., Ltd.Method and apparatus for extracting skill label
CN116795956A (en)*2022-03-152023-09-22华为技术有限公司Method for acquiring key phrase and related equipment
CN115146027A (en)*2022-05-312022-10-04招联消费金融有限公司 Text vectorized storage and retrieval method, device and computer equipment
CN115146027B (en)*2022-05-312025-09-09招联消费金融股份有限公司Text vectorization storage and retrieval method and device and computer equipment

Also Published As

Publication numberPublication date
CN110263343B (en)2021-06-15

Similar Documents

PublicationPublication DateTitle
CN110263343B (en) Method and system for keyword extraction based on phrase vector
CN109977413B (en) A Sentiment Analysis Method Based on Improved CNN-LDA
CN107122413B (en)Keyword extraction method and device based on graph model
CN109670039B (en)Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis
CN110083682A (en)It is a kind of to understand answer acquisition methods based on the machine readings for taking turns attention mechanism more
CN109960786A (en) Chinese word similarity calculation method based on fusion strategy
CN110674252A (en)High-precision semantic search system for judicial domain
CN106649561A (en)Intelligent question-answering system for tax consultation service
CN111078833B (en) A text classification method based on neural network
CN106997382A (en)Innovation intention label automatic marking method and system based on big data
CN111178053B (en)Text generation method for generating abstract extraction by combining semantics and text structure
CN111694927B (en) An automatic document review method based on improved word shift distance algorithm
CN114065758A (en)Document keyword extraction method based on hypergraph random walk
CN114818717A (en)Chinese named entity recognition method and system fusing vocabulary and syntax information
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
CN117708336B (en) A multi-strategy sentiment analysis method based on topic enhancement and knowledge distillation
CN112069312A (en)Text classification method based on entity recognition and electronic device
CN111859955A (en) A public opinion data analysis model based on deep learning
CN111858842A (en) A Judicial Case Screening Method Based on LDA Topic Model
Chang et al.A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN111061939A (en)Scientific research academic news keyword matching recommendation method based on deep learning
CN114238586A (en)Emotion classification method of Bert combined convolutional neural network based on federated learning framework
CN117574858A (en)Automatic generation method of class case retrieval report based on large language model
Chen et al.Sentiment classification of tourism based on rules and LDA topic model
CN115438195A (en) A method and device for constructing a knowledge map in the field of financial standardization

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp