Method for automatically extracting text keywordsTechnical Field
The invention relates to the technical field of natural language processing, in particular to a method for automatically extracting text keywords.
Background
According to Chen and Lin studies, keyword automatic extraction techniques can be roughly classified into four types, which are a statistical method, a linguistic method, a machine learning method, and other methods, respectively. The statistical method realizes the extraction of keywords by classifying and counting text units such as words or sentences in the text. This approach does not rely on training data and is independent of text language and domain. Common algorithms include n-gram statistics, word frequency computation, TF-IDF models, word co-occurrence, PAT trees, and the like. Linguistic methods utilize linguistic features of words, sentences, and documents for keyword extraction, where the most common linguistic features include lexical, syntactic, semantic, and phonetic analysis. The machine learning method is mainly divided into two types, namely supervised keyword extraction and unsupervised keyword extraction. The supervised keyword extraction method trains a data set marked with keywords by using a model, and then applies the model to extract keywords of other text data. The following problems exist in the prior art:
the existing method for extracting the non-supervision keywords needs to manually annotate the training data set, is tedious in work, is unavoidably subjective in the process, can cause inaccuracy of final experimental data, and meanwhile does not consider internal word frequency information, external corpus information, structural information and the like of texts.
Disclosure of Invention
The invention provides a method for automatically extracting text keywords, which aims to solve the problems in the background technology.
In order to solve the technical problems, the invention adopts the following technical scheme:
A method of automatically extracting text keywords, the method of automatically extracting text keywords comprising the steps of:
step one: a TextRank algorithm;
step two: a TF-IDF algorithm;
Step three: a word vector clustering weighting algorithm;
step four: a central node metric algorithm;
step five: extracting keywords;
In the first step, the TextRank algorithm pre-processes a text, constructs an adjacency relation between word units in the text by establishing a word graph g= (V, E), and calculates the importance of word nodes, when constructing the nodes in the word graph, the text needs to be divided according to sentence units, each sentence is divided, stop words and repeat words are removed, only nouns, verbs and adjectives are reserved, a node set v= { V1,V2,…,Vn } of the word graph is formed, and when determining whether edges exist between two nodes, the word co-occurrence relation is used for construction: when words corresponding to two nodes appear in one window at the same time, the two nodes are connected, and an edge set E= { (Vi,Vj)|Vi∈V∧Vj epsilon V } forming a word graph is formed, wherein the window size can be set to 2 to 10 words.
By constructing the word graph g= (V, E), one can determine the formulaAnd calculating to obtain the TextRank value of each node.
The technical scheme of the invention is further improved as follows: in the formula of the step one, d E [0,1] is a damping coefficient, which indicates that the probability of 1-d of any node randomly jumps to other nodes in the word graph, and in order to ensure that the iterative computation of TextRank can be converged, the value is generally 0.85; WS (Vi) represents the textRank value of node Vi; WS (Vj) represents the textRank value of node Vj after the last iteration; in (Vi) represents the set of all nodes pointing to node Vi; out (Vj) represents the set of all nodes to which node Vj points; wji represents the similarity between the two nodes Vi and Vj.
The technical scheme of the invention is further improved as follows: in the second step, the word frequency (TF) and the inverse document word frequency (IDF) are multiplied to obtain the weight value of the word, the first several bits are selected as key words according to the weight sequence, and the calculation formula is as follows: wTF-IDF(i)=TFi*IDFi;IDFi=logN/DFi; where TFi represents the number of occurrences of word i in the text divided by the total number of words in the text, i.e. the frequency of occurrence of word i in the text; n represents the total number of text in the corpus; DFi represents the number of text containing word i; IDFi represents the category discrimination capability of word i.
The technical scheme of the invention is further improved as follows: step three, the keyword can be inspected by clustering word vectors in the text and selecting the distance between other word vectors and the clustering center, and the word vector clustering weighting algorithm assumes that: words in a text can be divided into groups by calculating their vector similarity, these groups are called clusters, the farther a word is from the centroid of the cluster in which it is located, the more the difference between the word and the centroid of the cluster and the words around the centroid can be reflected, and the higher the importance of the word as a TextRank word graph node.
The technical scheme of the invention is further improved as follows: preprocessing a given text to generate candidate keywords V= { V1,V2,…,Vn }, and obtaining Word vectors of s by using a Word2Vec Word vector model obtained after trainingLet c= { C1,C2,…,Cn } represent the clustering result after K-means clustering by the word vector set of the document, calculate the importance of any word i in the belonging cluster Ci, and the calculation formula is as follows: Wherein/>A vector representing the centroid of cluster Ci; Representing the euclidean distance of the vector to the vector in the word vector space, |ci | represents the number of words contained in cluster Ci.
The technical scheme of the invention is further improved as follows: the fourth step can be divided into the following steps:
A1: centering: the degree centrality is the most direct measurement index for describing the node centrality in network analysis, the degree of a node is higher, the degree centrality of the node is more important in the network, and the degree centrality calculation formula of a certain node is as follows:
A2: feature vector centrality: the basic idea of feature vector centrality is that the importance of a node depends on both its number of neighboring nodes and its importance of neighboring nodes. The characteristic vector centrality calculation formula of a certain node is as follows: Wherein lambda is a characteristic value, j is a neighbor node of the node i;
A3: median centrality: the basic idea of the betting center is to characterize node importance by counting the number of shortest paths through a node. The calculation formula of the medium centrality of a certain node is as follows: Where σut represents the number of shortest paths of nodes u to t, σut (i) represents the number of paths that these paths pass through node i, and N represents the number of nodes in the graph;
The centrality of the node in the word graph is comprehensively estimated by the following formulas in combination with the A1, A2 and A3: wCentrality(i)=Wd9i)+Wev(i)+Wb (i).
The technical scheme of the invention is further improved as follows: in the fifth step, for any word i in the text, the following formula is provided to calculate the comprehensive weight of the word i: wWeight(i)=WTF-IDF(i)+WWord2Vec-Cluster(i)+WCentrality (i), while normalizing the node weight values using the following formula: Wherein WWeight (i) is the weight of node i, min_weight is the minimum weight value in all nodes, and max_weight is the maximum weight value in all nodes; according to the TextRank algorithm, the probability transition formula between the word graph nodes is as follows: /(I)And then constructing a probability transition matrix M between nodes in the word graph: /(I)Where Pij represents the transition probability of a jump from node i to node j, i.e., P (Vi,Vj); by constructing the probability transition matrix M, the final TextRank node weight iteration formula is as follows:
The technical scheme of the invention is further improved as follows: in the fifth step, convergence can be achieved by continuously iterating the formula when the error rate in the graph is smaller than a given limit value, and in general, the limit value is 0.0001, iteration is stopped at this time, the score of each node is the importance degree of the node in the graph, the nodes are sorted according to the descending order of the score, and the first N words are the final result of keyword extraction.
By adopting the technical scheme, compared with the prior art, the invention has the following technical progress:
1. The invention provides a method for automatically extracting text keywords, which calculates word node weight information by adopting a TF-IDF algorithm, a word vector clustering weighting algorithm and a central node measurement algorithm of a graph, and fills initial information of each node in a word co-occurrence network referenced by a model during keyword extraction, thereby obviously improving the accuracy of keyword extraction.
2. The invention provides a method for automatically extracting text keywords, which fully considers the diversity requirement of keyword extraction in real application through the design of a model, so that the model can flexibly cope with text data of different types and scales, and the model is excellent in processing professional terms in specific fields and more general text situations, thereby providing powerful support for wide application scenes.
3. The invention provides a method for automatically extracting text keywords, which is characterized in that a TF-IDF algorithm, a word vector clustering weighting algorithm and a central node measurement algorithm are incorporated into a TextRank algorithm model to respectively realize comprehensive investigation of text internal information, text external corpus information and text structure information, and the central node measurement is taken as a distinguishing attribute of node importance in a graph, so that the importance and influence of the nodes in the graph can be effectively evaluated, and important nodes in the graph can be effectively mined in a word co-occurrence network, so that the text keywords are better extracted.
Drawings
FIG. 1 is a candidate keyword graph of the present invention;
Fig. 2 is a schematic overview of the method of the present invention.
Detailed Description
The invention is further illustrated by the following examples:
example 1
As shown in fig. 1 and 2, the present invention provides a method for automatically extracting text keywords, which comprises the following steps:
step one: a TextRank algorithm;
step two: a TF-IDF algorithm;
Step three: a word vector clustering weighting algorithm;
step four: a central node metric algorithm;
step five: extracting keywords;
In the first step, the TextRank algorithm pre-processes a text, constructs the adjacency relation among word units in the text by establishing a word graph g=v, E), and calculates the importance of word nodes, when constructing the nodes in the word graph, the text is firstly required to be divided according to sentence units, each sentence is divided, stop words, repeated words and the like are removed, only nouns, verbs and adjectives are reserved, a node set v= { V1,V2,…,Vn } forming the word graph is reserved, and when determining whether edges exist between two nodes, the word co-occurrence relation is used for construction: when words corresponding to two nodes appear in one window at the same time, the two nodes are connected, an edge set E= { (Vi,Vj)|Vi∈V∧Vj epsilon V } forming a word graph is formed, wherein the window size can be set to 2 to 10 words, and by constructing a word graph G= (V, E), the word graph G= (V, E) can be formed through a formulaCalculating to obtain a TextRank value of each node, wherein d E [0,1] is a damping coefficient in the formula in the first step, and represents that the probability of 1-d of any node randomly jumps to other nodes in the word graph, so that the iteration calculation of the TextRank can be converged, and the value is generally 0.85; WS (Vi) represents the textRank value of node Vi; WS (Vj) represents the textRank value of node Vj after the last iteration; in (Vi) denotes a set of all nodes pointing to node Vi; out (Vj) represents the set of all nodes to which node Vj points; wji represents the similarity between the two nodes Vi and Vj.
In this embodiment, taking fig. 1 as an example, the probability that node α jumps to any neighboring node in region a isHowever, research shows that non-uniform hopping can effectively improve keyword extraction effect according to node importance, so that the method enriches initial weight information of nodes on the basis of the keyword extraction method and further optimizes hopping probability among the nodes.
Example 2
As shown in fig. 1 and 2, on the basis of embodiment 1, the present invention provides a technical solution: preferably, in the second step, the word frequency (TF) and the inverse document word frequency (IDF) are multiplied to obtain the weight value of the word, and the first several digits are selected as keywords according to the weight ranking, and the calculation formula is as follows: wTF-IDF(i)=TFi*IDFi;IDFi=logN/DFi; where TFi represents the number of occurrences of word i in the text divided by the total number of words in the text, i.e. the frequency of occurrence of word i in the text; n represents the total number of text in the corpus; DFi represents the number of text containing word i; IDFi represents the category discrimination capability of word i.
Example 3
As shown in fig. 1 and 2, on the basis of embodiment 1, the present invention provides a technical solution: preferably, the third step can be implemented by clustering word vectors in the text, and then selecting the distance between other word vectors and the clustering center to inspect the keywords, wherein the word vector clustering weighting algorithm assumes that: words in a text can be divided into a plurality of groups by calculating the vector similarity of the words, the groups are called clusters, the more far a Word is from the centroid of the cluster, the more the difference between the Word and the centroid of the cluster and the difference between the Word and the words around the centroid can be reflected, when the Word is used as a TextRank Word graph node, the higher the importance of the Word is, the given text is preprocessed in the third step to generate candidate keywords V= { V1,V2,…,Vn }, and then a Word vector of s is obtained by using a Word2Vec Word vector model obtained after trainingLet c= { C1,C2,…,Cn } represent the clustering result after K-means clustering by the word vector set of the document, calculate the importance of any word i in the belonging cluster Ci, and the calculation formula is as follows: /(I)Wherein/>A vector representing the centroid of cluster Ci; Representing the euclidean distance of the vector to the vector in the word vector space, |ci | represents the number of words contained in cluster Ci.
Example 4
As shown in fig. 1 and 2, on the basis of embodiment 1, the present invention provides a technical solution: preferably, the fourth step may be divided into the following steps:
A1: centering: the degree centrality is the most direct measurement index for describing the node centrality in network analysis, the degree of a node is higher, the degree centrality of the node is more important in the network, and the degree centrality calculation formula of a certain node is as follows:
A2: feature vector centrality: the basic idea of feature vector centrality is that the importance of a node depends on both its number of neighboring nodes and its importance of neighboring nodes. The characteristic vector centrality calculation formula of a certain node is as follows: Wherein lambda is a characteristic value, j is a neighbor node of the node i;
A3: median centrality: the basic idea of the betting center is to characterize node importance by counting the number of shortest paths through a node. The calculation formula of the medium centrality of a certain node is as follows: Where σut represents the number of shortest paths of nodes u to t, σut (i) represents the number of paths that these paths pass through node i, and N represents the number of nodes in the graph;
In combination with A1, A2 and A3, the centrality of the node in the word graph is comprehensively estimated by the following formula: wCentrality(i)=Wd(i)+Wev(i)+Wb (i).
In the embodiment, the central node measurement is used as the distinguishing attribute of the node importance in the graph, the importance and influence of the node in the graph can be effectively evaluated, and in the word co-occurrence network, the important node can be effectively mined, so that the text keyword can be better extracted.
Example 5
As shown in fig. 1 and 2, on the basis of embodiment 1, the present invention provides a technical solution: preferably, in the fifth step, for any word i in the text, the following formula is provided to calculate the comprehensive weight of the word i: wWeight(i)=WTF-IDF(i)+WWord2Vec-Cluster(i)+WCentrality (i), while normalizing the node weight values using the following formula: Wherein WWeight (i) is the weight of node i, min_weight is the minimum weight value in all nodes, and max_weight is the maximum weight value in all nodes; according to the TextRank algorithm, the probability transition formula between the word graph nodes is as follows: /(I)And then constructing a probability transition matrix M between nodes in the word graph: /(I)Where Pij represents the transition probability of a jump from node i to node j, i.e., P (Vi,Vj); by constructing the probability transition matrix M, the final TextRank node weight iteration formula is as follows: In the fifth step, through continuous iteration formulas, when the error rate in the graph is smaller than a given limit value, convergence can be achieved, in general, the limit value is 0.0001, iteration is stopped at this time, each node score is the importance degree of the node score in the graph, the node scores are sorted in descending order according to the score size, and the first N words are the final result of keyword extraction.
The workflow of the method of automatically extracting text keywords is described in detail below.
As shown in fig. 2, in the first step, preprocessing is performed on a given text, mainly deleting stop words, repeat words and the like, and retaining verbs, nouns and adjectives; secondly, constructing a word graph according to a word co-occurrence window, wherein the word co-occurrence window refers to the distance between two words when the two words are simultaneously present; thirdly, respectively calculating TF-IDF values of word nodes, word vector clustering weighted values and central node metric values, and carrying out weighted summation; and fourthly, iterating the TextRank algorithm to obtain the final score of each candidate keyword, and selecting the first N words as final keywords after sequencing.
The foregoing invention has been generally described in great detail, but it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, it is intended to cover modifications or improvements within the spirit of the inventive concepts.