CN108132927B

Movatterモバイル変換

Info

Publication number: CN108132927B
Application number: CN201711285868.0A
Authority: CN
Inventors: 马慧芳; 王双; 李苗; 赵卫中; 李宁
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2022-02-11
Anticipated expiration: 2037-12-07
Also published as: CN108132927A

Abstract

Translated fromChinese

本发明公开一种融合图结构与节点关联的关键词提取方法。本发明方法实现步骤如下：1)挖掘文档的频繁封闭词项集并生成强关联规则集合；2)将强关联规则集合中不重复的规则头与规则体作为节点，节点之间有边当且仅当彼此存在强关联规则，以关联规则的关联强度作为边权重构建文档的关联图；3)使用GSNA算法在关联图上随机游走，迭代计算每个节点的重要性分数，并对结果降序排序；4)对前若干个节点聚类，取出每个类的类中心点作为文档的关键提取结果。本发明方法能够在脱离外部语料库的情况下发现长文本的关键词，能够避免领域文集文本长短的影响，在更大范围检验一个词的重要性，进而能够提高领域关键词集的提取精度，还能适用于大规模领域文集的领域关键词集的提取。

The invention discloses a method for extracting keywords associated with a fusion graph structure and nodes. The implementation steps of the method of the present invention are as follows: 1) mining the frequent closed term itemsets of the document and generating a set of strong association rules; 2) using the non-repetitive rule heads and rule bodies in the set of strong association rules as nodes, and there are edges between nodes when and Only when there are strong association rules with each other, use the association strength of the association rules as the edge weight to construct the association graph of the document; 3) Use the GSNA algorithm to randomly walk on the association graph, iteratively calculate the importance score of each node, and sort the results in descending order Sort; 4) Cluster the first several nodes, and take out the class center point of each class as the key extraction result of the document. The method of the invention can find the keywords of long texts without the external corpus, can avoid the influence of the length of the field corpus text, check the importance of a word in a wider range, and can improve the extraction accuracy of the field keyword set. Extraction of domain keyword sets that can be applied to large-scale domain corpora.

Description

Keyword extraction method for combining graph structure and node association

Technical Field

The invention relates to a novel keyword extraction method, belongs to the field of word information processing, and particularly relates to a keyword extraction method for fusing graph structure and node association.

Background

With the popularization of network technology, webpage news and various electronic documents quickly infiltrate into the lives of people, and how users acquire valuable information from massive documents, a text keyword extraction technology is very important. The keyword extraction technology is widely applied to the aspects of webpage retrieval, text classification, topic word bank construction and the like, and the keyword extraction technology is helpful for a user to quickly understand the core content of a document. In most text mining tasks, keyword extraction is performed by sequencing the text content according to the relevance degree of terms, so that a keyword extraction algorithm for a plurality of single documents is generated accordingly.

The long text keyword extraction technique depends on the context in which the terms are located, and the keyword selection can be performed by adopting supervised learning or unsupervised learning. Two ways exist for extracting the key words by using a supervised learning algorithm, namely a binary classification method and feature selection. The binary classification method is to determine whether the candidate word is a keyword, and a classical classification method is often used. The feature selection method focuses on the internal statistical features of terms in the text and the external statistical features in the corpus. The technology for extracting the keywords by the unsupervised learning algorithm has three directions, namely a keyword extraction method based on potential semantic analysis, a keyword ordering method based on a graph and a keyword clustering method based on a theme. The latent semantic analysis excessively depends on the corpus to perform learning training, and under the actual condition, the corpus often cannot be obtained in time, so that a chance is provided for the development of a keyword extraction technology based on a graph, which can be separated from the corpus. The basic source of the graph-based keyword ranking algorithm is the PageRank algorithm. GraphSum considers that the relevance relation between the nodes has strong and weak scores, and the corresponding PageRank score is reduced when voting is carried out between the nodes with negative relevance relation, so that the node score is more reasonable. The Attribute indicates that not only the structure of the graph but also external information such as node attributes need to be considered when calculating the importance of the nodes, and systematically proposes 13 attributes of the nodes to measure the similarity of the nodes, so that the node ranking has higher reliability.

Disclosure of Invention

The invention provides a keyword Extraction Method (GSNA) for fusing Graph Structure and node Association, which comprehensively considers the Graph Structure attribute of Nodes, semantic information and the Association characteristics among the Nodes and can find keywords of long texts under the condition of departing from an external corpus.

In order to achieve the above object, the method for extracting keywords associated with a fusion graph structure and nodes according to the present invention comprises the following steps:

1) acquiring a text and performing text preprocessing;

2) mining a frequent closed item set of a document and generating a strong association rule set;

3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document;

4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;

5) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document. The method for preprocessing the text in the step 1) comprises the following steps:

a. acquiring a text, wherein the text consists of a plurality of sentences;

b. removing all directories, titles, figures and tables in the text and only leaving text information;

c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm;

d. filtering stop words of the text;

e. duplicate words in each sentence are removed.

The method for generating the strong association rule set in 2) comprises the following steps:

a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w₁,w₂,...,w_m}，w_iThe ith word in the item set;

b. solving a closed item set; the closed item set is item set Xf and only if

Wherein

Is a closure operation;

c. solving the length of the item set, wherein the number of items in one item set is called the length of the item set, and the item set with the length of k is called a k-item set;

d. the support degree of the item set is obtained, and for the association rule A → B,

the support degree sup (A → B) of A → B is the ratio of the number of the affairs containing AbuB in T to the total number of the affairs in T;

e. calculating the correlation strength; for the association rule a → B,

the strength of the association of A → B is:

since lift (a → B) ═ lift (B → a), the lift metric satisfies symmetry. From the analysis of probability distribution, when lift (a → B) ═ 1, a and B obey independent probability distribution, which indicates that there is no correlation between a and B; when lift (A → B) >1, the posterior probability of B after A is introduced is higher than the prior probability of B, which indicates that a positive association relationship exists between A and B; when lift (A → B) <1, the posterior probability of B after A is introduced is lower than the prior probability of B, which indicates that a negative correlation exists between A and B;

f. solving a strong association rule; for the association rule a → B,

if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max^-liftOr lift (A → B) ≥ min^+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max^-liftIs the maximum negative correlation threshold, min^+liftIs a minimum positive correlation threshold;

g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; the invention adopts the CLOSE algorithm to dig the frequent closed item set so as to generate the strong association rule set;

the CLOSE algorithm is realized by the following steps:

step (1) preprocessing a text, then, changing the text into a 1-item set after the text is deduplicated, scanning a transaction set to calculate closure and support degree of the 1-item set, recording a generated closure and support degree set as FCC1, and then, removing the 1-item set with the support degree smaller than min and the same closure from FCC1 to obtain FC1(FC1 is a closure and support degree set obtained by a 1-item set through a CLOSE algorithm);

step (2) in the process of generating an i-item set (the initial value of i is 1), combining two different item sets in the generators of FC (i-1) in pairs as generators of FCCi, calculating the closure and the support degree of the i-item set by scanning a transaction set to obtain FCCi, and removing the item set with the support degree smaller than min and the item set with the repeated closure to obtain FCi;

step (3) adding i, repeating step (2) until the generator of FCCi is empty, and ending the CLOSE algorithm;

step (4) finally, all the closures of the FCs 1 to FCi are taken out to obtain all closed frequent item sets FC of the transaction set;

h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L₁,L₂,...,L_kIn which L is_iA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f₁,f₂,...,f₃In which f_iI-frequent item set and its support set; finally, according to f_i(2. ltoreq. i.ltoreq.k) generates a set of association rules AR of different forms, which are composed of a rule head of an m-item set (1. ltoreq. m.ltoreq.i-1) and a rule body of the (i-m) -item set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre. The extraction satisfies that lift is less than or equal to max^-liftOr lift is more than or equal to min^+liftThe rule of (2) is used as a strong association rule.

The method for generating the association map in 3) comprises the following steps: taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges.

The GSNA algorithm in 4) comprises the following steps:

a. quantifying graph structure attributes of the nodes; in order to quantify the graph structure attribute of the node, six indexes are selected; wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, atr4 is the ratio of atr1 to the average value of the degrees of the neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art 2; assuming the existence of node N in the dependency graph_iThen N is_iFormalized as N_i＝(atr1,atr2,atr3,atr4,atr5,atr6)；

b. Solving the semantic distance between two nodes; suppose the word set of the preprocessed document is I ═ w₁,w₂,...,w_n}, node N_iThe co-occurrence probability distribution with I is P ═ P₁,p₂,...,p_n}，N_jThe co-occurrence probability distribution with I is Q ═ Q₁,q₂,...,q_nIs then N_iAnd N_jHas a semantic distance of

c. Calculating the similarity between nodes combining the graph structure attribute and the semantic information; integrating graph structure attributes and semantic information for nodes, node N_iAnd N_jThe similarity measure of (c) is shown in formula (3); when N is present_iAnd N_jThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity s_ijThe lower, conversely, the lower N_iAnd N_jThe closer the structural attributes or semantic distributions of (S) are, the more closely s_ijThe higher the correspondence;

e. normalizing each node at offSimilarity in the joint graph; assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formed_N×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:

f. modifying the PageRank formula;

r_irepresenting a node N_iDegree of similarity to the average of all nodes in the graph, r ═ r₁,r₂,...,r_N) Namely, the fixed probability distribution of all nodes in the association graph related to the graph structure and semantic attributes; using r_iModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:

since the correlation diagram is an undirected graph, C (N) in the formula (5)_k) Representing a node N_kE represents N_kThe number of neighbors of (2);

g. the PR voting score of positive associations is strengthened and the PR voting score of negative associations is reduced. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (5) is improved using lift values as follows:

h. randomly walking on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until the formula (7) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value;

and 5) clustering the nodes, wherein the steps are as follows: carrying out K-Mediod clustering on the sequenced nodes; firstly, selecting the first P nodes after descending sorting; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; finally, after all the nodes are distributed, searching the class center of each class again, and distributing the nodes of the non-class center again; and continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The basic technical flow is shown in fig. 1.

The invention discloses a keyword extraction method for combining graph structure and node association, which has the beneficial effects that: the invention uses the CLOSE algorithm to mine the strong association rule for constructing the association graph, the strong association rule obtained by taking the frequent closed item set as the reference has small scale, high speed and no information loss, thereby avoiding the problem of excessively complex graph structure caused by extracting redundant frequent item sets and improving the overall execution efficiency of the algorithm; the similarity of the graph structure attribute of the node in the association graph and the semantic information calculation node in the association graph is comprehensively considered, so that node jumping is not subject to equal probability distribution any more, and the possibility of jumping among similar nodes is increased; the invention provides that when the random walk is carried out on the association diagram, the voting among the nodes has positive and negative association scores, the positive association voting scores among the nodes are enhanced, the negative association voting scores among the nodes are reduced, and the node scores are more reasonable; the method can find the keywords of the long text under the condition of being separated from an external corpus; the graph structure attribute of the node, the correlation characteristics between the semantic information and the node are comprehensively considered, and the low execution efficiency of the GSNA caused by constructing a complex correlation graph is avoided; mining a frequent closed item set of a document and generating a strong association rule set; secondly, taking the nonrepetitive rule head and the rule body in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist, and taking the association strength of the association rules as edge weight to construct an association diagram of the document; then, randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sequencing the results in a descending order; and finally, clustering the previous nodes in order to avoid semantic inclusion relation among the extracted keywords, and taking the class center point of each class as a key extraction result of the document.

Drawings

FIG. 1 is a basic technical flow diagram of the present invention;

FIG. 2 is a diagram illustrating the steps of generating a closed frequent itemset according to the present invention;

FIG. 3 is a diagram of strong association rule generation of the present invention;

FIG. 4 is a correlation diagram of a document according to the present invention.

Detailed Description

Example 1

As shown in fig. 1-4, the invention provides a keyword extraction method for combining graph structures and node associations, which comprises the following steps:

1) acquiring a text and performing text preprocessing;

a. acquiring a text, selecting 'species origin' of Darwin as the text, wherein the text consists of a plurality of sentences;

c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm; english text can be used as a space as a natural delimiter, Chinese is only a character, a sentence and a paragraph can be simply delimited by an obvious delimiter, a unique word does not have a delimiter in a form, and proper word segmentation software can be selected to segment Chinese; because the selected text is the Chinese text, the Chinese text is segmented by using the ending segmentation; for example, a section of "provenance of species":

the role of natural selection lies entirely in the preservation and accumulation of the various variants which, for each organism, are favourable both under organic and inorganic conditions during their lifetime. The end result of this is that the relationship of various organisms to their external conditions is increasingly improving. Such improvements entail progressive development of the regime of most organisms worldwide. "

After word segmentation will become:

d. Filtering stop words of the text; stop words have no practical meaning, and for English text, 'the', 'is', 'at', 'which', 'on' etc. can be used as stop words, and for Chinese text, 'a', 'here', 'where', 'I', 'though', etc. can be used as stop words. After the text is filtered for stop words, a set of sentence sets St ═ St { St } separated by stop marks (","₁,st₂,...,st_nThe deactivation vocabulary can be defined by itself, which is not limited by the present invention. The invention uses a hundred degree stop word list and adds some stop words defined by texts. For the segment given in step (2.2), the filter by stop word becomes:

"natural selection action is fully preserved and accumulated organic and inorganic conditions of various variant organisms in life period are favorable

The final result is that the biological external condition relation is improved day by day

Improvement incurs gradual progress in the world's biosystems ";

e. removing repeated words in each sentence, wherein after the filtering step of the stop words, the repeated words become:

"natural selection action is fully preserved and accumulated in the life period of various variant organisms under the organic and inorganic conditions

Improvement incurs gradual progress in the world's biosystems ";

a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w₁,w₂,...,w_mWi is the ith word in the item set; for the book of "provenance of species", the transaction set is all the texts preprocessed in the book, the transaction is one sentence of the preprocessed texts, and the set of any words in the texts preprocessed by the term set is denoted as I ═ w₁,w₂,...,w_m}，w_iThe ith word in the item set;

b. solving a closed item set; the closed term set is term set X if and only if S (X) f (g (X)) fg (X) X, where S fg is a closure operation; assume that there are five transactions in the transaction set, with transaction IDs tr1, tr2, tr3, tr4, tr5, and item sets { a, c, d }, { b, c, e }, { a, b, c, e }, { b, e }, and { a, b, c, e }, respectively. One set of closure terms is S ({ b, e }) ═ f (g ({ b, e }))) f ({2,3,4,5}) } b, e }.

c. The length of the item set is calculated, the item set { a, c, d } is called a 3-item set, and the item set { a, b, c, e } is called a 4-item set;

d. and (3) calculating the support degree of the item set, wherein for the item set X, the support degree sup (X) is the ratio of the number of the affairs g (X) containing X to the number of the whole affairs, namely | g (X) |/| T |. For the association rule a → B,

the support sup (A → B) of A → B is the ratio of the number of transactions containing AbuB in T to the total number of transactions in T. According to the example in b, e are in the item sets tr2, tr3, tr4, tr5, so the item set { b, e } has a support of sup { b, e } ═ 4/5;

e. calculating the correlation strength; for the association rule a → B,

the strength of the association of A → B is:

f. solving a strong association rule; for the association rule a → B,

g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; minsup is the minimum support of X in the transaction set.

The conventional association rule mining algorithm Apriori constructs an association rule by mining frequent item sets with different lengths, and when the business scale is large, because the business set is continuously scanned in the process of generating the association rule, a large-scale redundant frequent item set is easily obtained, so that the efficiency is low, and the analysis process becomes complicated. In order to quickly extract effective strong association rules from a document transaction set, the invention applies a CLOSE algorithm to mine a frequent closed item set so as to generate a strong association rule set, the algorithm uses closure operation and combines a breadth-first strategy to search an item set space, thereby realizing the quick pruning of a non-frequent item set and a non-closed item set in the search space.

Assuming minsup-2/5, the example in step (4) 1 first generates a 1-item set as FCC1 according to the ClOSE algorithm, scans the transaction set to calculate the closure and support degree of the 1-item set to obtain FCC1, and then changes FCC1// FFC1 to FCC1 to remove the item set with support degree less than 2/5 and the 1-item set with the same closure to obtain FC 1; in generating the 2-item set, only a different set of items in FC1 are combined pairwise to yield { a, b }, { a, c }, { b, c }, since

Therefore, after the { a, c } is pruned, the { a, b } and the { b, c } are reserved as generators of FCC2, the closure and the support degree of the 2-item set are calculated by scanning the transaction set to obtain FCC2, and then the item set with the support degree smaller than 2/5 and the item set of the repeated closure are removed to obtain FC 2; in the generation of FCC3 by FC2, the 3-item set generated by generator of FC2 is only { a, b, c }, but

So the post-pruned 3-item set for { a, b, c } is null, by which time the CLOSE algorithm ends. Finally, all closing of FC1 and FC2 are taken out, so that all closed frequent item sets FC of the transaction set are obtained. FIG. 2 is a diagram illustrating the closed frequent item set generation process of the transaction set in detail;

h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L₁,L₂,...,L_kIn which L is_iA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f₁,f₂,...,f₃In which f_iIs a collection of i-frequent items anda set of support degrees; finally, according to f_iFor 2 ≦ i ≦ k, a different form of association rule set AR is generated, which is composed of a rule head with an m-term set of 1 ≦ m ≦ i-1 and a rule body of the (i-m) -term set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre; the extraction satisfies that lift is less than or equal to max^-liftOr lift is more than or equal to min^+liftThe rule of (2) is used as a strong association rule. According to the example in step (4) 1, assuming that minsup-2/5, the 3-item set { b, c, e } may have a rule body of the rule head 1-item set of the 2-item set and a rule body of the rule head 2-item set of the 1-item set, the lift metric satisfies symmetry, which can also be seen from the figure; the 2-item set { a, c } has only the rule head and the rule body of the 1-item set, and lift (a → c) ═ lift (c → a), and the frequently closed item set generation process is shown in fig. 3;

3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document; taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges. According to the example in 1 in step (4), a preprocessed document transaction set sample is assumed to be min equal to 0.6, min^+lift＝1.2，max^-lift0.95, obtaining the frequently closed item set of the transaction set as FC { { a, c }, { b, e }, { c }, { b, c, e } } according to the CLOSE algorithm, and deriving a strong association rule set

Extracting non-repeated rule heads and rule bodies in the AR as association graph nodes V { { b, c }, { b, e }, { c, e }, { a }, { b }, { c }, { e } } and constructing a directionless weighted association graph shown in fig. 4 by using the association strength of the strong association rule in the AR as edge weights;

text repetition by different nodes in association graphsThe importance is different, the invention considers the existing method for measuring the node importance in the graph, and provides a GSNA algorithm comprehensively considering the similarity of the nodes and the correlation strength between the nodes, and the algorithm is essentially an improvement on the PageRank algorithm. The PageRank algorithm considers that an edge between the nodes A and B can be regarded as voting of the node A to the node B, the higher the sum of the edge weights of the nodes is, the higher the importance of the node is, and the PageRank simulates an internet owner, can randomly walk between the points on the graph and can jump. After a limited number of excursions, each point reaches a steady state and the expected value of each point tends to a constant value. And generating node sequencing once after each walk iteration, wherein the sequencing value is the PageRank score. Suppose the graph contains N nodes in total, N_iFor the i-th node in the dependency graph, e denotes N_iNumber of edge entries of (2), PR (N)_k) Represents N_kPageRank score of (C) (N)_k) Represents N_kD is used to measure the attenuation coefficient of a node migrating to another node, the PageRank score of Ni is defined as follows:

wherein, the jump between nodes is equal probability event, namely 1/N, and theoretically, the possibility of jump between similar nodes is higher. In addition, there is a phenomenon that nodes have not only a positive association relationship but also a negative association relationship with each other. The positive incidence relation is that one node appears together with another node, and the negative incidence relation is that the probability of another node appearing is reduced after one node appears. For negative associations between nodes, the voting score of PR should be decreased, while for positive associations, the voting score of PR needs to be increased.

The GSNA algorithm improves the PageRank from two aspects of the similarity and the relevance of the graph structure among the nodes. When measuring the similarity of the nodes, the GSNA considers the structural attribute and the implicit semantic information of the nodes in the association diagram; when measuring the relevance of the nodes, positive relevance transmission needs to be enhanced, negative relevance transmission needs to be reduced, and PR scores are reasonably scaled by GSNA, so that the node ranking is more accurate.

For the GSNA algorithm, the steps are as follows:

and (1) quantizing the graph structure attribute of the node. Six indexes are selected in order to quantify the graph structure attribute of the node. Wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, and visually represent the association capability of one node with other nodes, atr4 is the ratio of the atr1 to the average value of the degrees of the neighbor nodes, which represents the similar matching degree of one node with the surrounding neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art2, which quantitatively represents the propagation capability of one node in the association diagram. Assuming the existence of node N in the dependency graph_iThen N is_iMay be formalized as Ni ═ by (atr1, atr2, atr3, atr4, atr5, atr 6).

Step (2) solving the semantic distance between two nodes

Suppose the word set of the preprocessed document is I ═ w₁,w₂,...,w_n}, node N_iThe co-occurrence probability distribution with I is P ═ P₁,p₂,...,p_nP is N_iThe number of sentences which appear simultaneously with the word in the I is larger than the total number of sentences of the text, and then the I is traversed to obtain P. N is a radical of_jThe co-occurrence probability distribution with I is Q ═ Q₁,q₂,...,q_nIs then N_iAnd N_jIs shown in equation (3). The larger the semantic distance between two nodes, the JSD (N)_i,N_j) The larger.

For the association graph of the document produced in fig. 4, the word set is I ═ { a, b, c, d, e }, the probability distribution of co-occurrence of the nodes { b, e } and I is P ═ 2/5,4/5,3/5,0,4/5}, and the probability distribution of co-occurrence of the nodes { a } and I is P ═ 3/5,2/5,3/5,1/5,2/5 }.

And (3) combining the similarity between the graph structure attribute and the nodes of the semantic information. Integrating graph structure attributes and semantic information for nodes, node N_iAnd N_jPhase ofThe similarity measure is shown in equation (4).

N_i＝(atr1,atr2,atr3,atr4,atr5,atr6)，

The larger the node similarity in conjunction with graph structure attributes, the higher the JSD (N)_i,N_j) The larger the semantic distance between two nodes, when N is_iAnd N_jThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity s_ijThe lower, conversely, the lower N_iAnd N_jThe closer the structural attributes or semantic distributions of (S) are, the more closely s_ijThe higher the correspondence.

And (4) normalizing the similarity of each node in the association graph. Assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formed_N×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:

and (5) correcting the PageRank formula.

r_iRepresenting a node N_iDegree of similarity to the average of all nodes in the graph, r ═ r₁,r₂,...,r_N) I.e. a fixed probability distribution that all nodes in the dependency graph are related to graph structure and semantic attributes. Using r_iModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:

since the correlation diagram is an undirected graph, C (N) in the formula (6)_k) Representing a node N_kE represents N_kThe number of neighbors.

And (6) enhancing the PR voting score of the positive association and reducing the PR voting score of the negative association. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (6) is improved using lift values as follows:

and (7) carrying out random walk on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until a formula (8) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value.

5) Clustering the first nodes, and taking the class center point of each class as a key extraction result of the document;

the nodes of the association graph are formed according to the frequent closed item set taking subsets with different lengths, the subsets of the k-item set contain 2k-1 non-empty subsets in total, and the 1-item set is necessarily the subset of the (k-1) -item set, and the phenomenon that the (k-1) -item set has semantic inclusion relation with the 1-item set can be considered. In order to prevent semantic inclusion relation between important nodes obtained after the nodes are sorted in a descending order, the invention carries out K-Mediod clustering on the sorted nodes.

Because the condition that non-key terms are clustered independently exists, the first P nodes after descending sorting are selected firstly; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; and finally, searching the class center of each class again after all the nodes are distributed, and distributing the nodes of the non-class center again. And continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The algorithm is concretely realized by the following steps:

selecting the first P nodes after descending sorting;

selecting K as P class centers with K being not more than K from P nodes;

calculating the semantic distance between the residual nodes and the center of each class, and distributing the semantic distance to the class with the closest distance;

step 4, calculating the minimum value of the semantic distance sum of all other points for other nodes except the class center node in each class, and taking the minimum value point as a new class center node;

and (5) continuously performing the step (4) until the difference value of the node distribution results of the two adjacent times reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text.

The invention marks the extracted keyword sequence as N_retTagging the vocabulary sequence as N_relThe evaluation indexes involved include MAP (mean Average precision), Recall, and F_β. Since the keywords expected to be retrieved by the invention are high in accuracy and ranked as far as possible, the keyword F is ranked as early as possible_βWith slight modification, MAP was introduced into F_βAnd (4) calculating. The formula is defined as follows:

wherein, function index (t)_i,N_rel) Returning the term t_iIn the sequence N_relThe index position of (a); g (t)_i,N_rel) To indicate a function, if the term t_iAppearing in the original vocabulary sequence N_relIf yes, returning to 1, otherwise, returning to 0; compared with the recall rate, the method is more focused on improving the accuracy rate, so F_βThe value of beta in the index is 0.5.

Design of experiments

In order to verify the feasibility of the algorithm, the experimental design comprises four links. Firstly selecting experimental data and proper evaluation indexes, secondly searching input parameter settings corresponding to an optimal association diagram model, then discussing the extraction number of keywords under the condition of the optimal association diagram, and finally selecting four keyword extraction algorithms to compare with the algorithm of the invention, thereby further explaining the superiority of the algorithm of the invention.

1. Influence of input parameters

Because the input parameters of the model are more, different parameter settings can bring different degrees of influence on the algorithm result. For example, if the minor setting in the CLOSE algorithm is too high, a large number of meaningful words are discarded, and if the minor setting is too low, a large number of redundant words are included, and min^+liftAnd max^-liftThe change of the relation coefficient d directly influences the form structure of the association graph, and meanwhile, the adjustment of the damping coefficient d of the GSNA algorithm during random walk also influences the keyword extraction effect. The invention sets the variation range of min to be 0.5 percent and 10 percent]；min^+liftIncrease at a rate of step size 0.1, base 0.1; max^-liftThe initial value is 5, and the speed is increased by 5 steps; d is taken according to experience, and the variation range is set to be [0.15,0.85 ]]. Suppose the ordered keyword sequence is N_vIn order to avoid the extraction effect analysis complexity caused by excessive parameter variation, the extraction quantity of the keywords is not considered temporarily, and when the model parameters are adjusted, the keyword sequence N obtained each time is extracted_vFront | N of_relI terms as the extracted keyword sequence N_retAnd used to calculate the MAP.

2 extraction of word number selection

The keyword extraction focuses on selecting the words most relevant to the text, so that the quantity and the scale of the selected keywords are important. The invention introduces an extraction parameter r ═ N_ret/N_vAccording to the parameter found in 1, then N_ret＝r×N_vSequence N representing the sequence from the rank_vThe scale of the key words is extracted. Based on the resulting parameters in 1, the optimal r is calculated.

3. Algorithm comparison and analysis

In order to verify the superiority and comparability of the algorithm, TextRank, GraphSum, a keyword Extraction Method (Key Words Extraction Method by node Association, NA) considering the Graph Structure and not considering the Graph Structure, and a keyword Extraction Method (Key Words Extraction Method by Graph Structure, GS) considering the node Association and not considering the Graph Structure are compared with the keyword results obtained by the algorithm.

The four comparison methods are selected based on the following considerations that (1) TextRank is a traditional algorithm for automatically generating keywords, and does not consider the graph structure and the node association information; (2) comparing an NA algorithm only considering graph structure information and a GS algorithm only considering node relevance information with the GSNA algorithm simultaneously considering graph structure and node relevance, and displaying the importance of graph structure and node relevance to keyword extraction; (3) GSNA is improved on the GraphSum method, GraphSum is most similar to the method of the invention, and the GraphSum is used for generating text summaries, so as to obtain text keywords, the sentence covering step of the original GraphSum algorithm is removed during the experiment.

Claims

Translated fromChinese

1.一种融合图结构与节点关联的关键词提取方法，其特征在于：1. a keyword extraction method associated with a fusion graph structure and a node, is characterized in that:

1)获取文本，进行文本预处理；所述文本预处理的方法，其步骤如下：1) obtain text, carry out text preprocessing; the method of described text preprocessing, its steps are as follows:

a.获取文本，文本由若干数目的句子组成；a. Get text, which consists of a certain number of sentences;

b.文本除文本中所有的目录、标题、图、表，只留下文本信息；b. In addition to all the catalogs, titles, figures and tables in the text, only text information is left in the text;

c.对文本进行分词，对于英文文本，基于简单的空格进行分词；对于中文文本，使用常用的分词算法进行分词；c. Perform word segmentation on the text, for English text, perform word segmentation based on simple spaces; for Chinese text, use common word segmentation algorithms for word segmentation;

d.将文本进行停用词过滤；d. Filter the text with stop words;

e.去除每个句子之中重复的词语；e. Remove repeated words from each sentence;

2)挖掘文档的频繁封闭项集并生成强关联规则集合；生成强关联规则集合的方法，其步骤如下：2) Mining the frequent closed itemsets of the document and generating a set of strong association rules; the method of generating a set of strong association rules, the steps are as follows:

a.建立事务集和项集；所述事务为文本T中的单个句子；所述事务集即为预处理后的整篇文本T；所述项集为整篇文本T包含的所有词语的任意子集记为I＝{w₁,w₂,...,w_m}，w_i为项集里面的第i个词；a. Establish a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the entire text T after preprocessing; the item set is any The subset is recorded as I={w₁ ,w₂ ,...,w_m }, and w_i is the i-th word in the itemset;

b.求封闭项集；所述封闭项集为项集X当且仅当

其中

为闭包运算；b. Find a closed itemset; the closed itemset is an itemset X if and only if

in

is a closure operation;c.求项集长度，一个项集中项的个数称为项集长度，长度为k的项集称为k-项集；c. Find the itemset length, the number of items in an itemset is called the itemset length, and the itemset with length k is called k-itemset;

d.求项集的支持度，对于关联规则A→B，

则A→B的支持度sup(A→B)为T中含有A∪B的事务个数与T中总事务数的比值；d. Find the support of the itemset, for the association rule A→B,

Then the support degree sup(A→B) of A→B is the ratio of the number of transactions containing A∪B in T to the total number of transactions in T;

e.求关联强度；对于关联规则A→B，

则A→B的关联强度为：e. Find the association strength; for the association rule A→B,

Then the association strength of A→B is:

f.求强关联规则；对于关联规则A→B，

若满足sup(A→B)≥minsup且lift(A→B)≤max^-lift或lift(A→B)≥min^+lift，则A→B为强关联规则；其中，minsup为关联规则的最小支持度阈值，max^-lift为最大负关联度阈值，min^+lift为最小正关联度阈值；f. Seek strong association rules; for association rules A→B,

If sup(A→B)≥minsup and lift(A→B)≤max^-lift or lift(A→B)≥min^+lift are satisfied, then A→B is a strong association rule; where minsup is the minimum association rule Support threshold, max^-lift is the maximum negative correlation threshold, min^+lift is the minimum positive correlation threshold;

g.求封闭频繁项集，项集X为频繁封闭项集当且仅当sup(X)≥minsup且S(X)＝X；g. Find closed frequent itemsets, itemset X is frequent closed itemsets if and only if sup(X)≥minsup and S(X)=X;

h.生成强关联规则；首先，事务集T的频繁闭包集合为L＝{L₁,L₂,...,L_k}，其中L_i为i-项集的频繁闭包集合；其次，抽取L中的不同长度的闭包及其支持度构成频繁封闭项集集合FC；然后，由FC得到不同长度的频繁项集及其支持度集合Fre＝{f₁,f₂,...,f₃}，其中f_i为i-频繁项集及其支持度集合；最后，根据f_i为2≤i≤k，生成不同形式的关联规则集合AR，这些关联规则由m-项集为1≤m≤i-1，的规则头与(i-m)-项集的规则体构成，并根据Fre计算lift(m-itemset→(i-m)-itemset)；抽取满足lift≤max^-lift或lift≥min^+lift的规则作为强关联规则；h. Generate strong association rules; first, the frequent closure set of transaction set T is L={L₁ , L₂ ,...,L_k }, where Li is the frequent closure set of_i -itemsets; secondly , extract the closures of different lengths in L and their supports to form a frequent closed itemset set FC; then, obtain frequent itemsets of different lengths and their support sets from FC Fre={f₁ , f₂ ,... ,f₃ }, where f_i is the i-frequent itemset and its support set; finally, according to the fact that f_i is 2≤i≤k, different forms of association rule sets AR are generated. These association rules are composed of m-itemsets as 1≤m≤i-1, the rule head is composed of the rule body of (im)-itemset, and lift(m-itemset→(im)-itemset) is calculated according to Fre; the extraction satisfies lift≤max^-lift or lift≥ The rule of min^+lift is used as a strong association rule;

3)将强关联规则集合中不重复的规则头与规则体作为节点，节点之间有边当且仅当彼此存在强关联规则，以关联规则的关联强度作为边权重构建文档的关联图；3) The non-repeated rule header and rule body in the strong association rule set are used as nodes, and there are edges between nodes if and only if there are strong association rules with each other, and the association strength of the association rules is used as the edge weight to construct an association graph of the document;

4)使用GSNA算法在关联图上随机游走，迭代计算每个节点的重要性分数，并对结果降序排序；4) Use the GSNA algorithm to randomly walk on the association graph, iteratively calculate the importance score of each node, and sort the results in descending order;

5)对前若干个节点聚类，取出每个类的类中心点作为文档的关键提取结果。5) Cluster the first several nodes, and take out the class center point of each class as the key extraction result of the document.

2.如权利要求1所述一种融合图结构与节点关联的关键词提取方法，其特征在于：所述4)中GSNA算法，其步骤如下：2. the keyword extraction method that a kind of fusion graph structure is associated with node as claimed in claim 1, is characterized in that: described 4) middle GSNA algorithm, its steps are as follows:

a.量化节点的图结构属性；为了量化节点的图结构属性，选取了六个指标；其中atr1、atr2、atr3分别表示与一个节点距离为1、2、3的节点个数，atr4为atr1与邻居节点度的平均值的比值，atr5为art2与art1的比值，atr6为art3与art2的比值；假设关联图中存在节点N_i，则N_i可形式化表示为N_i＝(atr1,atr2,atr3,atr4,atr5,atr6)；a. Quantify the graph structure attributes of nodes; in order to quantify the graph structure attributes of nodes, six indicators are selected; among them atr1, atr2, and atr3 represent the number of nodes with a distance of 1, 2, and 3 from a node, respectively, and atr4 is the difference between atr1 and atr1. The ratio of the average value of neighbor node degrees, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art2; assuming that there is a node N_i in the association graph, N_i can be formally expressed as N_i =(atr1,atr2, atr3,atr4,atr5,atr6);

b.求两个节点之间的语义距离；假设预处理后文档的词集为I＝{w₁,w₂,...,w_n}，节点N_i与I的共现概率分布为P＝{p₁,p₂,...,p_n}，N_j与I的共现概率分布为Q＝{q₁,q₂,...,q_n}，则N_i与N_j的语义距离为b. Find the semantic distance between two nodes; assuming that the word set of the preprocessed document is I={w₁ ,w₂ ,...,w_n }, the co-occurrence probability distribution of nodes N_i and I is P ={p₁ ,p₂ ,...,p_n }, the co-occurrence probability distribution of N_j and I is Q={q₁ ,q₂ ,...,q_n }, then the co-occurrence probability of N_i and N_j The semantic distance is

c.求结合图结构属性与语义信息的节点间相似性；综合节点的图结构属性与语义信息，节点N_i与N_j的相似度衡量如公式(3)所示；当N_i与N_j的图结构属性或语义分布差异越大时，其相似度s_ij越低，反之当N_i与N_j的结构属性或语义分布越接近时，s_ij相应越高；c. Find the similarity between nodes combining graph structural attributes and semantic information; synthesizing the graph structural attributes and semantic information of nodes, the similarity measure of nodes N_i and N_j is shown in formula (3); when N_i and N_j The greater the difference in the graph structural attributes or semantic distribution of the graphs, the lower the similarity s_ij , and conversely, when the structural attributes or semantic distributions of N_i and N_j are closer, the corresponding s_ij is higher;

e.归一化每个节点在关联图中的相似度；假设关联图共含有N个节点，存在强关联关系的节点之间有相似度，无强关联关系的节点之间相似度为0，则可形成关联图的相似度矩阵S_N×N；为了更形式化地度量一个节点与所有节点的相似程度，将每个节点在关联图中的相似度计算归一化如下：e. Normalize the similarity of each node in the association graph; assuming that the association graph contains a total of N nodes, the nodes with strong association have similarity, and the similarity between nodes without strong association is 0, Then the similarity matrix S_N×N of the association graph can be formed; in order to measure the similarity between a node and all nodes more formally, the similarity calculation of each node in the association graph is normalized as follows:

f.修正PageRank公式；f. Revise the PageRank formula;

r_i表示节点N_i与图中所有节点的平均相似程度，r＝(r₁,r₂,...,r_N)即为关联图中所有节点与图结构和语义属性相关的固定概率分布；使用r_i修正PageRank公式，使得节点不再服从等概率跳转，而是使与关联图相似程度更大的节点被选中的概率更高，其公式如下：ri_{represents the average degree of similarity between node N i}_and all nodes in the graph, r=(r₁ , r₂ ,...,r_N ) is the fixed probability distribution of all nodes in the association graph related to graph structure and semantic attributes ; Use r_i to modify the PageRank formula, so that nodes no longer obey equal probability jumps, but make nodes with greater similarity to the association graph have a higher probability of being selected. The formula is as follows:

由于关联图为无向图，所以公式(5)中C(N_k)表示节点N_k的度，e表示N_k的邻居个数；Since the association graph is an undirected graph, C(N_k ) in formula (5) represents the degree of node N_k , and e represents the number of neighbors of N_k ;

g.加强正关联的PR投票分数，降低负关联的PR投票分数；在关联图中，两节点之间有边当且仅当它们之间存在强关联规则；为了加强正关联的PR投票分数，降低负关联的PR投票分数，使用lift值对公式(5)改进如下：g. Strengthen the PR vote score of positive association and reduce the PR vote score of negative association; in the association graph, there is an edge between two nodes if and only if there is a strong association rule between them; in order to strengthen the PR vote score of positive association, To reduce the PR voting score of negative association, use the lift value to improve formula (5) as follows:

h.根据公式(6)在关联图上随机游走，迭代计算每个节点的PR值直至满足公式(7)，使节点分数达到收敛状态，其中β为随机游走终止阈值；h. Randomly walk on the association graph according to formula (6), and iteratively calculate the PR value of each node until formula (7) is satisfied, so that the node score reaches a convergent state, where β is the random walk termination threshold;

3.如权利要求1所述一种融合图结构与节点关联的关键词提取方法，其特征在于：所述5)中节点聚类，其步骤如下：将排序后的节点进行K-Mediod聚类；首先选取降序排序后的前P个节点；其次从P个节点中选定K为K≤P个类中心；然后计算剩余节点与每个类中心的语义距离，并将其分配至所属距离最近的类中；最后，所有节点分配完毕后重新寻找每个类的类中心，并再次分配非类中心的节点；不断地重复类中心调整与分配节点的过程，直至相邻两次的节点分配结果差值达到聚类终止阈值，此时抽取每个簇的类中心作为文本最终的关键词。3. the keyword extraction method that a kind of fusion graph structure is associated with node as claimed in claim 1, it is characterized in that: described 5) middle node clustering, its step is as follows: the node after sorting is carried out K-Mediod clustering ; First select the top P nodes sorted in descending order; secondly, select K from the P nodes as K ≤ P class centers; then calculate the semantic distance between the remaining nodes and each class center, and assign them to the nearest distance. Finally, after all nodes are allocated, re-find the class center of each class, and re-assign the non-class center nodes; repeat the process of class center adjustment and node allocation, until the adjacent two node allocation results When the difference reaches the cluster termination threshold, the class center of each cluster is extracted as the final keyword of the text.