Disclosure of Invention
The invention provides a keyword Extraction Method (GSNA) for fusing Graph Structure and node Association, which comprehensively considers the Graph Structure attribute of Nodes, semantic information and the Association characteristics among the Nodes and can find keywords of long texts under the condition of departing from an external corpus.
In order to achieve the above object, the method for extracting keywords associated with a fusion graph structure and nodes according to the present invention comprises the following steps:
1) acquiring a text and performing text preprocessing;
2) mining a frequent closed item set of a document and generating a strong association rule set;
3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document;
4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;
5) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document. The method for preprocessing the text in the step 1) comprises the following steps:
a. acquiring a text, wherein the text consists of a plurality of sentences;
b. removing all directories, titles, figures and tables in the text and only leaving text information;
c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm;
d. filtering stop words of the text;
e. duplicate words in each sentence are removed.
The method for generating the strong association rule set in 2) comprises the following steps:
a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w1,w2,...,wm},wiThe ith word in the item set;
b. solving a closed item set; the closed item set is item set Xf and only if
Wherein
Is a closure operation;
c. solving the length of the item set, wherein the number of items in one item set is called the length of the item set, and the item set with the length of k is called a k-item set;
d. the support degree of the item set is obtained, and for the association rule A → B,
the support degree sup (A → B) of A → B is the ratio of the number of the affairs containing AbuB in T to the total number of the affairs in T;
e. calculating the correlation strength; for the association rule a → B,
the strength of the association of A → B is:
since lift (a → B) ═ lift (B → a), the lift metric satisfies symmetry. From the analysis of probability distribution, when lift (a → B) ═ 1, a and B obey independent probability distribution, which indicates that there is no correlation between a and B; when lift (A → B) >1, the posterior probability of B after A is introduced is higher than the prior probability of B, which indicates that a positive association relationship exists between A and B; when lift (A → B) <1, the posterior probability of B after A is introduced is lower than the prior probability of B, which indicates that a negative correlation exists between A and B;
f. solving a strong association rule; for the association rule a → B,
if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max
-liftOr lift (A → B) ≥ min
+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max
-liftIs the maximum negative correlation threshold, min
+liftIs a minimum positive correlation threshold;
g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; the invention adopts the CLOSE algorithm to dig the frequent closed item set so as to generate the strong association rule set;
the CLOSE algorithm is realized by the following steps:
step (1) preprocessing a text, then, changing the text into a 1-item set after the text is deduplicated, scanning a transaction set to calculate closure and support degree of the 1-item set, recording a generated closure and support degree set as FCC1, and then, removing the 1-item set with the support degree smaller than min and the same closure from FCC1 to obtain FC1(FC1 is a closure and support degree set obtained by a 1-item set through a CLOSE algorithm);
step (2) in the process of generating an i-item set (the initial value of i is 1), combining two different item sets in the generators of FC (i-1) in pairs as generators of FCCi, calculating the closure and the support degree of the i-item set by scanning a transaction set to obtain FCCi, and removing the item set with the support degree smaller than min and the item set with the repeated closure to obtain FCi;
step (3) adding i, repeating step (2) until the generator of FCCi is empty, and ending the CLOSE algorithm;
step (4) finally, all the closures of the FCs 1 to FCi are taken out to obtain all closed frequent item sets FC of the transaction set;
h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L1,L2,...,LkIn which L isiA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f1,f2,...,f3In which fiI-frequent item set and its support set; finally, according to fi(2. ltoreq. i.ltoreq.k) generates a set of association rules AR of different forms, which are composed of a rule head of an m-item set (1. ltoreq. m.ltoreq.i-1) and a rule body of the (i-m) -item set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre. The extraction satisfies that lift is less than or equal to max-liftOr lift is more than or equal to min+liftThe rule of (2) is used as a strong association rule.
The method for generating the association map in 3) comprises the following steps: taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges.
The GSNA algorithm in 4) comprises the following steps:
a. quantifying graph structure attributes of the nodes; in order to quantify the graph structure attribute of the node, six indexes are selected; wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, atr4 is the ratio of atr1 to the average value of the degrees of the neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art 2; assuming the existence of node N in the dependency graphiThen N isiFormalized as Ni=(atr1,atr2,atr3,atr4,atr5,atr6);
b. Solving the semantic distance between two nodes; suppose the word set of the preprocessed document is I ═ w1,w2,...,wn}, node NiThe co-occurrence probability distribution with I is P ═ P1,p2,...,pn},NjThe co-occurrence probability distribution with I is Q ═ Q1,q2,...,qnIs then NiAnd NjHas a semantic distance of
c. Calculating the similarity between nodes combining the graph structure attribute and the semantic information; integrating graph structure attributes and semantic information for nodes, node NiAnd NjThe similarity measure of (c) is shown in formula (3); when N is presentiAnd NjThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity sijThe lower, conversely, the lower NiAnd NjThe closer the structural attributes or semantic distributions of (S) are, the more closely sijThe higher the correspondence;
e. normalizing each node at offSimilarity in the joint graph; assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formedN×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:
f. modifying the PageRank formula;
rirepresenting a node NiDegree of similarity to the average of all nodes in the graph, r ═ r1,r2,...,rN) Namely, the fixed probability distribution of all nodes in the association graph related to the graph structure and semantic attributes; using riModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:
since the correlation diagram is an undirected graph, C (N) in the formula (5)k) Representing a node NkE represents NkThe number of neighbors of (2);
g. the PR voting score of positive associations is strengthened and the PR voting score of negative associations is reduced. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (5) is improved using lift values as follows:
h. randomly walking on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until the formula (7) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value;
and 5) clustering the nodes, wherein the steps are as follows: carrying out K-Mediod clustering on the sequenced nodes; firstly, selecting the first P nodes after descending sorting; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; finally, after all the nodes are distributed, searching the class center of each class again, and distributing the nodes of the non-class center again; and continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The basic technical flow is shown in fig. 1.
The invention discloses a keyword extraction method for combining graph structure and node association, which has the beneficial effects that: the invention uses the CLOSE algorithm to mine the strong association rule for constructing the association graph, the strong association rule obtained by taking the frequent closed item set as the reference has small scale, high speed and no information loss, thereby avoiding the problem of excessively complex graph structure caused by extracting redundant frequent item sets and improving the overall execution efficiency of the algorithm; the similarity of the graph structure attribute of the node in the association graph and the semantic information calculation node in the association graph is comprehensively considered, so that node jumping is not subject to equal probability distribution any more, and the possibility of jumping among similar nodes is increased; the invention provides that when the random walk is carried out on the association diagram, the voting among the nodes has positive and negative association scores, the positive association voting scores among the nodes are enhanced, the negative association voting scores among the nodes are reduced, and the node scores are more reasonable; the method can find the keywords of the long text under the condition of being separated from an external corpus; the graph structure attribute of the node, the correlation characteristics between the semantic information and the node are comprehensively considered, and the low execution efficiency of the GSNA caused by constructing a complex correlation graph is avoided; mining a frequent closed item set of a document and generating a strong association rule set; secondly, taking the nonrepetitive rule head and the rule body in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist, and taking the association strength of the association rules as edge weight to construct an association diagram of the document; then, randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sequencing the results in a descending order; and finally, clustering the previous nodes in order to avoid semantic inclusion relation among the extracted keywords, and taking the class center point of each class as a key extraction result of the document.
Detailed Description
Example 1
As shown in fig. 1-4, the invention provides a keyword extraction method for combining graph structures and node associations, which comprises the following steps:
1) acquiring a text and performing text preprocessing;
a. acquiring a text, selecting 'species origin' of Darwin as the text, wherein the text consists of a plurality of sentences;
b. removing all directories, titles, figures and tables in the text and only leaving text information;
c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm; english text can be used as a space as a natural delimiter, Chinese is only a character, a sentence and a paragraph can be simply delimited by an obvious delimiter, a unique word does not have a delimiter in a form, and proper word segmentation software can be selected to segment Chinese; because the selected text is the Chinese text, the Chinese text is segmented by using the ending segmentation; for example, a section of "provenance of species":
the role of natural selection lies entirely in the preservation and accumulation of the various variants which, for each organism, are favourable both under organic and inorganic conditions during their lifetime. The end result of this is that the relationship of various organisms to their external conditions is increasingly improving. Such improvements entail progressive development of the regime of most organisms worldwide. "
After word segmentation will become:
the role of natural selection lies entirely in the preservation and accumulation of the various variants which, for each organism, are favourable both under organic and inorganic conditions during their lifetime. The end result of this is that the relationship of various organisms to their external conditions is increasingly improving. Such improvements entail progressive development of the regime of most organisms worldwide. "
d. Filtering stop words of the text; stop words have no practical meaning, and for English text, 'the', 'is', 'at', 'which', 'on' etc. can be used as stop words, and for Chinese text, 'a', 'here', 'where', 'I', 'though', etc. can be used as stop words. After the text is filtered for stop words, a set of sentence sets St ═ St { St } separated by stop marks (","1,st2,...,stnThe deactivation vocabulary can be defined by itself, which is not limited by the present invention. The invention uses a hundred degree stop word list and adds some stop words defined by texts. For the segment given in step (2.2), the filter by stop word becomes:
"natural selection action is fully preserved and accumulated organic and inorganic conditions of various variant organisms in life period are favorable
The final result is that the biological external condition relation is improved day by day
Improvement incurs gradual progress in the world's biosystems ";
e. removing repeated words in each sentence, wherein after the filtering step of the stop words, the repeated words become:
"natural selection action is fully preserved and accumulated in the life period of various variant organisms under the organic and inorganic conditions
The final result is that the biological external condition relation is improved day by day
Improvement incurs gradual progress in the world's biosystems ";
2) mining a frequent closed item set of a document and generating a strong association rule set;
a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w1,w2,...,wmWi is the ith word in the item set; for the book of "provenance of species", the transaction set is all the texts preprocessed in the book, the transaction is one sentence of the preprocessed texts, and the set of any words in the texts preprocessed by the term set is denoted as I ═ w1,w2,...,wm},wiThe ith word in the item set;
b. solving a closed item set; the closed term set is term set X if and only if S (X) f (g (X)) fg (X) X, where S fg is a closure operation; assume that there are five transactions in the transaction set, with transaction IDs tr1, tr2, tr3, tr4, tr5, and item sets { a, c, d }, { b, c, e }, { a, b, c, e }, { b, e }, and { a, b, c, e }, respectively. One set of closure terms is S ({ b, e }) ═ f (g ({ b, e }))) f ({2,3,4,5}) } b, e }.
c. The length of the item set is calculated, the item set { a, c, d } is called a 3-item set, and the item set { a, b, c, e } is called a 4-item set;
d. and (3) calculating the support degree of the item set, wherein for the item set X, the support degree sup (X) is the ratio of the number of the affairs g (X) containing X to the number of the whole affairs, namely | g (X) |/| T |. For the association rule a → B,
the support sup (A → B) of A → B is the ratio of the number of transactions containing AbuB in T to the total number of transactions in T. According to the example in b, e are in the item sets tr2, tr3, tr4, tr5, so the item set { b, e } has a support of sup { b, e } ═ 4/5;
e. calculating the correlation strength; for the association rule a → B,
the strength of the association of A → B is:
since lift (a → B) ═ lift (B → a), the lift metric satisfies symmetry. From the analysis of probability distribution, when lift (a → B) ═ 1, a and B obey independent probability distribution, which indicates that there is no correlation between a and B; when lift (A → B) >1, the posterior probability of B after A is introduced is higher than the prior probability of B, which indicates that a positive association relationship exists between A and B; when lift (A → B) <1, the posterior probability of B after A is introduced is lower than the prior probability of B, which indicates that a negative correlation exists between A and B;
f. solving a strong association rule; for the association rule a → B,
if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max
-liftOr lift (A → B) ≥ min
+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max
-liftIs the maximum negative correlation threshold, min
+liftIs a minimum positive correlation threshold;
g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; minsup is the minimum support of X in the transaction set.
The conventional association rule mining algorithm Apriori constructs an association rule by mining frequent item sets with different lengths, and when the business scale is large, because the business set is continuously scanned in the process of generating the association rule, a large-scale redundant frequent item set is easily obtained, so that the efficiency is low, and the analysis process becomes complicated. In order to quickly extract effective strong association rules from a document transaction set, the invention applies a CLOSE algorithm to mine a frequent closed item set so as to generate a strong association rule set, the algorithm uses closure operation and combines a breadth-first strategy to search an item set space, thereby realizing the quick pruning of a non-frequent item set and a non-closed item set in the search space.
Assuming minsup-2/5, the example in step (4) 1 first generates a 1-item set as FCC1 according to the ClOSE algorithm, scans the transaction set to calculate the closure and support degree of the 1-item set to obtain FCC1, and then changes FCC1// FFC1 to FCC1 to remove the item set with support degree less than 2/5 and the 1-item set with the same closure to obtain FC 1; in generating the 2-item set, only a different set of items in FC1 are combined pairwise to yield { a, b }, { a, c }, { b, c }, since
Therefore, after the { a, c } is pruned, the { a, b } and the { b, c } are reserved as generators of FCC2, the closure and the support degree of the 2-item set are calculated by scanning the transaction set to obtain FCC2, and then the item set with the support degree smaller than 2/5 and the item set of the repeated closure are removed to obtain FC 2; in the generation of FCC3 by FC2, the 3-item set generated by generator of FC2 is only { a, b, c }, but
So the post-pruned 3-item set for { a, b, c } is null, by which time the CLOSE algorithm ends. Finally, all closing of FC1 and FC2 are taken out, so that all closed frequent item sets FC of the transaction set are obtained. FIG. 2 is a diagram illustrating the closed frequent item set generation process of the transaction set in detail;
h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L1,L2,...,LkIn which L isiA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f1,f2,...,f3In which fiIs a collection of i-frequent items anda set of support degrees; finally, according to fiFor 2 ≦ i ≦ k, a different form of association rule set AR is generated, which is composed of a rule head with an m-term set of 1 ≦ m ≦ i-1 and a rule body of the (i-m) -term set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre; the extraction satisfies that lift is less than or equal to max-liftOr lift is more than or equal to min+liftThe rule of (2) is used as a strong association rule. According to the example in step (4) 1, assuming that minsup-2/5, the 3-item set { b, c, e } may have a rule body of the rule head 1-item set of the 2-item set and a rule body of the rule head 2-item set of the 1-item set, the lift metric satisfies symmetry, which can also be seen from the figure; the 2-item set { a, c } has only the rule head and the rule body of the 1-item set, and lift (a → c) ═ lift (c → a), and the frequently closed item set generation process is shown in fig. 3;
3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document; taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges. According to the example in 1 in step (4), a preprocessed document transaction set sample is assumed to be min equal to 0.6, min
+lift=1.2,max
-lift0.95, obtaining the frequently closed item set of the transaction set as FC { { a, c }, { b, e }, { c }, { b, c, e } } according to the CLOSE algorithm, and deriving a strong association rule set
Extracting non-repeated rule heads and rule bodies in the AR as association graph nodes V { { b, c }, { b, e }, { c, e }, { a }, { b }, { c }, { e } } and constructing a directionless weighted association graph shown in fig. 4 by using the association strength of the strong association rule in the AR as edge weights;
4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;
text repetition by different nodes in association graphsThe importance is different, the invention considers the existing method for measuring the node importance in the graph, and provides a GSNA algorithm comprehensively considering the similarity of the nodes and the correlation strength between the nodes, and the algorithm is essentially an improvement on the PageRank algorithm. The PageRank algorithm considers that an edge between the nodes A and B can be regarded as voting of the node A to the node B, the higher the sum of the edge weights of the nodes is, the higher the importance of the node is, and the PageRank simulates an internet owner, can randomly walk between the points on the graph and can jump. After a limited number of excursions, each point reaches a steady state and the expected value of each point tends to a constant value. And generating node sequencing once after each walk iteration, wherein the sequencing value is the PageRank score. Suppose the graph contains N nodes in total, NiFor the i-th node in the dependency graph, e denotes NiNumber of edge entries of (2), PR (N)k) Represents NkPageRank score of (C) (N)k) Represents NkD is used to measure the attenuation coefficient of a node migrating to another node, the PageRank score of Ni is defined as follows:
wherein, the jump between nodes is equal probability event, namely 1/N, and theoretically, the possibility of jump between similar nodes is higher. In addition, there is a phenomenon that nodes have not only a positive association relationship but also a negative association relationship with each other. The positive incidence relation is that one node appears together with another node, and the negative incidence relation is that the probability of another node appearing is reduced after one node appears. For negative associations between nodes, the voting score of PR should be decreased, while for positive associations, the voting score of PR needs to be increased.
The GSNA algorithm improves the PageRank from two aspects of the similarity and the relevance of the graph structure among the nodes. When measuring the similarity of the nodes, the GSNA considers the structural attribute and the implicit semantic information of the nodes in the association diagram; when measuring the relevance of the nodes, positive relevance transmission needs to be enhanced, negative relevance transmission needs to be reduced, and PR scores are reasonably scaled by GSNA, so that the node ranking is more accurate.
For the GSNA algorithm, the steps are as follows:
and (1) quantizing the graph structure attribute of the node. Six indexes are selected in order to quantify the graph structure attribute of the node. Wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, and visually represent the association capability of one node with other nodes, atr4 is the ratio of the atr1 to the average value of the degrees of the neighbor nodes, which represents the similar matching degree of one node with the surrounding neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art2, which quantitatively represents the propagation capability of one node in the association diagram. Assuming the existence of node N in the dependency graphiThen N isiMay be formalized as Ni ═ by (atr1, atr2, atr3, atr4, atr5, atr 6).
Step (2) solving the semantic distance between two nodes
Suppose the word set of the preprocessed document is I ═ w1,w2,...,wn}, node NiThe co-occurrence probability distribution with I is P ═ P1,p2,...,pnP is NiThe number of sentences which appear simultaneously with the word in the I is larger than the total number of sentences of the text, and then the I is traversed to obtain P. N is a radical ofjThe co-occurrence probability distribution with I is Q ═ Q1,q2,...,qnIs then NiAnd NjIs shown in equation (3). The larger the semantic distance between two nodes, the JSD (N)i,Nj) The larger.
For the association graph of the document produced in fig. 4, the word set is I ═ { a, b, c, d, e }, the probability distribution of co-occurrence of the nodes { b, e } and I is P ═ 2/5,4/5,3/5,0,4/5}, and the probability distribution of co-occurrence of the nodes { a } and I is P ═ 3/5,2/5,3/5,1/5,2/5 }.
And (3) combining the similarity between the graph structure attribute and the nodes of the semantic information. Integrating graph structure attributes and semantic information for nodes, node NiAnd NjPhase ofThe similarity measure is shown in equation (4).
N
i=(atr1,atr2,atr3,atr4,atr5,atr6),
The larger the node similarity in conjunction with graph structure attributes, the higher the JSD (N)
i,N
j) The larger the semantic distance between two nodes, when N is
iAnd N
jThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity s
ijThe lower, conversely, the lower N
iAnd N
jThe closer the structural attributes or semantic distributions of (S) are, the more closely s
ijThe higher the correspondence.
And (4) normalizing the similarity of each node in the association graph. Assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formedN×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:
and (5) correcting the PageRank formula.
riRepresenting a node NiDegree of similarity to the average of all nodes in the graph, r ═ r1,r2,...,rN) I.e. a fixed probability distribution that all nodes in the dependency graph are related to graph structure and semantic attributes. Using riModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:
since the correlation diagram is an undirected graph, C (N) in the formula (6)k) Representing a node NkE represents NkThe number of neighbors.
And (6) enhancing the PR voting score of the positive association and reducing the PR voting score of the negative association. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (6) is improved using lift values as follows:
and (7) carrying out random walk on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until a formula (8) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value.
5) Clustering the first nodes, and taking the class center point of each class as a key extraction result of the document;
the nodes of the association graph are formed according to the frequent closed item set taking subsets with different lengths, the subsets of the k-item set contain 2k-1 non-empty subsets in total, and the 1-item set is necessarily the subset of the (k-1) -item set, and the phenomenon that the (k-1) -item set has semantic inclusion relation with the 1-item set can be considered. In order to prevent semantic inclusion relation between important nodes obtained after the nodes are sorted in a descending order, the invention carries out K-Mediod clustering on the sorted nodes.
Because the condition that non-key terms are clustered independently exists, the first P nodes after descending sorting are selected firstly; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; and finally, searching the class center of each class again after all the nodes are distributed, and distributing the nodes of the non-class center again. And continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The algorithm is concretely realized by the following steps:
selecting the first P nodes after descending sorting;
selecting K as P class centers with K being not more than K from P nodes;
calculating the semantic distance between the residual nodes and the center of each class, and distributing the semantic distance to the class with the closest distance;
step 4, calculating the minimum value of the semantic distance sum of all other points for other nodes except the class center node in each class, and taking the minimum value point as a new class center node;
and (5) continuously performing the step (4) until the difference value of the node distribution results of the two adjacent times reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text.
The invention marks the extracted keyword sequence as NretTagging the vocabulary sequence as NrelThe evaluation indexes involved include MAP (mean Average precision), Recall, and Fβ. Since the keywords expected to be retrieved by the invention are high in accuracy and ranked as far as possible, the keyword F is ranked as early as possibleβWith slight modification, MAP was introduced into FβAnd (4) calculating. The formula is defined as follows:
wherein, function index (t)i,Nrel) Returning the term tiIn the sequence NrelThe index position of (a); g (t)i,Nrel) To indicate a function, if the term tiAppearing in the original vocabulary sequence NrelIf yes, returning to 1, otherwise, returning to 0; compared with the recall rate, the method is more focused on improving the accuracy rate, so FβThe value of beta in the index is 0.5.
Design of experiments
In order to verify the feasibility of the algorithm, the experimental design comprises four links. Firstly selecting experimental data and proper evaluation indexes, secondly searching input parameter settings corresponding to an optimal association diagram model, then discussing the extraction number of keywords under the condition of the optimal association diagram, and finally selecting four keyword extraction algorithms to compare with the algorithm of the invention, thereby further explaining the superiority of the algorithm of the invention.
1. Influence of input parameters
Because the input parameters of the model are more, different parameter settings can bring different degrees of influence on the algorithm result. For example, if the minor setting in the CLOSE algorithm is too high, a large number of meaningful words are discarded, and if the minor setting is too low, a large number of redundant words are included, and min+liftAnd max-liftThe change of the relation coefficient d directly influences the form structure of the association graph, and meanwhile, the adjustment of the damping coefficient d of the GSNA algorithm during random walk also influences the keyword extraction effect. The invention sets the variation range of min to be 0.5 percent and 10 percent];min+liftIncrease at a rate of step size 0.1, base 0.1; max-liftThe initial value is 5, and the speed is increased by 5 steps; d is taken according to experience, and the variation range is set to be [0.15,0.85 ]]. Suppose the ordered keyword sequence is NvIn order to avoid the extraction effect analysis complexity caused by excessive parameter variation, the extraction quantity of the keywords is not considered temporarily, and when the model parameters are adjusted, the keyword sequence N obtained each time is extractedvFront | N ofrelI terms as the extracted keyword sequence NretAnd used to calculate the MAP.
2 extraction of word number selection
The keyword extraction focuses on selecting the words most relevant to the text, so that the quantity and the scale of the selected keywords are important. The invention introduces an extraction parameter r ═ Nret/NvAccording to the parameter found in 1, then Nret=r×NvSequence N representing the sequence from the rankvThe scale of the key words is extracted. Based on the resulting parameters in 1, the optimal r is calculated.
3. Algorithm comparison and analysis
In order to verify the superiority and comparability of the algorithm, TextRank, GraphSum, a keyword Extraction Method (Key Words Extraction Method by node Association, NA) considering the Graph Structure and not considering the Graph Structure, and a keyword Extraction Method (Key Words Extraction Method by Graph Structure, GS) considering the node Association and not considering the Graph Structure are compared with the keyword results obtained by the algorithm.
The four comparison methods are selected based on the following considerations that (1) TextRank is a traditional algorithm for automatically generating keywords, and does not consider the graph structure and the node association information; (2) comparing an NA algorithm only considering graph structure information and a GS algorithm only considering node relevance information with the GSNA algorithm simultaneously considering graph structure and node relevance, and displaying the importance of graph structure and node relevance to keyword extraction; (3) GSNA is improved on the GraphSum method, GraphSum is most similar to the method of the invention, and the GraphSum is used for generating text summaries, so as to obtain text keywords, the sentence covering step of the original GraphSum algorithm is removed during the experiment.