Movatterモバイル変換


[0]ホーム

URL:


CN108132927B - Keyword extraction method for combining graph structure and node association - Google Patents

Keyword extraction method for combining graph structure and node association
Download PDF

Info

Publication number
CN108132927B
CN108132927BCN201711285868.0ACN201711285868ACN108132927BCN 108132927 BCN108132927 BCN 108132927BCN 201711285868 ACN201711285868 ACN 201711285868ACN 108132927 BCN108132927 BCN 108132927B
Authority
CN
China
Prior art keywords
nodes
association
node
graph
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711285868.0A
Other languages
Chinese (zh)
Other versions
CN108132927A (en
Inventor
马慧芳
王双
李苗
赵卫中
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal UniversityfiledCriticalNorthwest Normal University
Priority to CN201711285868.0ApriorityCriticalpatent/CN108132927B/en
Publication of CN108132927ApublicationCriticalpatent/CN108132927A/en
Application grantedgrantedCritical
Publication of CN108132927BpublicationCriticalpatent/CN108132927B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开一种融合图结构与节点关联的关键词提取方法。本发明方法实现步骤如下:1)挖掘文档的频繁封闭词项集并生成强关联规则集合;2)将强关联规则集合中不重复的规则头与规则体作为节点,节点之间有边当且仅当彼此存在强关联规则,以关联规则的关联强度作为边权重构建文档的关联图;3)使用GSNA算法在关联图上随机游走,迭代计算每个节点的重要性分数,并对结果降序排序;4)对前若干个节点聚类,取出每个类的类中心点作为文档的关键提取结果。本发明方法能够在脱离外部语料库的情况下发现长文本的关键词,能够避免领域文集文本长短的影响,在更大范围检验一个词的重要性,进而能够提高领域关键词集的提取精度,还能适用于大规模领域文集的领域关键词集的提取。

Figure 201711285868

The invention discloses a method for extracting keywords associated with a fusion graph structure and nodes. The implementation steps of the method of the present invention are as follows: 1) mining the frequent closed term itemsets of the document and generating a set of strong association rules; 2) using the non-repetitive rule heads and rule bodies in the set of strong association rules as nodes, and there are edges between nodes when and Only when there are strong association rules with each other, use the association strength of the association rules as the edge weight to construct the association graph of the document; 3) Use the GSNA algorithm to randomly walk on the association graph, iteratively calculate the importance score of each node, and sort the results in descending order Sort; 4) Cluster the first several nodes, and take out the class center point of each class as the key extraction result of the document. The method of the invention can find the keywords of long texts without the external corpus, can avoid the influence of the length of the field corpus text, check the importance of a word in a wider range, and can improve the extraction accuracy of the field keyword set. Extraction of domain keyword sets that can be applied to large-scale domain corpora.

Figure 201711285868

Description

Keyword extraction method for combining graph structure and node association
Technical Field
The invention relates to a novel keyword extraction method, belongs to the field of word information processing, and particularly relates to a keyword extraction method for fusing graph structure and node association.
Background
With the popularization of network technology, webpage news and various electronic documents quickly infiltrate into the lives of people, and how users acquire valuable information from massive documents, a text keyword extraction technology is very important. The keyword extraction technology is widely applied to the aspects of webpage retrieval, text classification, topic word bank construction and the like, and the keyword extraction technology is helpful for a user to quickly understand the core content of a document. In most text mining tasks, keyword extraction is performed by sequencing the text content according to the relevance degree of terms, so that a keyword extraction algorithm for a plurality of single documents is generated accordingly.
The long text keyword extraction technique depends on the context in which the terms are located, and the keyword selection can be performed by adopting supervised learning or unsupervised learning. Two ways exist for extracting the key words by using a supervised learning algorithm, namely a binary classification method and feature selection. The binary classification method is to determine whether the candidate word is a keyword, and a classical classification method is often used. The feature selection method focuses on the internal statistical features of terms in the text and the external statistical features in the corpus. The technology for extracting the keywords by the unsupervised learning algorithm has three directions, namely a keyword extraction method based on potential semantic analysis, a keyword ordering method based on a graph and a keyword clustering method based on a theme. The latent semantic analysis excessively depends on the corpus to perform learning training, and under the actual condition, the corpus often cannot be obtained in time, so that a chance is provided for the development of a keyword extraction technology based on a graph, which can be separated from the corpus. The basic source of the graph-based keyword ranking algorithm is the PageRank algorithm. GraphSum considers that the relevance relation between the nodes has strong and weak scores, and the corresponding PageRank score is reduced when voting is carried out between the nodes with negative relevance relation, so that the node score is more reasonable. The Attribute indicates that not only the structure of the graph but also external information such as node attributes need to be considered when calculating the importance of the nodes, and systematically proposes 13 attributes of the nodes to measure the similarity of the nodes, so that the node ranking has higher reliability.
Disclosure of Invention
The invention provides a keyword Extraction Method (GSNA) for fusing Graph Structure and node Association, which comprehensively considers the Graph Structure attribute of Nodes, semantic information and the Association characteristics among the Nodes and can find keywords of long texts under the condition of departing from an external corpus.
In order to achieve the above object, the method for extracting keywords associated with a fusion graph structure and nodes according to the present invention comprises the following steps:
1) acquiring a text and performing text preprocessing;
2) mining a frequent closed item set of a document and generating a strong association rule set;
3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document;
4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;
5) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document. The method for preprocessing the text in the step 1) comprises the following steps:
a. acquiring a text, wherein the text consists of a plurality of sentences;
b. removing all directories, titles, figures and tables in the text and only leaving text information;
c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm;
d. filtering stop words of the text;
e. duplicate words in each sentence are removed.
The method for generating the strong association rule set in 2) comprises the following steps:
a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w1,w2,...,wm},wiThe ith word in the item set;
b. solving a closed item set; the closed item set is item set Xf and only if
Figure GDA0003427662760000031
Wherein
Figure GDA0003427662760000032
Is a closure operation;
c. solving the length of the item set, wherein the number of items in one item set is called the length of the item set, and the item set with the length of k is called a k-item set;
d. the support degree of the item set is obtained, and for the association rule A → B,
Figure GDA0003427662760000033
the support degree sup (A → B) of A → B is the ratio of the number of the affairs containing AbuB in T to the total number of the affairs in T;
e. calculating the correlation strength; for the association rule a → B,
Figure GDA0003427662760000034
the strength of the association of A → B is:
Figure GDA0003427662760000035
since lift (a → B) ═ lift (B → a), the lift metric satisfies symmetry. From the analysis of probability distribution, when lift (a → B) ═ 1, a and B obey independent probability distribution, which indicates that there is no correlation between a and B; when lift (A → B) >1, the posterior probability of B after A is introduced is higher than the prior probability of B, which indicates that a positive association relationship exists between A and B; when lift (A → B) <1, the posterior probability of B after A is introduced is lower than the prior probability of B, which indicates that a negative correlation exists between A and B;
f. solving a strong association rule; for the association rule a → B,
Figure GDA0003427662760000036
if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max-liftOr lift (A → B) ≥ min+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max-liftIs the maximum negative correlation threshold, min+liftIs a minimum positive correlation threshold;
g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; the invention adopts the CLOSE algorithm to dig the frequent closed item set so as to generate the strong association rule set;
the CLOSE algorithm is realized by the following steps:
step (1) preprocessing a text, then, changing the text into a 1-item set after the text is deduplicated, scanning a transaction set to calculate closure and support degree of the 1-item set, recording a generated closure and support degree set as FCC1, and then, removing the 1-item set with the support degree smaller than min and the same closure from FCC1 to obtain FC1(FC1 is a closure and support degree set obtained by a 1-item set through a CLOSE algorithm);
step (2) in the process of generating an i-item set (the initial value of i is 1), combining two different item sets in the generators of FC (i-1) in pairs as generators of FCCi, calculating the closure and the support degree of the i-item set by scanning a transaction set to obtain FCCi, and removing the item set with the support degree smaller than min and the item set with the repeated closure to obtain FCi;
step (3) adding i, repeating step (2) until the generator of FCCi is empty, and ending the CLOSE algorithm;
step (4) finally, all the closures of the FCs 1 to FCi are taken out to obtain all closed frequent item sets FC of the transaction set;
h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L1,L2,...,LkIn which L isiA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f1,f2,...,f3In which fiI-frequent item set and its support set; finally, according to fi(2. ltoreq. i.ltoreq.k) generates a set of association rules AR of different forms, which are composed of a rule head of an m-item set (1. ltoreq. m.ltoreq.i-1) and a rule body of the (i-m) -item set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre. The extraction satisfies that lift is less than or equal to max-liftOr lift is more than or equal to min+liftThe rule of (2) is used as a strong association rule.
The method for generating the association map in 3) comprises the following steps: taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges.
The GSNA algorithm in 4) comprises the following steps:
a. quantifying graph structure attributes of the nodes; in order to quantify the graph structure attribute of the node, six indexes are selected; wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, atr4 is the ratio of atr1 to the average value of the degrees of the neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art 2; assuming the existence of node N in the dependency graphiThen N isiFormalized as Ni=(atr1,atr2,atr3,atr4,atr5,atr6);
b. Solving the semantic distance between two nodes; suppose the word set of the preprocessed document is I ═ w1,w2,...,wn}, node NiThe co-occurrence probability distribution with I is P ═ P1,p2,...,pn},NjThe co-occurrence probability distribution with I is Q ═ Q1,q2,...,qnIs then NiAnd NjHas a semantic distance of
Figure GDA0003427662760000051
c. Calculating the similarity between nodes combining the graph structure attribute and the semantic information; integrating graph structure attributes and semantic information for nodes, node NiAnd NjThe similarity measure of (c) is shown in formula (3); when N is presentiAnd NjThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity sijThe lower, conversely, the lower NiAnd NjThe closer the structural attributes or semantic distributions of (S) are, the more closely sijThe higher the correspondence;
Figure GDA0003427662760000061
e. normalizing each node at offSimilarity in the joint graph; assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formedN×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:
Figure GDA0003427662760000062
f. modifying the PageRank formula;
rirepresenting a node NiDegree of similarity to the average of all nodes in the graph, r ═ r1,r2,...,rN) Namely, the fixed probability distribution of all nodes in the association graph related to the graph structure and semantic attributes; using riModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:
Figure GDA0003427662760000063
since the correlation diagram is an undirected graph, C (N) in the formula (5)k) Representing a node NkE represents NkThe number of neighbors of (2);
g. the PR voting score of positive associations is strengthened and the PR voting score of negative associations is reduced. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (5) is improved using lift values as follows:
Figure GDA0003427662760000071
h. randomly walking on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until the formula (7) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value;
Figure GDA0003427662760000072
and 5) clustering the nodes, wherein the steps are as follows: carrying out K-Mediod clustering on the sequenced nodes; firstly, selecting the first P nodes after descending sorting; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; finally, after all the nodes are distributed, searching the class center of each class again, and distributing the nodes of the non-class center again; and continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The basic technical flow is shown in fig. 1.
The invention discloses a keyword extraction method for combining graph structure and node association, which has the beneficial effects that: the invention uses the CLOSE algorithm to mine the strong association rule for constructing the association graph, the strong association rule obtained by taking the frequent closed item set as the reference has small scale, high speed and no information loss, thereby avoiding the problem of excessively complex graph structure caused by extracting redundant frequent item sets and improving the overall execution efficiency of the algorithm; the similarity of the graph structure attribute of the node in the association graph and the semantic information calculation node in the association graph is comprehensively considered, so that node jumping is not subject to equal probability distribution any more, and the possibility of jumping among similar nodes is increased; the invention provides that when the random walk is carried out on the association diagram, the voting among the nodes has positive and negative association scores, the positive association voting scores among the nodes are enhanced, the negative association voting scores among the nodes are reduced, and the node scores are more reasonable; the method can find the keywords of the long text under the condition of being separated from an external corpus; the graph structure attribute of the node, the correlation characteristics between the semantic information and the node are comprehensively considered, and the low execution efficiency of the GSNA caused by constructing a complex correlation graph is avoided; mining a frequent closed item set of a document and generating a strong association rule set; secondly, taking the nonrepetitive rule head and the rule body in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist, and taking the association strength of the association rules as edge weight to construct an association diagram of the document; then, randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sequencing the results in a descending order; and finally, clustering the previous nodes in order to avoid semantic inclusion relation among the extracted keywords, and taking the class center point of each class as a key extraction result of the document.
Drawings
FIG. 1 is a basic technical flow diagram of the present invention;
FIG. 2 is a diagram illustrating the steps of generating a closed frequent itemset according to the present invention;
FIG. 3 is a diagram of strong association rule generation of the present invention;
FIG. 4 is a correlation diagram of a document according to the present invention.
Detailed Description
Example 1
As shown in fig. 1-4, the invention provides a keyword extraction method for combining graph structures and node associations, which comprises the following steps:
1) acquiring a text and performing text preprocessing;
a. acquiring a text, selecting 'species origin' of Darwin as the text, wherein the text consists of a plurality of sentences;
b. removing all directories, titles, figures and tables in the text and only leaving text information;
c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm; english text can be used as a space as a natural delimiter, Chinese is only a character, a sentence and a paragraph can be simply delimited by an obvious delimiter, a unique word does not have a delimiter in a form, and proper word segmentation software can be selected to segment Chinese; because the selected text is the Chinese text, the Chinese text is segmented by using the ending segmentation; for example, a section of "provenance of species":
the role of natural selection lies entirely in the preservation and accumulation of the various variants which, for each organism, are favourable both under organic and inorganic conditions during their lifetime. The end result of this is that the relationship of various organisms to their external conditions is increasingly improving. Such improvements entail progressive development of the regime of most organisms worldwide. "
After word segmentation will become:
the role of natural selection lies entirely in the preservation and accumulation of the various variants which, for each organism, are favourable both under organic and inorganic conditions during their lifetime. The end result of this is that the relationship of various organisms to their external conditions is increasingly improving. Such improvements entail progressive development of the regime of most organisms worldwide. "
d. Filtering stop words of the text; stop words have no practical meaning, and for English text, 'the', 'is', 'at', 'which', 'on' etc. can be used as stop words, and for Chinese text, 'a', 'here', 'where', 'I', 'though', etc. can be used as stop words. After the text is filtered for stop words, a set of sentence sets St ═ St { St } separated by stop marks (","1,st2,...,stnThe deactivation vocabulary can be defined by itself, which is not limited by the present invention. The invention uses a hundred degree stop word list and adds some stop words defined by texts. For the segment given in step (2.2), the filter by stop word becomes:
"natural selection action is fully preserved and accumulated organic and inorganic conditions of various variant organisms in life period are favorable
The final result is that the biological external condition relation is improved day by day
Improvement incurs gradual progress in the world's biosystems ";
e. removing repeated words in each sentence, wherein after the filtering step of the stop words, the repeated words become:
"natural selection action is fully preserved and accumulated in the life period of various variant organisms under the organic and inorganic conditions
The final result is that the biological external condition relation is improved day by day
Improvement incurs gradual progress in the world's biosystems ";
2) mining a frequent closed item set of a document and generating a strong association rule set;
a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w1,w2,...,wmWi is the ith word in the item set; for the book of "provenance of species", the transaction set is all the texts preprocessed in the book, the transaction is one sentence of the preprocessed texts, and the set of any words in the texts preprocessed by the term set is denoted as I ═ w1,w2,...,wm},wiThe ith word in the item set;
b. solving a closed item set; the closed term set is term set X if and only if S (X) f (g (X)) fg (X) X, where S fg is a closure operation; assume that there are five transactions in the transaction set, with transaction IDs tr1, tr2, tr3, tr4, tr5, and item sets { a, c, d }, { b, c, e }, { a, b, c, e }, { b, e }, and { a, b, c, e }, respectively. One set of closure terms is S ({ b, e }) ═ f (g ({ b, e }))) f ({2,3,4,5}) } b, e }.
c. The length of the item set is calculated, the item set { a, c, d } is called a 3-item set, and the item set { a, b, c, e } is called a 4-item set;
d. and (3) calculating the support degree of the item set, wherein for the item set X, the support degree sup (X) is the ratio of the number of the affairs g (X) containing X to the number of the whole affairs, namely | g (X) |/| T |. For the association rule a → B,
Figure GDA0003427662760000101
Figure GDA0003427662760000102
the support sup (A → B) of A → B is the ratio of the number of transactions containing AbuB in T to the total number of transactions in T. According to the example in b, e are in the item sets tr2, tr3, tr4, tr5, so the item set { b, e } has a support of sup { b, e } ═ 4/5;
e. calculating the correlation strength; for the association rule a → B,
Figure GDA0003427662760000103
the strength of the association of A → B is:
Figure GDA0003427662760000111
since lift (a → B) ═ lift (B → a), the lift metric satisfies symmetry. From the analysis of probability distribution, when lift (a → B) ═ 1, a and B obey independent probability distribution, which indicates that there is no correlation between a and B; when lift (A → B) >1, the posterior probability of B after A is introduced is higher than the prior probability of B, which indicates that a positive association relationship exists between A and B; when lift (A → B) <1, the posterior probability of B after A is introduced is lower than the prior probability of B, which indicates that a negative correlation exists between A and B;
f. solving a strong association rule; for the association rule a → B,
Figure GDA0003427662760000112
if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max-liftOr lift (A → B) ≥ min+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max-liftIs the maximum negative correlation threshold, min+liftIs a minimum positive correlation threshold;
g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; minsup is the minimum support of X in the transaction set.
The conventional association rule mining algorithm Apriori constructs an association rule by mining frequent item sets with different lengths, and when the business scale is large, because the business set is continuously scanned in the process of generating the association rule, a large-scale redundant frequent item set is easily obtained, so that the efficiency is low, and the analysis process becomes complicated. In order to quickly extract effective strong association rules from a document transaction set, the invention applies a CLOSE algorithm to mine a frequent closed item set so as to generate a strong association rule set, the algorithm uses closure operation and combines a breadth-first strategy to search an item set space, thereby realizing the quick pruning of a non-frequent item set and a non-closed item set in the search space.
Assuming minsup-2/5, the example in step (4) 1 first generates a 1-item set as FCC1 according to the ClOSE algorithm, scans the transaction set to calculate the closure and support degree of the 1-item set to obtain FCC1, and then changes FCC1// FFC1 to FCC1 to remove the item set with support degree less than 2/5 and the 1-item set with the same closure to obtain FC 1; in generating the 2-item set, only a different set of items in FC1 are combined pairwise to yield { a, b }, { a, c }, { b, c }, since
Figure GDA0003427662760000121
Therefore, after the { a, c } is pruned, the { a, b } and the { b, c } are reserved as generators of FCC2, the closure and the support degree of the 2-item set are calculated by scanning the transaction set to obtain FCC2, and then the item set with the support degree smaller than 2/5 and the item set of the repeated closure are removed to obtain FC 2; in the generation of FCC3 by FC2, the 3-item set generated by generator of FC2 is only { a, b, c }, but
Figure GDA0003427662760000122
Figure GDA0003427662760000123
So the post-pruned 3-item set for { a, b, c } is null, by which time the CLOSE algorithm ends. Finally, all closing of FC1 and FC2 are taken out, so that all closed frequent item sets FC of the transaction set are obtained. FIG. 2 is a diagram illustrating the closed frequent item set generation process of the transaction set in detail;
h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L1,L2,...,LkIn which L isiA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f1,f2,...,f3In which fiIs a collection of i-frequent items anda set of support degrees; finally, according to fiFor 2 ≦ i ≦ k, a different form of association rule set AR is generated, which is composed of a rule head with an m-term set of 1 ≦ m ≦ i-1 and a rule body of the (i-m) -term set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre; the extraction satisfies that lift is less than or equal to max-liftOr lift is more than or equal to min+liftThe rule of (2) is used as a strong association rule. According to the example in step (4) 1, assuming that minsup-2/5, the 3-item set { b, c, e } may have a rule body of the rule head 1-item set of the 2-item set and a rule body of the rule head 2-item set of the 1-item set, the lift metric satisfies symmetry, which can also be seen from the figure; the 2-item set { a, c } has only the rule head and the rule body of the 1-item set, and lift (a → c) ═ lift (c → a), and the frequently closed item set generation process is shown in fig. 3;
3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document; taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges. According to the example in 1 in step (4), a preprocessed document transaction set sample is assumed to be min equal to 0.6, min+lift=1.2,max-lift0.95, obtaining the frequently closed item set of the transaction set as FC { { a, c }, { b, e }, { c }, { b, c, e } } according to the CLOSE algorithm, and deriving a strong association rule set
Figure GDA0003427662760000131
Extracting non-repeated rule heads and rule bodies in the AR as association graph nodes V { { b, c }, { b, e }, { c, e }, { a }, { b }, { c }, { e } } and constructing a directionless weighted association graph shown in fig. 4 by using the association strength of the strong association rule in the AR as edge weights;
4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;
text repetition by different nodes in association graphsThe importance is different, the invention considers the existing method for measuring the node importance in the graph, and provides a GSNA algorithm comprehensively considering the similarity of the nodes and the correlation strength between the nodes, and the algorithm is essentially an improvement on the PageRank algorithm. The PageRank algorithm considers that an edge between the nodes A and B can be regarded as voting of the node A to the node B, the higher the sum of the edge weights of the nodes is, the higher the importance of the node is, and the PageRank simulates an internet owner, can randomly walk between the points on the graph and can jump. After a limited number of excursions, each point reaches a steady state and the expected value of each point tends to a constant value. And generating node sequencing once after each walk iteration, wherein the sequencing value is the PageRank score. Suppose the graph contains N nodes in total, NiFor the i-th node in the dependency graph, e denotes NiNumber of edge entries of (2), PR (N)k) Represents NkPageRank score of (C) (N)k) Represents NkD is used to measure the attenuation coefficient of a node migrating to another node, the PageRank score of Ni is defined as follows:
Figure GDA0003427662760000132
wherein, the jump between nodes is equal probability event, namely 1/N, and theoretically, the possibility of jump between similar nodes is higher. In addition, there is a phenomenon that nodes have not only a positive association relationship but also a negative association relationship with each other. The positive incidence relation is that one node appears together with another node, and the negative incidence relation is that the probability of another node appearing is reduced after one node appears. For negative associations between nodes, the voting score of PR should be decreased, while for positive associations, the voting score of PR needs to be increased.
The GSNA algorithm improves the PageRank from two aspects of the similarity and the relevance of the graph structure among the nodes. When measuring the similarity of the nodes, the GSNA considers the structural attribute and the implicit semantic information of the nodes in the association diagram; when measuring the relevance of the nodes, positive relevance transmission needs to be enhanced, negative relevance transmission needs to be reduced, and PR scores are reasonably scaled by GSNA, so that the node ranking is more accurate.
For the GSNA algorithm, the steps are as follows:
and (1) quantizing the graph structure attribute of the node. Six indexes are selected in order to quantify the graph structure attribute of the node. Wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, and visually represent the association capability of one node with other nodes, atr4 is the ratio of the atr1 to the average value of the degrees of the neighbor nodes, which represents the similar matching degree of one node with the surrounding neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art2, which quantitatively represents the propagation capability of one node in the association diagram. Assuming the existence of node N in the dependency graphiThen N isiMay be formalized as Ni ═ by (atr1, atr2, atr3, atr4, atr5, atr 6).
Step (2) solving the semantic distance between two nodes
Suppose the word set of the preprocessed document is I ═ w1,w2,...,wn}, node NiThe co-occurrence probability distribution with I is P ═ P1,p2,...,pnP is NiThe number of sentences which appear simultaneously with the word in the I is larger than the total number of sentences of the text, and then the I is traversed to obtain P. N is a radical ofjThe co-occurrence probability distribution with I is Q ═ Q1,q2,...,qnIs then NiAnd NjIs shown in equation (3). The larger the semantic distance between two nodes, the JSD (N)i,Nj) The larger.
Figure GDA0003427662760000151
For the association graph of the document produced in fig. 4, the word set is I ═ { a, b, c, d, e }, the probability distribution of co-occurrence of the nodes { b, e } and I is P ═ 2/5,4/5,3/5,0,4/5}, and the probability distribution of co-occurrence of the nodes { a } and I is P ═ 3/5,2/5,3/5,1/5,2/5 }.
And (3) combining the similarity between the graph structure attribute and the nodes of the semantic information. Integrating graph structure attributes and semantic information for nodes, node NiAnd NjPhase ofThe similarity measure is shown in equation (4).
Ni=(atr1,atr2,atr3,atr4,atr5,atr6),
Figure GDA0003427662760000152
The larger the node similarity in conjunction with graph structure attributes, the higher the JSD (N)i,Nj) The larger the semantic distance between two nodes, when N isiAnd NjThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity sijThe lower, conversely, the lower NiAnd NjThe closer the structural attributes or semantic distributions of (S) are, the more closely sijThe higher the correspondence.
Figure GDA0003427662760000153
And (4) normalizing the similarity of each node in the association graph. Assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formedN×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:
Figure GDA0003427662760000161
and (5) correcting the PageRank formula.
riRepresenting a node NiDegree of similarity to the average of all nodes in the graph, r ═ r1,r2,...,rN) I.e. a fixed probability distribution that all nodes in the dependency graph are related to graph structure and semantic attributes. Using riModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:
Figure GDA0003427662760000162
since the correlation diagram is an undirected graph, C (N) in the formula (6)k) Representing a node NkE represents NkThe number of neighbors.
And (6) enhancing the PR voting score of the positive association and reducing the PR voting score of the negative association. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (6) is improved using lift values as follows:
Figure GDA0003427662760000163
and (7) carrying out random walk on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until a formula (8) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value.
Figure GDA0003427662760000164
5) Clustering the first nodes, and taking the class center point of each class as a key extraction result of the document;
the nodes of the association graph are formed according to the frequent closed item set taking subsets with different lengths, the subsets of the k-item set contain 2k-1 non-empty subsets in total, and the 1-item set is necessarily the subset of the (k-1) -item set, and the phenomenon that the (k-1) -item set has semantic inclusion relation with the 1-item set can be considered. In order to prevent semantic inclusion relation between important nodes obtained after the nodes are sorted in a descending order, the invention carries out K-Mediod clustering on the sorted nodes.
Because the condition that non-key terms are clustered independently exists, the first P nodes after descending sorting are selected firstly; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; and finally, searching the class center of each class again after all the nodes are distributed, and distributing the nodes of the non-class center again. And continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The algorithm is concretely realized by the following steps:
selecting the first P nodes after descending sorting;
selecting K as P class centers with K being not more than K from P nodes;
calculating the semantic distance between the residual nodes and the center of each class, and distributing the semantic distance to the class with the closest distance;
step 4, calculating the minimum value of the semantic distance sum of all other points for other nodes except the class center node in each class, and taking the minimum value point as a new class center node;
and (5) continuously performing the step (4) until the difference value of the node distribution results of the two adjacent times reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text.
The invention marks the extracted keyword sequence as NretTagging the vocabulary sequence as NrelThe evaluation indexes involved include MAP (mean Average precision), Recall, and Fβ. Since the keywords expected to be retrieved by the invention are high in accuracy and ranked as far as possible, the keyword F is ranked as early as possibleβWith slight modification, MAP was introduced into FβAnd (4) calculating. The formula is defined as follows:
Figure GDA0003427662760000181
Figure GDA0003427662760000182
Figure GDA0003427662760000183
wherein, function index (t)i,Nrel) Returning the term tiIn the sequence NrelThe index position of (a); g (t)i,Nrel) To indicate a function, if the term tiAppearing in the original vocabulary sequence NrelIf yes, returning to 1, otherwise, returning to 0; compared with the recall rate, the method is more focused on improving the accuracy rate, so FβThe value of beta in the index is 0.5.
Design of experiments
In order to verify the feasibility of the algorithm, the experimental design comprises four links. Firstly selecting experimental data and proper evaluation indexes, secondly searching input parameter settings corresponding to an optimal association diagram model, then discussing the extraction number of keywords under the condition of the optimal association diagram, and finally selecting four keyword extraction algorithms to compare with the algorithm of the invention, thereby further explaining the superiority of the algorithm of the invention.
1. Influence of input parameters
Because the input parameters of the model are more, different parameter settings can bring different degrees of influence on the algorithm result. For example, if the minor setting in the CLOSE algorithm is too high, a large number of meaningful words are discarded, and if the minor setting is too low, a large number of redundant words are included, and min+liftAnd max-liftThe change of the relation coefficient d directly influences the form structure of the association graph, and meanwhile, the adjustment of the damping coefficient d of the GSNA algorithm during random walk also influences the keyword extraction effect. The invention sets the variation range of min to be 0.5 percent and 10 percent];min+liftIncrease at a rate of step size 0.1, base 0.1; max-liftThe initial value is 5, and the speed is increased by 5 steps; d is taken according to experience, and the variation range is set to be [0.15,0.85 ]]. Suppose the ordered keyword sequence is NvIn order to avoid the extraction effect analysis complexity caused by excessive parameter variation, the extraction quantity of the keywords is not considered temporarily, and when the model parameters are adjusted, the keyword sequence N obtained each time is extractedvFront | N ofrelI terms as the extracted keyword sequence NretAnd used to calculate the MAP.
2 extraction of word number selection
The keyword extraction focuses on selecting the words most relevant to the text, so that the quantity and the scale of the selected keywords are important. The invention introduces an extraction parameter r ═ Nret/NvAccording to the parameter found in 1, then Nret=r×NvSequence N representing the sequence from the rankvThe scale of the key words is extracted. Based on the resulting parameters in 1, the optimal r is calculated.
3. Algorithm comparison and analysis
In order to verify the superiority and comparability of the algorithm, TextRank, GraphSum, a keyword Extraction Method (Key Words Extraction Method by node Association, NA) considering the Graph Structure and not considering the Graph Structure, and a keyword Extraction Method (Key Words Extraction Method by Graph Structure, GS) considering the node Association and not considering the Graph Structure are compared with the keyword results obtained by the algorithm.
The four comparison methods are selected based on the following considerations that (1) TextRank is a traditional algorithm for automatically generating keywords, and does not consider the graph structure and the node association information; (2) comparing an NA algorithm only considering graph structure information and a GS algorithm only considering node relevance information with the GSNA algorithm simultaneously considering graph structure and node relevance, and displaying the importance of graph structure and node relevance to keyword extraction; (3) GSNA is improved on the GraphSum method, GraphSum is most similar to the method of the invention, and the GraphSum is used for generating text summaries, so as to obtain text keywords, the sentence covering step of the original GraphSum algorithm is removed during the experiment.

Claims (3)

Translated fromChinese
1.一种融合图结构与节点关联的关键词提取方法,其特征在于:1. a keyword extraction method associated with a fusion graph structure and a node, is characterized in that:1)获取文本,进行文本预处理;所述文本预处理的方法,其步骤如下:1) obtain text, carry out text preprocessing; the method of described text preprocessing, its steps are as follows:a.获取文本,文本由若干数目的句子组成;a. Get text, which consists of a certain number of sentences;b.文本除文本中所有的目录、标题、图、表,只留下文本信息;b. In addition to all the catalogs, titles, figures and tables in the text, only text information is left in the text;c.对文本进行分词,对于英文文本,基于简单的空格进行分词;对于中文文本,使用常用的分词算法进行分词;c. Perform word segmentation on the text, for English text, perform word segmentation based on simple spaces; for Chinese text, use common word segmentation algorithms for word segmentation;d.将文本进行停用词过滤;d. Filter the text with stop words;e.去除每个句子之中重复的词语;e. Remove repeated words from each sentence;2)挖掘文档的频繁封闭项集并生成强关联规则集合;生成强关联规则集合的方法,其步骤如下:2) Mining the frequent closed itemsets of the document and generating a set of strong association rules; the method of generating a set of strong association rules, the steps are as follows:a.建立事务集和项集;所述事务为文本T中的单个句子;所述事务集即为预处理后的整篇文本T;所述项集为整篇文本T包含的所有词语的任意子集记为I={w1,w2,...,wm},wi为项集里面的第i个词;a. Establish a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the entire text T after preprocessing; the item set is any The subset is recorded as I={w1 ,w2 ,...,wm }, and wi is the i-th word in the itemset;b.求封闭项集;所述封闭项集为项集X当且仅当
Figure FDA0003427662750000011
其中
Figure FDA0003427662750000012
为闭包运算;b. Find a closed itemset; the closed itemset is an itemset X if and only if
Figure FDA0003427662750000011
in
Figure FDA0003427662750000012
is a closure operation;c.求项集长度,一个项集中项的个数称为项集长度,长度为k的项集称为k-项集;c. Find the itemset length, the number of items in an itemset is called the itemset length, and the itemset with length k is called k-itemset;d.求项集的支持度,对于关联规则A→B,
Figure FDA0003427662750000013
则A→B的支持度sup(A→B)为T中含有A∪B的事务个数与T中总事务数的比值;
d. Find the support of the itemset, for the association rule A→B,
Figure FDA0003427662750000013
Then the support degree sup(A→B) of A→B is the ratio of the number of transactions containing A∪B in T to the total number of transactions in T;
e.求关联强度;对于关联规则A→B,
Figure FDA0003427662750000014
则A→B的关联强度为:
e. Find the association strength; for the association rule A→B,
Figure FDA0003427662750000014
Then the association strength of A→B is:
Figure FDA0003427662750000015
Figure FDA0003427662750000015
f.求强关联规则;对于关联规则A→B,
Figure FDA0003427662750000016
若满足sup(A→B)≥minsup且lift(A→B)≤max-lift或lift(A→B)≥min+lift,则A→B为强关联规则;其中,minsup为关联规则的最小支持度阈值,max-lift为最大负关联度阈值,min+lift为最小正关联度阈值;
f. Seek strong association rules; for association rules A→B,
Figure FDA0003427662750000016
If sup(A→B)≥minsup and lift(A→B)≤max-lift or lift(A→B)≥min+lift are satisfied, then A→B is a strong association rule; where minsup is the minimum association rule Support threshold, max-lift is the maximum negative correlation threshold, min+lift is the minimum positive correlation threshold;
g.求封闭频繁项集,项集X为频繁封闭项集当且仅当sup(X)≥minsup且S(X)=X;g. Find closed frequent itemsets, itemset X is frequent closed itemsets if and only if sup(X)≥minsup and S(X)=X;h.生成强关联规则;首先,事务集T的频繁闭包集合为L={L1,L2,...,Lk},其中Li为i-项集的频繁闭包集合;其次,抽取L中的不同长度的闭包及其支持度构成频繁封闭项集集合FC;然后,由FC得到不同长度的频繁项集及其支持度集合Fre={f1,f2,...,f3},其中fi为i-频繁项集及其支持度集合;最后,根据fi为2≤i≤k,生成不同形式的关联规则集合AR,这些关联规则由m-项集为1≤m≤i-1,的规则头与(i-m)-项集的规则体构成,并根据Fre计算lift(m-itemset→(i-m)-itemset);抽取满足lift≤max-lift或lift≥min+lift的规则作为强关联规则;h. Generate strong association rules; first, the frequent closure set of transaction set T is L={L1 , L2 ,...,Lk }, where Li is the frequent closure set ofi -itemsets; secondly , extract the closures of different lengths in L and their supports to form a frequent closed itemset set FC; then, obtain frequent itemsets of different lengths and their support sets from FC Fre={f1 , f2 ,... ,f3 }, where fi is the i-frequent itemset and its support set; finally, according to the fact that fi is 2≤i≤k, different forms of association rule sets AR are generated. These association rules are composed of m-itemsets as 1≤m≤i-1, the rule head is composed of the rule body of (im)-itemset, and lift(m-itemset→(im)-itemset) is calculated according to Fre; the extraction satisfies lift≤max-lift or lift≥ The rule of min+lift is used as a strong association rule;3)将强关联规则集合中不重复的规则头与规则体作为节点,节点之间有边当且仅当彼此存在强关联规则,以关联规则的关联强度作为边权重构建文档的关联图;3) The non-repeated rule header and rule body in the strong association rule set are used as nodes, and there are edges between nodes if and only if there are strong association rules with each other, and the association strength of the association rules is used as the edge weight to construct an association graph of the document;4)使用GSNA算法在关联图上随机游走,迭代计算每个节点的重要性分数,并对结果降序排序;4) Use the GSNA algorithm to randomly walk on the association graph, iteratively calculate the importance score of each node, and sort the results in descending order;5)对前若干个节点聚类,取出每个类的类中心点作为文档的关键提取结果。5) Cluster the first several nodes, and take out the class center point of each class as the key extraction result of the document.2.如权利要求1所述一种融合图结构与节点关联的关键词提取方法,其特征在于:所述4)中GSNA算法,其步骤如下:2. the keyword extraction method that a kind of fusion graph structure is associated with node as claimed in claim 1, is characterized in that: described 4) middle GSNA algorithm, its steps are as follows:a.量化节点的图结构属性;为了量化节点的图结构属性,选取了六个指标;其中atr1、atr2、atr3分别表示与一个节点距离为1、2、3的节点个数,atr4为atr1与邻居节点度的平均值的比值,atr5为art2与art1的比值,atr6为art3与art2的比值;假设关联图中存在节点Ni,则Ni可形式化表示为Ni=(atr1,atr2,atr3,atr4,atr5,atr6);a. Quantify the graph structure attributes of nodes; in order to quantify the graph structure attributes of nodes, six indicators are selected; among them atr1, atr2, and atr3 represent the number of nodes with a distance of 1, 2, and 3 from a node, respectively, and atr4 is the difference between atr1 and atr1. The ratio of the average value of neighbor node degrees, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art2; assuming that there is a node Ni in the association graph, Ni can be formally expressed as Ni =(atr1,atr2, atr3,atr4,atr5,atr6);b.求两个节点之间的语义距离;假设预处理后文档的词集为I={w1,w2,...,wn},节点Ni与I的共现概率分布为P={p1,p2,...,pn},Nj与I的共现概率分布为Q={q1,q2,...,qn},则Ni与Nj的语义距离为b. Find the semantic distance between two nodes; assuming that the word set of the preprocessed document is I={w1 ,w2 ,...,wn }, the co-occurrence probability distribution of nodes Ni and I is P ={p1 ,p2 ,...,pn }, the co-occurrence probability distribution of Nj and I is Q={q1 ,q2 ,...,qn }, then the co-occurrence probability of Ni and Nj The semantic distance is
Figure FDA0003427662750000031
Figure FDA0003427662750000031
c.求结合图结构属性与语义信息的节点间相似性;综合节点的图结构属性与语义信息,节点Ni与Nj的相似度衡量如公式(3)所示;当Ni与Nj的图结构属性或语义分布差异越大时,其相似度sij越低,反之当Ni与Nj的结构属性或语义分布越接近时,sij相应越高;c. Find the similarity between nodes combining graph structural attributes and semantic information; synthesizing the graph structural attributes and semantic information of nodes, the similarity measure of nodes Ni and Nj is shown in formula (3); when Ni and Nj The greater the difference in the graph structural attributes or semantic distribution of the graphs, the lower the similarity sij , and conversely, when the structural attributes or semantic distributions of Ni and Nj are closer, the corresponding sij is higher;
Figure FDA0003427662750000032
Figure FDA0003427662750000032
e.归一化每个节点在关联图中的相似度;假设关联图共含有N个节点,存在强关联关系的节点之间有相似度,无强关联关系的节点之间相似度为0,则可形成关联图的相似度矩阵SN×N;为了更形式化地度量一个节点与所有节点的相似程度,将每个节点在关联图中的相似度计算归一化如下:e. Normalize the similarity of each node in the association graph; assuming that the association graph contains a total of N nodes, the nodes with strong association have similarity, and the similarity between nodes without strong association is 0, Then the similarity matrix SN×N of the association graph can be formed; in order to measure the similarity between a node and all nodes more formally, the similarity calculation of each node in the association graph is normalized as follows:
Figure FDA0003427662750000033
Figure FDA0003427662750000033
f.修正PageRank公式;f. Revise the PageRank formula;ri表示节点Ni与图中所有节点的平均相似程度,r=(r1,r2,...,rN)即为关联图中所有节点与图结构和语义属性相关的固定概率分布;使用ri修正PageRank公式,使得节点不再服从等概率跳转,而是使与关联图相似程度更大的节点被选中的概率更高,其公式如下:rirepresents the average degree of similarity between node N iand all nodes in the graph, r=(r1 , r2 ,...,rN ) is the fixed probability distribution of all nodes in the association graph related to graph structure and semantic attributes ; Use ri to modify the PageRank formula, so that nodes no longer obey equal probability jumps, but make nodes with greater similarity to the association graph have a higher probability of being selected. The formula is as follows:
Figure FDA0003427662750000041
Figure FDA0003427662750000041
由于关联图为无向图,所以公式(5)中C(Nk)表示节点Nk的度,e表示Nk的邻居个数;Since the association graph is an undirected graph, C(Nk ) in formula (5) represents the degree of node Nk , and e represents the number of neighbors of Nk ;g.加强正关联的PR投票分数,降低负关联的PR投票分数;在关联图中,两节点之间有边当且仅当它们之间存在强关联规则;为了加强正关联的PR投票分数,降低负关联的PR投票分数,使用lift值对公式(5)改进如下:g. Strengthen the PR vote score of positive association and reduce the PR vote score of negative association; in the association graph, there is an edge between two nodes if and only if there is a strong association rule between them; in order to strengthen the PR vote score of positive association, To reduce the PR voting score of negative association, use the lift value to improve formula (5) as follows:
Figure FDA0003427662750000042
Figure FDA0003427662750000042
h.根据公式(6)在关联图上随机游走,迭代计算每个节点的PR值直至满足公式(7),使节点分数达到收敛状态,其中β为随机游走终止阈值;h. Randomly walk on the association graph according to formula (6), and iteratively calculate the PR value of each node until formula (7) is satisfied, so that the node score reaches a convergent state, where β is the random walk termination threshold;
Figure FDA0003427662750000043
Figure FDA0003427662750000043
3.如权利要求1所述一种融合图结构与节点关联的关键词提取方法,其特征在于:所述5)中节点聚类,其步骤如下:将排序后的节点进行K-Mediod聚类;首先选取降序排序后的前P个节点;其次从P个节点中选定K为K≤P个类中心;然后计算剩余节点与每个类中心的语义距离,并将其分配至所属距离最近的类中;最后,所有节点分配完毕后重新寻找每个类的类中心,并再次分配非类中心的节点;不断地重复类中心调整与分配节点的过程,直至相邻两次的节点分配结果差值达到聚类终止阈值,此时抽取每个簇的类中心作为文本最终的关键词。3. the keyword extraction method that a kind of fusion graph structure is associated with node as claimed in claim 1, it is characterized in that: described 5) middle node clustering, its step is as follows: the node after sorting is carried out K-Mediod clustering ; First select the top P nodes sorted in descending order; secondly, select K from the P nodes as K ≤ P class centers; then calculate the semantic distance between the remaining nodes and each class center, and assign them to the nearest distance. Finally, after all nodes are allocated, re-find the class center of each class, and re-assign the non-class center nodes; repeat the process of class center adjustment and node allocation, until the adjacent two node allocation results When the difference reaches the cluster termination threshold, the class center of each cluster is extracted as the final keyword of the text.
CN201711285868.0A2017-12-072017-12-07Keyword extraction method for combining graph structure and node associationExpired - Fee RelatedCN108132927B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201711285868.0ACN108132927B (en)2017-12-072017-12-07Keyword extraction method for combining graph structure and node association

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201711285868.0ACN108132927B (en)2017-12-072017-12-07Keyword extraction method for combining graph structure and node association

Publications (2)

Publication NumberPublication Date
CN108132927A CN108132927A (en)2018-06-08
CN108132927Btrue CN108132927B (en)2022-02-11

Family

ID=62389227

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201711285868.0AExpired - Fee RelatedCN108132927B (en)2017-12-072017-12-07Keyword extraction method for combining graph structure and node association

Country Status (1)

CountryLink
CN (1)CN108132927B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108920660B (en)*2018-07-042020-11-20中国银行股份有限公司Keyword weight obtaining method and device, electronic equipment and readable storage medium
CN109614466A (en)*2018-11-152019-04-12阿里巴巴集团控股有限公司The keyword rule generating method and its device excavated based on Frequent Set
CN109726402B (en)*2019-01-112022-12-23中国电子科技集团公司第七研究所 A Method for Automatically Extracting Document Subject Terms
CN109783628B (en)*2019-01-162022-06-21福州大学Method for searching KSAARM by combining time window and association rule mining
CN111062356B (en)*2019-12-262024-03-26沈阳理工大学Method for automatically identifying abnormal human body actions from monitoring video
CN111310436B (en)*2020-02-112022-02-15腾讯科技(深圳)有限公司Text processing method and device based on artificial intelligence and electronic equipment
CN111461109B (en)*2020-02-272023-09-15浙江工业大学 A method to identify documents based on environmental multi-category lexicon
CN111401928B (en)*2020-04-012022-04-12支付宝(杭州)信息技术有限公司Method and device for determining semantic similarity of text based on graph data
CN112016836B (en)*2020-08-312023-11-03中国银联股份有限公司 A method and device for determining similarity between objects
CN112561081B (en)*2020-12-182022-05-03北京百度网讯科技有限公司 Transformation method, device, electronic device and storage medium of deep learning model
CN113032551B (en)*2021-05-242021-09-10北京泽桥传媒科技股份有限公司Delivery progress calculation method and system based on combination of big data and article title
CN114691859A (en)*2022-04-142022-07-01西安邮电大学 A text summary generation method fused with semantic graph
CN116433837B (en)*2023-03-312023-09-12浙江大学Three-dimensional CAD model difference analysis method based on key point matching

Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101488861A (en)*2008-12-192009-07-22中山大学Keyword extracting method for network unknown application
US7577652B1 (en)*2008-08-202009-08-18Yahoo! Inc.Measuring topical coherence of keyword sets
CN102043851A (en)*2010-12-222011-05-04四川大学Multiple-document automatic abstracting method based on frequent itemset
CN104933032A (en)*2015-06-292015-09-23电子科技大学Method for extracting keywords of blog based on complex network
CN105975488A (en)*2016-04-252016-09-28哈尔滨工程大学Method for querying keyword based on topic cluster unit in relational database
CN106202042A (en)*2016-07-062016-12-07中央民族大学A kind of keyword abstraction method based on figure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102609528B (en)*2012-02-142014-06-18云南大学Frequent mode association sorting method based on probabilistic graphical model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7577652B1 (en)*2008-08-202009-08-18Yahoo! Inc.Measuring topical coherence of keyword sets
CN101488861A (en)*2008-12-192009-07-22中山大学Keyword extracting method for network unknown application
CN102043851A (en)*2010-12-222011-05-04四川大学Multiple-document automatic abstracting method based on frequent itemset
CN104933032A (en)*2015-06-292015-09-23电子科技大学Method for extracting keywords of blog based on complex network
CN105975488A (en)*2016-04-252016-09-28哈尔滨工程大学Method for querying keyword based on topic cluster unit in relational database
CN106202042A (en)*2016-07-062016-12-07中央民族大学A kind of keyword abstraction method based on figure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于PageRank的新闻关键词提取算法;顾亦然 等;《电子科技大学学报》;20170930;第46卷(第5期);第778-779页*
基于频繁模式挖掘的中文关键词提取算法;崔诚煜 等;《太赫兹科学与电子信息学报》;20150430;第13卷(第2期);第280-282页*

Also Published As

Publication numberPublication date
CN108132927A (en)2018-06-08

Similar Documents

PublicationPublication DateTitle
CN108132927B (en)Keyword extraction method for combining graph structure and node association
KR102019194B1 (en)Core keywords extraction system and method in document
CN112836029B (en) A graph-based document retrieval method, system and related components
KR100756921B1 (en) A computer-readable recording medium containing a document classification method and a program for executing the document classification method on a computer.
CN111177365A (en) An unsupervised automatic abstract extraction method based on graph model
CN114265935B (en) A text mining-based decision-making support method and system for scientific and technological project establishment management
JP2009093649A (en) Term recommendation to define the ontology space
CN112989215B (en) A Knowledge Graph Enhanced Recommendation System Based on Sparse User Behavior Data
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN107506472B (en)Method for classifying browsed webpages of students
CN109597995A (en)A kind of document representation method based on BM25 weighted combination term vector
CN112989802A (en)Barrage keyword extraction method, device, equipment and medium
CN112948544B (en)Book retrieval method based on deep learning and quality influence
CN113742292A (en)Multi-thread data retrieval and retrieved data access method based on AI technology
Bhutada et al.Semantic latent dirichlet allocation for automatic topic extraction
CN117973381A (en)Method for automatically extracting text keywords
CN117057346A (en)Domain keyword extraction method based on weighted textRank and K-means
CN112214511A (en) An API Recommendation Method Based on WTP-WCD Algorithm
CN115329085A (en) A social robot classification method and system
CN118484506B (en)Business identification system based on semantic analysis
CN119760057A (en)Response large model retrieval enhancement method and device based on hierarchical cluster index structure
OoPattern discovery using association rule mining on clustered data
CN117349512B (en)User tag classification method and system based on big data
CN113111288A (en)Web service classification method fusing unstructured and structured information
CN116796739A (en) An unsupervised keyword extraction method for documents

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20220211


[8]ページ先頭

©2009-2025 Movatter.jp