Movatterモバイル変換


[0]ホーム

URL:


CN117973381A - Method for automatically extracting text keywords - Google Patents

Method for automatically extracting text keywords
Download PDF

Info

Publication number
CN117973381A
CN117973381ACN202410034861.5ACN202410034861ACN117973381ACN 117973381 ACN117973381 ACN 117973381ACN 202410034861 ACN202410034861 ACN 202410034861ACN 117973381 ACN117973381 ACN 117973381A
Authority
CN
China
Prior art keywords
word
node
nodes
text
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410034861.5A
Other languages
Chinese (zh)
Inventor
李平周
赵娜
王浩
李小鹏
王剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University YNU
Original Assignee
Yunnan University YNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University YNUfiledCriticalYunnan University YNU
Priority to CN202410034861.5ApriorityCriticalpatent/CN117973381A/en
Publication of CN117973381ApublicationCriticalpatent/CN117973381A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention discloses a method for automatically extracting text keywords, which relates to the technical field of natural language processing, and comprises the following steps: step one: a TextRank algorithm; step two: a TF-IDF algorithm; step three: a word vector clustering weighting algorithm; step four: a central node metric algorithm; step five: extracting keywords; in the first step, the TextRank algorithm pre-processes a text, builds an adjacency relation between word units in the text by building a word graph g= (V, E), calculates importance of word nodes, and when building the nodes in the word graph, firstly needs to divide the text according to sentence units. According to the invention, the weight information of the word nodes is calculated by adopting the TF-IDF algorithm, the word vector clustering weighting algorithm and the central node measurement algorithm of the graph, so that the initial information of each node in the word co-occurrence network referred by the model during keyword extraction is enriched, and the accuracy of keyword extraction is remarkably improved.

Description

Method for automatically extracting text keywords
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method for automatically extracting text keywords.
Background
According to Chen and Lin studies, keyword automatic extraction techniques can be roughly classified into four types, which are a statistical method, a linguistic method, a machine learning method, and other methods, respectively. The statistical method realizes the extraction of keywords by classifying and counting text units such as words or sentences in the text. This approach does not rely on training data and is independent of text language and domain. Common algorithms include n-gram statistics, word frequency computation, TF-IDF models, word co-occurrence, PAT trees, and the like. Linguistic methods utilize linguistic features of words, sentences, and documents for keyword extraction, where the most common linguistic features include lexical, syntactic, semantic, and phonetic analysis. The machine learning method is mainly divided into two types, namely supervised keyword extraction and unsupervised keyword extraction. The supervised keyword extraction method trains a data set marked with keywords by using a model, and then applies the model to extract keywords of other text data. The following problems exist in the prior art:
the existing method for extracting the non-supervision keywords needs to manually annotate the training data set, is tedious in work, is unavoidably subjective in the process, can cause inaccuracy of final experimental data, and meanwhile does not consider internal word frequency information, external corpus information, structural information and the like of texts.
Disclosure of Invention
The invention provides a method for automatically extracting text keywords, which aims to solve the problems in the background technology.
In order to solve the technical problems, the invention adopts the following technical scheme:
A method of automatically extracting text keywords, the method of automatically extracting text keywords comprising the steps of:
step one: a TextRank algorithm;
step two: a TF-IDF algorithm;
Step three: a word vector clustering weighting algorithm;
step four: a central node metric algorithm;
step five: extracting keywords;
In the first step, the TextRank algorithm pre-processes a text, constructs an adjacency relation between word units in the text by establishing a word graph g= (V, E), and calculates the importance of word nodes, when constructing the nodes in the word graph, the text needs to be divided according to sentence units, each sentence is divided, stop words and repeat words are removed, only nouns, verbs and adjectives are reserved, a node set v= { V1,V2,…,Vn } of the word graph is formed, and when determining whether edges exist between two nodes, the word co-occurrence relation is used for construction: when words corresponding to two nodes appear in one window at the same time, the two nodes are connected, and an edge set E= { (Vi,Vj)|Vi∈V∧Vj epsilon V } forming a word graph is formed, wherein the window size can be set to 2 to 10 words.
By constructing the word graph g= (V, E), one can determine the formulaAnd calculating to obtain the TextRank value of each node.
The technical scheme of the invention is further improved as follows: in the formula of the step one, d E [0,1] is a damping coefficient, which indicates that the probability of 1-d of any node randomly jumps to other nodes in the word graph, and in order to ensure that the iterative computation of TextRank can be converged, the value is generally 0.85; WS (Vi) represents the textRank value of node Vi; WS (Vj) represents the textRank value of node Vj after the last iteration; in (Vi) represents the set of all nodes pointing to node Vi; out (Vj) represents the set of all nodes to which node Vj points; wji represents the similarity between the two nodes Vi and Vj.
The technical scheme of the invention is further improved as follows: in the second step, the word frequency (TF) and the inverse document word frequency (IDF) are multiplied to obtain the weight value of the word, the first several bits are selected as key words according to the weight sequence, and the calculation formula is as follows: wTF-IDF(i)=TFi*IDFi;IDFi=logN/DFi; where TFi represents the number of occurrences of word i in the text divided by the total number of words in the text, i.e. the frequency of occurrence of word i in the text; n represents the total number of text in the corpus; DFi represents the number of text containing word i; IDFi represents the category discrimination capability of word i.
The technical scheme of the invention is further improved as follows: step three, the keyword can be inspected by clustering word vectors in the text and selecting the distance between other word vectors and the clustering center, and the word vector clustering weighting algorithm assumes that: words in a text can be divided into groups by calculating their vector similarity, these groups are called clusters, the farther a word is from the centroid of the cluster in which it is located, the more the difference between the word and the centroid of the cluster and the words around the centroid can be reflected, and the higher the importance of the word as a TextRank word graph node.
The technical scheme of the invention is further improved as follows: preprocessing a given text to generate candidate keywords V= { V1,V2,…,Vn }, and obtaining Word vectors of s by using a Word2Vec Word vector model obtained after trainingLet c= { C1,C2,…,Cn } represent the clustering result after K-means clustering by the word vector set of the document, calculate the importance of any word i in the belonging cluster Ci, and the calculation formula is as follows: Wherein/>A vector representing the centroid of cluster Ci; Representing the euclidean distance of the vector to the vector in the word vector space, |ci | represents the number of words contained in cluster Ci.
The technical scheme of the invention is further improved as follows: the fourth step can be divided into the following steps:
A1: centering: the degree centrality is the most direct measurement index for describing the node centrality in network analysis, the degree of a node is higher, the degree centrality of the node is more important in the network, and the degree centrality calculation formula of a certain node is as follows:
A2: feature vector centrality: the basic idea of feature vector centrality is that the importance of a node depends on both its number of neighboring nodes and its importance of neighboring nodes. The characteristic vector centrality calculation formula of a certain node is as follows: Wherein lambda is a characteristic value, j is a neighbor node of the node i;
A3: median centrality: the basic idea of the betting center is to characterize node importance by counting the number of shortest paths through a node. The calculation formula of the medium centrality of a certain node is as follows: Where σut represents the number of shortest paths of nodes u to t, σut (i) represents the number of paths that these paths pass through node i, and N represents the number of nodes in the graph;
The centrality of the node in the word graph is comprehensively estimated by the following formulas in combination with the A1, A2 and A3: wCentrality(i)=Wd9i)+Wev(i)+Wb (i).
The technical scheme of the invention is further improved as follows: in the fifth step, for any word i in the text, the following formula is provided to calculate the comprehensive weight of the word i: wWeight(i)=WTF-IDF(i)+WWord2Vec-Cluster(i)+WCentrality (i), while normalizing the node weight values using the following formula: Wherein WWeight (i) is the weight of node i, min_weight is the minimum weight value in all nodes, and max_weight is the maximum weight value in all nodes; according to the TextRank algorithm, the probability transition formula between the word graph nodes is as follows: /(I)And then constructing a probability transition matrix M between nodes in the word graph: /(I)Where Pij represents the transition probability of a jump from node i to node j, i.e., P (Vi,Vj); by constructing the probability transition matrix M, the final TextRank node weight iteration formula is as follows:
The technical scheme of the invention is further improved as follows: in the fifth step, convergence can be achieved by continuously iterating the formula when the error rate in the graph is smaller than a given limit value, and in general, the limit value is 0.0001, iteration is stopped at this time, the score of each node is the importance degree of the node in the graph, the nodes are sorted according to the descending order of the score, and the first N words are the final result of keyword extraction.
By adopting the technical scheme, compared with the prior art, the invention has the following technical progress:
1. The invention provides a method for automatically extracting text keywords, which calculates word node weight information by adopting a TF-IDF algorithm, a word vector clustering weighting algorithm and a central node measurement algorithm of a graph, and fills initial information of each node in a word co-occurrence network referenced by a model during keyword extraction, thereby obviously improving the accuracy of keyword extraction.
2. The invention provides a method for automatically extracting text keywords, which fully considers the diversity requirement of keyword extraction in real application through the design of a model, so that the model can flexibly cope with text data of different types and scales, and the model is excellent in processing professional terms in specific fields and more general text situations, thereby providing powerful support for wide application scenes.
3. The invention provides a method for automatically extracting text keywords, which is characterized in that a TF-IDF algorithm, a word vector clustering weighting algorithm and a central node measurement algorithm are incorporated into a TextRank algorithm model to respectively realize comprehensive investigation of text internal information, text external corpus information and text structure information, and the central node measurement is taken as a distinguishing attribute of node importance in a graph, so that the importance and influence of the nodes in the graph can be effectively evaluated, and important nodes in the graph can be effectively mined in a word co-occurrence network, so that the text keywords are better extracted.
Drawings
FIG. 1 is a candidate keyword graph of the present invention;
Fig. 2 is a schematic overview of the method of the present invention.
Detailed Description
The invention is further illustrated by the following examples:
example 1
As shown in fig. 1 and 2, the present invention provides a method for automatically extracting text keywords, which comprises the following steps:
step one: a TextRank algorithm;
step two: a TF-IDF algorithm;
Step three: a word vector clustering weighting algorithm;
step four: a central node metric algorithm;
step five: extracting keywords;
In the first step, the TextRank algorithm pre-processes a text, constructs the adjacency relation among word units in the text by establishing a word graph g=v, E), and calculates the importance of word nodes, when constructing the nodes in the word graph, the text is firstly required to be divided according to sentence units, each sentence is divided, stop words, repeated words and the like are removed, only nouns, verbs and adjectives are reserved, a node set v= { V1,V2,…,Vn } forming the word graph is reserved, and when determining whether edges exist between two nodes, the word co-occurrence relation is used for construction: when words corresponding to two nodes appear in one window at the same time, the two nodes are connected, an edge set E= { (Vi,Vj)|Vi∈V∧Vj epsilon V } forming a word graph is formed, wherein the window size can be set to 2 to 10 words, and by constructing a word graph G= (V, E), the word graph G= (V, E) can be formed through a formulaCalculating to obtain a TextRank value of each node, wherein d E [0,1] is a damping coefficient in the formula in the first step, and represents that the probability of 1-d of any node randomly jumps to other nodes in the word graph, so that the iteration calculation of the TextRank can be converged, and the value is generally 0.85; WS (Vi) represents the textRank value of node Vi; WS (Vj) represents the textRank value of node Vj after the last iteration; in (Vi) denotes a set of all nodes pointing to node Vi; out (Vj) represents the set of all nodes to which node Vj points; wji represents the similarity between the two nodes Vi and Vj.
In this embodiment, taking fig. 1 as an example, the probability that node α jumps to any neighboring node in region a isHowever, research shows that non-uniform hopping can effectively improve keyword extraction effect according to node importance, so that the method enriches initial weight information of nodes on the basis of the keyword extraction method and further optimizes hopping probability among the nodes.
Example 2
As shown in fig. 1 and 2, on the basis of embodiment 1, the present invention provides a technical solution: preferably, in the second step, the word frequency (TF) and the inverse document word frequency (IDF) are multiplied to obtain the weight value of the word, and the first several digits are selected as keywords according to the weight ranking, and the calculation formula is as follows: wTF-IDF(i)=TFi*IDFi;IDFi=logN/DFi; where TFi represents the number of occurrences of word i in the text divided by the total number of words in the text, i.e. the frequency of occurrence of word i in the text; n represents the total number of text in the corpus; DFi represents the number of text containing word i; IDFi represents the category discrimination capability of word i.
Example 3
As shown in fig. 1 and 2, on the basis of embodiment 1, the present invention provides a technical solution: preferably, the third step can be implemented by clustering word vectors in the text, and then selecting the distance between other word vectors and the clustering center to inspect the keywords, wherein the word vector clustering weighting algorithm assumes that: words in a text can be divided into a plurality of groups by calculating the vector similarity of the words, the groups are called clusters, the more far a Word is from the centroid of the cluster, the more the difference between the Word and the centroid of the cluster and the difference between the Word and the words around the centroid can be reflected, when the Word is used as a TextRank Word graph node, the higher the importance of the Word is, the given text is preprocessed in the third step to generate candidate keywords V= { V1,V2,…,Vn }, and then a Word vector of s is obtained by using a Word2Vec Word vector model obtained after trainingLet c= { C1,C2,…,Cn } represent the clustering result after K-means clustering by the word vector set of the document, calculate the importance of any word i in the belonging cluster Ci, and the calculation formula is as follows: /(I)Wherein/>A vector representing the centroid of cluster Ci; Representing the euclidean distance of the vector to the vector in the word vector space, |ci | represents the number of words contained in cluster Ci.
Example 4
As shown in fig. 1 and 2, on the basis of embodiment 1, the present invention provides a technical solution: preferably, the fourth step may be divided into the following steps:
A1: centering: the degree centrality is the most direct measurement index for describing the node centrality in network analysis, the degree of a node is higher, the degree centrality of the node is more important in the network, and the degree centrality calculation formula of a certain node is as follows:
A2: feature vector centrality: the basic idea of feature vector centrality is that the importance of a node depends on both its number of neighboring nodes and its importance of neighboring nodes. The characteristic vector centrality calculation formula of a certain node is as follows: Wherein lambda is a characteristic value, j is a neighbor node of the node i;
A3: median centrality: the basic idea of the betting center is to characterize node importance by counting the number of shortest paths through a node. The calculation formula of the medium centrality of a certain node is as follows: Where σut represents the number of shortest paths of nodes u to t, σut (i) represents the number of paths that these paths pass through node i, and N represents the number of nodes in the graph;
In combination with A1, A2 and A3, the centrality of the node in the word graph is comprehensively estimated by the following formula: wCentrality(i)=Wd(i)+Wev(i)+Wb (i).
In the embodiment, the central node measurement is used as the distinguishing attribute of the node importance in the graph, the importance and influence of the node in the graph can be effectively evaluated, and in the word co-occurrence network, the important node can be effectively mined, so that the text keyword can be better extracted.
Example 5
As shown in fig. 1 and 2, on the basis of embodiment 1, the present invention provides a technical solution: preferably, in the fifth step, for any word i in the text, the following formula is provided to calculate the comprehensive weight of the word i: wWeight(i)=WTF-IDF(i)+WWord2Vec-Cluster(i)+WCentrality (i), while normalizing the node weight values using the following formula: Wherein WWeight (i) is the weight of node i, min_weight is the minimum weight value in all nodes, and max_weight is the maximum weight value in all nodes; according to the TextRank algorithm, the probability transition formula between the word graph nodes is as follows: /(I)And then constructing a probability transition matrix M between nodes in the word graph: /(I)Where Pij represents the transition probability of a jump from node i to node j, i.e., P (Vi,Vj); by constructing the probability transition matrix M, the final TextRank node weight iteration formula is as follows: In the fifth step, through continuous iteration formulas, when the error rate in the graph is smaller than a given limit value, convergence can be achieved, in general, the limit value is 0.0001, iteration is stopped at this time, each node score is the importance degree of the node score in the graph, the node scores are sorted in descending order according to the score size, and the first N words are the final result of keyword extraction.
The workflow of the method of automatically extracting text keywords is described in detail below.
As shown in fig. 2, in the first step, preprocessing is performed on a given text, mainly deleting stop words, repeat words and the like, and retaining verbs, nouns and adjectives; secondly, constructing a word graph according to a word co-occurrence window, wherein the word co-occurrence window refers to the distance between two words when the two words are simultaneously present; thirdly, respectively calculating TF-IDF values of word nodes, word vector clustering weighted values and central node metric values, and carrying out weighted summation; and fourthly, iterating the TextRank algorithm to obtain the final score of each candidate keyword, and selecting the first N words as final keywords after sequencing.
The foregoing invention has been generally described in great detail, but it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, it is intended to cover modifications or improvements within the spirit of the inventive concepts.

Claims (8)

In the first step, the TextRank algorithm pre-processes a text, constructs an adjacency relation between word units in the text by establishing a word graph g= (V, E), and calculates the importance of word nodes, when constructing the nodes in the word graph, the text needs to be divided according to sentence units, each sentence is divided, stop words and repeat words are removed, only nouns, verbs and adjectives are reserved, a node set v= { V1,V2,…,Vn } of the word graph is formed, and when determining whether edges exist between two nodes, the word co-occurrence relation is used for construction: when words corresponding to two nodes appear in a window at the same time, the two nodes are connected to form an edge set E= { (Vi,Vj)|Vi∈V∧Vj epsilon V } of the word graph, wherein the window size can be set to 2 to 10 words;
7. The method for automatically extracting text keywords according to claim 1, wherein: in the fifth step, for any word i in the text, the following formula is provided to calculate the comprehensive weight of the word i: wWeight(i)=WTF-IDF(i)+WWord2Vec-Cluster(i)+WCentrality (i), while normalizing the node weight values using the following formula: Wherein WWeight (i) is the weight of node i, min_weight is the minimum weight value in all nodes, and max_weight is the maximum weight value in all nodes; according to the TextRank algorithm, the probability transition formula between the word graph nodes is as follows: /(I)And then constructing a probability transition matrix M between nodes in the word graph: /(I)Where Pij represents the transition probability of a jump from node i to node j, i.e., P (Vi,Vj); by constructing the probability transition matrix M, the final TextRank node weight iteration formula is as follows:
CN202410034861.5A2024-01-102024-01-10Method for automatically extracting text keywordsPendingCN117973381A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410034861.5ACN117973381A (en)2024-01-102024-01-10Method for automatically extracting text keywords

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410034861.5ACN117973381A (en)2024-01-102024-01-10Method for automatically extracting text keywords

Publications (1)

Publication NumberPublication Date
CN117973381Atrue CN117973381A (en)2024-05-03

Family

ID=90847128

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410034861.5APendingCN117973381A (en)2024-01-102024-01-10Method for automatically extracting text keywords

Country Status (1)

CountryLink
CN (1)CN117973381A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119446538A (en)*2025-01-082025-02-14南充市中心医院 An intelligent management method and system for diabetic patients based on artificial intelligence
CN119474339A (en)*2024-11-082025-02-18浙江大学 A method for generating and screening keywords from datasets based on large language models
CN119512517A (en)*2024-11-062025-02-25天建软件(常州)有限公司 A method and system for generating ERP code based on business flow analysis

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119512517A (en)*2024-11-062025-02-25天建软件(常州)有限公司 A method and system for generating ERP code based on business flow analysis
CN119474339A (en)*2024-11-082025-02-18浙江大学 A method for generating and screening keywords from datasets based on large language models
CN119446538A (en)*2025-01-082025-02-14南充市中心医院 An intelligent management method and system for diabetic patients based on artificial intelligence

Similar Documents

PublicationPublication DateTitle
CN109190117B (en)Short text semantic similarity calculation method based on word vector
US9971974B2 (en)Methods and systems for knowledge discovery
CN108132927B (en)Keyword extraction method for combining graph structure and node association
CN110321925B (en)Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN117973381A (en)Method for automatically extracting text keywords
CN111177365A (en) An unsupervised automatic abstract extraction method based on graph model
CN114265935B (en) A text mining-based decision-making support method and system for scientific and technological project establishment management
CN110909116B (en)Entity set expansion method and system for social media
CN109902290B (en)Text information-based term extraction method, system and equipment
CN114491062B (en)Short text classification method integrating knowledge graph and topic model
CN113988053A (en)Hot word extraction method and device
Kumar et al.Performance analysis of keyword extraction algorithms assessing extractive text summarization
CN112527981A (en)Open type information extraction method and device, electronic equipment and storage medium
CN117672375A (en)Method, device, equipment and storage medium for mining synthetic biological functional element
Ghalehtaki et al.A combinational method of fuzzy, particle swarm optimization and cellular learning automata for text summarization
CN111563361B (en)Text label extraction method and device and storage medium
CN112926340A (en)Semantic matching model for knowledge point positioning
CN114298029B (en) A new word discovery method combining word vector multi-feature fusion
CN115858787A (en) A Hotspot Extraction and Mining Method Based on Problem Appeal Information in Road Transportation
CN115906825A (en)Chinese word sense disambiguation combining multi-channel mixed hole convolution with residual error and attention
CN119760057A (en)Response large model retrieval enhancement method and device based on hierarchical cluster index structure
Heidary et al.Automatic Persian text summarization using linguistic features from text structure analysis
CN115392244B (en)Academic keyword batch recognition system
Üstün et al.Incorporating word embeddings in unsupervised morphological segmentation
CN116502637A (en)Text keyword extraction method combining context semantics

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp