CN113011194A

Movatterモバイル変換

Info

Publication number: CN113011194A
Application number: CN202110403916.1A
Authority: CN
Inventors: 刘丹; 张成辉; 史梦雨
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-06-22
Anticipated expiration: 2041-04-15
Also published as: CN113011194B

Abstract

Translated fromChinese

本发明公开了一种融合关键词特征和多粒度语义特征的文本相似度计算方法，属于自然语言智能处理技术领域。本发明首先基于关键词特征，引入Ksimhash算法，计算当前两个文本的相似度sim₁；其次，利用TFIDF算法，抽取出文本关键词，并利用Word2vec模型得到每个词对应的词向量，利用关键词以及关键词向量信息得到文本对应的词语语义向量，基于词语语义向量，计算当前两个文本的相似度sim₂；然后，利用Doc2vec模型，得到每篇文本对应的篇章语义向量，基于文本语义向量，计算当前两个文本的相似度sim₃；最后，对sim₁、sim₂、sim₃相加求平均，得到最终文本的相似度结果。本发明计算得到的相似度准确度高，可用于文本检索、查重等应用领域。

The invention discloses a text similarity calculation method integrating keyword features and multi-granularity semantic features, belonging to the technical field of natural language intelligent processing. The present invention firstly introduces the Ksimhash algorithm based on the keyword features to calculate the similarity sim₁ of the current two texts; secondly, the TFIDF algorithm is used to extract the text keywords, and the Word2vec model is used to obtain the word vector corresponding to each word, and the key The word and keyword vector information is used to obtain the word semantic vector corresponding to the text, and based on the word semantic vector, the similarity sim₂ of the current two texts is calculated; then, the Doc2vec model is used to obtain the text semantic vector corresponding to each text, based on the text semantic vector , calculate the similarity sim₃ of the current two texts; finally, add and average sim₁ , sim₂ , and sim₃ to obtain the similarity result of the final text. The similarity calculated by the invention has high accuracy and can be used in text retrieval, duplicate checking and other application fields.

Description

Text similarity calculation method fusing keyword features and multi-granularity semantic features

Technical Field

The invention belongs to the technical field of natural language intelligent processing, and particularly relates to a text similarity calculation method fusing keyword features and multi-granularity semantic features.

Background

When comparing the similarity of two articles, the conventional algorithms have the following two general categories:

the first scheme is that words are divided for two articles to obtain a series of word feature vectors, and then the distance between the feature vectors is calculated, such as Euclidean distance, Hamming distance or included angle cosine and the like between the feature vectors, so that the similarity of the two articles is judged according to the distance. Another type of scheme is the traditional hash, which considers generating a fingerprint for each text by means of hash. The first scheme simply uses word feature vectors to represent text contents, so that semantic deletion is easily caused; while the second category of schemes is designed to make the overall distribution as uniform as possible, the hash value may vary greatly if the input content varies slightly.

The Kstimhash is one of the keyword hashes, and the key idea is to reduce the dimension, map a high-dimensional feature vector into a low-dimensional feature vector, and determine whether an article is repeated or highly similar according to the Hamming distance between the two vectors. In the information theory, the hamming distance between two equal-length character strings is the number of characters with different corresponding positions of the two character strings. I.e. it is the number of characters that need to be replaced to convert one string into another.

At the semantic representation level of text, Word2vec is a common Word-level vector representation model. After the Word2vec bag of words model training is complete, the Word2vec model may map each Word to a vector that shows the Word sense features to some extent. The Doc2vec model extended from the Word2vec model can be used for predicting a vector to represent different text or paragraph semantics, and the structure of the model overcomes the defects that the Word2vec bag model ignores Word order and context.

Disclosure of Invention

Based on the technical problems, the invention provides a text similarity calculation method fusing keyword features and multi-granularity semantic features so as to improve the accuracy of similarity measurement between texts.

The text similarity calculation method fusing the keyword features and the multi-granularity semantic features acquires any two texts d_iAnd d_jWhen the similarity is higher than the threshold value, executing the following steps;

step 1: extracting text d_iAnd d_jThe keyword of (1);

step 2: key word characteristic fingerprint f for extracting text based on Kshimhash algorithm_i1And f_j1And calculate f_i1And f_j1The hamming distance of obtains a text d_iAnd d_jSimilarity sim of keyword features of₁；

And step 3: calculating a text d_iAnd d_jSemantic similarity sim of words₂；

And 4, step 4: calculating a text d_iAnd d_jSpace semantic similarity sim₃；

And 5: comprehensive keyword feature similarity sim₁Semantic similarity sim of words₂Semantic similarity sim with chapters₃To obtain a text d_iAnd d_jSimilarity sim.

Preferably, in the step 1, extracting the keywords of the text specifically includes:

step 1.1: performing text preprocessing on the content of the text to obtain a text candidate word set, wherein the text preprocessing comprises the following steps: word segmentation and word stop.

Step 1.2: extracting keywords of the text from the text candidate word set: calculating TFIDF values of all words in a text candidate word set, and taking the words with the largest previous K TFIDF values as text keywords, wherein the value of K is a positive integer and can be set based on an actual application scene;

preferably, in step 2, extracting the keyword feature of the text includes:

step 2.1: calculating a hash value K of a given number of bits (e.g., 16 bits) for each keyword K (K ═ 1,2, …, K)_h: carrying out hash operation of specified digit on the character codes forming each keyword to obtainCarrying out bit XOR operation on the hash value of each word to obtain the hash value of the keyword;

step 2.2: calculating a weighted hash value of each keyword: weighted value W of keyword k_k＝TFIDF_k×K_h，TFIDF_kTFIDF value representing key K, i.e. weight of key and K_hBit of 1 and weight TFIDF_kPositively multiplied, 0 bit and weight TFIDF_kNegative multiplication. For example a keyword K_h＝[010110]With a corresponding weight of TFIDF_kWeighting to 5 yields [ -5,5, -5,5,5, -5]；

Step 2.3: and summing the weighted hash values of all the keywords (namely summing according to the bit) to obtain an accumulated vector. Such as [ -5,5, -5,5, 5], [ -3, -3, -3,3, -3,3], [1, -1, -1,1,1,1] to obtain [ -7,1, -9,9,3,9] after accumulation;

step 2.4: performing dimensionality reduction calculation on the obtained accumulated vector to obtain a keyword characteristic fingerprint of the text: judging each element value of the accumulated vector, if the element value is larger than 0, setting the element value to be 1, and if the element value is not larger than 0, setting the element value to be 0, thereby obtaining a text d_iAnd d_jRespective key word feature fingerprints f_i1And f_j1And further based on text d_iAnd d_jCalculating the hamming distance between the keyword characteristic fingerprints to obtain a text d_iAnd d_jSimilarity sim of keyword features of₁。

For example, the accumulated vector [ -7,1, -9,9,3,9] is subjected to dimensionality reduction calculation, and the obtained keyword feature fingerprint of the text is 010111.

Further, the text d_iAnd d_jSimilarity sim of keyword features of₁Can be set as follows:

wherein H_i,jRepresenting text d_iAnd d_jThe max () function represents taking the maximum value, and the len () function represents calculating the length of the string. When calculating the hamming distance between two key word feature fingerprints, if the length of the key word feature fingerprint (character string length) of two textsDegree) are different, the low-order bit-filling operation is carried out on the keyword characteristic fingerprint with shorter length, so that the length of the keyword characteristic fingerprint is the same as that of the keyword characteristic fingerprint. Preferably, 0 is complemented in the lower order so that the length of the two keyword feature fingerprints is the same.

Preferably, the step 3 comprises the following steps:

step 3.1: based on each keyword of the text, taking N words before and after the keyword as context, establishing a real number vector (for example, adopting hyper-parameter establishment of CBOW model of word2 vec), and finally enabling each keyword w_iAll correspond to a semantic vector

Wherein N is a positive integer, and N is a positive integer,

step 3.2: computing a word semantic fingerprint f of a text₂: summing the semantic vectors of the K keywords to obtain a word semantic fingerprint f of the text₂Namely:

step 3.3: calculating a text d_iAnd text d_jCorresponding word semantic fingerprint f_i2And f_j2Cosine similarity of (d) to obtain d_iAnd d_jSemantic similarity sim of words₂。

Preferably, the step 4 comprises the following steps:

step 4.1: extracting the first L longest sentences in the text as representative sentences of the text, and acquiring a sentence vector of each representative sentence to enable each representative sentence s to be_lAll correspond to a semantic vector

Wherein L is a positive integer; for example, a sentence vector representing a sentence may be calculated using the PV-DM model in the DOC2 VEC;

step 4.2: calculating text semantic fingerprint f₃: summing the semantic vectors of the L representative sentences to obtain a chapter semantic fingerprint f₃Namely:

step 4.3: calculating a text d_iAnd d_jCorresponding discourse semantic fingerprint f_i3And f_j3Cosine similarity of (d) to obtain d_iAnd d_jSpace semantic similarity sim₃。

Preferably, in the step 5, similarity sim is determined based on the keyword features₁Semantic similarity sim of words₂Semantic similarity sim with chapters₃Get the text d_iAnd d_jSimilarity sim.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: when the similarity of the two texts is calculated, the keyword features and the semantic features of the texts are fully considered. Meanwhile, the attention points of the semantic features not only stay in the word granularity layer, but also extend to the whole chapter granularity layer, and a multi-dimensional text expression vector is established, so that the text similarity calculation is more accurate. The invention can be used in the application fields of article duplicate checking, article retrieval and the like.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a processing flow chart of a text similarity calculation method fusing a keyword feature and a multi-granularity semantic feature according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The embodiment of the invention provides a text similarity calculation method fusing keyword features and multi-granularity semantic features, which aims at the problems that the conventional text similarity calculation method ignores the combination of text keyword features and semantic features, excessively pays attention to word semantics, ignores coarse-granularity-layer semantics such as sentences, paragraphs and texts, and the like, and the text similarity calculation method fusing the keyword features and the multi-granularity semantic features, which is provided by the embodiment of the invention, is shown in figure 1 and comprises the following steps:

in the present embodiment, to calculate the text d_iAnd d_jThe similarity of (2) is described as an example, and three-dimensional features are considered: keyword features, word-level semantic features, text-level semantic features.

Step 1: based on the Kshimhash algorithm, obtaining text fingerprints, calculating the Hamming distance between the two text fingerprints, and obtaining the similarity of the two texts on the key word characteristics. The method specifically comprises the following steps:

step 1.1: for the current text d_iThe text content of (2) is word-segmented.

In this embodiment, the word segmentation tool is jieba and the text d_iForming word bags after word segmentation, i.e. d_i＝[w_i1,w_i2,…,w_in]，w_ikRepresenting text d_iThe kth (k ═ 1,2, …, n) words, there is no semantic association between each word in the bag;

step 1.2: the stop word in the word bag is removed.

When stop words are removed, a stop word list is introduced, whether each word in the word bag appears in the stop word list or not is judged, and if the word appears, the word is removed from the word bag;

step 1.3: removing stop words to obtain a filtered word bag d_i＝[w_i1,w_i2,…,w_im]In the word bag, w_ik(k-1, 2, …, m) represents a text d_iAnd words that do not appear in the deactivation word list.

Extracting keywords of the current text based on the filtered word bag, taking the word Frequency Inverse text Frequency characteristic of each word in the current word bag into consideration, and obtaining the value K before the TFIDF value of the current text is ranked (the value of the K is an empirical value, and the preferred value range is [5,10 ]]In the present embodiment, the value of K is set to 10)Forming a keyword list

Each keyword_i-kAll correspond to a weight_i-kI.e., the weights are their corresponding TFIDF values.

Wherein, the TFIDF calculation formula is as follows:

wherein, count (w)_ik) The representative word w_ikIn the text d_iNumber of occurrences, | d_iI denotes the total number of words of the current text, N denotes the total number of texts, I (w)_ik,d_m) The representative word w_ikWhether or not in the text d_mIn (b), if present, I (w)_ik,d_m) Value 1, if not present, I (w)_ik,d_m) The value is 0.

Thereby finally obtaining a word-weight set (w)_k,weight_k)；

Step 1.4: performing hash operation on words in the keyword list of the current text, and calculating the hash value of each keyword to obtain (hash)_k,weight_k) Gathering;

step 1.5: each keyword_i-kBased on the hash value, according to the corresponding weight_i-kWeighting is carried out, namely: w_k＝hash_k×weight_kThe hash value is 1, and the weight is multiplied positively, and the hash value is 0, and the weight is multiplied negatively. For example, a word is hashed to [010110 ]]And its corresponding weight is 5, then the weighting results in [ -5,5, -5,5,5, -5]；

Step 1.6: summing the weighted hash vectors corresponding to all words in the keyword list, such as [ -5,5, -5,5,5,5], [ -3, -3, -3,3, -3,3], [ -1, -1, -1,1,1,1] to obtain [ -7,1, -9,9,3,9] after accumulation;

step 1.7: and performing dimensionality reduction operation on the obtained accumulated vector. Namely: if the value is greater than 0, the value is set to 1, otherwise, the value is set to 0, and the Kstimhash value of the statement is obtained.

E.g., [ -7,1, -9,9,3,9], resulting in 010111, which is the Ksimhash value of the current text.

Illustratively, d is obtained after the calculation process based on the above_iThe Kshimhash value of the text is f_i1：0011010101000110111001110000100010000110010111011101010000100100，d_jThe Kshimhash value of the text is f_j1：0010110100010011110011100100100011001110110011011000110101001101；

Step 1.8: using text d_iAnd d_jRespective corresponding text fingerprints f_i1And f_j1Calculating f_i1And f_j1Hamming distance of H_i,j. Hamming distance is defined as the number of bits of difference in a text fingerprint.

Step 1.9: obtaining d_iAnd d_jSimilarity sim of keyword features of₁Wherein

Corresponding to the above example, one can obtain the value d_iAnd d_jSimilarity sim of keyword features of₁Is 0.65625.

Step 2: and obtaining semantic vectors of the two texts in terms based on a Word2vec model, and calculating the cosine similarity of the semantic vectors at two term levels. The method specifically comprises the following steps:

step 2.1: training the Word vector with all text based on the Word2vec model, so that each Word w_nAll correspond to a semantic vector

Step 2.2: for the current text d_iUsing TFIObtaining a keyword list corresponding to the text by using DF algorithm

Each word keyword in the list_i-kAll correspond to a word vector

Calculating a word semantic vector corresponding to the current text, namely:

illustratively, the dimension of the semantic vector may be set to 200 dimensions.

Step 2.3: performing the operation of the step 2 on each text to ensure that each text has a unique corresponding word semantic vector;

step 2.4: calculating a text d_iAnd text d_jCorresponding word semantic vector f_i2And f_j2Cosine similarity of (d) to obtain d_iAnd d_jSemantic similarity sim of words₂. Let f_i2And f_j2Has a dimension of n, i.e. f_i2＝[f_i21,f_i22,…,f_i2n]，f_j2＝[f_j21,f_j22,…,f_j2n]Then sim₂The calculation formula of (2) is as follows:

exemplary, sim calculated in this example₂＝0.15181794593072392。

And step 3: based on a Doc2vec model, semantic vectors of the two texts in chapters are obtained, and cosine similarity of the two text-level semantic vectors is calculated. The method comprises the following specific steps:

step 3.1: obtaining a text d by using a Doc2vec model_iAnd text d_jEach of which isCorresponding text vector f_i3And f_j3(ii) a That is, the text vector f is obtained based on the representative sentence of the text_i3And f_j3. For example, when calculating the semantic vector of each representative sentence, the dimension of the semantic vector may be set to 200 dimensions.

Step 3.2: calculating f_i3And f_j3Cosine similarity of (d) to obtain d_iAnd d_jSpace semantic similarity sim₃. Let f_i3And f_j3Has a dimension of n, i.e. f_i3＝[f_i31,f_i32,…,f_i3n]，f_j3＝[f_j31,f_j32,…,f_j3n]，sim₃The calculation formula of (0.34401781495762856) is:

exemplary, sim calculated in this example₃＝0.34401781495762856。

And 4, step 4: for three local similarity values sim₁、sim₂、sim₃Adding and averaging to obtain a text d_iAnd text d_jFinal similarity value sim: .

Corresponding to the above three examples, text d may be obtained_iAnd text d_jThe final similarity value sim is: 0.3840285869627842.

the similarity calculation method provided by the embodiment of the invention can be used in the application fields of text retrieval, duplicate checking and the like. For example, the text to be processed is recorded as text d_iAnd any one text in the searched text set or the repeated text library is marked as a text d_jFirst, the text d is calculated_iAnd text d_jFinal similarity value sim, text d with similarity value sim reaching a first specified threshold (greater than or equal to the specified threshold)_jAs a result of its retrieval or duplication.

Furthermore, it is also possible to first cluster the retrieved text set or the duplicate-checking text library to obtain a plurality of clustering results (a plurality of clusters), and then calculate the text d_iSimilarity sim between texts corresponding to each cluster center is obtained when similarity sim reaches a second specified threshold, and then text d is calculated respectively_iAnd taking the text corresponding to the maximum similarity sim as the duplication checking or retrieval result of the similarity sim between the text and the texts in the cluster.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims

1. A text similarity calculation method fusing keyword features and multi-granularity semantic features is used for obtaining any two texts d_iAnd d_jIs characterized in that the following steps are executed;

step 1: extracting text d_iAnd d_jThe keyword of (1);

And step 3: calculating a text d_iAnd d_jSemantic similarity sim of words₂；

And 4, step 4: calculating a text d_iAnd d_jSpace semantic similarity sim₃；

2. The method according to claim 1, wherein in step 1, extracting the keywords of the text specifically comprises:

step 1.1: performing text preprocessing on the content of the text to obtain a text candidate word set, wherein the text preprocessing comprises the following steps: segmenting words and removing stop words;

step 1.2: extracting keywords of the text from the text candidate word set: and calculating TFIDF values of all words in the text candidate word set, and taking the words with the maximum TFIDF values of the first K words as text keywords, wherein the value of K is a positive integer.

3. The method of claim 2, wherein the range of values for K is set to [5,10 ].

4. A method according to claim 2 or 3, wherein in step 2, the keyword feature fingerprint of the text is calculated as:

step 2.1: calculating the hash value K of the appointed digit of each keyword K_h: carrying out hash operation of specified digits on the word codes forming each keyword to obtain the hash value of each word, and carrying out bit exclusive or operation on the hash value of each word to obtain the hash value of the current keyword;

step 2.2: calculating a weighted hash value of each keyword:

defining a weight value W for a keyword k_k＝TFIDF_k×K_h；

Wherein TFIDF_kThe TFIDF value representing the key word k,

and K is_hBit of 1 and TFIDF_kPositive multiplication, bit of 0 and TFIDF_kNegative multiplication;

step 2.3: summing the weighted hash values of all keywords of the text to obtain an accumulated vector;

step 2.4: performing dimensionality reduction calculation on the obtained accumulated vector to obtain a keyword characteristic fingerprint of the text: and judging each element value of the accumulated vector, if the element value is larger than 0, setting the element value to be 1, and if the element value is not larger than 0, setting the element value to be 0.

5. The method of claim 4, wherein the text d_iAnd d_jSimilarity sim of keyword features of₁Comprises the following steps:

wherein H_i，jRepresenting text d_iAnd d_jCharacteristic fingerprint f of key word_i1And f_j1Hamming distance between them, max () function represents taking the maximum value, len () function represents calculating the length of the string, and in calculating the keyword feature fingerprint f_i1And f_j1At hamming distance between them, if the key word feature fingerprint f_i1And f_j1When the lengths of the key words are different, the low-order bit complementing operation is carried out on the key word characteristic fingerprints with shorter lengths.

6. The method of claim 1, wherein said step 3 comprises the steps of:

step 3.1: based on each keyword of the text, taking N words before and after the keyword as context, and establishing a real number vector so as to enable each keyword w to be_iAll correspond to a semantic vector

Wherein N is a positive integer;

step 3.2: summing the semantic vectors of all keywords of the text to obtain a word semantic fingerprint f of the text₂；

Step 3.3: calculating a text d_iAnd text d_jCorresponding word semantic fingerprint f_i2And f_j2Cosine similarity of (d) to obtain d_iAnd d_jWord language ofSemblance similarity sim₂。

7. The method of claim 1, wherein said step 4 comprises the steps of:

Wherein L is a positive integer;

step 4.2: summing the semantic vectors of the L representative sentences to obtain a chapter semantic fingerprint f₃；

8. The method according to claim 1, wherein in the step 5, similarity sim based on keyword feature is performed₁Semantic similarity sim of words₂Semantic similarity sim with chapters₃Get the text d_iAnd d_jSimilarity sim.