Disclosure of Invention
Based on the technical problems, the invention provides a text similarity calculation method fusing keyword features and multi-granularity semantic features so as to improve the accuracy of similarity measurement between texts.
The text similarity calculation method fusing the keyword features and the multi-granularity semantic features acquires any two texts diAnd djWhen the similarity is higher than the threshold value, executing the following steps;
step 1: extracting text diAnd djThe keyword of (1);
step 2: key word characteristic fingerprint f for extracting text based on Kshimhash algorithmi1And fj1And calculate fi1And fj1The hamming distance of obtains a text diAnd djSimilarity sim of keyword features of1;
And step 3: calculating a text diAnd djSemantic similarity sim of words2;
And 4, step 4: calculating a text diAnd djSpace semantic similarity sim3;
And 5: comprehensive keyword feature similarity sim1Semantic similarity sim of words2Semantic similarity sim with chapters3To obtain a text diAnd djSimilarity sim.
Preferably, in the step 1, extracting the keywords of the text specifically includes:
step 1.1: performing text preprocessing on the content of the text to obtain a text candidate word set, wherein the text preprocessing comprises the following steps: word segmentation and word stop.
Step 1.2: extracting keywords of the text from the text candidate word set: calculating TFIDF values of all words in a text candidate word set, and taking the words with the largest previous K TFIDF values as text keywords, wherein the value of K is a positive integer and can be set based on an actual application scene;
preferably, in step 2, extracting the keyword feature of the text includes:
step 2.1: calculating a hash value K of a given number of bits (e.g., 16 bits) for each keyword K (K ═ 1,2, …, K)h: carrying out hash operation of specified digit on the character codes forming each keyword to obtainCarrying out bit XOR operation on the hash value of each word to obtain the hash value of the keyword;
step 2.2: calculating a weighted hash value of each keyword: weighted value W of keyword kk=TFIDFk×Kh,TFIDFkTFIDF value representing key K, i.e. weight of key and KhBit of 1 and weight TFIDFkPositively multiplied, 0 bit and weight TFIDFkNegative multiplication. For example a keyword Kh=[010110]With a corresponding weight of TFIDFkWeighting to 5 yields [ -5,5, -5,5,5, -5];
Step 2.3: and summing the weighted hash values of all the keywords (namely summing according to the bit) to obtain an accumulated vector. Such as [ -5,5, -5,5, 5], [ -3, -3, -3,3, -3,3], [1, -1, -1,1,1,1] to obtain [ -7,1, -9,9,3,9] after accumulation;
step 2.4: performing dimensionality reduction calculation on the obtained accumulated vector to obtain a keyword characteristic fingerprint of the text: judging each element value of the accumulated vector, if the element value is larger than 0, setting the element value to be 1, and if the element value is not larger than 0, setting the element value to be 0, thereby obtaining a text diAnd djRespective key word feature fingerprints fi1And fj1And further based on text diAnd djCalculating the hamming distance between the keyword characteristic fingerprints to obtain a text diAnd djSimilarity sim of keyword features of1。
For example, the accumulated vector [ -7,1, -9,9,3,9] is subjected to dimensionality reduction calculation, and the obtained keyword feature fingerprint of the text is 010111.
Further, the text d
iAnd d
jSimilarity sim of keyword features of
1Can be set as follows:
wherein H
i,jRepresenting text d
iAnd d
jThe max () function represents taking the maximum value, and the len () function represents calculating the length of the string. When calculating the hamming distance between two key word feature fingerprints, if the length of the key word feature fingerprint (character string length) of two textsDegree) are different, the low-order bit-filling operation is carried out on the keyword characteristic fingerprint with shorter length, so that the length of the keyword characteristic fingerprint is the same as that of the keyword characteristic fingerprint. Preferably, 0 is complemented in the lower order so that the length of the two keyword feature fingerprints is the same.
Preferably, the step 3 comprises the following steps:
step 3.1: based on each keyword of the text, taking N words before and after the keyword as context, establishing a real number vector (for example, adopting hyper-parameter establishment of CBOW model of word2 vec), and finally enabling each keyword w
iAll correspond to a semantic vector
Wherein N is a positive integer, and N is a positive integer,
step 3.2: computing a word semantic fingerprint f of a text
2: summing the semantic vectors of the K keywords to obtain a word semantic fingerprint f of the text
2Namely:
step 3.3: calculating a text diAnd text djCorresponding word semantic fingerprint fi2And fj2Cosine similarity of (d) to obtain diAnd djSemantic similarity sim of words2。
Preferably, the step 4 comprises the following steps:
step 4.1: extracting the first L longest sentences in the text as representative sentences of the text, and acquiring a sentence vector of each representative sentence to enable each representative sentence s to be
lAll correspond to a semantic vector
Wherein L is a positive integer; for example, a sentence vector representing a sentence may be calculated using the PV-DM model in the DOC2 VEC;
step 4.2: calculating text semantic fingerprint f
3: summing the semantic vectors of the L representative sentences to obtain a chapter semantic fingerprint f
3Namely:
step 4.3: calculating a text diAnd djCorresponding discourse semantic fingerprint fi3And fj3Cosine similarity of (d) to obtain diAnd djSpace semantic similarity sim3。
Preferably, in the step 5, similarity sim is determined based on the keyword features1Semantic similarity sim of words2Semantic similarity sim with chapters3Get the text diAnd djSimilarity sim.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: when the similarity of the two texts is calculated, the keyword features and the semantic features of the texts are fully considered. Meanwhile, the attention points of the semantic features not only stay in the word granularity layer, but also extend to the whole chapter granularity layer, and a multi-dimensional text expression vector is established, so that the text similarity calculation is more accurate. The invention can be used in the application fields of article duplicate checking, article retrieval and the like.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The embodiment of the invention provides a text similarity calculation method fusing keyword features and multi-granularity semantic features, which aims at the problems that the conventional text similarity calculation method ignores the combination of text keyword features and semantic features, excessively pays attention to word semantics, ignores coarse-granularity-layer semantics such as sentences, paragraphs and texts, and the like, and the text similarity calculation method fusing the keyword features and the multi-granularity semantic features, which is provided by the embodiment of the invention, is shown in figure 1 and comprises the following steps:
in the present embodiment, to calculate the text diAnd djThe similarity of (2) is described as an example, and three-dimensional features are considered: keyword features, word-level semantic features, text-level semantic features.
Step 1: based on the Kshimhash algorithm, obtaining text fingerprints, calculating the Hamming distance between the two text fingerprints, and obtaining the similarity of the two texts on the key word characteristics. The method specifically comprises the following steps:
step 1.1: for the current text diThe text content of (2) is word-segmented.
In this embodiment, the word segmentation tool is jieba and the text diForming word bags after word segmentation, i.e. di=[wi1,wi2,…,win],wikRepresenting text diThe kth (k ═ 1,2, …, n) words, there is no semantic association between each word in the bag;
step 1.2: the stop word in the word bag is removed.
When stop words are removed, a stop word list is introduced, whether each word in the word bag appears in the stop word list or not is judged, and if the word appears, the word is removed from the word bag;
step 1.3: removing stop words to obtain a filtered word bag di=[wi1,wi2,…,wim]In the word bag, wik(k-1, 2, …, m) represents a text diAnd words that do not appear in the deactivation word list.
Extracting keywords of the current text based on the filtered word bag, taking the word Frequency Inverse text Frequency characteristic of each word in the current word bag into consideration, and obtaining the value K before the TFIDF value of the current text is ranked (the value of the K is an empirical value, and the preferred value range is [5,10 ]]In the present embodiment, the value of K is set to 10)Forming a keyword list
Each keyword
i-kAll correspond to a weight
i-kI.e., the weights are their corresponding TFIDF values.
Wherein, the TFIDF calculation formula is as follows:
wherein, count (w)ik) The representative word wikIn the text diNumber of occurrences, | diI denotes the total number of words of the current text, N denotes the total number of texts, I (w)ik,dm) The representative word wikWhether or not in the text dmIn (b), if present, I (w)ik,dm) Value 1, if not present, I (w)ik,dm) The value is 0.
Thereby finally obtaining a word-weight set (w)k,weightk);
Step 1.4: performing hash operation on words in the keyword list of the current text, and calculating the hash value of each keyword to obtain (hash)k,weightk) Gathering;
step 1.5: each keywordi-kBased on the hash value, according to the corresponding weighti-kWeighting is carried out, namely: wk=hashk×weightkThe hash value is 1, and the weight is multiplied positively, and the hash value is 0, and the weight is multiplied negatively. For example, a word is hashed to [010110 ]]And its corresponding weight is 5, then the weighting results in [ -5,5, -5,5,5, -5];
Step 1.6: summing the weighted hash vectors corresponding to all words in the keyword list, such as [ -5,5, -5,5,5,5], [ -3, -3, -3,3, -3,3], [ -1, -1, -1,1,1,1] to obtain [ -7,1, -9,9,3,9] after accumulation;
step 1.7: and performing dimensionality reduction operation on the obtained accumulated vector. Namely: if the value is greater than 0, the value is set to 1, otherwise, the value is set to 0, and the Kstimhash value of the statement is obtained.
E.g., [ -7,1, -9,9,3,9], resulting in 010111, which is the Ksimhash value of the current text.
Illustratively, d is obtained after the calculation process based on the aboveiThe Kshimhash value of the text is fi1:0011010101000110111001110000100010000110010111011101010000100100,djThe Kshimhash value of the text is fj1:0010110100010011110011100100100011001110110011011000110101001101;
Step 1.8: using text diAnd djRespective corresponding text fingerprints fi1And fj1Calculating fi1And fj1Hamming distance of Hi,j. Hamming distance is defined as the number of bits of difference in a text fingerprint.
Step 1.9: obtaining d
iAnd d
jSimilarity sim of keyword features of
1Wherein
Corresponding to the above example, one can obtain the value d
iAnd d
jSimilarity sim of keyword features of
1Is 0.65625.
Step 2: and obtaining semantic vectors of the two texts in terms based on a Word2vec model, and calculating the cosine similarity of the semantic vectors at two term levels. The method specifically comprises the following steps:
step 2.1: training the Word vector with all text based on the Word2vec model, so that each Word w
nAll correspond to a semantic vector
Step 2.2: for the current text d
iUsing TFIObtaining a keyword list corresponding to the text by using DF algorithm
Each word keyword in the list
i-kAll correspond to a word vector
Calculating a word semantic vector corresponding to the current text, namely:
illustratively, the dimension of the semantic vector may be set to 200 dimensions.
Step 2.3: performing the operation of the step 2 on each text to ensure that each text has a unique corresponding word semantic vector;
step 2.4: calculating a text diAnd text djCorresponding word semantic vector fi2And fj2Cosine similarity of (d) to obtain diAnd djSemantic similarity sim of words2. Let fi2And fj2Has a dimension of n, i.e. fi2=[fi21,fi22,…,fi2n],fj2=[fj21,fj22,…,fj2n]Then sim2The calculation formula of (2) is as follows:
exemplary, sim calculated in this example2=0.15181794593072392。
And step 3: based on a Doc2vec model, semantic vectors of the two texts in chapters are obtained, and cosine similarity of the two text-level semantic vectors is calculated. The method comprises the following specific steps:
step 3.1: obtaining a text d by using a Doc2vec modeliAnd text djEach of which isCorresponding text vector fi3And fj3(ii) a That is, the text vector f is obtained based on the representative sentence of the texti3And fj3. For example, when calculating the semantic vector of each representative sentence, the dimension of the semantic vector may be set to 200 dimensions.
Step 3.2: calculating fi3And fj3Cosine similarity of (d) to obtain diAnd djSpace semantic similarity sim3. Let fi3And fj3Has a dimension of n, i.e. fi3=[fi31,fi32,…,fi3n],fj3=[fj31,fj32,…,fj3n],sim3The calculation formula of (0.34401781495762856) is:
exemplary, sim calculated in this example3=0.34401781495762856。
And 4, step 4: for three local similarity values sim1、sim2、sim3Adding and averaging to obtain a text diAnd text djFinal similarity value sim: .
Corresponding to the above three examples, text d may be obtainediAnd text djThe final similarity value sim is: 0.3840285869627842.
the similarity calculation method provided by the embodiment of the invention can be used in the application fields of text retrieval, duplicate checking and the like. For example, the text to be processed is recorded as text diAnd any one text in the searched text set or the repeated text library is marked as a text djFirst, the text d is calculatediAnd text djFinal similarity value sim, text d with similarity value sim reaching a first specified threshold (greater than or equal to the specified threshold)jAs a result of its retrieval or duplication.
Furthermore, it is also possible to first cluster the retrieved text set or the duplicate-checking text library to obtain a plurality of clustering results (a plurality of clusters), and then calculate the text diSimilarity sim between texts corresponding to each cluster center is obtained when similarity sim reaches a second specified threshold, and then text d is calculated respectivelyiAnd taking the text corresponding to the maximum similarity sim as the duplication checking or retrieval result of the similarity sim between the text and the texts in the cluster.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.