Movatterモバイル変換


[0]ホーム

URL:


CN113011194A - Text similarity calculation method fusing keyword features and multi-granularity semantic features - Google Patents

Text similarity calculation method fusing keyword features and multi-granularity semantic features
Download PDF

Info

Publication number
CN113011194A
CN113011194ACN202110403916.1ACN202110403916ACN113011194ACN 113011194 ACN113011194 ACN 113011194ACN 202110403916 ACN202110403916 ACN 202110403916ACN 113011194 ACN113011194 ACN 113011194A
Authority
CN
China
Prior art keywords
text
keyword
semantic
word
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110403916.1A
Other languages
Chinese (zh)
Other versions
CN113011194B (en
Inventor
刘丹
张成辉
史梦雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of ChinafiledCriticalUniversity of Electronic Science and Technology of China
Priority to CN202110403916.1ApriorityCriticalpatent/CN113011194B/en
Publication of CN113011194ApublicationCriticalpatent/CN113011194A/en
Application grantedgrantedCritical
Publication of CN113011194BpublicationCriticalpatent/CN113011194B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种融合关键词特征和多粒度语义特征的文本相似度计算方法,属于自然语言智能处理技术领域。本发明首先基于关键词特征,引入Ksimhash算法,计算当前两个文本的相似度sim1;其次,利用TFIDF算法,抽取出文本关键词,并利用Word2vec模型得到每个词对应的词向量,利用关键词以及关键词向量信息得到文本对应的词语语义向量,基于词语语义向量,计算当前两个文本的相似度sim2;然后,利用Doc2vec模型,得到每篇文本对应的篇章语义向量,基于文本语义向量,计算当前两个文本的相似度sim3;最后,对sim1、sim2、sim3相加求平均,得到最终文本的相似度结果。本发明计算得到的相似度准确度高,可用于文本检索、查重等应用领域。

Figure 202110403916

The invention discloses a text similarity calculation method integrating keyword features and multi-granularity semantic features, belonging to the technical field of natural language intelligent processing. The present invention firstly introduces the Ksimhash algorithm based on the keyword features to calculate the similarity sim1 of the current two texts; secondly, the TFIDF algorithm is used to extract the text keywords, and the Word2vec model is used to obtain the word vector corresponding to each word, and the key The word and keyword vector information is used to obtain the word semantic vector corresponding to the text, and based on the word semantic vector, the similarity sim2 of the current two texts is calculated; then, the Doc2vec model is used to obtain the text semantic vector corresponding to each text, based on the text semantic vector , calculate the similarity sim3 of the current two texts; finally, add and average sim1 , sim2 , and sim3 to obtain the similarity result of the final text. The similarity calculated by the invention has high accuracy and can be used in text retrieval, duplicate checking and other application fields.

Figure 202110403916

Description

Text similarity calculation method fusing keyword features and multi-granularity semantic features
Technical Field
The invention belongs to the technical field of natural language intelligent processing, and particularly relates to a text similarity calculation method fusing keyword features and multi-granularity semantic features.
Background
When comparing the similarity of two articles, the conventional algorithms have the following two general categories:
the first scheme is that words are divided for two articles to obtain a series of word feature vectors, and then the distance between the feature vectors is calculated, such as Euclidean distance, Hamming distance or included angle cosine and the like between the feature vectors, so that the similarity of the two articles is judged according to the distance. Another type of scheme is the traditional hash, which considers generating a fingerprint for each text by means of hash. The first scheme simply uses word feature vectors to represent text contents, so that semantic deletion is easily caused; while the second category of schemes is designed to make the overall distribution as uniform as possible, the hash value may vary greatly if the input content varies slightly.
The Kstimhash is one of the keyword hashes, and the key idea is to reduce the dimension, map a high-dimensional feature vector into a low-dimensional feature vector, and determine whether an article is repeated or highly similar according to the Hamming distance between the two vectors. In the information theory, the hamming distance between two equal-length character strings is the number of characters with different corresponding positions of the two character strings. I.e. it is the number of characters that need to be replaced to convert one string into another.
At the semantic representation level of text, Word2vec is a common Word-level vector representation model. After the Word2vec bag of words model training is complete, the Word2vec model may map each Word to a vector that shows the Word sense features to some extent. The Doc2vec model extended from the Word2vec model can be used for predicting a vector to represent different text or paragraph semantics, and the structure of the model overcomes the defects that the Word2vec bag model ignores Word order and context.
Disclosure of Invention
Based on the technical problems, the invention provides a text similarity calculation method fusing keyword features and multi-granularity semantic features so as to improve the accuracy of similarity measurement between texts.
The text similarity calculation method fusing the keyword features and the multi-granularity semantic features acquires any two texts diAnd djWhen the similarity is higher than the threshold value, executing the following steps;
step 1: extracting text diAnd djThe keyword of (1);
step 2: key word characteristic fingerprint f for extracting text based on Kshimhash algorithmi1And fj1And calculate fi1And fj1The hamming distance of obtains a text diAnd djSimilarity sim of keyword features of1
And step 3: calculating a text diAnd djSemantic similarity sim of words2
And 4, step 4: calculating a text diAnd djSpace semantic similarity sim3
And 5: comprehensive keyword feature similarity sim1Semantic similarity sim of words2Semantic similarity sim with chapters3To obtain a text diAnd djSimilarity sim.
Preferably, in the step 1, extracting the keywords of the text specifically includes:
step 1.1: performing text preprocessing on the content of the text to obtain a text candidate word set, wherein the text preprocessing comprises the following steps: word segmentation and word stop.
Step 1.2: extracting keywords of the text from the text candidate word set: calculating TFIDF values of all words in a text candidate word set, and taking the words with the largest previous K TFIDF values as text keywords, wherein the value of K is a positive integer and can be set based on an actual application scene;
preferably, in step 2, extracting the keyword feature of the text includes:
step 2.1: calculating a hash value K of a given number of bits (e.g., 16 bits) for each keyword K (K ═ 1,2, …, K)h: carrying out hash operation of specified digit on the character codes forming each keyword to obtainCarrying out bit XOR operation on the hash value of each word to obtain the hash value of the keyword;
step 2.2: calculating a weighted hash value of each keyword: weighted value W of keyword kk=TFIDFk×Kh,TFIDFkTFIDF value representing key K, i.e. weight of key and KhBit of 1 and weight TFIDFkPositively multiplied, 0 bit and weight TFIDFkNegative multiplication. For example a keyword Kh=[010110]With a corresponding weight of TFIDFkWeighting to 5 yields [ -5,5, -5,5,5, -5];
Step 2.3: and summing the weighted hash values of all the keywords (namely summing according to the bit) to obtain an accumulated vector. Such as [ -5,5, -5,5, 5], [ -3, -3, -3,3, -3,3], [1, -1, -1,1,1,1] to obtain [ -7,1, -9,9,3,9] after accumulation;
step 2.4: performing dimensionality reduction calculation on the obtained accumulated vector to obtain a keyword characteristic fingerprint of the text: judging each element value of the accumulated vector, if the element value is larger than 0, setting the element value to be 1, and if the element value is not larger than 0, setting the element value to be 0, thereby obtaining a text diAnd djRespective key word feature fingerprints fi1And fj1And further based on text diAnd djCalculating the hamming distance between the keyword characteristic fingerprints to obtain a text diAnd djSimilarity sim of keyword features of1
For example, the accumulated vector [ -7,1, -9,9,3,9] is subjected to dimensionality reduction calculation, and the obtained keyword feature fingerprint of the text is 010111.
Further, the text diAnd djSimilarity sim of keyword features of1Can be set as follows:
Figure BDA0003021486390000021
wherein Hi,jRepresenting text diAnd djThe max () function represents taking the maximum value, and the len () function represents calculating the length of the string. When calculating the hamming distance between two key word feature fingerprints, if the length of the key word feature fingerprint (character string length) of two textsDegree) are different, the low-order bit-filling operation is carried out on the keyword characteristic fingerprint with shorter length, so that the length of the keyword characteristic fingerprint is the same as that of the keyword characteristic fingerprint. Preferably, 0 is complemented in the lower order so that the length of the two keyword feature fingerprints is the same.
Preferably, the step 3 comprises the following steps:
step 3.1: based on each keyword of the text, taking N words before and after the keyword as context, establishing a real number vector (for example, adopting hyper-parameter establishment of CBOW model of word2 vec), and finally enabling each keyword wiAll correspond to a semantic vector
Figure BDA0003021486390000031
Wherein N is a positive integer, and N is a positive integer,
step 3.2: computing a word semantic fingerprint f of a text2: summing the semantic vectors of the K keywords to obtain a word semantic fingerprint f of the text2Namely:
Figure BDA0003021486390000032
step 3.3: calculating a text diAnd text djCorresponding word semantic fingerprint fi2And fj2Cosine similarity of (d) to obtain diAnd djSemantic similarity sim of words2
Preferably, the step 4 comprises the following steps:
step 4.1: extracting the first L longest sentences in the text as representative sentences of the text, and acquiring a sentence vector of each representative sentence to enable each representative sentence s to belAll correspond to a semantic vector
Figure BDA0003021486390000034
Wherein L is a positive integer; for example, a sentence vector representing a sentence may be calculated using the PV-DM model in the DOC2 VEC;
step 4.2: calculating text semantic fingerprint f3: summing the semantic vectors of the L representative sentences to obtain a chapter semantic fingerprint f3Namely:
Figure BDA0003021486390000033
step 4.3: calculating a text diAnd djCorresponding discourse semantic fingerprint fi3And fj3Cosine similarity of (d) to obtain diAnd djSpace semantic similarity sim3
Preferably, in the step 5, similarity sim is determined based on the keyword features1Semantic similarity sim of words2Semantic similarity sim with chapters3Get the text diAnd djSimilarity sim.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: when the similarity of the two texts is calculated, the keyword features and the semantic features of the texts are fully considered. Meanwhile, the attention points of the semantic features not only stay in the word granularity layer, but also extend to the whole chapter granularity layer, and a multi-dimensional text expression vector is established, so that the text similarity calculation is more accurate. The invention can be used in the application fields of article duplicate checking, article retrieval and the like.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a processing flow chart of a text similarity calculation method fusing a keyword feature and a multi-granularity semantic feature according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The embodiment of the invention provides a text similarity calculation method fusing keyword features and multi-granularity semantic features, which aims at the problems that the conventional text similarity calculation method ignores the combination of text keyword features and semantic features, excessively pays attention to word semantics, ignores coarse-granularity-layer semantics such as sentences, paragraphs and texts, and the like, and the text similarity calculation method fusing the keyword features and the multi-granularity semantic features, which is provided by the embodiment of the invention, is shown in figure 1 and comprises the following steps:
in the present embodiment, to calculate the text diAnd djThe similarity of (2) is described as an example, and three-dimensional features are considered: keyword features, word-level semantic features, text-level semantic features.
Step 1: based on the Kshimhash algorithm, obtaining text fingerprints, calculating the Hamming distance between the two text fingerprints, and obtaining the similarity of the two texts on the key word characteristics. The method specifically comprises the following steps:
step 1.1: for the current text diThe text content of (2) is word-segmented.
In this embodiment, the word segmentation tool is jieba and the text diForming word bags after word segmentation, i.e. di=[wi1,wi2,…,win],wikRepresenting text diThe kth (k ═ 1,2, …, n) words, there is no semantic association between each word in the bag;
step 1.2: the stop word in the word bag is removed.
When stop words are removed, a stop word list is introduced, whether each word in the word bag appears in the stop word list or not is judged, and if the word appears, the word is removed from the word bag;
step 1.3: removing stop words to obtain a filtered word bag di=[wi1,wi2,…,wim]In the word bag, wik(k-1, 2, …, m) represents a text diAnd words that do not appear in the deactivation word list.
Extracting keywords of the current text based on the filtered word bag, taking the word Frequency Inverse text Frequency characteristic of each word in the current word bag into consideration, and obtaining the value K before the TFIDF value of the current text is ranked (the value of the K is an empirical value, and the preferred value range is [5,10 ]]In the present embodiment, the value of K is set to 10)Forming a keyword list
Figure BDA0003021486390000043
Each keywordi-kAll correspond to a weighti-kI.e., the weights are their corresponding TFIDF values.
Wherein, the TFIDF calculation formula is as follows:
Figure BDA0003021486390000041
Figure BDA0003021486390000042
Figure BDA0003021486390000051
wherein, count (w)ik) The representative word wikIn the text diNumber of occurrences, | diI denotes the total number of words of the current text, N denotes the total number of texts, I (w)ik,dm) The representative word wikWhether or not in the text dmIn (b), if present, I (w)ik,dm) Value 1, if not present, I (w)ik,dm) The value is 0.
Thereby finally obtaining a word-weight set (w)k,weightk);
Step 1.4: performing hash operation on words in the keyword list of the current text, and calculating the hash value of each keyword to obtain (hash)k,weightk) Gathering;
step 1.5: each keywordi-kBased on the hash value, according to the corresponding weighti-kWeighting is carried out, namely: wk=hashk×weightkThe hash value is 1, and the weight is multiplied positively, and the hash value is 0, and the weight is multiplied negatively. For example, a word is hashed to [010110 ]]And its corresponding weight is 5, then the weighting results in [ -5,5, -5,5,5, -5];
Step 1.6: summing the weighted hash vectors corresponding to all words in the keyword list, such as [ -5,5, -5,5,5,5], [ -3, -3, -3,3, -3,3], [ -1, -1, -1,1,1,1] to obtain [ -7,1, -9,9,3,9] after accumulation;
step 1.7: and performing dimensionality reduction operation on the obtained accumulated vector. Namely: if the value is greater than 0, the value is set to 1, otherwise, the value is set to 0, and the Kstimhash value of the statement is obtained.
E.g., [ -7,1, -9,9,3,9], resulting in 010111, which is the Ksimhash value of the current text.
Illustratively, d is obtained after the calculation process based on the aboveiThe Kshimhash value of the text is fi1:0011010101000110111001110000100010000110010111011101010000100100,djThe Kshimhash value of the text is fj1:0010110100010011110011100100100011001110110011011000110101001101;
Step 1.8: using text diAnd djRespective corresponding text fingerprints fi1And fj1Calculating fi1And fj1Hamming distance of Hi,j. Hamming distance is defined as the number of bits of difference in a text fingerprint.
Step 1.9: obtaining diAnd djSimilarity sim of keyword features of1Wherein
Figure BDA0003021486390000052
Corresponding to the above example, one can obtain the value diAnd djSimilarity sim of keyword features of1Is 0.65625.
Step 2: and obtaining semantic vectors of the two texts in terms based on a Word2vec model, and calculating the cosine similarity of the semantic vectors at two term levels. The method specifically comprises the following steps:
step 2.1: training the Word vector with all text based on the Word2vec model, so that each Word wnAll correspond to a semantic vector
Figure BDA0003021486390000053
Step 2.2: for the current text diUsing TFIObtaining a keyword list corresponding to the text by using DF algorithm
Figure BDA0003021486390000054
Figure BDA0003021486390000055
Each word keyword in the listi-kAll correspond to a word vector
Figure BDA0003021486390000061
Calculating a word semantic vector corresponding to the current text, namely:
Figure BDA0003021486390000062
illustratively, the dimension of the semantic vector may be set to 200 dimensions.
Step 2.3: performing the operation of the step 2 on each text to ensure that each text has a unique corresponding word semantic vector;
step 2.4: calculating a text diAnd text djCorresponding word semantic vector fi2And fj2Cosine similarity of (d) to obtain diAnd djSemantic similarity sim of words2. Let fi2And fj2Has a dimension of n, i.e. fi2=[fi21,fi22,…,fi2n],fj2=[fj21,fj22,…,fj2n]Then sim2The calculation formula of (2) is as follows:
Figure BDA0003021486390000063
exemplary, sim calculated in this example2=0.15181794593072392。
And step 3: based on a Doc2vec model, semantic vectors of the two texts in chapters are obtained, and cosine similarity of the two text-level semantic vectors is calculated. The method comprises the following specific steps:
step 3.1: obtaining a text d by using a Doc2vec modeliAnd text djEach of which isCorresponding text vector fi3And fj3(ii) a That is, the text vector f is obtained based on the representative sentence of the texti3And fj3. For example, when calculating the semantic vector of each representative sentence, the dimension of the semantic vector may be set to 200 dimensions.
Step 3.2: calculating fi3And fj3Cosine similarity of (d) to obtain diAnd djSpace semantic similarity sim3. Let fi3And fj3Has a dimension of n, i.e. fi3=[fi31,fi32,…,fi3n],fj3=[fj31,fj32,…,fj3n],sim3The calculation formula of (0.34401781495762856) is:
Figure BDA0003021486390000064
exemplary, sim calculated in this example3=0.34401781495762856。
And 4, step 4: for three local similarity values sim1、sim2、sim3Adding and averaging to obtain a text diAnd text djFinal similarity value sim: .
Figure BDA0003021486390000065
Corresponding to the above three examples, text d may be obtainediAnd text djThe final similarity value sim is: 0.3840285869627842.
the similarity calculation method provided by the embodiment of the invention can be used in the application fields of text retrieval, duplicate checking and the like. For example, the text to be processed is recorded as text diAnd any one text in the searched text set or the repeated text library is marked as a text djFirst, the text d is calculatediAnd text djFinal similarity value sim, text d with similarity value sim reaching a first specified threshold (greater than or equal to the specified threshold)jAs a result of its retrieval or duplication.
Furthermore, it is also possible to first cluster the retrieved text set or the duplicate-checking text library to obtain a plurality of clustering results (a plurality of clusters), and then calculate the text diSimilarity sim between texts corresponding to each cluster center is obtained when similarity sim reaches a second specified threshold, and then text d is calculated respectivelyiAnd taking the text corresponding to the maximum similarity sim as the duplication checking or retrieval result of the similarity sim between the text and the texts in the cluster.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims (8)

1. A text similarity calculation method fusing keyword features and multi-granularity semantic features is used for obtaining any two texts diAnd djIs characterized in that the following steps are executed;
step 1: extracting text diAnd djThe keyword of (1);
step 2: key word characteristic fingerprint f for extracting text based on Kshimhash algorithmi1And fj1And calculate fi1And fj1The hamming distance of obtains a text diAnd djSimilarity sim of keyword features of1
And step 3: calculating a text diAnd djSemantic similarity sim of words2
And 4, step 4: calculating a text diAnd djSpace semantic similarity sim3
And 5: comprehensive keyword feature similarity sim1Semantic similarity sim of words2Semantic similarity sim with chapters3To obtain a text diAnd djSimilarity sim.
2. The method according to claim 1, wherein in step 1, extracting the keywords of the text specifically comprises:
step 1.1: performing text preprocessing on the content of the text to obtain a text candidate word set, wherein the text preprocessing comprises the following steps: segmenting words and removing stop words;
step 1.2: extracting keywords of the text from the text candidate word set: and calculating TFIDF values of all words in the text candidate word set, and taking the words with the maximum TFIDF values of the first K words as text keywords, wherein the value of K is a positive integer.
3. The method of claim 2, wherein the range of values for K is set to [5,10 ].
4. A method according to claim 2 or 3, wherein in step 2, the keyword feature fingerprint of the text is calculated as:
step 2.1: calculating the hash value K of the appointed digit of each keyword Kh: carrying out hash operation of specified digits on the word codes forming each keyword to obtain the hash value of each word, and carrying out bit exclusive or operation on the hash value of each word to obtain the hash value of the current keyword;
step 2.2: calculating a weighted hash value of each keyword:
defining a weight value W for a keyword kk=TFIDFk×Kh
Wherein TFIDFkThe TFIDF value representing the key word k,
and K ishBit of 1 and TFIDFkPositive multiplication, bit of 0 and TFIDFkNegative multiplication;
step 2.3: summing the weighted hash values of all keywords of the text to obtain an accumulated vector;
step 2.4: performing dimensionality reduction calculation on the obtained accumulated vector to obtain a keyword characteristic fingerprint of the text: and judging each element value of the accumulated vector, if the element value is larger than 0, setting the element value to be 1, and if the element value is not larger than 0, setting the element value to be 0.
5. The method of claim 4, wherein the text diAnd djSimilarity sim of keyword features of1Comprises the following steps:
Figure FDA0003021486380000011
wherein Hi,jRepresenting text diAnd djCharacteristic fingerprint f of key wordi1And fj1Hamming distance between them, max () function represents taking the maximum value, len () function represents calculating the length of the string, and in calculating the keyword feature fingerprint fi1And fj1At hamming distance between them, if the key word feature fingerprint fi1And fj1When the lengths of the key words are different, the low-order bit complementing operation is carried out on the key word characteristic fingerprints with shorter lengths.
6. The method of claim 1, wherein said step 3 comprises the steps of:
step 3.1: based on each keyword of the text, taking N words before and after the keyword as context, and establishing a real number vector so as to enable each keyword w to beiAll correspond to a semantic vector
Figure FDA0003021486380000021
Wherein N is a positive integer;
step 3.2: summing the semantic vectors of all keywords of the text to obtain a word semantic fingerprint f of the text2
Step 3.3: calculating a text diAnd text djCorresponding word semantic fingerprint fi2And fj2Cosine similarity of (d) to obtain diAnd djWord language ofSemblance similarity sim2
7. The method of claim 1, wherein said step 4 comprises the steps of:
step 4.1: extracting the first L longest sentences in the text as representative sentences of the text, and acquiring a sentence vector of each representative sentence to enable each representative sentence s to belAll correspond to a semantic vector
Figure FDA0003021486380000022
Wherein L is a positive integer;
step 4.2: summing the semantic vectors of the L representative sentences to obtain a chapter semantic fingerprint f3
Step 4.3: calculating a text diAnd djCorresponding discourse semantic fingerprint fi3And fj3Cosine similarity of (d) to obtain diAnd djSpace semantic similarity sim3
8. The method according to claim 1, wherein in the step 5, similarity sim based on keyword feature is performed1Semantic similarity sim of words2Semantic similarity sim with chapters3Get the text diAnd djSimilarity sim.
CN202110403916.1A2021-04-152021-04-15Text similarity calculation method fusing keyword features and multi-granularity semantic featuresExpired - Fee RelatedCN113011194B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110403916.1ACN113011194B (en)2021-04-152021-04-15Text similarity calculation method fusing keyword features and multi-granularity semantic features

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110403916.1ACN113011194B (en)2021-04-152021-04-15Text similarity calculation method fusing keyword features and multi-granularity semantic features

Publications (2)

Publication NumberPublication Date
CN113011194Atrue CN113011194A (en)2021-06-22
CN113011194B CN113011194B (en)2022-05-03

Family

ID=76388805

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110403916.1AExpired - Fee RelatedCN113011194B (en)2021-04-152021-04-15Text similarity calculation method fusing keyword features and multi-granularity semantic features

Country Status (1)

CountryLink
CN (1)CN113011194B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113641800A (en)*2021-10-182021-11-12中国铁道科学研究院集团有限公司科学技术信息研究所Text duplicate checking method, device and equipment and readable storage medium
CN113792119A (en)*2021-09-172021-12-14平安科技(深圳)有限公司Article originality evaluation system, method, device and medium
CN114443830A (en)*2021-12-312022-05-06深圳云天励飞技术股份有限公司Text matching method and related device
CN114943236A (en)*2022-06-302022-08-26北京金山数字娱乐科技有限公司 Keyword extraction method and device
CN115130454A (en)*2022-07-292022-09-30北京明略昭辉科技有限公司 Method, apparatus, electronic device and storage medium for calculating text similarity
CN115905505A (en)*2022-12-302023-04-04企知道网络技术有限公司Patent duplicate checking method and device and electronic equipment
CN116431803A (en)*2023-02-092023-07-14深圳市网联安瑞网络科技有限公司 Chinese media commentary text automatic generation method, system, device, client
CN117371439A (en)*2023-12-042024-01-09环球数科集团有限公司Similar word judging method based on AIGC

Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101404037A (en)*2008-11-182009-04-08西安交通大学Method for detecting and positioning electronic text contents plagiary
CN102682104A (en)*2012-05-042012-09-19中南大学Method for searching similar texts and link bit similarity measuring algorithm
CN103441924A (en)*2013-09-032013-12-11盈世信息科技(北京)有限公司Method and device for spam filtering based on short text
US8661341B1 (en)*2011-01-192014-02-25Google, Inc.Simhash based spell correction
CN107193803A (en)*2017-05-262017-09-22北京东方科诺科技发展有限公司A kind of particular task text key word extracting method based on semanteme
CN107644010A (en)*2016-07-202018-01-30阿里巴巴集团控股有限公司A kind of Text similarity computing method and device
CN108132929A (en)*2017-12-252018-06-08上海大学A kind of similarity calculation method of magnanimity non-structured text
CN109948125A (en)*2019-03-252019-06-28成都信息工程大学 Method and system of improved Simhash algorithm in text deduplication
CN112257453A (en)*2020-09-232021-01-22昆明理工大学 A Chinese-Vietnamese Text Similarity Calculation Method Fusion Keywords and Semantic Features

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101404037A (en)*2008-11-182009-04-08西安交通大学Method for detecting and positioning electronic text contents plagiary
US8661341B1 (en)*2011-01-192014-02-25Google, Inc.Simhash based spell correction
CN102682104A (en)*2012-05-042012-09-19中南大学Method for searching similar texts and link bit similarity measuring algorithm
CN103441924A (en)*2013-09-032013-12-11盈世信息科技(北京)有限公司Method and device for spam filtering based on short text
CN107644010A (en)*2016-07-202018-01-30阿里巴巴集团控股有限公司A kind of Text similarity computing method and device
CN107193803A (en)*2017-05-262017-09-22北京东方科诺科技发展有限公司A kind of particular task text key word extracting method based on semanteme
CN108132929A (en)*2017-12-252018-06-08上海大学A kind of similarity calculation method of magnanimity non-structured text
CN109948125A (en)*2019-03-252019-06-28成都信息工程大学 Method and system of improved Simhash algorithm in text deduplication
CN112257453A (en)*2020-09-232021-01-22昆明理工大学 A Chinese-Vietnamese Text Similarity Calculation Method Fusion Keywords and Semantic Features

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
MENGYU SHI 等: ""An Improved Key Sentence Extraction Algorithm Based on Features Computing Oriented to Argumentative Essay"", 《2020 IEEE 20TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY(ICCT)》*
PYRSQUARED: ""Weighted sum of word vectors for document similarity"", 《HTTPS://DATASCIENCE.STACKEXCHANGE.COM/QUESTIONS/24855/WEIGHTED-SUM-OF-WORD-VECTORS-FOR-DOCUMENT-SIMILARITY》*
WANG YUAN.等: ""Finding Similar Microblogs According to Their Word Similarities and Semantic Similarities"", 《ADVANCES IN COMPUTER SCIENCE AND UBIQUITOUS COMPUTING.CUTE2017,CSA2017.LECTURE NOTES IN ELECTRICAL ENGINEERING,VOL474》*
倪海清 等: ""基于语义感知的中文短文本摘要生成模型"", 《计算机科学》*
罗钰敏 等: ""加权平均Word2Vec实体对齐方法"", 《计算机工程与设计》*
翟社平 等: ""多特征融合的句子语义相似度计算方法"", 《计算机工程与设计》*
黄姝婧 等: ""基于多特征融合的句子相似度计算方法"", 《北京信息科技大学学报》*

Cited By (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113792119A (en)*2021-09-172021-12-14平安科技(深圳)有限公司Article originality evaluation system, method, device and medium
CN113641800A (en)*2021-10-182021-11-12中国铁道科学研究院集团有限公司科学技术信息研究所Text duplicate checking method, device and equipment and readable storage medium
CN114443830A (en)*2021-12-312022-05-06深圳云天励飞技术股份有限公司Text matching method and related device
CN114943236A (en)*2022-06-302022-08-26北京金山数字娱乐科技有限公司 Keyword extraction method and device
CN115130454A (en)*2022-07-292022-09-30北京明略昭辉科技有限公司 Method, apparatus, electronic device and storage medium for calculating text similarity
CN115905505A (en)*2022-12-302023-04-04企知道网络技术有限公司Patent duplicate checking method and device and electronic equipment
CN116431803A (en)*2023-02-092023-07-14深圳市网联安瑞网络科技有限公司 Chinese media commentary text automatic generation method, system, device, client
CN117371439A (en)*2023-12-042024-01-09环球数科集团有限公司Similar word judging method based on AIGC
CN117371439B (en)*2023-12-042024-03-08环球数科集团有限公司 A similar word judgment method based on AIGC

Also Published As

Publication numberPublication date
CN113011194B (en)2022-05-03

Similar Documents

PublicationPublication DateTitle
CN113011194B (en)Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN107133213B (en) A method and system for automatic extraction of text summaries based on algorithm
JP6335898B2 (en) Information classification based on product recognition
WO2023071118A1 (en)Method and system for calculating text similarity, device, and storage medium
CN107832306A (en)A kind of similar entities method for digging based on Doc2vec
CN113033183B (en)Network new word discovery method and system based on statistics and similarity
WO2020114100A1 (en)Information processing method and apparatus, and computer storage medium
CN112417153B (en)Text classification method, apparatus, terminal device and readable storage medium
CN110046250A (en)Three embedded convolutional neural networks model and its more classification methods of text
CN107391565B (en)Matching method of cross-language hierarchical classification system based on topic model
CN110298024B (en)Method and device for detecting confidential documents and storage medium
CN109829151B (en) A Text Segmentation Method Based on Hierarchical Dirichlet Model
CN112860898B (en)Short text box clustering method, system, equipment and storage medium
CN110858217A (en)Method and device for detecting microblog sensitive topics and readable storage medium
CN110209818A (en)A kind of analysis method of Semantic-Oriented sensitivity words and phrases
CN108027814A (en)Disable word recognition method and device
CN106547864A (en)A kind of Personalized search based on query expansion
CN113407660A (en)Unstructured text event extraction method
KR102091633B1 (en)Searching Method for Related Law
CN114266249B (en) A massive text clustering method based on birch clustering
CN111061939A (en)Scientific research academic news keyword matching recommendation method based on deep learning
WO2023173537A1 (en)Text sentiment analysis method and apparatus, device and storage medium
CN110347977A (en)A kind of news automated tag method based on LDA model
CN117272142A (en)Log abnormality detection method and system and electronic equipment
CN115146062A (en) Intelligent event analysis method and system integrating expert recommendation and text clustering

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20220503

CF01Termination of patent right due to non-payment of annual fee

[8]ページ先頭

©2009-2025 Movatter.jp