CN105653706A

Movatterモバイル変換

Info

Publication number: CN105653706A
Application number: CN201511026567.7A
Authority: CN
Inventors: 张春霞; 陈俊鹏; 王森; 王树良; 赵小林
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2015-12-31
Filing date: 2015-12-31
Publication date: 2016-06-08
Anticipated expiration: 2035-12-31
Also published as: CN105653706B

Abstract

Translated fromChinese

本发明公开了一种基于文献内容知识图谱的多层引文推荐方法，属于信息推荐和智能信息处理领域。本方法首先获取用户的查询需求，查询需求由需要推荐引用论文或引用文献的论文的标题和摘要的关键词构成。然后，基于文献内容的知识图谱扩展查询检索词语，知识图谱由文献的研究对象词语和研究行为词语结点，以及表示同义、近义、上下位、部分整体、并列等各种语义关系的边构成。最后，构建数据集中文献的倒排索引，选取候选引文，计算候选引文和查询的相似度，采用梯度渐进回归树来进行引文推荐。本方法基于文献内容知识图谱进行多层次的引文推荐，扩大了候选引文的范围，准确地表达了论文的研究对象和内容，提高了用户获取相关文献的效率，具有广阔的应用前景。

The invention discloses a multi-layer citation recommendation method based on a document content knowledge map, belonging to the fields of information recommendation and intelligent information processing. This method first acquires the user's query requirements, which are composed of titles and abstract keywords of papers that need to be recommended for citations or references. Then, based on the content of the literature, the knowledge graph expands the query and retrieval terms. The knowledge graph consists of the research object words and research behavior word nodes of the literature, as well as the edges representing various semantic relationships such as synonyms, near synonyms, upper and lower positions, partial wholes, and juxtapositions. constitute. Finally, construct the inverted index of the documents in the data set, select candidate citations, calculate the similarity between candidate citations and queries, and use gradient asymptotic regression trees to recommend citations. This method conducts multi-level citation recommendation based on the knowledge graph of document content, expands the scope of candidate citations, accurately expresses the research object and content of the paper, improves the efficiency of users in obtaining relevant documents, and has broad application prospects.

Description

Translated fromChinese

技术领域technical field

本发明涉及信息推荐技术领域，特别是涉及一种基于文献内容知识图谱的多层引文推荐方法。本发明在信息推荐、信息检索、网络舆情监控等领域具有广阔的应用前景。The present invention relates to the technical field of information recommendation, in particular to a multi-layer citation recommendation method based on a document content knowledge graph. The invention has broad application prospects in the fields of information recommendation, information retrieval, network public opinion monitoring and the like.

背景技术Background technique

目前，信息推荐方法可以分为三大类，基于内容的推荐、基于协同过滤的推荐、以及混合的方法。At present, information recommendation methods can be divided into three categories, content-based recommendation, collaborative filtering-based recommendation, and hybrid methods.

在基于内容的推荐方法中，首先构建推荐对象的内容特征模型和用户兴趣模型，然后计算推荐对象与用户兴趣的相似度，最后将相似度较大的推荐对象推荐给用户。推荐对象和用户模型通常采用关键词表示特征。该方法的优点是可以根据用户的历史记录来构建用户兴趣模型，反映用户的需求和偏好。其特点是，第一，推荐性能依赖于推荐对象的特征提取方法和内容特征模型，也就是依赖于推荐对象的内容特征的准确性和完整性；第二，推荐对象和用户兴趣模型基于关键词进行表示和相似度计算，停留在字符串层面，限制用户对高层次概念的认知，难以满足用户的真正需求。In the content-based recommendation method, the content feature model and user interest model of the recommended object are first constructed, then the similarity between the recommended object and the user's interest is calculated, and finally the recommended object with greater similarity is recommended to the user. Recommended objects and user models usually use keywords to represent features. The advantage of this method is that it can build a user interest model based on the user's historical records, reflecting the user's needs and preferences. Its characteristics are, first, the recommendation performance depends on the feature extraction method and content feature model of the recommended object, that is, the accuracy and completeness of the content feature of the recommended object; second, the recommended object and user interest model are based on keywords Perform representation and similarity calculations, stay at the string level, limit users' cognition of high-level concepts, and it is difficult to meet the real needs of users.

基于协同过滤的推荐方法是基于推荐对象之间的相关性或用户之间的相关性来进行推荐。基于协同过滤的推荐方法可以分为基于用户的协同推荐、基于物品的协同推荐，以及基于模型的协同推荐。该方法的优点是可以处理结构化和非结构化的复杂对象。其特点是存在稀疏性问题和冷启动问题。稀疏性问题是指对于涉及推荐对象较少的用户，在庞大的用户集中难以发现与该用户兴趣相似的用户。冷启动问题是指当新用户或者新推荐对象第一次出现在推荐系统中，系统难以获知新用户的兴趣偏好，难以对新推荐对象进行推荐。The recommendation method based on collaborative filtering is to make recommendations based on the correlation between recommended objects or the correlation between users. Recommendation methods based on collaborative filtering can be divided into user-based collaborative recommendation, item-based collaborative recommendation, and model-based collaborative recommendation. The advantage of this approach is that it can handle both structured and unstructured complex objects. It is characterized by sparsity problem and cold start problem. The sparsity problem means that for users who involve fewer recommended objects, it is difficult to find users with similar interests to the user in a huge user set. The cold start problem means that when a new user or a new recommendation object appears in the recommendation system for the first time, it is difficult for the system to know the interest preferences of the new user, and it is difficult to recommend the new recommendation object.

引文推荐是信息推荐的重要研究内容，其目的是在海量的文献中找出当前论文需要引用的论文。现有引文推荐方法主要利用文献的引用关系来进行推荐，基于关键词来表示论文的内容和用户的兴趣。Citation recommendation is an important research content of information recommendation, and its purpose is to find the papers that the current paper needs to cite in the massive literature. Existing citation recommendation methods mainly use the citation relationship of literature to recommend, and represent the content of papers and users' interests based on keywords.

发明内容Contents of the invention

本发明的目的是为了解决上述现有技术中推荐方法受限于相似用户的数量，难以检索字符不同语义相似的文献，难以检索与论文的研究对象和研究行为具有不同语义关联关系的文献，以及现有技术中的引用论文推荐结果不能很好满足用户需求的问题，提供一种基于文献内容知识图谱的多层引文推荐方法。The purpose of the present invention is to solve the problem that the recommendation method in the above-mentioned prior art is limited by the number of similar users, it is difficult to retrieve documents with different semantic similarities in characters, it is difficult to retrieve documents with different semantic associations with the research objects and research behaviors of the paper, and In the prior art, the recommendation results of cited papers cannot well meet the needs of users. A multi-layer citation recommendation method based on the knowledge graph of document content is provided.

本发明的目的是通过下述技术方案实现的。The purpose of the present invention is achieved through the following technical solutions.

一种基于文献内容知识图谱的多层引文推荐方法，包括如下步骤：A multi-layer citation recommendation method based on the knowledge map of document content, including the following steps:

步骤1，获取查询需求Step 1, get query requirements

提取需要推荐引文的论文的标题和摘要，进行词根提取(Stemming)和词形还原(Lemmatization)，去掉标点符号和停用词。停用词是指不具有实际意义的词语，主要包括助词、介词、连词等。进一步，提取关键词作为搜索引擎Lucene查询需求的检索词。Extract the title and abstract of the papers that need to recommend citations, perform stemming and lemmatization, and remove punctuation marks and stop words. Stop words refer to words without practical meaning, mainly including auxiliary words, prepositions, conjunctions, etc. Further, keywords are extracted as search terms required by the search engine Lucene.

步骤2，利用文献内容的知识图谱进行查询扩展Step 2, use the knowledge graph of the document content to perform query expansion

第一，对查询需求的检索词进行扩充，利用同义词词典和近义词词典获得检索词的同义词和近义词，扩充检索词集合；First, expand the search terms required by the query, use the synonyms dictionary and synonyms dictionary to obtain synonyms and synonyms of the search terms, and expand the search term set;

第二，根据论文的标题和摘要，识别论文的研究对象词语u和研究行为词语v；Second, according to the title and abstract of the paper, identify the research object word u and the research behavior word v of the paper;

第三，利用同义词词典和近义词词典，提取论文的研究对象词语和研究行为词语的同义词和近义词，构建检索扩展词，将其添加到检索词集合中。Thirdly, use the dictionary of synonyms and dictionary of synonyms to extract the synonyms and synonyms of the research object words and research behavior words in the thesis, construct the search expansion words, and add them to the search term set.

若论文的研究对象词语u的同义词和近义词为a₁,a₂,…,a_m(m为自然数)，研究行为词语v的同义词和近义词为b₁,b₂,…,b_n(n为自然数)，则构建如下的检索扩展词，其中“+”是指两个词语的连接。例如，“u+b₁”是指词语u和词语b₁的连接。If the synonyms and synonyms of the research object word u are a₁ , a₂ ,…,am (_m is a natural number), the synonyms and synonyms of the research behavior word v are b₁ , b₂ ,…,b_n (n is natural number), construct the following search expansion words, where "+" refers to the connection of two words. For example, "u+b₁ " refers to the concatenation of word u and word b₁ .

u+b₁,u+b₂,…,u+b_n,u+b₁ ,u+b₂ ,…,u+b_n ,

a₁+v,a₁+b₁,a₁+b₂,…,a₁+b_n,a₁ +v,a₁ +b₁ ,a₁ +b₂ ,…,a₁ +b_n ,

a₂+v,a₂+b₁,a₂+b₂,…,a₂+b_n,a₂ +v,a₂ +b₁ ,a₂ +b₂ ,…,a₂ +b_n ,

…,…,

a_m+v,a_m+b₁,a_m+b₂,…,a_m+b_n.a_m +v,a_m +b₁ ,a_m +b₂ ,…,a_m +b_n .

第四，利用知识图谱中的上下位关系子网络，提取论文的研究对象词语u和研究行为词语v的上位概念和下位概念；Fourth, use the hyponym subnetwork in the knowledge graph to extract the hypernymy and hyponym concepts of the research object word u and the research behavior word v;

若u的上位概念为c₁,c₂,…,c_p(p为自然数)，u的下位概念为d₁,d₂,…,d_q(q为自然数)，v的上位概念为e₁,e₂,…,e_s(s为自然数)，v的下位概念为f₁,f₂,…,f_t(t为自然数)，则构建如下的检索扩展词：If the superordinate concept of u is c₁ , c₂ ,…,c_p (p is a natural number), the subordinate concept of u is d₁ , d₂ ,…,d_q (q is a natural number), and the superordinate concept of v is e₁ ,e₂ ,…,e_s (s is a natural number), and the subordinate concept of v is f₁ ,f₂ ,…,f_t (t is a natural number), then construct the following search expansion words:

u+e_j(j＝1,2,…,s),u+f_j(j＝1,2,…,t),u+e_j (j=1,2,…,s),u+f_j (j=1,2,…,t),

a_i+e_j(i＝1,2,…,m,j＝1,2,…,s),a_i+f_j(i＝1,2,…,m,j＝1,2,…,t),a_i +e_j (i=1,2,...,m,j=1,2,...,s),a_i +f_j (i=1,2,...,m,j=1,2,... ,t),

c_i+v(i＝1,2,…,p),d_i+v(i＝1,2,…,q),c_i +v(i=1,2,...,p), d_i +v(i=1,2,...,q),

c_i+b_j(i＝1,2,…,p,j＝1,2,…,n),d_i+b_j(i＝1,2,…,q,j＝1,2,…,n),c_i +b_j (i=1,2,…,p,j=1,2,…,n), d_i +b_j (i=1,2,…,q,j=1,2,… ,n),

c_i+e_j(i＝1,2,…,p,j＝1,2,…,s),c_i+f_j(i＝1,2,…,p,j＝1,2,…,t),c_i +e_j (i=1,2,…,p,j=1,2,…,s),ci +f_j (_i =1,2,…,p,j=1,2,… ,t),

d_i+e_j(i＝1,2,…,q,j＝1,2,…,s),d_i+f_j(i＝1,2,…,q,j＝1,2,…,t).d_i +e_j (i=1,2,…,q,j=1,2,…,s),d_i +f_j (i=1,2,…,q,j=1,2,… ,t).

第五，利用知识图谱中的部分整体关系子网络，提取论文的研究对象词语u和研究行为词语v的部分概念和整体概念。若u的整体概念为g₁,g₂,…,g_o(o为自然数)，u的部分概念为h₁,h₂,…,h_r(r为自然数)，v的整体概念为k₁,k₂,…,k_w(w为自然数)，v的部分概念为l₁,l₂,…,l_z(z为自然数)，则构建如下的检索扩展词：Fifth, using the part of the overall relationship sub-network in the knowledge graph to extract the partial concepts and overall concepts of the research object word u and the research behavior word v in the paper. If the overall concept of u is g₁ , g₂ ,…,g_o (o is a natural number), the partial concept of u is h₁ ,h₂ ,…,h_r (r is a natural number), and the overall concept of v is k₁ ,k₂ ,…,k_w (w is a natural number), part of the concept of v is l₁ ,l₂ ,…,l_z (z is a natural number), then construct the following search expansion words:

u+k_j(j＝1,2,…,w),u+l_j(j＝1,2,…,z),u+k_j (j=1,2,…,w),u+l_j (j=1,2,…,z),

a_i+k_j(i＝1,2,…,m,j＝1,2,…,w),a_i+l_j(i＝1,2,…,m,j＝1,2,…,z),a_i +k_j (i=1,2,…,m,j=1,2,…,w),a_i +l_j (i=1,2,…,m,j=1,2,… ,z),

g_i+v(i＝1,2,…,o),h_i+v(i＝1,2,…,r),g_i +v(i=1,2,...,o),h_i +v(i=1,2,...,r),

g_i+b_j(i＝1,2,…,o,j＝1,2,…,n),h_i+b_j(i＝1,2,…,r,j＝1,2,…,n),g_i +b_j (i=1,2,...,o,j=1,2,...,n),h_i +b_j (i=1,2,...,r,j=1,2,... ,n),

g_i+k_j(i＝1,2,…,o,j＝1,2,…,w),g_i+l_j(i＝1,2,…,o,j＝1,2,…,z),g_i +k_j (i=1,2,...,o,j=1,2,...,w),g_i +l_j (i=1,2,...,o,j=1,2,... ,z),

h_i+k_j(i＝1,2,…,r,j＝1,2,…,w),h_i+l_j(i＝1,2,…,r,j＝1,2,…,z).h_i +k_j (i=1,2,...,r,j=1,2,...,w),h_i +l_j (i=1,2,...,r,j=1,2,... ,z).

第六，利用知识图谱中的并列关系子网络，提取论文的研究对象词语u和研究行为词语v的并列概念。若u的并列概念为x₁,x₂,…,x_k1(k1为自然数)，v的并列概念为y₁,y₂,…,y_k2(k2为自然数)，则构建如下的检索扩展词。Sixth, use the parallel relationship sub-network in the knowledge graph to extract the parallel concepts of the research object word u and the research behavior word v. If the parallel concept of u is x₁ , x₂ ,…,x_k1 (k1 is a natural number), and the parallel concept of v is y₁ , y₂ ,…,y_k2 (k2 is a natural number), then construct the following search expansion words .

u+y_j(j＝1,2,…,k2),x_i+v(i＝1,2,…,k1).u+y_j (j=1,2,…,k2), x_i +v(i=1,2,…,k1).

步骤3，构建文献的倒排索引Step 3, construct the inverted index of the document

根据数据集中的文献的标题和摘要构建倒排索引，包括预处理、构建索引和存储索引。预处理包括词根提取和词形还原，去掉标点符号和停用词。构建索引包括构建词语到文档的映射词典，对词语按照字典顺序排序，合并相同词语的文档映射信息，构建文档倒排链表即文档倒排索引。Construct an inverted index based on the titles and abstracts of the documents in the dataset, including preprocessing, index construction and index storage. Preprocessing includes stemming and lemmatization, removing punctuation and stop words. Building an index includes building a mapping dictionary from words to documents, sorting words in lexicographical order, merging document mapping information of the same word, and building a document inverted list, that is, a document inverted index.

步骤4，选取候选引文集Step 4, Select Candidate Citation Sets

首先，根据扩展后的检索词集合，在数据集中检索出在标题和摘要中包括任一检索词的论文。然后，计算查询与这些论文的相似度。将相似度最高的前N(N为自然数)篇论文作为候选引文集。其中，查询与论文的相似度采用搜索引擎Lucene中的向量空间模型进行计算。查询和论文由查询向量和论文向量来表示，查询和论文的相似度为查询向量和论文向量的余弦相似度。First, according to the expanded set of search terms, papers including any search term in the title and abstract are retrieved in the dataset. Then, the similarity of the query to these papers is calculated. Take the top N (N is a natural number) papers with the highest similarity as the candidate citation set. Among them, the similarity between the query and the paper is calculated using the vector space model in the search engine Lucene. Queries and papers are represented by query vectors and paper vectors, and the similarity between queries and papers is the cosine similarity between query vectors and paper vectors.

步骤5，提取候选引文与查询的相似度特征Step 5, extract the similarity features between the candidate citation and the query

候选引文与查询的相似度特征分为如下两种特征。第一种是基于搜索引擎Lucene的候选引文与查询的相似度特征。第二种是候选引文与查询的主题分布的KL距离(Kullback-LeiblerDivergence)。首先，采用隐含狄利克雷分布模型获取查询和候选引文的主题分布。然后，计算这两个主题分布的KL距离。The similarity features of candidate citations and queries are divided into the following two features. The first one is based on the similarity feature between candidate citations and queries of search engine Lucene. The second is the KL distance (Kullback-Leibler Divergence) between the candidate citation and the topic distribution of the query. First, a latent Dirichlet distribution model is employed to obtain the topic distributions of queries and candidate citations. Then, the KL distance of these two topic distributions is calculated.

步骤6，构建引文推荐的训练数据Step 6, construct training data for citation recommendation

第一，对训练数据集中每篇训练论文，根据其标题和摘要，利用搜索引擎Lucene检索出候选引文。First, for each training paper in the training data set, use the search engine Lucene to retrieve candidate citations according to its title and abstract.

第二，对于每一篇候选引文p，构建一个训练样本。训练样本特征包括候选引文p的引用次数特征、候选引文p和根据训练论文构建的查询的相似度特征。如果训练论文引用了候选引文p，则该样本的分类标签为1，否则为0。若训练论文包含m个参考文献，则可以构建m个正样本和n-m个负样本，其中n为候选引文的篇数。Second, for each candidate citation p, construct a training sample. The training sample features include the citation count feature of the candidate citation p, the similarity feature of the candidate citation p and the query constructed according to the training paper. The classification label of this sample is 1 if the training paper cites the candidate citation p, and 0 otherwise. If the training paper contains m references, m positive samples and n-m negative samples can be constructed, where n is the number of candidate citations.

步骤7，基于梯度渐进回归树进行引文推荐Step 7: Citation recommendation based on gradient progressive regression tree

第一，采用梯度渐进回归树GBRT(GradientBoostRegressionTree)来训练分类模型，实现引文推荐。分类特征包括候选引文与查询的相似度特征、论文引用次数特征。梯度渐进回归树的输出值一般为0～1之间的实数，将GBRT的输出值作为候选引文的推荐度。推荐度越大表示该候选引文分类为“推荐”的可能性就越大。进一步，将推荐度最高的M(M为自然数)篇候选引文作为当前论文的引文推荐结果；First, the gradient progressive regression tree GBRT (GradientBoostRegressionTree) is used to train the classification model and realize the citation recommendation. Classification features include the similarity features of candidate citations and queries, and the citation count features of papers. The output value of the gradient progressive regression tree is generally a real number between 0 and 1, and the output value of GBRT is used as the recommendation degree of the candidate citation. The greater the degree of recommendation, the greater the possibility that the candidate citation is classified as "recommended". Further, the M (M is a natural number) candidate citations with the highest recommendation degree are used as the citation recommendation results of the current paper;

第二，对推荐的每一篇引文p，从其标题和摘要中识别研究对象词语x和研究行为词语y。对于当前论文，构建每一篇引文p与它的多层语义关联关系。若u和v分别为当前论文的研究对象词语和研究行为词语；Second, for each recommended citation p, identify the research object word x and research behavior word y from its title and abstract. For the current paper, construct each citation p and its multi-layer semantic association. If u and v are the research object words and research behavior words of the current paper respectively;

情形1：若x为u的整体概念，或y为v的整体概念，则引文p的研究内容包括当前论文的研究内容。若x为u的部分概念，或y为v的部分概念，则当前论文的研究内容包括引文p的研究内容；Case 1: If x is the overall concept of u, or y is the overall concept of v, then the research content of the citation p includes the research content of the current paper. If x is part of the concept of u, or y is part of the concept of v, then the research content of the current paper includes the research content of the citation p;

情形2：若x为u的上位概念，或y为v的上位概念，则引文p的研究方法可应用于解决当前论文的研究问题。若x为u的下位概念，或y为v的下位概念，则当前论文的研究方法可应用于解决引文p的研究问题；Case 2: If x is a superordinate concept of u, or y is a superordinate concept of v, then the research method of citation p can be applied to solve the research problem of the current paper. If x is a subordinate concept of u, or y is a subordinate concept of v, then the research method in the current paper can be applied to solve the research problem of citation p;

情形3：若x为u的并列概念，或y为v的并列概念，则当前论文的研究方法可借鉴引文p的研究方法。Case 3: If x is a parallel concept of u, or y is a parallel concept of v, then the research method of the current paper can refer to the research method of the citation p.

至此，就完成了本方法的全部过程。So far, the entire process of the method has been completed.

有益效果Beneficial effect

本发明方法，针对现有引文推荐方法难以检索字符不同语义相似的文献、难以检索与论文的研究对象和研究行为具有不同语义关联关系的文献、受限于相似用户数量等问题，引入不同文献的内容语义关联的知识，采用一种基于文献内容知识图谱的多层引文推荐方法。该方法利用文献内容中研究对象词语和研究行为词语的各种语义关系来获取检索扩展词，基于梯度渐近回归树来进行多层次的引文推荐，提高了用户获取引文的效率。具体体现在如下方面：The method of the present invention aims at the problems that the existing citation recommendation methods are difficult to retrieve documents with different semantic similarities, difficult to retrieve documents with different semantic associations with the research objects and research behaviors of the paper, and limited by the number of similar users. The knowledge of content semantic association adopts a multi-layer citation recommendation method based on the knowledge graph of document content. This method uses various semantic relationships between the research object words and the research behavior words in the literature content to obtain the retrieval expansion words, and performs multi-level citation recommendation based on the gradient asymptotic regression tree, which improves the efficiency of users to obtain citations. Specifically reflected in the following aspects:

(1)本发明一方面通过提取论文的标题和摘要的关键词来表示论文的研究内容，另一方面通过提取论文的研究对象词语和研究行为词语来表示论文的研究内容，对论文的研究问题和研究内容进行了语义表征，更加准确地表达了论文的研究主题和内容，从而提高引文推荐的效果。(1) the present invention represents the research content of the paper by extracting the title of the paper and the keywords of the abstract on the one hand, represents the research content of the paper by extracting the research object words and the research behavior words of the paper on the other hand, to the research problem of the paper Semantic representation is carried out with the research content, and the research theme and content of the paper are more accurately expressed, thereby improving the effect of citation recommendation.

(2)利用文献内容的知识图谱来获取检索扩展词，也就是，利用论文的研究对象词语和研究行为词语的同义关系、近义关系、上下位关系、部分整体关系、并列关系来获取检索扩展词，扩大了候选引文的范围，从而解决引用文献漏检的问题和推荐系统初期的冷启动问题。(2) Use the knowledge map of the document content to obtain retrieval extension words, that is, use the synonymous relationship, near-synonymous relationship, hyponymy relationship, partial overall relationship, and parallel relationship between the research object words of the paper and the research behavior words to obtain retrieval Extended words expand the scope of candidate citations, thereby solving the problem of missing references and the initial cold start of the recommendation system.

(3)本发明采用梯度渐进回归树GBRT进行引文推荐，将引文推荐看作分类问题，每个训练样本引文的类别标签为1或0，即表示“推荐”或“不推荐”，不但保证了引文推荐结果的效果，而且保证了引文推荐方法的运行效率。(3) The present invention uses gradient progressive regression tree GBRT to carry out citation recommendation, regards citation recommendation as a classification problem, and the category label of each training sample citation is 1 or 0, which means "recommended" or "not recommended", which not only guarantees The effect of the citation recommendation results is not only guaranteed, but also the operating efficiency of the citation recommendation method is guaranteed.

(4)在文献内容的知识图谱中，可以动态添加与论文的研究对象词语和研究行为词语具有不同语义关系的词语，不断扩充文献内容的知识图谱网络，从而提高引文推荐方法的实时性和灵活性。(4) In the knowledge map of document content, words that have different semantic relations with the research object words and research behavior words of the paper can be dynamically added, and the knowledge map network of document content can be continuously expanded, thereby improving the real-time and flexibility of the citation recommendation method sex.

附图说明Description of drawings

图1为本发明方法的流程图。Fig. 1 is the flowchart of the method of the present invention.

具体实施方式detailed description

下面结合实施例对本发明方法进行详细说明。The method of the present invention will be described in detail below in conjunction with the examples.

实施例Example

步骤1，获取查询需求。Step 1, obtain query requirements.

提取需要推荐引文的论文的标题和摘要，进行词根提取(Stemming)和词形还原(Lemmatization)，去掉标点符号和停用词。例如，单词“entities”通过词根提取转化为“entity”。单词“identified”通过词形还原转化为“identify”。停用词是指不具有实际意义的词语，主要包括助词、介词、连词等。例如，“is”“with”和“and”都是停用词。进一步，提取关键词作为搜索引擎Lucene查询需求的检索词。Extract the title and abstract of the papers that need to recommend citations, perform stemming and lemmatization, and remove punctuation marks and stop words. For example, the word "entities" is transformed into "entity" through stemming. The word "identified" is transformed into "identify" by lemmatization. Stop words refer to words without practical meaning, mainly including auxiliary words, prepositions, conjunctions, etc. For example, "is", "with", and "and" are all stop words. Further, keywords are extracted as search terms required by the search engine Lucene.

步骤2，利用文献内容的知识图谱进行查询扩展。Step 2, use the knowledge graph of the document content to perform query expansion.

第一，对查询需求的检索词进行扩充，利用同义词词典和近义词词典获得检索词的同义词和近义词，扩充检索扩展词集合。First, expand the search terms required by the query, use the synonyms dictionary and the synonyms dictionary to obtain synonyms and synonyms of the search terms, and expand the set of search expansion words.

例如，从标题为“一种基于隐马尔科夫模型的命名实体识别”的论文中提取关键词“隐马尔科夫模型”和“命名实体识别”作为检索词。通过同义词词典和近义词词典获得检索扩展词“HMM(隐马尔科夫模型)”和“NER(命名实体识别)”。For example, keywords "Hidden Markov Model" and "Named Entity Recognition" are extracted from a paper titled "A Hidden Markov Model-Based Named Entity Recognition" as search terms. The search expansion words "HMM (Hidden Markov Model)" and "NER (Named Entity Recognition)" are obtained through the dictionary of synonyms and the dictionary of synonyms.

第二，根据论文的标题和摘要，识别论文的研究对象词语u和研究行为词语v。例如，对于标题为“一种基于隐马尔科夫模型的命名实体识别”的论文，识别其论文的研究对象词语为“命名实体”，研究行为词语为“识别”。Second, according to the title and abstract of the paper, identify the research object word u and the research behavior word v of the paper. For example, for a paper titled "A Named Entity Recognition Based on a Hidden Markov Model", the research target word for the recognition paper is "named entity", and the research behavior word is "recognition".

若论文的研究对象词语u的同义词和近义词为a₁,a₂,…,a_m(m为自然数)，研究行为词语v的同义词和近义词为b₁,b₂,…,b_n(n为自然数)，则构建如下的检索扩展词，其中“+”是指两个词语的连接。例如，“u+b₁”是指词语u和词语b₁的连接。“实体+检测”是指词语“实体”和词语“检测”的连接，即“实体检测”。If the synonyms and synonyms of the research object word u are a₁ , a₂ ,…,am (_m is a natural number), the synonyms and synonyms of the research behavior word v are b₁ , b₂ ,…,b_n (n is natural number), construct the following search expansion words, where "+" refers to the connection of two words. For example, "u+b₁ " refers to the concatenation of word u and word b₁ . "Entity + detection" refers to the connection of the word "entity" and the word "detection", that is, "entity detection".

u+b₁,u+b₂,…,u+b_n,u+b₁ ,u+b₂ ,…,u+b_n ,

…,…,

a_m+v,a_m+b₁,a_m+b₂,…,a_m+b_n.a_m +v,a_m +b₁ ,a_m +b₂ ,…,a_m +b_n .

例如，对于标题为“一种基于隐马尔科夫模型的命名实体识别”的论文，提取研究行为词语“识别”的近义词为“检测”和“提取”，因此，构建检索扩展词“命名实体检测”和“命名实体提取”，并将它们添加到检索词集合中。For example, for a paper titled "A Named Entity Recognition Based on Hidden Markov Model", the synonyms of the extraction research behavior word "recognition" are "detection" and "extraction". Therefore, constructing the retrieval extension word "named entity detection " and "Named Entity Extraction" and add them to the set of terms.

第四，利用知识图谱中的上下位关系子网络，提取论文的研究对象词语u和研究行为词语v的上位概念和下位概念。Fourth, using the hyponym sub-network in the knowledge graph, the hypernymy and hyponym concepts of the research object word u and the research behavior word v are extracted.

若u的上位概念为c₁,c₂,…,c_p(p为自然数)，u的下位概念为d₁,d₂,…,d_q(q为自然数)，v的上位概念为e₁,e₂,…,e_s(s为自然数)，v的下位概念为f₁,f₂,…,f_t(t为自然数)，则构建如下的检索扩展词。If the superordinate concept of u is c₁ , c₂ ,…,c_p (p is a natural number), the subordinate concept of u is d₁ , d₂ ,…,d_q (q is a natural number), and the superordinate concept of v is e₁ ,e₂ ,...,e_s (s is a natural number), and the subordinate concept of v is f₁ ,f₂ ,...,f_t (t is a natural number), then construct the following search expansion words.

例如，对于标题为“一种基于隐马尔科夫模型的命名实体识别”的论文，提取其研究对象“命名实体”的上位概念“实体”，则可构建检索扩展词“实体识别”、“实体检测”和“实体提取”，并将它们添加到检索词集合中。For example, for a paper titled "A Named Entity Recognition Based on Hidden Markov Model", to extract the superordinate concept "entity" of its research object "named entity", you can construct the search extension words "entity recognition", "entity Detect" and "Entity Extraction" and add them to the set of terms.

第五，利用知识图谱中的部分整体关系子网络，提取论文的研究对象词语u和研究行为词语v的部分概念和整体概念。若u的整体概念为g₁,g₂,…,g_o(o为自然数)，u的部分概念为h₁,h₂,…,h_r(r为自然数)，v的整体概念为k₁,k₂,…,k_w(w为自然数)，v的部分概念为l₁,l₂,…,l_z(z为自然数)，则构建如下的检索扩展词。Fifth, using the part of the overall relationship sub-network in the knowledge graph to extract the partial concepts and overall concepts of the research object word u and the research behavior word v in the paper. If the overall concept of u is g₁ , g₂ ,…,g_o (o is a natural number), the partial concept of u is h₁ ,h₂ ,…,h_r (r is a natural number), and the overall concept of v is k₁ ,k₂ ,…,k_w (w is a natural number), and some concepts of v are l₁ ,l₂ ,…,l_z (z is a natural number), then construct the following search expansion words.

例如，对于标题为“一种基于隐马尔科夫模型的命名实体识别”的论文，提取“命名实体”的整体概念“实体信息”，则可构建检索扩展词“实体信息提取”、“实体信息识别”和“实体信息检测”，将它们添加到检索词集合中。For example, for a paper titled "A Named Entity Recognition Based on Hidden Markov Model", to extract the overall concept "entity information" of "named entity", you can construct the retrieval extension words "entity information extraction", "entity information Recognition" and "Entity Information Detection", add them to the set of search terms.

例如，对于标题为“一种基于隐马尔科夫模型的命名实体识别”的论文，提取其研究行为词语“识别”的并列概念“链接”和“消歧”，则可构建检索扩展词“实体消歧”和“实体链接”，将它们添加到检索词集合中。For example, for a paper titled "A Named Entity Recognition Based on a Hidden Markov Model", extract the parallel concepts "link" and "disambiguation" of its research behavior word "recognition", and then construct a search extension term "entity Disambiguation" and "Entity Linking", adding them to the set of terms.

步骤3，构建文献的倒排索引。Step 3, constructing the inverted index of the document.

步骤4，选取候选引文集。Step 4, select candidate citation sets.

首先，根据扩展后的检索词集合，在数据集中检索出在标题和摘要中包括任一检索词的论文。然后，计算查询与这些论文的相似度。将相似度最高的前N(N为自然数)篇论文作为候选引文集。其中，查询与论文的相似度采用Lucene中的向量空间模型进行计算。查询和论文由查询向量和论文向量来表示，查询和论文的相似度为查询向量和论文向量的余弦相似度。First, according to the expanded set of search terms, papers including any search term in the title and abstract are retrieved in the dataset. Then, the similarity of the query to these papers is calculated. Take the top N (N is a natural number) papers with the highest similarity as the candidate citation set. Among them, the similarity between the query and the paper is calculated using the vector space model in Lucene. Queries and papers are represented by query vectors and paper vectors, and the similarity between queries and papers is the cosine similarity between query vectors and paper vectors.

步骤5，提取候选引文与查询的相似度特征。Step 5, extract the similarity features between the candidate citation and the query.

候选引文与查询的相似度特征分为如下两种特征。第一种是基于Lucene的候选引文与查询的相似度特征。第二种是候选引文与查询的主题分布的KL距离(Kullback-LeiblerDivergence)。首先，采用隐含狄利克雷分布模型获取查询和候选引文的主题分布。然后，计算这两个主题分布的KL距离。The similarity features of candidate citations and queries are divided into the following two features. The first is based on Lucene's similarity features between candidate citations and queries. The second is the KL distance (Kullback-Leibler Divergence) between the candidate citation and the topic distribution of the query. First, a latent Dirichlet distribution model is employed to obtain the topic distributions of queries and candidate citations. Then, the KL distance of these two topic distributions is calculated.

步骤6，构建引文推荐的训练数据。Step 6, construct the training data for citation recommendation.

步骤7，基于梯度渐进回归树进行引文推荐。Step 7: Citation recommendation based on gradient progressive regression tree.

第一，采用梯度渐进回归树GBRT(GradientBoostRegressionTree)来训练分类模型，实现引文推荐。分类特征包括候选引文与查询的相似度特征、论文引用次数特征。梯度渐进回归树的输出值一般为0～1之间的实数，将GBRT的输出值作为候选引文的推荐度。推荐度越大表示该候选引文分类为“推荐”的可能性就越大。进一步，将推荐度最高的M(M为自然数)篇候选引文作为当前论文的引文推荐结果。First, the gradient progressive regression tree GBRT (GradientBoostRegressionTree) is used to train the classification model and realize the citation recommendation. Classification features include the similarity features of candidate citations and queries, and the citation count features of papers. The output value of the gradient progressive regression tree is generally a real number between 0 and 1, and the output value of GBRT is used as the recommendation degree of the candidate citation. The greater the degree of recommendation, the greater the possibility that the candidate citation is classified as "recommended". Further, M (M is a natural number) candidate citations with the highest recommendation degree are taken as the citation recommendation results of the current paper.

情形1：若x为u的整体概念，或y为v的整体概念，则引文p的研究内容包括当前论文的研究内容。若x为u的部分概念，或y为v的部分概念，则当前论文的研究内容包括引文p的研究内容。Case 1: If x is the overall concept of u, or y is the overall concept of v, then the research content of the citation p includes the research content of the current paper. If x is a partial concept of u, or y is a partial concept of v, then the research content of the current paper includes the research content of the citation p.

情形2：若x为u的上位概念，或y为v的上位概念，则引文p的研究方法可应用于解决当前论文的研究问题。若x为u的下位概念，或y为v的下位概念，则当前论文的研究方法可应用于解决引文p的研究问题。Case 2: If x is a superordinate concept of u, or y is a superordinate concept of v, then the research method of citation p can be applied to solve the research problem of the current paper. If x is a subordinate concept of u, or y is a subordinate concept of v, then the research method in the current paper can be applied to solve the research problem of citation p.

本发明的实施过程选用物理学领域的科技论文进行实验测试。采用平均准确率AP(AveragePrecision)来评估引文推荐的实验结果。The implementation process of the present invention selects scientific and technological papers in the field of physics for experimental testing. The average accuracy rate AP (Average Precision) is used to evaluate the experimental results of citation recommendation.

对于论文q，设x_q是论文q的参考文献集合，y_q是一个有序二元组集合，表示论文q的引文推荐结果。y_q(i)＝(A,B)为有序二元组集合y_q中第i个位置的元素，其中A为论文ID，B表示该论文是否被引用，1表示被引用，0表示没有被引用。y_q是对引文按照梯度渐进回归树GBRT输出值的降序方式进行排序的。采用下面式子计算y_q在第k个位置上的准确率P_k(y_q)，k为自然数。For paper q, let x_q be the reference set of paper q, and y_q be an ordered set of 2-tuples, representing the citation recommendation results of paper q. y_q (i)=(A,B) is the i-th element in the ordered binary set y_q , where A is the ID of the paper, B indicates whether the paper is cited, 1 indicates that it is cited, and 0 indicates that it is not is quoted. y_q sorts the citations in descending order of the output value of the gradient asymptotic regression tree GBRT. Use the following formula to calculate the accuracy rate P_k (y_q ) of y_q at the kth position, where k is a natural number.

${P P}_{k k} (({y the y}_{q q})) = = \frac{11}{k k} {Σ Σ}_{i i = = 11}^{k k} {M m}_{{y the y}_{q q} ((i i))}$

其中，表示y_q(i)中的论文是否属于论文q的参考文献集合，具体计算如下：若y_q(i)中的论文属于论文q的参考文献集合，则若y_q(i)中的论文不属于论文q的参考文献集合，则in, Indicates whether the paper in y_q (i) belongs to the reference set of paper q, the specific calculation is as follows: If the paper in y_q (i) belongs to the reference set of paper q, then If the paper in y_q (i) does not belong to the reference set of paper q, then

进一步，利用下面式子计算y_q的平均准确率AP(y_q)，其中n为二元组集合y_q二元组个数。Further, use the following formula to calculate the average accuracy rate AP(y_q ) of y_q , where n is the number of 2-tuples in the y_q set of 2-tuples.

$A A P P (({y the y}_{q q})) = = \frac{11}{n no} {Σ Σ}_{k k = = 11}^{n no} {P P}_{k k} (({y the y}_{q q})) {M m}_{{y the y}_{q q} ((k k))}$

以标题为“MoreConfiningN＝1SUSYGaugeTheoriesfromNon-AbelianDuality”的论文为例，利用Lucene在数据集中进行查询获得的前10篇引文依次为(9811119,1),(9610139,1),(9804038,0),(9807222,0),(9603206,0),(9411149,1),(9607200,0),(9408155,0),(9810014,1),(9605113,0)。利用本发明的方法获得的前10篇引文依次为(9411149,1),(9407087,0),(9408099,0),(9610139,1),(9811119,1),(9510101,0),(9503179,1),(9510148,1),(9408155,0),(9602031,0)。基于Lucene的引文推荐实验结果的平均准确率约为0.29，采用本发明方法的引文推荐实验结果的平均准确率约为0.33。通过实验结果表明，本发明的引文推荐方法提高了用户获取引文的效率。另外，该引文推荐方法不涉及相似用户，因此不受限于相似用户的数量；它通过利用文献内容的知识图谱能够推荐与论文具有多层语义关联关系的文献。Taking the paper titled "MoreConfiningN=1SUSYGaugeTheoriesfromNon-AbelianDuality" as an example, the first 10 citations obtained by querying the data set using Lucene are (9811119,1),(9610139,1),(9804038,0),(9807222 ,0),(9603206,0),(9411149,1),(9607200,0),(9408155,0),(9810014,1),(9605113,0). The first 10 citations obtained by using the method of the present invention are (9411149,1), (9407087,0), (9408099,0), (9610139,1), (9811119,1), (9510101,0), ( 9503179,1),(9510148,1),(9408155,0),(9602031,0). The average accuracy rate of the experimental results of citation recommendation based on Lucene is about 0.29, and the average accuracy rate of the experimental results of citation recommendation using the method of the present invention is about 0.33. Experimental results show that the citation recommendation method of the present invention improves the efficiency for users to obtain citations. In addition, the citation recommendation method does not involve similar users, so it is not limited by the number of similar users; it can recommend documents with multi-layer semantic associations with papers by utilizing the knowledge graph of document content.