CN104915448B

Movatterモバイル変換

Info

Publication number: CN104915448B
Application number: CN201510372795.3A
Authority: CN
Inventors: 包红云; 郑孙聪; 许家铭; 齐振宇; 徐博; 郝红卫
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2015-06-30
Filing date: 2015-06-30
Publication date: 2018-03-27
Anticipated expiration: 2035-06-30
Also published as: CN104915448A

Abstract

Translated fromChinese

一种基于层次卷积网络的实体与段落链接方法，包括：利用卷积神经网络通过词向量化表示转化成句子向量化表示；利用句子向量化表示再次经过卷积神经网络并考虑所述句子次序信息得到段落向量化表示；句子向量化表示和段落向量化表示通过Softmax输出，借助已有实体作为监督信息进行所述卷积神经网络模型的训练；同时，考虑段落语义向量特征与实体语义向量特征之间的pair‑wise相似度信息进一步改善卷积神经网络模型的训练；给定一个测试描述段落，利用训练好的神经网络模型进行深层语义特征抽取得到测试段落的向量化表示，然后基于此语义表示经过Softmax输出可直接链接到目标实体上。

A method for linking entities and paragraphs based on a hierarchical convolutional network, including: using a convolutional neural network to convert a word vectorized representation into a sentence vectorized representation; using a sentence vectorized representation to pass through the convolutional neural network again and considering the order of the sentences The information is obtained by paragraph vectorization representation; sentence vectorization representation and paragraph vectorization representation are output through Softmax, and the training of the convolutional neural network model is carried out with the help of existing entities as supervisory information; at the same time, paragraph semantic vector features and entity semantic vector features are considered The pair-wise similarity information between them further improves the training of the convolutional neural network model; given a test description paragraph, use the trained neural network model to perform deep semantic feature extraction to obtain a vectorized representation of the test paragraph, and then based on this semantic Indicates that the Softmax output can be directly linked to the target entity.

Description

Translated fromChinese

一种基于层次卷积网络的实体与段落链接方法A Linking Method of Entities and Paragraphs Based on Hierarchical Convolutional Networks

技术领域technical field

本发明涉及知识库构建技术领域，更具体地涉及一种基于层次卷积网络的实体与段落链接方法。The present invention relates to the technical field of knowledge base construction, and in particular to a method for linking entities and paragraphs based on hierarchical convolutional networks.

背景技术Background technique

如今，已广泛使用的大规模知识库有Freebase、WordNet和YAGO等。它们都致力于构建一个全局的资源库，并且允许机器更方便地访问并获取结构化公共信息。同时，这些知识库提供了应用程序结构(APIs)以方便人们查询相关实体更为丰富的信息。例如，当我们在YAGO数据库中检索一个城市名“Washington D.C.”时，返回结果如下表1所示：Today, the large-scale knowledge bases that have been widely used include Freebase, WordNet, and YAGO. They are all committed to building a global resource library and allowing machines to access and obtain structured public information more easily. At the same time, these knowledge bases provide application program structures (APIs) to facilitate people to query richer information about related entities. For example, when we retrieve a city name "Washington D.C." in the YAGO database, the returned results are shown in Table 1 below:

表1Table 1

可以看到，返回的结果信息都是一些高度结构化的组织信息。但这些结构化信息并不符合人们理解实体的实际语境及语义信息。和YAGO数据库不同，Freebase和WordNet则返回结构化信息的同时会额外返回与检索实体相关的描述性段落，如下表2所示：It can be seen that the returned result information is some highly structured organizational information. However, these structured information do not conform to the actual context and semantic information for people to understand entities. Unlike the YAGO database, Freebase and WordNet return structured information while additionally returning descriptive paragraphs related to the retrieved entity, as shown in Table 2 below:

表2Table 2

可以看到，如表2所示的描述性段落更有益于用户理解查询实体词的具体语境及语义信息。然而，Freebase和WordNet的描述性段落信息都是由人工进行编辑的，这会导致大数据下对实体进行段落描述的局限性并耗费大量的时间与人力。因此，如何设计一个高效的实体与描述性段落自动链接方法是大数据时代知识库构建所迫切亟需的任务。It can be seen that the descriptive paragraphs shown in Table 2 are more beneficial for users to understand the specific context and semantic information of query entity words. However, the descriptive paragraph information of Freebase and WordNet is manually edited, which will lead to the limitation of paragraph description of entities under big data and consume a lot of time and manpower. Therefore, how to design an efficient automatic linking method between entities and descriptive paragraphs is an urgent task for knowledge base construction in the era of big data.

从表2的返回结果中还可以看到，描述性内容并不一定要包含查询实体词，而只需包含一些相关词对实体进行多方面地描述即可。因此，为了解决此问题，实体与段落链接方法需要从两个方面着手：1、从给定的一段描述性段落中捕捉文本的主题信息；2、找到和实体相关的重要描述性内容。比较传统的方法多是基于主题模型方法抽取段落的主题信息，如狄利克雷分布(LDA)及概率潜语义分析(PLSA)等。这些方法的普遍问题是抽取的主题信息是基于文档层的词共现信息获得的，受社交媒体中短文本特征表示的高稀疏性影响比较严重，而且丢失了文本中的词序信息。It can also be seen from the returned results in Table 2 that the descriptive content does not necessarily contain query entity words, but only needs to contain some related words to describe the entity in multiple aspects. Therefore, in order to solve this problem, the entity-paragraph linking method needs to start from two aspects: 1. Capture the topic information of the text from a given descriptive paragraph; 2. Find the important descriptive content related to the entity. More traditional methods are mostly based on topic model methods to extract topic information of paragraphs, such as Dirichlet distribution (LDA) and probabilistic latent semantic analysis (PLSA). The general problem of these methods is that the extracted topic information is obtained based on word co-occurrence information at the document level, which is seriously affected by the high sparsity of short text feature representation in social media, and the word order information in the text is lost.

近些年，随着深度神经网络的兴起，一些研究者尝试采用深度模型及词向量化表示方法学习描述性段落的深层隐式语义特征表示以解决实体与段落的链接问题。然而，现有的基于深度模型方法在解决描述性段落的语义特征抽取时，只是简单地把整个段落看成一个长句进行处理或直接把多个语句进行加权平均得到语义向量。而实际上，段落中的句子顺序也是具有语义逻辑关系的。In recent years, with the rise of deep neural networks, some researchers have tried to use deep models and word vectorization representation methods to learn deep implicit semantic feature representations of descriptive paragraphs to solve the problem of linking entities and paragraphs. However, when the existing deep model-based methods solve the semantic feature extraction of descriptive paragraphs, they simply treat the entire paragraph as a long sentence for processing or directly perform weighted average of multiple sentences to obtain semantic vectors. In fact, the order of sentences in a paragraph also has a semantic and logical relationship.

另一方面，捕捉段落中与实体密切相关的描述性线索也是非常重要的。如上述表2返回结果中的描述性段落虽然没有直接包含查询实体词“Washington D.C.”，但是却包含了很多相关的词汇或短语，如：“George Washington”、“United States”及“capital”等。因此，对实体进行向量化特征表示有助于实体与描述性段落的链接工作。On the other hand, it is also very important to capture descriptive clues that are closely related to entities in a paragraph. Although the descriptive paragraphs in the results returned in Table 2 above do not directly contain the query entity word "Washington D.C.", they contain many related words or phrases, such as: "George Washington", "United States" and "capital", etc. . Therefore, vectorized feature representation for entities is helpful for linking entities with descriptive paragraphs.

发明内容Contents of the invention

针对上述技术问题，本发明的主要目的在于提供一种基于层次卷积网络的实体与段落链接方法，从而能够将互联网中的实体词与描述性段落无需人工参与即自动链接，有助于大数据下的语义知识库的构建。In view of the above technical problems, the main purpose of the present invention is to provide a method for linking entities and paragraphs based on hierarchical convolutional networks, so that entity words and descriptive paragraphs in the Internet can be automatically linked without manual participation, which is helpful for big data The construction of the semantic knowledge base under.

为了实现上述目的，本发明提供了一种基于层次卷积网络的实体与段落链接方法，包括以下步骤：In order to achieve the above object, the present invention provides a method for linking entities and paragraphs based on hierarchical convolutional networks, comprising the following steps:

利用卷积神经网络通过词向量化表示转化成句子向量化表示，所述卷积网络有利于抽取查询实体在描述段落中的重要线索；Using the convolutional neural network to convert the word vectorized representation into a sentence vectorized representation, the convolutional network is conducive to extracting important clues of the query entity in the description paragraph;

所述句子向量化表示再次经过卷积神经网络并考虑所述句子次序信息得到段落向量化表示；The sentence vectorized representation passes through the convolutional neural network again and considers the sentence sequence information to obtain a paragraph vectorized representation;

所述句子向量化表示和所述段落向量化表示通过Softmax输出，借助已有实体作为监督信息进行所述卷积神经网络模型的训练；The sentence vectorization representation and the paragraph vectorization representation are output by Softmax, and the training of the convolutional neural network model is carried out as supervision information by means of existing entities;

同时考虑所述段落语义向量特征与实体语义向量特征之间的pair-wise相似度信息进一步改善所述卷积神经网络模型的训练；At the same time, consider the pair-wise similarity information between the paragraph semantic vector feature and the entity semantic vector feature to further improve the training of the convolutional neural network model;

给定一个测试描述段落，利用所述训练好的神经网络模型进行深层语义特征抽取得到所述测试段落的向量化表示，然后基于此语义表示经过Softmax输出可直接链接到目标实体上。Given a test description paragraph, use the trained neural network model to perform deep semantic feature extraction to obtain a vectorized representation of the test paragraph, and then based on this semantic representation, it can be directly linked to the target entity through Softmax output.

本发明的实体与段落链接方法将实体与段落的链接中的特征学习问题划分为四个层次，分别为：原始文本段落通过词向量化表示得到的特征矩阵层；通过卷积神经网络得到的句子向量化表示特征层；通过卷积神经网络得到的段落向量化表示特征层；利用词向量查表法得到实体词的向量化表示特征层。通过卷积特征网络和词向量查表，本发明的方法在两个文本数据集上实体与段落链接方法的精度值ACC显著优越于其他对比方法，且相对于最好的对比方法二，本发明方法在两个数据集上的精度值分别提升了12.4％和16.76％。The method for linking entities and paragraphs of the present invention divides the feature learning problem in the linking of entities and paragraphs into four levels, which are: the feature matrix layer obtained by expressing the original text paragraph through word vectorization; the sentence obtained by the convolutional neural network The vectorized representation feature layer; the paragraph vectorized representation feature layer obtained through the convolutional neural network; the vectorized representation feature layer of entity words obtained by using the word vector lookup table method. Through the convolutional feature network and the word vector lookup table, the accuracy value ACC of the entity and paragraph linking method of the method of the present invention on the two text data sets is significantly superior to other comparison methods, and compared with the best comparison method 2, the present invention The accuracy values of the method on the two datasets are improved by 12.4% and 16.76%, respectively.

附图说明Description of drawings

图1为作为本发明一个实施例的基于层次卷积网络的实体与段落链接方法的流程图；Fig. 1 is the flowchart of the method for linking entities and paragraphs based on hierarchical convolution network as an embodiment of the present invention;

图2为作为本发明一个实施例的基于层次卷积网络的实体与段落链接方法的框架示意图；FIG. 2 is a schematic framework diagram of a method for linking entities and paragraphs based on a hierarchical convolutional network as an embodiment of the present invention;

图3为作为本发明一个实施例的基于层次卷积网络的实体与段落链接方法的性能示意图。FIG. 3 is a schematic diagram of the performance of the method for linking entities and paragraphs based on hierarchical convolutional networks as an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明作进一步的详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明公开了一种基于层次卷积网络的实体与段落链接方法，能够将互联网中的实体词与描述性段落进行无需人工参与地自动链接，其总体构思是，通过层次化卷积神经网络先对段落中的词向量进行卷积得到句子的向量化表示。考虑句子在段落中的次序信息，并对句子的向量化表示进行再次卷积得到段落的向量化表示。然后利用实体特征做为监督信息指导卷积神经网络模型的参数学习，同时考虑段落的深度语义特征与实体语义向量化表示之间的pair-wise相似性信息改善卷积神经网络模型的学习。给定一个新描述性段落，则可利用训练好的卷积神经网络模型抽取其深度语义特征，并基于此特征输出得到对应的实体链接。The invention discloses a method for linking entities and paragraphs based on hierarchical convolutional networks, which can automatically link entity words and descriptive paragraphs in the Internet without manual participation. The vectorized representation of the sentence is obtained by convolving the word vectors in the paragraph. Consider the order information of the sentence in the paragraph, and perform convolution on the vectorized representation of the sentence again to obtain the vectorized representation of the paragraph. Then, the entity features are used as supervisory information to guide the parameter learning of the convolutional neural network model, and the pair-wise similarity information between the deep semantic features of the paragraph and the entity semantic vector representation is considered to improve the learning of the convolutional neural network model. Given a new descriptive paragraph, the trained convolutional neural network model can be used to extract its deep semantic features, and the corresponding entity links can be obtained based on this feature output.

更具体地，该方法首先利用卷积神经网络通过词向量化表示转化成句子向量化表示。然后利用句子向量化表示再次经过卷积神经网络并考虑所述句子次序信息得到段落向量化表示。句子向量化表示和段落向量化表示通过Softmax输出，借助已有实体作为监督信息进行所述卷积神经网络模型的训练。同时，考虑段落语义向量特征与实体语义向量特征之间的pair-wise相似度信息进一步改善卷积神经网络模型的训练。给定一个测试描述段落，利用训练好的神经网络模型进行深层语义特征抽取得到测试段落的向量化表示，然后基于此语义表示经过Softmax输出可直接链接到目标实体上。More specifically, the method first uses a convolutional neural network to convert word vectorization representations into sentence vectorization representations. Then, the sentence vectorized representation is used to pass through the convolutional neural network again and the paragraph vectorized representation is obtained by considering the sentence sequence information. The sentence vectorized representation and the paragraph vectorized representation are output through Softmax, and the convolutional neural network model is trained with the help of existing entities as supervisory information. At the same time, considering the pair-wise similarity information between the paragraph semantic vector features and the entity semantic vector features further improves the training of the convolutional neural network model. Given a test description paragraph, use the trained neural network model to perform deep semantic feature extraction to obtain a vectorized representation of the test paragraph, and then based on this semantic representation, it can be directly linked to the target entity through Softmax output.

下面结合附图对作为本发明一个实施例的基于层次卷积网络的实体与段落链接方法进行详细描述。The method for linking entities and paragraphs based on a hierarchical convolutional network as an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

图1为作为本发明一个实施例的基于层次卷积网络的实体与段落链接方法的流程图。FIG. 1 is a flowchart of a method for linking entities and paragraphs based on a hierarchical convolutional network as an embodiment of the present invention.

参照图1，在步骤S101，通过卷积神经网络模型和词向量化表示，抽取待处理段落中每条句子的向量化表示特征；With reference to Fig. 1, in step S101, by convolutional neural network model and word vectorized representation, extract the vectorized representation feature of each sentence in the paragraph to be processed;

根据本发明的一个示例性实施例，所述通过卷积神经网络模型和词向量化表示，抽取待处理段落中每条句子的向量化表示特征的步骤包括：According to an exemplary embodiment of the present invention, the step of extracting the vectorized representation features of each sentence in the paragraph to be processed through the convolutional neural network model and word vectorized representation includes:

在步骤S1011，给定待处理段落中的一条句子，利用查表法得到词项量化表示并将句子表征成矩阵形式；In step S1011, given a sentence in the paragraph to be processed, use the look-up table method to obtain the quantitative representation of the term and represent the sentence into a matrix form;

在步骤S1012，在所述句子矩阵化表示特征上进行一维卷积，获取卷积后的特征矩阵；In step S1012, one-dimensional convolution is performed on the sentence matrix representation feature to obtain a convolutional feature matrix;

在步骤S1013，在所述卷积后的特征矩阵上进行均值采样对特征进行压缩，得到句子的向量化表示。In step S1013, the mean value sampling is performed on the convoluted feature matrix to compress the features, and the vectorized representation of the sentence is obtained.

根据本发明的一个示例性实施例，所述利用查表法得到词项量化表示并将句子表征成矩阵形式的步骤包括：According to an exemplary embodiment of the present invention, the step of obtaining the quantitative representation of the term by using the look-up table method and representing the sentence into a matrix form includes:

给定一个word2vec训练好的词向量集合其中，|V|为词典大小，d为词向量的维度。则任意段落中长度为n的句子可表示成：Given a set of word vectors trained by word2vec Among them, |V| is the dictionary size, and d is the dimension of the word vector. Then a sentence of length n in any paragraph can be expressed as:

s＝(x₁；x₂；...；x_n) (1)s=(x₁ ; x₂ ; . . . ; x_n ) (1)

其中，x_i是利用查表法在词向量集合中找到的第i个词对应的向量化表示。其中，若词x_i未出现在已训练好的词向量集合中，则在本发明的该示例性实施例中直接对其进行随机初始化表示。Among them, x_i is the vectorized representation corresponding to the i-th word found in the word vector set by using the look-up table method. Wherein, if the word x_i does not appear in the trained word vector set, it is directly represented by random initialization in this exemplary embodiment of the present invention.

在步骤S1012，所述在句子矩阵化表示特征上进行一维卷积，获取卷积后的特征矩阵的步骤包括：In step S1012, the one-dimensional convolution is performed on the sentence matrix representation feature, and the step of obtaining the convolutional feature matrix includes:

这里，用表示句子s中从第i个词起始的h_s个连续词特征。给定一个一维卷积核则h_s个连续词特征卷积后的特征矩阵为：here, use Represents h_s continuous word features starting from the i-th word in sentence s. Given a one-dimensional convolution kernel Then the feature matrix after h_s continuous word feature convolution is:

其中，b⁽¹⁾是偏置项，f是激活函数，是h_s个连续词特征卷积后的特征矩阵。则所述句子的特征矩阵经过卷积后为：Among them, b⁽¹⁾ is the bias term, f is the activation function, is h_s consecutive word features Convolved feature matrix. Then the feature matrix of the sentence is after convolution:

在步骤S1013，所述在卷积后的特征矩阵上进行均值采样对特征进行压缩，得到句子的向量化表示的步骤包括：In step S1013, the step of performing mean sampling on the convoluted feature matrix to compress the feature, and obtaining the vectorized representation of the sentence includes:

在本发明的该示例性实施例中，所述采用均值采样的步骤为：In this exemplary embodiment of the present invention, the described step of adopting mean value sampling is:

至此，每个卷积核会生成一个d维特征向量如果使用了k个卷积核，则经过一个卷积层，最终会得到句子的向量化表示为则句子向量化表示的维度为d·k。So far, each convolution kernel will generate a d-dimensional feature vector If k convolution kernels are used, after a convolutional layer, the vectorized representation of the sentence will eventually be obtained as Then the dimension of sentence vectorization representation is d·k.

在步骤S102，利用卷积神经网络结构和所述句子向量化表示，学习所述段落的深度语义特征；In step S102, using the convolutional neural network structure and the vectorized representation of the sentence to learn the deep semantic features of the paragraph;

根据本发明的一个示例性实施例，所述段落的深度语义特征学习方法包括：According to an exemplary embodiment of the present invention, the deep semantic feature learning method of the paragraph includes:

在步骤S1021，利用所述段落中的句子向量特征按句子在所述段落中的词序将段落表征成矩阵形式；In step S1021, using the sentence vector feature in the paragraph to represent the paragraph into a matrix form according to the word order of the sentence in the paragraph;

在步骤S1022，在所述段落矩阵化表示特征上进行一维卷积，获取卷积后的特征矩阵；In step S1022, one-dimensional convolution is performed on the paragraph matrix representation feature to obtain a convolutional feature matrix;

在步骤S1023，在所述卷积后的特征矩阵上进行均值采样对特征进行压缩并进行一次线性变换，得到段落的向量化表示。In step S1023, the mean value sampling is performed on the convoluted feature matrix to compress the feature and perform a linear transformation to obtain a vectorized representation of the paragraph.

根据本发明的一个示例性实施例，所述利用段落中的句子向量特征按句子在所述段落中的词序将段落表征成矩阵形式的步骤包括：According to an exemplary embodiment of the present invention, the step of using the sentence vector feature in the paragraph to represent the paragraph into a matrix form according to the word order of the sentence in the paragraph includes:

已得到所述段落的l条句子的向量化表示，则段落可表示成：Having obtained the vectorized representation of the l sentences of the paragraph, the paragraph can be expressed as:

t＝(s₁；s₂；...；s_l) (5)t=(s₁ ; s₂ ; . . . ; s_l ) (5)

在步骤S1022，所述在段落矩阵化表示特征上进行一维卷积，获取卷积后的特征矩阵的步骤包括：In step S1022, the one-dimensional convolution is performed on the paragraph matrix representation feature, and the step of obtaining the convolutional feature matrix includes:

这里，用表示段落t中从第i个句子起始的h_t个连续句子特征。给定一个一维卷积核则h_t个连续句子特征卷积后的卷积特征为：here, use Represents h_t consecutive sentence features starting from the i-th sentence in paragraph t. Given a one-dimensional convolution kernel Then the convolution feature after h_t continuous sentence feature convolution is:

其中，b⁽²⁾是偏置项，f是激活函数，是h_t个连续句子特征卷积后的特征。则所述段落的特征经过卷积后为：Among them, b⁽²⁾ is the bias term, f is the activation function, is h_t consecutive sentence features Convolved features. Then the features of the paragraph after convolution are:

在步骤S1023，所述在卷积后的特征矩阵上进行均值采样对特征进行压缩并进行一次线性变换，得到段落的向量化表示的步骤包括：In step S1023, the step of performing mean sampling on the convolved feature matrix to compress the feature and perform a linear transformation to obtain the vectorized representation of the paragraph includes:

至此，经过卷积核W⁽²⁾生成一个d·k维特征向量为了方便计算段落特征与实体特征的相似性，需保证向量维度的统一，则对所述段落向量进行一次线性变换：So far, a d k-dimensional feature vector is generated through the convolution kernel W⁽²⁾ In order to facilitate the calculation of the similarity between the paragraph feature and the entity feature, it is necessary to ensure the unity of the vector dimension, then perform a linear transformation on the paragraph vector:

其中，为线性变换矩阵，而特征向量z为本发明的一个示例性实施例中最终段落特征向量。in, is a linear transformation matrix, and the eigenvector z is the final paragraph eigenvector in an exemplary embodiment of the present invention.

在步骤S103，所述句子的向量化表示和所述段落的向量化表示分别经过Softmax输出拟合所述段落所属实体；In step S103, the vectorized representation of the sentence and the vectorized representation of the paragraph are respectively outputted through Softmax to fit the entity to which the paragraph belongs;

根据本发明的一个示例性实施例，所述句子和所述段落的向量化表示分别拟合所述段落所属实体的方法包括以下步骤：According to an exemplary embodiment of the present invention, the method for respectively fitting the vectorized representations of the sentence and the paragraph to the entity to which the paragraph belongs includes the following steps:

在步骤S1031，对所述句子向量和段落向量分别进行线性变换得到输出向量，并使用Dropout技术进行正则；In step S1031, the sentence vector and the paragraph vector are respectively linearly transformed to obtain an output vector, and the Dropout technique is used for regularization;

在步骤S1032，使用Softmax函数计算候选实体的链接概率；In step S1032, use the Softmax function to calculate the link probability of the candidate entity;

根据本发明的一个示例性实施例，所述对句子向量和段落向量进行线性变换得到输出向量，并使用Dropout技术进行正则的步骤包括：According to an exemplary embodiment of the present invention, the step of performing linear transformation on the sentence vector and the paragraph vector to obtain the output vector, and using the Dropout technique to regularize includes:

对句子向量特征s和段落向量特征t分别进行线性变化，得到两个输出向量：Linearly change the sentence vector feature s and paragraph vector feature t respectively to obtain two output vectors:

ys＝W⁽⁴⁾·(sοr)+b⁽⁴⁾ (10)ys＝W⁽⁴⁾ ·(sοr)+b⁽⁴⁾ (10)

y＝W⁽⁵⁾·(zοr)+b⁽⁵⁾ (11)y＝W⁽⁵⁾ ·(zοr)+b⁽⁵⁾ (11)

其中，和是权重矩阵，m是本发明的一个示例性实施例中的实体数，符号。表示矩阵元素的乘操作，而则是一个服从一定概率ρ的伯努利分布。使用Dropout技术可以防止过拟合，可以增强神经网络模型的鲁棒性。in, and is the weight matrix, m is the entity number in an exemplary embodiment of the present invention, symbol. represents the multiplication operation of matrix elements, while It is a Bernoulli distribution with a certain probability ρ. Using Dropout technology can prevent overfitting and enhance the robustness of the neural network model.

在步骤S1032，使用所述Softmax函数计算所述候选实体的链接概率的步骤包括：In step S1032, the step of using the Softmax function to calculate the link probability of the candidate entity includes:

在所述句子向量特征和所述段落向量特征的两个所述输出层分别使用Softmax激活函数计算每个对应所述实体词的概率值：In the two output layers of the sentence vector feature and the paragraph vector feature, use the Softmax activation function to calculate the probability value of each corresponding entity word:

则在公式(12)和公式(13)中，ps_i和p_i分别表示对应第i个所述实体词的概率值。Then in formula (12) and formula (13), p_i and p_i represent the probability values corresponding to the i-th entity word respectively.

在步骤S104，计算所述实体的向量化表示与所述段落向量化表示的pair-wise相似信息；In step S104, calculate the pair-wise similarity information between the vectorized representation of the entity and the vectorized representation of the paragraph;

给定一个实体词集合E＝{e₁，e₂，...，e_m}，利用word2vec对所述实体词集合进行初始化，则实体词集合E与所述段落特征向量z的相似性为：Given an entity word set E=_{ e₁ , e₂ ,...,em }, use word2vec to initialize the entity word set, then the similarity between the entity word set E and the paragraph feature vector z is :

sim(z，E)＝{z·e₁，z·e₂，...，z·e_m} (14)sim(z, E)={z·e₁ , z·e₂ ,..., z·e_m } (14)

其中，操作符z·e表示所述段落特征向量z和对应所述实体词e的相似性。Wherein, the operator z·e represents the similarity between the paragraph feature vector z and the corresponding entity word e.

在步骤S105，通过Softmax拟合目标实体词及段落特征向量与目标实体词的pair-wise相似度信息进行误差反向传播训练卷积神经网络模型；In step S105, the pair-wise similarity information of the target entity word and the paragraph feature vector and the target entity word is fitted by Softmax to carry out error backpropagation training convolutional neural network model;

根据本发明的一个示例性实施例，所述通过Softmax拟合目标实体词及所述段落特征向量与目标实体词的pair-wise相似度信息进行误差反向传播所述训练卷积神经网络模型的步骤包括：According to an exemplary embodiment of the present invention, the pair-wise similarity information of the pair-wise similarity information of the target entity word and the paragraph feature vector and the target entity word through Softmax fitting is carried out error back propagation of the training convolutional neural network model Steps include:

在步骤S1051，根据所述句子特征和段落特征输出，利用所述Softmax对所述训练数据集中目标实体词的拟合结果设定目标函数；In step S1051, according to the output of the sentence feature and the paragraph feature, use the Softmax to set an objective function for the fitting result of the target entity word in the training data set;

在步骤S1052，根据所述段落特征与所述目标实体词的pair-wise相似度信息设定目标函数；In step S1052, an objective function is set according to the pair-wise similarity information of the paragraph feature and the target entity word;

在步骤S1053，设定全局目标约束函数；In step S1053, set the global target constraint function;

在步骤S1054，利用随机梯度下降方法对模型中的参数进行更新；In step S1054, the parameters in the model are updated using the stochastic gradient descent method;

根据本发明的一个示例性实施例，所述根据句子特征和段落特征输出利用所述Softmax对所述训练数据集中目标实体词的拟合结果设定目标函数的步骤包括：According to an exemplary embodiment of the present invention, the step of using the Softmax to set the target function for the fitting result of the target entity word in the training data set according to the sentence feature and the paragraph feature output includes:

利用公式(10)、(11)和公式(12)、(13)，设定所述句子向量化特征及所述段落向量化特征的目标约束函数分别为：Utilize formula (10), (11) and formula (12), (13), set the target constraint function of described sentence vectorization feature and described paragraph vectorization feature to be respectively:

其中，L_s为所述句子向量化特征的目标约束函数，L_p1为所述段落向量化特征的目标约束函数，为所有训练语料中段落集合中的所有句子集合，是第i个句子所属的正确实体词而是第i个段落所属的正确实体词。Wherein, L_s is the target constraint function of the sentence vectorization feature, L_p1 is the target constraint function of the paragraph vectorization feature, A collection of paragraphs in all training corpora The set of all sentences in , is the correct entity word to which the i-th sentence belongs and is the correct entity word to which the i-th paragraph belongs.

在步骤S1052，所述根据段落特征与所述目标实体词的pair-wise相似度信息设定目标函数的步骤包括：In step S1052, the step of setting an objective function according to the pair-wise similarity information of the paragraph feature and the target entity word includes:

为了增强所述段落与实体的语义表达能力，本发明通过设定目标约束函数增强所述段落向量化特征与其对应的所述所属实体词向量化特征的相似性，并削弱所述段落向量化特征与其对应的所述非所属实体词向量化特征的相似性，其目标约束函数如下：In order to enhance the semantic expression ability of the paragraph and the entity, the present invention enhances the similarity between the paragraph vectorization feature and the corresponding entity word vectorization feature by setting the target constraint function, and weakens the paragraph vectorization feature The similarity of the vectorized features of the non-subscribing entity words corresponding to it, its target constraint function is as follows:

其中，e_r是给定所述段落z所属的正确实体词。where e_r is the correct entity word to which the given paragraph z belongs.

在步骤S1053，所述设定全局目标约束函数的步骤如下：In step S1053, the steps of setting the global objective constraint function are as follows:

L＝L_s+(1-α)·L_p1+α·L_p2 (18)L＝L_s +(1-α)·L_p1 +α·L_p2 (18)

其中，α是权重调和系数，用来平衡所述段落向量化特征的两个约束L_p1和L_p2。Wherein, α is a weight reconciliation coefficient, which is used to balance the two constraints L_p1 and L_p2 of the paragraph vectorization feature.

在步骤S1054，所述利用随机梯度下降方法对所述模型中的参数进行更新的步骤包括：In step S1054, the step of using the stochastic gradient descent method to update the parameters in the model includes:

设定的所述目标约束函数中所有的模型训练参数统一表示为θ：All model training parameters in the set target constraint function are uniformly expressed as θ:

θ＝(x，W⁽¹⁾，b⁽¹⁾，W⁽²⁾，b⁽²⁾，α，W⁽³⁾，W⁽⁴⁾，b⁽⁴⁾，W⁽⁵⁾，b⁽⁵⁾，E) (19)θ=(x, W⁽¹⁾ , b⁽¹⁾ , W⁽²⁾ , b⁽²⁾ , α, W⁽³⁾ , W⁽⁴⁾ , b⁽⁴⁾ , W⁽⁵⁾ , b^{(5 )} , E) (19)

本发明的一个示例性实施例中，采用随机梯度下降方法进行误差反向传播对所述目标函数进行优化。In an exemplary embodiment of the present invention, the stochastic gradient descent method is used to perform error backpropagation to optimize the objective function.

在步骤S106，利用更新后卷积神经网络模型对测试描述性段落进行深度语义特征抽取，然后基于段落的向量化表示与对应的实体词进行链接。In step S106, use the updated convolutional neural network model to perform deep semantic feature extraction on the test descriptive paragraph, and then link the corresponding entity words based on the vectorized representation of the paragraph.

根据本发明的一个示例性实施例，所述用更新后的所述卷积神经网络模型对所述测试描述性段落进行深度语义特征抽取，然后基于所述段落的向量化表示与对应的所述实体词进行链接的步骤包括：According to an exemplary embodiment of the present invention, the updated convolutional neural network model is used to perform deep semantic feature extraction on the test descriptive paragraph, and then based on the vectorized representation of the paragraph and the corresponding The steps for linking entity words include:

在步骤S1061，给定一个测试段落文本，先通过公式(2)、(3)、(4)计算所述段落中句子的向量化特征s；In step S1061, given a test paragraph text, first calculate the vectorized feature s of the sentences in the paragraph by formulas (2), (3), and (4);

在步骤S1062，通过公式(6)、(7)、(8)、(9)计算所述段落的向量化特征z；In step S1062, calculate the vectorized feature z of the paragraph by formulas (6), (7), (8), and (9);

在步骤S1063，利用生成的所述段落的向量化特征z，使用无Dropout的线性变换及Softmax函数输出对应的所述实体词的匹配概率：In step S1063, using the generated vectorized feature z of the paragraph, use the linear transformation without Dropout and the Softmax function to output the matching probability of the corresponding entity word:

y＝W⁽⁵⁾·z+b⁽⁵⁾ (20)y＝W⁽⁵⁾ z+b⁽⁵⁾ (20)

则匹配概率最高的实体词即为所述测试段落的所属实体词。Then the entity word with the highest matching probability is the entity word to which the test paragraph belongs.

图2为作为本发明一个实施例的基于层次卷积网络的实体与段落链接方法的框架示意图。Fig. 2 is a schematic framework diagram of a method for linking entities and paragraphs based on a hierarchical convolutional network as an embodiment of the present invention.

参照图2，基于层次卷积网络的实体与段落链接方法共有四个层次的特征向量化表示，分别为：Referring to Figure 2, the entity and paragraph linking method based on the hierarchical convolutional network has four levels of feature vectorization representations, which are:

特征层次一：原始文本段落通过词向量化表示得到的特征矩阵；Feature level 1: The feature matrix obtained by expressing the original text paragraph through word vectorization;

特征层次二：通过卷积神经网络得到的句子向量化表示特征；Feature level 2: Sentence vectorization representation features obtained through convolutional neural network;

特征层次三：通过卷积神经网络得到的段落向量化表示特征；Feature level three: the paragraph vectorized representation features obtained through the convolutional neural network;

特征层次四：利用词向量查表法得到实体词的向量化表示特征；Feature level four: use the word vector lookup table method to obtain the vectorized representation features of entity words;

整个模型训练阶段共有三处监督信息进行指导，分别为：There are three supervisory information for guidance throughout the model training phase, namely:

监督信息一：句子的向量化表示特征经过线性变化和Softmax输出后对所属实体词的拟合信息；Supervision information 1: The vectorized representation of the sentence represents the fitting information of the entity word after the linear change and Softmax output;

监督信息二：段落的向量化表示特征经过线性变化和Softmax输出后对所属实体词的拟合信息；Supervision information 2: The vectorization of the paragraph represents the fitting information of the corresponding entity word after the feature is linearly changed and Softmax output;

监督信息三：段落的向量化表示特征经过线性变化后和所属实体词的Pair-wise相似度信息；Supervision information 3: The vectorized representation feature of the paragraph has been linearly changed and the Pair-wise similarity information of the entity word to which it belongs;

为了准确评估本发明方法的实体与段落的链接性能，本发明通过对比实体与段落链接结果和段落真实所属实体的一致性得到本发明方法的精度(ACC)。给定一个描述性段落样本x⁽ⁱ⁾，本发明方法链接的实体词为e⁽ⁱ⁾，及段落真实所述实体词为则精度的定义如下：In order to accurately evaluate the linking performance of the entity and the paragraph of the method of the present invention, the present invention obtains the accuracy (ACC) of the method of the present invention by comparing the consistency between the entity and the paragraph linking result and the entity to which the paragraph actually belongs. Given a descriptive paragraph sample x⁽ⁱ⁾ , the entity word linked by the method of the present invention is e⁽ⁱ⁾ , and the actual entity word described in the paragraph is Then precision is defined as follows:

其中，是描述性段落的个数，δ(x，y)是指示函数，当x＝y时指示函数为1，当x≠y时指示函数为0。in, is the number of descriptive paragraphs, δ(x, y) is an indicator function, which is 1 when x=y, and 0 when x≠y.

本发明的试验中采用两种公开文本数据集：Two kinds of public text data sets are used in the experiment of the present invention:

History：该数据集包含409个实体，1704条段落。History: This dataset contains 409 entities and 1704 paragraphs.

Literature：该数据集包含445个实体，2247条段落。Literature: This dataset contains 445 entities and 2247 paragraphs.

针对这些文本数据集，本发明不做任何处理(包括去停用词和词干还原等操作)。平均每个段落包含4-6条句子，而每个段落只包含1个实体词。数据集的具体统计信息如表3所示：For these text data sets, the present invention does not do any processing (including operations such as removing stop words and word stem restoration). On average, each paragraph contains 4-6 sentences, and each paragraph contains only 1 entity word. The specific statistical information of the dataset is shown in Table 3:

表3table 3

本发明的试验中采用以下对比方法：Adopt following comparative method in the test of the present invention:

对比方法一：基于词袋模型与逻辑斯特回归方法，该方法直接在原始文本的词袋模型上采用逻辑斯特回归方法；Comparison method 1: Based on the bag-of-words model and the logistic regression method, this method directly uses the logistic regression method on the bag-of-words model of the original text;

对比方法二：基于卷积神经网络的链接方法，该方法采用传统的卷积神经网络模型简单地将实体与段落链接问题看成一个分类问题。Contrast method 2: The linking method based on convolutional neural network, which uses the traditional convolutional neural network model to simply regard the problem of linking entities and paragraphs as a classification problem.

本发明试验中采用参数设置如表4所示：Adopt parameter setting in the test of the present invention as shown in table 4:

表4Table 4

数据集data setρρh_sh_sh_th_tddkkHistoryHistory0.50.5336610010011LiteratureLiterature0.50.5338810010011

表4中，参数ρ为模型训练时采用Dropout的比重系数，h_s为句子向量化特征表示时卷积核的框口大小，h_t为段落向量化特征表示时卷积核的框口大小，d为词向量维数，k为句子向量化特征表示时卷积核的个数。In Table 4, the parameter ρ is the proportion coefficient of Dropout used in model training, h_s is the frame size of the convolution kernel when the sentence vectorization feature is represented, h_t is the frame size of the convolution kernel when the paragraph vectorization feature is represented, d is the word vector dimension, and k is the number of convolution kernels when the sentence is vectorized.

本发明试验中，所有实体与段落链接方法执行50次求取其平均精度值(ACC)，最终的试验结果如表5所示：In the test of the present invention, all entities and paragraph linking methods are executed 50 times to obtain their average accuracy value (ACC), and the final test results are as shown in Table 5:

表5table 5

方法methodHistory/精度值(％)History/accuracy value (%)Literature/精度值(％)Literature/precision value (%)对比方法一Comparison method one65.10±0.0165.10±0.0161.17±0.0561.17±0.05对比方法二Comparison method two77.01±3.9277.01±3.9274.50±10.374.50±10.3本发明方法The method of the invention89.41±1.0589.41±1.0591.26±0.5091.26±0.50

表5为本发明方法、对比方法一、对比方法二在两个文本数据集上实体与段落链接方法的精度值(ACC)评测结果。试验结果表明，本发明方法的性能显著优越于其他对比方法。且相对于最好的对比方法二，本发明方法在两个数据集上的精度值分别提升了12.4％和16.76％。Table 5 shows the accuracy value (ACC) evaluation results of the method of the present invention, the comparison method 1, and the comparison method 2 on the entity and paragraph linking method on two text data sets. The test results show that the performance of the method of the present invention is significantly superior to other comparative methods. And compared with the best comparison method 2, the accuracy values of the method of the present invention on the two data sets are increased by 12.4% and 16.76% respectively.

同时，本发明试验验证在进行句子特征表示时卷积核的滑动词窗口大小对本发明方法进行实体与段落链接的精度值性能的影响，试验结果如图3所示。可以看到，当词窗口大小为3时，本发明方法性能在两个数据集上都达到最优，而当词窗口大小大于3时，本发明方法的精度值性能下降。因而本发明实验中采用的句子特征卷积核的滑动词窗口大小均为3。At the same time, the present invention verifies the effect of the size of the sliding word window of the convolution kernel on the performance of the accuracy value of the method of the present invention when performing sentence feature representation, and the test results are shown in Figure 3. It can be seen that when the size of the word window is 3, the performance of the method of the present invention is optimal on both data sets, and when the size of the word window is greater than 3, the accuracy value performance of the method of the present invention decreases. Thereby the sliding word window size of the sentence feature convolution kernel that adopts in the experiment of the present invention is 3.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the present invention. Within the spirit and principles of the present invention, any modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the present invention.