CN107239444A

Movatterモバイル変換

Info

Publication number: CN107239444A
Application number: CN201710384135.6A
Authority: CN
Inventors: 文坤梅; 李瑞轩; 刘其磊; 李玉华; 辜希武; 昝杰; 杨琪
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-05-26
Filing date: 2017-05-26
Publication date: 2017-10-10
Anticipated expiration: 2037-05-26
Also published as: CN107239444B

Abstract

Translated fromChinese

本发明公开了一种融合词性与位置信息的词向量训练方法及系统，该方法包括：对数据进行预处理得到目标文本；对目标文本进行分词和词性标注；对词性信息建模和对位置信息建模；在基于负采样策略的skip‑gram模型的基础上融合词性与位置信息进行词向量学习得到目标词向量，该目标词向量用于单词类比任务和单词相似度任务评估。本发明考虑了单词的词性信息及位置信息，且在对单词的词性和位置信息进行建模的基础上，充分利用单词的词性信息以及词性之间的位置信息来帮助词向量的训练，并且在训练的过程中对于参数的更新也更加合理。

The invention discloses a word vector training method and system for integrating part-of-speech and location information. The method includes: preprocessing data to obtain target text; performing word segmentation and part-of-speech tagging on the target text; modeling part-of-speech information and location information Modeling; on the basis of the skip-gram model based on the negative sampling strategy, the part-of-speech and position information are fused to learn the word vector to obtain the target word vector, which is used for word analogy task and word similarity task evaluation. The present invention considers the part-of-speech information and position information of words, and on the basis of modeling the part-of-speech information and position information of words, fully utilizes the part-of-speech information of words and the position information between parts of speech to help the training of word vectors, and in It is also more reasonable to update the parameters during the training process.

Description

Translated fromChinese

一种融合词性与位置信息的词向量训练方法及系统A word vector training method and system integrating part-of-speech and position information

技术领域technical field

本发明属于自然语言处理技术领域，更具体地，涉及一种融合词性与位置信息的词向量训练方法及系统。The invention belongs to the technical field of natural language processing, and more specifically relates to a word vector training method and system that integrates part of speech and position information.

背景技术Background technique

近年来，随着移动互联网技术的飞速发展，使得互联网中数据的规模急速增长，也使得数据的复杂度急剧增高。这就使得对这些海量的无结构、未标注数据的处理分析成为一大难题。In recent years, with the rapid development of mobile Internet technology, the scale of data in the Internet has grown rapidly, and the complexity of data has also increased dramatically. This makes the processing and analysis of these massive unstructured and unlabeled data a major problem.

传统的机器学习方法采用特征工程(Feature engineering)对数据进行符号化表示以便于模型的建模与求解，但特征工程中常用的词袋表示技术如One-hot向量随着数据复杂度的增长，特征的维度也会急剧增加从而导致维度灾难问题。并且基于One-hot向量表示的方法还存在语义鸿沟现象。随着“如果两个词上下文相似，那么它们的语义也相似”的分布假说(distributional hypothesis)被提出，基于分布假说的单词分布表示技术不断地被提出。其中最主要的有基于矩阵的分布表示、基于聚类的分布表示及基于词向量的分布表示。但无论是基于矩阵表示还是基于聚类表示的分布表示方法虽然能够在特征维度较小时表达简单的上下文信息。但当特征维度较高时，模型对于上下文的表达尤其是对复杂上下文的表达就无能为力。而基于词向量的表示技术，使得无论是对于每个单词的表示，还是通过线性组合的方法来表示单词的上下文都避免了出现维度灾难的问题。而且由于单词之间的距离可以通过他们所对应词向量之间的余弦距离或欧式距离来衡量，这也在很大程度上消除了传统的词袋模型中的语义鸿沟的问题。Traditional machine learning methods use feature engineering (Feature engineering) to symbolize data to facilitate model modeling and solution. However, bag-of-words representation techniques commonly used in feature engineering, such as One-hot vectors, increase with the complexity of data. The dimensionality of features will also increase dramatically leading to the curse of dimensionality problem. And there is still a semantic gap in the method based on One-hot vector representation. As the distributional hypothesis (distributional hypothesis) of "if two words are similar in context, their semantics are similar" is proposed, the word distribution representation technology based on the distributional hypothesis is continuously proposed. The most important ones are matrix-based distribution representation, cluster-based distribution representation and word vector-based distribution representation. However, although the distribution representation method based on matrix representation or cluster representation can express simple context information when the feature dimension is small. But when the feature dimension is high, the model is powerless to express the context, especially the complex context. The word vector-based representation technology avoids the curse of dimensionality, whether it is the representation of each word or the context of the word through a linear combination method. And since the distance between words can be measured by the cosine distance or Euclidean distance between their corresponding word vectors, this also largely eliminates the problem of semantic gap in the traditional bag-of-words model.

然而，目前已有的词向量研究工作大都集中在通过简化模型中神经网络的结构来减少模型复杂度，有的工作融合了情感、主题等信息，而融合词性信息的研究工作很少且在这些很少的工作中针对的词性粒度比较大，对于词性信息的利用很不充分，对于词性信息的更新也不太合理。However, most of the existing research work on word embedding focuses on reducing the complexity of the model by simplifying the structure of the neural network in the model. The part-of-speech granularity in the few works is relatively large, the use of part-of-speech information is not sufficient, and the update of part-of-speech information is not reasonable.

发明内容Contents of the invention

针对现有技术的以上缺陷或改进需求，本发明的目的在于提供了一种融合词性与位置信息的词向量训练方法及系统，由此解决现有技术中融合词性信息的研究工作中针对的词性粒度比较大，对于词性信息的利用很不充分，对于词性信息的更新也不太合理的技术问题。In view of the above defects or improvement needs of the prior art, the purpose of the present invention is to provide a word vector training method and system that integrates part of speech and position information, thereby solving the problem of part of speech in the research work of integrating part of speech information in the prior art. The granularity is relatively large, the use of part-of-speech information is not sufficient, and the update of part-of-speech information is not reasonable.

为实现上述目的，按照本发明的一个方面，提供了一种融合词性与位置信息的词向量训练方法，包括如下步骤：In order to achieve the above object, according to one aspect of the present invention, a word vector training method that combines part of speech and position information is provided, including the following steps:

S1、对原始文本进行预处理得到目标文本；S1. Preprocessing the original text to obtain the target text;

S2、根据单词的上下文信息，采用词性标注集中的词性对目标文本中的单词进行词性标注；S2. According to the context information of the word, the word in the target text is tagged with the part of speech in the part-of-speech tagging set;

S3、根据标注的词性信息进行建模构建词性关联权重矩阵M，以及针对词性对所对应单词对的相对位置i进行建模，构建与位置对应的位置词性关联权重矩阵M_i'，其中，矩阵M的行列维度为词性标注集中词性的种类大小，矩阵M中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性的共现概率，矩阵M_i'的行列维度与矩阵M相同，矩阵M_i'中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性在相对位置i时的共现概率；S3. Modeling and constructing the part-of-speech association weight matrix M according to the marked part-of-speech information, and modeling the relative position i of the corresponding word pair for the part-of-speech pair, and constructing the position part-of-speech association weight matrix M_i ' corresponding to the position, wherein, the matrix The row and column dimension of M is the size of the part of speech tagging set, the element in the matrix M is the co-occurrence probability of the part of speech corresponding to the row of the element and the part of speech corresponding to the column of the element, the row and column dimension of the matrix M_i ' and the matrix M is the same, and the element in the matrix M_i ' is the co-occurrence probability of the part of speech corresponding to the row of the element and the part of speech corresponding to the row of the element at the relative position i;

S4、将建模后的矩阵M和矩阵M_i'融合到skip-gram词向量模型中构建目标模型，由目标模型进行词向量学习得到目标词向量，其中，目标词向量用于单词类比任务以及单词相似度任务。S4, merging the modeled matrix M and matrix M_i ' into the skip-gram word vector model to construct a target model, and performing word vector learning by the target model to obtain a target word vector, wherein the target word vector is used for word analogy tasks and Word similarity task.

优选地，步骤S2具体包括以下子步骤：Preferably, step S2 specifically includes the following sub-steps:

S2.1、对目标文本进行分词，以区分出目标文本中的所有单词；S2.1. Segment the target text to distinguish all words in the target text;

S2.2、对目标文本中的每个句子，根据单词在句子中的上下文信息，采用词性标注集中的词性对单词进行词性标注。S2.2. For each sentence in the target text, according to the context information of the word in the sentence, the part-of-speech tagging set is used to tag the word.

优选地，步骤S3具体包括以下子步骤：Preferably, step S3 specifically includes the following sub-steps:

S3.1、对目标文本中的每个单词，生成针对单词及其对应的词性构成的单词-词性对，根据单词-词性对构建词性关联权重矩阵M，其中，矩阵M的行列维度为词性标注集中词性的种类大小，矩阵M中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性的共现概率；S3.1. For each word in the target text, generate a word-part-of-speech pair composed of the word and its corresponding part-of-speech, and construct a part-of-speech association weight matrix M according to the word-part-of-speech pair, wherein the row and column dimensions of the matrix M are part-of-speech tags Concentrate the kind size of part-of-speech, the element in the matrix M is the co-occurrence probability of the part-of-speech of the row corresponding word of this element and the part-of-speech of the column corresponding word of this element;

S3.2、针对词性对所对应单词对的相对位置i进行建模，构建与位置对应的位置词性关联权重矩阵M′_i，其中，矩阵M′_i的行列维度与矩阵M相同，矩阵M′_i中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性在相对位置i时的共现概率。S3.2. Modeling the relative position i of the word pair corresponding to the part-of-speech pair, constructing a position part-of-speech association weight matrix M′_i corresponding to the position, wherein the matrix M′_i has the same row and column dimensions as the matrix M, and the matrix M′ The element in_i is the co-occurrence probability of the part-of-speech of the word corresponding to the row of the element and the part-of-speech of the word corresponding to the column of the element at the relative position i.

优选地，步骤S4具体包括以下子步骤：Preferably, step S4 specifically includes the following sub-steps:

S4.1、构建初始目标函数：其中，C表示整个训练语料库中的词汇表，Context(w)表示由目标单词w的前后各c个单词组成的上下文单词集，c表示窗口大小；S4.1. Construct the initial objective function: Among them, C represents the vocabulary in the entire training corpus, Context(w) represents the context word set composed of c words before and after the target word w, and c represents the window size;

S4.2、将建模后的矩阵M和矩阵M_i'融合到基于负采样的skip-gram词向量模型中构建目标模型，并根据初始目标函数构建目标模型的新目标函数：其中，NEG(w)为对目标单词w进行采样的负样本集，L^w(u)为样本u的打分，正样本打分为1，负样本打分为0，θ^u为样本词在模型训练过程中用到的辅助向量，为上下文单词对应的词向量的转置，为T_u和两词性在相对位置关系为i时的共现概率；S4.2. Fuse the modeled matrix M and matrix M_i ' into the negative sampling-based skip-gram word vector model to construct the target model, and construct a new target function of the target model according to the initial target function: in, NEG(w) is the negative sample set for sampling the target word w, L^w (u) is the score of sample u, the positive sample is scored as 1, the negative sample is scored as 0, θ^u is the sample word used in the model training process to the auxiliary vector, for context words corresponding word vector the transposition of for T_u and The co-occurrence probability of two parts of speech when the relative position relationship is i;

S4.3、对新目标函数进行优化，将新目标函数取值最大化，并对参数θ^u、以及进行梯度计算和更新，并在对整个训练语料库遍历完成时获得目标词向量。S4.3. Optimize the new objective function, maximize the value of the new objective function, and optimize the parameters θ^u , as well as Perform gradient calculation and update, and obtain the target word vector when the entire training corpus is traversed.

按照本发明的另一方面，提供了一种融合词性与位置信息的词向量训练系统，包括：According to another aspect of the present invention, a word vector training system that fuses part-of-speech and position information is provided, including:

预处理模块，用于对原始文本进行预处理得到目标文本；A preprocessing module is used to preprocess the original text to obtain the target text;

词性标注模块，用于根据单词的上下文信息，采用词性标注集中的词性对目标文本中的单词进行词性标注；The part-of-speech tagging module is used to tag the words in the target text by using the part-of-speech tags in the part-of-speech tagging set according to the context information of the words;

位置词性融合模块，用于根据标注的词性信息进行建模构建词性关联权重矩阵M，以及针对词性对所对应单词对的相对位置i进行建模，构建与位置对应的位置词性关联权重矩阵M′_i，其中，矩阵M的行列维度为词性标注集中词性的种类大小，矩阵M中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性的共现概率，矩阵M′_i的行列维度与矩阵M相同，矩阵M′_i中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性在相对位置i时的共现概率；The position part-of-speech fusion module is used for modeling and constructing the part-of-speech association weight matrix M according to the marked part-of-speech information, and modeling the relative position i of the corresponding word pair for the part-of-speech pair, and constructing the position part-of-speech association weight matrix M′ corresponding to the position_i , where the row and column dimensions of the matrix M are the size of the part-of-speech tagging set, and the elements in the matrix M are the co-occurrence probability of the part-of-speech corresponding to the word in the row of the element and the part-of-speech corresponding to the word in the column of the element, matrix M′_i The row and column dimension of is identical with matrix M, and the element in the matrix M '_i is the co-occurrence probability of the part of speech corresponding to the row corresponding word of this element and the part of speech corresponding word of the row of this element at relative position i;

词向量学习模块，用于将建模后的矩阵M和矩阵M′_i融合到skip-gram词向量模型中构建目标模型，由目标模型进行词向量学习得到目标词向量，其中，目标词向量用于单词类比任务以及单词相似度任务。The word vector learning module is used to fuse the modeled matrix M and matrix M′_i into the skip-gram word vector model to construct the target model, and the target word vector is learned by the target model to obtain the target word vector, wherein the target word vector is used for word analogy tasks and word similarity tasks.

优选地，所述词性标注模块包括：Preferably, the part-of-speech tagging module includes:

分词模块，用于对目标文本进行分词，以区分出目标文本中的所有单词；The word segmentation module is used to segment the target text to distinguish all words in the target text;

词性标注子模块，用于对目标文本中的每个句子，根据单词在句子中的上下文信息，采用词性标注集中的词性对单词进行词性标注。The part-of-speech tagging submodule is used to tag each sentence in the target text by using the part-of-speech tagging set in the part-of-speech tagging set to tag the word according to the context information of the word in the sentence.

优选地，所述位置词性融合模块包括：Preferably, the position part-of-speech fusion module includes:

词性信息建模模块，用于对目标文本中的每个单词，生成针对单词及其对应的词性构成的单词-词性对，根据单词-词性对构建词性关联权重矩阵M，其中，矩阵M的行列维度为词性标注集中词性的种类大小，矩阵M中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性的共现概率；The part-of-speech information modeling module is used to generate a word-part-of-speech pair for each word in the target text and its corresponding part-of-speech, and construct a part-of-speech association weight matrix M according to the word-part-of-speech pair, wherein the rows and columns of the matrix M The dimension is the type size of the part of speech tagging set, and the element in the matrix M is the co-occurrence probability of the part of speech corresponding to the row of the element and the part of speech corresponding to the row of the element;

位置信息建模模块，用于针对词性对所对应单词对的相对位置i进行建模，构建与位置对应的位置词性关联权重矩阵M′_i，其中，矩阵M′_i的行列维度与矩阵M相同，矩阵M′_i中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性在相对位置i时的共现概率。The position information modeling module is used to model the relative position i of the corresponding word pair for the part-of-speech pair, and constructs the position part-of-speech association weight matrix M′_i corresponding to the position, wherein the row and column dimensions of the matrix M′_i are the same as the matrix M , the element in the matrix M'_i is the co-occurrence probability of the part of speech corresponding to the row of the element and the part of speech corresponding to the column of the element at the relative position i.

优选地，所述词向量学习模块包括：Preferably, the word vector learning module includes:

初始目标函数构建模块，用于构建初始目标函数：其中，C表示整个训练语料库中的词汇表，Context(w)表示由目标单词w的前后各c个单词组成的上下文单词集，c表示窗口大小；The initial objective function building block, used to construct the initial objective function: Among them, C represents the vocabulary in the entire training corpus, Context(w) represents the context word set composed of c words before and after the target word w, and c represents the window size;

新目标函数构建模块，用于将建模后的矩阵M和矩阵M′_i融合到基于负采样的skip-gram词向量模型中构建目标模型，并根据初始目标函数构建目标模型的新目标函数：其中，NEG(w)为对目标单词w进行采样的负样本集，L^w(u)为样本u的打分，正样本打分为1，负样本打分为0，θ^u为样本词在模型训练过程中用到的辅助向量，为上下文单词对应的词向量的转置，为T_u和两词性在相对位置关系为i时的共现概率；The new objective function building block is used to fuse the modeled matrix M and matrix M′_i into the negative sampling-based skip-gram word vector model to construct the target model, and construct the new target function of the target model according to the initial target function: in, NEG(w) is the negative sample set for sampling the target word w, L^w (u) is the score of the sample u, the positive sample is scored as 1, the negative sample is scored as 0, θ^u is the sample word used in the model training process to the auxiliary vector, for context words corresponding word vector the transposition of for T_u and The co-occurrence probability of two parts of speech when the relative position relationship is i;

词向量学习子模块，用于对新目标函数进行优化，将新目标函数取值最大化，并对参数θ^u、以及进行梯度计算和更新，并在对整个训练语料库遍历完成时获得目标词向量。The word vector learning sub-module is used to optimize the new objective function, maximize the value of the new objective function, and modify the parameters θ^u , as well as Perform gradient calculation and update, and obtain the target word vector when the entire training corpus is traversed.

总体而言，本发明方法与现有技术方案相比，能够取得下列有益效果：Generally speaking, compared with the prior art scheme, the method of the present invention can achieve the following beneficial effects:

(1)通过构建基于词性关联关系与位置关联关系的关联矩阵，可以很好地对单词间的词性和位置信息进行建模。(1) By constructing an association matrix based on part-of-speech association and position association, the word-of-speech and position information between words can be well modeled.

(2)通过将已经建模好的基于词性信息以及位置信息的关联矩阵融合到基于负采样的skip-gram词向量学习模型中，一方面可以得到更好的词向量结果，另一方面也可以得到用于模型训练的语料库中词性间的关联关系权重。(2) By fusing the already modeled association matrix based on part-of-speech information and location information into the skip-gram word vector learning model based on negative sampling, on the one hand, better word vector results can be obtained, and on the other hand, it can also Obtain the relationship weights between parts of speech in the corpus used for model training.

(3)由于模型采用了负采样的优化策略，使得模型的训练速度也比较快。(3) Since the model adopts the optimization strategy of negative sampling, the training speed of the model is also relatively fast.

附图说明Description of drawings

图1为本发明实施例公开的一种融合词性与位置信息的词向量训练方法的流程示意图；FIG. 1 is a schematic flow diagram of a word vector training method that combines part of speech and position information disclosed in an embodiment of the present invention;

图2为本发明实施例公开的一种词性和位置信息的建模模型图；Fig. 2 is the modeling model diagram of a kind of speech and location information disclosed by the embodiment of the present invention;

图3为本发明实施例公开的一种整体流程简化示意图；Fig. 3 is a simplified schematic diagram of an overall process disclosed by an embodiment of the present invention;

图4为本发明实施例公开的另一种融合词性与位置信息的词向量训练方法的流程示意图。FIG. 4 is a schematic flowchart of another word vector training method that integrates part-of-speech and position information disclosed in an embodiment of the present invention.

具体实施方式detailed description

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

由于已有词向量学习方法忽略词性及其在自然语言中的重要性，本发明提供一种融合词性与位置信息的词向量学习方法。该方法旨在原有skip-gram模型的基础上考虑单词之间的词性关联关系与位置关系，以便让模型能训练出融合更多信息的词向量结果，并利用学习到的词向量来更好地完成单词类比任务及单词相似度任务。Since existing word vector learning methods ignore part of speech and its importance in natural language, the present invention provides a word vector learning method that combines part of speech and position information. This method aims to consider the part-of-speech relationship and positional relationship between words on the basis of the original skip-gram model, so that the model can train word vector results that incorporate more information, and use the learned word vector to better Complete word analogy task and word similarity task.

如图1所示为本发明实施例公开的一种融合词性与位置信息的词向量学习方法的流程示意图，在图1所示的方法中包括以下步骤：As shown in Figure 1, it is a schematic flow diagram of a word vector learning method that integrates part of speech and position information disclosed in the embodiment of the present invention, and the method shown in Figure 1 includes the following steps:

由于获得的原始文本中存在着大量的无用信息如XML标签、网页链接、图片链接以及如“[”、“@”,“&”，“#”等，这些无用信息不但对词向量的训练无益，甚至会成为噪声数据，影响词向量的学习，因此需要将这些信息过滤掉，可以利用perl脚本将这些信息过滤掉。Because there are a lot of useless information in the obtained original text, such as XML tags, web links, image links, and such as "[", "@", "&", "#", etc., these useless information are not only useless for the training of word vectors , and even become noise data, affecting the learning of word vectors, so this information needs to be filtered out, which can be filtered out by using perl scripts.

由于在本发明提出的方法中要使用单词的词性信息，因此需要利用一些词性标注工具对文本进行词性标注。由于一个单词所处上下文的不同导致其可能具有多个词性，为了解决这个问题可以提前将文本进行词性标注，借助其上下文信息进行词性标注。步骤S2具体包括以下子步骤：Since part-of-speech information of words is used in the method proposed by the present invention, it is necessary to use some part-of-speech tagging tools to tag the text. Due to the different contexts of a word, it may have multiple parts of speech. In order to solve this problem, the text can be tagged with part of speech in advance, and the part of speech can be tagged with the help of its context information. Step S2 specifically includes the following sub-steps:

其中，可以利用openNLP中的tokenize分词工具将文本进行分词，比如“I buy anapple.”中如果不分词的话常见单词“apple”就会成为“apple.”这个不存在的单词，影响词向量的学习。Among them, the tokenize word segmentation tool in openNLP can be used to segment the text. For example, if the word is not segmented in "I buy anapple.", the common word "apple" will become "apple." This word does not exist, which affects the learning of word vectors. .

其中，一次性地对一整个句子进行词性标注，这样就可以将同一个单词根据其所处上下文而可以具有的多个词性区分开来。这里单词被赋予的词性属于Penn TreebankPOS词性标注集。Among them, part-of-speech tagging is performed on an entire sentence at one time, so that multiple parts of speech that the same word may have according to its context can be distinguished. The part-of-speech assigned to the word here belongs to the Penn TreebankPOS part-of-speech tagging set.

如“i love you.”和“she give her son too much love.”进行词标注后的两个句子就成为：For example, the two sentences after word tagging for "i love you." and "she give her son too much love." become:

i_PRP(代词)love_VBP(动词)you_PRP(代词)._.；i_PRP(pronoun)love_VBP(verb)you_PRP(pronoun)._.;

she_PRP(代词)give_VB(动词)her_PRP$(代词)son_NN(名词)too_RB(副词)much_JJ(形容词)love_NN(名词)._.。she_PRP(pronoun)give_VB(verb)her_PRP$(pronoun)son_NN(noun)too_RB(adverb)much_JJ(adjective)love_NN(noun)._.

S3、根据标注的词性信息进行建模构建词性关联权重矩阵M，以及针对词性对所对应单词对的相对位置i进行建模，构建与位置对应的位置词性关联权重矩阵M′_i，其中，矩阵M的行列维度为词性标注集中词性的种类大小，矩阵M中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性的共现概率，矩阵M′_i的行列维度与矩阵M相同，矩阵M′_i中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性在相对位置i时的共现概率；如图2所示本发明实施例公开的一种词性和位置信息的建模模型图，其中，行列中的T₀～T_N表示词性，M′_i(T_t,T_t-2)表示词性T_t与词性T_t-2在相对位置i时的共现概率。S3. Modeling and constructing the part-of-speech association weight matrix M according to the marked part-of-speech information, and modeling the relative position i of the corresponding word pair for the part-of-speech, and constructing the position part-of-speech association weight matrix M′_i corresponding to the position, wherein, the matrix The row and column dimension of M is the size of the part of speech tagging set, the element in the matrix M is the co-occurrence probability of the part of speech corresponding to the word in the row of the element and the part of speech corresponding to the word in the column of the element, the row and column dimension of the matrix M′_i is the same as the matrix M is the same, and the element in the matrix M'_i is the co-occurrence probability of the part of speech corresponding to the word of the row of the element and the part of speech of the column corresponding word of the element at relative position i; as shown in Figure 2, a disclosed embodiment of the present invention The modeling model diagram of speech and location information, in which T₀ ~ T_N in the row and column represent the part of speech, M′_i (T_t , T_t-2 ) represents the relative position i of the part of speech T_t and the part of speech T_t-2 co-occurrence probability.

其中，在获得单词的词性之后，如何将词性信息参与到词向量学习模型中并对新模型进行求解，就需要首先对词性信息进行建模。建模的目标为建立行列维度都为词性标注集中词性的种类大小的词性关联关系矩阵，矩阵中的元素即为两个词性出现的概率。除此之外，还要针对位置关系进行建模，因为两个词性共现时它们之间的位置关系也是十分重要的。步骤S3具体包括以下子步骤：Among them, after obtaining the part of speech of a word, how to participate the part of speech information in the word vector learning model and solve the new model requires modeling the part of speech information first. The goal of modeling is to establish a part-of-speech relationship matrix in which the row and column dimensions are the size of the part-of-speech in the part-of-speech tagging set, and the elements in the matrix are the probabilities of the occurrence of two parts of speech. In addition, it is necessary to model the positional relationship, because the positional relationship between two parts of speech is also very important when they co-occur. Step S3 specifically includes the following sub-steps:

例如对于“she give her son too much love.”中的单词son来说，其词性为NN，单词her的词性为PRP，则矩阵中词性PRP对应的行以及词性NN对应的列所指定的元素即为两个词性的共现概率(即权值)。For example, for the word son in "she give her son too much love.", its part of speech is NN, and the part of speech of word her is PRP, then the row corresponding to the part of speech PRP in the matrix and the element specified by the column corresponding to the part of speech NN are is the co-occurrence probability (ie weight) of the two parts of speech.

S3.2、针对词性对所对应单词对的相对位置i进行建模，构建与位置对应的位置词性关联权重矩阵M′_i，其中，矩阵M′_i的行列维度与矩阵M相同，矩阵M′_i中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性在相对位置i时的共现概率(即权值)。S3.2. Modeling the relative position i of the word pair corresponding to the part-of-speech pair, constructing a position part-of-speech association weight matrix M′_i corresponding to the position, wherein the matrix M′_i has the same row and column dimensions as the matrix M, and the matrix M′ The element in_i is the co-occurrence probability (ie weight) of the part of speech corresponding to the word in the row of the element and the part of speech corresponding to the word in the column of the element at the relative position i.

例如，若窗口大小为2c，则i∈[-c,c]。当窗口大小为6时，则就要建立M′_-3、M′_-2、M′_-1、M′₁、M′₂、M′₃共6个矩阵。For example, if the window size is 2c, then i∈[-c,c]. When the window size is 6, a total of 6 matrices M′₋₃ , M′₋₂ , M′₋₁ , M′₁ , M′₂ , and M′₃ must be established.

例如对于“she give her son too much love.”中的son和her，当son为目标词时，这两个单词词性所对应的词性与位置的关联权值为M′_-1(PRP,NN)。For example, for son and her in "she give her son too much love.", when son is the target word, the associated weight of the part of speech and position corresponding to the part of speech of these two words is M′_-1 (PRP,NN) .

S4、将建模后的矩阵M和矩阵M′_i融合到skip-gram词向量模型中构建目标模型，由目标模型进行词向量学习得到目标词向量，其中，目标词向量用于单词类比任务以及单词相似度任务。S4. Fusion the modeled matrix M and matrix M′_i into the skip-gram word vector model to construct the target model, and perform word vector learning by the target model to obtain the target word vector, wherein the target word vector is used for word analogy tasks and Word similarity task.

其中，步骤S4具体包括以下子步骤：Wherein, step S4 specifically includes the following sub-steps:

由于Skip-gram模型思想相同即通过目标词w_t预测上下文中的单词v(w_t+i)其中，i表示w_t+i与w_t之间的位置关系。以样本(Context(w_t)，w_t)为例，其中|Context(w_t)|＝2c，其中，Context(w_t)是由单词w_t前后各c个词组成。目标模型的最终优化目标依然是对整个训练语料库来说，使得所有通过目标词w_t来预测上下文单词的概率最大化也即最优化初始目标函数。Since the Skip-gram model has the same idea, the target word w_t is used to predict the word v(w_t+i ) in the context, where i represents the positional relationship between w_t+i and w_t . Take the sample (Context(w_t ), w_t ) as an example, where |Context(w_t )|=2c, where Context(w_t ) is composed of c words before and after the word w_t . The ultimate optimization goal of the target model is still to maximize the probability of predicting context words through the target word w_t for the entire training corpus, that is, to optimize the initial target function.

例如样本“she give her son too much love.”单词son为目标词w_t，c为3，则Context(w_t)＝{she,give,her,too,much,love}。For example, in the sample "she give her son too much love." The word son is the target word w_t , and c is 3, then Context(w_t )={she,give,her,too,much,love}.

S4.2、将建模后的矩阵M和矩阵M′_i融合到基于负采样的skip-gram词向量模型中构建目标模型，并根据初始目标函数构建目标模型的新目标函数：其中，NEG(w)为对目标单词w进行采样的负样本集，L^w(u)为样本u的打分，正样本打分为1，负样本打分为0，θ^u为样本词在模型训练过程中用到的辅助向量，为上下文单词对应的词向量的转置，为T_u和两词性在相对位置关系为i时的共现概率；S4.2. Fusion the modeled matrix M and matrix M′_i into the negative sampling-based skip-gram word vector model to construct the target model, and construct a new target function of the target model according to the initial target function: in, NEG(w) is the negative sample set for sampling the target word w, L^w (u) is the score of sample u, the positive sample is scored as 1, the negative sample is scored as 0, θ^u is the sample word used in the model training process to the auxiliary vector, for context words corresponding word vector the transposition of for T_u and The co-occurrence probability of two parts of speech when the relative position relationship is i;

例如样本“she give her son too much love.”单词son为正样本，此时单词son的标签为1，对于其他单词如dog、flower等就是负样本，其标签为0。For example, the sample "she give her son too much love." The word son is a positive sample. At this time, the label of the word son is 1. For other words such as dog, flower, etc., it is a negative sample, and its label is 0.

如图3所示为本发明实施例公开的一种整体流程简化示意图，构建的目标模型具有输入层、投影层、输出层三层。其中：FIG. 3 is a simplified schematic diagram of an overall process disclosed by an embodiment of the present invention. The constructed target model has three layers: an input layer, a projection layer, and an output layer. in:

输入层的输入为中心单词w(t)，输出的是中心单词w(t)对应的词向量；The input of the input layer is the central word w(t), and the output is the word vector corresponding to the central word w(t);

投影层主要是对输入层的输出结果进行投影，在该模型中投影层的输入和输出都是中心单词w(t)的词向量；The projection layer mainly projects the output of the input layer. In this model, the input and output of the projection layer are the word vectors of the central word w(t);

输出层主要是利用中心单词w(t)来预测如w(t-2),w(t-1),w(t+1),w(t+2)等上下文单词的词向量。The output layer mainly uses the central word w(t) to predict word vectors of context words such as w(t-2), w(t-1), w(t+1), w(t+2).

本发明主要目的是利用中心单词w(t)来预测其上下文单词时，考虑中心单词与其上下文单词的词性与位置关系。The main purpose of the present invention is to consider the part of speech and positional relationship between the central word and its context words when using the central word w(t) to predict its context words.

例如可以采用随机梯度上升法(Stochastic Gradient Ascent，SGA)对新目标函数进行优化即将新目标函数取值最大化。并对参数θ^u、和梯度计算和更新，当对整个训练语料库遍历完时则就获得了目标词向量。For example, Stochastic Gradient Ascent (SGA) may be used to optimize the new objective function, that is, to maximize the value of the new objective function. And for the parameters θ^u , with Gradient calculation and update, when the entire training corpus is traversed, the target word vector is obtained.

可选地，可以采用如下所示的方式进行更新及梯度计算得到目标词向量：Optionally, the target word vector can be obtained by updating and calculating the gradient in the following manner:

如图4所示为本发明实施例提供的另一种融合词性与位置信息的词向量训练方法的流程示意图，在图4所示的方法中，包括数据预处理、分词及词性标注、词性与位置信息建模、词向量训练、任务评估五个步骤。其中数据预处理、分词及词性标注、词性与位置信息建模、词向量训练如实施例1所描述的方法步骤，任务评估可以利用上面学习到的带有词性和位置信息的目标词向量后，可以将目标词向量用于单词类比任务以及单词相似度等任务中。主要包括以下两个步骤：As shown in Figure 4, it is a schematic flow chart of another word vector training method that combines part-of-speech and position information provided by the embodiment of the present invention. In the method shown in Figure 4, it includes data preprocessing, word segmentation and part-of-speech tagging, part-of-speech and There are five steps of location information modeling, word vector training, and task evaluation. Among them, data preprocessing, word segmentation and part-of-speech tagging, part-of-speech and position information modeling, and word vector training are as described in Embodiment 1. After the task evaluation can use the above-learned target word vector with part-of-speech and position information, Target word vectors can be used in tasks such as word analogy tasks and word similarity. It mainly includes the following two steps:

用学习到的目标词向量做单词类比任务。例如对于两个单词对<king,queen>和<man,woman>，通过对这些单词对所对应的词向量进行计算会发现存在v(king)–v(queen)＝v(man)–v(woman)这样的关系。Use the learned target word vectors for word analogy tasks. For example, for two word pairs <king, queen> and <man, woman>, by calculating the word vectors corresponding to these word pairs, it will be found that there is v(king)–v(queen)=v(man)–v( woman) such a relationship.

用学习到的目标词向量做单词相似的任务。例如给定一个单词如“dog”,通过计算其他单词与“dog”的余弦距离或欧式距离就可能会得到诸如“puppy”、“cat”等和“dog”有密切关系的前top N个单词。Use the learned target word vectors to do word similarity tasks. For example, given a word such as "dog", by calculating the cosine distance or Euclidean distance between other words and "dog", it is possible to get the top N words that are closely related to "dog", such as "puppy", "cat", etc. .

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

Translated fromChinese

1.一种融合词性与位置信息的词向量训练方法，其特征在于，包括如下步骤：1. a word vector training method of fusion part-of-speech and positional information, is characterized in that, comprises the steps:

S3、根据标注的词性信息进行建模构建词性关联权重矩阵M，以及针对词性对所对应单词对的相对位置i进行建模，构建与位置对应的位置词性关联权重矩阵M′_i，其中，矩阵M的行列维度为词性标注集中词性的种类大小，矩阵M中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性的共现概率，矩阵M′_i的行列维度与矩阵M相同，矩阵M′_i中的元素为该元素的行对应单词的词性与该元素的列对应单词的词性在相对位置i时的共现概率；S3. Modeling and constructing the part-of-speech association weight matrix M according to the marked part-of-speech information, and modeling the relative position i of the corresponding word pair for the part-of-speech, and constructing the position part-of-speech association weight matrix M′_i corresponding to the position, wherein, the matrix The row and column dimension of M is the size of the part of speech tagging set, the element in the matrix M is the co-occurrence probability of the part of speech corresponding to the word in the row of the element and the part of speech corresponding to the word in the column of the element, the row and column dimension of the matrix M′_i is the same as the matrix M is the same, and the element in the matrix M '_i is the co-occurrence probability of the part of speech corresponding to the word of the row of the element and the part of speech of the column corresponding word of the element at relative position i;

2.根据权利要求1所述的方法，其特征在于，步骤S2具体包括以下子步骤：2. The method according to claim 1, wherein step S2 specifically comprises the following sub-steps:

3.根据权利要求1或2所述的方法，其特征在于，步骤S3具体包括以下子步骤：3. The method according to claim 1 or 2, wherein step S3 specifically comprises the following sub-steps:

4.根据权利要求3所述的方法，其特征在于，步骤S4具体包括以下子步骤：4. The method according to claim 3, wherein step S4 specifically comprises the following sub-steps:

5.一种融合词性与位置信息的词向量训练系统，其特征在于，包括：5. A word vector training system that fuses part-of-speech and positional information, is characterized in that, comprises:

6.根据权利要求5所述的系统，其特征在于，所述词性标注模块包括：6. system according to claim 5, is characterized in that, described part-of-speech tagging module comprises:

7.根据权利要求5或6所述的系统，其特征在于，所述位置词性融合模块包括：7. system according to claim 5 or 6, is characterized in that, described position part-of-speech fusion module comprises:

8.根据权利要求7所述的系统，其特征在于，所述词向量学习模块包括：8. system according to claim 7, is characterized in that, described word vector learning module comprises:

新目标函数构建模块，用于将建模后的矩阵M和矩阵M_i'融合到基于负采样的skip-gram词向量模型中构建目标模型，并根据初始目标函数构建目标模型的新目标函数：其中，NEG(w)为对目标单词w进行采样的负样本集，L^w(u)为样本u的打分，正样本打分为1，负样本打分为0，θ^u为样本词在模型训练过程中用到的辅助向量，为上下文单词对应的词向量的转置，为T_u和两词性在相对位置关系为i时的共现概率；The new objective function building block is used to fuse the modeled matrix M and matrix M_i ' into the negative sampling-based skip-gram word vector model to build the target model, and build a new target function of the target model based on the initial target function: in, NEG(w) is the negative sample set for sampling the target word w, L^w (u) is the score of the sample u, the positive sample is scored as 1, the negative sample is scored as 0, θ^u is the sample word used in the model training process to the auxiliary vector, for context words corresponding word vector the transposition of for T_u and The co-occurrence probability of two parts of speech when the relative position relationship is i;