CN111008517A

Movatterモバイル変換

Info

Publication number: CN111008517A
Application number: CN201911043675.3A
Authority: CN
Inventors: 马鑫典; 张鹏; 张帅
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-04-14

Abstract

The invention discloses a neural language model compression method based on tensor decomposition technology, which starts from linear representation of an original attention function, then proves that the attention function can be linearly represented by a group of orthonormal basis vectors, and then compresses parameters by sharing the group of basis vectors under the condition of constructing a multi-head mechanism; meanwhile, the neural network model can have stronger discrimination capability through the modeling in a tensor slicing mode; the invention provides a new idea for developing a neural network model with low parameters and high accuracy.

Description

Tensor decomposition technology-based neural language model compression method

Technical Field

The invention relates to the field of neural language model compression, in particular to compression of an original attention function aiming at a Transformer neural network model.

Background

With the development of artificial intelligence, nowadays in the field of natural language processing, neural language Pre-training (Pre-training) models have demonstrated their effectiveness for most tasks. The Transformer model is based on an attention mechanism, and replaces a circular neural network and a convolutional neural network. This model has been widely used extensively today, playing a key role in many other pre-trained language models, such as the BERT pre-trained model. However, the large number of these pre-training models results in difficult deployment of the models on limited resources due to the large number of parameters. The compression of pre-trained language models is therefore an important research problem.

There are several methods of model compression currently available. In the process of training a language model, the size of a word list has an important influence on the number of parameters of the model, the parameters of an embedding layer (EmbeddingLayer) can be reduced by a tensed word embedding method, the model is compressed by a Tensor decomposition method of a Tensor Train (Tensor Train), and the main idea is the idea of low-rank decomposition. The method can be used at the embedding level of any language model. Recently, in the field of image processing, researchers have proposed using tensor block decomposition to compress a recurrent neural network, since the input vector representation is relatively long, resulting in a large number of parameters for linear computation. These compression methods are only compression of data representation in the input layer part of the model, and do not compress the model structure itself, and the compressed model cannot be easily embedded into the original model structure for training. There are also methods proposed that can use a higher order tensor to replace all the convolution kernels in a convolutional network.

Tensor techniques have been used as compression techniques for models, and in general, they are used alone at the input layer, the fully connected layer, of the model, and these compressions have solved the situation of excessive model parameters to some extent, however, the work of combining multiple tensor compression methods to jointly compress the internal structure of the model is lacking. Therefore, we propose the idea of using low rank decomposition and parameter sharing simultaneously, and the tensor decomposition technology is mainly the Tucker decomposition combining the third-order tensor and the Block-Term decomposition combining the third-order tensor. By combining the two methods, the model is reduced by nearly half the parameters in overall effect.

Language modeling is a most basic task and is mainly used for testing the capability of a model modeling language, and language modeling data on a Text mainly comprises the following three data sets, namely a small language modeling data set PTB, a medium data set Wiki-Text 103 and a large corpus One-Billion data set. To test the behavior of the compressed language model in downstream tasks, we selected the set of delta-to-english translation data and performed experiments on that data set.

Disclosure of Invention

The invention aims to solve the technical problem that the existing large-scale pre-trained neural language model has a large number of parameters and cannot be used for experimental deployment on limited resources. And training a language modeling data set and a translation data set by using a tensor Transformer model, and training a network model by using a back propagation and random gradient descent optimization method to obtain a prediction result of an optimal model on a test set, so as to obtain more accurate prediction and translation results.

A method for compressing a neural language model based on a tensor decomposition technology comprises the following steps:

constructing a tensor model by linearly representing an original attention function;

constructing a single block attention function by using Tucker in a tensor model;

constructing a Multi-linear attention function by using a Block-Term and sharing factor matrix in a tensor model, and realizing parameter compression of a transform neural network model;

embedding a Multi-linear attention function into a transform neural network model structure to obtain a compressed tensor transform model;

applying the compressed tensor Transformer model to a language modeling task and a machine translation task; the method comprises the following steps:

acquiring a language model data set and a German-English translation data set;

processing a data set and cleaning data, wherein for text data, the position of each word in a sentence is in a relation, and a word vector representation model of the text is constructed by combining the position information of the words in the sentence;

inputting the expression model of the word vector into a tensor Transformer model for model training to obtain a training loss function of the neural network model, wherein the training loss function is as follows:

wherein y is_iThe representation is really a category label,

represents the prediction result, and n represents the length of the sentence. And training the model by a back propagation algorithm and a batch random gradient descent method.

Training a tensor Transformer model on a training set, simultaneously verifying the tensor Transformer model on a verification set at certain intervals in batches, and recording model parameters stored on the verification set when the effect is optimal;

testing samples on the test set by using the optimal model stored in the last step to finally obtain a prediction result or a translation model of each test sample, comparing the test labels, and calculating the accuracy of prediction and translation; wherein:

in the language modeling task, inputting a test set in a language modeling data set into a trained tensor Transformer model for testing, and calculating the probability of each sentence;

in a translation task, inputting a test set in a German-English translation data set into a trained tensor Transformer model for testing, and calculating a BLEU value of each translated sentence and an original sentence;

and recording the experimental results of the tensor Transformer model on a language modeling task and a German-English translation task.

The invention has the beneficial effects that:

(1) a pre-training language model with low parameters and high accuracy is built, and the dilemma that the model with excessive parameters is difficult to deploy can be overcome.

(2) The two compression ideas are mixed for use, so that a better compression model can be obtained.

(3) The compressed model can improve the experimental result, and in the model testing stage, the batch scale can be improved so as to meet the possibility of multi-user request translation. Tables 1 and 2 below show the experimental results of the compressed model in the language modeling task.

Note: the lower the TestPPL, the better the One-Billion dataset.

TABLE 1

Note: the larger the Test BLEU value, the better, in WMT-16 EndText translation data set.

TABLE 2

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of a tensor Transformer model;

figure 3 is a schematic diagram of a tensor representation that can reconstruct the original attention.

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following. FIG. 1 shows a flow chart of the analysis method of the compression model proposed by the present method; FIG. 2 shows a diagram of a neural network model designed according to the present invention; figure 3 shows a schematic diagram of the tensor representation that can reconstruct the original attention.

The invention discloses a method for compressing parameters of a neural language model based on tensor decomposition. Due to the outstanding expression of the language model of the self-attention system in the natural language processing task, a Pre-training (Pre-training) language model based on an encoding-decoding (Encoder-Decoder) structure of the self-attention system becomes a research hotspot in the field of natural language processing, and the training is difficult on limited computing resources in consideration of the overlarge model parameters. We develop a way to combine tensor decomposition and parameter sharing compression ideas to compress a more general neural network language model, transform. And the compressed model is applied to language modeling and machine translation tasks.

The implementation process and the application of the invention mainly comprise the following steps: finding a structure needing compression in a Transformer language model and researching the main properties of the structure; designing a new model structure by using a tensor decomposition technology; researching the difference between the new structure and the original structure, and proving the rationality of the new structure; collecting a language model and a translation corpus data set, dividing the language model and the translation corpus data set into a training set and a testing set, and constructing word representation by applying position vector information and semantic vector information to each word in the corpus; inputting the text word vectors of the training corpus into a compressed network structure, and training a language model; inputting the text word vectors of the test set into the trained language model, thereby calculating the prediction probability of each sample;

the present invention starts with a linear representation of the original attention function and then demonstrates that the attention function can be linearly represented by a set of orthonormal basis vectors, which we can then share in case of constructing a multi-headed mechanism. Thereby compressing the parameters. Meanwhile, the invention also proves that the original attention function can be reconstructed by the new expression, and the neural network model can have stronger discrimination capability by the tensor slicing mode modeling. The invention provides a new idea for developing a neural network model with low parameters and high accuracy.

The purpose of the invention is realized by the following technical scheme, which comprises the following steps:

since there are three encoded matrices in the transform neural network model, which are considered as three factor matrices, after initializing a core tensor, a linear attention function can be constructed by using the Tucker decomposition in the tensor decomposition technique.

The monolithic-attention function is constructed using the Tucker decomposition technique, and is of the form:

herein, the

Is a core tensor, i, j and m are indices of the core tensor,

is the outer product of the vectors. In particular, in the experiment, the nuclear tensor here

Can be defined as the following equation:

here rand (0,1) is a function that generates random values. Finally, the parameter we store is a vector, which we perform Softmax normalization on.

to construct a multi-head attention mechanism and to be able to compress the parameters of the model, we use a set of linear mappings and then share the output of this part. In our model, it is called the Multi-linear attention function, which can be formalized as:

is the diagonal tensor. The multi-linear attention function herein can decompress the original language model. In multi-headed compression, the compression ratio of the model can be calculated by the following formula:

h is generally 8, so as d increases, the compression ratio of the model increases.

acquiring a language model data set and a German-English translation data set;

wherein y is_iThe representation is really a category label,

According to the method, the original attention function is compressed by using parameter sharing and low-rank decomposition in a mixed mode, a new attention function Multi-linear attention function is obtained and is embedded into a Transformer language model to perform natural language processing tasks.

The specific application steps are as follows:

the method mainly comprises three parts, wherein the first part is how to design an Encoder structure of a Transformer, the second part is that a compressed language model is tested in a language modeling task, and the third part is that the compressed model is tested in a translation task.

A first part:

(1): a Single-Block Attention function (Single-Block Attention) is constructed based on the Tucker tensor decomposition, as shown on the left side of fig. 2.

(2): multiple monolithic attention functions are constructed while sharing the same set of factor matrices. The tensor representation of these monolithic attentions was then taken using the average pooling method, resulting in a third order tensor representation. As shown on the right side of figure 2.

(3): in order to enable the third-order tensor expression to be embedded into a Transformer neural network model framework, a tensor splitting method is adopted. The method mainly comprises the steps of segmenting the third-order tensor in the second dimension of the third-order tensor to obtain a plurality of matrixes, splicing the matrixes, and expressing the tensor through a full-connection layerSuccessfully embedded in the transform structure, as shown on the right side of fig. 2, we split (split) an nxnxnxn by N third order tensor, and then use the concatenation (concat) to arrive at the matrix T₁,…,T_n.

A second part:

(1) and processing the three data sets, removing punctuation marks to obtain a word list of the corpus, then carrying out vector coding on each word, and splicing the relative position vectors on each word vector at each sentence coding time. The encoding method of the position vector is as follows:

pos here is the position of the word and i denotes the dimension of the vector.

(2) The sentences in the corpus are processed in Batch (Batch), each word is coded by the method in (1), and then the word vector representation of one sentence is input into a tensor Transformer language model designed by us. The specific steps are introduced as follows:

the first step is as follows: these inputs are subjected to a linear encoding process three times:

w herein^V、W^kAnd W^QIs the three initialization parameter matrices and E is the vector input for the sentence.

The second step is that: the three matrices Q, K and V are then input into the Multi-linear attention function we have designed, the specific formula is as follows:

Multilinear(Q,K,V)＝Concat(H₁,H₂,…,H_n)W^O

Where H₁＝TensorSplit(G)

where G is a third order tensor. H₁,H₂,…,H_nIs a matrix resulting from tensor cutting.

The third step: inputting the output of the Multi-linear function into a Feed-Forward network, wherein the network function is as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

(3) the last layer performs full connection on the last output, then performs loss function calculation, then performs back propagation by using BP, and then trains the model.

(4) Model validation is performed after each batch through the validation set, and the model with the best result on the prediction set is saved.

And a third part:

(1) the coding part adopts the same model structure as the language modeling to code, in this part, English is used as original text and is input into an Encoder part of a transform language model, German is used as a target statement, namely, translation is input into a Decoder part, and the Decoder part adopts an original self-attention function. The function is:

(2) english passes through the encoder, German passes through the decoder, and then matching probability is calculated at an output layer, and a translation model is trained through a cross entropy loss function.

(3) Model validation is performed after each batch through the validation set, and the model with the best result on the prediction set is saved.

The model can realize the parameter compression of nearly half of the original Transformer language model parameters. In addition, by adopting the technology, the self-attention module of the model can be embedded into an original model structure after being compressed, and meanwhile, the improvement can be ensured in an experimental result. The main technical support is that the representation of the tensor, which is a more informative representation, is able to reconstruct the original attention output representation, which is a marginal probability of the tensor representation. Fig. 3 shows the process, and the left side is third-order tensor representation obtained by the technology, and by slicing tensor in the vertical dimension, N matrixes above the right side of fig. 3 can be obtained, and the N matrixes are summed to obtain the original attention output X. In our technique, not summation, but concatenation operation (concat) as shown in fig. 2 is adopted, so that richer information can be ensured to be modeled, and the final modeling effect of the model is improved.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

Translated fromChinese

1.一种基于张量分解技术的神经语言模型的压缩方法，其特征在于：它包括如下步骤：1. a compression method based on the neural language model of tensor decomposition technology, is characterized in that: it comprises the steps:

通过线性表示出原始注意力函数进行张量模型构建；Tensor model construction is performed by linearly representing the original attention function;

利用张量模型中Tucker构造单块注意力函数；Use Tucker in the tensor model to construct a single block attention function;

利用张量模型中Block-Term和共享因子矩阵构造Multi-linear注意力函数，实现Transformer神经网络模型的参数压缩；The Multi-linear attention function is constructed by using the Block-Term and shared factor matrix in the tensor model to realize the parameter compression of the Transformer neural network model;

将Multi-linear注意力函数嵌入到Transformer神经网络模型结构中获得压缩后的张量Transformer模型；Embed the Multi-linear attention function into the Transformer neural network model structure to obtain a compressed tensor Transformer model;

将压缩后的张量Transformer模型应用在语言建模任务与机器翻译任务中；Apply the compressed tensor Transformer model to language modeling tasks and machine translation tasks;

在训练集上训练张量Transformer模型，同时每间隔一定批次在验证集上进行验证张量Transformer模型，记录保存在验证集上效果达到最优时的模型参数；Train the tensor Transformer model on the training set, and verify the tensor Transformer model on the validation set at regular intervals, and record and save the model parameters when the effect is optimal on the validation set;

用上一步中保存的最优的模型去测试测试集上的样本，最终得到每个测试样本的预测结果或翻译模型，对比测试标签，计算出预测和翻译的准确率；其中：Use the optimal model saved in the previous step to test the samples on the test set, and finally obtain the prediction result or translation model of each test sample, compare the test labels, and calculate the accuracy of prediction and translation; among them:

在语言建模任务中，将语言建模数据集中的测试集输入到训练好张量Transformer模型中进行测试，计算每个句子出现的概率；In the language modeling task, the test set in the language modeling dataset is input into the trained tensor Transformer model for testing, and the probability of each sentence appearing is calculated;

在翻译任务中，将德英翻译数据集中的测试集输入到训练好的张量Transformer模型中进行测试，计算每个译句与原句的BLEU值；In the translation task, the test set in the German-English translation dataset is input into the trained tensor Transformer model for testing, and the BLEU value of each translated sentence and the original sentence is calculated;

记录张量Transformer模型在语言建模任务以及德英翻译任务上的实验结果。Record the experimental results of the tensor Transformer model on language modeling tasks and German-English translation tasks.

2.根据权利要求1所述的一种基于张量分解技术的神经语言模型的压缩方法，其特征在于：所述张量Transformer模型应用在语言建模任务与机器翻译任务中的具体步骤如下：2. a kind of compression method based on the neural language model of tensor decomposition technology according to claim 1, is characterized in that: the concrete steps that described tensor Transformer model is applied in language modeling task and machine translation task are as follows:

获取语言模型数据集和德英翻译数据集。Get the language model dataset and the German-English translation dataset.

处理数据集并进行数据清洗，对以文本数据，一个句子中每个词的位置是有关系，结合句子中词的位置信息构建文本的词向量的表示模型；Process the data set and perform data cleaning. For text data, the position of each word in a sentence is related, and the representation model of the word vector of the text is constructed by combining the position information of the words in the sentence;

将词向量的表示模型输入到张量Transformer模型中进行模型训练获得神经网络模型的训练损失函数为：Input the representation model of the word vector into the tensor Transformer model for model training to obtain the training loss function of the neural network model:

其中y_i代表真是类别标签，

代表预测结果，n代表句子的长度。通过反向传播算法、批量随机梯度下降法训练模型。where y_i represents the true category label,

represents the prediction result, and n represents the length of the sentence. The model is trained by backpropagation algorithm, batch stochastic gradient descent method.