Disclosure of Invention
The invention aims to solve the technical problem that the existing large-scale pre-trained neural language model has a large number of parameters and cannot be used for experimental deployment on limited resources. And training a language modeling data set and a translation data set by using a tensor Transformer model, and training a network model by using a back propagation and random gradient descent optimization method to obtain a prediction result of an optimal model on a test set, so as to obtain more accurate prediction and translation results.
A method for compressing a neural language model based on a tensor decomposition technology comprises the following steps:
constructing a tensor model by linearly representing an original attention function;
constructing a single block attention function by using Tucker in a tensor model;
constructing a Multi-linear attention function by using a Block-Term and sharing factor matrix in a tensor model, and realizing parameter compression of a transform neural network model;
embedding a Multi-linear attention function into a transform neural network model structure to obtain a compressed tensor transform model;
applying the compressed tensor Transformer model to a language modeling task and a machine translation task; the method comprises the following steps:
acquiring a language model data set and a German-English translation data set;
processing a data set and cleaning data, wherein for text data, the position of each word in a sentence is in a relation, and a word vector representation model of the text is constructed by combining the position information of the words in the sentence;
inputting the expression model of the word vector into a tensor Transformer model for model training to obtain a training loss function of the neural network model, wherein the training loss function is as follows:
wherein y is
iThe representation is really a category label,
represents the prediction result, and n represents the length of the sentence. And training the model by a back propagation algorithm and a batch random gradient descent method.
Training a tensor Transformer model on a training set, simultaneously verifying the tensor Transformer model on a verification set at certain intervals in batches, and recording model parameters stored on the verification set when the effect is optimal;
testing samples on the test set by using the optimal model stored in the last step to finally obtain a prediction result or a translation model of each test sample, comparing the test labels, and calculating the accuracy of prediction and translation; wherein:
in the language modeling task, inputting a test set in a language modeling data set into a trained tensor Transformer model for testing, and calculating the probability of each sentence;
in a translation task, inputting a test set in a German-English translation data set into a trained tensor Transformer model for testing, and calculating a BLEU value of each translated sentence and an original sentence;
and recording the experimental results of the tensor Transformer model on a language modeling task and a German-English translation task.
The invention has the beneficial effects that:
(1) a pre-training language model with low parameters and high accuracy is built, and the dilemma that the model with excessive parameters is difficult to deploy can be overcome.
(2) The two compression ideas are mixed for use, so that a better compression model can be obtained.
(3) The compressed model can improve the experimental result, and in the model testing stage, the batch scale can be improved so as to meet the possibility of multi-user request translation. Tables 1 and 2 below show the experimental results of the compressed model in the language modeling task.
Note: the lower the TestPPL, the better the One-Billion dataset.
TABLE 1
Note: the larger the Test BLEU value, the better, in WMT-16 EndText translation data set.
TABLE 2
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following. FIG. 1 shows a flow chart of the analysis method of the compression model proposed by the present method; FIG. 2 shows a diagram of a neural network model designed according to the present invention; figure 3 shows a schematic diagram of the tensor representation that can reconstruct the original attention.
The invention discloses a method for compressing parameters of a neural language model based on tensor decomposition. Due to the outstanding expression of the language model of the self-attention system in the natural language processing task, a Pre-training (Pre-training) language model based on an encoding-decoding (Encoder-Decoder) structure of the self-attention system becomes a research hotspot in the field of natural language processing, and the training is difficult on limited computing resources in consideration of the overlarge model parameters. We develop a way to combine tensor decomposition and parameter sharing compression ideas to compress a more general neural network language model, transform. And the compressed model is applied to language modeling and machine translation tasks.
The implementation process and the application of the invention mainly comprise the following steps: finding a structure needing compression in a Transformer language model and researching the main properties of the structure; designing a new model structure by using a tensor decomposition technology; researching the difference between the new structure and the original structure, and proving the rationality of the new structure; collecting a language model and a translation corpus data set, dividing the language model and the translation corpus data set into a training set and a testing set, and constructing word representation by applying position vector information and semantic vector information to each word in the corpus; inputting the text word vectors of the training corpus into a compressed network structure, and training a language model; inputting the text word vectors of the test set into the trained language model, thereby calculating the prediction probability of each sample;
the present invention starts with a linear representation of the original attention function and then demonstrates that the attention function can be linearly represented by a set of orthonormal basis vectors, which we can then share in case of constructing a multi-headed mechanism. Thereby compressing the parameters. Meanwhile, the invention also proves that the original attention function can be reconstructed by the new expression, and the neural network model can have stronger discrimination capability by the tensor slicing mode modeling. The invention provides a new idea for developing a neural network model with low parameters and high accuracy.
The purpose of the invention is realized by the following technical scheme, which comprises the following steps:
constructing a tensor model by linearly representing an original attention function;
constructing a single block attention function by using Tucker in a tensor model;
constructing a Multi-linear attention function by using a Block-Term and sharing factor matrix in a tensor model, and realizing parameter compression of a transform neural network model;
since there are three encoded matrices in the transform neural network model, which are considered as three factor matrices, after initializing a core tensor, a linear attention function can be constructed by using the Tucker decomposition in the tensor decomposition technique.
The monolithic-attention function is constructed using the Tucker decomposition technique, and is of the form:
herein, the
Is a core tensor, i, j and m are indices of the core tensor,
is the outer product of the vectors. In particular, in the experiment, the nuclear tensor here
Can be defined as the following equation:
here rand (0,1) is a function that generates random values. Finally, the parameter we store is a vector, which we perform Softmax normalization on.
Embedding a Multi-linear attention function into a transform neural network model structure to obtain a compressed tensor transform model;
to construct a multi-head attention mechanism and to be able to compress the parameters of the model, we use a set of linear mappings and then share the output of this part. In our model, it is called the Multi-linear attention function, which can be formalized as:
is the diagonal tensor. The multi-linear attention function herein can decompress the original language model. In multi-headed compression, the compression ratio of the model can be calculated by the following formula:
h is generally 8, so as d increases, the compression ratio of the model increases.
Applying the compressed tensor Transformer model to a language modeling task and a machine translation task; the method comprises the following steps:
acquiring a language model data set and a German-English translation data set;
processing a data set and cleaning data, wherein for text data, the position of each word in a sentence is in a relation, and a word vector representation model of the text is constructed by combining the position information of the words in the sentence;
inputting the expression model of the word vector into a tensor Transformer model for model training to obtain a training loss function of the neural network model, wherein the training loss function is as follows:
wherein y is
iThe representation is really a category label,
represents the prediction result, and n represents the length of the sentence. And training the model by a back propagation algorithm and a batch random gradient descent method.
Training a tensor Transformer model on a training set, simultaneously verifying the tensor Transformer model on a verification set at certain intervals in batches, and recording model parameters stored on the verification set when the effect is optimal;
testing samples on the test set by using the optimal model stored in the last step to finally obtain a prediction result or a translation model of each test sample, comparing the test labels, and calculating the accuracy of prediction and translation; wherein:
in the language modeling task, inputting a test set in a language modeling data set into a trained tensor Transformer model for testing, and calculating the probability of each sentence;
in a translation task, inputting a test set in a German-English translation data set into a trained tensor Transformer model for testing, and calculating a BLEU value of each translated sentence and an original sentence;
and recording the experimental results of the tensor Transformer model on a language modeling task and a German-English translation task.
According to the method, the original attention function is compressed by using parameter sharing and low-rank decomposition in a mixed mode, a new attention function Multi-linear attention function is obtained and is embedded into a Transformer language model to perform natural language processing tasks.
The specific application steps are as follows:
the method mainly comprises three parts, wherein the first part is how to design an Encoder structure of a Transformer, the second part is that a compressed language model is tested in a language modeling task, and the third part is that the compressed model is tested in a translation task.
A first part:
(1): a Single-Block Attention function (Single-Block Attention) is constructed based on the Tucker tensor decomposition, as shown on the left side of fig. 2.
(2): multiple monolithic attention functions are constructed while sharing the same set of factor matrices. The tensor representation of these monolithic attentions was then taken using the average pooling method, resulting in a third order tensor representation. As shown on the right side of figure 2.
(3): in order to enable the third-order tensor expression to be embedded into a Transformer neural network model framework, a tensor splitting method is adopted. The method mainly comprises the steps of segmenting the third-order tensor in the second dimension of the third-order tensor to obtain a plurality of matrixes, splicing the matrixes, and expressing the tensor through a full-connection layerSuccessfully embedded in the transform structure, as shown on the right side of fig. 2, we split (split) an nxnxnxn by N third order tensor, and then use the concatenation (concat) to arrive at the matrix T1,…,Tn.
A second part:
(1) and processing the three data sets, removing punctuation marks to obtain a word list of the corpus, then carrying out vector coding on each word, and splicing the relative position vectors on each word vector at each sentence coding time. The encoding method of the position vector is as follows:
pos here is the position of the word and i denotes the dimension of the vector.
(2) The sentences in the corpus are processed in Batch (Batch), each word is coded by the method in (1), and then the word vector representation of one sentence is input into a tensor Transformer language model designed by us. The specific steps are introduced as follows:
the first step is as follows: these inputs are subjected to a linear encoding process three times:
w hereinV、WkAnd WQIs the three initialization parameter matrices and E is the vector input for the sentence.
The second step is that: the three matrices Q, K and V are then input into the Multi-linear attention function we have designed, the specific formula is as follows:
Multilinear(Q,K,V)=Concat(H1,H2,…,Hn)WO
Where H1=TensorSplit(G)
where G is a third order tensor. H1,H2,…,HnIs a matrix resulting from tensor cutting.
The third step: inputting the output of the Multi-linear function into a Feed-Forward network, wherein the network function is as follows:
FFN(x)=max(0,xW1+b1)W2+b2
(3) the last layer performs full connection on the last output, then performs loss function calculation, then performs back propagation by using BP, and then trains the model.
(4) Model validation is performed after each batch through the validation set, and the model with the best result on the prediction set is saved.
And a third part:
(1) the coding part adopts the same model structure as the language modeling to code, in this part, English is used as original text and is input into an Encoder part of a transform language model, German is used as a target statement, namely, translation is input into a Decoder part, and the Decoder part adopts an original self-attention function. The function is:
(2) english passes through the encoder, German passes through the decoder, and then matching probability is calculated at an output layer, and a translation model is trained through a cross entropy loss function.
(3) Model validation is performed after each batch through the validation set, and the model with the best result on the prediction set is saved.
The model can realize the parameter compression of nearly half of the original Transformer language model parameters. In addition, by adopting the technology, the self-attention module of the model can be embedded into an original model structure after being compressed, and meanwhile, the improvement can be ensured in an experimental result. The main technical support is that the representation of the tensor, which is a more informative representation, is able to reconstruct the original attention output representation, which is a marginal probability of the tensor representation. Fig. 3 shows the process, and the left side is third-order tensor representation obtained by the technology, and by slicing tensor in the vertical dimension, N matrixes above the right side of fig. 3 can be obtained, and the N matrixes are summed to obtain the original attention output X. In our technique, not summation, but concatenation operation (concat) as shown in fig. 2 is adopted, so that richer information can be ensured to be modeled, and the final modeling effect of the model is improved.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.