Disclosure of Invention
This patent is optimized and is improved to GRU model based on recurrent neural network, has proposed a new text characteristic and has drawed model NGU language model, introduces the normalization mechanism with GRU's gate unit, and the hyperbolic tangent function that will have the saturation region is replaced for the operation of layer normalization, promotes the model semantic expression ability in fusing the iterative unit with transform's feedforward layer neural network simultaneously, also is the model fitting data ability, and this patent definition is NGU language model.
The method solves the problems in the prior art, provides a text feature extraction method based on an NGU language model for integrating respective advantages of GRU and Transformer, and comprises the following steps:
s1, constructing a training data set: collecting and arranging a training data set related to a task, putting the training data set into train.txt, inputting the maximum length of a text of an NGU language model to 1000, and filling the maximum length to 1000 by adopting [ PAD ] when the length of the text is less than 1000;
s2, constructing mapping from characters to IDs: counting characters in train text in a training set in S1 and marking the characters as token _ list, then establishing a dictionary Dict _ token according to the characters in the token _ list, wherein a front key of the Dict _ token is a character index number, a back key of the Dict _ token is a specific single character, and [ PAD ] is a complete character when the text is not large enough;
s3, adaptation of training data and a model: if the text sample in the training data set obtained in the step S1 is not enough to have the maximum length of 1000, filling the list to have the maximum length of 1000 through [ PAD ], then mapping the list into an index number list through a dictionary Dict _ token, and changing the index number list into tensor X input by the model, wherein the batch size batch _ size input into the NGU language model is 128, and the size of X is [ 128, 1000 ];
s4, extracting the text features of the NGU language model: the original GRU network model iteration formula is as follows:
f is an equivalent formula of a GRU gating cycle unit, and the detailed formula of f is as follows:
the proposed NGU iterative formula is specifically as follows:
sigmoid function when x is far from 0, there is a saturation region,
after passing through the full connection layer, information is lost through the sigmoid function, layer normalization operation is introduced, normalization is carried out on the embedded representation dimension, and then the sigmoid function is adopted to effectively retain text representation information;
when in use
In (1)
When the value is far from 0 value and reaches the saturation region, the output tends to be a stable value, and much semantic information is lost. Replacing a hyperbolic tangent function tanh in a GRU by using layer normalization operation, wherein the layer normalization operation is as follows, namely normalization is performed on a semantic representation dimension d _ model, if a current word representation matrix is T, the matrix size is [ 1, d _model ], and values in a second dimension are sequentially as follows:
the mean values are as follows:
the variance is:
each input to the layer normalization is:
the layer normalization operation only carries out translation contraction on data without losing semantic information, meanwhile, the operation enables the embedded representation dimension to be normalized to be close to 0, so that model training is more stable, a feedforward layer neural network layer in a Transformer is migrated into an NGU model, the feedforward neural network comprises two times of linear transformation and one time of nonlinear transformation GeLU activation functions, and then the operation is carried out through a residual error network and the layer normalization operation;
the word embedding dimension is 256, the hidden layer dimension of the feedforward neural network is 2048, for the size of model input in step S3 to be [ 128, 1000 ], each word in the text data is embedded and represented through token _ embedding, the matrix shape of the embedded representation of each word in 1000 words is [ 128, 256 ], 128 is the size of Batchsize in training, 256 is the word embedding dimension in the text, then each word embedding in the text data X is sequentially input into the NGU loop iteration unit, and then the time of each step is input
Spliced together into
The dimensionality is [ 128, 1000, 256 ], the extraction of text features is completed through an NGU language model, and each word of each sentence of data in one batch is expressed as 256 dimensionalities through the NGU language model;
s5, applying an NGU language model: the proposed NGU language model is a non-pre-training language model, parameter training is carried out according to a specific natural language processing task, the size of a text expression tensor obtained in S4 is (128, 1000, 256), and the tensor is accessed to a neural network for subsequent text classification, relation extraction, text generation and entity identification to be trained.
The method has the advantages that the GRU sigmoid activation function has a saturation region, the intermediate full connection enables the representation data information of the text to be compressed in the saturation region, the method introduces layer normalization after full connection in the activation function, and shifts the data center to be close to 0 to keep the information. In order to further keep text information, namely, an activation function of a saturation region is avoided as much as possible, the hyperbolic tangent function in the GRU is directly replaced by layer normalization operation, and the model training stability is not influenced due to the memory gating effect of sigmoid. The normalization operation only translates the values, the variance is normalized, the richness of the values of the function is kept to the maximum extent, and the fitting data volume of the model is improved. The feedforward neural network of the Transformer is a very powerful module with powerful fitting function for performing full join operation in the embedding representation dimension. The NGU language model effectively combines the respective advantages of the GRU and the Transformer, and can be applied to various tasks of natural processing. Although the idea of the patent is derived from the GRU, the hyperbolic tangent function in the GRU is replaced, layer normalization operation is introduced at multiple positions, and a feedforward layer full-connection neural network is introduced in an iteration unit, so that the method is obviously different from the GRU model. This patent defines the NGU language model.
Detailed Description
The following is further detailed in combination with specific cases:
the invention specifically relates to a text feature extraction method based on an NGU language model, which comprises the following steps:
s1, constructing a training data set: and aiming at specific tasks in the project, such as text classification, relation extraction, entity identification, text search, text writing, text completion and other natural language processing tasks. Training data sets that collate and correlate tasks are collected and placed into train. The maximum length of a text input NGU language model is 1000, and when the text length is less than 1000, [ PAD ] is adopted to fill the maximum length to 1000;
s2, mapping of characters to IDs is constructed: and (5) counting characters in the train-in txt in the training set in the S1 to be recorded as token _ list. Then, a dictionary Dict _ token is established according to the characters in the token _ list, the front item key of the Dict _ token is a character index number, and the back item value of the Dict _ token is a specific single character, namely, dict _ token = {0: "middle", 1: "Hua", 2: "Tang", 3: "human", 4: "Min", 5: "[ PAD ]". Said. }, [ PAD ] is a completion character when the text is not of sufficient maximum length;
s3, adaptation of training data and a model: if a certain text sample in the training data set obtained in step S1 is "along with the development of deep learning, artificial intelligence has gained more and more attention in the financial field, medical field, and educational field. "that is to say [ 'follow', 'deep', 'degree', 'study', 'person', 'hair', 'spread', 'person', 'wisdom', 'energy', 'gold', 'melt', 'neck', the term" therapeutic "refers to the term" therapeutic effect "or" therapeutic effect "as used herein, and includes the terms" domain "," treatment "," lead "," field "," teaching "," birth "," lead "," field "," addition "," extended "," generalized "," heavy "," visual "," etc. ' ] list, the text is not enough to be maximum length 1000, the list is filled in to the text maximum length 1000 by [ PAD ], then mapped into index number list by dictionary Dict _ token, and becomes model input tensor X, at this time, if input NGU language model batch size is 128, then X size is [ 128, 1000 ].
S4, extracting the text features of the NGU language model: the original GRU network model iterative formula is as follows:
f is an equivalent formula of a GRU gating cycle unit, and the detailed formula of f is as follows:
the NGU iterative formula proposed by this patent is specifically as follows:
the GRU is plotted in sigmoid function as shown in fig. 2.
sigmoid function when x is far from 0, there is a saturation region,
after passing through the full connection layer, information is lost through the sigmoid function. The method introduces layer normalization operation, normalizes in embedding expression dimension, and effectively reserves the richness of text expression information by adopting a sigmoid function.
The tanh function in GRU also saturates when x is far from 0, because the derivative is 0, causing the gradient to disappear, and the tanh function and its derivative are plotted as shown in fig. 3.
Wherein when
In (1)
When the value is far from 0 value and reaches the saturation region, the output tends to be a stable value, and much semantic information is lost. This patent adopts the normalized operation of layerAs an alternative hyperbolic tangent function tanh in GRU. The operation of the layerorm layer normalization is as follows, namely, normalization is carried out in a semantic representation dimension d _ model. If the current word represents a matrix of T, the matrix size is [ 1, d _model ], and values in the second dimension are sequentially as follows:
the mean values are as follows:
the variance is:
each input to the layer normalization is:
it can be seen that the layer normalization operation only performs translation contraction on the data, and does not lose semantic information. Meanwhile, the operation enables the dimension of the embedded representation to be normalized to be close to 0, and the model training is more stable. Layer normalization operation is completely introduced into a gate control unit, and a tanh hyperbolic tangent function in a GRU is replaced by the layer normalization operation, so that the model training speed is higher, and the semantic representation capability is good. In order to further improve the data fitting capability of the method, a feedforward layer neural network layer in a Transformer is transferred to an NGU model, wherein the feedforward neural network comprises two linear transformations and one nonlinear transformation (GeLU activation function), and then the operation of residual error network and layer normalization is carried out.
In this patentThe word embedding dimension is 256 and the hidden layer dimension of the feedforward neural network is 2048. For the size of the model input in step S3 being [ 128, 1000 ], each word in the text data is first subjected to token _ embedding, at this time, the matrix shape of the embedded representation of each word in the 1000 words is [ 128, 256 ], 128 is the size of batchsize in training, 256 is the word embedding dimension in the text, and then each word embedding in the text data X is sequentially input to the NGU loop iteration unit. Then at each time step
Spliced together into
The dimensionality is [ 128, 1000, 256 ], that is, the extraction of text features is completed through the NGU language model, that is, each word of each sentence of data in one batch is represented as 256 dimensionalities through the NGU language model.
S5, applying an NGU language model: the NGU language model non-pre-training language model provided by the patent needs parameter training according to a specific natural language processing task, the size of a text representation tensor obtained by S4 is 128, 1000 and 256, and the tensor is accessed into a neural network for subsequent text classification, relation extraction, text generation and entity identification to train, so that the patent application is completed.
Theoretical analysis is passed through to this patent, improves based on the GRU model and has proposed NGU language model, and the improvement point mainly has: and layer normalization operation is introduced before sigmoid function operation in the GRU reset gate and the update gate. The hyperbolic tangent function in the GRU is changed to a layer normalization operation. Meanwhile, a feedforward neural network of a Transformer is introduced into the NGU to further improve the semantic representation complexity of the model, namely the data fitting capability. The NGU language model effectively combines the respective advantages of the GRU and the transform, and can be applied to various tasks of natural processing.