CN115730587A

Movatterモバイル変換

Info

Publication number: CN115730587A
Application number: CN202211606356.0A
Authority: CN
Inventors: 曹肖攀; 马国祖
Original assignee: China Telecom Wanwei Information Technology Co Ltd
Current assignee: China Telecom Wanwei Information Technology Co Ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-03-03

Abstract

The invention relates to the technical field of natural language processing, in particular to a text feature extraction method based on an NGU language model. The method is used for optimizing and improving a GRU model based on a recurrent neural network, provides a new text feature extraction model NGU language model, introduces a gate control unit of the GRU into a normalization mechanism, replaces a hyperbolic tangent function with a saturation region into layer normalization operation, and simultaneously fuses a feedforward layer neural network of a Transformer into an iteration unit to improve the semantic representation capability of the model, namely the data fitting capability of the model.

Description

Text feature extraction method based on NGU language model

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text feature extraction method based on an NGU language model.

Background

RNN, transformer, etc. play a key role in natural language processing tasks as important basic units in the field of natural language processing. The GRU model belonging to the RNN series plays an especially important role in entity recognition, relationship extraction and text generation. In the text generation task, the GRU unit can make the text generation inference in an iterative streaming mode, and the speed of text generation is faster than that of a Transformer model and a GPT model. In the tasks of entity identification and relationship extraction, the GRU model is also poor in appearance because the loop iteration mechanism of the GRU can introduce adjacent text content information. However, because the Transformer adopts a multi-head self-attention mechanism of global interaction perception, the Transformer has strong semantic representation capability, so that the current natural language processing related tasks gradually start to adopt a large model based on a Transformer related variant, and the model is trained on a large data set to achieve an excellent effect, but needs to consume a large amount of computing resources. In order to integrate the advantages of the Transformer and the GRU, the GRU has the semantic expression capability of the Transformer and the advantages of the GRU at the same time. The NGU language model is provided based on the GRU model.

Disclosure of Invention

This patent is optimized and is improved to GRU model based on recurrent neural network, has proposed a new text characteristic and has drawed model NGU language model, introduces the normalization mechanism with GRU's gate unit, and the hyperbolic tangent function that will have the saturation region is replaced for the operation of layer normalization, promotes the model semantic expression ability in fusing the iterative unit with transform's feedforward layer neural network simultaneously, also is the model fitting data ability, and this patent definition is NGU language model.

The method solves the problems in the prior art, provides a text feature extraction method based on an NGU language model for integrating respective advantages of GRU and Transformer, and comprises the following steps:

s1, constructing a training data set: collecting and arranging a training data set related to a task, putting the training data set into train.txt, inputting the maximum length of a text of an NGU language model to 1000, and filling the maximum length to 1000 by adopting [ PAD ] when the length of the text is less than 1000;

s2, constructing mapping from characters to IDs: counting characters in train text in a training set in S1 and marking the characters as token _ list, then establishing a dictionary Dict _ token according to the characters in the token _ list, wherein a front key of the Dict _ token is a character index number, a back key of the Dict _ token is a specific single character, and [ PAD ] is a complete character when the text is not large enough;

s3, adaptation of training data and a model: if the text sample in the training data set obtained in the step S1 is not enough to have the maximum length of 1000, filling the list to have the maximum length of 1000 through [ PAD ], then mapping the list into an index number list through a dictionary Dict _ token, and changing the index number list into tensor X input by the model, wherein the batch size batch _ size input into the NGU language model is 128, and the size of X is [ 128, 1000 ];

s4, extracting the text features of the NGU language model: the original GRU network model iteration formula is as follows:

f is an equivalent formula of a GRU gating cycle unit, and the detailed formula of f is as follows:

the proposed NGU iterative formula is specifically as follows:

sigmoid function when x is far from 0, there is a saturation region,

after passing through the full connection layer, information is lost through the sigmoid function, layer normalization operation is introduced, normalization is carried out on the embedded representation dimension, and then the sigmoid function is adopted to effectively retain text representation information;

when in use

In (1)

When the value is far from 0 value and reaches the saturation region, the output tends to be a stable value, and much semantic information is lost. Replacing a hyperbolic tangent function tanh in a GRU by using layer normalization operation, wherein the layer normalization operation is as follows, namely normalization is performed on a semantic representation dimension d _ model, if a current word representation matrix is T, the matrix size is [ 1, d _model ], and values in a second dimension are sequentially as follows:

the mean values are as follows:

the variance is:

each input to the layer normalization is:

the layer normalization operation only carries out translation contraction on data without losing semantic information, meanwhile, the operation enables the embedded representation dimension to be normalized to be close to 0, so that model training is more stable, a feedforward layer neural network layer in a Transformer is migrated into an NGU model, the feedforward neural network comprises two times of linear transformation and one time of nonlinear transformation GeLU activation functions, and then the operation is carried out through a residual error network and the layer normalization operation;

the word embedding dimension is 256, the hidden layer dimension of the feedforward neural network is 2048, for the size of model input in step S3 to be [ 128, 1000 ], each word in the text data is embedded and represented through token _ embedding, the matrix shape of the embedded representation of each word in 1000 words is [ 128, 256 ], 128 is the size of Batchsize in training, 256 is the word embedding dimension in the text, then each word embedding in the text data X is sequentially input into the NGU loop iteration unit, and then the time of each step is input

Spliced together into

The dimensionality is [ 128, 1000, 256 ], the extraction of text features is completed through an NGU language model, and each word of each sentence of data in one batch is expressed as 256 dimensionalities through the NGU language model;

s5, applying an NGU language model: the proposed NGU language model is a non-pre-training language model, parameter training is carried out according to a specific natural language processing task, the size of a text expression tensor obtained in S4 is (128, 1000, 256), and the tensor is accessed to a neural network for subsequent text classification, relation extraction, text generation and entity identification to be trained.

The method has the advantages that the GRU sigmoid activation function has a saturation region, the intermediate full connection enables the representation data information of the text to be compressed in the saturation region, the method introduces layer normalization after full connection in the activation function, and shifts the data center to be close to 0 to keep the information. In order to further keep text information, namely, an activation function of a saturation region is avoided as much as possible, the hyperbolic tangent function in the GRU is directly replaced by layer normalization operation, and the model training stability is not influenced due to the memory gating effect of sigmoid. The normalization operation only translates the values, the variance is normalized, the richness of the values of the function is kept to the maximum extent, and the fitting data volume of the model is improved. The feedforward neural network of the Transformer is a very powerful module with powerful fitting function for performing full join operation in the embedding representation dimension. The NGU language model effectively combines the respective advantages of the GRU and the Transformer, and can be applied to various tasks of natural processing. Although the idea of the patent is derived from the GRU, the hyperbolic tangent function in the GRU is replaced, layer normalization operation is introduced at multiple positions, and a feedforward layer full-connection neural network is introduced in an iteration unit, so that the method is obviously different from the GRU model. This patent defines the NGU language model.

Drawings

FIG. 1 is an overall flowchart of text feature extraction according to the present invention;

FIG. 2 is a graph of GRU plotted in sigmoid function;

FIG. 3 is a graph plotting the tanh function and its derivatives.

Detailed Description

The following is further detailed in combination with specific cases:

the invention specifically relates to a text feature extraction method based on an NGU language model, which comprises the following steps:

s1, constructing a training data set: and aiming at specific tasks in the project, such as text classification, relation extraction, entity identification, text search, text writing, text completion and other natural language processing tasks. Training data sets that collate and correlate tasks are collected and placed into train. The maximum length of a text input NGU language model is 1000, and when the text length is less than 1000, [ PAD ] is adopted to fill the maximum length to 1000;

s2, mapping of characters to IDs is constructed: and (5) counting characters in the train-in txt in the training set in the S1 to be recorded as token _ list. Then, a dictionary Dict _ token is established according to the characters in the token _ list, the front item key of the Dict _ token is a character index number, and the back item value of the Dict _ token is a specific single character, namely, dict _ token = {0: "middle", 1: "Hua", 2: "Tang", 3: "human", 4: "Min", 5: "[ PAD ]". Said. }, [ PAD ] is a completion character when the text is not of sufficient maximum length;

s3, adaptation of training data and a model: if a certain text sample in the training data set obtained in step S1 is "along with the development of deep learning, artificial intelligence has gained more and more attention in the financial field, medical field, and educational field. "that is to say [ 'follow', 'deep', 'degree', 'study', 'person', 'hair', 'spread', 'person', 'wisdom', 'energy', 'gold', 'melt', 'neck', the term" therapeutic "refers to the term" therapeutic effect "or" therapeutic effect "as used herein, and includes the terms" domain "," treatment "," lead "," field "," teaching "," birth "," lead "," field "," addition "," extended "," generalized "," heavy "," visual "," etc. ' ] list, the text is not enough to be maximum length 1000, the list is filled in to the text maximum length 1000 by [ PAD ], then mapped into index number list by dictionary Dict _ token, and becomes model input tensor X, at this time, if input NGU language model batch size is 128, then X size is [ 128, 1000 ].

S4, extracting the text features of the NGU language model: the original GRU network model iterative formula is as follows:

the NGU iterative formula proposed by this patent is specifically as follows:

the GRU is plotted in sigmoid function as shown in fig. 2.

sigmoid function when x is far from 0, there is a saturation region,

after passing through the full connection layer, information is lost through the sigmoid function. The method introduces layer normalization operation, normalizes in embedding expression dimension, and effectively reserves the richness of text expression information by adopting a sigmoid function.

The tanh function in GRU also saturates when x is far from 0, because the derivative is 0, causing the gradient to disappear, and the tanh function and its derivative are plotted as shown in fig. 3.

Wherein when

In (1)

When the value is far from 0 value and reaches the saturation region, the output tends to be a stable value, and much semantic information is lost. This patent adopts the normalized operation of layerAs an alternative hyperbolic tangent function tanh in GRU. The operation of the layerorm layer normalization is as follows, namely, normalization is carried out in a semantic representation dimension d _ model. If the current word represents a matrix of T, the matrix size is [ 1, d _model ], and values in the second dimension are sequentially as follows:

the mean values are as follows:

the variance is:

each input to the layer normalization is:

it can be seen that the layer normalization operation only performs translation contraction on the data, and does not lose semantic information. Meanwhile, the operation enables the dimension of the embedded representation to be normalized to be close to 0, and the model training is more stable. Layer normalization operation is completely introduced into a gate control unit, and a tanh hyperbolic tangent function in a GRU is replaced by the layer normalization operation, so that the model training speed is higher, and the semantic representation capability is good. In order to further improve the data fitting capability of the method, a feedforward layer neural network layer in a Transformer is transferred to an NGU model, wherein the feedforward neural network comprises two linear transformations and one nonlinear transformation (GeLU activation function), and then the operation of residual error network and layer normalization is carried out.

In this patentThe word embedding dimension is 256 and the hidden layer dimension of the feedforward neural network is 2048. For the size of the model input in step S3 being [ 128, 1000 ], each word in the text data is first subjected to token _ embedding, at this time, the matrix shape of the embedded representation of each word in the 1000 words is [ 128, 256 ], 128 is the size of batchsize in training, 256 is the word embedding dimension in the text, and then each word embedding in the text data X is sequentially input to the NGU loop iteration unit. Then at each time step

Spliced together into

The dimensionality is [ 128, 1000, 256 ], that is, the extraction of text features is completed through the NGU language model, that is, each word of each sentence of data in one batch is represented as 256 dimensionalities through the NGU language model.

S5, applying an NGU language model: the NGU language model non-pre-training language model provided by the patent needs parameter training according to a specific natural language processing task, the size of a text representation tensor obtained by S4 is 128, 1000 and 256, and the tensor is accessed into a neural network for subsequent text classification, relation extraction, text generation and entity identification to train, so that the patent application is completed.

Theoretical analysis is passed through to this patent, improves based on the GRU model and has proposed NGU language model, and the improvement point mainly has: and layer normalization operation is introduced before sigmoid function operation in the GRU reset gate and the update gate. The hyperbolic tangent function in the GRU is changed to a layer normalization operation. Meanwhile, a feedforward neural network of a Transformer is introduced into the NGU to further improve the semantic representation complexity of the model, namely the data fitting capability. The NGU language model effectively combines the respective advantages of the GRU and the transform, and can be applied to various tasks of natural processing.

Claims

1. A text feature extraction method based on an NGU language model is characterized by comprising the following steps:

s2, constructing mapping from characters to IDs: calculating characters in train text in a training set in S1, marking the characters as token _ list, establishing a dictionary Dict _ token according to the characters in the token _ list, wherein a front key of the Dict _ token is a character index number, a back key of the Dict _ token is a specific single character, and [ PAD ] is a complete character when the text is not long enough;

the proposed NGU iterative formula is specifically as follows:

sigmoid function when x is far from 0, there is a saturation region,

when in use

In

When the numerical value is far from the 0 value and reaches the saturation region, the output tends to a stable value, and a lot of semantic information is lost;

replacing a hyperbolic tangent function tanh in a GRU by using layer normalization operation, wherein the layer normalization operation is as follows, namely normalization is performed on a semantic representation dimension d _ model, if a current word representation matrix is T, the matrix size is [ 1, d _model ], and values in a second dimension are sequentially as follows:

the mean values are as follows:

the variance is:

each input to the layer normalization is:

the word embedding dimension is 256, the hidden layer dimension of the feedforward neural network is 2048, each word in the text data is embedded and represented through token _ embedding in step S3, the size of the model input is [ 128, 1000 ], the matrix shape of the embedding representation of each word in 1000 words is [ 128, 256 ], 128 is the size of Batchsize in training, 256 is the word embedding dimension in the text, each word in the text data X is embedded and sequentially input into the NGU circulation iteration unit, and then the time of each step is input into the NGU circulation iteration unit

Spliced together into

s5, applying an NGU language model: the proposed NGU language model is a non-pre-training language model, parameter training is carried out according to a specific natural language processing task, and the size of the text expression tensor obtained by S4 is

And (128, 1000 and 256), training the tensor by accessing the tensor into a neural network for subsequent text classification, relation extraction, text generation and entity identification.