Detailed Description
The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the application in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present application.
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
It should be noted that, if not in conflict, the features of the embodiments of the present application may be combined with each other, which is within the protection scope of the present application. In addition, while functional block division is performed in a device diagram and logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. Moreover, the words "first," "second," "third," and the like as used herein do not limit the data and order of execution, but merely distinguish between identical or similar items that have substantially the same function and effect.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items.
In addition, the technical features of the embodiments of the present application described below may be combined with each other as long as they do not collide with each other.
In order to facilitate understanding of the method provided in the embodiments of the present application, first, terms related to the embodiments of the present application are described:
(1) Neural network
A neural network may be composed of neural units, and is understood to mean, in particular, a neural network having an input layer, an hidden layer, and an output layer, where in general, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Among them, the neural network with many hidden layers is called deep neural network (deep neural network, DNN). The operation of each layer in the neural network can be described by the mathematical expression y=a (w·x+b), from the physical level, and can be understood as the completion of the transformation of the input space into the output space (i.e., the row space into the column space of the matrix) by five operations on the input space (set of input vectors), including 1, dimension up/down; 2. zoom in/out; 3. rotating; 4. translating; 5. "bending". Wherein the operations of 2,3 are done by "w·x", the operations of 4 are done by "+b", and the operations of 5 are done by "a ()" where the expression "space" is used in two words because the object to be classified is not a single thing but a class of things, space refers to the collection of all individuals of such things, where W is the weight matrix of the layers of the neural network, each value in the matrix representing the weight value of one neuron of that layer. The matrix W determines the spatial transformation of the input space into the output space described above, i.e. the W of each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.
It should be noted that in the embodiments of the present application, the neural network is essentially based on the model employed by the machine learning task. Common components in the neural network comprise a convolution layer, a pooling layer, a normalization layer, a reverse convolution layer and the like, a model is designed and obtained by assembling the common components in the neural network, and when model parameters (weight matrixes of all layers) are determined so that model errors meet preset conditions or the number of adjusted model parameters reaches a preset threshold value, the model converges.
The convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to carry out convolution operation on the image. The purpose of the convolution operation is to extract different features of the input image, and the first layer of convolution layer may only extract some low-level features such as edges, lines, angles, etc., and the deeper convolution layer may iteratively extract more complex features from the low-level features. The downsampled convolution layer is used to map a high-latitude space to a low-dimension while maintaining a connection/pattern between them (connection here refers to the connection at the time of convolution).
The inverse convolution layer (also referred to as an upsampling convolution layer) is used to map a low-dimensional space to a high-dimensional space while maintaining a connection/pattern between them (connection here refers to the connection at the time of convolution). Similarly, the inverse convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length to perform deconvolution operation on the image. Typically, a framework library (e.g., pyTorch library) for designing a neural network has a upsumple () function built into it, and by calling this upsumple () function, a low-dimensional to high-dimensional spatial mapping can be achieved.
The pooling layer (pooling) is to simulate that the human visual system can dimension down the data or represent the image with higher level features. Common operations of the pooling layer include maximum pooling, mean pooling, random pooling, median pooling, combined pooling, and the like. Typically, the pooling layers are periodically inserted between convolutional layers of the neural network to achieve dimension reduction.
The normalization layer is used for performing normalization operation on all neurons in the middle to prevent gradient explosion and gradient disappearance.
(2) Loss function
In the process of training the neural network, because the output of the neural network is expected to be as close to the value actually expected, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (however, an initialization process is usually performed before the first update, that is, the parameters are preconfigured for each layer in the neural network), for example, if the predicted value of the network is higher, the weight matrix is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the neural network can predict the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.
(3) Attention mechanism
The principle of the attention mechanism (Attention Mechanism) is to mimic that of the human visual and cognitive system, thereby improving processing efficiency and accuracy by giving more attention to important parts in processing information.
In a conventional neural network, the output of each neuron depends only on the outputs of all neurons of the previous layer, whereas in an attention mechanism, the output of each neuron depends not only on the outputs of all neurons of the previous layer, but also can be weighted according to different parts of the input data, i.e. different parts are given different weights. Thus, the model can pay more attention to key information in the input sequence, and the precision and efficiency of the model are improved. It should be noted that the attention mechanism is not a pending neural network structure, but a general mechanism, and may be applied to different neural network structures. For example, attention mechanisms may be used in convolutional neural networks to focus on important regions in the input image, or in recurrent neural networks to focus on important portions in the input sequence.
The core idea of the attention mechanism is to enable the model to automatically decide which parts of the information are important and to give more attention to these parts, depending on the requirements of the current task, when processing the sequence data. This mechanism allows the model to focus more on the information most relevant to the current task by calculating the correlation between each element in the input sequence and the current task goal, assigning a weight to each element.
In deep learning, the implementation of the attention mechanism typically includes the following steps: feature extraction, namely extracting features of each element in the input sequence so that a model can understand the meaning of the elements; and calculating the attention score, and calculating the similarity or the correlation between each element and the current task target (such as the current word or the current state) to obtain an attention score. This score represents the importance of the element to the current task; and (3) carrying out weighted summation, namely carrying out weighted summation on the feature vectors of all the elements and the attention scores of the feature vectors to obtain a new vector. This vector contains a summary of the important information in the input sequence and can be used by the model for subsequent prediction or classification tasks.
(4) Multi-headed attention mechanism
The multiple heads in the multiple head attention mechanism are different from the convolution kernels in a plurality of convolution layers in the convolution neural network, the plurality of convolution layers in the convolution neural network are equivalent to that a single convolution network is duplicated num_layers times, and each convolution layer can independently operate. The multi-head attention can be understood as splitting the input characteristic value into more finely divided small blocks, assigning a single trainable weight parameter to each small block, and then sharing the same hidden layer to output the result, wherein each head cannot be regarded as a complete independent coding and decoding structure to operate independently.
FIG. 1 is a schematic diagram of an operating environment of a method for training a gender identification model according to an embodiment of the present invention, including an electronic device 10.
The electronic device 10 is a device capable of automatically and high-speed processing mass data according to a program operation, and is generally composed of a hardware system and a software system, for example: computers, servers, etc. The electronic device 10 may be a local device or a cloud device, for example: cloud servers, cloud hosts, cloud service platforms, cloud computing platforms, and the like.
On the basis of fig. 1, other embodiments of the present invention provide an electronic device 10, please refer to fig. 1, which is a hardware configuration diagram of the electronic device 10 provided in the embodiment of the present invention, specifically, as shown in fig. 2, the electronic device 10 includes at least one processor 11 and a memory 12 (in fig. 2, a bus connection, a processor is taken as an example) that are communicatively connected.
The processor 11 is configured to provide computing and control capabilities to control the electronic device 10 to perform corresponding tasks, for example, to control the electronic device 10 to perform any one of the methods for training a dialogue model provided in the following inventive embodiments or any one of the methods for intelligent dialogue provided in the following inventive embodiments.
It is appreciated that the processor 11 may be a general purpose processor including a central processor (CentralProcessingUnit, CPU), a network processor (NetworkProcessor, NP), etc.; but may also be a digital signal processor (DigitalSignalProcessing, DSP), an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), a Field programmable gate array (Field-ProgrammableGateArray, FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component.
The memory 12 is used as a non-transitory computer readable storage medium for storing a non-transitory software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to a method for training a gender identification model in the embodiment of the present invention, or a program instruction/module corresponding to a method for identifying gender in the embodiment of the present invention. The processor 11, by running non-transitory software programs, instructions and modules stored in the memory 12, may implement the method of training a dialog model in any of the method embodiments described below, and may implement the method of intelligent dialog in any of the method embodiments described below. In particular, memory 12 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 12 may also include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
In the following, a method for training a dialogue model according to an embodiment of the present invention is described in detail, referring to fig. 3, the method S100 includes, but is not limited to, the following steps:
S101: a plurality of groups of dialogue samples are obtained, wherein the dialogue samples comprise at least one question sentence and at least one reply sentence, and the last reply sentence is marked as a real label.
Among the current common intelligent dialogue methods, the deep learning-based method is the main research method at present, and replies generated by the method are more personalized and natural. The reply content generated based on the deep learning method is obtained through dialogue sample training, so that the obtained dialogue sample at least comprises a question; in the training process of the dialogue model, the learning parameters need to be adjusted according to the real label and the predicted label, so that the dialogue sample comprises at least one reply sentence, and the reply sentence is marked as the real label. For example, dialogue sample 1{ "how do today weather? "" weather is good today. "}, where" weather today is good ". "is a genuine label.
The dialog samples are text collection composed of a plurality of dialog segments, wherein a plurality of segments of dialog are contained in the text collection, and each segment of dialog contains at least one question sentence and at least one reply sentence for training a required dialog model. For example:
Dialog sample 2{ [ "do you eat? "" I eat. "(real label) ] dialogue sample 3[" how much today? "" today's very good. "(true tags) ] dialogue sample 4[" do today's mood? "" today's mood is good. "(real tag) ] }
The question mark ends, has a question feature, and the sentence needing to reply to the content is a question sentence, and is a reply sentence according to the corresponding question sentence. For example, dialogue sample 5{ "is a television play good? (question sentence) "" tv show is very good. (reply sentence) "}
The reply sentence corresponding to the target question sentence is a real label, and represents the correct answer or real state of each sample in the dialogue sample set, and represents the expected value of the dialogue model learning prediction result. In deep learning, a real label is a reference point for calculating the accuracy of model prediction and adjusting model parameters, and in order to enable the model prediction to be as close to the real label as possible, the model parameters are continuously adjusted in the model training process.
S102: global text at the global level and speaker text at the speaker level are obtained from the dialog samples at the level.
In the conversation scene of daily life, not only a two-person conversation scene, but also a multi-person conversation scene is included, and a conversation sample includes more than one speaker. Therefore, text content extraction can be carried out on the dialogue samples from the global layer and the speaker layer respectively, global text at the global layer and speaker text at the speaker layer are obtained, and the obtained global text and speaker text are input into a generating network for training.
The global level refers to training by taking the text content of a dialogue sample as an integral input generation network; the speaker layer refers to extracting the text content of the dialogue sample according to the corresponding text of the speaker, and then respectively inputting the extracted text content set of the single speaker into the dialogue model for training. For example:
Dialog sample 6 (global level) { a: "do you have a meal? "B:" I eat, you do? "A:" how does I not eat yet, how does your meal taste? "}" is a speaker a { "is you have a meal? "how do i not eat yet, how do your meal taste? "speaker B {" i eat, you? "}. The above is only exemplified by two dialogs, and it is understood that a dialog sample may be a dialog between 3 or 4 dialogs.
The text content of the dialogue sample is divided into a global layer and a speaker layer to respectively extract texts, and the acquired global text and speaker text are respectively input into a generating network for training, so that the generating network can better understand the internal logic relationship of the input text, and the accuracy and consistency of generating replies are improved.
And S103, inputting the global text and the speaker text into a generating network to perform encoding and decoding processing to obtain a prediction tag.
In the generation network, the encoder performs the encoding process, i.e. it is responsible for processing the input sequence and converting it into a fixed-length internal representation that captures the key information of the input sequence. During processing, the encoder progressively compresses the input information in order to encode all necessary information into an abstract representation; the decoder performs the decoding process, i.e. the task of the decoder is to convert the internal representation of the encoder output into the target sequence, the decoder generating the output sequence step by step, each step possibly depending on the output of the previous step and the internal representation delivered from the encoder. During the generation process, the decoder progressively unwraps the encoder compressed information, converting it into meaningful output.
It can be understood that the predictive label is a predictive reply of the generating network to the input text content, and is used for comparing with the real label, and adjusting the adjustable parameters of the generating network according to the difference value between the predictive label and the real label, so as to improve the accuracy of generating the reply by the generating network.
In some embodiments, referring to fig. 4, the generating network includes an encoder and a decoder, the encoder including a representation unit and an understanding unit.
The representation unit respectively encodes the input global text and the speaker text to obtain global vector representation and speaker vector representation.
The understanding unit comprises a plurality of understanding branches which respectively correspond to the global level and each speaker level in the dialogue sample, and each understanding branch comprises a plurality of cascaded reasoning modules, and context feature extraction is carried out on vector representations of the corresponding levels to obtain context feature vectors of the corresponding levels.
And the decoder decodes the context feature vector corresponding to the global layer and the context feature vector corresponding to the speaker layer according to the generated word vector to obtain the prediction tag.
After text extraction is performed on the dialogue sample layering layer, an encoder and a decoder encode and decode the extracted text content to obtain a prediction tag. The representation unit of the encoder converts the input text content into a vector representation of the text content, and the global text is represented by the vector representation obtained by the representation unit, namely the global vector representation; the vector representation obtained by the speaker text via the representation unit is the speaker vector representation. In some embodiments, the presentation unit comprises a plurality of downsampled convolutional layers.
The understanding unit is part of an encoder, which contains a plurality of understanding branches, and the understanding branches are composed of a cascade of a plurality of inference modules. The above-described understanding unit gradually mines and fuses context information in the dialog samples through the inference module, helping the generation network to better understand the context of the dialog samples.
In some embodiments, the above-mentioned understanding branches respectively correspond to vector representations of different levels (as shown in fig. 4), the vector representations of different levels are input into the understanding unit, and the reasoning process and the searching process are iteratively executed through the reasoning module of each corresponding understanding branch, so as to help the generating network to better understand the inherent logic of the vector representations of the input understanding unit, and improve the accuracy and consistency of generating replies.
In some embodiments, the representation unit includes a word segmentation encoding module, a semantic parsing module, and a position embedding module, where the word segmentation encoding module is configured to perform word segmentation operation on each sentence in the input text, and encode each word to obtain a sentence code.
The semantic analysis module is used for carrying out semantic analysis on the sentence codes to obtain semantic vectors; semantic vectors corresponding to the sentences in the input text form a semantic vector set of the input text.
The position embedding module is used for introducing position information of each semantic vector to the input semantic vector set to obtain vector representation of the input text.
After the obtained global text and the speaker text are input into the representation unit, the word segmentation coding module firstly obtains the sentence codes corresponding to the input based on methods well known to the person skilled in the art, such as a word segmentation method based on character string matching, a word segmentation method based on understanding, a word segmentation method based on statistics and the like. In some embodiments, the word segmentation encoding module cuts the text content into minimum semantic units token, then converts the token into a numerical id, i.e., a position code, and feeds the numerical id to the model for learning.
And the semantic analysis module vectorizes the sentence codes to obtain vector representations of the corresponding sentences. In some embodiments, the semantic parsing module includes a Bi-gating loop unit (Bi-GRU), which is a variant of a conventional Recurrent Neural Network (RNN), and the GRU includes a reset gate and an update gate, and the amount of information transferred to the next time step is controlled by a gating mechanism, so that semantic association between long sequences can be captured effectively.
And inputting the sentence code output by the word segmentation coding module into the Bi-GRU, and outputting the vector representation of the sentence code, namely the semantic vector. The semantic vectors corresponding to the sentences in the input text form a semantic vector set of the input text. For example, bi-GRU vectorizes the input sentence code and takes the last hidden state as a representation of the sentence wordsI.e. the semantic vector of the current statement isRepresenting the obtained semantic vector representations of N sentences by a set, namely, the semantic vector set of the input text is as follows。
It will be appreciated that Bi-GRU utilizes both forward and backward information flows, enabling the model to better understand and process complex timing data. Bi-GRU is better able to handle tasks with contextual relevance, such as speech recognition, machine translation, etc., than RNNs. In addition, bi-GRU has advantages in training and deployment because of its relatively simple structure and fewer parameters.
In a natural language processing task, when sequence data is processed, position information corresponding to each element in a sequence can be ensured to be captured by a model through position embedding, so that the semantics and the structure of the sequence can be understood. Position embedding is a technique that does not encode the position information of each element in the sequence into a vector form, and in some embodiments, includes position encoding that captures the relative distance and order between different positions by using sine and cosine functions of different frequencies. For example, the position code of the semantic vector set is vectorized to obtain a position code vector setCombining the position coding vector set and the semantic vector set to obtain a set of corresponding sentence vector representations, namely。
The above-described position-coded vectors can serve as additional features of the model input, helping the model to better understand the structure and meaning of the sequence, and thus better handle the order and relevance of the sequence data.
In some embodiments, the representation unit further comprises a first attention module for performing an attention process on the input vector representation; the vector representation output by the first attention module is the final vector representation of the text as input.
In some embodiments, the first attention module includes a self-attention mechanism. The self-attention mechanism is then dedicated to handling relationships inside a single sequence, such as relationships between words in sentences and/or relationships between sentences in text paragraphs. The self-attention mechanism described above allows the model to learn the dependency relationships inside the sequence, such as word-to-word relationships, relationships between parts in sentences, etc., by calculating the degree of attention of each element in the sequence to all other elements. For example, the corresponding sentence vector is expressed as a setBy means of a mechanism of attention to a vector representation of the corresponding level, i.e. It will be appreciated that the vector representation of the speaker a-plane and the vector representation of the speaker B-plane are respectively:,。
Self-attention mechanisms are a special form of attention mechanisms that reduce reliance on external information, and are more adept at capturing the correlation inside data or features. In this embodiment, the self-attention mechanism provides a powerful tool for processing complex sequence data problems through its efficient internal sequence modeling capability, not only improving the performance of the model, but also greatly enhancing the capability of the model to process long-distance dependencies.
In some embodiments, the inference module includes a feature extraction module, a second attention module, and a fusion module, where the feature extraction module is configured to perform feature extraction on an input feature vector;
the second attention module is used for carrying out attention processing on the feature vector input by the feature extraction module and the feature vector output by the feature extraction module.
The fusion module is used for carrying out fusion processing on the feature vector output by the feature extraction module and the feature vector output by the second attention module to obtain a context feature vector of a corresponding layer.
In some embodiments, the feature extraction module includes a two-way long and short memory network (Bi-LSTM).
The bidirectional long-short-term memory is one of long-short-term memory networks (LSTM), bi-LSTM has no internal structure of LSTM, LSTM is applied twice and different in direction, and the results obtained by the two applications are spliced to be finally output.
In this embodiment, the two-way long and short memory network is adopted to simulate the cognitive reasoning process of human beings, for example:
wherein,Is the output vector.Is represented by a context of a global levelInitializing, namely: WhereinAndIs a learnable parameter.
The structure of Bi-LSTM can capture some specific pre-or post-features in the language grammar, enhancing semantic association. The inherent logic order between corresponding layer vector representations is learned through Bi-directional Bi-LSTM, so that not only can the features in the sequence be further extracted and more complex sequence structures and dependency relationships be captured, but also sequences with variable lengths and/(or sequences with different lengths can be processed in batches.
The second attention module includes an attention mechanism, as shown in fig. 5, which focuses on important features through a weight distribution help model, and assigns a weight to each input item of the generating network, and represents the attention degree of the generating network to the part, and the weight can be calculated through a softmax function, for example:
Wherein Q, V, K respectively represent the matrix of the Query, key, value of the input sentence, the vector Query, key, value vector corresponding to each word of each behavior of the matrix,Representing the vector length.
The second attention module matches the vector representation of the global level through an attention mechanism, namely, simulates a retrieval process, helps the dialogue model to pay attention to important features more accurately through the process, and enhances the judgment capability and decision capability of the dialogue model in a key part.
Illustratively, in training of the dialog model, the detailed calculation of the t-th word in the output predictive label is as follows:
f () is a vector multiplication function, consisting ofAndA function of the matrix multiplication is calculated. Multiplying Q by the transpose of K and further softmax (normalizing the scores of all words such that all words are positive and sum to 1), we multiply the resulting softmax score by each corresponding Value vector, i.e.
The fusion module comprises residual link and normalization, as shown in fig. 5, the residual link adds the input and output of the network, increases the depth of the network, but does not lose the initial characteristics; normalization performs linear or nonlinear scaling on the input or the output of the network layer, and maps it into a specific range or distribution to improve training stability and performance of the generated network. Because the original characteristics are added to the residual links, the problem that the gradient of the generated network disappears is caused, so that calculation is reduced through normalization processing, and the risk of network gradient disappearance is reduced.
In some embodiments, the fusion module will infer the output of the processOutput of the and search processFusion is carried out to obtain vectorsThe step of reasoning and searching is continued for input to the next reasoning module, for example:
Wherein the method comprises the steps ofIn order to be a weight system,Is the matrix number multiplied. In summary, a given global level context representationNumber of inference roundsThe entire understanding unit can be expressed as:
It will be appreciated that in combination with the context representation of the speaker a level and the context representation of the speaker B level, the whole understanding unit may be represented as:
Speech representation after unit is understoodFor the connection of the output vectors, the following is shown:
Residual connection allows the original input information to be directly transferred to deeper layers by introducing an identity map (IDENTITY MAPPING), thus alleviating the problem of gradient disappearance to some extent, making the network easier to learn the identity map or the transformation of the near identity map during training, which helps the model converge to the optimal solution more quickly. Through residual connection, the model can learn the nonlinear transformation of the input data more easily, thereby improving the representation capability of the model.
Normalization makes the input distribution of each layer relatively stable by carrying out standardization treatment on the activation value of each layer, and accelerates the training process of the model; and meanwhile, the standardized processing reduces the dependence degree of the model on parameter initialization, and has stronger bar stripping property on the distribution change of input data, so that the stability of the model is improved, and the optimal solution is easier to find.
In some embodiments, the decoder includes a third attention module, a feed forward neural network, and a recurrent neural network;
The third attention module is used for carrying out attention processing on the generated word vector, the context feature vector corresponding to the global level and the context feature vector corresponding to the speaker level;
The feedforward neural network is used for mapping the output of the third attention module;
and the cyclic neural network decodes the output of the feedforward neural network to obtain a prediction tag.
The third Attention module includes a self-Attention mechanism and a Multi-Head self-Attention mechanism, which is an extension of the Attention mechanism by using multiple sets of different query (q), key (k) and value (v) matrices for each position of the input sequence, calculating the Attention weight between each position and the other positions, and then weighting the sequence.
The generated word vector refers to a vector representation of the reply word that has been generated by the generating network at a time prior to the current time, e.g., if the current generated word is the t-th word, then the generated word vector representation is the vector representation of the first word through the t-1 th word. The vector representation of the generated word is passed as input through a third attention module, obtaining a representation with multi-headed self-attention, for example:
to generate the t-th wordFirst, t-1 words generated before are usedAs input, a vector representation of the generated word is obtained:
Wherein the method comprises the steps ofIs directed to target responseIs the embedded vector of the generated word, i.e。
The multi-head self-attention mechanism carries out attention distribution of different dimensions on input information in parallel by adding a plurality of attention heads on the basis of the self-attention mechanism, and processes the plurality of attention distribution in parallel, so that the attention capturing capability of the model is further enhanced, more abundant characteristics and contextual information are captured, and the expression capability and learning efficiency of the model are improved.
Unlike RNNs, which are a computational model of a plurality of neurons interconnected, in which each neuron receives input signals from other neurons and passes output signals to the next, the information processing of a FNN is unidirectional, passing in sequence from the input layer to the output layer.
The basic structure of the FNN includes an input layer, a hidden layer, and an output layer, as well as corresponding activation functions, weights, and biases. These components together form the overall view of the network, with the input layer responsible for receiving raw data, typically corresponding to the dimensions of the features; the hidden layer comprises one or more layers, each layer is composed of a plurality of neurons and is used for extracting abstract characteristics of input data; the output layer generates a final prediction or classification result of the network; the activation function introduces nonlinear characteristics to the feedforward neural network, so that the network can learn complex functions; the weight is connected with the linear factors of the neurons of each layer, and controls the flow of information among the neurons; the bias allows neurons to activate without input, increasing the flexibility of the model. Weights and biases are learnable parameters of the neural network that can be continually adjusted during training to minimize prediction errors.
The input to the feed forward layer is first subjected to a linear transformation, mapping the input to a high dimensional space. This linear transformation is typically implemented by a weight matrix and a bias vector. After linear transformation, the input will increase the nonlinear expression of the model by an activation function, e.g., reLU, sigmoid, tanh, etc. After the function is activated, linear transformation is performed again, and the characteristics of the high-dimensional space are mapped back to the original space, so that the output of the feedforward layer is obtained.
It will be appreciated that in some embodiments, the third attention module uses another multi-headed attention to represent the response historyAs a query toAs a key and value, and then output a representation through a feed-forward neural network:
The feedforward layer can extract complex characteristics of input data through linear transformation and an activation function, and is helpful for generating a network to better understand the data. By increasing the number of hidden units of the feedforward layer, the complexity of the generation network can be increased, and the expression capability of the generation network can be improved. In deep neural networks, gradient disappearance is a common problem that the activation function in the feed forward layer can alleviate so that the generation network can better generate the reply response.
A Recurrent Neural Network (RNN) is an artificial neural network using sequence data or time series data, with a memory mechanism that is able to store previous information for use in processing the sequence. The recurrent neural network includes gated loop units (GRUs) that solve the problems of inability to memorize long term and gradients in back propagation in RNNs, combining forgetting gates and input gates into a single "update gate", combining CELL STATE and HIDDEN STATE, and making some other changes, creating a more simplified model. The core flow of the GRU is that the gate- > update gate- > candidate hidden state- > hidden state; wherein resetting the gate helps to capture short-term dependencies in the sequence; updating the gates helps capture long-term dependencies in the sequence.
In some embodiments, the GRU model is employed as a decoder to generate replies, the decoder decoding process being represented by the following formula:
Wherein the method comprises the steps ofIs the hidden state of the GRU at the time t.
Compared with the common RNN, the GRU has simpler structure and better performance, requires less training time than other types of circulating neural networks, can effectively capture long-distance dependency in the sequence, and has better adaptability to the variable-length sequence.
S104, calculating the loss between the predicted label and the real label, and carrying out iterative training on the generated network according to the loss sum corresponding to the plurality of groups of dialogue samples until convergence to obtain a dialogue model.
Loss is an overall indicator of prediction inaccuracy of a machine learning model across the entire dataset, and model parameters can be optimized and prediction performance improved by minimizing loss. The specific calculation of the loss is completed through a loss function, the loss function receives a prediction label and a real label of the model as inputs and outputs a scalar value, namely a loss value, and the loss values corresponding to the elements of the data set are summed to obtain a loss sum, namely the loss sum represents the overall prediction error of the model on the whole data set.
The Loss Function (Loss Function) is a Function used in machine learning and deep learning to measure the difference between model predictive and real labels. Different tasks and models may require different loss functions. The loss function is located between forward propagation and backward propagation of the machine learning model, and in the forward propagation stage, the model generates a predicted value according to the input characteristics; the loss function receives the predicted values and calculates the difference between the predicted values and the true values; the differences are then used in the backward propagation stage to update the parameters of the model and reduce the next prediction error.
In some embodiments, the output of the FNN is predicted by a softmax layer, which is typically used as an activation function for the neural network output layer, particularly in multi-class classification problems, which functions to translate the raw class score into a probability distribution such that the sum of probabilities for all classes is 1. Thus, the output of the neural network can be interpreted as the probability of each class, punishing the erroneous classification by the calculation formula of the log-loss function, thus achieving an accurate measure of the classifier, for example:
Hidden layer representation at time t by the aforementioned gated loop unit (GRU)To obtain word probabilities, the formula is as follows:
Wherein the method comprises the steps ofIs a trainable parameter. Corresponding reply response sequenceThe log likelihood of (a) is:
In some embodiments, the loss function of the generation network is a logarithmic loss function. Log loss is a commonly used loss function that is widely used in training classification models, and is smaller as the model's predictive probability is closer to the actual label, so it can help the model fit the data better and improve classification accuracy.
In summary, in the training method of the dialogue model in the embodiment of the application, the dialogue samples are extracted in a layering manner, the vector representation at the global level and the vector representation at the speaker level are obtained and input into the understanding unit, and the understanding unit carries out reasoning and searching on the dialogue samples in a layering manner, so that the network can conveniently generate the dialogue samples from multiple angles, the inherent logic of the dialogue samples can be better understood, and the consistency of the generated replies is improved; meanwhile, the understanding unit inputs the input vector representation at the global level and the input vector representation at the speaker level into the corresponding understanding branches, and the reasoning module cascaded in the corresponding understanding branches carries out reasoning and searching on the input vector representation iteration, so that the internal logic information of the dialogue sample fully learned by the network is generated, the context information in the dialogue sample is gradually mined and fused, and the accuracy of generating the reply is improved.
The session implementation method provided by the embodiment of the present application is described below in connection with exemplary applications and implementations of the terminal provided by the embodiment of the present application. Referring to fig. 6, fig. 6 is a flow chart of a method for implementing a dialogue according to an embodiment of the present application. The method S200 comprises the steps of:
S201, a dialogue to be replied is acquired.
The dialogue to be replied is the dialogue text to be replied, and the final sentence is a question sentence. In some embodiments, the chat scene where the to-be-replied conversation is located may be a multi-person chat or a double-person chat.
S202, inputting the dialogue to be replied into a dialogue model and outputting a reply sentence.
The dialogue model is trained by adopting the method for training the dialogue model in any one of the embodiments. And inputting the sentence to be replied into the dialogue model, and outputting the corresponding reply sentence by the dialogue model.
It can be understood that the dialogue model is obtained by training the dialogue model in the above embodiment, and has the same structure and function as the dialogue model in the above embodiment, which is not described in detail herein.
The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores computer executable instructions for causing an electronic device to execute the method for training a dialogue model provided by the embodiment of the application, for example, the method for training the dialogue model shown in fig. 5 or the dialogue implementing method provided by the embodiment of the application.
In some embodiments, the storage medium may be FRAM, ROM, PROM, EPROM, EE PROM, flash memory, magnetic surface memory, optical disk, or CD-ROM memory, or various devices including one or any combination of the above.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper TextMarkup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one computing device (including devices such as smart terminals and servers) or on multiple computing devices located at one site, or on multiple computing devices distributed across multiple sites and interconnected by a communication network.
The embodiments of the present application also provide a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a computer, cause the computer to perform a method of training a dialog model or a dialog implementation method as in the previous embodiments.
It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the application, the steps may be implemented in any order, and there are many other variations of the different aspects of the application as described above, which are not provided in detail for the sake of brevity; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.