Disclosure of Invention
The embodiment of the application provides a method, a device, a terminal and a storage medium for training a scoring model, and solves the problem that the scoring model cannot be trained without a reference score corresponding to a sample translation. The technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a method for training a scoring model, where the method includes:
acquiring a sample original text, a first sample text and at least one second sample text, wherein the semantics of the first sample text are the same as the semantics of the sample original text, and the semantics of the second sample text is different from the semantics of the first sample text;
inputting the sample original text and the first sample translation into a scoring model to obtain a first sample score corresponding to the first sample translation, and inputting the sample original text and each second sample translation into the scoring model to obtain a second sample score corresponding to each second sample translation;
determining loss information based on the first sample score and at least one second sample score;
based on the loss information, a scoring model is adjusted.
Optionally, before the obtaining the sample original text, the first sample translation and the at least one second sample translation, the method further includes:
acquiring a first sample vector and a Gaussian noise vector corresponding to the first sample translation;
adding the first sample text vector and the Gaussian noise vector to obtain a first sample text vector after noise addition;
and inputting the first sample text vector after noise addition and the first sample text vector into a pre-trained denoising self-encoder to obtain the second sample translation.
Optionally, before the obtaining the sample original text, the first sample translation and the at least one second sample translation, the method further includes:
acquiring a first sample text vector corresponding to the first sample text;
randomly destroying the first sample translation to obtain a first sample translation after destruction;
determining a second sample text vector corresponding to the first sample translation after the corruption;
and inputting the first sample text vector and the second sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation.
Optionally, the determining loss information based on the first sample score and at least one second sample score includes:
determining the loss information based on the first sample score, the at least one second sample score, and a first preset formula;
the first preset formula is L ═ Sigmax∈D-(px×log(Wx×h(x))+(1-px)×log(1-Wx×h(x)));
Wherein L is the loss information, D is a sample translation set composed of the first sample translation and the at least one second sample translation, x is any sample translation in the sample translation set D, h (x) is a score corresponding to the sample translation x, WxIs a predetermined coefficient, pxIs a predetermined constant, pxThe numerical range of (2) is (0, 1).
Optionally, the determining loss information based on the first sample score and at least one second sample score includes:
determining the loss information based on the first sample fraction, the at least one second sample fraction, and a second preset formula;
the second preset formula is
Wherein L is the loss information, D is a sample translation set composed of the first sample translation and the at least one second sample translation, s is the first sample translation, h(s) is a first sample score corresponding to the first sample translation, x is any sample translation in the sample translation set D, h (x) is a score corresponding to the sample translation x, and margin is a preset constant.
Optionally, the method further includes:
and inputting the target original text and the target translation into a pre-trained scoring model to obtain a target score corresponding to the target translation.
Optionally, the scoring model includes a text preprocessing module, a feature extraction module, and a scoring module;
inputting the sample original text and the first sample translation into a scoring model to obtain a first sample score corresponding to the first sample translation, wherein the method comprises the following steps:
inputting the sample original text and the first sample translation into a text preprocessing module to obtain a sample character sequence;
inputting the sample character sequence into a feature extraction module to obtain sample feature information;
and inputting the sample characteristic information into a scoring module to obtain a first sample score corresponding to the first sample translation.
In a second aspect, an embodiment of the present application provides an apparatus for training a scoring model, where the apparatus includes:
a first obtaining module configured to obtain a sample original, a first sample translation, and at least one second sample translation, wherein the semantics of the first sample translation and the semantics of the sample original are the same, and the semantics of the second sample translation and the semantics of the first sample translation are different;
the input module is configured to input the sample original text and the first sample text into a scoring model to obtain a first sample score corresponding to the first sample text, and input the sample original text and each second sample text into the scoring model to obtain a second sample score corresponding to each second sample text;
a determination module configured to determine loss information based on the first sample score and at least one second sample score;
an adjustment module configured to adjust a scoring model based on the loss information.
Optionally, the apparatus further includes a second obtaining module, where the second obtaining module is configured to:
acquiring a first sample vector and a Gaussian noise vector corresponding to the first sample translation;
adding the first sample text vector and the Gaussian noise vector to obtain a first sample text vector after noise addition;
and inputting the first sample text vector after noise addition and the first sample text vector into a pre-trained denoising self-encoder to obtain the second sample translation.
Optionally, the apparatus further includes a third obtaining module, where the third obtaining module is configured to:
acquiring a first sample text vector corresponding to the first sample text;
randomly destroying the first sample translation to obtain a first sample translation after destruction;
determining a second sample text vector corresponding to the first sample translation after the corruption;
and inputting the first sample text vector and the second sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation.
Optionally, the determining module is configured to:
determining the loss information based on the first sample score, the at least one second sample score, and a first preset formula;
the first preset formula is L ═ Sigmax∈D-(px×log(Wx×h(x))+(1-px)×log(1-Wx×h(x)));
Wherein L is the loss information, D is a sample translation set composed of the first sample translation and the at least one second sample translation, x is any sample translation in the sample translation set D, h (x) is a score corresponding to the sample translation x, WxIs a predetermined coefficient, pxIs a predetermined constant, pxThe numerical range of (2) is (0, 1).
Optionally, the determining module is configured to:
determining the loss information based on the first sample fraction, the at least one second sample fraction, and a second preset formula;
the second preset formula is
Wherein L is the loss information, D is a sample translation set composed of the first sample translation and the at least one second sample translation, s is the first sample translation, h(s) is a first sample score corresponding to the first sample translation, x is any sample translation in the sample translation set D, h (x) is a score corresponding to the sample translation x, and margin is a preset constant.
Optionally, the apparatus further comprises a usage module configured to:
and inputting the target original text and the target translation into a pre-trained scoring model to obtain a target score corresponding to the target translation.
Optionally, the scoring model includes a text preprocessing module, a feature extraction module, and a scoring module;
the input module configured to:
inputting the sample original text and the first sample translation into a text preprocessing module to obtain a sample character sequence;
inputting the sample character sequence into a feature extraction module to obtain sample feature information;
and inputting the sample characteristic information into a scoring module to obtain a first sample score corresponding to the first sample translation.
In a third aspect, an embodiment of the present application provides a terminal, where the terminal includes a processor and a memory, where the memory stores at least one program code, and the at least one program code is loaded and executed by the processor to implement the method for training a scoring model described above.
In a fourth aspect, the present application provides a computer-readable storage medium, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the above method for training a scoring model.
In a fifth aspect, the present application provides a computer program product or a computer program, where the computer program product or the computer program includes a computer program code, the computer program code is stored in a computer readable storage medium, a processor of a computer device reads the computer program code from the computer readable storage medium, and the processor executes the computer program code, so that the computer device executes the above method for training a scoring model.
In the embodiment of the application, a first sample score corresponding to a first sample translation with the same semantic meaning as the sample original text and a second sample score corresponding to a second sample translation with the different semantic meaning from the sample original text are obtained. And determining loss information based on the first sample score and the second sample score, and adjusting the scoring model based on the loss information. Therefore, the reference score of the sample translation does not need to be obtained, and the problem that the scoring model cannot be trained without the reference score corresponding to the sample translation in the prior art is solved.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of an implementation environment of a method for training a scoring model according to an embodiment of the present application. As shown in fig. 1, the method may be implemented by the terminal 101 or theserver 102.
The terminal 101 may include components such as a processor, memory, and the like. The processor, which may be a Central Processing Unit (CPU), may be configured to obtain a sample original text, a first sample translation, and at least one second sample translation, input the sample original text and the first sample translation into a scoring model, obtain a first sample score corresponding to the first sample translation, input the sample original text and each second sample translation into the scoring model, obtain a second sample score corresponding to each second sample translation, determine loss information, adjust the scoring model based on the first sample score and the at least one second sample score, and the like. The Memory may be a RAM (Random Access Memory), a Flash (Flash Memory), etc., and may be configured to store the sample original text, the first sample translation, and the at least one second sample translation, etc. The terminal 101 may also include a transceiver, image detection components, a screen, audio output components, audio input components, and the like. The audio output component may be a sound box, an earphone, etc. The audio input means may be a microphone or the like.
Theserver 102 may include components such as a processor, memory, and the like. The processor, which may be a Central Processing Unit (CPU), may be configured to obtain a sample original text, a first sample translation, and at least one second sample translation, input the sample original text and the first sample translation into a scoring model, obtain a first sample score corresponding to the first sample translation, input the sample original text and each second sample translation into the scoring model, obtain a second sample score corresponding to each second sample translation, determine loss information, adjust the scoring model based on the first sample score and the at least one second sample score, and the like. The Memory may be a RAM (Random Access Memory), a Flash (Flash Memory), etc., and may be configured to store the sample original text, the first sample translation, and the at least one second sample translation, etc.
Fig. 2 is a flowchart of a method for training a scoring model according to an embodiment of the present disclosure. Referring to fig. 2, the embodiment includes:
step 201, obtaining a sample original text, a first sample translation and at least one second sample translation.
The semantics of the first sample translation and the corresponding semantics of the sample original text are the same, and the semantics of the second sample translation is different from the semantics of the first sample translation, namely the semantics of the second sample translation is different from the semantics of the sample original text. The sample original, the first sample translation, the second sample translation 1, and the second sample translation 2 may be as shown in Table 1 below:
TABLE 1
Optionally, the second sample translation is a translation with a different semantic meaning from the first sample translation, but if the deviation between the second sample translation and the first sample translation is large, the trained scoring model may not be effective. Therefore, the second sample translation in the embodiment of the present application has a smaller deviation from the first sample translation. The embodiments of the present application provide various methods for obtaining a second sample translation with a smaller deviation from a first sample translation, and specifically, the method for obtaining a second sample translation is as follows:
in the first method, a first sample text vector and a Gaussian noise vector corresponding to a first sample translation are obtained. And adding the first sample text vector and the Gaussian noise vector to obtain the first sample text vector after noise addition. And inputting the first sample text vector after noise addition and the first sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation.
The text vector is in a vector form corresponding to the text, and the number of numerical digits contained in the first text vector is the same as the number of numerical digits contained in the Gaussian noise vector. The gaussian noise vector is a noise vector that is randomly generated and distributed by a multidimensional gaussian having a mean value of 0 and a variance of 1, and the method for generating the noise vector is the prior art, and is not described in detail in the embodiments of the present application.
In implementation, a vector form corresponding to the first sample translation is obtained through a word embedding algorithm, so that a first sample vector is obtained, and a gaussian noise vector is randomly generated through the prior art. And adding the first sample text vector and the Gaussian noise vector to obtain the first sample text vector after noise addition. And inputting the first sample text vector after noise addition and the first sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation.
The training process for the denoising autoencoder is as follows: and acquiring a sample text, and acquiring a vector form corresponding to the sample text to acquire a sample text vector corresponding to the sample text. And adding the sample text vector and the Gaussian noise vector generated randomly to obtain a sample text vector after noise addition. And inputting the denoised sample text vector and the sample text vector into a denoising self-encoder to obtain a predicted text. And obtaining loss information based on the prediction text and the sample text, and adjusting parameters of the denoising autoencoder based on the loss information to obtain the denoising autoencoder after parameter adjustment. And training and adjusting the parametrized denoising self-encoder by using other sample texts until the parametrized first denoising self-encoder is converged to obtain the pre-trained denoising self-encoder.
The specific structure of the denoising self-encoder is shown in fig. 3A, a first sample text vector corresponding to a first sample translation and a gaussian noise vector generated randomly are obtained, and the first sample text vector and the gaussian noise vector are added to obtain a first sample text vector containing noise. And inputting the first sample vector containing noise and the first sample vector into a denoising self-encoder to obtain a second sample translation.
After the first sample text vector containing noise is input into the denoising self-encoder, the denoising self-encoder performs linear mapping on the first sample text vector containing noise to obtain a linearly mapped first vector. The first vector is input into a multi-headed self-attention layer to obtain a second vector. And residual error connection is carried out on the first vector and the second vector, and normalized processing is carried out on the vectors after the residual error connection, so that a third vector is obtained. And inputting the third vector into a feedforward layer to obtain a fourth vector. And residual error connection is carried out on the third vector and the fourth vector, and normalized processing is carried out on the vectors after the residual error connection, so that a fifth vector is obtained. Meanwhile, after the first sample text vector is input into the denoising self-encoder, the denoising encoder performs linear mapping on the first sample text vector to obtain a linearly mapped sixth vector. And inputting the sixth vector into the multi-head self-attention layer of the mask to obtain a seventh vector. And residual errors are connected between the sixth vector and the seventh vector, and the vectors after residual error connection are subjected to normalization processing to obtain an eighth vector. And inputting the eighth vector and the fifth vector into the multi-head mutual attention layer to obtain a ninth vector. And connecting the residual errors of the eighth vector and the ninth vector, and carrying out normalization processing on the vectors after residual error connection to obtain a tenth vector. And inputting the tenth vector into a feedforward layer to obtain an eleventh vector. And residual errors are connected between the tenth vector and the eleventh vector, and the vectors after residual error connection are subjected to normalization processing to obtain a twelfth vector. And inputting the twelfth vector into the linear layer, and performing Softmax processing to obtain a second sample translation.
The denoising self-encoder is of a single encoder-single decoder architecture, only one multi-head mutual attention layer is arranged in the denoising self-encoder, and the multi-head mutual attention layer is used for realizing information interaction between the encoder and the decoder. The multi-head mutual attention layer, the feedforward layer, the mask multi-head self-attention layer, the multi-head mutual attention layer and the feedforward layer are all neural networks.
In the actual use process, if the first sample translation is subjected to noise addition directly based on rules, the obtained second sample translation has obvious characteristics of grammar error, sentence pattern hardness, poor diversity and the like, and the characteristics are easily captured by a neural network and are not beneficial to the training of a scoring model. If the second sample translation constructed based on the translation model often has a fixed syntactic pattern, semantic errors cannot be guaranteed, and the training of the scoring model is also not facilitated.
In the embodiment of the present application, in order to construct a second sample translation whose semantic meaning has a smaller deviation from that of the first sample translation, the first sample translation vector after being subjected to noise addition is obtained first, and then the first sample translation vector after being subjected to noise addition and the first sample translation vector are input into a pre-trained denoising self-encoder to obtain the second sample translation. Although the pre-trained denoising autoencoder is used for correcting the semantics of the first sample translation added with the noise, in the actual use process, the pre-trained denoising autoencoder cannot completely correct the first sample translation added with the noise, namely, the pre-trained denoising autoencoder can only remove part of the noise in the first sample translation vector added with the noise, and then the second sample translation is obtained based on the first sample translation vector containing the part of the noise, and the retained part of the noise makes the semantics of the first sample translation and the semantics of the second sample translation have smaller deviation. The scoring model is trained on the basis of the first sample translation and the second sample translation with smaller deviation from the semantics of the first sample translation, and the scoring model can capture more detailed features in the training process, so that the training effect is better.
In a second method, a first sample text vector corresponding to the first sample text is obtained. And randomly destroying the first sample translation to obtain the first sample translation after destruction. A second sample text vector corresponding to the first sample translation after the corruption is determined. And inputting the first sample text vector and the second sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation.
In practice, the first sample translation is randomly destroyed to obtain the first sample translation after destruction. And acquiring a second sample text vector corresponding to the first sample translation after the destruction. And inputting the first sample text vector and the second sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation.
The random destruction of the text includes ways such as random masking, random replacement, random deletion, and random insertion. The random masking is to MASK a part of words by using a MASK code at random, replace a part of words by using other random words at random, delete a part of words in the text at random, and insert random words at random positions. The method for randomly destroying the text is the prior art, and the embodiment of the application is not described again. For example, the results of random disruption to "I am Chinese, I love China" can be as shown in Table 2.
TABLE 2
| Source sentence | I am Chinese,I love China. |
| Random masking | I am[MASK],I[MASK]China. |
| Random replacement | I am Chinese,residual love China. |
| Random deletion | I am,I love China. |
| Random insertion | I am Chinese,I monitoring love China. |
The third method is to add a gaussian noise vector to the first sample text vector corresponding to the first sample translation, and also randomly destroy the first sample translation. The method comprises the following specific steps: and randomly destroying the first sample translation to obtain the randomly destroyed first sample translation. And acquiring a sample text vector corresponding to the first sample translation after random destruction and a Gaussian noise vector generated randomly, and adding the two vectors to obtain a vector after addition. Thus, the added vector and the first sample vector are input into a pre-trained denoising autoencoder to obtain a second sample translation. The denoising self-encoder related to the method is the same as the denoising self-encoder related to the first method and the denoising self-encoder related to the second method
And a fourth method, obtaining a third sample text vector corresponding to the sample original text, a first sample text vector corresponding to the first sample translation, a first Gaussian noise vector and a second Gaussian noise vector. And adding the third sample text vector and the first Gaussian noise vector to obtain a third sample text vector after noise addition. And adding the first sample text vector and the second Gaussian noise vector to obtain the first sample text vector after noise addition. And inputting the first sample text vector after noise addition, the third sample text vector after noise addition and the first sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation. The first gaussian noise vector and the second gaussian noise vector may be the same vector or different vectors.
The training process of the denoising autoencoder is as follows: and acquiring a sample original text and a first sample translation corresponding to the sample original text, and inputting the first sample text vector after noise addition, the third sample text vector after noise addition and the first sample text vector into a de-noising self-encoder to obtain a predicted text. And inputting the prediction text and the first sample translation into a loss function to obtain loss information, and adjusting the denoising autoencoder based on the loss information to obtain the adjusted denoising autoencoder. And continuously adjusting the adjusted denoising autoencoder by using other sample texts and the first sample translation corresponding to the other sample texts, and obtaining the pre-trained denoising autoencoder when the adjusted denoising autoencoder is converged.
The structure of the denoising autoencoder is different from the structure of the denoising encoder related to the three methods, the denoising autoencoder is a double-encoder-single-decoder framework, the specific composition is as shown in fig. 4, and a third sample text vector corresponding to a sample original text is obtained through a word embedding algorithm. And adding the third sample text vector and the first Gaussian noise vector generated randomly to obtain a third sample text vector after noise addition. And performing linear mapping on the third sample text vector after the noise is added to obtain a thirteenth vector, and inputting the thirteenth vector into the multi-head mutual attention layer to obtain a fourteenth vector. And residual errors of the fourteenth vector and the thirteenth vector are connected, and the vectors after residual error connection are subjected to normalization processing to obtain a fifteenth vector. And inputting the fifteenth vector into the feedforward layer to obtain a sixteenth vector. And residual errors are connected between the fifteenth vector and the sixteenth vector, and the vectors after residual error connection are subjected to normalization processing to obtain a seventeenth vector. Similarly, a first sample text vector corresponding to the sample original text is obtained through a word embedding algorithm. And adding the first sample text vector and a second Gaussian noise vector which is randomly generated to obtain a first sample text vector after noise addition. And performing linear mapping on the third sample text vector after the noise is added to obtain an eighteenth vector, and inputting the eighteenth vector into the multi-head mutual attention layer to obtain a nineteenth vector. And performing residual error connection on the eighteenth vector and the nineteenth vector, and performing normalization processing on the vectors subjected to residual error connection to obtain a twentieth vector. And inputting the twentieth vector into a feedforward layer to obtain a twenty-first vector. And residual error connection is carried out on the twentieth vector and the twenty-first vector, and normalized processing is carried out on the vectors after the residual error connection to obtain a twentieth vector. Similarly, the first sample text vector is linearly mapped to obtain a twenty-third vector, and the twenty-third vector is input into the multi-head self-attention layer of the mask to obtain a twenty-fourth vector. And residual error connection is carried out on the twenty-third vector and the twenty-fourth vector, and normalized processing is carried out on the vectors after the residual error connection, so that a twenty-fifth vector is obtained. And inputting the twenty-fifth vector and the seventeenth vector into the multi-head mutual attention layer to obtain a twenty-sixth vector. And residual error connection is carried out on the twenty-sixth vector and the twenty-fifth vector, and normalized processing is carried out on the vectors after the residual error connection, so that a twenty-seventh vector is obtained. And inputting the twenty-seventh vector and the twenty-two vector into a multi-head mutual attention layer to obtain a twenty-eighth vector. And residual error connection is carried out on the twenty-eighth vector and the twenty-seventh vector, and normalized processing is carried out on the vectors after the residual error connection, so that a twenty-ninth vector is obtained. And inputting the twenty-ninth vector into a feedforward layer to obtain a thirty-th vector. And residual errors are connected between the twenty-ninth vector and the thirty-eighth vector, and the vectors after residual error connection are subjected to normalization processing to obtain a thirty-first vector. And inputting the thirty-first vector into the linear layer, and performing Softmax processing to obtain a second sample translation.
The denoising self-encoder comprises two multi-head mutual attention layers which are respectively in information interaction with the two encoders so as to complete decoding. And the multi-head self-attention layer, the feedforward layer, the mask multi-head self-attention layer, the multi-head mutual attention layer and the feedforward layer in the denoising self-encoder are all neural networks. The multi-head self-attention layer is used for projecting the feature vectors input by the encoder through a plurality of linear transformations to obtain query, key and value triplets, then calculating the attention weight between the query and the key, and multiplying the attention weight by the value to obtain the feature representation input by the encoder. The multi-head mutual attention layer is used for projecting the feature vectors input by the encoder and the decoder through a plurality of linear transformations to obtain query, key and value triplets, then calculating the attention weight between the query and the key, and multiplying the attention weight by the value to obtain the feature representation after the input information of the encoder and the decoder is interacted. The full connection layer is used for mapping the input feature representation twice, and the feature representation capacity is increased. Residual concatenation is used to concatenate the input vectors, thereby avoiding the gradient vanishing problem. The layer normalization is used for normalizing the neuron distribution of the same layer into the same distribution, and the training stability is guaranteed.
And a fifth method, randomly destroying the sample original text to obtain the sample original text after random destruction. And randomly destroying the first sample translation to obtain the first sample translation after random destruction. And obtaining a fourth sample text vector corresponding to the original text of the sample after the destruction and a second sample text vector corresponding to the translation of the first sample after the destruction. And inputting the fourth sample text vector, the second sample text vector and the first sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation.
And a sixth method, randomly destroying the first sample translation to obtain the randomly destroyed first sample translation. And acquiring a sample text vector corresponding to the randomly damaged first sample translation and a randomly generated third Gaussian noise vector, and adding the two vectors to obtain a first vector. And randomly destroying the sample original text to obtain the randomly destroyed sample original text. And acquiring a sample text vector corresponding to the randomly damaged sample text and a randomly generated fourth Gaussian noise vector, and adding the two vectors to obtain a second vector. And inputting the first vector, the second vector and the first sample vector into a pre-trained denoising self-encoder to obtain a second sample translation.
The third noise vector and the fourth gaussian noise vector may be the same vector or different noise vectors.
And the seventh method is used for randomly destroying the sample original text or the first sample translation and obtaining a text vector corresponding to the sample original text or the first sample translation after random destruction. And acquiring a sample text vector corresponding to the first sample translation or the sample original text and a Gaussian noise vector generated randomly, and adding the two vectors to obtain a vector after addition. And inputting the two vectors and the first sample text vector into a denoising self-encoder to obtain a second sample translation.
It should be noted that the training process and structure of the denoising autoencoder related to the fourth method, the fifth method, the sixth method, and the seventh method are the same, and are not described herein again.
Step 202, inputting the sample original text and the first sample translation into the scoring model to obtain a first sample score corresponding to the first sample translation, and inputting the sample original text and each second sample translation into the scoring model to obtain a second sample score corresponding to each second sample translation.
Optionally, the scoring model in this embodiment of the present application includes a text preprocessing module, a feature extraction module, and a scoring module. Inputting the sample original text and the first sample translation into a scoring model, and obtaining a first sample score corresponding to the first sample translation specifically comprises the following steps: and inputting the sample original text and the first sample translation into a text preprocessing module to obtain a sample character sequence. And inputting the sample character sequence into a feature extraction module to obtain sample feature information. And inputting the sample characteristic information into a scoring module to obtain a first sample score corresponding to the first sample translation.
The text preprocessing module is an algorithm model and mainly used for performing text preprocessing on an input sample original text and a first sample translation to obtain a sample text after the text preprocessing and the first sample translation after the text preprocessing, and splicing the sample text and the first sample translation to obtain a sample character sequence.
The text preprocessing comprises word segmentation processing, sub-word segmentation processing, special character processing and truncation processing. The segmentation word processing is to separate punctuation from text, the sub-word segmentation processing is to further segment a single word according to the frequency of occurrence of continuous letters of the single word, the special character processing is to delete non-printed characters and to transcribe escape characters, and the truncation processing is to truncate an input sequence according to the upper limit of the length of a sentence which can be processed by a model.
It should be noted that the text preprocessing includes a sub-word segmentation process, so that the sample original text and the first sample translation may be segmented into a plurality of sub-words. For example, after the sample original text "I eat apple" and the first sample translation "I drink an applet" go through the pre-processing flow of BERT, the sample original text with text pre-processing is obtained as "[ CLS ] I eat apple. ", the first sample of text pre-processing is translated as" [ SEP ] I driver an app # # le. [ SEP ] ". The two are spliced to obtain the 'CLS' apple which is eaten by people. [ SEP ] I drive an app # # le. [ SEP ] ".
In the above sequence, the term applet is split into two parts, app and # # le. This process helps to reduce the size of the vocabulary and reduce computational overhead.
The feature extraction module is a neural network model and is mainly used for extracting features of the text vectors to obtain sample feature information. The specific processing process of the feature extraction module is as follows: each character in the sequence of characters is first converted to a text vector. The text vectors are then sent to an encoder in the feature extraction module. And each layer of Transformer in the encoder encodes the text vectors into characteristic information layer by layer, so that the characteristic information corresponding to each word is fused with the context information of the word.
For example, although the word "bank" is included in both "I am fixing on the bank" and "I went to bank to sink money," the "bank" in the two sentences has different meanings. The "bank" in the first sentence should be translated as "bank" and the "bank" in the second sentence should be translated as "bank". The feature extraction module can distinguish different meanings of two words according to the context information, so that different expression vectors are given to the same word.
The scoring module is also a neural network model, the structure of the scoring module is composed of a layer of fully-connected network, and the scoring module can map the characteristic information into a real-valued continuous numerical value, namely a score, which is used as a quality evaluation result of the sample original text and the first sample translation.
Step 203, determining loss information based on the first sample score and the at least one second sample score.
And adding noise to the first sample translation to obtain a second sample translation. The second sample score corresponding to the second sample translation should be lower than the first sample score corresponding to the first sample translation. This allows loss information to be obtained based on the comparison between the first sample score and the second sample score. Based on this principle, the embodiment of the present application provides two forms of comparative training. The first method is comparative classification, as shown in FIG. 5A, and the other is comparative ordering, as shown in FIG. 5B, where the goal of both loss functions is to make the first sample score higher than the second sample score. Two specific methods are described below.
The first method determines loss information based on a first sample score, at least one second sample score, and a first predetermined formula.
The first preset formula is that L is sigmax∈D-(px×log(Wx×h(x))+(1-px)×log(1-Wx×h(x)));
Wherein L is loss information, D is a sample translation set consisting of a first sample translation and at least one second sample translation, x is any sample translation in the sample translation set D, h (x) is a score corresponding to the sample translation x, WxIs a predetermined coefficient, pxIs a predetermined constant, pxThe numerical range of (2) is (0, 1).
In FIG. 5A, S is the sample text, T0For the first translation, T1`~TnAll are the second sample translation, l0Is the first sample score, l1`~lnIs the second sample score. The original sample text S and the first sample translation T are combined0Inputting a scoring model to obtain a first sample score l0. The original sample text S and the first sample translation T are combined1Inputting a scoring model to obtain a first sample score l1And (5) allowing the strain to stand. The original sample text S and the first sample translation T are combinednInputting a scoring model to obtain a first sample score lnAnd (5) allowing the strain to stand. Then, the first preset formula is used for l0And l1`~ln"performing contrast sorting.
One or more scoring models may be used in fig. 5A. But when there are multiple scoring models, the parameters used by each scoring model are shared. Therefore, only one scoring model is obtained by actual training. The parameter sharing mode improves the efficiency of neural network training and reduces the occupied space of the scoring model.
In a second method, loss information is determined based on the first sample score, at least one second sample score' and a second predetermined formula.
The second predetermined formula is
Wherein, L is loss information, D is a sample translation set composed of a first sample translation and at least one second sample translation, s is the first sample translation, h(s) is a first sample score corresponding to the first sample translation, x is any sample translation in the sample translation set D, h (x) is a score corresponding to the sample translation x, and margin is a preset constant for enlarging a difference value between the first sample score and the second sample score.
As shown in fig. 5B, S is a sample original, T is a first sample translation, T 'is a second sample translation, l is a first sample score, and l' is a second sample score, where the scoring models in fig. 5B are all the same model or a model shared by multiple parameters. And inputting the sample original text S and the first sample translation T into a scoring model to obtain a first sample score l. Inputting the sample original text S and the first sample translation T 'into a scoring model to obtain a first sample score l'. And then calculating the marginal losses of l and l' by using a second preset formula.
And step 204, adjusting the scoring model based on the loss information.
In implementation, parameters in the scoring model are adjusted based on the loss information to obtain an adjusted scoring model. And training and adjusting the scoring model based on other sample original texts, the corresponding first sample translations and the corresponding second sample translations.
After loss information is obtained, gradient back transmission and parameter updating are carried out on the scoring model by using a back propagation algorithm of deep learning. In one training process, parameters of a feature extraction module and a scoring module in the scoring model are updated and the same learning rate is used. Meanwhile, in the training process, the scoring model can be verified once when the scoring model is trained for a preset time. The verification process is similar to the training process in the prior art, but the parameters of the scoring model are not adjusted based on the loss information, but the prediction score and the reference score output by the scoring model are compared to obtain the loss information, then the scoring model is trained and adjusted based on the loss information, and meanwhile, the precision of the scoring model is calculated based on the prediction score and the reference score. And when the calculated precision is not improved any more, obtaining the trained scoring model.
Wherein the loss information is calculated based on the prediction score, the benchmark score, and a third formula, wherein the third formula is
Wherein Lsent is loss information, h(s) is a prediction score output by the scoring model, hter
sIs a reference score, W
sIs a preset coefficient, wherein sigmoid (x) is a mapping function for mapping x into a numerical range of 0-1.
Therefore, when the scoring model is trained and adjusted, the scoring model can be trained based on the prediction score and the reference score, so that the output result of the trained scoring model is more accurate.
In an embodiment of the present application, a sample original, a first sample translation, and at least one second sample translation are obtained, where a semantic meaning of the first sample translation is the same as a semantic meaning corresponding to the sample original, and a semantic meaning of the second sample translation is different from a semantic meaning of the first sample translation. Inputting the sample original text and the first sample translation into a scoring model to obtain a first sample score corresponding to the first sample translation, and inputting the sample original text and each second sample translation into the scoring model to obtain a second sample score corresponding to each second sample translation; determining loss information based on the first sample score and at least one second sample score; based on the loss information, a scoring model is adjusted. Therefore, the scoring model can be trained on the premise of not depending on the reference score.
In the related art, before training a scoring model, a professional translator or native speaker is required to evaluate the sample original text and the translated sample translation, score the sample original text and the translated sample translation in multiple different aspects such as accuracy and fluency, and then integrate multiple evaluation scores of the sample translation to obtain a final benchmark score. As shown in table 3:
TABLE 3
| Sample original text | I am a Chinese. | I eat apples. |
| Sample translation | I am Chinese. | I drink an apple. |
| Results of Manual evaluation 1 | 1.0 | 0.2 |
| Results of Manual evaluation 2 | 0.9 | 0.4 |
| Results of Manual evaluation 3 | 1.0 | 0.35 |
| Final manual evaluation results | 0.9667 | 0.3167 |
In table 2, the manual evaluation results of the sample translation "I am chinese." are 1.0, 0.9, and 1.0, respectively, and the final manual evaluation result is the average 0.9667 of the three manual evaluation results, i.e. the average is the benchmark score. The sample translation "I drink an applet" corresponds to manual evaluation results of 0.2, 0.4 and 0.35, respectively, and the final manual evaluation result is the average 0.3167 of the three manual evaluation results, i.e., the average is the benchmark score.
Because the process of manual evaluation is time-consuming and labor-consuming, a large number of professional translators are required to participate, and then objective evaluation scores can be obtained. Moreover, because the error distributions of different languages, different fields and different machine translation systems are different, when quality evaluation is performed on a specific language, a specific field and a specific machine translation system, an existing scoring model cannot be directly used, and training and adjustment are performed on the scoring model based on the specific language, the specific field and the specific machine translation system again, which wastes time and labor.
And in the actual use process of the scoring model, inputting the target original text and the target translation into the pre-trained scoring model to obtain a target score corresponding to the target translation.
Fig. 6 is a schematic structural diagram of an apparatus for training a scoring model according to an embodiment of the present application, and referring to fig. 6, the apparatus includes:
a first obtainingmodule 610 configured to obtain a sample original, a first sample translation, and at least one second sample translation, wherein the semantics of the first sample translation and the semantics of the sample original are the same, and the semantics of the second sample translation and the semantics of the first sample translation are different;
theinput module 620 is configured to input the sample original text and the first sample text into a scoring model to obtain a first sample score corresponding to the first sample text, and input the sample original text and each second sample text into the scoring model to obtain a second sample score corresponding to each second sample text;
a determiningmodule 630 configured to determine loss information based on the first sample score and at least one second sample score;
anadjustment module 640 configured to adjust a scoring model based on the loss information.
Optionally, the apparatus further includes a second obtaining module, where the second obtaining module is configured to:
acquiring a first sample vector and a Gaussian noise vector corresponding to the first sample translation;
adding the first sample text vector and the Gaussian noise vector to obtain a first sample text vector after noise addition;
and inputting the first sample text vector after noise addition and the first sample text vector into a pre-trained denoising self-encoder to obtain the second sample translation.
Optionally, the apparatus further includes a third obtaining module, where the third obtaining module is configured to:
acquiring a first sample text vector corresponding to the first sample text;
randomly destroying the first sample translation to obtain a first sample translation after destruction;
determining a second sample text vector corresponding to the first sample translation after the corruption;
and inputting the first sample text vector and the second sample text vector into a pre-trained denoising self-encoder to obtain a second sample translation.
Optionally, the determiningmodule 630 is configured to:
determining the loss information based on the first sample score, the at least one second sample score, and a first preset formula;
the first preset formula is L ═ Sigmax∈D-(px×log(Wx×h(x))+(1-px)×log(1-Wx×h(x)));
Wherein L is the loss information, D is a sample translation set composed of the first sample translation and the at least one second sample translation, x is any sample translation in the sample translation set D, h (x) is a score corresponding to the sample translation x, WxIs a predetermined coefficient, pxIs a predetermined constant, pxThe numerical range of (2) is (0, 1).
Optionally, the determiningmodule 630 is configured to:
determining the loss information based on the first sample fraction, the at least one second sample fraction, and a second preset formula;
the second preset formula is
Wherein L is the loss information, D is a sample translation set composed of the first sample translation and the at least one second sample translation, s is the first sample translation, h(s) is a first sample score corresponding to the first sample translation, x is any sample translation in the sample translation set D, h (x) is a score corresponding to the sample translation x, and margin is a preset constant.
Optionally, the apparatus further comprises a usage module configured to:
and inputting the target original text and the target translation into a pre-trained scoring model to obtain a target score corresponding to the target translation.
Optionally, the scoring model includes a text preprocessing module, a feature extraction module, and a scoring module;
theinput module 620 is configured to:
inputting the sample original text and the first sample translation into a text preprocessing module to obtain a sample character sequence;
inputting the sample character sequence into a feature extraction module to obtain sample feature information;
and inputting the sample characteristic information into a scoring module to obtain a first sample score corresponding to the first sample translation.
It should be noted that: in the device for training a score model according to the above embodiment, when the score model is trained, only the division of the functional modules is exemplified, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device for training the scoring model and the method for training the scoring model provided by the above embodiments belong to the same concept, and the specific implementation process thereof is described in the method embodiments, and is not described herein again.
Fig. 7 shows a block diagram of a terminal 700 according to an exemplary embodiment of the present application. The terminal 700 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer.Terminal 700 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so on.
In general,terminal 700 includes: aprocessor 701 and amemory 702.
Theprocessor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. Theprocessor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Theprocessor 701 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, theprocessor 701 may be integrated with a GPU (Graphics Processing Unit) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, theprocessor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 702 may include one or more computer-readable storage media, which may be non-transitory.Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium inmemory 702 is used to store at least one program code for execution byprocessor 701 to implement the method of training a scoring model provided by method embodiments herein.
In some embodiments, the terminal 700 may further optionally include: aperipheral interface 703 and at least one peripheral. Theprocessor 701, thememory 702, and theperipheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected toperipheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of aradio frequency circuit 704, adisplay screen 705, acamera assembly 706, anaudio circuit 707, apositioning component 708, and apower source 709.
Theperipheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to theprocessor 701 and thememory 702. In some embodiments,processor 701,memory 702, andperipheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of theprocessor 701, thememory 702, and theperipheral interface 703 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
TheRadio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. Theradio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. Therf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, theradio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. Theradio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, theradio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.
Thedisplay screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When thedisplay screen 705 is a touch display screen, thedisplay screen 705 also has the ability to capture touch signals on or over the surface of thedisplay screen 705. The touch signal may be input to theprocessor 701 as a control signal for processing. At this point, thedisplay 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, thedisplay 705 may be one, disposed on a front panel of the terminal 700; in other embodiments, thedisplay 705 can be at least two, respectively disposed on different surfaces of the terminal 700 or in a folded design; in other embodiments, thedisplay 705 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 700. Even more, thedisplay 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. TheDisplay 705 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.
Thecamera assembly 706 is used to capture images or video. Optionally,camera assembly 706 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments,camera assembly 706 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Theaudio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to theprocessor 701 for processing or inputting the electric signals to theradio frequency circuit 704 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 700. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from theprocessor 701 or theradio frequency circuit 704 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, theaudio circuitry 707 may also include a headphone jack.
Thepositioning component 708 is used to locate the current geographic Location of the terminal 700 for navigation or LBS (Location Based Service). ThePositioning component 708 can be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.
Power supply 709 is provided to supply power to various components ofterminal 700. Thepower source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries. Whenpower source 709 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 700 also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.
The acceleration sensor 711 can detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the terminal 700. For example, the acceleration sensor 711 may be used to detect components of the gravitational acceleration in three coordinate axes. Theprocessor 701 may control thedisplay screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711. The acceleration sensor 711 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 712 may detect a body direction and a rotation angle of the terminal 700, and the gyro sensor 712 may cooperate with the acceleration sensor 711 to acquire a 3D motion of the terminal 700 by the user. From the data collected by the gyro sensor 712, theprocessor 701 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensors 713 may be disposed on a side frame ofterminal 700 and/or underneathdisplay 705. When the pressure sensor 713 is disposed on a side frame of the terminal 700, a user's grip signal on the terminal 700 may be detected, and theprocessor 701 performs right-left hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at a lower layer of thedisplay screen 705, theprocessor 701 controls the operability control on the UI interface according to the pressure operation of the user on thedisplay screen 705. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 714 is used for collecting a fingerprint of a user, and theprocessor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. When the user identity is identified as a trusted identity, theprocessor 701 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 714 may be disposed on the front, back, or side of the terminal 700. When a physical button or a vendor Logo is provided on the terminal 700, the fingerprint sensor 714 may be integrated with the physical button or the vendor Logo.
The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, theprocessor 701 may control the display brightness of thedisplay screen 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of thedisplay screen 705 is increased; when the ambient light intensity is low, the display brightness of thedisplay screen 705 is adjusted down. In another embodiment,processor 701 may also dynamically adjust the shooting parameters ofcamera assembly 706 based on the ambient light intensity collected by optical sensor 715.
A proximity sensor 716, also referred to as a distance sensor, is typically disposed on a front panel of the terminal 700. The proximity sensor 716 is used to collect the distance between the user and the front surface of the terminal 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually decreases, theprocessor 701 controls thedisplay 705 to switch from the bright screen state to the dark screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 is gradually increased, theprocessor 701 controls thedisplay 705 to switch from the breath-screen state to the bright-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 7 is not intended to be limiting ofterminal 700 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
The computer device provided by the embodiment of the application can be provided as a server. Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application, where theserver 800 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 801 and one ormore memories 802, where thememory 802 stores at least one program code, and the at least one program code is loaded and executed by theprocessors 801 to implement the method for training the scoring model according to the above-mentioned method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input obtaining interface, so as to obtain input, and the server may also include other components for implementing the functions of the device, which are not described herein again.
In an exemplary embodiment, a computer-readable storage medium, such as a memory including program code, which is executable by a processor in a terminal or a server to perform the method of training a scoring model in the above embodiments, is also provided. For example, the computer-readable storage medium may be a read-only memory (ROM), a Random Access Memory (RAM), a compact-disc read-only memory (cd-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware associated with program code, and the program may be stored in a computer readable storage medium, and the above mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.