Detailed Description
The process according to the invention is described in further detail below with reference to the accompanying drawings.
A low-resource machine translation method using semi-autoregressive fusion domain terms, comprising the steps of:
and 1, constructing a decoding method based on a semi-autoregressive model to realize the generation of a sequence in a semi-autoregressive mode.
Specifically, the constructed semi-autoregressive model is consistent with a transducer at the encoder end, and is decoded by a semi-autoregressive mode at the decoder end.
As shown in fig. 1, the principle of operation of a semi-autoregressive decoder is shown. The decoder blocks and decodes the translations synchronously when generating the translations. For example, a translation sequence S is partitioned into different blocks S1,S2,…,Sk. Within a block, the next word is predicted by means of Autoregressive (AR) decoding, combining the source text information, a priori knowledge and the generated historical translation, and the decoder generates a corresponding word or symbol for the incomplete block at each stage. Specifically, as shown in formula (1):
Wherein P (y|x) represents a conditional probability, x represents an input sequence, and y represents an output sequence; Representing the t-th word or symbol in the i-th block; A history translation has been generated for the ith block, L is the total length of the blocks, and K represents the number of blocks.
Computing predicted words or symbols in the ith block SiAs shown in formula (2):
where V denotes a vocabulary, < BOS > and < EOS > denote a start and end symbol, respectively, P () denotes a probability distribution of the corresponding expression, and argmax denotes a set of parameters that maximizes the probability.
When (when)At this time, the block Si is indicated as beginning decoding, allowing insertion of constraint term libraries;
When (when)When the block Si is not completed, the decoding is allowed to continue;
When Si = < EOS >, it means that the block Si is completed, and decoding is stopped.
When the predicted word is < EOS >, the history information is unchanged and the maximum length is reached in the whole decoding process, the sequence decoding is completed.
And 2, constructing a term library fused with the specific field, and fusing the external knowledge into translated sentences.
Specifically, during decoding, the generation of sequences in an autoregressive manner is realized by incorporating a priori knowledge within the block. And the redundant words are fused in and deleted through priori knowledge, so that the fusion of priori knowledge is realized.
In the stage of formally introducing priori knowledge and decoding, performing primary iteration through sentence boundary y0 = < s > </s >, filling a y0 sequence with a target constraint before deleting the wrong word in the historical translation, and optimizing the target sequence through editing iteration. Where < s > represents the beginning of the sentence and < s > represents the end of the sentence.
The priori knowledge is integrated into the method:
The input to the decoder embeds a uniform mapping z, z=f (x; θenc), f () representing the mapping function, x representing the source language input sequence, and a given set of n priori knowledge, where zk=ε(xi),T=1, 2..ty,n=P1,P2,...,Pn;θenc denotes the relevant parameters of the encoder, zk denotes the mapping result of the kth block, epsilon (xi) denotes a step function, T denotes the current time step, Tx denotes the source language sequence length, Ty denotes the target language sequence length, and Pn denotes the nth pair of a priori knowledge.
Each pair of a priori knowledge Pj is made up of a different word or phrase wj,J=1, 2,..n. Before formal decoding, the decoder combines source language information in advance, selects the corresponding target language from the priori knowledge base, and inserts the sequence y0,y0=<s>P1,P2,...,Pn to be generated.
The method for deleting the redundant words is to execute the deleting operation by combining constraint conditions, source text information and generated calendar Shi Yiwen during the formal decoding.
If the prior knowledge scale does not contain the whole source text information, or the constraint of the whole prior knowledge is deleted in the deleting operation, the constraint condition in the process of generating the translation is not generated, or the final translation does not contain the prior knowledge. To solve this problem, the present invention introduces a constraint mask to indicate the position of constraint markers in the sequence and specifies constraint markers specified by the inhibit delete constraint mask. And (3) randomly adding constraint mask occupation, re-calculating and updating the positions of the constraint masks during each iteration, and finally selecting proper candidate texts from the priori knowledge base to replace the corresponding constraint masks. Specifically, the following manner may be adopted:
the introduce constraint mask operation includes two phases, constraint mask prediction and candidate prediction.
Constraint mask prediction phase, in each iteration, for the input sequence y of the decoder, y= < s > y1,y2,...,yn, the model will utilize a classifier in each possible slot in y (yi,yi+1)To predict whether a constraint mask is to be added, the constraint mask is noted < PLH >, as shown in equation (3):
Wherein, theRepresenting the prediction result of the constraint mask predictor on the ith word in the sequence y, θ represents a model parameter, softmax represents a classification function, hi represents the hidden state of the ith word, hi+1 represents the hidden state of the (i+1) th word, concat represents a splicing function, and n is the length of the sequence y.
In the candidate text prediction stage, for each constraint mask of the prediction, the model replaces the placeholder by training a character predictor to select the actual character from the a priori knowledge base, as shown in equation (4).
Wherein, theRepresenting a prediction result of the candidate text predictor on a position where a mask exists in the sequence y, wherein θ represents a model parameter, softmax represents a classification function, hi represents a hidden state of an ith word, and yi is the ith word in the sequence; Is a logical symbol, representing "arbitrary", and C is a parameter shared with the word embedding matrix.
And 3, using a knowledge distillation method based on sentence level, and using the suggestion of the autoregressive model in training to enable the non-autoregressive model to learn the distribution of hidden variables and attention of the autoregressive model, so that the model capacity is enhanced, and the translation effect is improved.
At present, in most of non-autoregressive method translation methods, a knowledge distillation method is adopted in a training stage, so that a smaller student model learns a larger teacher model, and effective information is obtained from feature distribution. The invention improves the translation performance of the model through a sentence-level knowledge distillation method.
Specifically, firstly, a priori knowledge base is added into an original training corpus, and an autoregressive translation model is trained as a teacher model (the training method can adopt a term constrained NMT method);
Then, translating the source text and obtaining a translation y';
Finally, a non-autoregressive translation model (details can refer to non-autoregressive translations of the enhancement decoder input) is trained with a pseudo-parallel corpus (x, y'), where x represents the input sequence of the source language. Therefore, the priori knowledge is integrated into the training corpus, and the problem of repeated translation or missing translation of the non-autoregressive translation method on the original corpus is reduced.
The translation model adopts a greedy search algorithm, generates a plurality of candidate translations through parallel decoding, and then selects a translation sequence with the highest probability as a final translation.
Example verification
In order to prove the effect of the invention, experimental verification is carried out on a Tibetan parallel corpus data set with a certain scale, and comparison analysis is carried out with a main stream baseline model. The experimental procedure was as follows:
1. Experimental data
The embodiment independently constructs a term library in a specific field, and covers new word terms of Tibetan Chinese professionals from 2015 to 2021 in the last half year, and 30300 Tibetan Chinese term pairs are obtained after cleaning and washing. CCMT2019 and JudCorpus are used for hiding parallel corpus and are used as training corpus, wherein the CCMT2019 corpus scale is 147434 sentence pairs, and the JudCorpus corpus scale is 163000 sentence pairs. For fair comparison, the Chinese corpus is processed in a BPE mode, and the Tibetan corpus is processed in a sound-word fusion mode. The vocabulary sizes of Tibetan and Chinese are 40000 phonetic word fusion and 40000 subwords respectively, and the two vocabularies are shared in the whole training process. The invention combines Test2018 and JudDev data sets as development sets, and Test2017, dev2017 and JudTest data sets are adopted as Test sets.
2. Experimental setup
In this example, the experimental parameters were set up by Gu et al [17] with modifications based on a translation system RecoverSAT developed by Ran et al based on semi-autoregressive of the transducer model. In practical experiments, the corresponding parameters were modified due to the smaller corpus scale, wherein dmodel =278, dhidden=507, nlyer=5, nhead=2, pdropout=0.1. This embodiment uses a sequential level distillation approach that simulates not only each word, but also the distribution of all input sentences. And sampling output information of a teacher by using a beam search algorithm, and training students by cross entropy, wherein the beam search size is set to be 5. Through pre-training a translation model of AR, then initializing the encoder of the network by using the encoder parameters, sharing the corresponding parameters, and training the whole experiment on a platform with 2 blocks of Geforce GTX 1080Ti and a video memory of 11 GB.
3. Baseline method
For comparison, in the present embodiment, a corresponding baseline system was selected in addition to the typical AR translation method transform-based model (AT-FT) on both Tibetan and Han translation tasks. The NAR translation model (NAT-FT) was first proposed by Jiatao Gu et al, respectively, and the application Term constraint training NMT method (AT-FT+term) was proposed by Georgiana Dinu et al. The invention analyzes the model from the dimensions of translation rate Term% duty ratio, decoding speed Time(s) Time, BLEU fraction and the like of new word terms on 5 test sets in two fields respectively.
4. Analysis of experimental data
A. Analysis of term usage and decoding rate
The translation rate Term% and the decoding speed Time(s) of the new word Term are respectively scored from two dimensions, and AR and NAR model performances after priori knowledge is introduced are analyzed. Combining new word term libraries at two ends of each test set, and replacing common words with new word terms through a semantic similarity matching method to construct the test set containing the new word terms. Wherein, the translation rate term% of the new word terms refers to the ratio of the number of the new word terms accurately translated in the translation to the total number of the new word terms contained in the source end of the corresponding test set, and the decoding rate Time(s) refers to the Time consumed when each sentence is decoded on average. As shown in table 1, the translation rate Term% and the decoding rate Time(s) are new word terms in five test sets on two translation tasks.
Table 1 translation Rate Term% and decoding Rate Time(s) for new word terms in five test sets on two translation tasks
When the translation rate of new word terms is analyzed, the Term translation rate of the AT-FT+term baseline method in Tibetan translation task is AT least 86.45%, AT most 98.32% and AT average 90.83%, and the Term translation rate of the method is AT least 88.21%, AT most 99.98% and AT average 92.58%. The Term translation rate of the AT-FT+term baseline method in the Han-Zangensis translation task is AT least 86.12 percent, AT most 96.73 percent and AT average 90.56 percent, and the Term translation rate of the method is AT least 90.77 percent, AT most 99.95 percent and AT average 94.55 percent. Compared with a baseline method, the model not only improves the translation rate of new word terms in the Tibetan-translation Chinese task, but also improves the translation rate of the new word terms in the Chinese-translation Tibetan task. Therefore, the NMT method combines the NAR training mode and integrates the priori knowledge new word term library, so that the quality of the whole translation can be improved, the new word term in a specific field can be effectively translated, and the NMT method has a strong application value in industry. In the analysis of decoding rate, in the Tibetan translation task, the decoding rate of the method is obviously improved on five test sets compared with the baseline method. For example, the decoding rate of the method of the invention is improved by 24.13% on five test sets compared with the baseline AT-FT+term method in Tibetan translation and 46.17% on five test sets compared with the baseline ATFT +term method in Tibetan translation. The average translation of one sentence can improve the decoding efficiency by at least 11.11%, which becomes an important performance improvement and can bring great benefit to related work.
B. quality analysis of autoregressive and non-autoregressive translations
The present embodiment further analyzes the improvement effect of the Autoregressive (AR) decoding mode and the non-autoregressive (NAR) decoding mode on the quality of the translation, and the specific experiment is shown in table 2. The table counts the BLEU values of the method of the present invention and the other three baseline methods on five test sets in two translation tasks. In order to achieve comparability among the models, the invention strictly controls the scale, granularity, experiment platform and other resources of the experimental data and other related parameters. Only from the BLEU values of the experimental results, the quality scores of the translations of the AR are equivalent to those of the NAR in the translation models which are integrated with the prior knowledge new word term library and are not integrated with the prior knowledge new word term library. In the results of Tibetan-and-Tibetan tasks on Test2017, judTest, judDev, dev2017, judTest and other Test sets, the value of AT-FT+term is higher than the value of NAT-FT+term, 0.07 BLU is reduced on average, and in the results of Tibetan-and-Tibetan tasks on other Test sets, the value of AT-FT+term is lower than the value of NAT-FT+term, and 0.43 BLU value is improved on average. It can be seen that the magnitude of the decrease in the two translation tasks on each test set is far lower than the magnitude of the increase, so that it can be proved that the translation performance of the model is effectively improved by integrating priori knowledge in the non-autoregressive translation model.
Table 2 the method of the present invention and the other three baseline methods have BLEU values over five test sets in two translation tasks
In order to further analyze the difference of the two translation modes, the invention compares the difference of BLEU values of the two translation frames on the two translation tasks. As shown in fig. 2. Therefore, in two translation tasks, the invention remarkably improves the performance of autoregressive and non-autoregressive translation models by integrating the prior knowledge new word term library. For example, in the autoregressive training Tibetan translation model, the AT-FT-Term method on the Test set Test2017 improves (43.37-42.84 =0.53) BLER values compared with the AT-FT method, and in the non-autoregressive training Tibetan translation model, the NAT-FT+term method on the Test set Test2018 improves (20.66-17.86=2.80) BLER values compared with the NAT-FT method. In addition, in the Tibetan translation task, the NAT-FT+term model on the test set Dev2017 has improved (42.34-41.89=0.450) BLEU values compared with the AT-FT-Term model.
Through different cross comparison experiments and proved that in Tibetan translation and Chinese translation and Tibetan translation tasks, indexes of a translation model can be improved by integrating a pre-knowledge new word term library, and even in 50% of tests, the quality of non-autoregressive translations is higher than that of autoregressive translations.
C. Influence of the number of divided blocks on the quality of the generated translation
Training is carried out by setting different block number K values, and finally the optimal segmentation K value is selected. In the experiment, the K values are respectively set to 2, 5, 8 and 10, and Test2018 and JuicialDev containing new word terms are newly constructed as a Test set, and BLEU values of translations, translation rate Term% of the new word terms and decoding rate Time(s) under different K values are counted. The experimental results are shown in table 3, when the K value is set to 2, the translation rate of the corresponding BLEU score and new term reaches the highest, however the decoding rate is slower. At this time, the difference between the translated BLEU value and the autoregressive translated BLEU value is small, and the decoding rate is improved by 40.16% compared with the autoregressive. When the K value is set to 10, the corresponding BLEU score and interpretation rate of new word terms are reduced, however the decoding rate is fastest. At this time, on the premise that the BLEU score of the translation is reduced by less than 2.4, the decoding rate is remarkably improved compared with autoregressive.
Table 3 test results when different block numbers K are set
In summary, the K value is inversely related to the BLEU value of the translated version and the translation rate of the new term, and is positively related to the decoding rate. When the selected K value is 2, not only the BLEU value of the translated text and the translation rate of the new term are ensured, but also the decoding rate is effectively improved.
D. translation sequence generation and refinement mode analysis
Compared with LevT insertion and deletion mechanisms, the invention introduces mechanisms of deleting wrong words in historical translations, inserting prior knowledge term constraint words, predicting reasonable translation words, retaining and the like. Besides, the method assists the non-autoregressive translation model through a sentence-level knowledge distillation method, improves the information acquisition efficiency of the model, and effectively solves the problems of information asymmetry and the like. During training, the transform model updates the representation of each word through each layer, which results in a lack of flexibility in decoding the model, while LevT edits the sentence through each layer during training, which not only can flexibly generate a translation sequence, but also can continuously refine the sequence and change the information such as the sequence length, thereby effectively improving the performance of the machine translation model. Compared with a transducer model, the method of the invention generally obtains better translation quality and greatly improves the operation speed.
Most of the existing NMT can achieve satisfactory effect in the general field, but the existing NMT needs large-scale parallel corpus to better train and obtain the best result. At this time, aiming at the translation requirement of the specific field of the scarce resource, the method can integrate the prior knowledge, and has more important significance. In addition, the current NMT system is constructed by taking an AR translation model as a core framework, however, the efficiency of generating translations is limited because decoding cannot be performed in parallel in the AR translation model. Therefore, the construction of the semi-autoregressive translation framework by modifying the AR translation mode has higher application value. Based on the two problems presented above, the present invention proposes to build an NMT model by fusing domain-specific term libraries.
The invention adopts a semi-autoregressive method in a decoder by comparing an AR translation mode with a NAR translation mode and combining a transducer model. The decoding mode of semi-autoregressive is realized through blocking, and the decoding is realized through blocking and synchronous decoding when the translation is generated. At this time, NAR mode is adopted between each two blocks in parallel, and AR mode is adopted in each block in series. A priori knowledge is incorporated within the block. The NMT model fusing the term library in the specific field is realized by adding and deleting error information in the history translation, inserting priori knowledge term constraint words, predicting reasonable translation words, retaining and other mechanisms. Finally, comparing the test with the three baseline methods on five test sets of two translation tasks, and analyzing indexes such as BLEU values of translations, translation rates of terms, decoding rates and the like, so that the method can be seen to have larger improvement on the three indexes compared with other baseline methods. Meanwhile, on the basis of not increasing the computational complexity, the prior knowledge of external discrete is effectively integrated. In addition, in order to ensure accurate block number K values, the method independently sets the K values to be 2, 5, 8 and 10, and tests the corresponding translation quality, the translation rate and the decoding rate of the terms. When the K value is 2, the translation efficiency is ensured, and meanwhile, new word terms in a specific field can be effectively translated, so that the method has strong application value in industry.