CN114492468B

Movatterモバイル変換

Info

Publication number: CN114492468B
Application number: CN202111385350.0A
Authority: CN
Inventors: 冯冲; 许达; 沙九
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2025-09-26
Anticipated expiration: 2041-11-22
Also published as: CN114492468A

Abstract

The invention relates to a low-resource machine translation method utilizing a semi-autoregressive fusion domain term, belonging to the technical field of natural language processing machine translation. According to the invention, by means of a low-resource machine translation method integrating terms in a specific field, an external knowledge guide decoder is used, and an autoregressive and non-autoregressive decoding mode is combined, so that the decoding efficiency is not sacrificed while priori knowledge is introduced, and the translation requirement in the vertical field is met. In a semi-autoregressive decoder, a translation sequence to be generated is subjected to multi-group block, an autoregressive mode is adopted in the block, and a non-autoregressive mode is adopted outside the block. In the reasoning stage, information is extracted from the additionally provided constraint by deleting the wrong word in the historical translation, inserting the priori knowledge term constraint word, predicting reasonable translation word and reserving a mechanism, so that external priori knowledge is fused. The invention not only can flexibly integrate external priori knowledge, adapt to different translation models in multiple fields, but also greatly improve the decoding rate compared with an autoregressive translation mode.

Description

Low-resource machine translation method using semi-autoregressive fusion domain terms

Technical Field

The invention relates to a low-resource machine translation method utilizing a semi-autoregressive fusion domain term. By using a semi-autoregressive decoding mode in neural machine translation, a machine translation technology which can flexibly integrate prior knowledge in a specific field and greatly improve the decoding rate is realized, and belongs to the technical field of natural language processing machine translation.

Background

In recent years, neural machine translation (Neural machine translation, NMT) has been rapidly developed, and remarkable effects have been achieved. However, NMT technology is a training method driven by data, and has very limited translation effect under low resource conditions. The term "low-resource" refers to a low quality and a low number of parallel corpora that can be obtained in a specific domain or a specific language category. Because of different translation requirements in different fields, for an application scenario in a specific field, a translation model is usually required to be trained by using data in a corresponding field, however, it is often difficult to acquire parallel corpus in the specific field. In addition, most current neural machine translation methods, whether based on a recurrent neural network (Recurrent Neural Network, RNN) or on an Attention (Attention) mechanism model, are in the form of autoregressive (Autoregressive, AR) translation. The autoregressive translation mode depends on history information when generating translations in a decoding stage, so that the decoding process cannot be parallel, and the translation efficiency is slow.

The prior knowledge is fused with the NMT model, so that the method has important significance, can guide the whole learning process of NMT, and can bring higher application value to the industry. In general, labeling data like domain-specific term libraries and phrase tables prepared in advance are collectively referred to as a priori knowledge. However, there is currently no ideal method to integrate a priori knowledge of a particular domain into the NMT model. This is because, on the one hand, the neural network uses continuous, heavy-valued vectors to represent all language structures involved in the translation process, which vectors represent that although the translation rules can be implicitly captured, it is still difficult to interpret the hidden state of each layer in the neural network from a linguistic perspective, and on the other hand, the prior knowledge in NMT usually exists in discrete symbol form, which is difficult to translate the prior knowledge in discrete form into the continuous form required by the neural network. Since neural networks typically have strong independence assumptions between hidden states, it is also difficult to directly modify the model structure.

Disclosure of Invention

The invention aims at solving the problem of machine translation in the vertical field and creatively provides a low-resource machine translation method utilizing the semi-autoregressive fusion field term.

The invention has the innovation point that the decoding efficiency is not sacrificed while priori knowledge is introduced by combining an autoregressive decoding mode with a non-autoregressive decoding mode through a low-resource machine translation method integrating terms in a specific field, so as to meet the translation requirement of a vertical field. In the semi-autoregressive decoder, the decoded sequence to be generated is subjected to multi-group block, an autoregressive mode is adopted in the block, and a non-autoregressive mode is adopted outside the block. In the reasoning stage, information is extracted from the additionally provided constraint by deleting the wrong word in the historical translation, inserting the priori knowledge term constraint word, predicting reasonable translation word and reserving a mechanism, so that external priori knowledge is fused.

The invention is realized by adopting the following technical scheme.

A low-resource machine translation method using semi-autoregressive fusion domain terms, comprising the steps of:

and 1, constructing a decoding method based on a semi-autoregressive model to realize the generation of a sequence in a semi-autoregressive mode.

And 2, constructing a term library fused with the specific field, and respectively fusing external knowledge into a translated sentence by deleting the wrong word in the historical translation, inserting the prior knowledge term constraint word, predicting reasonable translated words and reserving the word.

And 3, using a knowledge distillation method based on sentence level, and using the suggestion of the autoregressive model in training to enable the non-autoregressive model to learn the distribution of hidden variables and attention of the autoregressive model, so that the model capacity is enhanced, and the translation effect is improved.

Advantageous effects

Compared with the prior art, the invention has the following advantages:

1. The invention combines the decoding ideas of an autoregressive method (AR) and a non-autoregressive method (NAR), blocks the translation in a training stage and trains by using a semi-autoregressive method, executes an autoregressive mode in the block, and executes a non-autoregressive mode among the blocks, so that the model has better flexibility and high efficiency. In the reasoning stage, editing strategies such as deleting wrong words in historical translations, inserting priori knowledge term constraint words, predicting reasonable translation words, retaining and the like are adopted, so that the sequence optimization supports the change of dynamic length, and meanwhile, the accuracy of sequence generation is improved.

2. The method splits the translation into a plurality of blocks and enables the blocks to be generated in parallel. And the translation performance and the decoding speed of the translation model in the specific field are improved by integrating the terms of the specific field in the block.

3. When each word is generated, the method not only depends on the generated word in the block, but also depends on the generated word in other blocks, and the dependency relationship of the target language end is better captured.

4. The invention creates a mechanism for deleting error information in historical translations, and if a model finds that the translated content of a certain block is translated in other blocks, the block can be dynamically deleted, so that the model has the capability of recovering from repeated translation errors which are easy to occur.

5. The invention provides an NMT model fused with a term library in a specific field, and the decoding speed (Time (s)) and the translation quality (BLEU fraction) are obviously improved.

6. The invention not only can flexibly integrate external priori knowledge and adapt to different translation models in multiple fields, but also greatly improves the decoding rate compared with an autoregressive translation mode.

Drawings

FIG. 1 is a schematic diagram of a decoding method for introducing priori knowledge by using a semi-autoregressive method in the method of the present invention;

FIG. 2 is a bar graph (Tibetan) comparing autoregressive and non-autoregressive methods over two translation tasks;

FIG. 3 is a bar graph (Han's translation) of the autoregressive and non-autoregressive methods for two translation tasks.

Detailed Description

The process according to the invention is described in further detail below with reference to the accompanying drawings.

Specifically, the constructed semi-autoregressive model is consistent with a transducer at the encoder end, and is decoded by a semi-autoregressive mode at the decoder end.

As shown in fig. 1, the principle of operation of a semi-autoregressive decoder is shown. The decoder blocks and decodes the translations synchronously when generating the translations. For example, a translation sequence S is partitioned into different blocks S¹,S²,…,S^k. Within a block, the next word is predicted by means of Autoregressive (AR) decoding, combining the source text information, a priori knowledge and the generated historical translation, and the decoder generates a corresponding word or symbol for the incomplete block at each stage. Specifically, as shown in formula (1):

Wherein P (y|x) represents a conditional probability, x represents an input sequence, and y represents an output sequence; Representing the t-th word or symbol in the i-th block; A history translation has been generated for the ith block, L is the total length of the blocks, and K represents the number of blocks.

Computing predicted words or symbols in the ith block SⁱAs shown in formula (2):

where V denotes a vocabulary, < BOS > and < EOS > denote a start and end symbol, respectively, P () denotes a probability distribution of the corresponding expression, and argmax denotes a set of parameters that maximizes the probability.

When (when)At this time, the block Sⁱ is indicated as beginning decoding, allowing insertion of constraint term libraries;

When (when)When the block Sⁱ is not completed, the decoding is allowed to continue;

When Sⁱ = < EOS >, it means that the block Sⁱ is completed, and decoding is stopped.

When the predicted word is < EOS >, the history information is unchanged and the maximum length is reached in the whole decoding process, the sequence decoding is completed.

And 2, constructing a term library fused with the specific field, and fusing the external knowledge into translated sentences.

Specifically, during decoding, the generation of sequences in an autoregressive manner is realized by incorporating a priori knowledge within the block. And the redundant words are fused in and deleted through priori knowledge, so that the fusion of priori knowledge is realized.

In the stage of formally introducing priori knowledge and decoding, performing primary iteration through sentence boundary y⁰ = < s > </s >, filling a y⁰ sequence with a target constraint before deleting the wrong word in the historical translation, and optimizing the target sequence through editing iteration. Where < s > represents the beginning of the sentence and < s > represents the end of the sentence.

The priori knowledge is integrated into the method:

The input to the decoder embeds a uniform mapping z, z=f (x; θ_enc), f () representing the mapping function, x representing the source language input sequence, and a given set of n priori knowledge, where z_k＝ε(x_i),T=1, 2..t_y,n＝P₁,P₂,...,P_n;θ_enc denotes the relevant parameters of the encoder, z_k denotes the mapping result of the kth block, epsilon (x_i) denotes a step function, T denotes the current time step, T_x denotes the source language sequence length, T_y denotes the target language sequence length, and P_n denotes the nth pair of a priori knowledge.

Each pair of a priori knowledge P_j is made up of a different word or phrase w^j,J=1, 2,..n. Before formal decoding, the decoder combines source language information in advance, selects the corresponding target language from the priori knowledge base, and inserts the sequence y⁰,y⁰＝<s>P₁,P₂,...,P_n to be generated.

The method for deleting the redundant words is to execute the deleting operation by combining constraint conditions, source text information and generated calendar Shi Yiwen during the formal decoding.

If the prior knowledge scale does not contain the whole source text information, or the constraint of the whole prior knowledge is deleted in the deleting operation, the constraint condition in the process of generating the translation is not generated, or the final translation does not contain the prior knowledge. To solve this problem, the present invention introduces a constraint mask to indicate the position of constraint markers in the sequence and specifies constraint markers specified by the inhibit delete constraint mask. And (3) randomly adding constraint mask occupation, re-calculating and updating the positions of the constraint masks during each iteration, and finally selecting proper candidate texts from the priori knowledge base to replace the corresponding constraint masks. Specifically, the following manner may be adopted:

the introduce constraint mask operation includes two phases, constraint mask prediction and candidate prediction.

Constraint mask prediction phase, in each iteration, for the input sequence y of the decoder, y= < s > y₁,y₂,...,y_n, the model will utilize a classifier in each possible slot in y (y_i,y_i+1)To predict whether a constraint mask is to be added, the constraint mask is noted < PLH >, as shown in equation (3):

Wherein, theRepresenting the prediction result of the constraint mask predictor on the ith word in the sequence y, θ represents a model parameter, softmax represents a classification function, hⁱ represents the hidden state of the ith word, hⁱ⁺¹ represents the hidden state of the (i+1) th word, concat represents a splicing function, and n is the length of the sequence y.

In the candidate text prediction stage, for each constraint mask of the prediction, the model replaces the placeholder by training a character predictor to select the actual character from the a priori knowledge base, as shown in equation (4).

Wherein, theRepresenting a prediction result of the candidate text predictor on a position where a mask exists in the sequence y, wherein θ represents a model parameter, softmax represents a classification function, hⁱ represents a hidden state of an ith word, and y_i is the ith word in the sequence; Is a logical symbol, representing "arbitrary", and C is a parameter shared with the word embedding matrix.

At present, in most of non-autoregressive method translation methods, a knowledge distillation method is adopted in a training stage, so that a smaller student model learns a larger teacher model, and effective information is obtained from feature distribution. The invention improves the translation performance of the model through a sentence-level knowledge distillation method.

Specifically, firstly, a priori knowledge base is added into an original training corpus, and an autoregressive translation model is trained as a teacher model (the training method can adopt a term constrained NMT method);

Then, translating the source text and obtaining a translation y';

Finally, a non-autoregressive translation model (details can refer to non-autoregressive translations of the enhancement decoder input) is trained with a pseudo-parallel corpus (x, y'), where x represents the input sequence of the source language. Therefore, the priori knowledge is integrated into the training corpus, and the problem of repeated translation or missing translation of the non-autoregressive translation method on the original corpus is reduced.

The translation model adopts a greedy search algorithm, generates a plurality of candidate translations through parallel decoding, and then selects a translation sequence with the highest probability as a final translation.

Example verification

In order to prove the effect of the invention, experimental verification is carried out on a Tibetan parallel corpus data set with a certain scale, and comparison analysis is carried out with a main stream baseline model. The experimental procedure was as follows:

1. Experimental data

The embodiment independently constructs a term library in a specific field, and covers new word terms of Tibetan Chinese professionals from 2015 to 2021 in the last half year, and 30300 Tibetan Chinese term pairs are obtained after cleaning and washing. CCMT2019 and JudCorpus are used for hiding parallel corpus and are used as training corpus, wherein the CCMT2019 corpus scale is 147434 sentence pairs, and the JudCorpus corpus scale is 163000 sentence pairs. For fair comparison, the Chinese corpus is processed in a BPE mode, and the Tibetan corpus is processed in a sound-word fusion mode. The vocabulary sizes of Tibetan and Chinese are 40000 phonetic word fusion and 40000 subwords respectively, and the two vocabularies are shared in the whole training process. The invention combines Test2018 and JudDev data sets as development sets, and Test2017, dev2017 and JudTest data sets are adopted as Test sets.

2. Experimental setup

In this example, the experimental parameters were set up by Gu et al [17] with modifications based on a translation system RecoverSAT developed by Ran et al based on semi-autoregressive of the transducer model. In practical experiments, the corresponding parameters were modified due to the smaller corpus scale, wherein dmodel =278, dhidden=507, nlyer=5, nhead=2, pdropout=0.1. This embodiment uses a sequential level distillation approach that simulates not only each word, but also the distribution of all input sentences. And sampling output information of a teacher by using a beam search algorithm, and training students by cross entropy, wherein the beam search size is set to be 5. Through pre-training a translation model of AR, then initializing the encoder of the network by using the encoder parameters, sharing the corresponding parameters, and training the whole experiment on a platform with 2 blocks of Geforce GTX 1080Ti and a video memory of 11 GB.

3. Baseline method

For comparison, in the present embodiment, a corresponding baseline system was selected in addition to the typical AR translation method transform-based model (AT-FT) on both Tibetan and Han translation tasks. The NAR translation model (NAT-FT) was first proposed by Jiatao Gu et al, respectively, and the application Term constraint training NMT method (AT-FT+term) was proposed by Georgiana Dinu et al. The invention analyzes the model from the dimensions of translation rate Term% duty ratio, decoding speed Time(s) Time, BLEU fraction and the like of new word terms on 5 test sets in two fields respectively.

4. Analysis of experimental data

A. Analysis of term usage and decoding rate

The translation rate Term% and the decoding speed Time(s) of the new word Term are respectively scored from two dimensions, and AR and NAR model performances after priori knowledge is introduced are analyzed. Combining new word term libraries at two ends of each test set, and replacing common words with new word terms through a semantic similarity matching method to construct the test set containing the new word terms. Wherein, the translation rate term% of the new word terms refers to the ratio of the number of the new word terms accurately translated in the translation to the total number of the new word terms contained in the source end of the corresponding test set, and the decoding rate Time(s) refers to the Time consumed when each sentence is decoded on average. As shown in table 1, the translation rate Term% and the decoding rate Time(s) are new word terms in five test sets on two translation tasks.

Table 1 translation Rate Term% and decoding Rate Time(s) for new word terms in five test sets on two translation tasks

When the translation rate of new word terms is analyzed, the Term translation rate of the AT-FT+term baseline method in Tibetan translation task is AT least 86.45%, AT most 98.32% and AT average 90.83%, and the Term translation rate of the method is AT least 88.21%, AT most 99.98% and AT average 92.58%. The Term translation rate of the AT-FT+term baseline method in the Han-Zangensis translation task is AT least 86.12 percent, AT most 96.73 percent and AT average 90.56 percent, and the Term translation rate of the method is AT least 90.77 percent, AT most 99.95 percent and AT average 94.55 percent. Compared with a baseline method, the model not only improves the translation rate of new word terms in the Tibetan-translation Chinese task, but also improves the translation rate of the new word terms in the Chinese-translation Tibetan task. Therefore, the NMT method combines the NAR training mode and integrates the priori knowledge new word term library, so that the quality of the whole translation can be improved, the new word term in a specific field can be effectively translated, and the NMT method has a strong application value in industry. In the analysis of decoding rate, in the Tibetan translation task, the decoding rate of the method is obviously improved on five test sets compared with the baseline method. For example, the decoding rate of the method of the invention is improved by 24.13% on five test sets compared with the baseline AT-FT+term method in Tibetan translation and 46.17% on five test sets compared with the baseline ATFT +term method in Tibetan translation. The average translation of one sentence can improve the decoding efficiency by at least 11.11%, which becomes an important performance improvement and can bring great benefit to related work.

B. quality analysis of autoregressive and non-autoregressive translations

The present embodiment further analyzes the improvement effect of the Autoregressive (AR) decoding mode and the non-autoregressive (NAR) decoding mode on the quality of the translation, and the specific experiment is shown in table 2. The table counts the BLEU values of the method of the present invention and the other three baseline methods on five test sets in two translation tasks. In order to achieve comparability among the models, the invention strictly controls the scale, granularity, experiment platform and other resources of the experimental data and other related parameters. Only from the BLEU values of the experimental results, the quality scores of the translations of the AR are equivalent to those of the NAR in the translation models which are integrated with the prior knowledge new word term library and are not integrated with the prior knowledge new word term library. In the results of Tibetan-and-Tibetan tasks on Test2017, judTest, judDev, dev2017, judTest and other Test sets, the value of AT-FT+term is higher than the value of NAT-FT+term, 0.07 BLU is reduced on average, and in the results of Tibetan-and-Tibetan tasks on other Test sets, the value of AT-FT+term is lower than the value of NAT-FT+term, and 0.43 BLU value is improved on average. It can be seen that the magnitude of the decrease in the two translation tasks on each test set is far lower than the magnitude of the increase, so that it can be proved that the translation performance of the model is effectively improved by integrating priori knowledge in the non-autoregressive translation model.

Table 2 the method of the present invention and the other three baseline methods have BLEU values over five test sets in two translation tasks

In order to further analyze the difference of the two translation modes, the invention compares the difference of BLEU values of the two translation frames on the two translation tasks. As shown in fig. 2. Therefore, in two translation tasks, the invention remarkably improves the performance of autoregressive and non-autoregressive translation models by integrating the prior knowledge new word term library. For example, in the autoregressive training Tibetan translation model, the AT-FT-Term method on the Test set Test2017 improves (43.37-42.84 =0.53) BLER values compared with the AT-FT method, and in the non-autoregressive training Tibetan translation model, the NAT-FT+term method on the Test set Test2018 improves (20.66-17.86=2.80) BLER values compared with the NAT-FT method. In addition, in the Tibetan translation task, the NAT-FT+term model on the test set Dev2017 has improved (42.34-41.89=0.450) BLEU values compared with the AT-FT-Term model.

Through different cross comparison experiments and proved that in Tibetan translation and Chinese translation and Tibetan translation tasks, indexes of a translation model can be improved by integrating a pre-knowledge new word term library, and even in 50% of tests, the quality of non-autoregressive translations is higher than that of autoregressive translations.

C. Influence of the number of divided blocks on the quality of the generated translation

Training is carried out by setting different block number K values, and finally the optimal segmentation K value is selected. In the experiment, the K values are respectively set to 2, 5, 8 and 10, and Test2018 and JuicialDev containing new word terms are newly constructed as a Test set, and BLEU values of translations, translation rate Term% of the new word terms and decoding rate Time(s) under different K values are counted. The experimental results are shown in table 3, when the K value is set to 2, the translation rate of the corresponding BLEU score and new term reaches the highest, however the decoding rate is slower. At this time, the difference between the translated BLEU value and the autoregressive translated BLEU value is small, and the decoding rate is improved by 40.16% compared with the autoregressive. When the K value is set to 10, the corresponding BLEU score and interpretation rate of new word terms are reduced, however the decoding rate is fastest. At this time, on the premise that the BLEU score of the translation is reduced by less than 2.4, the decoding rate is remarkably improved compared with autoregressive.

Table 3 test results when different block numbers K are set

In summary, the K value is inversely related to the BLEU value of the translated version and the translation rate of the new term, and is positively related to the decoding rate. When the selected K value is 2, not only the BLEU value of the translated text and the translation rate of the new term are ensured, but also the decoding rate is effectively improved.

D. translation sequence generation and refinement mode analysis

Compared with LevT insertion and deletion mechanisms, the invention introduces mechanisms of deleting wrong words in historical translations, inserting prior knowledge term constraint words, predicting reasonable translation words, retaining and the like. Besides, the method assists the non-autoregressive translation model through a sentence-level knowledge distillation method, improves the information acquisition efficiency of the model, and effectively solves the problems of information asymmetry and the like. During training, the transform model updates the representation of each word through each layer, which results in a lack of flexibility in decoding the model, while LevT edits the sentence through each layer during training, which not only can flexibly generate a translation sequence, but also can continuously refine the sequence and change the information such as the sequence length, thereby effectively improving the performance of the machine translation model. Compared with a transducer model, the method of the invention generally obtains better translation quality and greatly improves the operation speed.

Most of the existing NMT can achieve satisfactory effect in the general field, but the existing NMT needs large-scale parallel corpus to better train and obtain the best result. At this time, aiming at the translation requirement of the specific field of the scarce resource, the method can integrate the prior knowledge, and has more important significance. In addition, the current NMT system is constructed by taking an AR translation model as a core framework, however, the efficiency of generating translations is limited because decoding cannot be performed in parallel in the AR translation model. Therefore, the construction of the semi-autoregressive translation framework by modifying the AR translation mode has higher application value. Based on the two problems presented above, the present invention proposes to build an NMT model by fusing domain-specific term libraries.

The invention adopts a semi-autoregressive method in a decoder by comparing an AR translation mode with a NAR translation mode and combining a transducer model. The decoding mode of semi-autoregressive is realized through blocking, and the decoding is realized through blocking and synchronous decoding when the translation is generated. At this time, NAR mode is adopted between each two blocks in parallel, and AR mode is adopted in each block in series. A priori knowledge is incorporated within the block. The NMT model fusing the term library in the specific field is realized by adding and deleting error information in the history translation, inserting priori knowledge term constraint words, predicting reasonable translation words, retaining and other mechanisms. Finally, comparing the test with the three baseline methods on five test sets of two translation tasks, and analyzing indexes such as BLEU values of translations, translation rates of terms, decoding rates and the like, so that the method can be seen to have larger improvement on the three indexes compared with other baseline methods. Meanwhile, on the basis of not increasing the computational complexity, the prior knowledge of external discrete is effectively integrated. In addition, in order to ensure accurate block number K values, the method independently sets the K values to be 2, 5, 8 and 10, and tests the corresponding translation quality, the translation rate and the decoding rate of the terms. When the K value is 2, the translation efficiency is ensured, and meanwhile, new word terms in a specific field can be effectively translated, so that the method has strong application value in industry.

Claims

Translated fromChinese

1.一种利用半自回归融合领域术语的低资源机器翻译方法，其特征在于，包括以下步骤：1. A low-resource machine translation method using semi-autoregressive fusion of domain terminology, characterized by comprising the following steps:

步骤1：构建基于半自回归模型的解码方法，实现以半自回归方式生成序列；Step 1: Construct a decoding method based on a semi-autoregressive model to generate sequences in a semi-autoregressive manner;

步骤1构建的基于半自回归模型，在编码器端与Transformer保持一致，在解码器端则通过半自回归方式进行解码；The semi-autoregressive model constructed in step 1 is consistent with the Transformer on the encoder side, and decoded using a semi-autoregressive approach on the decoder side;

解码器在生成译文时，对译文进行分块并同步解码：When generating the translation, the decoder divides the translation into blocks and decodes them synchronously:

一条译文序列S被分割为不同的块S¹,S²,…,S^k，在块内，通过自回归解码方式，结合源文信息、先验知识和已生成历史译文来预测下一个词，每阶段解码器为不完整的块生成一个相应的词或符号，如式(1)所示：A translation sequence S is divided into different blocks S¹ , S² , …, S^k . Within the block, the next word is predicted by combining the source text information, prior knowledge and the generated historical translations through autoregressive decoding. At each stage, the decoder generates a corresponding word or symbol for the incomplete block, as shown in Equation (1):

其中，P(y|x)表示条件概率，x表示输入序列，y表示输出序列；表示第i块中第t个词或符号；为第i块已生成历史译文；L为块的总长度，K表示块的个数；Where P(y|x) represents the conditional probability, x represents the input sequence, and y represents the output sequence; represents the tth word or symbol in the i-th block; The historical translation has been generated for the i-th block; L is the total length of the block, and K represents the number of blocks;

计算第i块Sⁱ中预测的词或符号如式(2)所示：Calculate the predicted word or symbol in the i-th block^Si As shown in formula (2):

其中，V表示词汇表，<BOS>和<EOS>分别表示起始符和结束符；P()表示对应表达式的概率分布；argmax表示求使得该概率最大的参数集合；Where V represents the vocabulary, <BOS> and <EOS> represent the start and end symbols respectively; P() represents the probability distribution of the corresponding expression; argmax represents the parameter set that maximizes the probability;

当时,表示Sⁱ块开始解码，允许插入约束术语库；when When , it means that the^Si block starts decoding and allows the insertion of the constraint term base;

当时，表示Sⁱ块未完成，允许继续解码；when When , it means that the^Si block is not completed and decoding is allowed to continue;

当Sⁱ＝<EOS>时，表示Sⁱ块已完成，停止解码；When Sⁱ = <EOS>, it means that the Sⁱ block has been completed and decoding stops;

当整个解码过程中预测的词都为<EOS>、历史信息不变且达到最大长度时，表示此序列解码已完成；When the predicted words in the entire decoding process are all <EOS>, the historical information remains unchanged, and the maximum length is reached, it means that the decoding of this sequence is completed;

步骤2：构建融合特定领域的术语库，分别通过删除历史译文中错误的词、插入先验知识术语约束词、预测合理的译文词并保留，用于将外部知识融入到译文句子中；Step 2: Build a domain-specific terminology database, which integrates external knowledge into the translation sentences by removing incorrect words from historical translations, inserting prior knowledge terminology constraints, and predicting and retaining reasonable translation words.

步骤3：使用基于句子级别的知识蒸馏方法，在训练中利用自回归模型的提示，让非自回归模型学习自回归模型隐变量和注意力的分布；Step 3: Use a sentence-level knowledge distillation method to leverage the autoregressive model’s hints during training, allowing the non-autoregressive model to learn the distribution of the autoregressive model’s latent variables and attention.

翻译模型采用贪婪搜索算法，通过并行解码生成多个候选译文，然后挑选概率最大的译文序列，作为最终的译文。The translation model uses a greedy search algorithm to generate multiple candidate translations through parallel decoding, and then selects the translation sequence with the highest probability as the final translation.

2.如权利要求1所述的一种利用半自回归融合领域术语的低资源机器翻译方法，其特征在于，步骤2中，在正式引入先验知识并解码阶段，通过句子边界y⁰＝<s></s>进行初次迭代，在删除历史译文中错误的词之前，用目标约束填充y⁰序列，目标序列通过编辑迭代进行优化，其中，<s>表示句子开头，</s>表示句子结尾；2. The low-resource machine translation method using semi-autoregressive fusion of domain terminology according to claim 1, characterized in that in step 2, before formally introducing prior knowledge and performing decoding, an initial iteration is performed using sentence boundaries y⁰ = <s></s>. Before deleting erroneous words in historical translations, the y⁰ sequence is filled with target constraints, and the target sequence is optimized through editing iterations, where <s> represents the beginning of a sentence and </s> represents the end of a sentence.

其中，先验知识融入的方法为：Among them, the method of integrating prior knowledge is:

解码器的输入为源语言词嵌入均匀映射z，z＝f(x；θ_enc)，f()表示映射函数，x表示源语言输入序列，以及给定的n组先验知识，其中，z_k＝ε(x_i),t＝1,2,...,T_y,n＝P₁,P₂,…,P_n；θ_enc表示编码器的相关参数；z_k表示第k个块的映射结果；ε(x_i)表示阶跃函数；t表示当前时间步；T_x表示源语言序列长度，T_y表示目标语言序列长度；P_n表示第n对先验知识；The input of the decoder is the uniform mapping z of the source language word embedding, z = f(x; θ_enc ), f() represents the mapping function, x represents the source language input sequence, and the given n sets of prior knowledge, where z_k = ε(_xi ), t＝1,2,...,T_y ,n＝P₁ ,P₂ ,…,P_n ；θ_enc represents the relevant parameters of the encoder; z_k represents the mapping result of the kth block; ε(_xi ) represents the step function; t represents the current time step; T_x represents the length of the source language sequence,_Ty represents the length of the target language sequence; P_n represents the nth pair of prior knowledge;

每对先验知识P_j由不同的词或短句w^j构成，解码器在正式解码前，预先结合源语言信息，从先验知识库中选取对应的目标语言，并插入待生成序列y⁰，Each pair of prior knowledge_Pj consists of different words or short sentences^wj . Before formal decoding, the decoder combines the source language information in advance, selects the corresponding target language from the prior knowledge base, and inserts the sequence to be generated y⁰ ,

删除冗余词的方法为：在正式解码时，结合约束条件、源文信息、已生成历史译文，执行删除操作。The method for deleting redundant words is: during formal decoding, the deletion operation is performed based on the constraints, source text information, and generated historical translations.

3.如权利要求1所述的一种利用半自回归融合领域术语的低资源机器翻译方法，其特征在于，步骤2中，引入一个约束掩码来指示约束标记在序列中的位置，并规定禁止删除约束掩码所指定的约束标记；通过随机加入约束掩码占位，每次迭代时，会将约束掩码的位置重新进行计算并更新，最终从先验知识库中选取合适的候选文来替换对应的约束掩码。3. The low-resource machine translation method using semi-autoregressive fusion of domain terminology as described in claim 1 is characterized in that, in step 2, a constraint mask is introduced to indicate the position of constraint markers in the sequence, and it is prohibited to delete the constraint markers specified by the constraint mask; by randomly adding constraint mask placeholders, the position of the constraint mask is recalculated and updated at each iteration, and finally a suitable candidate text is selected from the prior knowledge base to replace the corresponding constraint mask.

4.如权利要求3所述的一种利用半自回归融合领域术语的低资源机器翻译方法，其特征在于，引入约束掩码操作包括两个阶段：约束掩码预测和候选文预测；4. The low-resource machine translation method using semi-autoregressive fusion of domain terminology as claimed in claim 3, wherein the introduction of the constrained mask operation comprises two stages: constrained mask prediction and candidate text prediction;

约束掩码预测阶段，在每次迭代中，对于解码器的输入序列y，y＝<s>y₁,y₂,…,y_n</s>,其中，<s>表示句子开头，</s>表示句子结尾，模型会在y中每一个可能的插槽(y_i,y_i+1)中，利用一个二分类器来预测是否会添加约束掩码，约束掩码记为<PLH>，如式(3)所示：In the constrained mask prediction stage, in each iteration, for the decoder input sequence y, y = <s>y₁ ,y₂ ,…,y_n </s>, where <s> represents the beginning of the sentence and </s> represents the end of the sentence, the model will use a binary classifier in each possible slot (y_i ,y_i+1 ) in y To predict whether a constraint mask will be added, the constraint mask is recorded as <PLH>, as shown in formula (3):

其中，表示约束掩码预测器对序列y中第i个词的预测结果，θ表示模型参数，softmax表示分类函数，hⁱ表示第i个词的隐藏状态，hⁱ⁺¹表示第i+1个词的隐藏状态，concat表示拼接函数，n为序列y的长度；in, represents the prediction result of the constrained mask predictor for the i-th word in the sequence y, θ represents the model parameters, softmax represents the classification function,^hi represents the hidden state of the i-th word, hi⁺¹ represents the hidden state of the i+1-th word, concat represents the concatenation function, and n is the length of the sequence y;

在候选文预测阶段，对于上述预测的每个约束掩码，模型通过训练一个字符预测器，从先验知识库中选取实际字符来对该占位符进行替换，如式(4)所示：In the candidate text prediction stage, for each constraint mask predicted above, the model trains a character predictor to select actual characters from the prior knowledge base to replace the placeholder, as shown in formula (4):

其中，表示候选文预测器对序列y中存在掩码的位置的预测结果，θ表示模型参数，softmax表示分类函数，hⁱ表示第i个词的隐藏状态，y_i为序列中第i个词；为逻辑符号，表示“任意的”；C为与词嵌入矩阵共享的参数。in, represents the prediction result of the candidate text predictor for the position of the mask in the sequence y, θ represents the model parameters, softmax represents the classification function,^hi represents the hidden state of the i-th word, and_yi is the i-th word in the sequence; is a logical symbol, indicating "arbitrary"; C is a parameter shared with the word embedding matrix.

5.如权利要求1所述的一种利用半自回归融合领域术语的低资源机器翻译方法，其特征在于，步骤3中，首先，在原始训练语料中加入先验知识库，并训练一种自回归的翻译模型作为教师模型；5. The low-resource machine translation method using semi-autoregressive fusion of domain terminology according to claim 1, wherein in step 3, first, a priori knowledge base is added to the original training corpus, and an autoregressive translation model is trained as a teacher model;

然后，对源文翻译并获得译文y'；Then, the source text is translated and the translation y' is obtained;

最后，用伪平行语料(x,y')训练非自回归的翻译模型，其中x表示源语言的输入序列。Finally, a non-autoregressive translation model is trained using the pseudo-parallel corpus (x, y'), where x represents the input sequence of the source language.