CN109446537B

Movatterモバイル変換

Info

Publication number: CN109446537B
Application number: CN201811306229.2A
Authority: CN
Inventors: 詹文法; 邵志伟; 陶鹏程; 张振林; 刘德阳
Original assignee: Anqing Normal University
Current assignee: Beijing Fengling Technology Co ltd
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2022-11-25
Anticipated expiration: 2038-11-05
Also published as: CN109446537A

Abstract

Translated fromChinese

本发明公开了一种针对机器翻译的译文评估方法及装置，所述方法包括：获取语料库中的若干条语料，并将每一条语料中包含的上下文词向量的拼接结果；并对所述若干条语料中包含的不同词性的词语的词向量进行初始化；将所述拼接结果以及所述词向量作为CBOW模型的输入，获取训练后的CBOW模型；获取每一条语料的目标词，并使用训练后的CBOW模型进行翻译；获取待评估模型针对所述目标词的译文，并根据所述待评估模型对应的译文与训练后的CBOW模型对应的译文之间的相似度，评估待评估模型译文的准确度。应用本发明实施例，可以自动对译文结果进行准确性评估。

The invention discloses a translation evaluation method and device for machine translation. The method includes: acquiring several pieces of corpus in a corpus, and concatenating the result of contextual word vectors contained in each piece of corpus; and evaluating the several pieces of corpus The word vectors of the words of different parts of speech contained in the corpus are initialized; the splicing result and the word vectors are used as the input of the CBOW model to obtain the trained CBOW model; the target words of each piece of corpus are obtained, and the trained CBOW model for translation; obtain the translation of the model to be evaluated for the target word, and evaluate the accuracy of the translation of the model to be evaluated according to the similarity between the translation corresponding to the model to be evaluated and the translation corresponding to the trained CBOW model . By applying the embodiment of the present invention, the accuracy evaluation of the translation result can be performed automatically.

Description

Translated fromChinese

一种针对机器翻译的译文评估方法及装置A translation evaluation method and device for machine translation

技术领域technical field

本发明涉及一种译文评估方法及装置，更具体涉及一种针对机器翻译的译文评估方法及装置。The present invention relates to a translation evaluation method and device, and more particularly to a translation evaluation method and device for machine translation.

背景技术Background technique

随着现代社会的发展，人类对语言之间的转换需求越来越大。在实际应用中，传统机器翻译以规则为基础，特点是基于语法和语义理论，通过分析上下文的语法搭配关系得到翻译结果。但是由于规则不可能涵盖所有的句子，传统机器翻译大多是句法的直译或句型的转换。With the development of modern society, human beings have more and more needs for switching between languages. In practical applications, traditional machine translation is rule-based, characterized by grammatical and semantic theories, and the translation results are obtained by analyzing the grammatical collocation relationship of the context. However, since the rules cannot cover all sentences, traditional machine translation is mostly a literal translation of syntax or a conversion of sentence patterns.

随着人工智能技术的不断发展，基于神经网络的表示学习技术开始在各个领域崭露头角。尤其在以图像识别和语音识别为主的多个任务上，基于表示学习的方法在性能上均超过了传统的以统计学习为主的方法。现代机器翻译方法是以“双语库”为基础，特点是利用一个包含很多句型的双语语料库，在翻译的时候根据语料库中的句型抽取与所输入句子相类似的例句，然后参照双语句型把源语言转化为目标语言。With the continuous development of artificial intelligence technology, representation learning technology based on neural network has begun to emerge in various fields. Especially in multiple tasks based on image recognition and speech recognition, the performance of the method based on representation learning exceeds the traditional method based on statistical learning. The modern machine translation method is based on the "bilingual corpus", which is characterized by the use of a bilingual corpus containing many sentence patterns. When translating, according to the sentence patterns in the corpus, example sentences similar to the input sentences are extracted, and then refer to the bilingual sentence patterns. Convert source language to target language.

自然语言是人类智慧的抽象表达，很难通过已有的数据结构表示出来。在自然语言处理过程中，数据的基本单位是字或词。类似于“苹果”，既可以表示一种水果，也可以表示“苹果公司”。“麦克风”和“话筒”表示的是一种物品，但从字面上无法建立起正确的联系。因此，目前大多数翻译系统都能将语句的大致意思正确翻译。但是不同语言之间的词、句用法有着显著差别，翻译的结果大多存在语序错误、词语混用、错用等问题。尤其对于长句，机器翻译不能达到更好的准确度，导致现有技术存在翻译的结果仍需人工评估的技术问题。Natural language is an abstract expression of human intelligence, which is difficult to express through existing data structures. In natural language processing, the basic unit of data is a word or word. Similar to "apple", it can mean either a fruit or "apple company". "Microphone" and "microphone" denote an item, but the words do not make the right connection. Therefore, most of the current translation systems can correctly translate the general meaning of the sentence. However, there are significant differences in the usage of words and sentences between different languages, and most of the translation results have problems such as word order errors, mixed use of words, and misuse. Especially for long sentences, machine translation cannot achieve better accuracy, which leads to the technical problem that the translation results still need to be manually evaluated in the existing technology.

发明内容Contents of the invention

本发明所要解决的技术问题在于提供了一种针对机器翻译的译文评估方法及装置，以解决现有技术中存在的翻译的结果仍需人工评估的技术问题。The technical problem to be solved by the present invention is to provide a translation evaluation method and device for machine translation, so as to solve the technical problem in the prior art that the translation results still need manual evaluation.

本发明是通过以下技术方案解决上述技术问题的：The present invention solves the above technical problems through the following technical solutions:

本发明实施例提供了一种针对机器翻译的译文评估方法，所述方法包括：An embodiment of the present invention provides a translation evaluation method for machine translation, the method comprising:

获取语料库中的若干条语料，并将每一条语料中包含的上下文词向量的拼接结果；并对所述若干条语料中包含的不同词性的词语的词向量进行初始化；Obtaining several pieces of corpus in the corpus, and concatenating the result of the context word vectors contained in each piece of corpus; and initializing the word vectors of the words of different parts of speech contained in the several pieces of corpus;

将所述拼接结果以及所述词向量作为CBOW模型的输入，获取训练后的 CBOW模型；Using the splicing result and the word vector as the input of the CBOW model to obtain the trained CBOW model;

获取每一条语料的目标词，并使用训练后的CBOW模型进行翻译；Obtain the target words of each corpus and use the trained CBOW model for translation;

获取待评估模型针对所述目标词的译文，并根据所述待评估模型对应的译文与训练后的CBOW模型对应的译文之间的相似度，评估待评估模型译文的准确度。Obtain the translation of the model to be evaluated for the target word, and evaluate the accuracy of the translation of the model to be evaluated according to the similarity between the translation corresponding to the model to be evaluated and the translation corresponding to the trained CBOW model.

可选的，所述对所述若干条语料中包含的不同词性的词语的词向量进行初始化，包括：Optionally, the initialization of word vectors of words of different parts of speech contained in the several pieces of corpus includes:

分别使用互不重合的取值范围，对所述若干条语料中包含的不同词性的词语的词向量进行初始化。The word vectors of words with different parts of speech included in the several pieces of corpus are initialized by using non-overlapping value ranges respectively.

可选的，在所述将所述拼接结果以及所述词向量作为CBOW模型的输入，获取训练后的CBOW模型之前，所述方法还包括：Optionally, before the splicing result and the word vector are used as the input of the CBOW model to obtain the trained CBOW model, the method also includes:

将每一条语料中除设定的标点符号以外的标点符号去除，其中，设定的标点符号包括：用于表达语料的语气的标点符号、语料结束的标点符号中的一种或组合。The punctuation marks other than the set punctuation marks in each piece of corpus are removed, wherein the set punctuation marks include: one or a combination of punctuation marks used to express the tone of the corpus, and punctuation marks at the end of the corpus.

可选的，所述获取每一条语料的目标词，包括：Optionally, the acquisition of target words for each piece of corpus includes:

利用公式，

获取每一条语料的目标词，其中，Using the formula,

Obtain the target words of each corpus, where,

P(w|c)为目标词的概率；w为目标词；c为目标词的上下文；exp()为以自然底数为底的指数函数；x为CBOW模型的输入层；∑为求和函数；v为语料库；()^T为转置矩阵。P(w|c) is the probability of the target word; w is the target word; c is the context of the target word; exp() is an exponential function with a natural base as the base; x is the input layer of the CBOW model; ∑ is the summation function ; v is the corpus; ()^T is the transpose matrix.

可选的，所述语料为单独的句子。Optionally, the corpus is a single sentence.

本发明实施例提供了一种针对机器翻译的译文评估装置，所述装置包括：An embodiment of the present invention provides a translation evaluation device for machine translation, the device includes:

获取模块，用于获取语料库中的若干条语料，并将每一条语料中包含的上下文词向量的拼接结果；并对所述若干条语料中包含的不同词性的词语的词向量进行初始化；The acquisition module is used to obtain several pieces of corpus in the corpus, and splicing results of the context word vectors contained in each piece of corpus; and initialize the word vectors of the words of different parts of speech contained in the several pieces of corpus;

可选的，所述获取模块，用于：Optionally, the acquiring module is used for:

可选的，所述装置还包括：去除模块，用于将每一条语料中除设定的标点符号以外的标点符号去除，其中，设定的标点符号包括：用于表达语料的语气的标点符号、语料结束的标点符号中的一种或组合。Optionally, the device further includes: a removal module, configured to remove punctuation marks other than the set punctuation marks in each piece of corpus, wherein the set punctuation marks include: punctuation marks used to express the tone of the corpus , one or a combination of punctuation marks at the end of the corpus.

利用公式，

获取每一条语料的目标词，其中，Using the formula,

Obtain the target words of each corpus, where,

本发明相比现有技术具有以下优点：Compared with the prior art, the present invention has the following advantages:

应用本发明实施例，由于上下文语序对于翻译起到了重要的作用，因此，将每一条语料中包含的上下文词向量的拼接结果，可以得到更加准确的翻译模型，进而可以使用本发明实施例训练的模型对现有技术中的模型的翻译结果进行校对，相对于现有技术中需要人工评估，本发明实施例可以自动对译文结果进行准确性评估。Applying the embodiment of the present invention, since the context word order plays an important role in translation, a more accurate translation model can be obtained by splicing the context word vectors contained in each piece of corpus, and then the training of the embodiment of the present invention can be used The model verifies the translation results of the models in the prior art. Compared with the need for manual evaluation in the prior art, the embodiment of the present invention can automatically evaluate the accuracy of the translation results.

附图说明Description of drawings

图1为本发明实施例提供的一种针对机器翻译的译文评估方法的流程示意图；FIG. 1 is a schematic flowchart of a translation evaluation method for machine translation provided by an embodiment of the present invention;

图2为本发明实施例提供的一种CBOW模型的结构示意图；Fig. 2 is a schematic structural diagram of a CBOW model provided by an embodiment of the present invention;

图3为本发明实施例提供的一种针对机器翻译的译文评估装置的结构示意图。FIG. 3 is a schematic structural diagram of a translation evaluation device for machine translation provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面对本发明的实施例作详细说明，本实施例在以本发明技术方案为前提下进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The embodiments of the present invention are described in detail below. This embodiment is implemented on the premise of the technical solution of the present invention, and detailed implementation methods and specific operating procedures are provided, but the protection scope of the present invention is not limited to the following implementation example.

本发明实施例提供了一种针对机器翻译的译文评估方法及装置，下面首先就本发明实施例提供的一种针对机器翻译的译文评估方法进行介绍。Embodiments of the present invention provide a translation evaluation method and device for machine translation. The translation evaluation method for machine translation provided by the embodiment of the present invention will first be introduced below.

图1为本发明实施例提供的一种针对机器翻译的译文评估方法的流程示意图，如图1所示，所述方法包括：Fig. 1 is a schematic flow chart of a translation evaluation method for machine translation provided by an embodiment of the present invention. As shown in Fig. 1, the method includes:

S101：获取语料库中的若干条语料，并将每一条语料中包含的上下文词向量的拼接结果；并对所述若干条语料中包含的不同词性的词语的词向量进行初始化；S101: Obtain several pieces of corpus in the corpus, and concatenate the context word vectors contained in each piece of corpus; and initialize the word vectors of words with different parts of speech contained in the several pieces of corpus;

具体的，可以分别使用互不重合的取值范围，对所述若干条语料中包含的不同词性的词语的词向量进行初始化。所述语料为单独的句子。Specifically, value ranges that do not overlap with each other may be used to initialize the word vectors of words with different parts of speech included in the several pieces of corpus. The corpus is a single sentence.

示例性的，可以从大规模语料库中学习建立语言模型。由于语言模型的好坏直接影响到对句子正确性的判断，所以选取合适的语料较为重要。中文语料可以选取维基百科中文词条进行建模。Exemplarily, a language model can be learned from a large-scale corpus. Since the quality of the language model directly affects the judgment of the correctness of the sentence, it is more important to select the appropriate corpus. Chinese corpus can be modeled by selecting Wikipedia Chinese entries.

S102：将所述拼接结果以及所述词向量作为CBOW模型的输入，获取训练后的CBOW模型；S102: Using the splicing result and the word vector as the input of the CBOW model to obtain the trained CBOW model;

图2为本发明实施例提供的一种CBOW模型的结构示意图，如图2所示， CBOW模型(Continuous Bag of Words，连续词袋模型)包括：输入层x和输出层y。输入层接收不同的短语，进行翻译后由输出层输出。FIG. 2 is a schematic structural diagram of a CBOW model provided by an embodiment of the present invention. As shown in FIG. 2 , the CBOW model (Continuous Bag of Words, continuous bag of words model) includes: an input layer x and an output layer y. The input layer receives different phrases, which are translated and output by the output layer.

S103：获取每一条语料的目标词，并使用训练后的CBOW模型进行翻译。S103: Obtain the target words of each piece of corpus, and use the trained CBOW model to translate.

具体的，可以利用公式，

获取每一条语料的目标词，其中，Specifically, the formula can be used,

Obtain the target words of each corpus, where,

(w,c)为从语料中选出的一个n元短语w_i-(n-1)/2,...,w_i+(n-1)/2，一般n选奇数，可以保证上下文的词语数量一致。(w,c) is an n-gram phrase w_i-(n-1)/2 ,...,w_i+(n-1)/2 selected from the corpus. Generally, n is an odd number, which can ensure the context The number of words is the same.

模型的优化目标可以：The optimization objective of the model can be:

其中，

in,

D为语料库。D is the corpus.

S104：获取待评估模型针对所述目标词的译文，并根据所述待评估模型对应的译文与训练后的CBOW模型对应的译文之间的相似度，评估待评估模型译文的准确度。S104: Obtain the translation of the model to be evaluated for the target word, and evaluate the accuracy of the translation of the model to be evaluated according to the similarity between the translation corresponding to the model to be evaluated and the translation corresponding to the trained CBOW model.

在实际应用中，对于一句译文，利用滑动窗口进行多次判断。例如：窗口大小为5，分别以译文的第1,2,…个词为中间词进行判断。每次判断得到一个相似度值，再计算相似度的平均值，最后得到的相似度为对这句译文的打分值，打分值越高说明译文的正确性越高。In practical applications, for a sentence of translation, the sliding window is used to make multiple judgments. For example: the window size is 5, and the first, second, ... words of the translation are judged as intermediate words. A similarity value is obtained for each judgment, and then the average value of the similarity is calculated. The final similarity is the scoring value of the translation. The higher the scoring value, the higher the correctness of the translation.

应用本发明图1所示实施例，由于上下文语序对于翻译起到了重要的作用，因此，将每一条语料中包含的上下文词向量的拼接结果，可以得到更加准确的翻译模型，进而可以使用本发明实施例训练的模型对现有技术中的模型的翻译结果进行校对，相对于现有技术中需要人工评估，本发明实施例可以自动对译文结果进行准确性评估。Apply the embodiment shown in Fig. 1 of the present invention, because the context word order has played an important role for translation, therefore, the splicing result of the context word vector contained in each corpus can obtain a more accurate translation model, and then the present invention can be used The model trained in the embodiment verifies the translation result of the model in the prior art. Compared with the need for manual evaluation in the prior art, the embodiment of the present invention can automatically evaluate the accuracy of the translation result.

具体的在本发明实施例的一种具体实施方式中，在S102步骤之前，所述方法还包括：Specifically, in a specific implementation manner of the embodiment of the present invention, before step S102, the method further includes:

在训练模型前，处理语料库时，对于特殊符号予以去除，保留对模型有用的标点符号。例如：句号、感叹号、问号等。Before training the model, when processing the corpus, remove the special symbols and keep the punctuation marks that are useful to the model. For example: period, exclamation point, question mark, etc.

本发明通过改进语言模型，增加了词序、词性、标点符号等语句信息，提高了语言模型的表示能力，可以表示更加复杂的语句。通过语言模型的改进，结合机器翻译，可以判断机器翻译译文的正确性，提高机器翻译的准确率。By improving the language model, the present invention adds sentence information such as word order, part of speech, punctuation marks, etc., improves the representation ability of the language model, and can represent more complex sentences. Through the improvement of the language model, combined with machine translation, the correctness of the machine translation translation can be judged and the accuracy of machine translation can be improved.

用于本发明图1所示实施例相对应，本发明实施例还提供了一种针对机器翻译的译文评估装置。Corresponding to the embodiment shown in FIG. 1 of the present invention, the embodiment of the present invention also provides a translation evaluation device for machine translation.

图3为本发明实施例提供的一种针对机器翻译的译文评估装置的结构示意图，如图3所示，所述装置包括：Fig. 3 is a schematic structural diagram of a translation evaluation device for machine translation provided by an embodiment of the present invention. As shown in Fig. 3, the device includes:

获取模块301，用于获取语料库中的若干条语料，并将每一条语料中包含的上下文词向量的拼接结果；并对所述若干条语料中包含的不同词性的词语的词向量进行初始化；Theacquisition module 301 is used to acquire several pieces of corpus in the corpus, and concatenate the result of the context word vectors contained in each piece of corpus; and initialize the word vectors of the words of different parts of speech contained in the several pieces of corpus;

在本发明实施例的一种具体实施方式中，所述获取模块301，用于：In a specific implementation manner of the embodiment of the present invention, theacquisition module 301 is configured to:

在本发明实施例的一种具体实施方式中，所述获取模块301，用于：所述装置还包括：去除模块，用于将每一条语料中除设定的标点符号以外的标点符号去除，其中，设定的标点符号包括：用于表达语料的语气的标点符号、语料结束的标点符号中的一种或组合。In a specific implementation manner of the embodiment of the present invention, theacquisition module 301 is configured to: the device further includes: a removal module configured to remove punctuation marks other than the set punctuation marks in each piece of corpus, Wherein, the set punctuation marks include: one or a combination of punctuation marks used to express the mood of the corpus, and punctuation marks at the end of the corpus.

在本发明实施例的一种具体实施方式中，所述获取模块301，用于：利用公式，

获取每一条语料的目标词，其中，In a specific implementation manner of the embodiment of the present invention, theacquisition module 301 is configured to: use a formula,

Obtain the target words of each corpus, where,

在本发明实施例的一种具体实施方式中，所述语料为单独的句子。In a specific implementation manner of the embodiment of the present invention, the corpus is a single sentence.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims

Translated fromChinese

1.一种针对机器翻译的译文评估方法，其特征在于，所述方法包括：1. A translation evaluation method for machine translation, characterized in that the method comprises:

将所述拼接结果以及所述词向量作为CBOW模型的输入，获取训练后的CBOW模型；Using the splicing result and the word vector as the input of the CBOW model to obtain the trained CBOW model;

2.根据权利要求1所述的一种针对机器翻译的译文评估方法，其特征在于，所述对所述若干条语料中包含的不同词性的词语的词向量进行初始化，包括：2. a kind of translation evaluation method for machine translation according to claim 1, is characterized in that, described word vector of the different parts of speech words that comprises in described several pieces of corpus is initialized, comprises:

3.根据权利要求1所述的一种针对机器翻译的译文评估方法，其特征在于，在所述将所述拼接结果以及所述词向量作为CBOW模型的输入，获取训练后的CBOW模型之前，所述方法还包括：3. a kind of translation evaluation method for machine translation according to claim 1, is characterized in that, before described splicing result and described word vector are used as the input of CBOW model, obtain the CBOW model after training, The method also includes:

4.根据权利要求1所述的一种针对机器翻译的译文评估方法，其特征在于，所述获取每一条语料的目标词，包括：4. a kind of translation evaluation method for machine translation according to claim 1, is characterized in that, the target word of described obtaining each corpus, comprises:

利用公式，

获取每一条语料的目标词，其中，Using the formula,

Obtain the target words of each corpus, where,P(w|c)为目标词的概率；w为目标词；c为目标词的上下文；exp()为以自然底数为底的指数函数；x为CBOW模型的输入层；∑为求和函数；v为语料库；()^T为转置矩阵。P(w|c) is the probability of the target word; w is the target word; c is the context of the target word; exp() is an exponential function with a natural base as the base; x is the input layer of the CBOW model; ∑ is the summation function ; v is the corpus; ()^T is the transpose matrix.

5.根据权利要求1所述的一种针对机器翻译的译文评估方法，其特征在于，所述语料为单独的句子。5. A translation evaluation method for machine translation according to claim 1, wherein the corpus is a separate sentence.

6.一种针对机器翻译的译文评估装置，其特征在于，所述装置包括：6. A translation evaluation device for machine translation, characterized in that the device comprises:

7.根据权利要求6所述的一种针对机器翻译的译文评估装置，其特征在于，所述获取模块，用于：7. A kind of translation evaluation device for machine translation according to claim 6, characterized in that, the acquisition module is used for:

8.根据权利要求6所述的一种针对机器翻译的译文评估装置，其特征在于，所述装置还包括：去除模块，用于将每一条语料中除设定的标点符号以外的标点符号去除，其中，设定的标点符号包括：用于表达语料的语气的标点符号、语料结束的标点符号中的一种或组合。8. A kind of translation evaluation device for machine translation according to claim 6, characterized in that, said device also includes: a removal module for removing punctuation marks other than the set punctuation marks in each piece of corpus , wherein the set punctuation marks include: one or a combination of punctuation marks used to express the tone of the corpus, and punctuation marks at the end of the corpus.

9.根据权利要求6所述的一种针对机器翻译的译文评估装置，其特征在于，所述获取模块，用于：9. A kind of translation evaluation device for machine translation according to claim 6, characterized in that, the acquisition module is used for:

利用公式，

获取每一条语料的目标词，其中，Using the formula,

Obtain the target words of each corpus, where,

10.根据权利要求6所述的一种针对机器翻译的译文评估装置，其特征在于，所述语料为单独的句子。10. A translation evaluation device for machine translation according to claim 6, wherein the corpus is a single sentence.