CN104391885A

Movatterモバイル変換

Info

Publication number: CN104391885A
Application number: CN201410624648.6A
Authority: CN
Inventors: 曹海龙; 张捷鑫; 赵铁军
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin University Of Technology High Tech Development Corp
Priority date: 2014-11-07
Filing date: 2014-11-07
Publication date: 2015-03-04
Anticipated expiration: 2034-11-07
Also published as: CN104391885B

Abstract

Translated fromChinese

一种基于平行语料训练的篇章级可比语料平行短语对的抽取方法，本发明涉及可比语料平行短语对的抽取方法。本发明是要解决获取平行语料需要花费高、将最相近的上下文的两个单词或片段互为翻译应用到可比语料上存在对于双语词典依赖很严重的问题。该方法是通过1源语言句子集合S和目标语言句子集合T；2得到平行语料的短语对集合；3得到平行语料的平行短语对；4得到平行语料的非平行短语对；5得到支持向量机二元分类器；6抽取候选平行短语对<s,t>；7获得可比语料中包含噪声的平行短语对；8得到可比语料的平行短语对；9得到扩展解码器等步骤实现的。本发明应用于可比语料平行短语对的抽取领域。

A method for extracting parallel phrase pairs of parallel corpus based on parallel corpus training, and the invention relates to a method for extracting parallel phrase pairs of comparable corpus. The present invention aims to solve the problems that the acquisition of parallel corpus requires high cost and the translation of two words or fragments in the closest context to the comparable corpus is heavily dependent on bilingual dictionaries. The method is through 1 source language sentence set S and target language sentence set T; 2 to obtain the phrase pair set of parallel corpus; 3 to obtain the parallel phrase pair of parallel corpus; 4 to obtain the non-parallel phrase pair of parallel corpus; 5 to obtain the support vector machine Binary classifier; 6 to extract candidate parallel phrase pairs <s, t>; 7 to obtain parallel phrase pairs containing noise in comparable corpus; 8 to obtain parallel phrase pairs of comparable corpus; 9 to obtain extended decoder and other steps. The invention is applied to the field of extracting parallel phrase pairs of comparable corpus.

Description

Translated fromChinese

一种基于平行语料训练的篇章级可比语料平行短语对的抽取方法A Parallel Phrase Pair Extraction Method Based on Parallel Corpus Training

技术领域technical field

本发明涉及短语翻译对抽取方法，特别涉及篇章级短语翻译对抽取方法。The invention relates to a method for extracting phrase-translation pairs, in particular to a method for extracting phrase-level phrase translation pairs.

背景技术Background technique

随着广播、电视、互联网等高覆盖度传播媒介的出现，人与人之间的时空距离骤然缩短，国际交往日益频繁便利，整个地球就如同是茫茫宇宙中的一个小村落。为了让人们能够畅通无阻的交流，机器翻译作为从一种语言到另一种语言的自动翻译有着巨大的市场需求和广泛的应用前景。With the emergence of high-coverage media such as radio, television, and the Internet, the space-time distance between people has suddenly shortened, and international exchanges have become more frequent and convenient. The entire earth is like a small village in the vast universe. In order to allow people to communicate unimpeded, machine translation, as an automatic translation from one language to another, has a huge market demand and broad application prospects.

近年来，计算能力获得了突飞猛进，互联网的发展和普及，以及双语国家、联合国的多语存档，为我们提供了数以千万句的双语平行语料，这些为统计机器翻译方法奠定了必要的基础，并随之提出了很多新的模型和方法且取得了很好的效果。In recent years, computing power has improved by leaps and bounds, the development and popularization of the Internet, and bilingual countries and the multilingual archives of the United Nations have provided us with tens of millions of bilingual parallel corpora, which have laid the necessary foundation for statistical machine translation methods , and subsequently proposed many new models and methods and achieved good results.

统计机器翻译系统的构建一般分为训练和翻译两个主要步骤。训练步骤是从语料中学习统计知识并进行参数训练。典型的基于短语的统计机器翻译系统的训练包含在大规模的双语语料库上的翻译模型训练、在目标语言的单语语料库上的语言模型训练、参数训练三个主要部分，用于训练的平行语料规模大小是影响其翻译性能的主要因素。对于一些语言对，如汉语和英语、阿拉伯语和英语拥有大量的平行数据可以被使用，但是对于大多数语言对并不是这种情况，他们的平行数据资源很稀少甚至不存在，像如印度语和英语、法语和日语，这严重降低了机器翻译系统的性能。若要获取平行语料需要花费相当高的代价，所以有必要利用其他资源来训练统计机器翻译系统。与平行语料相比，可比语料在各个语言对中都存在大量的资源，而且获取方便，网络、新闻、杂志等都可以获得丰富的资源。在这些可比语料中有很多包含相似信息的双语文档，如何将这些可比语料信息加入到统计机器翻译系统中已经受到越来越多人的关注，研究人员们正在通过各种方法从可比语料中抽取更丰富、准确的平行知识，并将其加入到翻译系统中，来提高翻译系统性能。The construction of a statistical machine translation system is generally divided into two main steps: training and translation. The training step is to learn statistical knowledge from the corpus and perform parameter training. The training of a typical phrase-based statistical machine translation system includes three main parts: translation model training on a large-scale bilingual corpus, language model training on a monolingual corpus of the target language, and parameter training. Parallel corpus for training Scale is the main factor affecting its translation performance. For some language pairs, such as Chinese and English, Arabic and English have a large number of parallel data can be used, but this is not the case for most language pairs, their parallel data resources are scarce or non-existent, such as Indian and English, French, and Japanese, which severely degrades the performance of machine translation systems. Obtaining parallel corpora is expensive, so it is necessary to use other resources to train statistical machine translation systems. Compared with parallel corpus, comparable corpus has a large number of resources in each language pair, and it is easy to obtain. Rich resources can be obtained from the Internet, news, magazines, etc. There are many bilingual documents containing similar information in these comparable corpora. How to add these comparable corpus information to the statistical machine translation system has attracted more and more attention. Researchers are using various methods to extract from comparable corpus Richer and more accurate parallel knowledge is added to the translation system to improve the performance of the translation system.

从可比语料中抽取平行知识大多都以分布假设作为基础。该假设认为，跨语言间互为翻译的两个单词或片段，他们的上下文也是相似甚至相同的。基于这种假设，研究者将源语言和目标语言的未知单词的上下文通过双语词典映射到向量空间，然后计算向量之间的相似度，可以通过余弦距离、欧式距离、偏斜距离等等。认为具有最相近的上下文的两个单词或片段互为翻译。基于这个最原始方法还衍生出了很多新的方法，例如加入主题信息、语义信息、音译信息等等，这些方法能够取得一定的效果。但是从该假设本身而言，平行语料是对称的结构，能够很好地满足该假设，但是可比语料是一种非对称结构，有时无法满足该假设，所以将最相近的上下文的两个单词或片段互为翻译应用到可比语料上存在一定问题，而且该方法对于双语词典依赖很严重，种子词典规模直接影响平行知识抽取效果。Extracting parallel knowledge from comparable corpora is mostly based on distributional assumptions. This hypothesis holds that the contexts of two words or fragments that are translated between languages are similar or even the same. Based on this assumption, researchers map the context of unknown words in the source language and the target language to a vector space through a bilingual dictionary, and then calculate the similarity between vectors, which can be achieved by cosine distance, Euclidean distance, skew distance, etc. Two words or fragments that are considered to have the closest context are translations of each other. Based on this most primitive method, many new methods have been derived, such as adding topic information, semantic information, transliteration information, etc., and these methods can achieve certain results. But from the assumption itself, the parallel corpus is a symmetrical structure, which can satisfy the assumption well, but the comparable corpus is an asymmetric structure, which sometimes cannot satisfy the assumption, so the two words in the most similar context or There are certain problems in the application of fragment translation to comparable corpora, and this method is heavily dependent on bilingual dictionaries, and the size of the seed dictionary directly affects the effect of parallel knowledge extraction.

发明内容Contents of the invention

本发明的目的是为了解决统计机器翻译系统平行数据资源很稀少甚至不存在若要获取平行语料需要花费高、将最相近的上下文的两个单词或片段互为翻译应用到可比语料上存在对于双语词典依赖很严重的问题而提出的一种篇章级可比语料短语翻译对抽取方法。The purpose of the present invention is to solve the problem that parallel data resources in statistical machine translation systems are scarce or even non-existent. If it is necessary to obtain parallel corpus, it will cost a lot, and the mutual translation of two words or fragments in the most similar context is applied to comparable corpus. A method for extracting phrase-translation pairs from a text-level comparable corpus based on a serious problem in dictionaries.

上述的发明目的是通过以下技术方案实现的：Above-mentioned purpose of the invention is achieved through the following technical solutions:

步骤一、设语料库中源语言句子集合S和目标语言句子集合T；其中，语料库包括平行语料和可比语料；Step 1, set the source language sentence set S and the target language sentence set T in the corpus; wherein, the corpus includes parallel corpus and comparable corpus;

步骤二、分别将S和T按规定长度进行依次划分成短语，短语的长度2-7个单词，划分成的短语进行两两组合，得到平行语料的短语对集合；其中，每个短语对中必须包含一个短语来自于S和一个短语来自于T；Step 2. S and T are respectively divided into phrases according to the specified length, and the length of the phrase is 2-7 words, and the divided phrases are combined in pairs to obtain a phrase pair set of parallel corpus; wherein, each phrase pair must contain a phrase from S and a phrase from T;

步骤三、利用GIZA++工具从平行语料中抽取双向单词翻译表，利用平行语料在Moses系统中建立基于短语的统计机器翻译系统得到短语翻译表；通过双向单词翻译表信息以及短语翻译表中的信息中抽取训练数据正例即得到平行语料的平行短语对；其中，双向单词翻译表中每组单词翻译对后面都有相应的翻译概率；短语翻译表包括短语之间的双向翻译概率、双向单词权重、词惩罚五项概率和短语内部的词对齐信息；Step 3, use the GIZA++ tool to extract the two-way word translation table from the parallel corpus, and use the parallel corpus to establish a phrase-based statistical machine translation system in the Moses system to obtain the phrase translation table; through the information in the two-way word translation table and the information in the phrase translation table The parallel phrase pairs of the parallel corpus are obtained by extracting the positive examples of the training data; among them, each group of word translation pairs in the two-way word translation table has a corresponding translation probability behind it; the phrase translation table includes the two-way translation probability between phrases, the weight of two-way words, Word penalty five-term probability and word alignment information within the phrase;

步骤四、从步骤二得到的平行语料的短语对集合中去除步骤三得到的平行语料的平行短语对得到训练数据反例即平行语料的非平行短语对；Step 4, remove the parallel phrases of the parallel corpus that step 3 obtains from the set of phrases of the parallel corpus that step 2 obtains and obtain the non-parallel phrases of the training data counterexample, that is, the parallel corpus;

步骤五、分别从平行语料的平行短语对和平行语料的非平行短语对中抽取分类特征；将分类特征输入到SVMlight系统中利用径向基这一核方法得到支持向量机二元分类器；Step 5. Extract classification features from the parallel phrase pairs of the parallel corpus and the non-parallel phrase pairs of the parallel corpus respectively; input the classification features into the SVMlight system and use the kernel method of radial basis to obtain the support vector machine binary classifier;

步骤六、将可比语料的源语言文章中的句子和可比语料的目标语言中的句子进行组合，过滤得到伪平行句对<S,T>，从伪平行句对中抽取候选平行短语对<s,t>，其中，s是句子S中的长度为i的子串，最小源语短语长度≤i≤最大源语短语长度，t是句子T的长度为j的子串，最小目标语短语长度≤j≤最大目标语短语长度；Step 6. Combine the sentences in the source language articles of the comparable corpus and the sentences in the target language of the comparable corpus, filter to obtain pseudo-parallel sentence pairs <S, T>, and extract candidate parallel phrase pairs <s from the pseudo-parallel sentence pairs ,t>, where, s is a substring of length i in sentence S, the minimum source phrase length ≤ i ≤ maximum source phrase length, t is a substring of length j in sentence T, and the minimum target phrase length ≤j≤maximum target language phrase length;

步骤七、利用支持向量机二元分类器对候选平行短语对对<s,t>进行分类，获得可比语料中包含噪声的平行短语对；Step 7, use the support vector machine binary classifier to classify the candidate parallel phrase pair <s, t>, and obtain parallel phrase pairs containing noise in the comparable corpus;

步骤八、将可比语料中包含噪声的平行短语对进行过滤处理，设置阈值θ，θ∈(0,1)将每组可比语料中包含噪声的平行短语对中单词翻译概率对数的平均值低于θ的短语对去除得到可比语料的平行短语对；Step 8: Filter the parallel phrase pairs that contain noise in the comparable corpus, and set the threshold θ, θ∈(0,1) to lower the average of the logarithm of the word translation probability in each group of parallel phrase pairs that contain noise in the comparable corpus Phrase pairs at θ are removed to obtain parallel phrase pairs of comparable corpus;

步骤九、将可比语料的平行短语对加入到基线解码器的短语表中得到扩展解码器；其中，基线解码器是通过基线BLEU值评价和扩展解码器扩展BLEU值评价；即完成了一种基于平行语料训练的篇章级可比语料平行短语对的抽取方法。Step 9, adding the parallel phrases of the comparable corpus to the phrase table of the baseline decoder to obtain an extended decoder; wherein, the baseline decoder is evaluated by evaluating the baseline BLEU value and extending the BLEU value of the extended decoder; that is, a method based on A Parallel Phrase Pair Extraction Method of Parallel Corpus for Parallel Corpus Training.

发明效果Invention effect

本发明的目的是从可比语料中挖掘平行短语，解决平行数据稀缺的问题。希望充分利用丰富的可比语料资源，从中获取平行短语，用来提升基于短语的统计机器翻译系统性能的目的。The purpose of the invention is to mine parallel phrases from comparable corpus to solve the problem of scarcity of parallel data. It is hoped to make full use of abundant comparable corpus resources to obtain parallel phrases for the purpose of improving the performance of phrase-based statistical machine translation systems.

本发明将从可比语料中抽取平行短语的问题转化为一个二元分类的问题。从训练数据中抽取有用的特征信息，建立支持向量机二元分类器，并利用该分类器对平行短语及非平行短语进行划分，最终将该系统从可比语料中抽取的平行短语加入到翻译系统中，以提高机器翻译质量。这是一个全自动的生成与测试方法。The invention transforms the problem of extracting parallel phrases from comparable corpus into a binary classification problem. Extract useful feature information from the training data, establish a support vector machine binary classifier, and use the classifier to classify parallel phrases and non-parallel phrases, and finally add the parallel phrases extracted by the system from comparable corpus to the translation system , to improve machine translation quality. This is a fully automated build and test approach.

二元分类器的建立过程为数据获取与训练两个部分：The establishment process of a binary classifier consists of two parts: data acquisition and training:

在获取训练数据阶段，已知平行的源和目标语言句子S和T，分别将S和T按规定长度进行划分，生成所有可能的短语，然后将短语进行配对，每个短语对中必须包含一个短语来自于S和一个短语来自于T，利用GIZA++工具从S和T中获得的平行数据信息来对训练短语进行正、反例的标注。In the stage of obtaining training data, the parallel source and target language sentences S and T are known, and S and T are respectively divided into specified lengths to generate all possible phrases, and then the phrases are paired. Each phrase pair must contain a A phrase comes from S and a phrase comes from T, using the parallel data information obtained from S and T by the GIZA++ tool to mark the positive and negative examples of the training phrase.

在训练阶段，利用平行数据信息从训练数据中抽取十九个特征作为分类特征。由于该分类问题属于非线性分类问题，所以将径向基这一核方法应用于该支持向量机分类器。这样就可以利用从平行语料中获取的训练短语建立支持向量机分类器。In the training phase, nineteen features are extracted from the training data as classification features using parallel data information. Since the classification problem is a nonlinear classification problem, the kernel method of radial basis is applied to the support vector machine classifier. In this way, support vector machine classifiers can be built using training phrases obtained from parallel corpora.

该发明性能的评价方法从分类器性能与对翻译系统性能两个方面进行：The evaluation method of this invention performance is carried out from classifier performance and to translation system performance two aspects:

对分类器的分类效果进行评价，利用标准评价方法，包括精确率、召回率及准确率。生成测试短语的方法与训练短语的生成方法一样，但是为了保证测试的公平性，对正、反例进行标注时利用的平行数据信息应与生成训练短语的一致。Evaluate the classification effect of the classifier, using standard evaluation methods, including precision rate, recall rate and accuracy rate. The method of generating test phrases is the same as that of training phrases, but in order to ensure the fairness of the test, the parallel data information used when labeling positive and negative examples should be consistent with that used to generate training phrases.

本发明的意义是从可比语料中获取平行短语来提高机器翻译系统性能，所以需要测试从可比语料中分类得到的平行短语是否能提高机器翻译系统性能，根据翻译质量评价标准进行评价。首先利用已有少量平行语料训练一个基线解码器，然后将分类器从可比语料中抽取的平行短语加入到基线系统短语表中，重新训练一个扩展解码器，对两个解码器翻译质量分别进行评价。The significance of the present invention is to obtain parallel phrases from comparable corpus to improve the performance of the machine translation system, so it is necessary to test whether the parallel phrases classified from the comparable corpus can improve the performance of the machine translation system, and evaluate according to the translation quality evaluation standard. First, use a small amount of parallel corpus to train a baseline decoder, then add the parallel phrases extracted by the classifier from the comparable corpus to the baseline system phrase table, retrain an extended decoder, and evaluate the translation quality of the two decoders separately .

实验结果表明，其基线BLEU值和扩展BLEU值具体如表3所示：The experimental results show that the baseline BLEU value and the extended BLEU value are shown in Table 3:

如表3可知本发明能够很好地对平行与非平行短语进行分类，利用本发明所述的方法从可比语料中抽取的平行短语，然后加入到翻译系统中的翻译结果所表达的含义更接近于人工翻译的结果。As shown in Table 3, it can be seen that the present invention can classify parallel and non-parallel phrases very well, and the parallel phrases extracted from comparable corpus using the method of the present invention, and then added to the translation results in the translation system have closer meanings results of human translation.

附图说明Description of drawings

图1为一种基于平行语料训练的篇章级可比语料平行短语对的抽取方法流程图。Fig. 1 is a flow chart of a method for extracting parallel phrase pairs from parallel corpus based on parallel corpus training.

具体实施方式Detailed ways

具体实施方式一：本实施方式的一种基于平行语料训练的篇章级可比语料平行短语对的抽取方法，具体是按照以下步骤制备的：Specific implementation mode one: a kind of method for extracting parallel phrase pairs based on parallel corpus training based on parallel corpus in this embodiment is specifically prepared according to the following steps:

步骤二、分别将S和T按规定长度进行依次划分成短语，短语的长度2-7个单词，划分成的短语进行两两组合，得到平行语料全部的短语对集合；其中，每个短语对中必须包含一个短语来自于S和一个短语来自于T；Step 2, respectively divide S and T into phrases according to the specified length, the length of the phrase is 2-7 words, and the divided phrases are combined in pairs to obtain all the phrase pair sets of the parallel corpus; wherein, each phrase pair must contain a phrase from S and a phrase from T;

步骤三、利用GIZA++工具从平行语料中抽取双向单词翻译表，利用平行语料在Moses系统中建立基于短语的统计机器翻译系统得到短语翻译表中包含的短语大多为平行短语对；通过双向单词翻译表信息以及短语翻译表中的信息中抽取训练数据正例(正例的标注)即得到平行语料的平行短语对；其中，双向单词翻译表中每组单词翻译对后面都有相应的翻译概率；即每组单词翻译对后面都有相应的翻译概率，并且根据归一化原则，每个单词对应的所有可能翻译的概率之和为1；短语翻译表包括短语之间的双向翻译概率、双向单词权重、词惩罚五项概率和短语内部的词对齐信息；Step 3, use the GIZA++ tool to extract the two-way word translation table from the parallel corpus, and use the parallel corpus to establish a phrase-based statistical machine translation system in the Moses system to obtain that most of the phrases contained in the phrase translation table are parallel phrase pairs; through the two-way word translation table Information and the information in the phrase translation table extract training data positive examples (positive example labels) to obtain the parallel phrase pairs of the parallel corpus; wherein, each group of word translation pairs in the two-way word translation table has a corresponding translation probability behind it; that is Each group of word translation pairs has a corresponding translation probability, and according to the normalization principle, the sum of all possible translation probabilities corresponding to each word is 1; the phrase translation table includes bidirectional translation probabilities between phrases, bidirectional word weights , word penalty five-item probability and word alignment information inside the phrase;

步骤四、从步骤二得到的平行语料的全部的短语对集合中去除步骤三得到的平行语料的平行短语对得到训练数据反例即平行语料的非平行短语对；Step 4, remove the parallel phrases of the parallel corpus obtained in step 3 from all the phrase pairs of the parallel corpus obtained in step 2 to obtain the non-parallel phrases of the parallel corpus that is the training data counterexample;

在训练数据集获取的过程中需要注意以下两个问题：In the process of obtaining the training data set, the following two issues need to be paid attention to:

(1)一个训练样例可能在抽取正和反例过程中出现很多次，所以在抽取过程中要进行去重处理，保证每个训练样例是独一无二的；(1) A training example may appear many times in the process of extracting positive and negative examples, so deduplication should be performed during the extraction process to ensure that each training example is unique;

(2)通过平行语料获得的训练集数量可能非常庞大，如果用于训练分类器的训练数据规模过大会导致过拟合现象，这样会严重降低分类器性能，所以需要在训练数据集中进行采样，确定比较合适的数量作为最终的训练样例；在可以保证正例和反例数据集质量较好的情况下，可以使用随机抽样的方法，当然也可进行适当的人工排错；(2) The number of training sets obtained through parallel corpus may be very large. If the size of the training data used to train the classifier is too large, it will lead to overfitting, which will seriously reduce the performance of the classifier, so it is necessary to sample in the training data set. Determine a more appropriate number as the final training sample; when the quality of the positive and negative data sets can be guaranteed to be good, random sampling can be used, and of course appropriate manual troubleshooting can also be performed;

步骤六、本发明将从可比语料中抽取平行短语对的问题转化为一个二元分类的问题；在抽取平行短语对之前，首先将可比语料的源语言文章中的句子和可比语料的目标语言中的句子进行组合，过滤得到伪平行句对<S,T>，从伪平行句对中抽取候选平行短语对<s,t>，其中，s是句子S中的长度为i的子串，最小源语短语长度≤i≤最大源语短语长度，t是句子T的长度为j的子串，最小目标语短语长度≤j≤最大目标语短语长度；这样就获得了所有的候选短语对；Step 6, the present invention converts the problem of extracting parallel phrase pairs from comparable corpus into a binary classification problem; before extracting parallel phrase pairs, at first the sentence in the source language article of comparable corpus and the target language of comparable corpus Sentences are combined, filtered to obtain a pseudo-parallel sentence pair <S, T>, and a candidate parallel phrase pair <s, t> is extracted from the pseudo-parallel sentence pair, where s is a substring of length i in the sentence S, and the minimum Source language phrase length ≤ i ≤ maximum source language phrase length, t is a substring of sentence T length j, minimum target language phrase length ≤ j ≤ maximum target language phrase length; thus all candidate phrase pairs are obtained;

步骤七、利用支持向量机二元分类器对候选平行短语对对<s,t>进行分类，获得可比语料中包含噪声的平行短语对；不作处理会影响翻译系统性能；Step 7, use the support vector machine binary classifier to classify the candidate parallel phrase pair <s, t>, and obtain parallel phrase pairs containing noise in the comparable corpus; no processing will affect the performance of the translation system;

步骤八、将可比语料中包含噪声的平行短语对进行过滤处理，即根据经验与实际情况设置阈值θ，θ∈(0,1)将每组可比语料中包含噪声的平行短语对中单词翻译概率对数的平均值低于θ的短语对去除得到可比语料的平行短语对；Step 8. Filter the parallel phrase pairs containing noise in the comparable corpus, that is, set the threshold θ according to experience and actual situation, θ∈(0,1) translate the word translation probability in each group of parallel phrase pairs containing noise in the comparable corpus Phrase pairs whose logarithmic mean is lower than θ are removed to obtain parallel phrase pairs of comparable corpora;

步骤九、将可比语料的平行短语对加入到基线解码器即基于短语的统计机器翻译系统的短语表中得到扩展解码器；其中，基线解码器是通过基线BLEU值评价和扩展解码器扩展BLEU值评价如图1；即完成了一种基于平行语料训练的篇章级可比语料平行短语对的抽取方法。Step 9. Add the parallel phrase pairs of the comparable corpus to the baseline decoder, that is, the phrase table of the phrase-based statistical machine translation system to obtain an extended decoder; wherein, the baseline decoder extends the BLEU value through the baseline BLEU value evaluation and the extended decoder The evaluation is shown in Figure 1; that is, a parallel phrase pair extraction method based on parallel corpus training is completed.

本实施方式效果：The effect of this implementation mode:

本实施方式将从可比语料中抽取平行短语的问题转化为一个二元分类的问题。从训练数据中抽取有用的特征信息，建立支持向量机二元分类器，并利用该分类器对平行短语及非平行短语进行划分，最终将该系统从可比语料中抽取的平行短语加入到翻译系统中，以提高机器翻译质量。这是一个全自动的生成与测试方法。In this embodiment, the problem of extracting parallel phrases from comparable corpus is transformed into a problem of binary classification. Extract useful feature information from the training data, establish a support vector machine binary classifier, and use the classifier to classify parallel phrases and non-parallel phrases, and finally add the parallel phrases extracted by the system from comparable corpus to the translation system , to improve machine translation quality. This is a fully automated build and test approach.

在获取训练数据阶段，已知平行的源和目标语言句子S和T，分别将S和T按规定长度进行划分，生成所有可能的短语，然后将短语进行配对，每个短语对中必须包含一个短语来自于S和一个短语来自于T，利用GIZA++工具从S和T中获得的平行数据信息来对训练短语进行正和反例的标注。In the stage of obtaining training data, the parallel source and target language sentences S and T are known, and S and T are respectively divided into specified lengths to generate all possible phrases, and then the phrases are paired. Each phrase pair must contain a A phrase comes from S and a phrase comes from T, using the parallel data information obtained from S and T by the GIZA++ tool to mark the positive and negative examples of the training phrase.

对分类器的分类效果进行评价，利用标准评价方法，包括精确率、召回率及准确率。生成测试短语的方法与训练短语的生成方法一样，但是为了保证测试的公平性，对正和反例进行标注时利用的平行数据信息应与生成训练短语的一致。Evaluate the classification effect of the classifier, using standard evaluation methods, including precision rate, recall rate and accuracy rate. The method of generating test phrases is the same as that of training phrases, but in order to ensure the fairness of the test, the parallel data information used when labeling positive and negative examples should be consistent with that used to generate training phrases.

本实施方式的意义是从可比语料中获取平行短语来提高机器翻译系统性能，所以需要测试从可比语料中分类得到的平行短语是否能提高机器翻译系统性能，根据翻译质量评价标准进行评价。首先利用已有少量平行语料训练一个基线解码器，然后将分类器从可比语料中抽取的平行短语加入到基线系统短语表中，重新训练一个扩展解码器，对两个解码器翻译质量分别进行评价。The meaning of this embodiment is to obtain parallel phrases from comparable corpus to improve the performance of the machine translation system, so it is necessary to test whether the parallel phrases classified from the comparable corpus can improve the performance of the machine translation system, and evaluate according to the translation quality evaluation standard. First, use a small amount of parallel corpus to train a baseline decoder, then add the parallel phrases extracted by the classifier from the comparable corpus to the baseline system phrase table, retrain an extended decoder, and evaluate the translation quality of the two decoders separately .

如表3可知本实施方式能够很好地对平行与非平行短语进行分类，利用本发明所述的方法从可比语料中抽取的平行短语，然后加入到翻译系统中的翻译结果所表达的含义更接近于人工翻译的结果。As can be seen from Table 3, this embodiment can classify parallel and non-parallel phrases well, and the parallel phrases extracted from comparable corpus using the method of the present invention, and then added to the translation system in the translated results express more meanings. close to the results of human translation.

具体实施方式二：本实施方式与具体实施方式一不同的是：步骤三中抽取训练数据正例(正例的标注)具体过程为：Specific embodiment 2: The difference between this embodiment and specific embodiment 1 is that in step 3, the specific process of extracting positive examples of training data (marking of positive examples) is as follows:

(1)设S_k为源语言句子集合S中的第k'个位置上的单词，是S中从位置i到位置j的单词序列和T_k'为目标语言句子集合T中的第k'个位置上的单词，是T中从位置i'到位置j'的单词序列；假设一个阈值ε，ε∈(0,1)；(1) Let S_k be the word at the k'th position in the source language sentence set S, is the word sequence from position i to position j in S and T_k ' is the word at the k'th position in the target language sentence set T, is the word sequence from position i' to position j' in T; assume a threshold ε, ε∈(0,1);

(2)该阈值根据经验和实际情况进行选取，如果双向单词翻译表中两个单词的翻译概率大于阈值ε，则认为这两个单词S_k与T_k'是互为翻译的；(2) The threshold is selected according to experience and actual conditions. If the translation probability of two words in the two-way word translation table is greater than the threshold ε, then it is considered that the two words S_k and T_k' are mutually translated;

(3)当且仅当S_k与T_k'互为翻译即对齐时，k∈[i,j]且k'∈[i',j']；(3) k∈[i,j] and k'∈[i',j'] if and only if S_k and T_k' are mutual translations or alignments;

S_k与T_k'不互为翻译即不对齐时，k∈[i,j]且When S_k and T_k' are not mutually translated or not aligned, k∈[i,j] and

S_k与T_k'不互为翻译时，且k'∈[i',j']；则认为与是互为翻译的，即为抽取的训练数据正例。其它步骤及参数与具体实施方式一相同。When S_k and T_k' are not mutual translations, And k'∈[i',j']; then consider and are mutual translations, that is, the positive examples of the extracted training data. Other steps and parameters are the same as those in Embodiment 1.

具体实施方式三：本实施方式与具体实施方式一或二不同的是：步骤五中分别从平行语料的平行短语对和平行语料的非平行短语对中抽取分类特征如下：Specific embodiment three: the difference between this embodiment and specific embodiment one or two is: in the step 5, extract classification features as follows from the parallel phrase pair of parallel corpus and the non-parallel phrase pair of parallel corpus respectively:

(1)短语长度差：是源语短语和目标语短语长度的差的绝对值；(1) Phrase length difference: it is the absolute value of the difference between the source language phrase and the target language phrase length;

(2)相同起始：如果源语短语的开头与目标语短语的开头能够互为翻译，则值为1，否则值为0；(2) Same start: if the beginning of the source language phrase and the beginning of the target language phrase can be translated each other, the value is 1, otherwise the value is 0;

(3)相同结尾：如果源语短语的结尾与目标语短语的结尾能够互为翻译，则值为1，否则值为0；(3) Same ending: If the ending of the source language phrase and the ending of the target language phrase can be translated each other, the value is 1, otherwise the value is 0;

(4)短语中单词数目：是源语短语和目标语短语中各自包含词的数量；(4) The number of words in the phrase: it is the number of words contained in the source language phrase and the target language phrase respectively;

(5)短语长度比率：是源语短语长度与目标语短语长度的比值；(5) Phrase length ratio: it is the ratio of the length of the source language phrase to the length of the target language phrase;

(6)翻译数目：是源语短语中单词在目标语短语中存在与之对应的翻译的个数，单词的翻译概率p(s|t)要大于一个阈值η；(6) Number of translations: it is the number of translations corresponding to words in the target language phrase in the source language phrase, and the translation probability p(s|t) of the word will be greater than a threshold η;

(7)无翻译数目：是源语短语中单词在目标语短语中不存在与之对应的翻译的个数；(7) No translation number: it is the number of words in the source language phrase that do not have corresponding translations in the target language phrase;

(8)翻译比率：是源语短语中存在翻译的单词数量与短语中单词总数的比值；(8) Translation ratio: it is the ratio of the number of translated words in the source language phrase to the total number of words in the phrase;

(9)半数翻译：源语短语单词至少有一半数量在目标短语中存在翻译，则值为1，否则值为0；(9) Half translation: if at least half of the source phrase words are translated in the target phrase, the value is 1, otherwise the value is 0;

(10)最长翻译单元：是源语短语中最长连续单词序列在目标语短语中存在翻译的长度；(10) The longest translation unit: it is the length of the longest continuous word sequence in the target language phrase in the source language phrase;

(11)最长无翻译单元：是源语短语中单词中最长连续单词序列在目标语短语中不存在翻译的长度；(11) The longest no-translation unit: it is the length of the longest continuous word sequence in the word in the source language phrase without translation in the target language phrase;

(1)～(3)特征与源语和目标语的方向无关，(4)～(11)与方向有关为正反方向；因此共提取了19个特征。其它步骤及参数与具体实施方式一或二相同。(1)~(3) features have nothing to do with the direction of the source language and the target language, and (4)~(11) are related to the direction, which is positive and negative directions; therefore, 19 features are extracted in total. Other steps and parameters are the same as those in Embodiment 1 or Embodiment 2.

具体实施方式四：本实施方式与具体实施方式一至三之一不同的是：步骤六中将可比语料的源语言文章中的句子和可比语料的目标语言中的句子进行组合，过滤得到伪平行句对<S,T>的过滤条件为：Embodiment 4: This embodiment differs from Embodiments 1 to 3 in that: in step 6, the sentence in the source language article of the comparable corpus is combined with the sentence in the target language of the comparable corpus, and pseudo-parallel sentences are obtained by filtering The filter conditions for <S,T> are:

(1)、两个句子中的单词个数比不超过2；(1), the ratio of the number of words in the two sentences does not exceed 2;

(2)、利用词典检查一个句子中至少有一半的单词在另外一个句子中存在翻译；(2), use the dictionary to check that at least half of the words in a sentence have translations in another sentence;

如果句对不同时满足这两个条件将会被丢弃；满足这两个条件的句对被当做是伪平行句对；在上述过程中可以去除大部分非平行句对，但该方法同时也去除了一些近似平行的句对，这些句对不满足过滤的两个条件，主要原因是词典并没有包含所有的实体；但是这些句对数量较少，并且不一定完全可靠，所以这种过滤方法在整体上对系统的精度和鲁棒性是有很大帮助的；不可避免的，这种过滤方法并不能完全将非平行句对去除，是因为词重叠条件很薄弱，例如停止词基本上都在对应语言中存在翻译，如果它恰巧和一些实词能够匹配，满足了阈值重叠，那么就有可能将一个非平行句对误判为平行句对。其它步骤及参数与具体实施方式一至三之一相同。If the sentence pair does not meet these two conditions at the same time, it will be discarded; the sentence pair that meets these two conditions is regarded as a pseudo-parallel sentence pair; most non-parallel sentence pairs can be removed in the above process, but this method also removes Some approximately parallel sentence pairs are found, these sentence pairs do not meet the two conditions of filtering, the main reason is that the dictionary does not contain all entities; but these sentence pairs are small in number and not necessarily completely reliable, so this filtering method is used in Overall, it is very helpful to the accuracy and robustness of the system; inevitably, this filtering method cannot completely remove non-parallel sentence pairs, because word overlapping conditions are very weak, for example, stop words are basically in There is a translation in the corresponding language. If it happens to match some content words and meets the threshold overlap, then it is possible to misjudge a non-parallel sentence pair as a parallel sentence pair. Other steps and parameters are the same as those in Embodiments 1 to 3.

具体实施方式五：本实施方式与具体实施方式一至四之一不同的是：步骤八中每组可比语料中包含噪声的平行短语对中单词翻译概率对数的平均值的公式如下：Specific embodiment five: this embodiment is different from one of specific embodiments one to four: the formula of the average value of the word translation probability logarithm in the parallel phrase pair that contains noise in each group of comparable corpus in step eight is as follows:

$phrasepri phrasepri = = \frac{{ln ln S S}_{11} + + ln ln {S S}_{22} + + . . . . . . + + ln ln {S S}_{n no}}{n no}$

其中，S_i表示原语短语在目标语短语中存在翻译的第i个单词的翻译概率；n表示原语短语在目标语短语中存在翻译的单词个数；这其中不包括停用词的翻译概率，因为停用词对于翻译的贡献非常小，所以直接忽略不计。其它步骤及参数与具体实施方式一至四之一相同。Among them, S_i represents the translation probability of the i-th word that is translated in the target language phrase in the original language phrase; n represents the number of translated words in the target language phrase in the original language phrase; this does not include the translation of stop words Probability, because the contribution of stop words to translation is very small, so it is directly ignored. Other steps and parameters are the same as in one of the specific embodiments 1 to 4.

采用以下实施例验证本发明的有益效果：Adopt the following examples to verify the beneficial effects of the present invention:

实施例一：Embodiment one:

本实施例一种基于平行语料训练的篇章级可比语料平行短语对的抽取方法，具体是按照以下步骤制备的：In this embodiment, a method for extracting paragraph-level comparable corpus parallel phrase pairs based on parallel corpus training is specifically prepared according to the following steps:

(1)设S_k为源语言句子集合S中的第k'个位置上的单词，是S中从位置i到位置j的单词序列和T_k'为目标语言句子集合T中的第k'个位置上的单词，是T中从位置i'到位置j'的单词序列；假设一个阈值ε＝0.5，ε∈(0,1)；(1) Let S_k be the word at the k'th position in the source language sentence set S, is the word sequence from position i to position j in S and T_k ' is the word at the k'th position in the target language sentence set T, is the word sequence from position i' to position j' in T; assume a threshold ε=0.5, ε∈(0,1);

S_k与T_k'不互为翻译时，且k'∈[i',j']；则认为与是互为翻译的，即为抽取的训练数据正例。When S_k and T_k' are not mutual translations, And k'∈[i',j']; then consider and are mutual translations, that is, the positive examples of the extracted training data.

(1)一个训练样例可能在抽取正反例过程中出现很多次，所以在抽取过程中要进行去重处理，保证每个训练样例是独一无二的。(1) A training example may appear many times in the process of extracting positive and negative examples, so deduplication processing should be performed during the extraction process to ensure that each training example is unique.

(2)通过平行语料获得的训练集数量可能非常庞大，如果用于训练分类器的训练数据规模过大会导致过拟合现象，这样会严重降低分类器性能，所以需要在训练数据集中进行采样，确定比较合适的数量作为最终的训练样例。在可以保证正例和反例数据集质量较好的情况下，可以使用随机抽样的方法，当然也可进行适当的人工排错。(2) The number of training sets obtained through parallel corpus may be very large. If the size of the training data used to train the classifier is too large, it will lead to overfitting, which will seriously reduce the performance of the classifier, so it is necessary to sample in the training data set. Determine a more appropriate number as the final training samples. When the quality of the positive and negative data sets can be guaranteed, random sampling can be used, and of course appropriate manual troubleshooting can also be performed.

平行语料的平行短语对和平行语料的非平行短语对中抽取分类特征如下：The classification features extracted from the parallel phrase pairs of the parallel corpus and the non-parallel phrase pairs of the parallel corpus are as follows:

(1)～(3)特征与源语和目标语的方向无关，(4)～(11)与方向有关。因此共提取了19个特征。(1)~(3) features have nothing to do with the direction of the source language and target language, and (4)~(11) are related to the direction. Therefore, a total of 19 features were extracted.

根据精确率、召回率和准确率三个方面对支持向量机二元分类器的分类效果进行评价；随机选择五组训练数据得到五个不同的分类器，并利用同一组测试数据进行测试，最终得到的结果如表1所示：According to the three aspects of precision, recall and accuracy, the classification effect of support vector machine binary classifier is evaluated; five sets of training data are randomly selected to obtain five different classifiers, and the same set of test data is used for testing, and finally The results obtained are shown in Table 1:

表1Table 1

利用同一个分类器，在随机选取五组测试数据上分别进行测试，最终得到的结果如表2所示：Using the same classifier, tests were performed on five sets of test data randomly selected, and the final results are shown in Table 2:

表2Table 2

根据上述结果可以判断本发明所描述的通过二元支持向量机分类方法能够很好地对平行短语对和非平行短语对对进行分类，通过选择不同训练数据以及测试数据进行测试，可以看出对于不同的数据集，该方法性能稳定，并能达到较好的效果。According to the above results, it can be judged that the binary support vector machine classification method described in the present invention can well classify parallel phrase pairs and non-parallel phrase pairs. By selecting different training data and test data for testing, it can be seen that for Different data sets, the performance of this method is stable, and can achieve better results.

步骤六、本发明将从可比语料中抽取平行短语对的问题转化为一个二元分类的问题；在抽取平行短语对之前，首先将可比语料的源语言文章中的句子和可比语料的目标语言中的句子进行组合，过滤得到伪平行句对<S,T>，从伪平行句对中抽取候选平行短语对<s,t>，其中，s是句子S中的长度为i的子串，最小源语短语长度≤i≤最大源语短语长度(2≤i≤7)，t是句子T的长度为j的子串，最小目标语短语长度≤j≤最大目标语短语长度(2≤i≤7)；这样就获得了所有的候选短语对。Step 6, the present invention converts the problem of extracting parallel phrase pairs from comparable corpus into a binary classification problem; before extracting parallel phrase pairs, at first the sentence in the source language article of comparable corpus and the target language of comparable corpus Sentences are combined, filtered to obtain a pseudo-parallel sentence pair <S, T>, and a candidate parallel phrase pair <s, t> is extracted from the pseudo-parallel sentence pair, where s is a substring of length i in the sentence S, and the minimum Source language phrase length≤i≤maximum source language phrase length (2≤i≤7), t is a substring of sentence T with length j, minimum target language phrase length≤j≤maximum target language phrase length (2≤i≤ 7); In this way, all candidate phrase pairs are obtained.

过滤得到伪平行句对<S,T>的过滤条件为：The filter conditions to obtain the pseudo-parallel sentence pair <S, T> are:

(1)、两个句子中的单词个数比不超过2。(1), the ratio of the number of words in the two sentences does not exceed 2.

如果句对不同时满足这两个条件将会被丢弃。满足这两个条件的句对被当做是伪平行句对；在上述过程中可以去除大部分非平行句对，但该方法同时也去除了一些近似平行的句对，这些句对不满足过滤的两个条件，主要原因是词典并没有包含所有的实体。但是这些句对数量较少，并且不一定完全可靠，所以这种过滤方法在整体上对系统的精度和鲁棒性是有很大帮助的。不可避免的，这种过滤方法并不能完全将非平行句对去除，是因为词重叠条件很薄弱，例如停止词基本上都在对应语言中存在翻译，如果它恰巧和一些实词能够匹配，满足了阈值重叠，那么就有可能将一个非平行句对误判为平行句对。Sentence pairs that do not satisfy both conditions will be discarded. Sentence pairs satisfying these two conditions are regarded as pseudo-parallel sentence pairs; most of the non-parallel sentence pairs can be removed in the above process, but this method also removes some approximately parallel sentence pairs, which do not meet the filtering requirements. Two conditions, the main reason is that the dictionary does not contain all entities. However, the number of these sentence pairs is small, and they are not necessarily completely reliable, so this filtering method is of great help to the accuracy and robustness of the system as a whole. Inevitably, this filtering method cannot completely remove non-parallel sentence pairs because word overlap conditions are weak. For example, stop words basically have translations in the corresponding language. If it happens to match some content words, it satisfies the If the threshold overlaps, it is possible to misjudge a non-parallel sentence pair as a parallel sentence pair.

步骤八、将可比语料中包含噪声的平行短语对进行过滤处理，即根据经验与实际情况设置阈值θ，θ＝0.3将每组可比语料中包含噪声的平行短语对中单词翻译概率对数的平均值低于θ的短语对去除得到可比语料的平行短语对；Step 8. Filter the parallel phrase pairs containing noise in the comparable corpus, that is, set the threshold θ according to experience and actual conditions, and θ=0.3 average the logarithm of the word translation probability in each group of parallel phrase pairs containing noise in the comparable corpus Phrase pairs with values lower than θ are removed to obtain parallel phrase pairs of comparable corpus;

每组可比语料中包含噪声的平行短语对中单词翻译概率对数的平均值的公式如下：The formula for the average of the logarithm of the translation probability of words in parallel phrase pairs containing noise in each comparable corpus is as follows:

其中，S_i表示原语短语在目标语短语中存在翻译的第i个单词的翻译概率；n表示原语短语在目标语短语中存在翻译的单词个数；这其中不包括停用词的翻译概率，因为停用词对于翻译的贡献非常小，所以直接忽略不计。Among them, S_i represents the translation probability of the i-th word that is translated in the target language phrase in the original language phrase; n represents the number of translated words in the target language phrase in the original language phrase; this does not include the translation of stop words Probability, because the contribution of stop words to translation is very small, so it is directly ignored.

步骤九、将可比语料的平行短语对加入到基线解码器即基于短语的统计机器翻译系统的短语表中得到扩展解码器；其中，基线解码器是通过基线BLEU值评价和扩展解码器扩展BLEU值评价，其基线BLEU值和扩展BLEU值具体如表3所示：Step 9. Add the parallel phrase pairs of the comparable corpus to the baseline decoder, that is, the phrase table of the phrase-based statistical machine translation system to obtain an extended decoder; wherein, the baseline decoder extends the BLEU value through the baseline BLEU value evaluation and the extended decoder Evaluation, its baseline BLEU value and extended BLEU value are shown in Table 3:

表3table 3

根据上述结果可以判断，采用本发明所述的利用二元分类的方法抽取平行短语对质量较高，将从可比语料中抽取的平行短语对加入到翻译系统中能够提高翻译系统性能，使结果更接近人工翻译结果，而且随着平行短语对数量的增加，翻译结果也越来越好。实验结果表明，本发明能够很好地对平行与非平行短语对对进行分类，利用本发明所述的方法从可比语料中抽取的平行短语对，然后加入到翻译系统中的翻译结果所表达的含义更接近于人工翻译的结果。即完成了一种基于平行语料训练的篇章级可比语料平行短语对的抽取方法。According to the above results, it can be judged that the parallel phrase pairs extracted by the binary classification method of the present invention are of high quality, and adding the parallel phrase pairs extracted from comparable corpus to the translation system can improve the performance of the translation system and make the results more accurate. It is close to human translation results, and the translation results are getting better and better as the number of parallel phrase pairs increases. Experimental results show that the present invention can well classify parallel and non-parallel phrase pairs, utilize the method described in the present invention to extract parallel phrase pairs from comparable corpus, and then add the translation results expressed in the translation system The meaning is closer to the result of human translation. That is to say, a method for extracting parallel phrase pairs from parallel corpus based on parallel corpus training is completed.

本发明还可有其它多种实施例，在不背离本发明精神及其实质的情况下，本领域技术人员当可根据本发明作出各种相应的改变和变形，但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。The present invention can also have other various embodiments, without departing from the spirit and essence of the present invention, those skilled in the art can make various corresponding changes and deformations according to the present invention, but these corresponding changes and deformations are all Should belong to the scope of protection of the appended claims of the present invention.