Movatterモバイル変換


[0]ホーム

URL:


CN112906371B - Parallel corpus acquisition method, device, equipment and storage medium - Google Patents

Parallel corpus acquisition method, device, equipment and storage medium
Download PDF

Info

Publication number
CN112906371B
CN112906371BCN202110181644.5ACN202110181644ACN112906371BCN 112906371 BCN112906371 BCN 112906371BCN 202110181644 ACN202110181644 ACN 202110181644ACN 112906371 BCN112906371 BCN 112906371B
Authority
CN
China
Prior art keywords
sentence
similarity value
semantic similarity
element position
current element
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110181644.5A
Other languages
Chinese (zh)
Other versions
CN112906371A (en
Inventor
张闯
吴培昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co LtdfiledCriticalBeijing Youzhuju Network Technology Co Ltd
Priority to CN202110181644.5ApriorityCriticalpatent/CN112906371B/en
Publication of CN112906371ApublicationCriticalpatent/CN112906371A/en
Application grantedgrantedCritical
Publication of CN112906371BpublicationCriticalpatent/CN112906371B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本公开实施例公开了一种平行语料获取方法、装置、设备及存储介质。该方法包括:拆分预先获取的第一文本和第二文本,得到第一句子列表和第二句子列表,第一文本和第二文本为同一语种,用于描述同一内容;确定第一句子列表中每一个第一语句与第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;根据相似值矩阵确定第一语句与第二语句的映射关系,映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;根据映射关系获取与第一语句关联的目标第二语句,并将第一语句与目标第二语句记为平行语料。上述方案基于句子间的语义相似值,确定语句间的映射关系,提高了关联语句对的准确性,进而提高了平行语料的准确性。

The embodiments of the present disclosure disclose a parallel corpus acquisition method, device, equipment and storage medium. The method includes: splitting the pre-obtained first text and the second text to obtain a first sentence list and a second sentence list, where the first text and the second text are in the same language and used to describe the same content; determining the first sentence list The semantic similarity value between each first statement in the list and each second statement in the second sentence list is obtained to obtain a similarity value matrix; the mapping relationship between the first statement and the second statement is determined based on the similarity value matrix, and the mapping relationship includes a pair of N , at least one of N-to-one and one-to-one, N is an integer greater than or equal to 2; obtain the target second statement associated with the first statement according to the mapping relationship, and record the first statement and the target second statement as Parallel corpus. The above scheme determines the mapping relationship between sentences based on the semantic similarity values between sentences, improves the accuracy of associated sentence pairs, and thereby improves the accuracy of parallel corpus.

Description

Translated fromChinese
一种平行语料获取方法、装置、设备及存储介质A parallel corpus acquisition method, device, equipment and storage medium

技术领域Technical field

本公开实施例涉及自然语言处理技术,尤其涉及一种平行语料获取方法、装置、设备及存储介质。The embodiments of the present disclosure relate to natural language processing technology, and in particular, to a parallel corpus acquisition method, device, equipment and storage medium.

背景技术Background technique

文本简化是指将包含难词和复杂句式的文本,通过改写来降低文本的难度,使知识水平低或者认知障碍的人群更易于理解和阅读。随着深度学习技术的发展,基于端到端的神经网络模型在文本简化中的应用越来越多。端到端的神经网络模型通常需要大量复杂句到简单句的平行语料来训练。Text simplification refers to rewriting texts that contain difficult words and complex sentence patterns to reduce the difficulty of the text, making it easier for people with low knowledge levels or cognitive impairments to understand and read. With the development of deep learning technology, end-to-end neural network models are increasingly used in text simplification. End-to-end neural network models usually require a large amount of parallel corpus ranging from complex sentences to simple sentences to train.

传统的获取平行语料的方式主要包括距离法、基于TF-IDF向量求语句间相似度的方法以及基于word2vec向量的方法,但都无法准确的获取平行语料。The traditional methods of obtaining parallel corpus mainly include the distance method, the method of finding the similarity between sentences based on TF-IDF vector, and the method based on word2vec vector, but they cannot accurately obtain parallel corpus.

公开内容public content

本公开实施例提供一种平行语料获取方法、装置、设备及存储介质,可以提高平行语料的准确性。Embodiments of the present disclosure provide a parallel corpus acquisition method, device, equipment and storage medium, which can improve the accuracy of parallel corpus.

第一方面,本公开实施例提供了一种平行语料获取方法,包括:In the first aspect, embodiments of the present disclosure provide a parallel corpus acquisition method, including:

拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;Split the pre-obtained first text and second text to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text. The first text and the second text are in the same language. , used to describe the same content;

确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;Determine the semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list, and obtain a similarity value matrix;

根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;The mapping relationship between the first statement and the second statement is determined according to the similarity value matrix. The mapping relationship includes at least one of a pair of N, N to one and one to one, where N is greater than or equal to 2. an integer;

根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。The target second sentence associated with the first sentence is obtained according to the mapping relationship, and the first sentence and the target second sentence are recorded as parallel corpus.

第二方面,本公开实施例还提供了一种平行语料获取装置,包括:In a second aspect, embodiments of the present disclosure also provide a parallel corpus acquisition device, including:

拆分模块,用于拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;A splitting module for splitting the pre-obtained first text and second text to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text. The first text and The second text is in the same language and used to describe the same content;

相似值矩阵确定模块,用于确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;A similarity value matrix determination module, used to determine the semantic similarity value between each first statement in the first sentence list and each second statement in the second sentence list, and obtain a similarity value matrix;

映射关系确定模块,用于根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;Mapping relationship determination module, configured to determine the mapping relationship between the first statement and the second statement according to the similarity value matrix, where the mapping relationship includes at least one of one-to-N, N-to-one, and one-to-one. , N is an integer greater than or equal to 2;

平行语料获取模块,用于根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。A parallel corpus acquisition module, configured to obtain a target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus.

第三方面,本公开实施例还提供了一种电子设备,包括:In a third aspect, embodiments of the present disclosure also provide an electronic device, including:

一个或多个处理器;one or more processors;

存储器,用于存储一个或多个程序;Memory, used to store one or more programs;

当所述一个或多个程序被所述一个或多个处理器执行时实现如第一方面所述的平行语料获取方法。The parallel corpus acquisition method as described in the first aspect is implemented when the one or more programs are executed by the one or more processors.

第四方面,本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的平行语料获取方法。In a fourth aspect, embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the parallel corpus acquisition method as described in the first aspect is implemented.

本公开实施例提供一种平行语料获取方法、装置、设备及存储介质,通过拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。上述方案基于句子间的语义相似值,确定语句间的映射关系,提高了关联语句对的准确性,进而提高了平行语料的准确性。Embodiments of the present disclosure provide a parallel corpus acquisition method, device, equipment and storage medium. By splitting the first text and the second text obtained in advance, the first sentence list corresponding to the first text and the second sentence list are obtained. A second sentence list corresponding to the text. The first text and the second text are in the same language and used to describe the same content; determine each first sentence in the first sentence list and each first sentence in the second sentence list. The semantic similarity values between the two statements are used to obtain a similarity value matrix; the mapping relationship between the first statement and the second statement is determined according to the similarity value matrix, and the mapping relationship includes a pair of N, N to one and one. For at least one of one, N is an integer greater than or equal to 2; obtain the target second statement associated with the first statement according to the mapping relationship, and combine the first statement with the target second statement Recorded as parallel corpus. The above scheme determines the mapping relationship between sentences based on the semantic similarity values between sentences, improves the accuracy of associated sentence pairs, and thereby improves the accuracy of parallel corpus.

附图说明Description of drawings

结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。The above and other features, advantages, and aspects of various embodiments of the present disclosure will become more apparent with reference to the following detailed description taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It is to be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

图1为本公开实施例一提供的一种平行语料获取方法的流程图;Figure 1 is a flow chart of a parallel corpus acquisition method provided by Embodiment 1 of the present disclosure;

图2为本公开实施例二提供的一种平行语料获取方法的流程图;Figure 2 is a flow chart of a parallel corpus acquisition method provided in Embodiment 2 of the present disclosure;

图3为本公开实施例三提供的一种平行语料获取方法的流程图;Figure 3 is a flow chart of a parallel corpus acquisition method provided by Embodiment 3 of the present disclosure;

图4为本公开实施例四提供的一种平行语料获取装置的结构图;Figure 4 is a structural diagram of a device for obtaining parallel corpus provided in Embodiment 4 of the present disclosure;

图5为本公开实施例五提供的一种电子设备的结构图。FIG. 5 is a structural diagram of an electronic device provided in Embodiment 5 of the present disclosure.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, which rather are provided for A more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.

应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that various steps described in the method implementations of the present disclosure may be executed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performance of illustrated steps. The scope of the present disclosure is not limited in this regard.

本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "include" and its variations are open-ended, ie, "including but not limited to." The term "based on" means "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; and the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.

需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的对象进行区分,并非用于限定这些对象所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as “first” and “second” mentioned in this disclosure are only used to distinguish different objects and are not used to limit the order or interdependence of functions performed by these objects.

需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "plurality" mentioned in this disclosure are illustrative and not restrictive. Those skilled in the art will understand that unless the context clearly indicates otherwise, it should be understood as "one or Multiple”.

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are for illustrative purposes only and are not used to limit the scope of these messages or information.

实施例一Embodiment 1

图1为本公开实施例一提供的一种平行语料获取方法的流程图,本实施可适用于获取平行语料的情况。平行语料是具有一定关联的语句,例如可以是相似程度较高的语句。该方法可以由平行语料获取装置来执行,该装置可以采用软件和/或硬件的方式实现,并可配置在具备数据处理功能的电子设备中。如图1所示,该方法可以包括如下步骤:Figure 1 is a flow chart of a method for obtaining parallel corpus provided in Embodiment 1 of the present disclosure. This implementation can be applied to the situation of obtaining parallel corpus. Parallel corpus is sentences that have a certain correlation, for example, they can be sentences with a high degree of similarity. The method can be executed by a parallel corpus acquisition device, which can be implemented in software and/or hardware, and can be configured in electronic equipment with data processing functions. As shown in Figure 1, the method may include the following steps:

S110、拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表。S110. Split the pre-obtained first text and second text to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

其中,所述第一文本和第二文本为同一语种,用于描述同一内容。第一文本可以是包含难理解词汇和复杂句式的文本,也可以称为复杂文本,这类文本的难度较高。第二文本可以是包含简单词汇和简单句式的文本,也可以称为简单文本,这类文本的难度较低,对于外语学习者、知识水平低或认知障碍的人群比较容易理解。本实施例的第一文本和第二文本用于描述同一内容,例如可以描述同一对象或同一事件,而且为同一语种,即第一文本和第二文本的语言类型相同,实施例对具体的语言类型不进行限定,例如可以是中文、英文或日文等。可选的,可以从分级阅读网站或本地获取对同一内容描述的第一文本和第二文本,分级阅读网站用于存储不同难度等级的文本。Wherein, the first text and the second text are in the same language and used to describe the same content. The first text can be a text that contains difficult-to-understand vocabulary and complex sentence patterns. It can also be called a complex text. This type of text is more difficult. The second text can be a text containing simple vocabulary and simple sentence patterns, or it can also be called a simple text. This type of text is less difficult and easier to understand for foreign language learners, people with low knowledge levels or cognitive disabilities. The first text and the second text in this embodiment are used to describe the same content, for example, they can describe the same object or the same event, and they are in the same language, that is, the language type of the first text and the second text is the same. The specific language of the embodiment is The type is not limited, for example, it can be Chinese, English or Japanese. Optionally, the first text and the second text describing the same content can be obtained from a graded reading website or locally. The graded reading website is used to store texts of different difficulty levels.

第一句子列表用于存储拆分第一文本得到的语句,第二句子列表用于存储拆分第二文本得到的语句。可选的,可以通过NLTK(Natural Language Toolkit,自然语言处理工具包)中的句子分割函数分别拆分第一文本和第二文本。当然也可以采用其他方式拆分第一文本和第二文本,实施例不进行限定。为了区分拆分得到的各语句,可选的,可以按照其在对应文本中的先后顺序对各语句进行数字编号,数字编号越小表示其在文本中的位置越靠前。第一句子列表的长度与第一文本包含的句子数量相同,第二句子列表的长度与第二文本包含的句子数量相同。第一文本包含的句子数量与第二文本包含的句子数量可以相同,也可以不同。The first sentence list is used to store sentences obtained by splitting the first text, and the second sentence list is used to store sentences obtained by splitting the second text. Optionally, the first text and the second text can be split separately through the sentence segmentation function in NLTK (Natural Language Toolkit, Natural Language Processing Toolkit). Of course, the first text and the second text can also be split in other ways, which are not limited by the embodiment. In order to distinguish each statement obtained by splitting, optionally, each statement can be numerically numbered according to its order in the corresponding text. The smaller the number, the higher its position in the text. The length of the first list of sentences is the same as the number of sentences the first text contains, and the length of the second list of sentences is the same as the number of sentences the second text contains. The number of sentences contained in the first text and the number of sentences contained in the second text may be the same or different.

S120、确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵。S120. Determine the semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list, and obtain a similarity value matrix.

第一语句为拆分第一文本得到的语句,第二语句为拆分第二文本得到的语句。语义相似值用于表示两个语句之间的语义相似程度,本实施例用于表示第一语句和第二语句之间的语义相似程度。可选的,可以用0-5之间的数值表示两个语句之间的语义相似程度,数值越小表示两个语句之间的语义相似程度越低,例如0表示两个语句之间的语义相似程度最低,也可以认为两个语句的语义完全不同,5表示两个语句之间的语义相似程度最高,也可以认为两个语句的语义完全相同。本实施例确定语句间的语义相似值,在后续获取平行语料时,可以将语句语义信息相同,但词汇差异较大的句子进行关联,从而提高平行语料的准确性。可选的,可以将第一语句和第二语句输入神经网络模型,由神经网络模型输出第一语句和第二语句之间的语义相似值。实施例对神经网络模型的具体结构不进行限定,例如可以采用深度语义模型(Deep Structured Sematic models,DSSM)或文本到文本转换模型(Transfer Text-to-Text Transformer,简称T5模型)等。当然也可以采用其他方式确定第一语句和第二语句之间的语义相似值,实施例不进行限定。The first statement is a statement obtained by splitting the first text, and the second statement is a statement obtained by splitting the second text. The semantic similarity value is used to represent the degree of semantic similarity between two statements. This embodiment is used to represent the degree of semantic similarity between the first statement and the second statement. Optionally, you can use a value between 0 and 5 to represent the semantic similarity between the two statements. The smaller the value, the lower the semantic similarity between the two statements. For example, 0 represents the semantic similarity between the two statements. The lowest degree of similarity means that the semantics of the two statements are completely different. 5 means the highest degree of semantic similarity between the two statements, and the two statements can also be considered to have exactly the same semantics. This embodiment determines the semantic similarity value between sentences. When parallel corpus is subsequently obtained, sentences with the same semantic information but greatly different vocabulary can be associated, thereby improving the accuracy of the parallel corpus. Optionally, the first statement and the second statement can be input into a neural network model, and the neural network model outputs a semantic similarity value between the first statement and the second statement. The embodiment does not limit the specific structure of the neural network model. For example, Deep Structured Sematic models (DSSM) or Text-to-Text Transformer (T5 model for short) may be used. Of course, other methods can also be used to determine the semantic similarity value between the first statement and the second statement, which are not limited by the embodiment.

相似值矩阵用于存储第一语句和第二语句之间的语义相似值,可选的,可以以行为单位存储每一个第一语句与各第二语句之间的语义相似值,即相似值矩阵的每一行代表一个第一语句,相似值矩阵的每一列代表一个第二语句,即相似值矩阵的行数等于第一句子列表包含的第一语句的数量,相似值矩阵的列数等于第二句子列表包含的第二语句的数量。例如相似值矩阵表示为T=txy,x=1,2,...,m,y=1,2,...,n,m为第一语句的数量,n为第二语句的数量,则t23表示第一句子列表中的第二个第一语句与第二句子列表中的第三个第二语句之间的语义相似值。The similarity value matrix is used to store the semantic similarity values between the first statement and the second statement. Optionally, the semantic similarity values between each first statement and each second statement can be stored in behavioral units, that is, the similarity value matrix. Each row of represents a first statement, and each column of the similarity value matrix represents a second statement. That is, the number of rows of the similarity value matrix is equal to the number of first statements contained in the first sentence list, and the number of columns of the similarity value matrix is equal to the number of the second statement. The number of second sentences contained in the sentence list. For example, the similarity value matrix is expressed as T=txy , x=1,2,...,m, y=1,2,...,n, m is the number of first statements, n is the number of second statements , then t23 represents the semantic similarity value between the second first sentence in the first sentence list and the third second sentence in the second sentence list.

S130、根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系。S130. Determine the mapping relationship between the first statement and the second statement according to the similarity value matrix.

其中,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数。一对N表示一个第一语句与多个第二语句关联,N对一表示多个第一语句与一个第二语句关联,一对一表示一个第一语句与一个第二与关联。可选的,可以根据语义相似值确定第一语句和第二语句的映射关系,例如当语义相似值等于设定阈值时,认为该语义相似值对应的第一语句和第二语句之间的映射关系为一对一,设定阈值用于表示第一语句和第二语句之间的语义相似程度最高,例如可以是5,即当第一语句和第二语句之间的语义相似值为5时,认为该第一语句和第二语句之间的映射关系为一对一。Wherein, the mapping relationship includes at least one of a pair of N, N to one, and one to one, where N is an integer greater than or equal to 2. A pair of N indicates that one first statement is associated with multiple second statements, N to one indicates that multiple first statements are associated with one second statement, and one-to-one indicates that a first statement is associated with a second statement. Optionally, the mapping relationship between the first statement and the second statement can be determined based on the semantic similarity value. For example, when the semantic similarity value is equal to a set threshold, the mapping between the first statement and the second statement corresponding to the semantic similarity value is considered The relationship is one-to-one, and the threshold is set to indicate the highest degree of semantic similarity between the first statement and the second statement. For example, it can be 5, that is, when the semantic similarity value between the first statement and the second statement is 5 , it is considered that the mapping relationship between the first statement and the second statement is one-to-one.

当语义相似值小于设定阈值时,在一个示例中,可以结合该第一语句与其他第二语句之间的语义相似值以及其他第一语句与该第二语句之间的语义相似值确定该第一语句和第二语句之间的映射关系。例如如果某个第一语句与不同的第二语句之间的语义相似值的差值小于或等于预设差值,则认为该第一语句与这多个不同的第二语句关联。预设差值的大小可以根据实际情况设定,例如可以设定为0.1。示例性的,第一句子列表中第二个第一语句与第二句子列表中第三个第二语句之间的语义相似值为2,第二个第一语句与第二句子列表中第四个第二语句之间的语义相似值为2.1,与第二句子列表中第二个第二语句之间的语义相似值为0.5,第三个第二语句与第一句子列表中的其他第一语句之间的语义相似值均小于1,则认为第一句子列表中的第二个第一语句与第二句子列表中的第三个第二语句之间的映射关系为一对二,即第一句子列表中的第二个第一语句与第二句子列表中的第三个第二语句和第四个第二语句关联。When the semantic similarity value is less than the set threshold, in one example, the semantic similarity value between the first statement and other second statements and the semantic similarity value between other first statements and the second statement can be combined to determine the semantic similarity value. The mapping relationship between the first statement and the second statement. For example, if the difference in semantic similarity values between a certain first sentence and different second sentences is less than or equal to the preset difference, the first sentence is considered to be associated with the multiple different second sentences. The size of the preset difference value can be set according to the actual situation, for example, it can be set to 0.1. For example, the semantic similarity value between the second first sentence in the first sentence list and the third second sentence in the second sentence list is 2, and the second first sentence and the fourth sentence in the second sentence list have a semantic similarity value of 2. The semantic similarity value between the second sentences is 2.1, the semantic similarity value between the second second sentence in the second sentence list is 0.5, and the semantic similarity value between the third second sentence and the other first sentences in the first sentence list is 0.5. If the semantic similarity values between sentences are all less than 1, it is considered that the mapping relationship between the second first sentence in the first sentence list and the third second sentence in the second sentence list is one-to-two, that is, the mapping relationship between The second first statement in a list of sentences is associated with the third second statement and the fourth second statement in the second list of sentences.

当语义相似值小于设定阈值时,在一个示例中,也可以对多个第一语句或第二语句进行合并,基于合并的语句之间的语义相似值判断第一语句和第二语句之间的映射关系。例如当第一句子列表中第三个第一语句与第二句子列表中的第一个第二语句之间的语义相似值小于5时,可以合并第一个第二语句到第二个第二语句,以及合并第三个第一语句和第四个第一语句,如果第三个第一语句与合并后的第二语句之间的语义相似值>合并后的第一语句与第一个第二语句之间的语义相似值>第四个第一语句和第二个第二语句之间的语义相似值,则认为第一句子列表中第三个第一语句与第二句子列表中的第一个第二语句之间的映射关系为一对N,如果合并后的第一语句与第一个第二语句之间的语义相似值>第三个第一语句与合并后的第二语句之间的语义相似值>第四个第一语句和第二个第二语句之间的语义相似值,则认为第一句子列表中第三个第一语句与第二句子列表中的第一个第二语句之间的映射关系为N对一。当然还可以采用其他方式确定第一语句和第二语句之间的映射关系,实施例不进行限定。When the semantic similarity value is less than the set threshold, in one example, multiple first statements or second statements can also be merged, and the relationship between the first statement and the second statement is determined based on the semantic similarity value between the merged statements. mapping relationship. For example, when the semantic similarity value between the third first sentence in the first sentence list and the first second sentence in the second sentence list is less than 5, the first second sentence can be merged into the second second sentence. statement, and merge the third first statement and the fourth first statement, if the semantic similarity value between the third first statement and the merged second statement > the merged first statement and the first The semantic similarity value between the two sentences > the semantic similarity value between the fourth first sentence and the second second sentence, then it is considered that the third first sentence in the first sentence list is the same as the third first sentence in the second sentence list. The mapping relationship between a second statement is a pair of N. If the semantic similarity value between the merged first statement and the first second statement is greater than the semantic similarity value between the third first statement and the merged second statement. If the semantic similarity value between the fourth first sentence and the second second sentence is greater than the semantic similarity value between the fourth first sentence and the second second sentence, then it is considered that the third first sentence in the first sentence list is the first sentence in the second sentence list. The mapping relationship between the two statements is N to one. Of course, other methods can be used to determine the mapping relationship between the first statement and the second statement, which are not limited by the embodiment.

需要注意的是,第一语句和第二语句之间的映射关系除了可以是上述的一对一、一对N或N对一,还有可能是N对N,N对N的映射关系可以通过分析两个语句之间的一对N或N对一的映射关系得到。例如第二个第一语句与第三个第二语句和第四个第二语句关联,第三个第一语句与第三个第二语句和第四个第二语句关联,则认为第一文本中的第二句和第三句与第二文本中的第三句和第四句关联,即2对2。第N个第一语句为第一文本的第N句,类似的,第N个第二语句为第二文本的第N句。It should be noted that in addition to the above-mentioned one-to-one, one-to-N, or N-to-one mapping relationship between the first statement and the second statement, it may also be N-to-N. The N-to-N mapping relationship can be passed Obtained by analyzing the one-to-N or N-to-one mapping relationship between two statements. For example, the second first sentence is associated with the third second sentence and the fourth second sentence, and the third first sentence is associated with the third second sentence and the fourth second sentence, then the first text is considered The second and third sentences in are related to the third and fourth sentences in the second text, that is, 2 to 2. The Nth first sentence is the Nth sentence of the first text. Similarly, the Nth second sentence is the Nth sentence of the second text.

S140、根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。S140: Obtain the target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus.

目标第二语句为与第一语句关联的语句,可以是单个第二语句,也可以是由多个第二语句合并得到的语句。具体的,如果第一语句与第二语句的映射关系为一对一,可以将该映射关系对应的第一语句和第二语句关联,作为一组平行语料,并将该第二语句称为目标第二语句;如果第一语句与第二语句的映射关系为一对N,可以将N个第二语句合并,并将第一语句和合并后的语句关联,作为一组平行语料,此时目标第二语句为N个第二语句合并之后的语句;如果第一语句与第二语句的映射关系为N对一,可以将N个第一语句合并,并将合并后的第一语句与该第二语句关联,作为一组平行语料,此时目标第二语句为单个第二语句。如果多个第一语句和多个第二语句关联,可以合并多个第一语句以及合并多个第二语句,将合并后的语句进行关联,作为一组平行语料,此时目标第二语句为多个第二语句合并后的语句。The target second statement is a statement associated with the first statement, and may be a single second statement or a statement obtained by merging multiple second statements. Specifically, if the mapping relationship between the first statement and the second statement is one-to-one, the first statement and the second statement corresponding to the mapping relationship can be associated as a set of parallel corpus, and the second statement is called the target. second statement; if the mapping relationship between the first statement and the second statement is a pair of N, N second statements can be merged, and the first statement and the merged statement can be associated as a set of parallel corpus. At this time, the target The second statement is the statement after merging N second statements; if the mapping relationship between the first statement and the second statement is N to one, N first statements can be merged, and the merged first statement can be combined with the second statement. Two-sentence association, as a set of parallel corpus, the target second sentence is a single second sentence. If multiple first statements are associated with multiple second statements, you can merge multiple first statements and multiple second statements, and associate the merged statements as a set of parallel corpus. In this case, the target second statement is The combined statement of multiple second statements.

本实施例基于语句的语义信息确定语句之间的语义相似值,根据语义相似值确定语句之间的映射关系,提高了关联语句对的准确性,进而提高了平行语料的准确性。后续在利用平行语料训练文本简化模型时可以提高文本简化模型的准确性,在利用训练好的文本简化模型将复杂文本转换为简单文本时,可以提高转换结果的准确性。This embodiment determines the semantic similarity values between sentences based on the semantic information of the sentences, determines the mapping relationship between sentences based on the semantic similarity values, improves the accuracy of associated sentence pairs, and thereby improves the accuracy of parallel corpus. Subsequently, when using parallel corpus to train the text reduction model, the accuracy of the text reduction model can be improved. When using the trained text reduction model to convert complex text into simple text, the accuracy of the conversion results can be improved.

本公开实施例一提供一种平行语料获取方法,通过拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。上述方案基于句子间的语义相似值,确定语句间的映射关系,提高了关联语句对的准确性,进而提高了平行语料的准确性。Embodiment 1 of the present disclosure provides a parallel corpus acquisition method. By splitting the first text and the second text obtained in advance, a first sentence list corresponding to the first text and a second sentence corresponding to the second text are obtained. List, the first text and the second text are in the same language and used to describe the same content; determine the semantics between each first statement in the first sentence list and each second sentence in the second sentence list Similar values are obtained to obtain a similarity value matrix; the mapping relationship between the first statement and the second statement is determined according to the similarity value matrix, and the mapping relationship includes at least one of a pair of N, N to one, and one to one. where N is an integer greater than or equal to 2; obtain the target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus. The above scheme determines the mapping relationship between sentences based on the semantic similarity values between sentences, improves the accuracy of associated sentence pairs, and thereby improves the accuracy of parallel corpus.

实施例二Embodiment 2

图2为本公开实施例二提供的一种平行语料获取方法的流程图,本实施例是在上述实施例的基础上进行优化,参考图2,该方法可以包括如下步骤:Figure 2 is a flow chart of a parallel corpus acquisition method provided in Embodiment 2 of the present disclosure. This embodiment is optimized based on the above embodiment. Referring to Figure 2, the method may include the following steps:

S210、拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表。S210. Split the pre-obtained first text and second text to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

S220、将所述第一句子列表中的第一语句和所述第二句子列表中的第二语句输入语义相似值模型,由所述语义相似值模型输出所述第一语句与所述第二语句的语义相似值。S220. Enter the first sentence in the first sentence list and the second sentence in the second sentence list into a semantic similarity value model, and the semantic similarity value model outputs the first sentence and the second sentence. Semantic similarity value of statements.

其中,所述语义相似值模型通过语义相似值不同的语句对训练得到。语义相似值模型用于在后续确定任意两个语句之间的语义相似值,本实施例的语义相似值模型以T5模型为例。应用之前,可以对T5模型进行训练。可选的,可以从公开数据集STS-B中获取语义相似值不同的语句对作为训练样本。公开数据集STS-B用于存储语义相似值不同的语句对。语义相似值可以用0-5之间的数字表示,其中,0可以表示两个语句之间的语义完全不同;1可以表示两个语句之间的语义不同,但描述的主题一致;2可以表示两个语句之间的语义不同,但一小部分信息一致;3可以表示两个语句之间的语义基本一致,但存在部分重要信息不一致或丢失;4可以表示两个语句之间的语义非常相似,但存在部分不重要信息不一致;5可以表示两个语句之间的语义完全相同。Wherein, the semantic similarity value model is obtained by training pairs of sentences with different semantic similarity values. The semantic similarity value model is used to subsequently determine the semantic similarity value between any two sentences. The semantic similarity value model in this embodiment takes the T5 model as an example. Before application, the T5 model can be trained. Optionally, sentence pairs with different semantic similarity values can be obtained from the public data set STS-B as training samples. The public data set STS-B is used to store pairs of sentences with different semantic similarity values. The semantic similarity value can be represented by a number between 0 and 5, where 0 can represent that the semantics between the two statements are completely different; 1 can represent that the semantics between the two statements are different, but the subject described is the same; 2 can represent The semantics between the two statements are different, but a small part of the information is consistent; 3 can mean that the semantics between the two statements are basically consistent, but some important information is inconsistent or missing; 4 can mean that the semantics between the two statements are very similar. , but there are some unimportant information inconsistencies; 5 can mean that the semantics between the two statements are exactly the same.

本实施例利用语义相似值不同的语句对作为训练样本,输入语义相似值模型,训练语义相似值模型,使得训练后的语义相似值模型可以确定任意两个语句之间的语义相似值。可选的,针对每一个第一语句,可以将该第一语句和第二句子列表中的一个第二语句输入训练后的语义相似值模型,依次确定该第一语句和每一个第二语句之间的语义相似值;也可以将该第一语句和全部的第二语句输入训练后的语义相似值模型,同时确定第一语句与各第二语句之间的语义相似值;还可以针对每一个第二语句,将该第二语句和全部的第一语句输入训练后的语义相似值模型,同时确定该第二语句和各第一语句之间的语义相似值;还可以将全部的第一语句和全部的第二语句输入训练后的语义相似值模型,同时确定每一个第一语句和各第二语句之间的语义相似值,可以提高效率。本实施例确定语句之间的语义相似值,在后续确定平行语料时,可以准确的关联到语义信息相同但词汇差异较大的语句,提高平行语料的准确性。此外,对于句法改变较大,词汇删除较多的复杂句也可以准确的关联到简单句。This embodiment uses sentence pairs with different semantic similarity values as training samples, inputs the semantic similarity value model, and trains the semantic similarity value model, so that the trained semantic similarity value model can determine the semantic similarity value between any two sentences. Optionally, for each first sentence, a second sentence in the first sentence and a second sentence list can be input into the trained semantic similarity value model, and the relationship between the first sentence and each second sentence can be determined in turn. The semantic similarity value between the first sentence and all the second sentences can also be input into the trained semantic similarity value model, and the semantic similarity value between the first sentence and each second sentence can also be determined; it can also be used for each For the second sentence, input the second sentence and all the first sentences into the trained semantic similarity value model, and at the same time determine the semantic similarity values between the second sentence and each first sentence; you can also input all the first sentences Input the trained semantic similarity value model with all the second sentences, and simultaneously determine the semantic similarity value between each first sentence and each second sentence, which can improve efficiency. This embodiment determines the semantic similarity value between sentences. When the parallel corpus is subsequently determined, it can be accurately associated with sentences with the same semantic information but large lexical differences, thereby improving the accuracy of the parallel corpus. In addition, complex sentences with large changes in syntax and deletion of vocabulary can be accurately associated with simple sentences.

S230、顺序排列各所述第一语句对应的语义相似值,得到相似值矩阵。S230. Arrange the semantic similarity values corresponding to each of the first sentences in order to obtain a similarity value matrix.

其中,所述相似值矩阵的行数等于所述第一句子列表包含的第一语句的数量,所述相似值矩阵的列数等于所述第二句子列表包含的第二语句的数量。例如可以用T表示相似值矩阵,Tnm表示m个第一语句和n个第二语句之间的语义相似值,m和n分别为第一句子列表的长度和第二句子列表的长度,例如Tn1表示第一文本的第一个语句与第二文本的各语句之间的语义相似值,T1m表示第二文本的第一个语句与第一文本的各语句之间的语义相似值。Wherein, the number of rows of the similarity value matrix is equal to the number of first sentences included in the first sentence list, and the number of columns of the similarity value matrix is equal to the number of second sentences included in the second sentence list. For example, T can be used to represent the similarity value matrix, Tnm represents the semantic similarity value between m first sentences and n second sentences, m and n are the length of the first sentence list and the length of the second sentence list respectively, For example, Tn 1 represents the semantic similarity value between the first sentence of the first text and each sentence of the second text, and T1m represents the semantic similarity between the first sentence of the second text and each sentence of the first text. Similar values.

S240、根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系。S240. Determine the mapping relationship between the first statement and the second statement according to the similarity value matrix.

S250、根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。S250: Obtain the target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus.

S260、将所述平行语料输入文本简化模型,训练所述文本简化模型,得到目标文本简化模型。S260: Input the parallel corpus into a text reduction model, train the text reduction model, and obtain a target text reduction model.

其中,所述目标文本简化模型用于将复杂文本转换为简单文本。平行语料确定之后,可以将第一语句输入文本简化模型,由文本简化模型输出预测语句,根据预测语句和第二语句的偏差调整文本简化模型的参数,直至预测语句和第二语句的偏差满足设定条件,得到目标文本简化模型,从而可以将复杂文本输入目标文本简化模型,由目标文本简化模型输出简单文本,实现由复杂文本到简单文本的转换。Wherein, the target text simplification model is used to convert complex text into simple text. After the parallel corpus is determined, the first sentence can be input into the text reduction model, and the text reduction model will output the prediction sentence. The parameters of the text reduction model can be adjusted according to the deviation between the prediction sentence and the second sentence until the deviation between the prediction sentence and the second sentence meets the set value. According to certain conditions, the target text simplified model is obtained, so that complex text can be input into the target text simplified model, and the target text simplified model outputs simple text, realizing the conversion from complex text to simple text.

本公开实施例二提供一种平行语料获取方法,在上述实施例的基础上,利用语义相似值不同的语句对训练语义相似值模型,利用训练后的语义相似值模型确定语句之间的语义相似值,得到相似值矩阵,进而根据相似值矩阵确定语句之间的映射关系,得到关联的语句对,提高了关联语句对的准确性。Embodiment 2 of the present disclosure provides a parallel corpus acquisition method. Based on the above embodiment, a semantic similarity value model is trained using sentence pairs with different semantic similarity values, and the semantic similarity value model after training is used to determine the semantic similarity between sentences. values, obtain a similar value matrix, and then determine the mapping relationship between statements based on the similar value matrix, and obtain associated statement pairs, which improves the accuracy of associated statement pairs.

实施例三Embodiment 3

图3为本公开实施例三提供的一种平行语料获取方法的流程图,本实施例是在上述实施例的基础上进行优化,参考图3,该方法可以包括如下步骤:Figure 3 is a flow chart of a parallel corpus acquisition method provided in Embodiment 3 of the present disclosure. This embodiment is optimized on the basis of the above embodiment. Referring to Figure 3, the method may include the following steps:

S310、拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表。S310. Split the pre-obtained first text and second text to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

S320、确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵。S320. Determine the semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list, and obtain a similarity value matrix.

S330、将所述相似值矩阵第一个元素的位置记为当前元素位置。S330. Record the position of the first element of the similarity value matrix as the current element position.

假定相似矩阵为m行n列,则第一个元素可以是相似矩阵第一行第一列对应的元素,第二个元素可以是相似矩阵第一行第二列对应的元素,第n+1个元素可以是第二行第一列对应的元素,依次类推。本实施例对相似矩阵中的每一个元素执行类似的过程,这里将第一个元素作为当前元素,将第一个元素的位置作为当前元素位置,对其需要执行的过程进行描述。每一个元素位置对应一个语义相似值,该语义相似值为该元素位置对应的第一语句和第二语句之间的语义相似值,例如第二行第三列的元素位置对应的语义相似值为第一文本中第二个语句与第二文本中第三个语句之间的语义相似值,也可以称为第一文本的第二个第一语句与第二文本的第三个第二语句之间的语义相似值。Assuming that the similarity matrix has m rows and n columns, the first element can be the element corresponding to the first row and first column of the similarity matrix, the second element can be the element corresponding to the first row and second column of the similarity matrix, and the n+1 The elements can be the elements corresponding to the second row and the first column, and so on. This embodiment performs a similar process on each element in the similarity matrix. Here, the first element is regarded as the current element, the position of the first element is regarded as the current element position, and the process that needs to be performed is described. Each element position corresponds to a semantic similarity value. The semantic similarity value is the semantic similarity value between the first statement and the second statement corresponding to the element position. For example, the semantic similarity value corresponding to the element position in the second row and third column is The semantic similarity value between the second sentence in the first text and the third sentence in the second text can also be called the second first sentence in the first text and the third second sentence in the second text. semantic similarity between them.

S340、所述当前元素位置对应的语义相似值是否等于第一预设值,若是,执行S350,否则,执行S360。S340. Whether the semantic similarity value corresponding to the current element position is equal to the first preset value. If so, execute S350. Otherwise, execute S360.

其中,第一预设值用于表示所述当前元素位置对应的第一语句与第二语句的语义相似程度最高。第一预设值与训练语义相似值模型采用的训练样本的语义相似值有关,例如训练语义相似值模型时采用的训练样本的语义相似值介于0-5之间,则第一预设值可以是5,以表示两个语句的语义相似程度最高,也就是说当前元素位置对应的语义相似值要么等于5,要么小于5。The first preset value is used to indicate that the first sentence and the second sentence corresponding to the current element position have the highest degree of semantic similarity. The first preset value is related to the semantic similarity value of the training samples used to train the semantic similarity value model. For example, the semantic similarity value of the training samples used to train the semantic similarity value model is between 0-5, then the first preset value It can be 5 to indicate that the semantic similarity between the two statements is the highest. That is to say, the semantic similarity value corresponding to the current element position is either equal to 5 or less than 5.

S350、确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对一。S350. Determine that the mapping relationship between the first statement and the second statement corresponding to the current element position is one-to-one.

具体的,如果当前元素位置对应的第一语句与第二语句之间的语义相似值为5,则认为第一语句与第二语句之间的语义完全相同,可以确定当前元素位置对应的第一语句与第二语句之间的映射关系为一对一。然后执行S380,将下一个元素位置作为当前元素位置,并返回执行S340,继续判断当前元素位置对应的第一语句与第二语句之间的映射关系。Specifically, if the semantic similarity value between the first statement and the second statement corresponding to the current element position is 5, it is considered that the semantics between the first statement and the second statement are exactly the same, and the first statement corresponding to the current element position can be determined. The mapping relationship between the statement and the second statement is one-to-one. Then execute S380, use the next element position as the current element position, and return to execute S340 to continue to determine the mapping relationship between the first statement and the second statement corresponding to the current element position.

S360、合并所述当前元素位置对应的第二语句与下一个元素位置对应的第二语句,得到第一合并语句;以及合并所述当前元素位置对应的第一语句与下一个元素位置对应的第一语句,得到第二合并语句。S360. Merge the second statement corresponding to the current element position and the second statement corresponding to the next element position to obtain a first merged statement; and merge the first statement corresponding to the current element position and the second statement corresponding to the next element position. One statement, get the second combined statement.

考虑到平行语料是语义相似程度较高的两个语句,因此当当前元素位置对应的语义相似值小于5并大于设定阈值时,可以进一步判断当前元素位置对应的第一语句和第二语句之间是否为一对N或N对一的关系。例如可以合并一定数量的语句,根据与合并语句之间的语义相似值确定当前元素位置对应的第一语句和第二语句之间是否为一对N或N对一的关系。设定阈值的大小可以根据实际情况确定,例如当语义相似值为3时,表示两个语句的语义基本相同,因此可以将设定阈值设置为3或3附近的某个数值。Considering that the parallel corpus is two sentences with a high degree of semantic similarity, when the semantic similarity value corresponding to the current element position is less than 5 and greater than the set threshold, it can be further determined whether the first sentence and the second sentence corresponding to the current element position are Whether there is a one-to-N or N-to-one relationship. For example, a certain number of statements can be merged, and whether there is a one-to-N or N-to-one relationship between the first statement and the second statement corresponding to the current element position is determined based on the semantic similarity value between the merged statements. The size of the set threshold can be determined according to the actual situation. For example, when the semantic similarity value is 3, it means that the semantics of the two statements are basically the same, so the set threshold can be set to 3 or a value near 3.

可选的,可以合并当前元素位置对应的第二语句与下一个元素位置对应的第二语句,得到第一合并语句S(ty:ty+1),ty表示当前元素位置对应的第二语句,ty+1表示下一个元素位置对应的第二语句,y=1,2,...,n-1。类似的,可以合并当前元素位置对应的第一语句与下一个元素位置对应的第一语句,得到第二合并语句C(tx:tx+1),tx表示当前元素位置对应的第一语句,tx+1表示下一个元素位置对应的第一语句,x=1,2,...,m-1。Optionally, the second statement corresponding to the current element position and the second statement corresponding to the next element position can be merged to obtain the first merged statement S(ty :ty +1), where ty represents the second statement corresponding to the current element position. The second statement, ty +1 represents the second statement corresponding to the next element position, y=1,2,...,n-1. Similarly, the first statement corresponding to the current element position and the first statement corresponding to the next element position can be merged to obtain the second merged statement C(tx :tx +1), where tx represents the first statement corresponding to the current element position. Statement, tx +1 represents the first statement corresponding to the next element position, x=1,2,...,m-1.

S370、根据所述当前元素位置对应的第一语句与所述第一合并语句的语义相似值、所述第二合并语句与所述当前元素位置对应的第二语句的语义相似值以及所述下一个元素位置对应的语义相似值,确定所述当前元素位置对应的第一语句与第二语句的映射关系。S370. According to the semantic similarity value between the first statement corresponding to the current element position and the first merging statement, the semantic similarity value between the second merging statement and the second statement corresponding to the current element position, and the following The semantic similarity value corresponding to an element position determines the mapping relationship between the first statement and the second statement corresponding to the current element position.

可选的,可以确定当前元素位置对应的第一语句tx与第一合并语句S(ty:ty+1)之间的语义相似值sim1、第二合并语句C(tx:tx+1)与当前元素位置对应的第二语句ty的语义相似值sim2,根据sim1、sim2以及下一个元素位置对应的语义相似值确定当前元素位置对应的第一语句tx与第二语句ty之间的映射关系。为了便于描述,可以将当前元素位置对应的第一语句tx与第一合并语句S(ty:ty+1)的语义相似值sim1记为第一语义相似值、第二合并语句C(tx:tx+1)与当前元素位置对应的第二语句ty的语义相似值sim2记为第二语义相似值、当前元素位置对应的语义相似值记为第三语义相似值以及下一个元素位置对应的语义相似值记为第四语义相似值。Optionally, the semantic similarity value sim1 between the first statement tx corresponding to the current element position and the first merging statement S(ty :ty +1) and the second merging statement C(tx :tx +1) The semantic similarity value sim2 of the second statement ty corresponding to the current element position is determined according to the semantic similarity value sim1 , sim2 and the next element position corresponding to the first statement tx corresponding to the current element position. The mapping relationship between the second statement ty . For the convenience of description, the semantic similarity value sim1 of the first statement tx corresponding to the current element position and the first merge statement S (ty :ty +1) can be recorded as the first semantic similarity value and the second merge statement C (tx :tx +1) The semantic similarity value sim2 of the second statementty corresponding to the current element position is recorded as the second semantic similarity value, the semantic similarity value corresponding to the current element position is recorded as the third semantic similarity value, and The semantic similarity value corresponding to the next element position is recorded as the fourth semantic similarity value.

在一个示例中,可以通过如下方式确定第一语句和第二语句之间的映射关系:In one example, the mapping relationship between the first statement and the second statement can be determined as follows:

如果所述第一语义相似值小于或等于所述第三语义相似值,或者所述第一语义相似值小于或等于所述第四语义相似值,则调小所述第一语义相似值,否则,保持所述第一语义相似值不变;If the first semantic similarity value is less than or equal to the third semantic similarity value, or the first semantic similarity value is less than or equal to the fourth semantic similarity value, then reduce the first semantic similarity value, otherwise , keeping the first semantic similarity value unchanged;

如果所述第二语义相似值小于或等于所述第三语义相似值,或者所述第二语义相似值小于或等于第四语义相似值,则调小所述第二语义相似值,否则,保持所述第二语义相似值不变;If the second semantic similarity value is less than or equal to the third semantic similarity value, or the second semantic similarity value is less than or equal to the fourth semantic similarity value, then reduce the second semantic similarity value; otherwise, keep The second semantic similarity value remains unchanged;

确定所述第一语义相似值、第二语义相似值和第四语义相似值中的最大值;Determine the maximum value among the first semantic similarity value, the second semantic similarity value and the fourth semantic similarity value;

如果最大值为所述第四语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对一;如果最大值为所述第一语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对N;如果最大值为所述第二语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为N对一。If the maximum value is the fourth semantic similarity value, it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is one-to-one; if the maximum value is the first semantic similarity value, it is determined that The mapping relationship between the first statement and the second statement corresponding to the current element position is a pair of N; if the maximum value is the second semantic similarity value, determine the first statement and the second statement corresponding to the current element position. The mapping relationship is N to one.

具体的,如果第一语义相似值sim1小于或等于第三语义相似值,或者第一语义相似值sim1小于或等于第四语义相似值,则调小第一语义相似值sim1的值,否则保持第一语义相似值sim1不变,然后比较第二语义相似值sim2与第三语义相似值和第四语义相似值的关系,例如如果第二语义相似值sim2小于或等于第三语义相似值,或者第二语义相似值sim2小于或等于第四语义相似值,则调小第二语义相似值sim1的值,否则保持第二语义相似值sim2不变。当然也可以先比较第二语义相似值sim2与第三语义相似值和第四语义相似值的关系,再比较第一语义相似值sim1与第三语义相似值和第四语义相似值的关系,过程类似。在此基础上,比较第一语义相似值sim1、第二语义相似值sim2和第四语义相似值的大小,这里的第一语义相似值sim1、第二语义相似值sim2为执行上述调整操作后的值,如果比较结果为第一语义相似值sim1最大,则认为当前元素位置的第一语句与第二语句之间的映射关系为一对N,如果比较结果为第二语义相似值sim2最大,则认为当前元素位置的第一语句与第二语句之间的映射关系为N对一,如果比较结果为第四语义相似值最大,则认为当前元素位置的第一语句与第二语句之间的映射关系为一对一。Specifically, if the first semantic similarity value sim1 is less than or equal to the third semantic similarity value, or the first semantic similarity value sim1 is less than or equal to the fourth semantic similarity value, then the value of the first semantic similarity value sim1 is reduced, Otherwise, keep the first semantic similarity value sim1 unchanged, and then compare the relationship between the second semantic similarity value sim2 and the third semantic similarity value and the fourth semantic similarity value, for example, if the second semantic similarity value sim2 is less than or equal to the third The semantic similarity value, or the second semantic similarity value sim2 is less than or equal to the fourth semantic similarity value, then the value of the second semantic similarity value sim1 is reduced, otherwise the second semantic similarity value sim2 is kept unchanged. Of course, you can also first compare the relationship between the second semantic similarity value sim2 and the third semantic similarity value and the fourth semantic similarity value, and then compare the relationship between the first semantic similarity value sim1 and the third semantic similarity value and the fourth semantic similarity value. , the process is similar. On this basis, compare the first semantic similarity value sim1 , the second semantic similarity value sim2 and the fourth semantic similarity value. Here, the first semantic similarity value sim1 and the second semantic similarity value sim2 are used to perform the above Adjust the value after the operation. If the comparison result is the first semantic similarity value sim1 , which is the largest, then the mapping relationship between the first statement and the second statement at the current element position is considered to be a pair of N. If the comparison result is the second semantic similarity If the value sim2 is the largest, it is considered that the mapping relationship between the first statement and the second statement at the current element position is N to one. If the comparison result is that the fourth semantic similarity value is the largest, it is considered that the first statement at the current element position and the second statement are N to one. The mapping relationship between the two statements is one-to-one.

上述过程中当需要调小第一语义相似值sim1或第二语义相似值sim2时,实施例对具体的调小量不进行限定,只要保证后续在比较第一语义相似值sim1、第二语义相似值sim2和第四语义相似值的大小时具有区分性即可,从而可以准确的确定第一语句和第二语句之间的映射关系。例如当需要调小第一语义相似值sim1时,可以直接将第一语义相似值sim1置为0,当需要调小第二语义相似值sim2时,可以直接将第二语义相似值sim2置为0。In the above process, when the first semantic similarity value sim1 or the second semantic similarity value sim2 needs to be reduced, the embodiment does not limit the specific reduction amount, as long as it is ensured that the first semantic similarity value sim1 and the second semantic similarity value sim 2 are subsequently compared. The second semantic similarity value sim2 and the fourth semantic similarity value only need to be distinguishable in size, so that the mapping relationship between the first statement and the second statement can be accurately determined. For example, when the first semantic similarity value sim1 needs to be reduced, the first semantic similarity value sim1 can be directly set to 0. When the second semantic similarity value sim2 needs to be reduced, the second semantic similarity value sim 2 can be directly set to 0.2 is set to 0.

在确定当前元素位置的第一语句与第二语句之间的映射关系为一对N或N对一时,进一步确定N的大小。例如当确定当前元素位置对应的第一语句与第二语句的映射关系为一对N时,可以通过如下方式确定N的大小:When the mapping relationship between the first statement and the second statement that determines the position of the current element is one-to-N or N-to-one, the size of N is further determined. For example, when it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is a pair of N, the size of N can be determined in the following way:

合并所述当前元素位置对应的第二语句到目标元素位置对应的第二语句之间的第二语句,得到第三合并语句,所述目标元素位置所在的列为所述当前元素位置所在的列与N-1的和,N=3;Merge the second statement between the second statement corresponding to the current element position and the second statement corresponding to the target element position to obtain a third merge statement. The column where the target element position is located is the column where the current element position is located. Sum with N-1, N=3;

如果所述当前元素位置对应的第一语句与所述第三合并语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,则确定N=2;If the semantic similarity value between the first statement corresponding to the current element position and the third merge statement is less than or equal to the semantic similarity value corresponding to the current element position, then determine N=2;

否则令N=N+1,并重复执行上述操作,直至得到的当前元素位置对应的第一语句与所述第三合并语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,并确定N=N-1。Otherwise, let N=N+1, and repeat the above operation until the obtained semantic similarity value of the first statement corresponding to the current element position and the third merge statement is less than or equal to the semantic similarity value corresponding to the current element position, And determine N=N-1.

一对N表示一个第一语句对应多个第二语句,N大于或等于2,此时可以先将N加1,即确定N是否为3,例如可以合并三个第二语句,即可以将当前元素位置对应的第二语句、下一个元素位置对应的第二语句以及下下一个元素位置对应的第二语句合并,得到第三合并语句,需要注意的是,这里合并的三个第二语句所在的元素位置对应同一个第一语句。然后确定当前元素位置对应的第一语句与第三合并语句之间的语义相似值,如果当前元素位置对应的第一语句与第三合并语句之间的语义相似值小于或等于当前元素位置对应的语义相似值,则停止搜索,并确定N=2,如果当前元素位置对应的第一语句与第三合并语句之间的语义相似值大于当前元素位置对应的语义相似值,则N加1,并继续判断,直至得到的当前元素位置对应的第一语句与第三合并语句的语义相似值小于或等于当前元素位置对应的语义相似值,此时N=N-1。A pair of N means that one first statement corresponds to multiple second statements, and N is greater than or equal to 2. At this time, you can first add 1 to N, that is, determine whether N is 3. For example, you can merge three second statements, that is, you can combine the current The second statement corresponding to the element position, the second statement corresponding to the next element position, and the second statement corresponding to the next element position are combined to obtain the third merged statement. It should be noted that the three second statements merged here are The element positions of correspond to the same first statement. Then determine the semantic similarity value between the first statement corresponding to the current element position and the third merging statement, if the semantic similarity value between the first statement corresponding to the current element position and the third merging statement is less than or equal to the semantic similarity value corresponding to the current element position If the semantic similarity value is greater than the semantic similarity value between the first statement corresponding to the current element position and the third merged statement, then N is increased by 1, and Continue the judgment until the obtained semantic similarity value of the first statement corresponding to the current element position and the third merged statement is less than or equal to the semantic similarity value corresponding to the current element position, at this time N=N-1.

本实施例在相似值矩阵的基础上进一步确定第一语句与多个第二语句合并之后的语义相似值,或者多个第一语句合并之后的语句与第二语句之间的语义相似值,在此基础上确定第一语句和第二语句之间的映射关系,提高了关联语句对的准确性,也使得一些语义相同或相近但词汇差异较大的句子可以被准确关联到,以及一些句法改变大、词汇删除较多的复杂句可以关联到简单句,增加了平行语料的数量。例如通过本实施例可以将“You should also be careful when taking selfies”和“Think before you take aselfie”、“It’s more comfortable high up in its clouds”和“But up in the clouds,it’s calmer”以及“The surface of Venus has burning temperatures and crushingpressures”和“It is very hot and has strong pressures”可以被准确的关联到。This embodiment further determines the semantic similarity value after the merger of the first statement and multiple second statements, or the semantic similarity value between the statement after the merger of multiple first statements and the second statement, based on the similarity value matrix. On this basis, the mapping relationship between the first sentence and the second sentence is determined, which improves the accuracy of the associated sentence pairs, and also allows some sentences with the same or similar semantics but large lexical differences to be accurately associated, as well as some syntactic changes. Complex sentences with large vocabulary and many deleted words can be related to simple sentences, increasing the amount of parallel corpus. For example, through this embodiment, "You should also be careful when taking selfies" and "Think before you take a selfie", "It's more comfortable high up in its clouds" and "But up in the clouds, it's calmer" and "The "The surface of Venus has burning temperatures and crushing pressures" and "It is very hot and has strong pressures" can be accurately related.

当确定当前元素位置对应的第一语句与第二语句的映射关系为N对一时,可以通过如下方式确定N的大小:When it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is N to one, the size of N can be determined in the following way:

合并所述当前元素位置对应的第一语句到目标元素位置对应的第一语句之间的第一语句,得到第四合并语句,所述目标元素位置所在的列为所述当前元素位置所在的列与N-1的和,N=3;Merge the first statement between the first statement corresponding to the current element position and the first statement corresponding to the target element position to obtain a fourth merged statement. The column where the target element position is located is the column where the current element position is located. Sum with N-1, N=3;

如果所述第四合并语句与所述当前元素位置对应的第二语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,则确定N=2;If the semantic similarity value of the fourth merging statement and the second statement corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, then determine N=2;

否则令N=N+1,并重复执行上述操作,直至得到的第四合并语句与所述当前元素位置对应的第二语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,确定N=N-1。Otherwise, let N=N+1, and repeat the above operation until the semantic similarity value of the obtained fourth merge statement and the second statement corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, Determine N=N-1.

当当前元素位置对应的第一语句与第二语句的映射关系为N对一时,N的大小的确定过程与当前元素位置对应的第一语句与第二语句的映射关系为一对N时,N的大小的确定过程类似,此处不再赘述。When the mapping relationship between the first statement and the second statement corresponding to the current element position is N to one, the determination process of the size of N and the mapping relationship between the first statement and the second statement corresponding to the current element position are N to one, N The process of determining the size is similar and will not be repeated here.

S380、所述当前元素位置是否为相似值矩阵中最后一个元素的位置,若是执行S3100,否则执行S390后返回执行S340。S380. Whether the current element position is the position of the last element in the similarity value matrix. If yes, execute S3100. Otherwise, execute S390 and then return to S340.

S390、将下一个元素的位置记为当前元素位置。S390. Record the position of the next element as the position of the current element.

S3100、根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。S3100: Obtain the target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus.

相似值矩阵中的元素位置遍历结束后,根据映射关系获取与第一语句关联的目标第二语句,并将第一语句与目标第二语句记为平行语料。After traversing the element positions in the similarity value matrix, the target second sentence associated with the first sentence is obtained according to the mapping relationship, and the first sentence and the target second sentence are recorded as parallel corpus.

S3110、将所述平行语料输入文本简化模型,训练所述文本简化模型,得到目标文本简化模型。S3110. Input the parallel corpus into a text reduction model, train the text reduction model, and obtain a target text reduction model.

在一个示例中,为了便于统计平行语料,在得到相似值矩阵后,可以同步初始化一个句子关联矩阵P,以用于存储第一语句与第二语句是否关联,句子关联矩阵P中各个元素的初始值为0,句子关联矩阵P的行和列与相似值矩阵T相同,句子关联矩阵P的某个元素位置对应的值表示相似值矩阵T中相同元素位置中第一语句和第二语句是否关联,后续可以直接根据句子关联矩阵P中各个元素位置的值得到平行语料。例如在确定第一文本的第二个第一语句与第二文本的第三个第二语句之间的映射关系为一对一时,可以同步将句子关联矩阵P的第二行第三列置为1,以表示该位置的第一语句与第二语句关联。再如,在确定第一文本的第二个第一语句与第二文本的第二个至第四个第二语句关联时,可以同步将句子关联矩阵P的第二行第二列至第二行第四列置为1。再如,在确定第一文本的第二个至第五个第一语句与第二文本的第三个第二语句关联时,可以同步将句子关联矩阵P的第二行第三列、第三行第三列、第四行第三列以及第五行第三列列置为1。遍历结束后,即可根据句子关联矩阵P中为1的元素位置取出,并将该元素位置对应的语句组合为复杂句-简单句,得到平行语料。In one example, in order to facilitate statistics of parallel corpora, after obtaining the similarity value matrix, a sentence association matrix P can be initialized synchronously to store whether the first sentence and the second sentence are related, and the initialization of each element in the sentence association matrix P The value is 0. The rows and columns of the sentence association matrix P are the same as the similarity value matrix T. The value corresponding to a certain element position of the sentence association matrix P indicates whether the first sentence and the second sentence in the same element position in the similarity value matrix T are related. , parallel corpus can be obtained directly based on the value of each element position in the sentence association matrix P. For example, when it is determined that the mapping relationship between the second first sentence of the first text and the third second sentence of the second text is one-to-one, the second row and third column of the sentence association matrix P can be simultaneously set to 1, to indicate that the first statement at this position is associated with the second statement. For another example, when it is determined that the second first sentence of the first text is associated with the second to fourth second sentences of the second text, the second row and second column of the sentence association matrix P can be synchronized to the second The fourth column of the row is set to 1. For another example, when determining that the second to fifth first sentences of the first text are associated with the third second sentence of the second text, the second row, third column, and third column of the sentence association matrix P can be synchronized. The third column of the row, the third column of the fourth row, and the third column of the fifth row are set to 1. After the traversal is completed, the element position that is 1 in the sentence association matrix P can be taken out, and the sentences corresponding to the element positions can be combined into complex sentences and simple sentences to obtain parallel corpus.

本公开实施例三提供一种平行语料获取方法,在上述实施例的基础上,确定语句间的语义相似值,根据语义相似值确定语句之间的映射关系,提高了关联语句对的准确性,也使得一些语义相同或相近但词汇差异较大的句子可以被准确关联到,以及一些句法改变大、词汇删除较多的复杂句可以关联到简单句,增加了平行语料的数量。Embodiment 3 of the present disclosure provides a parallel corpus acquisition method. Based on the above embodiment, the semantic similarity value between sentences is determined, and the mapping relationship between sentences is determined based on the semantic similarity value, thereby improving the accuracy of associated sentence pairs. It also allows some sentences with the same or similar semantics but large vocabulary differences to be accurately associated, and some complex sentences with large syntactic changes and large vocabulary deletions to be associated with simple sentences, increasing the amount of parallel corpus.

实施例四Embodiment 4

图4为本公开实施例四提供的一种平行语料获取装置的结构图,该装置可以执行上述实施例所述的平行语料获取方法,如图4所示,该装置可以包括:Figure 4 is a structural diagram of a parallel corpus acquisition device provided in Embodiment 4 of the present disclosure. The device can execute the parallel corpus acquisition method described in the above embodiment. As shown in Figure 4, the device can include:

拆分模块41,用于拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;Splitting module 41 is used to split the first text and the second text obtained in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text. The first text It is in the same language as the second text and is used to describe the same content;

相似值矩阵确定模块42,用于确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;The similarity value matrix determination module 42 is used to determine the semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list, and obtain a similarity value matrix;

映射关系确定模块43,用于根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;Mapping relationship determination module 43, configured to determine the mapping relationship between the first statement and the second statement according to the similarity value matrix. The mapping relationship includes at least one of a pair of N, an N-to-one and a one-to-one. species, N is an integer greater than or equal to 2;

平行语料获取模块44,用于根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。The parallel corpus acquisition module 44 is configured to obtain the target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus.

本公开实施例四提供一种平行语料获取装置,通过拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。上述方案基于句子间的语义相似值,确定语句间的映射关系,提高了关联语句对的准确性,进而提高了平行语料的准确性。Embodiment 4 of the present disclosure provides a device for obtaining parallel corpus, which obtains a first sentence list corresponding to the first text and a second sentence corresponding to the second text by splitting the first text and the second text obtained in advance. List, the first text and the second text are in the same language and used to describe the same content; determine the semantics between each first statement in the first sentence list and each second sentence in the second sentence list Similar values are obtained to obtain a similarity value matrix; the mapping relationship between the first statement and the second statement is determined according to the similarity value matrix, and the mapping relationship includes at least one of a pair of N, N to one, and one to one. where N is an integer greater than or equal to 2; obtain the target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus. The above scheme determines the mapping relationship between sentences based on the semantic similarity values between sentences, improves the accuracy of associated sentence pairs, and thereby improves the accuracy of parallel corpus.

在上述实施例的基础上,相似值矩阵确定模块42,具体用于:Based on the above embodiments, the similarity value matrix determination module 42 is specifically used for:

将所述第一句子列表中的第一语句和所述第二句子列表中的第二语句输入语义相似值模型,由所述语义相似值模型输出所述第一语句与所述第二语句的语义相似值,所述语义相似值模型通过语义相似值不同的语句对训练得到;The first sentence in the first sentence list and the second sentence in the second sentence list are input into the semantic similarity value model, and the semantic similarity value model outputs the relationship between the first sentence and the second sentence. Semantic similarity value, the semantic similarity value model is obtained by training pairs of sentences with different semantic similarity values;

顺序排列各所述第一语句对应的语义相似值,得到相似值矩阵,所述相似值矩阵的行数等于所述第一句子列表包含的第一语句的数量,所述相似值矩阵的列数等于所述第二句子列表包含的第二语句的数量。Arrange the semantic similarity values corresponding to each of the first sentences in order to obtain a similarity value matrix. The number of rows of the similarity value matrix is equal to the number of first sentences included in the first sentence list. The number of columns of the similarity value matrix is equal to the number of first sentences included in the first sentence list. Equal to the number of second sentences contained in the second sentence list.

在上述实施例的基础上,映射关系确定模块43,具体用于:Based on the above embodiments, the mapping relationship determination module 43 is specifically used for:

将所述相似值矩阵第一个元素的位置记为当前元素位置;Record the position of the first element of the similarity value matrix as the current element position;

如果所述当前元素位置对应的语义相似值等于第一预设值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对一,所述第一预设值用于表示所述当前元素位置对应的第一语句与第二语句的语义相似程度最高;If the semantic similarity value corresponding to the current element position is equal to the first preset value, it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is one-to-one, and the first preset value is The first sentence corresponding to the current element position and the second sentence have the highest degree of semantic similarity;

将下一个元素的位置记为当前元素位置,并重复执行上述操作。Mark the position of the next element as the current element position and repeat the above operation.

在上述实施例的基础上,映射关系确定模块43,具体用于:Based on the above embodiments, the mapping relationship determination module 43 is specifically used for:

将所述相似值矩阵第一个元素的位置记为当前元素位置;Record the position of the first element of the similarity value matrix as the current element position;

如果所述当前元素位置对应的语义相似值小于第一预设值,则合并所述当前元素位置对应的第二语句与下一个元素位置对应的第二语句,得到第一合并语句,所述第一预设值用于表示所述当前元素位置对应的第一语句与第二语句的语义相似程度最高;以及合并所述当前元素位置对应的第一语句与下一个元素位置对应的第一语句,得到第二合并语句;If the semantic similarity value corresponding to the current element position is less than the first preset value, merge the second sentence corresponding to the current element position and the second sentence corresponding to the next element position to obtain a first merged sentence. A default value is used to indicate that the first sentence corresponding to the current element position and the second sentence have the highest degree of semantic similarity; and to merge the first sentence corresponding to the current element position and the first sentence corresponding to the next element position, Get the second merge statement;

根据所述当前元素位置对应的第一语句与所述第一合并语句的语义相似值、所述第二合并语句与所述当前元素位置对应的第二语句的语义相似值以及所述下一个元素位置对应的语义相似值,确定所述当前元素位置对应的第一语句与第二语句的映射关系;According to the semantic similarity value between the first statement corresponding to the current element position and the first merging statement, the semantic similarity value between the second merging statement and the second statement corresponding to the current element position, and the next element The semantic similarity value corresponding to the position determines the mapping relationship between the first statement and the second statement corresponding to the current element position;

将下一个元素的位置记为当前元素位置,并重复执行上述操作。Mark the position of the next element as the current element position and repeat the above operation.

在上述实施例的基础上,将所述当前元素位置对应的第一语句与所述第一合并语句的语义相似值记为第一语义相似值、所述第二合并语句与所述当前元素位置对应的第二语句的语义相似值记为第二语义相似值、所述当前元素位置对应的语义相似值记为第三语义相似值以及所述下一个元素位置对应的语义相似值记为第四语义相似值;On the basis of the above embodiment, the semantic similarity value between the first statement corresponding to the current element position and the first merge statement is recorded as the first semantic similarity value, the second merge statement and the current element position. The semantic similarity value of the corresponding second sentence is recorded as the second semantic similarity value, the semantic similarity value corresponding to the current element position is recorded as the third semantic similarity value, and the semantic similarity value corresponding to the next element position is recorded as the fourth semantic similarity value. Semantic similarity value;

所述映射关系确定模块43,具体用于:The mapping relationship determination module 43 is specifically used for:

如果所述第一语义相似值小于或等于所述第三语义相似值,或者所述第一语义相似值小于或等于所述第四语义相似值,则调小所述第一语义相似值,否则,保持所述第一语义相似值不变;If the first semantic similarity value is less than or equal to the third semantic similarity value, or the first semantic similarity value is less than or equal to the fourth semantic similarity value, then reduce the first semantic similarity value, otherwise , keeping the first semantic similarity value unchanged;

如果所述第二语义相似值小于或等于所述第三语义相似值,或者所述第二语义相似值小于或等于第四语义相似值,则调小所述第二语义相似值,否则,保持所述第二语义相似值不变;If the second semantic similarity value is less than or equal to the third semantic similarity value, or the second semantic similarity value is less than or equal to the fourth semantic similarity value, then reduce the second semantic similarity value; otherwise, keep The second semantic similarity value remains unchanged;

确定所述第一语义相似值、第二语义相似值和第四语义相似值中的最大值;Determine the maximum value among the first semantic similarity value, the second semantic similarity value and the fourth semantic similarity value;

如果最大值为所述第四语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对一;如果最大值为所述第一语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对N;如果最大值为所述第二语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为N对一。If the maximum value is the fourth semantic similarity value, it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is one-to-one; if the maximum value is the first semantic similarity value, it is determined that The mapping relationship between the first statement and the second statement corresponding to the current element position is a pair of N; if the maximum value is the second semantic similarity value, determine the first statement and the second statement corresponding to the current element position. The mapping relationship is N to one.

在上述实施例的基础上,当所述当前元素位置对应的第一语句与第二语句的映射关系为一对N时,N的大小的确定过程如下:Based on the above embodiment, when the mapping relationship between the first statement and the second statement corresponding to the current element position is a pair of N, the determination process of the size of N is as follows:

合并所述当前元素位置对应的第二语句到目标元素位置对应的第二语句之间的第二语句,得到第三合并语句,所述目标元素位置所在的列为所述当前元素位置所在的列与N-1的和,N=3;Merge the second statement between the second statement corresponding to the current element position and the second statement corresponding to the target element position to obtain a third merge statement. The column where the target element position is located is the column where the current element position is located. Sum with N-1, N=3;

如果所述当前元素位置对应的第一语句与所述第三合并语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,则确定N=2;If the semantic similarity value between the first statement corresponding to the current element position and the third merge statement is less than or equal to the semantic similarity value corresponding to the current element position, then determine N=2;

否则令N=N+1,并重复执行上述操作,直至得到的当前元素位置对应的第一语句与所述第三合并语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,并确定N=N-1。Otherwise, let N=N+1, and repeat the above operation until the obtained semantic similarity value of the first statement corresponding to the current element position and the third merge statement is less than or equal to the semantic similarity value corresponding to the current element position, And determine N=N-1.

在上述实施例的基础上,当所述当前元素位置对应的第一语句与第二语句的映射关系为N对一时,N的大小的确定过程如下:Based on the above embodiment, when the mapping relationship between the first statement and the second statement corresponding to the current element position is N to one, the determination process of the size of N is as follows:

合并所述当前元素位置对应的第一语句到目标元素位置对应的第一语句之间的第一语句,得到第四合并语句,所述目标元素位置所在的列为所述当前元素位置所在的列与N-1的和,N=3;Merge the first statement between the first statement corresponding to the current element position and the first statement corresponding to the target element position to obtain a fourth merged statement. The column where the target element position is located is the column where the current element position is located. Sum with N-1, N=3;

如果所述第四合并语句与所述当前元素位置对应的第二语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,则确定N=2;If the semantic similarity value of the fourth merging statement and the second statement corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, then determine N=2;

否则令N=N+1,并重复执行上述操作,直至得到的第四合并语句与所述当前元素位置对应的第二语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,确定N=N-1。Otherwise, let N=N+1, and repeat the above operation until the semantic similarity value of the obtained fourth merge statement and the second statement corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, Determine N=N-1.

在上述实施例的基础上,平行语料获取模块44,具体用于:Based on the above embodiments, the parallel corpus acquisition module 44 is specifically used for:

如果所述映射关系为一对一,将所述映射关系对应的第二语句记为目标第二语句;If the mapping relationship is one-to-one, record the second statement corresponding to the mapping relationship as the target second statement;

如果所述映射关系为一对N,合并与所述映射关系对应的第二语句,得到目标第二语句;If the mapping relationship is a pair of N, merge the second statements corresponding to the mapping relationship to obtain the target second statement;

如果所述映射关系为N对一,合并所述映射关系对应的第一语句,并将所述映射关系对应的第二语句记为目标第二语句。If the mapping relationship is N to one, merge the first statements corresponding to the mapping relationship, and record the second statement corresponding to the mapping relationship as the target second statement.

在上述实施例的基础上,该装置还可以包括:Based on the above embodiments, the device may also include:

训练模块,用于在将所述第一语句与所述目标第二语句记为平行语料之后,将所述平行语料输入文本简化模型,训练所述文本简化模型,得到目标文本简化模型,所述目标文本简化模型用于将复杂文本转换为简单文本。A training module configured to, after recording the first sentence and the target second sentence as parallel corpus, input the parallel corpus into a text reduction model, train the text reduction model, and obtain a target text reduction model, the The target text reduction model is used to convert complex text into simple text.

本公开实施例提供的平行语料获取装置与上述实施例提供的平行语料获取方法属于同一构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例具备执行平行语料获取方法相同的有益效果。The parallel corpus acquisition device provided by the embodiments of the present disclosure and the parallel corpus acquisition method provided by the above embodiments belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above embodiments, and this embodiment has the ability to perform parallel corpus acquisition. The method has the same beneficial effects.

实施例五Embodiment 5

下面参考图5,其示出了适于用来实现本公开实施例的电子设备)500的结构示意图。本公开实施例中的电子设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图5示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring now to FIG. 5 , a schematic structural diagram of an electronic device 500 suitable for implementing embodiments of the present disclosure is shown. Electronic devices in embodiments of the present disclosure may include, but are not limited to, mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablets), PMPs (Portable Multimedia Players), vehicle-mounted terminals (such as Mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, etc. The electronic device shown in FIG. 5 is only an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.

如图5所示,电子设备500可以包括处理装置(例如中央处理器、图形处理器等)501,其可以根据存储在只读存储器(ROM)502中的程序或者从存储装置508加载到随机访问存储器(RAM)503中的程序而执行各种适当的动作和处理。在RAM 503中,还存储有电子设备500操作所需的各种程序和数据。处理装置501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(I/O)接口505也连接至总线504。As shown in FIG. 5 , the electronic device 500 may include a processing device (eg, central processing unit, graphics processor, etc.) 501 that may be loaded into a random access device according to a program stored in a read-only memory (ROM) 502 or from a storage device 508 . The program in the memory (RAM) 503 executes various appropriate actions and processes. In the RAM 503, various programs and data required for the operation of the electronic device 500 are also stored. The processing device 501, the ROM 502 and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

通常,以下装置可以连接至I/O接口505:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置506;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置507;包括例如磁带、硬盘等的存储装置508;以及通信装置509。通信装置509可以允许电子设备500与其他设备进行无线或有线通信以交换数据。虽然图5示出了具有各种装置的电子设备500,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 507 such as a computer; a storage device 508 including a magnetic tape, a hard disk, etc.; and a communication device 509. Communication device 509 may allow electronic device 500 to communicate wirelessly or wiredly with other devices to exchange data. Although FIG. 5 illustrates electronic device 500 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.

特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置509从网络上被下载和安装,或者从存储装置508被安装,或者从ROM 502被安装。在该计算机程序被处理装置501执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network via communication device 509, or from storage device 508, or from ROM 502. When the computer program is executed by the processing device 501, the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.

实施例六Embodiment 6

本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。The computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmed read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device . Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.

在一些实施方式中,客户端、服务器可以利用诸如HTTP(Hyper Text TransferProtocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and server can communicate using any currently known or future developed network protocol such as HTTP (Hyper Text Transfer Protocol), and can communicate with digital data in any form or medium. Communications (e.g., communications network) interconnections. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.

上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.

上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。The computer-readable medium carries one or more programs. When the one or more programs are executed by the electronic device, the electronic device: splits the pre-acquired first text and the second text to obtain the first text. The first sentence list corresponding to the text and the second sentence list corresponding to the second text, the first text and the second text are in the same language and used to describe the same content; determine each first sentence list in the first sentence list The semantic similarity value between a sentence and each second sentence in the second sentence list is used to obtain a similarity value matrix; the mapping relationship between the first sentence and the second sentence is determined according to the similarity value matrix, and the The mapping relationship includes at least one of a pair of N, N to one, and one to one, where N is an integer greater than or equal to 2; the target second statement associated with the first statement is obtained according to the mapping relationship, and The first sentence and the target second sentence are recorded as parallel corpus.

可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages—such as "C" or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through Internet connection).

附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operations of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.

描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,拆分模块还可以被描述为“拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表的模块”。The units involved in the embodiments of the present disclosure can be implemented in software or hardware. Among them, the name of the module does not constitute a limitation on the module itself under certain circumstances. For example, the split module can also be described as "splitting the first text and the second text obtained in advance to obtain the first text." module corresponding to the first sentence list and the second sentence list corresponding to the second text."

本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, and without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.

在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

根据本公开的一个或多个实施例,本公开提供了一种平行语料获取方法,包括:According to one or more embodiments of the present disclosure, the present disclosure provides a parallel corpus acquisition method, including:

拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;Split the pre-obtained first text and second text to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text. The first text and the second text are in the same language. , used to describe the same content;

确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;Determine the semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list, and obtain a similarity value matrix;

根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;The mapping relationship between the first statement and the second statement is determined according to the similarity value matrix. The mapping relationship includes at least one of a pair of N, N to one and one to one, where N is greater than or equal to 2. an integer;

根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。The target second sentence associated with the first sentence is obtained according to the mapping relationship, and the first sentence and the target second sentence are recorded as parallel corpus.

根据本公开的一个或多个实施例,本公开提供的平行语料获取方法中,所述确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵,包括:According to one or more embodiments of the present disclosure, in the parallel corpus acquisition method provided by the present disclosure, the determination of the distance between each first sentence in the first sentence list and each second sentence in the second sentence list is The semantic similarity value of , the similarity value matrix is obtained, including:

将所述第一句子列表中的第一语句和所述第二句子列表中的第二语句输入语义相似值模型,由所述语义相似值模型输出所述第一语句与所述第二语句的语义相似值,所述语义相似值模型通过语义相似值不同的语句对训练得到;The first sentence in the first sentence list and the second sentence in the second sentence list are input into the semantic similarity value model, and the semantic similarity value model outputs the relationship between the first sentence and the second sentence. Semantic similarity value, the semantic similarity value model is obtained by training pairs of sentences with different semantic similarity values;

顺序排列各所述第一语句对应的语义相似值,得到相似值矩阵,所述相似值矩阵的行数等于所述第一句子列表包含的第一语句的数量,所述相似值矩阵的列数等于所述第二句子列表包含的第二语句的数量。Arrange the semantic similarity values corresponding to each of the first sentences in order to obtain a similarity value matrix. The number of rows of the similarity value matrix is equal to the number of first sentences included in the first sentence list. The number of columns of the similarity value matrix is equal to the number of first sentences included in the first sentence list. Equal to the number of second sentences contained in the second sentence list.

根据本公开的一个或多个实施例,本公开提供的平行语料获取方法中,所述根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,包括:According to one or more embodiments of the present disclosure, in the parallel corpus acquisition method provided by the present disclosure, determining the mapping relationship between the first statement and the second statement based on the similarity value matrix includes:

将所述相似值矩阵第一个元素的位置记为当前元素位置;Record the position of the first element of the similarity value matrix as the current element position;

如果所述当前元素位置对应的语义相似值等于第一预设值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对一,所述第一预设值用于表示所述当前元素位置对应的第一语句与第二语句的语义相似程度最高;If the semantic similarity value corresponding to the current element position is equal to the first preset value, it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is one-to-one, and the first preset value is The first sentence corresponding to the current element position and the second sentence have the highest degree of semantic similarity;

将下一个元素的位置记为当前元素位置,并重复执行上述操作。Mark the position of the next element as the current element position and repeat the above operation.

根据本公开的一个或多个实施例,本公开提供的平行语料获取方法中,所述根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,包括:According to one or more embodiments of the present disclosure, in the parallel corpus acquisition method provided by the present disclosure, determining the mapping relationship between the first statement and the second statement based on the similarity value matrix includes:

将所述相似值矩阵第一个元素的位置记为当前元素位置;Record the position of the first element of the similarity value matrix as the current element position;

如果所述当前元素位置对应的语义相似值小于第一预设值,则合并所述当前元素位置对应的第二语句与下一个元素位置对应的第二语句,得到第一合并语句,所述第一预设值用于表示所述当前元素位置对应的第一语句与第二语句的语义相似程度最高;以及合并所述当前元素位置对应的第一语句与下一个元素位置对应的第一语句,得到第二合并语句;If the semantic similarity value corresponding to the current element position is less than the first preset value, merge the second sentence corresponding to the current element position and the second sentence corresponding to the next element position to obtain a first merged sentence. A default value is used to indicate that the first sentence corresponding to the current element position and the second sentence have the highest degree of semantic similarity; and to merge the first sentence corresponding to the current element position and the first sentence corresponding to the next element position, Get the second merge statement;

根据所述当前元素位置对应的第一语句与所述第一合并语句的语义相似值、所述第二合并语句与所述当前元素位置对应的第二语句的语义相似值以及所述下一个元素位置对应的语义相似值,确定所述当前元素位置对应的第一语句与第二语句的映射关系;According to the semantic similarity value between the first statement corresponding to the current element position and the first merging statement, the semantic similarity value between the second merging statement and the second statement corresponding to the current element position, and the next element The semantic similarity value corresponding to the position determines the mapping relationship between the first statement and the second statement corresponding to the current element position;

将下一个元素的位置记为当前元素位置,并重复执行上述操作。Mark the position of the next element as the current element position and repeat the above operation.

根据本公开的一个或多个实施例,本公开提供的平行语料获取方法中,将所述当前元素位置对应的第一语句与所述第一合并语句的语义相似值记为第一语义相似值、所述第二合并语句与所述当前元素位置对应的第二语句的语义相似值记为第二语义相似值、所述当前元素位置对应的语义相似值记为第三语义相似值以及所述下一个元素位置对应的语义相似值记为第四语义相似值;According to one or more embodiments of the present disclosure, in the parallel corpus acquisition method provided by the present disclosure, the semantic similarity value between the first sentence corresponding to the current element position and the first merged sentence is recorded as the first semantic similarity value. , the semantic similarity value of the second merge statement and the second statement corresponding to the current element position is recorded as the second semantic similarity value, the semantic similarity value corresponding to the current element position is recorded as the third semantic similarity value and the The semantic similarity value corresponding to the next element position is recorded as the fourth semantic similarity value;

所述根据所述当前元素位置对应的第一语句与所述第一合并语句的语义相似值、所述第二合并语句与所述当前元素位置对应的第二语句的语义相似值以及所述下一个元素位置对应的语义相似值,确定所述当前元素位置对应的第一语句与第二语句的映射关系,包括:The semantic similarity value between the first statement corresponding to the current element position and the first merging statement, the semantic similarity value between the second merging statement and the second statement corresponding to the current element position, and the following The semantic similarity value corresponding to an element position determines the mapping relationship between the first statement and the second statement corresponding to the current element position, including:

如果所述第一语义相似值小于或等于所述第三语义相似值,或者所述第一语义相似值小于或等于所述第四语义相似值,则调小所述第一语义相似值,否则,保持所述第一语义相似值不变;If the first semantic similarity value is less than or equal to the third semantic similarity value, or the first semantic similarity value is less than or equal to the fourth semantic similarity value, then reduce the first semantic similarity value, otherwise , keeping the first semantic similarity value unchanged;

如果所述第二语义相似值小于或等于所述第三语义相似值,或者所述第二语义相似值小于或等于第四语义相似值,则调小所述第二语义相似值,否则,保持所述第二语义相似值不变;If the second semantic similarity value is less than or equal to the third semantic similarity value, or the second semantic similarity value is less than or equal to the fourth semantic similarity value, then reduce the second semantic similarity value; otherwise, keep The second semantic similarity value remains unchanged;

确定所述第一语义相似值、第二语义相似值和第四语义相似值中的最大值;Determine the maximum value among the first semantic similarity value, the second semantic similarity value and the fourth semantic similarity value;

如果最大值为所述第四语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对一;如果最大值为所述第一语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为一对N;如果最大值为所述第二语义相似值,则确定所述当前元素位置对应的第一语句与第二语句的映射关系为N对一。If the maximum value is the fourth semantic similarity value, it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is one-to-one; if the maximum value is the first semantic similarity value, it is determined that The mapping relationship between the first statement and the second statement corresponding to the current element position is a pair of N; if the maximum value is the second semantic similarity value, determine the first statement and the second statement corresponding to the current element position. The mapping relationship is N to one.

根据本公开的一个或多个实施例,本公开提供的平行语料获取方法中,当所述当前元素位置对应的第一语句与第二语句的映射关系为一对N时,N的大小的确定过程如下:According to one or more embodiments of the present disclosure, in the parallel corpus acquisition method provided by the present disclosure, when the mapping relationship between the first statement and the second statement corresponding to the current element position is a pair of N, the size of N is determined The process is as follows:

合并所述当前元素位置对应的第二语句到目标元素位置对应的第二语句之间的第二语句,得到第三合并语句,所述目标元素位置所在的列为所述当前元素位置所在的列与N-1的和,N=3;Merge the second statement between the second statement corresponding to the current element position and the second statement corresponding to the target element position to obtain a third merge statement. The column where the target element position is located is the column where the current element position is located. Sum with N-1, N=3;

如果所述当前元素位置对应的第一语句与所述第三合并语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,则确定N=2;If the semantic similarity value between the first statement corresponding to the current element position and the third merge statement is less than or equal to the semantic similarity value corresponding to the current element position, then determine N=2;

否则令N=N+1,并重复执行上述操作,直至得到的当前元素位置对应的第一语句与所述第三合并语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,并确定N=N-1。Otherwise, let N=N+1, and repeat the above operation until the obtained semantic similarity value of the first statement corresponding to the current element position and the third merge statement is less than or equal to the semantic similarity value corresponding to the current element position, And determine N=N-1.

根据本公开的一个或多个实施例,本公开提供的平行语料获取方法中,当所述当前元素位置对应的第一语句与第二语句的映射关系为N对一时,N的大小的确定过程如下:According to one or more embodiments of the present disclosure, in the parallel corpus acquisition method provided by the present disclosure, when the mapping relationship between the first statement and the second statement corresponding to the current element position is N to one, the determination process of the size of N as follows:

合并所述当前元素位置对应的第一语句到目标元素位置对应的第一语句之间的第一语句,得到第四合并语句,所述目标元素位置所在的列为所述当前元素位置所在的列与N-1的和,N=3;Merge the first statement between the first statement corresponding to the current element position and the first statement corresponding to the target element position to obtain a fourth merged statement. The column where the target element position is located is the column where the current element position is located. Sum with N-1, N=3;

如果所述第四合并语句与所述当前元素位置对应的第二语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,则确定N=2;If the semantic similarity value of the fourth merging statement and the second statement corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, then determine N=2;

否则令N=N+1,并重复执行上述操作,直至得到的第四合并语句与所述当前元素位置对应的第二语句的语义相似值小于或等于所述当前元素位置对应的语义相似值,确定N=N-1。Otherwise, let N=N+1, and repeat the above operation until the semantic similarity value of the obtained fourth merge statement and the second statement corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, Determine N=N-1.

根据本公开的一个或多个实施例,本公开提供的平行语料获取方法中,所述根据所述映射关系获取与所述第一语句关联的目标第二语句,包括:According to one or more embodiments of the present disclosure, in the parallel corpus acquisition method provided by the present disclosure, the acquisition of the target second sentence associated with the first sentence according to the mapping relationship includes:

如果所述映射关系为一对一,将所述映射关系对应的第二语句记为目标第二语句;If the mapping relationship is one-to-one, record the second statement corresponding to the mapping relationship as the target second statement;

如果所述映射关系为一对N,合并与所述映射关系对应的第二语句,得到目标第二语句;If the mapping relationship is a pair of N, merge the second statements corresponding to the mapping relationship to obtain the target second statement;

如果所述映射关系为N对一,合并所述映射关系对应的第一语句,并将所述映射关系对应的第二语句记为目标第二语句。If the mapping relationship is N to one, merge the first statements corresponding to the mapping relationship, and record the second statement corresponding to the mapping relationship as the target second statement.

根据本公开的一个或多个实施例,本公开提供的平行语料获取方法中,在将所述第一语句与所述目标第二语句记为平行语料之后,还包括:According to one or more embodiments of the present disclosure, in the parallel corpus acquisition method provided by the present disclosure, after recording the first sentence and the target second sentence as parallel corpus, the method further includes:

将所述平行语料输入文本简化模型,训练所述文本简化模型,得到目标文本简化模型,所述目标文本简化模型用于将复杂文本转换为简单文本。The parallel corpus is input into a text reduction model, and the text reduction model is trained to obtain a target text reduction model. The target text reduction model is used to convert complex text into simple text.

根据本公开的一个或多个实施例,本公开提供了一种平行语料获取装置,包括:According to one or more embodiments of the present disclosure, the present disclosure provides a device for obtaining parallel corpus, including:

拆分模块,用于拆分预先获取的第一文本和第二文本,得到所述第一文本对应的第一句子列表和所述第二文本对应的第二句子列表,所述第一文本和第二文本为同一语种,用于描述同一内容;A splitting module for splitting the pre-obtained first text and second text to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text. The first text and The second text is in the same language and used to describe the same content;

相似值矩阵确定模块,用于确定所述第一句子列表中每一个第一语句与所述第二句子列表中各第二语句之间的语义相似值,得到相似值矩阵;A similarity value matrix determination module, used to determine the semantic similarity value between each first statement in the first sentence list and each second statement in the second sentence list, and obtain a similarity value matrix;

映射关系确定模块,用于根据所述相似值矩阵确定所述第一语句与所述第二语句的映射关系,所述映射关系包括一对N、N对一和一对一中的至少一种,N为大于或等于2的整数;Mapping relationship determination module, configured to determine the mapping relationship between the first statement and the second statement according to the similarity value matrix, where the mapping relationship includes at least one of one-to-N, N-to-one, and one-to-one. , N is an integer greater than or equal to 2;

平行语料获取模块,用于根据所述映射关系获取与所述第一语句关联的目标第二语句,并将所述第一语句与所述目标第二语句记为平行语料。A parallel corpus acquisition module, configured to obtain a target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus.

根据本公开的一个或多个实施例,本公开提供了一种电子设备,包括:According to one or more embodiments of the present disclosure, the present disclosure provides an electronic device, including:

一个或多个处理器;one or more processors;

存储器,用于存储一个或多个程序;Memory, used to store one or more programs;

当所述一个或多个程序被所述一个或多个处理器执行时实现如本公开任一所述的平行语料获取方法。The parallel corpus acquisition method as described in any one of the present disclosure is implemented when the one or more programs are executed by the one or more processors.

根据本公开的一个或多个实施例,本公开提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开任一所述的平行语料获取方法。According to one or more embodiments of the present disclosure, the present disclosure provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the parallel corpus acquisition method as described in any one of the present disclosure is implemented. .

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a description of the preferred embodiments of the present disclosure and the technical principles applied. Those skilled in the art should understand that the disclosure scope involved in the present disclosure is not limited to technical solutions composed of specific combinations of the above technical features, but should also cover solutions composed of the above technical features or without departing from the above disclosed concept. Other technical solutions formed by any combination of equivalent features. For example, a technical solution is formed by replacing the above features with technical features with similar functions disclosed in this disclosure (but not limited to).

此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。Furthermore, although operations are depicted in a specific order, this should not be understood as requiring that these operations be performed in the specific order shown or performed in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Claims (11)

CN202110181644.5A2021-02-082021-02-08Parallel corpus acquisition method, device, equipment and storage mediumActiveCN112906371B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110181644.5ACN112906371B (en)2021-02-082021-02-08Parallel corpus acquisition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110181644.5ACN112906371B (en)2021-02-082021-02-08Parallel corpus acquisition method, device, equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN112906371A CN112906371A (en)2021-06-04
CN112906371Btrue CN112906371B (en)2024-03-01

Family

ID=76123432

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110181644.5AActiveCN112906371B (en)2021-02-082021-02-08Parallel corpus acquisition method, device, equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN112906371B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105868187A (en)*2016-03-252016-08-17北京语言大学A multi-translation version parallel corpus establishing method
CN109670178A (en)*2018-12-202019-04-23龙马智芯(珠海横琴)科技有限公司Sentence-level bilingual alignment method and device, computer readable storage medium
CN109710950A (en)*2018-12-202019-05-03龙马智芯(珠海横琴)科技有限公司 Bilingual alignment method, device and system
CN110362820A (en)*2019-06-172019-10-22昆明理工大学A kind of bilingual parallel sentence extraction method of old man based on Bi-LSTM algorithm
CN110427629A (en)*2019-08-132019-11-08苏州思必驰信息科技有限公司Semi-supervised text simplified model training method and system
CN110781686A (en)*2019-10-302020-02-11普信恒业科技发展(北京)有限公司Statement similarity calculation method and device and computer equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10713432B2 (en)*2017-03-312020-07-14Adobe Inc.Classifying and ranking changes between document versions
KR102637340B1 (en)*2018-08-312024-02-16삼성전자주식회사Method and apparatus for mapping sentences

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105868187A (en)*2016-03-252016-08-17北京语言大学A multi-translation version parallel corpus establishing method
CN109670178A (en)*2018-12-202019-04-23龙马智芯(珠海横琴)科技有限公司Sentence-level bilingual alignment method and device, computer readable storage medium
CN109710950A (en)*2018-12-202019-05-03龙马智芯(珠海横琴)科技有限公司 Bilingual alignment method, device and system
CN110362820A (en)*2019-06-172019-10-22昆明理工大学A kind of bilingual parallel sentence extraction method of old man based on Bi-LSTM algorithm
CN110427629A (en)*2019-08-132019-11-08苏州思必驰信息科技有限公司Semi-supervised text simplified model training method and system
CN110781686A (en)*2019-10-302020-02-11普信恒业科技发展(北京)有限公司Statement similarity calculation method and device and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一个面向信息抽取的中英文平行语料库;惠浩添;李云建;钱龙华;周国栋;;计算机工程与科学(第12期);全文*
基于维基百科的双语可比语料的句子对齐;胡弘思等;中文信息学报(第01期);全文*

Also Published As

Publication numberPublication date
CN112906371A (en)2021-06-04

Similar Documents

PublicationPublication DateTitle
CN111159220B (en)Method and apparatus for outputting structured query statement
CN113139391B (en) Translation model training method, device, equipment and storage medium
JP7553202B2 (en) Text sequence generation method, device, equipment and medium
CN110457325B (en)Method and apparatus for outputting information
CN113488050B (en)Voice wakeup method and device, storage medium and electronic equipment
CN114861889B (en) Deep learning model training method, target object detection method and device
CN113033682B (en) Video classification method, device, readable medium, and electronic device
WO2022135080A1 (en)Corpus sample determination method and apparatus, electronic device, and storage medium
CN112380883B (en) Model training method, machine translation method, device, equipment and storage medium
CN110275962B (en) Method and apparatus for outputting information
CN112307061B (en) Method and device for querying data
CN113468330B (en)Information acquisition method, device, equipment and medium
CN114969044B (en)Materialized column creation method and data query method based on data lake
CN113688256B (en)Construction method and device of clinical knowledge base
CN112182255A (en) Method and apparatus for storing and retrieving media files
CN115146623A (en) Text word replacement method, device, storage medium and electronic device
WO2023273598A1 (en)Text search method and apparatus, and readable medium and electronic device
CN113033707B (en)Video classification method and device, readable medium and electronic equipment
CN115640815A (en) Translation method, device, readable medium and electronic equipment
CN116894188A (en)Service tag set updating method and device, medium and electronic equipment
WO2023011260A1 (en)Translation processing method and apparatus, device and medium
CN112148751A (en) Method and apparatus for querying data
WO2024021790A1 (en)Data lake-based virtual column construction method and data query method
CN116503596A (en)Picture segmentation method, device, medium and electronic equipment
JP2023078411A (en) Information processing method, model training method, device, equipment, medium and program product

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp