CN111369978B

Movatterモバイル変換

Info

Publication number: CN111369978B
Application number: CN201811603538.6A
Authority: CN
Inventors: 周盼
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2024-05-17
Anticipated expiration: 2038-12-26
Also published as: CN111369978A

Abstract

Translated fromChinese

本发明实施例提供了一种数据处理方法、装置和用于数据处理的装置。其中的方法具体包括：根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；根据所述第一解码结果，确定所述语音信息对应的识别结果。本发明实施例可以提高语音识别的准确率。

The embodiment of the present invention provides a data processing method, device and device for data processing. The method specifically includes: determining the language type of a speech frame in speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is trained based on acoustic data of at least two language types; decoding the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame; and determining a recognition result corresponding to the speech information according to the first decoding result. The embodiment of the present invention can improve the accuracy of speech recognition.

Description

Translated fromChinese

一种数据处理方法、装置和用于数据处理的装置A data processing method, a data processing device and a data processing device

技术领域Technical Field

本发明涉及计算机技术领域，尤其涉及一种数据处理方法、装置和用于数据处理的装置。The present invention relates to the field of computer technology, and in particular to a data processing method, device and a device for data processing.

背景技术Background technique

语音识别技术，也被称为ASR(Automatic Speech Recognition，自动语音识别)，其目标是将语音中的词汇内容转换为计算机可读的输入，例如按键、二进制编码或者字符序列。Speech recognition technology, also known as ASR (Automatic Speech Recognition), aims to convert the vocabulary content in speech into computer-readable input, such as keystrokes, binary codes or character sequences.

在日常的语言表达中，可能会出现多种语言混合表达的情况。以中文和英文混合表达为例，用户在使用中文进行表达的过程中，可以穿插使用英文词句。例如，“我买了最新款的iPhone”、“来一首Yesterday once more”。In daily language expression, multiple languages may be mixed. For example, in the mixed expression of Chinese and English, users can intersperse English words and phrases while expressing themselves in Chinese. For example, "I bought the latest iPhone" and "Let's sing a song Yesterday once more".

然而，目前的语音识别技术，对于单一语言的语音识别较为准确，而在语音中包含多种语言的情况下，识别的准确率明显下降。However, current speech recognition technology is relatively accurate in recognizing speech in a single language, but when the speech contains multiple languages, the recognition accuracy drops significantly.

发明内容Summary of the invention

本发明实施例提供一种数据处理方法、装置和用于数据处理的装置，可以提高在语音中包含多种语言的情况下，语音识别的准确率。The embodiments of the present invention provide a data processing method, a device and a device for data processing, which can improve the accuracy of speech recognition when the speech contains multiple languages.

为了解决上述问题，本发明实施例公开了一种数据处理方法，所述方法包括：In order to solve the above problem, an embodiment of the present invention discloses a data processing method, which includes:

根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；Determining the language type of the speech frame in the speech information according to the multilingual acoustic model; wherein the multilingual acoustic model is trained based on acoustic data of at least two language types;

根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；Decoding the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame;

根据所述第一解码结果，确定所述语音信息对应的识别结果。A recognition result corresponding to the voice information is determined according to the first decoding result.

另一方面，本发明实施例公开了一种数据处理装置，所述装置包括：On the other hand, an embodiment of the present invention discloses a data processing device, the device comprising:

类型确定模块，用于根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；A type determination module, used to determine the language type of the speech frame in the speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is trained based on acoustic data of at least two language types;

第一解码模块，用于根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；A first decoding module, configured to decode the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame;

结果确定模块，用于根据所述第一解码结果，确定所述语音信息对应的识别结果。A result determination module is used to determine a recognition result corresponding to the voice information according to the first decoding result.

再一方面，本发明实施例公开了一种用于数据处理的装置，包括有存储器，以及一个或者一个以上的程序，其中一个或者一个以上程序存储于存储器中，且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：In another aspect, an embodiment of the present invention discloses a device for data processing, comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by one or more processors. The one or more programs include instructions for performing the following operations:

又一方面，本发明实施例公开了一种机器可读介质，其上存储有指令，当由一个或多个处理器执行时，使得装置执行如前述一个或多个所述的数据处理方法。On the other hand, an embodiment of the present invention discloses a machine-readable medium having instructions stored thereon, which, when executed by one or more processors, enables the device to perform one or more of the data processing methods described above.

本发明实施例包括以下优点：The embodiments of the present invention include the following advantages:

本发明实施例可以根据至少两种语言类型的声学数据训练得到多语言模型，通过所述多语言声学模型，可以确定语音信息中语音帧的语言类型，因此，在语音信息中包含多种语言类型的情况下，本发明实施例可以准确区分语音信息中不同语言类型的语音帧，并且可以根据相应语言类型的解码网络对语音帧进行解码，以得到语音帧的第一解码结果，该第一解码结果为根据语音帧的语音类型对应的解码网络解码得到，可以保证解码的准确性，进而可以提高语音识别的准确率。The embodiment of the present invention can obtain a multilingual model based on acoustic data training of at least two language types. Through the multilingual acoustic model, the language type of the speech frame in the speech information can be determined. Therefore, when the speech information contains multiple language types, the embodiment of the present invention can accurately distinguish the speech frames of different language types in the speech information, and can decode the speech frames according to the decoding network of the corresponding language type to obtain a first decoding result of the speech frame. The first decoding result is obtained by decoding the decoding network corresponding to the speech type of the speech frame, which can ensure the accuracy of decoding and thus improve the accuracy of speech recognition.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案，下面将对本发明实施例的描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings required for use in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For ordinary technicians in this field, other accompanying drawings can be obtained based on these accompanying drawings without paying creative labor.

图1是本发明的一种数据处理方法实施例的步骤流程图；FIG1 is a flow chart of steps of a data processing method embodiment of the present invention;

图2是本发明的一种数据处理装置实施例的结构框图；FIG2 is a structural block diagram of an embodiment of a data processing device of the present invention;

图3是本发明的一种用于数据处理的装置800的框图；及FIG3 is a block diagram of a device 800 for data processing according to the present invention; and

图4是本发明的一些实施例中服务器的结构示意图。FIG. 4 is a schematic diagram of the structure of a server in some embodiments of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

方法实施例Method Embodiment

参照图1，示出了本发明的一种数据处理方法实施例的步骤流程图，具体可以包括如下步骤：1, a flow chart of a data processing method according to an embodiment of the present invention is shown, which may specifically include the following steps:

步骤101、根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；Step 101: determining the language type of a speech frame in speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is trained based on acoustic data of at least two language types;

步骤102、根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；Step 102: Decoding the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame;

步骤103、根据所述第一解码结果，确定所述语音信息对应的识别结果。Step 103: Determine a recognition result corresponding to the voice information according to the first decoding result.

本发明实施例的数据处理方法可用于对包含至少两种语言类型的语音信息进行识别的场景，所述数据处理方法可应用于电子设备，所述电子设备包括但不限于：服务器、智能手机、平板电脑、电子书阅读器、MP3(动态影像专家压缩标准音频层面3，MovingPicture Experts Group Audio Layer III)播放器、MP4(动态影像专家压缩标准音频层面4，Moving Picture Experts Group Audio Layer IV)播放器、膝上型便携计算机、车载电脑、台式计算机、机顶盒、智能电视机、可穿戴设备等等。The data processing method of the embodiment of the present invention can be used for scenarios where voice information containing at least two language types is recognized. The data processing method can be applied to electronic devices, including but not limited to: servers, smart phones, tablet computers, e-book readers, MP3 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts Group Audio Layer IV) players, laptop computers, car computers, desktop computers, set-top boxes, smart televisions, wearable devices, etc.

可以理解，本发明实施例对待识别的语音信息的获取方式不加以限制，例如，所述电子设备可以通过有线连接方式或者无线连接的方式，从客户端或网络中获取待识别的语音信息，或者，可以通过所述电子设备实时录制得到待识别的语音信息，或者，还可以根据即时通讯应用中获取的即时通讯消息得到待识别的语音信息等。It can be understood that the embodiments of the present invention do not limit the method for obtaining the voice information to be recognized. For example, the electronic device can obtain the voice information to be recognized from the client or the network through a wired connection or a wireless connection, or the voice information to be recognized can be obtained by real-time recording by the electronic device, or the voice information to be recognized can be obtained based on the instant messaging message obtained from the instant messaging application, etc.

在本发明实施例中，可以根据预先设定的窗长和帧移，将待识别的语音信息切分为多个语音帧，其中，每一个语音帧可以为一个语音片段，进而可以对所述语音信息逐帧进行解码。如果待识别的语音信息为模拟语音信息(例如用户通话的录音)，则需要先将模拟语音信息转换为数字语音信息，然后再进行语音信息的切分。In the embodiment of the present invention, the voice information to be recognized can be divided into a plurality of voice frames according to a preset window length and frame shift, wherein each voice frame can be a voice segment, and the voice information can be decoded frame by frame. If the voice information to be recognized is analog voice information (such as a recording of a user's call), the analog voice information needs to be converted into digital voice information first, and then the voice information is divided.

其中，窗长可用于表示每一帧语音片段的时长，帧移可用于表示相邻帧之间的时差。例如，当窗长为25ms帧移15ms时，第一帧语音片段为0～25ms，第二帧语音片段为15～40ms，依次类推，可以实现对待识别的语音信息的切分。可以理解，具体的窗长和帧移可以根据实际需求自行设定，本发明实施例对此不加以限制。The window length can be used to indicate the duration of each frame of speech segment, and the frame shift can be used to indicate the time difference between adjacent frames. For example, when the window length is 25ms and the frame shift is 15ms, the first frame of speech segment is 0-25ms, the second frame of speech segment is 15-40ms, and so on, so as to achieve the segmentation of the speech information to be recognized. It can be understood that the specific window length and frame shift can be set according to actual needs, and the embodiment of the present invention does not limit this.

可选地，在对待识别的语音信息进行切分之前，所述电子设备还可以对待识别的语音信息进行降噪处理，以提高后续对该语音信息的处理能力。Optionally, before segmenting the voice information to be recognized, the electronic device may also perform noise reduction processing on the voice information to be recognized, so as to improve the subsequent processing capability of the voice information.

在本发明实施例中，可以将语音信息输入预先训练的多语言声学模型，并基于多语言声学模型的输出，得到语音识别结果。所述多语言声学模型可以是融合了多种神经网络的分类模型。所述神经网络包括但不限于以下的至少一种或者至少两种的组合、叠加、嵌套：CNN(Convolutional Neural Network，卷积神经网络)、LSTM(Long Short-TermMemory，长短时记忆)网络、RNN(Simple Recurrent Neural Network，循环神经网络)、注意力神经网络等。In an embodiment of the present invention, speech information can be input into a pre-trained multilingual acoustic model, and a speech recognition result can be obtained based on the output of the multilingual acoustic model. The multilingual acoustic model can be a classification model that integrates multiple neural networks. The neural network includes but is not limited to a combination, superposition, or nesting of at least one or at least two of the following: CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory) network, RNN (Simple Recurrent Neural Network), attention neural network, etc.

为了提高对包含多种语言类型的语音信息识别的准确率，本发明实施例预先根据至少两种语言类型的声学数据训练得到多语言声学模型，根据所述多语言声学模型，可以确定语音信息中语音帧的语言类型，因此可以根据该语言类型对应的解码网络，对该语音帧进行解码，以得到该语音帧对应的第一解码结果，进而可以根据所述第一解码结果，确定所述语音信息对应的识别结果。In order to improve the accuracy of recognizing speech information containing multiple language types, an embodiment of the present invention pre-trains a multilingual acoustic model based on acoustic data of at least two language types. Based on the multilingual acoustic model, the language type of the speech frame in the speech information can be determined. Therefore, the speech frame can be decoded according to the decoding network corresponding to the language type to obtain a first decoding result corresponding to the speech frame, and then the recognition result corresponding to the speech information can be determined according to the first decoding result.

可以理解，本发明实施例对训练多语言声学模型的声学数据包含的语言类型数目以及语言类型均不加以限制。为便于描述，本发明实施例中均以包含中文和英文两种语言类型的语音信息为例进行说明，也即，所述多语言声学模型可以为根据收集的中文声学数据和英文声学数据训练得到。当然，还可以收集两种以上的语言类型的声学数据，如中文、英文、日文、德文等语言类型的声学数据，以训练多语言声学模型。对于两种以上的语言类型的应用场景，实现过程与两种语言类型类似，相互参照即可。It can be understood that the embodiments of the present invention do not limit the number of language types and language types contained in the acoustic data for training the multilingual acoustic model. For ease of description, the embodiments of the present invention are all illustrated by taking speech information containing two language types, Chinese and English, as an example, that is, the multilingual acoustic model can be trained based on the collected Chinese acoustic data and English acoustic data. Of course, acoustic data of more than two language types, such as acoustic data of Chinese, English, Japanese, German and other language types, can also be collected to train the multilingual acoustic model. For application scenarios of more than two language types, the implementation process is similar to that of two language types, and can be referenced to each other.

本发明实施例的解码网络可以包含至少两种语言类型对应的解码网络，例如，在识别中英文混合的语音信息的场景下，可以分别构建中文解码网络和英文解码网络。具体地，可以收集中文文本语料训练中文语言模型，根据中文语言模型、中文发音字典等知识源构建中文解码网络；同样地，可以收集英文文本语料训练英文语言模型，根据英文语言模型、英文发音字典等知识源构建英文解码网络。The decoding network of the embodiment of the present invention may include decoding networks corresponding to at least two language types. For example, in the scenario of recognizing mixed Chinese and English speech information, a Chinese decoding network and an English decoding network may be constructed respectively. Specifically, Chinese text corpus may be collected to train a Chinese language model, and a Chinese decoding network may be constructed based on knowledge sources such as a Chinese language model and a Chinese pronunciation dictionary; similarly, English text corpus may be collected to train an English language model, and an English decoding network may be constructed based on knowledge sources such as an English language model and an English pronunciation dictionary.

在对语音信息逐帧进行解码的过程中，若根据多语言声学模型确定语音帧的语言类型为中文，则可以根据中文解码网络对语音帧进行解码，若根据多语言声学模型确定语音帧的语言类型为英文，则可以根据英文解码网络对语音帧进行解码。In the process of decoding the speech information frame by frame, if the language type of the speech frame is determined to be Chinese according to the multilingual acoustic model, the speech frame can be decoded according to the Chinese decoding network; if the language type of the speech frame is determined to be English according to the multilingual acoustic model, the speech frame can be decoded according to the English decoding network.

在本发明的一种应用示例中，假设待识别的语音信息为“我喜欢apple”。具体地，首先可以根据多语言声学模型，确定该语音信息中第一帧语音帧的语言类型，假设确定第一帧语音帧的语言类型为中文，则可以根据中文解码网络，对第一帧语音帧进行解码，以得到第一帧语音帧的第一解码结果；然后再根据多语言声学模型，确定第二帧语音帧的语言类型，并且将第二帧语音帧输入其语言类型对应的解码网络进行解码，以得到第二帧语音帧的第一解码结果；以此类推，假设根据多语言声学模型，确定第m帧语音帧的语言类型为英文，则可以根据英文解码网络，对第m帧语音帧进行解码，以得到第m帧语音帧的第一解码结果，直到最后一帧语音帧解码完成；最后，可以根据各语音帧的第一解码结果，得到该语音信息的识别结果，如该识别结果可以包括如下文本信息“我喜欢apple”。In an application example of the present invention, it is assumed that the voice information to be recognized is "I like apple". Specifically, first, the language type of the first voice frame in the voice information can be determined according to the multilingual acoustic model. Assuming that the language type of the first voice frame is determined to be Chinese, the first voice frame can be decoded according to the Chinese decoding network to obtain the first decoding result of the first voice frame; then, the language type of the second voice frame can be determined according to the multilingual acoustic model, and the second voice frame can be input into the decoding network corresponding to its language type for decoding to obtain the first decoding result of the second voice frame; and so on, assuming that the language type of the mth voice frame is determined to be English according to the multilingual acoustic model, the mth voice frame can be decoded according to the English decoding network to obtain the first decoding result of the mth voice frame, until the last voice frame is decoded; finally, the recognition result of the voice information can be obtained according to the first decoding result of each voice frame, such as the recognition result can include the following text information "I like apple".

可以看出，本发明实施例通过已训练的多语言声学模型，可以确定语音信息中语音帧的语言类型，由此可以根据相应语言类型的解码网络对语音帧进行解码，以得到更加准确的识别结果。It can be seen that the embodiment of the present invention can determine the language type of the speech frame in the speech information through the trained multilingual acoustic model, and thus can decode the speech frame according to the decoding network of the corresponding language type to obtain a more accurate recognition result.

在本发明的一种可选实施例中，所述根据多语言声学模型，确定语音信息中语音帧的语言类型，具体可以包括：In an optional embodiment of the present invention, determining the language type of the speech frame in the speech information according to the multilingual acoustic model may specifically include:

步骤S11、根据多语言声学模型，确定语音帧对应各状态的后验概率；其中，所述状态与语言类型之间具有对应关系；Step S11, determining the posterior probability of each state corresponding to the speech frame according to the multilingual acoustic model; wherein there is a corresponding relationship between the state and the language type;

步骤S12、根据所述语音帧对应各状态的后验概率、以及各状态对应的语言类型，确定所述语音帧的后验概率对应各语言类型状态的概率比值；Step S12, determining a probability ratio of the posterior probability of the speech frame to each language type state according to the posterior probability of the speech frame corresponding to each state and the language type corresponding to each state;

步骤S13、根据所述概率比值，确定所述语音帧的语言类型。Step S13: Determine the language type of the speech frame according to the probability ratio.

所述多语言声学模型可以将输入的所述语音帧的特征转化为各状态的后验概率，所述状态具体可以为HMM(Hidden Markov Model，隐马尔可夫模型)状态，具体地，多个状态可以对应一个音素，多个音素可以对应一个字，多个字可以组成一个句子。The multilingual acoustic model can convert the features of the input speech frame into the posterior probability of each state, and the state can specifically be an HMM (Hidden Markov Model) state. Specifically, multiple states can correspond to a phoneme, multiple phonemes can correspond to a word, and multiple words can form a sentence.

例如，假设所述多语言声学模型在输出层可以输出(M1+M2)个状态对应的后验概率，其中M1个状态对应的语言类型可以为中文，M2个状态对应的语言类型可以为英文。For example, assume that the multilingual acoustic model can output the posterior probabilities corresponding to (M1+M2) states in the output layer, wherein the language type corresponding to the M1 state may be Chinese, and the language type corresponding to the M2 state may be English.

将语音帧输入所述多语言声学模型，可以输出该语音帧对应各状态的后验概率，根据所述语音帧对应各状态的后验概率、以及各状态对应的语言类型，可以确定该语音帧的后验概率对应各语言类型状态的概率比值，如该语音帧的后验概率中对应中文状态和英文状态的概率比值，进而可以根据该概率比值，确定该语音帧对应的语言类型。By inputting a speech frame into the multilingual acoustic model, the posterior probability of each state corresponding to the speech frame can be output. According to the posterior probability of the speech frame corresponding to each state and the language type corresponding to each state, the probability ratio of the posterior probability of the speech frame corresponding to each language type state can be determined, such as the probability ratio of the Chinese state and the English state in the posterior probability of the speech frame. Then, according to the probability ratio, the language type corresponding to the speech frame can be determined.

例如，将M1个中文状态的后验概率相加得到的概率值为p1，将M2个英文状态的后验概率相加得到的概率值为p2，且p1+p2＝1。如果p1大于p2，说明该语音帧的后验概率中对应中文状态的概率较大，则可以确定该语音帧的语言类型为中文，反之，可以确定该语音帧的语言类型为英文。For example, the probability value obtained by adding the posterior probabilities of M1 Chinese states is p1, and the probability value obtained by adding the posterior probabilities of M2 English states is p2, and p1+p2=1. If p1 is greater than p2, it means that the probability corresponding to the Chinese state in the posterior probability of the speech frame is larger, and the language type of the speech frame can be determined to be Chinese. Otherwise, the language type of the speech frame can be determined to be English.

然而，对于中英文混合的语音信息中，英文的后验概率通常较小，很少超过0.5，因此，为了减少误判，本发明实施例可以设置预设阈值，通过将所述概率比值与所述预设阈值进行对比，确定语音帧的语言类型。However, for mixed Chinese and English speech information, the posterior probability of English is usually small, rarely exceeding 0.5. Therefore, in order to reduce misjudgment, the embodiment of the present invention can set a preset threshold and determine the language type of the speech frame by comparing the probability ratio with the preset threshold.

以中英文混合为例，假设语音帧的后验概率对应英文状态和中文状态的概率比值为p2/p1，如果p2/p1超过预设阈值(如0.25)，则可以确定该语音帧的语言类型为英文；同理，该语音帧的后验概率对应中文状态和英文状态的概率比值为p1/p2，如果p1/p2超过4，则可以确定该语音帧的语言类型为中文。所述预设阈值可以根据实验进行调整，可以理解，本发明实施例对所述预设阈值的具体取值不加以限制。Taking a mixture of Chinese and English as an example, assuming that the probability ratio of the posterior probability of the speech frame corresponding to the English state and the Chinese state is p2/p1, if p2/p1 exceeds a preset threshold (such as 0.25), it can be determined that the language type of the speech frame is English; similarly, the probability ratio of the posterior probability of the speech frame corresponding to the Chinese state and the English state is p1/p2, if p1/p2 exceeds 4, it can be determined that the language type of the speech frame is Chinese. The preset threshold can be adjusted according to experiments, and it can be understood that the embodiment of the present invention does not limit the specific value of the preset threshold.

当然，由于p1+p2＝1，p2/p1>0.25等价于p2>0.2，因此也可以单纯以p1或者p2的值进行判断。Of course, since p1+p2=1, p2/p1>0.25 is equivalent to p2>0.2, it is also possible to make a judgment based solely on the value of p1 or p2.

在具体应用中，若出现用户频繁切换语言类型的情况，或者语音信息较短的情况，根据单帧语音帧判断该语音帧的语言类型，可能会导致判断出错。In a specific application, if the user frequently switches the language type, or the voice information is short, judging the language type of a voice frame based on a single voice frame may result in an error in the judgment.

为了提高确定语音帧的语言类型的准确性，在本发明的一种可选实施例中，可以根据语音帧所在预设窗长内的连续语音帧的后验概率对应各语言类型状态的概率比值的平均值，确定所述语音帧的语言类型。In order to improve the accuracy of determining the language type of a speech frame, in an optional embodiment of the present invention, the language type of the speech frame can be determined based on the average value of the probability ratios of the posterior probabilities of consecutive speech frames within a preset window length corresponding to each language type state.

可以理解，本发明实施例对所述预设窗长的具体数值不加以限制，例如，可以设置预设窗长为包含连续10帧语音帧的时间长度。具体地，可以获取包含所述语音帧的连续10帧语音帧，分别计算这10帧语音帧中每一帧语音帧的后验概率对应英文状态和中文状态的概率比值p2/p1，再将这10个p2/p1求和取平均值，如果该平均值超过预设阈值0.25，则可以确定所述语音帧的语言类型为英文，以避免通过单帧判断出现误判的概率，进而可以提高确定语音帧的语言类型的准确性。It can be understood that the embodiment of the present invention does not limit the specific value of the preset window length. For example, the preset window length can be set to the time length of 10 consecutive speech frames. Specifically, 10 consecutive speech frames containing the speech frame can be obtained, and the probability ratio p2/p1 of the posterior probability of each speech frame in the 10 speech frames corresponding to the English state and the Chinese state can be calculated respectively, and then the 10 p2/p1s are summed and averaged. If the average value exceeds the preset threshold value of 0.25, it can be determined that the language type of the speech frame is English, so as to avoid the probability of misjudgment through single frame judgment, thereby improving the accuracy of determining the language type of the speech frame.

在本发明的一种可选实施例中，在所述根据多语言声学模型，确定语音信息中语音帧的语言类型之前，所述方法还可以包括：In an optional embodiment of the present invention, before determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method may further include:

步骤S21、从所述至少两种语言类型中确定目标语言类型；Step S21, determining a target language type from the at least two language types;

步骤S22、根据所述目标语言类型对应的解码网络，对所述语音信息中的各语音帧进行解码，以得到所述各语音帧的第二解码结果；Step S22, decoding each speech frame in the speech information according to a decoding network corresponding to the target language type to obtain a second decoding result of each speech frame;

在所述根据多语言声学模型，确定语音信息中语音帧的语言类型之后，所述方法还可以包括：After determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method may further include:

从所述语音信息的语音帧中，确定目标语音帧，以及确定所述目标语音帧的第二解码结果；其中，所述目标语音帧的语言类型为非目标语言类型；Determining a target speech frame from the speech frames of the speech information, and determining a second decoding result of the target speech frame; wherein the language type of the target speech frame is a non-target language type;

所述根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果，具体可以包括：根据所述目标语音帧的语言类型对应的解码网络，对所述目标语音帧进行解码，以得到所述目标语音帧的第一解码结果；The step of decoding the speech frame according to the decoding network corresponding to the language type of the speech frame to obtain the first decoding result of the speech frame may specifically include: decoding the target speech frame according to the decoding network corresponding to the language type of the target speech frame to obtain the first decoding result of the target speech frame;

所述根据所述第一解码结果，确定所述语音信息对应的识别结果，具体可以包括：将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，以及将替换后的第二解码结果，作为所述语音信息对应的识别结果。Determining the recognition result corresponding to the voice information based on the first decoding result may specifically include: replacing the second decoding result of the target voice frame with the first decoding result of the language type corresponding to the target voice frame, and using the replaced second decoding result as the recognition result corresponding to the voice information.

在具体应用中，用户通常使用两种类型的语言混合进行表达，并且大部分语句使用其中一种语言类型，只有小部分语句中会穿插出现另一种语言类型。此外，在语音信息较短的情况下，例如语音信息只包含一个英语单词，则在解码时由于一个单词没有上下文信息，可能导致解码结果不够准确。In specific applications, users usually express themselves in a mixture of two types of languages, and most sentences use one of the two languages, with only a small number of sentences interspersed with the other language. In addition, when the voice information is short, for example, the voice information only contains one English word, the decoding result may not be accurate because there is no context information for one word.

因此，本发明实施例可以从所述至少两种语言类型中确定目标语言类型，所述目标语言类型可以为混合语言表达中使用的主要语言，例如，可以确定所述目标语言类型为中文。在对语音信息进行解码的过程中，对于语音信息中的每一帧语音帧，都根据中文解码网络进行解码，以得到每一帧语音帧对应的第二解码结果(如R1)，该R1为中文解码结果。由于第二解码结果为对一段完整的语音信息解码得到，每一帧语音帧在解码过程中，可以参考其对应的上下文信息，因此，可以提高第二解码结果的准确性。Therefore, the embodiment of the present invention can determine the target language type from the at least two language types, and the target language type can be the main language used in the mixed language expression. For example, the target language type can be determined to be Chinese. In the process of decoding the voice information, each voice frame in the voice information is decoded according to the Chinese decoding network to obtain a second decoding result (such as R1) corresponding to each voice frame, and R1 is the Chinese decoding result. Since the second decoding result is obtained by decoding a complete section of voice information, each voice frame can refer to its corresponding context information during the decoding process, so the accuracy of the second decoding result can be improved.

在目标语言类型对应的解码网络对语音信息中的所有语音帧解码完成后，可以从所述语音信息的语音帧中，确定目标语音帧；其中，所述目标语音帧的语言类型为非目标语言类型。例如，对于中英文混合的语音信息，若确定目标语言类型为中文，则英文为非目标语言类型，也即，可以从语音信息中确定语言类型为英文的语音帧为目标语音帧，并且确定该目标语音帧对应英文的第一解码结果(如R2)，该R2为根据英文解码网络对该目标语音帧解码得到，也即该R2为英文解码结果。最后用R2替换对应的R1，可以得到该语音信息对应的识别结果。After the decoding network corresponding to the target language type has completed decoding of all the speech frames in the speech information, the target speech frame can be determined from the speech frames of the speech information; wherein the language type of the target speech frame is a non-target language type. For example, for mixed Chinese and English speech information, if the target language type is determined to be Chinese, then English is a non-target language type, that is, the speech frame with the language type of English can be determined from the speech information as the target speech frame, and the first decoding result (such as R2) corresponding to the English of the target speech frame is determined, and R2 is obtained by decoding the target speech frame according to the English decoding network, that is, R2 is the English decoding result. Finally, replace the corresponding R1 with R2 to obtain the recognition result corresponding to the speech information.

在本发明的一种应用示例中，假设待识别的语音信息为“我喜欢apple”，且假设目标语言类型为中文。具体地，首先将该语音信息输入多语言声学模型，得到每一帧语音帧对应的状态后验概率序列，根据中文解码网络，对每一帧语音帧的中文状态的后验概率序列进行解码，以得到每一帧语音帧的第二解码结果，假设得到该语音信息的第二解码结果为“我喜欢爱破”；然后，根据各语音帧对应各状态的后验概率、以及各状态对应的语言类型，确定各语音帧的语言类型，并且将语言类型为英文的语音帧确定为目标语音帧；再根据英文解码网络对目标语音帧进行解码，以得到所述目标语音帧对应英文的第一解码结果，假设为“apple”；最后，将第二解码结果“我喜欢爱破”中与“apple”相对应的“爱破”替换为“apple”，可以得到替换后的第二解码结果为如下文本：“我喜欢apple”。In an application example of the present invention, it is assumed that the speech information to be recognized is "I like apple", and it is assumed that the target language type is Chinese. Specifically, the speech information is first input into the multilingual acoustic model to obtain the state posterior probability sequence corresponding to each speech frame, and the posterior probability sequence of the Chinese state of each speech frame is decoded according to the Chinese decoding network to obtain the second decoding result of each speech frame, and it is assumed that the second decoding result of the speech information is "I like love broken"; then, according to the posterior probability of each state corresponding to each speech frame, and the language type corresponding to each state, the language type of each speech frame is determined, and the speech frame with the language type of English is determined as the target speech frame; then, the target speech frame is decoded according to the English decoding network to obtain the first decoding result of the target speech frame corresponding to English, which is assumed to be "apple"; finally, the "love broken" corresponding to "apple" in the second decoding result "I like love broken" is replaced with "apple", and the second decoding result after replacement can be obtained as the following text: "I like apple".

需要说明的是，在本发明实施例中，对于语言类型为目标语言类型的语音帧，其第一解码结果和第二解码结果相同，例如，在上述示例中“我喜欢”对应的语音帧，其语言类型为中文，目标语言类型也为中文，则“我喜欢”对应语音帧的第一解码结果和第二解码结果均为文本“我喜欢”。It should be noted that, in the embodiment of the present invention, for the speech frame whose language type is the target language type, the first decoding result and the second decoding result are the same. For example, in the above example, the speech frame corresponding to "I like" has a language type of Chinese and the target language type is also Chinese. Then, the first decoding result and the second decoding result of the speech frame corresponding to "I like" are both the text "I like".

在本发明的一种可选实施例中，所述第一解码结果、以及所述第二解码结果可以包括：对应语音帧的时间边界信息；In an optional embodiment of the present invention, the first decoding result and the second decoding result may include: time boundary information of the corresponding speech frame;

所述将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，具体可以包括：The replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame may specifically include:

步骤S31、从所述目标语音帧的第二解码结果中，确定被替换结果；其中，所述被替换结果与所述目标语音帧对应语言类型的第一解码结果的时间边界相重合；Step S31, determining a replacement result from the second decoding result of the target speech frame; wherein the replacement result coincides with a time boundary of the first decoding result of the language type corresponding to the target speech frame;

步骤S32、将所述被替换结果替换为所述目标语音帧对应语言类型的第一解码结果。Step S32: Replace the replaced result with the first decoding result of the language type corresponding to the target speech frame.

为了保证将目标语音帧的第二解码结果可以准确替换为所述目标语音帧对应语言类型的第一解码结果，本发明实施例的第一解码结果、以及第二解码结果可以包括：对应语音帧的时间边界信息。In order to ensure that the second decoding result of the target speech frame can be accurately replaced by the first decoding result of the language type corresponding to the target speech frame, the first decoding result and the second decoding result of the embodiment of the present invention may include: time boundary information of the corresponding speech frame.

例如，在上述示例中，对于第二解码结果“我喜欢爱破”，其中每个字都包括该字对应语音帧的时间边界信息，可以根据该时间边界信息，从该第二解码结果中，确定被替换结果，以使被替换结果与所述目标语音帧对应语言类型的第一解码结果的时间边界相重合，根据上述示例可知，所述目标语音帧对应语言类型的第一解码结果为“apple”，假设确定第二解码结果“我喜欢爱破”中与“apple”的时间边界信息相重合的被替换结果为“爱破”，则可以将“我喜欢爱破”中的“爱破”替换为“apple”，得到替换后的解码结果为“我喜欢apple”。For example, in the above example, for the second decoding result "I like Ai Po", each word includes the time boundary information of the speech frame corresponding to the word, the replaced result can be determined from the second decoding result based on the time boundary information, so that the replaced result coincides with the time boundary of the first decoding result of the language type corresponding to the target speech frame. According to the above example, it can be seen that the first decoding result of the language type corresponding to the target speech frame is "apple". Assuming that it is determined that the replaced result in the second decoding result "I like Ai Po" that coincides with the time boundary information of "apple" is "Ai Po", then "Ai Po" in "I like Ai Po" can be replaced with "apple", and the replaced decoding result is "I like apple".

在本发明的一种可选实施例中，所述解码网络，具体可以包括：通用解码网络和专业解码网络；其中，所述通用解码网络中可以包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中可以包括：根据预置领域的文本语料训练得到的语言模型；In an optional embodiment of the present invention, the decoding network may specifically include: a general decoding network and a professional decoding network; wherein the general decoding network may include: a language model trained according to a general text corpus; and the professional decoding network may include: a language model trained according to a text corpus of a preset domain;

所述根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果，具体可以包括：The decoding network corresponding to the language type of the speech frame decodes the speech frame to obtain a first decoding result of the speech frame, which may specifically include:

步骤S41、分别根据所述通用解码网络和所述专业解码网络对所述语音帧进行解码，以得到所述语音帧对应所述通用解码网络的第一得分，以及所述语音帧对应所述专业解码网络的第二得分；Step S41, decoding the speech frame according to the general decoding network and the professional decoding network respectively, to obtain a first score of the speech frame corresponding to the general decoding network, and a second score of the speech frame corresponding to the professional decoding network;

步骤S42、将所述第一得分和所述第二得分中得分高的解码结果作为所述语音帧的第一解码结果。Step S42: Using the decoding result with a higher score between the first score and the second score as the first decoding result of the speech frame.

在具体应用中，对于用户日常交流类的语音，解码网络通常具有较好的解码效果，然而，对于一些专业领域的语音，例如医疗专业领域的语音，通常包含较多的医疗专业词汇，如“阿斯匹林”、“帕金森症”等，将会影响解码的效果。In specific applications, the decoding network usually has a good decoding effect for the speech of users' daily communication. However, for the speech in some professional fields, such as the speech in the medical field, it usually contains more medical professional vocabulary, such as "aspirin", "Parkinson's disease", etc., which will affect the decoding effect.

为解决上述问题，本发明实施例的解码网络可以包括通用解码网络和专业解码网络。其中，通用解码网络可以为用户日常交流过程中使用的通用的解码网络，通用解码网络中可以包括：根据通用的文本语料训练得到的语言模型，因此，通用解码网络可以对大多用户的日常语音进行识别。专业解码网络可以为专门为专业领域定制的解码网络，专业解码网络中可以包括：根据预置领域的文本语料训练得到的语言模型；所述预置领域可以为医学领域、法律领域、计算机领域等任意领域。To solve the above problems, the decoding network of the embodiment of the present invention may include a general decoding network and a professional decoding network. Among them, the general decoding network can be a general decoding network used by users in daily communication, and the general decoding network may include: a language model trained according to a general text corpus, so the general decoding network can recognize the daily speech of most users. The professional decoding network can be a decoding network specially customized for a professional field, and the professional decoding network may include: a language model trained according to a text corpus of a preset field; the preset field can be any field such as the medical field, the legal field, the computer field, etc.

例如，在某个医学研讨会上，演讲者可能会使用很多中英文混合的句子，并且还会使用大量的医疗专业词汇，本发明实施例可以将演讲者的语音实时识别为文字，并显示在大屏幕上供参会者观看。For example, at a medical seminar, the speaker may use a lot of mixed Chinese and English sentences and a large number of medical professional vocabulary. The embodiment of the present invention can recognize the speaker's voice into text in real time and display it on a large screen for participants to watch.

具体地，可以分别根据所述通用解码网络和所述专业解码网络对演讲者的语音进行逐帧解码，以得到语音帧对应通用解码网络的第一得分，以及语音帧所述专业解码网络的第二得分，并且将第一得分和第二得分中得分高的解码结果作为语音帧的第一解码结果。Specifically, the speaker's speech can be decoded frame by frame according to the general decoding network and the professional decoding network respectively to obtain a first score of the speech frame corresponding to the general decoding network and a second score of the speech frame corresponding to the professional decoding network, and the decoding result with a higher score between the first score and the second score is used as the first decoding result of the speech frame.

可以理解，本发明实施例的解码网络可以包括多个不同语言类型对应的解码网络，每一个语言类型的解码网络又可以包括该语言类型对应的通用解码网络和专业解码网络。由此，本发明实施例可以通过专业解码网络对通用解码网络的解码结果进行补充或修正，在语音信息中包含专业领域词汇的情况下，可以提高解码的准确性。It can be understood that the decoding network of the embodiment of the present invention may include a plurality of decoding networks corresponding to different language types, and the decoding network of each language type may include a universal decoding network and a professional decoding network corresponding to the language type. Therefore, the embodiment of the present invention can supplement or correct the decoding result of the universal decoding network through the professional decoding network, and can improve the accuracy of decoding when the voice information contains professional vocabulary.

可以理解，本发明实施例对训练所述多语言声学模型的训练方式不加以限制。在本发明的一种可选实施例中，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型。It is to be understood that the embodiment of the present invention does not limit the training method of the multilingual acoustic model. In an optional embodiment of the present invention, each of the acoustic data of at least two language types corresponds to at least two language types.

具体地，本发明实施例可以收集包含至少两种语言类型的混合声学数据，以训练多语言声学模型，所述混合声学数据指其中的每一个数据都对应至少两种语言类型。例如，“我喜欢apple”对应的语音可以为一个混合声学数据。Specifically, the embodiment of the present invention can collect mixed acoustic data containing at least two language types to train a multilingual acoustic model, wherein each of the mixed acoustic data corresponds to at least two language types. For example, the speech corresponding to "I like apple" can be a mixed acoustic data.

根据混合声学数据训练多语言声学模型，需要将不同类型的语言中相似的发音单元进行合并，以生成适应于混合语言的发音字典，然而在对发音单元进行合并的过程中，可能会带来一定的误差。此外，包含至少两种语言类型的混合声学数据通常具有数据稀少、难以收集的特点，因此，将会影响多语言声学模型识别的准确性。To train a multilingual acoustic model based on mixed acoustic data, it is necessary to merge similar pronunciation units in different types of languages to generate a pronunciation dictionary suitable for mixed languages. However, in the process of merging pronunciation units, certain errors may be introduced. In addition, mixed acoustic data containing at least two language types is usually scarce and difficult to collect, which will affect the accuracy of multilingual acoustic model recognition.

为解决上述问题，在本发明的一种可选实施例中，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。To solve the above problem, in an optional embodiment of the present invention, each of the acoustic data of at least two language types corresponds to a language type.

具体地，本发明实施例可以分别收集至少两种语言类型各自对应的单语言声学数据，并根据各语言类型对应的单语言数据组成的训练数据集训练多语言声学模型。例如，“今天天气很好”对应的语音可以为一个单语言声学数据，“What's the weather liketoday”对应的语音也可以为一个单语言声学数据。Specifically, the embodiment of the present invention can collect the monolingual acoustic data corresponding to at least two language types, and train the multilingual acoustic model according to the training data set composed of the monolingual data corresponding to each language type. For example, the speech corresponding to "The weather is good today" can be a monolingual acoustic data, and the speech corresponding to "What's the weather like today" can also be a monolingual acoustic data.

在本发明的一种可选实施例中，所述多语言声学模型的训练步骤具体可以包括：In an optional embodiment of the present invention, the step of training the multilingual acoustic model may specifically include:

步骤S51、根据收集的至少两种语言类型的声学数据，分别训练各语言类型对应的单语言声学模型；Step S51: training a single language acoustic model corresponding to each language type according to the collected acoustic data of at least two language types;

步骤S52、根据所述单语言声学模型，对所述至少两种语言类型的声学数据分别进行状态标注，其中，所述状态与语言类型之间具有对应关系；Step S52: according to the single language acoustic model, respectively label the acoustic data of the at least two language types with states, wherein there is a corresponding relationship between the states and the language types;

步骤S53、根据标注后的至少两种语言类型的声学数据组成的数据集，训练多语言声学模型。Step S53: training a multilingual acoustic model based on the data set consisting of the labeled acoustic data of at least two language types.

具体地，可以根据收集的中文声学数据L1，训练中文对应的单语言声学模型NN1，其中，L1中的每一个数据对应的语言类型均为中文。可以设置中文语音的HMM绑定状态个数为NN1网络输出层的结点个数，如M1。所述单语言声学模型的输出可以包括：一种语言类型对应的状态概率，也即，网络输出层的M1个节点的状态概率均对应中文语言类型。Specifically, a single language acoustic model NN1 corresponding to Chinese can be trained based on the collected Chinese acoustic data L1, wherein the language type corresponding to each data in L1 is Chinese. The number of HMM binding states of Chinese speech can be set to the number of nodes in the NN1 network output layer, such as M1. The output of the single language acoustic model can include: a state probability corresponding to a language type, that is, the state probabilities of the M1 nodes in the network output layer all correspond to the Chinese language type.

同样地，可以根据收集的英文声学数据L2，训练英文对应的单语言声学模型NN2，其中，L2中的每一个数据对应的语言类型均为英文。可以设置英文语音的HMM绑定状态个数为NN2网络输出层的结点个数，如M2，且M2个节点的状态概率均对应英文语言类型。Similarly, the monolingual acoustic model NN2 corresponding to English can be trained based on the collected English acoustic data L2, where the language type corresponding to each data in L2 is English. The number of HMM binding states of English speech can be set to the number of nodes in the output layer of the NN2 network, such as M2, and the state probabilities of the M2 nodes all correspond to the English language type.

然后，根据训练得到的NN1和NN2分别对中文声学数据L1和英文声学数据L2进行强制对齐，以对L1和L2进行状态标注。具体地，可以通过NN1确定L1中每个数据的语音帧对应的状态，以及通过NN2确定L2中每个数据的语音帧对应的状态。Then, the Chinese acoustic data L1 and the English acoustic data L2 are forced to be aligned according to the trained NN1 and NN2, so as to mark the states of L1 and L2. Specifically, the state corresponding to the speech frame of each data in L1 can be determined by NN1, and the state corresponding to the speech frame of each data in L2 can be determined by NN2.

最后，将标注后的L1和L2混合在一起得到标注后的数据集(L1+L2)，以训练多语言声学模型NN3。所述多语言声学模型的输出可以包括：至少两种语言类型对应的状态概率。例如，所述NN3的输出层结点个数可以为M1+M2，其中，前M1个结点可以对应中文的HMM状态，后M2个结点可以对应英文HMM的状态。Finally, the labeled L1 and L2 are mixed together to obtain a labeled data set (L1+L2) to train the multilingual acoustic model NN3. The output of the multilingual acoustic model may include: state probabilities corresponding to at least two language types. For example, the number of nodes in the output layer of the NN3 may be M1+M2, wherein the first M1 nodes may correspond to the HMM state of Chinese, and the last M2 nodes may correspond to the state of the English HMM.

本发明实施例在训练多语言声学模型的过程中，可以使用各语言类型对应的单语言声学数据，以保留各语言类型的发音特征，因此，在声学层面，对不同语言类型可以具有一定的区分性。此外，在收集训练数据的过程中，分别收集各语言类型的声学数据，可以避免收集多种语言类型的混合声学数据导致数据不足的问题，因此，可以提高多语言声学模型识别的准确性。In the process of training the multilingual acoustic model, the embodiments of the present invention can use the monolingual acoustic data corresponding to each language type to retain the pronunciation characteristics of each language type, so that different language types can be distinguished to a certain extent at the acoustic level. In addition, in the process of collecting training data, collecting acoustic data of each language type separately can avoid the problem of insufficient data caused by collecting mixed acoustic data of multiple language types, so the accuracy of multilingual acoustic model recognition can be improved.

综上，本发明实施例可以根据至少两种语言类型的声学数据训练得到多语言模型，通过所述多语言声学模型，可以确定语音信息中语音帧的语言类型，因此，在语音信息中包含多种语言类型的情况下，本发明实施例可以准确区分语音信息中不同语言类型的语音帧，并且可以根据相应语言类型的解码网络对语音帧进行解码，以得到语音帧的第一解码结果，该第一解码结果为根据语音帧的语音类型对应的解码网络解码得到，可以保证解码的准确性，进而可以提高语音识别的准确率。In summary, the embodiment of the present invention can obtain a multilingual model based on acoustic data training of at least two language types. Through the multilingual acoustic model, the language type of the speech frame in the speech information can be determined. Therefore, when the speech information contains multiple language types, the embodiment of the present invention can accurately distinguish the speech frames of different language types in the speech information, and can decode the speech frames according to the decoding network of the corresponding language type to obtain a first decoding result of the speech frame. The first decoding result is obtained by decoding the decoding network corresponding to the speech type of the speech frame, which can ensure the accuracy of decoding, and thus improve the accuracy of speech recognition.

需要说明的是，对于方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明实施例并不受所描述的动作顺序的限制，因为依据本发明实施例，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作并不一定是本发明实施例所必须的。It should be noted that, for the sake of simplicity, the method embodiments are described as a series of action combinations, but those skilled in the art should be aware that the embodiments of the present invention are not limited by the order of the actions described, because according to the embodiments of the present invention, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.

装置实施例Device Embodiment

参照图2，示出了本发明的一种数据处理装置实施例的结构框图，所述装置具体可以包括：2, there is shown a structural block diagram of an embodiment of a data processing device of the present invention, wherein the device may specifically include:

类型确定模块201，用于根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；A type determination module 201 is used to determine the language type of the speech frame in the speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is trained based on acoustic data of at least two language types;

第一解码模块202，用于根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；A first decoding module 202, configured to decode the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame;

结果确定模块203，用于根据所述第一解码结果，确定所述语音信息对应的识别结果。The result determination module 203 is used to determine the recognition result corresponding to the voice information according to the first decoding result.

可选地，所述类型确定模块，具体可以包括：Optionally, the type determination module may specifically include:

概率确定子模块，用于根据多语言声学模型，确定语音帧对应各状态的后验概率；其中，所述状态与语言类型之间具有对应关系；A probability determination submodule, for determining the posterior probability of each state corresponding to the speech frame according to the multilingual acoustic model; wherein there is a corresponding relationship between the state and the language type;

比值确定子模块，用于根据所述语音帧对应各状态的后验概率、以及各状态对应的语言类型，确定所述语音帧的后验概率对应各语言类型状态的概率比值；A ratio determination submodule, used to determine the probability ratio of the posterior probability of the speech frame corresponding to each language type state according to the posterior probability of the speech frame corresponding to each state and the language type corresponding to each state;

类型确定子模块，用于根据所述概率比值，确定所述语音帧的语言类型。The type determination submodule is used to determine the language type of the speech frame according to the probability ratio.

可选地，所述装置还可以包括：Optionally, the device may further include:

目标语言确定模块，用于从所述至少两种语言类型中确定目标语言类型；A target language determination module, used to determine a target language type from the at least two language types;

第二解码模块，用于根据所述目标语言类型对应的解码网络，对所述语音信息中的各语音帧进行解码，以得到所述各语音帧的第二解码结果；A second decoding module, configured to decode each speech frame in the speech information according to a decoding network corresponding to the target language type, so as to obtain a second decoding result of each speech frame;

所述装置还可以包括：The device may also include:

目标帧确定模块，用于从所述语音信息的语音帧中，确定目标语音帧，以及确定所述目标语音帧的第二解码结果；其中，所述目标语音帧的语言类型为非目标语言类型；A target frame determination module, used to determine a target voice frame from the voice frames of the voice information, and to determine a second decoding result of the target voice frame; wherein the language type of the target voice frame is a non-target language type;

所述第一解码模块，具体可以包括：The first decoding module may specifically include:

第一解码子模块，用于根据所述目标语音帧的语言类型对应的解码网络，对所述目标语音帧进行解码，以得到所述目标语音帧的第一解码结果；A first decoding submodule, configured to decode the target speech frame according to a decoding network corresponding to the language type of the target speech frame to obtain a first decoding result of the target speech frame;

所述结果确定模块，具体可以包括：The result determination module may specifically include:

第一结果确定子模块，用于将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，以及将替换后的第二解码结果，作为所述语音信息对应的识别结果。The first result determination submodule is used to replace the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame, and use the replaced second decoding result as the recognition result corresponding to the speech information.

可选地，所述第一解码结果、以及所述第二解码结果包括：对应语音帧的时间边界信息；Optionally, the first decoding result and the second decoding result include: time boundary information of a corresponding speech frame;

所述第一结果确定子模块，具体可以包括：The first result determination submodule may specifically include:

结果确定单元，用于从所述目标语音帧的第二解码结果中，确定被替换结果；其中，所述被替换结果与所述目标语音帧对应语言类型的第一解码结果的时间边界相重合；A result determination unit, configured to determine a replaced result from the second decoding result of the target speech frame; wherein the replaced result coincides with a time boundary of the first decoding result of the target speech frame corresponding to the language type;

替换单元，用于将所述被替换结果替换为所述目标语音帧对应语言类型的第一解码结果。A replacing unit is used to replace the replaced result with a first decoding result of the language type corresponding to the target speech frame.

可选地，所述解码网络，具体可以包括：通用解码网络和专业解码网络；其中，所述通用解码网络中包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中包括：根据预置领域的文本语料训练得到的语言模型；Optionally, the decoding network may specifically include: a general decoding network and a professional decoding network; wherein the general decoding network includes: a language model trained according to a general text corpus; and the professional decoding network includes: a language model trained according to a text corpus of a preset field;

得分确定子模块，用于分别根据所述通用解码网络和所述专业解码网络对所述语音帧进行解码，以得到所述语音帧对应所述通用解码网络的第一得分，以及所述语音帧对应所述专业解码网络的第二得分；A score determination submodule, used to decode the speech frame according to the general decoding network and the professional decoding network respectively, so as to obtain a first score of the speech frame corresponding to the general decoding network and a second score of the speech frame corresponding to the professional decoding network;

第二结果确定子模块，用于将所述第一得分和所述第二得分中得分高的解码结果作为所述语音帧的第一解码结果。The second result determination submodule is used to use the decoding result with a higher score between the first score and the second score as the first decoding result of the speech frame.

可选地，所述装置还可以包括：模型训练模块，用于训练所述多语言声学模型；所述模型训练模块，具体可以包括：Optionally, the device may further include: a model training module, used to train the multilingual acoustic model; the model training module may specifically include:

第一训练子模块，用于根据收集的至少两种语言类型的声学数据，分别训练各语言类型对应的单语言声学模型；A first training submodule is used to train a single language acoustic model corresponding to each language type according to the collected acoustic data of at least two language types;

状态标注子模块，用于根据所述单语言声学模型，对所述至少两种语言类型的声学数据分别进行状态标注，其中，所述状态与语言类型之间具有对应关系；A state labeling submodule, configured to label the acoustic data of the at least two language types according to the single language acoustic model, wherein there is a corresponding relationship between the state and the language type;

第二训练子模块，用于根据标注后的至少两种语言类型的声学数据组成的数据集，训练多语言声学模型。The second training submodule is used to train a multilingual acoustic model based on a data set consisting of labeled acoustic data of at least two language types.

可选地，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型；或者，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。Optionally, each of the acoustic data of at least two language types corresponds to at least two language types; or, each of the acoustic data of at least two language types corresponds to one language type.

对于装置实施例而言，由于其与方法实施例基本相似，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the various embodiments can be referenced to each other.

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the device in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be elaborated here.

本发明实施例提供了一种用于数据处理的装置，包括有存储器，以及一个或者一个以上的程序，其中一个或者一个以上程序存储于存储器中，且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；根据所述第一解码结果，确定所述语音信息对应的识别结果。An embodiment of the present invention provides a device for data processing, including a memory and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by one or more processors, wherein the one or more programs include instructions for performing the following operations: determining the language type of a speech frame in speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is trained based on acoustic data of at least two language types; decoding the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame; and determining a recognition result corresponding to the speech information according to the first decoding result.

图3是根据一示例性实施例示出的一种用于数据处理的装置800的框图。例如，装置800可以是移动电话，计算机，数字广播终端，消息收发设备，游戏控制台，平板设备，医疗设备，健身设备，个人数字助理等。Fig. 3 is a block diagram of a device 800 for data processing according to an exemplary embodiment. For example, the device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.

参照图3，装置800可以包括以下一个或多个组件：处理组件802，存储器804，电源组件806，多媒体组件808，音频组件810，输入/输出(I/O)的接口812，传感器组件814，以及通信组件816。3 , the device 800 may include one or more of the following components: a processing component 802 , a memory 804 , a power component 806 , a multimedia component 808 , an audio component 810 , an input/output (I/O) interface 812 , a sensor component 814 , and a communication component 816 .

处理组件802通常控制装置800的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理元件802可以包括一个或多个处理器820来执行指令，以完成上述的方法的全部或部分步骤。此外，处理组件802可以包括一个或多个模块，便于处理组件802和其他组件之间的交互。例如，处理组件802可以包括多媒体模块，以方便多媒体组件808和处理组件802之间的交互。The processing component 802 generally controls the overall operation of the device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above-mentioned method. In addition, the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.

存储器804被配置为存储各种类型的数据以支持在设备800的操作。这些数据的示例包括用于在装置800上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。The memory 804 is configured to store various types of data to support operations on the device 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phone book data, messages, pictures, videos, etc. The memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.

电源组件806为装置800的各种组件提供电力。电源组件806可以包括电源管理系统，一个或多个电源，及其他与为装置800生成、管理和分配电力相关联的组件。The power supply component 806 provides power to the various components of the device 800. The power supply component 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 800.

多媒体组件808包括在所述装置800和用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件808包括一个前置摄像头和/或后置摄像头。当设备800处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 808 includes a screen that provides an output interface between the device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundaries of the touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

音频组件810被配置为输出和/或输入音频信号。例如，音频组件810包括一个麦克风(MIC)，当装置800处于操作模式，如呼叫模式、记录模式和语音信息处理模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中，音频组件810还包括一个扬声器，用于输出音频信号。The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), and when the device 800 is in an operating mode, such as a call mode, a recording mode, and a voice information processing mode, the microphone is configured to receive an external audio signal. The received audio signal can be further stored in the memory 804 or sent via the communication component 816. In some embodiments, the audio component 810 also includes a speaker for outputting audio signals.

I/O接口812为处理组件802和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。I/O interface 812 provides an interface between processing component 802 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include but are not limited to: home button, volume button, start button, and lock button.

传感器组件814包括一个或多个传感器，用于为装置800提供各个方面的状态评估。例如，传感器组件814可以检测到设备800的打开/关闭状态，组件的相对定位，例如所述组件为装置800的显示器和小键盘，传感器组件814还可以检测装置800或装置800一个组件的位置改变，用户与装置800接触的存在或不存在，装置800方位或加速/减速和装置800的温度变化。传感器组件814可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器，如CMOS或CCD图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件814还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。The sensor assembly 814 includes one or more sensors for providing various aspects of status assessment for the device 800. For example, the sensor assembly 814 can detect the open/closed state of the device 800, the relative positioning of components, such as the display and keypad of the device 800, and the sensor assembly 814 can also detect the position change of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and the temperature change of the device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an accelerometer, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

通信组件816被配置为便于装置800和其他设备之间有线或无线方式的通信。装置800可以接入基于通信标准的无线网络，如WiFi，2G或3G，或它们的组合。在一个示例性实施例中，通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件816还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频信息处理(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。The communication component 816 is configured to facilitate wired or wireless communication between the device 800 and other devices. The device 800 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，装置800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, the apparatus 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components to perform the above method.

在示例性实施例中，还提供了一种包括指令的非临时性计算机可读存储介质，例如包括指令的存储器804，上述指令可由装置800的处理器820执行以完成上述方法。例如，所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a non-transitory computer-readable storage medium including instructions is also provided, such as a memory 804 including instructions, and the instructions can be executed by the processor 820 of the device 800 to perform the above method. For example, the non-transitory computer-readable storage medium can be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, etc.

图4是本发明的一些实施例中服务器的结构示意图。该服务器1900可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上中央处理器(central processingunits，CPU)1922(例如，一个或一个以上处理器)和存储器1932，一个或一个以上存储应用程序1942或数据1944的存储介质1930(例如一个或一个以上海量存储设备)。其中，存储器1932和存储介质1930可以是短暂存储或持久存储。存储在存储介质1930的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对服务器中的一系列指令操作。更进一步地，中央处理器1922可以设置为与存储介质1930通信，在服务器1900上执行存储介质1930中的一系列指令操作。FIG4 is a schematic diagram of the structure of a server in some embodiments of the present invention. The server 1900 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 1922 (e.g., one or more processors) and memory 1932, and one or more storage media 1930 (e.g., one or more mass storage devices) storing application programs 1942 or data 1944. Among them, the memory 1932 and the storage medium 1930 may be short-term storage or permanent storage. The program stored in the storage medium 1930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the server. Furthermore, the central processing unit 1922 may be configured to communicate with the storage medium 1930 and execute a series of instruction operations in the storage medium 1930 on the server 1900.

服务器1900还可以包括一个或一个以上电源1926，一个或一个以上有线或无线网络接口1950，一个或一个以上输入输出接口1958，一个或一个以上键盘1956，和/或，一个或一个以上操作系统1941，例如Windows ServerTM，Mac OS XTM，UnixTM,LinuxTM，FreeBSDTM等等。The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, one or more keyboards 1956, and/or, one or more operating systems 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

一种非临时性计算机可读存储介质，当所述存储介质中的指令由装置(服务器或者终端)的处理器执行时，使得装置能够执行图1所示的数据处理方法。A non-transitory computer-readable storage medium, when instructions in the storage medium are executed by a processor of a device (server or terminal), enables the device to execute the data processing method shown in FIG. 1 .

一种非临时性计算机可读存储介质，当所述存储介质中的指令由装置(服务器或者终端)的处理器执行时，使得装置能够执行一种数据处理方法，所述方法包括：根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；根据所述第一解码结果，确定所述语音信息对应的识别结果。A non-temporary computer-readable storage medium, when the instructions in the storage medium are executed by a processor of a device (server or terminal), enables the device to execute a data processing method, the method comprising: determining the language type of a speech frame in speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is trained based on acoustic data of at least two language types; decoding the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame; and determining a recognition result corresponding to the speech information according to the first decoding result.

本发明实施例公开了A1、一种数据处理方法，包括：根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；The embodiment of the present invention discloses A1, a data processing method, comprising: determining the language type of a speech frame in speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is trained according to acoustic data of at least two language types;

A2、根据A1所述的方法，所述根据多语言声学模型，确定语音信息中语音帧的语言类型，包括：A2. The method according to A1, wherein determining the language type of the speech frame in the speech information according to the multilingual acoustic model comprises:

根据多语言声学模型，确定语音帧对应各状态的后验概率；其中，所述状态与语言类型之间具有对应关系；Determine the posterior probability of each state corresponding to the speech frame according to the multilingual acoustic model; wherein there is a corresponding relationship between the state and the language type;

根据所述语音帧对应各状态的后验概率、以及各状态对应的语言类型，确定所述语音帧的后验概率对应各语言类型状态的概率比值；Determine a probability ratio of the posterior probability of the speech frame to each language type state according to the posterior probability of the speech frame corresponding to each state and the language type corresponding to each state;

根据所述概率比值，确定所述语音帧的语言类型。The language type of the speech frame is determined according to the probability ratio.

A3、根据A1所述的方法，在所述根据多语言声学模型，确定语音信息中语音帧的语言类型之前，所述方法还包括：A3. According to the method described in A1, before determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method further comprises:

从所述至少两种语言类型中确定目标语言类型；determining a target language type from the at least two language types;

根据所述目标语言类型对应的解码网络，对所述语音信息中的各语音帧进行解码，以得到所述各语音帧的第二解码结果；Decoding each speech frame in the speech information according to a decoding network corresponding to the target language type to obtain a second decoding result of each speech frame;

在所述根据多语言声学模型，确定语音信息中语音帧的语言类型之后，所述方法还包括：After determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method further includes:

所述根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果，包括：The decoding network corresponding to the language type of the speech frame decodes the speech frame to obtain a first decoding result of the speech frame, including:

根据所述目标语音帧的语言类型对应的解码网络，对所述目标语音帧进行解码，以得到所述目标语音帧的第一解码结果；Decoding the target speech frame according to a decoding network corresponding to the language type of the target speech frame to obtain a first decoding result of the target speech frame;

所述根据所述第一解码结果，确定所述语音信息对应的识别结果，包括：The determining, according to the first decoding result, a recognition result corresponding to the voice information includes:

将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，以及将替换后的第二解码结果，作为所述语音信息对应的识别结果。The second decoding result of the target speech frame is replaced by the first decoding result of the language type corresponding to the target speech frame, and the replaced second decoding result is used as the recognition result corresponding to the speech information.

A4、根据A3所述的方法，所述第一解码结果、以及所述第二解码结果包括：对应语音帧的时间边界信息；A4. According to the method described in A3, the first decoding result and the second decoding result include: time boundary information of the corresponding speech frame;

所述将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，包括：The step of replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame includes:

从所述目标语音帧的第二解码结果中，确定被替换结果；其中，所述被替换结果与所述目标语音帧对应语言类型的第一解码结果的时间边界相重合；Determine a replaced result from the second decoding result of the target speech frame; wherein the replaced result coincides with a time boundary of the first decoding result of the language type corresponding to the target speech frame;

将所述被替换结果替换为所述目标语音帧对应语言类型的第一解码结果。The replaced result is replaced by the first decoding result of the language type corresponding to the target speech frame.

A5、根据A1所述的方法，所述解码网络，包括：通用解码网络和专业解码网络；其中，所述通用解码网络中包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中包括：根据预置领域的文本语料训练得到的语言模型；A5. According to the method described in A1, the decoding network comprises: a general decoding network and a professional decoding network; wherein the general decoding network comprises: a language model trained according to a general text corpus; and the professional decoding network comprises: a language model trained according to a text corpus of a preset domain;

分别根据所述通用解码网络和所述专业解码网络对所述语音帧进行解码，以得到所述语音帧对应所述通用解码网络的第一得分，以及所述语音帧对应所述专业解码网络的第二得分；Decoding the speech frame according to the universal decoding network and the professional decoding network respectively to obtain a first score of the speech frame corresponding to the universal decoding network and a second score of the speech frame corresponding to the professional decoding network;

将所述第一得分和所述第二得分中得分高的解码结果作为所述语音帧的第一解码结果。A decoding result having a higher score between the first score and the second score is used as a first decoding result of the speech frame.

A6、根据A1所述的方法，所述多语言声学模型的训练步骤包括：A6. According to the method described in A1, the training step of the multilingual acoustic model comprises:

根据收集的至少两种语言类型的声学数据，分别训练各语言类型对应的单语言声学模型；Based on the collected acoustic data of at least two language types, training a single language acoustic model corresponding to each language type respectively;

根据所述单语言声学模型，对所述至少两种语言类型的声学数据分别进行状态标注，其中，所述状态与语言类型之间具有对应关系；According to the single language acoustic model, respectively labeling the acoustic data of the at least two language types with states, wherein there is a corresponding relationship between the states and the language types;

根据标注后的至少两种语言类型的声学数据组成的数据集，训练多语言声学模型。A multilingual acoustic model is trained based on a dataset consisting of labeled acoustic data of at least two language types.

A7、根据A1至A6中任一所述的方法，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型；或者，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。A7. According to any one of the methods described in A1 to A6, each of the acoustic data of at least two language types corresponds to at least two language types; or, each of the acoustic data of at least two language types corresponds to one language type.

本发明实施例公开了B8、一种数据处理装置，包括：The embodiment of the present invention discloses B8, a data processing device, including:

B9、根据B8所述的装置，所述类型确定模块，包括：B9. According to the apparatus described in B8, the type determination module includes:

概率确定子模块，用于根据多语言声学模型，确定语音帧对应各状态的后验概率；其中，所述状态与语言类型之间具有对应关系；A probability determination submodule, for determining the posterior probability of each state corresponding to the speech frame according to the multilingual acoustic model; wherein the state has a corresponding relationship with the language type;

B10、根据B8所述的装置，所述装置还包括：B10. The device according to B8, further comprising:

所述装置还包括：The device also includes:

所述第一解码模块，包括：The first decoding module comprises:

所述结果确定模块，包括：The result determination module comprises:

B11、根据B10所述的装置，所述第一解码结果、以及所述第二解码结果包括：对应语音帧的时间边界信息；B11. According to the apparatus described in B10, the first decoding result and the second decoding result include: time boundary information of the corresponding speech frame;

所述第一结果确定子模块，包括：The first result determination submodule includes:

替换单元，用于将所述被替换结果替换为所述目标语音帧对应语言类型的第一解码结果。A replacement unit is used to replace the replaced result with a first decoding result of the language type corresponding to the target speech frame.

B12、根据B8所述的装置，所述解码网络，包括：通用解码网络和专业解码网络；其中，所述通用解码网络中包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中包括：根据预置领域的文本语料训练得到的语言模型；B12. The device according to B8, wherein the decoding network comprises: a general decoding network and a professional decoding network; wherein the general decoding network comprises: a language model trained based on a general text corpus; and the professional decoding network comprises: a language model trained based on a text corpus of a preset domain;

所述第一解码模块，包括：The first decoding module includes:

B13、根据B8所述的装置，所述装置还包括：模型训练模块，用于训练所述多语言声学模型；所述模型训练模块，包括：B13. The device according to B8, further comprising: a model training module for training the multilingual acoustic model; the model training module comprising:

B14、根据B8至B13中任一所述的装置，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型；或者，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。B14. According to the device described in any one of B8 to B13, each of the acoustic data of at least two language types corresponds to at least two language types; or, each of the acoustic data of at least two language types corresponds to one language type.

本发明实施例公开了C15、一种用于数据处理的装置，其特征在于，包括有存储器，以及一个或者一个以上的程序，其中一个或者一个以上程序存储于存储器中，且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：The embodiment of the present invention discloses C15, a device for data processing, characterized in that it includes a memory and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by one or more processors. The one or more programs include instructions for performing the following operations:

C16、根据C15所述的装置，所述根据多语言声学模型，确定语音信息中语音帧的语言类型，包括：C16. The device according to C15, wherein determining the language type of the speech frame in the speech information according to the multilingual acoustic model comprises:

C17、根据C15所述的装置，所述装置还经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：C17. The device according to C15, wherein the device is further configured to execute, by one or more processors, the one or more programs including instructions for performing the following operations:

所述装置还经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：The device is also configured to execute, by one or more processors, the one or more programs including instructions for performing the following operations:

C18、根据C17所述的装置，所述第一解码结果、以及所述第二解码结果包括：对应语音帧的时间边界信息；C18. According to the apparatus of C17, the first decoding result and the second decoding result include: time boundary information of the corresponding speech frame;

C19、根据C15所述的装置，所述解码网络，包括：通用解码网络和专业解码网络；其中，所述通用解码网络中包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中包括：根据预置领域的文本语料训练得到的语言模型；C19. According to the device described in C15, the decoding network includes: a general decoding network and a professional decoding network; wherein the general decoding network includes: a language model trained according to a general text corpus; and the professional decoding network includes: a language model trained according to a text corpus of a preset domain;

C20、根据C15所述的装置，所述多语言声学模型的训练步骤包括：C20. According to the apparatus described in C15, the training step of the multilingual acoustic model includes:

C21、根据C15至C20中任一所述的装置，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型；或者，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。C21. According to the device described in any one of C15 to C20, each of the acoustic data of at least two language types corresponds to at least two language types; or, each of the acoustic data of at least two language types corresponds to one language type.

本发明实施例公开了D22、一种机器可读介质，其上存储有指令，当由一个或多个处理器执行时，使得装置执行如A1至A7中一个或多个所述的数据处理方法。An embodiment of the present invention discloses D22, a machine-readable medium having instructions stored thereon, which, when executed by one or more processors, enables the device to execute the data processing method as described in one or more of A1 to A7.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本发明的其它实施方案。本发明旨在涵盖本发明的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本发明的真正范围和精神由下面的权利要求指出。Those skilled in the art will readily appreciate other embodiments of the present invention after considering the specification and practicing the invention disclosed herein. The present invention is intended to cover any variations, uses or adaptations of the present invention that follow the general principles of the present invention and include common knowledge or customary techniques in the art that are not disclosed in this disclosure. The description and examples are to be considered exemplary only, and the true scope and spirit of the present invention are indicated by the following claims.

应当理解的是，本发明并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。It should be understood that the present invention is not limited to the exact construction that has been described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present invention is limited only by the appended claims.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

以上对本发明所提供的一种数据处理方法、一种数据处理装置和一种用于数据处理的装置，进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The data processing method, the data processing device and the device for data processing provided by the present invention are introduced in detail above. The principle and implementation mode of the present invention are explained by using specific examples in this article. The description of the above embodiments is only used to help understand the method of the present invention and its core idea. At the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation mode and the scope of application. In summary, the content of this specification should not be understood as limiting the present invention.