CN113611308B

Movatterモバイル変換

Info

Publication number: CN113611308B
Application number: CN202111048642.5A
Authority: CN
Inventors: 齐昕
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Beijing Duying Technology Co.,Ltd.
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2024-05-07
Anticipated expiration: 2041-09-08
Also published as: CN113611308A

Abstract

The embodiment of the invention provides a voice recognition method, a device, a system, a server and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining speaking images of a plurality of speakers in a conference, voice signals and voiceprint information of each speaker, wherein the voice signals comprise voice signals generated by simultaneous speaking of the plurality of speakers, identifying the speaking images, determining azimuth information and lip movement information of each speaker, inputting the lip movement information, the voiceprint information, the azimuth information and the voice signals of each speaker into a pre-trained voice recognition model aiming at each speaker to obtain text information corresponding to the speaker, wherein the voice recognition model is trained based on multi-user voice samples, and the multi-user voice samples comprise lip movement information, voiceprint information, azimuth information and voice signals generated by simultaneous speaking of multiple users. Because the voice signals do not need to be separated, the completeness of the voice signals is ensured, and the accuracy of voice recognition is improved.

Description

Translated fromChinese

一种语音识别方法、装置、系统、服务器及存储介质A speech recognition method, device, system, server and storage medium

技术领域Technical Field

本发明涉及语音识别技术领域，特别是涉及一种语音识别方法、装置、系统、服务器及存储介质。The present invention relates to the field of speech recognition technology, and in particular to a speech recognition method, device, system, server and storage medium.

背景技术Background technique

目前，视频会议已经成为人们工作生活中一种常见的沟通交流方式。为了对会议内容进行记录等处理，需要对会议中每个人的发言进行收集并识别得到对应的文本信息。但是，会议中难免出现多个用户同时发言的情况，针对这种情况，则需要识别出同时发言的每个人所说的内容。At present, video conferencing has become a common way of communication in people's work and life. In order to record the content of the meeting, it is necessary to collect and identify the corresponding text information of everyone in the meeting. However, it is inevitable that multiple users will speak at the same time in the meeting. In this case, it is necessary to identify the content of each person speaking at the same time.

在目前的语音识别方式中，在获取多个用户同时发言所产生的语音信号后，将多个用户同时发言所产生的语音信号进行语音分离，得到每个用户对应的语音信息，进而对每个用户对应的语音信息分别进行语音识别，得到每个用户所说的内容。In the current speech recognition method, after obtaining the speech signals generated by multiple users speaking at the same time, the speech signals generated by the multiple users speaking at the same time are subjected to speech separation to obtain the speech information corresponding to each user, and then speech recognition is performed on the speech information corresponding to each user to obtain the content spoken by each user.

由于在对语音信号进行语音分离的过程中，会对语音信号的频谱造成损伤，因此语音识别的准确度较低。Since the frequency spectrum of the speech signal is damaged during the process of speech separation, the accuracy of speech recognition is low.

发明内容Summary of the invention

本发明实施例的目的在于提供一种语音识别方法、装置、系统、服务器及存储介质，以实现提高语音识别的准确度，具体技术方案如下：The purpose of the embodiments of the present invention is to provide a speech recognition method, device, system, server and storage medium to improve the accuracy of speech recognition. The specific technical solutions are as follows:

第一方面，本发明实施例提供了一种语音识别方法，所述方法包括：In a first aspect, an embodiment of the present invention provides a speech recognition method, the method comprising:

获取会议中多个发言者的发言图像、语音信号以及每个发言者的声纹信息，其中，所述语音信号包括所述多个发言者同时发言所产生的语音信号；Acquire speech images, voice signals and voiceprint information of each speaker of multiple speakers in the conference, wherein the voice signals include voice signals generated by the multiple speakers speaking simultaneously;

对所述发言图像进行识别，确定每个发言者的方位信息以及唇动信息；Recognize the speech image to determine the position information and lip movement information of each speaker;

针对每个发言者，将该发言者的唇动信息、声纹信息、方位信息以及所述语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息，其中，所述语音识别模型为基于多用户语音样本训练得到的，所述多用户语音样本包括每个用户的唇动信息、声纹信息、方位信息以及多用户同时发言所产生的语音信号。For each speaker, the speaker's lip movement information, voiceprint information, direction information and the speech signal are input into a pre-trained speech recognition model to obtain text information corresponding to the speaker, wherein the speech recognition model is trained based on multi-user speech samples, and the multi-user speech samples include each user's lip movement information, voiceprint information, direction information and speech signals generated by multiple users speaking simultaneously.

可选的，所述语音信号为麦克风阵列所采集的语音信号，所述麦克风阵列包括多个阵元；Optionally, the speech signal is a speech signal collected by a microphone array, and the microphone array includes a plurality of array elements;

所述将该发言者的唇动信息、声纹信息、方位信息以及所述语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息的步骤，包括：The step of inputting the speaker's lip movement information, voiceprint information, position information and the voice signal into a pre-trained voice recognition model to obtain text information corresponding to the speaker includes:

将该发言者的唇动信息、声纹信息、方位信息以及所述语音信号输入预先训练完成的语音识别模型，以使所述语音识别模型基于所述方位信息、所述声纹信息以及所述多个阵元之间的相位特性，从所述语音信号中提取该发言者对应的语音特征，并将所述语音特征结合所述唇动信息进行语音识别，得到该发言者对应的文本信息。The speaker's lip movement information, voiceprint information, orientation information and the speech signal are input into a pre-trained speech recognition model, so that the speech recognition model extracts speech features corresponding to the speaker from the speech signal based on the orientation information, the voiceprint information and the phase characteristics between the multiple array elements, and performs speech recognition on the speech features in combination with the lip movement information to obtain text information corresponding to the speaker.

可选的，所述语音识别模型包括：残差层、第一拼接层、卷积层、第二拼接层以及识别层；Optionally, the speech recognition model includes: a residual layer, a first concatenation layer, a convolution layer, a second concatenation layer and a recognition layer;

所述语音识别模型基于所述方位信息、所述声纹信息以及所述多个阵元之间的相位特性，从所述语音信号中提取该发言者对应的语音特征，并将所述语音特征结合所述唇动信息进行语音识别，得到该发言者对应的文本信息的步骤，包括：The speech recognition model extracts speech features corresponding to the speaker from the speech signal based on the orientation information, the voiceprint information, and the phase characteristics between the multiple array elements, and performs speech recognition on the speech features in combination with the lip movement information to obtain text information corresponding to the speaker, including:

所述残差层对所述唇动信息进行特征提取，得到唇部特征，并输入所述第二拼接层；The residual layer extracts features from the lip movement information to obtain lip features, and inputs the features into the second splicing layer;

所述第一拼接层将所述语音信号、所述方位信息以及所述声纹信息进行拼接，并将拼接后的结果输入至所述卷积层；The first concatenation layer concatenates the speech signal, the position information, and the voiceprint information, and inputs the concatenated result to the convolution layer;

所述卷积层基于所述方位信息、所述声纹信息以及所述多个阵元之间的相位特性，从所述语音信号中提取该发言者对应的语音特征，并将所述语音特征输入所述第二拼接层；The convolution layer extracts speech features corresponding to the speaker from the speech signal based on the orientation information, the voiceprint information, and the phase characteristics between the multiple array elements, and inputs the speech features into the second concatenation layer;

所述第二拼接层将所述语音特征与所述唇部特征进行拼接，并将拼接后的特征输入所述识别层；The second concatenation layer concatenates the speech feature with the lip feature, and inputs the concatenated feature into the recognition layer;

所述识别层基于所述拼接后的特征进行语音识别，得到该发言者的对应的文本信息，并输出所述文本信息。The recognition layer performs speech recognition based on the spliced features, obtains corresponding text information of the speaker, and outputs the text information.

可选的，在所述获取多个发言者的图像、语音信号以及每个发言者的声纹信息的步骤之前，所述方法还包括：Optionally, before the step of acquiring images, voice signals and voiceprint information of multiple speakers, the method further includes:

获取会议中的会议图像，并对所述会议图像进行唇动检测，确定正在发言的目标发言者；Acquire a conference image in the conference, and perform lip movement detection on the conference image to determine a target speaker who is speaking;

基于预先建立的人脸库，确定所述目标发言者的身份信息；Determining the identity information of the target speaker based on a pre-established face database;

获取所述目标发言者的语音信号，并提取该语音信号的声纹信息；Acquire a speech signal of the target speaker and extract voiceprint information of the speech signal;

将所述声纹信息与所述身份信息对应记录。The voiceprint information and the identity information are recorded in correspondence.

可选的，所述对所述发言图像进行识别，确定每个发言者的方位信息的步骤，包括：Optionally, the step of identifying the speech image and determining the position information of each speaker includes:

对所述发言图像进行识别，确定每个发言者的面部像素点；Recognize the speech image to determine the facial pixels of each speaker;

针对每个发言者，基于该发言者的所述面部像素点在所述发言图像中位置、预先标定的拍摄所述发言图像的图像采集设备的参数以及语音采集设备的位置，确定该发言者相对于所述语音采集设备的角度信息，作为该发言者的方位信息。For each speaker, based on the position of the speaker's facial pixels in the speaking image, the pre-calibrated parameters of the image acquisition device used to capture the speaking image, and the position of the voice acquisition device, the angle information of the speaker relative to the voice acquisition device is determined as the speaker's orientation information.

可选的，所述语音识别模型的训练方式，包括：Optionally, the training method of the speech recognition model includes:

获取所述多用户语音样本以及初始模型；Acquire the multi-user voice samples and the initial model;

将每个多用户语音样本中包括每个用户所对应的文本信息，作为样本标签；The text information corresponding to each user included in each multi-user voice sample is used as a sample label;

将每个所述多用户语音样本输入所述初始模型，得到预测文本信息；Inputting each of the multi-user speech samples into the initial model to obtain predicted text information;

基于每个所述多用户语音样本对应的预测文本信息与样本标签之间的差异，调整所述初始模型的模型参数，直到所述初始模型收敛，得到所述语音识别模型。Based on the difference between the predicted text information corresponding to each of the multi-user voice samples and the sample label, the model parameters of the initial model are adjusted until the initial model converges to obtain the speech recognition model.

可选的，所述方法还包括：Optionally, the method further includes:

基于所述每个发言者对应的文本信息，生成会议记录。A meeting record is generated based on the text information corresponding to each speaker.

第二方面，本发明实施例提供了一种语音识别装置，所述装置包括：In a second aspect, an embodiment of the present invention provides a speech recognition device, the device comprising:

第一获取模块，用于获取会议中多个发言者的发言图像、语音信号以及每个发言者的声纹信息，其中，所述语音信号包括所述多个发言者同时发言所产生的语音信号；A first acquisition module is used to acquire speech images, voice signals and voiceprint information of each speaker of a plurality of speakers in a conference, wherein the voice signals include voice signals generated by the plurality of speakers speaking simultaneously;

第一确定模块，用于对所述发言图像进行识别，确定每个发言者的方位信息以及唇动信息；A first determination module is used to identify the speech image and determine the position information and lip movement information of each speaker;

识别模块，用于针对每个发言者，将该发言者的唇动信息、声纹信息、方位信息以及所述语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息，其中，所述语音识别模型为基于多用户语音样本训练得到的，所述多用户语音样本包括每个用户的唇动信息、声纹信息、方位信息以及多用户同时发言所产生的语音信号。The recognition module is used to input the lip movement information, voiceprint information, direction information and the voice signal of each speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker, wherein the voice recognition model is trained based on multi-user voice samples, and the multi-user voice samples include the lip movement information, voiceprint information, direction information of each user and the voice signals generated by multiple users speaking simultaneously.

所述识别模块包括：The identification module comprises:

第一识别单元，用于将该发言者的唇动信息、声纹信息、方位信息以及所述语音信号输入预先训练完成的语音识别模型，以使所述语音识别模型基于所述方位信息、所述声纹信息以及所述多个阵元之间的相位特性，从所述语音信号中提取该发言者对应的语音特征，并将所述语音特征结合所述唇动信息进行语音识别，得到该发言者对应的文本信息。The first recognition unit is used to input the speaker's lip movement information, voiceprint information, orientation information and the voice signal into a pre-trained voice recognition model, so that the voice recognition model extracts voice features corresponding to the speaker from the voice signal based on the orientation information, the voiceprint information and the phase characteristics between the multiple array elements, and performs voice recognition on the voice features in combination with the lip movement information to obtain text information corresponding to the speaker.

所述第一识别单元包括：The first identification unit comprises:

第一提取子单元，用于所述残差层对所述唇动信息进行特征提取，得到唇部特征，并输入所述第二拼接层；A first extraction subunit is used for performing feature extraction on the lip movement information at the residual layer to obtain lip features and input the lip features into the second splicing layer;

第一拼接子单元，用于所述第一拼接层将所述语音信号、所述方位信息以及所述声纹信息进行拼接，并将拼接后的结果输入至所述卷积层；A first splicing subunit, configured to splice the speech signal, the position information and the voiceprint information at the first splicing layer, and input the spliced result to the convolution layer;

第二提取子单元，用于所述卷积层基于所述方位信息、所述声纹信息以及所述多个阵元之间的相位特性，从所述语音信号中提取该发言者对应的语音特征，并将所述语音特征输入所述第二拼接层；A second extraction subunit, configured for the convolution layer to extract speech features corresponding to the speaker from the speech signal based on the orientation information, the voiceprint information, and the phase characteristics between the multiple array elements, and input the speech features into the second concatenation layer;

第二拼接子单元，用于所述第二拼接层将所述语音特征与所述唇部特征进行拼接，并将拼接后的特征输入所述识别层；A second splicing subunit, configured to splice the speech feature and the lip feature at the second splicing layer, and input the spliced features into the recognition layer;

识别子单元，用于所述识别层基于所述拼接后的特征进行语音识别，得到该发言者的对应的文本信息，并输出所述文本信息。The recognition subunit is used for the recognition layer to perform speech recognition based on the spliced features, obtain the corresponding text information of the speaker, and output the text information.

可选的，所述装置还包括：Optionally, the device further comprises:

第二获取模块，用于获取会议中的会议图像，并对所述会议图像进行唇动检测，确定正在发言的目标发言者；The second acquisition module is used to acquire a conference image in the conference, and perform lip movement detection on the conference image to determine a target speaker who is speaking;

第二确定模块，用于基于预先建立的人脸库，确定所述目标发言者的身份信息；A second determination module, configured to determine the identity information of the target speaker based on a pre-established face database;

第三获取模块，用于获取所述目标发言者的语音信号，并提取该语音信号的声纹信息；A third acquisition module, used to acquire a speech signal of the target speaker and extract voiceprint information of the speech signal;

记录模块，用于将所述声纹信息与所述身份信息对应记录。The recording module is used to record the voiceprint information and the identity information in correspondence.

可选的，所述第一确定模块包括：Optionally, the first determining module includes:

第二识别单元，用于对所述发言图像进行识别，确定每个发言者的面部像素点；A second recognition unit is used to recognize the speech image and determine the facial pixels of each speaker;

确定单元，用于针对每个发言者，基于该发言者的所述面部像素点在所述发言图像中位置、预先标定的拍摄所述发言图像的图像采集设备的参数以及语音采集设备的位置，确定该发言者相对于所述语音采集设备的角度信息，作为该发言者的方位信息。A determination unit is used to determine, for each speaker, the angle information of the speaker relative to the voice acquisition device based on the position of the facial pixel points of the speaker in the speech image, the pre-calibrated parameters of the image acquisition device for taking the speech image, and the position of the voice acquisition device, as the orientation information of the speaker.

可选的，所述语音识别模型通过模型训练模块预先训练得到，所述模型训练模块包括：Optionally, the speech recognition model is pre-trained by a model training module, and the model training module includes:

样本获取单元，用于获取所述多用户语音样本以及初始模型；A sample acquisition unit, used to acquire the multi-user speech samples and the initial model;

标签确定单元，用于将每个多用户语音样本中包括每个用户所对应的文本信息，作为样本标签；A label determination unit, used to use the text information corresponding to each user in each multi-user voice sample as a sample label;

文本预测单元，用于将每个所述多用户语音样本输入所述初始模型，得到预测文本信息；A text prediction unit, used for inputting each of the multi-user speech samples into the initial model to obtain predicted text information;

参数调整单元，用于基于每个所述多用户语音样本对应的预测文本信息与样本标签之间的差异，调整所述初始模型的模型参数，直到所述初始模型收敛，得到所述语音识别模型。The parameter adjustment unit is used to adjust the model parameters of the initial model based on the difference between the predicted text information corresponding to each of the multi-user voice samples and the sample label until the initial model converges to obtain the speech recognition model.

可选的，所述装置还包括：Optionally, the device further comprises:

生成模块，用于基于所述每个发言者对应的文本信息，生成会议记录。A generation module is used to generate a meeting record based on the text information corresponding to each speaker.

第三方面，本发明实施例提供了一种语音识别系统，所述系统包括服务器和终端，所述终端设置有图像采集设备以及语音采集设备，其中：In a third aspect, an embodiment of the present invention provides a speech recognition system, the system comprising a server and a terminal, the terminal being provided with an image acquisition device and a speech acquisition device, wherein:

所述图像采集设备，用于在会议中采集图像；The image acquisition device is used to acquire images during a meeting;

所述语音采集设备，用于在会议中采集语音信号；The voice collection device is used to collect voice signals in a meeting;

所述终端，用于将所述图像和所述语音信号发送至所述服务器；The terminal is used to send the image and the voice signal to the server;

所述服务器，用于接收所述图像和所述语音信号，并执行上述第一方面任一项所述的方法步骤。The server is used to receive the image and the voice signal, and execute the method steps described in any one of the first aspects above.

第四方面，本发明实施例提供了一种服务器，包括处理器、通信接口、存储器和通信总线，其中，处理器，通信接口，存储器通过通信总线完成相互间的通信；In a fourth aspect, an embodiment of the present invention provides a server, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other through the communication bus;

存储器，用于存放计算机程序；Memory, used to store computer programs;

处理器，用于执行存储器上所存放的程序时，实现上述第一方面任一所述的方法步骤。The processor is used to implement any method step described in the first aspect when executing the program stored in the memory.

第五方面，本发明实施例提供了一种计算机可读存储介质，所述计算机可读存储介质内存储有计算机程序，所述计算机程序被处理器执行时实现上述第一方面任一所述的方法步骤。In a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method steps described in any one of the first aspects are implemented.

本发明实施例有益效果：Beneficial effects of the embodiments of the present invention:

本发明实施例提供的方案中，服务器可以获取会议中多个发言者的发言图像、语音信号以及每个发言者的声纹信息，其中，语音信号包括多个发言者同时发言所产生的语音信号，对发言图像进行识别，确定每个发言者的方位信息以及唇动信息，针对每个发言者，将该发言者的唇动信息、声纹信息、方位信息以及语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息，其中，语音识别模型为基于多用户语音样本训练得到的，多用户语音样本包括每个用户的唇动信息、声纹信息、方位信息以及多用户同时发言所产生的语音信号。通过上述方案，服务器可以将获取到的多个发言者的发言图像、语音信号以及每个发言者的声纹信息输入至语音识别模型中，由于不需要将多个发言者的语音信号按照不同的发言者进行分离，保证了不同发言者的语音信号的频谱完整，从而提高了语音识别的准确度。当然，实施本发明的任一产品或方法并不一定需要同时达到以上所述的所有优点。In the solution provided by the embodiment of the present invention, the server can obtain the speech images, voice signals and voiceprint information of multiple speakers in the conference, wherein the voice signals include the voice signals generated by multiple speakers speaking at the same time, recognize the speech images, determine the position information and lip movement information of each speaker, and for each speaker, input the lip movement information, voiceprint information, position information and voice signal of the speaker into a pre-trained speech recognition model to obtain the text information corresponding to the speaker, wherein the speech recognition model is obtained based on multi-user speech samples, and the multi-user speech samples include the lip movement information, voiceprint information, position information of each user and the voice signals generated by multiple users speaking at the same time. Through the above solution, the server can input the speech images, voice signals and voiceprint information of each speaker obtained from multiple speakers into the speech recognition model. Since it is not necessary to separate the voice signals of multiple speakers according to different speakers, the spectrum of the voice signals of different speakers is guaranteed to be complete, thereby improving the accuracy of speech recognition. Of course, it is not necessary to achieve all the advantages described above at the same time when implementing any product or method of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，还可以根据这些附图获得其他的实施例。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For ordinary technicians in this field, other embodiments can also be obtained based on these drawings.

图1为本发明实施例所提供的一种语音识别方法所应用的实施场景示意图；FIG1 is a schematic diagram of an implementation scenario in which a speech recognition method provided by an embodiment of the present invention is applied;

图2为本发明实施例所提供的一种语音识别方法的流程图；FIG2 is a flow chart of a speech recognition method provided by an embodiment of the present invention;

图3为本发明实施例所提供的语音识别模型进行识别的流程图；FIG3 is a flow chart of recognition by a speech recognition model provided by an embodiment of the present invention;

图4为本发明实施例所提供的另一种语音识别方法的流程图；FIG4 is a flow chart of another speech recognition method provided by an embodiment of the present invention;

图5为基于图2所示实施例中步骤S202的一种具体流程图；FIG5 is a specific flow chart based on step S202 in the embodiment shown in FIG2 ;

图6为本发明实施例所提供的语音识别模型训练的流程图；FIG6 is a flow chart of speech recognition model training provided by an embodiment of the present invention;

图7为本发明实施例所提供的一种语音识别装置的结构示意图；FIG7 is a schematic diagram of the structure of a speech recognition device provided by an embodiment of the present invention;

图8为本发明实施例所提供的另一种语音识别装置的结构示意图；FIG8 is a schematic diagram of the structure of another speech recognition device provided by an embodiment of the present invention;

图9为本发明实施例所提供的另一种语音识别装置的结构示意图；FIG9 is a schematic diagram of the structure of another speech recognition device provided by an embodiment of the present invention;

图10为本发明实施例所提供的一种语音识别系统的结构示意图；FIG10 is a schematic diagram of the structure of a speech recognition system provided by an embodiment of the present invention;

图11为本发明实施例所提供的一种服务器的结构示意图。FIG. 11 is a schematic diagram of the structure of a server provided in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员基于本发明所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field based on the present invention belong to the scope of protection of the present invention.

为了提高语音识别的准确度，本发明实施例提供了一种语音识别方法、装置、系统、服务器、计算机可读存储介质以及计算机程序产品。为了方便理解本发明实施例所提供的一种语音识别方法，下面首先对本发明实施例所提供的一种语音识别方法可以应用的实施场景进行介绍。In order to improve the accuracy of speech recognition, the embodiments of the present invention provide a speech recognition method, device, system, server, computer-readable storage medium and computer program product. In order to facilitate understanding of a speech recognition method provided by an embodiment of the present invention, the following first introduces an implementation scenario in which a speech recognition method provided by an embodiment of the present invention can be applied.

图1为本发明实施例所提供的语音识别方法所应用的实施场景的一种示意图。多名与会人员参加视频会议，多名与会人员可以包括与会人员1、与会人员2、与会人员3、与会人员4、与会人员5、与会人员6以及与会人员7，服务器130与终端140之间通信连接，以进行数据传输。终端140可以为具有显示屏幕的电子设备，例如可以为会议平板、触摸一体机等，终端140还可以设置有语音采集设备110以及图像采集设备120，语音采集设备110用于采集与会人员在会议过程中进行发言时所发出的语音信号，图像采集设备120用于采集与会人员在会议过程中的图像，显示屏幕可以展示会议相关信息。FIG1 is a schematic diagram of an implementation scenario in which the speech recognition method provided by an embodiment of the present invention is applied. Multiple participants participate in a video conference, and the multiple participants may include participant 1, participant 2, participant 3, participant 4, participant 5, participant 6, and participant 7. The server 130 is connected to the terminal 140 for data transmission. The terminal 140 may be an electronic device with a display screen, such as a conference tablet, a touch-screen all-in-one, etc. The terminal 140 may also be provided with a voice acquisition device 110 and an image acquisition device 120. The voice acquisition device 110 is used to collect voice signals emitted by the participants when they speak during the meeting, and the image acquisition device 120 is used to collect images of the participants during the meeting. The display screen may display information related to the meeting.

其中，语音采集设备110可以为麦克风阵列，麦克风阵列可以为：一字型阵列、三角形阵、T型阵或均匀圆阵等，图1中以一字型阵列为例。图像采集设备120可以为摄像头等能够进行图像采集的设备，在此不做具体限定。The voice acquisition device 110 may be a microphone array, which may be a straight array, a triangular array, a T-shaped array, or a uniform circular array, etc. The straight array is taken as an example in FIG1. The image acquisition device 120 may be a device capable of image acquisition, such as a camera, which is not specifically limited here.

在会议结束后，终端140可以将会议过程中语音采集设备110采集的语音信号以及图像采集设备120采集的图像发送至服务器130，服务器130便可以获取到包括会议过程中多个发言者的发言图像以及语音信号的会议视频，由于在会议过程中会出现多个发言者同时发言的情况，而目前识别多个发言者同时发言的方式不够准确，针对这种情况，服务器130可以采用本发明实施例所提供的语音识别方法对多个发言者同时发言所产生的语音信号进行识别。下面对本发明实施例所提供的一种语音识别方法进行介绍。After the meeting, the terminal 140 can send the voice signals collected by the voice acquisition device 110 and the images collected by the image acquisition device 120 during the meeting to the server 130, and the server 130 can obtain the conference video including the speech images and voice signals of multiple speakers during the meeting. Since multiple speakers may speak at the same time during the meeting, and the current method of identifying multiple speakers speaking at the same time is not accurate enough, in this case, the server 130 can use the voice recognition method provided by the embodiment of the present invention to identify the voice signals generated by multiple speakers speaking at the same time. A voice recognition method provided by an embodiment of the present invention is introduced below.

如图2所示，一种语音识别方法，所述方法包括：As shown in FIG2 , a speech recognition method includes:

S201，获取会议中多个发言者的发言图像、语音信号以及每个发言者的声纹信息；S201, obtaining speech images, voice signals and voiceprint information of multiple speakers in a conference;

其中，所述语音信号包括所述多个发言者同时发言所产生的语音信号。The voice signal includes the voice signal generated by the multiple speakers speaking simultaneously.

S202，对所述发言图像进行识别，确定每个发言者的方位信息以及唇动信息；S202, identifying the speech image to determine the position information and lip movement information of each speaker;

S203，针对每个发言者，将该发言者的唇动信息、声纹信息、方位信息以及所述语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息。S203, for each speaker, input the lip movement information, voiceprint information, position information and the speech signal of the speaker into a pre-trained speech recognition model to obtain text information corresponding to the speaker.

其中，所述语音识别模型为基于多用户语音样本训练得到的，所述多用户语音样本包括每个用户的唇动信息、声纹信息、方位信息以及多用户同时发言所产生的语音信号。The speech recognition model is trained based on multi-user speech samples, and the multi-user speech samples include lip movement information, voiceprint information, and orientation information of each user, as well as speech signals generated by multiple users speaking simultaneously.

可见，本发明实施例提供的方案中，服务器可以获取会议中多个发言者的发言图像、语音信号以及每个发言者的声纹信息，其中，语音信号包括多个发言者同时发言所产生的语音信号，对发言图像进行识别，确定每个发言者的方位信息以及唇动信息，针对每个发言者，将该发言者的唇动信息、声纹信息、方位信息以及语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息，其中，语音识别模型为基于多用户语音样本训练得到的，多用户语音样本包括每个用户的唇动信息、声纹信息、方位信息以及多用户同时发言所产生的语音信号。通过上述方案，服务器可以将获取到的多个发言者的发言图像、语音信号以及每个发言者的声纹信息输入至语音识别模型中，由于不需要将多个发言者的语音信号按照不同的发言者进行分离，保证了不同发言者的语音信号的频谱完整，从而提高了语音识别的准确度。It can be seen that in the solution provided by the embodiment of the present invention, the server can obtain the speech images, voice signals and voiceprint information of multiple speakers in the conference, wherein the voice signals include the voice signals generated by multiple speakers speaking at the same time, recognize the speech images, determine the orientation information and lip movement information of each speaker, and for each speaker, input the lip movement information, voiceprint information, orientation information and voice signal of the speaker into the pre-trained speech recognition model to obtain the text information corresponding to the speaker, wherein the speech recognition model is obtained by training based on multi-user speech samples, and the multi-user speech samples include the lip movement information, voiceprint information, orientation information of each user and the speech signals generated by multiple users speaking at the same time. Through the above solution, the server can input the speech images, voice signals and voiceprint information of each speaker obtained from multiple speakers into the speech recognition model. Since it is not necessary to separate the speech signals of multiple speakers according to different speakers, the spectrum of the speech signals of different speakers is guaranteed to be complete, thereby improving the accuracy of speech recognition.

终端发送至服务器的会议视频中可能会包括会议中的多种发言情况所对应的视频，发言情况可以包括：一个发言者发言或多个发言者同时发言两种情况。The conference video sent by the terminal to the server may include videos corresponding to various speech situations in the conference, and the speech situation may include: one speaker speaking or multiple speakers speaking at the same time.

针对多个发言者同时发言的情况，服务器可以从该会议视频中获取多个发言者的发言图像、语音信号以及每个发言者的声纹信息。其中，多个发言者的发言图像可以为该会议视频中图像采集设备所采集到的能够表征发言者唇部动作的多帧图像，其可以是包括所有与会人员的图像，也可以是分别针对每个与会人员的图像，在此不做具体限定。In the case where multiple speakers speak at the same time, the server can obtain the speech images, voice signals and voiceprint information of each speaker from the conference video. The speech images of the multiple speakers can be multiple frames of images captured by the image acquisition device in the conference video that can represent the lip movements of the speakers. They can be images of all participants or images of each participant separately, which is not specifically limited here.

在一种实施方式中，服务器可以对该会议视频中的会议图像进行识别，确定会议图像中发言者的唇部图像特征，根据发言者唇部图像特征的运动信息，确定当前时刻发言者的数量，当发言者的数量为多个时，服务器便可以将该会议图像作为多个发言者的发言图像，并且获取该发言图像对应的时刻所采集到的语音信号以及每个发言者的声纹信息。In one embodiment, the server can identify the conference image in the conference video, determine the lip image features of the speaker in the conference image, and determine the number of speakers at the current moment based on the motion information of the speaker's lip image features. When there are multiple speakers, the server can use the conference image as a speaking image of multiple speakers, and obtain the voice signal collected at the moment corresponding to the speaking image and the voiceprint information of each speaker.

如果发言者的数量为一个，那么服务器便可以获取该发言图像对应的时刻所采集到的语音信号，该语音信号即为该一个发言者所发出的语音信号，进而，服务器便可以采用语音识别算法对语音信号进行语音识别，便可以得到对应的文本信息。If there is only one speaker, the server can obtain the voice signal collected at the moment corresponding to the speaking image. The voice signal is the voice signal emitted by the speaker. Then, the server can use a voice recognition algorithm to perform voice recognition on the voice signal to obtain the corresponding text information.

上述多个发言者的语音信号为该会议视频中多个发言者同时发言的时间段内，语音采集设备所采集到的由多个发言者同时发言的而产生的语音信号。其为多个发言者发出的语音信号混合在一起形成的语音信号。The above-mentioned voice signals of multiple speakers are voice signals generated by the simultaneous speaking of multiple speakers collected by the voice collection device during the time period when multiple speakers speak simultaneously in the conference video. It is a voice signal formed by mixing the voice signals emitted by multiple speakers.

例如，服务器对会议图像进行识别的过程中，提取到会议图像1中的发言者的唇部图像特征，根据唇部图像特征确定会议图像1所对应的时间点有发言者A、发言者B同时发言，服务器继续对会议图像依次进行识别，直到会议图像20，根据其中的唇部图像特征确定会议图像20所对应的时间点只有发言者A在发言，那么，会议图像1所对应的时间点到会议图像20所对应的时间点之间的时间段即为发言者A和发言者B两个发言者同时发言的时间段，该时间段所对应的语音信号也就是发言者A和发言者B两个发言者同时发言所产生的语音信号。针对该语音信号可以采用本发明实施例所提供的方法进行识别。For example, when the server recognizes the conference image, it extracts the lip image features of the speaker in conference image 1, and determines based on the lip image features that speakers A and B spoke at the same time at the time point corresponding to conference image 1. The server continues to recognize the conference images in sequence until conference image 20, and determines based on the lip image features that only speaker A spoke at the time point corresponding to conference image 20. Then, the time period between the time point corresponding to conference image 1 and the time point corresponding to conference image 20 is the time period when speakers A and B spoke at the same time, and the voice signal corresponding to this time period is the voice signal generated by speakers A and B speaking at the same time. The method provided in the embodiment of the present invention can be used to recognize the voice signal.

上述声纹信息即为能够表征发言者的语音频谱特征的信息，为了方便获取声纹信息，服务器可以在会议过程中，各个发言者首次进行单独发言时，获取其声纹信息并存储，进而在需要对多个发言者同时发言所产生的语音信号均进行识别时，基于预先建立的人脸库以及发言图像所确定发言者的身份信息，并基于该发言者的身份信息获取该发言者的声纹信息。The above-mentioned voiceprint information is information that can characterize the speaker's voice spectrum characteristics. In order to facilitate the acquisition of voiceprint information, the server can obtain and store the voiceprint information of each speaker when he or she makes a separate speech for the first time during the meeting. Then, when it is necessary to identify the voice signals generated by multiple speakers speaking at the same time, the identity information of the speaker is determined based on the pre-established face library and the speech image, and the voiceprint information of the speaker is obtained based on the identity information of the speaker.

在获取到会议中多个发言者的发言图像、语音信号以及每个发言者的声纹信息后，服务器可以执行上述步骤S202，即对发言图像进行识别，确定每个发言者的方位信息以及唇动信息。After acquiring the speech images, voice signals and voiceprint information of multiple speakers in the conference, the server may execute the above step S202, that is, recognize the speech images and determine the position information and lip movement information of each speaker.

服务器可以对上述发言图像进行识别，提取出每个发言者的唇部图像特征，针对每个发言者，可以将任一唇部图像特征作为该发言者在发言图像中的位置，也可以计算该发言者的唇部图像特征的平均值，将该平均值对应的点作为该发言者在发言图像中的位置，这都是合理的。The server can identify the above-mentioned speech images and extract the lip image features of each speaker. For each speaker, any lip image feature can be used as the position of the speaker in the speech image, or the average value of the speaker's lip image features can be calculated, and the point corresponding to the average value can be used as the position of the speaker in the speech image. This is all reasonable.

服务器在确定了发言者在发言图像中的位置后，可以根据预先标定的图像采集设备的外部参数和内部参数，确定该发言者的在会议场景中的实际位置信息，再根据语音采集设备的位置，可以计算得到该发言者与语音采集设备的相对位置关系，也就可以确定该发言者的方位信息。After determining the position of the speaker in the speaking image, the server can determine the actual position information of the speaker in the conference scene based on the external and internal parameters of the pre-calibrated image acquisition device, and then calculate the relative position relationship between the speaker and the voice acquisition device based on the position of the voice acquisition device, thereby determining the speaker's orientation information.

在一种实施方式中，图像采集设备为摄像机，语音采集设备为麦克风阵列，以麦克风阵列在会议场景中的位置作为三维坐标系原点建立坐标系，X轴与Y轴构成水平平面，服务器可以提取出发言图像1中每个发言者的唇部图像特征，将唇部图像特征A作为发言者A在该帧发言图像中的位置，根据摄像机的内部参数以及摄像机的外部参数，计算得到发言者A在上述以麦克风阵列在会议场景中的位置作为三维坐标系原点所建立的坐标系中的三维坐标(x，y，z)，进而计算tanx/y所对应的角度，即可以得到该发言者的方位信息。In one embodiment, the image acquisition device is a camera, and the voice acquisition device is a microphone array. A coordinate system is established with the position of the microphone array in the conference scene as the origin of the three-dimensional coordinate system. The X-axis and the Y-axis constitute a horizontal plane. The server can extract the lip image features of each speaker in the speech image 1, and use the lip image feature A as the position of speaker A in the frame of the speech image. According to the internal parameters of the camera and the external parameters of the camera, the three-dimensional coordinates (x, y, z) of speaker A in the coordinate system established with the position of the microphone array in the conference scene as the origin of the three-dimensional coordinate system are calculated, and then the angle corresponding to tanx/y is calculated, so that the orientation information of the speaker can be obtained.

服务器可以对多帧多个发言者同时发言的发言图像进行识别，并从中提取出每个发言者的唇部图像特征，将多帧发言图像中该发言者的唇部图像特征的变化信息作为该发言者的唇动信息。The server can recognize multiple frames of speech images in which multiple speakers speak simultaneously, and extract the lip image features of each speaker therefrom, and use the change information of the lip image features of the speaker in the multiple frames of speech images as the lip movement information of the speaker.

作为一种实施方式，由于服务器在确定当前同时发言的发言者的数量时，可能需要确定发言图像中每个发言者的唇部图像特征，所以在这种情况下，服务器可以将确定当前同时发言的发言者的数量时所确定的唇部图像特征，作为对应的发言者的唇动信息，而无需再对发言图像进行识别。As an implementation method, since the server may need to determine the lip image features of each speaker in the speech image when determining the number of speakers currently speaking simultaneously, in this case, the server can use the lip image features determined when determining the number of speakers currently speaking simultaneously as the lip movement information of the corresponding speaker without having to identify the speech image again.

接下来，服务器可以执行上述步骤S203，即针对每个发言者，将该发言者的唇动信息、声纹信息、方位信息以及语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息。Next, the server may execute the above step S203, that is, for each speaker, input the speaker's lip movement information, voiceprint information, position information and voice signal into a pre-trained speech recognition model to obtain text information corresponding to the speaker.

其中，语音信号为多个发言者同时发言而产生的语音信号，即服务器是将该发言者的唇动信息、声纹信息、方位信息以及多个发言者同时发言所产生的语音信号一同输入至预先训练完成的语音识别模型中，进而得到该发言者对应的文本信息，而不是将多个发言者同时发言而产生的语音信号分离成多路语音信号。Among them, the voice signal is a voice signal generated by multiple speakers speaking at the same time, that is, the server inputs the speaker's lip movement information, voiceprint information, direction information and the voice signal generated by multiple speakers speaking at the same time into a pre-trained voice recognition model, and then obtains the text information corresponding to the speaker, instead of separating the voice signals generated by multiple speakers speaking at the same time into multiple voice signals.

语音识别模型为预先基于多用户语音样本训练得到的，多用户语音样本可以包括每个用户的唇动信息、声纹信息、方位信息以及多个用户同时发言所产生的语音信号，即语音识别模型是基于每个用户的唇动信息、声纹信息、方位信息以及多个用户同时发言所产生的语音信号训练而成的。The speech recognition model is pre-trained based on multi-user speech samples. The multi-user speech samples may include each user's lip movement information, voiceprint information, direction information, and speech signals generated by multiple users speaking simultaneously. That is, the speech recognition model is pre-trained based on each user's lip movement information, voiceprint information, direction information, and speech signals generated by multiple users speaking simultaneously.

在训练的过程中，由于多个用户同时发言而产生的语音信号为多个用户发出的语音信号混合在一起形成的语音信号，并未将多个用户同时发言而产生的语音信号按照不同的用户进行分离，语音识别模型可以学习到每个用户的唇动信息、声纹信息、方位信息以及多个用户同时发言所产生的语音信号与该用户对应的文本信息之间的对应关系，进而在语音识别模型的使用过程中，可以对输入的每个发言者的唇动信息、声纹信息、方位信息以及多个发言者同时发言所产生的语音信号进行响应处理，进而得到该发言者对应的文本信息。During the training process, since the speech signals generated by multiple users speaking at the same time are mixed together, the speech signals generated by multiple users speaking at the same time are not separated according to different users. The speech recognition model can learn the lip movement information, voiceprint information, direction information of each user, as well as the correspondence between the speech signals generated by multiple users speaking at the same time and the text information corresponding to the user. Therefore, during the use of the speech recognition model, the lip movement information, voiceprint information, direction information of each speaker and the speech signals generated by multiple speakers speaking at the same time can be responded to and processed, and the text information corresponding to the speaker can be obtained.

针对多个发言者同时发言的情况，服务器可以逐个针对发言者进行语音识别，即遍历同时发言的多个发言者，每遍历一个发言者时，将其对应的唇动信息、声纹信息、方位信息以及语音信号输入至语音识别模型中，这样就可以分别得到每个发言者对应的文本信息，进而完成多个发言者同时发言的语音识别。For the situation where multiple speakers are speaking at the same time, the server can perform speech recognition on each speaker one by one, that is, it can traverse the multiple speakers who are speaking at the same time, and input the corresponding lip movement information, voiceprint information, direction information and speech signal into the speech recognition model each time it traverses a speaker. In this way, the text information corresponding to each speaker can be obtained separately, thereby completing the speech recognition of multiple speakers speaking at the same time.

例如，服务器确定在2分5秒到5分10秒有发言者A、发言者B和发言者C同时说话，服务器便可以分别获取发言者A、发言者B和发言者C的唇动信息、声纹信息、方位信息以及发言者A、发言者B和发言者C同时发言所产生的语音信号a。进而遍历每个发言者。For example, if the server determines that speakers A, B, and C speak at the same time from 2 minutes and 5 seconds to 5 minutes and 10 seconds, the server can obtain the lip movement information, voiceprint information, and position information of speakers A, B, and C, as well as the voice signal a generated by the simultaneous speaking of speakers A, B, and C. Then, each speaker is traversed.

具体来说，遍历发言者A，服务器可以将发言者A的唇动信息、声纹信息、方位信息以及语音信号a输入至上述语音识别模型，得到发言者A对应的文本信息。再遍历发言者B，将发言者B的唇动信息、声纹信息、方位信息以及语音信号a输入至上述语音识别模型，得到发言者B对应的文本信息。再遍历发言者C，将发言者C的唇动信息、声纹信息、方位信息以及语音信号a输入至上述语音识别模型，得到发言者C对应的文本信息。Specifically, after traversing speaker A, the server can input speaker A's lip movement information, voiceprint information, position information, and voice signal a into the above-mentioned voice recognition model to obtain text information corresponding to speaker A. Then, after traversing speaker B, the server can input speaker B's lip movement information, voiceprint information, position information, and voice signal a into the above-mentioned voice recognition model to obtain text information corresponding to speaker B. Then, after traversing speaker C, the server can input speaker C's lip movement information, voiceprint information, position information, and voice signal a into the above-mentioned voice recognition model to obtain text information corresponding to speaker C.

由于语音识别模型是基于每个用户的唇动信息、声纹信息、方位信息以及多个用户同时发言所产生的语音信号训练而成的，并且，在语音识别模型训练的过程中，未将多个用户同时发言而产生的语音信号按照不同的用户进行分离，进而在语音识别模型使用的过程中，服务器将该发言者的唇动信息、声纹信息、方位信息以及多个发言者同时发言所产生的语音信号输入至预先训练完成的语音识别模型中，可以无需将多个发言者同时发言而产生的语音信号按照不同的发言者进行分离，便可以识别得到文本信息，进而保证了不同发言者的语音信号的频谱完整，从而提高了语音识别的准确度。Since the speech recognition model is trained based on the lip movement information, voiceprint information, direction information of each user and the speech signals generated by multiple users speaking at the same time, and during the training of the speech recognition model, the speech signals generated by multiple users speaking at the same time are not separated according to different users, and then during the use of the speech recognition model, the server inputs the speaker's lip movement information, voiceprint information, direction information and the speech signals generated by multiple speakers speaking at the same time into the pre-trained speech recognition model, it is not necessary to separate the speech signals generated by multiple speakers speaking at the same time according to different speakers, and text information can be recognized, thereby ensuring the completeness of the frequency spectrum of the speech signals of different speakers, thereby improving the accuracy of speech recognition.

作为本发明实施例的一种实施方式，上述语音信号可以为麦克风阵列所采集的语音信号，麦克风阵列包括多个阵元。由于麦克风阵列中的各个阵元所在位置不同，所以在同一时刻接收多个发言者的语音信号存在时延，即每个阵元接收到的语音信号的波形的相位特性不同，所以可以利用该特点在不对多个发言者同时发言而产生的语音信号进行分离的情况下，根据语音信号的波形的相位特性准确识别出不同发言者的语音特征。As an implementation method of an embodiment of the present invention, the above-mentioned voice signal can be a voice signal collected by a microphone array, and the microphone array includes multiple array elements. Since the positions of the array elements in the microphone array are different, there is a time delay in receiving the voice signals of multiple speakers at the same time, that is, the phase characteristics of the waveform of the voice signal received by each array element are different, so this feature can be used to accurately identify the voice characteristics of different speakers according to the phase characteristics of the waveform of the voice signal without separating the voice signals generated by multiple speakers speaking at the same time.

针对这种情况，上述将该发言者的唇动信息、声纹信息、方位信息以及所述语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息的步骤，可以包括：In view of this situation, the step of inputting the speaker's lip movement information, voiceprint information, position information and the voice signal into a pre-trained voice recognition model to obtain text information corresponding to the speaker may include:

针对上述每个发言者，服务器可以将该发言者的唇动信息、声纹信息、方位信息以及语音信号输入预先训练完成的语音识别模型，由于麦克风阵列的每个阵元接收到的语音信号的波形的相位特性不同，因此，语音识别模型可以基于方位信息、声纹信息以及多个阵元之间的相位特性，从语音信号中提取该发言者对应的语音特征。唇动信息可以表征该发言者在说话时的唇部图像的特征，将语音特征结合唇动特征进行语音识别，可以提高针对多个发言者同时发言时，语音识别的准确度，进而得到该发言者对应的文本信息。For each of the above speakers, the server can input the speaker's lip movement information, voiceprint information, orientation information and voice signal into a pre-trained voice recognition model. Since the waveform of the voice signal received by each array element of the microphone array has different phase characteristics, the voice recognition model can extract the voice features corresponding to the speaker from the voice signal based on the orientation information, voiceprint information and the phase characteristics between multiple array elements. Lip movement information can characterize the features of the speaker's lip image when speaking. Combining voice features with lip movement features for voice recognition can improve the accuracy of voice recognition when multiple speakers speak at the same time, and then obtain text information corresponding to the speaker.

可见，在本实施例中，服务器可以将每个发言者的唇动信息、声纹信息、方位信息以及语音信号输入预先训练完成的语音识别模型，以使语音识别模型基于方位信息、声纹信息以及多个阵元之间的相位特性，从语音信号中提取该发言者对应的语音特征，并将语音特征结合唇动信息进行语音识别，得到该发言者对应的文本信息。由于麦克风阵列中多个阵元在同一时刻接收语音信号存在时延，会产生的不同的相位特性，而语音识别模型可以利用该相位特性，在不对多个发言者同时发言而产生的语音信号分离的情况下，准确识别出不同发言者的语音特征。因此保证了每个发言者的语音信号的频谱完整，从而提高了语音识别的准确度。It can be seen that in this embodiment, the server can input the lip movement information, voiceprint information, orientation information and voice signal of each speaker into the pre-trained speech recognition model, so that the speech recognition model can extract the speech features corresponding to the speaker from the speech signal based on the orientation information, voiceprint information and the phase characteristics between multiple array elements, and perform speech recognition on the speech features in combination with the lip movement information to obtain the text information corresponding to the speaker. Since there is a time delay when multiple array elements in the microphone array receive the voice signal at the same time, different phase characteristics will be generated, and the speech recognition model can use this phase characteristic to accurately identify the speech features of different speakers without separating the voice signals generated by multiple speakers speaking at the same time. Therefore, the spectrum of each speaker's voice signal is guaranteed to be complete, thereby improving the accuracy of speech recognition.

作为本发明实施例的一种实施方式，如图3所示，上述语音识别模型可以包括：残差层350、第一拼接层340、卷积层330、第二拼接层320以及识别层310。As an implementation of an embodiment of the present invention, as shown in FIG. 3 , the speech recognition model may include: a residual layer 350 , a first concatenation layer 340 , a convolution layer 330 , a second concatenation layer 320 and a recognition layer 310 .

相应的，上述语音识别模型基于所述方位信息、所述声纹信息以及所述多个阵元之间的相位特性，从所述语音信号中提取该发言者对应的语音特征，并将所述语音特征结合所述唇动信息进行语音识别，得到该发言者对应的文本信息的步骤，可以包括：Accordingly, the speech recognition model extracts speech features corresponding to the speaker from the speech signal based on the orientation information, the voiceprint information, and the phase characteristics between the multiple array elements, and performs speech recognition on the speech features in combination with the lip movement information to obtain text information corresponding to the speaker, which may include:

残差层350对唇动信息304进行特征提取，得到唇部特征，并输入第二拼接层320，第一拼接层340将语音信号301、方位信息303以及声纹信息302进行拼接，并将拼接后的结果输入至卷积层330，卷积层330基于方位信息303、声纹信息302以及多个阵元之间的相位特性，从语音信号301中提取该发言者对应的语音特征，并将语音特征输入第二拼接层320，第二拼接层320将语音特征与唇部特征进行拼接，并将拼接后的特征输入识别层310，识别层310基于拼接后的特征进行语音识别，得到该发言者的对应的文本信息，并输出文本信息。The residual layer 350 extracts features from the lip movement information 304 to obtain lip features, and inputs the features into the second splicing layer 320. The first splicing layer 340 splices the speech signal 301, the orientation information 303 and the voiceprint information 302, and inputs the spliced results into the convolution layer 330. The convolution layer 330 extracts speech features corresponding to the speaker from the speech signal 301 based on the orientation information 303, the voiceprint information 302 and the phase characteristics between multiple array elements, and inputs the speech features into the second splicing layer 320. The second splicing layer 320 splices the speech features with the lip features, and inputs the spliced features into the recognition layer 310. The recognition layer 310 performs speech recognition based on the spliced features to obtain the corresponding text information of the speaker, and outputs the text information.

其中，上述卷积层可以采用卷积神经网络(Convolutional Neural Networks,CNN)，上述残差层可以采用残差网络，上述识别层可以采用端到端的自动语音识别(Automatic Speech Recognition，ASR)模块，在此不做具体限定。Among them, the above-mentioned convolutional layer can adopt convolutional neural networks (CNN), the above-mentioned residual layer can adopt residual network, and the above-mentioned recognition layer can adopt an end-to-end automatic speech recognition (ASR) module, which is not specifically limited here.

其中，上述唇部特征可以表征该发言者在说话时的唇部图像特点，第一拼接层输出的拼接后的结果为多个发言者同时发言的语音信号与该发言者的方位信息以及该发言者的声纹信息拼接到一起的，可以包括该发言者的方位特点、该用户说话的频谱特点以及多用户混合说话时的混合语音信号特点。Among them, the above-mentioned lip features can characterize the lip image characteristics of the speaker when speaking. The spliced result output by the first splicing layer is the voice signals of multiple speakers speaking simultaneously, the position information of the speaker and the voiceprint information of the speaker spliced together, which may include the position characteristics of the speaker, the spectral characteristics of the user's speech, and the characteristics of the mixed voice signal when multiple users speak in a mixed manner.

由于多个发言者同时发言的语音信号为麦克风阵列采集的，麦克风阵列中的各个阵元所在位置不同，所以在同一时刻接收多个发言者的语音信号存在时延，即每个阵元接收到的语音信号的波形的相位特性不同，因此，卷积层330便可以基于方位信息303、声纹信息302以及多个阵元之间的相位特性，从语音信号301中提取该发言者对应的语音特征，该语音特征与唇部特征在第二拼接层320进行拼接。Since the voice signals of multiple speakers speaking simultaneously are collected by the microphone array, and the array elements in the microphone array are located at different positions, there is a time delay in receiving the voice signals of multiple speakers at the same time, that is, the phase characteristics of the waveform of the voice signal received by each array element are different. Therefore, the convolution layer 330 can extract the voice features corresponding to the speaker from the voice signal 301 based on the orientation information 303, the voiceprint information 302 and the phase characteristics between multiple array elements, and the voice features are spliced with the lip features in the second splicing layer 320.

此时语音特征以及唇部特征均为该发言者对应的特征，分别从语音特点和图像特点两个维度表征该发言者的特征，进而，将该发言者的语音特征与唇部特征拼接后输入至识别层310，识别层310便可以基于语音特点和图像特点两个维度的融合特征，准确识别得到该发言者的对应的文本信息，并输出文本信息。At this time, both the voice features and the lip features are the features corresponding to the speaker, and the features of the speaker are represented from the two dimensions of voice features and image features respectively. Then, the voice features and lip features of the speaker are spliced and input into the recognition layer 310. The recognition layer 310 can accurately identify the corresponding text information of the speaker based on the fusion features of the two dimensions of voice features and image features, and output the text information.

可见，在本实施例中，语音识别模型中的残差层对唇动信息进行特征提取，得到唇部特征，并输入第二拼接层，第一拼接层将语音信号、方位信息以及声纹信息进行拼接，并将拼接后的结果输入至卷积层，卷积层基于方位信息、声纹信息以及多个阵元之间的相位特性，从语音信号中提取该发言者对应的语音特征，并将语音特征输入第二拼接层，第二拼接层将语音特征与唇部特征进行拼接，并将拼接后的特征输入识别层，识别层基于拼接后的特征进行语音识别，得到该发言者的对应的文本信息，并输出文本信息，通过上述结构的语音识别模型可以准确进行语音识别，保证得到的文本信息的准确度。It can be seen that in this embodiment, the residual layer in the speech recognition model extracts features of the lip movement information to obtain lip features, and inputs the features into the second splicing layer. The first splicing layer splices the speech signal, the orientation information and the voiceprint information, and inputs the spliced results into the convolution layer. The convolution layer extracts the speech features corresponding to the speaker from the speech signal based on the orientation information, the voiceprint information and the phase characteristics between the multiple array elements, and inputs the speech features into the second splicing layer. The second splicing layer splices the speech features with the lip features, and inputs the spliced features into the recognition layer. The recognition layer performs speech recognition based on the spliced features to obtain the corresponding text information of the speaker, and outputs the text information. The speech recognition model with the above structure can accurately perform speech recognition to ensure the accuracy of the obtained text information.

作为本发明实施例的一种实施方式，如图4所示，在上述获取多个发言者的图像、语音信号以及每个发言者的声纹信息的步骤之前，上述方法还可以包括：As an implementation of an embodiment of the present invention, as shown in FIG4 , before the step of acquiring the images, voice signals and voiceprint information of multiple speakers, the method may further include:

S401，获取会议中的会议图像，并对所述会议图像进行唇动检测，确定正在发言的目标发言者；S401, acquiring a conference image in a conference, and performing lip movement detection on the conference image to determine a target speaker who is speaking;

服务器可以获取本次会议中的会议图像，会议图像可以为本次会议视频中的图像，会议图像对应本次会议视频中的某个时间点，例如，会议图像A对应该会议视频中的1分13秒时的会议画面。为了进行语音识别，服务器对会议图像进行唇动检测，可以确定出正在发言的目标发言者，其中，目标发言者可以为一个或多个，在此不做具体限定。The server can obtain the conference image in this conference, which can be an image in the video of this conference, and corresponds to a certain time point in the video of this conference. For example, conference image A corresponds to the conference screen at 1 minute and 13 seconds in the video of this conference. In order to perform speech recognition, the server performs lip movement detection on the conference image, and can determine the target speaker who is speaking, wherein the target speaker can be one or more, which is not specifically limited here.

S402，基于预先建立的人脸库，确定所述目标发言者的身份信息；S402, determining the identity information of the target speaker based on a pre-established face database;

服务器在确定正在发言的目标发言者之后，可以基于预先建立的人脸库以及发言者的人脸图像，确定目标发言者的身份信息。为了方便确定目标发言者的身份信息，可以预先建立人脸库，该人脸库可以存储预先获取的各个人员的人脸模型信息和对应的身份信息，例如，可以为人脸特征与姓名之间的对应关系。After determining the target speaker who is speaking, the server can determine the identity information of the target speaker based on the pre-established face library and the speaker's face image. In order to facilitate the determination of the identity information of the target speaker, a face library can be established in advance, and the face library can store pre-acquired face model information and corresponding identity information of each person, for example, the correspondence between facial features and names.

在会议开始前，终端可以获取与会人员的名单，该名单包括与会人员的身份信息，根据与会人员的名单中的身份信息，终端可以从上述人脸库中提取与会人员的人脸特征，并记录与会人员的人脸特征，从而完成与会人员的注册。终端可以将该与会人员的人脸特征与该与会人员的身份信息对应存储于终端本地，或将该人脸特征与该与会人员的身份信息进行对应记录后发送至服务器，这都是合理的。Before the meeting starts, the terminal can obtain a list of participants, which includes the identity information of the participants. Based on the identity information in the list of participants, the terminal can extract the facial features of the participants from the above-mentioned face database and record the facial features of the participants, thereby completing the registration of the participants. The terminal can store the facial features of the participant and the identity information of the participant in the terminal locally, or record the facial features and the identity information of the participant in correspondence and send them to the server, which is reasonable.

S403，获取所述目标发言者的语音信号，并提取该语音信号的声纹信息；S403, acquiring a speech signal of the target speaker, and extracting voiceprint information of the speech signal;

在一种实施方式中，目标发言者首次发言时为目标发言者独自发言时，那么，服务器便可以直接获取目标发言者独自发言的时间段内，语音采集设备所采集到的该目标发言者的语音信号，并提取该语音信号的声纹信息。In one embodiment, when the target speaker speaks for the first time, the target speaker speaks alone. Then, the server can directly obtain the voice signal of the target speaker collected by the voice collection device during the time period when the target speaker speaks alone, and extract the voiceprint information of the voice signal.

在另一种实施方式中，目标发言者首次发言时为包含目标发言者的多个人同时发言的情况，那么，服务器可以获取该包含目标发言者的多个人同时发言的时间段内，语音采集设备所采集到的多个发言者的语音信号，根据该目标发言者的唇动信息以及方位信息，从多个发言者的语音信号中提取该目标发言者的语音信号，并提取该语音信号的声纹信息。In another embodiment, when the target speaker speaks for the first time, multiple people including the target speaker speak at the same time. In this case, the server can obtain voice signals of the multiple speakers collected by the voice collection device during the time period when the multiple people including the target speaker speak at the same time, extract the voice signal of the target speaker from the voice signals of the multiple speakers based on the lip movement information and direction information of the target speaker, and extract the voiceprint information of the voice signal.

上述两种实施方式中，语音采集设备可以为麦克风阵列，麦克风阵列采集到的语音信号可以进行波束形成处理，即波束成型，波束形成是对各阵元的输出进行时延或相位补偿、幅度加权处理，以形成指向特定方向的波束。这样，服务器就可以得到更加精确的目标发言者的语音信号，从而所提取的声纹信息能够更加准确。In the above two implementations, the voice collection device can be a microphone array, and the voice signal collected by the microphone array can be processed by beamforming, i.e., beamforming, which is to perform time delay or phase compensation and amplitude weighting processing on the output of each array element to form a beam pointing in a specific direction. In this way, the server can obtain a more accurate voice signal of the target speaker, so that the extracted voiceprint information can be more accurate.

其中，上述从语音信号提取声纹信息可以采用时延神经网络(Time Delay NeuralNetwork，TDNN)和概率线性判别分析(Probabilistic Linear Discriminant Analysis，PLDA)等技术，上述波束形成可以采用最小方差无失真响应(Minimum VarianceDistortionless Response，MVDR)，在此不做具体限定。Among them, the above-mentioned extraction of voiceprint information from speech signals can adopt technologies such as Time Delay Neural Network (TDNN) and Probabilistic Linear Discriminant Analysis (PLDA), and the above-mentioned beamforming can adopt Minimum Variance Distortionless Response (MVDR), which is not specifically limited here.

S404，将所述声纹信息与所述身份信息对应记录。S404: Record the voiceprint information and the identity information in correspondence with each other.

在确定该目标发言者的身份信息以及获取到该目标发言者的声纹信息后，服务器可以将该目标发言者的身份信息与该目标发言者的声纹信息对应记录，从而获取到本次会议该目标发言者的声纹信息与该目标发言者的对应关系。例如：该目标发言者为目标发言者A，提取出目标发言者A的声纹信息1后，可以对应记录为“目标发言者A-声纹信息1”。对应记录的方式可以为采用表格进行记录等，在此不做具体限定。例如，可以如下表所示：After determining the identity information of the target speaker and obtaining the voiceprint information of the target speaker, the server can record the identity information of the target speaker and the voiceprint information of the target speaker in correspondence, thereby obtaining the corresponding relationship between the voiceprint information of the target speaker and the target speaker in this meeting. For example: the target speaker is target speaker A, after extracting the voiceprint information 1 of target speaker A, it can be recorded as "target speaker A-voiceprint information 1". The corresponding recording method can be recording in a table, etc., which is not specifically limited here. For example, it can be shown in the following table:

序号Serial number发言者Speakers声纹信息Voiceprint information11目标发言者ATarget Speaker A声纹信息1Voiceprint information 122目标发言者BTarget Speaker B声纹信息1Voiceprint information 133目标发言者CTarget Speaker C声纹信息3Voiceprint information 3

可见，在本实施例中，服务器可以获取会议中的会议图像，并对会议图像进行唇动检测，确定正在发言的目标发言者，基于预先建立的人脸库，确定目标发言者的身份信息，获取目标发言者的语音信号，并提取该语音信号的声纹信息，将声纹信息与身份信息对应记录。相关技术中，在会议开始前进行与会人员的声纹注册，但是同一个与会人员在声纹注册后，不同的时间段内，声纹信息波动较大，在实际使用过程中，会导致语音识别率低的问题。而本实施例中，无需如相关技术中在会议开始前进行与会人员的声纹注册，而是在提取会议过程中与会人员发出的语音信号，进行声纹信息的注册。从而避免了在会议开始前后环境变化，以及与会人员自身声纹信息波动大导致的预先注册的声纹信息不准确的问题，目标发言者的声纹信息更加准确，从而提高了后续语音识别的准确度。It can be seen that in this embodiment, the server can obtain the conference image in the meeting, perform lip movement detection on the conference image, determine the target speaker who is speaking, determine the identity information of the target speaker based on the pre-established face database, obtain the voice signal of the target speaker, extract the voiceprint information of the voice signal, and record the voiceprint information and the identity information in correspondence. In the related art, the voiceprints of the participants are registered before the meeting starts, but the voiceprint information of the same participant fluctuates greatly in different time periods after the voiceprint registration, which will lead to the problem of low voice recognition rate in actual use. In this embodiment, it is not necessary to register the voiceprints of the participants before the meeting starts as in the related art, but to register the voiceprint information by extracting the voice signals emitted by the participants during the meeting. Thereby avoiding the problem of inaccurate pre-registered voiceprint information caused by environmental changes before and after the meeting starts and large fluctuations in the voiceprint information of the participants themselves, and the voiceprint information of the target speaker is more accurate, thereby improving the accuracy of subsequent voice recognition.

作为本发明实施例的一种实施方式，如图5所示，上述对所述发言图像进行识别，确定每个发言者的方位信息的步骤，可以包括：As an implementation of an embodiment of the present invention, as shown in FIG5 , the step of identifying the speech image and determining the position information of each speaker may include:

S501，对所述发言图像进行识别，确定每个发言者的面部像素点；S501, identifying the speech image to determine the facial pixels of each speaker;

服务器可以对发言图像进行识别，确定出该发言图像中每个发言者的面部像素点，服务器可以选取该面部像素点的任一点作为该发言者的面部像素点位于图像中的位置，也可以计算该面部像素点的平均值，将平均值所对应的点作为该发言者的面部像素点位于图像中的位置，在此不做具体限定。The server can identify the speaking image and determine the facial pixels of each speaker in the speaking image. The server can select any point of the facial pixels as the position of the speaker's facial pixel in the image, or calculate the average value of the facial pixels and use the point corresponding to the average value as the position of the speaker's facial pixel in the image. No specific limitation is made here.

S502，针对每个发言者，基于该发言者的所述面部像素点在所述发言图像中位置、预先标定的拍摄所述发言图像的图像采集设备的参数以及语音采集设备的位置，确定该发言者相对于所述语音采集设备的角度信息，作为该发言者的方位信息。S502, for each speaker, based on the position of the facial pixels of the speaker in the speaking image, the pre-calibrated parameters of the image acquisition device for taking the speaking image, and the position of the voice acquisition device, determine the angle information of the speaker relative to the voice acquisition device as the orientation information of the speaker.

在一种实施方式中，在获得该发言者的面部像素点在发言图像中的位置后，服务器可以基于在获得该发言者的面部像素点在发言图像中的位置、预先标定的拍摄该发言图像的图像采集设备的参数，计算得到该发言者位于会议场景中的位置，基于语音采集设备与摄像机的相对位置，从而可以计算得到该发言者相对于语音采集设备的角度信息，作为该发言者的方位信息。In one embodiment, after obtaining the position of the speaker's facial pixels in the speaking image, the server can calculate the position of the speaker in the conference scene based on the position of the speaker's facial pixels in the speaking image and the parameters of the image acquisition device that was pre-calibrated to take the speaking image. Based on the relative position of the voice acquisition device and the camera, the angle information of the speaker relative to the voice acquisition device can be calculated as the speaker's orientation information.

在一种实施方式中，图像采集设备为摄像机，语音采集设备为麦克风阵列，以摄像机在会议场景中的位置作为三维坐标系原点建立坐标系，X轴与Y轴构成水平平面，服务器可以从发言图像1中提取出每个发言者的面部像素点，将面部像素点的平均值对应的点作为该发言者A位置，根据摄像机的内部参数以及摄像机的外部参数，计算得到该发言者在上述以摄像机在会议场景中的位置作为三维坐标系原点建立坐标系中的三维坐标(x1，y1，z1)，而麦克风阵列位于上述以摄像机在会议场景中的位置作为三维坐标系原点建立坐标系中的三维坐标为(x2，y2，z2)，计算tan|x1|+|x2|/|y1|+|y2|所对应的角度，作为该发言者的方位信息。In one embodiment, the image acquisition device is a camera, the voice acquisition device is a microphone array, and a coordinate system is established with the position of the camera in the conference scene as the origin of the three-dimensional coordinate system. The X-axis and the Y-axis constitute a horizontal plane. The server can extract the facial pixels of each speaker from the speech image 1, and take the point corresponding to the average value of the facial pixels as the position of the speaker A. According to the internal parameters of the camera and the external parameters of the camera, the three-dimensional coordinates (x1, y1, z1) of the speaker in the coordinate system established with the position of the camera in the conference scene as the origin of the three-dimensional coordinate system are calculated, and the three-dimensional coordinates of the microphone array in the coordinate system established with the position of the camera in the conference scene as the origin of the three-dimensional coordinate system are (x2, y2, z2), and the angle corresponding to tan|x1|+|x2|/|y1|+|y2| is calculated as the orientation information of the speaker.

可见，在本实施例中，服务器可以对发言图像进行识别，确定每个发言者的面部像素点，针对每个发言者，基于该发言者的所述面部像素点在发言图像中位置、预先标定的拍摄发言图像的图像采集设备的参数以及语音采集设备的位置，确定该发言者相对于所述语音采集设备的角度信息，作为该发言者的方位信息，这样，服务器便可以准确确定该发言者的方位信息，进而可以保证后续语音识别的准确度。It can be seen that in this embodiment, the server can identify the speaking image, determine the facial pixels of each speaker, and for each speaker, based on the position of the facial pixels of the speaker in the speaking image, the pre-calibrated parameters of the image acquisition device for taking the speaking image, and the position of the voice acquisition device, determine the angle information of the speaker relative to the voice acquisition device as the orientation information of the speaker. In this way, the server can accurately determine the orientation information of the speaker, thereby ensuring the accuracy of subsequent voice recognition.

作为本发明实施例的一种实施方式，如图6所示，上述语音识别模型的训练方式，可以包括：As an implementation of an embodiment of the present invention, as shown in FIG6 , the training method of the speech recognition model may include:

S601，获取所述多用户语音样本以及初始模型；S601, obtaining the multi-user voice samples and the initial model;

服务器可以获取多用户语音样本以及初始模型，其中，多用户语音样本包含多个用户的唇动信息、声纹信息、方位信息、语音信号以及每个用户所对应的文本信息。初始模型的结构与上述语音识别模型的结构相同，即可以包括：残差层、第一拼接层、卷积层、第二拼接层以及识别层，初始模型的初始参数可以为默认值也可以随机初始化，在此不做具体限定。The server may obtain multi-user voice samples and an initial model, wherein the multi-user voice samples include lip movement information, voiceprint information, position information, voice signals of multiple users, and text information corresponding to each user. The structure of the initial model is the same as that of the above-mentioned speech recognition model, that is, it may include: a residual layer, a first concatenation layer, a convolution layer, a second concatenation layer, and a recognition layer. The initial parameters of the initial model may be default values or randomly initialized, and are not specifically limited here.

S602，将每个多用户语音样本中包括每个用户所对应的文本信息，作为样本标签；S602, taking text information corresponding to each user included in each multi-user voice sample as a sample label;

服务器可以获取每个多用户语音样本中包括每个用户所对应的文本信息，该文本信息为人工确定的，也可以预先确定文本信息，进而使多个用户同时按照对应的文本信息发出语音信号，获得上述多用户语音样本。每个多用户语音样本中所对应的文本信息即可以作为该多用户语音样本对应的样本标签。The server can obtain text information corresponding to each user in each multi-user voice sample, where the text information is manually determined or pre-determined, so that multiple users simultaneously send voice signals according to the corresponding text information to obtain the multi-user voice sample. The text information corresponding to each multi-user voice sample can be used as a sample label corresponding to the multi-user voice sample.

S603，将每个所述多用户语音样本输入所述初始模型，得到预测文本信息；S603, inputting each of the multi-user voice samples into the initial model to obtain predicted text information;

针对每个多用户语音样本所包括的用户，可以将该用户的唇动信息输入至初始模型的残差层以对唇动信息进行特征提取，得到唇部特征后输入第二拼接层。将该用户的声纹信息、将该用户的方位信息以及多个用户同时发言的语音信号输入至第一拼接层进行拼接，并将拼接后的结果输入至卷积层。For each user included in the multi-user voice sample, the lip movement information of the user can be input into the residual layer of the initial model to extract the features of the lip movement information, and the lip features are input into the second splicing layer. The voiceprint information of the user, the position information of the user, and the voice signals of multiple users speaking simultaneously are input into the first splicing layer for splicing, and the spliced results are input into the convolution layer.

卷积层可以基于该用户的声纹信息、将该用户的方位信息以及麦克风阵列包括的多个阵元之间的相位特性，从多个用户同时发言的语音信号提取该用户对应的语音特征，并将该语音特征输入第二拼接层。为了保证训练得到的语音识别模型可以对语音信号进行准确处理，该麦克风阵列可以与上述实施例中所说的麦克风阵列相同。The convolution layer can extract the voice features corresponding to the user from the voice signals of multiple users speaking simultaneously based on the voiceprint information of the user, the position information of the user, and the phase characteristics between the multiple array elements included in the microphone array, and input the voice features into the second splicing layer. In order to ensure that the trained speech recognition model can accurately process the speech signal, the microphone array can be the same as the microphone array described in the above embodiment.

进而，第二拼接层可以将语音特征与唇部特征进行拼接，并将拼接后的特征输入识别层，识别层可以基于拼接后的特征进行语音识别，得到文本信息，作为预测文本信息。Furthermore, the second concatenation layer may concatenate the speech features with the lip features, and input the concatenated features into the recognition layer. The recognition layer may perform speech recognition based on the concatenated features to obtain text information as predicted text information.

S604，基于每个所述多用户语音样本对应的预测文本信息与样本标签之间的差异，调整所述初始模型的模型参数，直到所述初始模型收敛，得到所述语音识别模型。S604: Based on the difference between the predicted text information corresponding to each of the multi-user voice samples and the sample label, adjust the model parameters of the initial model until the initial model converges to obtain the speech recognition model.

由于当前的初始模型可能还不能对语音信号进行准确识别，所以可以基于每个多用户语音样本对应的预测文本信息与样本标签之间的差异，调整初始模型的模型参数，以使初始模型的参数越来越合适，提高语音识别的准确度，直到初始模型收敛。其中，可以采用梯度下降算法、随机梯度下降算法等调整初始模型的参数，在此不做具体限定。Since the current initial model may not be able to accurately recognize the speech signal, the model parameters of the initial model can be adjusted based on the difference between the predicted text information corresponding to each multi-user speech sample and the sample label, so that the parameters of the initial model are more and more suitable, and the accuracy of speech recognition is improved until the initial model converges. Among them, the parameters of the initial model can be adjusted by using a gradient descent algorithm, a stochastic gradient descent algorithm, etc., which are not specifically limited here.

在一种实施方式中，可以基于预测文本信息与样本标签之间的差异计算损失函数的函数值，当函数值达到预设值时，确定当前的初始模型收敛，得到语音识别模型。在一种实施方式中，在多用户语音样本迭代次数达到预设次数后，可以认为初始模型收敛，得到语音识别模型。In one embodiment, the function value of the loss function can be calculated based on the difference between the predicted text information and the sample label. When the function value reaches a preset value, it is determined that the current initial model has converged to obtain a speech recognition model. In one embodiment, after the number of iterations of the multi-user speech samples reaches a preset number, it can be considered that the initial model has converged to obtain a speech recognition model.

可见，在本实施例中，服务器可以获取多用户语音样本以及初始模型，将每个多用户语音样本中包括每个用户所对应的文本信息，作为样本标签，将每个多用户语音样本输入所述初始模型，得到预测文本信息，基于每个多用户语音样本对应的预测文本信息与样本标签之间的差异，调整初始模型的模型参数，直到初始模型收敛，得到语音识别模型。通过该训练方式可以训练得到能够准确对唇动信息、声纹信息、方位信息以及多用户同时说话所产生的语音信号进行识别的模型，从而保证后续语音识别的准确度。It can be seen that in this embodiment, the server can obtain multi-user voice samples and an initial model, use the text information corresponding to each user in each multi-user voice sample as a sample label, input each multi-user voice sample into the initial model, obtain predicted text information, and adjust the model parameters of the initial model based on the difference between the predicted text information corresponding to each multi-user voice sample and the sample label until the initial model converges to obtain a speech recognition model. Through this training method, a model that can accurately recognize lip movement information, voiceprint information, orientation information, and speech signals generated by multiple users speaking simultaneously can be trained, thereby ensuring the accuracy of subsequent speech recognition.

作为本发明实施例的一种实施方式，上述方法还可以包括：As an implementation of the embodiment of the present invention, the above method may further include:

由于在会议视频中，会有多个发言者同时发言或单个发言者发言的情况，服务器可以将不同情况下，发言者对应的文本信息按照对应的时间顺序进行记录，生成会议记录。Since there may be multiple speakers speaking at the same time or a single speaker speaking in a conference video, the server may record the text information corresponding to the speakers in different situations in a corresponding chronological order to generate a conference record.

例如，发言者A在时间a时发言所产生的语音信号对应的文本信息为“本次会议内容为上个季度的工作汇报”，在发言者A发言之后的时间b，发言者B和发言者C同时说话，发言者B发言所产生的语音信号对应的文本信息为“上个季度我部门完成了一个项目”，发言者C发言所产生的语音信号对应的文本信息为“我有一个问题想了解下”。那么服务器便可以生成会议记录：时间a：发言者A，本次会议内容为上个季度的工作汇报；时间b：发言者B，上个季度我部门完成了一个项目；发言者C，我有一个问题想了解下。For example, the text information corresponding to the voice signal generated by speaker A at time a is "The content of this meeting is the work report of the last quarter". At time b after speaker A spoke, speakers B and C spoke at the same time. The text information corresponding to the voice signal generated by speaker B is "My department completed a project last quarter", and the text information corresponding to the voice signal generated by speaker C is "I have a question I want to know". Then the server can generate a meeting record: time a: speaker A, the content of this meeting is the work report of the last quarter; time b: speaker B, my department completed a project last quarter; speaker C, I have a question I want to know.

在一种实施方式中，会议记录中还可以包括会议地点、会议名称等信息，在此不做具体限定。In one implementation, the meeting record may also include information such as the meeting location and meeting name, which is not specifically limited here.

可见，在本实施例中，服务器可以基于每个发言者对应的文本信息，生成会议记录，由于服务器可以将会议中多个发言者、单个发言者发言的情况按照会议时间顺序进行记录，并且针对多个发言者同时发言的情况，也可以准确进行语音识别得到准确的文本信息，不需要额外配备会议记录人员，节省了人力和成本。It can be seen that in this embodiment, the server can generate a meeting record based on the text information corresponding to each speaker. Since the server can record the speeches of multiple speakers or a single speaker in the meeting in chronological order of the meeting, and can also accurately perform voice recognition to obtain accurate text information when multiple speakers speak at the same time, there is no need for additional meeting recorders, saving manpower and costs.

相应与上述一种语音识别方法，本发明实施例还提供了一种语音识别装置，下面对本发明实施例所提供的一种语音识别装置进行介绍。Corresponding to the above-mentioned speech recognition method, an embodiment of the present invention further provides a speech recognition device. The speech recognition device provided by the embodiment of the present invention is introduced below.

如图7所示，一种语音识别装置，所述装置可以包括：As shown in FIG. 7 , a speech recognition device may include:

第一获取模块710，用于获取会议中多个发言者的发言图像、语音信号以及每个发言者的声纹信息，其中，所述语音信号包括所述多个发言者同时发言所产生的语音信号；A first acquisition module 710 is used to acquire speech images, voice signals and voiceprint information of each speaker of a conference, wherein the voice signal includes the voice signals generated by the multiple speakers speaking simultaneously;

第一确定模块720，用于对所述发言图像进行识别，确定每个发言者的方位信息以及唇动信息；A first determination module 720, configured to recognize the speech image and determine the position information and lip movement information of each speaker;

识别模块730，用于针对每个发言者，将该发言者的唇动信息、声纹信息、方位信息以及所述语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息，其中，所述语音识别模型为基于多用户语音样本训练得到的，所述多用户语音样本包括每个用户的唇动信息、声纹信息、方位信息以及多用户同时发言所产生的语音信号。The recognition module 730 is used to input the lip movement information, voiceprint information, direction information and the voice signal of each speaker into a pre-trained speech recognition model to obtain text information corresponding to the speaker, wherein the speech recognition model is trained based on multi-user voice samples, and the multi-user voice samples include the lip movement information, voiceprint information, direction information of each user and the voice signals generated by multiple users speaking simultaneously.

作为本发明实施例的一种实施方式，上述语音信号可以为麦克风阵列所采集的语音信号，所述麦克风阵列包括多个阵元；As an implementation manner of the embodiment of the present invention, the above-mentioned voice signal may be a voice signal collected by a microphone array, and the microphone array includes a plurality of array elements;

上述识别模块730可以包括：The identification module 730 may include:

作为本发明实施例的一种实施方式，上述语音识别模型可以包括：残差层、第一拼接层、卷积层、第二拼接层以及识别层；As an implementation of an embodiment of the present invention, the speech recognition model may include: a residual layer, a first concatenation layer, a convolution layer, a second concatenation layer, and a recognition layer;

所述第一识别单元可以包括：The first identification unit may include:

第二提取子单元，用于所述卷积层基于所述方位信息、所述声纹信息以及所述多个阵元之间的相位特性，从所述语音信号中提取该发言者对应的语音特征，并将所述语音特征输入所述第二拼接层；A second extraction subunit is used for the convolution layer to extract the speech features corresponding to the speaker from the speech signal based on the orientation information, the voiceprint information and the phase characteristics between the multiple array elements, and input the speech features into the second concatenation layer;

作为本发明实施例的一种实施方式，如图8所示，上述装置还可以包括：As an implementation of an embodiment of the present invention, as shown in FIG8 , the above-mentioned device may further include:

第二获取模块740，用于获取会议中的会议图像，并对所述会议图像进行唇动检测，确定正在发言的目标发言者；The second acquisition module 740 is used to acquire a conference image in the conference, and perform lip movement detection on the conference image to determine a target speaker who is speaking;

第二确定模块750，用于基于预先建立的人脸库，确定所述目标发言者的身份信息；A second determination module 750 is used to determine the identity information of the target speaker based on a pre-established face database;

第三获取模块760，用于获取所述目标发言者的语音信号，并提取该语音信号的声纹信息；The third acquisition module 760 is used to acquire the speech signal of the target speaker and extract the voiceprint information of the speech signal;

记录模块770，用于将所述声纹信息与所述身份信息对应记录。The recording module 770 is used to record the voiceprint information and the identity information in correspondence.

作为本发明实施例的一种实施方式，上述第一确定模块720可以包括：As an implementation of the embodiment of the present invention, the first determining module 720 may include:

确定单元，用于针对每个发言者，基于该发言者的所述面部像素点在所述发言图像中位置、预先标定的拍摄所述发言图像的图像采集设备的参数以及语音采集设备的位置，确定该发言者相对于所述语音采集设备的角度信息，作为该发言者的方位信息。A determination unit is used to determine, for each speaker, the angle information of the speaker relative to the voice acquisition device based on the position of the facial pixels of the speaker in the speech image, the pre-calibrated parameters of the image acquisition device for taking the speech image, and the position of the voice acquisition device, as the orientation information of the speaker.

作为本发明实施例的一种实施方式，上述语音识别模型通过模型训练模块预先训练得到，所述模型训练模块可以包括：As an implementation manner of an embodiment of the present invention, the above-mentioned speech recognition model is pre-trained by a model training module, and the model training module may include:

作为本发明实施例的一种实施方式，如图9所示，上述装置还可以包括：As an implementation of an embodiment of the present invention, as shown in FIG9 , the above device may further include:

生成模块780，用于基于所述每个发言者对应的文本信息，生成会议记录。The generating module 780 is used to generate the conference record based on the text information corresponding to each speaker.

相应与上述一种语音识别方法，本发明实施例还提供了一种语音识别系统，下面对本发明实施例所提供的一种语音识别系统进行介绍。Corresponding to the above-mentioned speech recognition method, an embodiment of the present invention further provides a speech recognition system. The speech recognition system provided by the embodiment of the present invention is introduced below.

如图10所示，一种语音识别系统，所述系统包括服务器1004和终端1003，所述终端设置有图像采集设备1001以及语音采集设备1002，其中：As shown in FIG10 , a speech recognition system includes a server 1004 and a terminal 1003, wherein the terminal is provided with an image acquisition device 1001 and a speech acquisition device 1002, wherein:

所述图像采集设备1001，用于在会议中采集图像；The image acquisition device 1001 is used to acquire images during a meeting;

所述语音采集设备1002，用于在会议中采集语音信号；The voice collection device 1002 is used to collect voice signals in a conference;

所述终端1003，用于将所述图像和所述语音信号发送至所述服务器1004；The terminal 1003 is used to send the image and the voice signal to the server 1004;

所述服务器1004，用于接收所述图像和所述语音信号，并上述实施例中任一所述的语音识别方法的步骤The server 1004 is used to receive the image and the voice signal, and perform the steps of the voice recognition method described in any of the above embodiments.

可见，本发明实施例提供的方案中，图像采集设备可以在会议中采集图像，语音采集设备可以在会议中采集语音信号，终端可以将图像和语音信号发送至服务器，服务器可以获取会议中多个发言者的发言图像、语音信号以及每个发言者的声纹信息，其中，语音信号包括多个发言者同时发言所产生的语音信号，对发言图像进行识别，确定每个发言者的方位信息以及唇动信息，针对每个发言者，将该发言者的唇动信息、声纹信息、方位信息以及语音信号输入预先训练完成的语音识别模型，得到该发言者对应的文本信息，其中，语音识别模型为基于多用户语音样本训练得到的，多用户语音样本包括每个用户的唇动信息、声纹信息、方位信息以及多用户同时发言所产生的语音信号。通过上述方案，服务器可以将多个发言者的发言图像、语音信号以及每个发言者的声纹信息输入至语音识别模型中，由于不需要将多个发言者的语音信号按照不同的发言者进行分离，保证了不同发言者的语音信号的频谱完整，从而提高了语音识别的准确度。It can be seen that in the solution provided by the embodiment of the present invention, the image acquisition device can acquire images in a conference, the voice acquisition device can acquire voice signals in a conference, the terminal can send images and voice signals to the server, and the server can obtain the speech images, voice signals and voiceprint information of multiple speakers in the conference, wherein the voice signals include the voice signals generated by multiple speakers speaking at the same time, recognize the speech images, determine the orientation information and lip movement information of each speaker, and for each speaker, input the lip movement information, voiceprint information, orientation information and voice signal of the speaker into the pre-trained speech recognition model to obtain the text information corresponding to the speaker, wherein the speech recognition model is obtained by training based on multi-user speech samples, and the multi-user speech samples include the lip movement information, voiceprint information, orientation information of each user and the voice signals generated by multiple users speaking at the same time. Through the above solution, the server can input the speech images, voice signals and voiceprint information of multiple speakers into the speech recognition model, and since it is not necessary to separate the voice signals of multiple speakers according to different speakers, the spectrum of the voice signals of different speakers is guaranteed to be complete, thereby improving the accuracy of speech recognition.

本发明实施例还提供了一种服务器，如图11所示，包括处理器1101、通信接口1102、存储器1103和通信总线1104，其中，处理器1101，通信接口1102，存储器1103通过通信总线1104完成相互间的通信，The embodiment of the present invention further provides a server, as shown in FIG11 , including a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, wherein the processor 1101, the communication interface 1102, and the memory 1103 communicate with each other via the communication bus 1104.

存储器1103，用于存放计算机程序；Memory 1103, used for storing computer programs;

处理器1101，用于执行存储器1103上所存放的程序时，实现上述任一实施例所述的语音识别方法步骤。The processor 1101 is used to implement the speech recognition method steps described in any of the above embodiments when executing the program stored in the memory 1103.

上述服务器提到的通信总线可以是外设部件互连标准(Peripheral ComponentInterconnect，PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture，EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示，图中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned in the above server can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

通信接口用于上述服务器与其他设备之间的通信。The communication interface is used for communication between the above server and other devices.

存储器可以包括随机存取存储器(Random Access Memory，RAM)，也可以包括非易失性存储器(Non-Volatile Memory，NVM)，例如至少一个磁盘存储器。可选的，存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include a random access memory (RAM) or a non-volatile memory (NVM), such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器，包括中央处理器(Central Processing Unit，CPU)、网络处理器(Network Processor，NP)等；还可以是数字信号处理器(Digital SignalProcessor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

在本发明提供的又一实施例中，还提供了一种计算机可读存储介质，该计算机可读存储介质内存储有计算机程序，所述计算机程序被处理器执行时实现上述实施例中任一所述的语音识别方法的步骤。In another embodiment of the present invention, a computer-readable storage medium is provided, in which a computer program is stored. When the computer program is executed by a processor, the steps of the speech recognition method described in any of the above embodiments are implemented.

在本发明提供的又一实施例中，还提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述实施例中任一所述的语音识别方法。In another embodiment of the present invention, a computer program product including instructions is provided. When the computer program product is run on a computer, the computer executes the speech recognition method described in any one of the above embodiments.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present invention is generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions can be transmitted from one website site, computer, server or data center to another website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center that includes one or more available media integrated. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk Solid State Disk (SSD)), etc.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于一种语音识别方法、装置、系统、服务器、计算机可读存储介质以及计算机程序产品而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for a speech recognition method, device, system, server, computer-readable storage medium, and computer program product, since they are basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

以上所述仅为本发明的较佳实施例，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above description is only a preferred embodiment of the present invention and is not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.