CN118782073A

Movatterモバイル変換

Info

Publication number: CN118782073A
Application number: CN202410916243.3A
Authority: CN
Inventors: 闫永泽; 牛俊峰; 刘君; 王飞宇; 王颖智; 原玉梁
Original assignee: China Unicom Shanxi Industrial Internet Co Ltd
Current assignee: China Unicom Shanxi Industrial Internet Co Ltd
Priority date: 2024-07-09
Filing date: 2024-07-09
Publication date: 2024-10-15

Abstract

Translated fromChinese

本发明属于语音识别技术领域，公开了一种智能语音分离与识别的会议转录方法、装置及系统。该方法包括通过麦克风阵列捕捉与会人员的语音信息；通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息；将所述子语音信息和与会人员语音信息库进行匹配，以确定各个子语音信息对应的发音对象；将各个子语音信息通过语音识别模块转换为发音对象的文本输出。通过上述方式，结合高效的语音分离算法，能够有效地分离和识别重叠的语音信号，使得能够区分与会人员的角色并且每个与会人员的说话内容都能够被准确地识别出来，提高工作效率，改善沟通交流体验，并为用户提供更加便捷、安全和高效的服务。

The present invention belongs to the field of speech recognition technology, and discloses a conference transcription method, device and system for intelligent speech separation and recognition. The method includes capturing the speech information of the participants through a microphone array; segmenting the captured speech information through an end-to-end speech separation module to obtain multiple sub-speech information; matching the sub-speech information with the participant's speech information library to determine the pronunciation object corresponding to each sub-speech information; and converting each sub-speech information into a text output of the pronunciation object through a speech recognition module. Through the above method, combined with an efficient speech separation algorithm, overlapping speech signals can be effectively separated and recognized, so that the roles of the participants can be distinguished and the speech content of each participant can be accurately recognized, thereby improving work efficiency, improving the communication experience, and providing users with more convenient, safe and efficient services.

Description

Translated fromChinese

智能语音分离与识别的会议转录方法、装置及系统Conference transcription method, device and system for intelligent speech separation and recognition

技术领域Technical Field

本发明涉及语音识别技术领域，尤其涉及一种智能语音分离与识别的会议转录方法、装置及系统。The present invention relates to the field of speech recognition technology, and in particular to a conference transcription method, device and system for intelligent speech separation and recognition.

背景技术Background Art

在现代商业环境中，远程工作和在线会议已经成为一种常态。团队成员可能分布在不同的地理位置，需要定期进行视频会议来讨论项目、做决策或者进行战略规划。会议系统利用视频、音频和协作工具，让团队成员可以在不同的地理位置参与会议。提高了会议的可访问性，使得团队能够更加灵活地协同工作，促进了信息的共享和决策的制定。然而，许多会议转录系统存在着一个共性问题：多人交叉讲话的情况下，语音识别模块无法根据语音信息自动区分与会人员角色，而且会影响语音识别(ASR)结果。In the modern business environment, remote work and online meetings have become the norm. Team members may be distributed in different geographical locations and need to hold video conferences regularly to discuss projects, make decisions or conduct strategic planning. Conference systems use video, audio and collaboration tools to allow team members to participate in meetings in different geographical locations. Improving the accessibility of meetings allows teams to work together more flexibly, promote information sharing and decision making. However, many conference transcription systems have a common problem: when multiple people speak cross-talk, the speech recognition module cannot automatically distinguish the roles of participants based on the voice information, and it will affect the speech recognition (ASR) results.

传统智能会议转录系统忽略了多人语音分离任务，在多人讲话语音信号重叠的情况下，说话者标识困难，在语音中准确标识不同说话者也是一个挑战。也就是说，在识别出的文本中正确地区分哪些内容来自哪个说话者是困难的，无法区分角色。语音识别系统无法准确地识别各个说话者的文本内容，导致识别出的文本内容中存在误识别、漏识别或替换错误等问题，进而影响后续的业务处理。Traditional intelligent conference transcription systems ignore the task of multi-person speech separation. When multiple people's speech signals overlap, it is difficult to identify the speaker, and it is also a challenge to accurately identify different speakers in the speech. In other words, it is difficult to correctly distinguish which content in the recognized text comes from which speaker, and it is impossible to distinguish roles. The speech recognition system cannot accurately identify the text content of each speaker, resulting in problems such as misidentification, missed recognition, or replacement errors in the recognized text content, which in turn affects subsequent business processing.

上述内容仅用于辅助理解本发明的技术方案，并不代表承认上述内容是现有技术。The above contents are only used to assist in understanding the technical solution of the present invention and do not constitute an admission that the above contents are prior art.

发明内容Summary of the invention

本发明的主要目的在于提供一种智能语音分离与识别的会议转录方法、装置及系统，旨在解决在多人会议中，多个说话者同时说话会导致语音信号叠加，使得传统语音会议转录系统难以区分不同说话者的角色，语音识别模块无法准确地识别会议内容的技术问题。The main purpose of the present invention is to provide a conference transcription method, device and system for intelligent speech separation and recognition, aiming to solve the technical problem that in a multi-person conference, multiple speakers speaking at the same time will cause voice signal superposition, making it difficult for traditional voice conference transcription systems to distinguish the roles of different speakers and the speech recognition module cannot accurately recognize the content of the meeting.

为实现上述目的，本发明提供了一种智能语音分离与识别的会议转录方法，所述智能语音分离与识别的会议转录方法包括以下步骤：To achieve the above object, the present invention provides a conference transcription method for intelligent speech separation and recognition, and the conference transcription method for intelligent speech separation and recognition comprises the following steps:

通过麦克风阵列捕捉与会人员的语音信息；Capture the voice information of the participants through the microphone array;

通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息；The captured voice information is segmented through an end-to-end voice separation module to obtain multiple sub-voice information;

将所述子语音信息和与会人员语音信息库进行匹配，以确定各个子语音信息对应的发音对象；Matching the sub-voice information with the participant voice information database to determine the pronunciation object corresponding to each sub-voice information;

将各个子语音信息通过语音识别模块转换为发音对象的文本输出。Each sub-speech information is converted into a text output of the pronunciation object through the speech recognition module.

可选地，所述通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息，包括：Optionally, the captured voice information is segmented by the end-to-end voice separation module to obtain a plurality of sub-voice information, including:

判断与会人员是否确定；Determine whether the participants are confirmed;

若与会人员确定，则通过与会人员确定模块获取与会人员的人员数量信息；If the attendees are confirmed, the number of attendees is obtained through the attendee confirmation module;

基于所述人员数量信息通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息。Based on the personnel quantity information, the captured voice information is segmented through an end-to-end voice separation module to obtain a plurality of sub-voice information.

可选地，所述智能语音分离与识别的会议转录方法还包括：Optionally, the conference transcription method for intelligent speech separation and recognition further includes:

若与会人员不确定，则通过聚类方式进行识别并按照声音成分将与会人员划分至不同的说话者群组，以得到初步分离结果；If the participants are uncertain, they are identified by clustering and divided into different speaker groups according to the sound components to obtain preliminary separation results;

基于所述初步分离结果通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息。Based on the preliminary separation result, the captured voice information is segmented through an end-to-end voice separation module to obtain a plurality of sub-voice information.

可选地，所述通过聚类方式进行识别并按照声音成分将与会人员划分至不同的说话者群组，以得到初步分离结果，包括：Optionally, the identifying by clustering and dividing the participants into different speaker groups according to sound components to obtain a preliminary separation result includes:

将获取到的语音信息对应的音频片段切分多个预设长度的音频片段；Splitting the audio segment corresponding to the acquired voice information into a plurality of audio segments of preset lengths;

通过深度学习编码模块提取各个音频片段的声纹特征；Extract the voiceprint features of each audio clip through the deep learning encoding module;

将所有的声纹特征组合成声纹特征矩阵，所述声纹特征矩阵的横坐标表示时间维度，所述声纹特征矩阵的纵坐标表示声纹维度；Combining all voiceprint features into a voiceprint feature matrix, wherein the abscissa of the voiceprint feature matrix represents the time dimension, and the ordinate of the voiceprint feature matrix represents the voiceprint dimension;

根据所述声纹特征矩阵和所述声纹特征矩阵的转置矩阵得到音频片段相似度矩阵，所述音频片段相似度矩阵中每个元素的值代表了元素所在行的索引对应的音频片段声纹信息和元素所在列的索引对应的音频片段声纹信息的相似度；An audio segment similarity matrix is obtained according to the voiceprint feature matrix and the transposed matrix of the voiceprint feature matrix, wherein the value of each element in the audio segment similarity matrix represents the similarity between the audio segment voiceprint information corresponding to the index of the row where the element is located and the audio segment voiceprint information corresponding to the index of the column where the element is located;

利用相似度矩阵和Kmeas聚类的方式，以对各个音频片段进行聚类，其中，各个音频片段之间相似度符合设定阈值的两个音频片段聚为一类；The audio clips are clustered by using the similarity matrix and Kmeas clustering method, wherein two audio clips whose similarity between the audio clips meets the set threshold are clustered into one category;

将同一类别的音频片段组合成一个音频，以确定与会人员的人员数量信息。Audio clips of the same category are combined into one audio to determine the number of participants.

通过多个1维卷积层、批归一化和非线性激活层对音频特征进行分类，得到多个子语音信息，所述子语音信息为分离后每个与会人员的语音信息。The audio features are classified through multiple 1-dimensional convolutional layers, batch normalization and non-linear activation layers to obtain multiple sub-speech information, which is the speech information of each participant after separation.

为与会人员提供交互式用户界面，以使与会人员实时校正转录的会议内容，以及以便会议记录员手动调整识别输出的文本中的漏洞和记录的特定内容。An interactive user interface is provided for the meeting participants to correct the transcribed meeting content in real time, and for the meeting recorder to manually adjust the identification of loopholes in the output text and specific content of the record.

此外，为实现上述目的，本发明还提出一种智能语音分离与识别的会议转录装置，所述智能语音分离与识别的会议转录装置包括：In addition, to achieve the above-mentioned purpose, the present invention also proposes a conference transcription device for intelligent speech separation and recognition, and the conference transcription device for intelligent speech separation and recognition comprises:

获取模块，用于通过麦克风阵列捕捉与会人员的语音信息；An acquisition module, used for capturing the voice information of the participants through a microphone array;

分割模块，用于通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息；A segmentation module, used to segment the captured voice information through the end-to-end voice separation module to obtain multiple sub-voice information;

匹配模块，用于将所述子语音信息和与会人员语音信息库进行匹配，以确定各个子语音信息对应的发音对象；A matching module, used for matching the sub-voice information with the participant voice information library to determine the pronunciation object corresponding to each sub-voice information;

识别模块，用于将各个子语音信息通过语音识别模块转换为发音对象的文本输出。The recognition module is used to convert each sub-speech information into a text output of a pronunciation object through a speech recognition module.

可选地，所述分割模块，还用于判断与会人员是否确定；Optionally, the segmentation module is further used to determine whether the attendees are confirmed;

可选地，所述分割模块，还用于若与会人员不确定，则通过聚类方式进行识别并按照声音成分将与会人员划分至不同的说话者群组，以得到初步分离结果；Optionally, the segmentation module is further used to identify the participants by clustering and divide the participants into different speaker groups according to sound components if the participants are uncertain, so as to obtain a preliminary separation result;

此外，为实现上述目的，本发明还提出一种智能语音分离与识别的会议转录系统，所述智能语音分离与识别的会议转录系统包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的智能语音分离与识别的会议转录程序，所述智能语音分离与识别的会议转录程序配置为实现如上文所述的智能语音分离与识别的会议转录方法的步骤。In addition, to achieve the above-mentioned objectives, the present invention also proposes a conference transcription system for intelligent speech separation and recognition, the conference transcription system for intelligent speech separation and recognition comprising: a memory, a processor, and a conference transcription program for intelligent speech separation and recognition stored in the memory and executable on the processor, the conference transcription program for intelligent speech separation and recognition being configured to implement the steps of the conference transcription method for intelligent speech separation and recognition as described above.

本发明通过麦克风阵列捕捉与会人员的语音信息；通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息；将所述子语音信息和与会人员语音信息库进行匹配，以确定各个子语音信息对应的发音对象；将各个子语音信息通过语音识别模块转换为发音对象的文本输出。通过上述方式，结合高效的语音分离算法，能够有效地分离和识别重叠的语音信号，使得能够区分与会人员的角色并且每个与会人员的说话内容都能够被准确地识别出来，提高工作效率，改善沟通交流体验，并为用户提供更加便捷、安全和高效的服务。The present invention captures the voice information of the participants through a microphone array; divides the captured voice information through an end-to-end voice separation module to obtain multiple sub-voice information; matches the sub-voice information with the participant voice information library to determine the pronunciation object corresponding to each sub-voice information; and converts each sub-voice information into a text output of the pronunciation object through a voice recognition module. Through the above method, combined with an efficient voice separation algorithm, it is possible to effectively separate and identify overlapping voice signals, so that the roles of the participants can be distinguished and the speech content of each participant can be accurately identified, thereby improving work efficiency, improving the communication experience, and providing users with more convenient, safe and efficient services.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明智能语音分离与识别的会议转录方法第一实施例的流程示意图；FIG1 is a flow chart of a first embodiment of a conference transcription method for intelligent speech separation and recognition according to the present invention;

图2为本发明智能语音分离与识别的会议转录方法一实施例中智能会议系统框架图；FIG2 is a framework diagram of an intelligent conference system in an embodiment of a conference transcription method for intelligent speech separation and recognition according to the present invention;

图3为本发明智能语音分离与识别的会议转录方法中语音分离流程图；FIG3 is a flow chart of speech separation in the conference transcription method of intelligent speech separation and recognition of the present invention;

图4为本发明智能语音分离与识别的会议转录方法中端到端语音分离模块结构图；FIG4 is a structural diagram of an end-to-end speech separation module in a conference transcription method for intelligent speech separation and recognition according to the present invention;

图5为本发明智能语音分离与识别的会议转录装置第一实施例的结构框图。FIG5 is a structural block diagram of a first embodiment of a conference transcription device for intelligent speech separation and recognition according to the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further explained in conjunction with embodiments and with reference to the accompanying drawings.

具体实施方式DETAILED DESCRIPTION

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, and are not used to limit the present invention.

本发明实施例提供了一种智能语音分离与识别的会议转录方法，参照图1，图1为本发明一种智能语音分离与识别的会议转录方法第一实施例的流程示意图。An embodiment of the present invention provides a conference transcription method for intelligent speech separation and recognition. Referring to FIG. 1 , FIG. 1 is a flow chart of a first embodiment of a conference transcription method for intelligent speech separation and recognition according to the present invention.

本实施例中，所述智能语音分离与识别的会议转录方法包括以下步骤：In this embodiment, the conference transcription method of intelligent speech separation and recognition includes the following steps:

步骤S10：通过麦克风阵列捕捉与会人员的语音信息。Step S10: Capture the voice information of the participants through the microphone array.

在本实施例中，本实施例的执行主体是智能语音分离与识别的会议转录系统，其中，该智能语音分离与识别的会议转录系统具有数据处理，数据通信及程序运行等功能，所述智能语音分离与识别的会议转录系统可以为计算机等终端数据处理设备，当然还可以为其他具备相似功能的设备，本实施例对此不做限制。In this embodiment, the executor of this embodiment is a conference transcription system for intelligent voice separation and recognition, wherein the conference transcription system for intelligent voice separation and recognition has functions such as data processing, data communication and program running. The conference transcription system for intelligent voice separation and recognition can be a terminal data processing device such as a computer, and of course it can also be other devices with similar functions, and this embodiment does not limit this.

需要说明的是，在现代商业环境中，远程工作和在线会议已经成为一种常态。团队成员可能分布在不同的地理位置，需要定期进行视频会议来讨论项目、做决策或者进行战略规划。会议系统利用视频、音频和协作工具，让团队成员可以在不同的地理位置参与会议。提高了会议的可访问性，使得团队能够更加灵活地协同工作，促进了信息的共享和决策的制定。然而，许多会议转录系统存在着一个共性问题：多人交叉讲话的情况下，语音识别模块无法根据语音信息自动区分与会人员角色，而且会影响语音识别(ASR)结果。传统智能会议转录系统忽略了多人语音分离任务，在多人讲话语音信号重叠的情况下，说话者标识困难，在语音中准确标识不同说话者也是一个挑战。也就是说，在识别出的文本中正确地区分哪些内容来自哪个说话者是困难的，无法区分角色。语音识别系统无法准确地识别各个说话者的文本内容，导致识别出的文本内容中存在误识别、漏识别或替换错误等问题，进而影响后续的业务处理。It should be noted that in the modern business environment, remote work and online meetings have become the norm. Team members may be distributed in different geographical locations and need to hold video conferences regularly to discuss projects, make decisions or conduct strategic planning. Conference systems use video, audio and collaboration tools to allow team members to participate in meetings in different geographical locations. The accessibility of meetings is improved, allowing teams to work together more flexibly, promoting information sharing and decision-making. However, many conference transcription systems have a common problem: when multiple people speak cross-talk, the speech recognition module cannot automatically distinguish the roles of participants based on the speech information, and it will affect the speech recognition (ASR) results. Traditional intelligent conference transcription systems ignore the task of multi-person speech separation. When the speech signals of multiple people overlap, it is difficult to identify the speaker, and it is also a challenge to accurately identify different speakers in the speech. In other words, it is difficult to correctly distinguish which content in the recognized text comes from which speaker, and it is impossible to distinguish roles. The speech recognition system cannot accurately recognize the text content of each speaker, resulting in problems such as misrecognition, missed recognition or replacement errors in the recognized text content, which in turn affects subsequent business processing.

为了解决上述技术问题，本实施例中通过麦克风阵列捕捉与会人员的语音信息；通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息；将所述子语音信息和与会人员语音信息库进行匹配，以确定各个子语音信息对应的发音对象；将各个子语音信息通过语音识别模块转换为发音对象的文本输出。通过上述方式，结合高效的语音分离算法，能够有效地分离和识别重叠的语音信号，使得能够区分与会人员的角色并且每个与会人员的说话内容都能够被准确地识别出来，提高工作效率，改善沟通交流体验，并为用户提供更加便捷、安全和高效的服务，具体的，可以按照如下方式实现。In order to solve the above technical problems, in this embodiment, the voice information of the participants is captured by a microphone array; the captured voice information is segmented by an end-to-end voice separation module to obtain multiple sub-voice information; the sub-voice information is matched with the participant voice information library to determine the pronunciation object corresponding to each sub-voice information; each sub-voice information is converted into a text output of the pronunciation object by a voice recognition module. Through the above method, combined with an efficient voice separation algorithm, it is possible to effectively separate and identify overlapping voice signals, so that the roles of the participants can be distinguished and the speech content of each participant can be accurately identified, thereby improving work efficiency, improving the communication experience, and providing users with more convenient, safe and efficient services. Specifically, it can be achieved in the following way.

在具体实现中，本实施例中先以图2为例对智能会议系统整体框架进行说明，智能会议系统包括与会人员确定模块、语音分离模块、以及语音识别模块，在与会人员确定时，可以通过与会人员确定模块直接获取与会人员的数量，语音分离模块可用于进行语音分离，从而分离出与会人员的具体语音信息，语音识别模块可以将分离出的语音信息转换成文本进行输出。本实施例中整体方案为Step1：通过麦克风阵列进行捕捉与会人员的声音。Step2：语音分离模块，如果与会人员确定，通过与会人员确定模块可得到与会人员的人员数量信息，将麦克风捕捉的声音通过端到端语音分离模块进行分割。Step3：如果与会人员不确定，则通过聚类的方法来识别和分配声音成分到不同的说话者群组，进行初步分离。然后，将初步分离的结果传递给端到端模型，进一步提高分离质量。Step4：将分离模型分离出的子语音信息与与会人员语音信息库进行匹配，确定语音对象。Step5：将子语音信息通过语音识别模块，将声学特征转化为文本输出。In the specific implementation, in this embodiment, the overall framework of the intelligent conference system is first described by taking FIG. 2 as an example. The intelligent conference system includes a participant determination module, a speech separation module, and a speech recognition module. When the participants are determined, the number of participants can be directly obtained through the participant determination module. The speech separation module can be used to perform speech separation, thereby separating the specific speech information of the participants. The speech recognition module can convert the separated speech information into text for output. The overall scheme in this embodiment is Step 1: Capturing the voices of the participants through a microphone array. Step 2: The speech separation module, if the participants are determined, the number of participants can be obtained through the participant determination module, and the sound captured by the microphone is segmented through the end-to-end speech separation module. Step 3: If the participants are uncertain, the sound components are identified and assigned to different speaker groups by a clustering method for preliminary separation. Then, the results of the preliminary separation are passed to the end-to-end model to further improve the separation quality. Step 4: Match the sub-speech information separated by the separation model with the participant speech information library to determine the speech object. Step 5: The sub-speech information is passed through the speech recognition module to convert the acoustic features into text output.

在具体实现中，本实施例中是通过麦克风阵列捕捉与会人员的语音信息。In a specific implementation, in this embodiment, the voice information of the participants is captured by a microphone array.

步骤S20：通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息。Step S20: Segment the captured voice information through an end-to-end voice separation module to obtain a plurality of sub-voice information.

在进行语音信息分割时，本实施例中需要先判断与会人员是否确定，分为与会人员确定与与会人员不确定两种情况进行语音信息的分割。When performing voice information segmentation, in this embodiment, it is necessary to first determine whether the attendees are confirmed, and the voice information is segmented in two cases: the attendees are confirmed and the attendees are uncertain.

具体的，如果与会人员确定，则可以直接通过与会人员确定模块获取与会人员的人员数量信息，然后基于所述人员数量信息通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息。Specifically, if the participants are determined, the number of participants can be directly obtained through the participant determination module, and then the captured voice information is segmented through the end-to-end voice separation module based on the number of participants to obtain multiple sub-voice information.

如果与会人员不确定，本实施例中则需要先通过聚类方式进行识别并按照声音成分将与会人员划分至不同的说话者群组，以得到初步分离结果，例如，然后再根据初步分离结果通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息。其中聚类分离作用在于在与会人员不确定的情况下，使用聚类方法来识别和分配声音成分到不同的说话者群组。作为一种初步分离，确定谁在谈话；将初步分离的结果传递给端到端模型，以进一步提高分离质量，通过聚类分离可以确定与会人员的数量。If the participants are uncertain, in this embodiment, it is necessary to first identify them by clustering and divide them into different speaker groups according to the sound components to obtain a preliminary separation result. For example, the captured voice information is then segmented by the end-to-end voice separation module according to the preliminary separation result to obtain multiple sub-voice information. The role of cluster separation is to use clustering methods to identify and assign sound components to different speaker groups when the participants are uncertain. As a preliminary separation, determine who is talking; pass the results of the preliminary separation to the end-to-end model to further improve the separation quality. The number of participants can be determined through cluster separation.

在本实施例中，进一步参考图3所示的语音分离模块流程图对语音分离过程进行说明，编码网络提取语音片段的语音信息特征。将语音特征组合成语音特征矩阵，横坐标表示时间维度，纵坐标表示声纹维度。接下来构建音频相似度矩阵，将语音特征矩阵转置与语音特征矩阵相乘得到，每个元素的值代表了该行的索引对应的音频片段声纹信息和该列的索引对应的音频片段声纹信息的相似度。利用相似度矩阵和Kmeas聚类的方式，来对各个音频片段进行聚类，将片段之间相似度符合设定阈值的两个片段聚为一类，然后将同一个类别的音频片段组合成为一个音频，从而可得到与会人员的人数。In this embodiment, the speech separation process is further explained with reference to the speech separation module flow chart shown in FIG3, and the encoding network extracts the speech information features of the speech segment. The speech features are combined into a speech feature matrix, the horizontal axis represents the time dimension, and the vertical axis represents the voiceprint dimension. Next, an audio similarity matrix is constructed, and the value of each element represents the similarity between the voiceprint information of the audio segment corresponding to the index of the row and the voiceprint information of the audio segment corresponding to the index of the column. The similarity matrix and Kmeas clustering method are used to cluster each audio segment, and two segments whose similarity between the segments meets the set threshold are clustered into one category, and then the audio segments of the same category are combined into one audio, so that the number of participants can be obtained.

具体的，将获取到的语音信息对应的音频片段切分多个预设长度的音频片段；通过深度学习编码模块提取各个音频片段的声纹特征；将所有的声纹特征组合成声纹特征矩阵，所述声纹特征矩阵的横坐标表示时间维度，所述声纹特征矩阵的纵坐标表示声纹维度；根据所述声纹特征矩阵和所述声纹特征矩阵的转置矩阵得到音频片段相似度矩阵，所述音频片段相似度矩阵中每个元素的值代表了元素所在行的索引对应的音频片段声纹信息和元素所在列的索引对应的音频片段声纹信息的相似度；利用相似度矩阵和Kmeas聚类的方式，以对各个音频片段进行聚类，其中，各个音频片段之间相似度符合设定阈值的两个音频片段聚为一类；将同一类别的音频片段组合成一个音频，音频数量和与会人员的数量一致，因此可以确定与会人员的人员数量信息。Specifically, the audio segment corresponding to the acquired voice information is divided into multiple audio segments of preset length; the voiceprint features of each audio segment are extracted through a deep learning encoding module; all the voiceprint features are combined into a voiceprint feature matrix, the horizontal axis of the voiceprint feature matrix represents the time dimension, and the vertical axis of the voiceprint feature matrix represents the voiceprint dimension; the audio segment similarity matrix is obtained according to the voiceprint feature matrix and the transposed matrix of the voiceprint feature matrix, the value of each element in the audio segment similarity matrix represents the similarity between the audio segment voiceprint information corresponding to the index of the row where the element is located and the audio segment voiceprint information corresponding to the index of the column where the element is located; the similarity matrix and Kmeas clustering method are used to cluster the audio segments, wherein two audio segments whose similarity between the audio segments meets the set threshold are clustered into one category; audio segments of the same category are combined into one audio, and the number of audios is consistent with the number of participants, so the number of participants can be determined.

进一步地，最后再通过端到端的模型进行语音分离，本实施例中语音分离模块的端到端分离的模型如图4所示，通过多个1维卷积层、批归一化和非线性激活层对音频特征进行分类，得到多个子语音信息，所述子语音信息为分离后每个与会人员的语音信息。Furthermore, speech separation is finally performed through an end-to-end model. The end-to-end separation model of the speech separation module in this embodiment is shown in Figure 4. The audio features are classified through multiple one-dimensional convolutional layers, batch normalization and non-linear activation layers to obtain multiple sub-speech information. The sub-speech information is the speech information of each participant after separation.

步骤S30：将所述子语音信息和与会人员语音信息库进行匹配，以确定各个子语音信息对应的发音对象。Step S30: matching the sub-voice information with the participant voice information database to determine the pronunciation object corresponding to each sub-voice information.

在完成子语音信息的分割之后，本实施例中可以对分离的语音数据应用发言者识别技术，通过语音特征提取和模式匹配将每个发言者的语音与其身份相关联，一旦发言者被识别，系统会为其创建一个独立的声音通道，通过这种方式可以确定每一个子语音信息对应的发音对象，也即确定每一个语音信息具体和与会人员之间的对应关系，便于后续的会议记录。After completing the segmentation of the sub-voice information, in this embodiment, speaker recognition technology can be applied to the separated voice data, and the voice of each speaker can be associated with its identity through voice feature extraction and pattern matching. Once the speaker is identified, the system will create an independent sound channel for it. In this way, the pronunciation object corresponding to each sub-voice information can be determined, that is, the correspondence between each voice information and the specific participant can be determined, which is convenient for subsequent meeting records.

步骤S40：将各个子语音信息通过语音识别模块转换为发音对象的文本输出。Step S40: converting each sub-speech information into a text output of a pronunciation object through a speech recognition module.

本实施中最后再通过语音识别模块将所接收到的子语音信息转换为具体的文本并进行输出，需要说明的是，所有会议中的语音数据和翻译文本都将被记录和存储。这些数据可以被检索，以便后续的审阅或共享。语音数据也可以被转化为文本格式，使其更容易搜索。In this implementation, the received sub-voice information is finally converted into specific text and output by the voice recognition module. It should be noted that all voice data and translated text in the meeting will be recorded and stored. These data can be retrieved for subsequent review or sharing. Voice data can also be converted into text format to make it easier to search.

进一步地，本实施例中智能会议系统还具有转录文本校正功能，能够为与会人员提供交互式用户界面，以使与会人员实时校正转录的会议内容，以及以便会议记录员手动调整识别输出的文本中的漏洞和记录的特定内容，例如提供一个交互式用户界面，使与会人员能够与系统进行互动，用于校正转录的会议内容。会议记录员可以手动调整识别文本中有漏洞的地方，记录的特定部分，以便后续参考。Furthermore, in this embodiment, the intelligent conference system also has a transcription text correction function, which can provide an interactive user interface for the participants to enable the participants to correct the transcribed conference content in real time, and to allow the meeting recorder to manually adjust the loopholes in the identified output text and the specific content of the record, for example, an interactive user interface is provided to enable the participants to interact with the system to correct the transcribed conference content. The meeting recorder can manually adjust the loopholes in the identified text and the specific parts of the record for subsequent reference.

进一步地，本实施例中的智能会议系统还支持会议管理，公司可以集中管理和监控会议，以确保与会人员的顺利沟通。管理人员可以实时查看会议中的数据，包括发言者信息、翻译文本和语音记录。Furthermore, the intelligent conference system in this embodiment also supports conference management, and the company can centrally manage and monitor the conference to ensure smooth communication among the participants. Managers can view the data in the conference in real time, including speaker information, translated text and voice records.

此外，智能会议系统具备权限和隐私设置，以确保敏感信息的安全性。只有授权人员才能访问和管理会议数据，使会议更具效率和可管理性。记录和检索功能还支持知识管理和决策制定，与会人员可以更容易地交流，语音内容可以自动分离、识别和记录。In addition, the intelligent conference system has permission and privacy settings to ensure the security of sensitive information. Only authorized personnel can access and manage conference data, making meetings more efficient and manageable. The recording and retrieval functions also support knowledge management and decision-making. Participants can communicate more easily, and voice content can be automatically separated, recognized and recorded.

本实施例通过麦克风阵列捕捉与会人员的语音信息；通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息；将所述子语音信息和与会人员语音信息库进行匹配，以确定各个子语音信息对应的发音对象；将各个子语音信息通过语音识别模块转换为发音对象的文本输出。通过上述方式，结合高效的语音分离算法，能够有效地分离和识别重叠的语音信号，使得能够区分与会人员的角色并且每个与会人员的说话内容都能够被准确地识别出来，提高工作效率，改善沟通交流体验，并为用户提供更加便捷、安全和高效的服务。This embodiment captures the voice information of the participants through a microphone array; segments the captured voice information through an end-to-end voice separation module to obtain multiple sub-voice information; matches the sub-voice information with the participant voice information library to determine the pronunciation object corresponding to each sub-voice information; and converts each sub-voice information into a text output of the pronunciation object through a voice recognition module. Through the above method, combined with an efficient voice separation algorithm, it is possible to effectively separate and identify overlapping voice signals, so that the roles of the participants can be distinguished and the speech content of each participant can be accurately identified, thereby improving work efficiency, improving the communication experience, and providing users with more convenient, safe and efficient services.

参照图5，图5为本发明智能语音分离与识别的会议转录装置第一实施例的结构框图。5 , which is a structural block diagram of a first embodiment of a conference transcription device for intelligent speech separation and recognition according to the present invention.

如图5所示，本发明实施例提出的智能语音分离与识别的会议转录装置包括：As shown in FIG5 , the conference transcription device for intelligent speech separation and recognition proposed in an embodiment of the present invention includes:

获取模块10，用于通过麦克风阵列捕捉与会人员的语音信息；An acquisition module 10 is used to capture the voice information of the participants through a microphone array;

分割模块20，用于通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息；A segmentation module 20, configured to segment the captured voice information through an end-to-end voice separation module to obtain a plurality of sub-voice information;

匹配模块30，用于将所述子语音信息和与会人员语音信息库进行匹配，以确定各个子语音信息对应的发音对象；A matching module 30, used for matching the sub-voice information with a participant voice information database to determine a pronunciation object corresponding to each sub-voice information;

识别模块40，用于将各个子语音信息通过语音识别模块转换为发音对象的文本输出。The recognition module 40 is used to convert each sub-speech information into a text output of a pronunciation object through a speech recognition module.

在一实施例中，所述分割模块20，还用于判断与会人员是否确定；若与会人员确定，则通过与会人员确定模块获取与会人员的人员数量信息；基于所述人员数量信息通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息。In one embodiment, the segmentation module 20 is also used to determine whether the attendees are confirmed; if the attendees are confirmed, the number information of the attendees is obtained through the attendee determination module; based on the number information of the attendees, the captured voice information is segmented through the end-to-end voice separation module to obtain multiple sub-voice information.

在一实施例中，所述分割模块20，还用于若与会人员不确定，则通过聚类方式进行识别并按照声音成分将与会人员划分至不同的说话者群组，以得到初步分离结果；基于所述初步分离结果通过端到端语音分离模块对捕捉到的语音信息进行分割，得到多个子语音信息。In one embodiment, the segmentation module 20 is also used to identify the participants by clustering and divide the participants into different speaker groups according to the sound components if the participants are uncertain, so as to obtain a preliminary separation result; based on the preliminary separation result, the captured voice information is segmented by the end-to-end voice separation module to obtain multiple sub-voice information.

在一实施例中，所述分割模块20，还用于将获取到的语音信息对应的音频片段切分多个预设长度的音频片段；通过深度学习编码模块提取各个音频片段的声纹特征；将所有的声纹特征组合成声纹特征矩阵，所述声纹特征矩阵的横坐标表示时间维度，所述声纹特征矩阵的纵坐标表示声纹维度；根据所述声纹特征矩阵和所述声纹特征矩阵的转置矩阵得到音频片段相似度矩阵，所述音频片段相似度矩阵中每个元素的值代表了元素所在行的索引对应的音频片段声纹信息和元素所在列的索引对应的音频片段声纹信息的相似度；利用相似度矩阵和Kmeas聚类的方式，以对各个音频片段进行聚类，其中，各个音频片段之间相似度符合设定阈值的两个音频片段聚为一类；将同一类别的音频片段组合成一个音频，以确定与会人员的人员数量信息。In one embodiment, the segmentation module 20 is further used to divide the audio segment corresponding to the acquired voice information into multiple audio segments of preset lengths; extract the voiceprint features of each audio segment through the deep learning encoding module; combine all the voiceprint features into a voiceprint feature matrix, the horizontal axis of the voiceprint feature matrix represents the time dimension, and the vertical axis of the voiceprint feature matrix represents the voiceprint dimension; obtain an audio segment similarity matrix according to the voiceprint feature matrix and the transposed matrix of the voiceprint feature matrix, the value of each element in the audio segment similarity matrix represents the similarity between the audio segment voiceprint information corresponding to the index of the row where the element is located and the audio segment voiceprint information corresponding to the index of the column where the element is located; use the similarity matrix and Kmeas clustering method to cluster each audio segment, wherein two audio segments whose similarity between each audio segment meets the set threshold are clustered into one category; combine audio segments of the same category into one audio to determine the number of participants.

在一实施例中，所述分割模块20，还用于通过多个1维卷积层、批归一化和非线性激活层对音频特征进行分类，得到多个子语音信息，所述子语音信息为分离后每个与会人员的语音信息。In one embodiment, the segmentation module 20 is further used to classify the audio features through multiple 1-dimensional convolutional layers, batch normalization and non-linear activation layers to obtain multiple sub-voice information, where the sub-voice information is the voice information of each participant after separation.

在一实施例中，所述智能语音分离与识别的会议转录装置还包括交互模块；In one embodiment, the conference transcription device for intelligent speech separation and recognition further includes an interaction module;

所述交互模块，用于为与会人员提供交互式用户界面，以使与会人员实时校正转录的会议内容，以及以便会议记录员手动调整识别输出的文本中的漏洞和记录的特定内容。The interactive module is used to provide an interactive user interface for the participants, so that the participants can correct the transcribed meeting content in real time, and so that the meeting recorder can manually adjust the loopholes in the identified output text and the specific content of the record.

本申请实施例还提供了一种智能语音分离与识别的会议转录系统，包括处理器、通信接口、存储器和通信总线，其中，处理器，通信接口，存储器通过通信总线完成相互间的通信，存储器，用于存放计算机程序；处理器，用于执行存储器上所存放的程序时，实现上述智能语音分离与识别的会议转录方法。An embodiment of the present application also provides a conference transcription system for intelligent speech separation and recognition, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other through the communication bus, and the memory is used to store computer programs; the processor is used to implement the above-mentioned conference transcription method for intelligent speech separation and recognition when executing the program stored in the memory.

上述智能语音分离与识别的会议转录系统提到的通信总线可以是外设部件互联标准(英文：Peripheral Component Interconnect，简称：PCI)总线或扩展工业标准结构(英文：Extended Industry Standard Architecture，简称：EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。The communication bus mentioned in the above-mentioned intelligent speech separation and recognition conference transcription system can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus can be divided into an address bus, a data bus, a control bus, etc.

通信接口用于上述智能语音分离与识别的会议转录系统与其他设备之间的通信。The communication interface is used for communication between the above-mentioned intelligent speech separation and recognition conference transcription system and other devices.

存储器可以包括随机存取存储器(英文：Random Access Memory，简称：RAM)，也可以包括非易失性存储器(英文：Non-Volatile Memory，简称：NVM)，例如至少一个磁盘存储器。可选的，存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include a random access memory (RAM) or a non-volatile memory (NVM), such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器，包括中央处理器(英文：Central ProcessingUnit，简称：CPU)、网络处理器(英文：Network Processor，简称：NP)等；还可以是数字信号处理器(英文：Digital Signal Processing，简称：DSP)、专用集成电路(英文：ApplicationSpecific Integrated Circuit，简称：ASIC)、现场可编程门阵列(英文：Field-Programmable Gate Array，简称：FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, and discrete hardware components.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented using software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, server or data center. The computer-readable storage medium may be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more available media integrated. The available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive Solid State Disk (SSD)), etc.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the existence of other identical elements in the process, method, article or device including the elements.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit the same. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that the technical solutions described in the aforementioned embodiments may still be modified, or some of the technical features may be replaced by equivalents. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

应当理解的是，以上仅为举例说明，对本发明的技术方案并不构成任何限定，在具体应用中，本领域的技术人员可以根据需要进行设置，本发明对此不做限制。It should be understood that the above is only an example and does not constitute any limitation on the technical solution of the present invention. In specific applications, technicians in this field can make settings as needed, and the present invention does not limit this.

需要说明的是，以上所描述的工作流程仅仅是示意性的，并不对本发明的保护范围构成限定，在实际应用中，本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的，此处不做限制。It should be noted that the workflow described above is merely illustrative and does not limit the scope of protection of the present invention. In practical applications, technicians in this field can select part or all of them according to actual needs to achieve the purpose of the present embodiment, and no limitation is made here.

另外，未在本实施例中详尽描述的技术细节，可参见本发明任意实施例所提供的智能语音分离与识别的会议转录方法，此处不再赘述。In addition, for technical details not described in detail in this embodiment, please refer to the conference transcription method for intelligent speech separation and recognition provided in any embodiment of the present invention, which will not be repeated here.

此外，需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。In addition, it should be noted that, in this article, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or system. In the absence of further restrictions, an element defined by the sentence "comprises a ..." does not exclude the existence of other identical elements in the process, method, article or system including the element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are only for description and do not represent the advantages or disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory，ROM)/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium (such as a read-only memory (ROM)/RAM, a magnetic disk, or an optical disk), and includes a number of instructions for a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the contents of the present invention specification and drawings, or directly or indirectly applied in other related technical fields, are also included in the patent protection scope of the present invention.

可理解的是，本发明实施例提供的系统与本发明实施例提供的方法相对应，相关内容的解释、举例和有益效果可以参考上述方法中的相应部分。It is understandable that the system provided by the embodiment of the present invention corresponds to the method provided by the embodiment of the present invention, and the explanation, examples and beneficial effects of the relevant contents can refer to the corresponding parts in the above method.