CN110136749A

Movatterモバイル変換

Info

Publication number: CN110136749A
Application number: CN201910517374.3A
Authority: CN
Inventors: 俞凯; 钱彦旻; 陈烨斐; 王帅
Original assignee: AI Speech Ltd; Shanghai Jiao Tong University
Current assignee: AI Speech Ltd
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2019-08-16
Anticipated expiration: 2039-06-14
Also published as: CN110136749B

Abstract

Translated fromChinese

本发明公开说话人相关的端到端语音端点检测方法和装置，其中，一种说话人相关的端到端语音端点检测方法，包括：提取待检测语音的声学特征；将所述声学特征与i‑vector特征进行拼接以作为新的输入特征；将新的输入特征输入至神经网络中进行训练并输出所述待检测语音是否为目标说话人语音的检测结果。本申请的方法和装置通过在传统的语音端点检测系统的训练过程中加入了说话人相关的信息(i‑vector)，并将深度神经网络(DNN)和长短时记忆神经网络(LSTM)应用到语音端点检测中，实现了端到端的说话人相关的端点检测系统，通过单个网络就可以直接输出目标说话人的语音部分，去除音频中其他的静音段和非目标说话人的语音。

The invention discloses a speaker-related end-to-end voice endpoint detection method and device, wherein a speaker-related end-to-end voice endpoint detection method includes: extracting acoustic features of the speech to be detected; ‑vector features are spliced as new input features; the new input features are input into the neural network for training and output the detection result of whether the to-be-detected speech is the target speaker's speech. The method and device of the present application add speaker-related information (i-vector) in the training process of the traditional voice endpoint detection system, and apply deep neural network (DNN) and long short-term memory neural network (LSTM) to In the voice endpoint detection, an end-to-end speaker-related endpoint detection system is implemented. Through a single network, the voice part of the target speaker can be directly output, and other silent segments in the audio and the voice of non-target speakers can be removed.

Description

Translated fromChinese

技术领域technical field

本发明属于语音识别的技术领域，尤其涉及端到端语音端点方法和装置。The present invention belongs to the technical field of speech recognition, and particularly relates to an end-to-end speech endpoint method and device.

背景技术Background technique

相关技术中，语音端点检测(Voice Activity Detection,VAD)，是语音识别，说话人识别等任务非常重要的预处理步骤。一个基础的语音端点检测系统的目标是要去除音频中的静音部分，而更加通用的检测系统可以去掉音频中所有不相关的部分，包括噪声和非目标说话人的语音。In the related art, Voice Activity Detection (VAD) is a very important preprocessing step for tasks such as speech recognition and speaker recognition. A basic speech endpoint detection system aims to remove the silent parts of the audio, while a more general detection system can remove all irrelevant parts of the audio, including noise and non-target speaker speech.

现有的方案主要是针对有背景人声干扰的场景(例如餐厅等)，提出了一种鲁棒的语音端点系统，可以提取目标说话人的语音部分。相关技术所提出的系统是基于高斯混合模型(GMM)的，并且在语音和噪声分别建模的基础上，使用了一个额外的GMM模型对目标说话人进行建模，即用三个GMM来达到提取目标说话人语音部分的目标。Existing solutions are mainly aimed at scenes with background human voice interference (such as restaurants, etc.), and propose a robust voice endpoint system that can extract the voice part of the target speaker. The system proposed in the related art is based on Gaussian Mixture Model (GMM), and on the basis of separately modeling speech and noise, an additional GMM model is used to model the target speaker, that is, three GMMs are used to achieve The target to extract the speech part of the target speaker.

发明人在实现本申请的过程中发现，现有的方案至少存在以下缺陷：During the process of realizing the present application, the inventor found that the existing solution at least has the following defects:

其余非目标说话人的声音是被看作背景噪声的(目标说话人的能量明显高于其余说话人)，并不适用于多人对话的场景。其次在面对复杂环境时，这种技术的检测准确率会有明显降低。The voices of other non-target speakers are regarded as background noise (the energy of the target speaker is significantly higher than that of the other speakers), which is not suitable for multi-person dialogue scenarios. Secondly, in the face of complex environments, the detection accuracy of this technology will be significantly reduced.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种说话人相关的端到端语音端点方法和装置，用于至少解决上述技术问题之一。Embodiments of the present invention provide a speaker-related end-to-end voice endpoint method and device, which are used to solve at least one of the above technical problems.

第一方面，本发明实施例提供一种说话人相关的端到端语音端点检测方法，包括：提取待检测语音的声学特征；将所述声学特征与目标说话人的i-vector特征进行拼接以作为新的输入特征；将所述新的输入特征输入至神经网络中进行训练并输出所述待检测语音是否为目标说话人语音的预测结果。In a first aspect, an embodiment of the present invention provides a speaker-related end-to-end voice endpoint detection method, including: extracting acoustic features of the speech to be detected; splicing the acoustic features with the i-vector features of the target speaker to obtain As a new input feature; inputting the new input feature into a neural network for training and outputting a prediction result of whether the to-be-detected speech is the target speaker's speech.

第二方面，本发明实施例提供一种说话人相关的端到端语音端点检测装置，包括：提取模块，配置为提取待检测语音的声学特征；拼接模块，配置为将所述声学特征与目标说话人的i-vector特征进行拼接以作为新的输入特征；以及输出模块，配置为将所述新的输入特征输入至神经网络中进行训练并输出所述待检测语音是否为目标说话人语音的预测结果。In a second aspect, an embodiment of the present invention provides a speaker-related end-to-end voice endpoint detection device, including: an extraction module configured to extract acoustic features of the speech to be detected; a splicing module configured to associate the acoustic features with a target The i-vector feature of the speaker is spliced as a new input feature; and an output module is configured to input the new input feature into a neural network for training and output whether the voice to be detected is the target speaker voice. forecast result.

第三方面，提供一种电子设备，其包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行本发明任一实施例的说话人相关的端到端语音端点检测方法的步骤。In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor to enable the at least one processor to perform the steps of the speaker-dependent end-to-end speech endpoint detection method of any embodiment of the present invention.

第四方面，本发明实施例还提供一种计算机程序产品，所述计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，使所述计算机执行本发明任一实施例的说话人相关的端到端语音端点检测方法的步骤。In a fourth aspect, an embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, and when the program is The instructions, when executed by a computer, cause the computer to execute the steps of the speaker-related end-to-end voice endpoint detection method according to any embodiment of the present invention.

本申请的方法和装置提供的方案对不同说话人提取相应的区分性特征，然后将该特征加入到语音端点检测系统中，利用深度学习的方法提高了噪声环境下语音端点检测的鲁棒性。进一步地，本申请的方案不仅针对对话场景下的语音端点检测提出了新的方案，更展现了使用说话人相关的特征来提升性能的各种可能。The solution provided by the method and device of the present application extracts corresponding distinguishing features for different speakers, and then adds the features to the speech endpoint detection system, using the deep learning method to improve the robustness of speech endpoint detection in a noisy environment. Further, the solution of the present application not only proposes a new solution for voice endpoint detection in a dialogue scenario, but also shows various possibilities of using speaker-related features to improve performance.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明一实施例提供的一种说话人相关的端到端语音端点检测方法的流程图；FIG. 1 is a flowchart of a speaker-related end-to-end voice endpoint detection method according to an embodiment of the present invention;

图2为本发明一实施例提供的基于LSTM的与说话人相关的VAD；FIG. 2 is a speaker-related VAD based on LSTM provided by an embodiment of the present invention;

图3为本发明一实施例提供的一种特征合并的方法；FIG. 3 is a method for combining features provided by an embodiment of the present invention;

图4分别为本发明一实施例提供的不同的系统测试用例的预测结果；FIG. 4 is respectively prediction results of different system test cases provided by an embodiment of the present invention;

图5为本发明一实施例提供的一种说话人相关的端到端语音端点检测装置的框图；5 is a block diagram of a speaker-related end-to-end voice endpoint detection device according to an embodiment of the present invention;

图6是本发明一实施例提供的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

请参考图1，其示出了本申请的说话人相关的端到端语音端点检测方法一实施例的流程图，本实施例的说话人相关的端到端语音端点检测方法可以适用于具备语言模型的终端、如智能语音电视、智能音箱、智能对话玩具以及其他现有的具备说话人检测功能的智能终端等。Please refer to FIG. 1 , which shows a flowchart of an embodiment of a method for detecting a speaker-related end-to-end voice endpoint of the present application. The method for detecting a speaker-related end-to-end voice Model terminals, such as smart voice TVs, smart speakers, smart dialogue toys, and other existing smart terminals with speaker detection functions.

如图1所示，在步骤101中，提取待检测语音的声学特征；As shown in Figure 1, in step 101, the acoustic features of the speech to be detected are extracted;

在步骤102中，将声学特征与目标说话人的i-vector特征进行拼接以作为新的输入特征；In step 102, the acoustic feature is spliced with the i-vector feature of the target speaker as a new input feature;

在步骤103中，将新的输入特征输入至神经网络中进行训练并输出待检测语音是否为目标说话人语音的预测结果。In step 103, the new input feature is input into the neural network for training and the prediction result of whether the speech to be detected is the speech of the target speaker is output.

在本实施例中，对于步骤101，说话人相关的端到端语音端点检测装置首先提取待检测语音的声学特征，然后在步骤102中，将提取的声学特征和同样是从待检测语音中提取的能够表征其身份的i-vector特征进行拼接，将拼接后的特征作为新的输入特征，由于i-vector特征携带有说话人信息，因此拼接之后能够更好地对说话人进行检测。之后，对于步骤103，将该新的输入特征输入至神经网络中对该神经网络进行训练并输出待检测语音是否为预测说话人语音的预测结果。In this embodiment, for step 101, the speaker-related end-to-end voice endpoint detection device first extracts the acoustic features of the speech to be detected, and then in step 102, the extracted acoustic features and The i-vector features that can represent their identity are spliced, and the spliced features are used as new input features. Since the i-vector features carry speaker information, the speaker can be better detected after splicing. Afterwards, for step 103, the new input feature is input into the neural network for training, and the neural network is trained to output whether the speech to be detected is a prediction result of predicting the speaker's speech.

本实施例的方案由于在建模过程中加入了说话人相关的信息i-vector特征，同时这是一个在线的检测系统，系统延迟很低。In the solution of this embodiment, the speaker-related information i-vector feature is added in the modeling process, and at the same time, this is an online detection system, and the system delay is very low.

在一些可选的实施例中，神经网络为深度神经网络，将新的输入特征输入至神经网络中进行训练并输出待检测语音是否为目标说话人语音的预测结果包括：将新的输入特征中的每一帧数据分别输入至深度神经网络；分别输出每一帧数据是否是目标说话人语音的检测结果。从而可以对每一帧数据是否为目标说话人语音进行检测。In some optional embodiments, the neural network is a deep neural network, and inputting new input features into the neural network for training and outputting a prediction result of whether the speech to be detected is the target speaker's speech includes: adding new input features into Each frame of data is input to the deep neural network respectively; the detection result of whether each frame of data is the target speaker's voice is output separately. Thus, it can be detected whether each frame of data is the target speaker's voice.

在一些可选的实施例中，神经网络为长短时记忆循环神经网络，将新的输入特征输入至神经网络中进行训练并输出待检测语音是否为目标说话人语音的预测结果包括：将新的输入特征对应的整个句子数据输入至深度神经网络；输出每一帧数据是否是目标说话人语音的预测结果。从而可以对整个句子数据是否为目标说话人语音进行检测。In some optional embodiments, the neural network is a long-short-term memory recurrent neural network, and inputting new input features into the neural network for training and outputting a prediction result of whether the speech to be detected is the target speaker's speech includes: The entire sentence data corresponding to the input feature is input to the deep neural network; whether each frame of data is the prediction result of the target speaker's speech is output. Thus, it is possible to detect whether the entire sentence data is the target speaker's voice.

在一些可选的实施例中，在将新的输入特征输入至神经网络中进行训练并输出待检测语音是否为目标说话人语音的预测结果之前，方法还包括：将新的输入特征中相邻的n个语音帧合并然后取平均值作为输入，同时把每一个预测输出对应的预测结果重复n次以形成最终输出。从而，通过在特征输入部分会把相邻的n个语音帧以取平均值的方式进行合并，得到的新的特征在长度上是原来的n分之一，这样做的目的是加强语音之间的连续性。然后在模型输出预测值之后，再把每一个预测值重复n次，这样长度就和最初输入的特征长度一致，保证每一帧都有对应的预测输出。上述方法用在说话人相关的语音端点检测中，可以解决语音和非语音之间的错误转换问题和“碎片化问题”。In some optional embodiments, before inputting the new input features into the neural network for training and outputting a prediction result of whether the speech to be detected is the target speaker's speech, the method further includes: adding adjacent ones of the new input features The n speech frames are merged and the average value is taken as the input, and the prediction result corresponding to each prediction output is repeated n times to form the final output. Therefore, in the feature input part, the adjacent n speech frames are merged in the way of averaging, and the new feature obtained is one-nth of the original length. The purpose of this is to strengthen the difference between speech continuity. Then, after the model outputs the predicted value, each predicted value is repeated n times, so that the length is the same as the original input feature length, ensuring that each frame has a corresponding predicted output. The above method is used in speaker-dependent speech endpoint detection, which can solve the problem of wrong conversion between speech and non-speech and the "fragmentation problem".

在一些可选的实施例中，将声学特征与目标说话人的i-vector特征进行拼接以作为新的输入特征包括：利用预训练的i-vector提取器从待检测语音中提取目标说话人的i-vector特征；将帧级别的声学特征和i-vector特征连接起来作为新的输入。从而实现对i-vector特征的提取和拼接，使其更好地帮助识别目标说话人的语音。In some optional embodiments, splicing the acoustic feature with the i-vector feature of the target speaker as a new input feature includes: extracting the target speaker's feature from the speech to be detected by using a pre-trained i-vector extractor i-vector features; concatenate frame-level acoustic features and i-vector features as new input. In this way, the extraction and splicing of i-vector features are realized, so that it can better help identify the target speaker's voice.

下面通过对发明人在实现本发明的过程中遇到的一些问题和对最终确定的方案的一个具体实施例进行说明，以使本领域技术人员更好地理解本申请的方案。The following describes some problems encountered by the inventor in the process of implementing the present invention and a specific embodiment of the finalized solution, so that those skilled in the art can better understand the solution of the present application.

发明人在实现本申请的过程中发现：现有技术把无关说话人的声音看作是背景噪声而非正常对话，所以无法在能量相似的情况下提取目标说话人的语音，其次它主要使用的还是传统的GMM方法，没有用到深度学习的方法，所以系统检测能力有限。In the process of realizing this application, the inventor found that the existing technology regards the voice of an irrelevant speaker as background noise rather than normal dialogue, so it is impossible to extract the voice of the target speaker under the condition of similar energy, and secondly, it mainly uses It is still the traditional GMM method, and the deep learning method is not used, so the system detection ability is limited.

本领域技术人员为了解决现有技术中存在的缺陷，可能会采用以下方案：传统的语音端点检测只能检测语音和非语音部分，无法进行说话人的区分，如果是要提取特定说话人的语音部分，一般会先用普通的语音端点检测系统找出音频中所有的语音段，然后再用说话人确认技术(Speaker Verification,SV)对所有的语音段进行筛选，找出目标说话人的语音部分。这样两个阶段的解决方案是比较容易想到的。In order to solve the defects in the prior art, those skilled in the art may adopt the following solutions: the traditional voice endpoint detection can only detect the voice and non-voice parts, and cannot distinguish the speakers. If the voice of a specific speaker is to be extracted Generally speaking, the ordinary voice endpoint detection system is used to find all the voice segments in the audio, and then the speaker verification technology (Speaker Verification, SV) is used to screen all the voice segments to find out the voice part of the target speaker. . Such a two-stage solution is relatively easy to think of.

而本申请的方案是一个端到端的神经网络的系统，之前并没有同样的工作。我们是在传统的语音端点检测系统的训练过程中加入了说话人相关的信息(i-vector)，并将深度神经网络(DNN)和长短时记忆神经网络(LSTM)应用到语音端点检测中，实现了端到端的说话人相关的端点检测系统，通过单个网络就可以直接输出目标说话人的语音部分，去除音频中其他的静音段和非目标说话人的语音。The solution of this application is an end-to-end neural network system, and there is no similar work before. We added speaker-related information (i-vector) in the training process of the traditional speech endpoint detection system, and applied deep neural network (DNN) and long short-term memory neural network (LSTM) to speech endpoint detection, An end-to-end speaker-related endpoint detection system is implemented. Through a single network, the voice part of the target speaker can be directly output, and other silent segments in the audio and the voice of non-target speakers can be removed.

如图2所示，其示出了本申请一实施例提供的基于LSTM的与说话人相关的VAD，将一段音频中每一帧的声学特征与目标说话人的i-vector特征拼接。As shown in FIG. 2 , it shows the LSTM-based speaker-related VAD provided by an embodiment of the present application, which splices the acoustic feature of each frame in a piece of audio with the i-vector feature of the target speaker.

其中，最下面特征输入的部分，传统方法都是将声学特征直接输入，而我们则是加入了说话人相关的信息(i-vector)，将每一帧的声学特征与目标说话人的表征i-vector拼接起来作为新的输入特征。在经过中间神经网络的训练之后，可以直接输出每一帧是否是目标说话人语音的预测结果。中间的神经网络部分可以替换成其他的网络，区别在于DNN输入的是拼接之后每一帧数据，而LSTM输入的是拼接之后整个句子的数据。Among them, in the feature input part at the bottom, the traditional method is to input the acoustic features directly, but we add the speaker-related information (i-vector), and the acoustic features of each frame and the target speaker's representation i -vectors are concatenated as new input features. After the training of the intermediate neural network, it can directly output whether each frame is the prediction result of the target speaker's speech. The middle part of the neural network can be replaced with other networks, the difference is that the DNN input is the data of each frame after splicing, while the LSTM input is the data of the entire sentence after splicing.

如图3所示，这里介绍的是一种特征合并的方法，用在说话人相关的语音端点检测中，可以解决语音和非语音之间的错误转换问题和“碎片化问题”。首先在特征输入部分会把相邻的n个语音帧以取平均值的方式进行合并，得到的新的特征在长度上是原来的n分之一，这样做的目的是加强语音之间的连续性。然后在模型输出预测值之后，再把每一个预测值重复n次，这样长度就和最初输入的特征长度一致，保证每一帧都有对应的预测输出。As shown in Figure 3, presented here is a feature merging method used in speaker-dependent speech endpoint detection, which can solve the problem of incorrect conversion between speech and non-speech and the "fragmentation problem". First, in the feature input part, the adjacent n speech frames will be merged by taking the average value, and the new feature obtained is one-nth of the original length. The purpose of this is to strengthen the continuity between speeches. sex. Then, after the model outputs the predicted value, each predicted value is repeated n times, so that the length is the same as the original input feature length, ensuring that each frame has a corresponding predicted output.

本申请的实施例提出了一个新的任务：说话人相关的语音端点检测(Speaker-Dependent Voice Activity Detection,SDVAD)，就是可以从音频中单独提取出目标说话人的语音部分。这个任务在真实的生活场景中很常见，通常的解决方法是在语音端点检测系统识别出所有语音段落之后，再用说话人确认(Speaker Verification,SD)来筛选出目标说话人的语音部分。在本申请的实施例中，我们提出了一种端到端的，基于神经网络的方法来解决这个问题，在建模过程中加入了说话人相关的信息，同时这是一个在线的检测系统，系统延迟很低。基于Switchboard数据集，我们生成了一个电话对话场景的语音数据集，并在这个数据集上做了一些实验。实验结果表明，相比于语音端点检测+说话人确认的方法，我们提出的在线检测系统在帧级别准确率和F-score这两个指标上都取得了更好的效果。我们也使用了之前我们自己提出的段落级别的评价指标对不同系统进行了更加全面的分析。The embodiment of the present application proposes a new task: speaker-dependent voice activity detection (SDVAD), that is, the voice part of the target speaker can be separately extracted from the audio. This task is common in real-life scenarios, and the usual solution is to use Speaker Verification (SD) to filter out the speech parts of the target speaker after the speech endpoint detection system has identified all speech passages. In the embodiment of this application, we propose an end-to-end, neural network-based method to solve this problem, adding speaker-related information in the modeling process, and this is an online detection system, the system Latency is low. Based on the Switchboard dataset, we generated a speech dataset of phone conversation scenes and did some experiments on this dataset. Experimental results show that, compared with the method of speech endpoint detection + speaker confirmation, our proposed online detection system achieves better results in both frame-level accuracy and F-score. We also conduct a more comprehensive analysis of the different systems using the paragraph-level evaluation metric we proposed earlier.

以下通过介绍发明人实现本申请的过程和所进行的实验及相关的实验数据，以使本领域技术人员更好地理解本申请的方案。The process of realizing the present application, the experiments performed by the inventor and the related experimental data are introduced below, so that those skilled in the art can better understand the solution of the present application.

简介Introduction

语音端点检测(VAD，voice activity detection)是语音信号处理中最关键的技术之一，用于将语音与音频内的非语音段分离。VAD通常用作各种语音处理任务的预处理步骤，例如自动语音识别(ASR)，语音合成，说话人识别和网际协议语音(VoIP)。VAD的质量直接影响后续任务的性能。Voice activity detection (VAD) is one of the most critical techniques in speech signal processing for separating speech from non-speech segments within audio. VAD is commonly used as a preprocessing step for various speech processing tasks, such as automatic speech recognition (ASR), speech synthesis, speaker recognition, and Voice over Internet Protocol (VoIP). The quality of VAD directly affects the performance of subsequent tasks.

在传统的VAD系统中，非语音部分通常由静音和噪声组成，而在这项工作中，非语音部分还包括非目标说话人的语音部分。这在实际应用中非常普遍，例如，语音助理可能只需要回复特定说话人的命令，或者在会话环境中，来自非目标说话人的语音应被视为非语音。所解决的问题被称为说话人相关的语音端点检测(SDVAD)，它是传统VAD任务的扩展。在此任务中，我们只想检测来自目标说话人的语音，因此来自非目标说话人的语音和环境噪音都将被忽略。该任务的简单方法有两个步骤：(1)使用普通VAD系统检测出所有的语音段(2)对获得的语音段执行说话人验证来识别出目标说话人的语音。然而，该方法以离线方式执行并且有较大的延迟。In conventional VAD systems, the non-speech part usually consists of silence and noise, while in this work, the non-speech part also includes the non-target speaker's speech part. This is very common in real-world applications, where, for example, a voice assistant may only need to reply to commands from a specific speaker, or in a conversational environment, speech from a non-target speaker should be treated as non-speech. The problem solved is called Speaker Dependent Voice Endpoint Detection (SDVAD), which is an extension of the traditional VAD task. In this task, we only want to detect speech from the target speaker, so speech from non-target speakers and ambient noise are ignored. A simple approach to this task has two steps: (1) use a common VAD system to detect all speech segments (2) perform speaker verification on the obtained speech segments to identify the target speaker's speech. However, this method is performed offline and has a large delay.

传统的VAD算法可以分为两类，基于特征的方法和基于模型的方法。关于基于特征的方法，首先提取不同的声学特征，例如时域能量，过零率等等，然后应用诸如阈值比较的方法来进行检测。关于基于模型的方法，训练单独的统计模型以通过不同的概率分布来表示语音和非语音段，通过后验概率来进行判断，主要包括了高斯混合模型(GMM)和隐马尔可夫模型(HMM)等。同时也可以直接训练区分性模型来区分语音和非语音，诸如支持向量机(SVM)和深度神经网络模型。Traditional VAD algorithms can be divided into two categories, feature-based methods and model-based methods. Regarding feature-based methods, different acoustic features, such as time-domain energy, zero-crossing rate, etc., are first extracted, and then methods such as threshold comparison are applied for detection. Regarding model-based methods, separate statistical models are trained to represent speech and non-speech segments through different probability distributions, and judgments are made by posterior probability, mainly including Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) )Wait. It is also possible to directly train discriminative models to distinguish speech from non-speech, such as support vector machines (SVM) and deep neural network models.

最近，深度学习方法已成功应用于包括VAD在内的许多任务。对于复杂环境中的VAD，DNN具有比传统方法更好的建模能力，递归神经网络(RNN)和长短时记忆网络(LSTM)可以更好地模拟输入之间的连续性，卷积神经(CNN)可以为神经网络训练生成更好的特征。Recently, deep learning methods have been successfully applied to many tasks including VAD. For VAD in complex environment, DNN has better modeling ability than traditional methods, recurrent neural network (RNN) and long short-term memory network (LSTM) can better model the continuity between inputs, convolutional neural network (CNN) ) can generate better features for neural network training.

为了解决说话人相关的语音端点检测，我们提出了一种基于神经网络的系统，该系统将说话人相关的信息加入到传统的VAD系统中。具体的做法是在声学特征中加入目标说话人的特征(i-vector)。与两阶段VAD/SV方法相比，我们提出的方法可以实现为端到端的在线系统，相对延迟。实验是在基于Switchboard数据集生成的多人对话的数据上进行的，，结果表明，与离线VAD/SV方法相比，我们提出的在线方法可以得到更好的性能，而且降低了延迟。To address speaker-dependent speech endpoint detection, we propose a neural network-based system that incorporates speaker-dependent information into conventional VAD systems. The specific approach is to add the target speaker's feature (i-vector) to the acoustic feature. Compared to the two-stage VAD/SV method, our proposed method can be implemented as an end-to-end online system with relative latency. Experiments are conducted on multi-person dialogue data generated based on the Switchboard dataset, and the results show that our proposed online method can achieve better performance and lower latency compared to offline VAD/SV methods.

基于神经网络的语音端点检测中首先应用的就是深度神经网络模型(DNN)。基于DNN的VAD系统不仅可以得到更好的效果，而且检测复杂度低。典型的基于DNN的VAD系统训练基于帧的二进制分类器以将每个帧分类为两类：语音和非语音。通常，DNN的输入是每一帧的声学特征加上前后帧的扩展，The first application of neural network-based voice endpoint detection is the deep neural network model (DNN). DNN-based VAD systems can not only achieve better results, but also have low detection complexity. A typical DNN-based VAD system trains a frame-based binary classifier to classify each frame into two categories: speech and non-speech. Usually, the input to the DNN is the acoustic features of each frame plus the extension of the previous and previous frames,

Ot＝[xt-r，...，xt-1，xt，xt+1，...，xt+r] (1)Ot=[xt-r,...,xt-1,xt,xt+1,...,xt+r] (1)

其中r是上下文扩展的长度。DNN通过交叉熵的损失函数来进行优化。对于每个帧，通过两个类的后验概率之间的比较来执行分类。where r is the length of the context extension. DNN is optimized by the loss function of cross entropy. For each frame, classification is performed by a comparison between the posterior probabilities of the two classes.

基于LSTM的VAD系统LSTM-based VAD system

LSTM能够对序列进行建模并捕获一系列特征中的长期的相关性。它的核心是由称为块的特殊单元组成。每个块包含一个输入门，一个输出门和一个遗忘门，使模型能够记忆短时间或长时间的信息相关性。LSTM结构可以有效地使用上下文来顺序地对输入声学特征进行建模。LSTMs can model sequences and capture long-term correlations in a set of features. Its core is made up of special units called blocks. Each block contains an input gate, an output gate, and a forget gate, enabling the model to memorize short- or long-term information dependencies. The LSTM structure can effectively use context to sequentially model the input acoustic features.

LSTM网络计算从输入序列x＝[x1，x2，...，xT]到输出序列y＝[y1，y2，...，yT]的映射。这种架构的更多细节可以参考相关的论文。The LSTM network computes a mapping from the input sequence x=[x1,x2,...,xT] to the output sequence y=[y1,y2,...,yT]. More details of this architecture can be found in related papers.

如果应用于VAD，则基于LSTM的系统逐帧输出预测，但是当前帧的每个预测部分地取决于其历史。训练标准与DNN相同。If applied to VAD, LSTM-based systems output predictions frame by frame, but each prediction of the current frame depends in part on its history. The training standard is the same as DNN.

与说话人相关的语音端点检测相关工作Work related to speaker-dependent speech endpoint detection

对于与说话人相关的VAD，一些先前的研究使用麦克风阵列来跟踪目标说话人。某些研究也考虑了VAD的说话人身份，他们使用的VAD系统基于高斯混合模型，使用一个额外的GMM来对目标说话人进行建模。但是，应该注意的是，我们与之前的工作有不同的实验环境，同时是解决不同的问题。在之前的研究中，来自其他说话人的语音表现为背景噪音，而在我们的任务中，针对的是会话场景，其中来自不同说话人的语音不重叠。另一种情况是在家中使用智能音响的时候，其中的语音识别系统将受到其他家庭成员对话的干扰。通常，对于仅想接受来自特定说话人的语音信号的系统，需要这种与说话人相关的语音端点检测器。For speaker-dependent VAD, some previous studies used microphone arrays to track the target speaker. Some studies also consider the speaker identity of VAD, and they use a VAD system based on Gaussian mixture model with an additional GMM to model the target speaker. However, it should be noted that we have a different experimental setting than previous work, and solve different problems at the same time. In previous studies, speech from other speakers appeared as background noise, while in our task, we aimed at conversational scenarios where speech from different speakers did not overlap. Another situation is when using a smart speaker at home, where the voice recognition system will be interrupted by the conversations of other family members. Typically, such speaker-dependent speech endpoint detectors are required for systems that only want to accept speech signals from a specific speaker.

基于说话人的特征i-vectorSpeaker-based feature i-vector

说话人建模在语音处理任务中起着至关重要的作用，例如说话人识别，说话人分割聚类，语音识别的说话人自适应。近年来，基于因子分析的i-vector系统在说话人识别任务中取得了显著的性能提升，这种说话人的表征方式也适用于其他相关任务，如语音转换和语音识别的说话人自适应训练。Speaker modeling plays a crucial role in speech processing tasks such as speaker identification, speaker segmentation clustering, speaker adaptation for speech recognition. In recent years, factor analysis-based i-vector systems have achieved significant performance improvements in speaker recognition tasks, and this speaker representation is also applicable to other related tasks such as speech conversion and speaker adaptation training for speech recognition .

基本上，i-vector是语音的低维固定长度表示，其保留说话人特定信息。对于i-vector框架，说话人和会话相关的超矢量M(从UBM导出)被建模为Basically, i-vectors are low-dimensional fixed-length representations of speech that preserve speaker-specific information. For the i-vector framework, the speaker- and session-dependent supervectors M (derived from UBM) are modeled as

M＝m+Tw (2)M=m+Tw (2)

其中m是说话人和会话无关的超向量，T是低秩矩阵，其表示说话人和会话可变性，i-vector是w的后验均值。where m is a speaker- and session-independent supervector, T is a low-rank matrix representing speaker and session variability, and i-vector is the posterior mean of w.

基线系统Baseline system

正如引言中所提到的，对于与说话人相关的VAD的任务，直观的方法将是一个两阶段的方法。首先，普通VAD用于检测所有语音段而不区分说话人，然后我们使用说话人验证系统来挑选属于目标说话人的语音段。因此，基线系统是VAD和与文本无关的说话人验证系统的组合，在本文的其余部分将其称为VAD/SV方法。As mentioned in the introduction, for the task of speaker-dependent VAD, an intuitive approach would be a two-stage approach. First, ordinary VAD is used to detect all speech segments without distinguishing speakers, and then we use a speaker verification system to pick out speech segments belonging to the target speaker. Therefore, the baseline system is a combination of VAD and a text-independent speaker verification system, which is referred to as the VAD/SV method in the rest of this paper.

在这项工作中，基于DNN和LSTM的系统是针对VAD阶段进行训练的，而对于说话人验证部分，我们使用目前最先进的基于i-vector的概率线性判别分析(i-vector/PLDA框架)。In this work, the DNN and LSTM based systems are trained for the VAD stage, while for the speaker verification part, we use the state-of-the-art i-vector-based probabilistic linear discriminant analysis (i-vector/PLDA framework) .

端到端说话人相关的VAD系统(SDVAD)End-to-End Speaker Dependent VAD System (SDVAD)

根据基线系统，说话人验证阶段是在获得整个音频的VAD预测结果之后，这增加了系统的延迟。而且，它并没有直接优化这项任务的最终目标。因此，我们建议在原始VAD网络中引入说话人建模，以使模型能够提供帧级说话人的相关预测。由于该模型现在以端到端的方式进行训练，因此可以充分利用数据信息来获得更好的效果。According to the baseline system, the speaker verification stage is after obtaining the VAD prediction results of the whole audio, which increases the latency of the system. Moreover, it does not directly optimize the ultimate goal of this task. Therefore, we propose to introduce speaker modeling in the original VAD network to enable the model to provide frame-level speaker correlation predictions. Since the model is now trained in an end-to-end manner, the data information can be fully exploited for better results.

所提出的系统在前述的图2中描绘，利用预训练的i-vector提取器，将从用户注册的语音中提取目标说话人的i-vector。然后我们将帧级别的声学特征和目标说话人的i-vector拼接起来作为神经网络的新输入。这对于训练和推理阶段都是可行的。对于训练阶段，音频数据都有相应的标注，因此可以使用说话人特定数据以便提取相应的i-vector。在推理阶段，要求用户在首次使用系统时首先注册他们的声音也是合理的。The proposed system, depicted in the aforementioned Figure 2, utilizes a pretrained i-vector extractor that will extract the target speaker's i-vector from the user's registered speech. Then we concatenate the frame-level acoustic features and the i-vector of the target speaker as a new input to the neural network. This is possible for both training and inference phases. For the training phase, the audio data has corresponding annotations, so speaker-specific data can be used in order to extract the corresponding i-vectors. During the inference phase, it is also reasonable to require users to first register their voices when using the system for the first time.

在训练过程中，仅目标说话人的语音部分被视为正样本，而非目标说话人的语音部分和非语音部分都被视为负样本。因此，该模型能够直接输出每帧的最终说话人相关预测，而无需额外的说话人验证阶段。我们所提出的与说话人相关的VAD系统是具有较低延迟的在线系统。During training, only the speech part of the target speaker is regarded as a positive sample, and both the speech part and the non-speech part of the non-target speaker are regarded as a negative sample. Therefore, the model is able to directly output final speaker-related predictions for each frame without the need for an additional speaker verification stage. Our proposed speaker-dependent VAD system is an online system with lower latency.

后处理和特征分类Postprocessing and Feature Classification

VAD与常见的二进制分类问题不同，因为音频信号的特征是具有连续性的，这意味着相邻帧是高度相关的。模型的原始输出通常包含许多错误转换，由于脉冲噪声和其他干扰而导致“碎片问题”。对于像DNN这样的基于帧的分类器，这种问题更为明显。因此，应用后处理方法来平滑模型的原始输出并减少语音和非语音之间频繁的错误转换非常重要。常用的是基于规则的后处理方法，使用滑动窗口来对模型的输出进行平滑，消除一些错误的语音非语音转换。VAD differs from common binary classification problems because the features of audio signals are continuous, which means that adjacent frames are highly correlated. The raw output of the model often contains many erroneous transformations, causing "fragmentation problems" due to impulse noise and other disturbances. This problem is more pronounced for frame-based classifiers like DNNs. Therefore, it is important to apply post-processing methods to smooth the raw output of the model and reduce frequent false transitions between speech and non-speech. A rule-based post-processing method is commonly used, using a sliding window to smooth the output of the model and eliminate some erroneous speech-to-speech conversions.

大多数后期处理方法都会为在线VAD系统增加额外的延迟。在本文中，另一种称为特征合并的方法用于帮助解决与说话人相关的VAD中的“碎片问题”。不同之处在于我们尝试平滑输入特征而不是模型输出。关于VAD，通过将值分组为固定数量的块来完成特征合并。如前述图3所示，我们使用均值缩减合并相邻n帧的输入特征，合并过程中帧不重叠。此过程将原始帧数减小到1/n。然后，模型的每个输出预测会被重复n次，以对应于每帧的原始特征。此方法引起的延迟可以忽略不计。Most post-processing methods add additional latency to online VAD systems. In this paper, another method called feature pooling is used to help solve the "fragmentation problem" in speaker-dependent VAD. The difference is that we try to smooth the input features instead of the model output. Regarding VAD, feature merging is done by grouping values into a fixed number of blocks. As shown in the aforementioned Figure 3, we use mean reduction to merge the input features of adjacent n frames, and the frames do not overlap during the merging process. This process reduces the original frame count to 1/n. Then, each output prediction of the model is repeated n times to correspond to the original features of each frame. The delay caused by this method is negligible.

对于DNN模型，正常的帧扩展用于添加上下文信息并减少预测结果中的错误转换。对于LSTM模型，我们使用特征合并来保持语音的连续性并降低计算时间。For DNN models, normal frame expansion is used to add contextual information and reduce false transitions in prediction results. For the LSTM model, we use feature pooling to preserve the continuity of speech and reduce computation time.

实验experiment

数据集data set

我们对从Switchboard语料库生成的对话数据集进行了实验。在去除重复的对话以及数据不足的说话人之后，我们还剩下500个说话人的250h音频数据，其中每个音频仅包含一个说话人。然后我们将这些过滤后的数据分为train，dev和test set。训练集中有450个说话人，开发集中有10个说话人，测试集中有剩下的40个说话人。We conduct experiments on dialogue datasets generated from the Switchboard corpus. After removing duplicate conversations and speakers with insufficient data, we are left with 250h of audio data for 500 speakers, where each audio contains only one speaker. Then we divide these filtered data into train, dev and test sets. There are 450 speakers in the training set, 10 speakers in the dev set, and the remaining 40 speakers in the test set.

训练数据的生成过程如下：The training data generation process is as follows:

(1)为训练集中的说话人提取i-vector。(1) Extract i-vectors for speakers in the training set.

(2)从第s个发言者的数据中随机选择第i个音频，称为utt_s_i，同时选择第t个说话人的第j个音频，其中s不等于j，并把这两段音频拼接起来，作为新的句子utt_new。(2) Randomly select the i-th audio from the data of the s-th speaker, called utt_s_i, and select the j-th audio of the t-th speaker at the same time, where s is not equal to j, and splicing the two audios together , as the new sentence utt_new.

(3)将目标说话人的i-vector连接到utt_new音频的每一帧上，以形成神经网络的最终输入。开发数据和测试数据的生成是类似的，而我们假设目标发言者的i-vector是通过额外的注册阶段获得的。(3) Concatenate the i-vector of the target speaker to each frame of the utt_new audio to form the final input of the neural network. The generation of development data and test data is similar, while we assume that the i-vector of the target speaker is obtained through an additional registration stage.

特征feature

对于i-vector提取器，提取帧长为25ms的20维MFCC作为前端特征。UBM由2048个GMM组成，提取的i-vector的维数为200。PLDA用作评分并补偿信道失真。所有神经网络的基本特征是36维对数滤波器组，帧长25ms，帧移为10ms。对于DNN模型，输入层由11帧的上下文窗口形成。DNN和LSTM模型都包含了两个隐层。For the i-vector extractor, a 20-dimensional MFCC with a frame length of 25ms is extracted as front-end features. UBM consists of 2048 GMMs, and the dimension of the extracted i-vector is 200. PLDA is used as a score and compensates for channel distortion. The basic feature of all neural networks is a 36-dimensional logarithmic filter bank with a frame length of 25ms and a frame shift of 10ms. For the DNN model, the input layer is formed by a context window of 11 frames. Both DNN and LSTM models contain two hidden layers.

帧级评估Frame-level evaluation

帧级别评估的结果以准确度(ACC)和F分数(F1，精度和召回的调和平均值)报告，列于表1中。The results of frame-level evaluations are reported in accuracy (ACC) and F-score (F1, the harmonic mean of precision and recall), listed in Table 1.

如果没有任何预处理或后处理，可以发现LSTM在VAD/SV基线和SDVAD系统中具有比DNN更好的性能，这归因于其序列建模能力。对于LSTM SDVAD系统，SDVAD系统的ACC和F-score略高于VAD/SV基线系统，这意味着我们提出的与说话人相关的VAD方法是有效的。。Without any pre-processing or post-processing, LSTM can be found to have better performance than DNN in VAD/SV baseline and SDVAD system, which is attributed to its sequence modeling ability. For the LSTM SDVAD system, the ACC and F-score of the SDVAD system are slightly higher than those of the VAD/SV baseline system, which means that our proposed speaker-dependent VAD method is effective. .

为了解决“碎片问题”并进一步提高系统性能，前面提到的基于规则的后处理和特征合并被应用于这些系统。从结果可以看出，后处理可以略微改善DNN和LSTM SDVAD的性能To address the "fragmentation problem" and further improve system performance, the aforementioned rule-based post-processing and feature merging are applied to these systems. As can be seen from the results, post-processing can slightly improve the performance of DNN and LSTM SDVAD

表1：不同系统的ACC(％)和F得分(％)。VAD/SV表示VAD，然后是说话人验证，含有两部分的基线系统，而SDVAD表示我们提出的端到端说话人相关的VAD系统。“+post”和“+binning”分别表示应用后处理和特征合并。对于后处理，滑动窗口的大小为10帧。特征合并的大小为4。Table 1: ACC (%) and F-score (%) for different systems. VAD/SV stands for VAD followed by speaker verification, a two-part baseline system, while SDVAD stands for our proposed end-to-end speaker-dependent VAD system. "+post" and "+binning" represent applying post-processing and feature merging, respectively. For post-processing, the size of the sliding window is 10 frames. The size of feature pooling is 4.

另一方面，我们所使用的特征合并方法可以极大地有益于基于LSTM的SDVAD系统，将ACC从88.31％提高到94.42％，并且可以通过后处理进一步提高到94.62％。F-score与ACC具有相同的改进。On the other hand, the feature merging method we used can greatly benefit the LSTM-based SDVAD system, improving the ACC from 88.31% to 94.42%, and can be further improved to 94.62% by post-processing. F-score has the same improvement as ACC.

在这里我们需要注意，作为基线系统的第一阶段，普通VAD可以在没有太多碎片的情况下获得语音/非语音分类(没有说话人区分)的良好准确性。特征合并对第一阶段没有太大影响，因此无法改善整个VAD/SV系统。出于同样的原因，后处理方法不能改进VAD/SV系统，因此后处理的VAD/SV结果没有添加到表1中。两种处理方法之间性能差异的原因是后处理操作不会影响SDVAD的训练过程，而作为预处理步骤的特征分级可以被视为神经网络的一部分，这有助于网络充分利用信息。Here we need to note that, as the first stage of the baseline system, vanilla VAD can obtain good accuracy for speech/non-speech classification (without speaker discrimination) without much fragmentation. Feature merging does not have much impact on the first stage and thus cannot improve the entire VAD/SV system. For the same reason, post-processing methods cannot improve the VAD/SV system, so the post-processed VAD/SV results are not added to Table 1. The reason for the difference in performance between the two processing methods is that post-processing operations do not affect the training process of SDVAD, while feature ranking as a preprocessing step can be considered as part of the neural network, which helps the network to fully utilize the information.

段落级别评估Paragraph level assessment

ACC和F分数仅是帧级别分类能力的指示。我们希望在段落级别进一步研究VAD/SV基线和SDVAD系统的性能。这里使用我们之前提出的评估度量J_VAD。ACC and F-scores are only indicators of frame-level classification ability. We hope to further investigate the performance of VAD/SV baselines and SDVAD systems at the paragraph level. Here we use our previously proposed evaluation metric J_VAD.

J_VAD包含四个不同的子标准，即起始边界精度(SBA)，结束边界精度(EBA)，边界精度(BP)和帧精度(ACC)。ACC是正确识别帧的准确率。SBA和EBA是边界级精度的指示。BP是衡量VAD输出段完整性的指标。将上述四个子标准的调和平均值定义为段级J_VAD。分析是从这四个方面进行的。段落级别的J_VAD结果如表2所示J_VAD contains four different sub-criteria, namely Start Boundary Accuracy (SBA), End Boundary Accuracy (EBA), Boundary Accuracy (BP) and Frame Accuracy (ACC). ACC is the accuracy rate for correctly identifying frames. SBA and EBA are indications of boundary-level accuracy. BP is a measure of the integrity of the VAD output segment. The harmonic mean of the above four sub-criteria is defined as segment-level J_VAD. The analysis is carried out from these four aspects. The J_VAD results at the paragraph level are shown in Table 2

为了更直观的比较，此处仅使用LSTM模型。与VAD/SV基线系统相比，我们可以发现原始的SDVAD系统受到“碎片问题”的限制。在不加任何预处理和后处理的方法时SDVAD系统的预测可能包含一些错误状态转换和片段。这些碎片导致不同系统检测到的段落数量增加。For a more intuitive comparison, only the LSTM model is used here. Compared with the VAD/SV baseline system, we can find that the original SDVAD system is limited by the "fragmentation problem". The predictions of the SDVAD system may contain some erroneous state transitions and fragments without any preprocessing and postprocessing methods. These fragments lead to an increase in the number of paragraphs detected by different systems.

表2：不同系统的J_VAD(％)和3个子标准(％)被列出，除了ACC，如表1所示Table 2: J_VAD (%) and 3 subcriteria (%) for different systems are listed, except ACC, as shown in Table 1

因此，BP评估很差。特征合并可以有效地减少这些错误转换。所有段落评估指标均已得到改进，并且接近于基线系统的效果。Therefore, BP evaluation is poor. Feature merging can effectively reduce these erroneous transformations. All passage evaluation metrics have been improved and are close to the performance of the baseline system.

为了更好地比较不同的系统，测试用例的预测结果如图4所示。图4示出了不同系统的预测，可以观察到SDVAD系统的预测结果中存在一些片段，并且特征合并可以有效地解决“碎片问题”。To better compare different systems, the prediction results of the test cases are shown in Figure 4. Figure 4 shows the predictions of different systems, it can be observed that there are some fragments in the prediction results of the SDVAD system, and feature merging can effectively solve the "fragmentation problem".

VAD/SV系统为非目标说话人提供了一些误报，这是合理的，因为VAD和SV是两个分离的阶段，无法针对任务的最终目标进行优化。The VAD/SV system provides some false positives for non-target speakers, which is reasonable since VAD and SV are two separate stages that cannot be optimized for the final goal of the task.

结论in conclusion

在本文中，基于端到端神经网络的系统被设计用于解决与说话人相关的VAD问题，该问题旨在仅检测来自目标说话人的语音。与延迟较高的两阶段VAD/SV方法相比，我们提出的端到端方法(SDVAD)直接将说话人信息带入建模过程并可直接执行在线预测。根据帧级别的指标和我们先前提出的段落级别的指标得到了一系列实验的结果。对于帧级别的评估，我们提出的LSTM SDVAD系统比传统的VAD/SV系统有了显著的性能提升，在帧精度方面从86.62％到94.42％。为了解决“碎片问题”，我们在LSTM SDVAD系统中引入了特征合并，这显著改善了段落级别的评估效果。In this paper, an end-to-end neural network based system is designed to solve the speaker-dependent VAD problem, which aims to detect only speech from the target speaker. Compared with two-stage VAD/SV methods with higher latency, our proposed end-to-end method (SDVAD) directly brings speaker information into the modeling process and can directly perform online prediction. The results of a series of experiments are obtained based on frame-level metrics and our previously proposed paragraph-level metrics. For frame-level evaluation, our proposed LSTM SDVAD system achieves a significant performance improvement over conventional VAD/SV systems, from 86.62% to 94.42% in terms of frame accuracy. To address the "fragmentation problem", we introduce feature merging in the LSTM SDVAD system, which significantly improves paragraph-level evaluation.

请参考图5，其示出了本发明一实施例提供的说话人相关的端到端语音端点检测装置的框图。Please refer to FIG. 5 , which shows a block diagram of a speaker-related end-to-end voice endpoint detection apparatus provided by an embodiment of the present invention.

如图5所示，端到端语音端点检测装置500，包括提取模块510、拼接模块520和输出模块530。As shown in FIG. 5 , the end-to-end voice endpoint detection apparatus 500 includes an extraction module 510 , a splicing module 520 and an output module 530 .

其中，提取模块510，配置为提取待检测语音的声学特征；拼接模块520，配置为将所述声学特征与目标说话人的i-vector特征进行拼接以作为新的输入特征；以及输出模块530，配置为将所述新的输入特征输入至神经网络中进行训练并输出所述待检测语音是否为目标说话人语音的预测结果。Wherein, the extraction module 510 is configured to extract the acoustic features of the speech to be detected; the splicing module 520 is configured to splicing the acoustic features and the i-vector features of the target speaker as new input features; and the output module 530, It is configured to input the new input feature into a neural network for training and output a prediction result of whether the speech to be detected is the speech of the target speaker.

在一些可选的实施例中，拼接模块520还进一步配置为：利用预训练的i-vector提取器从所述待检测语音中提取目标说话人的i-vector特征；将帧级的声学特征和目标说话人的i-vector特征连接起来作为新的输入。In some optional embodiments, the splicing module 520 is further configured to: extract the i-vector features of the target speaker from the speech to be detected by using a pre-trained i-vector extractor; combine the frame-level acoustic features and The i-vector features of the target speaker are concatenated as new input.

应当理解，图5中记载的诸模块与参考图1中描述的方法中的各个步骤相对应。由此，上文针对方法描述的操作和特征以及相应的技术效果同样适用于图5中的诸模块，在此不再赘述。It should be understood that the modules recited in FIG. 5 correspond to various steps in the method described with reference to FIG. 1 . Therefore, the operations and features described above with respect to the method and the corresponding technical effects are also applicable to the modules in FIG. 5 , which will not be repeated here.

值得注意的是，本申请的实施例中的模块并不用于限制本申请的方案，例如提取模块可以描述为提取待检测语音的声学特征的模块。另外，还可以通过硬件处理器来实现相关功能模块，例如提取模块也可以用处理器实现，在此不再赘述。It is worth noting that the modules in the embodiments of the present application are not used to limit the solution of the present application. For example, the extraction module may be described as a module for extracting acoustic features of the speech to be detected. In addition, the relevant functional modules may also be implemented by a hardware processor, for example, the extraction module may also be implemented by a processor, which will not be repeated here.

在另一些实施例中，本发明实施例还提供了一种非易失性计算机存储介质，计算机存储介质存储有计算机可执行指令，该计算机可执行指令可执行上述任意方法实施例中的说话人相关的端到端语音端点检测方法；In other embodiments, embodiments of the present invention further provide a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the speaker in any of the foregoing method embodiments Related end-to-end voice endpoint detection methods;

作为一种实施方式，本发明的非易失性计算机存储介质存储有计算机可执行指令，计算机可执行指令设置为：As an embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions, and the computer-executable instructions are set to:

提取待检测语音的声学特征；Extract the acoustic features of the speech to be detected;

将所述声学特征与目标说话人的i-vector特征进行拼接以作为新的输入特征；splicing the acoustic feature with the i-vector feature of the target speaker as a new input feature;

将所述新的输入特征输入至神经网络中进行训练并输出所述待检测语音是否为目标说话人语音的预测结果。The new input feature is input into the neural network for training and the prediction result of whether the speech to be detected is the speech of the target speaker is output.

非易失性计算机可读存储介质可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储根据说话人相关的端到端语音端点检测装置的使用所创建的数据等。此外，非易失性计算机可读存储介质可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中，非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至说话人相关的端到端语音端点检测装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile computer-readable storage medium may include a stored program area and a stored data area, wherein the stored program area may store an operating system and an application program required by at least one function; data created by the use of end-to-end voice endpoint detection devices, etc. In addition, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the non-transitory computer-readable storage medium may optionally include memory located remotely from the processor, the remote memory being connectable via a network to the speaker-dependent end-to-end voice endpoint detection device. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

本发明实施例还提供一种计算机程序产品，计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序，计算机程序包括程序指令，当程序指令被计算机执行时，使计算机执行上述任一项说话人相关的端到端语音端点检测方法。An embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is made to execute the above Any speaker-dependent end-to-end speech endpoint detection method.

图6是本发明实施例提供的电子设备的结构示意图，如图6所示，该设备包括：一个或多个处理器610以及存储器620，图6中以一个处理器610为例。端到端语音端点方法的设备还可以包括：输入装置630和输出装置640。处理器610、存储器620、输入装置630和输出装置640可以通过总线或者其他方式连接，图6中以通过总线连接为例。存储器620为上述的非易失性计算机可读存储介质。处理器610通过运行存储在存储器620中的非易失性软件程序、指令以及模块，从而执行服务器的各种功能应用以及数据处理，即实现上述方法实施例说话人相关的端到端语音端点检测方法。输入装置630可接收输入的数字或字符信息，以及产生与说话人相关的端到端语音端点检测装置的用户设置以及功能控制有关的键信号输入。输出装置640可包括显示屏等显示设备。FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 6 , the device includes: one or more processors 610 and a memory 620 . One processor 610 is taken as an example in FIG. 6 . The apparatus of the end-to-end voice endpoint method may further include: an input device 630 and an output device 640 . The processor 610, the memory 620, the input device 630, and the output device 640 may be connected by a bus or in other manners, and the connection by a bus is taken as an example in FIG. 6 . The memory 620 is the aforementioned non-volatile computer-readable storage medium. The processor 610 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 620, that is, to implement the speaker-related end-to-end voice endpoint detection in the above method embodiment. method. The input device 630 may receive input numerical or character information, and generate key signal input related to user settings of the speaker-related end-to-end voice endpoint detection device and function control. The output device 640 may include a display device such as a display screen.

上述产品可执行本发明实施例所提供的方法，具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节，可参见本发明实施例所提供的方法。The above product can execute the method provided by the embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the method. For technical details not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

作为一种实施方式，上述电子设备应用于说话人相关的端到端语音端点检测装置中，包括：至少一个处理器；以及，与至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，指令被至少一个处理器执行，以使至少一个处理器能够：As an implementation manner, the above-mentioned electronic device is applied to a speaker-related end-to-end voice endpoint detection apparatus, comprising: at least one processor; and a memory communicatively connected to the at least one processor; Instructions executed by at least one processor, the instructions being executed by at least one processor to enable at least one processor to:

本申请实施例的电子设备以多种形式存在，包括但不限于：The electronic devices in the embodiments of the present application exist in various forms, including but not limited to:

(1)移动通信设备：这类设备的特点是具备移动通信功能，并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机，以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones (eg iPhone), multimedia phones, feature phones, and low-end phones.

(2)超移动个人计算机设备：这类设备属于个人计算机的范畴，有计算和处理功能，一般也具备移动上网特性。这类终端包括：PDA、MID和UMPC设备等，例如iPad。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as iPads.

(3)便携式娱乐设备：这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod)，掌上游戏机，电子书，以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players (eg iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4)服务器:提供计算服务的设备，服务器的构成包括处理器、硬盘、内存、系统总线等，服务器和通用的计算机架构类似，但是由于需要提供高可靠的服务，因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The composition of the server includes a processor, a hard disk, a memory, a system bus, etc. The server is similar to a general computer architecture, but due to the need to provide highly reliable services, the processing power, stability , reliability, security, scalability, manageability and other aspects of high requirements.

(5)其他具有数据交互功能的电子装置。(5) Other electronic devices with data interaction function.

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place , or distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic Disks, optical discs, etc., include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods of various embodiments or portions of embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.