CN110223699A

Movatterモバイル変換

Info

Publication number: CN110223699A
Application number: CN201910407670.8A
Authority: CN
Inventors: 蔡晓东; 李波; 黄玳
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2019-09-10
Anticipated expiration: 2039-05-15
Also published as: CN110223699B

Abstract

The invention discloses a kind of speaker's identity confirmation method, device and storage mediums, belong to sound groove recognition technology in e field.Method includes the following steps: obtaining the speaker verification's neural network trained；Speaker's voice to be identified and speaker's speech database are input to the speaker verification's neural network trained, identify the identity of the corresponding speaker of speaker's voice to be identified.The device includes obtaining module and identification module.The invention enables speaker verifications can be based on the speaker verification's neural network progress trained, to improve the stability of accuracy and the identification of speaker verification's neural network recognization speaker's identity.

Description

Translated fromChinese

一种说话人身份确认方法、装置及存储介质A speaker identity verification method, device and storage medium

技术领域technical field

本发明涉及声纹识别技术领域，尤其涉及一种说话人身份确认方法、装置及存储介质。The present invention relates to the technical field of voiceprint recognition, in particular to a speaker identity verification method, device and storage medium.

背景技术Background technique

声纹识别，又称为说话人识别，说话人识别包括说话人确认技术和说话人辨认技术。其中，说话人确认是指利用已知的声频、语音信息，确认说话人的身份信息。目前说话人确认技术日益成熟，此技术被广泛应用到公安侦查罪犯、城市社区人员管理、办公区声纹考勤等实际应用中。Voiceprint recognition, also known as speaker recognition, speaker recognition includes speaker confirmation technology and speaker identification technology. Among them, speaker confirmation refers to confirming the speaker's identity information by using known audio frequency and voice information. At present, the speaker confirmation technology is becoming more and more mature, and this technology is widely used in practical applications such as police investigation of criminals, urban community personnel management, and voiceprint attendance in office areas.

但是，现有的声纹识别技术通常在实验环境下有很高的识别率，然而到了实际应用上由于复杂的环境因素的影响，会受到很大的限制，使得声纹识别技术在实际工作中远远达不到预期的效果。However, the existing voiceprint recognition technology usually has a high recognition rate in the experimental environment. However, due to the influence of complex environmental factors in practical applications, it will be greatly restricted, making the voiceprint recognition technology far from practical in practical work. far from the desired effect.

发明内容Contents of the invention

本发明所要解决的技术问题是针对现有技术的不足，提供一种说话人身份确认方法、装置及存储介质，以使说话人确认神经网络更准确地进行说话人确认。The technical problem to be solved by the present invention is to provide a speaker identity verification method, device and storage medium for the deficiencies of the prior art, so that the speaker verification neural network can perform speaker verification more accurately.

为解决上述技术问题，本发明提供一种说话人身份确认方法，包括以下步骤：In order to solve the above-mentioned technical problems, the present invention provides a speaker identity confirmation method, comprising the following steps:

获取已训练的说话人确认神经网络；Obtain the trained speaker confirmation neural network;

将待识别说话人语音与说话人语音数据库输入至所述已训练的说话人确认神经网络，识别所述待识别说话人语音对应的说话人的身份；其中，所述说话人语音数据库中包括多个不同说话人的多个不同语音。Inputting the voice of the speaker to be identified and the database of the speaker's voice into the trained speaker confirmation neural network to identify the identity of the speaker corresponding to the voice of the speaker to be identified; wherein, the database of the speaker's voice includes multiple multiple voices by different speakers.

本发明的有益效果是：通过获取所述已训练的说话人确认神经网络，并将待识别说话人语音与说话人语音数据库输入至所述已训练的说话人确认神经网络，识别所述待识别说话人语音对应的说话人的身份。使得说话人确认能够基于所述已训练的说话人确认神经网络进行，以提高说话人确认神经网络识别说话人身份的准确度以及识别的稳定度。The beneficial effects of the present invention are: by acquiring the trained speaker confirmation neural network, and inputting the speech of the speaker to be recognized and the speaker speech database into the trained speaker confirmation neural network, the speech recognition of the speech to be recognized is recognized. The identity of the speaker corresponding to the speaker's voice. The speaker confirmation can be performed based on the trained speaker confirmation neural network, so as to improve the accuracy and the stability of the identification of the speaker identity by the speaker confirmation neural network.

在上述技术方案的基础上，本发明还可以做如下改进。On the basis of the above technical solutions, the present invention can also be improved as follows.

进一步地，还包括预先训练说话人确认神经网络的步骤，具体包括：Further, it also includes the step of pre-training the speaker confirmation neural network, specifically including:

构建说话人确认神经网络，用以提取说话人特征表示；Build a speaker confirmation neural network to extract speaker feature representation;

基于不同说话人的不同语音子集选择训练样本；Select training samples based on different speech subsets of different speakers;

根据所述训练样本确定扩展相似度矩阵；Determine an extended similarity matrix according to the training samples;

基于所述训练样本以及所述扩展相似度矩阵训练所述说话人确认神经网络，获得所述已训练的说话人确认神经网络。Training the speaker verification neural network based on the training sample and the extended similarity matrix to obtain the trained speaker verification neural network.

采用上述进一步方案的有益效果是：首先通过构建说话人确认神经网络，然后选择训练样本，继而根据所述训练样本确定扩展相似度矩阵，并基于所述训练样本以及所述扩展相似度矩阵训练所述说话人确认神经网络，可以获得已训练的说话人确认神经网络，使得说话人确认能够基于所述已训练的说话人确认神经网络进行，以提高说话人确认神经网络识别说话人身份的准确度以及识别的稳定度。The beneficial effect of adopting the above-mentioned further solution is: firstly, by constructing a speaker confirmation neural network, then selecting a training sample, then determining an extended similarity matrix according to the training sample, and training all speakers based on the training sample and the extended similarity matrix Describe the speaker confirmation neural network, can obtain the trained speaker confirmation neural network, so that the speaker confirmation can be carried out based on the trained speaker confirmation neural network, to improve the accuracy of the speaker confirmation neural network identifying the identity of the speaker and the stability of recognition.

进一步地，所述构建说话人确认神经网络，具体包括：Further, the construction of the speaker confirmation neural network specifically includes:

获取语音样本；Get voice samples;

提取所述语音样本的声学特征；extracting acoustic features of the speech sample;

将所述声学特征输入LSTM网络中学习所述语音样本的说话人特征表示，获得所述说话人确认神经网络。The acoustic feature is input into the LSTM network to learn the speaker feature representation of the voice sample, and the speaker confirmation neural network is obtained.

采用上述进一步方案的有益效果是：通过提取所述语音样本的声学特征；将所述声学特征送入LSTM网络中进行对所述语音样本中的说话人的声学特征的学习，可以获得一个简单的能够提取说话人特征表示的说话人确认神经网络。The beneficial effect of adopting the above-mentioned further solution is: by extracting the acoustic features of the speech samples; sending the acoustic features into the LSTM network to learn the acoustic features of the speakers in the speech samples, a simple Speaker Verification Neural Networks Capable of Extracting Speaker Feature Representations.

进一步地，所述训练样本包括待识别语音样本、正训练样本、用于对比的负训练样本以及用于补充所述正训练样本数量的辅助训练样本，所述基于不同说话人的不同语音子集选择训练样本，具体包括：Further, the training samples include speech samples to be recognized, positive training samples, negative training samples for comparison, and auxiliary training samples for supplementing the number of positive training samples, and the different voice subsets based on different speakers Select training samples, including:

选取N个不同说话人，所述不同说话人包括一个目标说话人和N-1个对比说话人，且所述不同说话人中的每一个说话人均选取N-1个语音子集，所述语音子集中包含M句语音；Select N different speakers, the different speakers include a target speaker and N-1 comparison speakers, and each speaker in the different speakers selects N-1 voice subsets, the voice The subset contains M sentences of speech;

从所述目标说话人的语音子集中，选择一个语音子集作为目标语音子集，并从所述目标语音子集中选择一句语音作为所述待识别语音样本；并将所述目标语音子集中的其它语音作为所述正训练样本；From the speech subset of the target speaker, select a speech subset as the target speech subset, and select a sentence of speech from the target speech subset as the speech sample to be recognized; and Other voices are used as the positive training samples;

将所述目标说话人的语音子集中除所述目标语音子集外的其它语音子集作为所述辅助训练样本；Using other voice subsets in the voice subset of the target speaker except the target voice subset as the auxiliary training samples;

从所述对比说话人的语音子集中选择一个语音子集作为所述负训练样本。Selecting a voice subset from the voice subsets of the compared speaker as the negative training sample.

采用上述进一步方案的有益效果是：通过在训练样本的选取中增加了辅助训练样本，使得原有的正训练样本与负训练样本一对一或一对多的训练方式变为多对多，即正训练样本加辅助训练样本的数量等于负训练样本的数量，使得说话人确认神经网络的训练样本更加均衡，从而提高已训练的说话人确认神经网络识别的准确度以及速度。The beneficial effect of adopting the above-mentioned further scheme is: by adding auxiliary training samples in the selection of training samples, the original one-to-one or one-to-many training method of positive training samples and negative training samples becomes many-to-many, that is The number of positive training samples plus auxiliary training samples is equal to the number of negative training samples, which makes the training samples of the speaker confirmation neural network more balanced, thereby improving the recognition accuracy and speed of the trained speaker confirmation neural network.

进一步地，所述根据所述训练样本确定扩展相似度矩阵，具体包括：Further, the determining the extended similarity matrix according to the training samples specifically includes:

根据所述正训练样本、所述负训练样本以及所述辅助训练样本，获得正训练样本中心、负训练样本中心以及辅助训练样本中心；Obtain a positive training sample center, a negative training sample center, and an auxiliary training sample center based on the positive training sample, the negative training sample, and the auxiliary training sample;

根据所述待识别语音样本与所述正训练样本中心，获得所述待识别语音样本与所述正训练样本中心的距离值，基于所述待识别语音样本与所述正训练样本中心的距离值构建向量矩阵；Obtain a distance value between the speech sample to be recognized and the center of the positive training sample according to the speech sample to be recognized and the center of the positive training sample, based on the distance value between the speech sample to be recognized and the center of the positive training sample build vector matrix;

获取所述待识别语音样本与所述负训练样本中心的距离值，基于所述待识别语音样本与所述负训练样本中心的距离值构建负训练样本相似度矩阵；Obtaining the distance value between the speech sample to be recognized and the center of the negative training sample, and constructing a negative training sample similarity matrix based on the distance value between the speech sample to be recognized and the center of the negative training sample;

将所述向量矩阵以及所述负训练样本相似度矩阵组合成正负相似度矩阵；Combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;

获取所述待识别语音样本与所述辅助训练样本中心的距离值，基于所述待识别语音样本与所述辅助训练样本中心的距离值建立辅助相似度矩阵；Obtaining the distance value between the speech sample to be recognized and the center of the auxiliary training sample, and establishing an auxiliary similarity matrix based on the distance value between the speech sample to be recognized and the center of the auxiliary training sample;

根据所述正负相似度矩阵以及所述辅助相似度矩阵，获得所述扩展相似度矩阵。The extended similarity matrix is obtained according to the positive and negative similarity matrix and the auxiliary similarity matrix.

采用上述进一步方案的有益效果是：通过构建所述扩展相似度矩阵，将所述辅助训练样本能够融入至说话人确认神经网络的训练中，以训练获得一个识别准确度、速度都大幅提高的说话人确认神经网络模型。The beneficial effect of adopting the above-mentioned further solution is: by constructing the extended similarity matrix, the auxiliary training samples can be integrated into the training of the speaker confirmation neural network, so as to obtain a speech recognition accuracy and speed greatly improved. Man confirms the neural network model.

进一步地，所述方法还包括：Further, the method also includes:

构建损失函数，基于所述损失函数对所述说话人确认神经网络进行优化收敛。A loss function is constructed, and the speaker confirmation neural network is optimized and converged based on the loss function.

进一步地，所述损失函数的表达式为：Further, the expression of the loss function is:

其中，e_i,o表示待识别语音样本中待识别目标说话人o的语音子集中的第i句语音样本；N表示不同的说话人数量；k表示待识别语音样本中的第k个辅助语音子集；j表示负训练样本中的第j个语音子集；σ表示sigmoid函数，S_i,ok,ass表示待识别语音样本与第k个辅助语音子集的辅助训练样本中心的距离值，S_i,oi,pos表示待识别语音样本与正训练样本中心的距离值，S_i,oj,neg表示待识别语音样本与第j个语音子集的负训练样本中心的距离值；α为调节因子。Among them, e_{i, o} represent the i-th sentence speech sample in the speech subset of the target speaker o to be recognized in the speech sample to be recognized; N represents the number of different speakers; k represents the kth auxiliary speech in the speech sample to be recognized Subset; j represents the jth speech subset in the negative training sample; σ represents the sigmoid function, S_{i, ok, ass} represents the distance value between the speech sample to be recognized and the auxiliary training sample center of the kth auxiliary speech subset, S_{i, oi, pos} represent the distance between the speech sample to be recognized and the center of the positive training sample, S_{i, oj, neg} represent the distance between the speech sample to be recognized and the center of the negative training sample of the jth speech subset; α is the adjustment factor.

采用上述进一步方案的有益效果是：通过所述损失函数选择最佳的样本中心距离，使得所述说话人确认神经网络能够快速收敛，减少网络的计算量。The beneficial effect of adopting the above further solution is that: the optimal sample center distance is selected through the loss function, so that the speaker confirmation neural network can quickly converge and reduce the calculation amount of the network.

为解决上述技术问题，本发明实施例还提出一种说话人身份确认装置，包括：In order to solve the above technical problems, an embodiment of the present invention also proposes a speaker identity confirmation device, including:

获取模块，用于获取已训练的说话人确认神经网络；Obtaining a module for obtaining a trained speaker confirmation neural network;

识别模块，用于将待识别说话人语音与说话人语音数据库输入至所述已训练的说话人确认神经网络，识别所述待识别说话人语音对应的说话人的身份。The identification module is configured to input the voice of the speaker to be identified and the database of the speaker's voice into the trained speaker confirmation neural network, and identify the identity of the speaker corresponding to the voice of the speaker to be identified.

为解决上述技术问题，本发明实施例还提出一种计算机可读存储介质，包括指令，当所述指令在计算机上运行时，使所述计算机执行根据上述实施例任一项所述的说话人身份确认方法。In order to solve the above-mentioned technical problems, an embodiment of the present invention also proposes a computer-readable storage medium, including instructions. When the instructions are run on a computer, the computer is made to execute the speaker according to any one of the above-mentioned embodiments. Identification method.

为解决上述技术问题，本发明实施例还提出一种说话人身份确认装置，包括存储器、处理器及存储在所述存储器上的并可在所述处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述实施例任一项所述的说话人身份确认方法。In order to solve the above-mentioned technical problems, an embodiment of the present invention also proposes a speaker identity verification device, which includes a memory, a processor, and a computer program stored in the memory and operable on the processor. The processor When the program is executed, the speaker identity verification method as described in any one of the above embodiments is realized.

附图说明Description of drawings

图1为本发明实施例提供的一种说话人身份确认方法的流程示意图；Fig. 1 is a schematic flow chart of a speaker identity confirmation method provided by an embodiment of the present invention;

图2为本发明实施例提供的负训练样本中心、辅助训练样本中心以及正训练样本中心的示意图；Fig. 2 is a schematic diagram of the negative training sample center, the auxiliary training sample center and the positive training sample center provided by the embodiment of the present invention;

图3为本发明实施例提供的扩展相似度矩阵的构建示意图；FIG. 3 is a schematic diagram of building an extended similarity matrix provided by an embodiment of the present invention;

图4为本发明实施例提供的正训练样本、负训练样本以及辅助训练样本的映射关系示意图；4 is a schematic diagram of the mapping relationship between positive training samples, negative training samples and auxiliary training samples provided by the embodiment of the present invention;

图5为本发明实施例提供的一种说话人身份确认装置的结构示意图。Fig. 5 is a schematic structural diagram of a speaker identity verification device provided by an embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的原理和特征进行描述，所举实例只用于解释本发明，并非用于限定本发明的范围。The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.

图1给出了本发明实施例提供的一种说话人身份确认方法的流程示意图，如图1所示，本实施例中，一种说话人身份确认方法，包括以下步骤：Figure 1 shows a schematic flow diagram of a speaker identity confirmation method provided by an embodiment of the present invention. As shown in Figure 1, in this embodiment, a speaker identity confirmation method includes the following steps:

通过获取已训练的说话人确认神经网络，所述已训练的说话人确认神经网络用于进行说话人确认。通过将待识别说话人语音与说话人语音数据库输入至所述已训练的说话人确认神经网络，识别所述待识别说话人语音对应的说话人的身份。By acquiring a trained speaker confirmation neural network, the trained speaker confirmation neural network is used for speaker confirmation. By inputting the voice of the speaker to be recognized and the database of the speaker's voice into the trained speaker confirmation neural network, the identity of the speaker corresponding to the voice of the speaker to be recognized is identified.

值得说明的是，所述说话人语音数据库是由多个不同说话人的多个不同语音组成的一个语音数据库，识别的过程是所述待识别说话人语音与所述说话人语音数据库中的语音样本进行匹配，以对所述待识别说话人语音对应的说话人的身份进行确认。It is worth noting that the speaker's voice database is a voice database composed of multiple different voices of multiple different speakers, and the process of recognition is the combination of the speaker's voice to be recognized and the voice in the speaker's voice database. The samples are matched to confirm the identity of the speaker corresponding to the voice of the speaker to be recognized.

上述实施例中，通过获取所述已训练的说话人确认神经网络，并将待识别说话人语音与说话人语音数据库输入至所述已训练的说话人确认神经网络，识别所述待识别说话人语音对应的说话人的身份。使得说话人确认能够基于所述已训练的说话人确认神经网络进行，以提高说话人确认神经网络识别说话人身份的准确度以及识别的稳定度。In the above embodiment, by obtaining the trained speaker confirmation neural network, and inputting the speech of the speaker to be recognized and the speaker speech database into the trained speaker confirmation neural network, the speaker to be recognized is identified The identity of the speaker corresponding to the voice. The speaker confirmation can be performed based on the trained speaker confirmation neural network, so as to improve the accuracy and the stability of the identification of the speaker identity by the speaker confirmation neural network.

可选地，还包括预先训练说话人确认神经网络的步骤，具体包括：Optionally, a step of pre-training the speaker confirmation neural network is also included, specifically including:

上述实施例中，首先通过构建说话人确认神经网络，然后选择训练样本，继而根据所述训练样本确定扩展相似度矩阵，并基于所述训练样本以及所述扩展相似度矩阵训练所述说话人确认神经网络，可以获得已训练的说话人确认神经网络，使得说话人确认能够基于所述已训练的说话人确认神经网络进行，以提高说话人确认神经网络识别说话人身份的准确度以及识别的稳定度。In the above-mentioned embodiment, firstly by constructing the speaker confirmation neural network, then selecting training samples, then determining the extended similarity matrix according to the training samples, and training the speaker confirmation based on the training samples and the extended similarity matrix The neural network can obtain a trained speaker confirmation neural network, so that the speaker confirmation can be performed based on the trained speaker confirmation neural network, so as to improve the accuracy of the speaker confirmation neural network in identifying the identity of the speaker and the stability of the identification Spend.

具体地，所述构建说话人确认神经网络，具体包括：Specifically, the construction of the speaker confirmation neural network specifically includes:

获取语音样本；Get voice samples;

值得说明的是，获取所述语音样本，之后还包括对所述语音样本进行预处理。包括对每个语音样本进行分帧加窗，以25ms为一帧，帧移为10ms，对每一个语音样本取其前180帧和后180帧作为输入数据。然后提取每一帧的filter bank(滤波组件)声学特征，接着将语音样本的声学特征送入一个3层LSTM(长短期记忆(Long-Short Term Memory,LSTM)是一种时间递归神经网络(RNN))网络中学习语音样本的说话人特征表示。It is worth noting that the acquisition of the voice sample also includes preprocessing the voice sample. Including performing frame division and windowing on each speech sample, taking 25ms as a frame, and frame shifting as 10ms, taking the first 180 frames and the last 180 frames of each speech sample as input data. Then extract the filter bank (filter component) acoustic features of each frame, and then send the acoustic features of the voice sample into a 3-layer LSTM (Long-Short Term Memory, LSTM) is a time recurrent neural network (RNN )) learns speaker feature representations of speech samples in the network.

上述实施例中，通过提取所述语音样本的声学特征；将所述声学特征送入LSTM网络中进行对所述语音样本中的说话人的声学特征的学习，可以获得一个简单的说话人确认神经网络。In the above embodiment, by extracting the acoustic features of the speech samples; sending the acoustic features into the LSTM network to learn the acoustic features of the speakers in the speech samples, a simple speaker confirmation neural network can be obtained. network.

具体地，所述训练样本包括待识别语音样本、正训练样本、用于对比的负训练样本以及用于补充所述正训练样本数量的辅助训练样本，所述基于不同说话人的不同语音子集选择训练样本，具体包括：Specifically, the training samples include speech samples to be recognized, positive training samples, negative training samples for comparison, and auxiliary training samples for supplementing the number of positive training samples. The different voice subsets based on different speakers Select training samples, including:

由于常规的对所述说话人确认神经网络进行训练的样本选取方式都是以一对一或者一对多的，即一个正训练样本对一个负训练样本或者是一个正训练样本对多个负训练样本，因此会由于训练样本的不均衡导致神经网络的识别准确度降低，因此本发明在训练样本的选取中引入所述辅助训练样本，所述辅助训练样本用于补充正训练样本的数量，使得原本一对多的训练方式变成多对多，例如引入三个辅助训练样本，训练的方式就是从一个正训练样本加三个辅助训练样本对上四个负训练样本。Since the conventional sample selection method for training the speaker confirmation neural network is one-to-one or one-to-many, that is, a positive training sample is to a negative training sample or a positive training sample is to a plurality of negative training samples. Therefore, the recognition accuracy of the neural network will be reduced due to the imbalance of training samples, so the present invention introduces the auxiliary training samples in the selection of training samples, and the auxiliary training samples are used to supplement the number of positive training samples, so that The original one-to-many training method has become many-to-many. For example, three auxiliary training samples are introduced. The training method is to add four negative training samples from one positive training sample plus three auxiliary training samples.

值得说明的是，训练样本的选样方法如下：随机选择N个不同说话人，包括一个目标说话人和(N-1)个对比说话人，其中，每个说话人均随机选择(N-1)个语音子集，每个语音子集中包含M句语音。随机选择目标说话人的一个语音子集中的一句语音作为待识别语音样本，此语音子集中其他语音作为正训练样本，其余(N-2)个语音子集作为辅助语音子集，称其为辅助训练样本。另外，所述辅助训练样本用于补充正训练样本的数量，使得正训练样本加上所述辅助训练样本的数量等于所述负训练样本的数量。It is worth noting that the selection method of training samples is as follows: randomly select N different speakers, including a target speaker and (N-1) comparison speakers, where each speaker randomly selects (N-1) Voice subsets, each voice subset contains M sentences of speech. Randomly select a sentence in a voice subset of the target speaker as the voice sample to be recognized, other voices in this voice subset are used as positive training samples, and the remaining (N-2) voice subsets are used as auxiliary voice subsets, which are called auxiliary voice subsets. Training samples. In addition, the auxiliary training samples are used to supplement the number of positive training samples, so that the number of positive training samples plus the auxiliary training samples is equal to the number of negative training samples.

上述实施例中，通过在训练样本的选取中增加了辅助训练样本，使得原有的一对一或一对多的训练方式变为多对多，使得训练样本更加均衡，从而提高已训练的说话人确认神经网络识别的准确度以及识别的速度。In the above-described embodiment, by adding auxiliary training samples in the selection of training samples, the original one-to-one or one-to-many training method becomes many-to-many, making the training samples more balanced, thereby improving the trained speech People confirm the accuracy of neural network recognition and the speed of recognition.

具体地，所述根据所述训练样本确定扩展相似度矩阵，具体包括：Specifically, the determining the extended similarity matrix according to the training samples specifically includes:

如图2所示，分别计算所述正训练样本、所述负训练样本以及所述辅助训练样本的样本中心，得到正训练样本中心、负训练样本中心以及辅助训练样本中心；其中，计算所述负训练样本中心的函数如下：As shown in Figure 2, the sample center of the positive training sample, the negative training sample and the auxiliary training sample are calculated respectively to obtain the positive training sample center, the negative training sample center and the auxiliary training sample center; wherein, the calculation of the The function of the center of negative training samples is as follows:

其中，c_j,o,neg表示负训练样本中心，其中o表示目标说话人；j表示负训练样本中第j个对比说话人的语音子集；e_jm表示第j个负训练样本中的语音子集中的第m句语音；M表示负训练样本中每个说话人的语音子集共有M句语音。Among them, c_{j, o, neg} represent the center of the negative training sample, where o represents the target speaker; j represents the voice subset of the jth contrast speaker in the negative training sample; e_jm represents the voice in the jth negative training sample The m-th sentence speech in the subset; M means that the speech subset of each speaker in the negative training sample has a total of M sentence speech.

其中，计算所述辅助训练样本中心的函数如下：Wherein, the function of calculating the center of the auxiliary training samples is as follows:

c_k,o,ass表示辅助训练样本中心，其中o表示目标说话人；k表示辅助训练样本中的第k个辅助语音子集；e_km表示辅助训练样本中的第k个辅助语音子集中的第m句语音；M表示辅助训练样本中的语音子集共有M句语音。c_{k, o, ass} represent the auxiliary training sample center, where o represents the target speaker; k represents the kth auxiliary voice subset in the auxiliary training sample; e_km represents the kth auxiliary voice subset in the auxiliary training sample The m-th sentence speech; M indicates that the speech subset in the auxiliary training sample has a total of M sentences of speech.

其中，计算所述正训练样本中心的函数如下：Wherein, the function of calculating the center of the positive training sample is as follows:

式子中，表示正训练样本中心；其中o表示目标说话人，(-i)表示排除第i句语音；e_m表示待识别说话人语音子集中的第m句语音；M表示正训练样本语音子集共有M句语音。In the formula, Indicates the center of the positive training sample; where o represents the target speaker, (-i) represents the exclusion of the i-th sentence speech; em represents the_m -th sentence speech in the speech subset of the speaker to be recognized; M represents the total M of the positive training sample speech subset sentence voice.

值得说明的是，计算所述待识别语音样本与所述正训练样本中心的距离值的公式为：It is worth noting that the formula for calculating the distance between the voice sample to be recognized and the center of the positive training sample is:

S_i,oi,pos表示待识别语音样本与正训练样本中心的距离，其中i∈(1,M)；w、b为学习参数。S_{i, oi, pos} represent the distance between the voice sample to be recognized and the center of the training sample, where i∈(1,M); w, b are the learning parameters.

计算所述待识别语音样本与所述负训练样本中心的距离值的计算公式是：The calculation formula for calculating the distance value between the voice sample to be recognized and the center of the negative training sample is:

S_i,oj,neg＝w·cos(e_i,o,c_j,o,neg)+bS_i,oj,neg ＝w·cos(e_i,o ,c_j,o,neg )+b

其中，S_i,oj,neg表示待识别语音样本与第j个负训练样本中心的距离，其中，i∈(1,M)；j∈(1,N-1)；w、b为学习参数。Among them, S_{i, oj, neg} represent the distance between the voice sample to be recognized and the center of the jth negative training sample, where i∈(1,M); j∈(1,N-1); w, b are the learning parameters .

计算所述待识别语音样本与所述辅助训练样本中心的距离值的公式为：The formula for calculating the distance value between the voice sample to be recognized and the center of the auxiliary training sample is:

S_i,ok,ass＝w·cos(e_i,o,c_k,o,ass)+bS_i,ok,ass ＝w·cos(e_i,o ,c_k,o,ass )+b

S_i,ok,ass表示待识别语音样本与第k个辅助训练样本中心的距离；其中i∈(1,M)；k∈(1,N-2)；w、b为学习参数。S_i,ok,ass represent the distance between the speech sample to be recognized and the center of the kth auxiliary training sample; where i∈(1,M); k∈(1,N-2); w, b are learning parameters.

图3为本发明实施例提供的扩展相似度矩阵的构建示意图，如图3所示，首先，计算所述待识别语音样本与所述正训练样本中心的距离值，基于所述待识别语音样本与所述正训练样本中心的距离值构建向量矩阵；Figure 3 is a schematic diagram of the construction of the extended similarity matrix provided by the embodiment of the present invention, as shown in Figure 3, first, calculate the distance value between the speech sample to be recognized and the center of the positive training sample, based on the speech sample to be recognized Construct a vector matrix with the distance value of the positive training sample center;

然后，计算所述待识别语音样本与所述负训练样本中心的距离值，基于所述待识别语音样本与所述负训练样本中心的距离值构建负训练样本相似度矩阵；Then, calculate the distance value between the speech sample to be recognized and the center of the negative training sample, construct a negative training sample similarity matrix based on the distance value between the speech sample to be recognized and the center of the negative training sample;

继而，将所述向量矩阵以及所述负训练样本相似度矩阵组合成正负相似度矩阵；Then, combining the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix;

继而，计算所述待识别语音样本与所述辅助训练样本中心的距离值，基于所述待识别语音样本与所述辅助训练样本中心的距离值建立辅助相似度矩阵；Then, calculate the distance value between the speech sample to be recognized and the center of the auxiliary training sample, and establish an auxiliary similarity matrix based on the distance value between the speech sample to be recognized and the center of the auxiliary training sample;

最终，根据所述正负相似度矩阵以及所述辅助相似度矩阵，获得所述扩展相似度矩阵。Finally, the extended similarity matrix is obtained according to the positive and negative similarity matrix and the auxiliary similarity matrix.

即，将所述待识别语音样本与所述正训练样本中心的距离值构建成向量矩阵，并将所述待识别语音样本与所述负训练样本中心的距离值构建负训练样本相似度矩阵，然后将所述向量矩阵以及所述负训练样本相似度矩阵组合成正负相似度矩阵。That is, the distance value between the speech sample to be recognized and the center of the positive training sample is constructed into a vector matrix, and the distance value of the speech sample to be recognized and the center of the negative training sample is constructed into a negative training sample similarity matrix, Then combine the vector matrix and the negative training sample similarity matrix into a positive and negative similarity matrix.

由于本发明在训练样本的选取中引入了辅助训练样本，因此计算相似度矩阵也要将所述辅助训练样本融入至相似度举矩阵中。首先，计算所述待识别语音样本与所述辅助训练样本中心的距离值，然后将所有的待识别语音样本与辅助训练样本中心的距离值组合成辅助相似度矩阵；然后就是将所述正负相似度矩阵以及所述辅助相似度矩阵组合成一个新的相似度矩阵，即所述扩展相似度矩阵。通过所述扩展相似度矩阵，可以将所述辅助训练样本融入至说话人确认神经网络中的训练中。Since the present invention introduces auxiliary training samples in the selection of training samples, the calculation of the similarity matrix also includes the auxiliary training samples into the similarity matrix. First, calculate the distance value between the speech sample to be recognized and the center of the auxiliary training sample, and then combine all the distance values between the speech sample to be recognized and the center of the auxiliary training sample into an auxiliary similarity matrix; then the positive and negative The similarity matrix and the auxiliary similarity matrix are combined into a new similarity matrix, that is, the extended similarity matrix. Through the extended similarity matrix, the auxiliary training samples can be integrated into the training of the speaker confirmation neural network.

图4为本发明实施例提供的正训练样本、负训练样本以及辅助训练样本的映射关系示意图，如图4所示，通过引入所述辅助训练样本，使得正训练样本结合辅助训练样本的数量等于负训练样本的数量。4 is a schematic diagram of the mapping relationship between positive training samples, negative training samples and auxiliary training samples provided by the embodiment of the present invention. As shown in FIG. 4, by introducing the auxiliary training samples, the number of positive training samples combined with auxiliary training samples is equal to The number of negative training samples.

值得说明的是，引入所述辅助训练样本是因为在常规的对说话人确认神经网络的训练中，正训练样品与负训练样品的选择都是一对一或一对多的，这样导致训练样本的不均衡，引起训练样本的不公平，使得神经网络的效果不够好。因此在本发明中引入新的辅助训练样本，来使得训练样本更加均衡，以提高说话人确认神经网络识别的准确度以及速度。It is worth noting that the introduction of the auxiliary training samples is because in the conventional training of the speaker confirmation neural network, the selection of positive training samples and negative training samples is one-to-one or one-to-many, which leads to training samples The imbalance of the training samples causes the unfairness of the training samples, which makes the effect of the neural network not good enough. Therefore, new auxiliary training samples are introduced in the present invention to make the training samples more balanced, so as to improve the accuracy and speed of speaker confirmation neural network recognition.

可选地，所述方法还包括：Optionally, the method also includes:

具体地，所述损失函数的表达式为：Specifically, the expression of the loss function is:

由于训练样本中引入了辅助训练样本，因此需要设计新的损失函数，通过所述损失函数，使得选择待识别语音样本与辅助训练样本中心或正训练样本中心的最小值，即离待识别语音样本最远的本说话人的中心点；以及与负训练样本中心的最大值，即离待识别语音样本最近的对比说话人中心点参与到损失函数的计算中。所述损失函数选择最佳的样本中心距离，使得网络能够快速收敛，同时实现更精准的说话人识别。Due to the introduction of auxiliary training samples in the training samples, it is necessary to design a new loss function. Through the loss function, the minimum value of the speech sample to be recognized and the center of the auxiliary training sample or the center of the positive training sample is selected, that is, the distance from the speech sample to be recognized The center point of the farthest speaker; and the maximum value of the center of the negative training sample, that is, the center point of the speaker closest to the speech sample to be recognized participates in the calculation of the loss function. The loss function selects the best sample center distance, so that the network can converge quickly and achieve more accurate speaker recognition.

具体地，ATS-GE2E总的损失值为L(e_oi),具体数学表达式为：Specifically, the total loss value of ATS-GE2E is L(e_oi ), and the specific mathematical expression is:

其中，(o∈(1,N),i∈(1,M)。Among them, (o∈(1,N), i∈(1,M).

同时，如图5所示，本发明实施例还提出一种说话人身份确认装置，包括：At the same time, as shown in Figure 5, the embodiment of the present invention also proposes a speaker identity confirmation device, including:

同时，本发明实施例还提出一种计算机可读存储介质，包括指令，当所述指令在计算机上运行时，使所述计算机执行根据上述实施例任一项所述的说话人身份确认方法。Meanwhile, an embodiment of the present invention also proposes a computer-readable storage medium, including instructions, which, when the instructions are run on a computer, cause the computer to execute the speaker identity verification method according to any one of the above-mentioned embodiments.

同时，本发明实施例还提出一种说话人身份确认装置，包括存储器、处理器及存储在所述存储器上的并可在所述处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述实施例任一项所述的说话人身份确认方法。At the same time, the embodiment of the present invention also proposes a speaker identity verification device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the program Realize the speaker identity confirmation method as described in any one of the foregoing embodiments.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，上述描述的装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described devices and units can refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.

作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本发明实施例方案的目的。A unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分，或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of software products, and the computer software products are stored in a storage medium In, several instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes. .

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.