技术领域Technical field
本申请实施例涉及机器学习技术领域,尤其涉及一种身份识别模型训练方法、身份识别方法及装置。Embodiments of the present application relate to the field of machine learning technology, and in particular, to an identity recognition model training method, identity recognition method and device.
背景技术Background technique
随着科学技术的快速发展,越来越多的人工重复机械劳动逐渐被人工智能替代,机器学习作为人工智能的一个重要分支,已经广泛应用于机器翻译、人工智能客服、文本检测、声纹唤醒等各个领域。With the rapid development of science and technology, more and more manual repetitive mechanical labor is gradually replaced by artificial intelligence. As an important branch of artificial intelligence, machine learning has been widely used in machine translation, artificial intelligence customer service, text detection, and voiceprint wake-up. and other fields.
在声纹唤醒领域,可以通过身份识别来确认待识别的用户是否为预先注册的用户,进而可以根据身份识别结果进行终端设备唤醒或终端设备中部署的应用程序唤醒等操作。现有技术中,可以采用相关的声纹识别算法对声纹信息进行识别,进而得到身份识别结果。In the field of voiceprint wake-up, identity recognition can be used to confirm whether the user to be identified is a pre-registered user, and then operations such as waking up the terminal device or waking up the application deployed in the terminal device can be performed based on the identity recognition results. In the existing technology, relevant voiceprint recognition algorithms can be used to identify voiceprint information, and then obtain identity recognition results.
然而,现有的声纹识别算法的处理过程一般为先将获取的文本相关声纹向量序列进行压缩处理,得到声纹码本集,然后再基于文本相关声纹识别场景中接收到的用户的声纹信息,确定该声纹信息与声纹码本集的欧式距离,并根据欧式距离确定用户的身份识别结果,仅仅通过用户的声纹信息这一单一维度来确定身份识别结果,降低了身份识别结果的识别准确性,进而可能造成终端设备或相关应用程序误唤醒的情况。However, the processing process of existing voiceprint recognition algorithms generally involves first compressing the obtained text-related voiceprint vector sequence to obtain a voiceprint codebook set, and then based on the text-related voiceprint recognition scenario received user's Voiceprint information, determine the Euclidean distance between the voiceprint information and the voiceprint code set, and determine the user's identification result based on the Euclidean distance. The identification result is determined only through the single dimension of the user's voiceprint information, which reduces the risk of identity The accuracy of the recognition results may cause the terminal device or related applications to accidentally wake up.
发明内容Contents of the invention
本申请实施例提供一种身份识别模型训练方法、身份识别方法及装置,以提高身份识别结果的识别准确性。Embodiments of the present application provide an identity recognition model training method, identity recognition method and device to improve the recognition accuracy of identity recognition results.
第一方面,本申请实施例提供一种身份识别模型训练方法,所述方法包括:In a first aspect, embodiments of the present application provide a method for training an identity recognition model. The method includes:
获取训练音频数据集,以及对所述训练音频数据集进行特征提取,得到训练特征集;Obtain a training audio data set, and perform feature extraction on the training audio data set to obtain a training feature set;
将所述训练特征集输入待训练模型包括的内容识别模型中进行迭代训练,以及将所述训练特征集输入训练完成后的所述内容识别模型输出内容向量;以及将所述训练特征集输入所述待训练模型包括的声纹识别模型中进行迭代训练,以及将所述训练特征集输入训练完成后的所述声纹识别模型输出声纹向量;The training feature set is input into the content recognition model included in the model to be trained for iterative training, and the training feature set is input into the output content vector of the content recognition model after training is completed; and the training feature set is input into the content recognition model. Perform iterative training in the voiceprint recognition model included in the model to be trained, and input the training feature set into the voiceprint recognition model after training to output a voiceprint vector;
将所述内容向量和所述声纹向量输入至所述待训练模型包括的分类器中进行迭代训练,直至所述分类器的似然最大,参数收敛,得到身份识别模型。The content vector and the voiceprint vector are input into the classifier included in the model to be trained for iterative training until the likelihood of the classifier is maximized and the parameters converge to obtain an identity recognition model.
可以看出,在本申请实施例中,训练完的身份识别模型包括内容识别模型和声纹识别模型,实现了同时兼顾内容和声纹两种维度信息,提高了后续身份识别的准确性。另外,身份识别模型还包括分类器,该分类器训练所使用的数据是通过训练完成的内容识别模型和声纹识别模型提取得到的,由于训练完成的模型提取得到的数据的准确性高,因此提高了使用该数据训练得到的分类器的准确性,进而提升了身份识别模型的准确性,最终进一步提升了后续身份识别的准确性。It can be seen that in the embodiment of the present application, the trained identity recognition model includes a content recognition model and a voiceprint recognition model, which achieves simultaneous consideration of two dimensional information, content and voiceprint, and improves the accuracy of subsequent identity recognition. In addition, the identity recognition model also includes a classifier. The data used for training the classifier is extracted from the trained content recognition model and voiceprint recognition model. Since the data extracted from the trained model is highly accurate, This improves the accuracy of the classifier trained using this data, which in turn improves the accuracy of the identity recognition model, and ultimately further improves the accuracy of subsequent identity recognition.
第二方面,本申请实施例提供一种身份识别方法,所述方法包括:In a second aspect, embodiments of the present application provide an identity recognition method, which method includes:
获取待识别用户的第一语音数据;Obtain the first voice data of the user to be identified;
将所述第一语音数据输入身份识别模型包括的内容识别模型,输出目标内容向量;以及将所述第一语音数据输入所述身份识别模型包括的声纹识别模型,输出目标声纹向量;input the first voice data into the content recognition model included in the identity recognition model, and output a target content vector; and input the first voice data into the voiceprint recognition model included in the identity recognition model, and output the target voiceprint vector;
将所述目标内容向量、所述目标声纹向量、预设内容向量和预设声纹向量,输入至所述身份识别模型包括的分类器中,输出似然分布数值;其中,所述预设内容向量是将目标用户的第二语音数据输入至所述内容识别模型后得到的,所述预设声纹向量是将所述目标用户的第二语音数据输入至所述声纹识别模型后得到的;The target content vector, the target voiceprint vector, the preset content vector and the preset voiceprint vector are input into the classifier included in the identity recognition model, and a likelihood distribution value is output; wherein, the preset The content vector is obtained by inputting the second voice data of the target user into the content recognition model, and the preset voiceprint vector is obtained by inputting the second voice data of the target user into the voiceprint recognition model. of;
在所述似然分布数值大于预设似然分布数值阈值的情况下,确定所述待识别用户和所述目标用户为相同用户。When the likelihood distribution value is greater than the preset likelihood distribution value threshold, it is determined that the user to be identified and the target user are the same user.
可以看出,在本申请实施例中,身份识别综合考虑了内容信息和声纹信息,增加了身份识别的考量因素,不再仅依赖单一信息进行身份识别,提高了身份识别的准确性。It can be seen that in the embodiment of the present application, the identity recognition comprehensively considers the content information and the voiceprint information, which increases the consideration factors of the identity recognition and no longer relies solely on a single information for identity recognition, thereby improving the accuracy of the identity recognition.
第三方面,本申请实施例提供一种身份识别模型训练装置,所述装置包括:In a third aspect, embodiments of the present application provide an identity recognition model training device. The device includes:
第一获取模块,用于获取训练音频数据集,以及对所述训练音频数据集进行特征提取,得到训练特征集;The first acquisition module is used to acquire a training audio data set, and perform feature extraction on the training audio data set to obtain a training feature set;
第一处理模块,用于将所述训练特征集输入待训练模型包括的内容识别模型中进行迭代训练,以及将所述训练特征集输入训练完成后的所述内容识别模型输出内容向量;以及将所述训练特征集输入所述待训练模型包括的声纹识别模型中进行迭代训练,以及将所述训练特征集输入训练完成后的所述声纹识别模型输出声纹向量;The first processing module is configured to input the training feature set into the content recognition model included in the model to be trained for iterative training, and input the training feature set into the content recognition model output content vector after training is completed; and The training feature set is input into the voiceprint recognition model included in the model to be trained for iterative training, and the training feature set is input into the voiceprint recognition model after training is completed to output a voiceprint vector;
所述第一处理模块,还用于将所述内容向量和所述声纹向量输入至所述待训练模型包括的分类器中进行迭代训练,直至所述分类器的似然最大,参数收敛,得到身份识别模型。The first processing module is also configured to input the content vector and the voiceprint vector into the classifier included in the model to be trained for iterative training until the likelihood of the classifier is maximum and the parameters converge, Get the identity recognition model.
第四方面,本申请实施例提供一种身份识别装置,所述装置包括:In a fourth aspect, embodiments of the present application provide an identity recognition device, which includes:
第二获取模块,用于获取待识别用户的第一语音数据;The second acquisition module is used to acquire the first voice data of the user to be identified;
第二处理模块,用于将所述第一语音数据输入身份识别模型包括的内容识别模型,输出目标内容向量;以及将所述第一语音数据输入所述身份识别模型包括的声纹识别模型,输出目标声纹向量;The second processing module is configured to input the first voice data into the content recognition model included in the identity recognition model and output a target content vector; and input the first voice data into the voiceprint recognition model included in the identity recognition model, Output the target voiceprint vector;
所述第二处理模块,还用于将所述目标内容向量、所述目标声纹向量、预设内容向量和预设声纹向量,输入至所述身份识别模型包括的分类器中,输出似然分布数值;其中,所述预设内容向量是将目标用户的第二语音数据输入至所述内容识别模型后得到的,所述预设声纹向量是将所述目标用户的第二语音数据输入至所述声纹识别模型后得到的;The second processing module is also used to input the target content vector, the target voiceprint vector, the preset content vector and the preset voiceprint vector into the classifier included in the identity recognition model, and output a similar randomly distributed values; wherein, the preset content vector is obtained by inputting the target user's second voice data into the content recognition model, and the preset voiceprint vector is obtained by inputting the target user's second voice data. Obtained after inputting into the voiceprint recognition model;
所述第二处理模块,还用于在所述似然分布数值大于预设似然分布数值阈值的情况下,确定所述待识别用户和所述目标用户为相同用户。The second processing module is also configured to determine that the user to be identified and the target user are the same user when the likelihood distribution value is greater than a preset likelihood distribution value threshold.
第五方面,本申请实施例提供一种电子设备,包括:至少一个处理器和存储器;In a fifth aspect, embodiments of the present application provide an electronic device, including: at least one processor and a memory;
所述存储器存储计算机执行指令;The memory stores computer execution instructions;
所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如第一方面任一项所述的身份识别模型训练方法,或第二方面任一项所述的身份识别方法。The at least one processor executes the computer execution instructions stored in the memory, so that the at least one processor executes the identity recognition model training method as described in any one of the first aspects, or the method as described in any one of the second aspects. Identification method.
第六方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如第一方面任一项所述的身份识别模型训练方法,或第二方面任一项所述的身份识别方法。In a sixth aspect, embodiments of the present application provide a computer-readable storage medium. Computer-executable instructions are stored in the computer-readable storage medium. When the processor executes the computer-executable instructions, any one of the aspects of the first aspect is implemented. The identity recognition model training method, or the identity recognition method described in any one of the second aspects.
第七方面,本申请实施例提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现如上第一方面以及第一方面各种可能的设计所述的身份识别模型训练方法,或第二方面任一项所述的身份识别方法。In a seventh aspect, embodiments of the present application provide a computer program product, including a computer program. When the computer program is executed by a processor, the identity recognition model training as described in the first aspect and various possible designs of the first aspect is implemented. method, or the identification method described in any one of the second aspects.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting any creative effort.
图1为本申请实施例提供的身份识别模型训练方法的实施环境示意图;Figure 1 is a schematic diagram of the implementation environment of the identity recognition model training method provided by the embodiment of the present application;
图2为本申请实施例提供的身份识别模型训练方法的流程示意图;Figure 2 is a schematic flow chart of the identity recognition model training method provided by the embodiment of the present application;
图3为本申请实施例提供的身份识别模型的架构示意图;Figure 3 is a schematic architectural diagram of the identity recognition model provided by the embodiment of this application;
图4为本申请另一实施例提供的身份识别模型训练方法的流程示意图;Figure 4 is a schematic flowchart of an identity recognition model training method provided by another embodiment of the present application;
图5为本申请实施例提供的身份识别模型训练过程的原理示意图;Figure 5 is a schematic diagram of the principle of the identity recognition model training process provided by the embodiment of the present application;
图6为本申请实施例提供的身份识别方法的流程示意图;Figure 6 is a schematic flow chart of the identity recognition method provided by the embodiment of the present application;
图7为本申请实施例提供的身份识别方法的可视化场景图;Figure 7 is a visual scene diagram of the identity recognition method provided by the embodiment of the present application;
图8为本申请实施例提供的身份识别模型训练装置的结构示意图;Figure 8 is a schematic structural diagram of an identity recognition model training device provided by an embodiment of the present application;
图9为本申请实施例提供的身份识别装置的结构示意图;Figure 9 is a schematic structural diagram of an identity recognition device provided by an embodiment of the present application;
图10为本申请实施例提供的电子设备的硬件结构示意图。FIG. 10 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例还能够包括除了图示或描述的那些实例以外的其他顺序实例。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used for Describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are also capable of encompassing other sequence examples in addition to those illustrated or described. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.
随着科学技术的发展,人工智能已经广泛应用于机器翻译、人工智能客服、文本检测以及声纹唤醒等各个领域。对于声纹唤醒领域,可以通过身份识别来确认待识别的用户是否为预先注册的用户,进而可以根据身份识别结果进行终端设备唤醒或终端设备中部署的应用程序唤醒等操作。例如,在一些应用场景中,用户可能由于双手被占用或距离较远等原因需要通过声纹唤醒的方式来唤醒终端设备或部署于终端设备中的应用程序。示例性的,用户在购物过程中需要启动付款应用程序A,但用户双手都拎着购物袋,不方便通过触控终端设备的方式来启动付款应用程序A,因此,可以通过声纹唤醒的方式来启动付款应用程序A进行付款。With the development of science and technology, artificial intelligence has been widely used in various fields such as machine translation, artificial intelligence customer service, text detection, and voiceprint wake-up. For the field of voiceprint wake-up, identity recognition can be used to confirm whether the user to be identified is a pre-registered user, and then operations such as waking up the terminal device or waking up the application deployed in the terminal device can be performed based on the identity recognition results. For example, in some application scenarios, the user may need to wake up the terminal device or the application deployed in the terminal device through voiceprint wake-up due to reasons such as hands being occupied or being far away. For example, the user needs to start payment application A during the shopping process, but the user is carrying shopping bags with both hands and it is not convenient to start payment application A by touching the terminal device. Therefore, it can be awakened by voiceprint. to launch payment application A to make payment.
现有技术中,可以采用相关的声纹识别算法对声纹信息进行识别,得到身份识别结果,然后再根据身份识别结果确定是否需要对终端设备或相关的应用程序进行唤醒。然而,现有的声纹识别算法的处理过程一般为先将获取的文本相关声纹向量序列进行压缩处理,得到声纹码本集,然后再基于文本相关声纹识别场景中接收到的用户的声纹信息,确定该声纹信息与声纹码本集的欧式距离,并根据欧式距离确定用户的身份识别结果。In the existing technology, a relevant voiceprint recognition algorithm can be used to identify the voiceprint information to obtain an identity recognition result, and then determine whether the terminal device or related applications need to be awakened based on the identity recognition result. However, the processing process of existing voiceprint recognition algorithms generally involves first compressing the obtained text-related voiceprint vector sequence to obtain a voiceprint codebook set, and then based on the text-related voiceprint recognition scenario received user's Voiceprint information, determine the Euclidean distance between the voiceprint information and the voiceprint code set, and determine the user's identity recognition result based on the Euclidean distance.
示例性的,继续沿用前述实例,当用户通过声纹唤醒的方式来启动付款应用程序A进行付款时,对于付款类应用程序,由于涉及到实际交易,因此对于唤醒启动的安全性要求较高,即需要先确定用户的身份识别结果,身份识别结果通过后才能启动。现有的声纹识别算法可以通过声纹识别的方式来得到身份识别结果,进而确定用户的身份信息,而此时却出现了另一个问题,若终端设备中部署有多个应用程序,或者部署有多个支付类应用程序,终端设备仅仅通过声纹识别的方式确定了是预设的用户,可以启动应用程序,却无法确定具体启动哪个应用程序。此外,也可能会按照预设启动顺序启动一应用程序,而该应用程序却可能不是支付类应用程序,无法完成支付功能,或者也可能不是用户想要启动的特定的支付类应用程序,而其他应用场景中也可能存在无法精确启动的问题,即仅通过用户的声纹信息来确定身份识别结果的方式,既降低了身份识别结果的识别准确性,还可能存在误唤醒的情况,进而影响了用户的使用体验。For example, continuing to use the above example, when the user starts payment application A through voiceprint wake-up to make payment, for payment applications, since actual transactions are involved, the security requirements for wake-up start are higher. That is, the user's identity recognition result needs to be determined first, and the process can only be started after the identity recognition result passes. The existing voiceprint recognition algorithm can obtain the identity recognition result through voiceprint recognition, and then determine the user's identity information. However, another problem arises at this time. If there are multiple applications deployed in the terminal device, or if there are multiple applications deployed in the terminal device, There are multiple payment applications. The terminal device only determines that it is the default user through voiceprint recognition and can start the application, but it cannot determine which application to start. In addition, an application may be started according to the preset startup sequence, but the application may not be a payment application and cannot complete the payment function, or it may not be a specific payment application that the user wants to start, and other There may also be problems in application scenarios that cannot be started accurately, that is, the identification result is determined only through the user's voiceprint information, which not only reduces the accuracy of the identification result, but also may cause false awakening, which in turn affects the User experience.
基于上述问题,本申请通过将语音训练集中的声纹信息和内容信息相结合的方式来训练得到身份识别模型,再根据训练完成的身份识别模型对语音信息中的声纹信息和内容信息综合进行识别的方式,不再仅依赖声纹信息进行判断,达到了既提高了身份识别结果的识别准确性,又减少了后续终端设备或相关应用程序误唤醒的情况的技术效果。Based on the above problems, this application trains an identity recognition model by combining the voiceprint information and content information in the voice training set, and then comprehensively analyzes the voiceprint information and content information in the voice information based on the trained identity recognition model. The identification method no longer relies solely on voiceprint information for judgment, achieving the technical effect of not only improving the recognition accuracy of the identity recognition results, but also reducing the accidental awakening of subsequent terminal devices or related applications.
图1为本申请实施例提供的身份识别模型训练方法的实施环境示意图,如图1所示,本实施例提供的实施环境主要可以包括:服务器101和终端设备102,终端设备102通过无线方式或有线方式与服务器101进行通信。其中,有线方式可以是终端设备102和服务器101之间通过高清多媒体接口(High Definition Multimedia Interface,HDMI)等线路进行数据传输,无线方式可以是终端设备102和服务器101之间通过蓝牙、WIFI等方式进行通信。Figure 1 is a schematic diagram of the implementation environment of the identity recognition model training method provided by the embodiment of the present application. As shown in Figure 1, the implementation environment provided by this embodiment can mainly include: a server 101 and a terminal device 102. The terminal device 102 wirelessly or Communicates with the server 101 in a wired manner. Among them, the wired method can be data transmission between the terminal device 102 and the server 101 through lines such as High Definition Multimedia Interface (HDMI), and the wireless method can be between the terminal device 102 and the server 101 through Bluetooth, WIFI, etc. communicate.
此外,本实施例的实施环境还可以包括数据库103,数据库103中存储有训练音频数据集。在一种实现方式中,如图1中所示,服务器101可以从数据库103中获取训练音频数据集,然后根据获取的训练音频数据集进行模型训练,进而得到身份识别模型。身份识别模型训练完成之后,可以将该身份识别模型部署于终端设备102中,终端设备102可以根据该身份识别模型来识别用户的身份信息,得到身份信息识别结果,进而可以根据该身份信息识别结果实现终端设备102或部署于终端设备102中的应用程序的唤醒操作。In addition, the implementation environment of this embodiment may also include a database 103, in which a training audio data set is stored. In one implementation, as shown in Figure 1, the server 101 can obtain a training audio data set from the database 103, and then perform model training based on the obtained training audio data set to obtain an identity recognition model. After the identity recognition model training is completed, the identity recognition model can be deployed in the terminal device 102. The terminal device 102 can identify the user's identity information based on the identity recognition model, obtain the identity information recognition results, and then can identify the results based on the identity information. Implement a wake-up operation of the terminal device 102 or an application program deployed in the terminal device 102 .
在另一种实现方式中,终端设备102还可以直接从数据库103中获取训练音频数据集,然后根据获取的训练音频数据集进行模型训练,进而得到身份识别模型。身份识别模型训练完成之后,终端设备102可以根据该身份识别模型来识别用户的身份信息,得到身份信息识别结果,进而可以根据该身份信息识别结果实现终端设备102或部署于终端设备102中的应用程序的唤醒操作。In another implementation, the terminal device 102 can also directly obtain the training audio data set from the database 103, and then perform model training based on the obtained training audio data set to obtain an identity recognition model. After the identity recognition model training is completed, the terminal device 102 can identify the user's identity information according to the identity recognition model, obtain the identity information recognition result, and then implement the terminal device 102 or the application deployed in the terminal device 102 based on the identity information recognition result. Program wake-up operation.
需要说明的是,终端设备102可以但不限于是智能手机、平板、个人电脑、智能家电(示例性的,可以为热水器、洗衣机、电视机、智能音箱等)、智能穿戴设备等智能交互设备。It should be noted that the terminal device 102 may be, but is not limited to, a smart phone, a tablet, a personal computer, a smart home appliance (for example, a water heater, a washing machine, a television, a smart speaker, etc.), a smart wearable device, and other smart interactive devices.
此外,服务器101可以为独立部署的服务器,也可以为集群服务器。In addition, the server 101 can be an independently deployed server or a cluster server.
需要说明的,本申请提供的方法可以广泛应用于涉及到声纹唤醒功能的不同应用场景,下面将结合具体应用场景对本申请提供的身份识别模型训练方法和身份识别方法的实现过程进行详细描述。It should be noted that the method provided by this application can be widely used in different application scenarios involving the voiceprint wake-up function. The implementation process of the identity recognition model training method and identity recognition method provided by this application will be described in detail below based on specific application scenarios.
下面以具体地实施例对本申请的技术方案进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。The technical solution of the present application will be described in detail below with specific examples. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
图2为本申请实施例提供的身份识别模型训练方法的流程示意图,本实施例的方法可以由服务器101或终端设备102执行,如图2所示,本实施例的方法,可以包括:Figure 2 is a schematic flowchart of the identity recognition model training method provided by the embodiment of this application. The method of this embodiment can be executed by the server 101 or the terminal device 102. As shown in Figure 2, the method of this embodiment can include:
S201:获取训练音频数据集,以及对训练音频数据集进行特征提取,得到训练特征集。S201: Obtain a training audio data set, and perform feature extraction on the training audio data set to obtain a training feature set.
在本实施例中,训练音频数据集中可以包含多个训练音频数据,各个训练音频数据可以为同一用户产生的,也可以为不同的用户产生的,还可以为部分训练音频数据为同一用户产生的。In this embodiment, the training audio data set may contain multiple training audio data. Each training audio data may be generated by the same user, or may be generated by different users, or part of the training audio data may be generated by the same user. .
另外,在获取训练音频数据集时,可以从数据库中直接获取预存的训练音频数据集,也可以从第三方训练音频生成系统中获取训练音频数据集,当然,其他方式的训练音频数据集获取方式也在本申请的保护范围内,在此不再具体进行限定。In addition, when obtaining the training audio data set, you can directly obtain the pre-stored training audio data set from the database, or you can obtain the training audio data set from a third-party training audio generation system. Of course, other ways to obtain the training audio data set It is also within the protection scope of this application and will not be specifically limited here.
其中,可以通过用户的声纹信息来表示用户,通过音频数据中提取出的内容信息来表示用户表达的含义,因此,对训练音频数据集进行特征提取之后得到的训练特征集中可以包括用户的声纹特征和内容特征,还可以通过向量的方式来表示声纹特征和内容特征,因此,可以得到声纹特征对应的声纹向量,内容特征对应的内容向量。Among them, the user can be represented by the user's voiceprint information, and the meaning of the user's expression can be represented by the content information extracted from the audio data. Therefore, the training feature set obtained after feature extraction of the training audio data set can include the user's voice. The voiceprint features and content features can also be expressed in the form of vectors. Therefore, the voiceprint vector corresponding to the voiceprint feature and the content vector corresponding to the content feature can be obtained.
进一步的,训练特征集是采用MFCC(Mel-Frequency Cepstral Coefficients,梅尔频率倒谱系数)对训练音频数据集提取得到的。Furthermore, the training feature set is extracted from the training audio data set using MFCC (Mel-Frequency Cepstral Coefficients).
具体的,梅尔频率是基于人耳听觉特性提出来的,它与Hz频率成非线性对应关系,MFCC则是利用它们之间的这种关系,计算得到的Hz频谱特征,主要用于语音数据特征的提取,示例性的,可以通过MFCC来提取80维的声纹特征和内容特征。通过采用MFCC来提取特征的方式,为音频提取方面的行业通识,既提高了特征提取的便利性,又提高了特征提取的准确性。Specifically, Mel frequency is proposed based on the auditory characteristics of the human ear, and it has a nonlinear correspondence with Hz frequency. MFCC uses this relationship between them to calculate the Hz spectrum characteristics, which is mainly used for speech data. Feature extraction, for example, 80-dimensional voiceprint features and content features can be extracted through MFCC. By using MFCC to extract features, it provides industry knowledge in audio extraction, which not only improves the convenience of feature extraction, but also improves the accuracy of feature extraction.
此外,还可以采用kaldi、espnet或者librosa等现有工具包对训练音频数据集进行特征提取。In addition, existing toolkits such as kaldi, espnet or librosa can also be used to extract features from the training audio data set.
另外,训练音频数据集中包含目标训练音频,目标训练音频包含预设唤醒词,目标训练音频在训练音频数据集中的比重大于或等于预设比重阈值。In addition, the training audio data set contains the target training audio, the target training audio contains the preset wake-up word, and the proportion of the target training audio in the training audio data set is greater than or equal to the preset proportion threshold.
具体的,为了使得训练音频数据集的内容特征更加明显,可以增加包含预设唤醒词的目标训练音频的数量,进而使得预设唤醒词对应的内容特征(即音素特征)更加明显。对应的,可以设置为目标训练音频在训练音频数据集中的比重大于或等于预设比重阈值。其中,比重阈值可以根据实际应用场景自定义进行设置,在此不再详细进行论述。Specifically, in order to make the content features of the training audio data set more obvious, the number of target training audios containing the preset wake-up words can be increased, thereby making the content features (ie, phoneme features) corresponding to the preset wake-up words more obvious. Correspondingly, it can be set so that the proportion of the target training audio in the training audio data set is greater than or equal to the preset proportion threshold. Among them, the specific gravity threshold can be customized and set according to the actual application scenario, and will not be discussed in detail here.
示例性的,唤醒词可以为“小马小马”,则训练音频数据集中需包含语音中有“小马小马”的目标训练音频,且语音中有“小马小马”的目标训练音频在训练音频数据集中的比重大于或等于比重阈值。For example, the wake-up word can be "pony and pony", then the training audio data set needs to include the target training audio with "pony and pony" in the voice, and the target training audio with "pony and pony" in the voice The proportion in the training audio data set is greater than or equal to the proportion threshold.
通过增加包含预设唤醒词的目标训练音频的数量,使得唤醒词对应的内容特征分布更加集中,缩短了身份识别模型训练的时长,进而提高了身份识别模型训练的效率。By increasing the number of target training audios containing preset wake-up words, the content feature distribution corresponding to the wake-up words is more concentrated, shortening the training time of the identity recognition model, thereby improving the efficiency of identity recognition model training.
S202:将训练特征集输入待训练模型包括的内容识别模型中进行迭代训练,以及将训练特征集输入训练完成后的内容识别模型输出内容向量;以及将训练特征集输入待训练模型包括的声纹识别模型中进行迭代训练,以及将训练特征集输入训练完成后的声纹识别模型输出声纹向量。S202: Input the training feature set into the content recognition model included in the model to be trained for iterative training, and input the training feature set into the content recognition model output content vector after the training is completed; and input the training feature set into the voiceprint included in the model to be trained. Iterative training is performed in the recognition model, and the training feature set is input into the voiceprint recognition model to output a voiceprint vector after training.
在本实施例中,在得到训练特征集之后,训练特征集中包含若干用户的声纹特征和内容特征。现有技术中,在获取到训练特征集之后,可能直接忽视了训练特征集中的内容特征,只是提取出了用户的声纹特征,然后根据用户的声纹特征训练相关的声纹识别算法,并根据声纹识别算法进行用户身份识别。然而,单一的特征并不能全面的对语音信息进行概括。示例性的,在终端设备语音唤醒场景中,需要特定的用户说出特定的关键词或特定语句才可以实现终端设备的唤醒。例如,需要用户B说出关键词“小马小马”才可以实现终端设备的唤醒,然而,现有的声纹识别算法仅可以识别出用户B,并不能识别出关键词或特定语句,若用户B表达了其他关键词或特定语句,也可能唤醒终端设备,导致出现误唤醒的情况,降低了用户的使用体验。In this embodiment, after the training feature set is obtained, the training feature set contains voiceprint features and content features of several users. In the existing technology, after obtaining the training feature set, the content features in the training feature set may be directly ignored, and only the user's voiceprint features are extracted, and then the relevant voiceprint recognition algorithm is trained based on the user's voiceprint features, and User identification is performed based on the voiceprint recognition algorithm. However, a single feature cannot comprehensively summarize speech information. For example, in a terminal device voice wake-up scenario, a specific user needs to say specific keywords or specific sentences to wake up the terminal device. For example, user B needs to say the keyword "小马小马" to wake up the terminal device. However, the existing voiceprint recognition algorithm can only identify user B and cannot identify keywords or specific sentences. If User B expresses other keywords or specific sentences, which may also wake up the terminal device, resulting in accidental wake-up and reducing the user experience.
为了降低误唤醒情况,可以对待训练模型进行训练,其中,待训练模型可以包括内容识别模型,声纹识别模型和分类器,通过对待训练模型的内容识别模型,声纹识别模型和分类器进行训练,即可得到身份识别模型。In order to reduce false awakenings, the model to be trained can be trained. The model to be trained can include a content recognition model, a voiceprint recognition model and a classifier. The content recognition model, voiceprint recognition model and classifier of the model to be trained are trained. , the identity recognition model can be obtained.
进一步的,可以通过训练特征集训练用于提取声纹特征的声纹识别模型,以及用于提取内容特征的内容识别模型。其中,训练完成后的内容识别模型输出内容向量,训练完成后的声纹识别模型输出声纹向量,即可以通过内容特征和声纹特征两个维度的结合来表达语音信息。Further, the voiceprint recognition model for extracting voiceprint features and the content recognition model for extracting content features can be trained through the training feature set. Among them, the content recognition model after training outputs a content vector, and the voiceprint recognition model after training outputs a voiceprint vector. That is, speech information can be expressed through the combination of content features and voiceprint features.
需要说明的是,内容识别模型和声纹识别模型的训练过程以及特征提取过程没有特定顺序。可以先对内容识别模型进行训练,再对声纹识别模型进行训练;也可以先对声纹识别模型进行训练,再对内容识别模型进行训练;还可以同时对内容识别模型和声纹识别模型进行训练。特征提取过程类似,在此不再详细进行论述。It should be noted that there is no specific order in the training process and feature extraction process of the content recognition model and voiceprint recognition model. You can train the content recognition model first, and then train the voiceprint recognition model; you can also train the voiceprint recognition model first, and then train the content recognition model; you can also train the content recognition model and the voiceprint recognition model at the same time. train. The feature extraction process is similar and will not be discussed in detail here.
其中,内容向量和声纹向量可以为一维向量,且提取出的内容向量可以为音素向量,即内容向量可以为训练特征集中各训练音频(即各训练关键词或训练语句)对应的音素信息。具体的,训练关键词可以为单一的名词(如小马小马),训练语句可以为除单一的名称之外的其他格式的语句(如打开电视机)。示例性的,内容向量可以为a(小马小马对应的音素信息,小牛对应的音素信息,小鸡小鸡对应的音素信息,小马小马对应的音素信息,小马小马对应的音素信息,小马对应的音素信息,小猫对应的音素信息,小马小马对应的音素信息,小鸡小猫对应的音素信息),声纹向量可以为b(用户1对应的声纹,用户2对应的声纹,用户3对应的声纹,用户2对应的声纹,用户2对应的声纹,用户1对应的声纹,用户3对应的声纹,用户3对应的声纹,用户3对应的声纹)。Among them, the content vector and the voiceprint vector can be one-dimensional vectors, and the extracted content vector can be a phoneme vector, that is, the content vector can be the phoneme information corresponding to each training audio (i.e., each training keyword or training sentence) in the training feature set. . Specifically, the training keyword can be a single noun (such as pony, pony), and the training sentence can be a sentence in other formats besides a single name (such as turning on the TV). For example, the content vector may be a (phoneme information corresponding to pony and pony, phoneme information corresponding to calf, phoneme information corresponding to chicken and chick, phoneme information corresponding to pony and pony, phoneme information corresponding to pony and pony. Phoneme information, the phoneme information corresponding to the pony, the phoneme information corresponding to the kitten, the phoneme information corresponding to the pony, the phoneme information corresponding to the chicken and kitten), the voiceprint vector can be b (the voiceprint corresponding to user 1, The voiceprint corresponding to user 2, the voiceprint corresponding to user 3, the voiceprint corresponding to user 2, the voiceprint corresponding to user 2, the voiceprint corresponding to user 1, the voiceprint corresponding to user 3, the voiceprint corresponding to user 3, user 3 corresponding voiceprint).
S203:将内容向量和声纹向量输入至待训练模型包括的分类器中进行迭代训练,直至分类器的似然最大,参数收敛,得到身份识别模型。S203: Input the content vector and the voiceprint vector into the classifier included in the model to be trained for iterative training until the likelihood of the classifier is maximum and the parameters converge, and an identity recognition model is obtained.
在本实施例中,在待训练模型包括的内容识别模型和声纹识别模型训练完成之后,可以获取到内容识别模型输出的内容向量,以及声纹识别模型输出的声纹向量,然后可以根据内容向量和声纹向量对待训练模型包括的分类器进行迭代训练,直至分类器的似然最大,参数收敛,即可得到身份识别模型。其中,分类器可以为PLDA(Probabilistic LinearDiscriminant Analysis,概率线性判别分析)模型。在分类器进行迭代训练的过程中,可以通过极大似然估计的方式来进行训练,直至分类器的似然最大,参数收敛,即可得到身份识别模型。对应的,通过该参数所得的似然分布数值为分类器估计的极大值。In this embodiment, after the training of the content recognition model and the voiceprint recognition model included in the model to be trained is completed, the content vector output by the content recognition model and the voiceprint vector output by the voiceprint recognition model can be obtained, and then the content vector can be obtained according to the content Vectors and voiceprint vectors are iteratively trained on the classifier included in the to-be-trained model until the likelihood of the classifier is maximum and the parameters converge, and the identity recognition model can be obtained. Among them, the classifier can be a PLDA (Probabilistic Linear Discriminant Analysis) model. During the iterative training process of the classifier, the training can be carried out through maximum likelihood estimation until the likelihood of the classifier is maximized and the parameters converge, and the identity recognition model can be obtained. Correspondingly, the likelihood distribution value obtained by this parameter is the maximum value estimated by the classifier.
图3为本申请实施例提供的身份识别模型的架构示意图,如图3所示,在该实施例中,声纹识别模型可以包括用于提取内容向量的内容识别模型、用于提取声纹向量的声纹识别模型和用于根据内容向量和声纹向量得到似然分布数值的分类器,内容识别模型和声纹识别模型的输出端均与所述分类器的输入端连接。Figure 3 is a schematic architectural diagram of an identity recognition model provided by an embodiment of the present application. As shown in Figure 3, in this embodiment, the voiceprint recognition model may include a content recognition model for extracting content vectors, a content recognition model for extracting voiceprint vectors, and a content recognition model for extracting content vectors. A voiceprint recognition model and a classifier used to obtain a likelihood distribution value based on the content vector and the voiceprint vector. The output terminals of the content recognition model and the voiceprint recognition model are both connected to the input terminal of the classifier.
可以看出,在本申请实施例中,训练完的身份识别模型包括内容识别模型和声纹识别模型,实现了同时兼顾内容和声纹两种维度信息,提高了后续身份识别的准确性。另外,身份识别模型还包括分类器,该分类器训练所使用的数据是通过训练完成的内容识别模型和声纹识别模型提取得到的,由于训练完成的模型提取得到的数据的准确性高,因此提高了使用该数据训练得到的分类器的准确性,进而提升了身份识别模型的准确性,最终进一步提升了后续身份识别的准确性。It can be seen that in the embodiment of the present application, the trained identity recognition model includes a content recognition model and a voiceprint recognition model, which achieves simultaneous consideration of two dimensional information, content and voiceprint, and improves the accuracy of subsequent identity recognition. In addition, the identity recognition model also includes a classifier. The data used for training the classifier is extracted from the trained content recognition model and voiceprint recognition model. Since the data extracted from the trained model is highly accurate, This improves the accuracy of the classifier trained using this data, which in turn improves the accuracy of the identity recognition model, and ultimately further improves the accuracy of subsequent identity recognition.
基于图2的方法,本说明书实施例还提供了该方法的一些具体实施方案,下面进行说明。Based on the method in Figure 2, the embodiments of this specification also provide some specific implementations of the method, which will be described below.
在另一实施例中,内容识别模型可以为Conformer Network(Convolution-augmented Transformer for Speech Recognition Network,基于卷积增强的音素识别模型),内容向量为音素向量,音素识别模型可以对语音信息中的音素特征进行提取,然后将提取得到的音素特征中的最后一层输出向量作为音素向量。In another embodiment, the content recognition model can be a Conformer Network (Convolution-augmented Transformer for Speech Recognition Network, a phoneme recognition model based on convolution enhancement), the content vector is a phoneme vector, and the phoneme recognition model can detect phonemes in the speech information. Features are extracted, and then the last layer output vector of the extracted phoneme features is used as a phoneme vector.
通过加入对音素向量的判断可以有效的确定获取的语音是否是所需唤醒词的语音,此外,由于因素特征能更加真实的反应语音的内容信息,相对比通过关键词特征训练得到的声纹识别算法,通过音素特征训练得到的身份识别模型可以更加准确的确定语音信息的含义,提高了内容识别的准确性,进而提高了身份识别的准确性。By adding the judgment of the phoneme vector, it can be effectively determined whether the acquired speech is the speech of the required wake-up word. In addition, because the factor features can more realistically reflect the content information of the speech, compared with the voiceprint recognition obtained through keyword feature training, Algorithm, the identity recognition model obtained through phoneme feature training can more accurately determine the meaning of speech information, improve the accuracy of content recognition, and thus improve the accuracy of identity recognition.
进一步的,若内容识别模型为基于卷积增强的音素识别模型,则将训练特征集输入待训练模型包括的内容识别模型中进行迭代训练的具体实现过程可以包括:Further, if the content recognition model is a phoneme recognition model based on convolution enhancement, the specific implementation process of inputting the training feature set into the content recognition model included in the model to be trained for iterative training may include:
将训练特征集输入至音素识别模型中进行迭代训练,以及根据预设的第一梯度下降算法确定音素识别模型的第一损失值。The training feature set is input into the phoneme recognition model for iterative training, and the first loss value of the phoneme recognition model is determined according to a preset first gradient descent algorithm.
在第一损失值大于或等于第一损失值阈值的情况下,完成音素识别模型的训练。When the first loss value is greater than or equal to the first loss value threshold, training of the phoneme recognition model is completed.
具体的,第一梯度下降算法可以采用现有的算法,在此不再详细进行论述。第一损失值阈值可以根据音素识别模型的收敛程度或者测试准确率等因素确定。示例性的,可以将第一损失值阈值设置为测试准确率达到97%或98%。Specifically, the first gradient descent algorithm can adopt an existing algorithm and will not be discussed in detail here. The first loss value threshold can be determined based on factors such as the degree of convergence of the phoneme recognition model or the test accuracy. For example, the first loss value threshold can be set until the test accuracy reaches 97% or 98%.
此外,声纹识别模型可以为ResNet TDNN(Residual Neural Network and TimeDelay Neural Network,残差网络与时延神经网络),残差网络与时延神经网络是深度学习卷积神经网络的一种,可以实现动态路由层的引入,不仅使得网路层数变深且可训练,还使得网络参数大幅减少,提升了网络性能的同时有效提升了网络效率,而在ResNet后面接入TDNN层可以使得网络更好的抓取音频时序信息,进而提高用户声纹识别的准确性。In addition, the voiceprint recognition model can be ResNet TDNN (Residual Neural Network and TimeDelay Neural Network, residual network and time delay neural network). Residual network and time delay neural network are a type of deep learning convolutional neural network, which can be implemented The introduction of the dynamic routing layer not only makes the network layers deeper and more trainable, but also significantly reduces network parameters, improves network performance and effectively improves network efficiency. Accessing the TDNN layer behind ResNet can make the network better. Capture audio timing information to improve the accuracy of user voiceprint recognition.
对应的,声纹识别模型为残差网络与时延神经网络,则将训练特征集输入待训练模型包括的声纹识别模型中进行迭代训练的具体实现过程可以包括:Correspondingly, if the voiceprint recognition model is a residual network and a delay neural network, then the specific implementation process of inputting the training feature set into the voiceprint recognition model included in the model to be trained for iterative training may include:
将训练特征集输入残差网络与时延神经网络中进行迭代训练,以及根据预设的第二梯度下降算法确定残差网络与时延神经网络的第二损失值。The training feature set is input into the residual network and the delay neural network for iterative training, and the second loss value of the residual network and the delay neural network is determined according to the preset second gradient descent algorithm.
在第二损失值大于或等于第二损失值阈值的情况下,完成残差网络与时延神经网络的训练。When the second loss value is greater than or equal to the second loss value threshold, the training of the residual network and the delay neural network is completed.
具体的,第二梯度下降算法可以采用现有的算法,在此不再详细进行论述,且第一梯度下降算法与第二梯度下降算法可以采用相同的算法,也可以采用不同的算法,仅需实现对应的功能即可。Specifically, the second gradient descent algorithm can use an existing algorithm, which will not be discussed in detail here, and the first gradient descent algorithm and the second gradient descent algorithm can use the same algorithm or different algorithms. Just Just implement the corresponding function.
同理,第二损失值阈值可以根据残差网络与时延神经网络的收敛程度或者测试准确率等因素确定。示例性的,可以将第二损失值阈值设置为测试准确率达到97%或98%。In the same way, the second loss value threshold can be determined based on factors such as the degree of convergence of the residual network and the delay neural network or the test accuracy. For example, the second loss value threshold can be set until the test accuracy reaches 97% or 98%.
图4为本申请另一实施例提供的身份识别模型训练方法的流程示意图,如图4所示,在该实施例中,S203具体可以包括:Figure 4 is a schematic flow chart of an identity recognition model training method provided by another embodiment of the present application. As shown in Figure 4, in this embodiment, S203 may specifically include:
S401:将内容向量和声纹向量并联为一维的语音训练向量。S401: Concatenate the content vector and the voiceprint vector into a one-dimensional voice training vector.
在本实施例中,内容向量可以为表示语音内容的向量,声纹向量可以为表示用户的向量,在通过内容向量和声纹向量对分类器进行迭代训练之前,可以先将内容向量和声纹向量并联为一维的语音训练向量,然后再对分类器进行训练。In this embodiment, the content vector may be a vector representing the voice content, and the voiceprint vector may be a vector representing the user. Before iteratively training the classifier through the content vector and the voiceprint vector, the content vector and the voiceprint vector may be first The vectors are concatenated into a one-dimensional speech training vector, and then the classifier is trained.
示例性的,内容向量可以为a(小马小马对应的音素信息,小牛对应的音素信息,小鸡小鸡对应的音素信息,小马小马对应的音素信息,小马小马对应的音素信息,小马对应的音素信息,小猫对应的音素信息,小马小马对应的音素信息,小鸡小猫对应的音素信息),声纹向量可以为b(用户1对应的声纹,用户2对应的声纹,用户3对应的声纹,用户2对应的声纹,用户2对应的声纹,用户1对应的声纹,用户3对应的声纹,用户3对应的声纹,用户3对应的声纹),则语音训练向量可以为c(小马小马对应的音素信息,小牛对应的音素信息,小鸡小鸡对应的音素信息,小马小马对应的音素信息,小马小马对应的音素信息,小马对应的音素信息,小猫对应的音素信息,小马小马对应的音素信息,小鸡小猫对应的音素信息,用户1对应的声纹,用户2对应的声纹,用户3对应的声纹,用户2对应的声纹,用户2对应的声纹,用户1对应的声纹,用户3对应的声纹,用户3对应的声纹,用户3对应的声纹)。For example, the content vector may be a (phoneme information corresponding to pony and pony, phoneme information corresponding to calf, phoneme information corresponding to chicken and chick, phoneme information corresponding to pony and pony, phoneme information corresponding to pony and pony. Phoneme information, the phoneme information corresponding to the pony, the phoneme information corresponding to the kitten, the phoneme information corresponding to the pony, the phoneme information corresponding to the chicken and kitten), the voiceprint vector can be b (the voiceprint corresponding to user 1, The voiceprint corresponding to user 2, the voiceprint corresponding to user 3, the voiceprint corresponding to user 2, the voiceprint corresponding to user 2, the voiceprint corresponding to user 1, the voiceprint corresponding to user 3, the voiceprint corresponding to user 3, user 3 corresponding voiceprint), then the speech training vector can be c (phoneme information corresponding to pony and pony, phoneme information corresponding to calf, phoneme information corresponding to chicken and chicken, phoneme information corresponding to pony and pony, and The phoneme information corresponding to the horse and pony, the phoneme information corresponding to the pony, the phoneme information corresponding to the kitten, the phoneme information corresponding to the pony and pony, the phoneme information corresponding to the chicken and kitten, the voiceprint corresponding to user 1, and the corresponding phoneme information to user 2 The voiceprint corresponding to user 3, the voiceprint corresponding to user 2, the voiceprint corresponding to user 2, the voiceprint corresponding to user 1, the voiceprint corresponding to user 3, the voiceprint corresponding to user 3, the voiceprint corresponding to user 3 voiceprint).
其中,各向量的最大长度可以根据实际应用情况自定义进行设置,在此不再详细进行论述。Among them, the maximum length of each vector can be customized and set according to the actual application situation, and will not be discussed in detail here.
通过将内容向量和声纹向量并联为一维的语音训练向量的方式来对分类器进行训练,使得同一批次输入的向量中既包含内容向量,又包含声纹向量,增加了语音训练向量中特征的种类,使得该语音训练向量的训练维度更加丰富,进而使得分类器可以更准确的学习训练特征集对应的特征。The classifier is trained by concatenating the content vector and the voiceprint vector into a one-dimensional voice training vector, so that the input vectors in the same batch contain both the content vector and the voiceprint vector, which increases the number of voice training vectors. The types of features make the training dimensions of the speech training vector richer, allowing the classifier to more accurately learn the features corresponding to the training feature set.
S402:将语音训练向量输入分类器中进行迭代训练。S402: Input the speech training vector into the classifier for iterative training.
在本实施例中,分类器可以为PLDA模型,在得到语音训练向量之后,可以通过语音训练向量对PLDA模型进行迭代训练。In this embodiment, the classifier may be a PLDA model. After obtaining the speech training vector, the PLDA model may be iteratively trained using the speech training vector.
进一步的,S402的具体实现过程可以包括:Further, the specific implementation process of S402 may include:
根据语音训练向量对分类器分别进行均值处理和初始化处理,得到初始极大似然估计表达式。The classifier is subjected to mean processing and initialization processing according to the speech training vector to obtain the initial maximum likelihood estimation expression.
将语音训练向量和初始极大似然估计表达式输入至分类器中,进行迭代训练。The speech training vector and the initial maximum likelihood estimation expression are input into the classifier for iterative training.
具体的,语音训练向量中包括不同用户对应的不同语音内容,可以根据不同用户对应的不同语音内容对分类器进行迭代训练。对应的,可以对PLDA模型分别进行均值处理和初始化处理,得到初始极大似然估计表达式。Specifically, the speech training vector includes different speech contents corresponding to different users, and the classifier can be iteratively trained according to the different speech contents corresponding to different users. Correspondingly, the PLDA model can be subjected to mean processing and initialization processing respectively to obtain the initial maximum likelihood estimation expression.
进一步的,该初始极大似然估计表达式包括的参数可以为全体语音训练向量的均值、身份空间特征矩阵、噪声空间特征矩阵以及噪音协方差。其中,身份空间特征矩阵可以表示不同用户的信息;噪声空间特征矩阵可以表示同一用户不同语音变化的信息和最后的残留噪声项,噪音协方差用来表示尚未解释的东西。然后可以将语音训练向量和初始极大似然估计表达式输入至分类器中,通过预设算法进行迭代训练,确定初始极大似然估计表达式中各参数对应的数值。示例性的,预设算法可以为EM(Expectation MaximizationAlgorithm,期望最大化)算法。Further, the parameters included in the initial maximum likelihood estimation expression may be the mean value of all speech training vectors, the identity space feature matrix, the noise space feature matrix, and the noise covariance. Among them, the identity space feature matrix can represent the information of different users; the noise space feature matrix can represent the information of different voice changes of the same user and the final residual noise term, and the noise covariance is used to represent things that have not yet been explained. Then the speech training vector and the initial maximum likelihood estimation expression can be input into the classifier, and iterative training can be performed through a preset algorithm to determine the values corresponding to each parameter in the initial maximum likelihood estimation expression. For example, the preset algorithm may be the EM (Expectation MaximizationAlgorithm, expectation maximization) algorithm.
其中,EM算法的训练思想为通过极大似然估计,根据己经给出的观测数据,估计出参数的数值,然后再依据上一步估计出的参数的数值估计缺失数据,再根据估计出的缺失数据加上之前获取的数据重新再对参数进行估计,得到新的参数的数值,然后反复迭代,直至分类器的似然最大,参数收敛。Among them, the training idea of EM algorithm is to estimate the values of parameters based on the given observation data through maximum likelihood estimation, and then estimate the missing data based on the values of parameters estimated in the previous step, and then estimate the missing data based on the estimated values of parameters. The missing data is added to the previously acquired data to re-estimate the parameters to obtain new parameter values, and then iterate repeatedly until the likelihood of the classifier is maximized and the parameters converge.
采用上述方案后,由于不同的语音特征经过PLDA模型的似然分布是不同的,根据似然分布的不同,即可进行用户身份的识别,且由于PLDA模型具有较高的信道补偿能力,可以最大程度的表示语音信息的特征,提高了拟合特征表示参数表达语音信息对应特征的准确性,进而提高了身份识别的准确性。After adopting the above solution, since the likelihood distribution of different speech features through the PLDA model is different, the user identity can be identified based on the difference in likelihood distribution, and because the PLDA model has high channel compensation capability, it can maximize The degree of representing the characteristics of speech information improves the accuracy of the fitting feature representation parameters in expressing the corresponding features of speech information, thereby improving the accuracy of identity recognition.
此外,在另一实施例中,身份识别模型还可以包括:特征提取模块,内容识别模型与声纹识别模型的输入端均与特征提取模块的输出端连接。对应的,对训练音频数据集进行特征提取,得到训练特征集,具体可以包括:In addition, in another embodiment, the identity recognition model may also include: a feature extraction module, and the input terminals of the content recognition model and the voiceprint recognition model are both connected to the output terminal of the feature extraction module. Correspondingly, feature extraction is performed on the training audio data set to obtain a training feature set, which may include:
通过特征提取模块对训练音频数据集进行特征提取,得到训练特征集。Feature extraction is performed on the training audio data set through the feature extraction module to obtain a training feature set.
在本实施例中,特征提取模块可以采用kaldi、espnet或者librosa等现有的方式对训练音频数据集进行特征提取,提高了特征提取的效率与准确性。In this embodiment, the feature extraction module can use existing methods such as kaldi, espnet or librosa to extract features from the training audio data set, which improves the efficiency and accuracy of feature extraction.
图5为本申请实施例提供的身份识别模型训练过程的原理示意图,如图5所示,在该实施例中,内容识别模型为音素识别模型,可以先获取训练音频数据集,然后将训练音频数据集输入至特征提取模块进行特征提取,得到训练特征集,再将该训练特征集分别输入至待训练模型包括的音素识别模型和声纹识别模型中进行训练,并在训练完成之后,通过训练完成的声纹识别模型得到声纹向量,通过训练完成的音素识别模型得到音素向量。在得到音素向量和声纹向量之后,可以先将音素向量和声纹向量并联为一维的语音训练向量,再根据语音训练向量对PLDA模型进行迭代训练,直至PLDA模型的似然最大,参数收敛,得到身份识别模型。Figure 5 is a schematic diagram of the principle of the identity recognition model training process provided by the embodiment of the present application. As shown in Figure 5, in this embodiment, the content recognition model is a phoneme recognition model. The training audio data set can be obtained first, and then the training audio can be The data set is input to the feature extraction module for feature extraction to obtain a training feature set. The training feature set is then input into the phoneme recognition model and voiceprint recognition model included in the model to be trained for training. After the training is completed, the training feature set is The completed voiceprint recognition model obtains the voiceprint vector, and the trained phoneme recognition model obtains the phoneme vector. After obtaining the phoneme vector and voiceprint vector, you can first connect the phoneme vector and voiceprint vector in parallel to form a one-dimensional voice training vector, and then iteratively train the PLDA model based on the voice training vector until the likelihood of the PLDA model is maximum and the parameters converge. , get the identity recognition model.
前述各实施例所述的身份识别模型训练方法得到的身份识别模型可以应用于语音唤醒领域,在得到身份识别模型之后,可以将身份识别模型部署于终端设备中,实现对终端设备或终端设备中部署的应用程序的唤醒。下面以具体地实施例对本申请的技术方案进行详细说明,下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。The identity recognition model obtained by the identity recognition model training method described in the foregoing embodiments can be applied in the field of voice wake-up. After the identity recognition model is obtained, the identity recognition model can be deployed in the terminal device to realize the control of the terminal device or the terminal device. Wake-up of deployed applications. The technical solution of the present application will be described in detail below with specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
图6为本申请实施例提供的身份识别方法的流程示意图,如图6所示,在该实施例中,可以应用前述各实施例所述的身份识别模型训练方法得到的身份识别模型,本实施例的方法可以由终端设备102执行,如图6所示,本实施例的方法,可以包括:Figure 6 is a schematic flow chart of the identity recognition method provided by the embodiment of the present application. As shown in Figure 6, in this embodiment, the identity recognition model obtained by the identity recognition model training method described in the previous embodiments can be applied. This implementation The method in this embodiment can be executed by the terminal device 102. As shown in Figure 6, the method in this embodiment can include:
S601:获取待识别用户的第一语音数据。S601: Obtain the first voice data of the user to be identified.
在本实施例中,在一些应用场景中,用户想要远程唤醒终端设备或终端设备中部署的应用程序,或者用户此时无法通过触控终端设备的方式来唤醒终端设备或终端设备中部署的应用程序。示例性的,在一种应用场景下,用户想要打开热水器进行加热,然后用户却不想手动去打开热水器,因此,可以通过语音唤醒的方式远程打开热水器进行加热。在另一种应用场景下,用户此时找不到智能手机所在的位置,且手里没有其他的智能终端设备与该智能手机进行通信,因此,可以通过语音唤醒的方式来唤醒智能手机,进而确定智能手机所在的位置等。在又一种应用场景下,用户在购物过程中,想要开启付款应用程序进行付款,然后用户此时两手都拎着购物袋,不方便通过触控终端设备的方式打开付款类应用程序进行付款,因此可以语音唤醒的方式来开启付款类应用程序进行付款。In this embodiment, in some application scenarios, the user wants to remotely wake up the terminal device or the application deployed in the terminal device, or the user cannot wake up the terminal device or the application deployed in the terminal device by touching the terminal device at this time. app. For example, in one application scenario, the user wants to turn on the water heater for heating, but the user does not want to turn on the water heater manually. Therefore, the water heater can be turned on remotely for heating through voice wake-up. In another application scenario, the user cannot find the location of the smartphone at this time, and there is no other smart terminal device in hand to communicate with the smartphone. Therefore, the smartphone can be awakened through voice wake-up, and then Determine the location of your smartphone and more. In another application scenario, during the shopping process, the user wants to open a payment application for payment, and then the user is holding shopping bags in both hands, and it is inconvenient to open the payment application for payment by touching the terminal device. , so payment applications can be opened for payment by voice wake-up.
在上述应用场景中,终端设备可以先获取第一语音数据,然后再根据该第一语音数据进行身份信息识别,再根据身份信息识别结果确定是否唤醒终端设备或终端设备中部署的应用程序。In the above application scenario, the terminal device can first obtain the first voice data, then perform identity information recognition based on the first voice data, and then determine whether to wake up the terminal device or an application deployed in the terminal device based on the identity information recognition result.
进一步的,第一语音数据中可以包括内容信息和用户的声纹信息,示例性的,第一语音数据可以为:用户A说的唤醒关键词小马小马、用户B说的唤醒关键词小猫小马,或者用户C说的唤醒关键词小马小马。Further, the first voice data may include content information and the user's voiceprint information. For example, the first voice data may be: the wake-up keyword "小马小马" spoken by user A, and the wake-up keyword "小马小马" spoken by user B. Cat and pony, or what user C said to wake up the keyword pony and pony.
S602:将第一语音数据输入身份识别模型包括的内容识别模型,输出目标内容向量;以及将第一语音数据输入身份识别模型包括的声纹识别模型,输出目标声纹向量。S602: Input the first voice data into the content recognition model included in the identity recognition model, and output the target content vector; and input the first voice data into the voiceprint recognition model included in the identity recognition model, and output the target voiceprint vector.
在本实施例中,在获取到第一语音数据之后,可以将第一语音数据输入至身份识别模型包括的内容识别模型中进行识别,得到目标内容向量,还可以将第一语音数据输入至身份识别模型包括的声纹识别模型中进行识别,得到目标声纹向量。其中,内容识别模型可以为音素识别模型,则得到的目标内容向量为目标音素向量。In this embodiment, after obtaining the first voice data, the first voice data can be input into the content recognition model included in the identity recognition model for recognition to obtain the target content vector. The first voice data can also be input into the identity recognition model. Recognition is performed in the voiceprint recognition model included in the recognition model to obtain the target voiceprint vector. The content recognition model may be a phoneme recognition model, and the obtained target content vector is the target phoneme vector.
此外,可以先将第一语音数据输入至内容识别模型进行识别,也可以先将第一语音数据输入至声纹识别模型进行识别,还可以将第一语音数据同时输入至内容识别模型和声纹识别模型中进行识别,在此不再具体进行限定。In addition, the first voice data can be input to the content recognition model for recognition first, the first voice data can be input to the voiceprint recognition model for recognition first, or the first voice data can be input to the content recognition model and the voiceprint recognition model at the same time. Recognition is performed in the recognition model, which is no longer specifically limited here.
S603:将目标内容向量、目标声纹向量、预设内容向量和预设声纹向量,输入至身份识别模型包括的分类器中,输出似然分布数值。其中,预设内容向量是将目标用户的第二语音数据输入至所述内容识别模型后得到的,预设声纹向量是将目标用户的第二语音数据输入至声纹识别模型后得到的。S603: Input the target content vector, target voiceprint vector, preset content vector and preset voiceprint vector into the classifier included in the identity recognition model, and output a likelihood distribution value. The preset content vector is obtained by inputting the target user's second voice data into the content recognition model, and the preset voiceprint vector is obtained by inputting the target user's second voice data into the voiceprint recognition model.
在本实施例中,在获取到第一语音数据之前,可以先获取目标用户的第二语音数据,然后将第二语音数据输入至内容识别模型可以得到预设内容向量,将第二语音数据输入至声纹识别模型可以得到预设声纹向量。对应的,目标用户可以为预设注册的可以唤醒终端设备或相关应用程序的用户,即根据目标用户的第二语音数据得到的预设内容向量和预设声纹向量可以作为用户身份信息判断的依据。In this embodiment, before obtaining the first voice data, the second voice data of the target user can be obtained first, and then the second voice data can be input into the content recognition model to obtain the preset content vector, and the second voice data can be input The preset voiceprint vector can be obtained by using the voiceprint recognition model. Correspondingly, the target user can be a preset registered user who can wake up the terminal device or related applications, that is, the preset content vector and the preset voiceprint vector obtained according to the target user's second voice data can be used as user identity information to determine in accordance with.
注册完成之后,可以实时检测是否有用户生成第一语音数据,在得到第一语音数据之后,即可确定第一语音数据对应的目标内容向量和目标声纹向量,然后可以将目标内容向量、目标声纹向量、预设内容向量和预设声纹向量,输入至身份识别模型包括的分类器中,输出似然分布数值。其中,似然分布数值可以表示第一语音数据和第二语音数据之间的相似度。After the registration is completed, it can be detected in real time whether a user generates the first voice data. After obtaining the first voice data, the target content vector and the target voiceprint vector corresponding to the first voice data can be determined, and then the target content vector and target voiceprint vector can be determined. The voiceprint vector, the preset content vector and the preset voiceprint vector are input into the classifier included in the identity recognition model, and a likelihood distribution value is output. The likelihood distribution value may represent the similarity between the first voice data and the second voice data.
S604:在似然分布数值大于预设似然分布数值阈值的情况下,确定待识别用户和目标用户为相同用户。S604: When the likelihood distribution value is greater than the preset likelihood distribution value threshold, determine that the user to be identified and the target user are the same user.
在本实施例中,若似然分布数值大于预设似然分布数值阈值,则表明第一语音数据和第二语音数据之间的相似度较高,即可以确定待识别用户和目标用户为相同用户。若似然分布数值小于或等于预设似然分布数值阈值,则表明第一语音数据和第二语音数据之间的相似度较低,即可以确定待识别用户和目标用户不是同一用户。In this embodiment, if the likelihood distribution value is greater than the preset likelihood distribution value threshold, it indicates that the similarity between the first voice data and the second voice data is high, and it can be determined that the user to be identified and the target user are the same user. If the likelihood distribution value is less than or equal to the preset likelihood distribution value threshold, it indicates that the similarity between the first voice data and the second voice data is low, and it is determined that the user to be identified and the target user are not the same user.
其中,似然分布数值阈值可以根据实际应用场景自定义进行设置,示例性的,似然分布数值阈值可以为80-90中的任意值。Among them, the likelihood distribution numerical threshold can be customized and set according to the actual application scenario. For example, the likelihood distribution numerical threshold can be any value from 80 to 90.
可以看出,在本申请实施例中,身份识别综合考虑了内容信息和声纹信息,增加了身份识别的考量因素,不再仅依赖单一信息进行身份识别,提高了身份识别的准确性。It can be seen that in the embodiment of the present application, the identity recognition comprehensively considers the content information and the voiceprint information, which increases the consideration factors of the identity recognition and no longer relies solely on a single information for identity recognition, thereby improving the accuracy of the identity recognition.
在另一实施例中,所述方法还可以包括:In another embodiment, the method may further include:
对第一语音数据进行唤醒词识别,得到唤醒词识别结果。Perform wake word recognition on the first voice data to obtain a wake word recognition result.
若唤醒词识别结果为识别正确,则执行将第一语音数据输入身份识别模型包括的内容识别模型,输出目标内容向量;以及将第一语音数据输入所述身份识别模型包括的声纹识别模型,输出目标声纹向量的步骤。If the wake word recognition result is correct, input the first voice data into the content recognition model included in the identity recognition model and output the target content vector; and input the first voice data into the voiceprint recognition model included in the identity recognition model, Steps to output the target voiceprint vector.
在本实施例中,终端设备中可以预先部署有第三方唤醒词识别系统,在通过身份识别模型对第一语音数据进行识别之前,可以先通过第三方唤醒词识别系统对第一语音数据进行识别,得到唤醒词识别结果,若唤醒词识别结果为识别正确,表明第一语音数据中包含预定义的关键词,则可以继续通过身份识别模型对语音唤醒信息进行识别,得到识别结果。若唤醒词识别结果为识别错误,表明第一语音数据中没有包含预定义的关键词,则可以直接生成唤醒失败提示来提醒用户语音唤醒失败,无需再通过身份识别模型进行识别。In this embodiment, a third-party wake-up word recognition system may be pre-deployed in the terminal device. Before the first voice data is recognized through the identity recognition model, the first voice data may be recognized through the third-party wake-up word recognition system. , obtain the wake-up word recognition result. If the wake-up word recognition result is correct, indicating that the first voice data contains predefined keywords, you can continue to identify the voice wake-up information through the identity recognition model and obtain the recognition result. If the wake-up word recognition result is a recognition error, indicating that the first voice data does not contain the predefined keywords, a wake-up failure prompt can be directly generated to remind the user that the voice wake-up failed, without the need for identification through the identity recognition model.
此外,在确定待识别用户和目标用户为相同用户之后,所述方法还可以包括:生成唤醒指令,其中,唤醒指令用于唤醒终端设备或所述终端设备中部署的应用程序。In addition, after determining that the user to be identified and the target user are the same user, the method may further include: generating a wake-up instruction, where the wake-up instruction is used to wake up the terminal device or an application deployed in the terminal device.
采用上述方案后,通过双重验证的方式,当第三方唤醒词识别系统对唤醒词识别有误时,通过加入音素建模的身份识别模型可以有效判断语音数据中是否包含与预设的语音数据相匹配,有效减少了语音误识别的情况,进而减少了在实际应用中终端设备或相关应用程序误唤醒的情况。After adopting the above solution, through double verification, when the third-party wake-up word recognition system misidentifies the wake-up word, the identity recognition model that adds phoneme modeling can effectively determine whether the voice data contains information that is consistent with the preset voice data. Matching effectively reduces speech misrecognition, thereby reducing the misawakening of terminal devices or related applications in practical applications.
图7为本申请实施例提供的身份识别方法的可视化场景图,本实施例结合可视化场景对身份识别模型的训练过程,以及身份识别方法的完整实现流程进行详细描述。如图7所示,本实施例可以包括:先对身份识别模型进行训练,具体过程可以为:先获取训练音频数据集,然后将训练音频数据集输入至特征提取模块进行特征提取,得到训练特征集,然后将该训练特征集分别输入至音素识别模型和声纹识别模型中进行训练,并在训练完成之后,通过训练完成的声纹识别模型得到声纹向量,通过训练完成的音素识别模型得到音素向量。然后可以根据声纹向量和音素向量并联为一维的语音训练向量,再根据语音训练向量对待训练模型包括的分类器进行迭代训练,直至分类器的似然最大,参数收敛,得到身份识别模型。身份识别模型训练完成之后,可以将身份识别模型部署于终端设备中。示例性的,终端设备可以为智能手机。在将身份识别模型部署于智能手机之后,可以获取待识别用户生成的第一语音数据,以及预存的第二语音数据,然后基于身份识别模型对获取的第一语音数据和第二语音数据进行识别处理,得到似然分布数值,若该似然分布数值大于似然分布数值阈值,则确定待识别用户和目标用户为相同用户,可以生成唤醒指令来唤醒终端设备或相关的应用程序。否则,则确定待识别用户和目标用户为不同的用户,可以生成唤醒失败提示来提醒用户终端设备或相关的应用程序唤醒失败。Figure 7 is a visual scene diagram of the identity recognition method provided by the embodiment of the present application. This embodiment combines the visual scene with a detailed description of the training process of the identity recognition model and the complete implementation process of the identity recognition method. As shown in Figure 7, this embodiment may include: first training the identity recognition model. The specific process may be: first obtain the training audio data set, and then input the training audio data set to the feature extraction module for feature extraction to obtain the training features. Set, and then input the training feature set into the phoneme recognition model and voiceprint recognition model respectively for training. After the training is completed, the voiceprint vector is obtained through the trained voiceprint recognition model, and the voiceprint vector is obtained through the trained phoneme recognition model. Phoneme vector. Then the voiceprint vector and the phoneme vector can be connected in parallel to form a one-dimensional speech training vector, and then the classifier included in the training model can be iteratively trained according to the speech training vector until the likelihood of the classifier is maximized and the parameters converge to obtain the identity recognition model. After the identity recognition model training is completed, the identity recognition model can be deployed in the terminal device. For example, the terminal device may be a smartphone. After the identity recognition model is deployed on the smartphone, the first voice data generated by the user to be recognized and the pre-stored second voice data can be obtained, and then the acquired first voice data and second voice data can be identified based on the identity recognition model. Process to obtain a likelihood distribution value. If the likelihood distribution value is greater than the likelihood distribution value threshold, it is determined that the user to be identified and the target user are the same user, and a wake-up instruction can be generated to wake up the terminal device or related applications. Otherwise, it is determined that the user to be identified and the target user are different users, and a wake-up failure prompt can be generated to remind the user that the terminal device or related application failed to wake up.
基于同样的思路,本说明书实施例还提供了上述方法对应的装置,图8为本申请实施例提供的身份识别模型训练装置的结构示意图,如图8所示,本实施例提供的装置可以包括:Based on the same idea, the embodiment of this specification also provides a device corresponding to the above method. Figure 8 is a schematic structural diagram of the identity recognition model training device provided by the embodiment of the present application. As shown in Figure 8, the device provided by this embodiment may include :
第一获取模块801,用于获取训练音频数据集,以及对所述训练音频数据集进行特征提取,得到训练特征集。The first acquisition module 801 is used to acquire a training audio data set, and perform feature extraction on the training audio data set to obtain a training feature set.
在本实施例中,训练特征集是采用梅尔频率倒谱系数MFCC对所述训练音频数据集提取得到的。In this embodiment, the training feature set is extracted from the training audio data set by using the Mel frequency cepstrum coefficient MFCC.
进一步的,训练音频数据集中包含目标训练音频,所述目标训练音频包含预设唤醒词,所述目标训练音频在所述训练音频数据集中的比重大于或等于预设比重阈值。Further, the training audio data set includes target training audio, the target training audio includes a preset wake-up word, and the proportion of the target training audio in the training audio data set is greater than or equal to a preset proportion threshold.
第一处理模块802,用于将所述训练特征集输入待训练模型包括的内容识别模型中进行迭代训练,以及将训练特征集输入训练完成后的所述内容识别模型输出内容向量;以及将所述训练特征集输入所述待训练模型包括的声纹识别模型中进行迭代训练,以及将训练特征集输入训练完成后的所述声纹识别模型输出声纹向量。The first processing module 802 is configured to input the training feature set into the content recognition model included in the model to be trained for iterative training, and input the training feature set into the content recognition model output content vector after training is completed; and The training feature set is input into the voiceprint recognition model included in the model to be trained for iterative training, and the training feature set is input into the voiceprint recognition model after training is completed to output a voiceprint vector.
所述第一处理模块802,还用于将所述内容向量和所述声纹向量输入至所述待训练模型包括的分类器中进行迭代训练,直至所述分类器的似然最大,参数收敛,得到身份识别模型。The first processing module 802 is also used to input the content vector and the voiceprint vector into the classifier included in the model to be trained for iterative training until the likelihood of the classifier is maximum and the parameters converge. , get the identity recognition model.
在另一实施例中,内容识别模型为基于卷积增强的音素识别模型,内容向量为音素向量。对应的,所述第一处理模块802,还用于:In another embodiment, the content recognition model is a phoneme recognition model based on convolution enhancement, and the content vector is a phoneme vector. Correspondingly, the first processing module 802 is also used to:
将所述训练特征集输入所述音素识别模型中进行迭代训练。以及根据预设的第一梯度下降算法确定所述音素识别模型的第一损失值。The training feature set is input into the phoneme recognition model for iterative training. and determining a first loss value of the phoneme recognition model according to a preset first gradient descent algorithm.
在所述第一损失值大于或等于第一损失值阈值的情况下,完成所述音素识别模型的训练。When the first loss value is greater than or equal to the first loss value threshold, training of the phoneme recognition model is completed.
此外,声纹识别模型可以为残差网络与时延神经网络,对应的,所述第一处理模块802,还用于:In addition, the voiceprint recognition model can be a residual network and a delay neural network. Correspondingly, the first processing module 802 is also used to:
将所述训练特征集输入所述残差网络与时延神经网络中进行迭代训练;以及根据预设的第二梯度下降算法确定所述残差网络与时延神经网络的第二损失值。The training feature set is input into the residual network and the delay neural network for iterative training; and the second loss value of the residual network and the delay neural network is determined according to a preset second gradient descent algorithm.
在所述第二损失值大于或等于第二损失值阈值的情况下,完成所述残差网络与时延神经网络的训练。When the second loss value is greater than or equal to the second loss value threshold, the training of the residual network and the delay neural network is completed.
在另一实施例中,所述第一处理模块802,还用于:In another embodiment, the first processing module 802 is also used to:
将所述内容向量和所述声纹向量并联为一维的语音训练向量。The content vector and the voiceprint vector are connected in parallel to form a one-dimensional voice training vector.
将所述语音训练向量输入所述分类器中进行迭代训练。The speech training vector is input into the classifier for iterative training.
在本实施例中,所述第一处理模块802,还用于:In this embodiment, the first processing module 802 is also used to:
根据所述语音训练向量对所述分类器分别进行均值处理和初始化处理,得到初始极大似然估计表达式。Perform mean processing and initialization processing on the classifier according to the speech training vector to obtain an initial maximum likelihood estimation expression.
将所述语音训练向量和所述初始极大似然估计表达式输入至所述分类器中,进行迭代训练。The speech training vector and the initial maximum likelihood estimation expression are input into the classifier, and iterative training is performed.
图9为本申请实施例提供的身份识别装置的结构示意图,应用前述实施例中身份识别模型训练装置得到的身份识别模型,如图9所示,本实施例提供的装置可以包括:Figure 9 is a schematic structural diagram of an identity recognition device provided by an embodiment of the present application. The identity recognition model is obtained by applying the identity recognition model training device in the previous embodiment. As shown in Figure 9, the device provided by this embodiment may include:
第二获取模块901,用于获取待识别用户的第一语音数据。The second acquisition module 901 is used to acquire the first voice data of the user to be identified.
第二处理模块902,用于将所述第一语音数据输入身份识别模型包括的内容识别模型,输出目标内容向量;以及将所述第一语音数据输入所述身份识别模型包括的声纹识别模型,输出目标声纹向量。The second processing module 902 is used to input the first voice data into the content recognition model included in the identity recognition model and output a target content vector; and input the first voice data into the voiceprint recognition model included in the identity recognition model. , output the target voiceprint vector.
所述第二处理模块902,还用于将所述目标内容向量、所述目标声纹向量、预设内容向量和预设声纹向量,输入至所述身份识别模型包括的分类器中,输出似然分布数值;其中,所述预设内容向量是将目标用户的第二语音数据输入至所述内容识别模型后得到的,所述预设声纹向量是将所述目标用户的第二语音数据输入至所述声纹识别模型后得到的。The second processing module 902 is also used to input the target content vector, the target voiceprint vector, the preset content vector and the preset voiceprint vector into the classifier included in the identity recognition model, and output Likelihood distribution value; wherein, the preset content vector is obtained by inputting the target user's second voice data into the content recognition model, and the preset voiceprint vector is obtained by inputting the target user's second voice data. The data is input into the voiceprint recognition model.
所述第二处理模块902,还用于在所述似然分布数值大于预设似然分布数值阈值的情况下,确定所述待识别用户和所述目标用户为相同用户。The second processing module 902 is also configured to determine that the user to be identified and the target user are the same user when the likelihood distribution value is greater than a preset likelihood distribution value threshold.
在另一实施例中,所述第二处理模块902,还用于:In another embodiment, the second processing module 902 is also used to:
对所述第一语音数据进行唤醒词识别,得到唤醒词识别结果。Perform wake word recognition on the first voice data to obtain a wake word recognition result.
若所述唤醒词识别结果为识别正确,则执行将所述第一语音数据输入身份识别模型包括的内容识别模型,输出目标内容向量;以及将所述第一语音数据输入所述身份识别模型包括的声纹识别模型,输出目标声纹向量的步骤。If the wake word recognition result is correct, then inputting the first voice data into the content recognition model included in the identity recognition model is executed, and outputting the target content vector; and inputting the first voice data into the identity recognition model includes The voiceprint recognition model, the step of outputting the target voiceprint vector.
确定所述待识别用户和所述目标用户为相同用户之后,生成唤醒指令,其中,所述唤醒指令用于唤醒终端设备或所述终端设备中部署的应用程序。After it is determined that the user to be identified and the target user are the same user, a wake-up instruction is generated, where the wake-up instruction is used to wake up a terminal device or an application deployed in the terminal device.
本申请实施例提供的装置,可以实现上述如图2所示的实施例的方法,其实现原理和技术效果类似,此处不再赘述。The device provided by the embodiment of the present application can implement the method of the embodiment shown in Figure 2. The implementation principles and technical effects are similar and will not be described again here.
图10为本申请实施例提供的电子设备的硬件结构示意图,如图10所示,本实施例提供的设备1000包括:至少一个处理器1001和存储器1002。其中,处理器1001、存储器1002通过总线1003连接。Figure 10 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present application. As shown in Figure 10, a device 1000 provided by this embodiment includes: at least one processor 1001 and a memory 1002. Among them, the processor 1001 and the memory 1002 are connected through a bus 1003.
在具体实现过程中,至少一个处理器1001执行所述存储器1002存储的计算机执行指令,使得至少一个处理器1001执行上述方法实施例中的方法。During specific implementation, at least one processor 1001 executes computer execution instructions stored in the memory 1002, so that at least one processor 1001 executes the method in the above method embodiment.
处理器1001的具体实现过程可参见上述方法实施例,其实现原理和技术效果类似,本实施例此处不再赘述。The specific implementation process of the processor 1001 can be found in the above-mentioned method embodiments. The implementation principles and technical effects are similar and will not be described again in this embodiment.
在上述的图10所示的实施例中,应理解,处理器可以是中央处理单元(英文:Central Processing Unit,简称:CPU),还可以是其他通用处理器、数字信号处理器(英文:Digital Signal Processor,简称:DSP)、专用集成电路(英文:Application SpecificIntegrated Circuit,简称:ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合发明所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。In the above-mentioned embodiment shown in Figure 10, it should be understood that the processor may be a central processing unit (English: Central Processing Unit, referred to as: CPU), or other general-purpose processor, digital signal processor (English: Digital Signal Processor) Signal Processor (DSP for short), Application Specific Integrated Circuit (English: Application Specific Integrated Circuit (ASIC for short)), etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the invention can be directly embodied and executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
存储器可能包含高速RAM存储器,也可能还包括非易失性存储NVM,例如至少一个磁盘存储器。The memory may include high-speed RAM memory and may also include non-volatile storage NVM, such as at least one disk memory.
总线可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外部设备互连(Peripheral Component Interconnect,PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,本申请附图中的总线并不限定仅有一根总线或一种类型的总线。The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, the bus in the drawings of this application is not limited to only one bus or one type of bus.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现上述方法实施例的身份识别模型训练方法或身份识别方法。Embodiments of the present application also provide a computer-readable storage medium. Computer-executable instructions are stored in the computer-readable storage medium. When the processor executes the computer-executable instructions, the identity recognition model training method of the above method embodiment is implemented. or identification method.
本申请实施例还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现如上所述的身份识别模型训练方法或身份识别方法。An embodiment of the present application also provides a computer program product, including a computer program that, when executed by a processor, implements the identity recognition model training method or the identity recognition method as described above.
上述的计算机可读存储介质,上述可读存储介质可以是由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。可读存储介质可以是通用或专用计算机能够存取的任何可用介质。The above-mentioned computer-readable storage medium, the above-mentioned readable storage medium can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable memory. Programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.
一种示例性的可读存储介质耦合至处理器,从而使处理器能够从该可读存储介质读取信息,且可向该可读存储介质写入信息。当然,可读存储介质也可以是处理器的组成部分。处理器和可读存储介质可以位于专用集成电路(Application Specific IntegratedCircuits,简称:ASIC)中。当然,处理器和可读存储介质也可以作为分立组件存在于设备中。An exemplary readable storage medium is coupled to the processor such that the processor can read information from the readable storage medium and write information to the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may be located in Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also exist as discrete components in the device.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Persons of ordinary skill in the art can understand that all or part of the steps to implement the above method embodiments can be completed by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the steps including the above-mentioned method embodiments are executed; and the aforementioned storage media include: ROM, RAM, magnetic disks, optical disks and other media that can store program codes.
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application, but not to limit it; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or substitutions do not deviate from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present application. scope.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110681339.2ACN113421573B (en) | 2021-06-18 | 2021-06-18 | Identity recognition model training method, identity recognition method and device |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110681339.2ACN113421573B (en) | 2021-06-18 | 2021-06-18 | Identity recognition model training method, identity recognition method and device |
| Publication Number | Publication Date |
|---|---|
| CN113421573A CN113421573A (en) | 2021-09-21 |
| CN113421573Btrue CN113421573B (en) | 2024-03-19 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110681339.2AActiveCN113421573B (en) | 2021-06-18 | 2021-06-18 | Identity recognition model training method, identity recognition method and device |
| Country | Link |
|---|---|
| CN (1) | CN113421573B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113823294B (en)* | 2021-11-23 | 2022-03-11 | 清华大学 | Cross-channel voiceprint recognition method, device, equipment and storage medium |
| CN116129912A (en)* | 2022-07-06 | 2023-05-16 | 马上消费金融股份有限公司 | Training method and device for recognition model, electronic equipment and storage medium |
| CN115312040B (en)* | 2022-08-08 | 2024-11-19 | 中国电信股份有限公司 | Voice wake-up method, device, electronic device and computer-readable storage medium |
| CN115547345A (en)* | 2022-09-29 | 2022-12-30 | 阿里巴巴达摩院(杭州)科技有限公司 | Voiceprint recognition model training and related recognition method, electronic device and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107274890A (en)* | 2017-07-04 | 2017-10-20 | 清华大学 | Vocal print composes extracting method and device |
| WO2017197953A1 (en)* | 2016-05-16 | 2017-11-23 | 腾讯科技(深圳)有限公司 | Voiceprint-based identity recognition method and device |
| CN108154371A (en)* | 2018-01-12 | 2018-06-12 | 平安科技(深圳)有限公司 | Electronic device, the method for authentication and storage medium |
| WO2018107810A1 (en)* | 2016-12-15 | 2018-06-21 | 平安科技(深圳)有限公司 | Voiceprint recognition method and apparatus, and electronic device and medium |
| CN108958810A (en)* | 2018-02-09 | 2018-12-07 | 北京猎户星空科技有限公司 | A kind of user identification method based on vocal print, device and equipment |
| CN110060693A (en)* | 2019-04-16 | 2019-07-26 | Oppo广东移动通信有限公司 | Model training method and device, electronic equipment and storage medium |
| CN110491393A (en)* | 2019-08-30 | 2019-11-22 | 科大讯飞股份有限公司 | The training method and relevant apparatus of vocal print characterization model |
| CN110838295A (en)* | 2019-11-17 | 2020-02-25 | 西北工业大学 | A model generation method, voiceprint recognition method and corresponding device |
| WO2020073694A1 (en)* | 2018-10-10 | 2020-04-16 | 腾讯科技(深圳)有限公司 | Voiceprint identification method, model training method and server |
| CN111524526A (en)* | 2020-05-14 | 2020-08-11 | 中国工商银行股份有限公司 | Voiceprint recognition method and device |
| CN112259105A (en)* | 2020-10-10 | 2021-01-22 | 西南政法大学 | Training method of voiceprint recognition model, storage medium and computer equipment |
| CN112820299A (en)* | 2020-12-29 | 2021-05-18 | 马上消费金融股份有限公司 | Voiceprint recognition model training method and device and related equipment |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017197953A1 (en)* | 2016-05-16 | 2017-11-23 | 腾讯科技(深圳)有限公司 | Voiceprint-based identity recognition method and device |
| WO2018107810A1 (en)* | 2016-12-15 | 2018-06-21 | 平安科技(深圳)有限公司 | Voiceprint recognition method and apparatus, and electronic device and medium |
| CN107274890A (en)* | 2017-07-04 | 2017-10-20 | 清华大学 | Vocal print composes extracting method and device |
| CN108154371A (en)* | 2018-01-12 | 2018-06-12 | 平安科技(深圳)有限公司 | Electronic device, the method for authentication and storage medium |
| CN108958810A (en)* | 2018-02-09 | 2018-12-07 | 北京猎户星空科技有限公司 | A kind of user identification method based on vocal print, device and equipment |
| WO2020073694A1 (en)* | 2018-10-10 | 2020-04-16 | 腾讯科技(深圳)有限公司 | Voiceprint identification method, model training method and server |
| CN110060693A (en)* | 2019-04-16 | 2019-07-26 | Oppo广东移动通信有限公司 | Model training method and device, electronic equipment and storage medium |
| CN110491393A (en)* | 2019-08-30 | 2019-11-22 | 科大讯飞股份有限公司 | The training method and relevant apparatus of vocal print characterization model |
| CN110838295A (en)* | 2019-11-17 | 2020-02-25 | 西北工业大学 | A model generation method, voiceprint recognition method and corresponding device |
| CN111524526A (en)* | 2020-05-14 | 2020-08-11 | 中国工商银行股份有限公司 | Voiceprint recognition method and device |
| CN112259105A (en)* | 2020-10-10 | 2021-01-22 | 西南政法大学 | Training method of voiceprint recognition model, storage medium and computer equipment |
| CN112820299A (en)* | 2020-12-29 | 2021-05-18 | 马上消费金融股份有限公司 | Voiceprint recognition model training method and device and related equipment |
| Publication number | Publication date |
|---|---|
| CN113421573A (en) | 2021-09-21 |
| Publication | Publication Date | Title |
|---|---|---|
| CN113421573B (en) | Identity recognition model training method, identity recognition method and device | |
| US12136417B2 (en) | Domain and intent name feature identification and processing | |
| CN106683680B (en) | Speaker recognition method and apparatus, computer equipment and computer readable medium | |
| JP6394709B2 (en) | SPEAKER IDENTIFYING DEVICE AND FEATURE REGISTRATION METHOD FOR REGISTERED SPEECH | |
| JP6096333B2 (en) | Method, apparatus and system for verifying payment | |
| CN107481720B (en) | Explicit voiceprint recognition method and device | |
| CN110910903B (en) | Speech emotion recognition method, device, equipment and computer readable storage medium | |
| CN116884391B (en) | Multi-modal fusion audio generation method and device based on diffusion model | |
| CN111684521B (en) | Method for processing speech signal for speaker recognition and electronic device for implementing the same | |
| KR20190082900A (en) | A speech recognition method, an electronic device, and a computer storage medium | |
| CN108346427A (en) | Voice recognition method, device, equipment and storage medium | |
| WO2020007129A1 (en) | Context acquisition method and device based on voice interaction | |
| CN107437417B (en) | Voice data enhancement method and device based on recurrent neural network voice recognition | |
| CN113330511B (en) | Speech recognition method, device, storage medium and electronic device | |
| CN107707745A (en) | Method and apparatus for extracting information | |
| KR20200080400A (en) | Method for providing sententce based on persona and electronic device for supporting the same | |
| CN114387945B (en) | Voice generation method, device, electronic equipment and storage medium | |
| CN110544470B (en) | Voice recognition method and device, readable storage medium and electronic equipment | |
| CN108986825A (en) | Context acquisition methods and equipment based on interactive voice | |
| CN112309406A (en) | Voiceprint registration method, voiceprint registration device and computer-readable storage medium | |
| CN111126084B (en) | Data processing method, device, electronic equipment and storage medium | |
| CN110544468B (en) | Application awakening method and device, storage medium and electronic equipment | |
| JP2015175859A (en) | Pattern recognition device, pattern recognition method, and pattern recognition program | |
| CN113053409A (en) | Audio evaluation method and device | |
| CN114121007A (en) | Scheme acquisition method, device and equipment based on voice recognition and storage medium |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| OL01 | Intention to license declared | ||
| OL01 | Intention to license declared |