CN110534099A

Movatterモバイル変換

Info

Publication number: CN110534099A
Application number: CN201910828451.7A
Authority: CN
Inventors: 陈杰; 苏丹
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2019-12-03
Anticipated expiration: 2039-09-03
Also published as: CN110534099B

Abstract

Translated fromChinese

本申请提供的一种语音唤醒处理方法、装置、存储介质及电子设备，取该输入语音信息的音频帧特征，将其输入声学模型进行处理，得到预设唤醒词的每个音节对应的目标音频帧特征的后验概率，利用部署的分别针对成人模式和儿童模式的置信度判决，实现对得到的这些后验概率的双置信度判决，以使每个音节得到两个置信度得分，其中任一置信度得分的判决结果通过，会从缓存中获取相应长度的校验音频帧特征进行二次置信度校验，待置信度校验结果通过，可以直接响应该预设唤醒词对应的指令，控制电子设备执行预设操作。可见，本实施例提供的语音唤醒处理方法，能够同时兼顾成人语音唤醒性能和儿童语音唤醒性能，提高了语音唤醒效率及准确性。

A voice wake-up processing method, device, storage medium, and electronic equipment provided by the present application take the audio frame characteristics of the input voice information, input it into the acoustic model for processing, and obtain the target audio corresponding to each syllable of the preset wake-up word The posterior probability of the frame features, using the deployed confidence judgments for the adult mode and the child mode respectively, realizes the double confidence judgment of these posterior probabilities, so that each syllable can get two confidence scores, where any Once the judgment result of the confidence score is passed, the verification audio frame features of the corresponding length will be obtained from the cache for a second confidence verification. After the confidence verification result is passed, the command corresponding to the preset wake-up word can be directly responded. Control electronics to perform preset actions. It can be seen that the voice wake-up processing method provided in this embodiment can take into account both the voice wake-up performance of adults and the voice wake-up performance of children, and improves the efficiency and accuracy of voice wake-up.

Description

Translated fromChinese

语音唤醒处理方法、装置、存储介质及电子设备Voice wake-up processing method, device, storage medium and electronic equipment

技术领域technical field

本申请涉及人工智能应用领域，具体涉及一种语音唤醒处理方法、装置、存储介质及电子设备。The present application relates to the field of artificial intelligence applications, and in particular to a voice wake-up processing method, device, storage medium and electronic equipment.

背景技术Background technique

语音识别作为一种人工智能技术，已在工业、家电、通信、汽车电子、医疗、家庭服务、消费电子产品等多个领域得到广泛应用，使得应用于各领域的电子设备具有语音识别能力，通过识别用户发出唤醒词，来唤醒电子设备及其包含的应用，为用户使用电子设备提供了极大便利。Speech recognition, as an artificial intelligence technology, has been widely used in many fields such as industry, home appliances, communications, automotive electronics, medical care, home services, consumer electronics, etc., enabling electronic devices used in various fields to have speech recognition capabilities. Recognizing the wake-up word issued by the user to wake up the electronic device and the applications contained therein provides great convenience for the user to use the electronic device.

现有技术中，参照图1示出的现有语音唤醒处理方法的流程示意图，通常是将用户输入的语音信息发送至声学模型(如深度神经网络)，得到组成唤醒词的音素或音节等，同时，经填充单元还会得到非唤醒词，之后由后验处理模块的平滑窗和置信度计算窗，对唤醒词的音素或音节进行处理，得到该唤醒词的置信分数，若该置信分数达到阈值，将响应该唤醒词，控制电子设备执行预设操作。In the prior art, referring to the schematic flow chart of the existing voice wake-up processing method shown in FIG. 1, the voice information input by the user is usually sent to an acoustic model (such as a deep neural network) to obtain the phonemes or syllables forming the wake-up word, At the same time, non-awakening words will be obtained through the filling unit, and then the phoneme or syllable of the awakening word will be processed by the smoothing window and confidence calculation window of the posterior processing module to obtain the confidence score of the awakening word. If the confidence score reaches Threshold, the electronic device will be controlled to perform preset operations in response to the wake word.

可见，现有的语音唤醒处理方法虽然可以通过调整阈值，来平衡唤醒性能，但其并未考虑到成人语音特征与儿童语音特征之间的差别，导致声学模型的输出准确性较低，降低了对电子设备的语音唤醒性能。It can be seen that although the existing voice wake-up processing method can balance the wake-up performance by adjusting the threshold, it does not take into account the difference between the voice characteristics of adults and children, resulting in low accuracy of the output of the acoustic model and reducing the Voice wake-up capability for electronic devices.

发明内容Contents of the invention

有鉴于此，本申请实施例提供一种语音唤醒处理方法、装置、存储介质及电子设备，能够同时兼顾成人语音唤醒性能和儿童语音唤醒性能，提高了语音唤醒效率及准确性。In view of this, the embodiments of the present application provide a voice wake-up processing method, device, storage medium, and electronic device, which can take into account both the voice wake-up performance of adults and the voice wake-up performance of children, and improve the efficiency and accuracy of voice wake-up.

为实现上述目的，本申请实施例提供如下技术方案：In order to achieve the above purpose, the embodiment of the present application provides the following technical solutions:

一方面，本申请提出了一种语音唤醒处理方法，所述方法包括：On the one hand, the present application proposes a voice wake-up processing method, the method comprising:

获取输入的语音信息的音频帧特征；Obtain the audio frame feature of the input voice information;

将所述音频帧特征输入声学模型进行处理，得到与预设唤醒词的每个音节对应的目标音频帧特征的后验概率；The audio frame feature is input into the acoustic model for processing, and the posterior probability of the target audio frame feature corresponding to each syllable of the preset wake-up word is obtained;

对所述每个音节对应的目标音频帧特征的后验概率进行双置信度判决，得到相应音节的第一置信度得分及第二置信度得分；Carry out double confidence degree judgment to the posterior probability of the target audio frame feature corresponding to each syllable, obtain the first confidence degree score and the second confidence degree score of corresponding syllable;

利用所述第一置信度得分和所述第二置信度得分中通过的判决结果，获取所述语音信息的音频帧特征中的校验音频帧特征；Using the judgment results passed in the first confidence score and the second confidence score to obtain the verification audio frame features in the audio frame features of the voice information;

获取所述校验音频帧特征的置信度校验结果，所述置信度校验结果是对所述校验音频帧特征进行二次置信度判决得到的；Obtaining the confidence verification result of the verification audio frame feature, the confidence verification result is obtained by performing a second confidence judgment on the verification audio frame feature;

若所述置信度校验结果通过，响应所述预设唤醒词对应的指令，控制电子设备执行预设操作。If the confidence check result is passed, the electronic device is controlled to execute a preset operation in response to the instruction corresponding to the preset wake-up word.

又一方面，本申请提出了一种语音唤醒处理装置，所述装置包括：In yet another aspect, the present application proposes a voice wake-up processing device, the device comprising:

特征获取模块，用于获取输入的语音信息的音频帧特征；The feature acquisition module is used to acquire the audio frame features of the input speech information;

后验概率获取模块，用于将所述音频帧特征输入声学模型进行处理，得到与预设唤醒词的每个音节对应的目标音频帧特征的后验概率；The posterior probability acquisition module is used to input the audio frame feature into the acoustic model for processing to obtain the posterior probability of the target audio frame feature corresponding to each syllable of the preset wake-up word;

置信度判决模块，用于对所述每个音节对应的目标音频帧特征的后验概率进行双置信度判决，得到相应音节的第一置信度得分及第二置信度得分；Confidence degree judgment module, is used for carrying out double confidence degree judgment to the posterior probability of the target audio frame feature corresponding to each syllable, obtains the first confidence degree score and the second confidence degree score of corresponding syllable;

校验特征获取模块，用于利用所述第一置信度得分和所述第二置信度得分中通过的判决结果，获取所述语音信息的音频帧特征中的校验音频帧特征；A check feature acquisition module, configured to acquire the check audio frame features in the audio frame features of the voice information by using the judgment results passed in the first confidence score and the second confidence score;

置信度校验结果获取模块，用于获取所述校验音频帧特征的置信度校验结果，所述置信度校验结果是对所述校验音频帧特征进行二次置信度判决得到的；A confidence verification result acquisition module, configured to obtain a confidence verification result of the verification audio frame feature, the confidence verification result is obtained by performing a second confidence judgment on the verification audio frame feature;

语音唤醒模块，用于若所述置信度校验结果通过，响应所述预设唤醒词对应的指令，控制电子设备执行预设操作。The voice wake-up module is configured to control the electronic device to execute a preset operation in response to the instruction corresponding to the preset wake-up word if the confidence check result is passed.

又一方面，本申请提出了一种存储介质，其上存储有计算机程序，所述计算机程序被处理器执行，实现如上所述的语音唤醒处理的各步骤的程序。In yet another aspect, the present application proposes a storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the above-mentioned program of each step of voice wake-up processing.

又一方面，本申请提出了一种电子设备，所述电子设备包括：In yet another aspect, the present application proposes an electronic device, the electronic device comprising:

声音采集器，用于采集用户输出的语音信息；The sound collector is used to collect the voice information output by the user;

通信接口；Communication Interface;

存储器，用于存储实现如上所述的语音唤醒处理的程序；A memory for storing a program for realizing the above-mentioned voice wake-up processing;

处理器，用于加载并执行所述存储器存储的程序，以实现如上所述的语音唤醒处理的各步骤。The processor is configured to load and execute the program stored in the memory, so as to realize the above-mentioned steps of voice wake-up processing.

由此可见，相对于现有技术，本申请在获取用户针对电子设备输入的语音信息后，将获取该语音信息的音频帧特征，并通过将其输入声学模型进行处理，得到该语音信息中包含的预设唤醒词的每个音节对应的目标音频帧特征的后验概率，之后，本实施例将会考虑到不同类型用户(如成人和儿童)的语音特征之间的差异，部署分别针对成人模式和儿童模式的不同的置信度判决模块，共享一个声学模型，实现对得到的这些后验概率的双置信度判决，以使每个音节得到两个置信度得分，其中任一置信度得分的判决结果通过，会从缓存中获取相应长度的校验音频帧特征进行二次置信度校验，待置信度校验结果通过，可以确定语音信息中包含了该预设唤醒词，可以直接响应该预设唤醒词对应的指令，控制电子设备执行预设操作。可见，本实施例提供的语音唤醒处理方法，能够同时兼顾成人语音唤醒性能和儿童语音唤醒性能，提高了语音唤醒效率及准确性。It can be seen that, compared with the prior art, after acquiring the voice information input by the user for the electronic device, the present application will acquire the audio frame features of the voice information, and process it by inputting it into the acoustic model to obtain the audio information contained in the voice information. The posterior probability of the target audio frame features corresponding to each syllable of the preset wake-up word, after that, this embodiment will take into account the differences between the speech features of different types of users (such as adults and children), and deploy The different confidence judgment modules of the model and the children's model share an acoustic model, and realize the double confidence judgment of these posterior probabilities, so that each syllable can get two confidence scores, and any confidence score If the judgment result is passed, the verification audio frame features of the corresponding length will be obtained from the cache for a second confidence verification. After the confidence verification result is passed, it can be determined that the voice information contains the preset wake-up word, and you can directly respond to the The instruction corresponding to the preset wake-up word controls the electronic device to perform a preset operation. It can be seen that the voice wake-up processing method provided in this embodiment can take into account both the voice wake-up performance of adults and the voice wake-up performance of children, and improves the efficiency and accuracy of voice wake-up.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

图1示出了现有的一种语音唤醒处理方法的流程示意图；Fig. 1 shows a schematic flow chart of an existing voice wake-up processing method;

图2示出了本申请提出的语音唤醒处理方法的研发过程中，提出的实现语音唤醒处理方法的一可选结构示意图；Fig. 2 shows a schematic diagram of an optional structure for realizing the voice wake-up processing method proposed in the research and development process of the voice wake-up processing method proposed in the present application;

图3示出了实现本申请提出的语音唤醒处理方法的一可选实例的结构示意图；FIG. 3 shows a schematic structural diagram of an optional example for implementing the voice wake-up processing method proposed in the present application;

图4示出了本申请提出的电子设备的一可选实例的硬件结构示意图；FIG. 4 shows a schematic diagram of a hardware structure of an optional example of an electronic device proposed by the present application;

图5示出了本申请提出的电子设备的又一可选实例的硬件结构示意图；Fig. 5 shows a schematic diagram of the hardware structure of another optional example of the electronic device proposed by the present application;

图6示出了本申请提出的语音唤醒处理方法的一可选实例的流程图；Fig. 6 shows the flowchart of an optional example of the voice wake-up processing method proposed by the present application;

图7示出了本申请提出的语音唤醒处理方法的一可选实例的信令流程图；FIG. 7 shows a signaling flow chart of an optional example of the voice wake-up processing method proposed by the present application;

图8示出了本申请提出的语音唤醒处理装置的一可选实例的结构示意图；FIG. 8 shows a schematic structural diagram of an optional example of a voice wake-up processing device proposed by the present application;

图9示出了本申请提出的语音唤醒处理装置的又一可选实例的结构示意图；FIG. 9 shows a schematic structural diagram of another optional example of the voice wake-up processing device proposed in the present application;

图10示出了实现本申请提出的语音唤醒处理方法的一种系统结构示意图；FIG. 10 shows a schematic structural diagram of a system implementing the voice wake-up processing method proposed in the present application;

图11示出了实现本申请提出的语音唤醒处理方法的一应用场景示意图。FIG. 11 shows a schematic diagram of an application scenario for implementing the voice wake-up processing method proposed in this application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

其中，在本申请实施例的描述中，除非另有说明，“/”表示或的意思，例如，A/B可以表示A或B；本文中的“和/或”仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，在本申请实施例的描述中，“多个”是指两个或多于两个。Among them, in the description of the embodiments of this application, unless otherwise specified, "/" means or means, for example, A/B can mean A or B; "and/or" in this article is only a description of associated objects The association relationship of indicates that there may be three kinds of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. In addition, in the description of the embodiments of the present application, "plurality" refers to two or more than two.

以下术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中，除非另有说明，“多个”的含义是两个或两个以上。The following terms "first" and "second" are used for description purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, "plurality" means two or more.

如背景技术部分所介绍，目前语音唤醒电子设备应用中，电子设备所执行的语音唤醒处理方法，因只使用一个声学模型，对不同类型用户(如成人用户、儿童用户)的语音信息进行处理，从而导致这一个声学模型，无法兼顾成人和儿童的语音唤醒性能，通常情况下，模型训练的样本数据中，成人的数据会显著大于儿童数据，所以，现有的这种语音唤醒处理方法，可能会在成人唤醒性能上较高，但儿童唤醒性能很差。As introduced in the background technology section, in the current voice wake-up electronic device application, the voice wake-up processing method performed by the electronic device only uses one acoustic model to process the voice information of different types of users (such as adult users and child users). As a result, this acoustic model cannot take into account the voice wake-up performance of both adults and children. Usually, in the sample data for model training, the adult data is significantly larger than the child data. Therefore, the existing voice wake-up processing method may Will have higher wakeup performance for adults, but poor wakeup performance for children.

为了提高语音唤醒性能，本申请提出训练两个不同大小的声学模型，构成两级声学模型，并共享一个后验处理模块来计算置信分数，进行最终判决，参照图2所示的一种语音唤醒处理方法的流程示意图，对于用户输出的语音信息，可以先进行语音特征信息的提取，如采用MFCC(Mel-scale Frequency Cepstral Coefficients，梅尔倒谱系数)方式实现，但并不局限于此，再将提取到的语音特征信息写入帧缓冲区，由第一级模型即较小的声学模型(如图2中的第一声学模型)，对提取的语音特征信息进行置信分数计算，如利用隐马尔科夫模型HMM，计算得到该语音特征信息的置信分数；或者采用上文图1所示的后验处理模块进行置信分数计算等，在第一级模型被触发后，还可以将上述提取到的相同的语音特征信息发送至较大的声学模型(如图2中的第二声学模型)，采用类似方式计算语音特征信息的置信分数，从而实现对同一语音特征信息的二次判决，相对于图1所示的单个模型的语音唤醒处理方式，在一定程度上提高了语音唤醒性能。In order to improve the performance of voice wake-up, this application proposes to train two acoustic models of different sizes to form a two-level acoustic model, and share a posteriori processing module to calculate the confidence score and make the final judgment. Refer to a voice wake-up shown in Figure 2 The schematic flow chart of the processing method. For the voice information output by the user, the voice feature information can be extracted first, such as by using MFCC (Mel-scale Frequency Cepstral Coefficients, Mel cepstral coefficients), but it is not limited to this, and then The extracted speech feature information is written into the frame buffer, and the confidence score calculation is performed on the extracted speech feature information by the first-level model, which is a smaller acoustic model (such as the first acoustic model in Figure 2), such as using Hidden Markov model HMM, calculate the confidence score of the speech feature information; or use the posterior processing module shown in Figure 1 above to calculate the confidence score, etc. After the first-level model is triggered, the above-mentioned extracted The same speech feature information received is sent to a larger acoustic model (such as the second acoustic model in Figure 2), and the confidence score of the speech feature information is calculated in a similar manner, so as to realize the second judgment on the same speech feature information. The voice wake-up processing method of a single model shown in Figure 1 improves the voice wake-up performance to a certain extent.

与此同时，本申请还提出了另一种语音唤醒处理方法，其与上述图2所示的语音唤醒处理方法的区别在于，在第一级模型被触发后，是将用户输出的语音信息发送至云端的服务器，由服务器的自动语音识别(Automatic Speech Recognition，ASR)部件进行识别，此时，该服务器可以采用更大规模的声学模型，并结合较大的语言模型，经过编码器解码处理，实现对该语音信息的二次判决。At the same time, this application also proposes another voice wake-up processing method, which differs from the voice wake-up processing method shown in Figure 2 above in that after the first-level model is triggered, the voice information output by the user is sent to To the server in the cloud, it is recognized by the Automatic Speech Recognition (ASR) component of the server. At this time, the server can use a larger-scale acoustic model, combined with a larger language model, and decode it through the encoder. Realize the second judgment of the voice information.

由此可见，本申请上文提出的两种语音唤醒处理方法，都是引入了一个较大的二级模型，达到提升系统性能的目的，但是上文提出的几种语音唤醒处理方法，虽然相对于单个声学模型的方案，能够适当提高语音唤醒性能，但均未真正考虑成人语音特征与儿童语音特征之间的存在的差异，儿童相对成人语速很慢的特点，导致这几种方法中构建的声学模型，都不能真正兼顾成人与儿童的性能，进而导致使用该语音唤醒处理方法的电子设备，无法很好地同时适用于成人和儿童，大大降低了用户体验。It can be seen that the two voice wake-up processing methods proposed above in this application all introduce a larger two-level model to achieve the purpose of improving system performance, but several voice wake-up processing methods proposed above, although relatively The scheme based on a single acoustic model can properly improve the voice wake-up performance, but it does not really consider the differences between the voice characteristics of adults and children. Children's speech speed is very slow compared to adults, which leads to the construction of these methods. None of the acoustic models can really take into account the performance of both adults and children, which leads to the fact that electronic devices using this voice wake-up processing method cannot be well adapted to both adults and children, which greatly reduces user experience.

结合上文提出的改进方案，本申请为了解决上述儿童和成人的语音唤醒性能无法同时兼顾的问题，提出在上述图1所示的语音唤醒处理方法所使用的系统架构的基础上，针对儿童语音特点进行改进，增加双置信度判决机制，并在二级模型中，将儿童与成人的模型分离开，以使两者输入的语音特征信息及训练数据不同，显著提升儿童唤醒的性能。Combined with the improvement scheme proposed above, in order to solve the above-mentioned problem that the voice wake-up performance of children and adults cannot be taken into account at the same time, this application proposes to use the system architecture used in the voice wake-up processing method shown in Figure 1 above. Improve the features, add a double confidence judgment mechanism, and separate the child and adult models in the second-level model, so that the voice feature information and training data input by the two are different, and the performance of children's awakening is significantly improved.

具体的，参照图3示出的实现本申请实施例提出的语音唤醒处理方法的系统结构示意图，该系统可以由前后串联的两级三个模型构成，如图3所示，一级模型除了包括特征计算模块、特征缓存模块外，配置了一个声学模型和一个双置信度判决模块，该双置信度判决模块将会分别按照成人和儿童模型进行后验处理，也就是说，该双置信度判决模块可以包括成人后验处理模块和儿童后验处理模块。在二级模型中，将会针对这两种后验处理模块，配置相应的成人校验模型和儿童校验模型，共享一级模型，当其中的任意一后验处理模块的输出结果通过，触发二级模型进行二次置信度判决，若通过，将会响应语音信息包含的预设唤醒词，控制电子设备执行预设操作，具体实现过程可以参照下文方法实施例相应部分的描述。Specifically, referring to the schematic diagram of the system structure for implementing the voice wake-up processing method proposed in the embodiment of the present application shown in FIG. 3, the system can be composed of two stages and three models connected in series. In addition to the feature calculation module and the feature cache module, an acoustic model and a double-confidence judgment module are configured. The double-confidence judgment module will perform posterior processing according to the adult and child models respectively, that is to say, the double-confidence judgment The modules may include an adult posterior processing module and a child posterior processing module. In the second-level model, the corresponding adult verification model and child verification model will be configured for these two posterior processing modules, and the first-level model will be shared. When the output result of any one of the posterior processing modules passes, trigger The second-level model performs a second confidence judgment. If it passes, it will respond to the preset wake-up word contained in the voice information and control the electronic device to perform preset operations. For the specific implementation process, refer to the description of the corresponding part of the method embodiment below.

结合上文对本申请提出的语音唤醒处理方法的技术构思的分析，该语音唤醒处理方法可以适用于如电子设备(即终端设备)和/或服务器等计算机设备。具体的，本申请上文提出的一级模型可以部署在电子设备，二级模型是在一级模型被触发后运行，其可以部署在电子设备或云端的服务器上，但并不局限于这种部署方式，可以根据实际场景的需求确定。Combined with the above analysis of the technical concept of the voice wake-up processing method proposed in this application, the voice wake-up processing method can be applied to computer devices such as electronic devices (ie, terminal devices) and/or servers. Specifically, the first-level model proposed above in this application can be deployed on electronic devices, and the second-level model is run after the first-level model is triggered, which can be deployed on electronic devices or cloud servers, but is not limited to this The deployment method can be determined according to the requirements of the actual scenario.

示例性的，本申请提出的语音唤醒处理方法可以应用于电子设备，也就是说，上述系统结构中的一级模型和二级模型均可以位于电子设备，当然，根据实际需要一级模型可以位于电子设备，二级模型可以位于服务器或其他设备，无论是哪种系统布局，实现语音唤醒处理方法的过程类似，本申请不再针对每一种系统布局，分别描述其实现语音唤醒处理方法的过程。Exemplarily, the voice wake-up processing method proposed in this application can be applied to electronic equipment, that is to say, both the first-level model and the second-level model in the above system structure can be located in electronic equipment, of course, according to actual needs, the first-level model can be located in For electronic equipment, the second-level model can be located in a server or other equipment. Regardless of the system layout, the process of implementing the voice wake-up processing method is similar. This application does not describe the process of implementing the voice wake-up processing method for each system layout. .

其中，上述电子设备可以是手机、平板电脑、可穿戴设备、车载设备、智能家居设备、增强现实(augmented reality，AR)/虚拟现实(virtual reality，VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer，UMPC)、个人数字助理(personal digital assistant，PDA)等等，本申请实施例对电子设备的具体类型不做限定。Wherein, the aforementioned electronic devices may be mobile phones, tablet computers, wearable devices, vehicle-mounted devices, smart home devices, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) devices, notebook computers, super mobile personal computers ( Ultra-mobile personal computer, UMPC), personal digital assistant (personal digital assistant, PDA), etc., the embodiment of the present application does not limit the specific type of electronic equipment.

应当理解，为了实现对电子设备的语音控制，通常需要电子设备具有语音识别功能，如安装有语音助手等应用，这样，用户需要使用电子设备时，可以不用手动操作，直接说该电子设备的唤醒词，即可启动电子设备或其安装的某应用等，非常方便。通常情况下，对于不同厂家不同类型的电子设备，其设定的启动系统及各应用的唤醒词可能会有所差异，本申请对此不做详述，且用户可以根据实际需求，灵活调整电子设备的系统及应用的唤醒词，本申请对唤醒词的配置方法及其使用方法不做详述。It should be understood that in order to achieve voice control of electronic devices, it is usually required that the electronic device has a voice recognition function, such as an application such as a voice assistant installed. In this way, when the user needs to use the electronic device, he can directly say that the electronic device wakes up without manual operation. Word, you can start the electronic device or a certain application installed on it, which is very convenient. Usually, for different types of electronic equipment from different manufacturers, the startup system and the wake-up word of each application may be different. This application will not elaborate on this, and users can flexibly adjust the electronic The wake-up word of the system and application of the device. This application does not describe the configuration method and usage method of the wake-up word in detail.

示例性的，图4示出了实现本申请提供的语音唤醒处理方法的一种电子设备的硬件结构示意图，该电子设备可以包括：声音采集器11、通信接口12、存储器13和处理器14，其中：Exemplarily, FIG. 4 shows a schematic diagram of the hardware structure of an electronic device implementing the voice wake-up processing method provided in the present application. The electronic device may include: a sound collector 11, a communication interface 12, a memory 13 and a processor 14, in:

本实施例中，声音采集器11、通信接口12、存储器13和处理器14可以通过通信总线实现相互间的通信，且该声音采集器11、通信接口12、存储器13、处理器14以及通信总线的数量可以为至少一个，可以依据具体应用需求确定，本申请对上述电子设备组成部件的数量不作限定。In this embodiment, the sound collector 11, the communication interface 12, the memory 13 and the processor 14 can realize mutual communication through the communication bus, and the sound collector 11, the communication interface 12, the memory 13, the processor 14 and the communication bus The number of components can be at least one, and can be determined according to specific application requirements. The present application does not limit the number of components of the above-mentioned electronic equipment.

声音采集器11可以采集用户针对电子设备输出的语音信息，通常可以包含唤醒电子设备系统和/或电子设备安装的任一应用的唤醒词，也就是说，当用户需要唤醒电子设备或其具有的某一应用时，可以直接说相应的预设唤醒词，电子设备的声音采集器11就可以采集用户输出的包含该唤醒词的语音信息，以便通过识别该唤醒词，响应对应的控制指令，控制电子设备执行预设操作，本申请对电子设备的唤醒词的配置及其使用方法不做详述。The sound collector 11 can collect the voice information output by the user for the electronic device, which usually includes a wake-up word for waking up the electronic device system and/or any application installed in the electronic device, that is, when the user needs to wake up the electronic device or its own In a certain application, you can directly say the corresponding preset wake-up word, and the voice collector 11 of the electronic device can collect the voice information output by the user that contains the wake-up word, so that by recognizing the wake-up word, responding to the corresponding control command, controlling The electronic device executes a preset operation, and this application does not describe in detail the configuration of the wake-up word of the electronic device and its usage method.

通信接口12可以接收声音采集器11输出的语音信息，并将其发送至处理器14进行处理，还可以用来实现声音采集器11与存储器13，存储器13与处理器14之间的数据交互，或者是电子设备中其他组成部件之间，其他组成部件与本实施例列举的组成部件之间的数据交互，本申请对通信接口12收发数据的内容不做详述，可以依据电子设备产品类型及其应用场景确定。The communication interface 12 can receive the voice information output by the sound collector 11, and send it to the processor 14 for processing, and can also be used to realize the data interaction between the sound collector 11 and the memory 13, the memory 13 and the processor 14, Or between other components in the electronic equipment, data interaction between other components and the components listed in this embodiment, this application does not describe the content of the communication interface 12 sending and receiving data in detail, it can be based on the type of electronic equipment product and Its application scenarios are determined.

基于此，该通信接口12可以包括无线通信模块和/或有线通信模块的接口，如GSM(Global System for Mobile Communications，全球移动通信系统)模块的接口、WIFI模块的接口、GPRS(General Packet Radio Service，通用分组无线服务技术)模块的接口等，还可以包括；USB(通用串行总线，universal serial bus)接口、串/并口等等，本申请不做一一详述。Based on this, this communication interface 12 can comprise the interface of wireless communication module and/or wired communication module, as the interface of GSM (Global System for Mobile Communications, Global System for Mobile Communications) module, the interface of WIFI module, GPRS (General Packet Radio Service , general packet radio service technology) module interface, etc., may also include; USB (universal serial bus, universal serial bus) interface, serial/parallel port, etc., and this application will not describe them one by one.

存储器13可以用来存储实现本申请提出的语音唤醒处理方法的程序，还可以存储预设的至少一个唤醒词，以及语音唤醒处理方法运行过程中产生的各种中间数据，以及其他电子设备或用户发送的数据等等，可以依据应用场景的需求确定，本申请不做详述。The memory 13 can be used to store the program implementing the voice wake-up processing method proposed in this application, and can also store at least one preset wake-up word, as well as various intermediate data generated during the operation of the voice wake-up processing method, and other electronic devices or user The data to be sent, etc., can be determined according to the requirements of the application scenario, which will not be described in detail in this application.

在实际应用中，存储器13可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。In practical applications, the memory 13 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

处理器14可以用来调用并执行存储器所存储的程序，以实现上述应用于电子设备的语音唤醒处理方法的各步骤，具体实现过程可以参照下文方法实施例相应部分的描述。The processor 14 can be used to call and execute the program stored in the memory, so as to realize the steps of the above-mentioned voice wake-up processing method applied to the electronic device. For the specific implementation process, refer to the description of the corresponding part of the method embodiment below.

本实施例中，处理器14可能是一个中央处理器CPU，或者是特定集成电路ASIC(Application Specific Integrated Circuit)，或者是被配置成实施本申请实施例的一个或多个集成电路等，本申请对处理器14的具体结构不做详述。In this embodiment, the processor 14 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application, etc., the present application The specific structure of the processor 14 is not described in detail.

可选的，上述存储器13可以独立于处理器14，也可以部署在处理器14中，同理，上述通信接口所包含的至少部分接口，也可以部署在处理器14中，如集成电路接口、集成电路内置音频接口、USB接口等等，本申请对存储器13与处理器14之间的部署关系，以及处理器14内部署的通信接口数量及类型不做限定，可以依据实际需求确定。Optionally, the above-mentioned memory 13 can be independent of the processor 14, and can also be deployed in the processor 14. Similarly, at least part of the interfaces included in the above-mentioned communication interface can also be deployed in the processor 14, such as an integrated circuit interface, The integrated circuit has a built-in audio interface, USB interface, etc. This application does not limit the deployment relationship between the memory 13 and the processor 14, as well as the number and types of communication interfaces deployed in the processor 14, which can be determined according to actual needs.

另外，应当理解，电子设备的系统组成结构，并不局限于上文列举的声音采集器、通信接口、存储器和处理器，如图5所示，电子设备还可以包括显示器、输入设备、电源模块、扬声器、传感器模块、摄像头、指示灯、天线、电源模块等组成部件，本申请不做一一列举，且电子设备的组成可以包括比图5示出的更多或更少的部件，或者组合/拆分某些部件，或者不同的部件布置等，图示的部件可以是硬件、软件或硬件与软件的组合实现。In addition, it should be understood that the system composition structure of the electronic device is not limited to the sound collector, communication interface, memory and processor listed above, as shown in Figure 5, the electronic device can also include a display, an input device, a power supply module , speaker, sensor module, camera, indicator light, antenna, power supply module and other components, this application does not list them all, and the composition of electronic equipment may include more or less components than those shown in Figure 5, or a combination Some components may be split or arranged differently, and the illustrated components may be realized by hardware, software, or a combination of hardware and software.

且，图5示出的各模块间的接口连接关系，只是示意性说明，并不构成对电子设备的结构限定，也就是说，在其他实施例中，电子设备也可以采用与本实施例中不同的接口连接关系，或多种接口连接方式的组合，本申请不做一一详述。Moreover, the interface connection relationship between the modules shown in FIG. 5 is only a schematic illustration, and does not constitute a structural limitation on the electronic device. That is to say, in other embodiments, the electronic device can also adopt the same configuration as in this embodiment. Different interface connection relationships, or combinations of multiple interface connection modes, will not be described in detail in this application.

结合上图3示出的系统结构示意图，参照图6，示出了本申请实施例提供了一种语音唤醒处理方法的流程示意图，如上文方法可以由电子设备执行实现，也可以由电子设备和服务器配合实现，本实施例主要从电子设备的角度进行描述，具体实现过程可以包括但并不局限于以下步骤：In combination with the schematic diagram of the system structure shown in FIG. 3 above, referring to FIG. 6 , it shows a schematic flowchart of a voice wake-up processing method provided by the embodiment of the present application. The above method can be implemented by an electronic device, or it can be implemented by an electronic device and The server cooperates with the implementation. This embodiment is mainly described from the perspective of electronic equipment. The specific implementation process may include but is not limited to the following steps:

步骤S11，获取输入的语音信息的音频帧特征；Step S11, acquiring audio frame features of the input voice information;

本实施例实际应用中，用户是希望对电子设备进行语音控制，来替代传统的手动操作，解放用户的双手，通常情况下，对于不同类型的电子设备的各种操作，可以预先配置相应的唤醒词，用户只需要说出所需操作对应的唤醒词，即可通过语音控制方式，控制电子设备执行相应的操作。In the practical application of this embodiment, the user wants to control the electronic device by voice to replace the traditional manual operation and free the user's hands. Usually, for various operations of different types of electronic devices, corresponding wake-up calls can be pre-configured. The user only needs to say the wake-up word corresponding to the required operation, and then the electronic device can be controlled to perform the corresponding operation through voice control.

比如，用户希望控制智能音箱播放歌曲A，可以说“xx，播放歌曲A”，智能音箱可以通过对这一语音信息进行分析，识别出该语音信息包含的唤醒词，以唤醒智能音箱系统，并播放歌曲A。For example, if the user wants to control the smart speaker to play song A, he can say "xx, play song A", and the smart speaker can analyze the voice information to recognize the wake-up words contained in the voice information to wake up the smart speaker system, and Play song A.

在该过程中，由于不同类型用户的语音特征不同，如成人和儿童这两大类用户的语音特征存在很大差异，为了能够准确识别该语音信息中包含的唤醒词，本实施例可以将针对电子设备输入的语音信息划分成多帧(即多个音频帧)数据，再对每一帧数据进行特征提取，得到相应的音频帧特征，该音频帧特征可以是一个特征向量，按照这种处理方式，本实施例可以得到n维特征向量，n的数值取决于语音信息包含的音频帧的数量，本申请对n数值不做限定。In this process, due to the different voice characteristics of different types of users, such as adults and children, there are great differences in the voice characteristics of these two types of users. In order to accurately identify the wake-up words contained in the voice information, this embodiment can target The voice information input by the electronic device is divided into multiple frames (that is, multiple audio frames) data, and then feature extraction is performed on each frame of data to obtain the corresponding audio frame feature. The audio frame feature can be a feature vector. According to this processing In this embodiment, an n-dimensional feature vector can be obtained, and the value of n depends on the number of audio frames included in the speech information, and the present application does not limit the value of n.

需要说明，本申请对获取输入的语音信息后，对其进行特征提取，得到用于输入声学模型的特征数据的过程不做限定，可以在对语音信息进行分帧预处理后，采用FBanK(FilterBank)特征提取方式，对预处理后的各音频帧数据进行逐帧特征提取，得到相应帧的音频帧特征，本申请对FBanK特征提取的具体实现过程不作详述，且对于获取语音信息的各音频帧的音频帧特征的实现方式，并不局限于这种FBanK特征提取方式。It should be noted that this application does not limit the process of obtaining the input speech information, performing feature extraction on it, and obtaining the feature data for inputting the acoustic model. After preprocessing the speech information in frames, FBank (FilterBank ) feature extraction mode, each audio frame data after preprocessing is carried out frame-by-frame feature extraction, obtains the audio frame feature of corresponding frame, the specific implementation process of FBanK feature extraction is not described in detail in this application, and for each audio frequency of obtaining voice information The implementation of the audio frame feature of the frame is not limited to this FBanK feature extraction method.

步骤S12，将该音频帧特征输入声学模型进行处理，得到与预设唤醒词的每个音节对应的目标音频帧特征的后验概率；Step S12, input the audio frame feature into the acoustic model for processing, and obtain the posterior probability of the target audio frame feature corresponding to each syllable of the preset wake-up word;

声学模型是语音识别系统中最为重要的部分之一，可以采用隐马尔可夫模型HMM进行建模，但并不局限于这种建模方式，也可以采用其他神经网络等深度学习网络构建声学模型。该隐马尔可夫模型是一个离散时域有限状态自动机，且其打分、解码和训练相应的算法可以是前向算法、Viterbi算法和前向后向算法等，本申请对声学模型的建模过程不做详述。The acoustic model is one of the most important parts of the speech recognition system. It can be modeled using the hidden Markov model (HMM), but it is not limited to this modeling method. It can also use other deep learning networks such as neural networks to construct the acoustic model. . The hidden Markov model is a discrete time-domain finite state automaton, and the corresponding algorithms for scoring, decoding and training can be forward algorithm, Viterbi algorithm and forward-backward algorithm, etc. This application models the acoustic model The process is not described in detail.

通常情况下，声学模型的输入时由特征提取模块提取的多维的特征，且其取值可以是离散或连续的，本实施例可以实际需求获取输入声学模型的音频帧特征。Usually, the input of the acoustic model is a multi-dimensional feature extracted by the feature extraction module, and its value can be discrete or continuous. In this embodiment, the audio frame feature of the input acoustic model can be acquired according to actual requirements.

本实施例将得到的语音信息的多个音频帧特征输入声学模型后，声学模型可以将这多个音频帧特征与预设唤醒词对应的声学特征进行处理，以从这多个音频帧特征中，筛选出与该预设唤醒词对应的声学特征的每个音节对应的音频帧的范围，之后，可以利用筛除的每个音频帧的范围中的每个音频帧的声学似然评分，从每个音频帧的范围中，确定出符合预设要求的预设数量的目标音频帧，如声学似然评分达到预设评分的预设数量的目标音频帧，但并不局限于这种确定方式，本实施例可以将目标音频帧对应的音频帧特征记为目标音频帧特征，最后，可以利用声学模型，计算出这些目标音频帧特征各自的声学后验得分即后验概率，本申请对如何利用声学模型，计算音频帧特征的后验概率的实现过程不作详述。In this embodiment, after the multiple audio frame features of the obtained voice information are input into the acoustic model, the acoustic model can process the multiple audio frame features and the acoustic features corresponding to the preset wake-up words, so as to extract the multiple audio frame features from the multiple audio frame features. , to filter out the range of audio frames corresponding to each syllable of the acoustic feature corresponding to the preset wake-up word, after that, the acoustic likelihood score of each audio frame in the range of each audio frame that is filtered out can be obtained from In the range of each audio frame, determine a preset number of target audio frames that meet preset requirements, such as a preset number of target audio frames whose acoustic likelihood score reaches a preset score, but is not limited to this determination method In this embodiment, the audio frame features corresponding to the target audio frame can be recorded as the target audio frame features, and finally, the acoustic model can be used to calculate the respective acoustic posterior scores of these target audio frame features, that is, the posterior probability. The implementation process of calculating the posterior probability of audio frame features by using the acoustic model will not be described in detail.

可见，每一帧的音频帧特征输入声学模型，可以得到一个后验概率，该后验概率可以表示相应音频帧特征是预设唤醒词的音频帧特征的可能性大小，通常情况下，后验概率越大，说明其对应的音频帧特征是预设唤醒词的音频帧特征的可能性越大。It can be seen that the audio frame feature of each frame is input into the acoustic model, and a posterior probability can be obtained, which can indicate the possibility that the corresponding audio frame feature is the audio frame feature of the preset wake-up word. Usually, the posterior probability The greater the probability, the greater the possibility that the corresponding audio frame feature is the audio frame feature of the preset wake-up word.

应该理解的是，实际应用中，语音信息的所有音频帧特征输入声学模型后，输出的数据不仅可能包括组成唤醒词的音节或音素的各音频帧特征的后验概率，往往还可能包含其他非唤醒词的音节或音素的各音频帧特征的后验概率，本申请则是对组成唤醒词的音节或音素的各音频帧特征的后验概率进行后续处理，因此，可以从声学模块的输出数据中筛选出这部分需要的后验概率，具体实现过程不做详述。It should be understood that in practical applications, after all the audio frame features of speech information are input into the acoustic model, the output data may not only include the posterior probability of each audio frame feature of the syllables or phonemes that make up the wake-up words, but often may also include other non- The posterior probability of each audio frame feature of the syllable or phoneme of the wake-up word, this application is to carry out subsequent processing on the posterior probability of each audio frame feature of the syllable or phoneme of the wake-up word, therefore, the output data of the acoustic module can be The posterior probability required for this part is screened out, and the specific implementation process will not be described in detail.

本实施例中的预设唤醒词可以指用户当前对电子设备执行的语音控制所对应的预设的唤醒词，通常情况下，用户向电子设备发出其执行某操作的语音指令时，用户所说的语音信息会包含该预设唤醒词，本申请对预设唤醒词的内容不做限定。The preset wake-up word in this embodiment may refer to the preset wake-up word corresponding to the voice control currently performed by the user on the electronic device. Usually, when the user sends a voice command to the electronic device to perform a certain operation, the user says The voice information will contain the preset wake-up word, and this application does not limit the content of the preset wake-up word.

另外，需要说明的是，步骤S12中与预设唤醒词的每个音节对应的目标音频帧特征，可以是声学模型认为输入的音频帧特征中，可能是预设唤醒词的每一音节对应的音频帧特征。In addition, it should be noted that the target audio frame feature corresponding to each syllable of the preset wake-up word in step S12 may be the audio frame feature that the acoustic model believes to be input, which may be corresponding to each syllable of the preset wake-up word Audio frame features.

步骤S13，对与预设唤醒词的每个音节对应的目标音频帧特征的后验概率进行双置信度判决，得到相应音节的第一置信度得分及第二置信度得分；Step S13, performing a double-confidence judgment on the posterior probability of the target audio frame feature corresponding to each syllable of the preset wake-up word, and obtaining the first confidence score and the second confidence score of the corresponding syllable;

本实施例中，缓存的语音信息的音频帧特征经过声学模型的处理后，将利用针对不同类型用户预设的不同置信度判决模块，对处理结果进行双置信度判决，从而使得语音信息中可能是预设唤醒词的每一个音节，都能够得到两个置信度得分，记为第一置信度得分和第二置信度得分。本申请对如何实现对语音信息中的可能是预设唤醒词的每一个音节的置信度计算方法不做限定，可以包括但并不局限于以下计算方式：In this embodiment, after the audio frame features of the cached voice information are processed by the acoustic model, different confidence judgment modules preset for different types of users will be used to make double confidence judgments on the processing results, so that the voice information may For each syllable of the preset wake-up word, two confidence scores can be obtained, which are recorded as the first confidence score and the second confidence score. This application does not limit how to realize the confidence calculation method for each syllable in the voice information that may be a preset wake-up word, which may include but not limited to the following calculation methods:

上述置信度(confidence)计算公式中，n可以表示声学模型的输出单元个数，具体数值可以依据该声学模型的具体结果确定，p_i'_j可以表示平滑处理后的第i个单元第j帧的音频帧特征的后验概率，h_max＝max{1,j-w_max+1}可以表示置信度计算窗(即置信度判决窗)W_max中的第一帧的位置。In the above confidence calculation formula, n can represent the number of output units of the acoustic model, and the specific value can be determined according to the specific results of the acoustic model, and p_i '_j can represent the jth frame of the ith unit after smoothing The posterior probability of the feature of the audio frame, h_max =max{1, jw_max +1} may represent the position of the first frame in the confidence calculation window (ie, the confidence judgment window) W_max .

由上述置信度计算公式可以得知，本申请可以从声学模型的各输出单元的各音频帧特征的后验概率中，确定出各输出单元的最大后验概率，经过相乘和开方计算后，可以得到预设唤醒词的每个音节的置信度得分。如用户希望电子设备执行预设操作的唤醒词为“okey google”，按照上述置信度计算方式，得到的置信度得分可以表示在大小为h_max的时间内出现了okey和google的可能性有多大。It can be known from the above confidence calculation formula that the application can determine the maximum posterior probability of each output unit from the posterior probability of each audio frame feature of each output unit of the acoustic model, after multiplication and square root calculation , the confidence score of each syllable of the preset wake-up word can be obtained. For example, if the wake-up word that the user wants the electronic device to perform the preset operation is "okey google", according to the above confidence calculation method, the obtained confidence score can indicate how likely it is that okey and google appear within the time h_max .

继上文对本申请提出的语音唤醒处理方法的技术构思的分析，本申请将针对不同类型用户采用不同的置信度判决规则，来提高语音唤醒准确性，以不同类型用户为成年用户(即成人)和未成年用户(年龄较小的儿童)为例进行说明，可以预先针对这两种类型的用户，配置相应的置信度判决模块(即后验处理模块)实现后验处理，如上图3中的成人后验处理模块和儿童后验处理模块，利用这两个后验处理模块分别对上述得到的，与预设唤醒词的每个音节对应的目标音频帧特征的后验概率进行置信度计算，对于每个音节，将会得到两个置信度得分。Following the above analysis of the technical concept of the voice wake-up processing method proposed in this application, this application will adopt different confidence judgment rules for different types of users to improve the accuracy of voice wake-up, and use different types of users as adult users (i.e. adults) Let’s take the underage users (younger children) as an example. For these two types of users, you can configure the corresponding confidence judgment module (that is, the posterior processing module) to realize the posterior processing, as shown in Figure 3 above. The adult posterior processing module and the child posterior processing module use these two posterior processing modules to respectively perform confidence calculations on the posterior probability of the target audio frame features corresponding to each syllable of the preset wake-up word, For each syllable, two confidence scores will be obtained.

需要说明的是，本申请的不同类型用户的语音特征之间是存在较大的差异，如儿童的语速通常要比成人语速慢，这样，在置信度计算过程中，适用于成人用户的语音信息的判决窗大小可能无法覆盖儿童唤醒词完整语音，所以，本申请可以将适用于儿童用户的语音信息的判决窗，配置成大于适用于成人用户的语音信息的判决窗，这两种判决窗的具体大小不做限定，可以依据实际需求灵活调整。It should be noted that there are large differences between the speech characteristics of different types of users in this application. For example, the speech speed of children is usually slower than that of adults. The size of the judgment window for voice information may not be able to cover the complete speech of the children’s wake-up words. Therefore, this application can configure the judgment window for the voice information applicable to children users to be larger than the judgment window for the voice information applicable to adult users. The specific size of the window is not limited, and can be flexibly adjusted according to actual needs.

由此可见，对于不同的置信度判决模块，因两者配置的判决窗大小不同，两者会缓存音频帧特征的后验概率的时间长度不同，且在本次判决通过的情况下，后续进行二次判决时，获取所要判决的已缓存的音频帧特征的长度也会相应改变，该长度可以与相应判决窗大小匹配，以使得进行二次判决的音频帧特征尽量包含完整的唤醒词特征。It can be seen that for different confidence judgment modules, due to the different sizes of the judgment windows configured by the two, the length of time for the two to cache the posterior probability of the audio frame features is different, and in the case of passing this judgment, the subsequent During the second judgment, the length of the cached audio frame features to be judged will also change accordingly, and the length can match the size of the corresponding judgment window, so that the audio frame features for the second judgment include complete wake-up word features as much as possible.

其中，在配置好上述判决窗后，如该判决窗设置为缓存100帧的音频帧特征，那么，在已经保存了100帧的音频帧特征后，获取最新一帧的音频帧特征，会将最早缓存的一帧丢弃，加入该最新一帧的音频帧特征，达到缓存的目的，但并不局限于本实施例描述的判决窗的大小。Among them, after the above decision window is configured, if the decision window is set to cache the audio frame features of 100 frames, then, after the audio frame features of 100 frames have been saved, the audio frame feature of the latest frame will be obtained, and the earliest A cached frame is discarded, and the audio frame feature of the latest frame is added to achieve the purpose of caching, but it is not limited to the size of the decision window described in this embodiment.

步骤S14，利用第一置信度得分和第二置信度得分中通过的判决结果，获取该语音信息的音频帧特征中的校验音频帧特征；Step S14, using the judgment result passed in the first confidence score and the second confidence score to obtain the verification audio frame feature in the audio frame feature of the voice information;

继上文分析，对于不同置信度判决模块得到的置信度得分，判断相应音节是否为预设唤醒词的音节的阈值不同，本实施例可以将不同的阈值记为第一置信度判决阈值、第二置信度判决阈值等。Following the above analysis, for the confidence scores obtained by different confidence judgment modules, the thresholds for judging whether the corresponding syllable is a syllable of a preset wake-up word are different. In this embodiment, different thresholds can be recorded as the first confidence judgment threshold, the second Two confidence decision thresholds, etc.

这样，得到第一置信度得分和第二置信度得分后，可以将第一置信度得分与第一置信度判决阈值进行比较，将第二置信度得分与第二置信度判决阈值进行比较，若任一置信度得分达到相应的置信度判决阈值，可以认为该音节属于相应类型用户输入的预设唤醒词，此时，上图3中的一级模型将被触发，可以从缓存中，按照该类型用户对应的判决窗大小，获取校验音频帧特征。In this way, after obtaining the first confidence score and the second confidence score, the first confidence score can be compared with the first confidence decision threshold, and the second confidence score can be compared with the second confidence decision threshold. If any confidence score reaches the corresponding confidence judgment threshold, it can be considered that the syllable belongs to the preset wake-up word input by the corresponding type of user. At this time, the first-level model in Figure 3 above will be triggered, and it can be retrieved from the cache according to the The size of the judgment window corresponding to the type of user, and obtains the features of the audio frame for verification.

举例说明，若适用于儿童的置信度判决模块得到的第二置信度得分，达到了第二置信度判决阈值(即儿童的置信度判决阈值，相应地，第一置信度判决阈值则适用于成人)，可以按照儿童对应的判决窗大小，从缓存的音频帧特征中，获取相应长度的校验音频帧特征；同理，若是适用于成人的置信度判决模块得到的第一置信度得分，达到了第一置信度判决阈值，则可以获取与成人对应的判决窗大小匹配的，相应长度的校验音频帧特征，具体获取过程不做详述。For example, if the second confidence score obtained by the confidence judgment module suitable for children reaches the second confidence judgment threshold (ie, children's confidence judgment threshold, correspondingly, the first confidence judgment threshold is applicable to adults ), according to the size of the judgment window corresponding to the child, from the cached audio frame features, the verification audio frame features of the corresponding length can be obtained; similarly, if the first confidence score obtained by the confidence judgment module applicable to adults, reaches If the first confidence judgment threshold is obtained, the verification audio frame features of the corresponding length matching the judgment window size corresponding to an adult can be obtained, and the specific acquisition process will not be described in detail.

步骤S15，获取该校验音频帧特征的置信度校验结果，置信度校验结果是对该校验音频帧特征进行二次置信度判决得到的；Step S15, obtaining the confidence verification result of the verification audio frame feature, the confidence verification result is obtained by performing a second confidence judgment on the verification audio frame feature;

基于上述分析，本实施例是在一级模型中，采用双置信度判决模块，实现对语音信息的唤醒词识别，并在该一级模型被唤醒后，即初步确定该语音信息中包含预设唤醒词的情况下，将继续由二级模型对该语音信息进行二次验证，如上述分析，该二级模型可以部署在电子设备，也可以部署在服务器上，本申请对二级模型的部署位置及其结构不做限定。Based on the above analysis, this embodiment uses a dual-confidence judgment module in the first-level model to realize the recognition of the wake-up word for voice information, and after the first-level model is awakened, it is initially determined that the voice information contains preset In the case of wake-up words, the second-level model will continue to perform secondary verification on the voice information. As analyzed above, the second-level model can be deployed on electronic devices or on servers. The deployment of the second-level model in this application The location and its structure are not limited.

可选的，对于如图3中的二级模型，可以针对不同类型的用户配置相应的校验模型，如图3中的成人模型和儿童模型，这两种校验模型的网络结构可以相同，如上文技术方案研发过程中提出的部署在电子设备或云端的更大声学模型+后验处理模块，或者是一级模型中声学模型+相应的置信度判决模块等，本申请对校验模型的具体网络结构不作限定。Optionally, for the secondary model as shown in Figure 3, corresponding verification models can be configured for different types of users, such as the adult model and child model in Figure 3, the network structures of these two verification models can be the same, For example, the larger acoustic model + posterior processing module deployed on electronic devices or in the cloud, or the acoustic model + corresponding confidence judgment module in the first-level model proposed in the research and development process of the above technical solution, etc., the application of the verification model The specific network structure is not limited.

需要说明的是，在构建不同类型用户对应的校验模型过程中，需要利用相应类型用户的语音样本进行训练，且在训练过程中，输入网络的样本特征的音频帧长度也会不同，可以参照上述判决窗部分的描述。It should be noted that in the process of building the verification model corresponding to different types of users, it is necessary to use the voice samples of the corresponding types of users for training, and during the training process, the length of the audio frame of the input network sample features will also be different, you can refer to A description of the decision window section above.

其中，对校验音频帧特征的二次置信度判决过程，与上述一级模型对目标音频帧特征的首次置信度判决过程类似，本申请不再赘述。Wherein, the second confidence judgment process of the verified audio frame features is similar to the first confidence judgment process of the above-mentioned first-level model for the target audio frame features, and will not be repeated in this application.

步骤S16，若该置信度校验结果通过，响应该预设唤醒词对应的指令，控制电子设备执行预设操作。In step S16, if the confidence check result is passed, the electronic device is controlled to perform a preset operation in response to the instruction corresponding to the preset wake-up word.

如上述分析，本申请是在一级模型被唤醒后，即上述步骤S14中的第一置信度得分和第二置信度得分各自的判决结果中，至少有一个判决结果通过的情况下，才会进行二次置信度判决，待二次置信度判决得到的置信度判决结果也通过，可以认为从语音信息处理识别出的唤醒词的确是预设的预设唤醒词，即准确识别出了用户输入的语音信息中的唤醒词，之后，电子设备就可以响应该唤醒词对应的指令，控制电子设备执行预设操作，如控制智能音箱播放歌曲A。According to the above analysis, this application is only after the first-level model is awakened, that is, when at least one of the judgment results of the first confidence score and the second confidence score in the above step S14 is passed. Perform a second confidence judgment, and the confidence judgment result obtained after the second confidence judgment also passes. It can be considered that the wake-up word identified from the voice information processing is indeed the preset preset wake-up word, that is, the user input is accurately recognized. The wake-up word in the voice information, and then the electronic device can respond to the instruction corresponding to the wake-up word to control the electronic device to perform preset operations, such as controlling the smart speaker to play song A.

综上所述，本实施例获取用户针对电子设备输入的语音信息后，将获取该语音信息的音频帧特征，并通过将其输入声学模型进行处理，得到该语音信息中包含的预设唤醒词的每个音节对应的目标音频帧特征的后验概率，之后，本实施例将会考虑到不同类型用户(如成人和儿童)的语音特征之间的差异，部署分别针对成人模式和儿童模式的置信度判决，从而实现对得到的这些后验概率的双置信度判决，以使每个音节得到两个置信度得分，其中任一置信度得分的判决结果通过，会从缓存中获取相应长度的校验音频帧特征进行二次置信度校验，待置信度校验结果通过，可以确定语音信息中包含了该预设唤醒词，可以直接响应该预设唤醒词对应的指令，控制电子设备执行预设操作。可见，本实施例提供的语音唤醒处理方法，能够同时兼顾成人语音唤醒性能和儿童语音唤醒性能，提高了语音唤醒效率及准确性。To sum up, in this embodiment, after acquiring the voice information input by the user for the electronic device, the audio frame features of the voice information will be acquired, and the preset wake-up word contained in the voice information will be obtained by inputting it into the acoustic model for processing. The posterior probability of the target audio frame feature corresponding to each syllable of , after that, this embodiment will take into account the differences between the speech features of different types of users (such as adults and children), and deploy the Confidence judgment, so as to realize the double confidence judgment of these posterior probabilities, so that each syllable can get two confidence scores, and if the judgment result of any confidence score passes, the corresponding length will be obtained from the cache Check the characteristics of the audio frame to perform a second confidence check. After the result of the confidence check is passed, it can be determined that the voice information contains the preset wake-up word, and can directly respond to the command corresponding to the preset wake-up word to control the execution of the electronic device. Default action. It can be seen that the voice wake-up processing method provided in this embodiment can take into account both the voice wake-up performance of adults and the voice wake-up performance of children, and improves the efficiency and accuracy of voice wake-up.

下面将针对本申请上文描述的语音唤醒处理方法进行细化，但并不局限于下文描述的细化示例，如图7所示，为本申请提出的语音唤醒处理方法的一种细化示例的信令流程图，该方法可以包括但并不局限于以下步骤：The voice wake-up processing method described above in this application will be refined below, but it is not limited to the detailed examples described below. As shown in Figure 7, it is a refined example of the voice wake-up processing method proposed in this application A signaling flow chart, the method may include but not limited to the following steps:

步骤S21，电子设备获取用户输入的语音信息；Step S21, the electronic device obtains the voice information input by the user;

步骤S22，电子设备对该语音信息进行逐帧的特征提取，得到音频帧特征并缓存；Step S22, the electronic device extracts features of the voice information frame by frame, obtains audio frame features and caches them;

本实施例中，对用户输入的语音信息进行逐帧的特征提取，将会得到组成该语音信息的各音频帧的音频帧特征，之后，可以将得到的该语音信息的音频帧特征进行缓存，用来实现该语音信息的唤醒词的识别，进而实现对电子设备的语音唤醒控制。In this embodiment, the feature extraction of the voice information input by the user frame by frame will obtain the audio frame features of each audio frame forming the voice information, and then the audio frame features of the voice information obtained can be cached, It is used to realize the recognition of the wake-up word of the voice information, and then realize the voice wake-up control of the electronic equipment.

本申请对音频帧特征的获取方法及其缓存方式均不作限定，可以包括但并不局限于上文实施例描述的方法。The present application does not limit the acquisition method of the audio frame feature and its cache method, which may include but not limited to the methods described in the above embodiments.

步骤S23，电子设备将缓存的音频帧特征输入声学模型进行处理，得到与预设唤醒词的每个音节对应的目标音频帧特征的后验概率；Step S23, the electronic device inputs the cached audio frame features into the acoustic model for processing, and obtains the posterior probability of the target audio frame features corresponding to each syllable of the preset wake-up word;

关于步骤S23的实现过程可以参照上述实施例相应部分的描述。Regarding the implementation process of step S23, reference may be made to the descriptions of the corresponding parts in the foregoing embodiments.

步骤S24，电子设备分别按照第一置信度判决规则和第二置信度判决规则进行置信度计算，得到语音信息包含的该预设唤醒的同一音节的第一置信度得分和第二置信度得分；Step S24, the electronic device performs confidence calculation according to the first confidence judgment rule and the second confidence judgment rule respectively, and obtains the first confidence score and the second confidence score of the same syllable of the preset wake-up contained in the voice information;

结合上述实施例的描述，本实施例可以按照第一置信度判决规则，对该预设唤醒词中的每个音节对应的目标音频帧特征的后验概率进行置信度计算，得到相应音节的第一置信度得分；并按照第二置信度判决规则，对该预设唤醒词中的每个音节对应的目标音频帧特征的后验概率进行置信度计算，得到相应音节的第二置信度得分。其中，第一置信度判决规则与所述第二置信度判决规则的判决窗大小及置信度判决阈值均不同，所述判决窗用于确定进行置信度计算的目标音频帧特征的时间长度，具体数值不作限定。In combination with the description of the above embodiment, this embodiment can calculate the confidence degree of the posterior probability of the target audio frame feature corresponding to each syllable in the preset wake-up word according to the first confidence degree judgment rule, and obtain the first confidence degree of the corresponding syllable. A confidence score; and according to the second confidence judgment rule, perform confidence calculation on the posterior probability of the target audio frame feature corresponding to each syllable in the preset wake-up word, and obtain the second confidence score of the corresponding syllable. Wherein, the judgment window size and the confidence judgment threshold of the first confidence judgment rule and the second confidence judgment rule are different, and the judgment window is used to determine the time length of the target audio frame feature for confidence calculation, specifically The value is not limited.

本实施例中，上述第一置信度判决规则和第二置信度判决规则可以是，不同的置信度判决模块(即后验处理模块)进行置信度计算过程所依据的置信度计算规则，本申请对其具体内容不做限定，可以依据相应置信度判决模块的置信度计算方法确定。如上述分析，置信度判决模块可以包括成人的置信度判决模块，也可以包括儿童的置信度判决模块，可见，相对于现有技术，加入了针对儿童模式的置信度判决模块，且其与成人模式的置信度判决模块相互独立，在不影响成人唤醒性能的情况下，通过设置较大的判决窗，可以有效提高对儿童语音的唤醒性能。In this embodiment, the above-mentioned first confidence degree judgment rule and second confidence degree judgment rule may be the confidence degree calculation rules based on which different confidence degree judgment modules (ie, posterior processing modules) perform the confidence degree calculation process. The specific content is not limited, and may be determined according to the confidence calculation method of the corresponding confidence judgment module. As analyzed above, the confidence degree judgment module can include the confidence degree judgment module of adults, and can also include the confidence degree judgment module of children. It can be seen that compared with the prior art, a confidence degree judgment module for children's mode has been added, and it is different from that of adults. The confidence judgment modules of the modes are independent of each other. Without affecting the wake-up performance of adults, by setting a larger judgment window, the wake-up performance of children's voice can be effectively improved.

步骤S25，电子设备利用第一置信度判决阈值对第一置信度得分进行判决，得到第一判决结果，并利用第二置信度判决阈值对第二置信度得分进行判决，得到第二判决结果；Step S25, the electronic device judges the first confidence score by using the first confidence judgment threshold to obtain a first judgment result, and uses the second confidence judgment threshold to judge the second confidence score to obtain a second judgment result;

本实施例对第一置信度判决阈值和第二置信度判决阈值的具体数值大小不做限定。This embodiment does not limit the specific numerical values of the first confidence decision threshold and the second confidence decision threshold.

步骤S26，电子设备在第一判决结果或第二判决结果通过的情况下，获取校验音频帧特征；In step S26, the electronic device obtains the verification audio frame feature when the first judgment result or the second judgment result is passed;

其中，该校验音频帧特征是缓存的与通过的判决结果对应的判决窗大小匹配的音频帧特征，具体获取过程可以参照上述实施例相应部分的描述。Wherein, the verified audio frame feature is a cached audio frame feature that matches the size of the decision window corresponding to the passed decision result. For a specific acquisition process, refer to the description in the corresponding part of the above embodiment.

步骤S27，电子设备向服务器发送语音置信度校验请求；Step S27, the electronic device sends a voice confidence verification request to the server;

其中，该语音置信度校验请求可以携带有校验音频帧特征，以及该校验音频帧特征对应的用户类型标识，如成人用户标识、儿童用户标识，需要说明，该语音置信度校验请求携带的内容并不局限于此，还可以包括首次置信度判决结果，如通过或不通过等。Wherein, the voice confidence verification request may carry the verification audio frame feature, and the user type identification corresponding to the verification audio frame feature, such as an adult user identification and a child user identification. It needs to be explained that the voice confidence verification request The carried content is not limited thereto, and may also include the first confidence judgment result, such as pass or fail.

步骤S28，服务器解析该语音置信度校验请求，得到该校验音频帧特征及其对应的用户类型标识；Step S28, the server parses the voice confidence verification request, and obtains the verification audio frame feature and its corresponding user type identification;

步骤S29，服务器利用与该用户类型标识对应的校验模型，对校验音频帧特征进行置信度校验，得到置信度校验结果；Step S29, the server uses the verification model corresponding to the user type identifier to perform confidence verification on the verification audio frame features, and obtains a confidence verification result;

可见，电子设备在确定校验音频帧特征后，可以利用与通过的判决结果对应的校验模型对校验音频帧特征进行置信度校验，得到校验音频帧特征的置信度校验结果，其中，针对不同的置信度判决规则，配置有相应的校验模型，该校验模型是通过对相应置信度判决规则对应类型用户的语音样本进行训练得到的，具体实现过程可以参照上述实施例相应部分的描述，但并不局限于本实施例描述的这种处理方式。It can be seen that after the electronic device determines the verification audio frame features, it can use the verification model corresponding to the passed judgment result to perform confidence verification on the verification audio frame features, and obtain the confidence verification result of the verification audio frame features. Among them, for different confidence judgment rules, corresponding verification models are configured, and the verification model is obtained by training voice samples of users corresponding to the corresponding confidence judgment rules. The specific implementation process can refer to the above-mentioned embodiment. Part of the description, but not limited to the processing described in this embodiment.

步骤S210，服务器将该置信度校验结果反馈至电子设备；Step S210, the server feeds back the confidence verification result to the electronic device;

步骤S211，电子设备在该置信度校验结果通过的情况下，响应该预设唤醒词对应的指令，执行预设操作。In step S211, if the confidence check result is passed, the electronic device responds to the instruction corresponding to the preset wake-up word, and executes a preset operation.

综上，本实施例的电子设备将针对儿童语音和成人语音的特点，配置两个对应的置信度判决模块，即双置信度判决模块，相对于现有技术，加入了儿童模式的置信度判决，且这两个置信度判决模块相对独立，从而使实施例的电子设备在不影响成人唤醒性能的情况下，通过设置较大的判决窗，可以有效提高对儿童语音的唤醒性能。To sum up, the electronic device of this embodiment will be configured with two corresponding confidence judgment modules for the characteristics of children's voice and adult voice, that is, dual confidence judgment modules. , and these two confidence judgment modules are relatively independent, so that the electronic device of the embodiment can effectively improve the wake-up performance of children's voice by setting a larger judgment window without affecting the wake-up performance of adults.

且，在如图3的一级模型中，无论是成人用户还是儿童用户输入语音信息，将会共享声学模型进行处理，不需要针对这两类用户设置两个声学模型，减少了计算量，以及对电子设备资源的占用，使其能够适用于电子设备上资源受限的场景。Moreover, in the first-level model as shown in Figure 3, whether it is an adult user or a child user inputting voice information, the acoustic model will be shared for processing, and there is no need to set up two acoustic models for these two types of users, which reduces the amount of calculation, and Occupying the resources of the electronic device makes it suitable for scenarios where the resources on the electronic device are limited.

另外，在图3的二级模型中，本申请针对不同类型用户配置了不同的校验模型，这两个校验模型可以分别针对成人语音样本和儿童语音样本进行分别建模，能够有效利用这两类用户的语音样本，分别得到各自的最优性能，有效提升二次置信度判决的准确性，同时提升了儿童语音的唤醒率。In addition, in the secondary model in Figure 3, this application configures different verification models for different types of users. These two verification models can be modeled separately for adult voice samples and children's voice samples, which can be effectively used. The voice samples of the two types of users get their own optimal performance, which effectively improves the accuracy of the secondary confidence judgment and improves the awakening rate of children's voice.

参照图8，为本申请提出的语音唤醒处理装置的一可选示例的结构图，该装置可以用于电子设备，本申请对电子设备的产品类型不做限定，如图8所示，该装置可以包括：Referring to FIG. 8 , it is a structural diagram of an optional example of a voice wake-up processing device proposed in this application. This device can be used in electronic equipment. This application does not limit the product type of electronic equipment. As shown in FIG. 8 , the device Can include:

特征获取模块21，用于获取输入的语音信息的音频帧特征；Feature acquisition module 21, for acquiring the audio frame feature of the voice information of input;

可选的，该特征获取模块21可以包括：Optionally, the feature acquisition module 21 may include:

语音信息获取单元，用于获取针对电子设备输入的语音信息；a voice information acquisition unit, configured to acquire voice information input to the electronic device;

特征提取单元，用于对所述语音信息进行特征提取，得到组成所述语音信息的各音频帧的音频帧特征，并对得到的音频帧特征进行缓存。The feature extraction unit is configured to perform feature extraction on the speech information, obtain audio frame features of each audio frame forming the speech information, and cache the obtained audio frame features.

后验概率获取模块22，用于将所述音频帧特征输入声学模型进行处理，得到与预设唤醒词的每个音节对应的目标音频帧特征的后验概率；The posterior probability acquisition module 22 is used to input the audio frame feature into the acoustic model for processing to obtain the posterior probability of the target audio frame feature corresponding to each syllable of the preset wake-up word;

置信度判决模块23，用于对所述每个音节对应的目标音频帧特征的后验概率进行双置信度判决，得到相应音节的第一置信度得分及第二置信度得分；Confidence degree judgment module 23, is used for carrying out double confidence degree judgment to the posterior probability of the target audio frame feature corresponding to described each syllable, obtains the first confidence degree score and the second confidence degree score of corresponding syllable;

校验特征获取模块24，用于利用所述第一置信度得分和所述第二置信度得分中通过的判决结果，获取所述语音信息的音频帧特征中的校验音频帧特征；The verification feature acquisition module 24 is used to obtain the verification audio frame feature in the audio frame feature of the speech information by using the judgment result passed in the first confidence score and the second confidence score;

作为本申请一可选示例，如图9所示，该置信度判决模块23可以包括：As an optional example of the present application, as shown in FIG. 9, the confidence degree judgment module 23 may include:

第一置信度计算单元231，用于按照第一置信度判决规则，对所述每个音节对应的目标音频帧特征的后验概率进行置信度计算，得到相应音节的第一置信度得分；The first confidence degree calculation unit 231 is used to perform confidence degree calculation on the posterior probability of the target audio frame feature corresponding to each syllable according to the first confidence degree judgment rule, so as to obtain the first confidence degree score of the corresponding syllable;

第二置信度计算单元232，用于按照第二置信度判决规则，对所述每个音节对应的目标音频帧特征的后验概率进行置信度计算，得到相应音节的第二置信度得分；The second confidence degree calculation unit 232 is used to perform confidence degree calculation on the posterior probability of the target audio frame feature corresponding to each syllable according to the second confidence degree judgment rule, so as to obtain the second confidence degree score of the corresponding syllable;

其中，所述第一置信度判决规则与所述第二置信度判决规则的判决窗大小及置信度判决阈值均不同，所述判决窗用于确定进行置信度计算的目标音频帧特征的时间长度。Wherein, the judgment window size and the confidence judgment threshold of the first confidence judgment rule and the second confidence judgment rule are different, and the judgment window is used to determine the time length of the target audio frame feature for confidence calculation .

相应地，上述校验特征获取模块24可以包括：Correspondingly, the above-mentioned verification feature acquisition module 24 may include:

第一判决单元241，用于利用第一置信度判决阈值对所述第一置信度得分进行判决，得到第一判决结果；The first judging unit 241 is configured to judge the first confidence score by using a first confidence judgment threshold to obtain a first judgment result;

第二判决单元242，用于利用第二置信度判决阈值对所述第二置信度得分进行判决，得到第二判决结果；The second judgment unit 242 is configured to use a second confidence judgment threshold to judge the second confidence score to obtain a second judgment result;

校验音频帧特征获取单元243，用于在第一判决结果或第二判决结果的判决通过的情况下，从语音信息的音频帧特征中，获取与通过的判决结果对应的判决窗大小匹配的校验音频帧特征。Verify that the audio frame feature acquisition unit 243 is used to obtain the matching decision window size corresponding to the passed decision result from the audio frame feature of the voice information under the situation that the decision of the first decision result or the second decision result is passed. Check audio frame characteristics.

置信度校验结果获取模块25，用于获取所述校验音频帧特征的置信度校验结果，所述置信度校验结果是对所述校验音频帧特征进行二次置信度判决得到的；Confidence verification result acquisition module 25, used to obtain the confidence verification result of the verification audio frame feature, the confidence verification result is obtained by performing a second confidence judgment on the verification audio frame feature ;

可选的，置信度校验结果获取模块25可以包括：Optionally, the confidence verification result acquisition module 25 may include:

置信度校验单元，用于利用与通过的判决结果对应的校验模型对所述校验音频帧特征进行置信度校验，得到所述校验音频帧特征的置信度校验结果；Confidence check unit, for using the check model corresponding to the passed judgment result to perform confidence check on the features of the check audio frame, to obtain the confidence check result of the check audio frame features;

其中，针对不同的置信度判决规则，配置有相应的校验模型，所述校验模型是通过对相应置信度判决规则对应类型用户的语音样本进行训练得到的。Wherein, corresponding verification models are configured for different confidence judgment rules, and the verification models are obtained by training voice samples of users corresponding to the corresponding confidence judgment rules.

在实际应用中，上述校验音频帧特征的置信度校验结果可以直接由电子设备进行二次置信度判决得到，也可以由与电子设备能够通信连接的服务器或其他电子设备进行二次置信度判决得到，本申请对校验音频帧特征的置信度校验结果的具体获取方法不做限定，可以参照上述方法实施例相应部分的描述。In practical applications, the confidence verification result of the above-mentioned audio frame feature verification can be directly obtained by the electronic device through the secondary confidence judgment, or it can be obtained through the secondary confidence judgment by a server or other electronic device that can communicate with the electronic device. According to the judgment, this application does not limit the specific method for obtaining the confidence verification result of verifying the features of the audio frame, and reference may be made to the description of the corresponding part of the above method embodiment.

基于此，上述置信度校验单元可以包括：Based on this, the above-mentioned confidence checking unit may include:

置信度校验请求发送单元，用于向服务器发送语音置信度校验请求，所述语音置信度校验请求携带有所述校验音频帧特征，以及所述校验音频帧特征对应的用户类型标识；Confidence verification request sending unit, configured to send a voice confidence verification request to the server, the speech confidence verification request carrying the verification audio frame feature, and the user type corresponding to the verification audio frame feature logo;

置信度校验结果接收单元，用于接收所述服务器反馈的所述校验音频帧特征的置信度校验结果，所述置信度校验结果是所述服务器响应所述语音置信度校验请求，利用与所述用户类型标识对应的校验模型，对所述校验音频帧特征进行置信度校验得到的。Confidence verification result receiving unit, configured to receive the confidence verification result of the verification audio frame feature fed back by the server, the confidence verification result being that the server responds to the voice confidence verification request , obtained by performing confidence degree verification on the verification audio frame features by using a verification model corresponding to the user type identifier.

基于上述分析，应该理解的是，上述置信度校验结果由电子设备直接运算得到的实例中，与本实施例描述的运算处理过程类似，可以预先训练对应不同用户类型标识的校验模型，利用该校验模型对相应用户类型标识对应的校验音频特征进行二次置信度校验，具体校验过程可以与上一次该用户类型标识对应的置信度判决方法类似，本实施例不作赘述。Based on the above analysis, it should be understood that in the example where the above confidence verification results are directly calculated by electronic equipment, similar to the calculation process described in this embodiment, the verification models corresponding to different user types can be pre-trained. The verification model performs a second confidence degree verification on the verification audio feature corresponding to the corresponding user type identifier. The specific verification process may be similar to the previous confidence judgment method corresponding to the user type identifier, and will not be described in this embodiment.

语音唤醒模块26，用于若所述置信度校验结果通过，响应所述预设唤醒词对应的指令，控制电子设备执行预设操作。The voice wake-up module 26 is configured to control the electronic device to perform a preset operation in response to the instruction corresponding to the preset wake-up word if the confidence check result is passed.

综上所述，本实施例中，对于获取的语音信息，将结合不同类型用户的语音特点，对该语音信息进行双置信度判决，且该双置信度判决模块将共享同一声学模型实现，即双置信度判决模块对相同的音频帧特征进行置信度判决，只要有一个置信度判决通过，就会触发后续的二次置信度校验操作，即按照与通过的置信度判决所使用的判决窗大小相匹配的长度，获取校验音频帧特征，发送至相应用户类型的校验模型进行置信度校验，若校验通过，确定获取的语音信息包含预设唤醒词，电子设备可以响应用户输入的语音信息，执行预设操作。可见，本申请提出的语音唤醒处理方案，能够同时兼顾成人语音唤醒性能和儿童语音唤醒性能，相对于现有技术，提升了儿童语音唤醒性能，即提高了语音唤醒效率及准确性。To sum up, in this embodiment, for the acquired voice information, combined with the voice characteristics of different types of users, a double-confidence judgment will be made on the voice information, and the dual-confidence judgment module will share the same acoustic model for implementation, that is, The dual confidence judgment module performs confidence judgments on the same audio frame features. As long as one confidence judgment passes, it will trigger the subsequent second confidence verification operation, that is, according to the judgment window used by the passed confidence judgment The size matches the length, obtains the verification audio frame features, and sends it to the verification model of the corresponding user type for confidence verification. If the verification passes, it is determined that the acquired voice information contains preset wake-up words, and the electronic device can respond to user input voice message, perform preset operations. It can be seen that the voice wake-up processing scheme proposed in this application can take into account both adult voice wake-up performance and child voice wake-up performance. Compared with the prior art, the voice wake-up performance of children is improved, that is, the voice wake-up efficiency and accuracy are improved.

另外，需要说明，关于上述语音唤醒处理装置中的各模型、单元实际上是有程序代码构成的功能模块，通过执行相应的程序代码，实现该功能模型的功能，关于各功能模型实现相应功能的过程，可以参照上述实施例相应部分的描述。In addition, it needs to be explained that each model and unit in the above-mentioned voice wake-up processing device is actually a functional module composed of program codes. By executing the corresponding program codes, the functions of the functional models are realized. For the process, reference may be made to the descriptions in the corresponding parts of the foregoing embodiments.

本申请实施例还提供了一种存储介质，其上存储有计算机程序，该计算机程序被处理器执行，实现上述语音唤醒处理方法的各步骤，该语音唤醒处理方法的实现过程可以参照上述方法实施例的描述。The embodiment of the present application also provides a storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the steps of the above-mentioned voice wake-up processing method. The implementation process of the voice wake-up processing method can be implemented with reference to the above-mentioned method Example description.

参照图10，为本申请提出的语音唤醒处理系统的一可选示例的结构示意图，该系统可以包括但并不局限于：至少一个电子设备31和服务器32，其中：Referring to FIG. 10 , it is a schematic structural diagram of an optional example of the voice wake-up processing system proposed in the present application. The system may include but is not limited to: at least one electronic device 31 and a server 32, wherein:

本实施例对各电子设备31的产品类型不做限定，并不局限于图10示出的电子设备的类型。This embodiment does not limit the product type of each electronic device 31 , and is not limited to the type of electronic device shown in FIG. 10 .

服务器32可以是单独的服务设备，或由多个服务设备构成的服务器集，本申请对服务器32的结构及类型不做限定，如可以包括通信接口、存储器及处理器，服务器中的存储器可以用来对校验音频帧特征进行二次置信度判决方法的程序，处理器可以调用该程序并执行，实现对校验音频帧特征的二次置信度判决，得到校验音频帧特征的置信度校验结果，具体实现过程可以参照上述方法实施例相应部分的描述。The server 32 can be a separate service device, or a server set composed of multiple service devices. The present application does not limit the structure and type of the server 32. As it can include a communication interface, memory and processor, the memory in the server can be used A program for performing a second confidence degree judgment method on the characteristics of the verified audio frame, the processor can call the program and execute it to realize the second degree of confidence judgment on the characteristics of the verified audio frame, and obtain a confidence degree verification method for the verified audio frame characteristics For the specific implementation process, refer to the description of the corresponding part of the above method embodiment.

如图11所示，当用户希望语音控制电子设备执行某操作(即预设操作)，用户可以说出相应的唤醒词，如需要智能音箱播放歌曲B，用户可以说“xx(可以是智能音箱系统的唤醒词，但并不局限于此)，播放歌曲B”，电子设备采集到用户输出的语音信息后，可以按照上述实施例描述的方式对其进行处理，如电子设备可以对该语音信息进行逐帧的特征提取，得到多个音频帧特征，输入预设的声学模型进行处理，得到各音频帧特征的后验概率，在确定出该语音信息中包含的可能是预设唤醒词的，每一个音节对应的至少一个目标音频帧特征的后验概率，之后，对确定出的各音节对应的目标音频帧特征的后验概率进行双置信度判决，如分别使用成人置信度判决模块和儿童置信度判决模块进行处理，可见，本申请考虑到了成人语音特点与儿童语音特点之间的差异，使用不同的置信度判决模块，共享一个声学模型，对该声学模型输出的各目标音频帧特征的后验概率进行置信度计算、判决，需要说明，此处所使用的判决窗大小及置信度阈值大小不同，可以依据不同用户类型特点确定，通常儿童的判决窗大于成人的判决窗，以尽量保证唤醒词特征的完整性。As shown in Figure 11, when the user wishes to voice-control the electronic device to perform a certain operation (that is, a preset operation), the user can speak the corresponding wake-up word. If the smart speaker is required to play song B, the user can say "xx (which can be a smart speaker The wake-up word of the system, but not limited to this), play the song B", after the electronic device collects the voice information output by the user, it can be processed according to the method described in the above embodiment, such as the electronic device can use the voice information Perform frame-by-frame feature extraction to obtain multiple audio frame features, input the preset acoustic model for processing, and obtain the posterior probability of each audio frame feature, and determine that the voice information contained may be the preset wake-up word, The posterior probability of at least one target audio frame feature corresponding to each syllable, after that, carry out double-confidence degree judgment to the determined posterior probability of the target audio frame feature corresponding to each syllable, such as using adult confidence degree judgment module and children respectively It can be seen that this application takes into account the differences between adult voice characteristics and children's voice characteristics, uses different confidence judgment modules, and shares an acoustic model, and the characteristics of each target audio frame output by the acoustic model The posterior probability is used for confidence calculation and judgment. It needs to be explained that the size of the judgment window and the confidence threshold used here are different and can be determined according to the characteristics of different user types. Usually, the judgment window of children is larger than that of adults, so as to ensure the wake-up Completeness of word features.

实际应用中，上述双置信度判决结果中，只要一个置信度判决通过，认为如图3所示的一级模型被激活，可以触发二级模型工作，此时，获取特征长度与置信度判决通过的，用户类型的判决窗大小匹配的校验音频帧特征，将该校验音频帧特征发送至与该用户类型对应的校验模型(其可以部署在电子设备，也可以部署在其他电子设备，如上述服务器)，由该校验模型(如成人校验模型或儿童校验模型)按照上述处理方式，对校验音频帧特征进行二次置信度校验，具体过程不做赘述。其中，对于不同用户类型的校验模型是利用相应用户类型的数据训练得到的，保证了二次置信度判决的准确性。In practical applications, as long as one confidence judgment is passed in the above-mentioned double confidence judgment results, it is considered that the first-level model shown in Figure 3 is activated, which can trigger the work of the second-level model. At this time, the obtained feature length and confidence judgment pass Yes, the verification audio frame feature matched by the judgment window size of the user type, and the verification audio frame feature is sent to the verification model corresponding to the user type (it can be deployed in electronic equipment or other electronic equipment, Such as the above-mentioned server), the verification model (such as an adult verification model or a child verification model) performs a second confidence verification on the verification audio frame features according to the above-mentioned processing method, and the specific process will not be described in detail. Wherein, the verification models for different user types are obtained by using data training of corresponding user types, which ensures the accuracy of the secondary confidence judgment.

经过上述两次置信度判决均通过，可以确定当前获取的语音信息包含预设唤醒词，电子设备可以响应该预设唤醒词对应的控制指令，执行预设操作，满足用户对该电子设备的语音唤醒控制需求。如第一次置信度判决时是儿童的置信度判决结果通过，可以认为该语音信息可能是儿童输出的，且该语音信息可能包含预设唤醒词，将从缓存的音频帧特征中，获取与儿童判决窗大小匹配的校验音频帧特征，发送至儿童校验模型进行二次置信度判决，若通过，确定该语音信息是儿童发出的且包含预设唤醒词，电子设备将响应该语音信息，提升了儿童语音唤醒的性能。After the above two confidence judgments are all passed, it can be determined that the currently acquired voice information contains the preset wake-up word, and the electronic device can respond to the control command corresponding to the preset wake-up word, and execute the preset operation to satisfy the user's voice for the electronic device. Wake-up control needs. If the confidence judgment result of the child is passed in the first confidence judgment, it can be considered that the voice information may be output by the child, and the voice information may contain a preset wake-up word, which will be obtained from the cached audio frame features. The verified audio frame features matching the size of the children's judgment window are sent to the children's verification model for a second confidence judgment. If it passes, it is determined that the voice message is sent by a child and contains a preset wake-up word, and the electronic device will respond to the voice message , to improve the performance of children's voice wake-up.

需要说明的是，在本实施例的应用场景下，在得到校验音频帧特征后，并不局限于图11所示的处理方式，即发送至服务器进行二次置信度判决，也可以由电子设备自身进行二次置信度判决，具体实现过程相同，本申请不做赘述。It should be noted that, in the application scenario of this embodiment, after obtaining the features of the verified audio frame, it is not limited to the processing method shown in Figure 11, that is, it is sent to the server for a second confidence judgment, or the The device itself performs the second confidence judgment, and the specific implementation process is the same, which will not be described in this application.

本说明书中各个实施例采用递进或并列的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置、系统、电子设备而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive or parallel manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other. For the devices, systems, and electronic equipment disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and for relevant details, please refer to the description of the methods.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Professionals can further realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two. In order to clearly illustrate the possible For interchangeability, in the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be directly implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本申请的核心思想或范围的情况下，在其它实施例中实现。因此，本申请将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the core idea or scope of the application. Therefore, the present application will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.