CN111508493B

Movatterモバイル変換

Info

Publication number: CN111508493B
Application number: CN202010312299.XA
Authority: CN
Inventors: 宋天龙
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2022-11-15
Anticipated expiration: 2040-04-20
Also published as: CN111508493A

Abstract

Translated fromChinese

本申请公开了一种语音唤醒方法、装置、电子设备及存储介质，涉及语音处理技术领域，该方法包括：获取音频采集器采集的输入语音；基于第一语音匹配模型对输入语音进行匹配，得到第一概率输出，第一概率输出用于指示输入语音中包含指定文本的概率；获取第一语音匹配模型在当前的第一概率输出之前输出的至少一个概率输出，作为第二概率输出；将第一概率输出与第二概率输出进行融合，得到更新的第一概率输出；将更新的第一概率输出作为第一语音匹配模型对输入语音进行匹配的第一匹配结果；若第一匹配结果指示输入语音中包含指定文本，唤醒终端。本申请通过将历史输出与当前输出进行融合，能够提高关键词识别准确率，降低误唤醒率。

The application discloses a voice wake-up method, device, electronic equipment, and storage medium, and relates to the technical field of voice processing. The method includes: acquiring an input voice collected by an audio collector; matching the input voice based on a first voice matching model to obtain The first probability output, the first probability output is used to indicate the probability that the specified text is included in the input speech; obtain at least one probability output output by the first speech matching model before the current first probability output, as the second probability output; the second probability output The first probability output is fused with the second probability output to obtain an updated first probability output; the updated first probability output is used as the first matching result of the first voice matching model matching the input voice; if the first matching result indicates that the input The voice contains the specified text to wake up the terminal. In this application, the accuracy rate of keyword recognition can be improved and the false wakeup rate can be reduced by fusing the historical output with the current output.

Description

Translated fromChinese

语音唤醒方法、装置、电子设备及存储介质Voice wake-up method, device, electronic equipment and storage medium

技术领域technical field

本申请涉及语音处理技术领域，更具体地，涉及一种语音唤醒方法、装置、电子设备及存储介质。The present application relates to the technical field of voice processing, and more specifically, to a voice wake-up method, device, electronic equipment, and storage medium.

背景技术Background technique

随着语音处理技术的快速发展，在人们的日常生活中，终端中已经出现语音对话功能，用户可以通过输入特定语音来进一步控制终端，例如唤醒亮屏、唤醒解锁、唤醒启动语音对话功能等。终端可能同时接收到多条语音，为了从中甄别出用户用于控制终端的语音，一般会检测语音中是否包含唤醒词，如果有再作唤醒。但是，在实际使用中，终端常会在用户未说出唤醒词时被误唤醒，即目前语音唤醒的误唤醒率较高。With the rapid development of voice processing technology, voice dialogue functions have appeared in terminals in people's daily life. Users can further control the terminal by inputting specific voices, such as waking up the bright screen, waking up to unlock, and waking up to start voice dialogue functions, etc. The terminal may receive multiple voices at the same time. In order to identify the voice used by the user to control the terminal, it generally detects whether the voice contains wake-up words, and wakes up again if there is. However, in actual use, the terminal is often awakened by mistake when the user does not say a wake-up word, that is, the false wake-up rate of voice wake-up is relatively high at present.

发明内容Contents of the invention

本申请实施例提出了一种语音唤醒方法、装置、电子设备及存储介质，能够降低在终端实现语音唤醒的误唤醒率。Embodiments of the present application propose a voice wake-up method, device, electronic equipment, and storage medium, which can reduce the false wake-up rate of voice wake-up in a terminal.

第一方面，本申请实施例提供了一种语音唤醒方法，应用于终端，所述终端设置有音频采集器，该方法包括：获取所述音频采集器采集的输入语音；基于第一语音匹配模型对所述输入语音进行匹配，得到第一概率输出，所述第一概率输出用于指示所述输入语音中包含所述指定文本的概率；获取所述第一语音匹配模型在当前的所述第一概率输出之前输出的至少一个概率输出，作为第二概率输出；将所述第一概率输出与所述第二概率输出进行融合，得到更新的第一概率输出；将所述更新的第一概率输出作为所述第一语音匹配模型对所述输入语音进行匹配的第一匹配结果；若所述第一匹配结果指示所述输入语音中包含所述指定文本，唤醒所述终端。In the first aspect, the embodiment of the present application provides a voice wake-up method, which is applied to a terminal, the terminal is provided with an audio collector, and the method includes: acquiring the input voice collected by the audio collector; Matching the input speech to obtain a first probability output, the first probability output is used to indicate the probability that the input speech contains the specified text; obtain the first speech matching model in the current second At least one probability output output before a probability output is used as a second probability output; the first probability output is fused with the second probability output to obtain an updated first probability output; the updated first probability output outputting as a first matching result of matching the input speech by the first speech matching model; if the first matching result indicates that the input speech contains the specified text, waking up the terminal.

第二方面，本申请实施例提供了一种语音唤醒装置，应用于终端，所述终端设置有音频采集器，该装置包括：语音获取模块，用于获取所述音频采集器采集的输入语音；第一输出模块，用于基于所述第一语音匹配模型对所述输入语音进行匹配，得到第一概率输出，所述第一概率输出用于指示所述输入语音中包含所述指定文本的概率；第二输出模块，用于获取所述第一语音匹配模型在当前的所述第一概率输出之前输出的至少一个概率输出，作为第二概率输出；输出更新模块，用于将所述第一概率输出与所述第二概率输出进行融合，得到更新的第一概率输出；结果获取模块，用于将所述更新的第一概率输出作为所述第一语音匹配模型对所述输入语音进行匹配的第一匹配结果；终端唤醒模块，用于若所述第一匹配结果指示所述输入语音中包含所述指定文本，唤醒所述终端。In a second aspect, the embodiment of the present application provides a voice wake-up device, which is applied to a terminal, and the terminal is provided with an audio collector, and the device includes: a voice acquisition module, configured to acquire an input voice collected by the audio collector; A first output module, configured to match the input speech based on the first speech matching model to obtain a first probability output, and the first probability output is used to indicate the probability that the input speech contains the specified text ; A second output module, configured to obtain at least one probability output output by the first speech matching model before the current first probability output, as a second probability output; an output update module, configured to convert the first The probability output is fused with the second probability output to obtain an updated first probability output; a result acquisition module is configured to use the updated first probability output as the first speech matching model to match the input speech the first matching result; a terminal wake-up module, configured to wake up the terminal if the first matching result indicates that the input voice contains the specified text.

第三方面，本申请实施例提供了一种电子设备，包括：存储器；一个或多个处理器，与所述存储器耦接；一个或多个应用程序，其中，一个或多个应用程序被存储在存储器中并被配置为由一个或多个处理器执行，一个或多个应用程序配置用于执行上述第一方面提供的语音唤醒方法。In a third aspect, an embodiment of the present application provides an electronic device, including: a memory; one or more processors coupled to the memory; one or more application programs, wherein one or more application programs are stored In the memory and configured to be executed by one or more processors, one or more application programs are configured to execute the voice wake-up method provided in the first aspect above.

第四方面，本申请实施例提供了一种计算机可读取存储介质，计算机可读取存储介质中存储有程序代码，程序代码可被处理器调用执行上述第一方面提供的语音唤醒方法。In a fourth aspect, the embodiment of the present application provides a computer-readable storage medium, in which a program code is stored, and the program code can be invoked by a processor to execute the voice wake-up method provided in the first aspect above.

本申请实施例提供的一种语音唤醒方法、装置、电子设备及存储介质，通过获取音频采集器采集的输入语音，然后基于第一语音匹配模型对输入语音进行匹配，得到用于指示输入语音中是否包含指定文本的第一概率输出，并获取第一语音匹配模型在当前的第一概率输出之前输出的至少一个概率输出，作为第二概率输出，接着将第一概率输出与第二概率输出进行融合，得到更新的第一概率输出，并将更新的第一概率输出作为第一语音匹配模型对输入语音进行匹配的第一匹配结果，若第一匹配结果指示输入语音中包含指定文本，唤醒终端。由此，本申请通过第一语音匹配模型当前的第一概率输出以及历史输出的第二概率输出进行融合，得到对本次输入语音中是否包含指定文本的第一匹配结果，能够提高关键词识别准确率，有效抑制关键词检测跳变，并降低误唤醒率。In the voice wake-up method, device, electronic equipment, and storage medium provided in the embodiments of the present application, by acquiring the input voice collected by the audio collector, and then matching the input voice based on the first voice matching model, the input voice used to indicate the input voice is obtained. Whether to include the first probability output of the specified text, and obtain at least one probability output output by the first speech matching model before the current first probability output as the second probability output, and then compare the first probability output with the second probability output Fusion to obtain an updated first probability output, and use the updated first probability output as the first matching result of the first voice matching model matching the input voice, if the first matching result indicates that the input voice contains specified text, wake up the terminal . Thus, the present application fuses the current first probability output of the first speech matching model and the second probability output of historical output to obtain the first matching result of whether the specified text is included in the input speech this time, which can improve keyword recognition Accuracy, effectively suppress keyword detection jumps, and reduce false wake-up rate.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1示出了本申请实施例提供的一种语音唤醒方法的应用场景示意图。FIG. 1 shows a schematic diagram of an application scenario of a voice wake-up method provided by an embodiment of the present application.

图2示出了本申请一个实施例提供的语音唤醒方法的流程示意图。Fig. 2 shows a schematic flowchart of a voice wake-up method provided by an embodiment of the present application.

图3示出了本申请另一个实施例提供的语音唤醒方法的流程示意图。Fig. 3 shows a schematic flowchart of a voice wake-up method provided by another embodiment of the present application.

图4示出了本申请一个示例性实施例涉及的MFCC特征提取过程的示意图。Fig. 4 shows a schematic diagram of an MFCC feature extraction process involved in an exemplary embodiment of the present application.

图5示出了本申请一个示例性实施例提供的卷积神经网络的结构示意图。Fig. 5 shows a schematic structural diagram of a convolutional neural network provided by an exemplary embodiment of the present application.

图6示出了本申请一个示例性实施例的图3中步骤S230的流程示意图。FIG. 6 shows a schematic flowchart of step S230 in FIG. 3 in an exemplary embodiment of the present application.

图7示出了本申请一个示例性实施例涉及的注意力权重提取过程的示意图。Fig. 7 shows a schematic diagram of an attention weight extraction process involved in an exemplary embodiment of the present application.

图8示出了本申请一个示例性实施例提供的图6中步骤S231的流程示意图。FIG. 8 shows a schematic flowchart of step S231 in FIG. 6 provided by an exemplary embodiment of the present application.

图9示出了本申请一个示例性实施例涉及的池化过程的示意图。Fig. 9 shows a schematic diagram of a pooling process involved in an exemplary embodiment of the present application.

图10示出了本申请一个示例性实施例提供的注意力尺度化过程的示意图。Fig. 10 shows a schematic diagram of an attention scaling process provided by an exemplary embodiment of the present application.

图11示出了本申请一个示例性实施例提供的图3中步骤S250的流程示意图。FIG. 11 shows a schematic flowchart of step S250 in FIG. 3 provided by an exemplary embodiment of the present application.

图12示出了本申请一个示例性实施例提供的图11中步骤S254的流程示意图。Fig. 12 shows a schematic flowchart of step S254 in Fig. 11 provided by an exemplary embodiment of the present application.

图13示出了本申请一个示例性实施例涉及的历史融合过程的示意图。Fig. 13 shows a schematic diagram of a history fusion process involved in an exemplary embodiment of the present application.

图14示出了本申请又一个实施例提供的语音唤醒方法的流程示意图。Fig. 14 shows a schematic flowchart of a voice wake-up method provided by another embodiment of the present application.

图15示出了本申请再一个实施例提供的语音唤醒方法的流程示意图。Fig. 15 shows a schematic flowchart of a voice wake-up method provided by another embodiment of the present application.

图16示出了本申请一个示例性实施例提供的图15中步骤S490的流程示意图。FIG. 16 shows a schematic flowchart of step S490 in FIG. 15 provided by an exemplary embodiment of the present application.

图17示出了本申请实施例提供的语音唤醒装置的模块框图。Fig. 17 shows a module block diagram of the voice wake-up device provided by the embodiment of the present application.

图18示出了本申请实施例提供的电子设备的结构框图。FIG. 18 shows a structural block diagram of an electronic device provided by an embodiment of the present application.

图19示出了本申请实施例提供的用于保存或者携带实现根据本申请实施例的语音唤醒方法的程序代码的存储单元。Fig. 19 shows a storage unit provided by an embodiment of the present application for storing or carrying a program code for realizing the voice wake-up method according to the embodiment of the present application.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application.

目前的语音唤醒方法一般是基于对唤醒词的语音识别实现的，但是目前的语音唤醒方案，在通过语音唤醒终端时，多是识别用户输入语音，检测是否包含唤醒词，若包含则可执行唤醒操作，例如解锁、亮屏等。例如，若终端预存有唤醒词“小欧小欧”，则当用户说“小欧小欧”时，终端获取到该语音输入时，若识别出其中包含的唤醒词，执行亮屏操作。The current voice wake-up method is generally based on the voice recognition of the wake-up word, but the current voice wake-up scheme, when waking up the terminal by voice, mostly recognizes the user's input voice, detects whether the wake-up word is included, and executes the wake-up if it is included. Operations, such as unlocking, brightening the screen, etc. For example, if the wake-up word "Xiaoou Xiaoou" is pre-stored in the terminal, when the user says "Xiaoou Xiaoou", when the terminal obtains the voice input, if it recognizes the wake-up word contained in it, it will perform a screen brightening operation.

但是，发明人发现，在实际使用中，即便采用类似上述的语音唤醒，仍常常会出现在用户输入语音不包含唤醒词时，终端却被唤醒的误唤醒情况，如此可能会对终端资源和耗电量等造成不必要的浪费，还可能执行错误操作，影响用户体验。However, the inventors found that in actual use, even if the above-mentioned voice wake-up is used, the terminal is often awakened by false wake-up when the user input voice does not contain a wake-up word, which may cause damage to terminal resources and consumption. Unnecessary waste of power, etc., may also be performed incorrectly, affecting user experience.

基于上述问题，本申请实施例提供了一种语音唤醒方法、装置、电子设备及计算机可读取存储介质，通过获取音频采集器采集的输入语音，然后基于第一语音匹配模型对输入语音进行匹配，得到用于指示输入语音中是否包含指定文本的第一概率输出，并获取第一语音匹配模型在当前的第一概率输出之前输出的至少一个概率输出，作为第二概率输出，接着将第一概率输出与第二概率输出进行融合，得到更新的第一概率输出，并将更新的第一概率输出作为第一语音匹配模型对输入语音进行匹配的第一匹配结果，若第一匹配结果指示输入语音中包含指定文本，唤醒终端。由此，本申请通过第一语音匹配模型当前的第一概率输出以及历史输出的第二概率输出进行融合，得到对本次输入语音中是否包含指定文本的第一匹配结果，能够提高关键词识别准确率，有效抑制关键词检测跳变，并降低误唤醒率。Based on the above problems, the embodiment of the present application provides a voice wake-up method, device, electronic equipment and computer-readable storage medium, by acquiring the input voice collected by the audio collector, and then matching the input voice based on the first voice matching model , to obtain the first probability output for indicating whether the input speech contains the specified text, and obtain at least one probability output output by the first speech matching model before the current first probability output, as the second probability output, and then the first The probability output is fused with the second probability output to obtain an updated first probability output, and the updated first probability output is used as the first matching result of the first speech matching model matching the input speech, if the first matching result indicates that the input The voice contains the specified text to wake up the terminal. Thus, the present application fuses the current first probability output of the first speech matching model and the second probability output of historical output to obtain the first matching result of whether the specified text is included in the input speech this time, which can improve keyword recognition Accuracy, effectively suppress keyword detection jumps, and reduce false wake-up rate.

为了便于详细说明，下面先结合附图对本申请实施例所适用的应用场景进行示例性说明。For the convenience of detailed description, the application scenarios applicable to the embodiments of the present application will be illustrated below with reference to the accompanying drawings.

请参见图1，图1示出了本申请实施例提供的语音唤醒方法的应用场景示意图，该应用场景包括本申请实施例提供的一种语音唤醒系统10。该语音唤醒系统10包括：终端100和服务器200。Please refer to FIG. 1 . FIG. 1 shows a schematic diagram of an application scenario of the voice wake-up method provided by the embodiment of the present application, and the application scenario includes a voice wake-upsystem 10 provided by the embodiment of the present application. The voice wake-upsystem 10 includes: aterminal 100 and aserver 200 .

其中，终端100可以为但不限于为手机、平板电脑、MP3播放器(Moving PictureExperts Group Audio LayerⅢ，动态影像压缩标准音频层面3)、MP4(Moving PictureExperts Group Audio LayerⅣ，动态影像压缩标准音频层面4)播放器、个人计算机或可穿戴电子设备等等。本申请实施例对具体的终端的设备类型不作限定。Among them, theterminal 100 can be, but not limited to, a mobile phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture compression standard audio layer 4) player, personal computer or wearable electronic device, etc. The embodiment of the present application does not limit the specific device type of the terminal.

于本申请实施例中，终端100设置有音频采集器，如麦克风等，可通过音频采集器采集语音，本申请实施例对具体的音频采集器的类型不作限定。In the embodiment of the present application, theterminal 100 is provided with an audio collector, such as a microphone, through which voice can be collected. The embodiment of the present application does not limit the type of the specific audio collector.

其中，服务器200可以是传统服务器，也可以是云端服务器，可以是一台服务器，或者由若干台服务器组成的服务器集群，或者是一个云计算服务中心。Wherein, theserver 200 may be a traditional server, or a cloud server, a server, or a server cluster composed of several servers, or a cloud computing service center.

在一些可能的实施方式中，对输入语音进行处理的装置可设置于服务器200，则在终端100获取输入语音后，可将输入语音发送至服务器200，由服务器200对输入语音处理后返回处理结果至终端100，以使终端100可根据处理结果执行后续操作。In some possible implementations, the device for processing the input voice can be set in theserver 200, then after theterminal 100 obtains the input voice, it can send the input voice to theserver 200, and theserver 200 returns the processing result after processing the input voice to theterminal 100, so that theterminal 100 can perform subsequent operations according to the processing result.

其中，对输入语音进行处理的装置可以为语音匹配装置。Wherein, the device for processing the input voice may be a voice matching device.

在一些可能的实施例中，对输入语音进行处理的装置还可以包括声纹识别装置，以对输入语音进行声纹识别。In some possible embodiments, the device for processing the input voice may further include a voiceprint recognition device, so as to perform voiceprint recognition on the input voice.

作为一种实施方式，语音匹配装置可设置于服务器200，声纹识别装置可设置于终端100，则服务器100可返回语音匹配结果至终端100，由终端100基于语音匹配结果确定是否声纹识别，并在需作声纹识别时进行声纹识别及后续操作。As an implementation, the voice matching device can be set on theserver 200, and the voiceprint recognition device can be set on theterminal 100, then theserver 100 can return the voice matching result to theterminal 100, and theterminal 100 determines whether voiceprint recognition is based on the voice matching result, And perform voiceprint recognition and follow-up operations when voiceprint recognition is required.

另外，作为另一种实施方式，语音匹配装置、声纹识别装置的设置位置也可对换，即语音匹配装置可设置于终端100，声纹识别装置可设置于服务器200，则终端100基于语音匹配装置进行语音匹配后，若通过语音匹配，可将输入语音发送至服务器200，指示服务器200基于声纹识别装置对输入语音的声纹进行声纹识别，并返回声纹识别结果至终端100，以使终端100可基于声纹识别结果确定是否唤醒终端。In addition, as another implementation mode, the installation positions of the voice matching device and the voiceprint recognition device can also be reversed, that is, the voice matching device can be installed on theterminal 100, and the voiceprint recognition device can be installed on theserver 200, then theterminal 100 can After the matching device performs voice matching, if the voice matching is passed, the input voice can be sent to theserver 200, instructing theserver 200 to perform voiceprint recognition on the voiceprint of the input voice based on the voiceprint recognition device, and return the voiceprint recognition result to theterminal 100, So that theterminal 100 can determine whether to wake up the terminal based on the voiceprint recognition result.

作为又一种实施方式，语音匹配装置和声纹识别装置均可设置于服务器200，则服务器200可返回声纹识别结果至终端100，以使终端100可基于声纹识别结果确定是否唤醒终端。As yet another implementation, both the voice matching device and the voiceprint recognition device can be set on theserver 200, and theserver 200 can return the voiceprint recognition result to theterminal 100, so that theterminal 100 can determine whether to wake up the terminal based on the voiceprint recognition result.

在另一些可能的实施方式中，对输入语音进行处理的装置也可以设置于终端100上，使得终端100无需依赖与服务器200建立通信，也可对输入语音进行处理得到处理结果，则此时语音唤醒系统10可以只包括终端100。In other possible implementations, the device for processing the input voice can also be set on the terminal 100, so that the terminal 100 can also process the input voice to obtain the processing result without relying on establishing communication with theserver 200, then the voice The wake-up system 10 may only include the terminal 100 .

下面将通过具体实施例对本申请实施例提供的信息处理方法、装置、电子设备及存储介质进行详细说明。The information processing method, device, electronic device, and storage medium provided in the embodiments of the present application will be described in detail below through specific embodiments.

请参阅图2，图2示出了本申请实施例提供的一种语音唤醒方法的流程示意图，可应用于上述终端。下面将针对图2所示的流程进行详细的阐述。该语音唤醒方法可以包括：Please refer to FIG. 2 . FIG. 2 shows a schematic flowchart of a voice wake-up method provided by an embodiment of the present application, which can be applied to the above-mentioned terminal. The flow shown in FIG. 2 will be described in detail below. The voice wake-up method may include:

步骤S110：获取音频采集器采集的输入语音。Step S110: Obtain the input speech collected by the audio collector.

其中，终端可设置有音频采集器，也可连接外部的音频采集器，此处连接可以是无线连接，也可是有线连接，在此不做限定。在一些实施方式中，若为无线连接，则终端可设置有无线通讯模块，例如无线保真(Wireless Fidelity，WiFi)模块、蓝牙(Bluetooth)模块等，可基于无线通讯模块获取音频采集器采集的输入语音。Wherein, the terminal may be provided with an audio collector, and may also be connected to an external audio collector. Here, the connection may be a wireless connection or a wired connection, which is not limited here. In some embodiments, if it is a wireless connection, the terminal can be provided with a wireless communication module, such as a Wireless Fidelity (Wireless Fidelity, WiFi) module, a Bluetooth (Bluetooth) module, etc., which can obtain the data collected by the audio collector based on the wireless communication module. Enter your voice.

在一些实施例中，终端可通过音频采集器如麦克风等进行拾音，获取音频采集器采集的输入语音。由于利用音频采集器进行拾音的功耗较低，因此，音频采集器可一直处于开启状态进行拾音。并且，在一些实施方式中，音频采集器可以定时将采集到的音频进行缓存，并送入处理器对所采集的音频进行处理。In some embodiments, the terminal may pick up sound through an audio collector such as a microphone, and acquire the input voice collected by the audio collector. Since the power consumption of sound pickup by the audio collector is low, the audio collector can always be on for sound pickup. Moreover, in some implementation manners, the audio collector may periodically cache the collected audio and send it to the processor to process the collected audio.

步骤S120：基于第一语音匹配模型对输入语音进行匹配，得到第一概率输出。Step S120: Match the input speech based on the first speech matching model to obtain a first probability output.

其中，第一概率输出用于指示输入语音中包含指定文本的概率。Wherein, the first probability output is used to indicate the probability that the input speech contains the specified text.

其中，第一语音匹配模型可通过第一训练数据训练得到，其中，第一训练数据可包括多个正样本语音和多个负样本语音，正样本语音是包含指定文本的语音，负样本语音是不包含指定文本的语音，由此通过第一语音匹配模型对输入语音进行匹配，可判断输入语音中是否包含指定文本，以对输入语音作匹配校验，得到第一概率输出，可用于指示该输入语音中包含指定文本的概率。Wherein, the first voice matching model can be obtained by training the first training data, wherein the first training data can include a plurality of positive sample voices and a plurality of negative sample voices, the positive sample voices are voices containing specified text, and the negative sample voices are The speech does not contain the specified text, so the input speech is matched by the first speech matching model, and it can be judged whether the input speech contains the specified text, so as to perform matching check on the input speech, and obtain the first probability output, which can be used to indicate the The probability that the specified text is contained in the input speech.

其中，指定文本可以是程序预设的，也可是用户自定义的，在此不作限定。例如，指定文本可以为“小欧”、“小欧小欧”等，在此不做限定。则在一个示例中，正样本语音可以是“小欧小欧今天天气怎么样”对应的语音，负样本语音可以是“天气怎么样”对应的语音。另外，在一些实施例中，指定文本也可称为唤醒词，本申请实施例对此不做限定。Wherein, the specified text may be preset by the program or defined by the user, which is not limited here. For example, the specified text may be "Xiaoou", "Xiaoou Xiaoou", etc., which is not limited here. Then, in an example, the positive sample speech may be the speech corresponding to "Xiaoou Xiaoou, what is the weather today", and the negative sample speech may be the speech corresponding to "How is the weather?" In addition, in some embodiments, the specified text may also be called a wake-up word, which is not limited in this embodiment of the present application.

在一些实施例中，用户可预先在终端设置指定文本，例如可在终端的唤醒词设置页面输入指定文本，可以是输入指定文本对应的指定语音，也可以仅是输入指定文本的文本内容。In some embodiments, the user can pre-set the specified text on the terminal. For example, the specified text can be entered on the wake-up word setting page of the terminal, the specified voice corresponding to the specified text can be input, or only the text content of the specified text can be input.

作为一种实施方式，用户可输入指定文本对应的指定语音，以使终端基于该唤醒词设置页面获取到指定语音以训练语音唤醒算法。As an implementation, the user can input a specified voice corresponding to the specified text, so that the terminal can acquire the specified voice based on the wake-up word setting page to train the voice wake-up algorithm.

在一个具体示例中，用户可通过后续一系列操作进入唤醒词设置页面，以设置唤醒词，设置-安全-智能解锁-设定数字密码-唤醒词设置上，终端可显示唤醒词设置页面，提示用户录入唤醒词，用户可说出唤醒词，例如“小步小步”，终端可获取对应的语音数据作为训练数据训练语音唤醒算法。In a specific example, the user can enter the wake-up word setting page through a series of subsequent operations to set the wake-up word, set-security-smart unlock-set digital password-wake-up word setting, the terminal can display the wake-up word setting page, prompting The user enters the wake-up word, and the user can say the wake-up word, such as "small step, small step", and the terminal can obtain the corresponding voice data as training data to train the voice wake-up algorithm.

另外，在一些实施例中，为了提高语音唤醒算法的识别准确率，终端可提示用户重复多次录入唤醒词，最终将多次录入的语音数据作为训练数据送入语音唤醒算法进行训练，并在训练完成时提示用户。训练完成后即可利用语音唤醒算法检测输入语音中是否包含指定文本。In addition, in some embodiments, in order to improve the recognition accuracy of the voice wake-up algorithm, the terminal can prompt the user to repeatedly enter the wake-up word, and finally send the voice data entered multiple times as training data to the voice wake-up algorithm for training, and then Prompt the user when training is complete. After the training is completed, the voice wake-up algorithm can be used to detect whether the input voice contains the specified text.

在一些实施例中，第一可由神经网络构建得到，神经网络可以为但不限于为卷积神经网络(Convolutional Neural Network，CNN)、循环神经网络(Recurrent NeuralNetwork，RNN)等，本实施例对此不做限定。In some embodiments, the first can be obtained by constructing a neural network, which can be but not limited to a convolutional neural network (Convolutional Neural Network, CNN), a recurrent neural network (Recurrent Neural Network, RNN), etc. No limit.

在一些实施方式中，基于第一语音匹配模型对输入语音进行匹配时，可先对输入语音进行预处理，得到多帧语音片段，并基于第一语音匹配模对多帧语音片段进行匹配。In some implementations, when matching the input speech based on the first speech matching model, the input speech may be preprocessed to obtain a multi-frame speech segment, and then the multi-frame speech segment is matched based on the first speech matching model.

在一个示例中，可按预设长度对输入语音进行分帧得到多帧语音片段，每帧语音片段的长度可小于或等于预设长度，其中，预设长度可根据需要确定，也可自定义，例如可为0.5s，则输入语音经预处理后可得到每帧不超过0.5s的语音片段，依次将预处理得到的多帧语音片段输入第一语音匹配模型，可得到多个概率输出，当前语音片段对应第一概率输出。In one example, the input speech can be divided into frames according to the preset length to obtain multi-frame speech clips, and the length of each frame of speech clips can be less than or equal to the preset length, wherein the preset length can be determined according to needs, or can be customized , for example, it can be 0.5s, then after the input speech is preprocessed, a speech segment of no more than 0.5s per frame can be obtained, and the multi-frame speech segments obtained by preprocessing are input into the first speech matching model in turn, and multiple probability outputs can be obtained, The current speech segment corresponds to the first probability output.

步骤S130：获取第一语音匹配模型在当前的第一概率输出之前输出的至少一个概率输出，作为第二概率输出。Step S130: Obtain at least one probability output output by the first speech matching model before the current first probability output as a second probability output.

在一些实施方式中，终端可存储第一语音匹配模型每次输出的第一概率输出。则在得到当前的第一概率输出后，可获取之前M次输出的概率输出作为第二概率输出，则第二概率输出的尺寸为M*C。其中，M大于等于1，可以理解的是，其具体数值可根据实际需要确定，本实施例对此不做限定。在前述示例中，第二概率输出可以为当前语音片段之前的至少一个语音片段对应的概率输出。In some implementations, the terminal may store the first probability output of each output of the first speech matching model. Then, after obtaining the current first probability output, the probability output of previous M outputs can be obtained as the second probability output, and the size of the second probability output is M*C. Wherein, M is greater than or equal to 1, and it can be understood that its specific value can be determined according to actual needs, which is not limited in this embodiment. In the foregoing example, the second probability output may be a probability output corresponding to at least one speech segment preceding the current speech segment.

步骤S140：将第一概率输出与第二概率输出进行融合，得到更新的第一概率输出。Step S140: Fusing the first probability output and the second probability output to obtain an updated first probability output.

由于每个概率输出所对应的输入的语音片段的长度一般小于关键词长度，例如，用户输入指定文本对应的语音的时间长度，即关键词长度可在1s～2s之间，而实际输入第一语音匹配模型的语音是输入语音按预设长度分帧后的语音片段，其中预设长度小于关键词长度，例如可约为0.5s，所以针对一个输入语音的识别，会拆分成多帧语音片段输入第一语音匹配模型，得到多帧概率输出，当前的概率输出为第一概率输出，历史的概率输出记为第二概率输出，则获取第一概率输出和第二概率输出后，通过将第一概率输出与第二概率输出进行融合，可对输入语音经第一语音匹配模型得到的多帧结果进行融合判断，得到更新的第一概率输出。由此，可使得当前输出(即当前得到的第一概率输出)考虑历史输出(即第二概率输出)，结合第一、第二概率输出更新第一概率输出的基础上，从而可解决连续语音中的关键词识别。Since the length of the input speech segment corresponding to each probability output is generally smaller than the length of the keyword, for example, the length of the speech corresponding to the user inputting the specified text, that is, the length of the keyword can be between 1s and 2s, while the actual input first The voice of the voice matching model is a voice segment after the input voice is divided into frames according to a preset length. The preset length is less than the keyword length, for example, it can be about 0.5s, so for the recognition of an input voice, it will be split into multiple frames of voice The segment is input into the first speech matching model to obtain multi-frame probability output. The current probability output is the first probability output, and the historical probability output is recorded as the second probability output. After obtaining the first probability output and the second probability output, pass The first probability output is fused with the second probability output, and the multi-frame results obtained by the input speech through the first speech matching model can be fused and judged to obtain an updated first probability output. Thus, the current output (that is, the first probability output currently obtained) can consider the historical output (that is, the second probability output), and on the basis of updating the first probability output in conjunction with the first and second probability outputs, it can solve the problem of continuous speech Keyword recognition in .

需要说明的是，本实施例对预设长度、关键词长度的具体数值不作限定，仅预设长度小于关键词长度即可。It should be noted that this embodiment does not limit the specific values of the preset length and the keyword length, as long as the preset length is smaller than the keyword length.

在一些实施例中，将第一概率输出与第二概率输出进行融合的方式包括但不限于取二者的最大值、最小值、平均值，等等，另外，求平均值还可包括加权平均等方式，本实施例对此不作限定。In some embodiments, the way of fusing the first probability output and the second probability output includes but is not limited to taking the maximum value, minimum value, average value, etc. of the two. In addition, calculating the average may also include weighted average etc., which are not limited in this embodiment.

在另一些实施例中，也可对第一概率输出进行特征提取，并将特征提取结果与预设值进行比较，若特征提取结果高于该预设值，则将第一概率输出作为更新的第一概率输出，若特征提取结果低于或等于该预设值，则根据第二概率输出确定更新的第一概率输出。其中，预设值可根据实际需要确定，例如，可基于第一训练数据训练得到，还可以是用户自定义。由此可有效解决关键词检测跳变问题，降低误唤醒率。具体的实施方式可见后述实施例，在此不作赘述。In some other embodiments, feature extraction can also be performed on the first probability output, and the feature extraction result is compared with a preset value. If the feature extraction result is higher than the preset value, the first probability output is used as the updated For the first probability output, if the feature extraction result is lower than or equal to the preset value, an updated first probability output is determined according to the second probability output. Wherein, the preset value may be determined according to actual needs, for example, it may be trained based on the first training data, or it may be user-defined. This can effectively solve the problem of key word detection jumps and reduce false wakeup rates. Specific implementation manners can be seen in the following embodiments, and details are not repeated here.

步骤S150：根据更新的第一概率输出，得到第一语音匹配模型对输入语音进行匹配的第一匹配结果。Step S150: Obtain a first matching result of matching the input speech by the first speech matching model according to the updated first probability output.

通过前述步骤得到更新的第一概率输出，其为标量，在一些实施方式中，若更新的第一概率输出大于预设输出阈值，判定输入语音中包含指定文本并获取对应的第一匹配结果，即第一匹配结果指示输入语音中包含指定文本；若更新的第一概率输出小于或等于预设输出阈值，判定输入语音中不包含指定文本并获取对应的第一匹配结果，即此时第一匹配结果指示输入语音中不包含指定文本，此时第一语音匹配模型可等待后续的输入语音开始新的校验。Obtain an updated first probability output through the foregoing steps, which is a scalar. In some implementations, if the updated first probability output is greater than the preset output threshold, it is determined that the input speech contains specified text and the corresponding first matching result is obtained. That is, the first matching result indicates that the input speech contains the specified text; if the updated first probability output is less than or equal to the preset output threshold, it is determined that the input speech does not contain the specified text and the corresponding first matching result is obtained, that is, the first The matching result indicates that the input speech does not contain the specified text, and at this time, the first speech matching model may wait for subsequent input speech to start a new verification.

其中，预设输出阈值可根据实际需要确定，也可为程序预设，还可为用户自定义，在此不做限定。Wherein, the preset output threshold can be determined according to actual needs, can also be preset by a program, and can also be user-defined, which is not limited here.

步骤S160：若第一匹配结果指示输入语音中包含指定文本，唤醒终端。Step S160: If the first matching result indicates that the input voice contains the specified text, wake up the terminal.

在一些实施方式中，若第一匹配结果指示输入语音中包含指定文本，可唤醒终端执行预设操作。作为一种实施方式，终端可预先存储有终端当前状态与预设操作之间的映射关系表，其中，终端当前状态包括但不限于屏幕状态(是否熄屏、是否锁屏)、当前运行的应用程序、当前时刻等，在此不作限定。则若第一匹配结果指示输入语音中包含指定文本，可获取终端当前状态，以根据终端当前状态确定对应的预设操作，则可唤醒终端执行该预设操作。其中，预设操作可包括但不限于亮屏、解锁、激活语音助手等操作，本实施例对此不做限定。In some implementations, if the first matching result indicates that the input voice contains specified text, the terminal may be woken up to perform a preset operation. As an implementation, the terminal may pre-store a mapping relationship table between the current state of the terminal and preset operations, wherein the current state of the terminal includes but is not limited to the screen state (whether the screen is off or the screen is locked), the currently running application The program, current time, etc. are not limited here. Then, if the first matching result indicates that the input voice contains the specified text, the current state of the terminal can be obtained to determine a corresponding preset operation according to the current state of the terminal, and the terminal can be woken up to perform the preset operation. The preset operations may include, but are not limited to, operations such as turning on the screen, unlocking, and activating a voice assistant, which is not limited in this embodiment.

在一些实施方式中，若声纹识别通过验证，可唤醒终端，使终端从熄屏状态切换至非熄屏状态，该非熄屏状态可包括屏幕亮起且显示解锁界面的待解锁状态，还可包括屏幕亮起且未显示解锁界面的已解锁状态。In some implementations, if the voiceprint recognition passes the verification, the terminal can be woken up to switch the terminal from the screen-off state to the non-screen-off state. Can include an unlocked state where the screen is on and the unlock UI is not displayed.

在另一些实施方式中，若声纹识别通过验证，可唤醒终端并执行解锁操作，使得用户可直接通过语音解锁终端，并且基于本方法可实现准确、安全且便捷的解锁。例如，用户对终端说“小欧小欧”，并声纹识别通过验证后，终端屏幕可亮起且显示解锁后的界面，可以显示上次锁屏前的界面，也可显示桌面，在此不做限定。In other implementations, if the voiceprint recognition passes the verification, the terminal can be woken up and an unlocking operation can be performed, so that the user can unlock the terminal directly by voice, and based on this method, accurate, safe and convenient unlocking can be achieved. For example, after the user says "Xiaoou Xiaoou" to the terminal, and the voiceprint recognition is verified, the terminal screen can light up and display the unlocked interface, the interface before the last lock screen, or the desktop. No limit.

可以理解的是，以上仅为示例，本实施例提供的方法并不局限于上述场景中，但考虑篇幅原因在此不再穷举。It can be understood that, the above is only an example, and the method provided in this embodiment is not limited to the above scenario, but it is not exhaustive here for reasons of space.

由于目前相关技术主要通过对孤立词进行识别以实现语音唤醒，即对于每段音频只包含一个唤醒词，如“小布小布”、“小欧小欧”，因此需要对送入算法的输入语音进行精准的切割，这导致相关技术对连续语音中的关键词识别效果不是很好，如“小欧小欧今天天气怎么样”，相关技术难以准确识别出关键词“小欧小欧”，也就难以分割关键词“小欧小欧”与“今天天气怎么样”，从而无法通过识别“小欧小欧”来实现唤醒，以在唤醒后对“今天天气怎么样”进行自然语言处理触发相应操作。Since the current related technology mainly realizes voice wake-up by identifying isolated words, that is, each audio segment contains only one wake-up word, such as "Xiaobu Xiaobu" and "Xiaoou Xiaoou", so the input into the algorithm is required Accurate cutting of voice, which leads to related technologies are not very effective in recognizing keywords in continuous speech, such as "Xiaoou, Xiaoou, how is the weather today", related technologies are difficult to accurately identify the keyword "Xiaoou, Xiaoou", It is also difficult to separate the keywords "Xiaoou Xiaoou" and "what's the weather today", so that it is impossible to wake up by recognizing "Xiaoou Xiaoou", so as to trigger natural language processing on "how is the weather today" after waking up Operate accordingly.

而通过本实施例提供的语音唤醒方法，由第一语音匹配模型获取当前的第一概率输出，再获取第一语音匹配模型在当前的第一概率输出之前输出的至少一个概率输出，作为第二概率输出，由于每个概率输出所对应的输入的语音片段的长度一般小于关键词长度，例如，用户输入指定文本对应的语音的时间长度，即关键词长度可在1s～2s之间，而实际输入第一语音匹配模型的语音是输入语音按预设长度分帧后的语音片段，其中预设长度小于关键词长度，例如可约为0.5s，所以针对一个输入语音的识别，会拆分成多帧输入第一语音匹配模型，得到多帧结果，当前结果为第一概率输出，历史结果为第二概率输出，并将第一概率输出与历史输出的第二概率输出进行融合处理，可对第一语音匹配模型由输入语音得到的多帧结果进行融合判断，来得到最终指示输入语音中是否包含指定文本的第一匹配结果，因而可对关键词检测跳变和误唤醒进行有效抑制，解决连续语音关键词识别问题，提高关键词识别准确率，降低误唤醒率。However, through the voice wake-up method provided in this embodiment, the first voice matching model obtains the current first probability output, and then obtains at least one probability output output by the first voice matching model before the current first probability output, as the second Probabilistic output, because the length of the input speech segment corresponding to each probability output is generally smaller than the length of the keyword, for example, the length of the speech corresponding to the user input specified text, that is, the length of the keyword can be between 1s and 2s, while the actual The voice input to the first voice matching model is a voice segment after the input voice is divided into frames according to a preset length, wherein the preset length is less than the keyword length, for example, it can be about 0.5s, so for the recognition of an input voice, it will be split into Multiple frames are input to the first speech matching model to obtain multi-frame results, the current result is the first probability output, and the historical result is the second probability output, and the first probability output and the second probability output of the historical output are fused, which can be The first speech matching model fuses and judges the multi-frame results obtained from the input speech to obtain the first matching result that finally indicates whether the input speech contains the specified text, so it can effectively suppress keyword detection jumps and false wake-ups, and solve the problem Continuous voice keyword recognition problem, improve keyword recognition accuracy and reduce false wakeup rate.

请参阅图3，图3示出了本申请另一个实施例提供的语音唤醒方法的流程示意图，可应用于上述终端，该语音唤醒方法可以包括：Please refer to FIG. 3. FIG. 3 shows a schematic flowchart of a voice wake-up method provided by another embodiment of the present application, which can be applied to the above-mentioned terminal. The voice wake-up method may include:

步骤S210：获取音频采集器采集的输入语音。Step S210: Obtain the input speech collected by the audio collector.

步骤S220：提取输入语音的声学特征，通过第一语音匹配模型对声学特征进行卷积运算，获得卷积神经网络输出。Step S220: extracting the acoustic features of the input speech, performing a convolution operation on the acoustic features through the first speech matching model, and obtaining the output of the convolutional neural network.

其中，第一语音匹配模型可基于卷积神经网络构建得到，若终端处于熄屏状态，终端可提取输入语音的声学特征，通过第一语音匹配模型对声学特征进行卷积运算，获得卷积神经网络输出。Among them, the first voice matching model can be constructed based on the convolutional neural network. If the terminal is in the off-screen state, the terminal can extract the acoustic features of the input voice, and perform convolution operations on the acoustic features through the first voice matching model to obtain the convolutional neural network. network output.

在一种实施方式中，第一语音匹配模型可先对输入语音进行特征提取，以对输入语音进行特征生成和特征降维，得到输入语音的声学特征，声学特征可以是梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient，MFCC)特征。In one embodiment, the first speech matching model can first perform feature extraction on the input speech to perform feature generation and feature dimensionality reduction on the input speech to obtain the acoustic features of the input speech, and the acoustic features can be Mel frequency cepstral coefficients (Mel Frequency Cepstrum Coefficient, MFCC) feature.

请参阅图4，其示出了本申请一个示例性实施例涉及的MFCC特征提取过程的示意图。如图4所示，输入语音依次通过预处理模块401、加窗模块402、傅里叶边换模块403以及MFCC提取模块404，可得到输入语音对应的MFCC特征作为其声学特征。Please refer to FIG. 4 , which shows a schematic diagram of an MFCC feature extraction process involved in an exemplary embodiment of the present application. As shown in FIG. 4 , the input speech passes through thepreprocessing module 401 , thewindowing module 402 , theFourier transform module 403 and theMFCC extraction module 404 sequentially, and the MFCC feature corresponding to the input speech can be obtained as its acoustic feature.

其中，预处理模块401可以是高通滤波器，可选地，其数学表达式可以为：H(z)＝1-az^-1，其中，H(z)表示对输入语音滤波后的语音数据，a是修正系数，一般可取0.95-0.97。Wherein, thepreprocessing module 401 can be a high-pass filter, optionally, its mathematical expression can be: H(z)=1-az^-1 , wherein, H(z) represents the speech data after input speech filtering, a is the correction coefficient, which is generally 0.95-0.97.

进一步地，加窗模块402可用于对滤波后的语音数据进行平滑处理，平滑帧信号的边缘，可选地，加窗模块402可采用汉明窗函数进行平滑处理，可选地，汉明窗的函数表达式可以为

其中，n为整数，n＝0,1,2,3,…M，M为傅里叶变换的点数，可选地，M可取512。Further, thewindowing module 402 can be used to perform smoothing processing on the filtered speech data, and smooth the edge of the frame signal. Optionally, thewindowing module 402 can use a Hamming window function to perform smoothing processing. Optionally, the Hamming window The function expression of can be

Wherein, n is an integer, n=0, 1, 2, 3, .

进一步地，通过傅里叶变换模块403可得到平滑处理后的语音数据对应的频谱，再经由MFCC提取模块404进行梅尔滤波，将前述频谱转化为符合人耳听觉的梅尔谱，可选地，梅尔滤波所采用的函数表达式可以为：

其中，F_mel(f)表示提取到的梅尔谱，f为傅里叶变换后的频点。Further, the frequency spectrum corresponding to the smoothed speech data can be obtained by theFourier transform module 403, and then theMFCC extraction module 404 is used to perform Mel filtering to convert the aforementioned spectrum into a Mel spectrum conforming to human hearing, optionally , the function expression used by Mel filtering can be:

Wherein, F_mel (f) represents the extracted mel spectrum, and f is a frequency point after Fourier transform.

可选地，通过上述处理后得到梅尔谱后，最后可以通过对数提取先对得到的F_mel(f)取对数，然后做离散余弦变化(Discrete Cosine Transform，DCT)处理，最终得到的DCT系数即可作为提取到的MFCC特征。由此可提取得到输入语音的MFCC特征作为声学特征。Optionally, after the mel spectrum is obtained after the above-mentioned processing, the logarithm of the obtained F_mel (f) can be obtained at last through logarithmic extraction, and then discrete cosine transform (Discrete Cosine Transform, DCT) processing is performed, and the final obtained The DCT coefficients can be used as the extracted MFCC features. Thus, the MFCC features of the input speech can be extracted as the acoustic features.

需要说明的是，上述参数仅为一种示例，在其他示例中，也可选用其他参数，本实施例对此不做限定。It should be noted that the above parameters are only an example, and in other examples, other parameters may also be selected, which is not limited in this embodiment.

在提取到输入语音的声学特征后，可通过第一语音匹配模型对声学特征进行卷积运算，获得卷积神经网络输出。则作为一种示例，在第一语音匹配模型中，MFCC提取模块504后可连接卷积神经网络，以对MFCC提取模块504输出的声学特征进行卷积运算，获得卷积神经网络输出。After the acoustic feature of the input speech is extracted, the convolution operation can be performed on the acoustic feature through the first speech matching model to obtain the output of the convolutional neural network. As an example, in the first voice matching model, the convolutional neural network may be connected to theMFCC extraction module 504 to perform convolution operation on the acoustic features output by theMFCC extraction module 504 to obtain the output of the convolutional neural network.

在一些实施方式中，第一语音匹配模型可包括n组依次连接的卷积层、批归一化层以及线性激活层，将一组依次连接的卷积层、批归一化(Batch Normalization，BN)层以及线性激活层作为一组卷积块，则提取输入语音的声学特征，通过第一语音匹配模型对声学特征进行卷积运算，获得卷积神经网络输出的具体实施方式可包括：提取输入语音的声学特征，作为第1组卷积块的输入；通过n组卷积块，依次对上组的输出进行处理，获得第n组卷积块的输出作为卷积神经网络输出，其中，第i组卷积块的输入由第i-1组卷积块的输出与第n-i+1组卷积块的输入进行融合得到，其中，n∈N且n≥2，i＝{1,2,…,n}。In some embodiments, the first voice matching model may include n groups of sequentially connected convolutional layers, batch normalization layers, and linear activation layers, and a group of sequentially connected convolutional layers, batch normalization (Batch Normalization, (BN) layer and linear activation layer are used as a group of convolution blocks, then extract the acoustic features of the input speech, perform convolution operations on the acoustic features through the first speech matching model, and obtain the specific implementation of the convolutional neural network output may include: extracting The acoustic features of the input speech are used as the input of the first group of convolutional blocks; through n groups of convolutional blocks, the output of the previous group is sequentially processed, and the output of the nth group of convolutional blocks is obtained as the output of the convolutional neural network, wherein, The input of the i-th convolutional block is obtained by fusing the output of the i-1th convolutional block with the input of the n-i+1th convolutional block, where n∈N and n≥2, i={1 ,2,...,n}.

在一个具体示例中，第一语音匹配模型中的上述卷积神经网络的结构可如图5所示，卷积神经网络包括n组依次连接的卷积块500，每个卷积块500包括依次连接的卷积层501、批归一化层502以及线性激活层503。通过上述方法提取得到的声学特征，可作为第一组卷积块的输入，通过n组卷积块500，可获得卷积神经神经网络输出。In a specific example, the structure of the above-mentioned convolutional neural network in the first speech matching model can be shown in FIG. Connectedconvolutional layer 501 ,batch normalization layer 502 andlinear activation layer 503 . The acoustic features extracted by the above method can be used as the input of the first set of convolution blocks, and the output of the convolutional neural network can be obtained through the n sets of convolution blocks 500 .

其中，卷积层501为卷积神经网络层，是一种以卷积作为主要计算方式的神经网络层，可选地，声学特征的数据尺寸为C*R*1，其中，C为特征列数，R为特征行数，通道数为1，将提取得到的声学特征依次输入卷积层501中计算局部特征，可选地，其计算公式如下：Wherein, theconvolutional layer 501 is a convolutional neural network layer, which is a neural network layer with convolution as the main calculation method. Optionally, the data size of the acoustic feature is C*R*1, where C is a feature column number, R is the number of feature rows, and the number of channels is 1, and the extracted acoustic features are sequentially input into theconvolution layer 501 to calculate local features. Optionally, the calculation formula is as follows:

上述公式(1)，I表征输入，W表征卷积对应的权重，bias为偏置项，经过卷积层计算得到的结果是尺寸为c*r*l的3维特征。In the above formula (1), I represents the input, W represents the weight corresponding to the convolution, bias is the bias item, and the result obtained through the calculation of the convolution layer is a 3-dimensional feature with a size of c*r*l.

其中，批归一化层502是一种可对各层输出进行有效的自适应归一化的网络层，可选地，其计算公式如下：Wherein, thebatch normalization layer 502 is a network layer capable of performing effective adaptive normalization on the output of each layer. Optionally, its calculation formula is as follows:

β^(k)＝E[x^(k)] 公式(2)β^(k) = E[x^(k) ] formula (2)

上述公式(2)-(5)中，x为上一层输出，β、γ为自适应参数，k表征批次大小，则基于公式(2)计算上一层输出的均值得到，基于公式(3)计算上一层输出的标准差，基于公式(4)进行归一化处理，基于公式(5)对上述归一化处理后的得到的数据

进行重构得到y^(k)。由此，将x输入至批归一化层502中进行方差、均值计算可得到自适应参数γ、β，将计算得到的自适应参数γ、β在模型推理过程中进行计算可实现批归一化。In the above formulas (2)-(5), x is the output of the previous layer, β and γ are adaptive parameters, and k represents the batch size, which is obtained by calculating the mean value of the output of the previous layer based on formula (2), based on the formula ( 3) Calculate the standard deviation of the output of the previous layer, perform normalization processing based on formula (4), and obtain the data obtained after the above normalization processing based on formula (5)

Perform reconstruction to get y^(k) . Thus, inputting x into thebatch normalization layer 502 for variance and mean calculations can obtain adaptive parameters γ and β, and calculating the calculated adaptive parameters γ and β during model reasoning can realize batch normalization change.

其中，线性激活层503可以用于对输出特征进行线性提升，可选地，其计算公式如下：Wherein, thelinear activation layer 503 can be used to linearly improve the output features, and optionally, its calculation formula is as follows:

y＝f(x),f＝max(λ*x,0) 公式(6)y=f(x), f=max(λ*x,0) formula (6)

此时输出为正值的部分特征y，对于正值特征x需乘以因子λ作为线性增强手段。At this time, the output is a part of the feature y that is positive. For the positive feature x, it needs to be multiplied by the factor λ as a linear enhancement method.

其中，U型残差结构模块504是一种以各层特征进行分离和合并的层状结构，通过将第一组(i＝1)与最终组(i＝n)进行特征融合，第二组(i＝2)与倒数第二组(i＝n-1)进行特征融合，…，使得第i组卷积块的输入由第i-1组卷积块的输出与第n-i+1组卷积块的输入进行融合得到，其中，n∈N且n≥2，i＝{1,2,…,n}。由此，基于U型残差结构模块504，可将全部特征信息流予以保留和计算，对于推理过程中的低级特征和高级特征进行多尺度融合，使得网络可设计地更深，同时保证输出特征表达能力，从而可进一步提高对关键词检测的准确性。在一个具体示例中，最终可将结果提升3％。Among them, the U-shapedresidual structure module 504 is a layered structure that separates and combines the features of each layer. By fusing the features of the first group (i=1) and the final group (i=n), the second group (i=2) performs feature fusion with the penultimate group (i=n-1), ... so that the input of the i-th group of convolutional blocks is composed of the output of the i-1th group of convolutional blocks and the n-i+1th group The input of the group convolution block is fused, where n∈N and n≥2, i={1,2,…,n}. Therefore, based on the U-shapedresidual structure module 504, all feature information flows can be retained and calculated, and multi-scale fusion of low-level features and high-level features in the reasoning process can be performed, so that the network can be designed deeper, while ensuring the output feature expression ability, which can further improve the accuracy of keyword detection. In one specific example, the result could end up being a 3% boost.

请继续参阅图5，卷积神经网络通过反复应用卷积层501、批归一化层502、线性激活层503以及U型残差结构504进行模型纵向维度的加深，可对模型特征进行多次抽象和提取以进行有效分类，提高对关键词检测的准确性，并不断降低模型输出的维度，同时又解决网络太深难训练的问题，最终模型在多次叠加后得到基于U型残差结构的卷积神经网络的最终输出，即卷积神经网络输出。Please continue to refer to Figure 5. The convolutional neural network deepens the longitudinal dimension of the model by repeatedly applying theconvolutional layer 501, thebatch normalization layer 502, thelinear activation layer 503, and the U-shapedresidual structure 504. Abstract and extract for effective classification, improve the accuracy of keyword detection, and continuously reduce the dimension of model output, and at the same time solve the problem that the network is too deep and difficult to train. The final model is based on U-shaped residual structure after multiple superposition The final output of the convolutional neural network, that is, the convolutional neural network output.

由于目前为了降低误唤醒率，需要改善唤醒算法，使其更加精确复杂，并在终端处理器上持续运行，但是，这样会给终端造成较大的功耗负担，虽然这对插电使用的终端(例如智能音箱)的影响可能不大，但是对于非插电使用的终端(例如手机、平板电脑等)，则会加速电池消耗，使得终端待机时间降低。由此，本实施例通过上述操作不断加深模型纵向维度，对模型特征进行多次抽象和提取以进行有效分类，提高对关键词检测的准确性，同时通过U型残差结构504不断降低模型输出的维度，控制模型的大小，使得在提高第一语音匹配模型对关键词识别准确率的同时，还可以让第一语音匹配模型在计算性能较低但功耗较小的低功耗模块运行，从而降低误唤醒率的同时，还可在运用于低功耗模块时降低终端实现语音唤醒所需的功耗，有利于响应用户在全天场景下的唤醒。In order to reduce the false wake-up rate, it is necessary to improve the wake-up algorithm to make it more accurate and complex, and to run continuously on the terminal processor. (such as smart speakers) may have little impact, but for non-plugged terminals (such as mobile phones, tablet computers, etc.), it will accelerate battery consumption and reduce the standby time of the terminal. Therefore, this embodiment continuously deepens the longitudinal dimension of the model through the above operations, abstracts and extracts the model features multiple times for effective classification, improves the accuracy of keyword detection, and continuously reduces the model output through the U-shapedresidual structure 504 Dimensions, control the size of the model, so that while improving the accuracy of keyword recognition by the first voice matching model, the first voice matching model can also be run in a low-power module with low computing performance but low power consumption. In this way, while reducing the false wake-up rate, it can also reduce the power consumption required by the terminal to realize voice wake-up when it is applied to a low-power module, which is conducive to responding to the user's wake-up in all-day scenarios.

步骤S230：将卷积神经网络输出与指定文本对应的声学特征进行匹配，获取第一概率输出第一概率输出。Step S230: Match the output of the convolutional neural network with the acoustic feature corresponding to the specified text, and obtain the first probability output.

在一些实施例中，可将卷积神经网络输出与指定文本对应的声学特征进行匹配，以根据卷积神经网络输出判断输入语音中是否包含指定文本，例如可将卷积神经网络输出输入至分类器例如Softmax分类器得到指示输入语音中包含指定文本的概率，第一概率输出作为第一概率。在其他示例中，还可采用其他分类器，在此不做限定。In some embodiments, the convolutional neural network output can be matched with the acoustic features corresponding to the specified text, so as to judge whether the input speech contains the specified text according to the convolutional neural network output, for example, the convolutional neural network output can be input to the classification A classifier such as a Softmax classifier obtains the probability indicating that the input speech contains the specified text, and outputs the first probability as the first probability. In other examples, other classifiers may also be used, which are not limited here.

在另一些实施例中，步骤S230也可包括步骤S231至步骤S233，以在卷积神经网络的基础上，引入注意力机制，以对上述卷积神经网络进行注意力优化，以处理序列过长时模型精度丢失的问题，提高对关键词检测的准确性。具体地，请参阅图6，图6示出了本申请一个示例性实施例中图3内步骤S230的流程示意图，步骤S230包括：In some other embodiments, step S230 may also include steps S231 to S233, so as to introduce an attention mechanism on the basis of the convolutional neural network, so as to optimize the attention of the above-mentioned convolutional neural network to deal with the sequence being too long In order to improve the accuracy of keyword detection due to the problem of loss of model accuracy. Specifically, please refer to FIG. 6, which shows a schematic flow chart of step S230 in FIG. 3 in an exemplary embodiment of the present application, and step S230 includes:

步骤S231：将卷积神经网络输出按通道进行注意力权重提取，获得卷积神经网络输出对应的注意力权重向量。Step S231 : extracting attention weights from the output of the convolutional neural network by channel, and obtaining an attention weight vector corresponding to the output of the convolutional neural network.

在一些实施例中，将卷积神经网路输出按通道进行注意力权重提取时，可通过将卷积神经网络输出依次通过池化(Pooling)层、卷积层、全连接(Fully Connected，FC)层以及非线性激活层实现，由此获得卷积神经网络输出对应的注意力权重向量。In some embodiments, when the output of the convolutional neural network is extracted by channels, the output of the convolutional neural network can be passed through the pooling (Pooling) layer, the convolutional layer, and the fully connected (Fully Connected, FC) layer in sequence. ) layer and a nonlinear activation layer to obtain the attention weight vector corresponding to the output of the convolutional neural network.

在一个示例性实施例中，前述注意力权重提取的流程可如图7所示。以下为了对图7所示的流程进行详细说明，还请同时参阅图8，图8示出了本申请一个示例性实施例提供的图6中步骤S231的流程示意图，于本实施例中，步骤S231可包括：In an exemplary embodiment, the aforementioned process of extracting attention weights may be shown in FIG. 7 . In order to describe the process shown in FIG. 7 in detail below, please refer to FIG. 8 at the same time. FIG. 8 shows a schematic flow chart of step S231 in FIG. 6 provided by an exemplary embodiment of the present application. In this embodiment, step S231 may include:

步骤S2311：通过池化层，对卷积神经网络输出的每个通道的特征的特征值由大至小排序，提取每个通道前若干位的特征值作为池化后每个通道的特征的特征值，得到池化后特征。Step S2311: through the pooling layer, sort the feature values of the features of each channel output by the convolutional neural network from large to small, and extract the feature values of the first few bits of each channel as the features of the features of each channel after pooling value to get the pooled features.

其中，池化层可用于池化层可对提取到的特征进行降维，一方面使特征变小，简化网络计算复杂度并在一定程度上避免过拟合的出现；一方面可保留显著特征。Among them, the pooling layer can be used to reduce the dimensionality of the extracted features. On the one hand, it makes the features smaller, simplifies the computational complexity of the network, and avoids the occurrence of overfitting to a certain extent; on the one hand, it can retain significant features. .

在本实施例中，池化层可采用Top N池化，以对池化层的输入特征进行Top N池化操作，即一种对向量前N个最大值进行池化提取的操作。具体地，如图7所示，卷积神经网络输出的数据尺寸若为C*H*W，即池化层的输入是数据尺寸为C*H*W的特征，其中C表征输入通道数，H表征输入高度，W表征输入宽度，则通过Top N池化，可得到数据尺寸为C*N*1的特征，即池化后特征。In this embodiment, the pooling layer may adopt Top N pooling to perform a Top N pooling operation on the input features of the pooling layer, that is, an operation of pooling and extracting the first N maximum values of the vector. Specifically, as shown in Figure 7, if the data size output by the convolutional neural network is C*H*W, that is, the input of the pooling layer is a feature with a data size of C*H*W, where C represents the number of input channels, H represents the input height, W represents the input width, then through Top N pooling, the feature with data size of C*N*1 can be obtained, that is, the feature after pooling.

在一种实施方式中，Top N池化流程可参阅图9，如图9所示，池化层输入为C*H*W，针对每一个通道c(c∈C)的特征，其尺寸为H*W，将特征中的特征值按由大至小的顺序进行排序，提取前N位的特征指作为该通道的准许池化值，依次对每个通道进行前述操作，即可得到输出尺寸为C*N*1的特征，即池化后特征。In one embodiment, the Top N pooling process can refer to Figure 9, as shown in Figure 9, the input of the pooling layer is C*H*W, and for the feature of each channel c(c∈C), its size is H*W, sort the feature values in the feature in order from large to small, extract the first N-bit feature refers to the allowed pooling value of the channel, and perform the aforementioned operations on each channel in turn to obtain the output size It is the feature of C*N*1, that is, the feature after pooling.

另外，在其他一些实施例中，池化层可采用最大池化(Maxpooling)、平均池化(Mean Pooling)等，在此不做限定。以最大池化为例，可将卷积神经网络输出按通道将每个通道的特征划分为若干个区域，将每个区域的最大值作为该区域输出，最终得到由各区域最大值组成的输出。In addition, in some other embodiments, the pooling layer may adopt maximum pooling (Maxpooling), average pooling (Mean Pooling), etc., which are not limited here. Taking maximum pooling as an example, the output of the convolutional neural network can be divided into several regions according to the characteristics of each channel, and the maximum value of each region can be used as the output of the region, and finally the output composed of the maximum values of each region can be obtained .

步骤S2312：通过卷积层对池化后特征进行特征提取，得到一维向量。Step S2312: Perform feature extraction on the pooled features through the convolution layer to obtain a one-dimensional vector.

将通过池化层得到的池化后通知输入卷积层，对池化后特征进行特征提取，经过计算可得到数据尺寸为(C/N)*1*1的一维向量。The pooling obtained through the pooling layer is notified to the input convolutional layer, and feature extraction is performed on the pooled features. After calculation, a one-dimensional vector with a data size of (C/N)*1*1 can be obtained.

步骤S2313：将一维向量依次通过全连接层、非线性激活层，得到卷积神经网络输出对应的注意力权重向量。Step S2313: pass the one-dimensional vector through the fully connected layer and the nonlinear activation layer in turn to obtain the attention weight vector corresponding to the output of the convolutional neural network.

其中，全连接层是一种以权重作为计算方式的神经网络层，将经过池化层、卷积层得到的一维向量依次输入全连接层中计算局部特征，可选地，其计算公式可以为：Among them, the fully connected layer is a neural network layer with weights as the calculation method. The one-dimensional vectors obtained through the pooling layer and the convolutional layer are sequentially input into the fully connected layer to calculate local features. Optionally, the calculation formula can be for:

上述公式(7)中，I表征输入，W表征卷积对应的权重，bias为偏置项，经过全连接层计算得到尺寸为C*1*1的特征。In the above formula (7), I represents the input, W represents the weight corresponding to the convolution, bias is the bias item, and the feature of size C*1*1 is obtained through the calculation of the fully connected layer.

在一些实施方式中，如图7所示，卷积层之后可依次连接两个全连接层。在另一些可能的实施方式中，也可连接一个或两个以上的全连接层，在此不作限定。In some implementations, as shown in FIG. 7 , two fully-connected layers may be sequentially connected after the convolutional layer. In some other possible implementation manners, one or more than two fully connected layers may also be connected, which is not limited here.

其中，非线性激活层可用于对输出特征进行非线性提升，在一种实施方式中，非线性激活层可采用非线性激活函数得到，非线性激活函数可以是Sigmoid函数、Tanh函数等。可选地，以Sigmoid函数为例，其计算公式可以为：Wherein, the nonlinear activation layer can be used to nonlinearly improve the output features. In one embodiment, the nonlinear activation layer can be obtained by using a nonlinear activation function, and the nonlinear activation function can be a Sigmoid function, a Tanh function, or the like. Optionally, taking the Sigmoid function as an example, its calculation formula can be:

y＝sigmoid(x) 公式(8)y=sigmoid(x) formula (8)

由此，经过上述非线性激活层激活后，得到尺寸为C*1*1的一维向量，作为卷积神经网络输出对应的注意力权重向量。Thus, after the above-mentioned nonlinear activation layer is activated, a one-dimensional vector with a size of C*1*1 is obtained, which is used as the corresponding attention weight vector output by the convolutional neural network.

步骤S232：根据注意力权重向量对卷积神经网络输出进行加权处理，获得注意力输出特征。Step S232: weighting the convolutional neural network output according to the attention weight vector to obtain attention output features.

如图7所示，得到卷积神经网络输出对应的注意力权重向量后，可将其与注意力权重提取前的输入特征(即卷积神经网络输出)进行注意力尺度化，即根据注意力权重向量对卷积神经网络输出进行加权处理，获得注意力输出特征。As shown in Figure 7, after obtaining the attention weight vector corresponding to the output of the convolutional neural network, it can be scaled with the input features before the attention weight extraction (that is, the output of the convolutional neural network), that is, according to the attention The weight vector weights the output of the convolutional neural network to obtain the attention output feature.

在一个示例性的实施例中，请参阅图10，其示出了本申请一个示例性实施例提供的注意力尺度化过程的示意图，如图10所示，将步骤S232的算法抽象为一个模块，即为注意力尺度化模块，则注意力尺度化模块的输入为卷积神经网络输出(尺寸为C*H*W)以及注意力权重向量(尺寸为C*1*1)。In an exemplary embodiment, please refer to FIG. 10, which shows a schematic diagram of the attention scaling process provided by an exemplary embodiment of the present application. As shown in FIG. 10, the algorithm of step S232 is abstracted into a module , which is the attention scaling module, the input of the attention scaling module is the output of the convolutional neural network (the size is C*H*W) and the attention weight vector (the size is C*1*1).

在一种实施方式中，可通过度量化结构对注意力权重向量更新，得到注意力更新权重，此处更新可基于预定公式。其中，预定公式可为如下公式(9)-(13)的至少一种，当然也可不限于下述公式：In an implementation manner, the attention weight vector may be updated through the quantization structure to obtain the attention update weight, where the update may be based on a predetermined formula. Wherein, the predetermined formula can be at least one of the following formulas (9)-(13), and certainly not limited to the following formulas:

a_t＝g_BO(h_t)＝b_t 公式(9)a_t = g_BO (h_t ) = b_t formula (9)

a_t＝g_L(h_t)＝w_t^Th_t+b_t 公式(10)a_t =g_L (h_t )=w_t^T h_t +b_t formula (10)

a_t＝g_SL(h_t)＝w^Th_t+b 公式(11)a_t = g_SL (h_t ) = w^T h_t + b Formula (11)

a_t＝g_NL(h_t)＝V_t^Ttanh(w_t^Th_t+b_t) 公式(12)a_t =g_NL (h_t )=V_t^T tanh(w_t^T h_t +b_t ) formula (12)

a_t＝g_SNL(h_t)＝V^Ttanh(w^Th_t+b) 公式(13)a_t =g_SNL (h_t )=V^T tanh(w^T h_t +b) Formula (13)

上述公式(9)-(13)均可通过端到端的训练达到收敛的结果，同时针对不同特征分布的模型有各自的优势，且其中，h_t表征注意力权重向量，a_t表征注意力更新权重，其中t∈(1，T)，通过上述公式可得到注意力权重向量h_t对应的注意力更新权重a_t，可选地，通过对注意力更新权重a_t进行逐位求和平均可得到尺寸为C*1*1的特征，通过特征映射得到注意力尺度化权重a_t'，对注意力尺度化权重a_t'进行归一化得到向量

将卷积神经网络输出j_t与向量p_t按通道进行权重累加，即可获得注意力输出特征

注意力输出特征的尺寸为C*H*W。由此，基于注意力机制对上述卷积神经网络进行注意力优化，使得优化后的卷积神经网络输出可融合低维特征与高维特征，使得第一语音匹配模型在多种场景下具有更好的泛化能力。The above formulas (9)-(13) can achieve convergence results through end-to-end training, and models with different feature distributions have their own advantages, and among them, h_t represents the attention weight vector, and a_t represents the attention update Weight, where t∈(1, T), through the above formula, the attention update weight a_t corresponding to the attention weight vector h_t can be obtained, optionally, by bitwise summing and averaging the attention update weight a_t can be obtained Obtain a feature with a size of C*1*1, obtain the attention scaling weight a_t ' through the feature map, and normalize the attention scaling weight a_t ' to obtain a vector

Add the weight of the convolutional neural network output j_t and the vector p_t by channel to obtain the attention output feature

The size of the attention output feature is C*H*W. Therefore, based on the attention mechanism, the above-mentioned convolutional neural network is optimized for attention, so that the output of the optimized convolutional neural network can be integrated with low-dimensional features and high-dimensional features, so that the first speech matching model has better performance in various scenarios. Good generalization ability.

步骤S233：将注意力输出特征与指定文本对应的声学特征进行匹配，获取第一概率输出。Step S233: Match the attention output feature with the acoustic feature corresponding to the specified text to obtain the first probability output.

在一些实施例中，可将注意力输出特征与指定文本对应的声学特征进行匹配，以将注意力输出特征与指定文本对应的输出类别进行特征映射，获取第一概率输出。例如，若指定文本为“小欧”，则输出类别可以为“小欧”，通过将注意力输出特征与输出类别进行特征映射，可得到注意力输出特征是否可与指定文本匹配的概率，即第一概率输出。In some embodiments, the attention output feature can be matched with the acoustic feature corresponding to the specified text, so as to perform feature mapping between the attention output feature and the output category corresponding to the specified text, and obtain the first probability output. For example, if the specified text is "Xiaoou", the output category can be "Xiaoou", and the probability of whether the attention output feature can match the specified text can be obtained by performing feature mapping between the attention output feature and the output category, namely The first probability output.

第一概率输出可反映注意力输出特征与指定文本的匹配程度，一般认为概率越高，匹配程度越高。The first probability output can reflect the matching degree of the attention output feature and the specified text, and it is generally believed that the higher the probability, the higher the matching degree.

作为一种实施方式，可先通过全局池化对注意力输出特征进行特征降维，即将尺寸为C*H*W的注意力输出特征进行高度和宽度上的池化。可选地，可具体采用全局最大池化，例如可基于公式

实现，其中，i∈H*W，β_i为池化窗口的尺寸，则

为对各池化窗口进行池化得到各窗口的最大值，由此可计算得到针对各通道的输出特征，尺寸为C*1*1。As an implementation, the feature dimensionality reduction of the attention output feature can be performed through global pooling first, that is, the attention output feature with a size of C*H*W is pooled in height and width. Optionally, global max pooling can be specifically adopted, for example based on the formula

Realization, where, i∈H*W, β_i is the size of the pooling window, then

In order to pool each pooling window to obtain the maximum value of each window, the output features for each channel can be calculated, and the size is C*1*1.

进一步地，为了将输出特征与输出类别进行特征映射，可将输出特征进行全局归一化，可选地，可通过如下公式实现：Further, in order to perform feature mapping between the output features and the output categories, the output features can be globally normalized. Optionally, it can be realized by the following formula:

由此，得到向量k_t为对应输出类别的概率估计，即第一概率输出，其中，k_t∈[0，1]。Thus, the obtained vector k_t is the probability estimate of the corresponding output category, that is, the first probability output, where k_t ∈ [0, 1].

在一些实施方式中，若k_t大于预设结果阈值，可判定输入语音中包含指定文本，若k_t小于或等于预设结果阈值，可判定输入语音中不包含指定文本。In some implementations, if k_t is greater than the preset result threshold, it can be determined that the input speech contains the specified text, and if k_t is less than or equal to the preset result threshold, it can be determined that the input speech does not contain the specified text.

其中，预设结果阈值可根据实际需要进行设定。在一种实施方式中，可通过计算在训练数据集中的等错误率(Equal Error Rate，EER)，即错误接受率(False AcceptanceRate，FAR)与错误拒绝率(False Rejection Rate，FRR)相等的值，将EER最小时的阈值作为预设结果阈值，由此可使得第一语音匹配模型的误唤醒率和误拒绝率达到平衡，由此可兼顾查全率和查准率，使得第一语音匹配模型可用于全天场景下捕捉语音唤醒词，基于较高的查全率和相对较低的查准率，对尽可能多的、潜在的用户唤醒场景进行有效识别，有效降低漏检率。Wherein, the preset result threshold can be set according to actual needs. In one embodiment, the equal error rate (Equal Error Rate, EER) in the training data set can be calculated, that is, the value equal to the false acceptance rate (False Acceptance Rate, FAR) and the false rejection rate (False Rejection Rate, FRR) , the threshold when the EER is the smallest is used as the preset result threshold, so that the false awakening rate and the false rejection rate of the first speech matching model can be balanced, so that both recall and precision can be taken into account, so that the first speech matching The model can be used to capture voice wake-up words in all-day scenarios. Based on a high recall rate and a relatively low precision rate, it can effectively identify as many potential user wake-up scenarios as possible, effectively reducing the missed detection rate.

在其他实施方式中，预设结果阈值也可是基于其他方式计算得到，还可以是用户自定义，本实施例对此不做限定。In other implementation manners, the preset result threshold may also be calculated based on other methods, or may be user-defined, which is not limited in this embodiment.

步骤S240：获取第一语音匹配模型在当前的第一概率输出之前输出的至少一个概率输出，作为第二概率输出。Step S240: Obtain at least one probability output output by the first speech matching model before the current first probability output as a second probability output.

步骤S250：将第一概率输出与第二概率输出进行融合，得到更新的第一概率输出。Step S250: Fusing the first probability output and the second probability output to obtain an updated first probability output.

在一些实施例中，步骤S250可包括步骤S251至步骤S254，具体地，请参阅图11，图11示出了本申请一个示例性实施例提供的图3中步骤S250的流程示意图，于本实施例中，步骤S250可包括：In some embodiments, step S250 may include step S251 to step S254. Specifically, please refer to FIG. 11 , which shows a schematic flowchart of step S250 in FIG. 3 provided by an exemplary embodiment of the present application. In this embodiment In an example, step S250 may include:

步骤S251：对第二概率输出进行特征提取得到第一历史特征与第二历史特征。Step S251: Perform feature extraction on the second probability output to obtain the first historical feature and the second historical feature.

在一些实施方式中，可分别采用一个循环神经网络对第二概率输出进行特征提取，得到第一历史特征与第二历史特征。In some implementations, a cyclic neural network may be used to extract features from the second probability output to obtain the first historical features and the second historical features.

作为一种实施方式，当输入序列较长时，还可具体采用长短期记忆(Long Short-Term Memory，LSTM)网络、门控循环单元(GatedRecurrentUnit，GRU)对第二概率输出进行特征提取，本实施例对此不做限定。As an implementation, when the input sequence is relatively long, a Long Short-Term Memory (LSTM) network and a Gated Recurrent Unit (GRU) can also be used to extract features from the second probability output. The embodiment does not limit this.

另外，在一些实施例中，为了更好地理解上下文环境并消除歧义，可具体采用双向循环神经网络(Bidirectional Recurrent Neural Network，Bi-RNN)对第二概率输出进行特征提取，由此能够学到两个方向上的上下文依赖关系，以对序列信息特征进行更有效的特征提取和处理。具体地，将第二概率输出分别输入两个双向RNN，得到第一历史特征以及第二历史特征，其中，双向RNN包括若干个节点，本实施例不对节点数量作限定，可根据实际需要进行确定。In addition, in some embodiments, in order to better understand the context and eliminate ambiguity, a bidirectional recurrent neural network (Bidirectional Recurrent Neural Network, Bi-RNN) can be used to extract features from the second probability output, so that it can learn Context dependencies in both directions for more efficient feature extraction and processing of sequence-informative features. Specifically, input the second probability output into two bidirectional RNNs respectively to obtain the first historical feature and the second historical feature, wherein, the bidirectional RNN includes several nodes, and this embodiment does not limit the number of nodes, which can be determined according to actual needs .

在一些示例中，第一语音匹配模型可包括第一、第二双向循环神经网络层，此时，对第二概率输出进行特征提取得到第一历史特征与第二历史特征。具体地，可通过第一双向循环神经网络层对历史注意力输出进行特征提取得到第一历史特征；通过第二双向循环神经网络层对历史注意力输出进行特征提取得到第二历史特征。其中，双向循环神经网络层中的重复模块可以是常规RNN、LSTM或是GRU，在此不做限定。In some examples, the first voice matching model may include a first and a second bidirectional recurrent neural network layer. In this case, feature extraction is performed on the second probability output to obtain the first historical feature and the second historical feature. Specifically, the first historical feature can be obtained by performing feature extraction on the historical attention output through the first bidirectional cyclic neural network layer; the second historical feature can be obtained by performing feature extraction on the historical attention output through the second bidirectional cyclic neural network layer. Wherein, the repeating module in the bidirectional cyclic neural network layer may be a conventional RNN, LSTM or GRU, which is not limited here.

在一些实施例中，第一、第二双向循环神经网络层的网络参数不一样，可根据实际需求进行设置。In some embodiments, the network parameters of the first and second bidirectional cyclic neural network layers are different, and can be set according to actual needs.

在其他一些实施方式中，还可采用其他神经网络，对第二概率输出进行特征提取，得到第一历史特征与第二历史特征，本实施例对此不做限定。In some other implementation manners, other neural networks may also be used to perform feature extraction on the second probability output to obtain the first historical feature and the second historical feature, which is not limited in this embodiment.

步骤S252：将第一概率输出与第一历史特征进行融合，得到第二概率输出对应的历史注意力权重向量。Step S252: Fusing the first probability output with the first historical feature to obtain a historical attention weight vector corresponding to the second probability output.

在一些实施方式中，第一概率输出的尺寸与第一历史特征的尺寸相同，可将第一概率输出与第一历史特征进行逐点相乘，得到尺寸为C的一维向量特征，然后将该一维向量特征输入至归一化层进行归一化处理，在一个示例中，归一化层可采用Softmax函数，可选地，归一化的计算公式可如下：In some embodiments, the size of the first probability output is the same as the size of the first historical feature, and the first probability output and the first historical feature can be multiplied point by point to obtain a one-dimensional vector feature with a size of C, and then The one-dimensional vector feature is input to the normalization layer for normalization processing. In an example, the normalization layer can use the Softmax function. Optionally, the normalization calculation formula can be as follows:

其中，c_t表征该一维向量特征中第t个特征值，h_t表征第t个特征值归一化后的特征值。由此，通过归一化层对第一概率输出与第一历史特征的融合特征进行归一化处理后，可得到向量h_t，作为第二概率输出对应的历史注意力权重向量。Among them, c_t represents the t-th eigenvalue in the one-dimensional vector feature, and h_t represents the normalized eigenvalue of the t-th eigenvalue. Thus, after the fusion feature of the first probability output and the first historical feature is normalized by the normalization layer, the vector h_t can be obtained as the historical attention weight vector corresponding to the second probability output.

步骤S253：根据历史注意力权重向量对第二历史特征进行加权处理，获得历史融合输出。Step S253: Perform weighting processing on the second historical feature according to the historical attention weight vector to obtain a historical fusion output.

得到历史注意力权重向量后，可根据历史注意力权重向量对第二历史特征进行加权处理，获得历史融合输出。作为一种实施方式，可将历史注意力权重向量与第二历史特征进行逐点相乘，得到基于注意力机制优化后的历史融合输出。After obtaining the historical attention weight vector, the second historical feature can be weighted according to the historical attention weight vector to obtain the historical fusion output. As an implementation manner, the historical attention weight vector can be multiplied point by point by the second historical feature to obtain an optimized history fusion output based on the attention mechanism.

步骤S254：将第一概率输出与历史融合输出进行融合，得到更新的第一概率输出。Step S254: Fusing the first probability output with the historical fusion output to obtain an updated first probability output.

在一些实施例中，步骤S254可具体包括步骤S2541至步骤S2543，具体地，请参阅图12，图12示出了本申请一个示例性实施例提供的图11中步骤S254的流程示意图，于本实施例中，步骤S254可包括：In some embodiments, step S254 may specifically include steps S2541 to S2543. Specifically, please refer to FIG. 12, which shows a schematic flowchart of step S254 in FIG. 11 provided by an exemplary embodiment of the present application. In an embodiment, step S254 may include:

步骤S2541：对第一概率输出进行特征提取得到第一概率输出对应的输出系数。Step S2541: Perform feature extraction on the first probability output to obtain output coefficients corresponding to the first probability output.

在一种实施方式中，可将第一概率输出经过全连接层和非线性激活层进行特征提取，得到第一概率输出对应的输出系数G。In an implementation manner, the first probability output may be subjected to feature extraction through a fully connected layer and a nonlinear activation layer to obtain an output coefficient G corresponding to the first probability output.

步骤S2542：若输出系数大于预设结果阈值，将第一概率输出作为更新的第一概率输出。Step S2542: If the output coefficient is greater than the preset result threshold, use the first probability output as the updated first probability output.

步骤S2543：若输出系数小于或等于预设结果阈值，将历史融合输出作为更新的第一概率输出。Step S2543: If the output coefficient is less than or equal to the preset result threshold, use the historical fusion output as the updated first probability output.

其中，预设结果阈值的计算可见前述实施例，在此不再赘述。用thre表征预设结果阈值，则输出系数G的更新公式可如下：Wherein, the calculation of the preset result threshold can be seen in the foregoing embodiments, and will not be repeated here. Using thre to represent the preset result threshold, the update formula of the output coefficient G can be as follows:

此时，若记历史融合输出为memory，记第一概率输出为input，则更新的第一概率输出result的计算公式可如下：At this point, if the history fusion output is recorded as memory, and the first probability output is recorded as input, then the calculation formula of the updated first probability output result can be as follows:

result＝G*input+(1-G)*memory 公式(16)result＝G*input+(1-G)*memory formula (16)

结合上述公式(15)以及公式(16)可得，若输出系数G大于预设结果阈值thre，则G＝1，此时更新的第一概率输出result为第一概率输出input；若输出系数G小于或等于预设结果阈值thre，则G＝0，此时更新的第一概率输出result为历史融合输出memory。Combining the above formula (15) and formula (16), it can be obtained that if the output coefficient G is greater than the preset result threshold thre, then G=1, and the updated first probability output result at this time is the first probability output input; if the output coefficient G If it is less than or equal to the preset result threshold thre, then G=0, and the updated first probability output result at this time is the history fusion output memory.

由此通过上述步骤实现对历史记忆结果(历史)与当前输出结果(第一概率输出)之间的融合处理，并在输出系数＞预设结果阈值时，将当前输出结果作为更新的第一概率输出，在输出系数小于或等于预设结果阈值时，将对历史记忆结果进行注意力优化后的历史融合输出作为更新的第一概率输出。Thus, through the above steps, the fusion process between the historical memory result (history) and the current output result (first probability output) is realized, and when the output coefficient>preset result threshold, the current output result is used as the updated first probability Output, when the output coefficient is less than or equal to the preset result threshold, the historical fusion output after attention optimization of the historical memory results is used as the updated first probability output.

由于目前相关技术主要通过对孤立词进行识别以实现语音唤醒，即对于每段音频只包含一个唤醒词，如“小布小布”，因此需要对送入算法的输入语音进行精准的切割，这导致相关技术对连续语音中的关键词识别效果不是很好，如“小欧小欧今天天气怎么样”，而通过上述将历史记忆结果与当前输出结果进行融合处理，第一语音匹配模型可对有效抑制关键词检测跳变和误唤醒，解决连续语音关键词识别问题。Since the current related technology mainly realizes voice wake-up by identifying isolated words, that is, each audio segment contains only one wake-up word, such as "Xiaobu Xiaobu", so it is necessary to accurately cut the input voice fed into the algorithm. As a result, related technologies are not very effective in recognizing keywords in continuous speech, such as "Xiaoou Xiaoou, how is the weather today", and through the above-mentioned fusion of historical memory results and current output results, the first speech matching model can Effectively suppress keyword detection jumps and false wake-ups, and solve the problem of continuous speech keyword recognition.

在一个示例性实施例中，上述将历史记忆结果(即第二概率输出)与当前输出结果(即第一概率输出)进行融合处理的过程可参阅图13，图13示出了本申请一个示例性实施例涉及的历史融合过程的示意图，图中涉及原理及数据流向可见前述说明，在此不再赘述。In an exemplary embodiment, the above-mentioned process of fusing the historical memory result (that is, the second probability output) with the current output result (that is, the first probability output) can refer to FIG. 13 , which shows an example of the present application It is a schematic diagram of the history fusion process involved in the exemplary embodiment. The principle and data flow involved in the figure can be seen from the foregoing description, and will not be repeated here.

另外，在其他示例性的实施例中，输出系数也可不按上述公式(15)进行更新。In addition, in other exemplary embodiments, the output coefficient may not be updated according to the above formula (15).

步骤S260：将更新的第一概率输出作为第一语音匹配模型对输入语音进行匹配的第一匹配结果。Step S260: output the updated first probability as the first matching result of matching the input speech by the first speech matching model.

步骤S270：若第一匹配结果指示输入语音中包含指定文本，唤醒终端。Step S270: If the first matching result indicates that the input voice contains the specified text, wake up the terminal.

需要说明的是，本实施例中未详细描述的部分可以参考前述实施例，在此不再赘述。It should be noted that for parts not described in detail in this embodiment, reference may be made to the foregoing embodiments, and details are not repeated here.

本实施例提供的语音唤醒方法，基于卷积神经网络构建第一语音匹配模型，并在卷积神经网络的基础上，引入U型残差结构，对模型特征进行多次抽象和提取以进行有效分类，提高对关键词检测的准确性，并不断降低模型输出的维度。另外，还对模型进行注意力优化，处理序列过长时模型精度丢失的问题。除此之外，还通过将历史记忆结果与当前输出结果进行融合处理，从而可有效抑制关键词检测跳变和误唤醒，从而解决连续关键词识别问题，提高唤醒准确率。The voice wake-up method provided in this embodiment constructs the first voice matching model based on the convolutional neural network, and on the basis of the convolutional neural network, introduces a U-shaped residual structure, and abstracts and extracts the model features multiple times for effective Classification, improve the accuracy of keyword detection, and continuously reduce the dimensionality of model output. In addition, the model is also optimized for attention to deal with the loss of model accuracy when the sequence is too long. In addition, by fusing historical memory results with current output results, keyword detection jumps and false wakeups can be effectively suppressed, thereby solving the problem of continuous keyword recognition and improving wakeup accuracy.

在一些实施例中，为了进一步降低误唤醒率，可将基于第一语音匹配模型得到第一匹配结果的操作作为一级校验，在第一匹配结果指示输入语音中包含指定文本时，再基于第二语音匹配模型作二级校验，获取第二匹配结果，并在第二匹配结果也指示输入语音中包含指定文本时，再唤醒终端。具体地，请参阅图14，图14示出了本申请又一个实施例提供的语音唤醒方法的流程示意图，于本实施例中，该方法可包括：In some embodiments, in order to further reduce the false wakeup rate, the operation of obtaining the first matching result based on the first speech matching model can be used as a first-level check, and when the first matching result indicates that the input speech contains specified text, then based on The second voice matching model performs a secondary check to obtain a second matching result, and wakes up the terminal when the second matching result also indicates that the input voice contains the specified text. Specifically, please refer to FIG. 14. FIG. 14 shows a schematic flowchart of a voice wake-up method provided in another embodiment of the present application. In this embodiment, the method may include:

步骤S310：获取音频采集器采集的输入语音。Step S310: Obtain the input speech collected by the audio collector.

步骤S320：基于第一语音匹配模型对输入语音进行匹配，得到第一概率输出。Step S320: Match the input speech based on the first speech matching model to obtain a first probability output.

步骤S330：获取第一语音匹配模型在当前的第一概率输出之前输出的至少一个概率输出，作为第二概率输出。Step S330: Obtain at least one probability output output by the first speech matching model before the current first probability output as a second probability output.

步骤S340：将第一概率输出与第二概率输出进行融合，得到更新的第一概率输出。Step S340: Fusing the first probability output and the second probability output to obtain an updated first probability output.

步骤S350：根据更新的第一概率输出，得到第一语音匹配模型对输入语音进行匹配的第一匹配结果。Step S350: Obtain a first matching result of matching the input speech by the first speech matching model according to the updated first probability output.

步骤S360：若第一匹配结果指示输入语音中包含指定文本，基于第二语音匹配模型对输入语音进行匹配，获取第二匹配结果。Step S360: If the first matching result indicates that the input speech contains the specified text, match the input speech based on the second speech matching model to obtain a second matching result.

其中，第二语音匹配模型可通过第二训练数据训练得到，其中，第二训练数据可包括多个正样本语音和多个负样本语音，第二训练数据可与第一训练数据相同，也可不同，本实施例对此不做限定。由于都是由包含指定文本的正样本语音、不包含指定文本的负样本语音训练得到，因此第二语音匹配模型与第一语音匹配模型一样，均可对输入语音进行匹配，判断输入语音中是否包含指定文本，并在包含时，获取的第二匹配结果可指示输入语音中包含指定文本。Wherein, the second voice matching model can be obtained by training the second training data, wherein the second training data can include a plurality of positive sample voices and a plurality of negative sample voices, and the second training data can be the same as the first training data, or can be Different, this embodiment does not limit it. Since they are all trained by positive sample speech that contains the specified text and negative sample speech that does not contain the specified text, the second speech matching model, like the first speech matching model, can match the input speech to determine whether the input speech is The specified text is included, and when included, the obtained second matching result may indicate that the input speech contains the specified text.

在一些实施例中，第二语音匹配模型可由神经网络构建得到，神经网络可以为但不限于为卷积神经网络(Convolutional Neural Network，CNN)、循环神经网络(RecurrentNeural Network，RNN)等，本实施例对此不做限定。In some embodiments, the second speech matching model can be constructed by a neural network, which can be, but not limited to, a convolutional neural network (Convolutional Neural Network, CNN), a recurrent neural network (RecurrentNeural Network, RNN), etc. In this implementation Examples are not limited to this.

其中，第一语音匹配模型与第二语音匹配模型的匹配规则不同，其中，匹配规则为判断输入语音是否包含指定文本的算法，故匹配规则不同表征第一语音匹配模型与第二语音匹配模型的算法不同。作为一种实施方式，第一、第二语音匹配模型可均基于相同的神经网络构建得到，但二者的网络层数或者说网络深度可不同，例如均基于CNN构建得到，二者的卷积层数量可不同。作为另一种实施方式，第一、第二语音匹配模型也可基于完全不同的神经网络构建得到，还可基于部分不同的神经网络构建得到，本实施例对此不做限定。由此，在输入语音通过第一语音匹配模型的第一次校验后，还需经过匹配规则不同的第二语音匹配模型进行第二次校验，使得最终能够通过语音匹配的校验，进入声纹识别的输入语音至少需满足两种匹配规则，由此可提高关键词识别的准确性，降低误唤醒率。Wherein, the matching rules of the first speech matching model and the second speech matching model are different, wherein the matching rule is an algorithm for judging whether the input speech contains specified text, so the different matching rules characterize the first speech matching model and the second speech matching model. Algorithms are different. As an implementation, the first and second speech matching models can be constructed based on the same neural network, but the number of network layers or network depths of the two can be different, for example, both are constructed based on CNN, and the convolution of the two The number of layers can vary. As another implementation manner, the first and second voice matching models may also be constructed based on completely different neural networks, or may also be constructed based on partially different neural networks, which is not limited in this embodiment. Therefore, after the input speech passes the first verification of the first speech matching model, it needs to go through the second verification of the second speech matching model with different matching rules, so that it can finally pass the verification of the speech matching and enter the The input voice for voiceprint recognition needs to meet at least two matching rules, which can improve the accuracy of keyword recognition and reduce the false wakeup rate.

于本实施例中，第一语音匹配模型的复杂程度低于第二语音匹配模型的复杂程度。由此，通过先基于复杂程度较低的第一语音匹配模型作第一次校验匹配，并在校验通过时再基于复杂程度更高的第二语音匹配模型进行校验匹配，实现先由简单算法作一级校验，再由复杂算法作二级校验，由于简单算法所需耗费的功耗和运行资源相对复杂算法更低，从而可在进一步降低误唤醒率的同时，还可在不耗费过多功耗和运行资源的情况下实现初筛，即可在提高唤醒准确率的同时兼顾功耗和运行资源，大大提升用户体验。其中，复杂程度可以是指网络复杂度、网络层数等，例如第二语音匹配模型可比第一语音匹配模型更深，网络层数更多。In this embodiment, the complexity of the first speech matching model is lower than that of the second speech matching model. Thus, by performing the first verification and matching based on the first speech matching model with a lower degree of complexity, and then performing verification and matching based on the second speech matching model with a higher degree of complexity when the verification is passed, it is realized that the The simple algorithm is used as the first-level verification, and then the complex algorithm is used as the second-level verification. Since the power consumption and operating resources required by the simple algorithm are lower than those of the complex algorithm, it can further reduce the false wake-up rate and can also be used in The initial screening can be achieved without consuming too much power consumption and operating resources, which can improve the wake-up accuracy while taking into account power consumption and operating resources, greatly improving the user experience. Wherein, the degree of complexity may refer to the complexity of the network, the number of network layers, etc. For example, the second voice matching model may be deeper than the first voice matching model, and the number of network layers may be greater.

在一些实施例中，为了使得第二语音匹配模型可实现准确率更高的关键词识别，作为一种方式，判断第二语音匹配模型的训练是否完成的收敛条件可高于第一语音匹配模型，另外，第二训练数据可比第一训练数据数量更大，场景更复杂。In some embodiments, in order to enable the second voice matching model to realize keyword recognition with higher accuracy, as a way, the convergence condition for judging whether the training of the second voice matching model is completed can be higher than that of the first voice matching model , in addition, the second training data may be larger in quantity than the first training data, and the scene may be more complex.

在一些实施方式中，第一、第二语音匹配模型的算法均可存储于终端本地，并在本地运行，则终端可不依赖网络环境，无需考虑通信时间的消耗，直接在本地运行第一、第二语音匹配模型，有利于提高语音唤醒的效率。在其他一些实施方式中，也可有至少一个模型存储于服务器，本实施例对此不做限定。In some embodiments, the algorithms of the first and second speech matching models can be stored locally in the terminal and run locally, so the terminal can directly run the first and second locally without depending on the network environment and without considering the consumption of communication time. Two-speech matching model, which is beneficial to improve the efficiency of voice wake-up. In some other implementation manners, at least one model may also be stored in the server, which is not limited in this embodiment.

进一步地，在一些实施方式中，终端可包括第一芯片、第二芯片，第一语音匹配模型运行于第一芯片，第二语音匹配模型运行于第二芯片，其中，第一芯片的功耗低于第二芯片的功耗。由此，使得复杂程度较低的第一语音匹配模型在低功耗的芯片运行，以在终端处于熄屏状态下，即便持续工作也不会导致过高功耗，从而可支持长时间监听输入语音并作一级校验，实现低功耗的终端语音唤醒。Further, in some embodiments, the terminal may include a first chip and a second chip, the first voice matching model runs on the first chip, and the second voice matching model runs on the second chip, wherein the power consumption of the first chip Lower than the power consumption of the second chip. As a result, the first speech matching model with low complexity is run on a low-power chip, so that when the terminal is in the off-screen state, even if it continues to work, it will not cause excessive power consumption, thereby supporting long-term monitoring input The voice is verified at the first level to realize low-power terminal voice wake-up.

在一个示例性实施例中，当终端基于音频采集器未采集到语音信号时，第二芯片可处于休眠状态。当终端基于音频采集器采集到语音信号，获取到对应的输入语音时，若输入语音未通过第一语音匹配模型的一级校验，第二芯片仍可处于休眠状态；若通过第一语音匹配模型的一级校验，则第一芯片可发送中断信号，使第二芯片由休眠状态切换为工作状态，并第一芯片可将包含指定文本的语音数据传送给第二芯片，此时可将第一芯片由监听状态切换到休眠状态，基于第二芯片运行第二语音匹配模型以对包含指定文本的语音数据进行二次校验，并得到第二匹配结果。In an exemplary embodiment, when the terminal does not collect a voice signal based on the audio collector, the second chip may be in a dormant state. When the terminal collects voice signals based on the audio collector and obtains the corresponding input voice, if the input voice fails the first-level verification of the first voice matching model, the second chip can still be in a dormant state; if it passes the first voice matching For the first-level verification of the model, the first chip can send an interrupt signal to make the second chip switch from the dormant state to the working state, and the first chip can transmit the voice data containing the specified text to the second chip. At this time, the The first chip switches from the listening state to the dormant state, runs the second speech matching model based on the second chip to perform secondary verification on the speech data containing the specified text, and obtains the second matching result.

在另一个示例性实施例中，第一芯片也可一直处于工作状态，音频采集器不断监测和采集语音信号，若监测到语音信号可获取对应的输入语音并送入第一语音匹配模型以对输入语音作一级校验；若未监测到语音信号，或监测到但未通过一级校验时，音频采集器仍继续监测和采集语音信号，送入第一语音匹配模型作一级校验；若监测到语音信号并该语音信号对应的输入语音通过一级校验时，可将包含指定文本的输入语音传送给第二语音匹配模型，同时可控制音频采集器停止采集音频信号，并基于第二语音匹配模型对输入语音进行二级校验，并得到第二匹配结果。In another exemplary embodiment, the first chip can also be in working condition all the time, and the audio collector continuously monitors and collects voice signals, and if a voice signal is detected, the corresponding input voice can be obtained and sent to the first voice matching model for Input voice for first-level verification; if no voice signal is detected, or if it is detected but fails to pass the first-level verification, the audio collector will continue to monitor and collect voice signals, and send them to the first voice matching model for first-level verification ; If a voice signal is detected and the input voice corresponding to the voice signal passes the first-level verification, the input voice containing the specified text can be sent to the second voice matching model, and the audio collector can be controlled to stop collecting the audio signal at the same time, and based on The second speech matching model performs secondary verification on the input speech and obtains a second matching result.

在一种示例性的实施方式中，第一芯片可以为数字信号处理器(Digital SignalProcessor，DSP)，第二芯片可以为一种RISC微处理器，如ARM(Advanced RISC Machine)芯片。In an exemplary embodiment, the first chip may be a digital signal processor (Digital Signal Processor, DSP), and the second chip may be a RISC microprocessor, such as an ARM (Advanced RISC Machine) chip.

在一些示例中，ARM芯片是一种在终端常用的芯片，通常作为主要处理器在终端被唤醒时工作，ARM芯片的计算性能较高，可运行复杂程度更高的算法，但ARM芯片处于工作状态下所需的功耗也较高、内存占用率也较高，若在终端处于熄屏状态时，仍保持ARM芯片处于工作状态，可能导致过高功耗和内存占用率，但为了实现更准确的唤醒，又需要复杂程度较高的算法，因此本实施例在终端增加比ARM芯片功耗低的第一芯片，如DSP芯片，并在第一芯片运行复杂程度相对第二语音匹配模型较低的第一语音匹配模型，从而在较低功耗下，捕捉用户的唤醒词，采用较高的查全率和相对第二语音匹配模型较低的查准率对潜在的用户唤醒场景进行有效识别。In some examples, the ARM chip is a chip commonly used in the terminal, and usually works as the main processor when the terminal is woken up. The ARM chip has high computing performance and can run more complex algorithms, but the ARM chip is not working In this state, the required power consumption is also high, and the memory usage rate is also high. If the ARM chip is still in the working state when the terminal is in the off-screen state, it may cause excessive power consumption and memory usage rate. However, in order to achieve a higher Accurate wake-up requires an algorithm with a higher degree of complexity, so this embodiment adds a first chip with lower power consumption than the ARM chip, such as a DSP chip, in the terminal, and the operation complexity of the first chip is relatively higher than that of the second voice matching model. Low first voice matching model, so as to capture the user's wake-up words at lower power consumption, and use a higher recall rate and a lower precision rate than the second voice matching model to effectively detect potential user wake-up scenarios identify.

在一些实施例中，第一语音匹配模型基于前述U型残差结构的卷积神经网络以及对卷积神经网络输出的注意力优化，可在硬件限制的基础上，尽可能地增加模型深度、提高对关键词识别的准确率，同时控制模型大小不会过大，可在低功耗的芯片上运行。第二语音匹配模型可采用较大较深的网络得到，作为一种实施方式，第二语音匹配模型可采用更大更深的基于序列的LSTM构建得到，使第二语音匹配模型的准确度相较于第一语音匹配模型更高。另外，声纹识别算法也可在第二芯片上运行。In some embodiments, the first speech matching model is based on the convolutional neural network of the aforementioned U-shaped residual structure and the attention optimization to the output of the convolutional neural network, which can increase the depth of the model as much as possible on the basis of hardware limitations, Improve the accuracy of keyword recognition, while controlling the size of the model will not be too large, and can run on a low-power chip. The second voice matching model can be obtained by using a larger and deeper network. As an embodiment, the second voice matching model can be constructed by using a larger and deeper sequence-based LSTM, so that the accuracy of the second voice matching model is compared with higher than the first speech matching model. In addition, the voiceprint recognition algorithm can also run on the second chip.

在一个示例中，音频采集器可集成于第一芯片，从而可实现低功耗的音频采集，有利于持续监听周围的音频。In one example, the audio collector can be integrated into the first chip, so as to realize audio collection with low power consumption, which is beneficial to continuously monitor the surrounding audio.

需要说明的是，第一语音匹配模型对输入语音进行一级校验后，可将整个输入语音传送给第二语音匹配模型，也可从输入语音中截取出仅包含指定文本的语音片段，以将该语音片段传送给第二语音匹配模型，从而可省去第二语音匹配模型对不包含指定文本的其他语音片段的识别，提高针对指定文本的关键词识别效率。It should be noted that after the first speech matching model performs a first-level verification on the input speech, the entire input speech can be transmitted to the second speech matching model, or a speech segment containing only specified text can be intercepted from the input speech to The speech segment is transmitted to the second speech matching model, so that the recognition of other speech segments not containing the specified text by the second speech matching model can be omitted, and the keyword recognition efficiency for the specified text can be improved.

若终端处于非熄屏状态，基于第二语音匹配模型对输入语音进行匹配，获取第二匹配结果。由此，在获取到输入语音时，可针对终端是否熄屏采用不同的语音唤醒方案，并在终端处于非熄屏状态时，直接基于第二语音匹配模型对输入语音进行匹配，获取第二匹配结果。If the terminal is not in the screen-off state, the input voice is matched based on the second voice matching model to obtain a second matching result. Therefore, when the input voice is obtained, different voice wake-up schemes can be adopted for whether the terminal is off the screen, and when the terminal is in the non-off state, the input voice is directly matched based on the second voice matching model to obtain the second matching result.

在一些实施方式中，第二语音匹配模型的复杂程度高于第一语音匹配模型，对关键词识别的准确性高于第一语音匹配模型，因而通过在终端处于非熄屏状态时，直接将输入语音送入第二芯片的第二语音匹配模型进行校验，可在提高识别效率的同时仍保持较高的识别准确率。In some implementations, the complexity of the second voice matching model is higher than that of the first voice matching model, and the accuracy of keyword recognition is higher than that of the first voice matching model. The input voice is sent to the second voice matching model of the second chip for verification, which can improve the recognition efficiency while still maintaining a high recognition accuracy.

另外，在一些实施例中，第二语音匹配模型运行于功耗更高的第二芯片，例如ARM芯片，而在终端处于非熄屏状态时，第二芯片往往也处于工作状态，则可直接由运行于第二芯片的第二语音匹配模型作一次校验，无需经过第一语音匹配模型，提高识别效率。而且由于第二语音匹配模型可实现的识别准确率高于第一语音匹配模型，因而通过直接由第二芯片作一次校验，也可降低第一芯片运行所带来的功耗。在一些示例中，在终端处于非熄屏状态时，可控制第一芯片处于休眠状态，以尽可能减少功耗。In addition, in some embodiments, the second voice matching model runs on a second chip with higher power consumption, such as an ARM chip, and when the terminal is in a state where the screen is not off, the second chip is often in a working state, so it can be directly A verification is performed by the second voice matching model running on the second chip without going through the first voice matching model, thereby improving the recognition efficiency. Moreover, since the second voice matching model can achieve higher recognition accuracy than the first voice matching model, the power consumption caused by the operation of the first chip can also be reduced by directly performing a verification by the second chip. In some examples, when the terminal is not in a screen-off state, the first chip may be controlled to be in a sleep state, so as to reduce power consumption as much as possible.

步骤S370：若第二匹配结果指示输入语音中包含指定文本，唤醒终端。Step S370: If the second matching result indicates that the input voice contains the specified text, wake up the terminal.

在一些实施例中，为了在降低误唤醒率的同时降低终端功耗，延长终端的待机时间，可在终端处于熄屏状态时，先由第一语音匹配模型对输入语音作一级校验，再在一级校验通过后由第二语音匹配模型作二级校验；而在终端不处于熄屏状态时，不作一级校验，直接由第二语音匹配模型对输入语音作二级校验。具体地，请参阅图15，图15示出了本申请再一个实施例提供的语音唤醒方法的流程示意图，于本实施例中，该方法可包括：In some embodiments, in order to reduce the power consumption of the terminal while reducing the false wake-up rate and prolong the standby time of the terminal, when the terminal is in the off-screen state, the first voice matching model can first perform a first-level verification on the input voice, Then, after the first-level verification is passed, the second voice matching model will perform the second-level verification; and when the terminal is not in the off-screen state, the first-level verification will not be performed, and the second voice matching model will directly perform the second-level verification on the input voice. test. Specifically, please refer to FIG. 15. FIG. 15 shows a schematic flowchart of a voice wake-up method provided in another embodiment of the present application. In this embodiment, the method may include:

步骤S410：获取音频采集器采集的输入语音。Step S410: Obtain the input speech collected by the audio collector.

步骤S420：检测终端是否处于熄屏状态。Step S420: Detect whether the terminal is in a screen-off state.

终端获取到音频采集器采集的输入语音，在对输入语音进行处理前，可先检测终端是否处于熄屏状态，若处于熄屏状态，可执行步骤S130。其中，熄屏状态是指正常关闭背光，熄灭屏幕，当终端处于熄屏状态时可表征终端处于待机状态。在一些实施例中，熄屏也可称为“息屏”，本实施例已给出熄屏状态的定义，对其具体命名不做限定。The terminal acquires the input voice collected by the audio collector, and before processing the input voice, it may first detect whether the terminal is in the off-screen state, and if it is in the off-screen state, step S130 may be executed. Wherein, the screen-off state refers to normally turning off the backlight and turning off the screen, and when the terminal is in the screen-off state, it may indicate that the terminal is in a standby state. In some embodiments, the screen off state may also be referred to as "screen off". This embodiment has given the definition of the screen off state, and its specific name is not limited.

作为一种实施方式，终端可通过调用屏幕状态检测接口，来获取当前终端的屏幕状态。其中，屏幕状态包括熄屏状态以及非熄屏状态，熄屏状态的功耗低于非熄屏状态的功耗。As an implementation manner, the terminal may obtain the current screen state of the terminal by calling a screen state detection interface. Wherein, the screen state includes a screen-off state and a non-screen-off state, and power consumption in the screen-off state is lower than that in a non-screen-off state.

在一些示例中，可通过调用屏幕状态检测接口来进行检测。例如，若终端运行安卓(Android)系统，可通过调用PowerManager的isScreenOn，根据返回的标识确定终端是否处于熄屏状态，例如，若返回的标识为“false”，则可确定终端处于熄屏状态；若返回的标识为“true”，则可确定终端处于非熄屏状态。In some examples, detection may be performed by calling a screen state detection interface. For example, if the terminal runs the Android system, it can be determined whether the terminal is in the off-screen state according to the returned identifier by calling isScreenOn of PowerManager. For example, if the returned identifier is "false", it can be determined that the terminal is in the off-screen state; If the returned flag is "true", it can be determined that the terminal is not in the state of turning off the screen.

在一些实施例中，若终端处于非熄屏状态，可仅对输入语音作一次语音匹配，以提高语音唤醒效率。具体实施方式可见后述实施例，在此不再赘述。于本实施例中，检测终端是否处于熄屏状态之后，还包括：In some embodiments, if the terminal is not in the screen-off state, only one voice matching can be performed on the input voice to improve voice wake-up efficiency. Specific implementation manners can be seen in the following embodiments, and details are not repeated here. In this embodiment, after detecting whether the terminal is in the off-screen state, it also includes:

若终端处于熄屏状态，可执行步骤S430；If the terminal is in the off-screen state, step S430 may be performed;

若终端不处于熄屏状态，可执行步骤S480。If the terminal is not in the screen-off state, step S480 may be performed.

步骤S430：基于第一语音匹配模型对输入语音进行匹配，得到第一概率输出。Step S430: Match the input speech based on the first speech matching model to obtain a first probability output.

步骤S440：获取第一语音匹配模型在当前的第一概率输出之前输出的至少一个概率输出，作为第二概率输出。Step S440: Obtain at least one probability output output by the first speech matching model before the current first probability output as a second probability output.

步骤S450：将第一概率输出与第二概率输出进行融合，得到更新的第一概率输出。Step S450: Fusing the first probability output and the second probability output to obtain an updated first probability output.

步骤S460：根据更新的第一概率输出，得到第一语音匹配模型对输入语音进行匹配的第一匹配结果。Step S460: Obtain a first matching result of matching the input speech by the first speech matching model according to the updated first probability output.

步骤S470：判断第一匹配结果是否指示输入语音中包含指定文本。Step S470: Determine whether the first matching result indicates that the input speech contains the specified text.

于本实施例中，若第一匹配结果指示输入语音中包含指定文本，可执行步骤S480；若第一匹配结果指示输入语音中不包含指定文本，可返回执行步骤S410，以继续采集输入语音。In this embodiment, if the first matching result indicates that the input voice contains the specified text, step S480 can be performed; if the first matching result indicates that the input voice does not contain the specified text, the process can return to step S410 to continue collecting the input voice.

另外，在一些可能的实施例中，也可结束本方法，在此不作限定。In addition, in some possible embodiments, the method may also end, which is not limited here.

步骤S480：基于第二语音匹配模型对输入语音进行匹配，获取第二匹配结果。Step S480: Match the input speech based on the second speech matching model, and obtain a second matching result.

步骤S490：若第二匹配结果指示输入语音中包含指定文本，唤醒终端。Step S490: If the second matching result indicates that the input voice contains the specified text, wake up the terminal.

若第二匹配结果指示输入语音中包含指定文本，则输入语音通过二次校验，第一、第二语音匹配模型均从输入语音中识别出指定文本，此时可唤醒终端。If the second matching result indicates that the input voice contains the specified text, the input voice passes the secondary verification, and both the first and second voice matching models recognize the specified text from the input voice, and the terminal can be woken up at this time.

作为一种实施方式，第二语音匹配模型均可运行于第二芯片，从而可在硬件支持复杂算法的基础上，采用更大更深的模型实现准确率更高的识别。As an implementation, the second voice matching model can run on the second chip, so that on the basis of hardware supporting complex algorithms, a larger and deeper model can be used to achieve higher accuracy recognition.

在一些实施例中，若第二匹配结果指示输入语音中不包含指定文本，则对输入语音的二级校验失败。In some embodiments, if the second matching result indicates that the input voice does not contain the specified text, the secondary verification of the input voice fails.

作为另一种实施方式，在对输入语音基于第二语音匹配模型进行二级校验时，第一、第二芯片均处于工作状态，则若二级校验失败，可控制第二芯片由工作状态切换至休眠状态，以降低功耗。As another implementation, when the input voice is checked at the second level based on the second voice matching model, both the first and second chips are in the working state, and if the second level check fails, the second chip can be controlled from the working state. state switches to Sleep to reduce power consumption.

本实施例提供的语音唤醒方法，通过在获取到输入语音时，先检测终端是否处于熄屏状态，并在熄屏状态时先基于第一语音匹配模型对输入语音作一级校验，并在一级校验通过后基于第二匹配模型对输入语音进行二级校验，在二级校验也通过后才唤醒终端。由此，通过在终端处于熄屏状态时，基于匹配规则不同的第一、第二语音匹配模型实现两次校验，使得能够唤醒终端的输入语音至少需通过两个不同匹配规则的校验，才可成功唤醒终端，可大大降低误唤醒率，而且将终端从熄屏状态唤醒将耗费较大功耗，因此在熄屏状态时作两次校验，可在降低误唤醒率的同时降低终端功耗。The voice wake-up method provided in this embodiment first detects whether the terminal is in the off-screen state when the input voice is acquired, and performs a first-level verification on the input voice based on the first voice matching model when the screen is off, and then After the first-level verification is passed, the input voice is subjected to a second-level verification based on the second matching model, and the terminal is woken up only after the second-level verification is also passed. Therefore, when the terminal is in the off-screen state, two verifications are implemented based on the first and second speech matching models with different matching rules, so that the input speech that can wake up the terminal needs to pass at least two verifications of different matching rules. Only then can the terminal be successfully woken up, which can greatly reduce the false wakeup rate, and waking up the terminal from the off-screen state will consume a lot of power consumption, so doing two checks in the off-screen state can reduce the false wakeup rate while reducing the terminal power consumption.

另外，在一些实施例中，无论是仅由第一语音匹配模型作一级校验还是仅由第二语音匹配模型作二级校验，还是先由第一语音匹配模型作一级校验后，再在一级校验通过后由第二语音匹配模型作二级校验，在判断输入语音中包含指定文本时，可将包含指定文本的语音进行声纹识别，以进一步提高终端使用安全，避免他人随意唤醒终端，造成不必要功耗或给终端带来安全威胁。In addition, in some embodiments, no matter whether only the first speech matching model is used for the first-level verification or only the second speech matching model is used for the second-level verification, or the first speech matching model is firstly used for the first-level verification , and then after the first-level verification is passed, the second voice matching model performs a second-level verification. When it is judged that the input voice contains specified text, the voice containing the specified text can be recognized by voiceprint to further improve the security of the terminal. Prevent others from waking up the terminal at will, causing unnecessary power consumption or bringing security threats to the terminal.

在一种实施例中，以在二级校验通过后进行声纹识别为例，说明声纹识别的操作方法，具体地，请参阅图16，图16示出了本申请一个示例性实施例提供的图15中步骤S490的流程示意图，于本实施例中，步骤S490可包括：In one embodiment, the operation method of voiceprint recognition is described by taking voiceprint recognition after the secondary verification is passed as an example. Specifically, please refer to FIG. 16, which shows an exemplary embodiment of the present application Provided is a schematic flowchart of step S490 in FIG. 15 , in this embodiment, step S490 may include:

步骤S491：若第二匹配结果指示输入语音中包含指定文本，对输入语音进行声纹识别。Step S491: If the second matching result indicates that the input voice contains the specified text, perform voiceprint recognition on the input voice.

若第二匹配结果指示输入语音中包含指定文本，则输入语音通过二级校验，第一、第二语音匹配模型均从输入语音中识别出指定文本，此时可将输入语音送入声纹识别算法，以对输入语音进行声纹识别。If the second matching result indicates that the input voice contains the specified text, the input voice passes the secondary verification, the first and second voice matching models recognize the specified text from the input voice, and the input voice can be sent to the voiceprint at this time Recognition algorithm to perform voiceprint recognition on the input speech.

在一些实施方式中，终端可预先存储有声纹模板，声纹模板的数量可以是多个，例如可存储有用户A的声纹模板、用户B的声纹模板等。声纹模板用于与输入语音的声纹特征进行匹配。在第二匹配结果指示输入语音中包含指定文本时，可提取输入语音中的声纹特征，将该声纹特征在声纹模板中进行匹配，若存在与该声纹特征匹配的声纹模板，可判定声纹识别通过验证，若不存在与该声纹特征匹配的声纹模板，可判定声纹识别未通过验证。In some implementations, the terminal may pre-store voiceprint templates, and the number of voiceprint templates may be multiple, for example, the voiceprint template of user A, the voiceprint template of user B, etc. may be stored. The voiceprint template is used to match the voiceprint features of the input voice. When the second matching result indicates that the input voice contains specified text, the voiceprint feature in the input voice can be extracted, and the voiceprint feature is matched in the voiceprint template. If there is a voiceprint template matching the voiceprint feature, It can be determined that the voiceprint recognition has passed the verification, and if there is no voiceprint template matching the voiceprint feature, it can be determined that the voiceprint recognition has not passed the verification.

在一些实施例中，声纹模板的存储可通过前述唤醒词设置页面实现，具体地，用户可通过唤醒词设置页面录入包含指定文本的语音，由终端从输入语音中提取声纹特征作为该指定文本对应的一个声纹模板，进行存储，以供声纹识别时进行声纹验证。In some embodiments, the storage of the voiceprint template can be realized through the aforementioned wake-up word setting page. Specifically, the user can enter a voice containing specified text through the wake-up word setting page, and the terminal extracts voiceprint features from the input voice as the specified A voiceprint template corresponding to the text is stored for voiceprint verification during voiceprint recognition.

在一些实施例中，声纹识别算法可与第二语音匹配模型一起运行于第二芯片。In some embodiments, the voiceprint recognition algorithm can run on the second chip together with the second voice matching model.

步骤S492：若声纹识别通过验证，唤醒终端。Step S492: If the voiceprint recognition passes the verification, wake up the terminal.

若声纹识别通过验证，则输入语音不仅通过了两个不同匹配规则的校验，而且输入语音中的声纹特征也通过了验证，此时判定语音唤醒成功，唤醒终端，可大大降低误唤醒率。If the voiceprint recognition passes the verification, the input voice has not only passed the verification of two different matching rules, but also the voiceprint features in the input voice have passed the verification. At this time, it is determined that the voice wake-up is successful and the terminal is woken up, which can greatly reduce false wake-ups Rate.

若声纹识别通过验证，可唤醒终端执行预设操作。作为一种实施方式，终端可预先存储有声纹模板与预设操作之间的映射关系表，则根据与输入语音的声纹特征匹配的声纹模板，可确定对应的预设操作，则可唤醒终端执行该预设操作。其中，预设操作可包括但不限于亮屏、解锁等操作，本实施例对此不做限定。If the voiceprint recognition is verified, the terminal can be woken up to perform preset operations. As an implementation, the terminal can pre-store a mapping relationship table between the voiceprint template and the preset operation, then according to the voiceprint template that matches the voiceprint feature of the input voice, the corresponding preset operation can be determined, and the The terminal executes the preset operation. Wherein, the preset operation may include but not limited to operations such as turning on the screen and unlocking, which is not limited in this embodiment.

在另一些实施方式中，若声纹识别通过验证，可唤醒终端并执行解锁操作，使得用户可直接通过语音解锁终端，并且基于本方法可实现准确、安全且便捷的解锁。例如，用户对终端说“小欧小欧”，并声纹识别通过验证后，终端屏幕可亮起且显示解锁后的界面，可以显示上次锁屏前的界面，也可显示桌面，在此不做限定。In other implementations, if the voiceprint recognition passes the verification, the terminal can be woken up and an unlocking operation can be performed, so that the user can unlock the terminal directly by voice, and based on this method, accurate, safe and convenient unlocking can be realized. For example, after the user says "Xiaoou Xiaoou" to the terminal, and the voiceprint recognition is verified, the terminal screen can light up and display the unlocked interface, the interface before the last lock screen, or the desktop. No limit.

在又一些实施方式中，若声纹识别通过验证，还可获取环境信息，确定当前场景是否为指定支付场景，若是指定支付场景，可唤醒终端并完成指定支付场景对应的支付操作。在一个示例中，终端可预先录入公交卡信息，在用户乘坐公交车时，用户可将终端靠近刷卡机，并说出唤醒词“小欧小欧”，此时终端可获取刷卡机发出的NFC信号，以确定当前场景为公交支付场景，并唤醒终端，基于预先录入的公交卡信息完成支付，由此可实现便捷安全的支付。In some other implementations, if the voiceprint recognition passes the verification, the environment information can also be obtained to determine whether the current scene is a designated payment scene, and if it is a designated payment scene, the terminal can be woken up and the payment operation corresponding to the designated payment scene can be completed. In one example, the terminal can pre-enter the bus card information. When the user takes the bus, the user can put the terminal close to the card reader and say the wake-up word "Xiaoou Xiaoou". At this time, the terminal can obtain the NFC card issued by the card reader. signal to determine that the current scene is a bus payment scene, wake up the terminal, and complete the payment based on the pre-entered bus card information, thereby realizing convenient and safe payment.

可以理解的是，以上仅为示例，本实施例提供的方法并不局限于上述场景中，但考虑篇幅原因在此不再穷举。另外，前述唤醒终端执行的操作可同样适用于前述任一实施例。It can be understood that, the above is only an example, and the method provided in this embodiment is not limited to the above scenario, but it is not exhaustive here for reasons of space. In addition, the aforementioned operations performed by waking up the terminal may also be applicable to any of the aforementioned embodiments.

在一些实施例中，步骤S492的具体实施方式可为若声纹识别通过验证，唤醒终端执行目标指令，则其中，目标指令与输入语音的声纹特征匹配的声纹模板绑定。具体地，终端可预存有声纹模板与控制指令的映射关系表，则若声纹识别通过验证，可根据与输入语音的声纹特征匹配的声纹模板，获取对应的控制指令作为目标指令，并唤醒终端执行该目标指令。其中，控制指令可以是解锁操作、激活语音助手、支付操作等，本实施例对此不做限定。In some embodiments, the specific implementation of step S492 may be to wake up the terminal to execute the target instruction if the voiceprint recognition passes the verification, wherein the target instruction is bound to a voiceprint template that matches the voiceprint feature of the input voice. Specifically, the terminal may pre-store a mapping relationship table between a voiceprint template and a control command, and if the voiceprint recognition passes the verification, the corresponding control command may be obtained as the target command according to the voiceprint template matching the voiceprint feature of the input voice, and Wake up the terminal to execute the target command. Wherein, the control instruction may be an unlocking operation, activation of a voice assistant, payment operation, etc., which is not limited in this embodiment.

在一些实施方式中，前述映射关系表还可存储有终端的屏幕状态、当前显示界面或其他当前终端信息，与声纹模板和控制指令对应存储，则在输入语音通过声纹识别时，可根据输入语音的声纹特征和当前终端信息，确定对应的控制指令。In some implementations, the above-mentioned mapping relationship table can also store the screen status of the terminal, the current display interface or other current terminal information, which are stored in correspondence with the voiceprint template and control instructions. Then, when the input voice is recognized through the voiceprint, it can be Input the voiceprint characteristics of the voice and the current terminal information, and determine the corresponding control instructions.

在一些实施例中，非熄屏状态可包括屏幕亮起且显示解锁界面的待解锁状态，还可包括屏幕亮起且未显示解锁界面的已解锁状态。则基于前述实施方式，若终端处于所述待解锁状态，则在声纹识别通过验证时，可唤醒终端执行解锁操作、激活语音助手等。若终端处于所述已解锁状态，或终端当前显示具有支付功能的应用程序的界面，则在声纹识别通过验证时，可唤醒终端执行支付操作等，即基于本方案实现基于声纹识别的身份认证。In some embodiments, the non-off screen state may include a waiting state where the screen is on and the unlocking interface is displayed, and may also include an unlocked state where the screen is on and the unlocking interface is not displayed. Then, based on the aforementioned embodiments, if the terminal is in the state to be unlocked, when the voiceprint recognition passes the verification, the terminal can be woken up to perform unlocking operations, activate voice assistants, and the like. If the terminal is in the unlocked state, or the terminal is currently displaying the interface of an application program with a payment function, when the voiceprint recognition passes the verification, the terminal can be woken up to perform payment operations, etc., that is, based on this scheme, the identity based on voiceprint recognition can be realized. certified.

另外，在一些实施例中，若终端处于非熄屏状态，可先检测终端是否预存有声纹模板，若未预存有声纹模板，可识别输入语音，确定输入语音对应的文本内容作为其指定文本，并提取输入语音的声纹特征作为一个声纹模板，将输入语音对应的指定文本和声纹模板对应存储于终端，并基于指定文本和声纹模板训练第一、第二语音匹配模型以及声纹识别算法，以在下次获取到输入语音时，可对输入语音进行校验，判断输入语音中是否包含指定文本，并在包含时基于预存的声纹模板对输入语音对应的声纹特征进行校验。由此，可在终端未预先存储有声纹模板时，对用户输入语音进行关键词检测和存储，以在下次实现本申请实施例提供的语音唤醒方法。In addition, in some embodiments, if the terminal is not in the screen-off state, it can first detect whether the terminal has a voiceprint template pre-stored. If there is no voiceprint template pre-stored, the input voice can be recognized, and the text content corresponding to the input voice can be determined as its specified text. And extract the voiceprint feature of the input voice as a voiceprint template, store the specified text corresponding to the input voice and the voiceprint template in the terminal, and train the first and second voice matching models and the voiceprint based on the specified text and voiceprint template Recognition algorithm, so that when the input voice is acquired next time, it can verify the input voice, judge whether the input voice contains the specified text, and check the voiceprint feature corresponding to the input voice based on the pre-stored voiceprint template when it is included . Thus, when the terminal does not store a voiceprint template in advance, keyword detection and storage can be performed on the voice input by the user, so as to realize the voice wake-up method provided by the embodiment of the present application next time.

作为一种实施方式，若未预存有声纹模板，音频采集器可持续采集语音信号，使得用户可重复说出多次待存储的唤醒词，以用于后续的关键词检测和存储。可以理解的是，重复次数越多，存储的声纹模板越准确，后续对第一、第二语音匹配模型以及声纹识别算法的训练效果越好，识别越稳定。As an implementation, if no voiceprint template is pre-stored, the audio collector can continuously collect voice signals, so that the user can repeat the wake-up words to be stored multiple times for subsequent keyword detection and storage. It can be understood that the more repetitions, the more accurate the stored voiceprint template, the better the subsequent training effect of the first and second voice matching models and the voiceprint recognition algorithm, and the more stable the recognition.

在一些实施例中，若声纹识别未通过验证或二级校验失败，可控制音频采集器继续监测和采集语音信号，并获取对应的输入语音送入第一语音匹配模型。In some embodiments, if the voiceprint recognition fails the verification or the secondary verification fails, the audio collector can be controlled to continue monitoring and collecting voice signals, and obtain corresponding input voices to be sent to the first voice matching model.

在另一些实施例中，若声纹识别未通过验证或二级校验失败，还可控制第二芯片由工作状态切换至休眠状态，以降低第二芯片所引入的较大功耗，并控制第一芯片由休眠状态切换至监听状态，继续监测和采集语音信号。In some other embodiments, if the voiceprint recognition fails the verification or the secondary verification fails, the second chip can also be controlled to switch from the working state to the dormant state, so as to reduce the large power consumption introduced by the second chip, and control The first chip switches from the dormant state to the monitoring state, and continues to monitor and collect voice signals.

由此，本实施例提供的语音唤醒方法，通过在获取到输入语音时，先检测终端是否处于熄屏状态，并在熄屏状态时先基于第一语音匹配模型对输入语音作一级校验，并在一级校验通过后基于第二匹配模型对输入语音进行二级校验，在二级校验也通过后才对输入语音的声纹进行声纹识别，声纹识别通过验证后唤醒终端。由此，通过在终端处于熄屏状态时，基于匹配规则不同的第一、第二语音匹配模型实现两次校验，并在两次校验均通过后才进行声纹识别，使得能够唤醒终端的输入语音至少需通过两个不同匹配规则的校验以及声纹识别，才可成功唤醒终端，可大大降低误唤醒率，而且将终端从熄屏状态唤醒将耗费较大功耗，因此在熄屏状态时作两次校验，可在降低误唤醒率的同时降低终端功耗。Therefore, the voice wake-up method provided by this embodiment first detects whether the terminal is in the off-screen state when the input voice is acquired, and first performs a primary verification on the input voice based on the first voice matching model when the screen is off. , and after the first-level verification is passed, a second-level verification is performed on the input voice based on the second matching model. After the second-level verification is also passed, the voiceprint recognition is performed on the voiceprint of the input voice, and the voiceprint recognition is waked up after passing the verification terminal. Therefore, when the terminal is in the off-screen state, two verifications are implemented based on the first and second voice matching models with different matching rules, and the voiceprint recognition is performed only after the two verifications pass, so that the terminal can be woken up The input voice needs to pass at least two different matching rules and voiceprint recognition to successfully wake up the terminal, which can greatly reduce the false wakeup rate, and waking up the terminal from the off-screen state will consume a lot of power. Performing two verifications in the screen state can reduce the power consumption of the terminal while reducing the false wake-up rate.

并在一个具体示例中，由低功耗的第一芯片运行第一语音匹配模型，以捕捉全天场景下的输入语音，并对输入语音作一级校验，并在一级校验通过后，将输入语音传输至由第二芯片运行第二语音匹配模型，由第二语音匹配模型对输入语音作二级校验，并在二级校验也通过后，才对包含指定文本的语音进行声纹识别，以在声纹识别通过验证后，才唤醒终端，由此不仅可实现全天低功耗且查全率较高的一级校验，并由二级校验进行查准率比一级校验更高的二级校验，使得终端可更准确地对输入语音中是否包含指定文本进行判断，并在二级校验通过后还要进行声纹识别，避免他人随意唤醒终端，提高终端使用安全。And in a specific example, the first voice matching model is run by the first low-power chip to capture the input voice in the all-day scene, and perform a first-level verification on the input voice, and after the first-level verification passes , the input voice is transmitted to the second voice matching model operated by the second chip, and the second voice matching model performs a secondary verification on the input voice, and only after the secondary verification is passed, the voice containing the specified text is checked. Voiceprint recognition, so as to wake up the terminal after the voiceprint recognition has passed the verification, so that not only the first-level verification with low power consumption and high recall rate can be realized all day long, but also the precision rate ratio can be compared by the second-level verification Level-1 verification is higher than level-2 verification, so that the terminal can more accurately judge whether the input voice contains the specified text, and after the second-level verification is passed, voiceprint recognition is required to prevent others from waking up the terminal at will. Improve end-use security.

请参阅图17，其示出了本申请实施例提供的一种语音唤醒装置1700的结构框图，该语音唤醒装置1700可应用于上述终端，该语音唤醒装置1700可以包括：语音获取模块1710、第一输出模块1720、第二输出模块1730、输出更新模块1740、结果获取模块1750以及终端唤醒模块1760，具体地：Please refer to FIG. 17, which shows a structural block diagram of a voice wake-up device 1700 provided by an embodiment of the present application. The voice wake-up device 1700 can be applied to the above-mentioned terminal. The voice wake-up device 1700 may include: a voice acquisition module 1710, a second An output module 1720, a second output module 1730, an output update module 1740, a result acquisition module 1750, and a terminal wake-up module 1760, specifically:

语音获取模块，用于获取所述音频采集器采集的输入语音；A voice acquisition module, configured to acquire the input voice collected by the audio collector;

第一输出模块，用于基于第一语音匹配模型对所述输入语音进行匹配，得到第一概率输出，所述第一概率输出用于指示所述输入语音中包含所述指定文本的概率；The first output module is configured to match the input speech based on a first speech matching model to obtain a first probability output, and the first probability output is used to indicate the probability that the input speech contains the specified text;

第二输出模块，用于获取所述第一语音匹配模型在当前的所述第一概率输出之前输出的至少一个概率输出，作为第二概率输出；A second output module, configured to obtain at least one probability output output by the first speech matching model before the current first probability output, as a second probability output;

输出更新模块，用于将所述第一概率输出与所述第二概率输出进行融合，得到更新的第一概率输出；an output update module, configured to fuse the first probability output with the second probability output to obtain an updated first probability output;

结果获取模块，用于将所述更新的第一概率输出作为所述第一语音匹配模型对所述输入语音进行匹配的第一匹配结果；终端唤醒模块，用于若所述第一匹配结果指示所述输入语音中包含所述指定文本，唤醒所述终端。A result acquisition module, configured to output the updated first probability as a first matching result of matching the input speech by the first speech matching model; a terminal wake-up module, configured to if the first matching result indicates The input voice contains the specified text, and wakes up the terminal.

进一步地，所述输出更新模块包括：历史提取子模块、第一历史融合子模块、注意力处理子模块以及第二历史融合子模块，其中：Further, the output update module includes: a history extraction submodule, a first history fusion submodule, an attention processing submodule and a second history fusion submodule, wherein:

历史提取子模块，用于对所述第二概率输出进行特征提取得到第一历史特征与第二历史特征；A history extraction submodule, configured to perform feature extraction on the second probability output to obtain a first historical feature and a second historical feature;

第一历史融合子模块，用于将所述第一概率输出与所述第一历史特征进行融合，得到所述第二概率输出对应的历史注意力权重向量；A first history fusion submodule, configured to fuse the first probability output with the first history feature to obtain a historical attention weight vector corresponding to the second probability output;

注意力处理子模块，用于根据所述历史注意力权重向量对所述第二历史特征进行加权处理，获得历史融合输出；The attention processing submodule is used to perform weight processing on the second historical feature according to the historical attention weight vector to obtain a historical fusion output;

第二历史融合子模块，用于将所述第一概率输出与所述历史融合输出进行融合，得到更新的第一概率输出。The second history fusion sub-module is configured to fuse the first probability output with the history fusion output to obtain an updated first probability output.

进一步地，所述第一语音匹配模型包括第一、第二双向循环神经网络层，所述历史提取子模块包括：第一历史提取单元以及第二历史提取单元，其中：Further, the first voice matching model includes first and second bidirectional cyclic neural network layers, and the history extraction submodule includes: a first history extraction unit and a second history extraction unit, wherein:

第一历史提取单元，用于通过所述第一双向循环神经网络层对所述历史注意力输出进行特征提取得到第一历史特征；The first history extraction unit is used to perform feature extraction on the historical attention output through the first bidirectional cyclic neural network layer to obtain a first historical feature;

第二历史提取单元，用于通过所述第二双向循环神经网络层对所述历史注意力输出进行特征提取得到第二历史特征。The second history extraction unit is configured to perform feature extraction on the historical attention output through the second bidirectional cyclic neural network layer to obtain a second historical feature.

进一步地，所述第二历史融合子模块包括：系数提取单元、第一输出单元以及第二输出单元，其中：Further, the second history fusion sub-module includes: a coefficient extraction unit, a first output unit and a second output unit, wherein:

系数提取单元，用于对所述第一概率输出进行特征提取得到所述第一概率输出对应的输出系数；A coefficient extraction unit, configured to perform feature extraction on the first probability output to obtain an output coefficient corresponding to the first probability output;

第一输出单元，用于若所述输出系数大于预设结果阈值，将所述第一概率输出作为所述更新的第一概率输出；A first output unit, configured to use the first probability output as the updated first probability output if the output coefficient is greater than a preset result threshold;

第二输出单元，用于若所述输出系数小于或等于预设结果阈值，将所述历史融合输出作为所述更新的第一概率输出。The second output unit is configured to use the historical fusion output as the updated first probability output if the output coefficient is less than or equal to a preset result threshold.

进一步地，所述第一语音匹配模型为卷积神经网络模型，所述第一输出模块包括：声学特征提取子模块以及第一概率输出子模块，其中：Further, the first speech matching model is a convolutional neural network model, and the first output module includes: an acoustic feature extraction submodule and a first probability output submodule, wherein:

声学特征提取子模块，用于提取所述输入语音的声学特征，通过所述第一语音匹配模型对所述声学特征进行卷积运算，获得卷积神经网络输出；The acoustic feature extraction submodule is used to extract the acoustic features of the input speech, and perform convolution operation on the acoustic features through the first speech matching model to obtain the convolutional neural network output;

第一概率输出子模块，用于将所述卷积神经网络输出与所述指定文本对应的声学特征进行匹配，获取第一概率输出。The first probability output submodule is configured to match the output of the convolutional neural network with the acoustic feature corresponding to the specified text to obtain a first probability output.

进一步地，所述第一概率输出子模块包括：权重提取单元、加权处理单元以及概率输出单元，其中：Further, the first probability output submodule includes: a weight extraction unit, a weight processing unit and a probability output unit, wherein:

权重提取单元，用于将所述卷积神经网络输出按通道进行注意力权重提取，获得所述卷积神经网络输出对应的注意力权重向量；The weight extraction unit is used to extract the attention weight of the convolutional neural network output by channel, and obtain the attention weight vector corresponding to the convolutional neural network output;

加权处理单元，用于根据所述注意力权重向量对所述卷积神经网络输出进行加权处理，获得注意力输出特征；A weighting processing unit, configured to weight the convolutional neural network output according to the attention weight vector, to obtain attention output features;

概率输出单元，用于将所述注意力输出特征与所述指定文本对应的声学特征进行匹配，获取第一概率输出。A probability output unit, configured to match the attention output feature with the acoustic feature corresponding to the specified text to obtain a first probability output.

进一步地，所述终端唤醒模块包括：二级校验子模块以及二级唤醒子模块，其中：Further, the terminal wake-up module includes: a secondary syndrome sub-module and a secondary wake-up sub-module, wherein:

二级校验子模块，用于若所述第一匹配结果指示所述输入语音中包含所述指定文本，基于第二语音匹配模型对所述输入语音进行匹配，获取第二匹配结果，所述第一语音匹配模型与所述第二语音匹配模型的匹配规则不同；A secondary verification submodule, configured to match the input speech based on a second speech matching model to obtain a second matching result if the first matching result indicates that the input speech contains the specified text, and the The matching rules of the first speech matching model and the second speech matching model are different;

二级唤醒子模块，用于若所述第二匹配结果指示所述输入语音中包含所述指定文本，唤醒所述终端。A secondary wake-up submodule, configured to wake up the terminal if the second matching result indicates that the input voice contains the specified text.

进一步地，所述终端包括第一芯片、第二芯片，所述第一语音匹配模型运行于所述第一芯片，所述第二语音匹配模型运行于所述第二芯片，其中，所述第一芯片的功耗低于所述第二芯片的功耗。Further, the terminal includes a first chip and a second chip, the first voice matching model runs on the first chip, and the second voice matching model runs on the second chip, wherein the first voice matching model runs on the second chip. The power consumption of one chip is lower than the power consumption of the second chip.

进一步地，所述第一芯片为DSP芯片，所述第二芯片为ARM芯片。Further, the first chip is a DSP chip, and the second chip is an ARM chip.

进一步地，所述二级唤醒子模块包括：声纹识别单元以及声纹唤醒单元，其中：Further, the secondary wake-up submodule includes: a voiceprint recognition unit and a voiceprint wakeup unit, wherein:

声纹识别单元，用于若所述第二匹配结果指示所述输入语音中包含所述指定文本，对所述输入语音进行声纹识别；A voiceprint recognition unit, configured to perform voiceprint recognition on the input voice if the second matching result indicates that the input voice contains the specified text;

声纹唤醒单元，用于若所述声纹识别通过验证，唤醒所述终端。The voiceprint wake-up unit is configured to wake up the terminal if the voiceprint identification is verified.

进一步地，所述第一输出模块包括：熄屏检测子模块以及第一输出子模块，其中：Further, the first output module includes: a screen-off detection submodule and a first output submodule, wherein:

熄屏检测子模块，用于检测所述终端是否处于熄屏状态；The screen-off detection submodule is used to detect whether the terminal is in a screen-off state;

第一输出子模块，用于基于第一语音匹配模型对所述输入语音进行匹配，得到第一概率输出。The first output submodule is configured to match the input speech based on the first speech matching model to obtain a first probability output.

进一步地，所述检测所述终端是否处于熄屏状态之后，所述语音唤醒装置1700还包括：非熄屏匹配模块以及非熄屏唤醒模块，其中：Further, after detecting whether the terminal is in the screen-off state, the voice wake-up device 1700 further includes: a non-off screen matching module and a non-off screen wake-up module, wherein:

非熄屏匹配模块，用于若所述终端不处于熄屏状态，基于所述第二语音匹配模型对所述输入语音进行匹配，获取第二匹配结果；The non-screen-off matching module is used to match the input voice based on the second voice matching model to obtain a second matching result if the terminal is not in the screen-off state;

非熄屏唤醒模块，用于若所述第二匹配结果指示所述输入语音中包含所述指定文本，唤醒所述终端。A non-screen-off wake-up module, configured to wake up the terminal if the second matching result indicates that the input voice contains the specified text.

本申请实施例提供的语音唤醒装置用于实现前述方法实施例中相应的语音唤醒方法，并具有相应的方法实施例的有益效果，在此不再赘述。The voice wake-up device provided in the embodiment of the present application is used to implement the corresponding voice wake-up method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, and will not be repeated here.

在本申请所提供的几个实施例中，模块相互之间的耦合可以是电性，机械或其它形式的耦合。In several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other forms of coupling.

另外，在本申请各个实施例中的各功能模块可以集成在一个处理模块中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application may be integrated into one processing module, each module may exist separately physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.

请参考图18，其示出了本申请实施例提供的一种电子设备的结构框图。该电子设备1800可以是智能手机、平板电脑、电子书、笔记本电脑、个人计算机等能够运行应用程序的电子设备。本申请中的电子设备1800可以包括一个或多个如下部件：处理器1810、存储器1820以及一个或多个应用程序，其中一个或多个应用程序可以被存储在存储器1820中并被配置为由一个或多个处理器1810执行，一个或多个程序配置用于执行如前述方法实施例所描述的方法。Please refer to FIG. 18 , which shows a structural block diagram of an electronic device provided by an embodiment of the present application. Theelectronic device 1800 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, an e-book, a notebook computer, and a personal computer. Theelectronic device 1800 in this application may include one or more of the following components: aprocessor 1810, amemory 1820, and one or more application programs, wherein one or more application programs may be stored in thememory 1820 and configured to be run by a or a plurality ofprocessors 1810, and one or more programs are configured to execute the methods described in the foregoing method embodiments.

处理器1810可以包括一个或者多个处理核。处理器1810利用各种接口和线路连接整个电子设备1800内的各个部分，通过运行或执行存储在存储器1820内的指令、程序、代码集或指令集，以及调用存储在存储器1820内的数据，执行电子设备1800的各种功能和处理数据。可选地，处理器1810可以采用DSP芯片、ARM芯片、现场可编程门阵列(Field－Programmable Gate Array，FPGA)、可编程逻辑阵列(Programmable Logic Array，PLA)中的至少一种硬件形式来实现。处理器1810可集成中央处理器(Central Processing Unit，CPU)、图像处理器(Graphics Processing Unit，GPU)和调制解调器等中的一种或几种的组合。其中，CPU主要处理操作系统、用户界面和应用程序等；GPU用于负责显示内容的渲染和绘制；调制解调器用于处理无线通信。可以理解的是，上述调制解调器也可以不集成到处理器1810中，单独通过一块通信芯片进行实现。Processor 1810 may include one or more processing cores. Theprocessor 1810 uses various interfaces and lines to connect various parts of the entireelectronic device 1800, and executes by running or executing instructions, programs, code sets or instruction sets stored in thememory 1820, and calling data stored in thememory 1820. Various functions of theelectronic device 1800 and processing data. Optionally, theprocessor 1810 may be implemented in at least one hardware form of a DSP chip, an ARM chip, a Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), and a Programmable Logic Array (Programmable Logic Array, PLA). . Theprocessor 1810 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), a modem, and the like. Among them, the CPU mainly handles the operating system, user interface and application programs, etc.; the GPU is used to render and draw the displayed content; the modem is used to handle wireless communication. It can be understood that the modem mentioned above may not be integrated into theprocessor 1810, but implemented by a communication chip alone.

存储器1820可以包括随机存储器(Random Access Memory，RAM)，也可以包括只读存储器(Read-Only Memory)。存储器1820可用于存储指令、程序、代码、代码集或指令集。存储器1820可包括存储程序区和存储数据区，其中，存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储电子设备1800在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。Thememory 1820 may include a random access memory (Random Access Memory, RAM), and may also include a read-only memory (Read-Only Memory). Thememory 1820 may be used to store instructions, programs, codes, sets of codes or sets of instructions. Thememory 1820 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system and instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing the following method embodiments, and the like. The storage data area can also store data created by theelectronic device 1800 during use (such as phonebook, audio and video data, and chat record data) and the like.

在一些实施例中，电子设备1800设置有音频采集器，可用于采集语音信号，并传输至处理器1810进行处理，还可传输至存储器1820进行数据存储。In some embodiments, theelectronic device 1800 is provided with an audio collector, which can be used to collect voice signals, and transmit them to theprocessor 1810 for processing, and also transmit them to thememory 1820 for data storage.

在一些实施方式中，音频采集器可设置于处理器1810内，例如，处理器1810可包括第一芯片和第二芯片，音频采集器可集成于第一芯片。作为一个施例，第一芯片可为DSP芯片，第二芯片可为ARM芯片。In some implementations, the audio collector may be disposed in theprocessor 1810, for example, theprocessor 1810 may include a first chip and a second chip, and the audio collector may be integrated into the first chip. As an example, the first chip may be a DSP chip, and the second chip may be an ARM chip.

请参考图19，其示出了本申请实施例提供的一种计算机可读取存储介质的结构框图。该计算机可读取存储介质1900中存储有程序代码，所述程序代码可被处理器调用执行上述实施例中所描述的方法。Please refer to FIG. 19 , which shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application. The computer-readable storage medium 1900 stores program codes, and the program codes can be invoked by a processor to execute the methods described in the above-mentioned embodiments.

计算机可读取存储介质1900可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地，计算机可读取存储介质1900包括非易失性计算机可读取存储介质(non-transitory computer-readable storage medium)。计算机可读取存储介质1900具有执行上述方法中的任何方法步骤的程序代码1910的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码1910可以例如以适当形式进行压缩。The computer readable storage medium 1900 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 1900 includes a non-transitory computer-readable storage medium (non-transitory computer-readable storage medium). The computer-readable storage medium 1900 has a storage space forprogram code 1910 for executing any method steps in the above methods. These program codes can be read from or written into one or more computer program products.Program code 1910 may, for example, be compressed in a suitable form.

最后应说明的是：以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not drive the essence of the corresponding technical solutions away from the spirit and scope of the technical solutions of the various embodiments of the present application.

Claims

1. A voice wake-up method is applied to a terminal, the terminal is provided with an audio collector, and the method comprises the following steps:

acquiring input voice collected by the audio collector;

matching the input speech based on a first speech matching model, resulting in a first probability output indicating a probability that the input speech includes specified text;

obtaining at least one probability output of the first speech matching model output before the current first probability output as a second probability output;

performing feature extraction on the second probability output to obtain a first historical feature and a second historical feature;

fusing the first probability output and the first historical feature to obtain a historical attention weight vector corresponding to the second probability output;

weighting the second historical characteristic according to the historical attention weight vector to obtain historical fusion output;

fusing the first probability output with the historical fused output to obtain an updated first probability output;

outputting the updated first probability as a first matching result of the first speech matching model matching the input speech;

and if the first matching result indicates that the input voice contains the specified text, awakening the terminal.

2. The method of claim 1, wherein the first speech matching model comprises first and second bi-directional recurrent neural network layers, and wherein the extracting the feature of the second probability output to obtain first and second historical features comprises:

performing feature extraction on the historical attention output through the first bidirectional recurrent neural network layer to obtain a first historical feature;

and performing feature extraction on the historical attention output through the second bidirectional recurrent neural network layer to obtain a second historical feature.

3. The method of claim 1, wherein said fusing the first probability output with the historical fused output resulting in an updated first probability output, comprises:

performing feature extraction on the first probability output to obtain an output coefficient corresponding to the first probability output;

if the output coefficient is greater than a preset result threshold, outputting the first probability as the updated first probability;

and if the output coefficient is smaller than or equal to a preset result threshold value, taking the history fusion output as the updated first probability output.

4. The method of claim 1, wherein the first speech matching model is a convolutional neural network model, and wherein matching the input speech based on the first speech matching model results in a first probability output, comprising:

extracting acoustic features of the input voice, and performing convolution operation on the acoustic features through the first voice matching model to obtain convolution neural network output;

and matching the output of the convolutional neural network with the acoustic characteristics corresponding to the specified text to obtain a first probability output.

5. The method of claim 4, wherein matching the convolutional neural network output to acoustic features corresponding to the specified text to obtain a first probability output comprises:

performing attention weight extraction on the output of the convolutional neural network according to channels to obtain an attention weight vector corresponding to the output of the convolutional neural network;

carrying out weighting processing on the output of the convolutional neural network according to the attention weight vector to obtain attention output characteristics;

and matching the attention output characteristics with the acoustic characteristics corresponding to the specified text to acquire a first probability output.

6. The method of claim 1, wherein waking up the terminal if the first matching result indicates that the input speech includes the specified text comprises:

if the first matching result indicates that the input voice contains the specified text, matching the input voice based on a second voice matching model to obtain a second matching result, wherein the matching rules of the first voice matching model and the second voice matching model are different;

and if the second matching result indicates that the input voice contains the specified text, awakening the terminal.

7. The method of claim 6, wherein the terminal comprises a first chip and a second chip, wherein the first voice matching model runs on the first chip and the second voice matching model runs on the second chip, and wherein the power consumption of the first chip is lower than the power consumption of the second chip.

8. The method of claim 7, wherein the first chip is a DSP chip and the second chip is an ARM chip.

9. The method according to claim 6, wherein waking up the terminal if the second matching result indicates that the input speech includes the specified text comprises:

if the second matching result indicates that the input voice contains the specified text, performing voiceprint recognition on the input voice;

and if the voiceprint identification passes the verification, awakening the terminal.

10. The method of any of claims 6-9, wherein said matching the input speech based on a first speech matching model resulting in a first probability output comprises:

detecting whether the terminal is in a screen-off state or not;

and matching the input voice based on a first voice matching model to obtain a first probability output.

11. The method of claim 10, wherein after detecting whether the terminal is in a screen-off state, the method further comprises:

if the terminal is not in the screen-off state, matching the input voice based on the second voice matching model to obtain a second matching result;

and if the second matching result indicates that the input voice contains the specified text, waking up the terminal.

12. The utility model provides a voice wake-up device which characterized in that is applied to the terminal, the terminal is provided with the audio collector, the device includes:

the voice acquisition module is used for acquiring the input voice acquired by the audio acquisition device;

a first output module, configured to match the input speech based on a first speech matching model to obtain a first probability output, where the first probability output is used to indicate a probability that the input speech includes a specified text;

a second output module, configured to obtain at least one probability output of the first speech matching model output before the current first probability output, as a second probability output;

an output update module, the output update module comprising: the system comprises a history extraction submodule, a first history fusion submodule, an attention processing submodule and a second history fusion submodule; the history extraction submodule is used for extracting the features of the second probability output to obtain a first history feature and a second history feature; a first history fusion submodule, configured to fuse the first probability output with the first history feature to obtain a history attention weight vector corresponding to the second probability output; the attention processing submodule is used for carrying out weighting processing on the second historical characteristic according to the historical attention weight vector to obtain historical fusion output; a second history fusion submodule for fusing the first probability output with the history fusion output to obtain an updated first probability output;

a result obtaining module, configured to output the updated first probability as a first matching result of the first speech matching model matching the input speech;

and the terminal awakening module is used for awakening the terminal if the first matching result indicates that the input voice contains the specified text.

13. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-11.

14. A computer-readable storage medium having program code stored therein, the program code being invoked by a processor to perform the method of any of claims 1-11.